linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] sched, deadline: patches
@ 2013-12-17 12:27 Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

Hai..

This is my current queue of SCHED_DEADLINE; which I hope to merge 'soon'.

Juri handed me a version that should've (didn't check) included all feedback
including the new sched_attr interface.

I did clean up some of the patches; moved some hunks around so that each patch
compiles on its own etc..

I then did a number of patches at the end to change some actual stuff around.

Juri do you have some userspace around to test with? I'm just too likely to
have wrecked everything; so if you don't have something I need to go write
something before merging this :-)

No need to update your test proglets to the new interface, I can lift it if
that is more convenient -- I know you're conferencing atm.

One question; the SoB chain in patch 2/13 seems weird; I suppose that
patch is a collaborative effort? could we cure that by mentioning Fabio
and Michael in some other way?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2014-01-21 14:36   ` Michael Kerrisk
  2014-01-26  9:48   ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Geert Uytterhoeven
  2013-12-17 12:27 ` [PATCH 02/13] sched: SCHED_DEADLINE structures & implementation Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0002-sched-Add-3-new-scheduler-syscalls-to-support-an-ext.patch --]
[-- Type: text/plain, Size: 16888 bytes --]

From: Dario Faggioli <raistlin@linux.it>

Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).

In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:

 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:

 - defines the new struct sched_attr, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;
 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setscheduler2(), sched_setattr()
   and sched_getattr().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.

Cc: oleg@redhat.com
Cc: darren@dvhart.com
Cc: paulmck@linux.vnet.ibm.com
Cc: dhaval.giani@gmail.com
Cc: p.faure@akatech.ch
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: rostedt@goodmis.org
Cc: jkacur@redhat.com
Cc: tommaso.cucinotta@sssup.it
Cc: johan.eker@ericsson.com
Cc: vincent.guittot@linaro.org
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: bruce.ashfield@windriver.com
Cc: nicola.manica@disi.unitn.it
Cc: claudio@evidence.eu.com
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Twiddled the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/arm/include/asm/unistd.h      |    2 
 arch/arm/include/uapi/asm/unistd.h |    3 
 arch/arm/kernel/calls.S            |    3 
 arch/x86/syscalls/syscall_32.tbl   |    3 
 arch/x86/syscalls/syscall_64.tbl   |    3 
 include/linux/sched.h              |   54 ++++++++
 include/linux/syscalls.h           |    8 +
 kernel/sched/core.c                |  234 +++++++++++++++++++++++++++++++++++--
 8 files changed, 298 insertions(+), 12 deletions(-)

--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -15,7 +15,7 @@
 
 #include <uapi/asm/unistd.h>
 
-#define __NR_syscalls  (380)
+#define __NR_syscalls  (383)
 #define __ARM_NR_cmpxchg		(__ARM_NR_BASE+0x00fff0)
 
 #define __ARCH_WANT_STAT64
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -406,6 +406,9 @@
 #define __NR_process_vm_writev		(__NR_SYSCALL_BASE+377)
 #define __NR_kcmp			(__NR_SYSCALL_BASE+378)
 #define __NR_finit_module		(__NR_SYSCALL_BASE+379)
+#define __NR_sched_setscheduler2	(__NR_SYSCALL_BASE+380)
+#define __NR_sched_setattr		(__NR_SYSCALL_BASE+381)
+#define __NR_sched_getattr		(__NR_SYSCALL_BASE+382)
 
 /*
  * This may need to be greater than __NR_last_syscall+1 in order to
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -389,6 +389,9 @@
 		CALL(sys_process_vm_writev)
 		CALL(sys_kcmp)
 		CALL(sys_finit_module)
+/* 380 */	CALL(sys_sched_setscheduler2)
+		CALL(sys_sched_setattr)
+		CALL(sys_sched_getattr)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,6 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	sched_setattr		sys_sched_setattr
+352	i386	sched_getattr		sys_sched_getattr
+353	i386	sched_setscheduler2	sys_sched_setscheduler2
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,9 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	sched_setattr		sys_sched_setattr
+315	common	sched_getattr		sys_sched_getattr
+316	common	sched_setscheduler2	sys_sched_setscheduler2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -56,6 +56,58 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+#define SCHED_ATTR_SIZE_VER0	40	/* sizeof first published struct */
+
+/*
+ * Extended scheduling parameters data structure.
+ *
+ * This is needed because the original struct sched_param can not be
+ * altered without introducing ABI issues with legacy applications
+ * (e.g., in sched_getparam()).
+ *
+ * However, the possibility of specifying more than just a priority for
+ * the tasks may be useful for a wide variety of application fields, e.g.,
+ * multimedia, streaming, automation and control, and many others.
+ *
+ * This variant (sched_attr) is meant at describing a so-called
+ * sporadic time-constrained task. In such model a task is specified by:
+ *  - the activation period or minimum instance inter-arrival time;
+ *  - the maximum (or average, depending on the actual scheduling
+ *    discipline) computation time of all instances, a.k.a. runtime;
+ *  - the deadline (relative to the actual activation time) of each
+ *    instance.
+ * Very briefly, a periodic (sporadic) task asks for the execution of
+ * some specific computation --which is typically called an instance--
+ * (at most) every period. Moreover, each instance typically lasts no more
+ * than the runtime and must be completed by time instant t equal to
+ * the instance activation time + the deadline.
+ *
+ * This is reflected by the actual fields of the sched_attr structure:
+ *
+ *  @sched_priority     task's priority (might still be useful)
+ *  @sched_flags        for customizing the scheduler behaviour
+ *  @sched_deadline     representative of the task's deadline
+ *  @sched_runtime      representative of the task's runtime
+ *  @sched_period       representative of the task's period
+ *
+ * Given this task model, there are a multiplicity of scheduling algorithms
+ * and policies, that can be used to ensure all the tasks will make their
+ * timing constraints.
+ *
+ *  @size		size of the structure, for fwd/bwd compat.
+ */
+struct sched_attr {
+	int sched_priority;
+	unsigned int sched_flags;
+	u64 sched_runtime;
+	u64 sched_deadline;
+	u64 sched_period;
+	u32 size;
+
+	/* Align to u64. */
+	u32 __reserved;
+};
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
@@ -1960,6 +2012,8 @@ extern int sched_setscheduler(struct tas
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern int sched_setscheduler2(struct task_struct *, int,
+				 const struct sched_attr *);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -38,6 +38,7 @@ struct rlimit;
 struct rlimit64;
 struct rusage;
 struct sched_param;
+struct sched_attr;
 struct sel_arg_struct;
 struct semaphore;
 struct sembuf;
@@ -277,11 +278,18 @@ asmlinkage long sys_clock_nanosleep(cloc
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
+					struct sched_attr __user *attr);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setattr(pid_t pid,
+					struct sched_attr __user *attr);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_getattr(pid_t pid,
+					struct sched_attr __user *attr,
+					unsigned int size);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3023,7 +3023,8 @@ static bool check_same_owner(struct task
 }
 
 static int __sched_setscheduler(struct task_struct *p, int policy,
-				const struct sched_param *param, bool user)
+				const struct sched_attr *attr,
+				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
 	unsigned long flags;
@@ -3053,11 +3054,11 @@ static int __sched_setscheduler(struct t
 	 * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
 	 * SCHED_BATCH and SCHED_IDLE is 0.
 	 */
-	if (param->sched_priority < 0 ||
-	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
-	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
+	if (attr->sched_priority < 0 ||
+	    (p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
+	    (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if (rt_policy(policy) != (param->sched_priority != 0))
+	if (rt_policy(policy) != (attr->sched_priority != 0))
 		return -EINVAL;
 
 	/*
@@ -3073,8 +3074,8 @@ static int __sched_setscheduler(struct t
 				return -EPERM;
 
 			/* can't increase priority */
-			if (param->sched_priority > p->rt_priority &&
-			    param->sched_priority > rlim_rtprio)
+			if (attr->sched_priority > p->rt_priority &&
+			    attr->sched_priority > rlim_rtprio)
 				return -EPERM;
 		}
 
@@ -3123,7 +3124,7 @@ static int __sched_setscheduler(struct t
 	 * If not changing anything there's no need to proceed further:
 	 */
 	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
-			param->sched_priority == p->rt_priority))) {
+		     attr->sched_priority == p->rt_priority))) {
 		task_rq_unlock(rq, p, &flags);
 		return 0;
 	}
@@ -3160,7 +3161,7 @@ static int __sched_setscheduler(struct t
 
 	oldprio = p->prio;
 	prev_class = p->sched_class;
-	__setscheduler(rq, p, policy, param->sched_priority);
+	__setscheduler(rq, p, policy, attr->sched_priority);
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
@@ -3188,10 +3189,20 @@ static int __sched_setscheduler(struct t
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, true);
+	struct sched_attr attr = {
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, policy, &attr, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
+int sched_setscheduler2(struct task_struct *p, int policy,
+			const struct sched_attr *attr)
+{
+	return __sched_setscheduler(p, policy, attr, true);
+}
+EXPORT_SYMBOL_GPL(sched_setscheduler2);
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -3208,7 +3219,10 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, false);
+	struct sched_attr attr = {
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, policy, &attr, false);
 }
 
 static int
@@ -3233,6 +3247,97 @@ do_sched_setscheduler(pid_t pid, int pol
 	return retval;
 }
 
+/*
+ * Mimics kernel/events/core.c perf_copy_attr().
+ */
+static int sched_copy_attr(struct sched_attr __user *uattr,
+			   struct sched_attr *attr)
+{
+	u32 size;
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))
+		return -EFAULT;
+
+	/*
+	 * zero the full structure, so that a short copy will be nice.
+	 */
+	memset(attr, 0, sizeof(*attr));
+
+	ret = get_user(size, &uattr->size);
+	if (ret)
+		return ret;
+
+	if (size > PAGE_SIZE)	/* silly large */
+		goto err_size;
+
+	if (!size)		/* abi compat */
+		size = SCHED_ATTR_SIZE_VER0;
+
+	if (size < SCHED_ATTR_SIZE_VER0)
+		goto err_size;
+
+	/*
+	 * If we're handed a bigger struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. new
+	 * user-space does not rely on any kernel feature
+	 * extensions we dont know about yet.
+	 */
+	if (size > sizeof(*attr)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uattr + sizeof(*attr);
+		end  = (void __user *)uattr + size;
+
+		for (; addr < end; addr++) {
+			ret = get_user(val, addr);
+			if (ret)
+				return ret;
+			if (val)
+				goto err_size;
+		}
+		size = sizeof(*attr);
+	}
+
+	ret = copy_from_user(attr, uattr, size);
+	if (ret)
+		return -EFAULT;
+
+out:
+	return ret;
+
+err_size:
+	put_user(sizeof(*attr), &uattr->size);
+	ret = -E2BIG;
+	goto out;
+}
+
+static int
+do_sched_setscheduler2(pid_t pid, int policy,
+		       struct sched_attr __user *attr_uptr)
+{
+	struct sched_attr attr;
+	struct task_struct *p;
+	int retval;
+
+	if (!attr_uptr || pid < 0)
+		return -EINVAL;
+
+	if (sched_copy_attr(attr_uptr, &attr))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL)
+		retval = sched_setscheduler2(p, policy, &attr);
+	rcu_read_unlock();
+
+	return retval;
+}
+
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -3252,6 +3357,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_
 }
 
 /**
+ * sys_sched_setscheduler2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @policy: new policy (could use extended sched_param).
+ * @attr: structure containg the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
+		struct sched_attr __user *, attr)
+{
+	if (policy < 0)
+		return -EINVAL;
+
+	return do_sched_setscheduler2(pid, policy, attr);
+}
+
+/**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
  * @param: structure containing the new RT priority.
@@ -3264,6 +3384,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, p
 }
 
 /**
+ * sys_sched_setattr - same as above, but with extended sched_attr
+ * @pid: the pid in question.
+ * @attr: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_setattr, pid_t, pid,
+		struct sched_attr __user *, attr)
+{
+	return do_sched_setscheduler2(pid, -1, attr);
+}
+
+/**
  * sys_sched_getscheduler - get the policy (scheduling class) of a thread
  * @pid: the pid in question.
  *
@@ -3329,6 +3460,87 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, p
 	return retval;
 
 out_unlock:
+	rcu_read_unlock();
+	return retval;
+}
+
+static int sched_read_attr(struct sched_attr __user *uattr,
+			   struct sched_attr *attr,
+			   unsigned int usize)
+{
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, usize))
+		return -EFAULT;
+
+	/*
+	 * If we're handed a smaller struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. old
+	 * user-space does not get uncomplete information.
+	 */
+	if (usize < sizeof(*attr)) {
+		unsigned char *addr;
+		unsigned char *end;
+
+		addr = (void *)attr + usize;
+		end  = (void *)attr + sizeof(*attr);
+
+		for (; addr < end; addr++) {
+			if (*addr)
+				goto err_size;
+		}
+
+		attr->size = usize;
+	}
+
+	ret = copy_to_user(uattr, attr, usize);
+	if (ret)
+		return -EFAULT;
+
+out:
+	return ret;
+
+err_size:
+	ret = -E2BIG;
+	goto out;
+}
+
+/**
+ * sys_sched_getattr - same as above, but with extended "sched_param"
+ * @pid: the pid in question.
+ * @attr: structure containing the extended parameters.
+ * @size: sizeof(attr) for fwd/bwd comp.
+ */
+SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
+		unsigned int, size)
+{
+	struct sched_attr attr = {
+		.size = sizeof(struct sched_attr),
+	};
+	struct task_struct *p;
+	int retval;
+
+	if (!uattr || pid < 0 || size > PAGE_SIZE ||
+	    size < SCHED_ATTR_SIZE_VER0)
+		return -EINVAL;
+
+	rcu_read_lock();
+	p = find_process_by_pid(pid);
+	retval = -ESRCH;
+	if (!p)
+		goto out_unlock;
+
+	retval = security_task_getscheduler(p);
+	if (retval)
+		goto out_unlock;
+
+	attr.sched_priority = p->rt_priority;
+	rcu_read_unlock();
+
+	retval = sched_read_attr(uattr, &attr, size);
+	return retval;
+
+out_unlock:
 	rcu_read_unlock();
 	return retval;
 }



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 02/13] sched: SCHED_DEADLINE structures & implementation
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 03/13] sched: SCHED_DEADLINE SMP-related data structures & logic Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0003-sched-SCHED_DEADLINE-structures-implementation.patch --]
[-- Type: text/plain, Size: 39080 bytes --]

From: Dario Faggioli <raistlin@linux.it>

Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.

Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.

Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

To summarize, this patch:
 - introduces the data structures, constants and symbols needed;
 - implements the core logic of the scheduling algorithm in the new
   scheduling class file;
 - provides all the glue code between the new scheduling class and
   the core scheduler and refines the interactions between sched/dl
   and the other existing scheduling classes.

Cc: johan.eker@ericsson.com
Cc: mingo@redhat.com
Cc: tommaso.cucinotta@sssup.it
Cc: dhaval.giani@gmail.com
Cc: liming.wang@windriver.com
Cc: nicola.manica@disi.unitn.it
Cc: rostedt@goodmis.org
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: paulmck@linux.vnet.ibm.com
Cc: insop.song@gmail.com
Cc: vincent.guittot@linaro.org
Cc: bruce.ashfield@windriver.com
Cc: darren@dvhart.com
Cc: jkacur@redhat.com
Cc: oleg@redhat.com
Cc: tglx@linutronix.de
Cc: luca.abeni@unitn.it
Cc: claudio@evidence.eu.com
Cc: p.faure@akatech.ch
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/arm/include/asm/unistd.h  |    2 
 include/linux/sched.h          |   46 ++
 include/linux/sched/deadline.h |   24 +
 include/uapi/linux/sched.h     |    1 
 kernel/fork.c                  |    4 
 kernel/hrtimer.c               |    3 
 kernel/sched/Makefile          |    3 
 kernel/sched/core.c            |  109 +++++-
 kernel/sched/deadline.c        |  684 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   28 +
 kernel/sched/stop_task.c       |    2 
 11 files changed, 885 insertions(+), 21 deletions(-)
 create mode 100644 include/linux/sched/deadline.h
 create mode 100644 kernel/sched/deadline.c

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,6 +95,10 @@ struct sched_param {
  * timing constraints.
  *
  *  @size		size of the structure, for fwd/bwd compat.
+ *
+ * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
+ * only user of this new interface. More information about the algorithm
+ * available in the scheduling class file or in Documentation/.
  */
 struct sched_attr {
 	int sched_priority;
@@ -1080,6 +1084,45 @@ struct sched_rt_entity {
 #endif
 };
 
+struct sched_dl_entity {
+	struct rb_node	rb_node;
+
+	/*
+	 * Original scheduling parameters. Copied here from sched_attr
+	 * during sched_setscheduler2(), they will remain the same until
+	 * the next sched_setscheduler2().
+	 */
+	u64 dl_runtime;		/* maximum runtime for each instance	*/
+	u64 dl_deadline;	/* relative deadline of each instance	*/
+
+	/*
+	 * Actual scheduling parameters. Initialized with the values above,
+	 * they are continously updated during task execution. Note that
+	 * the remaining runtime could be < 0 in case we are in overrun.
+	 */
+	s64 runtime;		/* remaining runtime for this instance	*/
+	u64 deadline;		/* absolute deadline for this instance	*/
+	unsigned int flags;	/* specifying the scheduler behaviour	*/
+
+	/*
+	 * Some bool flags:
+	 *
+	 * @dl_throttled tells if we exhausted the runtime. If so, the
+	 * task has to wait for a replenishment to be performed at the
+	 * next firing of dl_timer.
+	 *
+	 * @dl_new tells if a new instance arrived. If so we must
+	 * start executing it with full runtime and reset its absolute
+	 * deadline;
+	 */
+	int dl_throttled, dl_new;
+
+	/*
+	 * Bandwidth enforcement timer. Each -deadline task has its
+	 * own bandwidth to be enforced, thus we need one timer per task.
+	 */
+	struct hrtimer dl_timer;
+};
 
 struct rcu_node;
 
@@ -1116,6 +1159,7 @@ struct task_struct {
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *sched_task_group;
 #endif
+	struct sched_dl_entity dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -2093,7 +2137,7 @@ extern void wake_up_new_task(struct task
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
+extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
--- /dev/null
+++ b/include/linux/sched/deadline.h
@@ -0,0 +1,24 @@
+#ifndef _SCHED_DEADLINE_H
+#define _SCHED_DEADLINE_H
+
+/*
+ * SCHED_DEADLINE tasks has negative priorities, reflecting
+ * the fact that any of them has higher prio than RT and
+ * NORMAL/BATCH tasks.
+ */
+
+#define MAX_DL_PRIO		0
+
+static inline int dl_prio(int prio)
+{
+	if (unlikely(prio < MAX_DL_PRIO))
+		return 1;
+	return 0;
+}
+
+static inline int dl_task(struct task_struct *p)
+{
+	return dl_prio(p->prio);
+}
+
+#endif /* _SCHED_DEADLINE_H */
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -39,6 +39,7 @@
 #define SCHED_BATCH		3
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
+#define SCHED_DEADLINE		6
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,7 +1310,9 @@ static struct task_struct *copy_process(
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(clone_flags, p);
+	retval = sched_fork(clone_flags, p);
+	if (retval)
+		goto bad_fork_cleanup_policy;
 
 	retval = perf_event_init_task(p);
 	if (retval)
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -46,6 +46,7 @@
 #include <linux/sched.h>
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/timer.h>
 #include <linux/freezer.h>
 
@@ -1610,7 +1611,7 @@ long hrtimer_nanosleep(struct timespec *
 	unsigned long slack;
 
 	slack = current->timer_slack_ns;
-	if (rt_task(current))
+	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
 	hrtimer_init_on_stack(&t.timer, clockid, mode);
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -11,7 +11,8 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER
 CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
 endif
 
-obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o stop_task.o
+obj-y += core.o proc.o clock.o cputime.o
+obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
 obj-y += wait.o completion.o
 obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -899,7 +899,9 @@ static inline int normal_prio(struct tas
 {
 	int prio;
 
-	if (task_has_rt_policy(p))
+	if (task_has_dl_policy(p))
+		prio = MAX_DL_PRIO-1;
+	else if (task_has_rt_policy(p))
 		prio = MAX_RT_PRIO-1 - p->rt_priority;
 	else
 		prio = __normal_prio(p);
@@ -1716,6 +1718,12 @@ static void __sched_fork(unsigned long c
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
+	RB_CLEAR_NODE(&p->dl.rb_node);
+	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	p->dl.dl_runtime = p->dl.runtime = 0;
+	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.flags = 0;
+
 	INIT_LIST_HEAD(&p->rt.run_list);
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -1767,7 +1775,7 @@ void set_numabalancing_state(bool enable
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(unsigned long clone_flags, struct task_struct *p)
+int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
@@ -1789,7 +1797,7 @@ void sched_fork(unsigned long clone_flag
 	 * Revert to default priority/policy on fork if requested.
 	 */
 	if (unlikely(p->sched_reset_on_fork)) {
-		if (task_has_rt_policy(p)) {
+		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 			p->policy = SCHED_NORMAL;
 			p->static_prio = NICE_TO_PRIO(0);
 			p->rt_priority = 0;
@@ -1806,8 +1814,14 @@ void sched_fork(unsigned long clone_flag
 		p->sched_reset_on_fork = 0;
 	}
 
-	if (!rt_prio(p->prio))
+	if (dl_prio(p->prio)) {
+		put_cpu();
+		return -EAGAIN;
+	} else if (rt_prio(p->prio)) {
+		p->sched_class = &rt_sched_class;
+	} else {
 		p->sched_class = &fair_sched_class;
+	}
 
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
@@ -1836,6 +1850,7 @@ void sched_fork(unsigned long clone_flag
 #endif
 
 	put_cpu();
+	return 0;
 }
 
 /*
@@ -2767,7 +2782,7 @@ void rt_mutex_setprio(struct task_struct
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
+	BUG_ON(prio > MAX_PRIO);
 
 	rq = __task_rq_lock(p);
 
@@ -2799,7 +2814,9 @@ void rt_mutex_setprio(struct task_struct
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (rt_prio(prio))
+	if (dl_prio(prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -2833,9 +2850,9 @@ void set_user_nice(struct task_struct *p
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
-	 * SCHED_FIFO/SCHED_RR:
+	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
 	 */
-	if (task_has_rt_policy(p)) {
+	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
@@ -2999,7 +3016,9 @@ __setscheduler(struct rq *rq, struct tas
 	p->normal_prio = normal_prio(p);
 	/* we are holding p->pi_lock already */
 	p->prio = rt_mutex_getprio(p);
-	if (rt_prio(p->prio))
+	if (dl_prio(p->prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -3007,6 +3026,50 @@ __setscheduler(struct rq *rq, struct tas
 }
 
 /*
+ * This function initializes the sched_dl_entity of a newly becoming
+ * SCHED_DEADLINE task.
+ *
+ * Only the static values are considered here, the actual runtime and the
+ * absolute deadline will be properly calculated when the task is enqueued
+ * for the first time with its new policy.
+ */
+static void
+__setparam_dl(struct task_struct *p, const struct sched_attr *attr)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	init_dl_task_timer(dl_se);
+	dl_se->dl_runtime = attr->sched_runtime;
+	dl_se->dl_deadline = attr->sched_deadline;
+	dl_se->flags = attr->sched_flags;
+	dl_se->dl_throttled = 0;
+	dl_se->dl_new = 1;
+}
+
+static void
+__getparam_dl(struct task_struct *p, struct sched_attr *attr)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	attr->sched_priority = p->rt_priority;
+	attr->sched_runtime = dl_se->dl_runtime;
+	attr->sched_deadline = dl_se->dl_deadline;
+	attr->sched_flags = dl_se->flags;
+}
+
+/*
+ * This function validates the new parameters of a -deadline task.
+ * We ask for the deadline not being zero, and greater or equal
+ * than the runtime.
+ */
+static bool
+__checkparam_dl(const struct sched_attr *attr)
+{
+	return attr && attr->sched_deadline != 0 &&
+	       (s64)(attr->sched_deadline - attr->sched_runtime) >= 0;
+}
+
+/*
  * check the target process has a UID that matches the current process's
  */
 static bool check_same_owner(struct task_struct *p)
@@ -3043,7 +3106,8 @@ static int __sched_setscheduler(struct t
 		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
 		policy &= ~SCHED_RESET_ON_FORK;
 
-		if (policy != SCHED_FIFO && policy != SCHED_RR &&
+		if (policy != SCHED_DEADLINE &&
+				policy != SCHED_FIFO && policy != SCHED_RR &&
 				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
 				policy != SCHED_IDLE)
 			return -EINVAL;
@@ -3058,7 +3122,8 @@ static int __sched_setscheduler(struct t
 	    (p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if (rt_policy(policy) != (attr->sched_priority != 0))
+	if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
+	    (rt_policy(policy) != (attr->sched_priority != 0)))
 		return -EINVAL;
 
 	/*
@@ -3124,7 +3189,8 @@ static int __sched_setscheduler(struct t
 	 * If not changing anything there's no need to proceed further:
 	 */
 	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
-		     attr->sched_priority == p->rt_priority))) {
+			attr->sched_priority == p->rt_priority) &&
+			!dl_policy(policy))) {
 		task_rq_unlock(rq, p, &flags);
 		return 0;
 	}
@@ -3161,6 +3227,8 @@ static int __sched_setscheduler(struct t
 
 	oldprio = p->prio;
 	prev_class = p->sched_class;
+	if (dl_policy(policy))
+		__setparam_dl(p, attr);
 	__setscheduler(rq, p, policy, attr->sched_priority);
 
 	if (running)
@@ -3331,8 +3399,11 @@ do_sched_setscheduler2(pid_t pid, int po
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
+	if (p != NULL) {
+		if (dl_policy(policy))
+			attr.sched_priority = 0;
 		retval = sched_setscheduler2(p, policy, &attr);
+	}
 	rcu_read_unlock();
 
 	return retval;
@@ -3449,6 +3520,10 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, p
 	if (retval)
 		goto out_unlock;
 
+	if (task_has_dl_policy(p)) {
+		retval = -EINVAL;
+		goto out_unlock;
+	}
 	lp.sched_priority = p->rt_priority;
 	rcu_read_unlock();
 
@@ -3506,7 +3581,7 @@ static int sched_read_attr(struct sched_
 }
 
 /**
- * sys_sched_getattr - same as above, but with extended "sched_param"
+ * sys_sched_getattr - similar to sched_getparam, but with sched_attr
  * @pid: the pid in question.
  * @attr: structure containing the extended parameters.
  * @size: sizeof(attr) for fwd/bwd comp.
@@ -3534,6 +3609,7 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pi
 	if (retval)
 		goto out_unlock;
 
+	__getparam_dl(p, &attr);
 	attr.sched_priority = p->rt_priority;
 	rcu_read_unlock();
 
@@ -3956,6 +4032,7 @@ SYSCALL_DEFINE1(sched_get_priority_max,
 	case SCHED_RR:
 		ret = MAX_USER_RT_PRIO-1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
@@ -3982,6 +4059,7 @@ SYSCALL_DEFINE1(sched_get_priority_min,
 	case SCHED_RR:
 		ret = 1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
@@ -6462,6 +6540,7 @@ void __init sched_init(void)
 		rq->calc_load_update = jiffies + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs);
 		init_rt_rq(&rq->rt, rq);
+		init_dl_rq(&rq->dl, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
@@ -6646,7 +6725,7 @@ void normalize_rt_tasks(void)
 		p->se.statistics.block_start	= 0;
 #endif
 
-		if (!rt_task(p)) {
+		if (!dl_task(p) && !rt_task(p)) {
 			/*
 			 * Renice negative nice level userspace
 			 * tasks back to 0:
--- /dev/null
+++ b/kernel/sched/deadline.c
@@ -0,0 +1,684 @@
+/*
+ * Deadline Scheduling Class (SCHED_DEADLINE)
+ *
+ * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
+ *
+ * Tasks that periodically executes their instances for less than their
+ * runtime won't miss any of their deadlines.
+ * Tasks that are not periodic or sporadic or that tries to execute more
+ * than their reserved bandwidth will be slowed down (and may potentially
+ * miss some of their deadlines), and won't affect any other task.
+ *
+ * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
+ *                    Michael Trimarchi <michael@amarulasolutions.com>,
+ *                    Fabio Checconi <fchecconi@gmail.com>
+ */
+#include "sched.h"
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+	return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->dl;
+}
+
+static inline int on_dl_rq(struct sched_dl_entity *dl_se)
+{
+	return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
+static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	return dl_rq->rb_leftmost == &dl_se->rb_node;
+}
+
+void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
+{
+	dl_rq->rb_root = RB_ROOT;
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags);
+
+/*
+ * We are being explicitly informed that a new instance is starting,
+ * and this means that:
+ *  - the absolute deadline of the entity has to be placed at
+ *    current time + relative deadline;
+ *  - the runtime of the entity has to be set to the maximum value.
+ *
+ * The capability of specifying such event is useful whenever a -deadline
+ * entity wants to (try to!) synchronize its behaviour with the scheduler's
+ * one, and to (try to!) reconcile itself with its own scheduling
+ * parameters.
+ */
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
+
+	/*
+	 * We use the regular wall clock time to set deadlines in the
+	 * future; in fact, we must consider execution overheads (time
+	 * spent on hardirq context, etc.).
+	 */
+	dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->dl_new = 0;
+}
+
+/*
+ * Pure Earliest Deadline First (EDF) scheduling does not deal with the
+ * possibility of a entity lasting more than what it declared, and thus
+ * exhausting its runtime.
+ *
+ * Here we are interested in making runtime overrun possible, but we do
+ * not want a entity which is misbehaving to affect the scheduling of all
+ * other entities.
+ * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
+ * is used, in order to confine each entity within its own bandwidth.
+ *
+ * This function deals exactly with that, and ensures that when the runtime
+ * of a entity is replenished, its deadline is also postponed. That ensures
+ * the overrunning entity can't interfere with other entity in the system and
+ * can't make them miss their deadlines. Reasons why this kind of overruns
+ * could happen are, typically, a entity voluntarily trying to overcome its
+ * runtime, or it just underestimated it during sched_setscheduler_ex().
+ */
+static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * We keep moving the deadline away until we get some
+	 * available runtime for the entity. This ensures correct
+	 * handling of situations where the runtime overrun is
+	 * arbitrary large.
+	 */
+	while (dl_se->runtime <= 0) {
+		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->runtime += dl_se->dl_runtime;
+	}
+
+	/*
+	 * At this point, the deadline really should be "in
+	 * the future" with respect to rq->clock. If it's
+	 * not, we are, for some reason, lagging too much!
+	 * Anyway, after having warn userspace abut that,
+	 * we still try to keep the things running by
+	 * resetting the deadline and the budget of the
+	 * entity.
+	 */
+	if (dl_time_before(dl_se->deadline, rq_clock(rq))) {
+		static bool lag_once = false;
+
+		if (!lag_once) {
+			lag_once = true;
+			printk_sched("sched: DL replenish lagged to much\n");
+		}
+		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * Here we check if --at time t-- an entity (which is probably being
+ * [re]activated or, in general, enqueued) can use its remaining runtime
+ * and its current deadline _without_ exceeding the bandwidth it is
+ * assigned (function returns true if it can't). We are in fact applying
+ * one of the CBS rules: when a task wakes up, if the residual runtime
+ * over residual deadline fits within the allocated bandwidth, then we
+ * can keep the current (absolute) deadline and residual budget without
+ * disrupting the schedulability of the system. Otherwise, we should
+ * refill the runtime and set the deadline a period in the future,
+ * because keeping the current (absolute) deadline of the task would
+ * result in breaking guarantees promised to other tasks.
+ *
+ * This function returns true if:
+ *
+ *   runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ *
+ * IOW we can't recycle current parameters.
+ */
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+{
+	u64 left, right;
+
+	/*
+	 * left and right are the two sides of the equation above,
+	 * after a bit of shuffling to use multiplications instead
+	 * of divisions.
+	 *
+	 * Note that none of the time values involved in the two
+	 * multiplications are absolute: dl_deadline and dl_runtime
+	 * are the relative deadline and the maximum runtime of each
+	 * instance, runtime is the runtime left for the last instance
+	 * and (deadline - t), since t is rq->clock, is the time left
+	 * to the (absolute) deadline. Even if overflowing the u64 type
+	 * is very unlikely to occur in both cases, here we scale down
+	 * as we want to avoid that risk at all. Scaling down by 10
+	 * means that we reduce granularity to 1us. We are fine with it,
+	 * since this is only a true/false check and, anyway, thinking
+	 * of anything below microseconds resolution is actually fiction
+	 * (but still we want to give the user that illusion >;).
+	 */
+	left = (dl_se->dl_deadline >> 10) * (dl_se->runtime >> 10);
+	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
+
+	return dl_time_before(right, left);
+}
+
+/*
+ * When a -deadline entity is queued back on the runqueue, its runtime and
+ * deadline might need updating.
+ *
+ * The policy here is that we update the deadline of the entity only if:
+ *  - the current deadline is in the past,
+ *  - using the remaining runtime with the current deadline would make
+ *    the entity exceed its bandwidth.
+ */
+static void update_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * The arrival of a new instance needs special treatment, i.e.,
+	 * the actual scheduling parameters have to be "renewed".
+	 */
+	if (dl_se->dl_new) {
+		setup_new_dl_entity(dl_se);
+		return;
+	}
+
+	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
+	    dl_entity_overflow(dl_se, rq_clock(rq))) {
+		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * If the entity depleted all its runtime, and if we want it to sleep
+ * while waiting for some new execution time to become available, we
+ * set the bandwidth enforcement timer to the replenishment instant
+ * and try to activate it.
+ *
+ * Notice that it is important for the caller to know if the timer
+ * actually started or not (i.e., the replenishment instant is in
+ * the future or in the past).
+ */
+static int start_dl_timer(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	ktime_t now, act;
+	ktime_t soft, hard;
+	unsigned long range;
+	s64 delta;
+
+	/*
+	 * We want the timer to fire at the deadline, but considering
+	 * that it is actually coming from rq->clock and not from
+	 * hrtimer's time base reading.
+	 */
+	act = ns_to_ktime(dl_se->deadline);
+	now = hrtimer_cb_get_time(&dl_se->dl_timer);
+	delta = ktime_to_ns(now) - rq_clock(rq);
+	act = ktime_add_ns(act, delta);
+
+	/*
+	 * If the expiry time already passed, e.g., because the value
+	 * chosen as the deadline is too small, don't even try to
+	 * start the timer in the past!
+	 */
+	if (ktime_us_delta(act, now) < 0)
+		return 0;
+
+	hrtimer_set_expires(&dl_se->dl_timer, act);
+
+	soft = hrtimer_get_softexpires(&dl_se->dl_timer);
+	hard = hrtimer_get_expires(&dl_se->dl_timer);
+	range = ktime_to_ns(ktime_sub(hard, soft));
+	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
+				 range, HRTIMER_MODE_ABS, 0);
+
+	return hrtimer_active(&dl_se->dl_timer);
+}
+
+/*
+ * This is the bandwidth enforcement timer callback. If here, we know
+ * a task is not on its dl_rq, since the fact that the timer was running
+ * means the task is throttled and needs a runtime replenishment.
+ *
+ * However, what we actually do depends on the fact the task is active,
+ * (it is on its rq) or has been removed from there by a call to
+ * dequeue_task_dl(). In the former case we must issue the runtime
+ * replenishment and add the task back to the dl_rq; in the latter, we just
+ * do nothing but clearing dl_throttled, so that runtime and deadline
+ * updating (and the queueing back to dl_rq) will be done by the
+ * next call to enqueue_task_dl().
+ */
+static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     dl_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+	raw_spin_lock(&rq->lock);
+
+	/*
+	 * We need to take care of a possible races here. In fact, the
+	 * task might have changed its scheduling policy to something
+	 * different from SCHED_DEADLINE or changed its reservation
+	 * parameters (through sched_setscheduler()).
+	 */
+	if (!dl_task(p) || dl_se->dl_new)
+		goto unlock;
+
+	sched_clock_tick();
+	update_rq_clock(rq);
+	dl_se->dl_throttled = 0;
+	if (p->on_rq) {
+		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
+		if (task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+		else
+			resched_task(rq->curr);
+	}
+unlock:
+	raw_spin_unlock(&rq->lock);
+
+	return HRTIMER_NORESTART;
+}
+
+void init_dl_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->dl_timer;
+
+	if (hrtimer_active(timer)) {
+		hrtimer_try_to_cancel(timer);
+		return;
+	}
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = dl_task_timer;
+}
+
+static
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
+{
+	int dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
+	int rorun = dl_se->runtime <= 0;
+
+	if (!rorun && !dmiss)
+		return 0;
+
+	/*
+	 * If we are beyond our current deadline and we are still
+	 * executing, then we have already used some of the runtime of
+	 * the next instance. Thus, if we do not account that, we are
+	 * stealing bandwidth from the system at each deadline miss!
+	 */
+	if (dmiss) {
+		dl_se->runtime = rorun ? dl_se->runtime : 0;
+		dl_se->runtime -= rq_clock(rq) - dl_se->deadline;
+	}
+
+	return 1;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	u64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	/*
+	 * Consumed budget is computed considering the time as
+	 * observed by schedulable tasks (excluding time spent
+	 * in hardirq context, etc.). Deadlines are instead
+	 * computed using hard walltime. This seems to be the more
+	 * natural solution, but the full ramifications of this
+	 * approach need further study.
+	 */
+	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
+	if (unlikely((s64)delta_exec < 0))
+		delta_exec = 0;
+
+	schedstat_set(curr->se.statistics.exec_max,
+		      max(curr->se.statistics.exec_max, delta_exec));
+
+	curr->se.sum_exec_runtime += delta_exec;
+	account_group_exec_runtime(curr, delta_exec);
+
+	curr->se.exec_start = rq_clock_task(rq);
+	cpuacct_charge(curr, delta_exec);
+
+	dl_se->runtime -= delta_exec;
+	if (dl_runtime_exceeded(rq, dl_se)) {
+		__dequeue_task_dl(rq, curr, 0);
+		if (likely(start_dl_timer(dl_se)))
+			dl_se->dl_throttled = 1;
+		else
+			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+
+		if (!is_leftmost(curr, &rq->dl))
+			resched_task(curr);
+	}
+}
+
+static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rb_node **link = &dl_rq->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct sched_dl_entity *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
+		if (dl_time_before(dl_se->deadline, entry->deadline))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->rb_leftmost = &dl_se->rb_node;
+
+	rb_link_node(&dl_se->rb_node, parent, link);
+	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+
+	dl_rq->dl_nr_running++;
+}
+
+static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+	if (RB_EMPTY_NODE(&dl_se->rb_node))
+		return;
+
+	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&dl_se->rb_node);
+		dl_rq->rb_leftmost = next_node;
+	}
+
+	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+	RB_CLEAR_NODE(&dl_se->rb_node);
+
+	dl_rq->dl_nr_running--;
+}
+
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+{
+	BUG_ON(on_dl_rq(dl_se));
+
+	/*
+	 * If this is a wakeup or a new instance, the scheduling
+	 * parameters of the task might need updating. Otherwise,
+	 * we want a replenishment of its runtime.
+	 */
+	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
+		replenish_dl_entity(dl_se);
+	else
+		update_dl_entity(dl_se);
+
+	__enqueue_dl_entity(dl_se);
+}
+
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	__dequeue_dl_entity(dl_se);
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	/*
+	 * If p is throttled, we do nothing. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 */
+	if (p->dl.dl_throttled)
+		return;
+
+	enqueue_dl_entity(&p->dl, flags);
+	inc_nr_running(rq);
+}
+
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	dequeue_dl_entity(&p->dl);
+}
+
+static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	update_curr_dl(rq);
+	__dequeue_task_dl(rq, p, flags);
+
+	dec_nr_running(rq);
+}
+
+/*
+ * Yield task semantic for -deadline tasks is:
+ *
+ *   get off from the CPU until our next instance, with
+ *   a new runtime. This is of little use now, since we
+ *   don't have a bandwidth reclaiming mechanism. Anyway,
+ *   bandwidth reclaiming is planned for the future, and
+ *   yield_task_dl will indicate that some spare budget
+ *   is available for other task instances to use it.
+ */
+static void yield_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	/*
+	 * We make the task go to sleep until its current deadline by
+	 * forcing its runtime to zero. This way, update_curr_dl() stops
+	 * it and the bandwidth timer will wake it up and will give it
+	 * new scheduling parameters (thanks to dl_new=1).
+	 */
+	if (p->dl.runtime > 0) {
+		rq->curr->dl.dl_new = 1;
+		p->dl.runtime = 0;
+	}
+	update_curr_dl(rq);
+}
+
+/*
+ * Only called when both the current and waking task are -deadline
+ * tasks.
+ */
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags)
+{
+	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		resched_task(rq->curr);
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+	s64 delta = p->dl.dl_runtime - p->dl.runtime;
+
+	if (delta > 10000)
+		hrtick_start(rq, p->dl.runtime);
+}
+#endif
+
+static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
+						   struct dl_rq *dl_rq)
+{
+	struct rb_node *left = dl_rq->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct sched_dl_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p;
+	struct dl_rq *dl_rq;
+
+	dl_rq = &rq->dl;
+
+	if (unlikely(!dl_rq->dl_nr_running))
+		return NULL;
+
+	dl_se = pick_next_dl_entity(rq, dl_rq);
+	BUG_ON(!dl_se);
+
+	p = dl_task_of(dl_se);
+	p->se.exec_start = rq_clock_task(rq);
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+#endif
+	return p;
+}
+
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+{
+	update_curr_dl(rq);
+}
+
+static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
+{
+	update_curr_dl(rq);
+
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+		start_hrtick_dl(rq, p);
+#endif
+}
+
+static void task_fork_dl(struct task_struct *p)
+{
+	/*
+	 * SCHED_DEADLINE tasks cannot fork and this is achieved through
+	 * sched_fork()
+	 */
+}
+
+static void task_dead_dl(struct task_struct *p)
+{
+	struct hrtimer *timer = &p->dl.dl_timer;
+
+	if (hrtimer_active(timer))
+		hrtimer_try_to_cancel(timer);
+}
+
+static void set_curr_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	p->se.exec_start = rq_clock_task(rq);
+}
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p)
+{
+	if (hrtimer_active(&p->dl.dl_timer))
+		hrtimer_try_to_cancel(&p->dl.dl_timer);
+}
+
+static void switched_to_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * If p is throttled, don't consider the possibility
+	 * of preempting rq->curr, the check will be done right
+	 * after its runtime will get replenished.
+	 */
+	if (unlikely(p->dl.dl_throttled))
+		return;
+
+	if (!p->on_rq || rq->curr != p) {
+		if (task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+		else
+			resched_task(rq->curr);
+	}
+}
+
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+			    int oldprio)
+{
+	switched_to_dl(rq, p);
+}
+
+#ifdef CONFIG_SMP
+static int
+select_task_rq_dl(struct task_struct *p, int prev_cpu, int sd_flag, int flags)
+{
+	return task_cpu(p);
+}
+#endif
+
+const struct sched_class dl_sched_class = {
+	.next			= &rt_sched_class,
+	.enqueue_task		= enqueue_task_dl,
+	.dequeue_task		= dequeue_task_dl,
+	.yield_task		= yield_task_dl,
+
+	.check_preempt_curr	= check_preempt_curr_dl,
+
+	.pick_next_task		= pick_next_task_dl,
+	.put_prev_task		= put_prev_task_dl,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_dl,
+#endif
+
+	.set_curr_task		= set_curr_task_dl,
+	.task_tick		= task_tick_dl,
+	.task_fork              = task_fork_dl,
+	.task_dead		= task_dead_dl,
+
+	.prio_changed           = prio_changed_dl,
+	.switched_from		= switched_from_dl,
+	.switched_to		= switched_to_dl,
+};
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2,6 +2,7 @@
 #include <linux/sched.h>
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
@@ -88,11 +89,23 @@ static inline int rt_policy(int policy)
 	return 0;
 }
 
+static inline int dl_policy(int policy)
+{
+	if (unlikely(policy == SCHED_DEADLINE))
+		return 1;
+	return 0;
+}
+
 static inline int task_has_rt_policy(struct task_struct *p)
 {
 	return rt_policy(p->policy);
 }
 
+static inline int task_has_dl_policy(struct task_struct *p)
+{
+	return dl_policy(p->policy);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
@@ -364,6 +377,15 @@ struct rt_rq {
 #endif
 };
 
+/* Deadline class' related fields in a runqueue */
+struct dl_rq {
+	/* runqueue is an rbtree, ordered by deadline */
+	struct rb_root rb_root;
+	struct rb_node *rb_leftmost;
+
+	unsigned long dl_nr_running;
+};
+
 #ifdef CONFIG_SMP
 
 /*
@@ -432,6 +454,7 @@ struct rq {
 
 	struct cfs_rq cfs;
 	struct rt_rq rt;
+	struct dl_rq dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
@@ -988,6 +1011,7 @@ static const u32 prio_to_wmult[40] = {
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_REPLENISH	8
 
 #define DEQUEUE_SLEEP		1
 
@@ -1043,6 +1067,7 @@ struct sched_class {
    for (class = sched_class_highest; class; class = class->next)
 
 extern const struct sched_class stop_sched_class;
+extern const struct sched_class dl_sched_class;
 extern const struct sched_class rt_sched_class;
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
@@ -1078,6 +1103,8 @@ extern void resched_cpu(int cpu);
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
+extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+
 extern void update_idle_cpu_load(struct rq *this_rq);
 
 extern void init_task_runnable_average(struct task_struct *p);
@@ -1354,6 +1381,7 @@ extern void print_rt_stats(struct seq_fi
 
 extern void init_cfs_rq(struct cfs_rq *cfs_rq);
 extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
+extern void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq);
 
 extern void cfs_bandwidth_usage_inc(void);
 extern void cfs_bandwidth_usage_dec(void);
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -103,7 +103,7 @@ get_rr_interval_stop(struct rq *rq, stru
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 const struct sched_class stop_sched_class = {
-	.next			= &rt_sched_class,
+	.next			= &dl_sched_class,
 
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 03/13] sched: SCHED_DEADLINE SMP-related data structures & logic
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 02/13] sched: SCHED_DEADLINE structures & implementation Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 04/13] [PATCH 05/13] sched: SCHED_DEADLINE avg_update accounting Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0004-sched-SCHED_DEADLINE-SMP-related-data-structures-log.patch --]
[-- Type: text/plain, Size: 33387 bytes --]

From: Juri Lelli <juri.lelli@gmail.com>

Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.

Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.

The very same approach used in sched_rt is utilised:
 - -deadline tasks are kept into CPU-specific runqueues,
 - -deadline tasks are migrated among runqueues to achieve the
   following:
    * on an M-CPU system the M earliest deadline ready tasks
      are always running;
    * affinity/cpusets settings of all the -deadline tasks is
      always respected.

Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.

To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.

In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.

Cc: jkacur@redhat.com
Cc: tglx@linutronix.de
Cc: vincent.guittot@linaro.org
Cc: dhaval.giani@gmail.com
Cc: raistlin@linux.it
Cc: luca.abeni@unitn.it
Cc: paulmck@linux.vnet.ibm.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: mingo@redhat.com
Cc: hgu1972@gmail.com
Cc: johan.eker@ericsson.com
Cc: tommaso.cucinotta@sssup.it
Cc: nicola.manica@disi.unitn.it
Cc: insop.song@gmail.com
Cc: claudio@evidence.eu.com
Cc: michael@amarulasolutions.com
Cc: rostedt@goodmis.org
Cc: p.faure@akatech.ch
Cc: liming.wang@windriver.com
Cc: darren@dvhart.com
Cc: bruce.ashfield@windriver.com
Cc: oleg@redhat.com
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h   |    1 
 kernel/sched/core.c     |    9 
 kernel/sched/deadline.c |  934 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/rt.c       |    2 
 kernel/sched/sched.h    |   34 +
 5 files changed, 963 insertions(+), 17 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1193,6 +1193,7 @@ struct task_struct {
 	struct list_head tasks;
 #ifdef CONFIG_SMP
 	struct plist_node pushable_tasks;
+	struct rb_node pushable_dl_tasks;
 #endif
 
 	struct mm_struct *mm, *active_mm;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1847,6 +1847,7 @@ int sched_fork(unsigned long clone_flags
 	init_task_preempt_count(p);
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
 #endif
 
 	put_cpu();
@@ -5031,6 +5032,7 @@ static void free_rootdomain(struct rcu_h
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
 	cpupri_cleanup(&rd->cpupri);
+	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
 	free_cpumask_var(rd->span);
@@ -5082,8 +5084,10 @@ static int init_rootdomain(struct root_d
 		goto out;
 	if (!alloc_cpumask_var(&rd->online, GFP_KERNEL))
 		goto free_span;
-	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+	if (!alloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
 		goto free_online;
+	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
@@ -5091,6 +5095,8 @@ static int init_rootdomain(struct root_d
 
 free_rto_mask:
 	free_cpumask_var(rd->rto_mask);
+free_dlo_mask:
+	free_cpumask_var(rd->dlo_mask);
 free_online:
 	free_cpumask_var(rd->online);
 free_span:
@@ -6441,6 +6447,7 @@ void __init sched_init_smp(void)
 	free_cpumask_var(non_isolated_cpus);
 
 	init_sched_rt_class();
+	init_sched_dl_class();
 }
 #else
 void __init sched_init_smp(void)
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -10,6 +10,7 @@
  * miss some of their deadlines), and won't affect any other task.
  *
  * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
+ *                    Juri Lelli <juri.lelli@gmail.com>,
  *                    Michael Trimarchi <michael@amarulasolutions.com>,
  *                    Fabio Checconi <fchecconi@gmail.com>
  */
@@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a,
 	return (s64)(a - b) < 0;
 }
 
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+	return dl_time_before(a->deadline, b->deadline);
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -53,8 +63,168 @@ static inline int is_leftmost(struct tas
 void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_SMP
+	/* zero means no -deadline tasks */
+	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
+
+	dl_rq->dl_nr_migratory = 0;
+	dl_rq->overloaded = 0;
+	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#endif
+}
+
+#ifdef CONFIG_SMP
+
+static inline int dl_overloaded(struct rq *rq)
+{
+	return atomic_read(&rq->rd->dlo_count);
+}
+
+static inline void dl_set_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
+	/*
+	 * Must be visible before the overload count is
+	 * set (as in sched_rt.c).
+	 *
+	 * Matched by the barrier in pull_dl_task().
+	 */
+	smp_wmb();
+	atomic_inc(&rq->rd->dlo_count);
+}
+
+static inline void dl_clear_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	atomic_dec(&rq->rd->dlo_count);
+	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
+}
+
+static void update_dl_migration(struct dl_rq *dl_rq)
+{
+	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
+		if (!dl_rq->overloaded) {
+			dl_set_overload(rq_of_dl_rq(dl_rq));
+			dl_rq->overloaded = 1;
+		}
+	} else if (dl_rq->overloaded) {
+		dl_clear_overload(rq_of_dl_rq(dl_rq));
+		dl_rq->overloaded = 0;
+	}
+}
+
+static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total++;
+	if (p->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory++;
+
+	update_dl_migration(dl_rq);
+}
+
+static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total--;
+	if (p->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory--;
+
+	update_dl_migration(dl_rq);
+}
+
+/*
+ * The list of pushable -deadline task is not a plist, like in
+ * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
+ */
+static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct task_struct *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct task_struct,
+				 pushable_dl_tasks);
+		if (dl_entity_preempt(&p->dl, &entry->dl))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
+
+	rb_link_node(&p->pushable_dl_tasks, parent, link);
+	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+}
+
+static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+
+	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
+		return;
+
+	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&p->pushable_dl_tasks);
+		dl_rq->pushable_dl_tasks_leftmost = next_node;
+	}
+
+	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+}
+
+static inline int has_pushable_dl_tasks(struct rq *rq)
+{
+	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
+}
+
+static int push_dl_task(struct rq *rq);
+
+#else
+
+static inline
+void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+static inline
+void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
 }
 
+#endif /* CONFIG_SMP */
+
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
@@ -309,6 +479,14 @@ static enum hrtimer_restart dl_task_time
 			check_preempt_curr_dl(rq, p, 0);
 		else
 			resched_task(rq->curr);
+#ifdef CONFIG_SMP
+		/*
+		 * Queueing this task back might have overloaded rq,
+		 * check if we need to kick someone away.
+		 */
+		if (has_pushable_dl_tasks(rq))
+			push_dl_task(rq);
+#endif
 	}
 unlock:
 	raw_spin_unlock(&rq->lock);
@@ -399,6 +577,100 @@ static void update_curr_dl(struct rq *rq
 	}
 }
 
+#ifdef CONFIG_SMP
+
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
+
+static inline u64 next_deadline(struct rq *rq)
+{
+	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
+
+	if (next && dl_prio(next->prio))
+		return next->dl.deadline;
+	else
+		return 0;
+}
+
+static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	if (dl_rq->earliest_dl.curr == 0 ||
+	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
+		/*
+		 * If the dl_rq had no -deadline tasks, or if the new task
+		 * has shorter deadline than the current one on dl_rq, we
+		 * know that the previous earliest becomes our next earliest,
+		 * as the new task becomes the earliest itself.
+		 */
+		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
+		dl_rq->earliest_dl.curr = deadline;
+	} else if (dl_rq->earliest_dl.next == 0 ||
+		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
+		/*
+		 * On the other hand, if the new -deadline task has a
+		 * a later deadline than the earliest one on dl_rq, but
+		 * it is earlier than the next (if any), we must
+		 * recompute the next-earliest.
+		 */
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * Since we may have removed our earliest (and/or next earliest)
+	 * task we must recompute them.
+	 */
+	if (!dl_rq->dl_nr_running) {
+		dl_rq->earliest_dl.curr = 0;
+		dl_rq->earliest_dl.next = 0;
+	} else {
+		struct rb_node *leftmost = dl_rq->rb_leftmost;
+		struct sched_dl_entity *entry;
+
+		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
+		dl_rq->earliest_dl.curr = entry->deadline;
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+#else
+
+static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+
+#endif /* CONFIG_SMP */
+
+static inline
+void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+	u64 deadline = dl_se->deadline;
+
+	WARN_ON(!dl_prio(prio));
+	dl_rq->dl_nr_running++;
+
+	inc_dl_deadline(dl_rq, deadline);
+	inc_dl_migration(dl_se, dl_rq);
+}
+
+static inline
+void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+
+	WARN_ON(!dl_prio(prio));
+	WARN_ON(!dl_rq->dl_nr_running);
+	dl_rq->dl_nr_running--;
+
+	dec_dl_deadline(dl_rq, dl_se->deadline);
+	dec_dl_migration(dl_se, dl_rq);
+}
+
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -426,7 +698,7 @@ static void __enqueue_dl_entity(struct s
 	rb_link_node(&dl_se->rb_node, parent, link);
 	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
 
-	dl_rq->dl_nr_running++;
+	inc_dl_tasks(dl_se, dl_rq);
 }
 
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
@@ -446,7 +718,7 @@ static void __dequeue_dl_entity(struct s
 	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
 	RB_CLEAR_NODE(&dl_se->rb_node);
 
-	dl_rq->dl_nr_running--;
+	dec_dl_tasks(dl_se, dl_rq);
 }
 
 static void
@@ -484,12 +756,17 @@ static void enqueue_task_dl(struct rq *r
 		return;
 
 	enqueue_dl_entity(&p->dl, flags);
+
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
+
 	inc_nr_running(rq);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	dequeue_dl_entity(&p->dl);
+	dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -527,6 +804,74 @@ static void yield_task_dl(struct rq *rq)
 	update_curr_dl(rq);
 }
 
+#ifdef CONFIG_SMP
+
+static int find_later_rq(struct task_struct *task);
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask);
+
+static int
+select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
+{
+	struct task_struct *curr;
+	struct rq *rq;
+
+	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
+		goto out;
+
+	rq = cpu_rq(cpu);
+
+	rcu_read_lock();
+	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
+
+	/*
+	 * If we are dealing with a -deadline task, we must
+	 * decide where to wake it up.
+	 * If it has a later deadline and the current task
+	 * on this rq can't move (provided the waking task
+	 * can!) we prefer to send it somewhere else. On the
+	 * other hand, if it has a shorter deadline, we
+	 * try to make it stay here, it might be important.
+	 */
+	if (unlikely(dl_task(curr)) &&
+	    (curr->nr_cpus_allowed < 2 ||
+	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+	    (p->nr_cpus_allowed > 1)) {
+		int target = find_later_rq(p);
+
+		if (target != -1)
+			cpu = target;
+	}
+	rcu_read_unlock();
+
+out:
+	return cpu;
+}
+
+static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * Current can't be migrated, useless to reschedule,
+	 * let's hope p can move out.
+	 */
+	if (rq->curr->nr_cpus_allowed == 1 ||
+	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+		return;
+
+	/*
+	 * p is migratable, so let's not schedule it and
+	 * see if it is pushed or pulled somewhere else.
+	 */
+	if (p->nr_cpus_allowed != 1 &&
+	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+		return;
+
+	resched_task(rq->curr);
+}
+
+#endif /* CONFIG_SMP */
+
 /*
  * Only called when both the current and waking task are -deadline
  * tasks.
@@ -534,8 +879,20 @@ static void yield_task_dl(struct rq *rq)
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
 		resched_task(rq->curr);
+		return;
+	}
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the unlikely case current and p have the same deadline
+	 * let us try to decide what's the best thing to do...
+	 */
+	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
+	    !need_resched())
+		check_preempt_equal_dl(rq, p);
+#endif /* CONFIG_SMP */
 }
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -575,16 +932,29 @@ struct task_struct *pick_next_task_dl(st
 
 	p = dl_task_of(dl_se);
 	p->se.exec_start = rq_clock_task(rq);
+
+	/* Running task will never be pushed. */
+	if (p)
+		dequeue_pushable_dl_task(rq, p);
+
 #ifdef CONFIG_SCHED_HRTICK
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
 #endif
+
+#ifdef CONFIG_SMP
+	rq->post_schedule = has_pushable_dl_tasks(rq);
+#endif /* CONFIG_SMP */
+
 	return p;
 }
 
 static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
+
+	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
 }
 
 static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
@@ -618,16 +988,517 @@ static void set_curr_task_dl(struct rq *
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq_clock_task(rq);
+
+	/* You can't push away the running task */
+	dequeue_pushable_dl_task(rq, p);
+}
+
+#ifdef CONFIG_SMP
+
+/* Only try algorithms three times */
+#define DL_MAX_TRIES 3
+
+static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
+	    (p->nr_cpus_allowed > 1))
+		return 1;
+
+	return 0;
+}
+
+/* Returns the second earliest -deadline task, NULL otherwise */
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
+{
+	struct rb_node *next_node = rq->dl.rb_leftmost;
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p = NULL;
+
+next_node:
+	next_node = rb_next(next_node);
+	if (next_node) {
+		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
+		p = dl_task_of(dl_se);
+
+		if (pick_dl_task(rq, p, cpu))
+			return p;
+
+		goto next_node;
+	}
+
+	return NULL;
+}
+
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask)
+{
+	const struct sched_dl_entity *dl_se = &task->dl;
+	int cpu, found = -1, best = 0;
+	u64 max_dl = 0;
+
+	for_each_cpu(cpu, span) {
+		struct rq *rq = cpu_rq(cpu);
+		struct dl_rq *dl_rq = &rq->dl;
+
+		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
+		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
+		     dl_rq->earliest_dl.curr))) {
+			if (later_mask)
+				cpumask_set_cpu(cpu, later_mask);
+			if (!best && !dl_rq->dl_nr_running) {
+				best = 1;
+				found = cpu;
+			} else if (!best &&
+				   dl_time_before(max_dl,
+						  dl_rq->earliest_dl.curr)) {
+				max_dl = dl_rq->earliest_dl.curr;
+				found = cpu;
+			}
+		} else if (later_mask)
+			cpumask_clear_cpu(cpu, later_mask);
+	}
+
+	return found;
+}
+
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
+
+static int find_later_rq(struct task_struct *task)
+{
+	struct sched_domain *sd;
+	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
+	int this_cpu = smp_processor_id();
+	int best_cpu, cpu = task_cpu(task);
+
+	/* Make sure the mask is initialized first */
+	if (unlikely(!later_mask))
+		return -1;
+
+	if (task->nr_cpus_allowed == 1)
+		return -1;
+
+	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	if (best_cpu == -1)
+		return -1;
+
+	/*
+	 * If we are here, some target has been found,
+	 * the most suitable of which is cached in best_cpu.
+	 * This is, among the runqueues where the current tasks
+	 * have later deadlines than the task's one, the rq
+	 * with the latest possible one.
+	 *
+	 * Now we check how well this matches with task's
+	 * affinity and system topology.
+	 *
+	 * The last cpu where the task run is our first
+	 * guess, since it is most likely cache-hot there.
+	 */
+	if (cpumask_test_cpu(cpu, later_mask))
+		return cpu;
+	/*
+	 * Check if this_cpu is to be skipped (i.e., it is
+	 * not in the mask) or not.
+	 */
+	if (!cpumask_test_cpu(this_cpu, later_mask))
+		this_cpu = -1;
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+
+			/*
+			 * If possible, preempting this_cpu is
+			 * cheaper than migrating.
+			 */
+			if (this_cpu != -1 &&
+			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
+				rcu_read_unlock();
+				return this_cpu;
+			}
+
+			/*
+			 * Last chance: if best_cpu is valid and is
+			 * in the mask, that becomes our choice.
+			 */
+			if (best_cpu < nr_cpu_ids &&
+			    cpumask_test_cpu(best_cpu, sched_domain_span(sd))) {
+				rcu_read_unlock();
+				return best_cpu;
+			}
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * At this point, all our guesses failed, we just return
+	 * 'something', and let the caller sort the things out.
+	 */
+	if (this_cpu != -1)
+		return this_cpu;
+
+	cpu = cpumask_any(later_mask);
+	if (cpu < nr_cpu_ids)
+		return cpu;
+
+	return -1;
+}
+
+/* Locks the rq it finds */
+static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
+{
+	struct rq *later_rq = NULL;
+	int tries;
+	int cpu;
+
+	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
+		cpu = find_later_rq(task);
+
+		if ((cpu == -1) || (cpu == rq->cpu))
+			break;
+
+		later_rq = cpu_rq(cpu);
+
+		/* Retry if something changed. */
+		if (double_lock_balance(rq, later_rq)) {
+			if (unlikely(task_rq(task) != rq ||
+				     !cpumask_test_cpu(later_rq->cpu,
+				                       &task->cpus_allowed) ||
+				     task_running(rq, task) || !task->on_rq)) {
+				double_unlock_balance(rq, later_rq);
+				later_rq = NULL;
+				break;
+			}
+		}
+
+		/*
+		 * If the rq we found has no -deadline task, or
+		 * its earliest one has a later deadline than our
+		 * task, the rq is a good one.
+		 */
+		if (!later_rq->dl.dl_nr_running ||
+		    dl_time_before(task->dl.deadline,
+				   later_rq->dl.earliest_dl.curr))
+			break;
+
+		/* Otherwise we try again. */
+		double_unlock_balance(rq, later_rq);
+		later_rq = NULL;
+	}
+
+	return later_rq;
+}
+
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_dl_tasks(rq))
+		return NULL;
+
+	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
+		     struct task_struct, pushable_dl_tasks);
+
+	BUG_ON(rq->cpu != task_cpu(p));
+	BUG_ON(task_current(rq, p));
+	BUG_ON(p->nr_cpus_allowed <= 1);
+
+	BUG_ON(!p->se.on_rq);
+	BUG_ON(!dl_task(p));
+
+	return p;
+}
+
+/*
+ * See if the non running -deadline tasks on this rq
+ * can be sent to some other CPU where they can preempt
+ * and start executing.
+ */
+static int push_dl_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	struct rq *later_rq;
+
+	if (!rq->dl.overloaded)
+		return 0;
+
+	next_task = pick_next_pushable_dl_task(rq);
+	if (!next_task)
+		return 0;
+
+retry:
+	if (unlikely(next_task == rq->curr)) {
+		WARN_ON(1);
+		return 0;
+	}
+
+	/*
+	 * If next_task preempts rq->curr, and rq->curr
+	 * can move away, it makes sense to just reschedule
+	 * without going further in pushing next_task.
+	 */
+	if (dl_task(rq->curr) &&
+	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	    rq->curr->nr_cpus_allowed > 1) {
+		resched_task(rq->curr);
+		return 0;
+	}
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	/* Will lock the rq it'll find */
+	later_rq = find_lock_later_rq(next_task, rq);
+	if (!later_rq) {
+		struct task_struct *task;
+
+		/*
+		 * We must check all this again, since
+		 * find_lock_later_rq releases rq->lock and it is
+		 * then possible that next_task has migrated.
+		 */
+		task = pick_next_pushable_dl_task(rq);
+		if (task_cpu(next_task) == rq->cpu && task == next_task) {
+			/*
+			 * The task is still there. We don't try
+			 * again, some other cpu will pull it when ready.
+			 */
+			dequeue_pushable_dl_task(rq, next_task);
+			goto out;
+		}
+
+		if (!task)
+			/* No more tasks */
+			goto out;
+
+		put_task_struct(next_task);
+		next_task = task;
+		goto retry;
+	}
+
+	deactivate_task(rq, next_task, 0);
+	set_task_cpu(next_task, later_rq->cpu);
+	activate_task(later_rq, next_task, 0);
+
+	resched_task(later_rq->curr);
+
+	double_unlock_balance(rq, later_rq);
+
+out:
+	put_task_struct(next_task);
+
+	return 1;
+}
+
+static void push_dl_tasks(struct rq *rq)
+{
+	/* Terminates as it moves a -deadline task */
+	while (push_dl_task(rq))
+		;
+}
+
+static int pull_dl_task(struct rq *this_rq)
+{
+	int this_cpu = this_rq->cpu, ret = 0, cpu;
+	struct task_struct *p;
+	struct rq *src_rq;
+	u64 dmin = LONG_MAX;
+
+	if (likely(!dl_overloaded(this_rq)))
+		return 0;
+
+	/*
+	 * Match the barrier from dl_set_overloaded; this guarantees that if we
+	 * see overloaded we must also see the dlo_mask bit.
+	 */
+	smp_rmb();
+
+	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+
+		/*
+		 * It looks racy, abd it is! However, as in sched_rt.c,
+		 * we are fine with this.
+		 */
+		if (this_rq->dl.dl_nr_running &&
+		    dl_time_before(this_rq->dl.earliest_dl.curr,
+				   src_rq->dl.earliest_dl.next))
+			continue;
+
+		/* Might drop this_rq->lock */
+		double_lock_balance(this_rq, src_rq);
+
+		/*
+		 * If there are no more pullable tasks on the
+		 * rq, we're done with it.
+		 */
+		if (src_rq->dl.dl_nr_running <= 1)
+			goto skip;
+
+		p = pick_next_earliest_dl_task(src_rq, this_cpu);
+
+		/*
+		 * We found a task to be pulled if:
+		 *  - it preempts our current (if there's one),
+		 *  - it will preempt the last one we pulled (if any).
+		 */
+		if (p && dl_time_before(p->dl.deadline, dmin) &&
+		    (!this_rq->dl.dl_nr_running ||
+		     dl_time_before(p->dl.deadline,
+				    this_rq->dl.earliest_dl.curr))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * Then we pull iff p has actually an earlier
+			 * deadline than the current task of its runqueue.
+			 */
+			if (dl_time_before(p->dl.deadline,
+					   src_rq->curr->dl.deadline))
+				goto skip;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			dmin = p->dl.deadline;
+
+			/* Is there any other task even earlier? */
+		}
+skip:
+		double_unlock_balance(this_rq, src_rq);
+	}
+
+	return ret;
+}
+
+static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
+{
+	/* Try to pull other tasks here */
+	if (dl_task(prev))
+		pull_dl_task(rq);
+}
+
+static void post_schedule_dl(struct rq *rq)
+{
+	push_dl_tasks(rq);
+}
+
+/*
+ * Since the task is not running and a reschedule is not going to happen
+ * anytime soon on its runqueue, we try pushing it away now.
+ */
+static void task_woken_dl(struct rq *rq, struct task_struct *p)
+{
+	if (!task_running(rq, p) &&
+	    !test_tsk_need_resched(rq->curr) &&
+	    has_pushable_dl_tasks(rq) &&
+	    p->nr_cpus_allowed > 1 &&
+	    dl_task(rq->curr) &&
+	    (rq->curr->nr_cpus_allowed < 2 ||
+	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
+		push_dl_tasks(rq);
+	}
 }
 
+static void set_cpus_allowed_dl(struct task_struct *p,
+				const struct cpumask *new_mask)
+{
+	struct rq *rq;
+	int weight;
+
+	BUG_ON(!dl_task(p));
+
+	/*
+	 * Update only if the task is actually running (i.e.,
+	 * it is on the rq AND it is not throttled).
+	 */
+	if (!on_dl_rq(&p->dl))
+		return;
+
+	weight = cpumask_weight(new_mask);
+
+	/*
+	 * Only update if the process changes its state from whether it
+	 * can migrate or not.
+	 */
+	if ((p->nr_cpus_allowed > 1) == (weight > 1))
+		return;
+
+	rq = task_rq(p);
+
+	/*
+	 * The process used to be able to migrate OR it can now migrate
+	 */
+	if (weight <= 1) {
+		if (!task_current(rq, p))
+			dequeue_pushable_dl_task(rq, p);
+		BUG_ON(!rq->dl.dl_nr_migratory);
+		rq->dl.dl_nr_migratory--;
+	} else {
+		if (!task_current(rq, p))
+			enqueue_pushable_dl_task(rq, p);
+		rq->dl.dl_nr_migratory++;
+	}
+
+	update_dl_migration(&rq->dl);
+}
+
+/* Assumes rq->lock is held */
+static void rq_online_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void rq_offline_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_clear_overload(rq);
+}
+
+void init_sched_dl_class(void)
+{
+	unsigned int i;
+
+	for_each_possible_cpu(i)
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
+					GFP_KERNEL, cpu_to_node(i));
+}
+
+#endif /* CONFIG_SMP */
+
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
-	if (hrtimer_active(&p->dl.dl_timer))
+	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
 		hrtimer_try_to_cancel(&p->dl.dl_timer);
+
+#ifdef CONFIG_SMP
+	/*
+	 * Since this might be the only -deadline task on the rq,
+	 * this is the right place to try to pull some other one
+	 * from an overloaded cpu, if any.
+	 */
+	if (!rq->dl.dl_nr_running)
+		pull_dl_task(rq);
+#endif
 }
 
+/*
+ * When switching to -deadline, we may overload the rq, then
+ * we try to push someone off, if possible.
+ */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	int check_resched = 1;
+
 	/*
 	 * If p is throttled, don't consider the possibility
 	 * of preempting rq->curr, the check will be done right
@@ -637,26 +1508,53 @@ static void switched_to_dl(struct rq *rq
 		return;
 
 	if (!p->on_rq || rq->curr != p) {
-		if (task_has_dl_policy(rq->curr))
+#ifdef CONFIG_SMP
+		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
+			/* Only reschedule if pushing failed */
+			check_resched = 0;
+#endif /* CONFIG_SMP */
+		if (check_resched && task_has_dl_policy(rq->curr))
 			check_preempt_curr_dl(rq, p, 0);
-		else
-			resched_task(rq->curr);
 	}
 }
 
+/*
+ * If the scheduling parameters of a -deadline task changed,
+ * a push or pull operation might be needed.
+ */
 static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 			    int oldprio)
 {
-	switched_to_dl(rq, p);
-}
-
+	if (p->on_rq || rq->curr == p) {
 #ifdef CONFIG_SMP
-static int
-select_task_rq_dl(struct task_struct *p, int prev_cpu, int sd_flag, int flags)
-{
-	return task_cpu(p);
+		/*
+		 * This might be too much, but unfortunately
+		 * we don't have the old deadline value, and
+		 * we can't argue if the task is increasing
+		 * or lowering its prio, so...
+		 */
+		if (!rq->dl.overloaded)
+			pull_dl_task(rq);
+
+		/*
+		 * If we now have a earlier deadline task than p,
+		 * then reschedule, provided p is still on this
+		 * runqueue.
+		 */
+		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
+		    rq->curr == p)
+			resched_task(p);
+#else
+		/*
+		 * Again, we don't know if p has a earlier
+		 * or later deadline, so let's blindly set a
+		 * (maybe not needed) rescheduling point.
+		 */
+		resched_task(p);
+#endif /* CONFIG_SMP */
+	} else
+		switched_to_dl(rq, p);
 }
-#endif
 
 const struct sched_class dl_sched_class = {
 	.next			= &rt_sched_class,
@@ -671,6 +1569,12 @@ const struct sched_class dl_sched_class
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
+	.set_cpus_allowed       = set_cpus_allowed_dl,
+	.rq_online              = rq_online_dl,
+	.rq_offline             = rq_offline_dl,
+	.pre_schedule		= pre_schedule_dl,
+	.post_schedule		= post_schedule_dl,
+	.task_woken		= task_woken_dl,
 #endif
 
 	.set_curr_task		= set_curr_task_dl,
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1738,7 +1738,7 @@ static void task_woken_rt(struct rq *rq,
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
 	    p->nr_cpus_allowed > 1 &&
-	    rt_task(rq->curr) &&
+	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
 	     rq->curr->prio <= p->prio))
 		push_rt_tasks(rq);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -384,6 +384,31 @@ struct dl_rq {
 	struct rb_node *rb_leftmost;
 
 	unsigned long dl_nr_running;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Deadline values of the currently executing and the
+	 * earliest ready task on this rq. Caching these facilitates
+	 * the decision wether or not a ready but not running task
+	 * should migrate somewhere else.
+	 */
+	struct {
+		u64 curr;
+		u64 next;
+	} earliest_dl;
+
+	unsigned long dl_nr_migratory;
+	unsigned long dl_nr_total;
+	int overloaded;
+
+	/*
+	 * Tasks on this rq that can be pushed away. They are kept in
+	 * an rb-tree, ordered by tasks' deadlines, with caching
+	 * of the leftmost (earliest deadline) element.
+	 */
+	struct rb_root pushable_dl_tasks_root;
+	struct rb_node *pushable_dl_tasks_leftmost;
+#endif
 };
 
 #ifdef CONFIG_SMP
@@ -404,6 +429,13 @@ struct root_domain {
 	cpumask_var_t online;
 
 	/*
+	 * The bit corresponding to a CPU gets set here if such CPU has more
+	 * than one runnable -deadline task (as it is below for RT tasks).
+	 */
+	cpumask_var_t dlo_mask;
+	atomic_t dlo_count;
+
+	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
 	 */
@@ -1094,6 +1126,8 @@ static inline void idle_balance(int cpu,
 extern void sysrq_sched_debug_show(void);
 extern void sched_init_granularity(void);
 extern void update_max_interval(void);
+
+extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 04/13] [PATCH 05/13] sched: SCHED_DEADLINE avg_update accounting
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 03/13] sched: SCHED_DEADLINE SMP-related data structures & logic Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 05/13] sched: Add period support for -deadline tasks Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0005-sched-SCHED_DEADLINE-avg_update-accounting.patch --]
[-- Type: text/plain, Size: 1479 bytes --]

From: Dario Faggioli <raistlin@linux.it>

Make the core scheduler and load balancer aware of the load
produced by -deadline tasks, by updating the moving average
like for sched_rt.

Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: mingo@redhat.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rostedt@goodmis.org
Cc: tglx@linutronix.de
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/deadline.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ab5deb9..e69b4e0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -564,6 +564,8 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = rq_clock_task(rq);
 	cpuacct_charge(curr, delta_exec);
 
+	sched_rt_avg_update(rq, delta_exec);
+
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-- 
1.7.10.4




^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 05/13] sched: Add period support for -deadline tasks
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 04/13] [PATCH 05/13] sched: SCHED_DEADLINE avg_update accounting Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 06/13] [PATCH 07/13] sched: Add latency tracing " Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Harald Gustafsson, Peter Zijlstra

[-- Attachment #1: 0006-sched-Add-period-support-for-deadline-tasks.patch --]
[-- Type: text/plain, Size: 4692 bytes --]

From: Harald Gustafsson <harald.gustafsson@ericsson.com>

Make it possible to specify a period (different or equal than
deadline) for -deadline tasks. Relative deadlines (D_i) are used on
task arrivals to generate new scheduling (absolute) deadlines as "d =
t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d
= d + P_i" when the budget is zero.

This is in general useful to model (and schedule) tasks that have slow
activation rates (long periods), but have to be scheduled soon once
activated (short deadlines).

Cc: oleg@redhat.com
Cc: darren@dvhart.com
Cc: paulmck@linux.vnet.ibm.com
Cc: dhaval.giani@gmail.com
Cc: p.faure@akatech.ch
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: rostedt@goodmis.org
Cc: johan.eker@ericsson.com
Cc: tglx@linutronix.de
Cc: liming.wang@windriver.com
Cc: tommaso.cucinotta@sssup.it
Cc: luca.abeni@unitn.it
Cc: vincent.guittot@linaro.org
Cc: michael@amarulasolutions.com
Cc: mingo@redhat.com
Cc: bruce.ashfield@windriver.com
Cc: nicola.manica@disi.unitn.it
Cc: claudio@evidence.eu.com
Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h   |    1 +
 kernel/sched/core.c     |   10 ++++++++--
 kernel/sched/deadline.c |   10 +++++++---
 3 files changed, 16 insertions(+), 5 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1094,6 +1094,7 @@ struct sched_dl_entity {
 	 */
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
+	u64 dl_period;		/* separation of two instances (period) */
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1722,6 +1722,7 @@ static void __sched_fork(unsigned long c
 	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	p->dl.dl_runtime = p->dl.runtime = 0;
 	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.dl_period = 0;
 	p->dl.flags = 0;
 
 	INIT_LIST_HEAD(&p->rt.run_list);
@@ -3042,6 +3043,7 @@ __setparam_dl(struct task_struct *p, con
 	init_dl_task_timer(dl_se);
 	dl_se->dl_runtime = attr->sched_runtime;
 	dl_se->dl_deadline = attr->sched_deadline;
+	dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
 	dl_se->flags = attr->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
@@ -3055,19 +3057,23 @@ __getparam_dl(struct task_struct *p, str
 	attr->sched_priority = p->rt_priority;
 	attr->sched_runtime = dl_se->dl_runtime;
 	attr->sched_deadline = dl_se->dl_deadline;
+	attr->sched_period = dl_se->dl_period;
 	attr->sched_flags = dl_se->flags;
 }
 
 /*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
- * than the runtime.
+ * than the runtime, as well as the period of being zero or
+ * greater than deadline.
  */
 static bool
 __checkparam_dl(const struct sched_attr *attr)
 {
 	return attr && attr->sched_deadline != 0 &&
-	       (s64)(attr->sched_deadline - attr->sched_runtime) >= 0;
+		(attr->sched_period == 0 ||
+		(s64)(attr->sched_period   - attr->sched_deadline) >= 0) &&
+		(s64)(attr->sched_deadline - attr->sched_runtime ) >= 0;
 }
 
 /*
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -289,7 +289,7 @@ static void replenish_dl_entity(struct s
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->deadline += dl_se->dl_period;
 		dl_se->runtime += dl_se->dl_runtime;
 	}
 
@@ -329,9 +329,13 @@ static void replenish_dl_entity(struct s
  *
  * This function returns true if:
  *
- *   runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ *   runtime / (deadline - t) > dl_runtime / dl_period ,
  *
  * IOW we can't recycle current parameters.
+ *
+ * Notice that the bandwidth check is done against the period. For
+ * task with deadline equal to period this is the same of using
+ * dl_deadline instead of dl_period in the equation above.
  */
 static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 {
@@ -355,7 +359,7 @@ static bool dl_entity_overflow(struct sc
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (dl_se->dl_deadline >> 10) * (dl_se->runtime >> 10);
+	left = (dl_se->dl_period >> 10) * (dl_se->runtime >> 10);
 	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
 
 	return dl_time_before(right, left);



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 06/13] [PATCH 07/13] sched: Add latency tracing for -deadline tasks
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (4 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 05/13] sched: Add period support for -deadline tasks Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 07/13] rtmutex: Turn the plist into an rb-tree Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0007-sched-Add-latency-tracing-for-deadline-tasks.patch --]
[-- Type: text/plain, Size: 7944 bytes --]

From: Dario Faggioli <raistlin@linux.it>

It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.

For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.

As a consequence of applying this patch there will be three wakeup
latency tracer:
 * "wakeup", that deals with all tasks in the system;
 * "wakeup_rt", that deals with -rt and -deadline tasks only;
 * "wakeup_dl", that deals with -deadline tasks only.

Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: mingo@redhat.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rostedt@goodmis.org
Cc: tglx@linutronix.de
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/trace/trace_sched_wakeup.c |   64 ++++++++++++++++++++++++++++++++++---
 kernel/trace/trace_selftest.c     |   32 +++++++++++--------
 2 files changed, 78 insertions(+), 18 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index fee77e1..090c4d9 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,8 @@ static int			wakeup_cpu;
 static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
+static int			wakeup_dl;
+static int			tracing_dl = 0;
 
 static arch_spinlock_t wakeup_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -437,6 +439,7 @@ static void __wakeup_reset(struct trace_array *tr)
 {
 	wakeup_cpu = -1;
 	wakeup_prio = -1;
+	tracing_dl = 0;
 
 	if (wakeup_task)
 		put_task_struct(wakeup_task);
@@ -472,9 +475,17 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	tracing_record_cmdline(p);
 	tracing_record_cmdline(current);
 
-	if ((wakeup_rt && !rt_task(p)) ||
-			p->prio >= wakeup_prio ||
-			p->prio >= current->prio)
+	/*
+	 * Semantic is like this:
+	 *  - wakeup tracer handles all tasks in the system, independently
+	 *    from their scheduling class;
+	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
+	 *    sched_rt class;
+	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
+	 */
+	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
+	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
 		return;
 
 	pc = preempt_count();
@@ -486,7 +497,8 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	arch_spin_lock(&wakeup_lock);
 
 	/* check for races. */
-	if (!tracer_enabled || p->prio >= wakeup_prio)
+	if (!tracer_enabled || tracing_dl ||
+	    (!dl_task(p) && p->prio >= wakeup_prio))
 		goto out_locked;
 
 	/* reset the trace */
@@ -496,6 +508,15 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	wakeup_current_cpu = wakeup_cpu;
 	wakeup_prio = p->prio;
 
+	/*
+	 * Once you start tracing a -deadline task, don't bother tracing
+	 * another task until the first one wakes up.
+	 */
+	if (dl_task(p))
+		tracing_dl = 1;
+	else
+		tracing_dl = 0;
+
 	wakeup_task = p;
 	get_task_struct(wakeup_task);
 
@@ -597,16 +618,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
 
 static int wakeup_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 0;
 	return __wakeup_tracer_init(tr);
 }
 
 static int wakeup_rt_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 1;
 	return __wakeup_tracer_init(tr);
 }
 
+static int wakeup_dl_tracer_init(struct trace_array *tr)
+{
+	wakeup_dl = 1;
+	wakeup_rt = 0;
+	return __wakeup_tracer_init(tr);
+}
+
 static void wakeup_tracer_reset(struct trace_array *tr)
 {
 	int lat_flag = save_flags & TRACE_ITER_LATENCY_FMT;
@@ -674,6 +704,28 @@ static struct tracer wakeup_rt_tracer __read_mostly =
 	.use_max_tr	= true,
 };
 
+static struct tracer wakeup_dl_tracer __read_mostly =
+{
+	.name		= "wakeup_dl",
+	.init		= wakeup_dl_tracer_init,
+	.reset		= wakeup_tracer_reset,
+	.start		= wakeup_tracer_start,
+	.stop		= wakeup_tracer_stop,
+	.wait_pipe	= poll_wait_pipe,
+	.print_max	= true,
+	.print_header	= wakeup_print_header,
+	.print_line	= wakeup_print_line,
+	.flags		= &tracer_flags,
+	.set_flag	= wakeup_set_flag,
+	.flag_changed	= wakeup_flag_changed,
+#ifdef CONFIG_FTRACE_SELFTEST
+	.selftest    = trace_selftest_startup_wakeup,
+#endif
+	.open		= wakeup_trace_open,
+	.close		= wakeup_trace_close,
+	.use_max_tr	= true,
+};
+
 __init static int init_wakeup_tracer(void)
 {
 	int ret;
@@ -686,6 +738,10 @@ __init static int init_wakeup_tracer(void)
 	if (ret)
 		return ret;
 
+	ret = register_tracer(&wakeup_dl_tracer);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 core_initcall(init_wakeup_tracer);
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index a7329b7..542de6a 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -1022,11 +1022,15 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
 #ifdef CONFIG_SCHED_TRACER
 static int trace_wakeup_test_thread(void *data)
 {
-	/* Make this a RT thread, doesn't need to be too high */
-	static const struct sched_param param = { .sched_priority = 5 };
+	/* Make this a -deadline thread */
+	static const struct sched_attr attr = {
+		.sched_runtime = 100000ULL,
+		.sched_deadline = 10000000ULL,
+		.sched_period = 10000000ULL
+	};
 	struct completion *x = data;
 
-	sched_setscheduler(current, SCHED_FIFO, &param);
+	sched_setscheduler2(current, SCHED_DEADLINE, &attr);
 
 	/* Make it know we have a new prio */
 	complete(x);
@@ -1040,8 +1044,8 @@ static int trace_wakeup_test_thread(void *data)
 	/* we are awake, now wait to disappear */
 	while (!kthread_should_stop()) {
 		/*
-		 * This is an RT task, do short sleeps to let
-		 * others run.
+		 * This will likely be the system top priority
+		 * task, do short sleeps to let others run.
 		 */
 		msleep(100);
 	}
@@ -1054,21 +1058,21 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 {
 	unsigned long save_max = tracing_max_latency;
 	struct task_struct *p;
-	struct completion isrt;
+	struct completion is_ready;
 	unsigned long count;
 	int ret;
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
-	/* create a high prio thread */
-	p = kthread_run(trace_wakeup_test_thread, &isrt, "ftrace-test");
+	/* create a -deadline thread */
+	p = kthread_run(trace_wakeup_test_thread, &is_ready, "ftrace-test");
 	if (IS_ERR(p)) {
 		printk(KERN_CONT "Failed to create ftrace wakeup test thread ");
 		return -1;
 	}
 
-	/* make sure the thread is running at an RT prio */
-	wait_for_completion(&isrt);
+	/* make sure the thread is running at -deadline policy */
+	wait_for_completion(&is_ready);
 
 	/* start the tracing */
 	ret = tracer_init(trace, tr);
@@ -1082,19 +1086,19 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 
 	while (p->on_rq) {
 		/*
-		 * Sleep to make sure the RT thread is asleep too.
+		 * Sleep to make sure the -deadline thread is asleep too.
 		 * On virtual machines we can't rely on timings,
 		 * but we want to make sure this test still works.
 		 */
 		msleep(100);
 	}
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
 	wake_up_process(p);
 
 	/* Wait for the task to wake up */
-	wait_for_completion(&isrt);
+	wait_for_completion(&is_ready);
 
 	/* stop the tracing. */
 	tracing_stop();
-- 
1.7.10.4




^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 07/13] rtmutex: Turn the plist into an rb-tree
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (5 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 06/13] [PATCH 07/13] sched: Add latency tracing " Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 08/13] sched: Drafted deadline inheritance logic Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0008-rtmutex-Turn-the-plist-into-an-rb-tree.patch --]
[-- Type: text/plain, Size: 18322 bytes --]

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
 - classical prio field of the plist is just an int, which might
   not be enough for representing a deadline;
 - manipulating such a list would become O(nr_deadline_tasks),
   which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
 - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
   one with the higher (lower, actually!) prio wins;
 - among a -priority and a -deadline task, the latter always wins;
 - among two -deadline tasks, the one with the earliest deadline
   wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Cc: nicola.manica@disi.unitn.it
Cc: darren@dvhart.com
Cc: oleg@redhat.com
Cc: dhaval.giani@gmail.com
Cc: paulmck@linux.vnet.ibm.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: p.faure@akatech.ch
Cc: jkacur@redhat.com
Cc: rostedt@goodmis.org
Cc: johan.eker@ericsson.com
Cc: tglx@linutronix.de
Cc: liming.wang@windriver.com
Cc: tommaso.cucinotta@sssup.it
Cc: luca.abeni@unitn.it
Cc: vincent.guittot@linaro.org
Cc: michael@amarulasolutions.com
Cc: bruce.ashfield@windriver.com
Cc: mingo@redhat.com
Cc: claudio@evidence.eu.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/init_task.h       |   10 ++
 include/linux/rtmutex.h         |   18 +---
 include/linux/sched.h           |    4 -
 kernel/fork.c                   |    3 
 kernel/futex.c                  |    2 
 kernel/locking/rtmutex-debug.c  |    8 --
 kernel/locking/rtmutex.c        |  151 ++++++++++++++++++++++++++++++++--------
 kernel/locking/rtmutex_common.h |   22 ++---
 kernel/sched/core.c             |    4 -
 9 files changed, 157 insertions(+), 65 deletions(-)

--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -11,6 +11,7 @@
 #include <linux/user_namespace.h>
 #include <linux/securebits.h>
 #include <linux/seqlock.h>
+#include <linux/rbtree.h>
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
@@ -154,6 +155,14 @@ extern struct task_group root_task_group
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES(tsk)						\
+	.pi_waiters = RB_ROOT,						\
+	.pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -221,6 +230,7 @@ extern struct task_group root_task_group
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ(tsk)						\
+	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 }
 
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
 #define __LINUX_RT_MUTEX_H
 
 #include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
 
 extern int max_lock_depth; /* for sysctl */
@@ -22,12 +22,14 @@ extern int max_lock_depth; /* for sysctl
  * The rt_mutex structure
  *
  * @wait_lock:	spinlock to protect the structure
- * @wait_list:	pilist head to enqueue waiters in priority order
+ * @waiters:	rbtree root to enqueue waiters in priority order
+ * @waiters_leftmost: top waiter
  * @owner:	the mutex owner
  */
 struct rt_mutex {
 	raw_spinlock_t		wait_lock;
-	struct plist_head	wait_list;
+	struct rb_root          waiters;
+	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
@@ -66,7 +68,7 @@ struct hrtimer_sleeper;
 
 #define __RT_MUTEX_INITIALIZER(mutexname) \
 	{ .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \
-	, .wait_list = PLIST_HEAD_INIT(mutexname.wait_list) \
+	, .waiters = RB_ROOT \
 	, .owner = NULL \
 	__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}
 
@@ -98,12 +100,4 @@ extern int rt_mutex_trylock(struct rt_mu
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk)						\
-	.pi_waiters	= PLIST_HEAD_INIT(tsk.pi_waiters),	\
-	INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
 #endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -16,6 +16,7 @@ struct sched_param {
 #include <linux/types.h>
 #include <linux/timex.h>
 #include <linux/jiffies.h>
+#include <linux/plist.h>
 #include <linux/rbtree.h>
 #include <linux/thread_info.h>
 #include <linux/cpumask.h>
@@ -1346,7 +1347,8 @@ struct task_struct {
 
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
-	struct plist_head pi_waiters;
+	struct rb_root pi_waiters;
+	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1086,7 +1086,8 @@ static void rt_mutex_init_task(struct ta
 {
 	raw_spin_lock_init(&p->pi_lock);
 #ifdef CONFIG_RT_MUTEXES
-	plist_head_init(&p->pi_waiters);
+	p->pi_waiters = RB_ROOT;
+	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2315,6 +2315,8 @@ static int futex_wait_requeue_pi(u32 __u
 	 * code while we sleep on uaddr.
 	 */
 	debug_rt_mutex_init_waiter(&rt_waiter);
+	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
+	RB_CLEAR_NODE(&rt_waiter.tree_entry);
 	rt_waiter.task = NULL;
 
 	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -24,7 +24,7 @@
 #include <linux/kallsyms.h>
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/fs.h>
 #include <linux/debug_locks.h>
 
@@ -57,7 +57,7 @@ static void printk_lock(struct rt_mutex
 
 void rt_mutex_debug_task_free(struct task_struct *task)
 {
-	DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters));
+	DEBUG_LOCKS_WARN_ON(!RB_EMPTY_ROOT(&task->pi_waiters));
 	DEBUG_LOCKS_WARN_ON(task->pi_blocked_on);
 }
 
@@ -154,16 +154,12 @@ void debug_rt_mutex_proxy_unlock(struct
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
-	plist_node_init(&waiter->list_entry, MAX_PRIO);
-	plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
 	waiter->deadlock_task_pid = NULL;
 }
 
 void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
 {
 	put_pid(waiter->deadlock_task_pid);
-	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
-	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
 	memset(waiter, 0x22, sizeof(*waiter));
 }
 
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -14,6 +14,7 @@
 #include <linux/export.h>
 #include <linux/sched.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/timer.h>
 
 #include "rtmutex_common.h"
@@ -91,10 +92,104 @@ static inline void mark_rt_mutex_waiters
 }
 #endif
 
+static inline int
+rt_mutex_waiter_less(struct rt_mutex_waiter *left,
+		     struct rt_mutex_waiter *right)
+{
+	if (left->task->prio < right->task->prio)
+		return 1;
+
+	/*
+	 * If both tasks are dl_task(), we check their deadlines.
+	 */
+	if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+		return (left->task->dl.deadline < right->task->dl.deadline);
+
+	return 0;
+}
+
+static void
+rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &lock->waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		lock->waiters_leftmost = &waiter->tree_entry;
+
+	rb_link_node(&waiter->tree_entry, parent, link);
+	rb_insert_color(&waiter->tree_entry, &lock->waiters);
+}
+
+static void
+rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	if (RB_EMPTY_NODE(&waiter->tree_entry))
+		return;
+
+	if (lock->waiters_leftmost == &waiter->tree_entry)
+		lock->waiters_leftmost = rb_next(&waiter->tree_entry);
+
+	rb_erase(&waiter->tree_entry, &lock->waiters);
+	RB_CLEAR_NODE(&waiter->tree_entry);
+}
+
+static void
+rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &task->pi_waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		task->pi_waiters_leftmost = &waiter->pi_tree_entry;
+
+	rb_link_node(&waiter->pi_tree_entry, parent, link);
+	rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
+}
+
+static void
+rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	if (RB_EMPTY_NODE(&waiter->pi_tree_entry))
+		return;
+
+	if (task->pi_waiters_leftmost == &waiter->pi_tree_entry)
+		task->pi_waiters_leftmost = rb_next(&waiter->pi_tree_entry);
+
+	rb_erase(&waiter->pi_tree_entry, &task->pi_waiters);
+	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+}
+
 /*
- * Calculate task priority from the waiter list priority
+ * Calculate task priority from the waiter tree priority
  *
- * Return task->normal_prio when the waiter list is empty or when
+ * Return task->normal_prio when the waiter tree is empty or when
  * the waiter is not allowed to do priority boosting
  */
 int rt_mutex_getprio(struct task_struct *task)
@@ -102,7 +197,7 @@ int rt_mutex_getprio(struct task_struct
 	if (likely(!task_has_pi_waiters(task)))
 		return task->normal_prio;
 
-	return min(task_top_pi_waiter(task)->pi_list_entry.prio,
+	return min(task_top_pi_waiter(task)->task->prio,
 		   task->normal_prio);
 }
 
@@ -233,7 +328,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * When deadlock detection is off then we check, if further
 	 * priority adjustment is necessary.
 	 */
-	if (!detect_deadlock && waiter->list_entry.prio == task->prio)
+	if (!detect_deadlock && waiter->task->prio == task->prio)
 		goto out_unlock_pi;
 
 	lock = waiter->lock;
@@ -254,9 +349,9 @@ static int rt_mutex_adjust_prio_chain(st
 	top_waiter = rt_mutex_top_waiter(lock);
 
 	/* Requeue the waiter */
-	plist_del(&waiter->list_entry, &lock->wait_list);
-	waiter->list_entry.prio = task->prio;
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
+	waiter->task->prio = task->prio;
+	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -280,17 +375,15 @@ static int rt_mutex_adjust_prio_chain(st
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		/* Boost the owner */
-		plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, top_waiter);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 
 	} else if (top_waiter == waiter) {
 		/* Deboost the owner */
-		plist_del(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 	}
 
@@ -355,7 +448,7 @@ static int try_to_take_rt_mutex(struct r
 	 * 3) it is top waiter
 	 */
 	if (rt_mutex_has_waiters(lock)) {
-		if (task->prio >= rt_mutex_top_waiter(lock)->list_entry.prio) {
+		if (task->prio >= rt_mutex_top_waiter(lock)->task->prio) {
 			if (!waiter || waiter != rt_mutex_top_waiter(lock))
 				return 0;
 		}
@@ -369,7 +462,7 @@ static int try_to_take_rt_mutex(struct r
 
 		/* remove the queued waiter. */
 		if (waiter) {
-			plist_del(&waiter->list_entry, &lock->wait_list);
+			rt_mutex_dequeue(lock, waiter);
 			task->pi_blocked_on = NULL;
 		}
 
@@ -379,8 +472,7 @@ static int try_to_take_rt_mutex(struct r
 		 */
 		if (rt_mutex_has_waiters(lock)) {
 			top = rt_mutex_top_waiter(lock);
-			top->pi_list_entry.prio = top->list_entry.prio;
-			plist_add(&top->pi_list_entry, &task->pi_waiters);
+			rt_mutex_enqueue_pi(task, top);
 		}
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 	}
@@ -416,13 +508,11 @@ static int task_blocks_on_rt_mutex(struc
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
-	plist_node_init(&waiter->list_entry, task->prio);
-	plist_node_init(&waiter->pi_list_entry, task->prio);
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
 		top_waiter = rt_mutex_top_waiter(lock);
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_enqueue(lock, waiter);
 
 	task->pi_blocked_on = waiter;
 
@@ -433,8 +523,8 @@ static int task_blocks_on_rt_mutex(struc
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
-		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
-		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, top_waiter);
+		rt_mutex_enqueue_pi(owner, waiter);
 
 		__rt_mutex_adjust_prio(owner);
 		if (owner->pi_blocked_on)
@@ -486,7 +576,7 @@ static void wakeup_next_waiter(struct rt
 	 * boosted mode and go back to normal after releasing
 	 * lock->wait_lock.
 	 */
-	plist_del(&waiter->pi_list_entry, &current->pi_waiters);
+	rt_mutex_dequeue_pi(current, waiter);
 
 	rt_mutex_set_owner(lock, NULL);
 
@@ -510,7 +600,7 @@ static void remove_waiter(struct rt_mute
 	int chain_walk = 0;
 
 	raw_spin_lock_irqsave(&current->pi_lock, flags);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
@@ -521,13 +611,13 @@ static void remove_waiter(struct rt_mute
 
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
 
-		plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, waiter);
 
 		if (rt_mutex_has_waiters(lock)) {
 			struct rt_mutex_waiter *next;
 
 			next = rt_mutex_top_waiter(lock);
-			plist_add(&next->pi_list_entry, &owner->pi_waiters);
+			rt_mutex_enqueue_pi(owner, next);
 		}
 		__rt_mutex_adjust_prio(owner);
 
@@ -537,8 +627,6 @@ static void remove_waiter(struct rt_mute
 		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
 	}
 
-	WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
 	if (!chain_walk)
 		return;
 
@@ -565,7 +653,7 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->list_entry.prio == task->prio) {
+	if (!waiter || waiter->task->prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -638,6 +726,8 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 	int ret = 0;
 
 	debug_rt_mutex_init_waiter(&waiter);
+	RB_CLEAR_NODE(&waiter.pi_tree_entry);
+	RB_CLEAR_NODE(&waiter.tree_entry);
 
 	raw_spin_lock(&lock->wait_lock);
 
@@ -904,7 +994,8 @@ void __rt_mutex_init(struct rt_mutex *lo
 {
 	lock->owner = NULL;
 	raw_spin_lock_init(&lock->wait_lock);
-	plist_head_init(&lock->wait_list);
+	lock->waiters = RB_ROOT;
+	lock->waiters_leftmost = NULL;
 
 	debug_rt_mutex_init(lock, name);
 }
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -40,13 +40,13 @@ extern void schedule_rt_mutex_test(struc
  * This is the control structure for tasks blocked on a rt_mutex,
  * which is allocated on the kernel stack on of the blocked task.
  *
- * @list_entry:		pi node to enqueue into the mutex waiters list
- * @pi_list_entry:	pi node to enqueue into the mutex owner waiters list
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
  * @task:		task reference to the blocked task
  */
 struct rt_mutex_waiter {
-	struct plist_node	list_entry;
-	struct plist_node	pi_list_entry;
+	struct rb_node          tree_entry;
+	struct rb_node          pi_tree_entry;
 	struct task_struct	*task;
 	struct rt_mutex		*lock;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
@@ -57,11 +57,11 @@ struct rt_mutex_waiter {
 };
 
 /*
- * Various helpers to access the waiters-plist:
+ * Various helpers to access the waiters-tree:
  */
 static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
 {
-	return !plist_head_empty(&lock->wait_list);
+	return !RB_EMPTY_ROOT(&lock->waiters);
 }
 
 static inline struct rt_mutex_waiter *
@@ -69,8 +69,8 @@ rt_mutex_top_waiter(struct rt_mutex *loc
 {
 	struct rt_mutex_waiter *w;
 
-	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
-			       list_entry);
+	w = rb_entry(lock->waiters_leftmost, struct rt_mutex_waiter,
+		     tree_entry);
 	BUG_ON(w->lock != lock);
 
 	return w;
@@ -78,14 +78,14 @@ rt_mutex_top_waiter(struct rt_mutex *loc
 
 static inline int task_has_pi_waiters(struct task_struct *p)
 {
-	return !plist_head_empty(&p->pi_waiters);
+	return !RB_EMPTY_ROOT(&p->pi_waiters);
 }
 
 static inline struct rt_mutex_waiter *
 task_top_pi_waiter(struct task_struct *p)
 {
-	return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
-				  pi_list_entry);
+	return rb_entry(p->pi_waiters_leftmost, struct rt_mutex_waiter,
+			pi_tree_entry);
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6625,10 +6625,6 @@ void __init sched_init(void)
 	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
 #endif
 
-#ifdef CONFIG_RT_MUTEXES
-	plist_head_init(&init_task.pi_waiters);
-#endif
-
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 08/13] sched: Drafted deadline inheritance logic
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (6 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 07/13] rtmutex: Turn the plist into an rb-tree Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 09/13] sched: Add bandwidth management for sched_dl Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0009-sched-Drafted-deadline-inheritance-logic.patch --]
[-- Type: text/plain, Size: 17620 bytes --]

From: Dario Faggioli <raistlin@linux.it>

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).

This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
 - ensure a pi-lock owner with waiters is never throttled down. Instead,
   when it runs out of runtime, it immediately gets replenished and it's
   deadline is postponed;
 - the scheduling parameters (relative deadline and default runtime)
   used for that replenishments --during the whole period it holds the
   pi-lock-- are the ones of the waiting task with earliest deadline.

Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.

We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)

Cc: liming.wang@windriver.com
Cc: oleg@redhat.com
Cc: tommaso.cucinotta@sssup.it
Cc: dhaval.giani@gmail.com
Cc: luca.abeni@unitn.it
Cc: paulmck@linux.vnet.ibm.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: vincent.guittot@linaro.org
Cc: insop.song@gmail.com
Cc: michael@amarulasolutions.com
Cc: p.faure@akatech.ch
Cc: bruce.ashfield@windriver.com
Cc: jkacur@redhat.com
Cc: mingo@redhat.com
Cc: rostedt@goodmis.org
Cc: claudio@evidence.eu.com
Cc: johan.eker@ericsson.com
Cc: nicola.manica@disi.unitn.it
Cc: tglx@linutronix.de
Cc: darren@dvhart.com
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h             |    8 ++-
 include/linux/sched/rt.h          |    1 
 kernel/fork.c                     |    1 
 kernel/locking/rtmutex.c          |   31 +++++++++---
 kernel/locking/rtmutex_common.h   |    1 
 kernel/sched/core.c               |   36 ++++++++++++---
 kernel/sched/deadline.c           |   91 ++++++++++++++++++++++----------------
 kernel/sched/sched.h              |   14 +++++
 kernel/trace/trace_sched_wakeup.c |    1 
 9 files changed, 130 insertions(+), 54 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,8 +1116,12 @@ struct sched_dl_entity {
 	 * @dl_new tells if a new instance arrived. If so we must
 	 * start executing it with full runtime and reset its absolute
 	 * deadline;
+	 *
+	 * @dl_boosted tells if we are boosted due to DI. If so we are
+	 * outside bandwidth enforcement mechanism (but only until we
+	 * exit the critical section).
 	 */
-	int dl_throttled, dl_new;
+	int dl_throttled, dl_new, dl_boosted;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -1351,6 +1355,8 @@ struct task_struct {
 	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
+	/* Top pi_waiters task */
+	struct task_struct *pi_top_task;
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -35,6 +35,7 @@ static inline int rt_task(struct task_st
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
+extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1089,6 +1089,7 @@ static void rt_mutex_init_task(struct ta
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
+	p->pi_top_task = NULL;
 #endif
 }
 
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -96,13 +96,16 @@ static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 		     struct rt_mutex_waiter *right)
 {
-	if (left->task->prio < right->task->prio)
+	if (left->prio < right->prio)
 		return 1;
 
 	/*
-	 * If both tasks are dl_task(), we check their deadlines.
+	 * If both waiters have dl_prio(), we check the deadlines of the
+	 * associated tasks.
+	 * If left waiter has a dl_prio(), and we didn't return 1 above,
+	 * then right waiter has a dl_prio() too.
 	 */
-	if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+	if (dl_prio(left->prio))
 		return (left->task->dl.deadline < right->task->dl.deadline);
 
 	return 0;
@@ -197,10 +200,18 @@ int rt_mutex_getprio(struct task_struct
 	if (likely(!task_has_pi_waiters(task)))
 		return task->normal_prio;
 
-	return min(task_top_pi_waiter(task)->task->prio,
+	return min(task_top_pi_waiter(task)->prio,
 		   task->normal_prio);
 }
 
+struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
+{
+	if (likely(!task_has_pi_waiters(task)))
+		return NULL;
+
+	return task_top_pi_waiter(task)->task;
+}
+
 /*
  * Adjust the priority of a task, after its pi_waiters got modified.
  *
@@ -210,7 +221,7 @@ static void __rt_mutex_adjust_prio(struc
 {
 	int prio = rt_mutex_getprio(task);
 
-	if (task->prio != prio)
+	if (task->prio != prio || dl_prio(prio))
 		rt_mutex_setprio(task, prio);
 }
 
@@ -328,7 +339,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * When deadlock detection is off then we check, if further
 	 * priority adjustment is necessary.
 	 */
-	if (!detect_deadlock && waiter->task->prio == task->prio)
+	if (!detect_deadlock && waiter->prio == task->prio)
 		goto out_unlock_pi;
 
 	lock = waiter->lock;
@@ -350,7 +361,7 @@ static int rt_mutex_adjust_prio_chain(st
 
 	/* Requeue the waiter */
 	rt_mutex_dequeue(lock, waiter);
-	waiter->task->prio = task->prio;
+	waiter->prio = task->prio;
 	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
@@ -448,7 +459,7 @@ static int try_to_take_rt_mutex(struct r
 	 * 3) it is top waiter
 	 */
 	if (rt_mutex_has_waiters(lock)) {
-		if (task->prio >= rt_mutex_top_waiter(lock)->task->prio) {
+		if (task->prio >= rt_mutex_top_waiter(lock)->prio) {
 			if (!waiter || waiter != rt_mutex_top_waiter(lock))
 				return 0;
 		}
@@ -508,6 +519,7 @@ static int task_blocks_on_rt_mutex(struc
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
+	waiter->prio = task->prio;
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
@@ -653,7 +665,8 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->task->prio == task->prio) {
+	if (!waiter || (waiter->prio == task->prio &&
+			!dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -54,6 +54,7 @@ struct rt_mutex_waiter {
 	struct pid		*deadlock_task_pid;
 	struct rt_mutex		*deadlock_lock;
 #endif
+	int prio;
 };
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -947,7 +947,7 @@ static inline void check_class_changed(s
 		if (prev_class->switched_from)
 			prev_class->switched_from(rq, p);
 		p->sched_class->switched_to(rq, p);
-	} else if (oldprio != p->prio)
+	} else if (oldprio != p->prio || dl_task(p))
 		p->sched_class->prio_changed(rq, p, oldprio);
 }
 
@@ -2780,7 +2780,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	int oldprio, on_rq, running;
+	int oldprio, on_rq, running, enqueue_flag = 0;
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
@@ -2807,6 +2807,7 @@ void rt_mutex_setprio(struct task_struct
 	}
 
 	trace_sched_pi_setprio(p, prio);
+	p->pi_top_task = rt_mutex_get_top_task(p);
 	oldprio = p->prio;
 	prev_class = p->sched_class;
 	on_rq = p->on_rq;
@@ -2816,19 +2817,42 @@ void rt_mutex_setprio(struct task_struct
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (dl_prio(prio))
+	/*
+	 * Boosting condition are:
+	 * 1. -rt task is running and holds mutex A
+	 *      --> -dl task blocks on mutex A
+	 *
+	 * 2. -dl task is running and holds mutex A
+	 *      --> -dl task blocks on mutex A and could preempt the
+	 *          running task
+	 */
+	if (dl_prio(prio)) {
+		if (!dl_prio(p->normal_prio) || (p->pi_top_task &&
+			dl_entity_preempt(&p->pi_top_task->dl, &p->dl))) {
+			p->dl.dl_boosted = 1;
+			p->dl.dl_throttled = 0;
+			enqueue_flag = ENQUEUE_REPLENISH;
+		} else
+			p->dl.dl_boosted = 0;
 		p->sched_class = &dl_sched_class;
-	else if (rt_prio(prio))
+	} else if (rt_prio(prio)) {
+		if (dl_prio(oldprio))
+			p->dl.dl_boosted = 0;
+		if (oldprio < prio)
+			enqueue_flag = ENQUEUE_HEAD;
 		p->sched_class = &rt_sched_class;
-	else
+	} else {
+		if (dl_prio(oldprio))
+			p->dl.dl_boosted = 0;
 		p->sched_class = &fair_sched_class;
+	}
 
 	p->prio = prio;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq)
-		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+		enqueue_task(rq, p, enqueue_flag);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,20 +16,6 @@
  */
 #include "sched.h"
 
-static inline int dl_time_before(u64 a, u64 b)
-{
-	return (s64)(a - b) < 0;
-}
-
-/*
- * Tells if entity @a should preempt entity @b.
- */
-static inline
-int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
-{
-	return dl_time_before(a->deadline, b->deadline);
-}
-
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -242,7 +228,8 @@ static void check_preempt_curr_dl(struct
  * one, and to (try to!) reconcile itself with its own scheduling
  * parameters.
  */
-static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
+				       struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -254,8 +241,8 @@ static inline void setup_new_dl_entity(s
 	 * future; in fact, we must consider execution overheads (time
 	 * spent on hardirq context, etc.).
 	 */
-	dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+	dl_se->runtime = pi_se->dl_runtime;
 	dl_se->dl_new = 0;
 }
 
@@ -277,11 +264,23 @@ static inline void setup_new_dl_entity(s
  * could happen are, typically, a entity voluntarily trying to overcome its
  * runtime, or it just underestimated it during sched_setscheduler_ex().
  */
-static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+static void replenish_dl_entity(struct sched_dl_entity *dl_se,
+				struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
+	BUG_ON(pi_se->dl_runtime <= 0);
+
+	/*
+	 * This could be the case for a !-dl task that is boosted.
+	 * Just go with full inherited parameters.
+	 */
+	if (dl_se->dl_deadline == 0) {
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
+	}
+
 	/*
 	 * We keep moving the deadline away until we get some
 	 * available runtime for the entity. This ensures correct
@@ -289,8 +288,8 @@ static void replenish_dl_entity(struct s
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_period;
-		dl_se->runtime += dl_se->dl_runtime;
+		dl_se->deadline += pi_se->dl_period;
+		dl_se->runtime += pi_se->dl_runtime;
 	}
 
 	/*
@@ -309,8 +308,8 @@ static void replenish_dl_entity(struct s
 			lag_once = true;
 			printk_sched("sched: DL replenish lagged to much\n");
 		}
-		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 	}
 }
 
@@ -337,7 +336,8 @@ static void replenish_dl_entity(struct s
  * task with deadline equal to period this is the same of using
  * dl_deadline instead of dl_period in the equation above.
  */
-static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
+			       struct sched_dl_entity *pi_se, u64 t)
 {
 	u64 left, right;
 
@@ -359,8 +359,8 @@ static bool dl_entity_overflow(struct sc
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (dl_se->dl_period >> 10) * (dl_se->runtime >> 10);
-	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
+	left = (pi_se->dl_period >> 10) * (dl_se->runtime >> 10);
+	right = ((dl_se->deadline - t) >> 10) * (pi_se->dl_runtime >> 10);
 
 	return dl_time_before(right, left);
 }
@@ -374,7 +374,8 @@ static bool dl_entity_overflow(struct sc
  *  - using the remaining runtime with the current deadline would make
  *    the entity exceed its bandwidth.
  */
-static void update_dl_entity(struct sched_dl_entity *dl_se)
+static void update_dl_entity(struct sched_dl_entity *dl_se,
+			     struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -384,14 +385,14 @@ static void update_dl_entity(struct sche
 	 * the actual scheduling parameters have to be "renewed".
 	 */
 	if (dl_se->dl_new) {
-		setup_new_dl_entity(dl_se);
+		setup_new_dl_entity(dl_se, pi_se);
 		return;
 	}
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
-	    dl_entity_overflow(dl_se, rq_clock(rq))) {
-		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 	}
 }
 
@@ -405,7 +406,7 @@ static void update_dl_entity(struct sche
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct sched_dl_entity *dl_se)
+static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -414,6 +415,8 @@ static int start_dl_timer(struct sched_d
 	unsigned long range;
 	s64 delta;
 
+	if (boosted)
+		return 0;
 	/*
 	 * We want the timer to fire at the deadline, but considering
 	 * that it is actually coming from rq->clock and not from
@@ -573,7 +576,7 @@ static void update_curr_dl(struct rq *rq
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-		if (likely(start_dl_timer(dl_se)))
+		if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
 			dl_se->dl_throttled = 1;
 		else
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
@@ -728,7 +731,8 @@ static void __dequeue_dl_entity(struct s
 }
 
 static void
-enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+enqueue_dl_entity(struct sched_dl_entity *dl_se,
+		  struct sched_dl_entity *pi_se, int flags)
 {
 	BUG_ON(on_dl_rq(dl_se));
 
@@ -738,9 +742,9 @@ enqueue_dl_entity(struct sched_dl_entity
 	 * we want a replenishment of its runtime.
 	 */
 	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
-		replenish_dl_entity(dl_se);
+		replenish_dl_entity(dl_se, pi_se);
 	else
-		update_dl_entity(dl_se);
+		update_dl_entity(dl_se, pi_se);
 
 	__enqueue_dl_entity(dl_se);
 }
@@ -752,6 +756,18 @@ static void dequeue_dl_entity(struct sch
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	struct task_struct *pi_task = p->pi_top_task;
+	struct sched_dl_entity *pi_se = &p->dl;
+
+	/*
+	 * Use the scheduling parameters of the top pi-waiter
+	 * task if we have one and its (relative) deadline is
+	 * smaller than our one... OTW we keep our runtime and
+	 * deadline.
+	 */
+	if (pi_task && p->dl.dl_boosted && dl_prio(pi_task->normal_prio))
+		pi_se = &pi_task->dl;
+
 	/*
 	 * If p is throttled, we do nothing. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
@@ -761,7 +777,7 @@ static void enqueue_task_dl(struct rq *r
 	if (p->dl.dl_throttled)
 		return;
 
-	enqueue_dl_entity(&p->dl, flags);
+	enqueue_dl_entity(&p->dl, pi_se, flags);
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
@@ -985,8 +1001,7 @@ static void task_dead_dl(struct task_str
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
 
-	if (hrtimer_active(timer))
-		hrtimer_try_to_cancel(timer);
+	hrtimer_cancel(timer);
 }
 
 static void set_curr_task_dl(struct rq *rq)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -106,6 +106,20 @@ static inline int task_has_dl_policy(str
 	return dl_policy(p->policy);
 }
 
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+	return dl_time_before(a->deadline, b->deadline);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -16,6 +16,7 @@
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <trace/events/sched.h>
 #include "trace.h"
 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (7 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 08/13] sched: Drafted deadline inheritance logic Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-18 16:55   ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 10/13] sched: speed up -dl pushes with a push-heap Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0010-sched-Add-bandwidth-management-for-sched_dl.patch --]
[-- Type: text/plain, Size: 28088 bytes --]

From: Dario Faggioli <raistlin@linux.it>

In order of -deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).

Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.

This patch, therefore:
 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
   that determine (i.e., runtime / period) the total bandwidth
   available on each CPU of each root_domain for -deadline tasks;
 - couples the RT and deadline bandwidth management, i.e., enforces
   that the sum of how much bandwidth is being devoted to -rt
   -deadline tasks to stay below 100%.

This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:

    M * (sched_dl_runtime_us / sched_dl_period_us)

It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.

Cc: hgu1972@gmail.com
Cc: tglx@linutronix.de
Cc: bruce.ashfield@windriver.com
Cc: dhaval.giani@gmail.com
Cc: johan.eker@ericsson.com
Cc: michael@amarulasolutions.com
Cc: rostedt@goodmis.org
Cc: luca.abeni@unitn.it
Cc: paulmck@linux.vnet.ibm.com
Cc: fchecconi@gmail.com
Cc: oleg@redhat.com
Cc: fweisbec@gmail.com
Cc: vincent.guittot@linaro.org
Cc: darren@dvhart.com
Cc: jkacur@redhat.com
Cc: harald.gustafsson@ericsson.com
Cc: nicola.manica@disi.unitn.it
Cc: p.faure@akatech.ch
Cc: tommaso.cucinotta@sssup.it
Cc: claudio@evidence.eu.com
Cc: insop.song@gmail.com
Cc: liming.wang@windriver.com
Cc: mingo@redhat.com
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h        |    1 
 include/linux/sched/sysctl.h |   13 +
 kernel/sched/core.c          |  437 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/deadline.c      |   46 +++-
 kernel/sched/sched.h         |   76 +++++++
 kernel/sysctl.c              |   14 +
 6 files changed, 554 insertions(+), 33 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1096,6 +1096,7 @@ struct sched_dl_entity {
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
 	u64 dl_period;		/* separation of two instances (period) */
+	u64 dl_bw;		/* dl_runtime / dl_deadline		*/
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -81,6 +81,15 @@ static inline unsigned int get_sysctl_ti
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+/*
+ *  control SCHED_DEADLINE reservations:
+ *
+ *  /proc/sys/kernel/sched_dl_period_us
+ *  /proc/sys/kernel/sched_dl_runtime_us
+ */
+extern unsigned int sysctl_sched_dl_period;
+extern int sysctl_sched_dl_runtime;
+
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 #endif
@@ -99,4 +108,8 @@ extern int sched_rt_handler(struct ctl_t
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 #endif /* _SCHED_SYSCTL_H */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -296,6 +296,15 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+/*
+ * Maximum bandwidth available for all -deadline tasks and groups
+ * (if group scheduling is configured) on each CPU.
+ *
+ * default: 5%
+ */
+unsigned int sysctl_sched_dl_period = 1000000;
+int sysctl_sched_dl_runtime = 50000;
+
 
 
 /*
@@ -1855,6 +1864,111 @@ int sched_fork(unsigned long clone_flags
 	return 0;
 }
 
+unsigned long to_ratio(u64 period, u64 runtime)
+{
+	if (runtime == RUNTIME_INF)
+		return 1ULL << 20;
+
+	/*
+	 * Doing this here saves a lot of checks in all
+	 * the calling paths, and returning zero seems
+	 * safe for them anyway.
+	 */
+	if (period == 0)
+		return 0;
+
+	return div64_u64(runtime << 20, period);
+}
+
+#ifdef CONFIG_SMP
+inline struct dl_bw *dl_bw_of(int i)
+{
+	return &cpu_rq(i)->rd->dl_bw;
+}
+
+static inline int __dl_span_weight(struct rq *rq)
+{
+	return cpumask_weight(rq->rd->span);
+}
+#else
+inline struct dl_bw *dl_bw_of(int i)
+{
+	return &cpu_rq(i)->dl.dl_bw;
+}
+
+static inline int __dl_span_weight(struct rq *rq)
+{
+	return 1;
+}
+#endif
+
+static inline
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw -= tsk_bw;
+}
+
+static inline
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw += tsk_bw;
+}
+
+static inline
+bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+{
+	return dl_b->bw != -1 &&
+	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+}
+
+/*
+ * We must be sure that accepting a new task (or allowing changing the
+ * parameters of an existing one) is consistent with the bandwidth
+ * constraints. If yes, this function also accordingly updates the currently
+ * allocated bandwidth to reflect the new situation.
+ *
+ * This function is called while holding p's rq->lock.
+ */
+static int dl_overflow(struct task_struct *p, int policy,
+		       const struct sched_attr *attr)
+{
+
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+	u64 period = attr->sched_period;
+	u64 runtime = attr->sched_runtime;
+	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
+	int cpus = __dl_span_weight(task_rq(p));
+	int err = -1;
+
+	if (new_bw == p->dl.dl_bw)
+		return 0;
+
+	/*
+	 * Either if a task, enters, leave, or stays -deadline but changes
+	 * its parameters, we may need to update accordingly the total
+	 * allocated bandwidth of the container.
+	 */
+	raw_spin_lock(&dl_b->lock);
+	if (dl_policy(policy) && !task_has_dl_policy(p) &&
+	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
+		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		err = 0;
+	}
+	raw_spin_unlock(&dl_b->lock);
+
+	return err;
+}
+
+extern void init_dl_bw(struct dl_bw *dl_b);
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -3069,6 +3183,7 @@ __setparam_dl(struct task_struct *p, con
 	dl_se->dl_deadline = attr->sched_deadline;
 	dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
 	dl_se->flags = attr->sched_flags;
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
 }
@@ -3089,7 +3204,9 @@ __getparam_dl(struct task_struct *p, str
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime, as well as the period of being zero or
- * greater than deadline.
+ * greater than deadline. Furthermore, we have to be sure that
+ * user parameters are above the internal resolution (1us); we
+ * check sched_runtime only since it is always the smaller one.
  */
 static bool
 __checkparam_dl(const struct sched_attr *attr)
@@ -3097,7 +3214,8 @@ __checkparam_dl(const struct sched_attr
 	return attr && attr->sched_deadline != 0 &&
 		(attr->sched_period == 0 ||
 		(s64)(attr->sched_period   - attr->sched_deadline) >= 0) &&
-		(s64)(attr->sched_deadline - attr->sched_runtime ) >= 0;
+		(s64)(attr->sched_deadline - attr->sched_runtime ) >= 0  &&
+		attr->sched_runtime >= (2 << (DL_SCALE - 1));
 }
 
 /*
@@ -3226,8 +3344,8 @@ static int __sched_setscheduler(struct t
 		return 0;
 	}
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
 		/*
 		 * Do not allow realtime tasks into groups that have no runtime
 		 * assigned.
@@ -3238,8 +3356,34 @@ static int __sched_setscheduler(struct t
 			task_rq_unlock(rq, p, &flags);
 			return -EPERM;
 		}
-	}
 #endif
+#ifdef CONFIG_SMP
+		if (dl_bandwidth_enabled() && dl_policy(policy)) {
+			cpumask_t *span = rq->rd->span;
+			cpumask_t act_affinity;
+
+			/*
+			 * cpus_allowed mask is statically initialized with
+			 * CPU_MASK_ALL, span is instead dynamic. Here we
+			 * compute the "dynamic" affinity of a task.
+			 */
+			cpumask_and(&act_affinity, &p->cpus_allowed,
+				    cpu_active_mask);
+
+			/*
+			 * Don't allow tasks with an affinity mask smaller than
+			 * the entire root_domain to become SCHED_DEADLINE. We
+			 * will also fail if there's no bandwidth available.
+			 */
+			if (!cpumask_equal(&act_affinity, span) ||
+			    		   rq->rd->dl_bw.bw == 0) {
+				__task_rq_unlock(rq);
+				raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+				return -EPERM;
+			}
+		}
+#endif
+	}
 
 	/* recheck policy now with rq lock held */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
@@ -3247,6 +3391,19 @@ static int __sched_setscheduler(struct t
 		task_rq_unlock(rq, p, &flags);
 		goto recheck;
 	}
+
+	/*
+	 * If setscheduling to SCHED_DEADLINE (or changing the parameters
+	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
+	 * is available.
+	 */
+	if ((dl_policy(policy) || dl_task(p)) &&
+	    dl_overflow(p, policy, attr)) {
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		return -EBUSY;
+	}
+
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
@@ -3696,6 +3853,24 @@ long sched_setaffinity(pid_t pid, const
 	if (retval)
 		goto out_unlock;
 
+	/*
+	 * Since bandwidth control happens on root_domain basis,
+	 * if admission test is enabled, we only admit -deadline
+	 * tasks allowed to run on all the CPUs in the task's
+	 * root_domain.
+	 */
+#ifdef CONFIG_SMP
+	if (task_has_dl_policy(p)) {
+		const struct cpumask *span = task_rq(p)->rd->span;
+
+		if (dl_bandwidth_enabled() &&
+		    !cpumask_equal(in_mask, span)) {
+			retval = -EBUSY;
+			goto out_unlock;
+		}
+	}
+#endif
+
 	cpuset_cpus_allowed(p, cpus_allowed);
 	cpumask_and(new_mask, in_mask, cpus_allowed);
 again:
@@ -4350,6 +4525,42 @@ int set_cpus_allowed_ptr(struct task_str
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
+ * When dealing with a -deadline task, we have to check if moving it to
+ * a new CPU is possible or not. In fact, this is only true iff there
+ * is enough bandwidth available on such CPU, otherwise we want the
+ * whole migration progedure to fail over.
+ */
+static inline
+bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
+{
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+	struct dl_bw *cpu_b = dl_bw_of(cpu);
+	int ret = 1;
+	u64 bw;
+
+	if (dl_b == cpu_b)
+		return 1;
+
+	raw_spin_lock(&dl_b->lock);
+	raw_spin_lock(&cpu_b->lock);
+
+	bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
+	if (dl_bandwidth_enabled() &&
+	    bw < cpu_b->total_bw + p->dl.dl_bw) {
+		ret = 0;
+		goto unlock;
+	}
+	dl_b->total_bw -= p->dl.dl_bw;
+	cpu_b->total_bw += p->dl.dl_bw;
+
+unlock:
+	raw_spin_unlock(&cpu_b->lock);
+	raw_spin_unlock(&dl_b->lock);
+
+	return ret;
+}
+
+/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -4381,6 +4592,13 @@ static int __migrate_task(struct task_st
 		goto fail;
 
 	/*
+	 * If p is -deadline, proceed only if there is enough
+	 * bandwidth available on dest_cpu
+	 */
+	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
+		goto fail;
+
+	/*
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
@@ -5119,6 +5337,8 @@ static int init_rootdomain(struct root_d
 	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
 		goto free_dlo_mask;
 
+	init_dl_bw(&rd->dl_bw);
+
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
 	return 0;
@@ -6554,6 +6774,8 @@ void __init sched_init(void)
 
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
+	init_dl_bandwidth(&def_dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
@@ -6954,16 +7176,6 @@ void sched_move_task(struct task_struct
 }
 #endif /* CONFIG_CGROUP_SCHED */
 
-#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
-	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
-
-	return div64_u64(runtime << 20, period);
-}
-#endif
-
 #ifdef CONFIG_RT_GROUP_SCHED
 /*
  * Ensure that the real time constraints are schedulable.
@@ -7137,10 +7349,48 @@ static long sched_group_rt_period(struct
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+/*
+ * Coupling of -rt and -deadline bandwidth.
+ *
+ * Here we check if the new -rt bandwidth value is consistent
+ * with the system settings for the bandwidth available
+ * to -deadline tasks.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_rt_dl_global_constraints(u64 rt_bw)
+{
+	unsigned long flags;
+	u64 dl_bw;
+	bool ret;
+
+	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
+	if (global_rt_runtime() == RUNTIME_INF ||
+	    global_dl_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	dl_bw = to_ratio(def_dl_bandwidth.dl_period,
+			 def_dl_bandwidth.dl_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	return ret;
+}
 
+#ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period;
+	u64 runtime, period, bw;
 	int ret = 0;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -7155,6 +7405,10 @@ static int sched_rt_global_constraints(v
 	if (runtime > period && runtime != RUNTIME_INF)
 		return -EINVAL;
 
+	bw = to_ratio(period, runtime);
+	if (!__sched_rt_dl_global_constraints(bw))
+		return -EINVAL;
+
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
@@ -7177,19 +7431,19 @@ static int sched_rt_can_attach(struct ta
 static int sched_rt_global_constraints(void)
 {
 	unsigned long flags;
-	int i;
+	int i, ret = 0;
+	u64 bw;
 
 	if (sysctl_sched_rt_period <= 0)
 		return -EINVAL;
 
-	/*
-	 * There's always some RT tasks in the root group
-	 * -- migration, kstopmachine etc..
-	 */
-	if (sysctl_sched_rt_runtime == 0)
-		return -EBUSY;
-
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
+	bw = to_ratio(global_rt_period(), global_rt_runtime());
+	if (!__sched_rt_dl_global_constraints(bw)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
 
@@ -7197,12 +7451,93 @@ static int sched_rt_global_constraints(v
 		rt_rq->rt_runtime = global_rt_runtime();
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+unlock:
 	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 
-	return 0;
+	return ret;
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+/*
+ * Coupling of -dl and -rt bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for -dl tasks and groups, if the new values are consistent with
+ * the system settings for the bandwidth available to -rt entities.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_dl_rt_global_constraints(u64 dl_bw)
+{
+	u64 rt_bw;
+	bool ret;
+
+	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF ||
+	    global_rt_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
+			 def_rt_bandwidth.rt_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
+
+	return ret;
+}
+
+static bool __sched_dl_global_constraints(u64 runtime, u64 period)
+{
+	if (!period || (runtime != RUNTIME_INF && runtime > period))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sched_dl_global_constraints(void)
+{
+	u64 runtime = global_dl_runtime();
+	u64 period = global_dl_period();
+	u64 new_bw = to_ratio(period, runtime);
+	int ret, i;
+
+	ret = __sched_dl_global_constraints(runtime, period);
+	if (ret)
+		return ret;
+
+	if (!__sched_dl_rt_global_constraints(new_bw))
+		return -EINVAL;
+
+	/*
+	 * Here we want to check the bandwidth not being set to some
+	 * value smaller than the currently allocated bandwidth in
+	 * any of the root_domains.
+	 *
+	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
+	 * cycling on root_domains... Discussion on different/better
+	 * solutions is welcome!
+	 */
+	for_each_possible_cpu(i) {
+		struct dl_bw *dl_b = dl_bw_of(i);
+
+		raw_spin_lock(&dl_b->lock);
+		if (new_bw < dl_b->total_bw) {
+			raw_spin_unlock(&dl_b->lock);
+			return -EBUSY;
+		}
+		raw_spin_unlock(&dl_b->lock);
+	}
+
+	return 0;
+}
+
 int sched_rr_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
@@ -7249,6 +7584,60 @@ int sched_rt_handler(struct ctl_table *t
 	}
 	mutex_unlock(&mutex);
 
+	return ret;
+}
+
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+	int old_period, old_runtime;
+	static DEFINE_MUTEX(mutex);
+	unsigned long flags;
+
+	mutex_lock(&mutex);
+	old_period = sysctl_sched_dl_period;
+	old_runtime = sysctl_sched_dl_runtime;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+				      flags);
+
+		ret = sched_dl_global_constraints();
+		if (ret) {
+			sysctl_sched_dl_period = old_period;
+			sysctl_sched_dl_runtime = old_runtime;
+		} else {
+			u64 new_bw;
+			int i;
+
+			def_dl_bandwidth.dl_period = global_dl_period();
+			def_dl_bandwidth.dl_runtime = global_dl_runtime();
+			if (global_dl_runtime() == RUNTIME_INF)
+				new_bw = -1;
+			else
+				new_bw = to_ratio(global_dl_period(),
+						  global_dl_runtime());
+			/*
+			 * FIXME: As above...
+			 */
+			for_each_possible_cpu(i) {
+				struct dl_bw *dl_b = dl_bw_of(i);
+
+				raw_spin_lock(&dl_b->lock);
+				dl_b->bw = new_bw;
+				raw_spin_unlock(&dl_b->lock);
+			}
+		}
+
+		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+					   flags);
+	}
+	mutex_unlock(&mutex);
+
 	return ret;
 }
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,8 @@
  */
 #include "sched.h"
 
+struct dl_bandwidth def_dl_bandwidth;
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -46,6 +48,27 @@ static inline int is_leftmost(struct tas
 	return dl_rq->rb_leftmost == &dl_se->rb_node;
 }
 
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+	raw_spin_lock_init(&dl_b->dl_runtime_lock);
+	dl_b->dl_period = period;
+	dl_b->dl_runtime = runtime;
+}
+
+extern unsigned long to_ratio(u64 period, u64 runtime);
+
+void init_dl_bw(struct dl_bw *dl_b)
+{
+	raw_spin_lock_init(&dl_b->lock);
+	raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF)
+		dl_b->bw = -1;
+	else
+		dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
+	dl_b->total_bw = 0;
+}
+
 void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
@@ -57,6 +80,8 @@ void init_dl_rq(struct dl_rq *dl_rq, str
 	dl_rq->dl_nr_migratory = 0;
 	dl_rq->overloaded = 0;
 	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#else
+	init_dl_bw(&dl_rq->dl_bw);
 #endif
 }
 
@@ -359,8 +384,9 @@ static bool dl_entity_overflow(struct sc
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (pi_se->dl_period >> 10) * (dl_se->runtime >> 10);
-	right = ((dl_se->deadline - t) >> 10) * (pi_se->dl_runtime >> 10);
+	left = (pi_se->dl_period >> DL_SCALE) * (dl_se->runtime >> DL_SCALE);
+	right = ((dl_se->deadline - t) >> DL_SCALE) *
+		(pi_se->dl_runtime >> DL_SCALE);
 
 	return dl_time_before(right, left);
 }
@@ -911,8 +937,8 @@ static void check_preempt_curr_dl(struct
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
-	    !need_resched())
+	if ((p->dl.deadline == rq->curr->dl.deadline) &&
+	    !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
 }
@@ -1000,6 +1026,14 @@ static void task_fork_dl(struct task_str
 static void task_dead_dl(struct task_struct *p)
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+	/*
+	 * Since we are TASK_DEAD we won't slip out of the domain!
+	 */
+	raw_spin_lock_irq(&dl_b->lock);
+	dl_b->total_bw -= p->dl.dl_bw;
+	raw_spin_unlock_irq(&dl_b->lock);
 
 	hrtimer_cancel(timer);
 }
@@ -1226,7 +1260,7 @@ static struct task_struct *pick_next_pus
 	BUG_ON(task_current(rq, p));
 	BUG_ON(p->nr_cpus_allowed <= 1);
 
-	BUG_ON(!p->se.on_rq);
+	BUG_ON(!p->on_rq);
 	BUG_ON(!dl_task(p));
 
 	return p;
@@ -1373,7 +1407,7 @@ static int pull_dl_task(struct rq *this_
 		     dl_time_before(p->dl.deadline,
 				    this_rq->dl.earliest_dl.curr))) {
 			WARN_ON(p == src_rq->curr);
-			WARN_ON(!p->se.on_rq);
+			WARN_ON(!p->on_rq);
 
 			/*
 			 * Then we pull iff p has actually an earlier
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -82,6 +82,13 @@ extern void update_cpu_load_active(struc
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
+/*
+ * Single value that decides SCHED_DEADLINE internal math precision.
+ * 10 -> just above 1us
+ * 9  -> just above 0.5us
+ */
+#define DL_SCALE (10)
+
 static inline int rt_policy(int policy)
 {
 	if (policy == SCHED_FIFO || policy == SCHED_RR)
@@ -106,7 +113,7 @@ static inline int task_has_dl_policy(str
 	return dl_policy(p->policy);
 }
 
-static inline int dl_time_before(u64 a, u64 b)
+static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
@@ -114,8 +121,8 @@ static inline int dl_time_before(u64 a,
 /*
  * Tells if entity @a should preempt entity @b.
  */
-static inline
-int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+static inline bool
+dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
 {
 	return dl_time_before(a->deadline, b->deadline);
 }
@@ -135,6 +142,50 @@ struct rt_bandwidth {
 	u64			rt_runtime;
 	struct hrtimer		rt_period_timer;
 };
+/*
+ * To keep the bandwidth of -deadline tasks and groups under control
+ * we need some place where:
+ *  - store the maximum -deadline bandwidth of the system (the group);
+ *  - cache the fraction of that bandwidth that is currently allocated.
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, the bandwidth is given on a per-CPU basis,
+ * meaning that:
+ *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ *  - dl_total_bw array contains, in the i-eth element, the currently
+ *    allocated bandwidth on the i-eth CPU.
+ * Moreover, groups consume bandwidth on each CPU, while tasks only
+ * consume bandwidth on the CPU they're running on.
+ * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
+ * that will be shown the next time the proc or cgroup controls will
+ * be red. It on its turn can be changed by writing on its own
+ * control.
+ */
+struct dl_bandwidth {
+	raw_spinlock_t dl_runtime_lock;
+	u64 dl_runtime;
+	u64 dl_period;
+};
+
+static inline int dl_bandwidth_enabled(void)
+{
+	return sysctl_sched_dl_runtime >= 0;
+}
+
+extern struct dl_bw *dl_bw_of(int i);
+
+struct dl_bw {
+	raw_spinlock_t lock;
+	u64 bw, total_bw;
+};
+
+static inline u64 global_dl_period(void);
+static inline u64 global_dl_runtime(void);
 
 extern struct mutex sched_domains_mutex;
 
@@ -422,6 +473,8 @@ struct dl_rq {
 	 */
 	struct rb_root pushable_dl_tasks_root;
 	struct rb_node *pushable_dl_tasks_leftmost;
+#else
+	struct dl_bw dl_bw;
 #endif
 };
 
@@ -448,6 +501,7 @@ struct root_domain {
 	 */
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
+	struct dl_bw dl_bw;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
@@ -896,7 +950,18 @@ static inline u64 global_rt_runtime(void
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+static inline u64 global_dl_period(void)
+{
+	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_dl_runtime(void)
+{
+	if (sysctl_sched_dl_runtime < 0)
+		return RUNTIME_INF;
 
+	return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
+}
 
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
@@ -1144,6 +1209,7 @@ extern void update_max_interval(void);
 extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
+extern void init_sched_dl_class(void);
 
 extern void resched_task(struct task_struct *p);
 extern void resched_cpu(int cpu);
@@ -1151,8 +1217,12 @@ extern void resched_cpu(int cpu);
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
+extern struct dl_bandwidth def_dl_bandwidth;
+extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 
+unsigned long to_ratio(u64 period, u64 runtime);
+
 extern void update_idle_cpu_load(struct rq *this_rq);
 
 extern void init_task_runnable_average(struct task_struct *p);
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -414,6 +414,20 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rr_handler,
 	},
+	{
+		.procname	= "sched_dl_period_us",
+		.data		= &sysctl_sched_dl_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_runtime_us",
+		.data		= &sysctl_sched_dl_runtime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 10/13] sched: speed up -dl pushes with a push-heap.
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (8 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 09/13] sched: Add bandwidth management for sched_dl Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 11/13] sched: Remove sched_setscheduler2() Peter Zijlstra
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: 0012-sched-speed-up-dl-pushes-with-a-push-heap.patch --]
[-- Type: text/plain, Size: 14239 bytes --]

From: Juri Lelli <juri.lelli@gmail.com>

Data from tests confirmed that the original active load balancing
logic didn't scale neither in the number of CPU nor in the number of
tasks (as sched_rt does).

Here we provide a global data structure to keep track of deadlines
of the running tasks in the system. The structure is composed by
a bitmask showing the free CPUs and a max-heap, needed when the system
is heavily loaded.

The implementation and concurrent access scheme are kept simple by
design. However, our measurements show that we can compete with sched_rt
on large multi-CPUs machines [1].

Only the push path is addressed, the extension to use this structure
also for pull decisions is straightforward. However, we are currently
evaluating different (in order to decrease/avoid contention) data
structures to solve possibly both problems. We are also going to re-run
tests considering recent changes inside cpupri [2].

[1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
[2] http://www.spinics.net/lists/linux-rt-users/msg06778.html

Cc: nicola.manica@disi.unitn.it
Cc: darren@dvhart.com
Cc: oleg@redhat.com
Cc: dhaval.giani@gmail.com
Cc: paulmck@linux.vnet.ibm.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: p.faure@akatech.ch
Cc: jkacur@redhat.com
Cc: raistlin@linux.it
Cc: johan.eker@ericsson.com
Cc: rostedt@goodmis.org
Cc: liming.wang@windriver.com
Cc: tglx@linutronix.de
Cc: luca.abeni@unitn.it
Cc: tommaso.cucinotta@sssup.it
Cc: michael@amarulasolutions.com
Cc: bruce.ashfield@windriver.com
Cc: vincent.guittot@linaro.org
Cc: mingo@redhat.com
Cc: claudio@evidence.eu.com
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/Makefile      |    2 +-
 kernel/sched/core.c        |    3 +
 kernel/sched/cpudeadline.c |  216 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpudeadline.h |   33 +++++++
 kernel/sched/deadline.c    |   53 +++--------
 kernel/sched/sched.h       |    2 +
 6 files changed, 269 insertions(+), 40 deletions(-)
 create mode 100644 kernel/sched/cpudeadline.c
 create mode 100644 kernel/sched/cpudeadline.h

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index b039035..9a95c8c 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -14,7 +14,7 @@ endif
 obj-y += core.o proc.o clock.o cputime.o
 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
 obj-y += wait.o completion.o
-obj-$(CONFIG_SMP) += cpupri.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c506792..2ae0acd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5278,6 +5278,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
 	cpupri_cleanup(&rd->cpupri);
+	cpudl_cleanup(&rd->cpudl);
 	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
@@ -5336,6 +5337,8 @@ static int init_rootdomain(struct root_domain *rd)
 		goto free_dlo_mask;
 
 	init_dl_bw(&rd->dl_bw);
+	if (cpudl_init(&rd->cpudl) != 0)
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
new file mode 100644
index 0000000..3bcade5
--- /dev/null
+++ b/kernel/sched/cpudeadline.c
@@ -0,0 +1,216 @@
+/*
+ *  kernel/sched/cpudl.c
+ *
+ *  Global CPU deadline management
+ *
+ *  Author: Juri Lelli <j.lelli@sssup.it>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include "cpudeadline.h"
+
+static inline int parent(int i)
+{
+	return (i - 1) >> 1;
+}
+
+static inline int left_child(int i)
+{
+	return (i << 1) + 1;
+}
+
+static inline int right_child(int i)
+{
+	return (i << 1) + 2;
+}
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+void cpudl_exchange(struct cpudl *cp, int a, int b)
+{
+	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
+
+	swap(cp->elements[a], cp->elements[b]);
+	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
+}
+
+void cpudl_heapify(struct cpudl *cp, int idx)
+{
+	int l, r, largest;
+
+	/* adapted from lib/prio_heap.c */
+	while(1) {
+		l = left_child(idx);
+		r = right_child(idx);
+		largest = idx;
+
+		if ((l < cp->size) && dl_time_before(cp->elements[idx].dl,
+							cp->elements[l].dl))
+			largest = l;
+		if ((r < cp->size) && dl_time_before(cp->elements[largest].dl,
+							cp->elements[r].dl))
+			largest = r;
+		if (largest == idx)
+			break;
+
+		/* Push idx down the heap one level and bump one up */
+		cpudl_exchange(cp, largest, idx);
+		idx = largest;
+	}
+}
+
+void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl)
+{
+	WARN_ON(idx > num_present_cpus() || idx == IDX_INVALID);
+
+	if (dl_time_before(new_dl, cp->elements[idx].dl)) {
+		cp->elements[idx].dl = new_dl;
+		cpudl_heapify(cp, idx);
+	} else {
+		cp->elements[idx].dl = new_dl;
+		while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
+					cp->elements[idx].dl)) {
+			cpudl_exchange(cp, idx, parent(idx));
+			idx = parent(idx);
+		}
+	}
+}
+
+static inline int cpudl_maximum(struct cpudl *cp)
+{
+	return cp->elements[0].cpu;
+}
+
+/*
+ * cpudl_find - find the best (later-dl) CPU in the system
+ * @cp: the cpudl max-heap context
+ * @p: the task
+ * @later_mask: a mask to fill in with the selected CPUs (or NULL)
+ *
+ * Returns: int - best CPU (heap maximum if suitable)
+ */
+int cpudl_find(struct cpudl *cp, struct task_struct *p,
+	       struct cpumask *later_mask)
+{
+	int best_cpu = -1;
+	const struct sched_dl_entity *dl_se = &p->dl;
+
+	if (later_mask && cpumask_and(later_mask, cp->free_cpus,
+			&p->cpus_allowed) && cpumask_and(later_mask,
+			later_mask, cpu_active_mask)) {
+		best_cpu = cpumask_any(later_mask);
+		goto out;
+	} else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
+			dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
+		best_cpu = cpudl_maximum(cp);
+		if (later_mask)
+			cpumask_set_cpu(best_cpu, later_mask);
+	}
+
+out:
+	WARN_ON(best_cpu > num_present_cpus() && best_cpu != -1);
+
+	return best_cpu;
+}
+
+/*
+ * cpudl_set - update the cpudl max-heap
+ * @cp: the cpudl max-heap context
+ * @cpu: the target cpu
+ * @dl: the new earliest deadline for this cpu
+ *
+ * Notes: assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
+{
+	int old_idx, new_cpu;
+	unsigned long flags;
+
+	WARN_ON(cpu > num_present_cpus());
+
+	raw_spin_lock_irqsave(&cp->lock, flags);
+	old_idx = cp->cpu_to_idx[cpu];
+	if (!is_valid) {
+		/* remove item */
+		if (old_idx == IDX_INVALID) {
+			/*
+			 * Nothing to remove if old_idx was invalid.
+			 * This could happen if a rq_offline_dl is
+			 * called for a CPU without -dl tasks running.
+			 */
+			goto out;
+		}
+		new_cpu = cp->elements[cp->size - 1].cpu;
+		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
+		cp->elements[old_idx].cpu = new_cpu;
+		cp->size--;
+		cp->cpu_to_idx[new_cpu] = old_idx;
+		cp->cpu_to_idx[cpu] = IDX_INVALID;
+		while (old_idx > 0 && dl_time_before(
+				cp->elements[parent(old_idx)].dl,
+				cp->elements[old_idx].dl)) {
+			cpudl_exchange(cp, old_idx, parent(old_idx));
+			old_idx = parent(old_idx);
+		}
+		cpumask_set_cpu(cpu, cp->free_cpus);
+                cpudl_heapify(cp, old_idx);
+
+		goto out;
+	}
+
+	if (old_idx == IDX_INVALID) {
+		cp->size++;
+		cp->elements[cp->size - 1].dl = 0;
+		cp->elements[cp->size - 1].cpu = cpu;
+		cp->cpu_to_idx[cpu] = cp->size - 1;
+		cpudl_change_key(cp, cp->size - 1, dl);
+		cpumask_clear_cpu(cpu, cp->free_cpus);
+	} else {
+		cpudl_change_key(cp, old_idx, dl);
+	}
+
+out:
+	raw_spin_unlock_irqrestore(&cp->lock, flags);
+}
+
+/*
+ * cpudl_init - initialize the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+int cpudl_init(struct cpudl *cp)
+{
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+	raw_spin_lock_init(&cp->lock);
+	cp->size = 0;
+	for (i = 0; i < NR_CPUS; i++)
+		cp->cpu_to_idx[i] = IDX_INVALID;
+	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
+		return -ENOMEM;
+	cpumask_setall(cp->free_cpus);
+
+	return 0;
+}
+
+/*
+ * cpudl_cleanup - clean up the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+void cpudl_cleanup(struct cpudl *cp)
+{
+	/*
+	 * nothing to do for the moment
+	 */
+}
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
new file mode 100644
index 0000000..a202789
--- /dev/null
+++ b/kernel/sched/cpudeadline.h
@@ -0,0 +1,33 @@
+#ifndef _LINUX_CPUDL_H
+#define _LINUX_CPUDL_H
+
+#include <linux/sched.h>
+
+#define IDX_INVALID     -1
+
+struct array_item {
+	u64 dl;
+	int cpu;
+};
+
+struct cpudl {
+	raw_spinlock_t lock;
+	int size;
+	int cpu_to_idx[NR_CPUS];
+	struct array_item elements[NR_CPUS];
+	cpumask_var_t free_cpus;
+};
+
+
+#ifdef CONFIG_SMP
+int cpudl_find(struct cpudl *cp, struct task_struct *p,
+	       struct cpumask *later_mask);
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid);
+int cpudl_init(struct cpudl *cp);
+void cpudl_cleanup(struct cpudl *cp);
+#else
+#define cpudl_set(cp, cpu, dl) do { } while (0)
+#define cpudl_init() do { } while (0)
+#endif /* CONFIG_SMP */
+
+#endif /* _LINUX_CPUDL_H */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0da5e15..84e0f63 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,8 @@
  */
 #include "sched.h"
 
+#include <linux/slab.h>
+
 struct dl_bandwidth def_dl_bandwidth;
 
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
@@ -639,6 +641,7 @@ static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 		 */
 		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
 		dl_rq->earliest_dl.curr = deadline;
+		cpudl_set(&rq->rd->cpudl, rq->cpu, deadline, 1);
 	} else if (dl_rq->earliest_dl.next == 0 ||
 		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
 		/*
@@ -662,6 +665,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 	if (!dl_rq->dl_nr_running) {
 		dl_rq->earliest_dl.curr = 0;
 		dl_rq->earliest_dl.next = 0;
+		cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 	} else {
 		struct rb_node *leftmost = dl_rq->rb_leftmost;
 		struct sched_dl_entity *entry;
@@ -669,6 +673,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
 		dl_rq->earliest_dl.curr = entry->deadline;
 		dl_rq->earliest_dl.next = next_deadline(rq);
+		cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline, 1);
 	}
 }
 
@@ -854,9 +859,6 @@ static void yield_task_dl(struct rq *rq)
 #ifdef CONFIG_SMP
 
 static int find_later_rq(struct task_struct *task);
-static int latest_cpu_find(struct cpumask *span,
-			   struct task_struct *task,
-			   struct cpumask *later_mask);
 
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
@@ -903,7 +905,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+	    cpudl_find(&rq->rd->cpudl, rq->curr, NULL) == -1)
 		return;
 
 	/*
@@ -911,7 +913,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * see if it is pushed or pulled somewhere else.
 	 */
 	if (p->nr_cpus_allowed != 1 &&
-	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+	    cpudl_find(&rq->rd->cpudl, p, NULL) != -1)
 		return;
 
 	resched_task(rq->curr);
@@ -1084,39 +1086,6 @@ next_node:
 	return NULL;
 }
 
-static int latest_cpu_find(struct cpumask *span,
-			   struct task_struct *task,
-			   struct cpumask *later_mask)
-{
-	const struct sched_dl_entity *dl_se = &task->dl;
-	int cpu, found = -1, best = 0;
-	u64 max_dl = 0;
-
-	for_each_cpu(cpu, span) {
-		struct rq *rq = cpu_rq(cpu);
-		struct dl_rq *dl_rq = &rq->dl;
-
-		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
-		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
-		     dl_rq->earliest_dl.curr))) {
-			if (later_mask)
-				cpumask_set_cpu(cpu, later_mask);
-			if (!best && !dl_rq->dl_nr_running) {
-				best = 1;
-				found = cpu;
-			} else if (!best &&
-				   dl_time_before(max_dl,
-						  dl_rq->earliest_dl.curr)) {
-				max_dl = dl_rq->earliest_dl.curr;
-				found = cpu;
-			}
-		} else if (later_mask)
-			cpumask_clear_cpu(cpu, later_mask);
-	}
-
-	return found;
-}
-
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
 
 static int find_later_rq(struct task_struct *task)
@@ -1133,7 +1102,8 @@ static int find_later_rq(struct task_struct *task)
 	if (task->nr_cpus_allowed == 1)
 		return -1;
 
-	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
+			task, later_mask);
 	if (best_cpu == -1)
 		return -1;
 
@@ -1509,6 +1479,9 @@ static void rq_online_dl(struct rq *rq)
 {
 	if (rq->dl.overloaded)
 		dl_set_overload(rq);
+
+	if (rq->dl.dl_nr_running > 0)
+		cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 1);
 }
 
 /* Assumes rq->lock is held */
@@ -1516,6 +1489,8 @@ static void rq_offline_dl(struct rq *rq)
 {
 	if (rq->dl.overloaded)
 		dl_clear_overload(rq);
+
+	cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 }
 
 void init_sched_dl_class(void)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 34e8c07..3140942 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -10,6 +10,7 @@
 #include <linux/slab.h>
 
 #include "cpupri.h"
+#include "cpudeadline.h"
 #include "cpuacct.h"
 
 struct rq;
@@ -501,6 +502,7 @@ struct root_domain {
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
 	struct dl_bw dl_bw;
+	struct cpudl cpudl;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
-- 
1.7.10.4




^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 11/13] sched: Remove sched_setscheduler2()
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (9 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 10/13] sched: speed up -dl pushes with a push-heap Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 12/13] sched, deadline: Fixup the smp-affinity mask tests Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: peterz-sched-attr-more.patch --]
[-- Type: text/plain, Size: 14060 bytes --]

Expand sched_{set,get}attr() to include the policy and nice value.

This obviates the need for sched_setscheduler2().

The new sched_setattr() call now covers the functionality of:

  sched_setscheduler(),
  sched_setparam(),
  setpriority(.which = PRIO_PROCESS)

And sched_getattr() now covers:

  sched_getscheduler(),
  sched_getparam(),
  getpriority(.which = PRIO_PROCESS)

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/arm/include/asm/unistd.h      |    2 
 arch/arm/include/uapi/asm/unistd.h |    5 -
 arch/arm/kernel/calls.S            |    3 
 arch/x86/syscalls/syscall_32.tbl   |    1 
 arch/x86/syscalls/syscall_64.tbl   |    1 
 include/linux/sched.h              |   24 +++--
 include/linux/syscalls.h           |    2 
 kernel/sched/core.c                |  173 +++++++++++++++++++------------------
 kernel/sched/sched.h               |   13 +-
 9 files changed, 119 insertions(+), 105 deletions(-)

--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -15,7 +15,7 @@
 
 #include <uapi/asm/unistd.h>
 
-#define __NR_syscalls  (383)
+#define __NR_syscalls  (382)
 #define __ARM_NR_cmpxchg		(__ARM_NR_BASE+0x00fff0)
 
 #define __ARCH_WANT_STAT64
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -406,9 +406,8 @@
 #define __NR_process_vm_writev		(__NR_SYSCALL_BASE+377)
 #define __NR_kcmp			(__NR_SYSCALL_BASE+378)
 #define __NR_finit_module		(__NR_SYSCALL_BASE+379)
-#define __NR_sched_setscheduler2	(__NR_SYSCALL_BASE+380)
-#define __NR_sched_setattr		(__NR_SYSCALL_BASE+381)
-#define __NR_sched_getattr		(__NR_SYSCALL_BASE+382)
+#define __NR_sched_setattr		(__NR_SYSCALL_BASE+380)
+#define __NR_sched_getattr		(__NR_SYSCALL_BASE+381)
 
 /*
  * This may need to be greater than __NR_last_syscall+1 in order to
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -389,8 +389,7 @@
 		CALL(sys_process_vm_writev)
 		CALL(sys_kcmp)
 		CALL(sys_finit_module)
-/* 380 */	CALL(sys_sched_setscheduler2)
-		CALL(sys_sched_setattr)
+/* 380 */	CALL(sys_sched_setattr)
 		CALL(sys_sched_getattr)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,4 +359,3 @@
 350	i386	finit_module		sys_finit_module
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
-353	i386	sched_setscheduler2	sys_sched_setscheduler2
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,7 +322,6 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
-316	common	sched_setscheduler2	sys_sched_setscheduler2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -57,7 +57,7 @@ struct sched_param {
 
 #include <asm/processor.h>
 
-#define SCHED_ATTR_SIZE_VER0	40	/* sizeof first published struct */
+#define SCHED_ATTR_SIZE_VER0	48	/* sizeof first published struct */
 
 /*
  * Extended scheduling parameters data structure.
@@ -85,7 +85,9 @@ struct sched_param {
  *
  * This is reflected by the actual fields of the sched_attr structure:
  *
- *  @sched_priority     task's priority (might still be useful)
+ *  @sched_policy	task's scheduling policy
+ *  @sched_nice		task's nice value      (SCHED_NORMAL/BATCH)
+ *  @sched_priority     task's static priority (SCHED_FIFO/RR)
  *  @sched_flags        for customizing the scheduler behaviour
  *  @sched_deadline     representative of the task's deadline
  *  @sched_runtime      representative of the task's runtime
@@ -102,15 +104,21 @@ struct sched_param {
  * available in the scheduling class file or in Documentation/.
  */
 struct sched_attr {
-	int sched_priority;
-	unsigned int sched_flags;
+	u32 size;
+
+	u32 sched_policy;
+	u64 sched_flags;
+
+	/* SCHED_NORMAL, SCHED_BATCH */
+	s32 sched_nice;
+
+	/* SCHED_FIFO, SCHED_RR */
+	u32 sched_priority;
+
+	/* SCHED_DEADLINE */
 	u64 sched_runtime;
 	u64 sched_deadline;
 	u64 sched_period;
-	u32 size;
-
-	/* Align to u64. */
-	u32 __reserved;
 };
 
 struct exec_domain;
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -278,8 +278,6 @@ asmlinkage long sys_clock_nanosleep(cloc
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
-asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
-					struct sched_attr __user *attr);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
 asmlinkage long sys_sched_setattr(pid_t pid,
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2973,6 +2973,7 @@ void rt_mutex_setprio(struct task_struct
 	__task_rq_unlock(rq);
 }
 #endif
+
 void set_user_nice(struct task_struct *p, long nice)
 {
 	int old_prio, delta, on_rq;
@@ -3147,24 +3148,6 @@ static struct task_struct *find_process_
 	return pid ? find_task_by_vpid(pid) : current;
 }
 
-/* Actually do priority change: must hold rq lock. */
-static void
-__setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
-{
-	p->policy = policy;
-	p->rt_priority = prio;
-	p->normal_prio = normal_prio(p);
-	/* we are holding p->pi_lock already */
-	p->prio = rt_mutex_getprio(p);
-	if (dl_prio(p->prio))
-		p->sched_class = &dl_sched_class;
-	else if (rt_prio(p->prio))
-		p->sched_class = &rt_sched_class;
-	else
-		p->sched_class = &fair_sched_class;
-	set_load_weight(p);
-}
-
 /*
  * This function initializes the sched_dl_entity of a newly becoming
  * SCHED_DEADLINE task.
@@ -3188,6 +3171,34 @@ __setparam_dl(struct task_struct *p, con
 	dl_se->dl_new = 1;
 }
 
+/* Actually do priority change: must hold pi & rq lock. */
+static void __setscheduler(struct rq *rq, struct task_struct *p,
+			   const struct sched_attr *attr)
+{
+	int policy = attr->sched_policy;
+
+	p->policy = policy;
+
+	if (fair_policy(policy))
+		p->static_prio = NICE_TO_PRIO(attr->sched_nice);
+	if (rt_policy(policy))
+		p->rt_priority = attr->sched_priority;
+	if (dl_policy(policy))
+		__setparam_dl(p, attr);
+
+	p->normal_prio = normal_prio(p);
+	p->prio = rt_mutex_getprio(p);
+
+	if (dl_prio(p->prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
+		p->sched_class = &rt_sched_class;
+	else
+		p->sched_class = &fair_sched_class;
+
+	set_load_weight(p);
+}
+
 static void
 __getparam_dl(struct task_struct *p, struct sched_attr *attr)
 {
@@ -3234,11 +3245,12 @@ static bool check_same_owner(struct task
 	return match;
 }
 
-static int __sched_setscheduler(struct task_struct *p, int policy,
+static int __sched_setscheduler(struct task_struct *p,
 				const struct sched_attr *attr,
 				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
+	int policy = attr->sched_policy;
 	unsigned long flags;
 	const struct sched_class *prev_class;
 	struct rq *rq;
@@ -3271,6 +3283,7 @@ static int __sched_setscheduler(struct t
 	    (p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
+
 	if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
 	    (rt_policy(policy) != (attr->sched_priority != 0)))
 		return -EINVAL;
@@ -3279,6 +3292,11 @@ static int __sched_setscheduler(struct t
 	 * Allow unprivileged RT tasks to decrease priority:
 	 */
 	if (user && !capable(CAP_SYS_NICE)) {
+		if (fair_policy(policy)) {
+			if (!can_nice(p, attr->sched_nice))
+				return -EPERM;
+		}
+
 		if (rt_policy(policy)) {
 			unsigned long rlim_rtprio =
 					task_rlimit(p, RLIMIT_RTPRIO);
@@ -3337,12 +3355,18 @@ static int __sched_setscheduler(struct t
 	/*
 	 * If not changing anything there's no need to proceed further:
 	 */
-	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
-			attr->sched_priority == p->rt_priority) &&
-			!dl_policy(policy))) {
+	if (unlikely(policy == p->policy)) {
+		if (fair_policy(policy) && attr->sched_nice != TASK_NICE(p))
+			goto change;
+		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
+			goto change;
+		if (dl_policy(policy))
+			goto change;
+
 		task_rq_unlock(rq, p, &flags);
 		return 0;
 	}
+change:
 
 	if (user) {
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -3399,8 +3423,7 @@ static int __sched_setscheduler(struct t
 	 */
 	if ((dl_policy(policy) || dl_task(p)) &&
 	    dl_overflow(p, policy, attr)) {
-		__task_rq_unlock(rq);
-		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		task_rq_unlock(rq, p, &flags);
 		return -EBUSY;
 	}
 
@@ -3415,9 +3438,7 @@ static int __sched_setscheduler(struct t
 
 	oldprio = p->prio;
 	prev_class = p->sched_class;
-	if (dl_policy(policy))
-		__setparam_dl(p, attr);
-	__setscheduler(rq, p, policy, attr->sched_priority);
+	__setscheduler(rq, p, attr);
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
@@ -3446,18 +3467,18 @@ int sched_setscheduler(struct task_struc
 		       const struct sched_param *param)
 {
 	struct sched_attr attr = {
+		.sched_policy   = policy,
 		.sched_priority = param->sched_priority
 	};
-	return __sched_setscheduler(p, policy, &attr, true);
+	return __sched_setscheduler(p, &attr, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
-int sched_setscheduler2(struct task_struct *p, int policy,
-			const struct sched_attr *attr)
+int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
 {
-	return __sched_setscheduler(p, policy, attr, true);
+	return __sched_setscheduler(p, attr, true);
 }
-EXPORT_SYMBOL_GPL(sched_setscheduler2);
+EXPORT_SYMBOL_GPL(sched_setattr);
 
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
@@ -3476,9 +3497,10 @@ int sched_setscheduler_nocheck(struct ta
 			       const struct sched_param *param)
 {
 	struct sched_attr attr = {
+		.sched_policy   = policy,
 		.sched_priority = param->sched_priority
 	};
-	return __sched_setscheduler(p, policy, &attr, false);
+	return __sched_setscheduler(p, &attr, false);
 }
 
 static int
@@ -3561,6 +3583,12 @@ static int sched_copy_attr(struct sched_
 	if (ret)
 		return -EFAULT;
 
+	/*
+	 * XXX: do we want to be lenient like existing syscalls; or do we want
+	 * to be strict and return an error on out-of-bounds values?
+	 */
+	attr->sched_nice = clamp(attr->sched_nice, -20, 19);
+
 out:
 	return ret;
 
@@ -3570,33 +3598,6 @@ static int sched_copy_attr(struct sched_
 	goto out;
 }
 
-static int
-do_sched_setscheduler2(pid_t pid, int policy,
-		       struct sched_attr __user *attr_uptr)
-{
-	struct sched_attr attr;
-	struct task_struct *p;
-	int retval;
-
-	if (!attr_uptr || pid < 0)
-		return -EINVAL;
-
-	if (sched_copy_attr(attr_uptr, &attr))
-		return -EFAULT;
-
-	rcu_read_lock();
-	retval = -ESRCH;
-	p = find_process_by_pid(pid);
-	if (p != NULL) {
-		if (dl_policy(policy))
-			attr.sched_priority = 0;
-		retval = sched_setscheduler2(p, policy, &attr);
-	}
-	rcu_read_unlock();
-
-	return retval;
-}
-
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -3616,21 +3617,6 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_
 }
 
 /**
- * sys_sched_setscheduler2 - same as above, but with extended sched_param
- * @pid: the pid in question.
- * @policy: new policy (could use extended sched_param).
- * @attr: structure containg the extended parameters.
- */
-SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
-		struct sched_attr __user *, attr)
-{
-	if (policy < 0)
-		return -EINVAL;
-
-	return do_sched_setscheduler2(pid, policy, attr);
-}
-
-/**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
  * @param: structure containing the new RT priority.
@@ -3647,10 +3633,26 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, p
  * @pid: the pid in question.
  * @attr: structure containing the extended parameters.
  */
-SYSCALL_DEFINE2(sched_setattr, pid_t, pid,
-		struct sched_attr __user *, attr)
+SYSCALL_DEFINE2(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr)
 {
-	return do_sched_setscheduler2(pid, -1, attr);
+	struct sched_attr attr;
+	struct task_struct *p;
+	int retval;
+
+	if (!uattr || pid < 0)
+		return -EINVAL;
+
+	if (sched_copy_attr(uattr, &attr))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL)
+		retval = sched_setattr(p, &attr);
+	rcu_read_unlock();
+
+	return retval;
 }
 
 /**
@@ -3797,8 +3799,14 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pi
 	if (retval)
 		goto out_unlock;
 
-	__getparam_dl(p, &attr);
-	attr.sched_priority = p->rt_priority;
+	attr.sched_policy = p->policy;
+	if (task_has_dl_policy(p))
+		__getparam_dl(p, &attr);
+	else if (task_has_rt_policy(p))
+		attr.sched_priority = p->rt_priority;
+	else
+		attr.sched_nice = TASK_NICE(p);
+
 	rcu_read_unlock();
 
 	retval = sched_read_attr(uattr, &attr, size);
@@ -6948,13 +6956,16 @@ EXPORT_SYMBOL(__might_sleep);
 static void normalize_task(struct rq *rq, struct task_struct *p)
 {
 	const struct sched_class *prev_class = p->sched_class;
+	struct sched_attr attr = {
+		.sched_policy = SCHED_NORMAL,
+	};
 	int old_prio = p->prio;
 	int on_rq;
 
 	on_rq = p->on_rq;
 	if (on_rq)
 		dequeue_task(rq, p, 0);
-	__setscheduler(rq, p, SCHED_NORMAL, 0);
+	__setscheduler(rq, p, &attr);
 	if (on_rq) {
 		enqueue_task(rq, p, 0);
 		resched_task(rq->curr);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -90,18 +90,19 @@ extern void update_cpu_load_active(struc
  */
 #define DL_SCALE (10)
 
+static inline int fair_policy(int policy)
+{
+	return policy == SCHED_NORMAL || policy == SCHED_BATCH;
+}
+
 static inline int rt_policy(int policy)
 {
-	if (policy == SCHED_FIFO || policy == SCHED_RR)
-		return 1;
-	return 0;
+	return policy == SCHED_FIFO || policy == SCHED_RR;
 }
 
 static inline int dl_policy(int policy)
 {
-	if (unlikely(policy == SCHED_DEADLINE))
-		return 1;
-	return 0;
+	return unlikely(policy == SCHED_DEADLINE);
 }
 
 static inline int task_has_rt_policy(struct task_struct *p)



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 12/13] sched, deadline: Fixup the smp-affinity mask tests
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (10 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 11/13] sched: Remove sched_setscheduler2() Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 12:27 ` [PATCH 13/13] sched, deadline: Remove the sysctl_sched_dl knobs Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: peterz-frob-smp-affinity.patch --]
[-- Type: text/plain, Size: 3330 bytes --]

For now deadline tasks are not allowed to set smp affinity; however
the current tests are wrong, cure this.

The test in __sched_setscheduler() also uses an on-stack cpumask_t
which is a no-no.

Change both tests to use cpumask_subset() such that we test the root
domain span to be a subset of the cpus_allowed mask. This way we're
sure the tasks can always run on all CPUs they can be balanced over,
and have no effective affinity constraints.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c  |   44 +++++++++---------------------------
 kernel/sched/rt.c    |   62 +++++++++++++++++++++++++++++++++++++--------------
 kernel/sched/sched.h |    2 +
 3 files changed, 58 insertions(+), 50 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3384,23 +3384,14 @@ static int __sched_setscheduler(struct t
 #ifdef CONFIG_SMP
 		if (dl_bandwidth_enabled() && dl_policy(policy)) {
 			cpumask_t *span = rq->rd->span;
-			cpumask_t act_affinity;
-
-			/*
-			 * cpus_allowed mask is statically initialized with
-			 * CPU_MASK_ALL, span is instead dynamic. Here we
-			 * compute the "dynamic" affinity of a task.
-			 */
-			cpumask_and(&act_affinity, &p->cpus_allowed,
-				    cpu_active_mask);
 
 			/*
 			 * Don't allow tasks with an affinity mask smaller than
 			 * the entire root_domain to become SCHED_DEADLINE. We
 			 * will also fail if there's no bandwidth available.
 			 */
-			if (!cpumask_equal(&act_affinity, span) ||
-			    		   rq->rd->dl_bw.bw == 0) {
+			if (!cpumask_subset(span, &p->cpus_allowed) ||
+			    rq->rd->dl_bw.bw == 0) {
 				__task_rq_unlock(rq);
 				raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 				return -EPERM;
@@ -3421,8 +3412,7 @@ static int __sched_setscheduler(struct t
 	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
 	 * is available.
 	 */
-	if ((dl_policy(policy) || dl_task(p)) &&
-	    dl_overflow(p, policy, attr)) {
+	if ((dl_policy(policy) || dl_task(p)) && dl_overflow(p, policy, attr)) {
 		task_rq_unlock(rq, p, &flags);
 		return -EBUSY;
 	}
@@ -3861,6 +3851,10 @@ long sched_setaffinity(pid_t pid, const
 	if (retval)
 		goto out_unlock;
 
+
+	cpuset_cpus_allowed(p, cpus_allowed);
+	cpumask_and(new_mask, in_mask, cpus_allowed);
+
 	/*
 	 * Since bandwidth control happens on root_domain basis,
 	 * if admission test is enabled, we only admit -deadline
@@ -3871,16 +3865,12 @@ long sched_setaffinity(pid_t pid, const
 	if (task_has_dl_policy(p)) {
 		const struct cpumask *span = task_rq(p)->rd->span;
 
-		if (dl_bandwidth_enabled() &&
-		    !cpumask_equal(in_mask, span)) {
+		if (dl_bandwidth_enabled() && !cpumask_subset(span, new_mask)) {
 			retval = -EBUSY;
 			goto out_unlock;
 		}
 	}
 #endif
-
-	cpuset_cpus_allowed(p, cpus_allowed);
-	cpumask_and(new_mask, in_mask, cpus_allowed);
 again:
 	retval = set_cpus_allowed_ptr(p, new_mask);
 
@@ -4536,7 +4526,7 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
  * When dealing with a -deadline task, we have to check if moving it to
  * a new CPU is possible or not. In fact, this is only true iff there
  * is enough bandwidth available on such CPU, otherwise we want the
- * whole migration progedure to fail over.
+ * whole migration procedure to fail over.
  */
 static inline
 bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 13/13] sched, deadline: Remove the sysctl_sched_dl knobs
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (11 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 12/13] sched, deadline: Fixup the smp-affinity mask tests Peter Zijlstra
@ 2013-12-17 12:27 ` Peter Zijlstra
  2013-12-17 20:17 ` [PATCH] sched, deadline: Properly initialize def_dl_bandwidth lock Steven Rostedt
  2013-12-20 13:51 ` [PATCH 00/13] sched, deadline: patches Juri Lelli
  14 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-17 12:27 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur
  Cc: Peter Zijlstra

[-- Attachment #1: peterz-frob-admission-control.patch --]
[-- Type: text/plain, Size: 14043 bytes --]

Remove the deadline specific sysctls for now. The problem with them is
that the interaction with the exisiting rt knobs is nearly impossible
to get right.

The current (as per before this patch) situation is that the rt and dl
bandwidth is completely separate and we enforce rt+dl < 100%. This is
undesirable because this means that the rt default of 95% leaves us
hardly any room, even though dl tasks are saver than rt tasks.

Another proposed solution was (a discarted patch) to have the dl
bandwidth be a fraction of the rt bandwidth. This is highly
confusing imo.

Furthermore neither proposal is consistent with the situation we
actually want; which is rt tasks ran from a dl server. In which case
the rt bandwidth is a direct subset of dl.

So whichever way we go, the introduction of dl controls at this point
is painful. Therefore remove them and instead share the rt budget.

This means that for now the rt knobs are used for dl admission control
and the dl runtime is accounted against the rt runtime. I realise that
this isn't entirely desirable either; but whatever we do we appear to
need to change the interface later, so better have a small interface
for now.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched/sysctl.h |   13 --
 kernel/sched/core.c          |  259 +++++++++++--------------------------------
 kernel/sched/deadline.c      |   27 ++++
 kernel/sched/sched.h         |   18 --
 kernel/sysctl.c              |   14 --
 5 files changed, 97 insertions(+), 234 deletions(-)

--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -81,15 +81,6 @@ static inline unsigned int get_sysctl_ti
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
-/*
- *  control SCHED_DEADLINE reservations:
- *
- *  /proc/sys/kernel/sched_dl_period_us
- *  /proc/sys/kernel/sched_dl_runtime_us
- */
-extern unsigned int sysctl_sched_dl_period;
-extern int sysctl_sched_dl_runtime;
-
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 #endif
@@ -108,8 +99,4 @@ extern int sched_rt_handler(struct ctl_t
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
-int sched_dl_handler(struct ctl_table *table, int write,
-		void __user *buffer, size_t *lenp,
-		loff_t *ppos);
-
 #endif /* _SCHED_SYSCTL_H */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6776,7 +6776,7 @@ void __init sched_init(void)
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
 	init_dl_bandwidth(&def_dl_bandwidth,
-			global_dl_period(), global_dl_runtime());
+			global_rt_period(), global_rt_runtime());
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
@@ -7355,64 +7355,11 @@ static long sched_group_rt_period(struct
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-/*
- * Coupling of -rt and -deadline bandwidth.
- *
- * Here we check if the new -rt bandwidth value is consistent
- * with the system settings for the bandwidth available
- * to -deadline tasks.
- *
- * IOW, we want to enforce that
- *
- *   rt_bandwidth + dl_bandwidth <= 100%
- *
- * is always true.
- */
-static bool __sched_rt_dl_global_constraints(u64 rt_bw)
-{
-	unsigned long flags;
-	u64 dl_bw;
-	bool ret;
-
-	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
-	if (global_rt_runtime() == RUNTIME_INF ||
-	    global_dl_runtime() == RUNTIME_INF) {
-		ret = true;
-		goto unlock;
-	}
-
-	dl_bw = to_ratio(def_dl_bandwidth.dl_period,
-			 def_dl_bandwidth.dl_runtime);
-
-	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
-unlock:
-	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
-
-	return ret;
-}
-
 #ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period, bw;
 	int ret = 0;
 
-	if (sysctl_sched_rt_period <= 0)
-		return -EINVAL;
-
-	runtime = global_rt_runtime();
-	period = global_rt_period();
-
-	/*
-	 * Sanity check on the sysctl variables.
-	 */
-	if (runtime > period && runtime != RUNTIME_INF)
-		return -EINVAL;
-
-	bw = to_ratio(period, runtime);
-	if (!__sched_rt_dl_global_constraints(bw))
-		return -EINVAL;
-
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
@@ -7436,18 +7383,8 @@ static int sched_rt_global_constraints(v
 {
 	unsigned long flags;
 	int i, ret = 0;
-	u64 bw;
-
-	if (sysctl_sched_rt_period <= 0)
-		return -EINVAL;
 
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	bw = to_ratio(global_rt_period(), global_rt_runtime());
-	if (!__sched_rt_dl_global_constraints(bw)) {
-		ret = -EINVAL;
-		goto unlock;
-	}
-
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
 
@@ -7455,69 +7392,18 @@ static int sched_rt_global_constraints(v
 		rt_rq->rt_runtime = global_rt_runtime();
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
-unlock:
 	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 
 	return ret;
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-/*
- * Coupling of -dl and -rt bandwidth.
- *
- * Here we check, while setting the system wide bandwidth available
- * for -dl tasks and groups, if the new values are consistent with
- * the system settings for the bandwidth available to -rt entities.
- *
- * IOW, we want to enforce that
- *
- *   rt_bandwidth + dl_bandwidth <= 100%
- *
- * is always true.
- */
-static bool __sched_dl_rt_global_constraints(u64 dl_bw)
-{
-	u64 rt_bw;
-	bool ret;
-
-	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
-	if (global_dl_runtime() == RUNTIME_INF ||
-	    global_rt_runtime() == RUNTIME_INF) {
-		ret = true;
-		goto unlock;
-	}
-
-	rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
-			 def_rt_bandwidth.rt_runtime);
-
-	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
-unlock:
-	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
-
-	return ret;
-}
-
-static bool __sched_dl_global_constraints(u64 runtime, u64 period)
-{
-	if (!period || (runtime != RUNTIME_INF && runtime > period))
-		return -EINVAL;
-
-	return 0;
-}
-
 static int sched_dl_global_constraints(void)
 {
-	u64 runtime = global_dl_runtime();
-	u64 period = global_dl_period();
+	u64 runtime = global_rt_runtime();
+	u64 period = global_rt_period();
 	u64 new_bw = to_ratio(period, runtime);
-	int ret, i;
-
-	ret = __sched_dl_global_constraints(runtime, period);
-	if (ret)
-		return ret;
-
-	if (!__sched_dl_rt_global_constraints(new_bw))
-		return -EINVAL;
+	int cpu, ret = 0;
 
 	/*
 	 * Here we want to check the bandwidth not being set to some
@@ -7528,46 +7414,68 @@ static int sched_dl_global_constraints(v
 	 * cycling on root_domains... Discussion on different/better
 	 * solutions is welcome!
 	 */
-	for_each_possible_cpu(i) {
-		struct dl_bw *dl_b = dl_bw_of(i);
+	for_each_possible_cpu(cpu) {
+		struct dl_bw *dl_b = dl_bw_of(cpu);
 
 		raw_spin_lock(&dl_b->lock);
-		if (new_bw < dl_b->total_bw) {
-			raw_spin_unlock(&dl_b->lock);
-			return -EBUSY;
-		}
+		if (new_bw < dl_b->total_bw)
+			ret = -EBUSY;
 		raw_spin_unlock(&dl_b->lock);
+
+		if (ret)
+			break;
 	}
 
-	return 0;
+	return ret;
 }
 
-int sched_rr_handler(struct ctl_table *table, int write,
-		void __user *buffer, size_t *lenp,
-		loff_t *ppos)
+static void sched_dl_do_global(void)
 {
-	int ret;
-	static DEFINE_MUTEX(mutex);
+	u64 new_bw = -1;
+	int cpu;
 
-	mutex_lock(&mutex);
-	ret = proc_dointvec(table, write, buffer, lenp, ppos);
-	/* make sure that internally we keep jiffies */
-	/* also, writing zero resets timeslice to default */
-	if (!ret && write) {
-		sched_rr_timeslice = sched_rr_timeslice <= 0 ?
-			RR_TIMESLICE : msecs_to_jiffies(sched_rr_timeslice);
+	def_dl_bandwidth.dl_period = global_rt_period();
+	def_dl_bandwidth.dl_runtime = global_rt_runtime();
+
+	if (global_rt_runtime() != RUNTIME_INF)
+		new_bw = to_ratio(global_rt_period(), global_rt_runtime());
+
+	/*
+	 * FIXME: As above...
+	 */
+	for_each_possible_cpu(cpu) {
+		struct dl_bw *dl_b = dl_bw_of(cpu);
+
+		raw_spin_lock(&dl_b->lock);
+		dl_b->bw = new_bw;
+		raw_spin_unlock(&dl_b->lock);
 	}
-	mutex_unlock(&mutex);
-	return ret;
+}
+
+static int sched_rt_global_validate(void)
+{
+	if (sysctl_sched_rt_period <= 0)
+		return -EINVAL;
+
+	if (sysctl_sched_rt_runtime > sysctl_sched_rt_period)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void sched_rt_do_global(void)
+{
+	def_rt_bandwidth.rt_runtime = global_rt_runtime();
+	def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period());
 }
 
 int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
 {
-	int ret;
 	int old_period, old_runtime;
 	static DEFINE_MUTEX(mutex);
+	int ret;
 
 	mutex_lock(&mutex);
 	old_period = sysctl_sched_rt_period;
@@ -7576,72 +7484,47 @@ int sched_rt_handler(struct ctl_table *t
 	ret = proc_dointvec(table, write, buffer, lenp, ppos);
 
 	if (!ret && write) {
+		ret = sched_rt_global_validate();
+		if (ret)
+			goto undo;
+
 		ret = sched_rt_global_constraints();
-		if (ret) {
-			sysctl_sched_rt_period = old_period;
-			sysctl_sched_rt_runtime = old_runtime;
-		} else {
-			def_rt_bandwidth.rt_runtime = global_rt_runtime();
-			def_rt_bandwidth.rt_period =
-				ns_to_ktime(global_rt_period());
-		}
+		if (ret)
+			goto undo;
+
+		ret = sched_dl_global_constraints();
+		if (ret)
+			goto undo;
+
+		sched_rt_do_global();
+		sched_dl_do_global();
+	}
+	if (0) {
+undo:
+		sysctl_sched_rt_period = old_period;
+		sysctl_sched_rt_runtime = old_runtime;
 	}
 	mutex_unlock(&mutex);
 
 	return ret;
 }
 
-int sched_dl_handler(struct ctl_table *table, int write,
+int sched_rr_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
 {
 	int ret;
-	int old_period, old_runtime;
 	static DEFINE_MUTEX(mutex);
-	unsigned long flags;
 
 	mutex_lock(&mutex);
-	old_period = sysctl_sched_dl_period;
-	old_runtime = sysctl_sched_dl_runtime;
-
 	ret = proc_dointvec(table, write, buffer, lenp, ppos);
-
+	/* make sure that internally we keep jiffies */
+	/* also, writing zero resets timeslice to default */
 	if (!ret && write) {
-		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
-				      flags);
-
-		ret = sched_dl_global_constraints();
-		if (ret) {
-			sysctl_sched_dl_period = old_period;
-			sysctl_sched_dl_runtime = old_runtime;
-		} else {
-			u64 new_bw;
-			int i;
-
-			def_dl_bandwidth.dl_period = global_dl_period();
-			def_dl_bandwidth.dl_runtime = global_dl_runtime();
-			if (global_dl_runtime() == RUNTIME_INF)
-				new_bw = -1;
-			else
-				new_bw = to_ratio(global_dl_period(),
-						  global_dl_runtime());
-			/*
-			 * FIXME: As above...
-			 */
-			for_each_possible_cpu(i) {
-				struct dl_bw *dl_b = dl_bw_of(i);
-
-				raw_spin_lock(&dl_b->lock);
-				dl_b->bw = new_bw;
-				raw_spin_unlock(&dl_b->lock);
-			}
-		}
-
-		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
-					   flags);
+		sched_rr_timeslice = sched_rr_timeslice <= 0 ?
+			RR_TIMESLICE : msecs_to_jiffies(sched_rr_timeslice);
 	}
 	mutex_unlock(&mutex);
-
 	return ret;
 }
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -63,10 +63,10 @@ void init_dl_bw(struct dl_bw *dl_b)
 {
 	raw_spin_lock_init(&dl_b->lock);
 	raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
-	if (global_dl_runtime() == RUNTIME_INF)
+	if (global_rt_runtime() == RUNTIME_INF)
 		dl_b->bw = -1;
 	else
-		dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+		dl_b->bw = to_ratio(global_rt_period(), global_rt_runtime());
 	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
 	dl_b->total_bw = 0;
 }
@@ -612,6 +612,29 @@ static void update_curr_dl(struct rq *rq
 		if (!is_leftmost(curr, &rq->dl))
 			resched_task(curr);
 	}
+
+	/*
+	 * Because -- for now -- we share the rt bandwidth, we need to
+	 * account our runtime there too, otherwise actual rt tasks
+	 * would be able to exceed the shared quota.
+	 *
+	 * Account to the root rt group for now.
+	 *
+	 * The solution we're working towards is having the RT groups scheduled
+	 * using deadline servers -- however there's a few nasties to figure
+	 * out before that can happen.
+	 */
+	if (rt_bandwidth_enabled()) {
+		struct rt_rq *rt_rq = &rq->rt;
+
+		raw_spin_lock(&rt_rq->rt_runtime_lock);
+		rt_rq->rt_time += delta;
+		/*
+		 * We'll let actual RT tasks worry about the overflow here, we
+		 * have our own CBS to keep us inline -- see above.
+		 */
+		raw_spin_unlock(&rt_rq->rt_runtime_lock);
+	}
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -176,7 +176,7 @@ struct dl_bandwidth {
 
 static inline int dl_bandwidth_enabled(void)
 {
-	return sysctl_sched_dl_runtime >= 0;
+	return sysctl_sched_rt_runtime >= 0;
 }
 
 extern struct dl_bw *dl_bw_of(int i);
@@ -186,9 +186,6 @@ struct dl_bw {
 	u64 bw, total_bw;
 };
 
-static inline u64 global_dl_period(void);
-static inline u64 global_dl_runtime(void);
-
 extern struct mutex sched_domains_mutex;
 
 #ifdef CONFIG_CGROUP_SCHED
@@ -953,19 +950,6 @@ static inline u64 global_rt_runtime(void
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
-static inline u64 global_dl_period(void)
-{
-	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
-}
-
-static inline u64 global_dl_runtime(void)
-{
-	if (sysctl_sched_dl_runtime < 0)
-		return RUNTIME_INF;
-
-	return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
-}
-
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
 	return rq->curr == p;
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -414,20 +414,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rr_handler,
 	},
-	{
-		.procname	= "sched_dl_period_us",
-		.data		= &sysctl_sched_dl_period,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= sched_dl_handler,
-	},
-	{
-		.procname	= "sched_dl_runtime_us",
-		.data		= &sysctl_sched_dl_runtime,
-		.maxlen		= sizeof(int),
-		.mode		= 0644,
-		.proc_handler	= sched_dl_handler,
-	},
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH] sched, deadline: Properly initialize def_dl_bandwidth lock
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (12 preceding siblings ...)
  2013-12-17 12:27 ` [PATCH 13/13] sched, deadline: Remove the sysctl_sched_dl knobs Peter Zijlstra
@ 2013-12-17 20:17 ` Steven Rostedt
  2013-12-18 10:01   ` Peter Zijlstra
  2013-12-20 13:51 ` [PATCH 00/13] sched, deadline: patches Juri Lelli
  14 siblings, 1 reply; 71+ messages in thread
From: Steven Rostedt @ 2013-12-17 20:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur


Spinlocks even in structures require to be properly initialized.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/deadline.c
===================================================================
--- linux-rt.git.orig/kernel/sched/deadline.c
+++ linux-rt.git/kernel/sched/deadline.c
@@ -18,7 +18,9 @@
 
 #include <linux/slab.h>
 
-struct dl_bandwidth def_dl_bandwidth;
+struct dl_bandwidth def_dl_bandwidth = {
+	.dl_runtime_lock = __RAW_SPIN_LOCK_UNLOCKED(def_dl_bandwidth.dl_runtime_lock),
+};
 
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH] sched, deadline: Properly initialize def_dl_bandwidth lock
  2013-12-17 20:17 ` [PATCH] sched, deadline: Properly initialize def_dl_bandwidth lock Steven Rostedt
@ 2013-12-18 10:01   ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-18 10:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur

On Tue, Dec 17, 2013 at 03:17:53PM -0500, Steven Rostedt wrote:
> 
> Spinlocks even in structures require to be properly initialized.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Index: linux-rt.git/kernel/sched/deadline.c
> ===================================================================
> --- linux-rt.git.orig/kernel/sched/deadline.c
> +++ linux-rt.git/kernel/sched/deadline.c
> @@ -18,7 +18,9 @@
>  
>  #include <linux/slab.h>
>  
> -struct dl_bandwidth def_dl_bandwidth;
> +struct dl_bandwidth def_dl_bandwidth = {
> +	.dl_runtime_lock = __RAW_SPIN_LOCK_UNLOCKED(def_dl_bandwidth.dl_runtime_lock),
> +};
>  
>  static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
>  {

The thing is, init_dl_bandwidth() in sched_init() is supposed to already
do that. The stacktrace you provided out of band:

[    0.000000]  [<ffffffff814e2aca>] spin_bug+0x2b/0x2d
[    0.000000]  [<ffffffff8107535c>] do_raw_spin_lock+0x27/0x104
[    0.000000]  [<ffffffff814ec32e>] _raw_spin_lock+0x20/0x24
[    0.000000]  [<ffffffff8106fc41>] init_dl_bw+0x2d/0x79^M
[    0.000000]  [<ffffffff810607b9>] init_rootdomain+0x20/0x4f^M
[    0.000000]  [<ffffffff81b0db07>] sched_init+0x69/0x432
[    0.000000]  [<ffffffff81af3b6c>] start_kernel+0x201/0x41c
[    0.000000]  [<ffffffff81af3773>] ? repair_env_string+0x56/0x56
[    0.000000]  [<ffffffff81af3487>] x86_64_start_reservations+0x2a/0x2c
[    0.000000]  [<ffffffff81af357b>] x86_64_start_kernel+0xf2/0xf9

Has clue though. So it appears we use the lock before we reach that
init_dl_bandwidth.

The below hunk should fix things up -- merged into patch 9/13.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6768,15 +6768,15 @@ void __init sched_init(void)
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 	}
 
-#ifdef CONFIG_SMP
-	init_defrootdomain();
-#endif
-
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
 	init_dl_bandwidth(&def_dl_bandwidth,
 			global_dl_period(), global_dl_runtime());
 
+#ifdef CONFIG_SMP
+	init_defrootdomain();
+#endif
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
 			global_rt_period(), global_rt_runtime());

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-17 12:27 ` [PATCH 09/13] sched: Add bandwidth management for sched_dl Peter Zijlstra
@ 2013-12-18 16:55   ` Peter Zijlstra
  2013-12-20 17:13     ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-18 16:55 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur

On Tue, Dec 17, 2013 at 01:27:29PM +0100, Peter Zijlstra wrote:
> @@ -4381,6 +4592,13 @@ static int __migrate_task(struct task_st
>  		goto fail;
>  
>  	/*
> +	 * If p is -deadline, proceed only if there is enough
> +	 * bandwidth available on dest_cpu
> +	 */
> +	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
> +		goto fail;
> +
> +	/*
>  	 * If we're not on a rq, the next wake-up will ensure we're
>  	 * placed properly.
>  	 */

I just noticed this one.. we can't do this.

The reason we cannot do this is because:

  CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task()

cannot fail and hard assumes it _will_ move all tasks off of the dying
cpu, failing this will break hotplug.

Also, I'm not entirely sure why this hunk exists. For GEDF we don't need
this constraints AFAIK, as long as we guarantee we run the N earliest
deadlines, it only matters what the total utilization (root domain wide)
is, the per-cpu utilization is irrelevant.

If the purpose is to fail hotplug because taking out the CPU would end
up in over-subscription, then we need a DOWN_PREPARE handler.

Dario, Juri?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] sched, deadline: patches
  2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
                   ` (13 preceding siblings ...)
  2013-12-17 20:17 ` [PATCH] sched, deadline: Properly initialize def_dl_bandwidth lock Steven Rostedt
@ 2013-12-20 13:51 ` Juri Lelli
  2013-12-20 14:28   ` Steven Rostedt
  2013-12-20 14:51   ` Peter Zijlstra
  14 siblings, 2 replies; 71+ messages in thread
From: Juri Lelli @ 2013-12-20 13:51 UTC (permalink / raw)
  To: Peter Zijlstra, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur

Hi,

On 12/17/2013 01:27 PM, Peter Zijlstra wrote:
> Hai..
> 
> This is my current queue of SCHED_DEADLINE; which I hope to merge 'soon'.
> 
> Juri handed me a version that should've (didn't check) included all feedback
> including the new sched_attr interface.
> 
> I did clean up some of the patches; moved some hunks around so that each patch
> compiles on its own etc..
> 
> I then did a number of patches at the end to change some actual stuff around.
> 
> Juri do you have some userspace around to test with? I'm just too likely to
> have wrecked everything; so if you don't have something I need to go write
> something before merging this :-)
> 
> No need to update your test proglets to the new interface, I can lift it if
> that is more convenient -- I know you're conferencing atm.
> 

I use this: https://github.com/gbagnoli/rt-app/tree/new-ABI
It is aligned with the new ABI (but it is missing your last changes).

> One question; the SoB chain in patch 2/13 seems weird; I suppose that
> patch is a collaborative effort? could we cure that by mentioning Fabio
> and Michael in some other way?
> 

We could change their SoB to be a Cc (not sure this cure the thing anyway). Not
even Cc-ing them seems pointless. Then we have them mentioned in
sched/deadline.c copyright (oh! s/2012/2014/ there :-O).

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] sched, deadline: patches
  2013-12-20 13:51 ` [PATCH 00/13] sched, deadline: patches Juri Lelli
@ 2013-12-20 14:28   ` Steven Rostedt
  2013-12-20 14:51   ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2013-12-20 14:28 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, tglx, mingo, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, 20 Dec 2013 14:51:18 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:


> > One question; the SoB chain in patch 2/13 seems weird; I suppose that
> > patch is a collaborative effort? could we cure that by mentioning Fabio
> > and Michael in some other way?
> > 
> 
> We could change their SoB to be a Cc (not sure this cure the thing anyway). Not
> even Cc-ing them seems pointless. Then we have them mentioned in
> sched/deadline.c copyright (oh! s/2012/2014/ there :-O).

Well, if they were involved in the development of the code, then their
SoB's would be appropriate. That copyright seems to state that they are
required.

-- Steve

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] sched, deadline: patches
  2013-12-20 13:51 ` [PATCH 00/13] sched, deadline: patches Juri Lelli
  2013-12-20 14:28   ` Steven Rostedt
@ 2013-12-20 14:51   ` Peter Zijlstra
  2013-12-20 15:19     ` Steven Rostedt
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-20 14:51 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, Dec 20, 2013 at 02:51:18PM +0100, Juri Lelli wrote:
> I use this: https://github.com/gbagnoli/rt-app/tree/new-ABI
> It is aligned with the new ABI (but it is missing your last changes).

I'm a moron, I can't find how to make github get me a git:// url I can
clone from.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 00/13] sched, deadline: patches
  2013-12-20 14:51   ` Peter Zijlstra
@ 2013-12-20 15:19     ` Steven Rostedt
  0 siblings, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2013-12-20 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, 20 Dec 2013 15:51:40 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Dec 20, 2013 at 02:51:18PM +0100, Juri Lelli wrote:
> > I use this: https://github.com/gbagnoli/rt-app/tree/new-ABI
> > It is aligned with the new ABI (but it is missing your last changes).
> 
> I'm a moron, I can't find how to make github get me a git:// url I can
> clone from.

I don't think you can use git:// but you can do:

git clone https://github.com/gbagnoli/rt-app.git

-- Steve

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-18 16:55   ` Peter Zijlstra
@ 2013-12-20 17:13     ` Peter Zijlstra
  2013-12-20 17:37       ` Steven Rostedt
  2014-01-13 15:55       ` [tip:sched/core] sched/deadline: Fix hotplug admission control tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-20 17:13 UTC (permalink / raw)
  To: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur

On Wed, Dec 18, 2013 at 05:55:08PM +0100, Peter Zijlstra wrote:

> If the purpose is to fail hotplug because taking out the CPU would end
> up in over-subscription, then we need a DOWN_PREPARE handler.

Juri just said (on IRC) that that was indeed the intended purpose.

---
Subject: sched, deadline: Fix hotplug admission control
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Dec 19 11:54:45 CET 2013

The current hotplug admission control is broken because:

  CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task()

cannot fail and hard assumes it _will_ move all tasks off of the dying
cpu, failing this will break hotplug.

The much simpler solution is a DOWN_PREPARE handler that fails when
removing one CPU gets us below the total allocated bandwidth.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/core.c |   68 ++++++++++++++++------------------------------------
 1 file changed, 21 insertions(+), 47 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1886,9 +1886,9 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->rd->dl_bw;
 }
 
-static inline int __dl_span_weight(struct rq *rq)
+static inline int dl_bw_cpus(int i)
 {
-	return cpumask_weight(rq->rd->span);
+	return cpumask_weight(cpu_rq(i)->rd->span);
 }
 #else
 inline struct dl_bw *dl_bw_of(int i)
@@ -1896,7 +1896,7 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->dl.dl_bw;
 }
 
-static inline int __dl_span_weight(struct rq *rq)
+static inline int dl_bw_cpus(int i)
 {
 	return 1;
 }
@@ -1937,7 +1937,7 @@ static int dl_overflow(struct task_struc
 	u64 period = attr->sched_period;
 	u64 runtime = attr->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-	int cpus = __dl_span_weight(task_rq(p));
+	int cpus = dl_bw_cpus(task_cpu(p));
 	int err = -1;
 
 	if (new_bw == p->dl.dl_bw)
@@ -4523,42 +4523,6 @@ int set_cpus_allowed_ptr(struct task_str
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
- * When dealing with a -deadline task, we have to check if moving it to
- * a new CPU is possible or not. In fact, this is only true iff there
- * is enough bandwidth available on such CPU, otherwise we want the
- * whole migration procedure to fail over.
- */
-static inline
-bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
-{
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-	struct dl_bw *cpu_b = dl_bw_of(cpu);
-	int ret = 1;
-	u64 bw;
-
-	if (dl_b == cpu_b)
-		return 1;
-
-	raw_spin_lock(&dl_b->lock);
-	raw_spin_lock(&cpu_b->lock);
-
-	bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
-	if (dl_bandwidth_enabled() &&
-	    bw < cpu_b->total_bw + p->dl.dl_bw) {
-		ret = 0;
-		goto unlock;
-	}
-	dl_b->total_bw -= p->dl.dl_bw;
-	cpu_b->total_bw += p->dl.dl_bw;
-
-unlock:
-	raw_spin_unlock(&cpu_b->lock);
-	raw_spin_unlock(&dl_b->lock);
-
-	return ret;
-}
-
-/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -4590,13 +4554,6 @@ static int __migrate_task(struct task_st
 		goto fail;
 
 	/*
-	 * If p is -deadline, proceed only if there is enough
-	 * bandwidth available on dest_cpu
-	 */
-	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
-		goto fail;
-
-	/*
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
@@ -4985,6 +4942,23 @@ migration_call(struct notifier_block *nf
 	unsigned long flags;
 	struct rq *rq = cpu_rq(cpu);
 
+	switch (action) {
+	case CPU_DOWN_PREPARE: /* explicitly allow suspend */
+		{
+			struct dl_bw *dl_b = dl_bw_of(cpu);
+			int cpus = dl_bw_cpus(cpu);
+			bool overflow;
+
+			raw_spin_lock_irqsave(&dl_b->lock, flags);
+			overflow = __dl_overflow(dl_b, cpus-1, 0, 0);
+			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+			if (overflow)
+				return notifier_from_errno(-EBUSY);
+		}
+		break;
+	}
+
 	switch (action & ~CPU_TASKS_FROZEN) {
 
 	case CPU_UP_PREPARE:

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-20 17:13     ` Peter Zijlstra
@ 2013-12-20 17:37       ` Steven Rostedt
  2013-12-20 17:42         ` Peter Zijlstra
  2014-01-13 15:55       ` [tip:sched/core] sched/deadline: Fix hotplug admission control tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Steven Rostedt @ 2013-12-20 17:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, 20 Dec 2013 18:13:43 +0100
Peter Zijlstra <peterz@infradead.org> wrote:


> @@ -4985,6 +4942,23 @@ migration_call(struct notifier_block *nf
>  	unsigned long flags;
>  	struct rq *rq = cpu_rq(cpu);
>  
> +	switch (action) {
> +	case CPU_DOWN_PREPARE: /* explicitly allow suspend */
> +		{
> +			struct dl_bw *dl_b = dl_bw_of(cpu);
> +			int cpus = dl_bw_cpus(cpu);
> +			bool overflow;
> +
> +			raw_spin_lock_irqsave(&dl_b->lock, flags);
> +			overflow = __dl_overflow(dl_b, cpus-1, 0, 0);
> +			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> +
> +			if (overflow)
> +				return notifier_from_errno(-EBUSY);

Is it possible to have a race here to create a new deadline task that
may work with cpus but not cpus-1? That is, if a new deadline task is
currently being created as a CPU is going offline, this check happens
first while the creation is spinning on the dl_b->lock, and it sets
overflow to false, then once the lock is released, the new deadline
task makes the condition true.

Should the system call have a get_online_cpus() somewhere?

-- Steve


> +		}
> +		break;
> +	}
> +
>  	switch (action & ~CPU_TASKS_FROZEN) {
>  
>  	case CPU_UP_PREPARE:


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-20 17:37       ` Steven Rostedt
@ 2013-12-20 17:42         ` Peter Zijlstra
  2013-12-20 18:23           ` Steven Rostedt
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-20 17:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, Dec 20, 2013 at 12:37:07PM -0500, Steven Rostedt wrote:
> On Fri, 20 Dec 2013 18:13:43 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > @@ -4985,6 +4942,23 @@ migration_call(struct notifier_block *nf
> >  	unsigned long flags;
> >  	struct rq *rq = cpu_rq(cpu);
> >  
> > +	switch (action) {
> > +	case CPU_DOWN_PREPARE: /* explicitly allow suspend */
> > +		{
> > +			struct dl_bw *dl_b = dl_bw_of(cpu);
> > +			int cpus = dl_bw_cpus(cpu);
> > +			bool overflow;
> > +
> > +			raw_spin_lock_irqsave(&dl_b->lock, flags);
> > +			overflow = __dl_overflow(dl_b, cpus-1, 0, 0);
> > +			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> > +
> > +			if (overflow)
> > +				return notifier_from_errno(-EBUSY);
> 
> Is it possible to have a race here to create a new deadline task that
> may work with cpus but not cpus-1? That is, if a new deadline task is
> currently being created as a CPU is going offline, this check happens
> first while the creation is spinning on the dl_b->lock, and it sets
> overflow to false, then once the lock is released, the new deadline
> task makes the condition true.
> 
> Should the system call have a get_online_cpus() somewhere?

No, should be all good; the entire admission control is serialized by
that dl_b->lock, and its a raw_spin_lock (as can be seen from the above)
which already very much excludes hotplug.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-20 17:42         ` Peter Zijlstra
@ 2013-12-20 18:23           ` Steven Rostedt
  2013-12-20 18:26             ` Steven Rostedt
  2013-12-20 21:44             ` Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Steven Rostedt @ 2013-12-20 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, 20 Dec 2013 18:42:00 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Dec 20, 2013 at 12:37:07PM -0500, Steven Rostedt wrote:
> > On Fri, 20 Dec 2013 18:13:43 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > 
> > > @@ -4985,6 +4942,23 @@ migration_call(struct notifier_block *nf
> > >  	unsigned long flags;
> > >  	struct rq *rq = cpu_rq(cpu);
> > >  
> > > +	switch (action) {
> > > +	case CPU_DOWN_PREPARE: /* explicitly allow suspend */
> > > +		{
> > > +			struct dl_bw *dl_b = dl_bw_of(cpu);
> > > +			int cpus = dl_bw_cpus(cpu);
> > > +			bool overflow;
> > > +
> > > +			raw_spin_lock_irqsave(&dl_b->lock, flags);
> > > +			overflow = __dl_overflow(dl_b, cpus-1, 0, 0);
> > > +			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> > > +
> > > +			if (overflow)
> > > +				return notifier_from_errno(-EBUSY);
> > 
> > Is it possible to have a race here to create a new deadline task that
> > may work with cpus but not cpus-1? That is, if a new deadline task is
> > currently being created as a CPU is going offline, this check happens
> > first while the creation is spinning on the dl_b->lock, and it sets
> > overflow to false, then once the lock is released, the new deadline
> > task makes the condition true.
> > 
> > Should the system call have a get_online_cpus() somewhere?
> 
> No, should be all good; the entire admission control is serialized by
> that dl_b->lock, and its a raw_spin_lock (as can be seen from the above)
> which already very much excludes hotplug.

I'm saying what stops this?


	CPU 0			CPU 1
	-----			-----
 sched_setattr()
 dl_overflow()
 cpus = __dl_span_weight()

			  cpu_down()
			  raw_spin_lock()
 raw_spin_lock() /* blocks */


			  overflow = __dl_overflow(cpus-1);
			  raw_spin_unlock();

 /* gets lock */
 __dl_overflow(cpus) /* all OK! */



			  /* cpus goes to cpus - 1 making
			     __dl_overflow() not OK anymore */


-- Steve


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-20 18:23           ` Steven Rostedt
@ 2013-12-20 18:26             ` Steven Rostedt
  2013-12-20 21:44             ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2013-12-20 18:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, tglx, mingo, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur

On Fri, 20 Dec 2013 13:23:23 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> 	CPU 0			CPU 1
> 	-----			-----
>  sched_setattr()
>  dl_overflow()
>  cpus = __dl_span_weight()
> 
> 			  cpu_down()
> 			  raw_spin_lock()
>  raw_spin_lock() /* blocks */
> 
> 
> 			  overflow = __dl_overflow(cpus-1);
> 			  raw_spin_unlock();
> 
>  /* gets lock */
>  __dl_overflow(cpus) /* all OK! */

Forgot to add:

 /* new deadline commitment added here */

> 
> 
> 
> 			  /* cpus goes to cpus - 1 making
> 			     __dl_overflow() not OK anymore */
> 

			also should have stated:

			"__dl_overflow(cpus-1) not OK anymore"


-- Steve

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-20 18:23           ` Steven Rostedt
  2013-12-20 18:26             ` Steven Rostedt
@ 2013-12-20 21:44             ` Peter Zijlstra
  2013-12-20 23:29               ` Steven Rostedt
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-20 21:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, Dec 20, 2013 at 01:23:23PM -0500, Steven Rostedt wrote:
> I'm saying what stops this?

oh duh, yes.

So the below is a bit cumbersome in having to use rd->span &
cpu_active_mask because it appears rd->online is too late again.

So I think this will avoid the problem by being consistent with the cpu
count. At worst it will reject a new task that could've fit, but that's
a safe mistake to make.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1886,9 +1886,15 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->rd->dl_bw;
 }
 
-static inline int __dl_span_weight(struct rq *rq)
+static inline int dl_bw_cpus(int i)
 {
-	return cpumask_weight(rq->rd->span);
+	struct root_domain *rd = cpu_rq(i)->rd;
+	int cpus = 0;
+
+	for_each_cpu_and(rd->span, cpu_active_mask)
+		cpus++;
+
+	return cpus;
 }
 #else
 inline struct dl_bw *dl_bw_of(int i)
@@ -1896,7 +1902,7 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->dl.dl_bw;
 }
 
-static inline int __dl_span_weight(struct rq *rq)
+static inline int dl_bw_cpus(int i)
 {
 	return 1;
 }
@@ -1937,8 +1943,7 @@ static int dl_overflow(struct task_struc
 	u64 period = attr->sched_period;
 	u64 runtime = attr->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-	int cpus = __dl_span_weight(task_rq(p));
-	int err = -1;
+	int cpus, err = -1;
 
 	if (new_bw == p->dl.dl_bw)
 		return 0;
@@ -1949,6 +1954,7 @@ static int dl_overflow(struct task_struc
 	 * allocated bandwidth of the container.
 	 */
 	raw_spin_lock(&dl_b->lock);
+	cpus = dl_bw_cpus(task_cpu(p));
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
 		__dl_add(dl_b, new_bw);
@@ -4523,42 +4529,6 @@ int set_cpus_allowed_ptr(struct task_str
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
- * When dealing with a -deadline task, we have to check if moving it to
- * a new CPU is possible or not. In fact, this is only true iff there
- * is enough bandwidth available on such CPU, otherwise we want the
- * whole migration procedure to fail over.
- */
-static inline
-bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
-{
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-	struct dl_bw *cpu_b = dl_bw_of(cpu);
-	int ret = 1;
-	u64 bw;
-
-	if (dl_b == cpu_b)
-		return 1;
-
-	raw_spin_lock(&dl_b->lock);
-	raw_spin_lock(&cpu_b->lock);
-
-	bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
-	if (dl_bandwidth_enabled() &&
-	    bw < cpu_b->total_bw + p->dl.dl_bw) {
-		ret = 0;
-		goto unlock;
-	}
-	dl_b->total_bw -= p->dl.dl_bw;
-	cpu_b->total_bw += p->dl.dl_bw;
-
-unlock:
-	raw_spin_unlock(&cpu_b->lock);
-	raw_spin_unlock(&dl_b->lock);
-
-	return ret;
-}
-
-/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -4590,13 +4560,6 @@ static int __migrate_task(struct task_st
 		goto fail;
 
 	/*
-	 * If p is -deadline, proceed only if there is enough
-	 * bandwidth available on dest_cpu
-	 */
-	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
-		goto fail;
-
-	/*
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
@@ -4986,7 +4949,6 @@ migration_call(struct notifier_block *nf
 	struct rq *rq = cpu_rq(cpu);
 
 	switch (action & ~CPU_TASKS_FROZEN) {
-
 	case CPU_UP_PREPARE:
 		rq->calc_load_update = calc_load_update;
 		break;
@@ -5056,10 +5018,28 @@ static int sched_cpu_inactive(struct not
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_DOWN_PREPARE:
 		set_cpu_active((long)hcpu, false);
-		return NOTIFY_OK;
-	default:
-		return NOTIFY_DONE;
+		break;
 	}
+
+	switch (action) {
+	case CPU_DOWN_PREPARE: /* explicitly allow suspend */
+		{
+			struct dl_bw *dl_b = dl_bw_of(cpu);
+			bool overflow;
+			int cpus;
+
+			raw_spin_lock_irqsave(&dl_b->lock, flags);
+		       	cpus = dl_bw_cpus(cpu);
+			overflow = __dl_overflow(dl_b, cpus, 0, 0);
+			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+			if (overflow)
+				return notifier_from_errno(-EBUSY);
+		}
+		break;
+	}
+
+	return NOTIFY_OK;
 }
 
 static int __init migration_init(void)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-20 21:44             ` Peter Zijlstra
@ 2013-12-20 23:29               ` Steven Rostedt
  2013-12-21 10:05                 ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Steven Rostedt @ 2013-12-20 23:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, 20 Dec 2013 22:44:13 +0100
Peter Zijlstra <peterz@infradead.org> wrote:


> @@ -5056,10 +5018,28 @@ static int sched_cpu_inactive(struct not
>  	switch (action & ~CPU_TASKS_FROZEN) {
>  	case CPU_DOWN_PREPARE:
>  		set_cpu_active((long)hcpu, false);
> -		return NOTIFY_OK;
> -	default:
> -		return NOTIFY_DONE;
> +		break;
>  	}
> +
> +	switch (action) {
> +	case CPU_DOWN_PREPARE: /* explicitly allow suspend */

Instead of the double switch (which is quite confusing), what about
just adding:

	if (!(action & CPU_TASKS_FROZEN))

I mean, the above switch gets called for both cases, this only gets
called for the one case. This case is a subset of the above. I don't
see why an if () would not be better than a double (confusing) switch().

Also, it seems that this change also does not return NOTIFY_DONE if
something other than CPU_DOWN_PREPARE is passed in.

-- Steve

> +		{
> +			struct dl_bw *dl_b = dl_bw_of(cpu);
> +			bool overflow;
> +			int cpus;
> +
> +			raw_spin_lock_irqsave(&dl_b->lock, flags);
> +		       	cpus = dl_bw_cpus(cpu);
> +			overflow = __dl_overflow(dl_b, cpus, 0, 0);
> +			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> +
> +			if (overflow)
> +				return notifier_from_errno(-EBUSY);
> +		}
> +		break;
> +	}
> +
> +	return NOTIFY_OK;
>  }
>  
>  static int __init migration_init(void)


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-20 23:29               ` Steven Rostedt
@ 2013-12-21 10:05                 ` Peter Zijlstra
  2013-12-21 17:26                   ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-21 10:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur

On Fri, Dec 20, 2013 at 06:29:46PM -0500, Steven Rostedt wrote:
> On Fri, 20 Dec 2013 22:44:13 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> > @@ -5056,10 +5018,28 @@ static int sched_cpu_inactive(struct not
> >  	switch (action & ~CPU_TASKS_FROZEN) {
> >  	case CPU_DOWN_PREPARE:
> >  		set_cpu_active((long)hcpu, false);
> > -		return NOTIFY_OK;
> > -	default:
> > -		return NOTIFY_DONE;
> > +		break;
> >  	}
> > +
> > +	switch (action) {
> > +	case CPU_DOWN_PREPARE: /* explicitly allow suspend */
> 
> Instead of the double switch (which is quite confusing), what about
> just adding:
> 
> 	if (!(action & CPU_TASKS_FROZEN))
> 
> I mean, the above switch gets called for both cases, this only gets
> called for the one case. This case is a subset of the above. I don't
> see why an if () would not be better than a double (confusing) switch().

I don't see the confusion in the double switch(), but sure an if would
work too I suppose.

> Also, it seems that this change also does not return NOTIFY_DONE if
> something other than CPU_DOWN_PREPARE is passed in.

Yeah, I had a look but couldn't find an actual difference between
NOTIFY_DONE and NOTIFY_OK. Maybe I missed it..

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/13] sched: Add bandwidth management for sched_dl
  2013-12-21 10:05                 ` Peter Zijlstra
@ 2013-12-21 17:26                   ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2013-12-21 17:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur



Like this then? :-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5056,10 +5019,25 @@ static int sched_cpu_inactive(struct not
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_DOWN_PREPARE:
 		set_cpu_active((long)hcpu, false);
+
+		/* explicitly allow suspend */
+		if (!(action & CPU_TASKS_FROZEN)) {
+			struct dl_bw *dl_b = dl_bw_of(cpu);
+			bool overflow;
+			int cpus;
+
+			raw_spin_lock_irqsave(&dl_b->lock, flags);
+			cpus = dl_bw_cpus(cpu);
+			overflow = __dl_overflow(dl_b, cpus, 0, 0);
+			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+			if (overflow)
+				return notifier_from_errno(-EBUSY);
+		}
 		return NOTIFY_OK;
-	default:
-		return NOTIFY_DONE;
 	}
+
+	return NOTIFY_DONE;
 }
 
 static int __init migration_init(void)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [tip:sched/core] sched/deadline: Fix hotplug admission control
  2013-12-20 17:13     ` Peter Zijlstra
  2013-12-20 17:37       ` Steven Rostedt
@ 2014-01-13 15:55       ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 15:55 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, peterz, tglx

Commit-ID:  de212f18e92c952533d57c5510d2790199c75734
Gitweb:     http://git.kernel.org/tip/de212f18e92c952533d57c5510d2790199c75734
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 19 Dec 2013 11:54:45 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:47:25 +0100

sched/deadline: Fix hotplug admission control

The current hotplug admission control is broken because:

  CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task()

cannot fail and hard assumes it _will_ move all tasks off of the dying
cpu, failing this will break hotplug.

The much simpler solution is a DOWN_PREPARE handler that fails when
removing one CPU gets us below the total allocated bandwidth.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131220171343.GL2480@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 83 +++++++++++++++++++++--------------------------------
 1 file changed, 32 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d33eb8..a549d9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1887,9 +1887,15 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->rd->dl_bw;
 }
 
-static inline int __dl_span_weight(struct rq *rq)
+static inline int dl_bw_cpus(int i)
 {
-	return cpumask_weight(rq->rd->span);
+	struct root_domain *rd = cpu_rq(i)->rd;
+	int cpus = 0;
+
+	for_each_cpu_and(i, rd->span, cpu_active_mask)
+		cpus++;
+
+	return cpus;
 }
 #else
 inline struct dl_bw *dl_bw_of(int i)
@@ -1897,7 +1903,7 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->dl.dl_bw;
 }
 
-static inline int __dl_span_weight(struct rq *rq)
+static inline int dl_bw_cpus(int i)
 {
 	return 1;
 }
@@ -1938,8 +1944,7 @@ static int dl_overflow(struct task_struct *p, int policy,
 	u64 period = attr->sched_period;
 	u64 runtime = attr->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-	int cpus = __dl_span_weight(task_rq(p));
-	int err = -1;
+	int cpus, err = -1;
 
 	if (new_bw == p->dl.dl_bw)
 		return 0;
@@ -1950,6 +1955,7 @@ static int dl_overflow(struct task_struct *p, int policy,
 	 * allocated bandwidth of the container.
 	 */
 	raw_spin_lock(&dl_b->lock);
+	cpus = dl_bw_cpus(task_cpu(p));
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
 		__dl_add(dl_b, new_bw);
@@ -4522,42 +4528,6 @@ out:
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
- * When dealing with a -deadline task, we have to check if moving it to
- * a new CPU is possible or not. In fact, this is only true iff there
- * is enough bandwidth available on such CPU, otherwise we want the
- * whole migration procedure to fail over.
- */
-static inline
-bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
-{
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-	struct dl_bw *cpu_b = dl_bw_of(cpu);
-	int ret = 1;
-	u64 bw;
-
-	if (dl_b == cpu_b)
-		return 1;
-
-	raw_spin_lock(&dl_b->lock);
-	raw_spin_lock(&cpu_b->lock);
-
-	bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
-	if (dl_bandwidth_enabled() &&
-	    bw < cpu_b->total_bw + p->dl.dl_bw) {
-		ret = 0;
-		goto unlock;
-	}
-	dl_b->total_bw -= p->dl.dl_bw;
-	cpu_b->total_bw += p->dl.dl_bw;
-
-unlock:
-	raw_spin_unlock(&cpu_b->lock);
-	raw_spin_unlock(&dl_b->lock);
-
-	return ret;
-}
-
-/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -4589,13 +4559,6 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 		goto fail;
 
 	/*
-	 * If p is -deadline, proceed only if there is enough
-	 * bandwidth available on dest_cpu
-	 */
-	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
-		goto fail;
-
-	/*
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
@@ -5052,13 +5015,31 @@ static int sched_cpu_active(struct notifier_block *nfb,
 static int sched_cpu_inactive(struct notifier_block *nfb,
 					unsigned long action, void *hcpu)
 {
+	unsigned long flags;
+	long cpu = (long)hcpu;
+
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_DOWN_PREPARE:
-		set_cpu_active((long)hcpu, false);
+		set_cpu_active(cpu, false);
+
+		/* explicitly allow suspend */
+		if (!(action & CPU_TASKS_FROZEN)) {
+			struct dl_bw *dl_b = dl_bw_of(cpu);
+			bool overflow;
+			int cpus;
+
+			raw_spin_lock_irqsave(&dl_b->lock, flags);
+			cpus = dl_bw_cpus(cpu);
+			overflow = __dl_overflow(dl_b, cpus, 0, 0);
+			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+
+			if (overflow)
+				return notifier_from_errno(-EBUSY);
+		}
 		return NOTIFY_OK;
-	default:
-		return NOTIFY_DONE;
 	}
+
+	return NOTIFY_DONE;
 }
 
 static int __init migration_init(void)

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2013-12-17 12:27 ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Peter Zijlstra
@ 2014-01-21 14:36   ` Michael Kerrisk
  2014-01-21 15:38     ` Peter Zijlstra
  2014-01-26  9:48   ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Geert Uytterhoeven
  1 sibling, 1 reply; 71+ messages in thread
From: Michael Kerrisk @ 2014-01-21 14:36 UTC (permalink / raw)
  To: Peter Zijlstra, Dario Faggioli
  Cc: Thomas Gleixner, Ingo Molnar, rostedt, Oleg Nesterov, fweisbec,
	darren, johan.eker, p.faure, Linux Kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, juri.lelli, nicola.manica,
	luca.abeni, dhaval.giani, hgu1972, Paul McKenney, insop.song,
	liming.wang, jkacur, Michael Kerrisk-manpages

Peter, Dario,


On Tue, Dec 17, 2013 at 1:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Dario Faggioli <raistlin@linux.it>
>
> Add the syscalls needed for supporting scheduling algorithms
> with extended scheduling parameters (e.g., SCHED_DEADLINE).
>
> In general, it makes possible to specify a periodic/sporadic task,
> that executes for a given amount of runtime at each instance, and is
> scheduled according to the urgency of their own timing constraints,
> i.e.:
>
>  - a (maximum/typical) instance execution time,
>  - a minimum interval between consecutive instances,
>  - a time constraint by which each instance must be completed.
>
> Thus, both the data structure that holds the scheduling parameters of
> the tasks and the system calls dealing with it must be extended.
> Unfortunately, modifying the existing struct sched_param would break
> the ABI and result in potentially serious compatibility issues with
> legacy binaries.
>
> For these reasons, this patch:
>
>  - defines the new struct sched_attr, containing all the fields
>    that are necessary for specifying a task in the computational
>    model described above;
>  - defines and implements the new scheduling related syscalls that
>    manipulate it, i.e., sched_setscheduler2(), sched_setattr()
>    and sched_getattr().

Is someone (e.g., one of you) planning to write man pages for the new
sched_setattr() and sched_getattr() system calls? (Also, for the
future, please CC linux-api@vger.kernel.org on patches that change the
API, then those of us who don't follow LKML get a heads up about
upcoming API changes.)

Thanks,

Michael


> Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
> proof of concept and for developing and testing purposes. Making them
> available on other architectures is straightforward.
>
> Since no "user" for these new parameters is introduced in this patch,
> the implementation of the new system calls is just identical to their
> already existing counterpart. Future patches that implement scheduling
> policies able to exploit the new data structure must also take care of
> modifying the sched_*attr() calls accordingly with their own purposes.
>
> Cc: oleg@redhat.com
> Cc: darren@dvhart.com
> Cc: paulmck@linux.vnet.ibm.com
> Cc: dhaval.giani@gmail.com
> Cc: p.faure@akatech.ch
> Cc: fchecconi@gmail.com
> Cc: fweisbec@gmail.com
> Cc: harald.gustafsson@ericsson.com
> Cc: hgu1972@gmail.com
> Cc: insop.song@gmail.com
> Cc: rostedt@goodmis.org
> Cc: jkacur@redhat.com
> Cc: tommaso.cucinotta@sssup.it
> Cc: johan.eker@ericsson.com
> Cc: vincent.guittot@linaro.org
> Cc: liming.wang@windriver.com
> Cc: luca.abeni@unitn.it
> Cc: michael@amarulasolutions.com
> Cc: bruce.ashfield@windriver.com
> Cc: nicola.manica@disi.unitn.it
> Cc: claudio@evidence.eu.com
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> [ Twiddled the changelog. ]
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/arm/include/asm/unistd.h      |    2
>  arch/arm/include/uapi/asm/unistd.h |    3
>  arch/arm/kernel/calls.S            |    3
>  arch/x86/syscalls/syscall_32.tbl   |    3
>  arch/x86/syscalls/syscall_64.tbl   |    3
>  include/linux/sched.h              |   54 ++++++++
>  include/linux/syscalls.h           |    8 +
>  kernel/sched/core.c                |  234 +++++++++++++++++++++++++++++++++++--
>  8 files changed, 298 insertions(+), 12 deletions(-)
>
> --- a/arch/arm/include/asm/unistd.h
> +++ b/arch/arm/include/asm/unistd.h
> @@ -15,7 +15,7 @@
>
>  #include <uapi/asm/unistd.h>
>
> -#define __NR_syscalls  (380)
> +#define __NR_syscalls  (383)
>  #define __ARM_NR_cmpxchg               (__ARM_NR_BASE+0x00fff0)
>
>  #define __ARCH_WANT_STAT64
> --- a/arch/arm/include/uapi/asm/unistd.h
> +++ b/arch/arm/include/uapi/asm/unistd.h
> @@ -406,6 +406,9 @@
>  #define __NR_process_vm_writev         (__NR_SYSCALL_BASE+377)
>  #define __NR_kcmp                      (__NR_SYSCALL_BASE+378)
>  #define __NR_finit_module              (__NR_SYSCALL_BASE+379)
> +#define __NR_sched_setscheduler2       (__NR_SYSCALL_BASE+380)
> +#define __NR_sched_setattr             (__NR_SYSCALL_BASE+381)
> +#define __NR_sched_getattr             (__NR_SYSCALL_BASE+382)
>
>  /*
>   * This may need to be greater than __NR_last_syscall+1 in order to
> --- a/arch/arm/kernel/calls.S
> +++ b/arch/arm/kernel/calls.S
> @@ -389,6 +389,9 @@
>                 CALL(sys_process_vm_writev)
>                 CALL(sys_kcmp)
>                 CALL(sys_finit_module)
> +/* 380 */      CALL(sys_sched_setscheduler2)
> +               CALL(sys_sched_setattr)
> +               CALL(sys_sched_getattr)
>  #ifndef syscalls_counted
>  .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
>  #define syscalls_counted
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -357,3 +357,6 @@
>  348    i386    process_vm_writev       sys_process_vm_writev           compat_sys_process_vm_writev
>  349    i386    kcmp                    sys_kcmp
>  350    i386    finit_module            sys_finit_module
> +351    i386    sched_setattr           sys_sched_setattr
> +352    i386    sched_getattr           sys_sched_getattr
> +353    i386    sched_setscheduler2     sys_sched_setscheduler2
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -320,6 +320,9 @@
>  311    64      process_vm_writev       sys_process_vm_writev
>  312    common  kcmp                    sys_kcmp
>  313    common  finit_module            sys_finit_module
> +314    common  sched_setattr           sys_sched_setattr
> +315    common  sched_getattr           sys_sched_getattr
> +316    common  sched_setscheduler2     sys_sched_setscheduler2
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -56,6 +56,58 @@ struct sched_param {
>
>  #include <asm/processor.h>
>
> +#define SCHED_ATTR_SIZE_VER0   40      /* sizeof first published struct */
> +
> +/*
> + * Extended scheduling parameters data structure.
> + *
> + * This is needed because the original struct sched_param can not be
> + * altered without introducing ABI issues with legacy applications
> + * (e.g., in sched_getparam()).
> + *
> + * However, the possibility of specifying more than just a priority for
> + * the tasks may be useful for a wide variety of application fields, e.g.,
> + * multimedia, streaming, automation and control, and many others.
> + *
> + * This variant (sched_attr) is meant at describing a so-called
> + * sporadic time-constrained task. In such model a task is specified by:
> + *  - the activation period or minimum instance inter-arrival time;
> + *  - the maximum (or average, depending on the actual scheduling
> + *    discipline) computation time of all instances, a.k.a. runtime;
> + *  - the deadline (relative to the actual activation time) of each
> + *    instance.
> + * Very briefly, a periodic (sporadic) task asks for the execution of
> + * some specific computation --which is typically called an instance--
> + * (at most) every period. Moreover, each instance typically lasts no more
> + * than the runtime and must be completed by time instant t equal to
> + * the instance activation time + the deadline.
> + *
> + * This is reflected by the actual fields of the sched_attr structure:
> + *
> + *  @sched_priority     task's priority (might still be useful)
> + *  @sched_flags        for customizing the scheduler behaviour
> + *  @sched_deadline     representative of the task's deadline
> + *  @sched_runtime      representative of the task's runtime
> + *  @sched_period       representative of the task's period
> + *
> + * Given this task model, there are a multiplicity of scheduling algorithms
> + * and policies, that can be used to ensure all the tasks will make their
> + * timing constraints.
> + *
> + *  @size              size of the structure, for fwd/bwd compat.
> + */
> +struct sched_attr {
> +       int sched_priority;
> +       unsigned int sched_flags;
> +       u64 sched_runtime;
> +       u64 sched_deadline;
> +       u64 sched_period;
> +       u32 size;
> +
> +       /* Align to u64. */
> +       u32 __reserved;
> +};
> +
>  struct exec_domain;
>  struct futex_pi_state;
>  struct robust_list_head;
> @@ -1960,6 +2012,8 @@ extern int sched_setscheduler(struct tas
>                               const struct sched_param *);
>  extern int sched_setscheduler_nocheck(struct task_struct *, int,
>                                       const struct sched_param *);
> +extern int sched_setscheduler2(struct task_struct *, int,
> +                                const struct sched_attr *);
>  extern struct task_struct *idle_task(int cpu);
>  /**
>   * is_idle_task - is the specified task an idle task?
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -38,6 +38,7 @@ struct rlimit;
>  struct rlimit64;
>  struct rusage;
>  struct sched_param;
> +struct sched_attr;
>  struct sel_arg_struct;
>  struct semaphore;
>  struct sembuf;
> @@ -277,11 +278,18 @@ asmlinkage long sys_clock_nanosleep(cloc
>  asmlinkage long sys_nice(int increment);
>  asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
>                                         struct sched_param __user *param);
> +asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
> +                                       struct sched_attr __user *attr);
>  asmlinkage long sys_sched_setparam(pid_t pid,
>                                         struct sched_param __user *param);
> +asmlinkage long sys_sched_setattr(pid_t pid,
> +                                       struct sched_attr __user *attr);
>  asmlinkage long sys_sched_getscheduler(pid_t pid);
>  asmlinkage long sys_sched_getparam(pid_t pid,
>                                         struct sched_param __user *param);
> +asmlinkage long sys_sched_getattr(pid_t pid,
> +                                       struct sched_attr __user *attr,
> +                                       unsigned int size);
>  asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
>                                         unsigned long __user *user_mask_ptr);
>  asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3023,7 +3023,8 @@ static bool check_same_owner(struct task
>  }
>
>  static int __sched_setscheduler(struct task_struct *p, int policy,
> -                               const struct sched_param *param, bool user)
> +                               const struct sched_attr *attr,
> +                               bool user)
>  {
>         int retval, oldprio, oldpolicy = -1, on_rq, running;
>         unsigned long flags;
> @@ -3053,11 +3054,11 @@ static int __sched_setscheduler(struct t
>          * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
>          * SCHED_BATCH and SCHED_IDLE is 0.
>          */
> -       if (param->sched_priority < 0 ||
> -           (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
> -           (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
> +       if (attr->sched_priority < 0 ||
> +           (p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
> +           (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
>                 return -EINVAL;
> -       if (rt_policy(policy) != (param->sched_priority != 0))
> +       if (rt_policy(policy) != (attr->sched_priority != 0))
>                 return -EINVAL;
>
>         /*
> @@ -3073,8 +3074,8 @@ static int __sched_setscheduler(struct t
>                                 return -EPERM;
>
>                         /* can't increase priority */
> -                       if (param->sched_priority > p->rt_priority &&
> -                           param->sched_priority > rlim_rtprio)
> +                       if (attr->sched_priority > p->rt_priority &&
> +                           attr->sched_priority > rlim_rtprio)
>                                 return -EPERM;
>                 }
>
> @@ -3123,7 +3124,7 @@ static int __sched_setscheduler(struct t
>          * If not changing anything there's no need to proceed further:
>          */
>         if (unlikely(policy == p->policy && (!rt_policy(policy) ||
> -                       param->sched_priority == p->rt_priority))) {
> +                    attr->sched_priority == p->rt_priority))) {
>                 task_rq_unlock(rq, p, &flags);
>                 return 0;
>         }
> @@ -3160,7 +3161,7 @@ static int __sched_setscheduler(struct t
>
>         oldprio = p->prio;
>         prev_class = p->sched_class;
> -       __setscheduler(rq, p, policy, param->sched_priority);
> +       __setscheduler(rq, p, policy, attr->sched_priority);
>
>         if (running)
>                 p->sched_class->set_curr_task(rq);
> @@ -3188,10 +3189,20 @@ static int __sched_setscheduler(struct t
>  int sched_setscheduler(struct task_struct *p, int policy,
>                        const struct sched_param *param)
>  {
> -       return __sched_setscheduler(p, policy, param, true);
> +       struct sched_attr attr = {
> +               .sched_priority = param->sched_priority
> +       };
> +       return __sched_setscheduler(p, policy, &attr, true);
>  }
>  EXPORT_SYMBOL_GPL(sched_setscheduler);
>
> +int sched_setscheduler2(struct task_struct *p, int policy,
> +                       const struct sched_attr *attr)
> +{
> +       return __sched_setscheduler(p, policy, attr, true);
> +}
> +EXPORT_SYMBOL_GPL(sched_setscheduler2);
> +
>  /**
>   * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
>   * @p: the task in question.
> @@ -3208,7 +3219,10 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
>  int sched_setscheduler_nocheck(struct task_struct *p, int policy,
>                                const struct sched_param *param)
>  {
> -       return __sched_setscheduler(p, policy, param, false);
> +       struct sched_attr attr = {
> +               .sched_priority = param->sched_priority
> +       };
> +       return __sched_setscheduler(p, policy, &attr, false);
>  }
>
>  static int
> @@ -3233,6 +3247,97 @@ do_sched_setscheduler(pid_t pid, int pol
>         return retval;
>  }
>
> +/*
> + * Mimics kernel/events/core.c perf_copy_attr().
> + */
> +static int sched_copy_attr(struct sched_attr __user *uattr,
> +                          struct sched_attr *attr)
> +{
> +       u32 size;
> +       int ret;
> +
> +       if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))
> +               return -EFAULT;
> +
> +       /*
> +        * zero the full structure, so that a short copy will be nice.
> +        */
> +       memset(attr, 0, sizeof(*attr));
> +
> +       ret = get_user(size, &uattr->size);
> +       if (ret)
> +               return ret;
> +
> +       if (size > PAGE_SIZE)   /* silly large */
> +               goto err_size;
> +
> +       if (!size)              /* abi compat */
> +               size = SCHED_ATTR_SIZE_VER0;
> +
> +       if (size < SCHED_ATTR_SIZE_VER0)
> +               goto err_size;
> +
> +       /*
> +        * If we're handed a bigger struct than we know of,
> +        * ensure all the unknown bits are 0 - i.e. new
> +        * user-space does not rely on any kernel feature
> +        * extensions we dont know about yet.
> +        */
> +       if (size > sizeof(*attr)) {
> +               unsigned char __user *addr;
> +               unsigned char __user *end;
> +               unsigned char val;
> +
> +               addr = (void __user *)uattr + sizeof(*attr);
> +               end  = (void __user *)uattr + size;
> +
> +               for (; addr < end; addr++) {
> +                       ret = get_user(val, addr);
> +                       if (ret)
> +                               return ret;
> +                       if (val)
> +                               goto err_size;
> +               }
> +               size = sizeof(*attr);
> +       }
> +
> +       ret = copy_from_user(attr, uattr, size);
> +       if (ret)
> +               return -EFAULT;
> +
> +out:
> +       return ret;
> +
> +err_size:
> +       put_user(sizeof(*attr), &uattr->size);
> +       ret = -E2BIG;
> +       goto out;
> +}
> +
> +static int
> +do_sched_setscheduler2(pid_t pid, int policy,
> +                      struct sched_attr __user *attr_uptr)
> +{
> +       struct sched_attr attr;
> +       struct task_struct *p;
> +       int retval;
> +
> +       if (!attr_uptr || pid < 0)
> +               return -EINVAL;
> +
> +       if (sched_copy_attr(attr_uptr, &attr))
> +               return -EFAULT;
> +
> +       rcu_read_lock();
> +       retval = -ESRCH;
> +       p = find_process_by_pid(pid);
> +       if (p != NULL)
> +               retval = sched_setscheduler2(p, policy, &attr);
> +       rcu_read_unlock();
> +
> +       return retval;
> +}
> +
>  /**
>   * sys_sched_setscheduler - set/change the scheduler policy and RT priority
>   * @pid: the pid in question.
> @@ -3252,6 +3357,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_
>  }
>
>  /**
> + * sys_sched_setscheduler2 - same as above, but with extended sched_param
> + * @pid: the pid in question.
> + * @policy: new policy (could use extended sched_param).
> + * @attr: structure containg the extended parameters.
> + */
> +SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
> +               struct sched_attr __user *, attr)
> +{
> +       if (policy < 0)
> +               return -EINVAL;
> +
> +       return do_sched_setscheduler2(pid, policy, attr);
> +}
> +
> +/**
>   * sys_sched_setparam - set/change the RT priority of a thread
>   * @pid: the pid in question.
>   * @param: structure containing the new RT priority.
> @@ -3264,6 +3384,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, p
>  }
>
>  /**
> + * sys_sched_setattr - same as above, but with extended sched_attr
> + * @pid: the pid in question.
> + * @attr: structure containing the extended parameters.
> + */
> +SYSCALL_DEFINE2(sched_setattr, pid_t, pid,
> +               struct sched_attr __user *, attr)
> +{
> +       return do_sched_setscheduler2(pid, -1, attr);
> +}
> +
> +/**
>   * sys_sched_getscheduler - get the policy (scheduling class) of a thread
>   * @pid: the pid in question.
>   *
> @@ -3329,6 +3460,87 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, p
>         return retval;
>
>  out_unlock:
> +       rcu_read_unlock();
> +       return retval;
> +}
> +
> +static int sched_read_attr(struct sched_attr __user *uattr,
> +                          struct sched_attr *attr,
> +                          unsigned int usize)
> +{
> +       int ret;
> +
> +       if (!access_ok(VERIFY_WRITE, uattr, usize))
> +               return -EFAULT;
> +
> +       /*
> +        * If we're handed a smaller struct than we know of,
> +        * ensure all the unknown bits are 0 - i.e. old
> +        * user-space does not get uncomplete information.
> +        */
> +       if (usize < sizeof(*attr)) {
> +               unsigned char *addr;
> +               unsigned char *end;
> +
> +               addr = (void *)attr + usize;
> +               end  = (void *)attr + sizeof(*attr);
> +
> +               for (; addr < end; addr++) {
> +                       if (*addr)
> +                               goto err_size;
> +               }
> +
> +               attr->size = usize;
> +       }
> +
> +       ret = copy_to_user(uattr, attr, usize);
> +       if (ret)
> +               return -EFAULT;
> +
> +out:
> +       return ret;
> +
> +err_size:
> +       ret = -E2BIG;
> +       goto out;
> +}
> +
> +/**
> + * sys_sched_getattr - same as above, but with extended "sched_param"
> + * @pid: the pid in question.
> + * @attr: structure containing the extended parameters.
> + * @size: sizeof(attr) for fwd/bwd comp.
> + */
> +SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> +               unsigned int, size)
> +{
> +       struct sched_attr attr = {
> +               .size = sizeof(struct sched_attr),
> +       };
> +       struct task_struct *p;
> +       int retval;
> +
> +       if (!uattr || pid < 0 || size > PAGE_SIZE ||
> +           size < SCHED_ATTR_SIZE_VER0)
> +               return -EINVAL;
> +
> +       rcu_read_lock();
> +       p = find_process_by_pid(pid);
> +       retval = -ESRCH;
> +       if (!p)
> +               goto out_unlock;
> +
> +       retval = security_task_getscheduler(p);
> +       if (retval)
> +               goto out_unlock;
> +
> +       attr.sched_priority = p->rt_priority;
> +       rcu_read_unlock();
> +
> +       retval = sched_read_attr(uattr, &attr, size);
> +       return retval;
> +
> +out_unlock:
>         rcu_read_unlock();
>         return retval;
>  }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-01-21 14:36   ` Michael Kerrisk
@ 2014-01-21 15:38     ` Peter Zijlstra
  2014-01-21 15:46       ` Peter Zijlstra
  2014-02-14 14:13       ` Michael Kerrisk (man-pages)
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-01-21 15:38 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur

On Tue, Jan 21, 2014 at 03:36:37PM +0100, Michael Kerrisk wrote:
> Peter, Dario,

> Is someone (e.g., one of you) planning to write man pages for the new
> sched_setattr() and sched_getattr() system calls? (Also, for the
> future, please CC linux-api@vger.kernel.org on patches that change the
> API, then those of us who don't follow LKML get a heads up about
> upcoming API changes.)

first draft, shamelessly stolen from SCHED_SETSCHEDULER(2).

One note on both the original as well as the below: process is
ambiguous, the syscalls actually apply to a single thread of a process,
not the entire process.

---


NAME
	sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
	#include <sched.h>

	struct sched_attr {
		u32 size;

		u32 sched_policy;
		u64 sched_flags;

		/* SCHED_NORMAL, SCHED_BATCH */
		s32 sched_nice;

		/* SCHED_FIFO, SCHED_RR */
		u32 sched_priority;

		/* SCHED_DEADLINE */
		u64 sched_runtime;
		u64 sched_deadline;
		u64 sched_period;
	};

	int sched_setattr(pid_t pid, const struct sched_attr *attr);

	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);

DESCRIPTION
	sched_setattr() sets both the scheduling policy and the
	associated attributes for the process whose ID is specified in
	pid.  If pid equals zero, the scheduling policy and attributes
	of the calling process will be set.  The interpretation of the
	argument attr depends on the selected policy.  Currently, Linux
	supports the following "normal" (i.e., non-real-time) scheduling
	policies:

	SCHED_OTHER	the standard "fair" time-sharing policy;

	SCHED_BATCH	for "batch" style execution of processes; and

	SCHED_IDLE	for running very low priority background jobs.

	The following "real-time" policies are also supported, for
	special time-critical applications that need precise control
	over the way in which runnable processes are selected for
	execution:

	SCHED_FIFO	a first-in, first-out policy;

	SCHED_RR	a round-robin policy; and

	SCHED_DEADLINE	a deadline policy.

	The semantics of each of these policies are detailed below.

	sched_attr::size must be set to the size of the structure, as in
	sizeof(struct sched_attr), if the provided structure is smaller
	than the kernel structure, any additional fields are assumed
	'0'. If the provided structure is larger than the kernel
	structure, the kernel verifies all additional fields are '0' if
	not the syscall will fail with -E2BIG.

	sched_attr::sched_policy the desired scheduling policy.

	sched_attr::sched_flags additional flags that can influence
	scheduling behaviour. Currently as per Linux kernel 3.14:

		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
		on fork().

	is the only supported flag.

	sched_attr::sched_nice should only be set for SCHED_OTHER,
	SCHED_BATCH, the desired nice value [-20,19], see NICE(2).

	sched_attr::sched_priority should only be set for SCHED_FIFO,
	SCHED_RR, the desired static priority [1,99].

	sched_attr::sched_runtime
	sched_attr::sched_deadline
	sched_attr::sched_period should only be set for SCHED_DEADLINE
	and are the traditional sporadic task model parameters.

	sched_getattr() queries the scheduling policy currently applied
	to the process identified by pid.  If pid equals zero, the
	policy of the calling process will be retrieved.

	The size argument should reflect the size of struct sched_attr
	as known to userspace. The kernel fills out sched_attr::size to
	the size of its sched_attr structure. If the user provided
	structure is larger, additional fields are not touched. If the
	user provided structure is smaller, but the kernel needs to
	return values outside the provided space, the syscall will fail
	with -E2BIG.

	The other sched_attr fields are filled out as described in
	sched_setattr().


${insert SCHED_* descriptions}

    SCHED_DEADLINE: Sporadic task model deadline scheduling
	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
	Deadline First) with additional CBS (Constant Bandwidth Server).
	The CBS guarantees that tasks that over-run their specified
	budget are throttled and do not affect the correct performance
	of other SCHED_DEADLINE tasks.

	SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN

	Setting SCHED_DEADLINE can fail with -EINVAL when admission
	control tests fail.

${NOTE: should we change that to -EBUSY ? }


Other than that its pretty much the same as the existing
SCHED_SETSCHEDULER(2) page.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-01-21 15:38     ` Peter Zijlstra
@ 2014-01-21 15:46       ` Peter Zijlstra
  2014-01-21 16:02         ` Steven Rostedt
  2014-02-14 14:13       ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-01-21 15:46 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur

On Tue, Jan 21, 2014 at 04:38:51PM +0100, Peter Zijlstra wrote:
>     SCHED_DEADLINE: Sporadic task model deadline scheduling
> 	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> 	Deadline First) with additional CBS (Constant Bandwidth Server).

We might want to re-word that to:

	SCHED_DEADLINE currently is an implementation of GEDF, however
	any policy that correctly schedules the sporadic task model is
	a valid implementation.

To make sure we should not rely on the actual implementation; there's
many possible algorithms to schedule the sporadic task model.

> 	The CBS guarantees that tasks that over-run their specified
> 	budget are throttled and do not affect the correct performance
> 	of other SCHED_DEADLINE tasks.
> 
> 	SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> 
> 	Setting SCHED_DEADLINE can fail with -EINVAL when admission
> 	control tests fail.
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-01-21 15:46       ` Peter Zijlstra
@ 2014-01-21 16:02         ` Steven Rostedt
  2014-01-21 16:06           ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Steven Rostedt @ 2014-01-21 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Kerrisk, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur

On Tue, 21 Jan 2014 16:46:03 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Jan 21, 2014 at 04:38:51PM +0100, Peter Zijlstra wrote:
> >     SCHED_DEADLINE: Sporadic task model deadline scheduling
> > 	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > 	Deadline First) with additional CBS (Constant Bandwidth Server).
> 
> We might want to re-word that to:
> 
> 	SCHED_DEADLINE currently is an implementation of GEDF, however
> 	any policy that correctly schedules the sporadic task model is
> 	a valid implementation.
> 
> To make sure we should not rely on the actual implementation; there's
> many possible algorithms to schedule the sporadic task model.

Probably should post some links to GEDF documentation too?

-- Steve

> 
> > 	The CBS guarantees that tasks that over-run their specified
> > 	budget are throttled and do not affect the correct performance
> > 	of other SCHED_DEADLINE tasks.
> > 
> > 	SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> > 
> > 	Setting SCHED_DEADLINE can fail with -EINVAL when admission
> > 	control tests fail.
> > 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-01-21 16:02         ` Steven Rostedt
@ 2014-01-21 16:06           ` Peter Zijlstra
  2014-01-21 16:46             ` Juri Lelli
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-01-21 16:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Michael Kerrisk, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur

On Tue, Jan 21, 2014 at 11:02:55AM -0500, Steven Rostedt wrote:
> On Tue, 21 Jan 2014 16:46:03 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Jan 21, 2014 at 04:38:51PM +0100, Peter Zijlstra wrote:
> > >     SCHED_DEADLINE: Sporadic task model deadline scheduling
> > > 	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > > 	Deadline First) with additional CBS (Constant Bandwidth Server).
> > 
> > We might want to re-word that to:
> > 
> > 	SCHED_DEADLINE currently is an implementation of GEDF, however
> > 	any policy that correctly schedules the sporadic task model is
> > 	a valid implementation.
> > 
> > To make sure we should not rely on the actual implementation; there's
> > many possible algorithms to schedule the sporadic task model.
> 
> Probably should post some links to GEDF documentation too?

At best I think we can do something like:

SEE ALSO
	Documentation/scheduler/sched_deadline.txt in the Linux kernel
	source tree (since kernel 3.14).

Possibly also an ISBN for a good scheduling theory book (if there exists
such a thing), but I would have to rely on others to provide such as my
shelfs are devoid of such material.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-01-21 16:06           ` Peter Zijlstra
@ 2014-01-21 16:46             ` Juri Lelli
  0 siblings, 0 replies; 71+ messages in thread
From: Juri Lelli @ 2014-01-21 16:46 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: Michael Kerrisk, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, Paul McKenney,
	insop.song, liming.wang, jkacur

On 01/21/2014 05:06 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 11:02:55AM -0500, Steven Rostedt wrote:
>> On Tue, 21 Jan 2014 16:46:03 +0100
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> On Tue, Jan 21, 2014 at 04:38:51PM +0100, Peter Zijlstra wrote:
>>>>     SCHED_DEADLINE: Sporadic task model deadline scheduling
>>>> 	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
>>>> 	Deadline First) with additional CBS (Constant Bandwidth Server).
>>>
>>> We might want to re-word that to:
>>>
>>> 	SCHED_DEADLINE currently is an implementation of GEDF, however
>>> 	any policy that correctly schedules the sporadic task model is
>>> 	a valid implementation.
>>>
>>> To make sure we should not rely on the actual implementation; there's
>>> many possible algorithms to schedule the sporadic task model.
>>
>> Probably should post some links to GEDF documentation too?
> 
> At best I think we can do something like:
> 
> SEE ALSO
> 	Documentation/scheduler/sched_deadline.txt in the Linux kernel
> 	source tree (since kernel 3.14).
> 
> Possibly also an ISBN for a good scheduling theory book (if there exists
> such a thing), but I would have to rely on others to provide such as my
> shelfs are devoid of such material.
> 

Well, picking just one is not that easy, I'd say (among many others):

 - Handbook of Scheduling: Algorithms, Models, and Performance Analysis
   by Joseph Y-T. Leung, James H. Anderson - ISBN-10: 1584883979
   (especially cap. 30);
 - Hard Real-Time Computing Systems by Giorgio C. Buttazzo
   ISBN 978-1-4614-0675-4 (even if it is more about UP);
 - A survey of hard real-time scheduling for multiprocessor systems
   by RI Davis, A Burns - ACM Computing Surveys (CSUR), 2011
   (available at http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf);

Probably last one is better (as is freely downloadable). We should add
something in the documentation too.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2013-12-17 12:27 ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Peter Zijlstra
  2014-01-21 14:36   ` Michael Kerrisk
@ 2014-01-26  9:48   ` Geert Uytterhoeven
  1 sibling, 0 replies; 71+ messages in thread
From: Geert Uytterhoeven @ 2014-01-26  9:48 UTC (permalink / raw)
  To: Peter Zijlstra, Linux-Arch
  Cc: Thomas Gleixner, Ingo Molnar, Steven Rostedt, Oleg Nesterov,
	Frédéric Weisbecker, darren, johan.eker, p.faure,
	linux-kernel, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, nicola.manica,
	luca.abeni, Dhaval Giani, hgu1972, Paul McKenney, Dario Faggioli,
	insop.song, liming.wang, John Kacur

On Tue, Dec 17, 2013 at 1:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> From: Dario Faggioli <raistlin@linux.it>
>
> Add the syscalls needed for supporting scheduling algorithms
> with extended scheduling parameters (e.g., SCHED_DEADLINE).

>  - defines and implements the new scheduling related syscalls that
>    manipulate it, i.e., sched_setscheduler2(), sched_setattr()
>    and sched_getattr().

> Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
> proof of concept and for developing and testing purposes. Making them
> available on other architectures is straightforward.

Please cc linux-arch on new syscalls.

They're now in mainline, and in the end there are only 2 (sched_setattr() and
sched_getattr()).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-01-21 15:38     ` Peter Zijlstra
  2014-01-21 15:46       ` Peter Zijlstra
@ 2014-02-14 14:13       ` Michael Kerrisk (man-pages)
  2014-02-14 16:19         ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-02-14 14:13 UTC (permalink / raw)
  To: Peter Zijlstra, Dario Faggioli
  Cc: Thomas Gleixner, Ingo Molnar, rostedt, Oleg Nesterov, fweisbec,
	darren, johan.eker, p.faure, Linux Kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, juri.lelli, nicola.manica,
	luca.abeni, dhaval.giani, hgu1972, Paul McKenney, insop.song,
	liming.wang, jkacur, Michael Kerrisk

Peter, Dario,

This is a little late in the day, but I think it's an important point
to just check before this API goes final.

> SYNOPSIS
>         #include <sched.h>
>
>         struct sched_attr {
>                 u32 size;
>
>                 u32 sched_policy;
>                 u64 sched_flags;
[...]
>         };
>
>         int sched_setattr(pid_t pid, const struct sched_attr *attr);
>
>         int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);

So, I that there's a flags field in the structure, which allows for
some extensibility for these calls in the future. However, is it
worthwhile to consider adding a 'flags' argument in the base signature
of either of these calls, to allow for some possible extensions in the
future. (See http://lwn.net/SubscriberLink/585415/7b905c0248a158a2/ ).

Cheers,

Michael

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-02-14 14:13       ` Michael Kerrisk (man-pages)
@ 2014-02-14 16:19         ` Peter Zijlstra
  2014-02-15 12:52           ` Ingo Molnar
                             ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-02-14 16:19 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur

On Fri, Feb 14, 2014 at 03:13:22PM +0100, Michael Kerrisk (man-pages) wrote:
> Peter, Dario,
> 
> This is a little late in the day, but I think it's an important point
> to just check before this API goes final.
> 
> > SYNOPSIS
> >         #include <sched.h>
> >
> >         struct sched_attr {
> >                 u32 size;
> >
> >                 u32 sched_policy;
> >                 u64 sched_flags;
> [...]
> >         };
> >
> >         int sched_setattr(pid_t pid, const struct sched_attr *attr);
> >
> >         int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);
> 
> So, I that there's a flags field in the structure, which allows for
> some extensibility for these calls in the future. However, is it
> worthwhile to consider adding a 'flags' argument in the base signature
> of either of these calls, to allow for some possible extensions in the
> future. (See http://lwn.net/SubscriberLink/585415/7b905c0248a158a2/ ).

Sure why not.. I picked 'unsigned long' for the flags argument; I don't
think there's a real standard for this, I've seen: 'int' 'unsigned int'
and 'unsigned long' flags.

Please holler if there is indeed a preference and I picked the wrong
one.

BTW; do you need more text on the manpage thingy I send you or was that
sufficient?

---
Subject: sched: Add 'flags' argument to sched_{set,get}attr() syscalls

Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.

Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/syscalls.h |  6 ++++--
 kernel/sched/core.c      | 11 ++++++-----
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40ed9e9a77e5..bf41aeb09078 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -281,13 +281,15 @@ asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
 asmlinkage long sys_sched_setattr(pid_t pid,
-					struct sched_attr __user *attr);
+					struct sched_attr __user *attr,
+					unsigned long flags);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
 asmlinkage long sys_sched_getattr(pid_t pid,
 					struct sched_attr __user *attr,
-					unsigned int size);
+					unsigned int size,
+					unsigned long flags);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764fbc537..deeaa54fdf92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3631,13 +3631,14 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
  * @pid: the pid in question.
  * @uattr: structure containing the extended parameters.
  */
-SYSCALL_DEFINE2(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr)
+SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
+			       unsigned long, flags)
 {
 	struct sched_attr attr;
 	struct task_struct *p;
 	int retval;
 
-	if (!uattr || pid < 0)
+	if (!uattr || pid < 0 || flags)
 		return -EINVAL;
 
 	if (sched_copy_attr(uattr, &attr))
@@ -3774,8 +3775,8 @@ static int sched_read_attr(struct sched_attr __user *uattr,
  * @uattr: structure containing the extended parameters.
  * @size: sizeof(attr) for fwd/bwd comp.
  */
-SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
-		unsigned int, size)
+SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
+		unsigned int, size, unsigned long, flags)
 {
 	struct sched_attr attr = {
 		.size = sizeof(struct sched_attr),
@@ -3784,7 +3785,7 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	int retval;
 
 	if (!uattr || pid < 0 || size > PAGE_SIZE ||
-	    size < SCHED_ATTR_SIZE_VER0)
+	    size < SCHED_ATTR_SIZE_VER0 || flags)
 		return -EINVAL;
 
 	rcu_read_lock();

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-02-14 16:19         ` Peter Zijlstra
@ 2014-02-15 12:52           ` Ingo Molnar
  2014-02-17 13:20           ` Michael Kerrisk (man-pages)
  2014-02-21 20:32           ` [tip:sched/urgent] sched: Add 'flags' argument to sched_{set, get}attr() syscalls tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 71+ messages in thread
From: Ingo Molnar @ 2014-02-15 12:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Kerrisk (man-pages),
	Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur


* Peter Zijlstra <peterz@infradead.org> wrote:

> > > SYNOPSIS
> > >         #include <sched.h>
> > >
> > >         struct sched_attr {
> > >                 u32 size;
> > >
> > >                 u32 sched_policy;
> > >                 u64 sched_flags;
> > [...]
> > >         };
> > >
> > >         int sched_setattr(pid_t pid, const struct sched_attr *attr);
> > >
> > >         int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);
> > 
> > So, I that there's a flags field in the structure, which allows for
> > some extensibility for these calls in the future. However, is it
> > worthwhile to consider adding a 'flags' argument in the base signature
> > of either of these calls, to allow for some possible extensions in the
> > future. (See http://lwn.net/SubscriberLink/585415/7b905c0248a158a2/ ).
> 
> Sure why not.. I picked 'unsigned long' for the flags argument; I 
> don't think there's a real standard for this, I've seen: 'int' 
> 'unsigned int' and 'unsigned long' flags.
> 
> Please holler if there is indeed a preference and I picked the wrong
> one.

Yo!

So, since this is an ABI, if it's a true 64-bit flags value then 
please make it u64 - and 'unsigned int' or u32 otherwise. I don't 
think we have many (any?) 'long' argument syscall ABIs.

'unsigned long' is generally a bad choice because it's u32 on 32-bit 
platforms and u64 on 64-bit platforms.

Now, for syscall argument ABIs it's not a lethal mistake to make (as 
compared to say ABI data structures), because syscall arguments have 
their own types and width anyway, so any definition mistake can 
usually be fixed after the fact.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
  2014-02-14 16:19         ` Peter Zijlstra
  2014-02-15 12:52           ` Ingo Molnar
@ 2014-02-17 13:20           ` Michael Kerrisk (man-pages)
  2014-04-09  9:25             ` sched_{set,get}attr() manpage Peter Zijlstra
  2014-04-28  8:18             ` Peter Zijlstra
  2014-02-21 20:32           ` [tip:sched/urgent] sched: Add 'flags' argument to sched_{set, get}attr() syscalls tip-bot for Peter Zijlstra
  2 siblings, 2 replies; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-02-17 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mtk.manpages, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur

Hello Peter,

On 02/14/2014 05:19 PM, Peter Zijlstra wrote:
> On Fri, Feb 14, 2014 at 03:13:22PM +0100, Michael Kerrisk (man-pages) wrote:
>> Peter, Dario,
>>
>> This is a little late in the day, but I think it's an important point
>> to just check before this API goes final.
>>
>>> SYNOPSIS
>>>         #include <sched.h>
>>>
>>>         struct sched_attr {
>>>                 u32 size;
>>>
>>>                 u32 sched_policy;
>>>                 u64 sched_flags;
>> [...]
>>>         };
>>>
>>>         int sched_setattr(pid_t pid, const struct sched_attr *attr);
>>>
>>>         int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);
>>
>> So, I that there's a flags field in the structure, which allows for
>> some extensibility for these calls in the future. However, is it
>> worthwhile to consider adding a 'flags' argument in the base signature
>> of either of these calls, to allow for some possible extensions in the
>> future. (See http://lwn.net/SubscriberLink/585415/7b905c0248a158a2/ ).
> 
> Sure why not.. 

Well, it doesn't need to be added gratuitously -- just if you think there's
some nonzero chance it might prove useful in the future ;-).

> I picked 'unsigned long' for the flags argument; I don't
> think there's a real standard for this, I've seen: 'int' 'unsigned int'
> and 'unsigned long' flags.
> 
> Please holler if there is indeed a preference and I picked the wrong
> one.
> 
> BTW; do you need more text on the manpage thingy I send you or was that
> sufficient?

If your could take another pass though your existing text, to incorporate the
new flags stuff, and then send a page to me + linux-man@
that would be great.

Cheers,

Michael


> ---
> Subject: sched: Add 'flags' argument to sched_{set,get}attr() syscalls
> 
> Because of a recent syscall design debate; its deemed appropriate for
> each syscall to have a flags argument for future extension; without
> immediately requiring new syscalls.
> 
> Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  include/linux/syscalls.h |  6 ++++--
>  kernel/sched/core.c      | 11 ++++++-----
>  2 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 40ed9e9a77e5..bf41aeb09078 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -281,13 +281,15 @@ asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
>  asmlinkage long sys_sched_setparam(pid_t pid,
>  					struct sched_param __user *param);
>  asmlinkage long sys_sched_setattr(pid_t pid,
> -					struct sched_attr __user *attr);
> +					struct sched_attr __user *attr,
> +					unsigned long flags);
>  asmlinkage long sys_sched_getscheduler(pid_t pid);
>  asmlinkage long sys_sched_getparam(pid_t pid,
>  					struct sched_param __user *param);
>  asmlinkage long sys_sched_getattr(pid_t pid,
>  					struct sched_attr __user *attr,
> -					unsigned int size);
> +					unsigned int size,
> +					unsigned long flags);
>  asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
>  					unsigned long __user *user_mask_ptr);
>  asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fb9764fbc537..deeaa54fdf92 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3631,13 +3631,14 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
>   * @pid: the pid in question.
>   * @uattr: structure containing the extended parameters.
>   */
> -SYSCALL_DEFINE2(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr)
> +SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> +			       unsigned long, flags)
>  {
>  	struct sched_attr attr;
>  	struct task_struct *p;
>  	int retval;
>  
> -	if (!uattr || pid < 0)
> +	if (!uattr || pid < 0 || flags)
>  		return -EINVAL;
>  
>  	if (sched_copy_attr(uattr, &attr))
> @@ -3774,8 +3775,8 @@ static int sched_read_attr(struct sched_attr __user *uattr,
>   * @uattr: structure containing the extended parameters.
>   * @size: sizeof(attr) for fwd/bwd comp.
>   */
> -SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> -		unsigned int, size)
> +SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> +		unsigned int, size, unsigned long, flags)
>  {
>  	struct sched_attr attr = {
>  		.size = sizeof(struct sched_attr),
> @@ -3784,7 +3785,7 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
>  	int retval;
>  
>  	if (!uattr || pid < 0 || size > PAGE_SIZE ||
> -	    size < SCHED_ATTR_SIZE_VER0)
> +	    size < SCHED_ATTR_SIZE_VER0 || flags)
>  		return -EINVAL;
>  
>  	rcu_read_lock();
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [tip:sched/urgent] sched: Add 'flags' argument to sched_{set, get}attr() syscalls
  2014-02-14 16:19         ` Peter Zijlstra
  2014-02-15 12:52           ` Ingo Molnar
  2014-02-17 13:20           ` Michael Kerrisk (man-pages)
@ 2014-02-21 20:32           ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 71+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-02-21 20:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, hpa, mingo, peterz, tglx, mtk.manpages

Commit-ID:  6d35ab48090b10c5ea5604ed5d6e91f302dc6060
Gitweb:     http://git.kernel.org/tip/6d35ab48090b10c5ea5604ed5d6e91f302dc6060
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 14 Feb 2014 17:19:29 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 21 Feb 2014 21:27:10 +0100

sched: Add 'flags' argument to sched_{set,get}attr() syscalls

Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.

Cc: juri.lelli@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/syscalls.h |  6 ++++--
 kernel/sched/core.c      | 11 ++++++-----
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40ed9e9..a747a77 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -281,13 +281,15 @@ asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
 asmlinkage long sys_sched_setattr(pid_t pid,
-					struct sched_attr __user *attr);
+					struct sched_attr __user *attr,
+					unsigned int flags);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
 asmlinkage long sys_sched_getattr(pid_t pid,
 					struct sched_attr __user *attr,
-					unsigned int size);
+					unsigned int size,
+					unsigned int flags);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a6e7470..6edbef2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3661,13 +3661,14 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
  * @pid: the pid in question.
  * @uattr: structure containing the extended parameters.
  */
-SYSCALL_DEFINE2(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr)
+SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
+			       unsigned int, flags)
 {
 	struct sched_attr attr;
 	struct task_struct *p;
 	int retval;
 
-	if (!uattr || pid < 0)
+	if (!uattr || pid < 0 || flags)
 		return -EINVAL;
 
 	if (sched_copy_attr(uattr, &attr))
@@ -3804,8 +3805,8 @@ err_size:
  * @uattr: structure containing the extended parameters.
  * @size: sizeof(attr) for fwd/bwd comp.
  */
-SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
-		unsigned int, size)
+SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
+		unsigned int, size, unsigned int, flags)
 {
 	struct sched_attr attr = {
 		.size = sizeof(struct sched_attr),
@@ -3814,7 +3815,7 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	int retval;
 
 	if (!uattr || pid < 0 || size > PAGE_SIZE ||
-	    size < SCHED_ATTR_SIZE_VER0)
+	    size < SCHED_ATTR_SIZE_VER0 || flags)
 		return -EINVAL;
 
 	rcu_read_lock();

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* sched_{set,get}attr() manpage
  2014-02-17 13:20           ` Michael Kerrisk (man-pages)
@ 2014-04-09  9:25             ` Peter Zijlstra
  2014-04-09 15:19               ` Henrik Austad
  2014-04-28  8:18             ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-04-09  9:25 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

On Mon, Feb 17, 2014 at 02:20:29PM +0100, Michael Kerrisk (man-pages) wrote:
> If your could take another pass though your existing text, to incorporate the
> new flags stuff, and then send a page to me + linux-man@
> that would be great.


Sorry, this slipped my mind. An updated version below. Heavy borrowing
from SCHED_SETSCHEDULER(2) as before.

---

NAME
	sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
	#include <sched.h>

	struct sched_attr {
		u32 size;
		u32 sched_policy;
		u64 sched_flags;

		/* SCHED_NORMAL, SCHED_BATCH */
		s32 sched_nice;
		/* SCHED_FIFO, SCHED_RR */
		u32 sched_priority;
		/* SCHED_DEADLINE */
		u64 sched_runtime;
		u64 sched_deadline;
		u64 sched_period;
	};
	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
	sched_setattr() sets both the scheduling policy and the
	associated attributes for the process whose ID is specified in
	pid.  If pid equals zero, the scheduling policy and attributes
	of the calling process will be set.  The interpretation of the
	argument attr depends on the selected policy.  Currently, Linux
	supports the following "normal" (i.e., non-real-time) scheduling
	policies:

	SCHED_OTHER	the standard "fair" time-sharing policy;

	SCHED_BATCH	for "batch" style execution of processes; and

	SCHED_IDLE	for running very low priority background jobs.

	The following "real-time" policies are also supported, for
	special time-critical applications that need precise control
	over the way in which runnable processes are selected for
	execution:

	SCHED_FIFO	a first-in, first-out policy;

	SCHED_RR	a round-robin policy; and

	SCHED_DEADLINE	a deadline policy.

	The semantics of each of these policies are detailed below.

	sched_attr::size must be set to the size of the structure, as in
	sizeof(struct sched_attr), if the provided structure is smaller
	than the kernel structure, any additional fields are assumed
	'0'. If the provided structure is larger than the kernel
	structure, the kernel verifies all additional fields are '0' if
	not the syscall will fail with -E2BIG.

	sched_attr::sched_policy the desired scheduling policy.

	sched_attr::sched_flags additional flags that can influence
	scheduling behaviour. Currently as per Linux kernel 3.14:

		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
		on fork().

	is the only supported flag.

	sched_attr::sched_nice should only be set for SCHED_OTHER,
	SCHED_BATCH, the desired nice value [-20,19], see NICE(2).

	sched_attr::sched_priority should only be set for SCHED_FIFO,
	SCHED_RR, the desired static priority [1,99].

	sched_attr::sched_runtime
	sched_attr::sched_deadline
	sched_attr::sched_period should only be set for SCHED_DEADLINE
	and are the traditional sporadic task model parameters.

	The flags argument should be 0.

	sched_getattr() queries the scheduling policy currently applied
	to the process identified by pid.  If pid equals zero, the
	policy of the calling process will be retrieved.

	The size argument should reflect the size of struct sched_attr
	as known to userspace. The kernel fills out sched_attr::size to
	the size of its sched_attr structure. If the user provided
	structure is larger, additional fields are not touched. If the
	user provided structure is smaller, but the kernel needs to
	return values outside the provided space, the syscall will fail
	with -E2BIG.

	The flags argument should be 0.

	The other sched_attr fields are filled out as described in
	sched_setattr().

   Scheduling Policies
       The  scheduler  is  the  kernel  component  that decides which runnable
       process will be executed by the CPU next.  Each process has an  associ‐
       ated  scheduling  policy and a static scheduling priority, sched_prior‐
       ity; these are the settings that are modified by  sched_setscheduler().
       The  scheduler  makes it decisions based on knowledge of the scheduling
       policy and static priority of all processes on the system.

       For processes scheduled under one of  the  normal  scheduling  policies
       (SCHED_OTHER,  SCHED_IDLE,  SCHED_BATCH), sched_priority is not used in
       scheduling decisions (it must be specified as 0).

       Processes scheduled under one of the  real-time  policies  (SCHED_FIFO,
       SCHED_RR)  have  a  sched_priority  value  in  the  range 1 (low) to 99
       (high).  (As the numbers imply, real-time processes always have  higher
       priority than normal processes.)  Note well: POSIX.1-2001 only requires
       an implementation to support a minimum 32 distinct priority levels  for
       the  real-time  policies,  and  some  systems supply just this minimum.
       Portable   programs   should    use    sched_get_priority_min(2)    and
       sched_get_priority_max(2) to find the range of priorities supported for
       a particular policy.

       Conceptually, the scheduler maintains a list of runnable processes  for
       each  possible  sched_priority  value.   In  order  to  determine which
       process runs next, the scheduler looks for the nonempty list  with  the
       highest  static  priority  and  selects the process at the head of this
       list.

       A process's scheduling policy determines where it will be inserted into
       the  list  of processes with equal static priority and how it will move
       inside this list.

       All scheduling is preemptive: if a process with a higher static  prior‐
       ity  becomes  ready  to run, the currently running process will be pre‐
       empted and returned to the wait list for  its  static  priority  level.
       The  scheduling  policy only determines the ordering within the list of
       runnable processes with equal static priority.

    SCHED_DEADLINE: Sporadic task model deadline scheduling
	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
	Deadline First) with additional CBS (Constant Bandwidth Server).
	The CBS guarantees that tasks that over-run their specified
	budget are throttled and do not affect the correct performance
	of other SCHED_DEADLINE tasks.

	SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN

	Setting SCHED_DEADLINE can fail with -EINVAL when admission
	control tests fail.

   SCHED_FIFO: First In-First Out scheduling
       SCHED_FIFO can only be used with static priorities higher than 0, which
       means that when a SCHED_FIFO processes becomes runnable, it will always
       immediately preempt any currently running SCHED_OTHER, SCHED_BATCH,  or
       SCHED_IDLE  process.  SCHED_FIFO is a simple scheduling algorithm with‐
       out time slicing.  For processes scheduled under the SCHED_FIFO policy,
       the following rules apply:

       *  A  SCHED_FIFO  process that has been preempted by another process of
          higher priority will stay at the head of the list for  its  priority
          and  will resume execution as soon as all processes of higher prior‐
          ity are blocked again.

       *  When a SCHED_FIFO process becomes runnable, it will be  inserted  at
          the end of the list for its priority.

       *  A  call  to  sched_setscheduler()  or sched_setparam(2) will put the
          SCHED_FIFO (or SCHED_RR) process identified by pid at the  start  of
          the  list  if it was runnable.  As a consequence, it may preempt the
          currently  running  process   if   it   has   the   same   priority.
          (POSIX.1-2001 specifies that the process should go to the end of the
          list.)

       *  A process calling sched_yield(2) will be put at the end of the list.

       No other events will move a process scheduled under the SCHED_FIFO pol‐
       icy in the wait list of runnable processes with equal static priority.

       A SCHED_FIFO process runs until either it is blocked by an I/O request,
       it  is  preempted  by  a  higher  priority   process,   or   it   calls
       sched_yield(2).

   SCHED_RR: Round Robin scheduling
       SCHED_RR  is  a simple enhancement of SCHED_FIFO.  Everything described
       above for SCHED_FIFO also applies to SCHED_RR, except that each process
       is  only  allowed  to  run  for  a maximum time quantum.  If a SCHED_RR
       process has been running for a time period equal to or longer than  the
       time  quantum,  it will be put at the end of the list for its priority.
       A SCHED_RR process that has been preempted by a higher priority process
       and  subsequently  resumes execution as a running process will complete
       the unexpired portion of its round robin time quantum.  The  length  of
       the time quantum can be retrieved using sched_rr_get_interval(2).

   SCHED_OTHER: Default Linux time-sharing scheduling
       SCHED_OTHER  can only be used at static priority 0.  SCHED_OTHER is the
       standard Linux time-sharing scheduler that is  intended  for  all  pro‐
       cesses  that  do  not  require  the  special real-time mechanisms.  The
       process to run is chosen from the static priority 0  list  based  on  a
       dynamic priority that is determined only inside this list.  The dynamic
       priority is based on the nice value (set by nice(2) or  setpriority(2))
       and  increased  for  each time quantum the process is ready to run, but
       denied to run by the scheduler.  This ensures fair progress  among  all
       SCHED_OTHER processes.

   SCHED_BATCH: Scheduling batch processes
       (Since  Linux 2.6.16.)  SCHED_BATCH can only be used at static priority
       0.  This policy is similar to SCHED_OTHER  in  that  it  schedules  the
       process  according  to  its dynamic priority (based on the nice value).
       The difference is that this policy will cause the scheduler  to  always
       assume  that the process is CPU-intensive.  Consequently, the scheduler
       will apply a small scheduling penalty with respect to wakeup behaviour,
       so that this process is mildly disfavored in scheduling decisions.

       This policy is useful for workloads that are noninteractive, but do not
       want to lower their nice value, and for workloads that want a determin‐
       istic scheduling policy without interactivity causing extra preemptions
       (between the workload's tasks).

   SCHED_IDLE: Scheduling very low priority jobs
       (Since Linux 2.6.23.)  SCHED_IDLE can only be used at  static  priority
       0; the process nice value has no influence for this policy.

       This  policy  is  intended  for  running jobs at extremely low priority
       (lower even than a +19 nice value with the SCHED_OTHER  or  SCHED_BATCH
       policies).

RETURN VALUE
	On success, sched_setattr() and sched_getattr() return 0. On
	error, -1 is returned, and errno is set appropriately.

ERRORS
       EINVAL The scheduling policy is not one  of  the  recognized  policies,
              param is NULL, or param does not make sense for the policy.

       EPERM  The calling process does not have appropriate privileges.

       ESRCH  The process whose ID is pid could not be found.

       E2BIG  The provided storage for struct sched_attr is either too
              big, see sched_setattr(), or too small, see sched_getattr().

NOTES
	While the text above (and in SCHED_SETSCHEDULER(2)) talks about
	processes, in actual fact these system calls are thread specific.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-09  9:25             ` sched_{set,get}attr() manpage Peter Zijlstra
@ 2014-04-09 15:19               ` Henrik Austad
  2014-04-09 15:42                 ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Henrik Austad @ 2014-04-09 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Kerrisk (man-pages),
	Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

On Wed, Apr 09, 2014 at 11:25:10AM +0200, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 02:20:29PM +0100, Michael Kerrisk (man-pages) wrote:
> > If your could take another pass though your existing text, to incorporate the
> > new flags stuff, and then send a page to me + linux-man@
> > that would be great.
> 
> 
> Sorry, this slipped my mind. An updated version below. Heavy borrowing
> from SCHED_SETSCHEDULER(2) as before.
> 
> ---
> 
> NAME
> 	sched_setattr, sched_getattr - set and get scheduling policy/attributes
> 
> SYNOPSIS
> 	#include <sched.h>
> 
> 	struct sched_attr {
> 		u32 size;
> 		u32 sched_policy;
> 		u64 sched_flags;
> 
> 		/* SCHED_NORMAL, SCHED_BATCH */
> 		s32 sched_nice;
> 		/* SCHED_FIFO, SCHED_RR */
> 		u32 sched_priority;
> 		/* SCHED_DEADLINE */
> 		u64 sched_runtime;
> 		u64 sched_deadline;
> 		u64 sched_period;
> 	};
> 	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
> 
> 	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
> 
> DESCRIPTION
> 	sched_setattr() sets both the scheduling policy and the
> 	associated attributes for the process whose ID is specified in
> 	pid.  If pid equals zero, the scheduling policy and attributes
> 	of the calling process will be set.  The interpretation of the
> 	argument attr depends on the selected policy.  Currently, Linux
> 	supports the following "normal" (i.e., non-real-time) scheduling
> 	policies:
> 
> 	SCHED_OTHER	the standard "fair" time-sharing policy;
> 
> 	SCHED_BATCH	for "batch" style execution of processes; and
> 
> 	SCHED_IDLE	for running very low priority background jobs.
> 
> 	The following "real-time" policies are also supported, for

why the "'s?

> 	special time-critical applications that need precise control
> 	over the way in which runnable processes are selected for
> 	execution:
> 
> 	SCHED_FIFO	a first-in, first-out policy;
> 
> 	SCHED_RR	a round-robin policy; and
> 
> 	SCHED_DEADLINE	a deadline policy.
> 
> 	The semantics of each of these policies are detailed below.
> 
> 	sched_attr::size must be set to the size of the structure, as in
> 	sizeof(struct sched_attr), if the provided structure is smaller
> 	than the kernel structure, any additional fields are assumed
> 	'0'. If the provided structure is larger than the kernel
> 	structure, the kernel verifies all additional fields are '0' if
> 	not the syscall will fail with -E2BIG.
> 
> 	sched_attr::sched_policy the desired scheduling policy.
> 
> 	sched_attr::sched_flags additional flags that can influence
> 	scheduling behaviour. Currently as per Linux kernel 3.14:
> 
> 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> 		on fork().
> 
> 	is the only supported flag.
> 
> 	sched_attr::sched_nice should only be set for SCHED_OTHER,
> 	SCHED_BATCH, the desired nice value [-20,19], see NICE(2).
> 
> 	sched_attr::sched_priority should only be set for SCHED_FIFO,
> 	SCHED_RR, the desired static priority [1,99].
> 
> 	sched_attr::sched_runtime
> 	sched_attr::sched_deadline
> 	sched_attr::sched_period should only be set for SCHED_DEADLINE
> 	and are the traditional sporadic task model parameters.
> 
> 	The flags argument should be 0.
> 
> 	sched_getattr() queries the scheduling policy currently applied
> 	to the process identified by pid.  If pid equals zero, the
> 	policy of the calling process will be retrieved.
> 
> 	The size argument should reflect the size of struct sched_attr
> 	as known to userspace. The kernel fills out sched_attr::size to
> 	the size of its sched_attr structure. If the user provided
> 	structure is larger, additional fields are not touched. If the
> 	user provided structure is smaller, but the kernel needs to
> 	return values outside the provided space, the syscall will fail
> 	with -E2BIG.
> 
> 	The flags argument should be 0.

What about SCHED_FLAG_RESET_ON_FOR?

> 	The other sched_attr fields are filled out as described in
> 	sched_setattr().
> 
>    Scheduling Policies
>        The  scheduler  is  the  kernel  component  that decides which runnable
>        process will be executed by the CPU next.  Each process has an  associ‐
>        ated  scheduling  policy and a static scheduling priority, sched_prior‐
>        ity; these are the settings that are modified by  sched_setscheduler().
>        The  scheduler  makes it decisions based on knowledge of the scheduling
>        policy and static priority of all processes on the system.

Isn't this last sentence redundant/sliglhtly repetitive?

>        For processes scheduled under one of  the  normal  scheduling  policies
>        (SCHED_OTHER,  SCHED_IDLE,  SCHED_BATCH), sched_priority is not used in
>        scheduling decisions (it must be specified as 0).
> 
>        Processes scheduled under one of the  real-time  policies  (SCHED_FIFO,
>        SCHED_RR)  have  a  sched_priority  value  in  the  range 1 (low) to 99
>        (high).  (As the numbers imply, real-time processes always have  higher
>        priority than normal processes.)  Note well: POSIX.1-2001 only requires
>        an implementation to support a minimum 32 distinct priority levels  for
>        the  real-time  policies,  and  some  systems supply just this minimum.
>        Portable   programs   should    use    sched_get_priority_min(2)    and
>        sched_get_priority_max(2) to find the range of priorities supported for
>        a particular policy.
> 
>        Conceptually, the scheduler maintains a list of runnable processes  for
>        each  possible  sched_priority  value.   In  order  to  determine which
>        process runs next, the scheduler looks for the nonempty list  with  the
>        highest  static  priority  and  selects the process at the head of this
>        list.
> 
>        A process's scheduling policy determines where it will be inserted into
>        the  list  of processes with equal static priority and how it will move
>        inside this list.
> 
>        All scheduling is preemptive: if a process with a higher static  prior‐
>        ity  becomes  ready  to run, the currently running process will be pre‐
>        empted and returned to the wait list for  its  static  priority  level.
>        The  scheduling  policy only determines the ordering within the list of
>        runnable processes with equal static priority.
> 
>     SCHED_DEADLINE: Sporadic task model deadline scheduling
> 	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> 	Deadline First) with additional CBS (Constant Bandwidth Server).
> 	The CBS guarantees that tasks that over-run their specified
> 	budget are throttled and do not affect the correct performance
> 	of other SCHED_DEADLINE tasks.
> 
> 	SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> 
> 	Setting SCHED_DEADLINE can fail with -EINVAL when admission
> 	control tests fail.

Perhaps add a note about the deadline-class having higher priority than the 
other classes; i.e. if a deadline-task is runnable, it will preempt any 
other SCHED_(RR|FIFO) regardless of priority?

>    SCHED_FIFO: First In-First Out scheduling
>        SCHED_FIFO can only be used with static priorities higher than 0, which
>        means that when a SCHED_FIFO processes becomes runnable, it will always
>        immediately preempt any currently running SCHED_OTHER, SCHED_BATCH,  or
>        SCHED_IDLE  process.  SCHED_FIFO is a simple scheduling algorithm with‐
>        out time slicing.  For processes scheduled under the SCHED_FIFO policy,
>        the following rules apply:
> 
>        *  A  SCHED_FIFO  process that has been preempted by another process of
>           higher priority will stay at the head of the list for  its  priority
>           and  will resume execution as soon as all processes of higher prior‐
>           ity are blocked again.
> 
>        *  When a SCHED_FIFO process becomes runnable, it will be  inserted  at
>           the end of the list for its priority.
> 
>        *  A  call  to  sched_setscheduler()  or sched_setparam(2) will put the
>           SCHED_FIFO (or SCHED_RR) process identified by pid at the  start  of
>           the  list  if it was runnable.  As a consequence, it may preempt the
>           currently  running  process   if   it   has   the   same   priority.
>           (POSIX.1-2001 specifies that the process should go to the end of the
>           list.)
> 
>        *  A process calling sched_yield(2) will be put at the end of the list.

How about the recent discussion regarding sched_yield(). Is this correct?

lkml.kernel.org/r/alpine.DEB.2.02.1403312333100.14882@ionos.tec.linutronix.de

Is this the correct place to add a note explaining te potentional pitfalls 
using sched_yield?

>        No other events will move a process scheduled under the SCHED_FIFO pol‐
>        icy in the wait list of runnable processes with equal static priority.
> 
>        A SCHED_FIFO process runs until either it is blocked by an I/O request,
>        it  is  preempted  by  a  higher  priority   process,   or   it   calls
>        sched_yield(2).
> 
>    SCHED_RR: Round Robin scheduling
>        SCHED_RR  is  a simple enhancement of SCHED_FIFO.  Everything described
>        above for SCHED_FIFO also applies to SCHED_RR, except that each process
>        is  only  allowed  to  run  for  a maximum time quantum.  If a SCHED_RR
>        process has been running for a time period equal to or longer than  the
>        time  quantum,  it will be put at the end of the list for its priority.
>        A SCHED_RR process that has been preempted by a higher priority process
>        and  subsequently  resumes execution as a running process will complete
>        the unexpired portion of its round robin time quantum.  The  length  of
>        the time quantum can be retrieved using sched_rr_get_interval(2).

-> Default is 0.1HZ ms

This is a question I get form time to time, having this in the manpage 
would be helpful.

>    SCHED_OTHER: Default Linux time-sharing scheduling
>        SCHED_OTHER  can only be used at static priority 0.  SCHED_OTHER is the
>        standard Linux time-sharing scheduler that is  intended  for  all  pro‐
>        cesses  that  do  not  require  the  special real-time mechanisms.  The
>        process to run is chosen from the static priority 0  list  based  on  a
>        dynamic priority that is determined only inside this list.  The dynamic
>        priority is based on the nice value (set by nice(2) or  setpriority(2))
>        and  increased  for  each time quantum the process is ready to run, but
>        denied to run by the scheduler.  This ensures fair progress  among  all
>        SCHED_OTHER processes.
> 
>    SCHED_BATCH: Scheduling batch processes
>        (Since  Linux 2.6.16.)  SCHED_BATCH can only be used at static priority
>        0.  This policy is similar to SCHED_OTHER  in  that  it  schedules  the
>        process  according  to  its dynamic priority (based on the nice value).
>        The difference is that this policy will cause the scheduler  to  always
>        assume  that the process is CPU-intensive.  Consequently, the scheduler
>        will apply a small scheduling penalty with respect to wakeup behaviour,
>        so that this process is mildly disfavored in scheduling decisions.
> 
>        This policy is useful for workloads that are noninteractive, but do not
>        want to lower their nice value, and for workloads that want a determin‐
>        istic scheduling policy without interactivity causing extra preemptions
>        (between the workload's tasks).
> 
>    SCHED_IDLE: Scheduling very low priority jobs
>        (Since Linux 2.6.23.)  SCHED_IDLE can only be used at  static  priority
>        0; the process nice value has no influence for this policy.
> 
>        This  policy  is  intended  for  running jobs at extremely low priority
>        (lower even than a +19 nice value with the SCHED_OTHER  or  SCHED_BATCH
>        policies).
> 
> RETURN VALUE
> 	On success, sched_setattr() and sched_getattr() return 0. On
> 	error, -1 is returned, and errno is set appropriately.
> 
> ERRORS
>        EINVAL The scheduling policy is not one  of  the  recognized  policies,
>               param is NULL, or param does not make sense for the policy.
> 
>        EPERM  The calling process does not have appropriate privileges.
> 
>        ESRCH  The process whose ID is pid could not be found.
> 
>        E2BIG  The provided storage for struct sched_attr is either too
>               big, see sched_setattr(), or too small, see sched_getattr().

Where's the EBUSY? It can throw this from __sched_setscheduler() when it 
checks if there's enough bandwidth to run the task.

> 
> NOTES
> 	While the text above (and in SCHED_SETSCHEDULER(2)) talks about
> 	processes, in actual fact these system calls are thread specific.


-- 
Henrik Austad

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-09 15:19               ` Henrik Austad
@ 2014-04-09 15:42                 ` Peter Zijlstra
  2014-04-10  7:47                   ` Juri Lelli
  2014-04-27 15:47                   ` Michael Kerrisk (man-pages)
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-04-09 15:42 UTC (permalink / raw)
  To: Henrik Austad
  Cc: Michael Kerrisk (man-pages),
	Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

On Wed, Apr 09, 2014 at 05:19:11PM +0200, Henrik Austad wrote:
> > 	The following "real-time" policies are also supported, for
> 
> why the "'s?

I borrowed those from SCHED_SETSCHEDULER(2).

> > 	sched_attr::sched_flags additional flags that can influence
> > 	scheduling behaviour. Currently as per Linux kernel 3.14:
> > 
> > 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> > 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> > 		on fork().
> > 
> > 	is the only supported flag.

...

> > 	The flags argument should be 0.
> 
> What about SCHED_FLAG_RESET_ON_FOR?

Different flags. The one is sched_attr::flags the other is
sched_setattr(.flags).

> > 	The other sched_attr fields are filled out as described in
> > 	sched_setattr().
> > 
> >    Scheduling Policies
> >        The  scheduler  is  the  kernel  component  that decides which runnable
> >        process will be executed by the CPU next.  Each process has an  associ‐
> >        ated  scheduling  policy and a static scheduling priority, sched_prior‐
> >        ity; these are the settings that are modified by  sched_setscheduler().
> >        The  scheduler  makes it decisions based on knowledge of the scheduling
> >        policy and static priority of all processes on the system.
> 
> Isn't this last sentence redundant/sliglhtly repetitive?

I borrowed that from SCHED_SETSCHEDULER(2) again.

> >     SCHED_DEADLINE: Sporadic task model deadline scheduling
> > 	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > 	Deadline First) with additional CBS (Constant Bandwidth Server).
> > 	The CBS guarantees that tasks that over-run their specified
> > 	budget are throttled and do not affect the correct performance
> > 	of other SCHED_DEADLINE tasks.
> > 
> > 	SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> > 
> > 	Setting SCHED_DEADLINE can fail with -EINVAL when admission
> > 	control tests fail.
> 
> Perhaps add a note about the deadline-class having higher priority than the 
> other classes; i.e. if a deadline-task is runnable, it will preempt any 
> other SCHED_(RR|FIFO) regardless of priority?

Yes, good point, will do.

> >    SCHED_FIFO: First In-First Out scheduling
> >        SCHED_FIFO can only be used with static priorities higher than 0, which
> >        means that when a SCHED_FIFO processes becomes runnable, it will always
> >        immediately preempt any currently running SCHED_OTHER, SCHED_BATCH,  or
> >        SCHED_IDLE  process.  SCHED_FIFO is a simple scheduling algorithm with‐
> >        out time slicing.  For processes scheduled under the SCHED_FIFO policy,
> >        the following rules apply:
> > 
> >        *  A  SCHED_FIFO  process that has been preempted by another process of
> >           higher priority will stay at the head of the list for  its  priority
> >           and  will resume execution as soon as all processes of higher prior‐
> >           ity are blocked again.
> > 
> >        *  When a SCHED_FIFO process becomes runnable, it will be  inserted  at
> >           the end of the list for its priority.
> > 
> >        *  A  call  to  sched_setscheduler()  or sched_setparam(2) will put the
> >           SCHED_FIFO (or SCHED_RR) process identified by pid at the  start  of
> >           the  list  if it was runnable.  As a consequence, it may preempt the
> >           currently  running  process   if   it   has   the   same   priority.
> >           (POSIX.1-2001 specifies that the process should go to the end of the
> >           list.)
> > 
> >        *  A process calling sched_yield(2) will be put at the end of the list.
> 
> How about the recent discussion regarding sched_yield(). Is this correct?
> 
> lkml.kernel.org/r/alpine.DEB.2.02.1403312333100.14882@ionos.tec.linutronix.de
> 
> Is this the correct place to add a note explaining te potentional pitfalls 
> using sched_yield?

I'm not sure; there's a SCHED_YIELD(2) manpage to fill with that
nonsense.

Also; I realized I have not described the DEADLINE sched_yield()
behaviour.

> >        No other events will move a process scheduled under the SCHED_FIFO pol‐
> >        icy in the wait list of runnable processes with equal static priority.
> > 
> >        A SCHED_FIFO process runs until either it is blocked by an I/O request,
> >        it  is  preempted  by  a  higher  priority   process,   or   it   calls
> >        sched_yield(2).
> > 
> >    SCHED_RR: Round Robin scheduling
> >        SCHED_RR  is  a simple enhancement of SCHED_FIFO.  Everything described
> >        above for SCHED_FIFO also applies to SCHED_RR, except that each process
> >        is  only  allowed  to  run  for  a maximum time quantum.  If a SCHED_RR
> >        process has been running for a time period equal to or longer than  the
> >        time  quantum,  it will be put at the end of the list for its priority.
> >        A SCHED_RR process that has been preempted by a higher priority process
> >        and  subsequently  resumes execution as a running process will complete
> >        the unexpired portion of its round robin time quantum.  The  length  of
> >        the time quantum can be retrieved using sched_rr_get_interval(2).
> 
> -> Default is 0.1HZ ms
> 
> This is a question I get form time to time, having this in the manpage 
> would be helpful.

Again, brazenly stolen from SCHED_SETSCHEDULER(2); but yes. Also I'm not
sure I'd call RR an enhancement of anything much at all ;-)

> > ERRORS
> >        EINVAL The scheduling policy is not one  of  the  recognized  policies,
> >               param is NULL, or param does not make sense for the policy.
> > 
> >        EPERM  The calling process does not have appropriate privileges.
> > 
> >        ESRCH  The process whose ID is pid could not be found.
> > 
> >        E2BIG  The provided storage for struct sched_attr is either too
> >               big, see sched_setattr(), or too small, see sched_getattr().
> 
> Where's the EBUSY? It can throw this from __sched_setscheduler() when it 
> checks if there's enough bandwidth to run the task.

Uhhm.. it got lost :-) /me quickly adds.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-09 15:42                 ` Peter Zijlstra
@ 2014-04-10  7:47                   ` Juri Lelli
  2014-04-10  9:59                     ` Claudio Scordino
  2014-04-27 15:47                   ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 71+ messages in thread
From: Juri Lelli @ 2014-04-10  7:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Henrik Austad, Michael Kerrisk (man-pages),
	Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, Paul McKenney,
	insop.song, liming.wang, jkacur, linux-man

Hi all,

On Wed, 9 Apr 2014 17:42:04 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Apr 09, 2014 at 05:19:11PM +0200, Henrik Austad wrote:
> > > 	The following "real-time" policies are also supported, for
> > 
> > why the "'s?
> 
> I borrowed those from SCHED_SETSCHEDULER(2).
> 
> > > 	sched_attr::sched_flags additional flags that can influence
> > > 	scheduling behaviour. Currently as per Linux kernel 3.14:
> > > 
> > > 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> > > 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> > > 		on fork().
> > > 
> > > 	is the only supported flag.
> 
> ...
> 
> > > 	The flags argument should be 0.
> > 
> > What about SCHED_FLAG_RESET_ON_FOR?
> 
> Different flags. The one is sched_attr::flags the other is
> sched_setattr(.flags).
> 
> > > 	The other sched_attr fields are filled out as described in
> > > 	sched_setattr().
> > > 
> > >    Scheduling Policies
> > >        The  scheduler  is  the  kernel  component  that decides which runnable
> > >        process will be executed by the CPU next.  Each process has an  associ‐
> > >        ated  scheduling  policy and a static scheduling priority, sched_prior‐
> > >        ity; these are the settings that are modified by  sched_setscheduler().
> > >        The  scheduler  makes it decisions based on knowledge of the scheduling
> > >        policy and static priority of all processes on the system.
> > 
> > Isn't this last sentence redundant/sliglhtly repetitive?
> 
> I borrowed that from SCHED_SETSCHEDULER(2) again.
> 
> > >     SCHED_DEADLINE: Sporadic task model deadline scheduling
> > > 	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > > 	Deadline First) with additional CBS (Constant Bandwidth Server).
> > > 	The CBS guarantees that tasks that over-run their specified
> > > 	budget are throttled and do not affect the correct performance
> > > 	of other SCHED_DEADLINE tasks.
> > > 
> > > 	SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> > > 
> > > 	Setting SCHED_DEADLINE can fail with -EINVAL when admission
> > > 	control tests fail.
> > 
> > Perhaps add a note about the deadline-class having higher priority than the 
> > other classes; i.e. if a deadline-task is runnable, it will preempt any 
> > other SCHED_(RR|FIFO) regardless of priority?
> 
> Yes, good point, will do.
> 
> > >    SCHED_FIFO: First In-First Out scheduling
> > >        SCHED_FIFO can only be used with static priorities higher than 0, which
> > >        means that when a SCHED_FIFO processes becomes runnable, it will always
> > >        immediately preempt any currently running SCHED_OTHER, SCHED_BATCH,  or
> > >        SCHED_IDLE  process.  SCHED_FIFO is a simple scheduling algorithm with‐
> > >        out time slicing.  For processes scheduled under the SCHED_FIFO policy,
> > >        the following rules apply:
> > > 
> > >        *  A  SCHED_FIFO  process that has been preempted by another process of
> > >           higher priority will stay at the head of the list for  its  priority
> > >           and  will resume execution as soon as all processes of higher prior‐
> > >           ity are blocked again.
> > > 
> > >        *  When a SCHED_FIFO process becomes runnable, it will be  inserted  at
> > >           the end of the list for its priority.
> > > 
> > >        *  A  call  to  sched_setscheduler()  or sched_setparam(2) will put the
> > >           SCHED_FIFO (or SCHED_RR) process identified by pid at the  start  of
> > >           the  list  if it was runnable.  As a consequence, it may preempt the
> > >           currently  running  process   if   it   has   the   same   priority.
> > >           (POSIX.1-2001 specifies that the process should go to the end of the
> > >           list.)
> > > 
> > >        *  A process calling sched_yield(2) will be put at the end of the list.
> > 
> > How about the recent discussion regarding sched_yield(). Is this correct?
> > 
> > lkml.kernel.org/r/alpine.DEB.2.02.1403312333100.14882@ionos.tec.linutronix.de
> > 
> > Is this the correct place to add a note explaining te potentional pitfalls 
> > using sched_yield?
> 
> I'm not sure; there's a SCHED_YIELD(2) manpage to fill with that
> nonsense.
> 
> Also; I realized I have not described the DEADLINE sched_yield()
> behaviour.
> 

So, for SCHED_DEADLINE we currently have this behaviour:

/*
 * Yield task semantic for -deadline tasks is:
 *
 *   get off from the CPU until our next instance, with
 *   a new runtime. This is of little use now, since we
 *   don't have a bandwidth reclaiming mechanism. Anyway,
 *   bandwidth reclaiming is planned for the future, and
 *   yield_task_dl will indicate that some spare budget
 *   is available for other task instances to use it.
 */

But, considering also the discussion above, I'm less sure now that's
what we want. Still, I think we will want some way in the future to be
able to say "I'm finished with my current job, give this remaining
runtime to someone else", like another syscall or something.

Thanks,

- Juri

> > >        No other events will move a process scheduled under the SCHED_FIFO pol‐
> > >        icy in the wait list of runnable processes with equal static priority.
> > > 
> > >        A SCHED_FIFO process runs until either it is blocked by an I/O request,
> > >        it  is  preempted  by  a  higher  priority   process,   or   it   calls
> > >        sched_yield(2).
> > > 
> > >    SCHED_RR: Round Robin scheduling
> > >        SCHED_RR  is  a simple enhancement of SCHED_FIFO.  Everything described
> > >        above for SCHED_FIFO also applies to SCHED_RR, except that each process
> > >        is  only  allowed  to  run  for  a maximum time quantum.  If a SCHED_RR
> > >        process has been running for a time period equal to or longer than  the
> > >        time  quantum,  it will be put at the end of the list for its priority.
> > >        A SCHED_RR process that has been preempted by a higher priority process
> > >        and  subsequently  resumes execution as a running process will complete
> > >        the unexpired portion of its round robin time quantum.  The  length  of
> > >        the time quantum can be retrieved using sched_rr_get_interval(2).
> > 
> > -> Default is 0.1HZ ms
> > 
> > This is a question I get form time to time, having this in the manpage 
> > would be helpful.
> 
> Again, brazenly stolen from SCHED_SETSCHEDULER(2); but yes. Also I'm not
> sure I'd call RR an enhancement of anything much at all ;-)
> 
> > > ERRORS
> > >        EINVAL The scheduling policy is not one  of  the  recognized  policies,
> > >               param is NULL, or param does not make sense for the policy.
> > > 
> > >        EPERM  The calling process does not have appropriate privileges.
> > > 
> > >        ESRCH  The process whose ID is pid could not be found.
> > > 
> > >        E2BIG  The provided storage for struct sched_attr is either too
> > >               big, see sched_setattr(), or too small, see sched_getattr().
> > 
> > Where's the EBUSY? It can throw this from __sched_setscheduler() when it 
> > checks if there's enough bandwidth to run the task.
> 
> Uhhm.. it got lost :-) /me quickly adds.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-10  7:47                   ` Juri Lelli
@ 2014-04-10  9:59                     ` Claudio Scordino
  0 siblings, 0 replies; 71+ messages in thread
From: Claudio Scordino @ 2014-04-10  9:59 UTC (permalink / raw)
  To: Juri Lelli, Peter Zijlstra
  Cc: Henrik Austad, Michael Kerrisk (man-pages),
	Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, Paul McKenney,
	insop.song, liming.wang, jkacur, linux-man

Il 10/04/2014 09:47, Juri Lelli ha scritto:
> Hi all,
>
> On Wed, 9 Apr 2014 17:42:04 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, Apr 09, 2014 at 05:19:11PM +0200, Henrik Austad wrote:
>>>> 	The following "real-time" policies are also supported, for
>>> why the "'s?
>> I borrowed those from SCHED_SETSCHEDULER(2).
>>
>>>> 	sched_attr::sched_flags additional flags that can influence
>>>> 	scheduling behaviour. Currently as per Linux kernel 3.14:
>>>>
>>>> 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
>>>> 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
>>>> 		on fork().
>>>>
>>>> 	is the only supported flag.
>> ...
>>
>>>> 	The flags argument should be 0.
>>> What about SCHED_FLAG_RESET_ON_FOR?
>> Different flags. The one is sched_attr::flags the other is
>> sched_setattr(.flags).
>>
>>>> 	The other sched_attr fields are filled out as described in
>>>> 	sched_setattr().
>>>>
>>>>     Scheduling Policies
>>>>         The  scheduler  is  the  kernel  component  that decides which runnable
>>>>         process will be executed by the CPU next.  Each process has an  associ‐
>>>>         ated  scheduling  policy and a static scheduling priority, sched_prior‐
>>>>         ity; these are the settings that are modified by  sched_setscheduler().
>>>>         The  scheduler  makes it decisions based on knowledge of the scheduling
>>>>         policy and static priority of all processes on the system.
>>> Isn't this last sentence redundant/sliglhtly repetitive?
>> I borrowed that from SCHED_SETSCHEDULER(2) again.
>>
>>>>      SCHED_DEADLINE: Sporadic task model deadline scheduling
>>>> 	SCHED_DEADLINE is an implementation of GEDF (Global Earliest
>>>> 	Deadline First) with additional CBS (Constant Bandwidth Server).
>>>> 	The CBS guarantees that tasks that over-run their specified
>>>> 	budget are throttled and do not affect the correct performance
>>>> 	of other SCHED_DEADLINE tasks.
>>>>
>>>> 	SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
>>>>
>>>> 	Setting SCHED_DEADLINE can fail with -EINVAL when admission
>>>> 	control tests fail.
>>> Perhaps add a note about the deadline-class having higher priority than the
>>> other classes; i.e. if a deadline-task is runnable, it will preempt any
>>> other SCHED_(RR|FIFO) regardless of priority?
>> Yes, good point, will do.
>>
>>>>     SCHED_FIFO: First In-First Out scheduling
>>>>         SCHED_FIFO can only be used with static priorities higher than 0, which
>>>>         means that when a SCHED_FIFO processes becomes runnable, it will always
>>>>         immediately preempt any currently running SCHED_OTHER, SCHED_BATCH,  or
>>>>         SCHED_IDLE  process.  SCHED_FIFO is a simple scheduling algorithm with‐
>>>>         out time slicing.  For processes scheduled under the SCHED_FIFO policy,
>>>>         the following rules apply:
>>>>
>>>>         *  A  SCHED_FIFO  process that has been preempted by another process of
>>>>            higher priority will stay at the head of the list for  its  priority
>>>>            and  will resume execution as soon as all processes of higher prior‐
>>>>            ity are blocked again.
>>>>
>>>>         *  When a SCHED_FIFO process becomes runnable, it will be  inserted  at
>>>>            the end of the list for its priority.
>>>>
>>>>         *  A  call  to  sched_setscheduler()  or sched_setparam(2) will put the
>>>>            SCHED_FIFO (or SCHED_RR) process identified by pid at the  start  of
>>>>            the  list  if it was runnable.  As a consequence, it may preempt the
>>>>            currently  running  process   if   it   has   the   same   priority.
>>>>            (POSIX.1-2001 specifies that the process should go to the end of the
>>>>            list.)
>>>>
>>>>         *  A process calling sched_yield(2) will be put at the end of the list.
>>> How about the recent discussion regarding sched_yield(). Is this correct?
>>>
>>> lkml.kernel.org/r/alpine.DEB.2.02.1403312333100.14882@ionos.tec.linutronix.de
>>>
>>> Is this the correct place to add a note explaining te potentional pitfalls
>>> using sched_yield?
>> I'm not sure; there's a SCHED_YIELD(2) manpage to fill with that
>> nonsense.
>>
>> Also; I realized I have not described the DEADLINE sched_yield()
>> behaviour.
>>
> So, for SCHED_DEADLINE we currently have this behaviour:
>
> /*
>   * Yield task semantic for -deadline tasks is:
>   *
>   *   get off from the CPU until our next instance, with
>   *   a new runtime. This is of little use now, since we
>   *   don't have a bandwidth reclaiming mechanism. Anyway,
>   *   bandwidth reclaiming is planned for the future, and
>   *   yield_task_dl will indicate that some spare budget
>   *   is available for other task instances to use it.
>   */
>
> But, considering also the discussion above, I'm less sure now that's
> what we want. Still, I think we will want some way in the future to be
> able to say "I'm finished with my current job, give this remaining
> runtime to someone else", like another syscall or something.

Hi Juri, hi Peter,

my two cents:

A syscall to block the task until its next instance is definitely useful.
This way, a periodic task doesn't have to sleep anymore: the kernel 
takes care of unblocking the task at the right moment.
This would be easier (for user-level) and more efficient too.
I don't know if using sched_yield() to get this behavior is a good 
choice or not. You have ways more experience than me :)

Best,

         Claudio


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-09 15:42                 ` Peter Zijlstra
  2014-04-10  7:47                   ` Juri Lelli
@ 2014-04-27 15:47                   ` Michael Kerrisk (man-pages)
  2014-04-27 19:34                     ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-27 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Henrik Austad, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, Frédéric Weisbecker, darren,
	johan.eker, p.faure, Linux Kernel, Claudio Scordino,
	Michael Trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	nicola.manica, luca.abeni, Dhaval Giani, hgu1972, Paul McKenney,
	Insop Song, liming.wang, jkacur, linux-man, Michael Kerrisk

Hi Peter,

Following the review comments that one or two people sent, are you
planning to send in a revised version of this page? Also, is there any
test code lying about somewhere that I could play with?

Thanks,

Michael


On Wed, Apr 9, 2014 at 5:42 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Apr 09, 2014 at 05:19:11PM +0200, Henrik Austad wrote:
>> >     The following "real-time" policies are also supported, for
>>
>> why the "'s?
>
> I borrowed those from SCHED_SETSCHEDULER(2).
>
>> >     sched_attr::sched_flags additional flags that can influence
>> >     scheduling behaviour. Currently as per Linux kernel 3.14:
>> >
>> >             SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
>> >             to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
>> >             on fork().
>> >
>> >     is the only supported flag.
>
> ...
>
>> >     The flags argument should be 0.
>>
>> What about SCHED_FLAG_RESET_ON_FOR?
>
> Different flags. The one is sched_attr::flags the other is
> sched_setattr(.flags).
>
>> >     The other sched_attr fields are filled out as described in
>> >     sched_setattr().
>> >
>> >    Scheduling Policies
>> >        The  scheduler  is  the  kernel  component  that decides which runnable
>> >        process will be executed by the CPU next.  Each process has an  associ‐
>> >        ated  scheduling  policy and a static scheduling priority, sched_prior‐
>> >        ity; these are the settings that are modified by  sched_setscheduler().
>> >        The  scheduler  makes it decisions based on knowledge of the scheduling
>> >        policy and static priority of all processes on the system.
>>
>> Isn't this last sentence redundant/sliglhtly repetitive?
>
> I borrowed that from SCHED_SETSCHEDULER(2) again.
>
>> >     SCHED_DEADLINE: Sporadic task model deadline scheduling
>> >     SCHED_DEADLINE is an implementation of GEDF (Global Earliest
>> >     Deadline First) with additional CBS (Constant Bandwidth Server).
>> >     The CBS guarantees that tasks that over-run their specified
>> >     budget are throttled and do not affect the correct performance
>> >     of other SCHED_DEADLINE tasks.
>> >
>> >     SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
>> >
>> >     Setting SCHED_DEADLINE can fail with -EINVAL when admission
>> >     control tests fail.
>>
>> Perhaps add a note about the deadline-class having higher priority than the
>> other classes; i.e. if a deadline-task is runnable, it will preempt any
>> other SCHED_(RR|FIFO) regardless of priority?
>
> Yes, good point, will do.
>
>> >    SCHED_FIFO: First In-First Out scheduling
>> >        SCHED_FIFO can only be used with static priorities higher than 0, which
>> >        means that when a SCHED_FIFO processes becomes runnable, it will always
>> >        immediately preempt any currently running SCHED_OTHER, SCHED_BATCH,  or
>> >        SCHED_IDLE  process.  SCHED_FIFO is a simple scheduling algorithm with‐
>> >        out time slicing.  For processes scheduled under the SCHED_FIFO policy,
>> >        the following rules apply:
>> >
>> >        *  A  SCHED_FIFO  process that has been preempted by another process of
>> >           higher priority will stay at the head of the list for  its  priority
>> >           and  will resume execution as soon as all processes of higher prior‐
>> >           ity are blocked again.
>> >
>> >        *  When a SCHED_FIFO process becomes runnable, it will be  inserted  at
>> >           the end of the list for its priority.
>> >
>> >        *  A  call  to  sched_setscheduler()  or sched_setparam(2) will put the
>> >           SCHED_FIFO (or SCHED_RR) process identified by pid at the  start  of
>> >           the  list  if it was runnable.  As a consequence, it may preempt the
>> >           currently  running  process   if   it   has   the   same   priority.
>> >           (POSIX.1-2001 specifies that the process should go to the end of the
>> >           list.)
>> >
>> >        *  A process calling sched_yield(2) will be put at the end of the list.
>>
>> How about the recent discussion regarding sched_yield(). Is this correct?
>>
>> lkml.kernel.org/r/alpine.DEB.2.02.1403312333100.14882@ionos.tec.linutronix.de
>>
>> Is this the correct place to add a note explaining te potentional pitfalls
>> using sched_yield?
>
> I'm not sure; there's a SCHED_YIELD(2) manpage to fill with that
> nonsense.
>
> Also; I realized I have not described the DEADLINE sched_yield()
> behaviour.
>
>> >        No other events will move a process scheduled under the SCHED_FIFO pol‐
>> >        icy in the wait list of runnable processes with equal static priority.
>> >
>> >        A SCHED_FIFO process runs until either it is blocked by an I/O request,
>> >        it  is  preempted  by  a  higher  priority   process,   or   it   calls
>> >        sched_yield(2).
>> >
>> >    SCHED_RR: Round Robin scheduling
>> >        SCHED_RR  is  a simple enhancement of SCHED_FIFO.  Everything described
>> >        above for SCHED_FIFO also applies to SCHED_RR, except that each process
>> >        is  only  allowed  to  run  for  a maximum time quantum.  If a SCHED_RR
>> >        process has been running for a time period equal to or longer than  the
>> >        time  quantum,  it will be put at the end of the list for its priority.
>> >        A SCHED_RR process that has been preempted by a higher priority process
>> >        and  subsequently  resumes execution as a running process will complete
>> >        the unexpired portion of its round robin time quantum.  The  length  of
>> >        the time quantum can be retrieved using sched_rr_get_interval(2).
>>
>> -> Default is 0.1HZ ms
>>
>> This is a question I get form time to time, having this in the manpage
>> would be helpful.
>
> Again, brazenly stolen from SCHED_SETSCHEDULER(2); but yes. Also I'm not
> sure I'd call RR an enhancement of anything much at all ;-)
>
>> > ERRORS
>> >        EINVAL The scheduling policy is not one  of  the  recognized  policies,
>> >               param is NULL, or param does not make sense for the policy.
>> >
>> >        EPERM  The calling process does not have appropriate privileges.
>> >
>> >        ESRCH  The process whose ID is pid could not be found.
>> >
>> >        E2BIG  The provided storage for struct sched_attr is either too
>> >               big, see sched_setattr(), or too small, see sched_getattr().
>>
>> Where's the EBUSY? It can throw this from __sched_setscheduler() when it
>> checks if there's enough bandwidth to run the task.
>
> Uhhm.. it got lost :-) /me quickly adds.



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-27 15:47                   ` Michael Kerrisk (man-pages)
@ 2014-04-27 19:34                     ` Peter Zijlstra
  2014-04-27 19:45                       ` Steven Rostedt
  2014-04-28  7:39                       ` Juri Lelli
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-04-27 19:34 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Henrik Austad, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, Frédéric Weisbecker, darren,
	johan.eker, p.faure, Linux Kernel, Claudio Scordino,
	Michael Trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	nicola.manica, luca.abeni, Dhaval Giani, hgu1972, Paul McKenney,
	Insop Song, liming.wang, jkacur, linux-man

On Sun, Apr 27, 2014 at 05:47:25PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
> 
> Following the review comments that one or two people sent, are you
> planning to send in a revised version of this page?

Yes, I just suck at getting around to it :-(, I'll do it first thing
tomorrow.

> Also, is there any test code lying about somewhere that I could play with?

Juri?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-27 19:34                     ` Peter Zijlstra
@ 2014-04-27 19:45                       ` Steven Rostedt
  2014-04-28  7:39                       ` Juri Lelli
  1 sibling, 0 replies; 71+ messages in thread
From: Steven Rostedt @ 2014-04-27 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Kerrisk (man-pages),
	Henrik Austad, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	Oleg Nesterov, Frédéric Weisbecker, darren, johan.eker,
	p.faure, Linux Kernel, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, nicola.manica,
	luca.abeni, Dhaval Giani, hgu1972, Paul McKenney, Insop Song,
	liming.wang, jkacur, linux-man

On Sun, 27 Apr 2014 21:34:49 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Also, is there any test code lying about somewhere that I could play with?

I have a deadline program you can play with too:

http://rostedt.homelinux.com/private/deadline.c

-- Steve

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-27 19:34                     ` Peter Zijlstra
  2014-04-27 19:45                       ` Steven Rostedt
@ 2014-04-28  7:39                       ` Juri Lelli
  1 sibling, 0 replies; 71+ messages in thread
From: Juri Lelli @ 2014-04-28  7:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Kerrisk (man-pages),
	Henrik Austad, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, Frédéric Weisbecker, darren,
	johan.eker, p.faure, Linux Kernel, Claudio Scordino,
	Michael Trimarchi, Fabio Checconi, Tommaso Cucinotta,
	nicola.manica, luca.abeni, Dhaval Giani, hgu1972, Paul McKenney,
	Insop Song, liming.wang, jkacur, linux-man

On Sun, 27 Apr 2014 21:34:49 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Apr 27, 2014 at 05:47:25PM +0200, Michael Kerrisk (man-pages) wrote:
> > Hi Peter,
> > 
> > Following the review comments that one or two people sent, are you
> > planning to send in a revised version of this page?
> 
> Yes, I just suck at getting around to it :-(, I'll do it first thing
> tomorrow.
> 
> > Also, is there any test code lying about somewhere that I could play with?
> 
> Juri?

Yes. I use this two tools:

- rt-app (to create periodic workload, also not RT/DL)
  https://github.com/gbagnoli/rt-app

- schedtool-dl (patched version of schetool)
  https://github.com/jlelli/schedtool-dl

Both are aligned to the last interface.

Best,

- Juri

^ permalink raw reply	[flat|nested] 71+ messages in thread

* sched_{set,get}attr() manpage
  2014-02-17 13:20           ` Michael Kerrisk (man-pages)
  2014-04-09  9:25             ` sched_{set,get}attr() manpage Peter Zijlstra
@ 2014-04-28  8:18             ` Peter Zijlstra
  2014-04-29 13:08               ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-04-28  8:18 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

Hi Michael,

find below an updated manpage, I did not apply the comments on parts
that are identical to SCHED_SETSCHEDULER(2) in order to keep these texts
in alignment. I feel that if we change one we should also change the
other, and such a 'patch' is best done separate from the new manpage
itself.

I did add the missing EBUSY error, and amended the text where it said
we'd return EINVAL in that case.

I added a paragraph stating that SCHED_DEADLINE preempted anything else
userspace can do (with the explicit mention of userspace to leave me
wriggle room for the kernel's stop task :-).

I also did a short paragraph on the deadline sched_yield(). For further
deadline yield details we should maybe add to the SCHED_YIELD(2)
manpage.

Re juri/claudio; no I think sched_yield() as implemented for deadline
makes sense, no other yield semantics other than NOP makes sense for it,
and since we have the syscall already might as well make it do something
useful.


---

NAME
	sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
	#include <sched.h>

	struct sched_attr {
		u32 size;
		u32 sched_policy;
		u64 sched_flags;

		/* SCHED_NORMAL, SCHED_BATCH */
		s32 sched_nice;
		/* SCHED_FIFO, SCHED_RR */
		u32 sched_priority;
		/* SCHED_DEADLINE */
		u64 sched_runtime;
		u64 sched_deadline;
		u64 sched_period;
	};
	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
	sched_setattr() sets both the scheduling policy and the
	associated attributes for the process whose ID is specified in
	pid.  If pid equals zero, the scheduling policy and attributes
	of the calling process will be set.  The interpretation of the
	argument attr depends on the selected policy.  Currently, Linux
	supports the following "normal" (i.e., non-real-time) scheduling
	policies:

	SCHED_OTHER	the standard "fair" time-sharing policy;

	SCHED_BATCH	for "batch" style execution of processes; and

	SCHED_IDLE	for running very low priority background jobs.

	The following "real-time" policies are also supported, for
	special time-critical applications that need precise control
	over the way in which runnable processes are selected for
	execution:

	SCHED_FIFO	a first-in, first-out policy;

	SCHED_RR	a round-robin policy; and

	SCHED_DEADLINE	a deadline policy.

	The semantics of each of these policies are detailed below.

	sched_attr::size must be set to the size of the structure, as in
	sizeof(struct sched_attr), if the provided structure is smaller
	than the kernel structure, any additional fields are assumed
	'0'. If the provided structure is larger than the kernel
	structure, the kernel verifies all additional fields are '0' if
	not the syscall will fail with -E2BIG.

	sched_attr::sched_policy the desired scheduling policy.

	sched_attr::sched_flags additional flags that can influence
	scheduling behaviour. Currently as per Linux kernel 3.14:

		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
		on fork().

	is the only supported flag.

	sched_attr::sched_nice should only be set for SCHED_OTHER,
	SCHED_BATCH, the desired nice value [-20,19], see NICE(2).

	sched_attr::sched_priority should only be set for SCHED_FIFO,
	SCHED_RR, the desired static priority [1,99].

	sched_attr::sched_runtime
	sched_attr::sched_deadline
	sched_attr::sched_period should only be set for SCHED_DEADLINE
	and are the traditional sporadic task model parameters.

	The flags argument should be 0.

	sched_getattr() queries the scheduling policy currently applied
	to the process identified by pid.  If pid equals zero, the
	policy of the calling process will be retrieved.

	The size argument should reflect the size of struct sched_attr
	as known to userspace. The kernel fills out sched_attr::size to
	the size of its sched_attr structure. If the user provided
	structure is larger, additional fields are not touched. If the
	user provided structure is smaller, but the kernel needs to
	return values outside the provided space, the syscall will fail
	with -E2BIG.

	The flags argument should be 0.

	The other sched_attr fields are filled out as described in
	sched_setattr().

   Scheduling Policies
       The  scheduler  is  the  kernel  component  that decides which runnable
       process will be executed by the CPU next.  Each process has an  associ‐
       ated  scheduling  policy and a static scheduling priority, sched_prior‐
       ity; these are the settings that are modified by  sched_setscheduler().
       The  scheduler  makes it decisions based on knowledge of the scheduling
       policy and static priority of all processes on the system.

       For processes scheduled under one of  the  normal  scheduling  policies
       (SCHED_OTHER,  SCHED_IDLE,  SCHED_BATCH), sched_priority is not used in
       scheduling decisions (it must be specified as 0).

       Processes scheduled under one of the  real-time  policies  (SCHED_FIFO,
       SCHED_RR)  have  a  sched_priority  value  in  the  range 1 (low) to 99
       (high).  (As the numbers imply, real-time processes always have  higher
       priority than normal processes.)  Note well: POSIX.1-2001 only requires
       an implementation to support a minimum 32 distinct priority levels  for
       the  real-time  policies,  and  some  systems supply just this minimum.
       Portable   programs   should    use    sched_get_priority_min(2)    and
       sched_get_priority_max(2) to find the range of priorities supported for
       a particular policy.

       Conceptually, the scheduler maintains a list of runnable processes  for
       each  possible  sched_priority  value.   In  order  to  determine which
       process runs next, the scheduler looks for the nonempty list  with  the
       highest  static  priority  and  selects the process at the head of this
       list.

       A process's scheduling policy determines where it will be inserted into
       the  list  of processes with equal static priority and how it will move
       inside this list.

       All scheduling is preemptive: if a process with a higher static  prior‐
       ity  becomes  ready  to run, the currently running process will be pre‐
       empted and returned to the wait list for  its  static  priority  level.
       The  scheduling  policy only determines the ordering within the list of
       runnable processes with equal static priority.

    SCHED_DEADLINE: Sporadic task model deadline scheduling
       SCHED_DEADLINE is an implementation of GEDF (Global Earliest
       Deadline First) with additional CBS (Constant Bandwidth Server).
       The CBS guarantees that tasks that over-run their specified
       budget are throttled and do not affect the correct performance
       of other SCHED_DEADLINE tasks.

       SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN

       Setting SCHED_DEADLINE can fail with -EBUSY when admission
       control tests fail.

       Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
       highest priority (user controllable) tasks in the system, if any
       SCHED_DEADLINE task is runnable it will preempt anything
       FIFO/RR/OTHER/BATCH/IDLE task out there.

       A SCHED_DEADLINE task calling sched_yield() will 'yield' the
       current job and wait for a new period to begin.

   SCHED_FIFO: First In-First Out scheduling
       SCHED_FIFO can only be used with static priorities higher than 0, which
       means that when a SCHED_FIFO processes becomes runnable, it will always
       immediately preempt any currently running SCHED_OTHER, SCHED_BATCH,  or
       SCHED_IDLE  process.  SCHED_FIFO is a simple scheduling algorithm with‐
       out time slicing.  For processes scheduled under the SCHED_FIFO policy,
       the following rules apply:

       *  A  SCHED_FIFO  process that has been preempted by another process of
          higher priority will stay at the head of the list for  its  priority
          and  will resume execution as soon as all processes of higher prior‐
          ity are blocked again.

       *  When a SCHED_FIFO process becomes runnable, it will be  inserted  at
          the end of the list for its priority.

       *  A  call  to  sched_setscheduler()  or sched_setparam(2) will put the
          SCHED_FIFO (or SCHED_RR) process identified by pid at the  start  of
          the  list  if it was runnable.  As a consequence, it may preempt the
          currently  running  process   if   it   has   the   same   priority.
          (POSIX.1-2001 specifies that the process should go to the end of the
          list.)

       *  A process calling sched_yield(2) will be put at the end of the list.

       No other events will move a process scheduled under the SCHED_FIFO pol‐
       icy in the wait list of runnable processes with equal static priority.

       A SCHED_FIFO process runs until either it is blocked by an I/O request,
       it  is  preempted  by  a  higher  priority   process,   or   it   calls
       sched_yield(2).

   SCHED_RR: Round Robin scheduling
       SCHED_RR  is  a simple enhancement of SCHED_FIFO.  Everything described
       above for SCHED_FIFO also applies to SCHED_RR, except that each process
       is  only  allowed  to  run  for  a maximum time quantum.  If a SCHED_RR
       process has been running for a time period equal to or longer than  the
       time  quantum,  it will be put at the end of the list for its priority.
       A SCHED_RR process that has been preempted by a higher priority process
       and  subsequently  resumes execution as a running process will complete
       the unexpired portion of its round robin time quantum.  The  length  of
       the time quantum can be retrieved using sched_rr_get_interval(2).

   SCHED_OTHER: Default Linux time-sharing scheduling
       SCHED_OTHER  can only be used at static priority 0.  SCHED_OTHER is the
       standard Linux time-sharing scheduler that is  intended  for  all  pro‐
       cesses  that  do  not  require  the  special real-time mechanisms.  The
       process to run is chosen from the static priority 0  list  based  on  a
       dynamic priority that is determined only inside this list.  The dynamic
       priority is based on the nice value (set by nice(2) or  setpriority(2))
       and  increased  for  each time quantum the process is ready to run, but
       denied to run by the scheduler.  This ensures fair progress  among  all
       SCHED_OTHER processes.

   SCHED_BATCH: Scheduling batch processes
       (Since  Linux 2.6.16.)  SCHED_BATCH can only be used at static priority
       0.  This policy is similar to SCHED_OTHER  in  that  it  schedules  the
       process  according  to  its dynamic priority (based on the nice value).
       The difference is that this policy will cause the scheduler  to  always
       assume  that the process is CPU-intensive.  Consequently, the scheduler
       will apply a small scheduling penalty with respect to wakeup behaviour,
       so that this process is mildly disfavored in scheduling decisions.

       This policy is useful for workloads that are noninteractive, but do not
       want to lower their nice value, and for workloads that want a determin‐
       istic scheduling policy without interactivity causing extra preemptions
       (between the workload's tasks).

   SCHED_IDLE: Scheduling very low priority jobs
       (Since Linux 2.6.23.)  SCHED_IDLE can only be used at  static  priority
       0; the process nice value has no influence for this policy.

       This  policy  is  intended  for  running jobs at extremely low priority
       (lower even than a +19 nice value with the SCHED_OTHER  or  SCHED_BATCH
       policies).

RETURN VALUE
	On success, sched_setattr() and sched_getattr() return 0. On
	error, -1 is returned, and errno is set appropriately.

ERRORS
       EINVAL The scheduling policy is not one  of  the  recognized  policies,
              param is NULL, or param does not make sense for the policy.

       EPERM  The calling process does not have appropriate privileges.

       ESRCH  The process whose ID is pid could not be found.

       E2BIG  The provided storage for struct sched_attr is either too
              big, see sched_setattr(), or too small, see sched_getattr().

       EBUSY  SCHED_DEADLINE admission control failure

NOTES
	While the text above (and in SCHED_SETSCHEDULER(2)) talks about
	processes, in actual fact these system calls are thread specific.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-28  8:18             ` Peter Zijlstra
@ 2014-04-29 13:08               ` Michael Kerrisk (man-pages)
  2014-04-29 14:22                 ` Peter Zijlstra
  2014-04-29 16:04                 ` Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-29 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mtk.manpages, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

Hi Peter,

On 04/28/2014 10:18 AM, Peter Zijlstra wrote:
> Hi Michael,
> 
> find below an updated manpage, I did not apply the comments on parts
> that are identical to SCHED_SETSCHEDULER(2) in order to keep these texts
> in alignment. I feel that if we change one we should also change the
> other, and such a 'patch' is best done separate from the new manpage
> itself.
> 
> I did add the missing EBUSY error, and amended the text where it said
> we'd return EINVAL in that case.
> 
> I added a paragraph stating that SCHED_DEADLINE preempted anything else
> userspace can do (with the explicit mention of userspace to leave me
> wriggle room for the kernel's stop task :-).
> 
> I also did a short paragraph on the deadline sched_yield(). For further
> deadline yield details we should maybe add to the SCHED_YIELD(2)
> manpage.
> 
> Re juri/claudio; no I think sched_yield() as implemented for deadline
> makes sense, no other yield semantics other than NOP makes sense for it,
> and since we have the syscall already might as well make it do something
> useful.

Thanks for the updated page. Would you be willing
to revise as per the comments below.


> NAME
> 	sched_setattr, sched_getattr - set and get scheduling policy/attributes
> 
> SYNOPSIS
> 	#include <sched.h>
> 
> 	struct sched_attr {
> 		u32 size;
> 		u32 sched_policy;
> 		u64 sched_flags;
> 
> 		/* SCHED_NORMAL, SCHED_BATCH */
> 		s32 sched_nice;
> 		/* SCHED_FIFO, SCHED_RR */
> 		u32 sched_priority;
> 		/* SCHED_DEADLINE */
> 		u64 sched_runtime;
> 		u64 sched_deadline;
> 		u64 sched_period;
> 	};
> 	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
> 
> 	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
> 
> DESCRIPTION
> 	sched_setattr() sets both the scheduling policy and the
> 	associated attributes for the process whose ID is specified in
> 	pid.  

Around about here, I think there needs to be a sentence explaining
that sched_setattr() provides a superset of the functionality of 
sched_setscheduler(2) and setpritority(2). I mean, it can do all that 
those two calls can do, right?

> If pid equals zero, the scheduling policy and attributes
> 	of the calling process will be set.  The interpretation of the
> 	argument attr depends on the selected policy.  Currently, Linux
> 	supports the following "normal" (i.e., non-real-time) scheduling
> 	policies:
> 
> 	SCHED_OTHER	the standard "fair" time-sharing policy;
> 
> 	SCHED_BATCH	for "batch" style execution of processes; and
> 
> 	SCHED_IDLE	for running very low priority background jobs.
> 
> 	The following "real-time" policies are also supported, for
> 	special time-critical applications that need precise control
> 	over the way in which runnable processes are selected for
> 	execution:
> 
> 	SCHED_FIFO	a first-in, first-out policy;
> 
> 	SCHED_RR	a round-robin policy; and
> 
> 	SCHED_DEADLINE	a deadline policy.
> 
> 	The semantics of each of these policies are detailed below.

The semantics of each of these policies are detailed in sched(7).

[See my comments below]

> 
> 	sched_attr::size must be set to the size of the structure, as in
> 	sizeof(struct sched_attr), if the provided structure is smaller
> 	than the kernel structure, any additional fields are assumed
> 	'0'. If the provided structure is larger than the kernel
> 	structure, the kernel verifies all additional fields are '0' if
> 	not the syscall will fail with -E2BIG.
> 
> 	sched_attr::sched_policy the desired scheduling policy.
> 
> 	sched_attr::sched_flags additional flags that can influence
> 	scheduling behaviour. Currently as per Linux kernel 3.14:
> 
> 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> 		on fork().
> 
> 	is the only supported flag.
> 
> 	sched_attr::sched_nice should only be set for SCHED_OTHER,
> 	SCHED_BATCH, the desired nice value [-20,19], see NICE(2).
> 
> 	sched_attr::sched_priority should only be set for SCHED_FIFO,
> 	SCHED_RR, the desired static priority [1,99].
> 
> 	sched_attr::sched_runtime
> 	sched_attr::sched_deadline
> 	sched_attr::sched_period should only be set for SCHED_DEADLINE
> 	and are the traditional sporadic task model parameters.

Could you add (a lot ;-)) more detail on these three fields? Assume the
reader does not know about this traditional sporadic task model, and 
then give some explanation of what these three fields do. Probably, at
this point you can work in some statement  about the admission control
test.

[but, see my comment below. It may be that sched(7) is a better
place for this detail.

> 	The flags argument should be 0.
> 
> 	sched_getattr() queries the scheduling policy currently applied
> 	to the process identified by pid.  If pid equals zero, the
> 	policy of the calling process will be retrieved.
> 
> 	The size argument should reflect the size of struct sched_attr
> 	as known to userspace. The kernel fills out sched_attr::size to
> 	the size of its sched_attr structure. If the user provided
> 	structure is larger, additional fields are not touched. If the
> 	user provided structure is smaller, but the kernel needs to
> 	return values outside the provided space, the syscall will fail
> 	with -E2BIG.
> 
> 	The flags argument should be 0.
> 
> 	The other sched_attr fields are filled out as described in
> 	sched_setattr().

I assume that everything between my [[[ and ]]] blocks below is taken straight 
from sched_setscheduler(2). (If that is not true, please let me know.)
This reminds me that there is a structural fault in this part of man-pages ;-).
The problem is sched_setscheduler(2) currently tries to do two things:

[a] Document the sched_setscheduler() and sched_scheduler system calls
[b] Provide and overview od scheduling policies and parameters.

It should really only do the former. I have now gone through the task of
separating [b] out into a separate page, sched(7), which other pages,
such as sched_setscheduler(2) and sched_setattr(2) can refer to. You
can see the current versions of sched_setscheduelr.2 and sched.7 in Git
(https://www.kernel.org/doc/man-pages/download.html )

So, what I would ideally like to see

[1] A page describing the sched_setattr() and sched_getattr() APIs
[2] A piece of text describing the SCHED_DEADLINE policy, which I can
drop into sched(7).

Could you revise like that?

[[[[
>    Scheduling Policies
>        The  scheduler  is  the  kernel  component  that decides which runnable
>        process will be executed by the CPU next.  Each process has an  associ‐
>        ated  scheduling  policy and a static scheduling priority, sched_prior‐
>        ity; these are the settings that are modified by  sched_setscheduler().
>        The  scheduler  makes it decisions based on knowledge of the scheduling
>        policy and static priority of all processes on the system.
> 
>        For processes scheduled under one of  the  normal  scheduling  policies
>        (SCHED_OTHER,  SCHED_IDLE,  SCHED_BATCH), sched_priority is not used in
>        scheduling decisions (it must be specified as 0).
> 
>        Processes scheduled under one of the  real-time  policies  (SCHED_FIFO,
>        SCHED_RR)  have  a  sched_priority  value  in  the  range 1 (low) to 99
>        (high).  (As the numbers imply, real-time processes always have  higher
>        priority than normal processes.)  Note well: POSIX.1-2001 only requires
>        an implementation to support a minimum 32 distinct priority levels  for
>        the  real-time  policies,  and  some  systems supply just this minimum.
>        Portable   programs   should    use    sched_get_priority_min(2)    and
>        sched_get_priority_max(2) to find the range of priorities supported for
>        a particular policy.
> 
>        Conceptually, the scheduler maintains a list of runnable processes  for
>        each  possible  sched_priority  value.   In  order  to  determine which
>        process runs next, the scheduler looks for the nonempty list  with  the
>        highest  static  priority  and  selects the process at the head of this
>        list.
> 
>        A process's scheduling policy determines where it will be inserted into
>        the  list  of processes with equal static priority and how it will move
>        inside this list.
> 
>        All scheduling is preemptive: if a process with a higher static  prior‐
>        ity  becomes  ready  to run, the currently running process will be pre‐
>        empted and returned to the wait list for  its  static  priority  level.
>        The  scheduling  policy only determines the ordering within the list of
>        runnable processes with equal static priority.
]]]]

>     SCHED_DEADLINE: Sporadic task model deadline scheduling
>        SCHED_DEADLINE is an implementation of GEDF (Global Earliest
>        Deadline First) with additional CBS (Constant Bandwidth Server).
>        The CBS guarantees that tasks that over-run their specified
>        budget are throttled and do not affect the correct performance
>        of other SCHED_DEADLINE tasks.
> 
>        SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> 
>        Setting SCHED_DEADLINE can fail with -EBUSY when admission
>        control tests fail.
> 
>        Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
>        highest priority (user controllable) tasks in the system, if any
>        SCHED_DEADLINE task is runnable it will preempt anything
>        FIFO/RR/OTHER/BATCH/IDLE task out there.
> 
>        A SCHED_DEADLINE task calling sched_yield() will 'yield' the
>        current job and wait for a new period to begin.

This is the piece that could go into sched(7), but I'd like it to include
a discussion of deadline, period, and runtime.

[[[[
 
>    SCHED_FIFO: First In-First Out scheduling
>        SCHED_FIFO can only be used with static priorities higher than 0, which
>        means that when a SCHED_FIFO processes becomes runnable, it will always
>        immediately preempt any currently running SCHED_OTHER, SCHED_BATCH,  or
>        SCHED_IDLE  process.  SCHED_FIFO is a simple scheduling algorithm with‐
>        out time slicing.  For processes scheduled under the SCHED_FIFO policy,
>        the following rules apply:
> 
>        *  A  SCHED_FIFO  process that has been preempted by another process of
>           higher priority will stay at the head of the list for  its  priority
>           and  will resume execution as soon as all processes of higher prior‐
>           ity are blocked again.
> 
>        *  When a SCHED_FIFO process becomes runnable, it will be  inserted  at
>           the end of the list for its priority.
> 
>        *  A  call  to  sched_setscheduler()  or sched_setparam(2) will put the
>           SCHED_FIFO (or SCHED_RR) process identified by pid at the  start  of
>           the  list  if it was runnable.  As a consequence, it may preempt the
>           currently  running  process   if   it   has   the   same   priority.
>           (POSIX.1-2001 specifies that the process should go to the end of the
>           list.)
> 
>        *  A process calling sched_yield(2) will be put at the end of the list.
> 
>        No other events will move a process scheduled under the SCHED_FIFO pol‐
>        icy in the wait list of runnable processes with equal static priority.
> 
>        A SCHED_FIFO process runs until either it is blocked by an I/O request,
>        it  is  preempted  by  a  higher  priority   process,   or   it   calls
>        sched_yield(2).
> 
>    SCHED_RR: Round Robin scheduling
>        SCHED_RR  is  a simple enhancement of SCHED_FIFO.  Everything described
>        above for SCHED_FIFO also applies to SCHED_RR, except that each process
>        is  only  allowed  to  run  for  a maximum time quantum.  If a SCHED_RR
>        process has been running for a time period equal to or longer than  the
>        time  quantum,  it will be put at the end of the list for its priority.
>        A SCHED_RR process that has been preempted by a higher priority process
>        and  subsequently  resumes execution as a running process will complete
>        the unexpired portion of its round robin time quantum.  The  length  of
>        the time quantum can be retrieved using sched_rr_get_interval(2).
> 
>    SCHED_OTHER: Default Linux time-sharing scheduling
>        SCHED_OTHER  can only be used at static priority 0.  SCHED_OTHER is the
>        standard Linux time-sharing scheduler that is  intended  for  all  pro‐
>        cesses  that  do  not  require  the  special real-time mechanisms.  The
>        process to run is chosen from the static priority 0  list  based  on  a
>        dynamic priority that is determined only inside this list.  The dynamic
>        priority is based on the nice value (set by nice(2) or  setpriority(2))
>        and  increased  for  each time quantum the process is ready to run, but
>        denied to run by the scheduler.  This ensures fair progress  among  all
>        SCHED_OTHER processes.
> 
>    SCHED_BATCH: Scheduling batch processes
>        (Since  Linux 2.6.16.)  SCHED_BATCH can only be used at static priority
>        0.  This policy is similar to SCHED_OTHER  in  that  it  schedules  the
>        process  according  to  its dynamic priority (based on the nice value).
>        The difference is that this policy will cause the scheduler  to  always
>        assume  that the process is CPU-intensive.  Consequently, the scheduler
>        will apply a small scheduling penalty with respect to wakeup behaviour,
>        so that this process is mildly disfavored in scheduling decisions.
> 
>        This policy is useful for workloads that are noninteractive, but do not
>        want to lower their nice value, and for workloads that want a determin‐
>        istic scheduling policy without interactivity causing extra preemptions
>        (between the workload's tasks).
> 
>    SCHED_IDLE: Scheduling very low priority jobs
>        (Since Linux 2.6.23.)  SCHED_IDLE can only be used at  static  priority
>        0; the process nice value has no influence for this policy.
> 
>        This  policy  is  intended  for  running jobs at extremely low priority
>        (lower even than a +19 nice value with the SCHED_OTHER  or  SCHED_BATCH
>        policies).
]]]]

> RETURN VALUE
> 	On success, sched_setattr() and sched_getattr() return 0. On
> 	error, -1 is returned, and errno is set appropriately.
> 
> ERRORS
>        EINVAL The scheduling policy is not one  of  the  recognized  policies,
>               param is NULL, or param does not make sense for the policy.
> 
>        EPERM  The calling process does not have appropriate privileges.
> 
>        ESRCH  The process whose ID is pid could not be found.
> 
>        E2BIG  The provided storage for struct sched_attr is either too
>               big, see sched_setattr(), or too small, see sched_getattr().
> 
>        EBUSY  SCHED_DEADLINE admission control failure

The above is the only place on the page that mentions admission control.
As well as the suggestions above, it would be nice to have somewhere a
summary of how admission control is calculated.

> NOTES
> 	While the text above (and in SCHED_SETSCHEDULER(2)) talks about
> 	processes, in actual fact these system calls are thread specific.
> 

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-29 13:08               ` Michael Kerrisk (man-pages)
@ 2014-04-29 14:22                 ` Peter Zijlstra
  2014-04-29 16:04                 ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-04-29 14:22 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

On Tue, Apr 29, 2014 at 03:08:55PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
> 
> On 04/28/2014 10:18 AM, Peter Zijlstra wrote:
> > Hi Michael,
> > 
> > find below an updated manpage, I did not apply the comments on parts
> > that are identical to SCHED_SETSCHEDULER(2) in order to keep these texts
> > in alignment. I feel that if we change one we should also change the
> > other, and such a 'patch' is best done separate from the new manpage
> > itself.
> > 
> > I did add the missing EBUSY error, and amended the text where it said
> > we'd return EINVAL in that case.
> > 
> > I added a paragraph stating that SCHED_DEADLINE preempted anything else
> > userspace can do (with the explicit mention of userspace to leave me
> > wriggle room for the kernel's stop task :-).
> > 
> > I also did a short paragraph on the deadline sched_yield(). For further
> > deadline yield details we should maybe add to the SCHED_YIELD(2)
> > manpage.
> > 
> > Re juri/claudio; no I think sched_yield() as implemented for deadline
> > makes sense, no other yield semantics other than NOP makes sense for it,
> > and since we have the syscall already might as well make it do something
> > useful.
> 
> Thanks for the updated page. Would you be willing
> to revise as per the comments below.

Ok.

> 
> > NAME
> > 	sched_setattr, sched_getattr - set and get scheduling policy/attributes
> > 
> > SYNOPSIS
> > 	#include <sched.h>
> > 
> > 	struct sched_attr {
> > 		u32 size;
> > 		u32 sched_policy;
> > 		u64 sched_flags;
> > 
> > 		/* SCHED_NORMAL, SCHED_BATCH */
> > 		s32 sched_nice;
> > 		/* SCHED_FIFO, SCHED_RR */
> > 		u32 sched_priority;
> > 		/* SCHED_DEADLINE */
> > 		u64 sched_runtime;
> > 		u64 sched_deadline;
> > 		u64 sched_period;
> > 	};
> > 	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
> > 
> > 	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
> > 
> > DESCRIPTION
> > 	sched_setattr() sets both the scheduling policy and the
> > 	associated attributes for the process whose ID is specified in
> > 	pid.  
> 
> Around about here, I think there needs to be a sentence explaining
> that sched_setattr() provides a superset of the functionality of 
> sched_setscheduler(2) and setpritority(2). I mean, it can do all that 
> those two calls can do, right?

Almost; setpriority() has the .which argument which we don't have. So
while that syscall can change the nice value for an entire process group
or user, sched_setattr() can only change the nice value for 1 task.

But yes, I can mention something along those lines.

> > If pid equals zero, the scheduling policy and attributes
> > 	of the calling process will be set.  The interpretation of the
> > 	argument attr depends on the selected policy.  Currently, Linux
> > 	supports the following "normal" (i.e., non-real-time) scheduling
> > 	policies:
> > 
> > 	SCHED_OTHER	the standard "fair" time-sharing policy;
> > 
> > 	SCHED_BATCH	for "batch" style execution of processes; and
> > 
> > 	SCHED_IDLE	for running very low priority background jobs.
> > 
> > 	The following "real-time" policies are also supported, for
> > 	special time-critical applications that need precise control
> > 	over the way in which runnable processes are selected for
> > 	execution:
> > 
> > 	SCHED_FIFO	a first-in, first-out policy;
> > 
> > 	SCHED_RR	a round-robin policy; and
> > 
> > 	SCHED_DEADLINE	a deadline policy.
> > 
> > 	The semantics of each of these policies are detailed below.
> 
> The semantics of each of these policies are detailed in sched(7).

I don't appear to have SCHED(7), how new is that?

> [See my comments below]
> 
> > 
> > 	sched_attr::size must be set to the size of the structure, as in
> > 	sizeof(struct sched_attr), if the provided structure is smaller
> > 	than the kernel structure, any additional fields are assumed
> > 	'0'. If the provided structure is larger than the kernel
> > 	structure, the kernel verifies all additional fields are '0' if
> > 	not the syscall will fail with -E2BIG.
> > 
> > 	sched_attr::sched_policy the desired scheduling policy.
> > 
> > 	sched_attr::sched_flags additional flags that can influence
> > 	scheduling behaviour. Currently as per Linux kernel 3.14:
> > 
> > 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> > 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> > 		on fork().
> > 
> > 	is the only supported flag.
> > 
> > 	sched_attr::sched_nice should only be set for SCHED_OTHER,
> > 	SCHED_BATCH, the desired nice value [-20,19], see NICE(2).
> > 
> > 	sched_attr::sched_priority should only be set for SCHED_FIFO,
> > 	SCHED_RR, the desired static priority [1,99].
> > 
> > 	sched_attr::sched_runtime
> > 	sched_attr::sched_deadline
> > 	sched_attr::sched_period should only be set for SCHED_DEADLINE
> > 	and are the traditional sporadic task model parameters.
> 
> Could you add (a lot ;-)) more detail on these three fields? Assume the
> reader does not know about this traditional sporadic task model, and 
> then give some explanation of what these three fields do. Probably, at
> this point you can work in some statement  about the admission control
> test.
> 
> [but, see my comment below. It may be that sched(7) is a better
> place for this detail.

Yes, I think SCHED(7) would be a better place; also I think I forgot to
put a reference in to Documentation/scheduler/sched-deadline.txt

I'll try and write something concise. This is the stuff of books, not
paragraphs :/

> > 	The flags argument should be 0.
> > 
> > 	sched_getattr() queries the scheduling policy currently applied
> > 	to the process identified by pid.  If pid equals zero, the
> > 	policy of the calling process will be retrieved.
> > 
> > 	The size argument should reflect the size of struct sched_attr
> > 	as known to userspace. The kernel fills out sched_attr::size to
> > 	the size of its sched_attr structure. If the user provided
> > 	structure is larger, additional fields are not touched. If the
> > 	user provided structure is smaller, but the kernel needs to
> > 	return values outside the provided space, the syscall will fail
> > 	with -E2BIG.
> > 
> > 	The flags argument should be 0.
> > 
> > 	The other sched_attr fields are filled out as described in
> > 	sched_setattr().
> 
> I assume that everything between my [[[ and ]]] blocks below is taken straight 
> from sched_setscheduler(2). (If that is not true, please let me know.)

That did indeed look about right.

> This reminds me that there is a structural fault in this part of man-pages ;-).
> The problem is sched_setscheduler(2) currently tries to do two things:
> 
> [a] Document the sched_setscheduler() and sched_scheduler system calls
> [b] Provide and overview od scheduling policies and parameters.
> 
> It should really only do the former. I have now gone through the task of
> separating [b] out into a separate page, sched(7), which other pages,
> such as sched_setscheduler(2) and sched_setattr(2) can refer to. You
> can see the current versions of sched_setscheduelr.2 and sched.7 in Git
> (https://www.kernel.org/doc/man-pages/download.html )
> 
> So, what I would ideally like to see
> 
> [1] A page describing the sched_setattr() and sched_getattr() APIs
> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> drop into sched(7).
> 
> Could you revise like that?

ACK.

> [[[[

> ]]]]
> 
> >     SCHED_DEADLINE: Sporadic task model deadline scheduling
> >        SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> >        Deadline First) with additional CBS (Constant Bandwidth Server).
> >        The CBS guarantees that tasks that over-run their specified
> >        budget are throttled and do not affect the correct performance
> >        of other SCHED_DEADLINE tasks.
> > 
> >        SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> > 
> >        Setting SCHED_DEADLINE can fail with -EBUSY when admission
> >        control tests fail.
> > 
> >        Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
> >        highest priority (user controllable) tasks in the system, if any
> >        SCHED_DEADLINE task is runnable it will preempt anything
> >        FIFO/RR/OTHER/BATCH/IDLE task out there.
> > 
> >        A SCHED_DEADLINE task calling sched_yield() will 'yield' the
> >        current job and wait for a new period to begin.
> 
> This is the piece that could go into sched(7), but I'd like it to include
> a discussion of deadline, period, and runtime.
> 
> [[[[

> ]]]]
> 
> > RETURN VALUE
> > 	On success, sched_setattr() and sched_getattr() return 0. On
> > 	error, -1 is returned, and errno is set appropriately.
> > 
> > ERRORS
> >        EINVAL The scheduling policy is not one  of  the  recognized  policies,
> >               param is NULL, or param does not make sense for the policy.
> > 
> >        EPERM  The calling process does not have appropriate privileges.
> > 
> >        ESRCH  The process whose ID is pid could not be found.
> > 
> >        E2BIG  The provided storage for struct sched_attr is either too
> >               big, see sched_setattr(), or too small, see sched_getattr().
> > 
> >        EBUSY  SCHED_DEADLINE admission control failure
> 
> The above is the only place on the page that mentions admission control.
> As well as the suggestions above, it would be nice to have somewhere a
> summary of how admission control is calculated.

I think I'll write down what admission control is without specifics.
Giving specifics pins you down on the implementation. In general
admission control enforces a bound on the schedulability of the task
set. New and interesting ways of computing schedulability are the
subject of papers each year.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-29 13:08               ` Michael Kerrisk (man-pages)
  2014-04-29 14:22                 ` Peter Zijlstra
@ 2014-04-29 16:04                 ` Peter Zijlstra
  2014-04-30 11:09                   ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-04-29 16:04 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

On Tue, Apr 29, 2014 at 03:08:55PM +0200, Michael Kerrisk (man-pages) wrote:

Juri, Dario, Can you have a look at the 2nd part; I'm not at all sure I
got the activate/release the right way around.

My current thinking was that we activate first, and then release it to
go run. But googling the terms only confused me more. I suppose its one
of those things that's not actually _that_ well defined. And I hope the
ASCII art actually clarifies things better than the terms used.

> [1] A page describing the sched_setattr() and sched_getattr() APIs

NAME
	sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
	#include <sched.h>

	struct sched_attr {
		u32 size;
		u32 sched_policy;
		u64 sched_flags;

		/* SCHED_NORMAL, SCHED_BATCH */
		s32 sched_nice;

		/* SCHED_FIFO, SCHED_RR */
		u32 sched_priority;

		/* SCHED_DEADLINE */
		u64 sched_runtime;
		u64 sched_deadline;
		u64 sched_period;
	};

	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
	sched_setattr() sets both the scheduling policy and the
	associated attributes for the process whose ID is specified in
	pid.

	sched_setattr() replaces sched_setscheduler(), sched_setparam(),
	nice() and some of setpriority().

	If pid equals zero, the scheduling policy and attributes
	of the calling process will be set.  The interpretation of the
	argument attr depends on the selected policy.  Currently, Linux
	supports the following "normal" (i.e., non-real-time) scheduling
	policies:

	SCHED_OTHER	the standard "fair" time-sharing policy;

	SCHED_BATCH	for "batch" style execution of processes; and

	SCHED_IDLE	for running very low priority background jobs.

	The following "real-time" policies are also supported, for
	special time-critical applications that need precise control
	over the way in which runnable processes are selected for
	execution:

	SCHED_FIFO	a static priority first-in, first-out policy;

	SCHED_RR	a static priority round-robin policy; and

	SCHED_DEADLINE	a dynamic priority deadline policy.

	The semantics of each of these policies are detailed in
	sched(7).

	sched_attr::size must be set to the size of the structure, as in
	sizeof(struct sched_attr), if the provided structure is smaller
	than the kernel structure, any additional fields are assumed
	'0'. If the provided structure is larger than the kernel
	structure, the kernel verifies all additional fields are '0' if
	not the syscall will fail with -E2BIG.

	sched_attr::sched_policy the desired scheduling policy.

	sched_attr::sched_flags additional flags that can influence
	scheduling behaviour. Currently as per Linux kernel 3.14:

		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
		on fork().

	is the only supported flag.

	sched_attr::sched_nice should only be set for SCHED_OTHER,
	SCHED_BATCH, the desired nice value [-20,19], see sched(7).

	sched_attr::sched_priority should only be set for SCHED_FIFO,
	SCHED_RR, the desired static priority [1,99], see sched(7).

	sched_attr::sched_runtime
	sched_attr::sched_deadline
	sched_attr::sched_period should only be set for SCHED_DEADLINE
	and are the traditional sporadic task model parameters, see
	sched(7).

	The flags argument should be 0.

	sched_getattr() queries the scheduling policy currently applied
	to the process identified by pid.

	Similar to sched_setattr(), sched_getattr() replaces
	sched_getscheduler(), sched_getparam() and some of
	getpriority().

	If pid equals zero, the policy of the calling process will be
	retrieved.

	The size argument should reflect the size of struct sched_attr
	as known to userspace. The kernel fills out sched_attr::size to
	the size of its sched_attr structure. If the user provided
	structure is larger, additional fields are not touched. If the
	user provided structure is smaller, but the kernel needs to
	return values outside the provided space, the syscall will fail
	with -E2BIG.

	The flags argument should be 0.

	The other sched_attr fields are filled out as described in
	sched_setattr().

RETURN VALUE
	On success, sched_setattr() and sched_getattr() return 0. On
	error, -1 is returned, and errno is set appropriately.

ERRORS
       EINVAL The scheduling policy is not one  of  the  recognized  policies,
              param is NULL, or param does not make sense for the selected
	      policy.

       EPERM  The calling process does not have appropriate privileges.

       ESRCH  The process whose ID is pid could not be found.

       E2BIG  The provided storage for struct sched_attr is either too
              big, see sched_setattr(), or too small, see sched_getattr().

       EBUSY  SCHED_DEADLINE admission control failure, see sched(7).

NOTES
       While the text above (and in sched_setscheduler(2)) talks about
       processes, in actual fact these system calls are thread specific.

> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> drop into sched(7).

    SCHED_DEADLINE: Sporadic task model deadline scheduling
       SCHED_DEADLINE is an implementation of GEDF (Global Earliest
       Deadline First) with additional CBS (Constant Bandwidth Server).

       A sporadic task is on that has a sequence of jobs, where each job
       is activated at most once per period [us]. Each job will have an
       absolute deadline relative to its activation before which it must
       finish its execution, and it shall at no time run longer
       than runtime [us] after its release.

              activation/wakeup       absolute deadline
              |        release        |
              v        v              v
       -------x--------x--------------x--------x-------
                       |<- Runtime -->|
              |<---------- Deadline ->|
              |<---------- Period  ----------->|

       This gives: runtime <= (rel) deadline <= period.

       The CBS guarantees that tasks that over-run their specified
       runtime are throttled and do not affect the correct performance
       of other SCHED_DEADLINE tasks.

       In general a task set of such tasks it not feasible/schedulable
       within the given constraints. Therefore we must do an admittance
       test on setting/changing SCHED_DEADLINE policy/attributes.

       This admission test calculates that the task set is
       feasible/schedulable, failing this, sched_setattr() will return
       -EBUSY.

       For example, it is required (but not sufficient) for the total
       utilization to be less or equal to the total amount of cpu time
       available. That is, since each job can maximally run for runtime
       [us] per period [us], that task's utilization is runtime/period.
       Summing this over all tasks must be less than the total amount of
       CPUs present.

       SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN.

       Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
       highest priority (user controllable) tasks in the system, if any
       SCHED_DEADLINE task is runnable it will preempt anything
       FIFO/RR/OTHER/BATCH/IDLE task out there.

       A SCHED_DEADLINE task calling sched_yield() will 'yield' the
       current job and wait for a new period to begin.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-29 16:04                 ` Peter Zijlstra
@ 2014-04-30 11:09                   ` Michael Kerrisk (man-pages)
  2014-04-30 12:35                     ` Peter Zijlstra
  2014-04-30 13:09                     ` Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-04-30 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mtk.manpages, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

Hi Peter,

Thanks for the revision. More comments below. Could you revise in 
the light of those comments, and hopefully also after feedback from 
Juri and Dario?

On 04/29/2014 06:04 PM, Peter Zijlstra wrote:
> On Tue, Apr 29, 2014 at 03:08:55PM +0200, Michael Kerrisk (man-pages) wrote:
> 
> Juri, Dario, Can you have a look at the 2nd part; I'm not at all sure I
> got the activate/release the right way around.
> 
> My current thinking was that we activate first, and then release it to
> go run. But googling the terms only confused me more. I suppose its one
> of those things that's not actually _that_ well defined. And I hope the
> ASCII art actually clarifies things better than the terms used.
> 
>> [1] A page describing the sched_setattr() and sched_getattr() APIs
> 
> NAME
> 	sched_setattr, sched_getattr - set and get scheduling policy/attributes
> 
> SYNOPSIS
> 	#include <sched.h>
> 
> 	struct sched_attr {
> 		u32 size;
> 		u32 sched_policy;
> 		u64 sched_flags;
> 
> 		/* SCHED_NORMAL, SCHED_BATCH */
> 		s32 sched_nice;
> 
> 		/* SCHED_FIFO, SCHED_RR */
> 		u32 sched_priority;
> 
> 		/* SCHED_DEADLINE */
> 		u64 sched_runtime;
> 		u64 sched_deadline;
> 		u64 sched_period;
> 	};
> 
> 	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
> 
> 	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
> 
> DESCRIPTION
> 	sched_setattr() sets both the scheduling policy and the
> 	associated attributes for the process whose ID is specified in
> 	pid.
> 
> 	sched_setattr() replaces sched_setscheduler(), sched_setparam(),
> 	nice() and some of setpriority().
> 
> 	If pid equals zero, the scheduling policy and attributes
> 	of the calling process will be set.  The interpretation of the
> 	argument attr depends on the selected policy.  Currently, Linux
> 	supports the following "normal" (i.e., non-real-time) scheduling
> 	policies:
> 
> 	SCHED_OTHER	the standard "fair" time-sharing policy;
> 
> 	SCHED_BATCH	for "batch" style execution of processes; and
> 
> 	SCHED_IDLE	for running very low priority background jobs.
> 
> 	The following "real-time" policies are also supported, for
> 	special time-critical applications that need precise control
> 	over the way in which runnable processes are selected for
> 	execution:
> 
> 	SCHED_FIFO	a static priority first-in, first-out policy;
> 
> 	SCHED_RR	a static priority round-robin policy; and
> 
> 	SCHED_DEADLINE	a dynamic priority deadline policy.
> 
> 	The semantics of each of these policies are detailed in
> 	sched(7).
> 
> 	sched_attr::size must be set to the size of the structure, as in
> 	sizeof(struct sched_attr), if the provided structure is smaller
> 	than the kernel structure, any additional fields are assumed
> 	'0'. If the provided structure is larger than the kernel
> 	structure, the kernel verifies all additional fields are '0' if
> 	not the syscall will fail with -E2BIG.
> 
> 	sched_attr::sched_policy the desired scheduling policy.
> 
> 	sched_attr::sched_flags additional flags that can influence
> 	scheduling behaviour. Currently as per Linux kernel 3.14:
> 
> 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> 		on fork().
> 
> 	is the only supported flag.
> 
> 	sched_attr::sched_nice should only be set for SCHED_OTHER,
> 	SCHED_BATCH, the desired nice value [-20,19], see sched(7).
> 
> 	sched_attr::sched_priority should only be set for SCHED_FIFO,
> 	SCHED_RR, the desired static priority [1,99], see sched(7).
> 
> 	sched_attr::sched_runtime
> 	sched_attr::sched_deadline
> 	sched_attr::sched_period should only be set for SCHED_DEADLINE
> 	and are the traditional sporadic task model parameters, see
> 	sched(7).

So, are there fields expressed in some unit (presumably microseconds)?
Best to mention that here.

> 	The flags argument should be 0.
> 
> 	sched_getattr() queries the scheduling policy currently applied
> 	to the process identified by pid.
> 
> 	Similar to sched_setattr(), sched_getattr() replaces
> 	sched_getscheduler(), sched_getparam() and some of
> 	getpriority().
> 
> 	If pid equals zero, the policy of the calling process will be
> 	retrieved.
> 
> 	The size argument should reflect the size of struct sched_attr
> 	as known to userspace. The kernel fills out sched_attr::size to
> 	the size of its sched_attr structure. If the user provided
> 	structure is larger, additional fields are not touched. If the
> 	user provided structure is smaller, but the kernel needs to
> 	return values outside the provided space, the syscall will fail
> 	with -E2BIG.
> 
> 	The flags argument should be 0.
> 
> 	The other sched_attr fields are filled out as described in
> 	sched_setattr().
> 
> RETURN VALUE
> 	On success, sched_setattr() and sched_getattr() return 0. On
> 	error, -1 is returned, and errno is set appropriately.
> 
> ERRORS
>        EINVAL The scheduling policy is not one  of  the  recognized  policies,
>               param is NULL, or param does not make sense for the selected
> 	      policy.
> 
>        EPERM  The calling process does not have appropriate privileges.
> 
>        ESRCH  The process whose ID is pid could not be found.
> 
>        E2BIG  The provided storage for struct sched_attr is either too
>               big, see sched_setattr(), or too small, see sched_getattr().
> 
>        EBUSY  SCHED_DEADLINE admission control failure, see sched(7).
> 
> NOTES
>        While the text above (and in sched_setscheduler(2)) talks about
>        processes, in actual fact these system calls are thread specific.
> 
>> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
>> drop into sched(7).
> 
>     SCHED_DEADLINE: Sporadic task model deadline scheduling
>        SCHED_DEADLINE is an implementation of GEDF (Global Earliest
>        Deadline First) with additional CBS (Constant Bandwidth Server).
> 
>        A sporadic task is on that has a sequence of jobs, where each job
>        is activated at most once per period [us]. Each job will have an
>        absolute deadline relative to its activation before which it must
>        finish its execution, and it shall at no time run longer
>        than runtime [us] after its release.
> 
>               activation/wakeup       absolute deadline
>               |        release        |
>               v        v              v
>        -------x--------x--------------x--------x-------
>                        |<- Runtime -->|
>               |<---------- Deadline ->|
>               |<---------- Period  ----------->|
> 
>        This gives: runtime <= (rel) deadline <= period.

So, the 'sched_deadline' field in the 'sched_attr' expresses the release
deadline? (I had initially thought it was the "absolute deadline".
Could you make this clearer in the text please.

>        The CBS guarantees that tasks that over-run their specified
>        runtime are throttled and do not affect the correct performance
>        of other SCHED_DEADLINE tasks.
> 
>        In general a task set of such tasks it not feasible/schedulable

That last line is garbled. Could you fix, please.

Also, could you add some words to explain what you mean by 'task set'.

>        within the given constraints. Therefore we must do an admittance
>        test on setting/changing SCHED_DEADLINE policy/attributes.
> 
>        This admission test calculates that the task set is
>        feasible/schedulable, failing this, sched_setattr() will return
>        -EBUSY.
> 
>        For example, it is required (but not sufficient) for the total
>        utilization to be less or equal to the total amount of cpu time
>        available. That is, since each job can maximally run for runtime
>        [us] per period [us], that task's utilization is runtime/period.
>        Summing this over all tasks must be less than the total amount of
>        CPUs present.
> 
>        SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN.

Except if SCHED_RESET_ON_FORK was set, right? If yes, that should be
mentioned here.

>        Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
>        highest priority (user controllable) tasks in the system, if any
>        SCHED_DEADLINE task is runnable it will preempt anything
>        FIFO/RR/OTHER/BATCH/IDLE task out there.
> 
>        A SCHED_DEADLINE task calling sched_yield() will 'yield' the
>        current job and wait for a new period to begin.

So, I'm trying to naively understand how this all works. If different 
processes specify different deadline periods, how does the kernel deal
with that? Is it worth adding some detail on this point?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-30 11:09                   ` Michael Kerrisk (man-pages)
@ 2014-04-30 12:35                     ` Peter Zijlstra
  2014-04-30 13:09                     ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-04-30 12:35 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

On Wed, Apr 30, 2014 at 01:09:25PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
> 
> Thanks for the revision. More comments below. Could you revise in 
> the light of those comments, and hopefully also after feedback from 
> Juri and Dario?
> 
> > 
> > 	sched_attr::sched_runtime
> > 	sched_attr::sched_deadline
> > 	sched_attr::sched_period should only be set for SCHED_DEADLINE
> > 	and are the traditional sporadic task model parameters, see
> > 	sched(7).
> 
> So, are there fields expressed in some unit (presumably microseconds)?
> Best to mention that here.

Oh wait, no its nanoseconds. Which means I should amend the text below.

> >> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> >> drop into sched(7).
> > 
> >     SCHED_DEADLINE: Sporadic task model deadline scheduling
> >        SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> >        Deadline First) with additional CBS (Constant Bandwidth Server).
> > 
> >        A sporadic task is on that has a sequence of jobs, where each job
> >        is activated at most once per period [us]. Each job will have an
> >        absolute deadline relative to its activation before which it must

(A)

> >        finish its execution, and it shall at no time run longer
> >        than runtime [us] after its release.
> > 
> >               activation/wakeup       absolute deadline
> >               |        release        |
> >               v        v              v
> >        -------x--------x--------------x--------x-------
> >                        |<- Runtime -->|
> >               |<---------- Deadline ->|
> >               |<---------- Period  ----------->|
> > 
> >        This gives: runtime <= (rel) deadline <= period.
> 
> So, the 'sched_deadline' field in the 'sched_attr' expresses the release
> deadline? (I had initially thought it was the "absolute deadline".
> Could you make this clearer in the text please.

No, and yes, sched_attr::sched_deadline is a relative deadline wrt to
the activation. Like said at (A).

So we get: absolute deadline = activation + relative deadline.

And we must be done running at that point, so the very last possible
release moment is: absolute deadline - runtime.

And therefore, it too is a release deadline, since we must not release
later than that.

> >        The CBS guarantees that tasks that over-run their specified
> >        runtime are throttled and do not affect the correct performance
> >        of other SCHED_DEADLINE tasks.
> > 
> >        In general a task set of such tasks it not feasible/schedulable
> 
> That last line is garbled. Could you fix, please.

s/it/is/

> Also, could you add some words to explain what you mean by 'task set'.

A set of tasks? :-) In particular all tasks in the system of
SCHED_DEADLINE, indicated by 'of such'.

> >        within the given constraints. Therefore we must do an admittance
> >        test on setting/changing SCHED_DEADLINE policy/attributes.
> > 
> >        This admission test calculates that the task set is
> >        feasible/schedulable, failing this, sched_setattr() will return
> >        -EBUSY.
> > 
> >        For example, it is required (but not sufficient) for the total
> >        utilization to be less or equal to the total amount of cpu time
> >        available. That is, since each job can maximally run for runtime
> >        [us] per period [us], that task's utilization is runtime/period.
> >        Summing this over all tasks must be less than the total amount of
> >        CPUs present.
> > 
> >        SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN.
> 
> Except if SCHED_RESET_ON_FORK was set, right? If yes, that should be
> mentioned here.

Ah, indeed.

> >        Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
> >        highest priority (user controllable) tasks in the system, if any
> >        SCHED_DEADLINE task is runnable it will preempt anything
> >        FIFO/RR/OTHER/BATCH/IDLE task out there.
> > 
> >        A SCHED_DEADLINE task calling sched_yield() will 'yield' the
> >        current job and wait for a new period to begin.
> 
> So, I'm trying to naively understand how this all works. If different 
> processes specify different deadline periods, how does the kernel deal
> with that? Is it worth adding some detail on this point?

Userspace should not rely on any implementation details there. Saying
its a (G)EDF scheduler is maybe already too much. All userspace should
really care about is that its tasks _should_ be scheduled such that it
meets the specified requirements.

There are multiple scheduling algorithms that can be employed to make it
so, and I don't want to pin us to whatever we chose to implement this
time.

That said, the current (G)EDF is a soft realtime scheduler in that it
guarantees a bounded tardiness (which is the time we can miss the
deadline by) but not a hard realtime, since the bound is not 0.

Anyway, for your elucidation; assuming no overhead and a UP system
(SMP is a right head-ache), and a further assumption that deadline ==
period. It is reasonable straight forward to see that scheduling the
task with the earliest deadline will satisfy the constraints IFF the
total utilization (\Sum runtime_i / deadline_i) <= 1.

Suppose two tasks: A := { 5, 10 } and B := { 10, 20 } with strict
periodic activation:

    A1,B1     A2        Ad2
    |         Ad1       Bd1
    v         v         v
  --AAAAABBBBBAAAAABBBBBx--
  --AAAAABBBBBBBBBBAAAAAx--

Where A# is the #th activation, Ad# is the corresponding #th deadline
before which we must have sufficient time.

Since we're perfectly synced up there is a tie and we get two possible
outcomes. But note that in either case A has gotten 2x its 5 As and B
has gotten its 10 Bs.

Non-periodic activation, and deadline != period make the thing more
interesting, but at that point I would ask Juri (or others) to refer you
to a paper/book.

Now, let me go update the texts yet again :-)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-30 11:09                   ` Michael Kerrisk (man-pages)
  2014-04-30 12:35                     ` Peter Zijlstra
@ 2014-04-30 13:09                     ` Peter Zijlstra
  2014-05-03 10:43                       ` Juri Lelli
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-04-30 13:09 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	Paul McKenney, insop.song, liming.wang, jkacur, linux-man

On Wed, Apr 30, 2014 at 01:09:25PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
> 
> Thanks for the revision. More comments below. Could you revise in 
> the light of those comments, and hopefully also after feedback from 
> Juri and Dario?

New text below; hopefully a little clearer. If not, do holler.

---
> [1] A page describing the sched_setattr() and sched_getattr() APIs

NAME
	sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
	#include <sched.h>

	struct sched_attr {
		u32 size;
		u32 sched_policy;
		u64 sched_flags;

		/* SCHED_NORMAL, SCHED_BATCH */
		s32 sched_nice;

		/* SCHED_FIFO, SCHED_RR */
		u32 sched_priority;

		/* SCHED_DEADLINE */
		u64 sched_runtime;
		u64 sched_deadline;
		u64 sched_period;
	};

	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
	sched_setattr() sets both the scheduling policy and the
	associated attributes for the process whose ID is specified in
	pid.

	sched_setattr() replaces sched_setscheduler(), sched_setparam(),
	nice() and some of setpriority().

	If pid equals zero, the scheduling policy and attributes
	of the calling process will be set.  The interpretation of the
	argument attr depends on the selected policy.  Currently, Linux
	supports the following "normal" (i.e., non-real-time) scheduling
	policies:

	SCHED_OTHER	the standard "fair" time-sharing policy;

	SCHED_BATCH	for "batch" style execution of processes; and

	SCHED_IDLE	for running very low priority background jobs.

	The following "real-time" policies are also supported, for
	special time-critical applications that need precise control
	over the way in which runnable processes are selected for
	execution:

	SCHED_FIFO	a static priority first-in, first-out policy;

	SCHED_RR	a static priority round-robin policy; and

	SCHED_DEADLINE	a dynamic priority deadline policy.

	The semantics of each of these policies are detailed in
	sched(7).

	sched_attr::size must be set to the size of the structure, as in
	sizeof(struct sched_attr), if the provided structure is smaller
	than the kernel structure, any additional fields are assumed
	'0'. If the provided structure is larger than the kernel
	structure, the kernel verifies all additional fields are '0' if
	not the syscall will fail with -E2BIG.

	sched_attr::sched_policy the desired scheduling policy.

	sched_attr::sched_flags additional flags that can influence
	scheduling behaviour. Currently as per Linux kernel 3.14:

		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
		on fork().

	is the only supported flag.

	sched_attr::sched_nice should only be set for SCHED_OTHER,
	SCHED_BATCH, the desired nice value [-20,19], see sched(7).

	sched_attr::sched_priority should only be set for SCHED_FIFO,
	SCHED_RR, the desired static priority [1,99], see sched(7).

	sched_attr::sched_runtime in nanoseconds,
	sched_attr::sched_deadline in nanoseconds,
	sched_attr::sched_period in nanoseconds, should only be set for
	SCHED_DEADLINE and are the traditional sporadic task model
	parameters, see sched(7).

	The flags argument should be 0.

	sched_getattr() queries the scheduling policy currently applied
	to the process identified by pid.

	Similar to sched_setattr(), sched_getattr() replaces
	sched_getscheduler(), sched_getparam() and some of
	getpriority().

	If pid equals zero, the policy of the calling process will be
	retrieved.

	The size argument should reflect the size of struct sched_attr
	as known to userspace. The kernel fills out sched_attr::size to
	the size of its sched_attr structure. If the user provided
	structure is larger, additional fields are not touched. If the
	user provided structure is smaller, but the kernel needs to
	return values outside the provided space, the syscall will fail
	with -E2BIG.

	The flags argument should be 0.

	The other sched_attr fields are filled out as described in
	sched_setattr().

RETURN VALUE
	On success, sched_setattr() and sched_getattr() return 0. On
	error, -1 is returned, and errno is set appropriately.

ERRORS
       EINVAL The scheduling policy is not one  of  the  recognized  policies,
              param is NULL, or param does not make sense for the selected
	      policy.

       EPERM  The calling process does not have appropriate privileges.

       ESRCH  The process whose ID is pid could not be found.

       E2BIG  The provided storage for struct sched_attr is either too
              big, see sched_setattr(), or too small, see sched_getattr().

       EBUSY  SCHED_DEADLINE admission control failure, see sched(7).

NOTES
       While the text above (and in sched_setscheduler(2)) talks about
       processes, in actual fact these system calls are thread specific.

       While the SCHED_DEADLINE parameters are in nanoseconds, current
       kernels truncate the lower 10 bits and we get an effective
       microsecond resolution.

> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> drop into sched(7).

    SCHED_DEADLINE: Sporadic task model deadline scheduling
       SCHED_DEADLINE is currently implemented using GEDF (Global
       Earliest Deadline First) with additional CBS (Constant Bandwidth
       Server).

       A sporadic task is on that has a sequence of jobs, where each job
       is activated at most once per period [ns]. Each job will have an
       absolute deadline relative to its activation before which it must
       finish its execution, and it shall at no time run longer
       than runtime [ns] after its release.

              activation/wakeup       absolute deadline
              |        release        |
              v        v              v
       -------x--------x--------------x--------x-------
                       |<- Runtime -->|
              |<---------- Deadline ->|
              |<---------- Period  ----------->|

       This gives: runtime <= (rel) deadline <= period.

       The CBS guarantees non-interference between tasks, by throttling
       tasks that attempt to over-run their specified runtime.

       In general the set of all SCHED_DEADLINE tasks is not
       feasible/schedulable within the given constraints. Therefore we
       must do an admittance test on setting/changing SCHED_DEADLINE
       policy/attributes.

       This admission test calculates that the task set is
       feasible/schedulable, failing this, sched_setattr() will return
       -EBUSY.

       For example, it is required (but not necessarily sufficient) for
       the total utilization to be less or equal to the total amount of
       CPUs available, where, since each task can maximally run for
       runtime [us] per period [us], that task's utilization is its
       runtime/period.

       Because we must be able to calculate admittance SCHED_DEADLINE
       tasks are the highest priority (user controllable) tasks in the
       system, if any SCHED_DEADLINE task is runnable it will preempt
       any FIFO/RR/OTHER/BATCH/IDLE task.

       SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN, except when
       the forking task has SCHED_FLAG_RESET_ON_FORK set.

       A SCHED_DEADLINE task calling sched_yield() will 'yield' the
       current job and wait for a new period to begin.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-04-30 13:09                     ` Peter Zijlstra
@ 2014-05-03 10:43                       ` Juri Lelli
  2014-05-05  6:55                         ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 71+ messages in thread
From: Juri Lelli @ 2014-05-03 10:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Kerrisk (man-pages),
	Dario Faggioli, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, Paul McKenney,
	insop.song, liming.wang, jkacur, linux-man

Hi,

sorry for the late reply, but I was travelling for work.

On Wed, 30 Apr 2014 15:09:37 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Apr 30, 2014 at 01:09:25PM +0200, Michael Kerrisk (man-pages) wrote:
> > Hi Peter,
> > 
> > Thanks for the revision. More comments below. Could you revise in 
> > the light of those comments, and hopefully also after feedback from 
> > Juri and Dario?
> 
> New text below; hopefully a little clearer. If not, do holler.
> 
> ---
> > [1] A page describing the sched_setattr() and sched_getattr() APIs
> 
> NAME
> 	sched_setattr, sched_getattr - set and get scheduling policy/attributes
> 
> SYNOPSIS
> 	#include <sched.h>
> 
> 	struct sched_attr {
> 		u32 size;
> 		u32 sched_policy;
> 		u64 sched_flags;
> 
> 		/* SCHED_NORMAL, SCHED_BATCH */
> 		s32 sched_nice;
> 
> 		/* SCHED_FIFO, SCHED_RR */
> 		u32 sched_priority;
> 
> 		/* SCHED_DEADLINE */
> 		u64 sched_runtime;
> 		u64 sched_deadline;
> 		u64 sched_period;
> 	};
> 
> 	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
> 
> 	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
> 
> DESCRIPTION
> 	sched_setattr() sets both the scheduling policy and the
> 	associated attributes for the process whose ID is specified in
> 	pid.
> 
> 	sched_setattr() replaces sched_setscheduler(), sched_setparam(),
> 	nice() and some of setpriority().
> 
> 	If pid equals zero, the scheduling policy and attributes
> 	of the calling process will be set.  The interpretation of the
> 	argument attr depends on the selected policy.  Currently, Linux
> 	supports the following "normal" (i.e., non-real-time) scheduling
> 	policies:
> 
> 	SCHED_OTHER	the standard "fair" time-sharing policy;
> 
> 	SCHED_BATCH	for "batch" style execution of processes; and
> 
> 	SCHED_IDLE	for running very low priority background jobs.
> 
> 	The following "real-time" policies are also supported, for
> 	special time-critical applications that need precise control
> 	over the way in which runnable processes are selected for
> 	execution:
> 
> 	SCHED_FIFO	a static priority first-in, first-out policy;
> 
> 	SCHED_RR	a static priority round-robin policy; and
> 
> 	SCHED_DEADLINE	a dynamic priority deadline policy.
> 
> 	The semantics of each of these policies are detailed in
> 	sched(7).
> 
> 	sched_attr::size must be set to the size of the structure, as in
> 	sizeof(struct sched_attr), if the provided structure is smaller
> 	than the kernel structure, any additional fields are assumed
> 	'0'. If the provided structure is larger than the kernel
> 	structure, the kernel verifies all additional fields are '0' if
> 	not the syscall will fail with -E2BIG.
> 
> 	sched_attr::sched_policy the desired scheduling policy.
> 
> 	sched_attr::sched_flags additional flags that can influence
> 	scheduling behaviour. Currently as per Linux kernel 3.14:
> 
> 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> 		on fork().
> 
> 	is the only supported flag.
> 
> 	sched_attr::sched_nice should only be set for SCHED_OTHER,
> 	SCHED_BATCH, the desired nice value [-20,19], see sched(7).
> 
> 	sched_attr::sched_priority should only be set for SCHED_FIFO,
> 	SCHED_RR, the desired static priority [1,99], see sched(7).
> 
> 	sched_attr::sched_runtime in nanoseconds,
> 	sched_attr::sched_deadline in nanoseconds,
> 	sched_attr::sched_period in nanoseconds, should only be set for
> 	SCHED_DEADLINE and are the traditional sporadic task model
> 	parameters, see sched(7).
> 
> 	The flags argument should be 0.
> 
> 	sched_getattr() queries the scheduling policy currently applied
> 	to the process identified by pid.
> 
> 	Similar to sched_setattr(), sched_getattr() replaces
> 	sched_getscheduler(), sched_getparam() and some of
> 	getpriority().
> 
> 	If pid equals zero, the policy of the calling process will be
> 	retrieved.
> 
> 	The size argument should reflect the size of struct sched_attr
> 	as known to userspace. The kernel fills out sched_attr::size to
> 	the size of its sched_attr structure. If the user provided
> 	structure is larger, additional fields are not touched. If the
> 	user provided structure is smaller, but the kernel needs to
> 	return values outside the provided space, the syscall will fail
> 	with -E2BIG.
> 
> 	The flags argument should be 0.
> 
> 	The other sched_attr fields are filled out as described in
> 	sched_setattr().
> 
> RETURN VALUE
> 	On success, sched_setattr() and sched_getattr() return 0. On
> 	error, -1 is returned, and errno is set appropriately.
> 
> ERRORS
>        EINVAL The scheduling policy is not one  of  the  recognized  policies,
>               param is NULL, or param does not make sense for the selected
> 	      policy.
> 
>        EPERM  The calling process does not have appropriate privileges.
> 
>        ESRCH  The process whose ID is pid could not be found.
> 
>        E2BIG  The provided storage for struct sched_attr is either too
>               big, see sched_setattr(), or too small, see sched_getattr().
> 
>        EBUSY  SCHED_DEADLINE admission control failure, see sched(7).
> 
> NOTES
>        While the text above (and in sched_setscheduler(2)) talks about
>        processes, in actual fact these system calls are thread specific.
> 
>        While the SCHED_DEADLINE parameters are in nanoseconds, current
>        kernels truncate the lower 10 bits and we get an effective
>        microsecond resolution.
> 
> > [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> > drop into sched(7).
> 

I'd tweak the following a bit, just to be sure that users understand
that one thing is the model of tasks behavior and another thing is what
you can set using SCHED_DEADLINE. Then the two things are obviously
closely related, but different settings can be in principle used to
schedule the same task set (with lot of literature about optimal
settings and so on).

>     SCHED_DEADLINE: Sporadic task model deadline scheduling
>        SCHED_DEADLINE is currently implemented using GEDF (Global
>        Earliest Deadline First) with additional CBS (Constant Bandwidth
>        Server).
> 
>        A sporadic task is on that has a sequence of jobs, where each job
>        is activated at most once per period [ns]. Each job will have an
>        absolute deadline relative to its activation before which it must
>        finish its execution, and it shall at no time run longer
>        than runtime [ns] after its release.
> 

A sporadic task is one that has a sequence of jobs, where each job is
activated at most once per period. Each job has also a relative
deadline, before which it should finish execution, and a computation
time, that is the time necessary for executing the job without
interruption. The instant of time when a task wakes up, because a new
job has to be executed, is called arrival time (and it is also referred
to as request time or release time). Start time is instead the time at
which a task starts its execution. The absolute deadline is thus
obtained adding the relative deadline to the arrival time. The
following diagram clarifies these terms:

>               activation/wakeup       absolute deadline
>               |        release        |
>               v        v              v
>        -------x--------x--------------x--------x-------
>                        |<- Runtime -->|
>               |<---------- Deadline ->|
>               |<---------- Period  ----------->|
> 

               arrival/wakeup           absolute deadline
               |        start time          |
               v        v                   v
        -------x--------xoooooooooooo-------x--------x-----
                        |<- comp. ->|
               |<---------- rel. deadline ->|
               |<---------- period   --------------->|

SCHED_DEADLINE allows the user to specify three parameters (see
sched_setattr(2)): Runtime [ns], Deadline [ns] and Period [ns]. Such
parameters has not necessarily to correspond to the aforementioned
terms, while usual practise is to set Runtime to something bigger than
the average computation time (or worst-case execution time for hard
real-time tasks), Deadline to the relative deadline and Period to the
period of the task. With such a setting we would have:

               arrival/wakeup           absolute deadline
               |        start time          |
               v        v                   v
        -------x--------xoooooooooooo-------x--------x-----
                        |<- Runtime  ->|
               |<---------- Deadline ------>|
               |<---------- Period   --------------->|
 


>        This gives: runtime <= (rel) deadline <= period.
> 

It is checked that: Runtime <= Deadline <= Period.

>        The CBS guarantees non-interference between tasks, by throttling
>        tasks that attempt to over-run their specified runtime.
> 

s/runtime/Runtime to be consistent.

>        In general the set of all SCHED_DEADLINE tasks is not
>        feasible/schedulable within the given constraints. Therefore we
>        must do an admittance test on setting/changing SCHED_DEADLINE
>        policy/attributes.
> 

To guarantee some degree of timeliness we must do an admission test on
setting/changing SCHED_DEADLINE policy/attributes.


>        This admission test calculates that the task set is
>        feasible/schedulable, failing this, sched_setattr() will return
>        -EBUSY.
> 
>        For example, it is required (but not necessarily sufficient) for
>        the total utilization to be less or equal to the total amount of
>        CPUs available, where, since each task can maximally run for
>        runtime [us] per period [us], that task's utilization is its
>        runtime/period.
> 

CPUs available, where, since each task can maximally run for Runtime
per Period, that task's utilization is its Runtime/Period.

>        Because we must be able to calculate admittance SCHED_DEADLINE
>        tasks are the highest priority (user controllable) tasks in the
>        system, if any SCHED_DEADLINE task is runnable it will preempt
>        any FIFO/RR/OTHER/BATCH/IDLE task.
> 
>        SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN, except when
>        the forking task has SCHED_FLAG_RESET_ON_FORK set.
> 
>        A SCHED_DEADLINE task calling sched_yield() will 'yield' the
>        current job and wait for a new period to begin.
> 

Does it look any better?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-03 10:43                       ` Juri Lelli
@ 2014-05-05  6:55                         ` Michael Kerrisk (man-pages)
  2014-05-05  7:21                           ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-05  6:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, mtk.manpages, Dario Faggioli, Thomas Gleixner,
	Ingo Molnar, rostedt, Oleg Nesterov, fweisbec, darren,
	johan.eker, p.faure, Linux Kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, Paul McKenney, insop.song, liming.wang, jkacur,
	linux-man

Hi Peter,

Looks like a good set of comments from Juri. Could you revise and 
resubmit?

By the way, I assume you are just writing this page as raw text.
While I'd prefer to get proper man markup source, I'll add that
if you if you don't :-/. But, in that case, I need to know the
copyright and license you want to use. Please see
https://www.kernel.org/doc/man-pages/licenses.html

Cheers,

Michael


On 05/03/2014 12:43 PM, Juri Lelli wrote:
> Hi,
> 
> sorry for the late reply, but I was travelling for work.
> 
> On Wed, 30 Apr 2014 15:09:37 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Wed, Apr 30, 2014 at 01:09:25PM +0200, Michael Kerrisk (man-pages) wrote:
>>> Hi Peter,
>>>
>>> Thanks for the revision. More comments below. Could you revise in 
>>> the light of those comments, and hopefully also after feedback from 
>>> Juri and Dario?
>>
>> New text below; hopefully a little clearer. If not, do holler.
>>
>> ---
>>> [1] A page describing the sched_setattr() and sched_getattr() APIs
>>
>> NAME
>> 	sched_setattr, sched_getattr - set and get scheduling policy/attributes
>>
>> SYNOPSIS
>> 	#include <sched.h>
>>
>> 	struct sched_attr {
>> 		u32 size;
>> 		u32 sched_policy;
>> 		u64 sched_flags;
>>
>> 		/* SCHED_NORMAL, SCHED_BATCH */
>> 		s32 sched_nice;
>>
>> 		/* SCHED_FIFO, SCHED_RR */
>> 		u32 sched_priority;
>>
>> 		/* SCHED_DEADLINE */
>> 		u64 sched_runtime;
>> 		u64 sched_deadline;
>> 		u64 sched_period;
>> 	};
>>
>> 	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
>>
>> 	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
>>
>> DESCRIPTION
>> 	sched_setattr() sets both the scheduling policy and the
>> 	associated attributes for the process whose ID is specified in
>> 	pid.
>>
>> 	sched_setattr() replaces sched_setscheduler(), sched_setparam(),
>> 	nice() and some of setpriority().
>>
>> 	If pid equals zero, the scheduling policy and attributes
>> 	of the calling process will be set.  The interpretation of the
>> 	argument attr depends on the selected policy.  Currently, Linux
>> 	supports the following "normal" (i.e., non-real-time) scheduling
>> 	policies:
>>
>> 	SCHED_OTHER	the standard "fair" time-sharing policy;
>>
>> 	SCHED_BATCH	for "batch" style execution of processes; and
>>
>> 	SCHED_IDLE	for running very low priority background jobs.
>>
>> 	The following "real-time" policies are also supported, for
>> 	special time-critical applications that need precise control
>> 	over the way in which runnable processes are selected for
>> 	execution:
>>
>> 	SCHED_FIFO	a static priority first-in, first-out policy;
>>
>> 	SCHED_RR	a static priority round-robin policy; and
>>
>> 	SCHED_DEADLINE	a dynamic priority deadline policy.
>>
>> 	The semantics of each of these policies are detailed in
>> 	sched(7).
>>
>> 	sched_attr::size must be set to the size of the structure, as in
>> 	sizeof(struct sched_attr), if the provided structure is smaller
>> 	than the kernel structure, any additional fields are assumed
>> 	'0'. If the provided structure is larger than the kernel
>> 	structure, the kernel verifies all additional fields are '0' if
>> 	not the syscall will fail with -E2BIG.
>>
>> 	sched_attr::sched_policy the desired scheduling policy.
>>
>> 	sched_attr::sched_flags additional flags that can influence
>> 	scheduling behaviour. Currently as per Linux kernel 3.14:
>>
>> 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
>> 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
>> 		on fork().
>>
>> 	is the only supported flag.
>>
>> 	sched_attr::sched_nice should only be set for SCHED_OTHER,
>> 	SCHED_BATCH, the desired nice value [-20,19], see sched(7).
>>
>> 	sched_attr::sched_priority should only be set for SCHED_FIFO,
>> 	SCHED_RR, the desired static priority [1,99], see sched(7).
>>
>> 	sched_attr::sched_runtime in nanoseconds,
>> 	sched_attr::sched_deadline in nanoseconds,
>> 	sched_attr::sched_period in nanoseconds, should only be set for
>> 	SCHED_DEADLINE and are the traditional sporadic task model
>> 	parameters, see sched(7).
>>
>> 	The flags argument should be 0.
>>
>> 	sched_getattr() queries the scheduling policy currently applied
>> 	to the process identified by pid.
>>
>> 	Similar to sched_setattr(), sched_getattr() replaces
>> 	sched_getscheduler(), sched_getparam() and some of
>> 	getpriority().
>>
>> 	If pid equals zero, the policy of the calling process will be
>> 	retrieved.
>>
>> 	The size argument should reflect the size of struct sched_attr
>> 	as known to userspace. The kernel fills out sched_attr::size to
>> 	the size of its sched_attr structure. If the user provided
>> 	structure is larger, additional fields are not touched. If the
>> 	user provided structure is smaller, but the kernel needs to
>> 	return values outside the provided space, the syscall will fail
>> 	with -E2BIG.
>>
>> 	The flags argument should be 0.
>>
>> 	The other sched_attr fields are filled out as described in
>> 	sched_setattr().
>>
>> RETURN VALUE
>> 	On success, sched_setattr() and sched_getattr() return 0. On
>> 	error, -1 is returned, and errno is set appropriately.
>>
>> ERRORS
>>        EINVAL The scheduling policy is not one  of  the  recognized  policies,
>>               param is NULL, or param does not make sense for the selected
>> 	      policy.
>>
>>        EPERM  The calling process does not have appropriate privileges.
>>
>>        ESRCH  The process whose ID is pid could not be found.
>>
>>        E2BIG  The provided storage for struct sched_attr is either too
>>               big, see sched_setattr(), or too small, see sched_getattr().
>>
>>        EBUSY  SCHED_DEADLINE admission control failure, see sched(7).
>>
>> NOTES
>>        While the text above (and in sched_setscheduler(2)) talks about
>>        processes, in actual fact these system calls are thread specific.
>>
>>        While the SCHED_DEADLINE parameters are in nanoseconds, current
>>        kernels truncate the lower 10 bits and we get an effective
>>        microsecond resolution.
>>
>>> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
>>> drop into sched(7).
>>
> 
> I'd tweak the following a bit, just to be sure that users understand
> that one thing is the model of tasks behavior and another thing is what
> you can set using SCHED_DEADLINE. Then the two things are obviously
> closely related, but different settings can be in principle used to
> schedule the same task set (with lot of literature about optimal
> settings and so on).
> 
>>     SCHED_DEADLINE: Sporadic task model deadline scheduling
>>        SCHED_DEADLINE is currently implemented using GEDF (Global
>>        Earliest Deadline First) with additional CBS (Constant Bandwidth
>>        Server).
>>
>>        A sporadic task is on that has a sequence of jobs, where each job
>>        is activated at most once per period [ns]. Each job will have an
>>        absolute deadline relative to its activation before which it must
>>        finish its execution, and it shall at no time run longer
>>        than runtime [ns] after its release.
>>
> 
> A sporadic task is one that has a sequence of jobs, where each job is
> activated at most once per period. Each job has also a relative
> deadline, before which it should finish execution, and a computation
> time, that is the time necessary for executing the job without
> interruption. The instant of time when a task wakes up, because a new
> job has to be executed, is called arrival time (and it is also referred
> to as request time or release time). Start time is instead the time at
> which a task starts its execution. The absolute deadline is thus
> obtained adding the relative deadline to the arrival time. The
> following diagram clarifies these terms:
> 
>>               activation/wakeup       absolute deadline
>>               |        release        |
>>               v        v              v
>>        -------x--------x--------------x--------x-------
>>                        |<- Runtime -->|
>>               |<---------- Deadline ->|
>>               |<---------- Period  ----------->|
>>
> 
>                arrival/wakeup           absolute deadline
>                |        start time          |
>                v        v                   v
>         -------x--------xoooooooooooo-------x--------x-----
>                         |<- comp. ->|
>                |<---------- rel. deadline ->|
>                |<---------- period   --------------->|
> 
> SCHED_DEADLINE allows the user to specify three parameters (see
> sched_setattr(2)): Runtime [ns], Deadline [ns] and Period [ns]. Such
> parameters has not necessarily to correspond to the aforementioned
> terms, while usual practise is to set Runtime to something bigger than
> the average computation time (or worst-case execution time for hard
> real-time tasks), Deadline to the relative deadline and Period to the
> period of the task. With such a setting we would have:
> 
>                arrival/wakeup           absolute deadline
>                |        start time          |
>                v        v                   v
>         -------x--------xoooooooooooo-------x--------x-----
>                         |<- Runtime  ->|
>                |<---------- Deadline ------>|
>                |<---------- Period   --------------->|
>  
> 
> 
>>        This gives: runtime <= (rel) deadline <= period.
>>
> 
> It is checked that: Runtime <= Deadline <= Period.
> 
>>        The CBS guarantees non-interference between tasks, by throttling
>>        tasks that attempt to over-run their specified runtime.
>>
> 
> s/runtime/Runtime to be consistent.
> 
>>        In general the set of all SCHED_DEADLINE tasks is not
>>        feasible/schedulable within the given constraints. Therefore we
>>        must do an admittance test on setting/changing SCHED_DEADLINE
>>        policy/attributes.
>>
> 
> To guarantee some degree of timeliness we must do an admission test on
> setting/changing SCHED_DEADLINE policy/attributes.
> 
> 
>>        This admission test calculates that the task set is
>>        feasible/schedulable, failing this, sched_setattr() will return
>>        -EBUSY.
>>
>>        For example, it is required (but not necessarily sufficient) for
>>        the total utilization to be less or equal to the total amount of
>>        CPUs available, where, since each task can maximally run for
>>        runtime [us] per period [us], that task's utilization is its
>>        runtime/period.
>>
> 
> CPUs available, where, since each task can maximally run for Runtime
> per Period, that task's utilization is its Runtime/Period.
> 
>>        Because we must be able to calculate admittance SCHED_DEADLINE
>>        tasks are the highest priority (user controllable) tasks in the
>>        system, if any SCHED_DEADLINE task is runnable it will preempt
>>        any FIFO/RR/OTHER/BATCH/IDLE task.
>>
>>        SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN, except when
>>        the forking task has SCHED_FLAG_RESET_ON_FORK set.
>>
>>        A SCHED_DEADLINE task calling sched_yield() will 'yield' the
>>        current job and wait for a new period to begin.
>>
> 
> Does it look any better?
> 
> Thanks,
> 
> - Juri
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-05  6:55                         ` Michael Kerrisk (man-pages)
@ 2014-05-05  7:21                           ` Peter Zijlstra
  2014-05-05  7:41                             ` Michael Kerrisk (man-pages)
  2014-05-06  8:16                             ` Peter Zijlstra
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-05-05  7:21 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Juri Lelli, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, Paul McKenney,
	insop.song, liming.wang, jkacur, linux-man

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

On Mon, May 05, 2014 at 08:55:28AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
> 
> Looks like a good set of comments from Juri. Could you revise and 
> resubmit?

Yeah, I'll try and get it done today, but there's a few icky bugs
waiting for my attention as well, I'll do me bestest :-)

> By the way, I assume you are just writing this page as raw text.
> While I'd prefer to get proper man markup source, I'll add that
> if you if you don't :-/. 

Well, learning *roff will likely take me more time than writing this
text + all revisions so far :/ But yeah, I appreciate the grief.

Is there a TeX variant one could use to generate the *roff muck? While
my TeX isn't entirely fresh its at least something I've done lots of.

> But, in that case, I need to know the
> copyright and license you want to use. Please see
> https://www.kernel.org/doc/man-pages/licenses.html

GPLv2 + DOC (not v2+) sounds good.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-05  7:21                           ` Peter Zijlstra
@ 2014-05-05  7:41                             ` Michael Kerrisk (man-pages)
  2014-05-05  7:47                               ` Peter Zijlstra
  2014-05-06  8:16                             ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-05  7:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mtk.manpages, Juri Lelli, Dario Faggioli, Thomas Gleixner,
	Ingo Molnar, rostedt, Oleg Nesterov, fweisbec, darren,
	johan.eker, p.faure, Linux Kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, Paul McKenney, insop.song, liming.wang, jkacur,
	linux-man

On 05/05/2014 09:21 AM, Peter Zijlstra wrote:
> On Mon, May 05, 2014 at 08:55:28AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Peter,
>>
>> Looks like a good set of comments from Juri. Could you revise and 
>> resubmit?
> 
> Yeah, I'll try and get it done today, but there's a few icky bugs
> waiting for my attention as well, I'll do me bestest :-)
> 
>> By the way, I assume you are just writing this page as raw text.
>> While I'd prefer to get proper man markup source, I'll add that
>> if you if you don't :-/. 
> 
> Well, learning *roff will likely take me more time than writing this
> text + all revisions so far :/ But yeah, I appreciate the grief.
> 
> Is there a TeX variant one could use to generate the *roff muck? While
> my TeX isn't entirely fresh its at least something I've done lots of.

Don't worry -- just send me the plain text; I'll do it. I appreciate 
you writing the text in the first place; I'll handle the rest--it won't
take me too long, and probably I'll find things to fix/check on the way.

>> But, in that case, I need to know the
>> copyright and license you want to use. Please see
>> https://www.kernel.org/doc/man-pages/licenses.html
> 
> GPLv2 + DOC (not v2+) sounds good.

I'm a little unclear here. Do you or don't you mean
https://www.kernel.org/doc/man-pages/licenses.html#gpl
?

(Note, I'd really prefer to stick to one of those licenses
(without variants). (My personal preference is the "verbatim"
license, which is the most widely used one.) There's already
do many licenses in in man-pages...

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-05  7:41                             ` Michael Kerrisk (man-pages)
@ 2014-05-05  7:47                               ` Peter Zijlstra
  2014-05-05  9:53                                 ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-05-05  7:47 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Juri Lelli, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, Paul McKenney,
	insop.song, liming.wang, jkacur, linux-man

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

On Mon, May 05, 2014 at 09:41:08AM +0200, Michael Kerrisk (man-pages) wrote:
> >> But, in that case, I need to know the
> >> copyright and license you want to use. Please see
> >> https://www.kernel.org/doc/man-pages/licenses.html
> > 
> > GPLv2 + DOC (not v2+) sounds good.
> 
> I'm a little unclear here. Do you or don't you mean
> https://www.kernel.org/doc/man-pages/licenses.html#gpl
> ?

A variant with out the +, just like I do my kernel code, no greater gpl
versions. However, 

> (Note, I'd really prefer to stick to one of those licenses
> (without variants). (My personal preference is the "verbatim"
> license, which is the most widely used one.) There's already
> do many licenses in in man-pages...

Verbatim is OK with me I suppose. Its only text after all, who cares
about that :-)

/me runs for the hills.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-05  7:47                               ` Peter Zijlstra
@ 2014-05-05  9:53                                 ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-05  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mtk.manpages, Juri Lelli, Dario Faggioli, Thomas Gleixner,
	Ingo Molnar, rostedt, Oleg Nesterov, fweisbec, darren,
	johan.eker, p.faure, Linux Kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, Paul McKenney, insop.song, liming.wang, jkacur,
	linux-man

On 05/05/2014 09:47 AM, Peter Zijlstra wrote:
> On Mon, May 05, 2014 at 09:41:08AM +0200, Michael Kerrisk (man-pages) wrote:
>>>> But, in that case, I need to know the
>>>> copyright and license you want to use. Please see
>>>> https://www.kernel.org/doc/man-pages/licenses.html
>>>
>>> GPLv2 + DOC (not v2+) sounds good.
>>
>> I'm a little unclear here. Do you or don't you mean
>> https://www.kernel.org/doc/man-pages/licenses.html#gpl
>> ?
> 
> A variant with out the +, just like I do my kernel code, no greater gpl
> versions. However, 
> 
>> (Note, I'd really prefer to stick to one of those licenses
>> (without variants). (My personal preference is the "verbatim"
>> license, which is the most widely used one.) There's already
>> do many licenses in in man-pages...
> 
> Verbatim is OK with me I suppose. 

And don't neglect to mention who the copyright is to please.

> Its only text after all, who cares
> about that :-)
> /me runs for the hills.

Well, apparently you care, so thanks ;-)

 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-05  7:21                           ` Peter Zijlstra
  2014-05-05  7:41                             ` Michael Kerrisk (man-pages)
@ 2014-05-06  8:16                             ` Peter Zijlstra
  2014-05-09  8:23                               ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-05-06  8:16 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Juri Lelli, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, Paul McKenney,
	insop.song, liming.wang, jkacur, linux-man

[-- Attachment #1: Type: text/plain, Size: 9226 bytes --]

On Mon, May 05, 2014 at 09:21:14AM +0200, Peter Zijlstra wrote:
> On Mon, May 05, 2014 at 08:55:28AM +0200, Michael Kerrisk (man-pages) wrote:
> > Hi Peter,
> > 
> > Looks like a good set of comments from Juri. Could you revise and 
> > resubmit?
> 
> Yeah, I'll try and get it done today, but there's a few icky bugs
> waiting for my attention as well, I'll do me bestest :-)

OK, not quite managed it yesterday, but here goes.

So Verbatim license, for the first part to me and whoever I borrowed
sched_setscheduler() bits from.

For the second part to me and Juri.

---

> [1] A page describing the sched_setattr() and sched_getattr() APIs

NAME
	sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
	#include <sched.h>

	struct sched_attr {
		u32 size;
		u32 sched_policy;
		u64 sched_flags;

		/* SCHED_NORMAL, SCHED_BATCH */
		s32 sched_nice;

		/* SCHED_FIFO, SCHED_RR */
		u32 sched_priority;

		/* SCHED_DEADLINE */
		u64 sched_runtime;
		u64 sched_deadline;
		u64 sched_period;
	};

	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
	sched_setattr() sets both the scheduling policy and the
	associated attributes for the process whose ID is specified in
	pid.

	sched_setattr() replaces sched_setscheduler(), sched_setparam(),
	nice() and some of setpriority().

	If pid equals zero, the scheduling policy and attributes
	of the calling process will be set.  The interpretation of the
	argument attr depends on the selected policy.  Currently, Linux
	supports the following "normal" (i.e., non-real-time) scheduling
	policies:

	SCHED_OTHER	the standard "fair" time-sharing policy;

	SCHED_BATCH	for "batch" style execution of processes; and

	SCHED_IDLE	for running very low priority background jobs.

	The following "real-time" policies are also supported, for
	special time-critical applications that need precise control
	over the way in which runnable processes are selected for
	execution:

	SCHED_FIFO	a static priority first-in, first-out policy;

	SCHED_RR	a static priority round-robin policy; and

	SCHED_DEADLINE	a dynamic priority deadline policy.

	The semantics of each of these policies are detailed in
	sched(7).

	sched_attr::size must be set to the size of the structure, as in
	sizeof(struct sched_attr), if the provided structure is smaller
	than the kernel structure, any additional fields are assumed
	'0'. If the provided structure is larger than the kernel
	structure, the kernel verifies all additional fields are '0' if
	not the syscall will fail with -E2BIG.

	sched_attr::sched_policy the desired scheduling policy.

	sched_attr::sched_flags additional flags that can influence
	scheduling behaviour. Currently as per Linux kernel 3.14:

		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
		on fork().

	is the only supported flag.

	sched_attr::sched_nice should only be set for SCHED_OTHER,
	SCHED_BATCH, the desired nice value [-20,19], see sched(7).

	sched_attr::sched_priority should only be set for SCHED_FIFO,
	SCHED_RR, the desired static priority [1,99], see sched(7).

	sched_attr::sched_runtime in nanoseconds,
	sched_attr::sched_deadline in nanoseconds,
	sched_attr::sched_period in nanoseconds, should only be set for
	SCHED_DEADLINE and are the traditional sporadic task model
	parameters, see sched(7).

	The flags argument should be 0.

	sched_getattr() queries the scheduling policy currently applied
	to the process identified by pid.

	Similar to sched_setattr(), sched_getattr() replaces
	sched_getscheduler(), sched_getparam() and some of
	getpriority().

	If pid equals zero, the policy of the calling process will be
	retrieved.

	The size argument should reflect the size of struct sched_attr
	as known to userspace. The kernel fills out sched_attr::size to
	the size of its sched_attr structure. If the user provided
	structure is larger, additional fields are not touched. If the
	user provided structure is smaller, but the kernel needs to
	return values outside the provided space, the syscall will fail
	with -E2BIG.

	The flags argument should be 0.

	The other sched_attr fields are filled out as described in
	sched_setattr().

RETURN VALUE
	On success, sched_setattr() and sched_getattr() return 0. On
	error, -1 is returned, and errno is set appropriately.

ERRORS
       EINVAL The scheduling policy is not one  of  the  recognized  policies,
              param is NULL, or param does not make sense for the selected
	      policy.

       EPERM  The calling process does not have appropriate privileges.

       ESRCH  The process whose ID is pid could not be found.

       E2BIG  The provided storage for struct sched_attr is either too
              big, see sched_setattr(), or too small, see sched_getattr().

       EBUSY  SCHED_DEADLINE admission control failure, see sched(7).

NOTES
       While the text above (and in sched_setscheduler(2)) talks about
       processes, in actual fact these system calls are thread specific.

       While the SCHED_DEADLINE parameters are in nanoseconds, current
       kernels truncate the lower 10 bits and we get an effective
       microsecond resolution.

> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> drop into sched(7).

    SCHED_DEADLINE: Sporadic task model deadline scheduling
       SCHED_DEADLINE is currently implemented using GEDF (Global
       Earliest Deadline First) with additional CBS (Constant Bandwidth
       Server).

       A sporadic task is one that has a sequence of jobs, where each
       job is activated at most once per period. Each job has also a
       relative deadline, before which it should finish execution, and a
       computation time, that is the time necessary for executing the
       job without interruption. The instant of time when a task wakes
       up, because a new job has to be executed, is called arrival time
       (and it is also referred to as request time or release time).
       Start time is instead the time at which a task starts its
       execution. The absolute deadline is thus obtained adding the
       relative deadline to the arrival time.

       The following diagram clarifies these terms:

               arrival/wakeup           absolute deadline
               |        start time          |
               v        v                   v
        -------x--------xoooooooooooo-------x--------x-----
                        |<- comp. ->|
               |<---------- rel. deadline ->|
               |<---------- period ----------------->|

       SCHED_DEADLINE allows the user to specify three parameters (see
       sched_setattr(2)): Runtime [ns], Deadline [ns] and Period [ns].
       Such parameters has not necessarily to correspond to the
       aforementioned terms, while usual practise is to set Runtime to
       something bigger than the average computation time (or worst-case
       execution time for hard real-time tasks), Deadline to the
       relative deadline and Period to the period of the task. With such
       a setting we would have:

               arrival/wakeup           absolute deadline
               |        start time          |
               v        v                   v
        -------x--------xoooooooooooo-------x--------x-----
                        |<- Runtime -->|
               |<---------- Deadline ------>|
               |<---------- Period ----------------->|

       It is checked that: Runtime <= Deadline <= Period.

       The CBS guarantees non-interference between tasks, by throttling
       tasks that attempt to over-run their specified Runtime.

       In general the set of all SCHED_DEADLINE tasks is not
       feasible/schedulable within the given constraints. To guarantee
       some degree of timeliness we must do an admittance test on
       setting/changing SCHED_DEADLINE policy/attributes.

       This admission test calculates that the task set is
       feasible/schedulable, failing this, sched_setattr() will return
       -EBUSY.

       For example, it is required (but not necessarily sufficient) for
       the total utilization to be less or equal to the total amount of
       CPUs available, where, since each task can maximally run for
       Runtime per Period, that task's utilization is its
       Runtime/Period.

       Because we must be able to calculate admittance SCHED_DEADLINE
       tasks are the highest priority (user controllable) tasks in the
       system, if any SCHED_DEADLINE task is runnable it will preempt
       any FIFO/RR/OTHER/BATCH/IDLE task.

       SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN, except when
       the forking task has SCHED_FLAG_RESET_ON_FORK set.

       A SCHED_DEADLINE task calling sched_yield() will 'yield' the
       current job and wait for a new period to begin.


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-06  8:16                             ` Peter Zijlstra
@ 2014-05-09  8:23                               ` Michael Kerrisk (man-pages)
  2014-05-09  8:53                                 ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-09  8:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mtk.manpages, Juri Lelli, Dario Faggioli, Thomas Gleixner,
	Ingo Molnar, rostedt, Oleg Nesterov, fweisbec, darren,
	johan.eker, p.faure, Linux Kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, Paul McKenney, insop.song, liming.wang, jkacur,
	linux-man

Hi Peter,

I'm working on this text. I see the following in kernel/sched/core.c:

[[
static int __sched_setscheduler(struct task_struct *p,
                                const struct sched_attr *attr,
                                bool user)
{
        ...

        int policy = attr->sched_policy;
        ...
        if (policy < 0) {
                reset_on_fork = p->sched_reset_on_fork;
                policy = oldpolicy = p->policy;
]]

What's a negative policy about? Is this something that should 
be documented?

Cheers,

Michael

On 05/06/2014 10:16 AM, Peter Zijlstra wrote:
> On Mon, May 05, 2014 at 09:21:14AM +0200, Peter Zijlstra wrote:
>> On Mon, May 05, 2014 at 08:55:28AM +0200, Michael Kerrisk (man-pages) wrote:
>>> Hi Peter,
>>>
>>> Looks like a good set of comments from Juri. Could you revise and 
>>> resubmit?
>>
>> Yeah, I'll try and get it done today, but there's a few icky bugs
>> waiting for my attention as well, I'll do me bestest :-)
> 
> OK, not quite managed it yesterday, but here goes.
> 
> So Verbatim license, for the first part to me and whoever I borrowed
> sched_setscheduler() bits from.
> 
> For the second part to me and Juri.
> 
> ---
> 
>> [1] A page describing the sched_setattr() and sched_getattr() APIs
> 
> NAME
> 	sched_setattr, sched_getattr - set and get scheduling policy/attributes
> 
> SYNOPSIS
> 	#include <sched.h>
> 
> 	struct sched_attr {
> 		u32 size;
> 		u32 sched_policy;
> 		u64 sched_flags;
> 
> 		/* SCHED_NORMAL, SCHED_BATCH */
> 		s32 sched_nice;
> 
> 		/* SCHED_FIFO, SCHED_RR */
> 		u32 sched_priority;
> 
> 		/* SCHED_DEADLINE */
> 		u64 sched_runtime;
> 		u64 sched_deadline;
> 		u64 sched_period;
> 	};
> 
> 	int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
> 
> 	int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
> 
> DESCRIPTION
> 	sched_setattr() sets both the scheduling policy and the
> 	associated attributes for the process whose ID is specified in
> 	pid.
> 
> 	sched_setattr() replaces sched_setscheduler(), sched_setparam(),
> 	nice() and some of setpriority().
> 
> 	If pid equals zero, the scheduling policy and attributes
> 	of the calling process will be set.  The interpretation of the
> 	argument attr depends on the selected policy.  Currently, Linux
> 	supports the following "normal" (i.e., non-real-time) scheduling
> 	policies:
> 
> 	SCHED_OTHER	the standard "fair" time-sharing policy;
> 
> 	SCHED_BATCH	for "batch" style execution of processes; and
> 
> 	SCHED_IDLE	for running very low priority background jobs.
> 
> 	The following "real-time" policies are also supported, for
> 	special time-critical applications that need precise control
> 	over the way in which runnable processes are selected for
> 	execution:
> 
> 	SCHED_FIFO	a static priority first-in, first-out policy;
> 
> 	SCHED_RR	a static priority round-robin policy; and
> 
> 	SCHED_DEADLINE	a dynamic priority deadline policy.
> 
> 	The semantics of each of these policies are detailed in
> 	sched(7).
> 
> 	sched_attr::size must be set to the size of the structure, as in
> 	sizeof(struct sched_attr), if the provided structure is smaller
> 	than the kernel structure, any additional fields are assumed
> 	'0'. If the provided structure is larger than the kernel
> 	structure, the kernel verifies all additional fields are '0' if
> 	not the syscall will fail with -E2BIG.
> 
> 	sched_attr::sched_policy the desired scheduling policy.
> 
> 	sched_attr::sched_flags additional flags that can influence
> 	scheduling behaviour. Currently as per Linux kernel 3.14:
> 
> 		SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> 		to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> 		on fork().
> 
> 	is the only supported flag.
> 
> 	sched_attr::sched_nice should only be set for SCHED_OTHER,
> 	SCHED_BATCH, the desired nice value [-20,19], see sched(7).
> 
> 	sched_attr::sched_priority should only be set for SCHED_FIFO,
> 	SCHED_RR, the desired static priority [1,99], see sched(7).
> 
> 	sched_attr::sched_runtime in nanoseconds,
> 	sched_attr::sched_deadline in nanoseconds,
> 	sched_attr::sched_period in nanoseconds, should only be set for
> 	SCHED_DEADLINE and are the traditional sporadic task model
> 	parameters, see sched(7).
> 
> 	The flags argument should be 0.
> 
> 	sched_getattr() queries the scheduling policy currently applied
> 	to the process identified by pid.
> 
> 	Similar to sched_setattr(), sched_getattr() replaces
> 	sched_getscheduler(), sched_getparam() and some of
> 	getpriority().
> 
> 	If pid equals zero, the policy of the calling process will be
> 	retrieved.
> 
> 	The size argument should reflect the size of struct sched_attr
> 	as known to userspace. The kernel fills out sched_attr::size to
> 	the size of its sched_attr structure. If the user provided
> 	structure is larger, additional fields are not touched. If the
> 	user provided structure is smaller, but the kernel needs to
> 	return values outside the provided space, the syscall will fail
> 	with -E2BIG.
> 
> 	The flags argument should be 0.
> 
> 	The other sched_attr fields are filled out as described in
> 	sched_setattr().
> 
> RETURN VALUE
> 	On success, sched_setattr() and sched_getattr() return 0. On
> 	error, -1 is returned, and errno is set appropriately.
> 
> ERRORS
>        EINVAL The scheduling policy is not one  of  the  recognized  policies,
>               param is NULL, or param does not make sense for the selected
> 	      policy.
> 
>        EPERM  The calling process does not have appropriate privileges.
> 
>        ESRCH  The process whose ID is pid could not be found.
> 
>        E2BIG  The provided storage for struct sched_attr is either too
>               big, see sched_setattr(), or too small, see sched_getattr().
> 
>        EBUSY  SCHED_DEADLINE admission control failure, see sched(7).
> 
> NOTES
>        While the text above (and in sched_setscheduler(2)) talks about
>        processes, in actual fact these system calls are thread specific.
> 
>        While the SCHED_DEADLINE parameters are in nanoseconds, current
>        kernels truncate the lower 10 bits and we get an effective
>        microsecond resolution.
> 
>> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
>> drop into sched(7).
> 
>     SCHED_DEADLINE: Sporadic task model deadline scheduling
>        SCHED_DEADLINE is currently implemented using GEDF (Global
>        Earliest Deadline First) with additional CBS (Constant Bandwidth
>        Server).
> 
>        A sporadic task is one that has a sequence of jobs, where each
>        job is activated at most once per period. Each job has also a
>        relative deadline, before which it should finish execution, and a
>        computation time, that is the time necessary for executing the
>        job without interruption. The instant of time when a task wakes
>        up, because a new job has to be executed, is called arrival time
>        (and it is also referred to as request time or release time).
>        Start time is instead the time at which a task starts its
>        execution. The absolute deadline is thus obtained adding the
>        relative deadline to the arrival time.
> 
>        The following diagram clarifies these terms:
> 
>                arrival/wakeup           absolute deadline
>                |        start time          |
>                v        v                   v
>         -------x--------xoooooooooooo-------x--------x-----
>                         |<- comp. ->|
>                |<---------- rel. deadline ->|
>                |<---------- period ----------------->|
> 
>        SCHED_DEADLINE allows the user to specify three parameters (see
>        sched_setattr(2)): Runtime [ns], Deadline [ns] and Period [ns].
>        Such parameters has not necessarily to correspond to the
>        aforementioned terms, while usual practise is to set Runtime to
>        something bigger than the average computation time (or worst-case
>        execution time for hard real-time tasks), Deadline to the
>        relative deadline and Period to the period of the task. With such
>        a setting we would have:
> 
>                arrival/wakeup           absolute deadline
>                |        start time          |
>                v        v                   v
>         -------x--------xoooooooooooo-------x--------x-----
>                         |<- Runtime -->|
>                |<---------- Deadline ------>|
>                |<---------- Period ----------------->|
> 
>        It is checked that: Runtime <= Deadline <= Period.
> 
>        The CBS guarantees non-interference between tasks, by throttling
>        tasks that attempt to over-run their specified Runtime.
> 
>        In general the set of all SCHED_DEADLINE tasks is not
>        feasible/schedulable within the given constraints. To guarantee
>        some degree of timeliness we must do an admittance test on
>        setting/changing SCHED_DEADLINE policy/attributes.
> 
>        This admission test calculates that the task set is
>        feasible/schedulable, failing this, sched_setattr() will return
>        -EBUSY.
> 
>        For example, it is required (but not necessarily sufficient) for
>        the total utilization to be less or equal to the total amount of
>        CPUs available, where, since each task can maximally run for
>        Runtime per Period, that task's utilization is its
>        Runtime/Period.
> 
>        Because we must be able to calculate admittance SCHED_DEADLINE
>        tasks are the highest priority (user controllable) tasks in the
>        system, if any SCHED_DEADLINE task is runnable it will preempt
>        any FIFO/RR/OTHER/BATCH/IDLE task.
> 
>        SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN, except when
>        the forking task has SCHED_FLAG_RESET_ON_FORK set.
> 
>        A SCHED_DEADLINE task calling sched_yield() will 'yield' the
>        current job and wait for a new period to begin.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-09  8:23                               ` Michael Kerrisk (man-pages)
@ 2014-05-09  8:53                                 ` Peter Zijlstra
  2014-05-09  9:26                                   ` Michael Kerrisk (man-pages)
                                                     ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-05-09  8:53 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Juri Lelli, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	rostedt, Oleg Nesterov, fweisbec, darren, johan.eker, p.faure,
	Linux Kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, Paul McKenney,
	insop.song, liming.wang, jkacur, linux-man

[-- Attachment #1: Type: text/plain, Size: 1875 bytes --]

On Fri, May 09, 2014 at 10:23:22AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
> 
> I'm working on this text. I see the following in kernel/sched/core.c:
> 
> [[
> static int __sched_setscheduler(struct task_struct *p,
>                                 const struct sched_attr *attr,
>                                 bool user)
> {
>         ...
> 
>         int policy = attr->sched_policy;
>         ...
>         if (policy < 0) {
>                 reset_on_fork = p->sched_reset_on_fork;
>                 policy = oldpolicy = p->policy;
> ]]
> 
> What's a negative policy about? Is this something that should 
> be documented?

That's for sched_setparam(), which internally passes policy = -1, it
wasn't meant to be user visible, lemme double check that.

sys_sched_setscheduler() -- explicit check for policy < 0
sys_sched_setparam() -- explicitly passes policy=-1, not user visible
sys_sched_setattr() -- hmm, it looks like fail


---
Subject: sched: Disallow sched_attr::sched_policy < 0
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri May  9 10:49:03 CEST 2014

The scheduler uses policy=-1 to preserve the current policy state to
implement sys_sched_setparam(), this got exposed to userspace by
accident through sys_sched_setattr(), cure this.

Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-b4kbwz2qh21xlngdzje00t55@git.kernel.org
---
 kernel/sched/core.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3711,6 +3711,9 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pi
 	if (sched_copy_attr(uattr, &attr))
 		return -EFAULT;
 
+	if (attr.sched_policy < 0)
+		return -EINVAL;
+
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: sched_{set,get}attr() manpage
  2014-05-09  8:53                                 ` Peter Zijlstra
@ 2014-05-09  9:26                                   ` Michael Kerrisk (man-pages)
  2014-05-19 13:06                                   ` [tip:sched/core] sched: Disallow sched_attr::sched_policy < 0 tip-bot for Peter Zijlstra
  2014-05-22 12:25                                   ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 71+ messages in thread
From: Michael Kerrisk (man-pages) @ 2014-05-09  9:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Dario Faggioli, Thomas Gleixner, Ingo Molnar,
	Steven Rostedt, Oleg Nesterov, Frédéric Weisbecker,
	Darren Hart, johan.eker, p.faure, Linux Kernel, Claudio Scordino,
	Michael Trimarchi, Fabio Checconi, Tommaso Cucinotta,
	nicola.manica, luca.abeni, Dhaval Giani, hgu1972, Paul McKenney,
	Insop Song, liming.wang, jkacur, linux-man

Hi Peter,

On Fri, May 9, 2014 at 10:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 09, 2014 at 10:23:22AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Peter,
>>
>> I'm working on this text. I see the following in kernel/sched/core.c:
>>
>> [[
>> static int __sched_setscheduler(struct task_struct *p,
>>                                 const struct sched_attr *attr,
>>                                 bool user)
>> {
>>         ...
>>
>>         int policy = attr->sched_policy;
>>         ...
>>         if (policy < 0) {
>>                 reset_on_fork = p->sched_reset_on_fork;
>>                 policy = oldpolicy = p->policy;
>> ]]
>>
>> What's a negative policy about? Is this something that should
>> be documented?
>
> That's for sched_setparam(), which internally passes policy = -1, it
> wasn't meant to be user visible, lemme double check that.
>
> sys_sched_setscheduler() -- explicit check for policy < 0
> sys_sched_setparam() -- explicitly passes policy=-1, not user visible

(Ahh -- I missed that piece in sys_sched_setparam())

> sys_sched_setattr() -- hmm, it looks like fail

Yep, I was seeing that there was no check in sched_setatr().

As I recently said, when it comes to writing a man page, show me a new
interface, and I'll show you a bug ;-).

Thanks for the clarification.

Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>

Cheers,

Michael


> ---
> Subject: sched: Disallow sched_attr::sched_policy < 0
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri May  9 10:49:03 CEST 2014
>
> The scheduler uses policy=-1 to preserve the current policy state to
> implement sys_sched_setparam(), this got exposed to userspace by
> accident through sys_sched_setattr(), cure this.
>
> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/n/tip-b4kbwz2qh21xlngdzje00t55@git.kernel.org
> ---
>  kernel/sched/core.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3711,6 +3711,9 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pi
>         if (sched_copy_attr(uattr, &attr))
>                 return -EFAULT;
>
> +       if (attr.sched_policy < 0)
> +               return -EINVAL;
> +
>         rcu_read_lock();
>         retval = -ESRCH;
>         p = find_process_by_pid(pid);



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [tip:sched/core] sched: Disallow sched_attr::sched_policy < 0
  2014-05-09  8:53                                 ` Peter Zijlstra
  2014-05-09  9:26                                   ` Michael Kerrisk (man-pages)
@ 2014-05-19 13:06                                   ` tip-bot for Peter Zijlstra
  2014-05-22 12:25                                   ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 71+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-05-19 13:06 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, peterz, tglx, mtk.manpages

Commit-ID:  438367fbfe307f2c26ee8e490f1fb9eacb6b6b02
Gitweb:     http://git.kernel.org/tip/438367fbfe307f2c26ee8e490f1fb9eacb6b6b02
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 9 May 2014 10:49:03 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 19 May 2014 21:47:33 +0900

sched: Disallow sched_attr::sched_policy < 0

The scheduler uses policy=-1 to preserve the current policy state to
implement sys_sched_setparam(), this got exposed to userspace by
accident through sys_sched_setattr(), cure this.

Cc: stable@vger.kernel.org
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f2205f0..cdefcf7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3662,6 +3662,9 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
 	if (retval)
 		return retval;
 
+	if (attr.sched_policy < 0)
+		return -EINVAL;
+
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [tip:sched/core] sched: Disallow sched_attr::sched_policy < 0
  2014-05-09  8:53                                 ` Peter Zijlstra
  2014-05-09  9:26                                   ` Michael Kerrisk (man-pages)
  2014-05-19 13:06                                   ` [tip:sched/core] sched: Disallow sched_attr::sched_policy < 0 tip-bot for Peter Zijlstra
@ 2014-05-22 12:25                                   ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 71+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-05-22 12:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, stable, tglx, mtk.manpages

Commit-ID:  dbdb22754fde671dc93d2fae06f8be113d47f2fb
Gitweb:     http://git.kernel.org/tip/dbdb22754fde671dc93d2fae06f8be113d47f2fb
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 9 May 2014 10:49:03 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 22 May 2014 10:21:26 +0200

sched: Disallow sched_attr::sched_policy < 0

The scheduler uses policy=-1 to preserve the current policy state to
implement sys_sched_setparam(), this got exposed to userspace by
accident through sys_sched_setattr(), cure this.

Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f2205f0..cdefcf7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3662,6 +3662,9 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
 	if (retval)
 		return retval;
 
+	if (attr.sched_policy < 0)
+		return -EINVAL;
+
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);

^ permalink raw reply related	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2014-05-22 12:32 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-17 12:27 [PATCH 00/13] sched, deadline: patches Peter Zijlstra
2013-12-17 12:27 ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Peter Zijlstra
2014-01-21 14:36   ` Michael Kerrisk
2014-01-21 15:38     ` Peter Zijlstra
2014-01-21 15:46       ` Peter Zijlstra
2014-01-21 16:02         ` Steven Rostedt
2014-01-21 16:06           ` Peter Zijlstra
2014-01-21 16:46             ` Juri Lelli
2014-02-14 14:13       ` Michael Kerrisk (man-pages)
2014-02-14 16:19         ` Peter Zijlstra
2014-02-15 12:52           ` Ingo Molnar
2014-02-17 13:20           ` Michael Kerrisk (man-pages)
2014-04-09  9:25             ` sched_{set,get}attr() manpage Peter Zijlstra
2014-04-09 15:19               ` Henrik Austad
2014-04-09 15:42                 ` Peter Zijlstra
2014-04-10  7:47                   ` Juri Lelli
2014-04-10  9:59                     ` Claudio Scordino
2014-04-27 15:47                   ` Michael Kerrisk (man-pages)
2014-04-27 19:34                     ` Peter Zijlstra
2014-04-27 19:45                       ` Steven Rostedt
2014-04-28  7:39                       ` Juri Lelli
2014-04-28  8:18             ` Peter Zijlstra
2014-04-29 13:08               ` Michael Kerrisk (man-pages)
2014-04-29 14:22                 ` Peter Zijlstra
2014-04-29 16:04                 ` Peter Zijlstra
2014-04-30 11:09                   ` Michael Kerrisk (man-pages)
2014-04-30 12:35                     ` Peter Zijlstra
2014-04-30 13:09                     ` Peter Zijlstra
2014-05-03 10:43                       ` Juri Lelli
2014-05-05  6:55                         ` Michael Kerrisk (man-pages)
2014-05-05  7:21                           ` Peter Zijlstra
2014-05-05  7:41                             ` Michael Kerrisk (man-pages)
2014-05-05  7:47                               ` Peter Zijlstra
2014-05-05  9:53                                 ` Michael Kerrisk (man-pages)
2014-05-06  8:16                             ` Peter Zijlstra
2014-05-09  8:23                               ` Michael Kerrisk (man-pages)
2014-05-09  8:53                                 ` Peter Zijlstra
2014-05-09  9:26                                   ` Michael Kerrisk (man-pages)
2014-05-19 13:06                                   ` [tip:sched/core] sched: Disallow sched_attr::sched_policy < 0 tip-bot for Peter Zijlstra
2014-05-22 12:25                                   ` tip-bot for Peter Zijlstra
2014-02-21 20:32           ` [tip:sched/urgent] sched: Add 'flags' argument to sched_{set, get}attr() syscalls tip-bot for Peter Zijlstra
2014-01-26  9:48   ` [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI Geert Uytterhoeven
2013-12-17 12:27 ` [PATCH 02/13] sched: SCHED_DEADLINE structures & implementation Peter Zijlstra
2013-12-17 12:27 ` [PATCH 03/13] sched: SCHED_DEADLINE SMP-related data structures & logic Peter Zijlstra
2013-12-17 12:27 ` [PATCH 04/13] [PATCH 05/13] sched: SCHED_DEADLINE avg_update accounting Peter Zijlstra
2013-12-17 12:27 ` [PATCH 05/13] sched: Add period support for -deadline tasks Peter Zijlstra
2013-12-17 12:27 ` [PATCH 06/13] [PATCH 07/13] sched: Add latency tracing " Peter Zijlstra
2013-12-17 12:27 ` [PATCH 07/13] rtmutex: Turn the plist into an rb-tree Peter Zijlstra
2013-12-17 12:27 ` [PATCH 08/13] sched: Drafted deadline inheritance logic Peter Zijlstra
2013-12-17 12:27 ` [PATCH 09/13] sched: Add bandwidth management for sched_dl Peter Zijlstra
2013-12-18 16:55   ` Peter Zijlstra
2013-12-20 17:13     ` Peter Zijlstra
2013-12-20 17:37       ` Steven Rostedt
2013-12-20 17:42         ` Peter Zijlstra
2013-12-20 18:23           ` Steven Rostedt
2013-12-20 18:26             ` Steven Rostedt
2013-12-20 21:44             ` Peter Zijlstra
2013-12-20 23:29               ` Steven Rostedt
2013-12-21 10:05                 ` Peter Zijlstra
2013-12-21 17:26                   ` Peter Zijlstra
2014-01-13 15:55       ` [tip:sched/core] sched/deadline: Fix hotplug admission control tip-bot for Peter Zijlstra
2013-12-17 12:27 ` [PATCH 10/13] sched: speed up -dl pushes with a push-heap Peter Zijlstra
2013-12-17 12:27 ` [PATCH 11/13] sched: Remove sched_setscheduler2() Peter Zijlstra
2013-12-17 12:27 ` [PATCH 12/13] sched, deadline: Fixup the smp-affinity mask tests Peter Zijlstra
2013-12-17 12:27 ` [PATCH 13/13] sched, deadline: Remove the sysctl_sched_dl knobs Peter Zijlstra
2013-12-17 20:17 ` [PATCH] sched, deadline: Properly initialize def_dl_bandwidth lock Steven Rostedt
2013-12-18 10:01   ` Peter Zijlstra
2013-12-20 13:51 ` [PATCH 00/13] sched, deadline: patches Juri Lelli
2013-12-20 14:28   ` Steven Rostedt
2013-12-20 14:51   ` Peter Zijlstra
2013-12-20 15:19     ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).