LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Raistlin <raistlin@linux.it>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>,
	Steven Rostedt <rostedt@goodmis.org>,
	Chris Friesen <cfriesen@nortel.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Darren Hart <darren@dvhart.com>, Henrik Austad <henrik@austad.us>,
	Johan Eker <johan.eker@ericsson.com>,
	"p.faure" <p.faure@akatech.ch>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Claudio Scordino <claudio@evidence.eu.com>,
	michael trimarchi <trimarchi@retis.sssup.it>,
	Fabio Checconi <fabio@gandalf.sssup.it>,
	Tommaso Cucinotta <t.cucinotta@sssup.it>,
	Juri Lelli <juri.lelli@gmail.com>,
	Nicola Manica <nicola.manica@gmail.com>,
	Luca Abeni <luca.abeni@unitn.it>
Subject: [RFC][PATCH 10/11] sched: add bandwidth management for sched_dl.
Date: Sun, 28 Feb 2010 20:27:10 +0100
Message-ID: <1267385230.13676.101.camel@Palantir> (raw)
In-Reply-To: <1267383976.13676.79.camel@Palantir>


[-- Attachment #1: Type: text/plain, Size: 45369 bytes --]

In order of -deaadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

The main differences between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need a throttling
mechanism in the groups, which can be used nothing more than for
admission control of tasks.

This patch:
 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
   that determine (i.e., runtime / period) the total bandwidth
   available on each CPU for -deadline tasks and task groups;

 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_total_bw,
   that --after writing to it the index of an online CPU-- tells
   how much of the total available bandwidth of that CPU is
   currently allocated.

 - adds per-group deadline bandwidth management by means of:
    * /cgroup/<group>/cpu.dl_runtime_us,
    * /cgroup/<group>/cpu.dl_period_us,
   (same as above, but per-group);

 - adds per-group deadline bandwidth management by means of:
    * /cgroup/<group>/cpu.dl_total_bw,
   (same as above, but per-group).

 - couples the RT and deadline bandwidth management (at system
   level only for now), i.e., the sum of how much bandwidth is
   being devoted to -rt entities and to -deadline tasks and task
   groups must stay below 100%.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   13 +
 init/Kconfig          |   14 +
 kernel/sched.c        | 1003 ++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_debug.c  |    3 +-
 kernel/sched_dl.c     |   16 +-
 kernel/sysctl.c       |   21 +
 6 files changed, 1054 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0b3a302..66f6872 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1329,6 +1329,7 @@ struct sched_dl_entity {
 	 */
 	u64 dl_runtime;		/* maximum runtime for each instance 	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
+	u64 dl_bw;		/* dl_runtime / dl_deadline		*/
 
 	/*
 	 * Actual scheduling parameters. They are initialized with the
@@ -2120,6 +2121,18 @@ int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+extern unsigned int sysctl_sched_dl_total_bw;
+extern unsigned int sysctl_sched_dl_period;
+extern int sysctl_sched_dl_runtime;
+
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
+int sched_dl_total_bw_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *lenp,
+			loff_t *ppos);
+
 extern unsigned int sysctl_sched_compat_yield;
 
 #ifdef CONFIG_RT_MUTEXES
diff --git a/init/Kconfig b/init/Kconfig
index 1510e17..de57415 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -473,6 +473,20 @@ config RT_GROUP_SCHED
 	  realtime bandwidth for them.
 	  See Documentation/scheduler/sched-rt-group.txt for more information.
 
+config DEADLINE_GROUP_SCHED
+	bool "Group scheduling for SCHED_DEADLINE"
+	depends on EXPERIMENTAL
+	depends on GROUP_SCHED
+	depends on CGROUPS
+	depends on !USER_SCHED
+	default n
+	help
+	  This feature lets you explicitly specify, in terms of runtime
+	  and period, the bandwidth of each task control group. This means
+	  tasks (and other groups) can be added to it only up to such
+	  "bandwidth cap", which might be useful for avoiding or
+	  controlling oversubscription.
+
 choice
 	depends on GROUP_SCHED
 	prompt "Basis for grouping tasks"
diff --git a/kernel/sched.c b/kernel/sched.c
index 87782a3..ec458ff 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -239,6 +239,97 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 }
 #endif
 
+static unsigned long to_ratio(u64 period, u64 runtime)
+{
+	if (runtime == RUNTIME_INF)
+		return 1ULL << 20;
+
+	/*
+	 * Doing this here saves a lot of checks in all
+	 * the calling paths, and returning zero seems
+	 * safe for them anyway.
+	 */
+	if (period == 0)
+		return 0;
+
+	return div64_u64(runtime << 20, period);
+}
+
+/*
+ * To keep the bandwidth of -deadline tasks and groups under control
+ * we need some place where:
+ *  - store the maximum -deadline bandwidth of the system (the group);
+ *  - cache the fraction of that bandwidth that is currently allocated.
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, the bandwidth is given on a per-CPU basis,
+ * meaning that:
+ *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ *  - dl_total_bw array contains, in the i-eth element, the currently
+ *    allocated bandwidth on the i-eth CPU.
+ * Moreover, groups consume bandwidth on each CPU, while tasks only
+ * consume bandwidth on the CPU they're running on.
+ * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
+ * that will be shown the next time the proc or cgroup controls will
+ * be red. It on its turn can be changed by writing on its own
+ * control.
+ */
+struct dl_bandwidth {
+	raw_spinlock_t dl_runtime_lock;
+
+	/* dl_bw = dl_runtime / dl_period */
+	u64 dl_runtime;
+	u64 dl_period;
+	u64 dl_bw;
+
+	/* dl_total_bw[cpu] < dl_bw (for each cpu) */
+	int dl_total_bw_cpu;
+	u64 *dl_total_bw;
+};
+
+static struct dl_bandwidth def_dl_bandwidth;
+
+static
+int init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+	raw_spin_lock_init(&dl_b->dl_runtime_lock);
+	dl_b->dl_period = period;
+	dl_b->dl_runtime = runtime;
+	dl_b->dl_bw = to_ratio(period, runtime);
+
+	dl_b->dl_total_bw_cpu = 0;
+	dl_b->dl_total_bw = kzalloc(sizeof(u64) * nr_cpu_ids, GFP_KERNEL);
+	if (!dl_b->dl_total_bw)
+		return 0;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	if (dl_b == &def_dl_bandwidth) {
+		int i;
+
+		for_each_possible_cpu(i)
+			dl_b->dl_total_bw[i] = sysctl_sched_dl_total_bw;
+	}
+#endif
+
+	return 1;
+}
+
+static inline int dl_bandwidth_enabled(void)
+{
+	return sysctl_sched_dl_runtime >= 0;
+}
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static void destroy_dl_bandwidth(struct dl_bandwidth *dl_b)
+{
+	kfree(dl_b->dl_total_bw);
+}
+#endif
+
 /*
  * sched_domains_mutex serializes calls to arch_init_sched_domains,
  * detach_destroy_domains and partition_sched_domains.
@@ -278,6 +369,12 @@ struct task_group {
 	struct rt_bandwidth rt_bandwidth;
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct dl_rq **dl_rq;
+
+	struct dl_bandwidth dl_bandwidth;
+#endif
+
 	struct rcu_head rcu;
 	struct list_head list;
 
@@ -312,6 +409,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct cfs_rq, init_tg_cfs_rq);
 static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_entity);
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rt_rq, init_rt_rq_var);
 #endif /* CONFIG_RT_GROUP_SCHED */
+
 #else /* !CONFIG_USER_SCHED */
 #define root_task_group init_task_group
 #endif /* CONFIG_USER_SCHED */
@@ -500,8 +598,30 @@ struct dl_rq {
 	struct rb_node *rb_leftmost;
 
 	unsigned long dl_nr_running;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct rq *rq;
+#endif
 };
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static inline struct dl_bandwidth *task_dl_bandwidth(struct task_struct *p)
+{
+	return &task_group(p)->dl_bandwidth;
+}
+
+static inline struct dl_bandwidth *parent_dl_bandwidth(struct task_group *tg)
+{
+	if (tg->parent)
+		return &tg->parent->dl_bandwidth;
+	return &def_dl_bandwidth;
+}
+#else
+static inline struct dl_bandwidth *task_dl_bandwidth(struct task_struct *p)
+{
+	return &def_dl_bandwidth;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 #ifdef CONFIG_SMP
 
 /*
@@ -879,6 +999,33 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * Maximum bandwidth available for all -deadline tasks and groups
+ * (if group scheduling is configured) on each CPU.
+ *
+ * default: 5%
+ */
+unsigned int sysctl_sched_dl_period = 1000000;
+int sysctl_sched_dl_runtime = 50000;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+unsigned int sysctl_sched_dl_total_bw = 52428;	/* 50000<<20 / 1000000 */
+#else
+unsigned int sysctl_sched_dl_total_bw = 0;
+#endif
+
+static inline u64 global_dl_period(void)
+{
+	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_dl_runtime(void)
+{
+	if (sysctl_sched_dl_runtime < 0)
+		return RUNTIME_INF;
+
+	return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
+}
+
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
 #endif
@@ -2063,6 +2210,30 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+/*
+ * When dealing with a -deadline task, we have to check if moving it to
+ * a new CPU is possible or not. In fact, this is only true iff there
+ * is enough bandwidth available on such CPU, otherwise we want the
+ * whole migration progedure to fail over.
+ */
+static inline
+bool __set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
+{
+	struct dl_bandwidth *dl_b = task_dl_bandwidth(p);
+
+	raw_spin_lock(&dl_b->dl_runtime_lock);
+	if (dl_b->dl_bw < dl_b->dl_total_bw[cpu] + p->dl.dl_bw) {
+		raw_spin_unlock(&dl_b->dl_runtime_lock);
+
+		return 0;
+	}
+	dl_b->dl_total_bw[task_cpu(p)] -= p->dl.dl_bw;
+	dl_b->dl_total_bw[cpu] += p->dl.dl_bw;
+	raw_spin_unlock(&dl_b->dl_runtime_lock);
+
+	return 1;
+}
+
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
 #ifdef CONFIG_SCHED_DEBUG
@@ -2077,6 +2248,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		if (task_has_dl_policy(p) && !__set_task_cpu_dl(p, new_cpu))
+			return;
+
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
 	}
@@ -2672,6 +2846,91 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	put_cpu();
 }
 
+static inline
+void __dl_clear_task_bw(struct dl_bandwidth *dl_b, int cpu, u64 tsk_bw)
+{
+	dl_b->dl_total_bw[cpu] -= tsk_bw;
+}
+
+static inline
+void __dl_add_task_bw(struct dl_bandwidth *dl_b, int cpu, u64 tsk_bw)
+{
+	dl_b->dl_total_bw[cpu] += tsk_bw;
+}
+
+static inline
+bool __dl_check_new_task(struct dl_bandwidth *dl_b, int cpu, u64 tsk_bw)
+{
+	return dl_b->dl_runtime == RUNTIME_INF ||
+	       dl_b->dl_bw >= dl_b->dl_total_bw[cpu] + tsk_bw;
+}
+
+static inline
+bool __dl_check_chg_task(struct dl_bandwidth *dl_b,
+			 int cpu, u64 old_tsk_bw, u64 new_tsk_bw)
+{
+	return dl_b->dl_runtime == RUNTIME_INF ||
+	       dl_b->dl_bw >= dl_b->dl_total_bw[cpu] - old_tsk_bw + new_tsk_bw;
+}
+
+/*
+ * We must be sure that accepting a new task (or allowing changing the
+ * parameters of an existing one) is consistent with the bandwidth
+ * contraints. If yes, this function also accordingly updates the currently
+ * allocated bandwidth to reflect the new situation.
+ *
+ * This function is called while holding p's rq->lock.
+ */
+static int dl_check_task_bw(struct task_struct *p, int policy,
+			    struct sched_param_ex *param_ex)
+{
+	struct dl_bandwidth *dl_b = task_dl_bandwidth(p);
+	int cpu = task_cpu(p), err = -EBUSY;
+	u64 new_tsk_bw;
+
+	raw_spin_lock(&dl_b->dl_runtime_lock);
+
+	/*
+	 * It is forbidden to create a task inside a container
+	 * that has no bandwidth.
+	 */
+	if (dl_b->dl_runtime != RUNTIME_INF && !dl_b->dl_bw) {
+		err = -EPERM;
+		goto unlock;
+	}
+
+	new_tsk_bw = to_ratio(timespec_to_ns(&param_ex->sched_deadline),
+			      timespec_to_ns(&param_ex->sched_runtime));
+	if (new_tsk_bw == p->dl.dl_bw) {
+		err = 0;
+		goto unlock;
+	}
+
+	/*
+	 * Either if a task, enters, leave, or stays -deadline but changes
+	 * its parameters, we may need to update accordingly the total
+	 * allocated bandwidth of the container.
+	 */
+	if (dl_policy(policy) && !task_has_dl_policy(p) &&
+	    __dl_check_new_task(dl_b, cpu, new_tsk_bw)) {
+		__dl_add_task_bw(dl_b, cpu, new_tsk_bw);
+		err = 0;
+	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
+		   __dl_check_chg_task(dl_b, cpu, p->dl.dl_bw, new_tsk_bw)) {
+		__dl_clear_task_bw(dl_b, cpu, p->dl.dl_bw);
+		__dl_add_task_bw(dl_b, cpu, new_tsk_bw);
+		err = 0;
+	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
+		__dl_clear_task_bw(dl_b, cpu, p->dl.dl_bw);
+		err = 0;
+	}
+unlock:
+	raw_spin_unlock(&dl_b->dl_runtime_lock);
+
+	return err;
+}
+
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -6111,10 +6370,14 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 		 * relative deadline are both set to the relative deadline
 		 * being inherited, which means the maximum boosting we are
 		 * able to provide as of now!
+		 *
+		 * Notice that it also does not count in admission control,
+		 * since its bandwidth is set to 0.
 		 */
 		if (!task_has_dl_policy(p)) {
 			p->dl.dl_runtime = dl_se_prio_to_deadline(prio);
 			p->dl.dl_deadline = dl_se_prio_to_deadline(prio);
+			p->dl.dl_bw = 0;
 			p->dl.flags = DL_NEW;
 		}
 		p->sched_class = &dl_sched_class;
@@ -6327,6 +6590,7 @@ __setparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
 
 	dl_se->dl_runtime = timespec_to_ns(&param_ex->sched_runtime);
 	dl_se->dl_deadline = timespec_to_ns(&param_ex->sched_deadline);
+	dl_se->dl_bw = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
 	dl_se->flags = param_ex->sched_flags;
 	dl_se->flags &= ~DL_THROTTLED;
 	dl_se->flags |= DL_NEW;
@@ -6488,6 +6752,15 @@ recheck:
 			return -EPERM;
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		/*
+		 * And the same for -deadline tasks.
+		 */
+		if (dl_bandwidth_enabled() && dl_policy(policy) &&
+				task_group(p)->dl_bandwidth.dl_runtime == 0)
+			return -EPERM;
+#endif
+
 		retval = security_task_setscheduler(p, policy, param);
 		if (retval)
 			return retval;
@@ -6510,6 +6783,22 @@ recheck:
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		goto recheck;
 	}
+	/*
+	 * If setscheduling to SCHED_DEADLINE (or changing the parameters
+	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
+	 * is available.
+	 */
+	if (dl_policy(policy) || dl_task(p)) {
+		int err;
+
+		err = dl_check_task_bw(p, policy, param_ex);
+		if (err) {
+			__task_rq_unlock(rq);
+			raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+			return err;
+		}
+	}
+
 	update_rq_clock(rq);
 	on_rq = p->se.on_rq;
 	running = task_current(rq, p);
@@ -7476,6 +7765,8 @@ again:
 		put_task_struct(mt);
 		wait_for_completion(&req.done);
 		tlb_migrate_finish(p->mm);
+		if (task_cpu(p) != req.dest_cpu)
+			return -EAGAIN;
 		return 0;
 	}
 out:
@@ -7522,8 +7813,8 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 	if (p->se.on_rq) {
 		deactivate_task(rq_src, p, 0);
 		set_task_cpu(p, dest_cpu);
-		activate_task(rq_dest, p, 0);
-		check_preempt_curr(rq_dest, p, 0);
+		activate_task(task_rq(p), p, 0);
+		check_preempt_curr(task_rq(p), p, 0);
 	}
 done:
 	ret = 1;
@@ -9694,6 +9985,10 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 static void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	dl_rq->rq = rq;
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9755,6 +10050,15 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
 }
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+void init_tg_dl_entry(struct task_group *tg, struct dl_rq *dl_rq,
+		struct sched_dl_entity *dl_se, int cpu, int add,
+		struct sched_dl_entity *parent)
+{
+	tg->dl_rq[cpu] = dl_rq;
+}
+#endif
+
 void __init sched_init(void)
 {
 	int i, j;
@@ -9766,6 +10070,9 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 	alloc_size += 2 * nr_cpu_ids * sizeof(void **);
 #endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	alloc_size += nr_cpu_ids * sizeof(void **);
+#endif
 #ifdef CONFIG_USER_SCHED
 	alloc_size *= 2;
 #endif
@@ -9805,6 +10112,10 @@ void __init sched_init(void)
 		ptr += nr_cpu_ids * sizeof(void **);
 #endif /* CONFIG_USER_SCHED */
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		init_task_group.dl_rq = (struct dl_rq **)ptr;
+		ptr += nr_cpu_ids * sizeof(void **);
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 #ifdef CONFIG_CPUMASK_OFFSTACK
 		for_each_possible_cpu(i) {
 			per_cpu(load_balance_tmpmask, i) = (void *)ptr;
@@ -9819,6 +10130,8 @@ void __init sched_init(void)
 
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
+	init_dl_bandwidth(&def_dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&init_task_group.rt_bandwidth,
@@ -9829,6 +10142,11 @@ void __init sched_init(void)
 #endif /* CONFIG_USER_SCHED */
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	init_dl_bandwidth(&init_task_group.dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 #ifdef CONFIG_GROUP_SCHED
 	list_add(&init_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&init_task_group.children);
@@ -9915,6 +10233,10 @@ void __init sched_init(void)
 #endif
 #endif
 
+#if defined CONFIG_DEADLINE_GROUP_SCHED && defined CONFIG_CGROUP_SCHED
+		init_tg_dl_entry(&init_task_group, &rq->dl, NULL, i, 1, NULL);
+#endif
+
 		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
 			rq->cpu_load[j] = 0;
 #ifdef CONFIG_SMP
@@ -10307,11 +10629,89 @@ static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static void free_dl_sched_group(struct task_group *tg)
+{
+	destroy_dl_bandwidth(&tg->dl_bandwidth);
+
+	kfree(tg->dl_rq);
+}
+
+int alloc_dl_sched_group(struct task_group *tg, struct task_group *parent)
+{
+	struct rq *rq;
+	int i;
+
+	tg->dl_rq = kzalloc(sizeof(struct dl_rq *) * nr_cpu_ids, GFP_KERNEL);
+	if (!tg->dl_rq)
+		return 0;
+
+	if (!init_dl_bandwidth(&tg->dl_bandwidth, global_dl_period(), 0))
+		return 0;
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		init_tg_dl_entry(tg, &rq->dl, NULL, i, 0, NULL);
+	}
+
+	return 1;
+}
+
+int sched_dl_can_attach(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	struct task_group *tg = container_of(cgroup_subsys_state(cgrp,
+					     cpu_cgroup_subsys_id),
+					     struct task_group, css);
+	unsigned long flags;
+	struct rq *rq = task_rq_lock(tsk, &flags);
+	int ret = 1;
+
+	if (!dl_task(tsk))
+		goto unlock_rq;
+
+	raw_spin_lock(&tg->dl_bandwidth.dl_runtime_lock);
+	if (tg->dl_bandwidth.dl_runtime == RUNTIME_INF)
+		goto unlock;
+	/*
+	 * Check if the group has enough bandwidth available.
+	 */
+	if (tg->dl_bandwidth.dl_bw <
+	    tg->dl_bandwidth.dl_total_bw[task_cpu(tsk)] + tsk->dl.dl_bw)
+		ret = 0;
+unlock:
+	raw_spin_unlock(&tg->dl_bandwidth.dl_runtime_lock);
+unlock_rq:
+	task_rq_unlock(rq, &flags);
+
+	return ret;
+}
+#else /* !CONFIG_DEADLINE_GROUP_SCHED */
+static inline void free_dl_sched_group(struct task_group *tg)
+{
+}
+
+static inline
+int alloc_dl_sched_group(struct task_group *tg, struct task_group *parent)
+{
+	return 1;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+static inline
+void register_dl_sched_group(struct task_group *tg, int cpu)
+{
+}
+
+static inline
+void unregister_dl_sched_group(struct task_group *tg, int cpu)
+{
+}
+
 #ifdef CONFIG_GROUP_SCHED
 static void free_sched_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
+	free_dl_sched_group(tg);
 	kfree(tg);
 }
 
@@ -10332,10 +10732,14 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	if (!alloc_dl_sched_group(tg, parent))
+		goto err;
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	for_each_possible_cpu(i) {
 		register_fair_sched_group(tg, i);
 		register_rt_sched_group(tg, i);
+		register_dl_sched_group(tg, i);
 	}
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -10365,11 +10769,26 @@ void sched_destroy_group(struct task_group *tg)
 {
 	unsigned long flags;
 	int i;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct task_group *parent = tg->parent;
+
+	spin_lock_irqsave(&task_group_lock, flags);
 
+	/*
+	 * If a deadline group goes away, its bandwidth must be
+	 * freed in its parent.
+	 */
+	raw_spin_lock(&parent->dl_bandwidth.dl_runtime_lock);
+	for_each_possible_cpu(i)
+		parent->dl_bandwidth.dl_total_bw[i] -= tg->dl_bandwidth.dl_bw;
+	raw_spin_unlock(&parent->dl_bandwidth.dl_runtime_lock);
+#else
 	spin_lock_irqsave(&task_group_lock, flags);
+#endif
 	for_each_possible_cpu(i) {
 		unregister_fair_sched_group(tg, i);
 		unregister_rt_sched_group(tg, i);
+		unregister_dl_sched_group(tg, i);
 	}
 	list_del_rcu(&tg->list);
 	list_del_rcu(&tg->siblings);
@@ -10516,14 +10935,6 @@ unsigned long sched_group_shares(struct task_group *tg)
  */
 static DEFINE_MUTEX(rt_constraints_mutex);
 
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
-	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
-
-	return div64_u64(runtime << 20, period);
-}
-
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
@@ -10617,7 +11028,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -10656,7 +11067,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -10681,7 +11092,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -10692,10 +11103,44 @@ long sched_group_rt_period(struct task_group *tg)
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+/*
+ * Coupling of -rt and -dl bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for all -rt entities, if the new values are consistent with the
+ * system settings for the bandwidth available to -dl tasks and groups.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_rt_dl_global_constraints(u64 global_rt_bw)
+{
+	unsigned long flags;
+	u64 global_dl_bw;
+	bool ret;
+
+	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	global_dl_bw = to_ratio(def_dl_bandwidth.dl_period,
+				def_dl_bandwidth.dl_runtime);
+
+	ret = global_rt_bw + global_dl_bw <=
+		to_ratio(RUNTIME_INF, RUNTIME_INF);
 
+	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	return ret;
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period;
+	u64 runtime, period, global_bw;
 	int ret = 0;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -10710,6 +11155,10 @@ static int sched_rt_global_constraints(void)
 	if (runtime > period && runtime != RUNTIME_INF)
 		return -EINVAL;
 
+	global_bw = to_ratio(period, runtime);
+	if (!__sched_rt_dl_global_constraints(global_bw))
+		return -EINVAL;
+
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
@@ -10732,6 +11181,7 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 static int sched_rt_global_constraints(void)
 {
 	unsigned long flags;
+	u64 global_bw;
 	int i;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -10744,6 +11194,10 @@ static int sched_rt_global_constraints(void)
 	if (sysctl_sched_rt_runtime == 0)
 		return -EBUSY;
 
+	global_bw = to_ratio(global_rt_period(), global_rt_runtime());
+	if (!__sched_rt_dl_global_constraints(global_bw))
+		return -EINVAL;
+
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
@@ -10758,6 +11212,359 @@ static int sched_rt_global_constraints(void)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+/* Must be called with tasklist_lock held */
+static inline int tg_has_dl_tasks(struct task_group *tg)
+{
+	struct task_struct *g, *p;
+
+	do_each_thread(g, p) {
+		if (task_has_dl_policy(p) && task_group(p) == tg)
+			return 1;
+	} while_each_thread(g, p);
+
+	return 0;
+}
+
+/*
+ * If we were RUNTIME_INF and we want to constraint our own bandwidth
+ * we must to be sure that none of our children is RUNTIME_INF.
+ *
+ * Something different (i.e., letting children to stay RUNTIME_INF even
+ * if we are constrained) could have been done if a different architecture
+ * were chosen for -deadline group scheduling (similar to -rt throttling).
+ */
+static int __check_children_bandwidth(struct task_group *tg, u64 dl_runtime)
+{
+	struct task_group *child;
+
+	/*
+	 * Either we were not RUNTIME_INF or we are going to
+	 * become RUNTIME_INF, so no more checking on our children
+	 * is needed.
+	 */
+	if (tg->dl_bandwidth.dl_runtime != RUNTIME_INF ||
+	    dl_runtime == RUNTIME_INF)
+		return 1;
+
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		raw_spin_lock(&child->dl_bandwidth.dl_runtime_lock);
+		if (child->dl_bandwidth.dl_runtime == RUNTIME_INF) {
+			raw_spin_unlock(&child->dl_bandwidth.dl_runtime_lock);
+			return 0;
+		}
+		raw_spin_unlock(&child->dl_bandwidth.dl_runtime_lock);
+	}
+
+	return 1;
+}
+
+/*
+ * If we want to decrease our own bandwidth from old_tg_bw to
+ * new_tg_bw we must be sure that none of our runqueue has more
+ * allocated bandwidth than new_bw.
+ *
+ * This is called holding _both_ tg's and tg's parent's bandwidth
+ * parameters locks (dl_bandwidth.dl_runtime_lock).
+ */
+static
+int __check_tg_bandwidth(struct task_group *tg, u64 new_tg_bw)
+{
+	int i;
+
+	if (new_tg_bw < tg->dl_bandwidth.dl_bw) {
+		for_each_possible_cpu(i) {
+			if (new_tg_bw < tg->dl_bandwidth.dl_total_bw[i])
+				return 0;
+		}
+	}
+
+	return 1;
+}
+
+/*
+ * Here we check if the new bandwidth parameters of the cgroup would
+ * still lead to a schedulable system.
+ *
+ * This is called holding _both_ tg's and tg's parent's bandwidth
+ * parameters locks (dl_bandwidth.dl_runtime_lock).
+ */
+static
+int __deadline_schedulable(struct task_group *tg,
+			   struct dl_bandwidth *parent_dl_b,
+			   u64 dl_runtime, u64 new_tg_bw)
+{
+	int i;
+
+	/*
+	 * RUNTIME_INF is allowed only if our parent is
+	 * RUNTIME_INF as well (see the comment to the
+	 * above function).
+	 */
+	if (parent_dl_b->dl_runtime == RUNTIME_INF)
+		return 0;
+
+	if (dl_runtime == RUNTIME_INF)
+		return -EINVAL;
+
+	if (new_tg_bw > parent_dl_b->dl_bw ||
+	    !__check_tg_bandwidth(tg, new_tg_bw) ||
+	    !__check_children_bandwidth(tg, dl_runtime))
+		return -EBUSY;
+
+	/*
+	 * The root group has no parent, but its assigned bandwidth has
+	 * to stay below the global bandwidth value given by
+	 * sysctl_sched_dl_runtime / sysctl_sched_dl_period.
+	 *
+	 * For other group, what is required is that the sum of the bandwidths
+	 * of all the children of a group does not exceed the bandwidth of
+	 * such group.
+	 */
+	for_each_possible_cpu(i) {
+		if (parent_dl_b->dl_bw < parent_dl_b->dl_total_bw[i] -
+		    tg->dl_bandwidth.dl_bw + new_tg_bw)
+			return -EBUSY;
+	}
+
+	return 0;
+}
+
+/*
+ * This checks if the new parameters of the task group are consistent
+ * and, if yes, updates the allocateed bandwidth in the higher
+ * level entity (could the parent cgroup or the system default)
+ */
+static int tg_set_dl_bandwidth(struct task_group *tg,
+			       u64 dl_period, u64 dl_runtime)
+{
+	struct dl_bandwidth *dl_b = &tg->dl_bandwidth,
+			*parent_dl_b = parent_dl_bandwidth(tg);
+	u64 new_tg_bw;
+	int i, err = 0;
+
+	if (dl_runtime != RUNTIME_INF && dl_runtime > dl_period)
+		return -EINVAL;
+
+	read_lock(&tasklist_lock);
+
+	if (!dl_runtime && tg_has_dl_tasks(tg)) {
+		err = -EBUSY;
+		goto runlock;
+	}
+
+	raw_spin_lock_irq(&parent_dl_b->dl_runtime_lock);
+	raw_spin_lock(&dl_b->dl_runtime_lock);
+
+	/*
+	 * Calculate the old and new bandwidth for the group and,
+	 * if different, check if the new value is consistent.
+	 */
+	new_tg_bw = to_ratio(dl_period, dl_runtime);
+	if (new_tg_bw != dl_b->dl_bw) {
+		err = __deadline_schedulable(tg, parent_dl_b,
+					     dl_runtime, new_tg_bw);
+		if (err)
+			goto unlock;
+	}
+
+	/*
+	 * If here, we can now update tg's bandwidth and tg's
+	 * parent allocated bandwidth value (on each CPU).
+	 */
+	for_each_possible_cpu(i)
+		parent_dl_b->dl_total_bw[i] += new_tg_bw - dl_b->dl_bw;
+	dl_b->dl_bw = new_tg_bw;
+	dl_b->dl_period = dl_period;
+	dl_b->dl_runtime = dl_runtime;
+
+unlock:
+	raw_spin_unlock(&dl_b->dl_runtime_lock);
+	raw_spin_unlock_irq(&parent_dl_b->dl_runtime_lock);
+runlock:
+	read_unlock(&tasklist_lock);
+
+	return err;
+}
+
+int sched_group_set_dl_runtime(struct task_group *tg, long dl_runtime_us)
+{
+	u64 dl_runtime, dl_period;
+
+	dl_period = tg->dl_bandwidth.dl_period;
+	dl_runtime = (u64)dl_runtime_us * NSEC_PER_USEC;
+	if (dl_runtime_us < 0)
+		dl_runtime = RUNTIME_INF;
+
+	return tg_set_dl_bandwidth(tg, dl_period, dl_runtime);
+}
+
+long sched_group_dl_runtime(struct task_group *tg)
+{
+	s64 dl_runtime;
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	dl_runtime = tg->dl_bandwidth.dl_runtime;
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+
+	if (dl_runtime == RUNTIME_INF)
+		return -1;
+
+	do_div(dl_runtime, NSEC_PER_USEC);
+	return dl_runtime;
+}
+
+int sched_group_set_dl_period(struct task_group *tg, long dl_period_us)
+{
+	u64 dl_runtime, dl_period;
+
+	dl_period = (u64)dl_period_us * NSEC_PER_USEC;
+	dl_runtime = tg->dl_bandwidth.dl_runtime;
+
+	if (dl_period == 0)
+		return -EINVAL;
+
+	return tg_set_dl_bandwidth(tg, dl_period, dl_runtime);
+}
+
+long sched_group_dl_period(struct task_group *tg)
+{
+	u64 dl_period_us;
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	dl_period_us = tg->dl_bandwidth.dl_period;
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	do_div(dl_period_us, NSEC_PER_USEC);
+
+	return dl_period_us;
+}
+
+int sched_group_set_dl_total_bw(struct task_group *tg, int cpu)
+{
+	if (!cpu_online(cpu))
+		return -EINVAL;
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	tg->dl_bandwidth.dl_total_bw_cpu = cpu;
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+
+	return 0;
+}
+
+long sched_group_dl_total_bw(struct task_group *tg)
+{
+	int cpu = tg->dl_bandwidth.dl_total_bw_cpu;
+	u64 dl_total_bw;
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	dl_total_bw = tg->dl_bandwidth.dl_total_bw[cpu];
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+
+	return dl_total_bw;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
+static bool __sched_dl_global_constraints(void)
+{
+	u64 global_runtime = global_dl_runtime();
+	u64 global_period = global_dl_period();
+
+	return (!global_period == 0 || (global_runtime != RUNTIME_INF &&
+		global_runtime > global_period));
+}
+
+/*
+ * Coupling of -dl and -rt bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for -dl tasks and groups, if the new values are consistent with
+ * the system settings for the bandwidth available to -rt entities.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_dl_rt_global_constraints(u64 global_dl_bw)
+{
+	u64 global_rt_bw;
+	bool ret;
+
+	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
+
+	global_rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
+				def_rt_bandwidth.rt_runtime);
+
+	ret = global_rt_bw + global_dl_bw <=
+		to_ratio(RUNTIME_INF, RUNTIME_INF);
+
+	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
+
+	return ret;
+}
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static int sched_dl_global_constraints(void)
+{
+	int err = 0;
+	unsigned long flags;
+	struct dl_bandwidth *init_tg_dl_b = &init_task_group.dl_bandwidth;
+	u64 global_bw;
+
+	if (!__sched_dl_global_constraints())
+		return -EINVAL;
+
+	global_bw = to_ratio(global_dl_period(), global_dl_runtime());
+	if (!__sched_dl_rt_global_constraints(global_bw))
+		return -EINVAL;
+
+	/*
+	 * It is not allowed to set the global system bandwidth
+	 * below the current bandwidth of the root task group (nor
+	 * to constrain it if the root task group is RUNTIME_INF)
+	 */
+	raw_spin_lock_irqsave(&init_tg_dl_b->dl_runtime_lock, flags);
+
+	if (global_bw < init_tg_dl_b->dl_bw ||
+	    (global_dl_runtime() != RUNTIME_INF &&
+	     init_tg_dl_b->dl_runtime == RUNTIME_INF))
+		err = -EBUSY;
+
+	raw_spin_unlock_irqrestore(&init_tg_dl_b->dl_runtime_lock, flags);
+
+	return err;
+}
+#else /* !CONFIG_DEADLINE_GROUP_SCHED */
+static int sched_dl_global_constraints(void)
+{
+	int i;
+	u64 global_bw;
+
+	if (!__sched_dl_global_constraints())
+		return -EINVAL;
+
+	global_bw = to_ratio(global_dl_period(), global_dl_runtime());
+	if (!__sched_dl_rt_global_constraints(global_bw))
+		return -EINVAL;
+
+	/*
+	 * In the !DEADLINE_GROUP_SCHED case it is here that we enforce
+	 * the global system bandwidth not being set to a value smaller
+	 * than the currently allocated bandwidth on any runqueue.
+	 *
+	 * This is safe since we are called with the dl_runtime_lock
+	 * of def_sl_bandwidth held.
+	 */
+	for_each_possible_cpu(i) {
+		if (global_bw < def_dl_bandwidth.dl_total_bw[i])
+			return -EBUSY;
+	}
+
+	return 0;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
@@ -10788,6 +11595,72 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+	int old_period, old_runtime;
+	static DEFINE_MUTEX(mutex);
+	unsigned long flags;
+
+	mutex_lock(&mutex);
+	old_period = sysctl_sched_dl_period;
+	old_runtime = sysctl_sched_dl_runtime;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+				      flags);
+
+		ret = sched_dl_global_constraints();
+		if (ret) {
+			sysctl_sched_dl_period = old_period;
+			sysctl_sched_dl_runtime = old_runtime;
+		} else {
+			def_dl_bandwidth.dl_period = global_dl_period();
+			def_dl_bandwidth.dl_runtime = global_dl_runtime();
+			def_dl_bandwidth.dl_bw = to_ratio(global_dl_period(),
+							  global_dl_runtime());
+		}
+
+		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+					   flags);
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
+int sched_dl_total_bw_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *lenp,
+			loff_t *ppos)
+{
+	int ret, old_cpu, cpu;
+	static DEFINE_MUTEX(mutex);
+
+	mutex_lock(&mutex);
+	old_cpu = cpu = def_dl_bandwidth.dl_total_bw_cpu;
+
+	if (!write)
+		sysctl_sched_dl_total_bw = def_dl_bandwidth.dl_total_bw[cpu];
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		cpu = sysctl_sched_dl_total_bw;
+		if (!cpu_online(cpu)) {
+			cpu = old_cpu;
+			ret = -EINVAL;
+		} else
+			def_dl_bandwidth.dl_total_bw_cpu = cpu;
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /* return corresponding task_group object of a cgroup */
@@ -10826,9 +11699,15 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 static int
 cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
+#if defined CONFIG_DEADLINE_GROUP_SCHED || defined CONFIG_RT_GROUP_SCHED
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	if (!sched_dl_can_attach(cgrp, tsk))
+		return -EINVAL;
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
 		return -EINVAL;
+#endif
 #else
 	/* We don't support RT-tasks being in separate groups */
 	if (tsk->sched_class != &fair_sched_class)
@@ -10859,11 +11738,55 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	return 0;
 }
 
+/*
+ * The bandwidth of tsk is freed from its former task
+ * group, and has to be considered occupied in the
+ * new task group.
+ */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static
+void __cpu_cgroup_attach_dl(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			    struct task_struct *tsk)
+{
+	unsigned long flags;
+	struct task_group *tg = container_of(cgroup_subsys_state(cgrp,
+					     cpu_cgroup_subsys_id),
+					     struct task_group, css);
+	struct task_group *old_tg = container_of(cgroup_subsys_state(old_cgrp,
+						 cpu_cgroup_subsys_id),
+						 struct task_group, css);
+	struct rq *rq = task_rq_lock(tsk, &flags);
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	raw_spin_lock(&old_tg->dl_bandwidth.dl_runtime_lock);
+
+	/*
+	 * Actually move the bandwidth the task occupies
+	 * from its old to its new cgroup.
+	 */
+	tg->dl_bandwidth.dl_total_bw[task_cpu(tsk)] += tsk->dl.dl_bw;
+	old_tg->dl_bandwidth.dl_total_bw[task_cpu(tsk)] -= tsk->dl.dl_bw;
+
+	raw_spin_unlock(&old_tg->dl_bandwidth.dl_runtime_lock);
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+
+	task_rq_unlock(rq, &flags);
+}
+#else
+static
+void __cpu_cgroup_attach_dl(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			    struct task_struct *tsk)
+{
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 static void
 cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		  struct cgroup *old_cont, struct task_struct *tsk,
 		  bool threadgroup)
 {
+	__cpu_cgroup_attach_dl(cgrp, old_cont, tsk);
+
 	sched_move_task(tsk);
 	if (threadgroup) {
 		struct task_struct *c;
@@ -10914,6 +11837,41 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static int cpu_dl_runtime_write(struct cgroup *cgrp, struct cftype *cftype,
+				s64 dl_runtime_us)
+{
+	return sched_group_set_dl_runtime(cgroup_tg(cgrp), dl_runtime_us);
+}
+
+static s64 cpu_dl_runtime_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return sched_group_dl_runtime(cgroup_tg(cgrp));
+}
+
+static int cpu_dl_period_write(struct cgroup *cgrp, struct cftype *cftype,
+			       u64 dl_period_us)
+{
+	return sched_group_set_dl_period(cgroup_tg(cgrp), dl_period_us);
+}
+
+static u64 cpu_dl_period_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return sched_group_dl_period(cgroup_tg(cgrp));
+}
+
+static int cpu_dl_total_bw_write(struct cgroup *cgrp, struct cftype *cftype,
+				 u64 cpu)
+{
+	return sched_group_set_dl_total_bw(cgroup_tg(cgrp), cpu);
+}
+
+static u64 cpu_dl_total_bw_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return sched_group_dl_total_bw(cgroup_tg(cgrp));
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -10934,6 +11892,23 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	{
+		.name = "dl_runtime_us",
+		.read_s64 = cpu_dl_runtime_read,
+		.write_s64 = cpu_dl_runtime_write,
+	},
+	{
+		.name = "dl_period_us",
+		.read_u64 = cpu_dl_period_read,
+		.write_u64 = cpu_dl_period_write,
+	},
+	{
+		.name = "dl_total_bw",
+		.read_u64 = cpu_dl_total_bw_read,
+		.write_u64 = cpu_dl_total_bw_write,
+	},
+#endif
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 407a761..84d8d40 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -146,7 +146,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 }
 
 #if defined(CONFIG_CGROUP_SCHED) && \
-	(defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+	(defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED) || \
+	 defined(CONFIG_DEADLINE_GROUP_SCHED))
 static void task_group_path(struct task_group *tg, char *buf, int buflen)
 {
 	/* may be NULL if the underlying cgroup isn't fully-created yet */
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 3613cbd..3e466cb 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -9,6 +9,10 @@
  * than their reserved bandwidth will be slowed down (and may potentially
  * miss some of their deadlines), and won't affect any other task.
  *
+ * Group scheduling, if configured, is utilized for admission control
+ * purposes, i.e., the sum of the bandwidth of tasks and groups belonging
+ * to group A must stays below A's own bandwidth.
+ *
  * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
  *                    Michael Trimarchi <trimarchimichael@yahoo.it>,
  *                    Fabio Checconi <fabio@gandalf.sssup.it>
@@ -658,8 +662,18 @@ static void task_fork_dl(struct task_struct *p)
 
 static void task_dead_dl(struct task_struct *p)
 {
+	struct dl_bandwidth *dl_b = task_dl_bandwidth(p);
+
+	/*
+	 * Since the task is TASK_DEAD we hope
+	 * it will not migrate or change group!
+	 */
+	raw_spin_lock_irq(&dl_b->dl_runtime_lock);
+	dl_b->dl_total_bw[task_cpu(p)] -= p->dl.dl_bw;
+	raw_spin_unlock_irq(&dl_b->dl_runtime_lock);
+
 	/*
-	 * We are not holding any lock here, so it is safe to
+	 * We are no longer holding any lock here, so it is safe to
 	 * wait for the bandwidth timer to be removed.
 	 */
 	hrtimer_cancel(&p->dl.dl_timer);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..2f3cdba 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -358,6 +358,27 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= sched_rt_handler,
 	},
 	{
+		.procname	= "sched_dl_period_us",
+		.data		= &sysctl_sched_dl_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_runtime_us",
+		.data		= &sysctl_sched_dl_runtime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_total_bw",
+		.data		= &sysctl_sched_dl_total_bw,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &sched_dl_total_bw_handler,
+	},
+	{
 		.procname	= "sched_compat_yield",
 		.data		= &sysctl_sched_compat_yield,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

  parent reply index

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
2010-02-28 19:15 ` [RFC][PATCH 01/11] sched: add sched_class->task_dead Raistlin
2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 18:55     ` Steven Rostedt
2010-04-15  7:34       ` Peter Zijlstra
2010-04-13 18:22   ` Peter Zijlstra
2010-02-28 19:18 ` [RFC][PATCH 03/11] sched: add extended scheduling interface Raistlin
2010-02-28 19:19 ` [RFC][PATCH 04/11] sched: add resource limits for -deadline tasks Raistlin
2010-02-28 19:20 ` [RFC][PATCH 05/11] sched: add a syscall to wait for the next instance Raistlin
2010-02-28 19:22 ` [RFC][PATCH 06/11] sched: add the sched-debug bits for sched_dl Raistlin
2010-02-28 19:23 ` [RFC][PATCH 07/11] sched: add latency tracing for -deadline tasks Raistlin
2010-02-28 19:24 ` [RFC][PATCH 08/11] sched: send SIGXCPU at -deadline task overruns Raistlin
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 19:32     ` Oleg Nesterov
2010-02-28 19:26 ` [RFC][PATCH 09/11] sched: first draft of deadline inheritance Raistlin
2010-04-14  8:25   ` Peter Zijlstra
2010-04-14  9:45     ` Peter Zijlstra
2010-02-28 19:27 ` Raistlin [this message]
2010-04-14 10:09   ` [RFC][PATCH 10/11] sched: add bandwidth management for sched_dl Peter Zijlstra
2010-02-28 19:28 ` [RFC][PATCH 11/11] sched: add sched_dl documentation Raistlin
2010-04-14 10:17 ` [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1267385230.13676.101.camel@Palantir \
    --to=raistlin@linux.it \
    --cc=cfriesen@nortel.com \
    --cc=claudio@evidence.eu.com \
    --cc=darren@dvhart.com \
    --cc=fabio@gandalf.sssup.it \
    --cc=fweisbec@gmail.com \
    --cc=henrik@austad.us \
    --cc=johan.eker@ericsson.com \
    --cc=juri.lelli@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luca.abeni@unitn.it \
    --cc=mingo@elte.hu \
    --cc=nicola.manica@gmail.com \
    --cc=p.faure@akatech.ch \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=t.cucinotta@sssup.it \
    --cc=tglx@linutronix.de \
    --cc=trimarchi@retis.sssup.it \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git