linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] SCHED_DEADLINE cgroups support
@ 2018-02-12 13:40 Juri Lelli
  2018-02-12 13:40 ` [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth Juri Lelli
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Juri Lelli @ 2018-02-12 13:40 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, tglx, vincent.guittot, rostedt, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, tkjos,
	joelaf, morten.rasmussen, dietmar.eggemann, patrick.bellasi,
	alessio.balsini, juri.lelli

Hi,

A long time ago there was a patch [1] (written by Dario) adding DEADLINE
bandwidth management control for task groups. That was then removed from
the set of patches that made to mainline because outside of the bare
minimum of features to possibly start playing with SCHED_DEADLINE, and
because quite some discussion points remained open.

Fast forward to present day and more features have been added, DEADLINE
usage is however still reserved to root only. Several things are still
missing before we can comfortably relax privilegies, bandwidth
management for group of tasks being one of the most important (together
with a better/safer PI mechanism I'd say).

Another (different) attempt to add cgroup support was proposed last year
[2]. The set was implementing hierachical scheduling support (RT
entities running inside DEADLINE servers). Complexity (and maybe not
enough documentation? :) made discussion around that proposal difficult
to happen.

Even though hierachical scheduling is still what we want in the end,
this set tries to start getting there by adding cgroup based bandwidth
management for SCHED_DEADLINE. The following design choices have been
made (also detailed in changelog/doc):

 - implementation _is not_ hierarchical: only single/plain DEADLINE
   entities can be handled, and they get scheduled at root rq level

 - DEADLINE_GROUP_SCHED requires RT_GROUP_SCHED (because of the points
   below)

 - DEADLINE and RT share bandwidth; therefore, DEADLINE tasks will eat
   RT bandwidth, as they do today at root level; support for RT_RUNTIME_
   SHARE is however missing, an RT task might be able to exceed its group
   bandwidth constrain if such feature is enabled (more thinking required)

 - and therefore cpu.rt_runtime_us and cpu.rt_period_us are still
   controlling a group bandwidth; however, two additional (read only)
   knobs are added

     # cpu.dl_bw : maximum bandwidth available for the group on each CPU
                   (rt_runtime_us/rt_period_us)
     # cpu.dl_total_bw : current total (across CPUs) amount of bandwidth
                         allocated by the group (sum of tasks bandwidth)

 - father/children/siblings rules are the same as for RT

Adding this kind of support should be useful to be able to let normal
users use DEADLINE, as the sys admin (with root privilegies) could
reserve a fraction of the total available bandwidth to users and let
them allocate what needed inside such space.

I'm more than sure that there are problems lurking in this set (e.g.,
too much ifdeffery) and many discussion points are still open, but I
wanted to share what I have early and see what people thinks about it
(possibly understaning how to move forward).

First patch might actually be a standalone cleanup change.

The set (based on tip/sched/core as of today) is available at:

https://github.com/jlelli/linux.git upstream/deadline/cgroup-rfc-v1

Comments and feedback are the purpose of this RFC. Thanks in advance!

Best,

- Juri

[1] https://lkml.org/lkml/2010/2/28/119
[2] https://lwn.net/Articles/718645/

Juri Lelli (3):
  sched/deadline: merge dl_bw into dl_bandwidth
  sched/deadline: add task groups bandwidth management support
  Documentation/scheduler/sched-deadline: add info about cgroup support

 Documentation/scheduler/sched-deadline.txt |  36 +++--
 init/Kconfig                               |  12 ++
 kernel/sched/autogroup.c                   |   7 +
 kernel/sched/core.c                        |  56 ++++++-
 kernel/sched/deadline.c                    | 241 +++++++++++++++++++++++------
 kernel/sched/debug.c                       |   6 +-
 kernel/sched/rt.c                          |  52 ++++++-
 kernel/sched/sched.h                       |  68 ++++----
 kernel/sched/topology.c                    |   2 +-
 9 files changed, 381 insertions(+), 99 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth
  2018-02-12 13:40 [RFC PATCH 0/3] SCHED_DEADLINE cgroups support Juri Lelli
@ 2018-02-12 13:40 ` Juri Lelli
  2018-02-12 17:34   ` Steven Rostedt
  2018-02-12 13:40 ` [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support Juri Lelli
  2018-02-12 13:40 ` [RFC PATCH 3/3] Documentation/scheduler/sched-deadline: add info about cgroup support Juri Lelli
  2 siblings, 1 reply; 10+ messages in thread
From: Juri Lelli @ 2018-02-12 13:40 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, tglx, vincent.guittot, rostedt, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, tkjos,
	joelaf, morten.rasmussen, dietmar.eggemann, patrick.bellasi,
	alessio.balsini, juri.lelli

Both dl_bandwidth and dl_bw hold information about DEADLINE bandwidth admitted
to the system (at different levels). However, they are separate and threated as
two different beasts.

Merge them as it makes more sense, it's easier to manage and to better align
with RT (that already has a single rt_bandwidth).

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/core.c     |  2 +-
 kernel/sched/deadline.c | 84 +++++++++++++++++++++++--------------------------
 kernel/sched/debug.c    |  6 ++--
 kernel/sched/sched.h    | 48 +++++++++++-----------------
 kernel/sched/topology.c |  2 +-
 5 files changed, 63 insertions(+), 79 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee420d78e674..772a6b3239eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4234,7 +4234,7 @@ static int __sched_setscheduler(struct task_struct *p,
 			 * will also fail if there's no bandwidth available.
 			 */
 			if (!cpumask_subset(span, &p->cpus_allowed) ||
-			    rq->rd->dl_bw.bw == 0) {
+			    rq->rd->dl_bw.dl_bw == 0) {
 				task_rq_unlock(rq, p, &rf);
 				return -EPERM;
 			}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9bb0e0c412ec..de19bd7feddb 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -46,7 +46,7 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 }
 
 #ifdef CONFIG_SMP
-static inline struct dl_bw *dl_bw_of(int i)
+static inline struct dl_bandwidth *dl_bw_of(int i)
 {
 	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
 			 "sched RCU must be held");
@@ -66,7 +66,7 @@ static inline int dl_bw_cpus(int i)
 	return cpus;
 }
 #else
-static inline struct dl_bw *dl_bw_of(int i)
+static inline struct dl_bandwidth *dl_bw_of(int i)
 {
 	return &cpu_rq(i)->dl.dl_bw;
 }
@@ -275,14 +275,14 @@ static void task_non_contending(struct task_struct *p)
 		if (dl_task(p))
 			sub_running_bw(dl_se, dl_rq);
 		if (!dl_task(p) || p->state == TASK_DEAD) {
-			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+			struct dl_bandwidth *dl_b = dl_bw_of(task_cpu(p));
 
 			if (p->state == TASK_DEAD)
 				sub_rq_bw(&p->dl, &rq->dl);
-			raw_spin_lock(&dl_b->lock);
+			raw_spin_lock(&dl_b->dl_runtime_lock);
 			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 			__dl_clear_params(p);
-			raw_spin_unlock(&dl_b->lock);
+			raw_spin_unlock(&dl_b->dl_runtime_lock);
 		}
 
 		return;
@@ -342,18 +342,11 @@ void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
 	raw_spin_lock_init(&dl_b->dl_runtime_lock);
 	dl_b->dl_period = period;
 	dl_b->dl_runtime = runtime;
-}
-
-void init_dl_bw(struct dl_bw *dl_b)
-{
-	raw_spin_lock_init(&dl_b->lock);
-	raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
-	if (global_rt_runtime() == RUNTIME_INF)
-		dl_b->bw = -1;
+	if (runtime == RUNTIME_INF)
+		dl_b->dl_bw = -1;
 	else
-		dl_b->bw = to_ratio(global_rt_period(), global_rt_runtime());
-	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
-	dl_b->total_bw = 0;
+		dl_b->dl_bw = to_ratio(period, runtime);
+	dl_b->dl_total_bw = 0;
 }
 
 void init_dl_rq(struct dl_rq *dl_rq)
@@ -368,7 +361,8 @@ void init_dl_rq(struct dl_rq *dl_rq)
 	dl_rq->overloaded = 0;
 	dl_rq->pushable_dl_tasks_root = RB_ROOT_CACHED;
 #else
-	init_dl_bw(&dl_rq->dl_bw);
+	init_dl_bandwidth(&dl_rq->dl_bw);
+	init_dl_bandwidth(&dl_rq->dl_bw, global_rt_period(), global_rt_runtime());
 #endif
 
 	dl_rq->running_bw = 0;
@@ -1262,7 +1256,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	rq = task_rq_lock(p, &rf);
 
 	if (!dl_task(p) || p->state == TASK_DEAD) {
-		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+		struct dl_bandwidth *dl_b = dl_bw_of(task_cpu(p));
 
 		if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
 			sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
@@ -1270,9 +1264,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 			dl_se->dl_non_contending = 0;
 		}
 
-		raw_spin_lock(&dl_b->lock);
+		raw_spin_lock(&dl_b->dl_runtime_lock);
 		__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
-		raw_spin_unlock(&dl_b->lock);
+		raw_spin_unlock(&dl_b->dl_runtime_lock);
 		__dl_clear_params(p);
 
 		goto unlock;
@@ -2223,7 +2217,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 	 * domain (see cpuset_can_attach()).
 	 */
 	if (!cpumask_intersects(src_rd->span, new_mask)) {
-		struct dl_bw *src_dl_b;
+		struct dl_bandwidth *src_dl_b;
 
 		src_dl_b = dl_bw_of(cpu_of(rq));
 		/*
@@ -2231,9 +2225,9 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 		 * off. In the worst case, sched_setattr() may temporary fail
 		 * until we complete the update.
 		 */
-		raw_spin_lock(&src_dl_b->lock);
+		raw_spin_lock(&src_dl_b->dl_runtime_lock);
 		__dl_sub(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
-		raw_spin_unlock(&src_dl_b->lock);
+		raw_spin_unlock(&src_dl_b->dl_runtime_lock);
 	}
 
 	set_cpus_allowed_common(p, new_mask);
@@ -2406,7 +2400,7 @@ int sched_dl_global_validate(void)
 	u64 runtime = global_rt_runtime();
 	u64 period = global_rt_period();
 	u64 new_bw = to_ratio(period, runtime);
-	struct dl_bw *dl_b;
+	struct dl_bandwidth *dl_b;
 	int cpu, ret = 0;
 	unsigned long flags;
 
@@ -2423,10 +2417,10 @@ int sched_dl_global_validate(void)
 		rcu_read_lock_sched();
 		dl_b = dl_bw_of(cpu);
 
-		raw_spin_lock_irqsave(&dl_b->lock, flags);
-		if (new_bw < dl_b->total_bw)
+		raw_spin_lock_irqsave(&dl_b->dl_runtime_lock, flags);
+		if (new_bw < dl_b->dl_total_bw)
 			ret = -EBUSY;
-		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+		raw_spin_unlock_irqrestore(&dl_b->dl_runtime_lock, flags);
 
 		rcu_read_unlock_sched();
 
@@ -2453,7 +2447,7 @@ void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
 void sched_dl_do_global(void)
 {
 	u64 new_bw = -1;
-	struct dl_bw *dl_b;
+	struct dl_bandwidth *dl_b;
 	int cpu;
 	unsigned long flags;
 
@@ -2470,9 +2464,9 @@ void sched_dl_do_global(void)
 		rcu_read_lock_sched();
 		dl_b = dl_bw_of(cpu);
 
-		raw_spin_lock_irqsave(&dl_b->lock, flags);
-		dl_b->bw = new_bw;
-		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+		raw_spin_lock_irqsave(&dl_b->dl_runtime_lock, flags);
+		dl_b->dl_bw = new_bw;
+		raw_spin_unlock_irqrestore(&dl_b->dl_runtime_lock, flags);
 
 		rcu_read_unlock_sched();
 		init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl);
@@ -2490,7 +2484,7 @@ void sched_dl_do_global(void)
 int sched_dl_overflow(struct task_struct *p, int policy,
 		      const struct sched_attr *attr)
 {
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+	struct dl_bandwidth *dl_b = dl_bw_of(task_cpu(p));
 	u64 period = attr->sched_period ?: attr->sched_deadline;
 	u64 runtime = attr->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
@@ -2508,7 +2502,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
 	 * its parameters, we may need to update accordingly the total
 	 * allocated bandwidth of the container.
 	 */
-	raw_spin_lock(&dl_b->lock);
+	raw_spin_lock(&dl_b->dl_runtime_lock);
 	cpus = dl_bw_cpus(task_cpu(p));
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
@@ -2537,7 +2531,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
 		 */
 		err = 0;
 	}
-	raw_spin_unlock(&dl_b->lock);
+	raw_spin_unlock(&dl_b->dl_runtime_lock);
 
 	return err;
 }
@@ -2655,14 +2649,14 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
 {
 	unsigned int dest_cpu = cpumask_any_and(cpu_active_mask,
 							cs_cpus_allowed);
-	struct dl_bw *dl_b;
+	struct dl_bandwidth *dl_b;
 	bool overflow;
 	int cpus, ret;
 	unsigned long flags;
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(dest_cpu);
-	raw_spin_lock_irqsave(&dl_b->lock, flags);
+	raw_spin_lock_irqsave(&dl_b->dl_runtime_lock, flags);
 	cpus = dl_bw_cpus(dest_cpu);
 	overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
 	if (overflow)
@@ -2677,7 +2671,7 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
 		__dl_add(dl_b, p->dl.dl_bw, cpus);
 		ret = 0;
 	}
-	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+	raw_spin_unlock_irqrestore(&dl_b->dl_runtime_lock, flags);
 	rcu_read_unlock_sched();
 	return ret;
 }
@@ -2686,18 +2680,18 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
 				 const struct cpumask *trial)
 {
 	int ret = 1, trial_cpus;
-	struct dl_bw *cur_dl_b;
+	struct dl_bandwidth *cur_dl_b;
 	unsigned long flags;
 
 	rcu_read_lock_sched();
 	cur_dl_b = dl_bw_of(cpumask_any(cur));
 	trial_cpus = cpumask_weight(trial);
 
-	raw_spin_lock_irqsave(&cur_dl_b->lock, flags);
-	if (cur_dl_b->bw != -1 &&
-	    cur_dl_b->bw * trial_cpus < cur_dl_b->total_bw)
+	raw_spin_lock_irqsave(&cur_dl_b->dl_runtime_lock, flags);
+	if (cur_dl_b->dl_bw != -1 &&
+	    cur_dl_b->dl_bw * trial_cpus < cur_dl_b->dl_total_bw)
 		ret = 0;
-	raw_spin_unlock_irqrestore(&cur_dl_b->lock, flags);
+	raw_spin_unlock_irqrestore(&cur_dl_b->dl_runtime_lock, flags);
 	rcu_read_unlock_sched();
 	return ret;
 }
@@ -2705,16 +2699,16 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
 bool dl_cpu_busy(unsigned int cpu)
 {
 	unsigned long flags;
-	struct dl_bw *dl_b;
+	struct dl_bandwidth *dl_b;
 	bool overflow;
 	int cpus;
 
 	rcu_read_lock_sched();
 	dl_b = dl_bw_of(cpu);
-	raw_spin_lock_irqsave(&dl_b->lock, flags);
+	raw_spin_lock_irqsave(&dl_b->dl_runtime_lock, flags);
 	cpus = dl_bw_cpus(cpu);
 	overflow = __dl_overflow(dl_b, cpus, 0, 0);
-	raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+	raw_spin_unlock_irqrestore(&dl_b->dl_runtime_lock, flags);
 	rcu_read_unlock_sched();
 	return overflow;
 }
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1ca0130ed4f9..cf736a30350e 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -622,7 +622,7 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 
 void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
 {
-	struct dl_bw *dl_bw;
+	struct dl_bandwidth *dl_bw;
 
 	SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
 
@@ -636,8 +636,8 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
 #else
 	dl_bw = &dl_rq->dl_bw;
 #endif
-	SEQ_printf(m, "  .%-30s: %lld\n", "dl_bw->bw", dl_bw->bw);
-	SEQ_printf(m, "  .%-30s: %lld\n", "dl_bw->total_bw", dl_bw->total_bw);
+	SEQ_printf(m, "  .%-30s: %lld\n", "dl_bw->dl_bw", dl_bw->dl_bw);
+	SEQ_printf(m, "  .%-30s: %lld\n", "dl_bw->dl_total_bw", dl_bw->dl_total_bw);
 
 #undef PU
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505e23c6..7c44c8baa98c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -213,7 +213,7 @@ void __dl_clear_params(struct task_struct *p);
 /*
  * To keep the bandwidth of -deadline tasks and groups under control
  * we need some place where:
- *  - store the maximum -deadline bandwidth of the system (the group);
+ *  - store the maximum -deadline bandwidth of the system (the domain);
  *  - cache the fraction of that bandwidth that is currently allocated.
  *
  * This is all done in the data structure below. It is similar to the
@@ -224,20 +224,16 @@ void __dl_clear_params(struct task_struct *p);
  *
  * With respect to SMP, the bandwidth is given on a per-CPU basis,
  * meaning that:
- *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
- *  - dl_total_bw array contains, in the i-eth element, the currently
- *    allocated bandwidth on the i-eth CPU.
- * Moreover, groups consume bandwidth on each CPU, while tasks only
- * consume bandwidth on the CPU they're running on.
- * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
- * that will be shown the next time the proc or cgroup controls will
- * be red. It on its turn can be changed by writing on its own
- * control.
+ *  - dl_bw (< 100%) is the bandwidth of the system (domain) on each CPU;
+ *  - dl_total_bw array contains the currently allocated bandwidth on the
+ *    i-eth CPU.
  */
 struct dl_bandwidth {
 	raw_spinlock_t dl_runtime_lock;
-	u64 dl_runtime;
 	u64 dl_period;
+	u64 dl_runtime;
+	u64 dl_bw;
+	u64 dl_total_bw;
 };
 
 static inline int dl_bandwidth_enabled(void)
@@ -245,36 +241,30 @@ static inline int dl_bandwidth_enabled(void)
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-struct dl_bw {
-	raw_spinlock_t lock;
-	u64 bw, total_bw;
-};
-
-static inline void __dl_update(struct dl_bw *dl_b, s64 bw);
+static inline void __dl_update(struct dl_bandwidth *dl_b, s64 bw);
 
 static inline
-void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
+void __dl_sub(struct dl_bandwidth *dl_b, u64 tsk_bw, int cpus)
 {
-	dl_b->total_bw -= tsk_bw;
+	dl_b->dl_total_bw -= tsk_bw;
 	__dl_update(dl_b, (s32)tsk_bw / cpus);
 }
 
 static inline
-void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
+void __dl_add(struct dl_bandwidth *dl_b, u64 tsk_bw, int cpus)
 {
-	dl_b->total_bw += tsk_bw;
+	dl_b->dl_total_bw += tsk_bw;
 	__dl_update(dl_b, -((s32)tsk_bw / cpus));
 }
 
 static inline
-bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+bool __dl_overflow(struct dl_bandwidth *dl_b, int cpus, u64 old_bw, u64 new_bw)
 {
-	return dl_b->bw != -1 &&
-	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+	return dl_b->dl_bw != -1 &&
+	       dl_b->dl_bw * cpus < dl_b->dl_total_bw - old_bw + new_bw;
 }
 
 void dl_change_utilization(struct task_struct *p, u64 new_bw);
-extern void init_dl_bw(struct dl_bw *dl_b);
 extern int sched_dl_global_validate(void);
 extern void sched_dl_do_global(void);
 extern int sched_dl_overflow(struct task_struct *p, int policy,
@@ -600,7 +590,7 @@ struct dl_rq {
 	 */
 	struct rb_root_cached pushable_dl_tasks_root;
 #else
-	struct dl_bw dl_bw;
+	struct dl_bandwidth dl_bw;
 #endif
 	/*
 	 * "Active utilization" for this runqueue: increased when a
@@ -659,7 +649,7 @@ struct root_domain {
 	 */
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
-	struct dl_bw dl_bw;
+	struct dl_bandwidth dl_bw;
 	struct cpudl cpudl;
 
 #ifdef HAVE_RT_PUSH_IPI
@@ -2018,7 +2008,7 @@ static inline void nohz_balance_exit_idle(unsigned int cpu) { }
 
 #ifdef CONFIG_SMP
 static inline
-void __dl_update(struct dl_bw *dl_b, s64 bw)
+void __dl_update(struct dl_bandwidth *dl_b, s64 bw)
 {
 	struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
 	int i;
@@ -2033,7 +2023,7 @@ void __dl_update(struct dl_bw *dl_b, s64 bw)
 }
 #else
 static inline
-void __dl_update(struct dl_bw *dl_b, s64 bw)
+void __dl_update(struct dl_bandwidth *dl_b, s64 bw)
 {
 	struct dl_rq *dl = container_of(dl_b, struct dl_rq, dl_bw);
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed7f88b..0700f3f40445 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -276,7 +276,7 @@ static int init_rootdomain(struct root_domain *rd)
 	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
 #endif
 
-	init_dl_bw(&rd->dl_bw);
+	init_dl_bandwidth(&rd->dl_bw, global_rt_period(), global_rt_runtime());
 	if (cpudl_init(&rd->cpudl) != 0)
 		goto free_rto_mask;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
  2018-02-12 13:40 [RFC PATCH 0/3] SCHED_DEADLINE cgroups support Juri Lelli
  2018-02-12 13:40 ` [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth Juri Lelli
@ 2018-02-12 13:40 ` Juri Lelli
  2018-02-12 16:47   ` Tejun Heo
  2018-02-12 13:40 ` [RFC PATCH 3/3] Documentation/scheduler/sched-deadline: add info about cgroup support Juri Lelli
  2 siblings, 1 reply; 10+ messages in thread
From: Juri Lelli @ 2018-02-12 13:40 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, tglx, vincent.guittot, rostedt, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, tkjos,
	joelaf, morten.rasmussen, dietmar.eggemann, patrick.bellasi,
	alessio.balsini, juri.lelli, Tejun Heo

One of the missing features of DEADLINE (w.r.t. RT) is some way of controlling
CPU bandwidth allocation for task groups. Such feature would be especially
useful to be able to let normal users use DEADLINE, as the sys admin (with root
privilegies) could reserve a fraction of the total available bandwidth to users
and let them allocate what needed inside such space.

This patch implements cgroup support for DEADLINE, with the following design
choices:

 - implementation _is not_ hierarchical: only single/plain DEADLINE entities
   can be handled, and they get scheduled at root rq level

 - DEADLINE_GROUP_SCHED requires RT_GROUP_SCHED (because of the points below)

 - DEADLINE and RT share bandwidth; therefore, DEADLINE tasks will eat RT
   bandwidth, as they do today at root level; support for RT_RUNTIME_ SHARE is
   however missing, an RT task might be able to exceed its group bandwidth
   constrain if such feature is enabled (more thinking required)

 - and therefore cpu.rt_runtime_us and cpu.rt_period_us are still controlling a
   group bandwidth; however, two additional knobs are added

     # cpu.dl_bw : maximum bandwidth available for the group on each CPU
                   (rt_runtime_us/rt_period_us)
     # cpu.dl_total_bw : current total (across CPUs) amount of bandwidth
                         allocated by the group (sum of tasks bandwidth)

 - father/children/siblings rules are the same as for RT

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: linux-kernel@vger.kernel.org
---
 init/Kconfig             |  12 ++++
 kernel/sched/autogroup.c |   7 +++
 kernel/sched/core.c      |  54 +++++++++++++++-
 kernel/sched/deadline.c  | 159 ++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/rt.c        |  52 ++++++++++++++--
 kernel/sched/sched.h     |  20 +++++-
 6 files changed, 292 insertions(+), 12 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index e37f4b2a6445..c6ddda90d51f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -751,6 +751,18 @@ config RT_GROUP_SCHED
 	  realtime bandwidth for them.
 	  See Documentation/scheduler/sched-rt-group.txt for more information.
 
+config DEADLINE_GROUP_SCHED
+	bool "Group scheduling for SCHED_DEADLINE"
+	depends on CGROUP_SCHED
+	select RT_GROUP_SCHED
+	default n
+	help
+	  This feature lets you explicitly specify, in terms of runtime
+	  and period, the bandwidth of each task control group. This means
+	  tasks (and other groups) can be added to it only up to such
+	  "bandwidth cap", which might be useful for avoiding or
+	  controlling oversubscription.
+
 endif #CGROUP_SCHED
 
 config CGROUP_PIDS
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index a43df5193538..7cba2e132ac7 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -90,6 +90,13 @@ static inline struct autogroup *autogroup_create(void)
 	free_rt_sched_group(tg);
 	tg->rt_se = root_task_group.rt_se;
 	tg->rt_rq = root_task_group.rt_rq;
+#endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	/*
+	 * Similarly to what above we do for DEADLINE tasks.
+	 */
+	free_dl_sched_group(tg);
+	tg->dl_rq = root_task_group.dl_rq;
 #endif
 	tg->autogroup = ag;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 772a6b3239eb..8bb3e74b9486 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4225,7 +4225,8 @@ static int __sched_setscheduler(struct task_struct *p,
 #endif
 #ifdef CONFIG_SMP
 		if (dl_bandwidth_enabled() && dl_policy(policy) &&
-				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
+				!(attr->sched_flags & SCHED_FLAG_SUGOV) &&
+				!task_group_is_autogroup(task_group(p))) {
 			cpumask_t *span = rq->rd->span;
 
 			/*
@@ -5900,6 +5901,9 @@ void __init sched_init(void)
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	alloc_size += 2 * nr_cpu_ids * sizeof(void **);
+#endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	alloc_size += nr_cpu_ids * sizeof(void **);
 #endif
 	if (alloc_size) {
 		ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
@@ -5920,6 +5924,11 @@ void __init sched_init(void)
 		ptr += nr_cpu_ids * sizeof(void **);
 
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		root_task_group.dl_rq = (struct dl_rq **)ptr;
+		ptr += nr_cpu_ids * sizeof(void **);
+
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 	}
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	for_each_possible_cpu(i) {
@@ -5941,6 +5950,11 @@ void __init sched_init(void)
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	init_dl_bandwidth(&root_task_group.dl_bandwidth,
+			global_rt_period(), global_rt_runtime());
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 
 #ifdef CONFIG_CGROUP_SCHED
 	task_group_cache = KMEM_CACHE(task_group, 0);
@@ -5993,6 +6007,10 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		init_tg_dl_entry(&root_task_group, &rq->dl, NULL, i, NULL);
+#endif
+
 
 		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
 			rq->cpu_load[j] = 0;
@@ -6225,6 +6243,7 @@ static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
+	free_dl_sched_group(tg);
 	autogroup_free(tg);
 	kmem_cache_free(task_group_cache, tg);
 }
@@ -6244,6 +6263,9 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	if (!alloc_dl_sched_group(tg, parent))
+		goto err;
+
 	return tg;
 
 err:
@@ -6427,14 +6449,20 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 	int ret = 0;
 
 	cgroup_taskset_for_each(task, css, tset) {
+#if defined CONFIG_DEADLINE_GROUP_SCHED || defined CONFIG_RT_GROUP_SCHED
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		if (!sched_dl_can_attach(css_tg(css), task))
+			return -EINVAL;
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		if (!sched_rt_can_attach(css_tg(css), task))
 			return -EINVAL;
+#endif /* CONFIG_RT_GROUP_SCHED */
 #else
 		/* We don't support RT-tasks being in separate groups */
 		if (task->sched_class != &fair_sched_class)
 			return -EINVAL;
-#endif
+#endif /* CONFIG_DEADLINE_GROUP_SCHED || CONFIG_RT_GROUP_SCHED */
 		/*
 		 * Serialize against wake_up_new_task() such that if its
 		 * running, we're sure to observe its full state.
@@ -6750,6 +6778,18 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 	return sched_group_rt_period(css_tg(css));
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static u64 cpu_dl_bw_read(struct cgroup_subsys_state *css,
+			  struct cftype *cft)
+{
+	return sched_group_dl_bw(css_tg(css));
+}
+static u64 cpu_dl_total_bw_read(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	return sched_group_dl_total_bw(css_tg(css));
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -6786,6 +6826,16 @@ static struct cftype cpu_legacy_files[] = {
 		.read_u64 = cpu_rt_period_read_uint,
 		.write_u64 = cpu_rt_period_write_uint,
 	},
+#endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	{
+		.name = "dl_bw",
+		.read_u64 = cpu_dl_bw_read,
+	},
+	{
+		.name = "dl_total_bw",
+		.read_u64 = cpu_dl_total_bw_read,
+	},
 #endif
 	{ }	/* Terminate */
 };
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index de19bd7feddb..25ed0a01623e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -361,7 +361,6 @@ void init_dl_rq(struct dl_rq *dl_rq)
 	dl_rq->overloaded = 0;
 	dl_rq->pushable_dl_tasks_root = RB_ROOT_CACHED;
 #else
-	init_dl_bandwidth(&dl_rq->dl_bw);
 	init_dl_bandwidth(&dl_rq->dl_bw, global_rt_period(), global_rt_runtime());
 #endif
 
@@ -370,6 +369,129 @@ void init_dl_rq(struct dl_rq *dl_rq)
 	init_dl_rq_bw_ratio(dl_rq);
 }
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+u64 sched_group_dl_bw(struct task_group *tg)
+{
+	return tg->dl_bandwidth.dl_bw;
+}
+
+u64 sched_group_dl_total_bw(struct task_group *tg)
+{
+	return tg->dl_bandwidth.dl_total_bw;
+}
+
+/* Must be called with tasklist_lock held */
+int tg_has_dl_tasks(struct task_group *tg)
+{
+	struct task_struct *g, *p;
+
+	/*
+	 * Autogroups do not have DL tasks; see autogroup_create().
+	 */
+	if (task_group_is_autogroup(tg))
+		return 0;
+
+	do_each_thread(g, p) {
+		if (task_has_dl_policy(p) && task_group(p) == tg)
+			return 1;
+	} while_each_thread(g, p);
+
+	return 0;
+}
+
+int sched_dl_can_attach(struct task_group *tg, struct task_struct *tsk)
+{
+	int cpus, ret = 1;
+	struct rq_flags rf;
+	struct task_group *orig_tg;
+	struct rq *rq = task_rq_lock(tsk, &rf);
+
+	if (!dl_task(tsk))
+		goto unlock_rq;
+
+	/* Don't accept tasks when there is no way for them to run */
+	if (tg->dl_bandwidth.dl_runtime == 0) {
+		ret = 0;
+		goto unlock_rq;
+	}
+
+	/*
+	 * Check that the group has enough bandwidth left to accept this task.
+	 *
+	 * If there is space for the task:
+	 *   - reserve space for it in destination group
+	 *   - remove task bandwidth contribution from current group
+	 */
+	raw_spin_lock(&tg->dl_bandwidth.dl_runtime_lock);
+	cpus = dl_bw_cpus(task_cpu(tsk));
+	if (__dl_overflow(&tg->dl_bandwidth, cpus, 0, tsk->dl.dl_bw)) {
+		ret = 0;
+	} else {
+		tg->dl_bandwidth.dl_total_bw += tsk->dl.dl_bw;
+	}
+	raw_spin_unlock(&tg->dl_bandwidth.dl_runtime_lock);
+
+	/*
+	 * We managed to allocate tsk bandwidth in the new group,
+	 * remove that from the old one.
+	 * Doing this here is preferred instead of taking both
+	 * dl_runtime_lock together.
+	 */
+	if (ret) {
+		orig_tg = task_group(tsk);
+		raw_spin_lock(&orig_tg->dl_bandwidth.dl_runtime_lock);
+		orig_tg->dl_bandwidth.dl_total_bw -= tsk->dl.dl_bw;
+		raw_spin_unlock(&orig_tg->dl_bandwidth.dl_runtime_lock);
+	}
+
+unlock_rq:
+	task_rq_unlock(rq, tsk, &rf);
+
+	return ret;
+}
+
+void init_tg_dl_entry(struct task_group *tg, struct dl_rq *dl_rq,
+		struct sched_dl_entity *dl_se, int cpu,
+		struct sched_dl_entity *parent)
+{
+	tg->dl_rq[cpu] = dl_rq;
+}
+
+int alloc_dl_sched_group(struct task_group *tg, struct task_group *parent)
+{
+	struct rq *rq;
+	int i;
+
+	tg->dl_rq = kzalloc(sizeof(struct dl_rq *) * nr_cpu_ids, GFP_KERNEL);
+	if (!tg->dl_rq)
+		return 0;
+
+	init_dl_bandwidth(&tg->dl_bandwidth,
+			ktime_to_ns(def_dl_bandwidth.dl_period), 0);
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		init_tg_dl_entry(tg, &rq->dl, NULL, i, NULL);
+	}
+
+	return 1;
+}
+
+void free_dl_sched_group(struct task_group *tg)
+{
+	kfree(tg->dl_rq);
+}
+
+#else /* !CONFIG_DEADLINE_GROUP_SCHED */
+int alloc_dl_sched_group(struct task_group *tg, struct task_group *parent)
+{
+	return 1;
+}
+
+void free_dl_sched_group(struct task_group *tg) { }
+
+#endif /*CONFIG_DEADLINE_GROUP_SCHED*/
+
 #ifdef CONFIG_SMP
 
 static inline int dl_overloaded(struct rq *rq)
@@ -1223,14 +1345,23 @@ static void update_curr_dl(struct rq *rq)
 	 * account our runtime there too, otherwise actual rt tasks
 	 * would be able to exceed the shared quota.
 	 *
-	 * Account to the root rt group for now.
+	 * Account to curr's group, or the root rt group if group scheduling
+	 * is not in use. XXX if RT_RUNTIME_SHARE is enabled we should
+	 * probably split accounting between all rd rt_rq(s), but locking is
+	 * ugly. :/
 	 *
 	 * The solution we're working towards is having the RT groups scheduled
 	 * using deadline servers -- however there's a few nasties to figure
 	 * out before that can happen.
 	 */
 	if (rt_bandwidth_enabled()) {
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		struct rt_bandwidth *rt_b =
+			sched_rt_bandwidth_tg(task_group(curr));
+		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, cpu_of(rq));
+#else
 		struct rt_rq *rt_rq = &rq->rt;
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 
 		raw_spin_lock(&rt_rq->rt_runtime_lock);
 		/*
@@ -1267,6 +1398,14 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		raw_spin_lock(&dl_b->dl_runtime_lock);
 		__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&dl_b->dl_runtime_lock);
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		{
+		struct dl_bandwidth *tg_b = &task_group(p)->dl_bandwidth;
+		raw_spin_lock(&tg_b->dl_runtime_lock);
+		tg_b->dl_total_bw -= p->dl.dl_bw;
+		raw_spin_unlock(&tg_b->dl_runtime_lock);
+		}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 		__dl_clear_params(p);
 
 		goto unlock;
@@ -2488,7 +2627,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
 	u64 period = attr->sched_period ?: attr->sched_deadline;
 	u64 runtime = attr->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-	int cpus, err = -1;
+	int cpus, err = -1, change = 0;
 
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
 		return 0;
@@ -2522,6 +2661,7 @@ int sched_dl_overflow(struct task_struct *p, int policy,
 		__dl_sub(dl_b, p->dl.dl_bw, cpus);
 		__dl_add(dl_b, new_bw, cpus);
 		dl_change_utilization(p, new_bw);
+		change = 1;
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		/*
@@ -2533,6 +2673,19 @@ int sched_dl_overflow(struct task_struct *p, int policy,
 	}
 	raw_spin_unlock(&dl_b->dl_runtime_lock);
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	/* Add new_bw to task group p belongs to. */
+	if (!err) {
+		struct dl_bandwidth *tg_b = &task_group(p)->dl_bandwidth;
+
+		raw_spin_lock(&tg_b->dl_runtime_lock);
+		if (change)
+			tg_b->dl_total_bw -= p->dl.dl_bw;
+		tg_b->dl_total_bw += new_bw;
+		raw_spin_unlock(&tg_b->dl_runtime_lock);
+	}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 	return err;
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 862a513adca3..70d7d3b71f81 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -547,7 +547,6 @@ static inline const struct cpumask *sched_rt_period_mask(void)
 }
 #endif
 
-static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
 	return container_of(rt_b, struct task_group, rt_bandwidth)->rt_rq[cpu];
@@ -558,6 +557,11 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
 	return &rt_rq->tg->rt_bandwidth;
 }
 
+struct rt_bandwidth *sched_rt_bandwidth_tg(struct task_group *tg)
+{
+	return &tg->rt_bandwidth;
+}
+
 #else /* !CONFIG_RT_GROUP_SCHED */
 
 static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
@@ -609,7 +613,6 @@ static inline const struct cpumask *sched_rt_period_mask(void)
 	return cpu_online_mask;
 }
 
-static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
 	return &cpu_rq(cpu)->rt;
@@ -620,14 +623,20 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
 	return &def_rt_bandwidth;
 }
 
+struct rt_bandwidth *sched_rt_bandwidth_tg(struct task_group *tg)
+{
+	return &def_rt_bandwidth;
+}
+
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
 {
 	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
 
-	return (hrtimer_active(&rt_b->rt_period_timer) ||
-		rt_rq->rt_time < rt_b->rt_runtime);
+	return (rt_rq->rt_nr_running &&
+		(hrtimer_active(&rt_b->rt_period_timer) ||
+		 rt_rq->rt_time < rt_b->rt_runtime));
 }
 
 #ifdef CONFIG_SMP
@@ -2423,9 +2432,14 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 		return -EINVAL;
 
 	/*
-	 * Ensure we don't starve existing RT tasks.
+	 * Ensure we don't starve existing RT or DEADLINE tasks.
 	 */
-	if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
+	if (rt_bandwidth_enabled() && !runtime &&
+			(tg_has_rt_tasks(tg)
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+			 || tg_has_dl_tasks(tg)
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+			 ))
 		return -EBUSY;
 
 	total = to_ratio(period, runtime);
@@ -2436,8 +2450,19 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 	if (total > to_ratio(global_rt_period(), global_rt_runtime()))
 		return -EINVAL;
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	/*
+	 * If decreasing our own bandwidth we must be sure we didn't already
+	 * allocate too much bandwidth.
+	 */
+	if (total < tg->dl_bandwidth.dl_total_bw)
+		return -EBUSY;
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 	/*
 	 * The sum of our children's runtime should not exceed our own.
+	 * Also check that none of our children already allocated more than
+	 * the new bandwidth we want to set for ourself.
 	 */
 	list_for_each_entry_rcu(child, &tg->children, siblings) {
 		period = ktime_to_ns(child->rt_bandwidth.rt_period);
@@ -2448,6 +2473,11 @@ static int tg_rt_schedulable(struct task_group *tg, void *data)
 			runtime = d->rt_runtime;
 		}
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		if (total < child->dl_bandwidth.dl_total_bw)
+			return -EBUSY;
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 		sum += to_ratio(period, runtime);
 	}
 
@@ -2507,6 +2537,16 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 		rt_rq->rt_runtime = rt_runtime;
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	raw_spin_lock(&tg->dl_bandwidth.dl_runtime_lock);
+	tg->dl_bandwidth.dl_period = tg->rt_bandwidth.rt_period;
+	tg->dl_bandwidth.dl_runtime = tg->rt_bandwidth.rt_runtime;
+	tg->dl_bandwidth.dl_bw =
+		to_ratio(tg->dl_bandwidth.dl_period,
+			 tg->dl_bandwidth.dl_runtime);
+	raw_spin_unlock(&tg->dl_bandwidth.dl_runtime_lock);
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
 unlock:
 	read_unlock(&tasklist_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7c44c8baa98c..850aacc8f241 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -285,6 +285,7 @@ extern bool dl_cpu_busy(unsigned int cpu);
 
 struct cfs_rq;
 struct rt_rq;
+struct dl_rq;
 
 extern struct list_head task_groups;
 
@@ -333,6 +334,10 @@ struct task_group {
 
 	struct rt_bandwidth rt_bandwidth;
 #endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct dl_rq **dl_rq;
+	struct dl_bandwidth dl_bandwidth;
+#endif
 
 	struct rcu_head rcu;
 	struct list_head list;
@@ -404,6 +409,19 @@ extern int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us);
 extern long sched_group_rt_runtime(struct task_group *tg);
 extern long sched_group_rt_period(struct task_group *tg);
 extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
+extern struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu);
+extern struct rt_bandwidth *sched_rt_bandwidth_tg(struct task_group *tg);
+
+extern void free_dl_sched_group(struct task_group *tg);
+extern int alloc_dl_sched_group(struct task_group *tg, struct task_group *parent);
+extern void init_tg_dl_entry(struct task_group *tg, struct dl_rq *dl_rq,
+		struct sched_dl_entity *dl_se, int cpu,
+		struct sched_dl_entity *parent);
+extern int tg_has_dl_tasks(struct task_group *tg);
+extern u64 sched_group_dl_bw(struct task_group *tg);
+extern u64 sched_group_dl_total_bw(struct task_group *tg);
+extern int sched_dl_can_attach(struct task_group *tg, struct task_struct *tsk);
+
 
 extern struct task_group *sched_create_group(struct task_group *parent);
 extern void sched_online_group(struct task_group *tg,
@@ -1194,7 +1212,7 @@ static inline struct task_group *task_group(struct task_struct *p)
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 {
-#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED)
+#if defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_DEADLINE_GROUP_SCHED)
 	struct task_group *tg = task_group(p);
 #endif
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/3] Documentation/scheduler/sched-deadline: add info about cgroup support
  2018-02-12 13:40 [RFC PATCH 0/3] SCHED_DEADLINE cgroups support Juri Lelli
  2018-02-12 13:40 ` [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth Juri Lelli
  2018-02-12 13:40 ` [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support Juri Lelli
@ 2018-02-12 13:40 ` Juri Lelli
  2 siblings, 0 replies; 10+ messages in thread
From: Juri Lelli @ 2018-02-12 13:40 UTC (permalink / raw)
  To: peterz, mingo
  Cc: linux-kernel, tglx, vincent.guittot, rostedt, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, tkjos,
	joelaf, morten.rasmussen, dietmar.eggemann, patrick.bellasi,
	alessio.balsini, juri.lelli, Tejun Heo, Jonathan Corbet,
	linux-doc

Add documentation for SCHED_DEADLINE cgroup support (CONFIG_DEADLINE_
GROUP_SCHED config option).

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: linux-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
---
 Documentation/scheduler/sched-deadline.txt | 36 ++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
index 8ce78f82ae23..65d55c778976 100644
--- a/Documentation/scheduler/sched-deadline.txt
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -528,11 +528,8 @@ CONTENTS
  to -deadline tasks is similar to the one already used for -rt
  tasks with real-time group scheduling (a.k.a. RT-throttling - see
  Documentation/scheduler/sched-rt-group.txt), and is based on readable/
- writable control files located in procfs (for system wide settings).
- Notice that per-group settings (controlled through cgroupfs) are still not
- defined for -deadline tasks, because more discussion is needed in order to
- figure out how we want to manage SCHED_DEADLINE bandwidth at the task group
- level.
+ writable control files located in procfs (for system wide settings) and in
+ cgroupfs (per-group settings).
 
  A main difference between deadline bandwidth management and RT-throttling
  is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
@@ -553,9 +550,9 @@ CONTENTS
  For now the -rt knobs are used for -deadline admission control and the
  -deadline runtime is accounted against the -rt runtime. We realize that this
  isn't entirely desirable; however, it is better to have a small interface for
- now, and be able to change it easily later. The ideal situation (see 5.) is to
- run -rt tasks from a -deadline server; in which case the -rt bandwidth is a
- direct subset of dl_bw.
+ now, and be able to change it easily later. The ideal situation (see 6.) is to
+ run -rt tasks from a -deadline server (H-CBS); in which case the -rt bandwidth
+ is a direct subset of dl_bw.
 
  This means that, for a root_domain comprising M CPUs, -deadline tasks
  can be created while the sum of their bandwidths stays below:
@@ -623,6 +620,27 @@ CONTENTS
  make the leftoever runtime available for reclamation by other
  SCHED_DEADLINE tasks.
 
+4.4 Grouping tasks
+------------------
+
+CONFIG_DEADLINE_GROUP_SCHED depends on CONFIG_RT_GROUP_SCHED, so go on and
+read Documentation/scheduler/sched-rt-group.txt first.
+
+Enabling CONFIG_DEADLINE_GROUP_SCHED lets you explicitly manage CPU bandwidth
+for task groups.
+
+This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us
+<cgroup>/cpu.rt_period_us" to control the CPU time reserved for each control
+group. Yes, they are the same of CONFIG_RT_GROUP_SCHED since RT and DEADLINE
+share the same bandwidth. In addition to these CONFIG_DEADLINE_GROUP_SCHED
+adds "<cgroup>/cpu.dl_bw" (maximum bandwidth on each CPU available to the
+group, corresponds to cpu.rt_runtime_us/cpu.rt_period_us) and
+"<cgroup>/cpu.dl_total_bw" (a group's current allocated bandwidth); both are
+non-writable.
+
+Group settings are checked against same limits of CONFIG_RT_GROUP_SCHED:
+
+   \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
 
 5. Tasks CPU affinity
 =====================
@@ -661,7 +679,7 @@ CONTENTS
     of retaining bandwidth isolation among non-interacting tasks. This is
     being studied from both theoretical and practical points of view, and
     hopefully we should be able to produce some demonstrative code soon;
-  - (c)group based bandwidth management, and maybe scheduling;
+  - (c)group based scheduling (Hierachical-CBS);
   - access control for non-root users (and related security concerns to
     address), which is the best way to allow unprivileged use of the mechanisms
     and how to prevent non-root users "cheat" the system?
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
  2018-02-12 13:40 ` [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support Juri Lelli
@ 2018-02-12 16:47   ` Tejun Heo
  2018-02-12 17:09     ` Juri Lelli
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2018-02-12 16:47 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, tglx, vincent.guittot, rostedt,
	luca.abeni, claudio, tommaso.cucinotta, bristot, mathieu.poirier,
	tkjos, joelaf, morten.rasmussen, dietmar.eggemann,
	patrick.bellasi, alessio.balsini

Hello,

On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote:
>  - implementation _is not_ hierarchical: only single/plain DEADLINE entities
>    can be handled, and they get scheduled at root rq level

This usually is a deal breaker and often indicates that the cgroup
filesystem is not the right interface for the feature.  Can you please
elaborate the interface with some details?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
  2018-02-12 16:47   ` Tejun Heo
@ 2018-02-12 17:09     ` Juri Lelli
  0 siblings, 0 replies; 10+ messages in thread
From: Juri Lelli @ 2018-02-12 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: peterz, mingo, linux-kernel, tglx, vincent.guittot, rostedt,
	luca.abeni, claudio, tommaso.cucinotta, bristot, mathieu.poirier,
	tkjos, joelaf, morten.rasmussen, dietmar.eggemann,
	patrick.bellasi, alessio.balsini

Hi,

On 12/02/18 08:47, Tejun Heo wrote:
> Hello,
> 
> On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote:
> >  - implementation _is not_ hierarchical: only single/plain DEADLINE entities
> >    can be handled, and they get scheduled at root rq level
> 
> This usually is a deal breaker and often indicates that the cgroup
> filesystem is not the right interface for the feature.  Can you please
> elaborate the interface with some details?

The interface is the same as what we have today for groups of RT tasks,
and same rules apply. The difference is that when using RT
<group>/cpu.rt_runtime_us and <group>/cpu.rt_period_us control
RT-Throttling behaviour (fraction of CPU time and granularity), while
for DEADLINE the same interface would be used only at admission control
time (while servicing a sched_setattr(), attaching tasks to a group or
changing group's parameters) since DEADLINE task have their own
throttling mechanism already.

Intended usage should be very similar. For example, a sys admin that
wants to reserve and guarantee CPU bandwidth for a group of tasks would
create a group, configure its rt_runtime_us, rt_period_us and put
DEADLINE tasks inside it (e.g. video/audio pipeline). Related to what I
was saying in the cover letter (i.e., non root access to DEADLINE
scheduling) might be a different situation, where sys admin wants to
grant a user a certain percentage of CPU time (by creating a group and
putting user session inside it) and also control that user doesn't
exceed what granted. User would then be free to spawn DEADLINE tasks to
service her/his needs up to the maximum bandwidth cap set by sys admin.

Does this make any sense and provide a bit more information?

Thanks a lot for looking at this!

Best,

- Juri

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth
  2018-02-12 13:40 ` [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth Juri Lelli
@ 2018-02-12 17:34   ` Steven Rostedt
  2018-02-12 17:43     ` Juri Lelli
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2018-02-12 17:34 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, tglx, vincent.guittot, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, tkjos,
	joelaf, morten.rasmussen, dietmar.eggemann, patrick.bellasi,
	alessio.balsini

On Mon, 12 Feb 2018 14:40:28 +0100
Juri Lelli <juri.lelli@redhat.com> wrote:

> + *  - dl_bw (< 100%) is the bandwidth of the system (domain) on each CPU;
> + *  - dl_total_bw array contains the currently allocated bandwidth on the
> + *    i-eth CPU.

The comment for dl_total_bw doesn't make sense. You mean that
dl_total_bw is the cpu's bandwidth? If so, let's not call it total,
because that would suggest it's the bandwidth of all CPUs. What about
dl_cpu_bw?

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth
  2018-02-12 17:34   ` Steven Rostedt
@ 2018-02-12 17:43     ` Juri Lelli
  2018-02-12 18:02       ` Steven Rostedt
  0 siblings, 1 reply; 10+ messages in thread
From: Juri Lelli @ 2018-02-12 17:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, tglx, vincent.guittot, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, tkjos,
	joelaf, morten.rasmussen, dietmar.eggemann, patrick.bellasi,
	alessio.balsini

On 12/02/18 12:34, Steven Rostedt wrote:
> On Mon, 12 Feb 2018 14:40:28 +0100
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > + *  - dl_bw (< 100%) is the bandwidth of the system (domain) on each CPU;
> > + *  - dl_total_bw array contains the currently allocated bandwidth on the
> > + *    i-eth CPU.
> 
> The comment for dl_total_bw doesn't make sense. You mean that
> dl_total_bw is the cpu's bandwidth? If so, let's not call it total,
> because that would suggest it's the bandwidth of all CPUs. What about
> dl_cpu_bw?

Huh, I meant to properly fix this (broken already in mainline) comment,
but I only managed to do that (hopefully) in next patch. :/

However, this surely needs to be fixed here. It's tracking the sum of
all tasks' (across CPUs) bandwidth admitted on the system, so that's why
it's called dl_total_bw. Incremented when a task passes sched_setattr()
and decremented when it leaves the system or changes scheduling class.

Does it make a bit more sense? Would you still prefer a different name?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth
  2018-02-12 17:43     ` Juri Lelli
@ 2018-02-12 18:02       ` Steven Rostedt
  2018-02-12 18:17         ` Juri Lelli
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2018-02-12 18:02 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, linux-kernel, tglx, vincent.guittot, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, tkjos,
	joelaf, morten.rasmussen, dietmar.eggemann, patrick.bellasi,
	alessio.balsini

On Mon, 12 Feb 2018 18:43:12 +0100
Juri Lelli <juri.lelli@redhat.com> wrote:

> However, this surely needs to be fixed here. It's tracking the sum of
> all tasks' (across CPUs) bandwidth admitted on the system, so that's why
> it's called dl_total_bw. Incremented when a task passes sched_setattr()
> and decremented when it leaves the system or changes scheduling class.
> 
> Does it make a bit more sense? Would you still prefer a different name?

No the name is fine, the comment needs to change.

 - dl_total_bw - tracks the sum of all tasks' bandwidth across CPUs.

How's that?

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth
  2018-02-12 18:02       ` Steven Rostedt
@ 2018-02-12 18:17         ` Juri Lelli
  0 siblings, 0 replies; 10+ messages in thread
From: Juri Lelli @ 2018-02-12 18:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, mingo, linux-kernel, tglx, vincent.guittot, luca.abeni,
	claudio, tommaso.cucinotta, bristot, mathieu.poirier, tkjos,
	joelaf, morten.rasmussen, dietmar.eggemann, patrick.bellasi,
	alessio.balsini

On 12/02/18 13:02, Steven Rostedt wrote:
> On Mon, 12 Feb 2018 18:43:12 +0100
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > However, this surely needs to be fixed here. It's tracking the sum of
> > all tasks' (across CPUs) bandwidth admitted on the system, so that's why
> > it's called dl_total_bw. Incremented when a task passes sched_setattr()
> > and decremented when it leaves the system or changes scheduling class.
> > 
> > Does it make a bit more sense? Would you still prefer a different name?
> 
> No the name is fine, the comment needs to change.
> 
>  - dl_total_bw - tracks the sum of all tasks' bandwidth across CPUs.
> 
> How's that?

LGTM. I'll fix in next version.

Thanks!

- Juri

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-02-12 18:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-12 13:40 [RFC PATCH 0/3] SCHED_DEADLINE cgroups support Juri Lelli
2018-02-12 13:40 ` [RFC PATCH 1/3] sched/deadline: merge dl_bw into dl_bandwidth Juri Lelli
2018-02-12 17:34   ` Steven Rostedt
2018-02-12 17:43     ` Juri Lelli
2018-02-12 18:02       ` Steven Rostedt
2018-02-12 18:17         ` Juri Lelli
2018-02-12 13:40 ` [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support Juri Lelli
2018-02-12 16:47   ` Tejun Heo
2018-02-12 17:09     ` Juri Lelli
2018-02-12 13:40 ` [RFC PATCH 3/3] Documentation/scheduler/sched-deadline: add info about cgroup support Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).