linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure
@ 2019-07-26 14:54 Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

Hi,

So recently syzcaller ran into the big deadline/period issue (again), and I
figured I should at least propose a patch that puts limits on that -- see Patch 1.

During that discussion; SCHED_OTHER servers got mentioned (also again), and I
figured I should have a poke at that too. So I took some inspiration from
patches Alessio Balsini send a while back and cobbled something together for
that too.

Included are also a bunch of patches I did for core scheduling (2-8),
which I'm probably going to just merge as they're generic cleanups.
They're included here because they make pick_next_task() simpler and
thereby thinking about the nested pick_next_task() logic inherent in
servers was less of a head-ache. (I think it can work fine without them,
but its easier with them on)

Anyway; after it all compiled it booted a kvm image all the way to userspace on
the first run, so clearly this code isn't to be trusted at all.

There's still lots of missing bits and pieces -- like changelogs and the
fair server isn't configurable or hooked into the bandwidth accounting,
but the foundation is there I think.

Enjoy!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-29  8:57   ` Juri Lelli
                     ` (3 more replies)
  2019-07-26 14:54 ` [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 4 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz


Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/sysctl.h |    3 +++
 kernel/sched/deadline.c      |   23 +++++++++++++++++++++--
 kernel/sysctl.c              |   14 ++++++++++++++
 3 files changed, 38 insertions(+), 2 deletions(-)

--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,9 @@ int sched_proc_update_handler(struct ctl
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+extern unsigned int sysctl_sched_dl_period_max;
+extern unsigned int sysctl_sched_dl_period_min;
+
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2597,6 +2597,14 @@ void __getparam_dl(struct task_struct *p
 }
 
 /*
+ * Default limits for DL period; on the top end we guard against small util
+ * tasks still getting rediculous long effective runtimes, on the bottom end we
+ * guard against timer DoS.
+ */
+unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
+unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
+
+/*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime, as well as the period of being zero or
@@ -2608,6 +2616,8 @@ void __getparam_dl(struct task_struct *p
  */
 bool __checkparam_dl(const struct sched_attr *attr)
 {
+	u64 period, max, min;
+
 	/* special dl tasks don't actually use any parameter */
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
 		return true;
@@ -2631,12 +2641,21 @@ bool __checkparam_dl(const struct sched_
 	    attr->sched_period & (1ULL << 63))
 		return false;
 
+	period = attr->sched_period;
+	if (!period)
+		period = attr->sched_deadline;
+
 	/* runtime <= deadline <= period (if period != 0) */
-	if ((attr->sched_period != 0 &&
-	     attr->sched_period < attr->sched_deadline) ||
+	if (period < attr->sched_deadline ||
 	    attr->sched_deadline < attr->sched_runtime)
 		return false;
 
+	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
+	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
+
+	if (period < min || period > max)
+		return false;
+
 	return true;
 }
 
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -443,6 +443,20 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= sched_rt_handler,
 	},
 	{
+		.procname	= "sched_deadline_period_max_us",
+		.data		= &sysctl_sched_dl_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sched_deadline_period_min_us",
+		.data		= &sysctl_sched_dl_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "sched_rr_timeslice_ms",
 		.data		= &sysctl_sched_rr_timeslice,
 		.maxlen		= sizeof(int),



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-30 13:16   ` Phil Auld
  2019-07-30 13:22   ` Steven Rostedt
  2019-07-26 14:54 ` [RFC][PATCH 03/13] sched: Fix kerneldoc comment for ia64_set_curr_task Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, Valentin Schneider, Aaron Lu,
	keescook, Pawan Gupta, Phil Auld, torvalds, Tim Chen, fweisbec,
	subhra.mazumdar, Julien Desfossez, pjt, Nishanth Aravamudan,
	Aubrey Li, Mel Gorman, kerrnel, Paolo Bonzini,
	Peter Zijlstra (Intel)

Make sure the entire for loop has stop_cpus_in_progress set.

Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: keescook@chromium.org
Cc: mingo@kernel.org
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: torvalds@linux-foundation.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: fweisbec@gmail.com
Cc: subhra.mazumdar@oracle.com
Cc: tglx@linutronix.de
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: pjt@google.com
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Cc: Aubrey Li <aubrey.intel@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: kerrnel@google.com
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpillai@digitalocean.com
---
 kernel/stop_machine.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -383,6 +383,7 @@ static bool queue_stop_cpus_work(const s
 	 */
 	preempt_disable();
 	stop_cpus_in_progress = true;
+	barrier();
 	for_each_cpu(cpu, cpumask) {
 		work = &per_cpu(cpu_stopper.stop_work, cpu);
 		work->fn = fn;
@@ -391,6 +392,7 @@ static bool queue_stop_cpus_work(const s
 		if (cpu_stop_queue_work(cpu, work))
 			queued = true;
 	}
+	barrier();
 	stop_cpus_in_progress = false;
 	preempt_enable();
 



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 03/13] sched: Fix kerneldoc comment for ia64_set_curr_task
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6702,7 +6702,7 @@ struct task_struct *curr_task(int cpu)
 
 #ifdef CONFIG_IA64
 /**
- * set_curr_task - set the current task for a given CPU.
+ * ia64_set_curr_task - set the current task for a given CPU.
  * @cpu: the processor in question.
  * @p: the task pointer to set.
  *



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (2 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 03/13] sched: Fix kerneldoc comment for ia64_set_curr_task Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-29  9:25   ` Juri Lelli
  2019-07-26 14:54 ` [RFC][PATCH 05/13] sched: Add task_struct pointer to sched_class::set_curr_task Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

Because pick_next_task() implies set_curr_task() and some of the
details haven't matter too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c |   23 ++++++++++++-----------
 kernel/sched/rt.c       |   27 ++++++++++++++-------------
 2 files changed, 26 insertions(+), 24 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *r
 }
 #endif
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
+
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+
+	if (rq->curr->sched_class != &dl_sched_class)
+		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	if (rq->curr != p)
+		deadline_queue_push_tasks(rq);
 }
 
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
@@ -1758,15 +1767,7 @@ pick_next_task_dl(struct rq *rq, struct
 
 	p = dl_task_of(dl_se);
 
-	set_next_task(rq, p);
-
-	if (hrtick_enabled(rq))
-		start_hrtick_dl(rq, p);
-
-	deadline_queue_push_tasks(rq);
-
-	if (rq->curr->sched_class != &dl_sched_class)
-		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_dl(rq, p);
 
 	return p;
 }
@@ -1813,7 +1814,7 @@ static void task_fork_dl(struct task_str
 
 static void set_curr_task_dl(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_dl(rq, rq->curr);
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1498,12 +1498,23 @@ static void check_preempt_curr_rt(struct
 #endif
 }
 
-static inline void set_next_task(struct rq *rq, struct task_struct *p)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p)
 {
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* The running task is never eligible for pushing */
 	dequeue_pushable_task(rq, p);
+
+	/*
+	 * If prev task was rt, put_prev_task() has already updated the
+	 * utilization. We only care of the case where we start to schedule a
+	 * rt task
+	 */
+	if (rq->curr->sched_class != &rt_sched_class)
+		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+
+	if (rq->curr != p)
+		rt_queue_push_tasks(rq);
 }
 
 static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
@@ -1577,17 +1588,7 @@ pick_next_task_rt(struct rq *rq, struct
 
 	p = _pick_next_task_rt(rq);
 
-	set_next_task(rq, p);
-
-	rt_queue_push_tasks(rq);
-
-	/*
-	 * If prev task was rt, put_prev_task() has already updated the
-	 * utilization. We only care of the case where we start to schedule a
-	 * rt task
-	 */
-	if (rq->curr->sched_class != &rt_sched_class)
-		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
+	set_next_task_rt(rq, p);
 
 	return p;
 }
@@ -2356,7 +2357,7 @@ static void task_tick_rt(struct rq *rq,
 
 static void set_curr_task_rt(struct rq *rq)
 {
-	set_next_task(rq, rq->curr);
+	set_next_task_rt(rq, rq->curr);
 }
 
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 05/13] sched: Add task_struct pointer to sched_class::set_curr_task
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (3 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 06/13] sched/fair: Export newidle_balance() Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   12 ++++++------
 kernel/sched/deadline.c  |    7 +------
 kernel/sched/fair.c      |   17 ++++++++++++++---
 kernel/sched/idle.c      |   27 +++++++++++++++------------
 kernel/sched/rt.c        |    7 +------
 kernel/sched/sched.h     |    8 +++++---
 kernel/sched/stop_task.c |   17 +++++++----------
 7 files changed, 49 insertions(+), 46 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1494,7 +1494,7 @@ void do_set_cpus_allowed(struct task_str
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 }
 
 /*
@@ -4273,7 +4273,7 @@ void rt_mutex_setprio(struct task_struct
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -4340,7 +4340,7 @@ void set_user_nice(struct task_struct *p
 			resched_curr(rq);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 out_unlock:
 	task_rq_unlock(rq, p, &rf);
 }
@@ -4783,7 +4783,7 @@ static int __sched_setscheduler(struct t
 		enqueue_task(rq, p, queue_flags);
 	}
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 
@@ -5972,7 +5972,7 @@ void sched_setnuma(struct task_struct *p
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
 	if (running)
-		set_curr_task(rq, p);
+		set_next_task(rq, p);
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -6853,7 +6853,7 @@ void sched_move_task(struct task_struct
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
 	if (running)
-		set_curr_task(rq, tsk);
+		set_next_task(rq, tsk);
 
 	task_rq_unlock(rq, tsk, &rf);
 }
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1812,11 +1812,6 @@ static void task_fork_dl(struct task_str
 	 */
 }
 
-static void set_curr_task_dl(struct rq *rq)
-{
-	set_next_task_dl(rq, rq->curr);
-}
-
 #ifdef CONFIG_SMP
 
 /* Only try algorithms three times */
@@ -2404,6 +2399,7 @@ const struct sched_class dl_sched_class
 
 	.pick_next_task		= pick_next_task_dl,
 	.put_prev_task		= put_prev_task_dl,
+	.set_next_task		= set_next_task_dl,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
@@ -2414,7 +2410,6 @@ const struct sched_class dl_sched_class
 	.task_woken		= task_woken_dl,
 #endif
 
-	.set_curr_task		= set_curr_task_dl,
 	.task_tick		= task_tick_dl,
 	.task_fork              = task_fork_dl,
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10121,9 +10121,19 @@ static void switched_to_fair(struct rq *
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_curr_task_fair(struct rq *rq)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p)
 {
-	struct sched_entity *se = &rq->curr->se;
+	struct sched_entity *se = &p->se;
+
+#ifdef CONFIG_SMP
+	if (task_on_rq_queued(p)) {
+		/*
+		 * Move the next running task to the front of the list, so our
+		 * cfs_tasks list becomes MRU one.
+		 */
+		list_move(&se->group_node, &rq->cfs_tasks);
+	}
+#endif
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -10394,7 +10404,9 @@ const struct sched_class fair_sched_clas
 	.check_preempt_curr	= check_preempt_wakeup,
 
 	.pick_next_task		= pick_next_task_fair,
+
 	.put_prev_task		= put_prev_task_fair,
+	.set_next_task          = set_next_task_fair,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
@@ -10407,7 +10419,6 @@ const struct sched_class fair_sched_clas
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
 	.task_fork		= task_fork_fair,
 
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -374,14 +374,25 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+{
+}
+
+static void set_next_task_idle(struct rq *rq, struct task_struct *next)
+{
+	update_idle_core(rq);
+	schedstat_inc(rq->sched_goidle);
+}
+
 static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
+	struct task_struct *next = rq->idle;
+
 	put_prev_task(rq, prev);
-	update_idle_core(rq);
-	schedstat_inc(rq->sched_goidle);
+	set_next_task_idle(rq, next);
 
-	return rq->idle;
+	return next;
 }
 
 /*
@@ -397,10 +408,6 @@ dequeue_task_idle(struct rq *rq, struct
 	raw_spin_lock_irq(&rq->lock);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
-{
-}
-
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -413,10 +420,6 @@ static void task_tick_idle(struct rq *rq
 {
 }
 
-static void set_curr_task_idle(struct rq *rq)
-{
-}
-
 static void switched_to_idle(struct rq *rq, struct task_struct *p)
 {
 	BUG();
@@ -451,13 +454,13 @@ const struct sched_class idle_sched_clas
 
 	.pick_next_task		= pick_next_task_idle,
 	.put_prev_task		= put_prev_task_idle,
+	.set_next_task          = set_next_task_idle,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_idle,
 	.task_tick		= task_tick_idle,
 
 	.get_rr_interval	= get_rr_interval_idle,
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2355,11 +2355,6 @@ static void task_tick_rt(struct rq *rq,
 	}
 }
 
-static void set_curr_task_rt(struct rq *rq)
-{
-	set_next_task_rt(rq, rq->curr);
-}
-
 static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
 {
 	/*
@@ -2381,6 +2376,7 @@ const struct sched_class rt_sched_class
 
 	.pick_next_task		= pick_next_task_rt,
 	.put_prev_task		= put_prev_task_rt,
+	.set_next_task          = set_next_task_rt,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_rt,
@@ -2392,7 +2388,6 @@ const struct sched_class rt_sched_class
 	.switched_from		= switched_from_rt,
 #endif
 
-	.set_curr_task          = set_curr_task_rt,
 	.task_tick		= task_tick_rt,
 
 	.get_rr_interval	= get_rr_interval_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1711,6 +1711,7 @@ struct sched_class {
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
@@ -1725,7 +1726,6 @@ struct sched_class {
 	void (*rq_offline)(struct rq *rq);
 #endif
 
-	void (*set_curr_task)(struct rq *rq);
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork)(struct task_struct *p);
 	void (*task_dead)(struct task_struct *p);
@@ -1755,12 +1755,14 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
+	WARN_ON_ONCE(rq->curr != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
-static inline void set_curr_task(struct rq *rq, struct task_struct *curr)
+static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
-	curr->sched_class->set_curr_task(rq);
+	WARN_ON_ONCE(rq->curr != next);
+	next->sched_class->set_next_task(rq, next);
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,6 +23,11 @@ check_preempt_curr_stop(struct rq *rq, s
 	/* we're never preempted */
 }
 
+static void set_next_task_stop(struct rq *rq, struct task_struct *stop)
+{
+	stop->se.exec_start = rq_clock_task(rq);
+}
+
 static struct task_struct *
 pick_next_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
@@ -32,8 +37,7 @@ pick_next_task_stop(struct rq *rq, struc
 		return NULL;
 
 	put_prev_task(rq, prev);
-
-	stop->se.exec_start = rq_clock_task(rq);
+	set_next_task_stop(rq, stop);
 
 	return stop;
 }
@@ -86,13 +90,6 @@ static void task_tick_stop(struct rq *rq
 {
 }
 
-static void set_curr_task_stop(struct rq *rq)
-{
-	struct task_struct *stop = rq->stop;
-
-	stop->se.exec_start = rq_clock_task(rq);
-}
-
 static void switched_to_stop(struct rq *rq, struct task_struct *p)
 {
 	BUG(); /* its impossible to change to this class */
@@ -128,13 +125,13 @@ const struct sched_class stop_sched_clas
 
 	.pick_next_task		= pick_next_task_stop,
 	.put_prev_task		= put_prev_task_stop,
+	.set_next_task          = set_next_task_stop,
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 #endif
 
-	.set_curr_task          = set_curr_task_stop,
 	.task_tick		= task_tick_stop,
 
 	.get_rr_interval	= get_rr_interval_stop,



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 06/13] sched/fair: Export newidle_balance()
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (4 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 05/13] sched: Add task_struct pointer to sched_class::set_curr_task Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 07/13] sched: Allow put_prev_task() to drop rq->lock Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  |   18 ++++++++----------
 kernel/sched/sched.h |    4 ++++
 2 files changed, 12 insertions(+), 10 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3635,8 +3635,6 @@ static inline unsigned long cfs_rq_load_
 	return cfs_rq->avg.load_avg;
 }
 
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
-
 static inline unsigned long task_util(struct task_struct *p)
 {
 	return READ_ONCE(p->se.avg.util_avg);
@@ -6848,11 +6846,10 @@ done: __maybe_unused;
 	return p;
 
 idle:
-	update_misfit_status(NULL, rq);
-	new_tasks = idle_balance(rq, rf);
+	new_tasks = newidle_balance(rq, rf);
 
 	/*
-	 * Because idle_balance() releases (and re-acquires) rq->lock, it is
+	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
 	 * possible for any higher priority task to appear. In that case we
 	 * must re-start the pick_next_entity() loop.
 	 */
@@ -9016,10 +9013,10 @@ static int load_balance(int this_cpu, st
 	ld_moved = 0;
 
 	/*
-	 * idle_balance() disregards balance intervals, so we could repeatedly
-	 * reach this code, which would lead to balance_interval skyrocketting
-	 * in a short amount of time. Skip the balance_interval increase logic
-	 * to avoid that.
+	 * newidle_balance() disregards balance intervals, so we could
+	 * repeatedly reach this code, which would lead to balance_interval
+	 * skyrocketting in a short amount of time. Skip the balance_interval
+	 * increase logic to avoid that.
 	 */
 	if (env.idle == CPU_NEWLY_IDLE)
 		goto out;
@@ -9729,7 +9726,7 @@ static inline void nohz_newidle_balance(
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  */
-static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
+int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
 	int this_cpu = this_rq->cpu;
@@ -9737,6 +9734,7 @@ static int idle_balance(struct rq *this_
 	int pulled_task = 0;
 	u64 curr_cost = 0;
 
+	update_misfit_status(NULL, this_rq);
 	/*
 	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 	 * measure the duration of idle_balance() as idle time.
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1449,10 +1449,14 @@ static inline void unregister_sched_doma
 }
 #endif
 
+extern int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
 
+static inline int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { return 0; }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 07/13] sched: Allow put_prev_task() to drop rq->lock
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (5 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 06/13] sched/fair: Export newidle_balance() Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 08/13] sched: Rework pick_next_task() slow-path Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |    2 +-
 kernel/sched/deadline.c  |   14 +++++++++++++-
 kernel/sched/fair.c      |    2 +-
 kernel/sched/idle.c      |    2 +-
 kernel/sched/rt.c        |   14 +++++++++++++-
 kernel/sched/sched.h     |    4 ++--
 kernel/sched/stop_task.c |    2 +-
 7 files changed, 32 insertions(+), 8 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6012,7 +6012,7 @@ static void calc_load_migrate(struct rq
 		atomic_long_add(delta, &calc_load_tasks);
 }
 
-static void put_prev_task_fake(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fake(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1772,13 +1772,25 @@ pick_next_task_dl(struct rq *rq, struct
 	return p;
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_dl(rq);
 
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
+
+	if (rf && !on_dl_rq(&p->dl) && need_pull_dl_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6871,7 +6871,7 @@ done: __maybe_unused;
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -374,7 +374,7 @@ static void check_preempt_curr_idle(stru
 	resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 }
 
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1593,7 +1593,7 @@ pick_next_task_rt(struct rq *rq, struct
 	return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 {
 	update_curr_rt(rq);
 
@@ -1605,6 +1605,18 @@ static void put_prev_task_rt(struct rq *
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	if (rf && !on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
+		/*
+		 * This is OK, because current is on_cpu, which avoids it being
+		 * picked for load-balance and preemption/IRQs are still
+		 * disabled avoiding further scheduler activity on it and we've
+		 * not yet started the picking loop.
+		 */
+		rq_unpin_lock(rq, rf);
+		pull_rt_task(rq);
+		rq_repin_lock(rq, rf);
+	}
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1714,7 +1714,7 @@ struct sched_class {
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
 					       struct rq_flags *rf);
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct rq_flags *rf);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
@@ -1760,7 +1760,7 @@ struct sched_class {
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
-	prev->sched_class->put_prev_task(rq, prev);
+	prev->sched_class->put_prev_task(rq, prev, NULL);
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -59,7 +59,7 @@ static void yield_task_stop(struct rq *r
 	BUG(); /* the stop task should never yield, its pointless. */
 }
 
-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *curr = rq->curr;
 	u64 delta_exec;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 08/13] sched: Rework pick_next_task() slow-path
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (6 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 07/13] sched: Allow put_prev_task() to drop rq->lock Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 09/13] sched: Unify runtime accounting across classes Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   19 ++++++++++++-------
 kernel/sched/deadline.c  |   30 ++----------------------------
 kernel/sched/fair.c      |    9 ++++++---
 kernel/sched/idle.c      |    4 +++-
 kernel/sched/rt.c        |   29 +----------------------------
 kernel/sched/sched.h     |   13 ++++++++-----
 kernel/sched/stop_task.c |    3 ++-
 7 files changed, 34 insertions(+), 73 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3739,7 +3739,7 @@ pick_next_task(struct rq *rq, struct tas
 
 		p = fair_sched_class.pick_next_task(rq, prev, rf);
 		if (unlikely(p == RETRY_TASK))
-			goto again;
+			goto restart;
 
 		/* Assumes fair_sched_class->next == idle_sched_class */
 		if (unlikely(!p))
@@ -3748,14 +3748,19 @@ pick_next_task(struct rq *rq, struct tas
 		return p;
 	}
 
-again:
+restart:
+	/*
+	 * Ensure that we put DL/RT tasks before the pick loop, such that they
+	 * can PULL higher prio tasks when we lower the RQ 'priority'.
+	 */
+	prev->sched_class->put_prev_task(rq, prev, rf);
+	if (!rq->nr_running)
+		newidle_balance(rq, rf);
+
 	for_each_class(class) {
-		p = class->pick_next_task(rq, prev, rf);
-		if (p) {
-			if (unlikely(p == RETRY_TASK))
-				goto again;
+		p = class->pick_next_task(rq, NULL, NULL);
+		if (p)
 			return p;
-		}
 	}
 
 	/* The idle class should always have a runnable task: */
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1729,39 +1729,13 @@ pick_next_task_dl(struct rq *rq, struct
 	struct task_struct *p;
 	struct dl_rq *dl_rq;
 
-	dl_rq = &rq->dl;
-
-	if (need_pull_dl_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_dl_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_dl_task() can drop (and re-acquire) rq->lock; this
-		 * means a stop task can slip in, in which case we need to
-		 * re-start task selection.
-		 */
-		if (rq->stop && task_on_rq_queued(rq->stop))
-			return RETRY_TASK;
-	}
+	WARN_ON_ONCE(prev || rf);
 
-	/*
-	 * When prev is DL, we may throttle it in put_prev_task().
-	 * So, we update time before we check for dl_nr_running.
-	 */
-	if (prev->sched_class == &dl_sched_class)
-		update_curr_dl(rq);
+	dl_rq = &rq->dl;
 
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6740,7 +6740,7 @@ pick_next_task_fair(struct rq *rq, struc
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
+	if (!prev || prev->sched_class != &fair_sched_class)
 		goto simple;
 
 	/*
@@ -6817,8 +6817,8 @@ pick_next_task_fair(struct rq *rq, struc
 	goto done;
 simple:
 #endif
-
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
 
 	do {
 		se = pick_next_entity(cfs_rq, NULL);
@@ -6846,6 +6846,9 @@ done: __maybe_unused;
 	return p;
 
 idle:
+	if (!rf)
+		return NULL;
+
 	new_tasks = newidle_balance(rq, rf);
 
 	/*
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -389,7 +389,9 @@ pick_next_task_idle(struct rq *rq, struc
 {
 	struct task_struct *next = rq->idle;
 
-	put_prev_task(rq, prev);
+	if (prev)
+		put_prev_task(rq, prev);
+
 	set_next_task_idle(rq, next);
 
 	return next;
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1554,38 +1554,11 @@ pick_next_task_rt(struct rq *rq, struct
 	struct task_struct *p;
 	struct rt_rq *rt_rq = &rq->rt;
 
-	if (need_pull_rt_task(rq, prev)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we're
-		 * being very careful to re-start the picking loop.
-		 */
-		rq_unpin_lock(rq, rf);
-		pull_rt_task(rq);
-		rq_repin_lock(rq, rf);
-		/*
-		 * pull_rt_task() can drop (and re-acquire) rq->lock; this
-		 * means a dl or stop task can slip in, in which case we need
-		 * to re-start task selection.
-		 */
-		if (unlikely((rq->stop && task_on_rq_queued(rq->stop)) ||
-			     rq->dl.dl_nr_running))
-			return RETRY_TASK;
-	}
-
-	/*
-	 * We may dequeue prev's rt_rq in put_prev_task().
-	 * So, we update time before rt_queued check.
-	 */
-	if (prev->sched_class == &rt_sched_class)
-		update_curr_rt(rq);
+	WARN_ON_ONCE(prev || rf);
 
 	if (!rt_rq->rt_queued)
 		return NULL;
 
-	put_prev_task(rq, prev);
-
 	p = _pick_next_task_rt(rq);
 
 	set_next_task_rt(rq, p);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1704,12 +1704,15 @@ struct sched_class {
 	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
 
 	/*
-	 * It is the responsibility of the pick_next_task() method that will
-	 * return the next task to call put_prev_task() on the @prev task or
-	 * something equivalent.
+	 * Both @prev and @rf are optional and may be NULL, in which case the
+	 * caller must already have invoked put_prev_task(rq, prev, rf).
 	 *
-	 * May return RETRY_TASK when it finds a higher prio class has runnable
-	 * tasks.
+	 * Otherwise it is the responsibility of the pick_next_task() to call
+	 * put_prev_task() on the @prev task or something equivalent, IFF it
+	 * returns a next task.
+	 *
+	 * In that case (@rf != NULL) it may return RETRY_TASK when it finds a
+	 * higher prio class has runnable tasks.
 	 */
 	struct task_struct * (*pick_next_task)(struct rq *rq,
 					       struct task_struct *prev,
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -33,10 +33,11 @@ pick_next_task_stop(struct rq *rq, struc
 {
 	struct task_struct *stop = rq->stop;
 
+	WARN_ON_ONCE(prev || rf);
+
 	if (!stop || !task_on_rq_queued(stop))
 		return NULL;
 
-	put_prev_task(rq, prev);
 	set_next_task_stop(rq, stop);
 
 	return stop;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 09/13] sched: Unify runtime accounting across classes
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (7 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 08/13] sched: Rework pick_next_task() slow-path Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 10/13] sched/deadline: Collect sched_dl_entity initialization Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h    |    2 -
 kernel/sched/deadline.c  |   17 ++-------------
 kernel/sched/fair.c      |   50 ++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/rt.c        |   17 ++-------------
 kernel/sched/sched.h     |    2 +
 kernel/sched/stop_task.c |   16 ---------------
 6 files changed, 49 insertions(+), 55 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -433,7 +433,7 @@ struct sched_statistics {
 
 	u64				block_start;
 	u64				block_max;
-	u64				exec_max;
+	s64				exec_max;
 	u64				slice_max;
 
 	u64				nr_migrations_cold;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1147,9 +1147,8 @@ static void update_curr_dl(struct rq *rq
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
-	u64 now;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1162,23 +1161,13 @@ static void update_curr_dl(struct rq *rq
 	 * natural solution, but the full ramifications of this
 	 * approach need further study.
 	 */
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0)) {
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
 
-	schedstat_set(curr->se.statistics.exec_max,
-		      max(curr->se.statistics.exec_max, delta_exec));
-
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -828,30 +828,58 @@ static void update_tg_load_avg(struct cf
 }
 #endif /* CONFIG_SMP */
 
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+{
+	u64 now = rq_clock_task(rq);
+	s64 delta_exec;
+
+	delta_exec = now - curr->exec_start;
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	curr->exec_start = now;
+	curr->sum_exec_runtime += delta_exec;
+
+	schedstat_set(curr->statistics.exec_max,
+		      max(delta_exec, curr->statistics.exec_max));
+
+	return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	s64 delta_exec;
+
+	delta_exec = update_curr_se(rq, &curr->se);
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	account_group_exec_runtime(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
+
+	return delta_exec;
+}
+
 /*
  * Update the current task's runtime statistics.
  */
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_clock_task(rq_of(cfs_rq));
-	u64 delta_exec;
+	s64 delta_exec;
 
 	if (unlikely(!curr))
 		return;
 
-	delta_exec = now - curr->exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	if (unlikely(delta_exec <= 0))
 		return;
 
-	curr->exec_start = now;
-
-	schedstat_set(curr->statistics.exec_max,
-		      max(delta_exec, curr->statistics.exec_max));
-
-	curr->sum_exec_runtime += delta_exec;
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
-
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	update_min_vruntime(cfs_rq);
 
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -955,26 +955,15 @@ static void update_curr_rt(struct rq *rq
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_rt_entity *rt_se = &curr->rt;
-	u64 delta_exec;
-	u64 now;
+	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec < 0))
 		return;
 
-	schedstat_set(curr->se.statistics.exec_max,
-		      max(curr->se.statistics.exec_max, delta_exec));
-
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1689,6 +1689,8 @@ extern const u32		sched_prio_to_wmult[40
 
 #define RETRY_TASK		((void *)-1UL)
 
+extern s64 update_curr_common(struct rq *rq);
+
 struct sched_class {
 	const struct sched_class *next;
 
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -62,21 +62,7 @@ static void yield_task_stop(struct rq *r
 
 static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct task_struct *curr = rq->curr;
-	u64 delta_exec;
-
-	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
-	if (unlikely((s64)delta_exec < 0))
-		delta_exec = 0;
-
-	schedstat_set(curr->se.statistics.exec_max,
-			max(curr->se.statistics.exec_max, delta_exec));
-
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = rq_clock_task(rq);
-	cgroup_account_cputime(curr, delta_exec);
+	update_curr_common(rq);
 }
 
 /*



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 10/13] sched/deadline: Collect sched_dl_entity initialization
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (8 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 09/13] sched: Unify runtime accounting across classes Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 11/13] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Peter Zijlstra
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

Create a single function that initializes a sched_dl_entity.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |    5 +----
 kernel/sched/deadline.c |   22 +++++++++++++++-------
 kernel/sched/sched.h    |    6 ++----
 3 files changed, 18 insertions(+), 15 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2581,10 +2581,7 @@ static void __sched_fork(unsigned long c
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
-	RB_CLEAR_NODE(&p->dl.rb_node);
-	init_dl_task_timer(&p->dl);
-	init_dl_inactive_task_timer(&p->dl);
-	__dl_clear_params(p);
+	init_dl_entity(&p->dl);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->rt.timeout		= 0;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -180,6 +180,8 @@ void dl_change_utilization(struct task_s
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void __dl_clear_params(struct sched_dl_entity *dl_se);
+
 /*
  * The utilization of a task cannot be immediately removed from
  * the rq active utilization (running_bw) when the task blocks.
@@ -278,7 +280,7 @@ static void task_non_contending(struct t
 				sub_rq_bw(&p->dl, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
 			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
-			__dl_clear_params(p);
+			__dl_clear_params(dl_se);
 			raw_spin_unlock(&dl_b->lock);
 		}
 
@@ -1049,7 +1051,7 @@ static enum hrtimer_restart dl_task_time
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
@@ -1261,7 +1263,7 @@ static enum hrtimer_restart inactive_tas
 		raw_spin_lock(&dl_b->lock);
 		__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&dl_b->lock);
-		__dl_clear_params(p);
+		__dl_clear_params(dl_se);
 
 		goto unlock;
 	}
@@ -1277,7 +1279,7 @@ static enum hrtimer_restart inactive_tas
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
@@ -2633,10 +2635,8 @@ bool __checkparam_dl(const struct sched_
 /*
  * This function clears the sched_dl_entity static params.
  */
-void __dl_clear_params(struct task_struct *p)
+static void __dl_clear_params(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	dl_se->dl_runtime		= 0;
 	dl_se->dl_deadline		= 0;
 	dl_se->dl_period		= 0;
@@ -2650,6 +2650,14 @@ void __dl_clear_params(struct task_struc
 	dl_se->dl_overrun		= 0;
 }
 
+void init_dl_entity(struct sched_dl_entity *dl_se)
+{
+	RB_CLEAR_NODE(&dl_se->rb_node);
+	init_dl_task_timer(dl_se);
+	init_dl_inactive_task_timer(dl_se);
+	__dl_clear_params(dl_se);
+}
+
 bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -239,8 +239,6 @@ struct rt_bandwidth {
 	unsigned int		rt_period_active;
 };
 
-void __dl_clear_params(struct task_struct *p);
-
 /*
  * To keep the bandwidth of -deadline tasks and groups under control
  * we need some place where:
@@ -1844,10 +1842,10 @@ extern void init_rt_bandwidth(struct rt_
 
 extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
-extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
-extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
 extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq);
 
+extern void init_dl_entity(struct sched_dl_entity *dl_se);
+
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
 #define RATIO_SHIFT		8



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 11/13] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (9 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 10/13] sched/deadline: Collect sched_dl_entity initialization Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 14:54 ` [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz

In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c |  125 ++++++++++++++++++++++++++----------------------
 kernel/sched/sched.h    |    6 ++
 2 files changed, 75 insertions(+), 56 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -236,9 +236,8 @@ static void __dl_clear_params(struct sch
  * up, and checks if the task is still in the "ACTIVE non contending"
  * state or not (in the second case, it updates running_bw).
  */
-static void task_non_contending(struct task_struct *p)
+static void task_non_contending(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->inactive_timer;
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -271,15 +270,18 @@ static void task_non_contending(struct t
 	 * utilization now, instead of starting a timer
 	 */
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
+		struct task_struct *p = dl_task_of(dl_se);
+
 		if (dl_task(p))
 			sub_running_bw(dl_se, dl_rq);
+
 		if (!dl_task(p) || p->state == TASK_DEAD) {
 			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
 			if (p->state == TASK_DEAD)
-				sub_rq_bw(&p->dl, &rq->dl);
+				sub_rq_bw(dl_se, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
 			__dl_clear_params(dl_se);
 			raw_spin_unlock(&dl_b->lock);
 		}
@@ -288,7 +290,7 @@ static void task_non_contending(struct t
 	}
 
 	dl_se->dl_non_contending = 1;
-	get_task_struct(p);
+	get_task_struct(dl_task_of(dl_se));
 	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
 }
 
@@ -1404,6 +1406,41 @@ enqueue_dl_entity(struct sched_dl_entity
 	BUG_ON(on_dl_rq(dl_se));
 
 	/*
+	 * Check if a constrained deadline task was activated
+	 * after the deadline but before the next period.
+	 * If that is the case, the task will be throttled and
+	 * the replenishment timer will be set to the next period.
+	 */
+	if (!dl_se->dl_throttled && !dl_is_implicit(dl_se))
+		dl_check_constrained_dl(dl_se);
+
+	if (flags & (ENQUEUE_RESTORE|ENQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		add_rq_bw(dl_se, dl_rq);
+		add_running_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 * However, the active utilization does not depend on the fact
+	 * that the task is on the runqueue or not (but depends on the
+	 * task's state - in GRUB parlance, "inactive" vs "active contending").
+	 * In other words, even if a task is throttled its utilization must
+	 * be counted in the active utilization; hence, we need to call
+	 * add_running_bw().
+	 */
+	if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		if (flags & ENQUEUE_WAKEUP)
+			task_contending(dl_se, flags);
+
+		return;
+	}
+
+	/*
 	 * If this is a wakeup or a new instance, the scheduling
 	 * parameters of the task might need updating. Otherwise,
 	 * we want a replenishment of its runtime.
@@ -1422,9 +1459,28 @@ enqueue_dl_entity(struct sched_dl_entity
 	__enqueue_dl_entity(dl_se);
 }
 
-static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
 	__dequeue_dl_entity(dl_se);
+
+	if (flags & (DEQUEUE_SAVE|DEQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		sub_running_bw(dl_se, dl_rq);
+		sub_rq_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * This check allows to start the inactive timer (or to immediately
+	 * decrease the active utilization, if needed) in two cases:
+	 * when the task blocks and when it is terminating
+	 * (p->state == TASK_DEAD). We can handle the two cases in the same
+	 * way, because from GRUB's point of view the same thing is happening
+	 * (the task moves from "active contending" to "active non contending"
+	 * or "inactive")
+	 */
+	if (flags & DEQUEUE_SLEEP)
+		task_non_contending(dl_se);
 }
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1454,38 +1510,8 @@ static void enqueue_task_dl(struct rq *r
 		return;
 	}
 
-	/*
-	 * Check if a constrained deadline task was activated
-	 * after the deadline but before the next period.
-	 * If that is the case, the task will be throttled and
-	 * the replenishment timer will be set to the next period.
-	 */
-	if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl))
-		dl_check_constrained_dl(&p->dl);
-
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
-		add_rq_bw(&p->dl, &rq->dl);
-		add_running_bw(&p->dl, &rq->dl);
-	}
-
-	/*
-	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
-	 * its budget it needs a replenishment and, since it now is on
-	 * its rq, the bandwidth timer callback (which clearly has not
-	 * run yet) will take care of this.
-	 * However, the active utilization does not depend on the fact
-	 * that the task is on the runqueue or not (but depends on the
-	 * task's state - in GRUB parlance, "inactive" vs "active contending").
-	 * In other words, even if a task is throttled its utilization must
-	 * be counted in the active utilization; hence, we need to call
-	 * add_running_bw().
-	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
-		if (flags & ENQUEUE_WAKEUP)
-			task_contending(&p->dl, flags);
-
-		return;
-	}
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= ENQUEUE_MIGRATING;
 
 	enqueue_dl_entity(&p->dl, pi_se, flags);
 
@@ -1495,31 +1521,18 @@ static void enqueue_task_dl(struct rq *r
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
-	dequeue_dl_entity(&p->dl);
+	dequeue_dl_entity(&p->dl, flags);
 	dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
-	__dequeue_task_dl(rq, p, flags);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
-		sub_running_bw(&p->dl, &rq->dl);
-		sub_rq_bw(&p->dl, &rq->dl);
-	}
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= DEQUEUE_MIGRATING;
 
-	/*
-	 * This check allows to start the inactive timer (or to immediately
-	 * decrease the active utilization, if needed) in two cases:
-	 * when the task blocks and when it is terminating
-	 * (p->state == TASK_DEAD). We can handle the two cases in the same
-	 * way, because from GRUB's point of view the same thing is happening
-	 * (the task moves from "active contending" to "active non contending"
-	 * or "inactive")
-	 */
-	if (flags & DEQUEUE_SLEEP)
-		task_non_contending(p);
+	__dequeue_task_dl(rq, p, flags);
 }
 
 /*
@@ -2269,7 +2282,7 @@ static void switched_from_dl(struct rq *
 	 * will reset the task parameters.
 	 */
 	if (task_on_rq_queued(p) && p->dl.dl_runtime)
-		task_non_contending(p);
+		task_non_contending(&p->dl);
 
 	if (!task_on_rq_queued(p)) {
 		/*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1661,6 +1661,10 @@ extern const u32		sched_prio_to_wmult[40
  * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
  *        in the runqueue.
  *
+ * NOCLOCK - skip the update_rq_clock() (avoids double updates)
+ *
+ * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
+ *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
@@ -1671,6 +1675,7 @@ extern const u32		sched_prio_to_wmult[40
 #define DEQUEUE_SAVE		0x02 /* Matches ENQUEUE_RESTORE */
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_MIGRATING	0x80 /* Matches ENQUEUE_MIGRATING */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -1684,6 +1689,7 @@ extern const u32		sched_prio_to_wmult[40
 #else
 #define ENQUEUE_MIGRATED	0x00
 #endif
+#define ENQUEUE_MIGRATING	0x80
 
 #define RETRY_TASK		((void *)-1UL)
 



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (10 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 11/13] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-08-07 16:31   ` Dietmar Eggemann
                     ` (2 more replies)
  2019-07-26 14:54 ` [RFC][PATCH 13/13] sched/fair: Add trivial fair server Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 3 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz



Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |   26 +++
 kernel/sched/core.c     |    5 
 kernel/sched/deadline.c |  327 +++++++++++++++++++++++++++++++-----------------
 kernel/sched/fair.c     |    4 
 kernel/sched/sched.h    |   29 ++++
 5 files changed, 277 insertions(+), 114 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -52,12 +52,14 @@ struct robust_list_head;
 struct root_domain;
 struct rq;
 struct sched_attr;
+struct sched_dl_entity;
 struct sched_param;
 struct seq_file;
 struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct task_struct;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -509,6 +511,9 @@ struct sched_rt_entity {
 #endif
 } __randomize_layout;
 
+typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+
 struct sched_dl_entity {
 	struct rb_node			rb_node;
 
@@ -561,6 +566,7 @@ struct sched_dl_entity {
 	unsigned int			dl_yielded        : 1;
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
+	unsigned int			dl_server         : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -575,7 +581,20 @@ struct sched_dl_entity {
 	 * timer is needed to decrease the active utilization at the correct
 	 * time.
 	 */
-	struct hrtimer inactive_timer;
+	struct hrtimer			inactive_timer;
+
+	/*
+	 * Bits for DL-server functionality. Also see the comment near
+	 * dl_server_update().
+	 *
+	 * @rq the runqueue this server is for
+	 *
+	 * @server_has_tasks() returns true if @server_pick return a
+	 * runnable task.
+	 */
+	struct rq			*rq;
+	dl_server_has_tasks_f		server_has_tasks;
+	dl_server_pick_f		server_pick;
 };
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -688,10 +707,13 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+	struct sched_dl_entity		*server;
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
 	/* Clamp values requested for a scheduling entity */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3756,8 +3756,11 @@ pick_next_task(struct rq *rq, struct tas
 
 	for_each_class(class) {
 		p = class->pick_next_task(rq, NULL, NULL);
-		if (p)
+		if (p) {
+			if (p->sched_class == class && p->server)
+				p->server = NULL;
 			return p;
+		}
 	}
 
 	/* The idle class should always have a runnable task: */
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -20,8 +20,14 @@
 
 struct dl_bandwidth def_dl_bandwidth;
 
+static bool dl_server(struct sched_dl_entity *dl_se)
+{
+	return dl_se->dl_server;
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
+	BUG_ON(dl_server(dl_se));
 	return container_of(dl_se, struct task_struct, dl);
 }
 
@@ -30,14 +36,22 @@ static inline struct rq *rq_of_dl_rq(str
 	return container_of(dl_rq, struct rq, dl);
 }
 
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = task_rq(p);
+	struct rq *rq = dl_se->rq;
 
-	return &rq->dl;
+	if (!dl_server(dl_se))
+		rq = task_rq(dl_task_of(dl_se));
+
+	return rq;
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	return &rq_of_dl_se(dl_se)->dl;
 }
 
+
 static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 {
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
@@ -239,8 +253,8 @@ static void __dl_clear_params(struct sch
 static void task_non_contending(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct rq *rq = rq_of_dl_se(dl_se);
+	struct dl_rq *dl_rq = &rq->dl;
 	s64 zerolag_time;
 
 	/*
@@ -270,27 +284,32 @@ static void task_non_contending(struct s
 	 * utilization now, instead of starting a timer
 	 */
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
-		struct task_struct *p = dl_task_of(dl_se);
-
-		if (dl_task(p))
+		if (dl_server(dl_se)) {
 			sub_running_bw(dl_se, dl_rq);
+		} else {
+			struct task_struct *p = dl_task_of(dl_se);
 
-		if (!dl_task(p) || p->state == TASK_DEAD) {
-			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+			if (dl_task(p))
+				sub_running_bw(dl_se, dl_rq);
 
-			if (p->state == TASK_DEAD)
-				sub_rq_bw(dl_se, &rq->dl);
-			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
-			__dl_clear_params(dl_se);
-			raw_spin_unlock(&dl_b->lock);
+			if (!dl_task(p) || p->state == TASK_DEAD) {
+				struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+				if (p->state == TASK_DEAD)
+					sub_rq_bw(dl_se, &rq->dl);
+				raw_spin_lock(&dl_b->lock);
+				__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
+				__dl_clear_params(dl_se);
+				raw_spin_unlock(&dl_b->lock);
+			}
 		}
 
 		return;
 	}
 
 	dl_se->dl_non_contending = 1;
-	get_task_struct(dl_task_of(dl_se));
+	if (!dl_server(dl_se))
+		get_task_struct(dl_task_of(dl_se));
 	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
 }
 
@@ -317,8 +336,10 @@ static void task_contending(struct sched
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
-			put_task_struct(dl_task_of(dl_se));
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	} else {
 		/*
 		 * Since "dl_non_contending" is not set, the
@@ -331,10 +352,8 @@ static void task_contending(struct sched
 	}
 }
 
-static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	return dl_rq->root.rb_leftmost == &dl_se->rb_node;
 }
 
@@ -428,8 +447,6 @@ static void inc_dl_migration(struct sche
 
 	if (p->nr_cpus_allowed > 1)
 		dl_rq->dl_nr_migratory++;
-
-	update_dl_migration(dl_rq);
 }
 
 static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
@@ -438,8 +455,6 @@ static void dec_dl_migration(struct sche
 
 	if (p->nr_cpus_allowed > 1)
 		dl_rq->dl_nr_migratory--;
-
-	update_dl_migration(dl_rq);
 }
 
 /*
@@ -607,8 +622,11 @@ static inline void deadline_queue_pull_t
 }
 #endif /* CONFIG_SMP */
 
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se,
+		  struct sched_dl_entity *pi_se, int flags);
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p, int flags);
 
 /*
@@ -855,8 +873,7 @@ static inline bool dl_is_implicit(struct
 static void update_dl_entity(struct sched_dl_entity *dl_se,
 			     struct sched_dl_entity *pi_se)
 {
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
@@ -888,11 +905,11 @@ static inline u64 dl_next_period(struct
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct task_struct *p)
+static int start_dl_timer(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->dl_timer;
-	struct rq *rq = task_rq(p);
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
 	ktime_t now, act;
 	s64 delta;
 
@@ -926,13 +943,33 @@ static int start_dl_timer(struct task_st
 	 * and observe our state.
 	 */
 	if (!hrtimer_is_queued(timer)) {
-		get_task_struct(p);
+		if (!dl_server(dl_se))
+			get_task_struct(dl_task_of(dl_se));
 		hrtimer_start(timer, act, HRTIMER_MODE_ABS);
 	}
 
 	return 1;
 }
 
+static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * Queueing this task back might have overloaded rq, check if we need
+	 * to kick someone away.
+	 */
+	if (has_pushable_dl_tasks(rq)) {
+		/*
+		 * Nothing relies on rq->lock after this, so its safe to drop
+		 * rq->lock.
+		 */
+		rq_unpin_lock(rq, rf);
+		push_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+#endif
+}
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -951,10 +988,34 @@ static enum hrtimer_restart dl_task_time
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     dl_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p;
 	struct rq_flags rf;
 	struct rq *rq;
 
+	if (dl_server(dl_se)) {
+		struct rq *rq = rq_of_dl_se(dl_se);
+		struct rq_flags rf;
+
+		rq_lock(rq, &rf);
+		if (dl_se->dl_throttled) {
+			sched_clock_tick();
+			update_rq_clock(rq);
+
+			if (dl_se->server_has_tasks(dl_se)) {
+				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
+				resched_curr(rq);
+				__push_dl_task(rq, &rf);
+			} else {
+				replenish_dl_entity(dl_se, dl_se);
+			}
+
+		}
+		rq_unlock(rq, &rf);
+
+		return HRTIMER_NORESTART;
+	}
+
+	p = dl_task_of(dl_se);
 	rq = task_rq_lock(p, &rf);
 
 	/*
@@ -1025,21 +1086,7 @@ static enum hrtimer_restart dl_task_time
 	else
 		resched_curr(rq);
 
-#ifdef CONFIG_SMP
-	/*
-	 * Queueing this task back might have overloaded rq, check if we need
-	 * to kick someone away.
-	 */
-	if (has_pushable_dl_tasks(rq)) {
-		/*
-		 * Nothing relies on rq->lock after this, so its safe to drop
-		 * rq->lock.
-		 */
-		rq_unpin_lock(rq, &rf);
-		push_dl_task(rq);
-		rq_repin_lock(rq, &rf);
-	}
-#endif
+	__push_dl_task(rq, &rf);
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
@@ -1081,12 +1128,11 @@ static void init_dl_task_timer(struct sc
  */
 static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
 	    dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
-		if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
+		if (unlikely(dl_se->dl_boosted || !start_dl_timer(dl_se)))
 			return;
 		dl_se->dl_throttled = 1;
 		if (dl_se->runtime > 0)
@@ -1143,29 +1189,10 @@ static u64 grub_reclaim(u64 delta, struc
 	return (delta * u_act) >> BW_SHIFT;
 }
 
-/*
- * Update the current task's runtime statistics (provided it is still
- * a -deadline task and has not been removed from the dl_rq).
- */
-static void update_curr_dl(struct rq *rq)
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
 {
-	struct task_struct *curr = rq->curr;
-	struct sched_dl_entity *dl_se = &curr->dl;
-	s64 delta_exec, scaled_delta_exec;
-	int cpu = cpu_of(rq);
-
-	if (!dl_task(curr) || !on_dl_rq(dl_se))
-		return;
+	s64 scaled_delta_exec;
 
-	/*
-	 * Consumed budget is computed considering the time as
-	 * observed by schedulable tasks (excluding time spent
-	 * in hardirq context, etc.). Deadlines are instead
-	 * computed using hard walltime. This seems to be the more
-	 * natural solution, but the full ramifications of this
-	 * approach need further study.
-	 */
-	delta_exec = update_curr_common(rq);
 	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
@@ -1183,10 +1210,9 @@ static void update_curr_dl(struct rq *rq
 	 * according to current frequency and CPU maximum capacity.
 	 */
 	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
-		scaled_delta_exec = grub_reclaim(delta_exec,
-						 rq,
-						 &curr->dl);
+		scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
 	} else {
+		int cpu = cpu_of(rq);
 		unsigned long scale_freq = arch_scale_freq_capacity(cpu);
 		unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
 
@@ -1205,11 +1231,18 @@ static void update_curr_dl(struct rq *rq
 		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
 			dl_se->dl_overrun = 1;
 
-		__dequeue_task_dl(rq, curr, 0);
-		if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
-			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+		dequeue_dl_entity(dl_se, 0);
+		if (!dl_server(dl_se))
+			dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
+
+		if (unlikely(dl_se->dl_boosted || !start_dl_timer(dl_se))) {
+			if (dl_server(dl_se))
+				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
+			else
+				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
+		}
 
-		if (!is_leftmost(curr, &rq->dl))
+		if (!is_leftmost(dl_se, &rq->dl))
 			resched_curr(rq);
 	}
 
@@ -1239,20 +1272,81 @@ static void update_curr_dl(struct rq *rq
 	}
 }
 
+void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+	enqueue_dl_entity(dl_se, dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+}
+
+void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick)
+{
+	dl_se->dl_server = 1;
+	dl_se->rq = rq;
+	dl_se->server_has_tasks = has_tasks;
+	dl_se->server_pick = pick;
+
+	setup_new_dl_entity(dl_se);
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	s64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	/*
+	 * Consumed budget is computed considering the time as
+	 * observed by schedulable tasks (excluding time spent
+	 * in hardirq context, etc.). Deadlines are instead
+	 * computed using hard walltime. This seems to be the more
+	 * natural solution, but the full ramifications of this
+	 * approach need further study.
+	 */
+	delta_exec = update_curr_common(rq);
+	update_curr_dl_se(rq, dl_se, delta_exec);
+}
+
 static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 {
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     inactive_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p = NULL;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	rq = task_rq_lock(p, &rf);
+	if (!dl_server(dl_se)) {
+		p = dl_task_of(dl_se);
+		rq = task_rq_lock(p, &rf);
+	} else {
+		rq = dl_se->rq;
+		rq_lock(rq, &rf);
+	}
 
 	sched_clock_tick();
 	update_rq_clock(rq);
 
+	if (dl_server(dl_se))
+		goto no_task;
+
 	if (!dl_task(p) || p->state == TASK_DEAD) {
 		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
@@ -1269,14 +1363,21 @@ static enum hrtimer_restart inactive_tas
 
 		goto unlock;
 	}
+
+no_task:
 	if (dl_se->dl_non_contending == 0)
 		goto unlock;
 
 	sub_running_bw(dl_se, &rq->dl);
 	dl_se->dl_non_contending = 0;
 unlock:
-	task_rq_unlock(rq, p, &rf);
-	put_task_struct(p);
+
+	if (!dl_server(dl_se)) {
+		task_rq_unlock(rq, p, &rf);
+		put_task_struct(p);
+	} else {
+		rq_unlock(rq, &rf);
+	}
 
 	return HRTIMER_NORESTART;
 }
@@ -1334,29 +1435,28 @@ static inline void dec_dl_deadline(struc
 static inline
 void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
 	u64 deadline = dl_se->deadline;
 
-	WARN_ON(!dl_prio(prio));
 	dl_rq->dl_nr_running++;
 	add_nr_running(rq_of_dl_rq(dl_rq), 1);
 
 	inc_dl_deadline(dl_rq, deadline);
-	inc_dl_migration(dl_se, dl_rq);
+	if (!dl_server(dl_se))
+		inc_dl_migration(dl_se, dl_rq);
+	update_dl_migration(dl_rq);
 }
 
 static inline
 void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
-
-	WARN_ON(!dl_prio(prio));
 	WARN_ON(!dl_rq->dl_nr_running);
 	dl_rq->dl_nr_running--;
 	sub_nr_running(rq_of_dl_rq(dl_rq), 1);
 
 	dec_dl_deadline(dl_rq, dl_se->deadline);
-	dec_dl_migration(dl_se, dl_rq);
+	if (!dl_server(dl_se))
+		dec_dl_migration(dl_se, dl_rq);
+	update_dl_migration(dl_rq);
 }
 
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
@@ -1451,8 +1551,7 @@ enqueue_dl_entity(struct sched_dl_entity
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se, pi_se);
 	} else if ((flags & ENQUEUE_RESTORE) &&
-		  dl_time_before(dl_se->deadline,
-				 rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
+		   dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
 		setup_new_dl_entity(dl_se);
 	}
 
@@ -1519,12 +1618,6 @@ static void enqueue_task_dl(struct rq *r
 		enqueue_pushable_dl_task(rq, p);
 }
 
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
-{
-	dequeue_dl_entity(&p->dl, flags);
-	dequeue_pushable_dl_task(rq, p);
-}
-
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
@@ -1532,7 +1625,8 @@ static void dequeue_task_dl(struct rq *r
 	if (p->on_rq == TASK_ON_RQ_MIGRATING)
 		flags |= DEQUEUE_MIGRATING;
 
-	__dequeue_task_dl(rq, p, flags);
+	dequeue_dl_entity(&p->dl, flags);
+	dequeue_pushable_dl_task(rq, p);
 }
 
 /*
@@ -1688,12 +1782,12 @@ static void check_preempt_curr_dl(struct
 }
 
 #ifdef CONFIG_SCHED_HRTICK
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
-	hrtick_start(rq, p->dl.runtime);
+	hrtick_start(rq, dl_se->runtime);
 }
 #else /* !CONFIG_SCHED_HRTICK */
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
 }
 #endif
@@ -1705,9 +1799,6 @@ static void set_next_task_dl(struct rq *
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
 
-	if (hrtick_enabled(rq))
-		start_hrtick_dl(rq, p);
-
 	if (rq->curr->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
@@ -1737,15 +1828,29 @@ pick_next_task_dl(struct rq *rq, struct
 
 	dl_rq = &rq->dl;
 
+again:
 	if (unlikely(!dl_rq->dl_nr_running))
 		return NULL;
 
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
 
-	p = dl_task_of(dl_se);
+	if (dl_server(dl_se)) {
+		p = dl_se->server_pick(dl_se);
+		if (!p) {
+			// XXX should not happen, warn?!
+			dl_se->dl_yielded = 1;
+			update_curr_dl_se(rq, dl_se, 0);
+			goto again;
+		}
+		p->server = dl_se;
+	} else {
+		p = dl_task_of(dl_se);
+		set_next_task_dl(rq, p);
+	}
 
-	set_next_task_dl(rq, p);
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, dl_se);
 
 	return p;
 }
@@ -1790,8 +1895,8 @@ static void task_tick_dl(struct rq *rq,
 	 * be set and schedule() will start a new hrtick for the next task.
 	 */
 	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0 &&
-	    is_leftmost(p, &rq->dl))
-		start_hrtick_dl(rq, p);
+	    is_leftmost(&p->dl, &rq->dl))
+		start_hrtick_dl(rq, &p->dl);
 }
 
 static void task_fork_dl(struct task_struct *p)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -860,6 +860,8 @@ s64 update_curr_common(struct rq *rq)
 
 	account_group_exec_runtime(curr, delta_exec);
 	cgroup_account_cputime(curr, delta_exec);
+	if (curr->server)
+		dl_server_update(curr->server, delta_exec);
 
 	return delta_exec;
 }
@@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
+		if (curtask->server)
+			dl_server_update(curtask->server, delta_exec);
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -316,6 +316,35 @@ extern int  dl_task_can_attach(struct ta
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern bool dl_cpu_busy(unsigned int cpu);
 
+/*
+ * SCHED_DEADLINE supports servers (nested scheduling) with the following
+ * interface:
+ *
+ *   dl_se::rq -- runqueue we belong to.
+ *
+ *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
+ *                                server when it runs out of tasks to run.
+ *
+ *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
+ *                           returns NULL.
+ *
+ *   dl_server_update() -- called from update_curr_common(), propagates runtime
+ *                         to the server.
+ *
+ *   dl_server_start()
+ *   dl_server_stop()  -- start/stop the server when it has (no) tasks
+ *
+ *   dl_server_init()
+ *
+ * XXX
+ */
+extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_stop(struct sched_dl_entity *dl_se);
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 #include <linux/cgroup.h>



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH 13/13] sched/fair: Add trivial fair server
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (11 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers Peter Zijlstra
@ 2019-07-26 14:54 ` Peter Zijlstra
  2019-07-26 20:01 ` [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure luca abeni
  2019-09-03 14:27 ` Alessio Balsini
  14 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-26 14:54 UTC (permalink / raw)
  To: mingo, juri.lelli
  Cc: linux-kernel, dietmar.eggemann, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt, peterz



Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |    6 +++++-
 kernel/sched/deadline.c |    2 ++
 kernel/sched/fair.c     |   29 +++++++++++++++++++++++++++++
 kernel/sched/sched.h    |    6 +++++-
 4 files changed, 41 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6501,6 +6504,7 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+		fair_server_init(rq);
 	}
 
 	set_load_weight(&init_task, false);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5239,6 +5239,9 @@ enqueue_task_fair(struct rq *rq, struct
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
+	if (!rq->cfs.h_nr_running)
+		dl_server_start(&rq->fair_server);
+
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
 	 * utilization updates, so do it here explicitly with the IOWAIT flag
@@ -5374,6 +5377,9 @@ static void dequeue_task_fair(struct rq
 	if (!se)
 		sub_nr_running(rq, 1);
 
+	if (!rq->cfs.h_nr_running)
+		dl_server_stop(&rq->fair_server);
+
 	util_est_dequeue(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
 }
@@ -6903,6 +6909,29 @@ done: __maybe_unused;
 	return NULL;
 }
 
+static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->rq->cfs.nr_running;
+}
+
+static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+{
+	return pick_next_task_fair(dl_se->rq, NULL, NULL);
+}
+
+void fair_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->fair_server;
+
+	init_dl_entity(dl_se);
+
+	dl_se->dl_runtime = TICK_NSEC;
+	dl_se->dl_deadline = 20 * TICK_NSEC;
+	dl_se->dl_period = 20 * TICK_NSEC;
+
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+}
+
 /*
  * Account for a descheduled task:
  */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -345,6 +345,8 @@ extern void dl_server_init(struct sched_
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+extern void fair_server_init(struct rq *);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 #include <linux/cgroup.h>
@@ -905,6 +907,8 @@ struct rq {
 	struct rt_rq		rt;
 	struct dl_rq		dl;
 
+	struct sched_dl_entity	fair_server;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
 	struct list_head	leaf_cfs_rq_list;
@@ -968,7 +972,7 @@ struct rq {
 
 	/* This is used to determine avg_idle's max value */
 	u64			max_idle_balance_cost;
-#endif
+#endif /* CONFIG_SMP */
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	u64			prev_irq_time;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (12 preceding siblings ...)
  2019-07-26 14:54 ` [RFC][PATCH 13/13] sched/fair: Add trivial fair server Peter Zijlstra
@ 2019-07-26 20:01 ` luca abeni
  2019-09-03 14:27 ` Alessio Balsini
  14 siblings, 0 replies; 63+ messages in thread
From: luca abeni @ 2019-07-26 20:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

Hi Peter,

On Fri, 26 Jul 2019 16:54:09 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Hi,
> 
> So recently syzcaller ran into the big deadline/period issue (again),
> and I figured I should at least propose a patch that puts limits on
> that -- see Patch 1.
> 
> During that discussion; SCHED_OTHER servers got mentioned (also
> again), and I figured I should have a poke at that too. So I took
> some inspiration from patches Alessio Balsini send a while back and
> cobbled something together for that too.

I think Patch 1 is a very good idea!

The server patches look interesting (and they seem to be much simpler
than our patchset :). I need to have a better look at them, but this
seems to be very promising.



			Thanks,
				Luca



> 
> Included are also a bunch of patches I did for core scheduling (2-8),
> which I'm probably going to just merge as they're generic cleanups.
> They're included here because they make pick_next_task() simpler and
> thereby thinking about the nested pick_next_task() logic inherent in
> servers was less of a head-ache. (I think it can work fine without
> them, but its easier with them on)
> 
> Anyway; after it all compiled it booted a kvm image all the way to
> userspace on the first run, so clearly this code isn't to be trusted
> at all.
> 
> There's still lots of missing bits and pieces -- like changelogs and
> the fair server isn't configurable or hooked into the bandwidth
> accounting, but the foundation is there I think.
> 
> Enjoy!
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-07-26 14:54 ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Peter Zijlstra
@ 2019-07-29  8:57   ` Juri Lelli
  2019-07-29 11:45     ` Daniel Bristot de Oliveira
  2019-08-02 17:21   ` Alessio Balsini
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2019-07-29  8:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

Hi,

On 26/07/19 16:54, Peter Zijlstra wrote:
> 
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Luca Abeni <luca.abeni@santannapisa.it>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched/sysctl.h |    3 +++
>  kernel/sched/deadline.c      |   23 +++++++++++++++++++++--
>  kernel/sysctl.c              |   14 ++++++++++++++
>  3 files changed, 38 insertions(+), 2 deletions(-)
> 
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -56,6 +56,9 @@ int sched_proc_update_handler(struct ctl
>  extern unsigned int sysctl_sched_rt_period;
>  extern int sysctl_sched_rt_runtime;
>  
> +extern unsigned int sysctl_sched_dl_period_max;
> +extern unsigned int sysctl_sched_dl_period_min;
> +
>  #ifdef CONFIG_UCLAMP_TASK
>  extern unsigned int sysctl_sched_uclamp_util_min;
>  extern unsigned int sysctl_sched_uclamp_util_max;
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2597,6 +2597,14 @@ void __getparam_dl(struct task_struct *p
>  }
>  
>  /*
> + * Default limits for DL period; on the top end we guard against small util
> + * tasks still getting rediculous long effective runtimes, on the bottom end we

s/rediculous/ridiculous/

> + * guard against timer DoS.
> + */
> +unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
> +unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */

These limits look sane to me. I've actually been experimenting with 10us
period tasks and throttling seemed to behave fine, but I guess 100us is
a saner default.

So, (with a few lines of changelog :)

Acked-by: Juri Lelli <juri.lelli@redhat.com>

Thanks for doing this.

Best,

Juri

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-07-26 14:54 ` [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Peter Zijlstra
@ 2019-07-29  9:25   ` Juri Lelli
  2019-07-29 11:15     ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2019-07-29  9:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

Hi,

On 26/07/19 16:54, Peter Zijlstra wrote:
> Because pick_next_task() implies set_curr_task() and some of the
> details haven't matter too much, some of what _should_ be in
> set_curr_task() ended up in pick_next_task, correct this.
> 
> This prepares the way for a pick_next_task() variant that does not
> affect the current state; allowing remote picking.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/deadline.c |   23 ++++++++++++-----------
>  kernel/sched/rt.c       |   27 ++++++++++++++-------------
>  2 files changed, 26 insertions(+), 24 deletions(-)
> 
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *r
>  }
>  #endif
>  
> -static inline void set_next_task(struct rq *rq, struct task_struct *p)
> +static void set_next_task_dl(struct rq *rq, struct task_struct *p)
>  {
>  	p->se.exec_start = rq_clock_task(rq);
>  
>  	/* You can't push away the running task */
>  	dequeue_pushable_dl_task(rq, p);
> +
> +	if (hrtick_enabled(rq))
> +		start_hrtick_dl(rq, p);
> +
> +	if (rq->curr->sched_class != &dl_sched_class)
> +		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
> +
> +	if (rq->curr != p)
> +		deadline_queue_push_tasks(rq);

It's a minor thing, but I was wondering why you added the check on curr.
deadline_queue_push_tasks() already checks if are there pushable tasks,
plus curr can still be of a different class at this point?

Thanks,

Juri

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-07-29  9:25   ` Juri Lelli
@ 2019-07-29 11:15     ` Peter Zijlstra
  2019-07-29 11:27       ` Juri Lelli
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-29 11:15 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On Mon, Jul 29, 2019 at 11:25:19AM +0200, Juri Lelli wrote:
> Hi,
> 
> On 26/07/19 16:54, Peter Zijlstra wrote:
> > Because pick_next_task() implies set_curr_task() and some of the
> > details haven't matter too much, some of what _should_ be in
> > set_curr_task() ended up in pick_next_task, correct this.
> > 
> > This prepares the way for a pick_next_task() variant that does not
> > affect the current state; allowing remote picking.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/deadline.c |   23 ++++++++++++-----------
> >  kernel/sched/rt.c       |   27 ++++++++++++++-------------
> >  2 files changed, 26 insertions(+), 24 deletions(-)
> > 
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *r
> >  }
> >  #endif
> >  
> > -static inline void set_next_task(struct rq *rq, struct task_struct *p)
> > +static void set_next_task_dl(struct rq *rq, struct task_struct *p)
> >  {
> >  	p->se.exec_start = rq_clock_task(rq);
> >  
> >  	/* You can't push away the running task */
> >  	dequeue_pushable_dl_task(rq, p);
> > +
> > +	if (hrtick_enabled(rq))
> > +		start_hrtick_dl(rq, p);
> > +
> > +	if (rq->curr->sched_class != &dl_sched_class)
> > +		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
> > +
> > +	if (rq->curr != p)
> > +		deadline_queue_push_tasks(rq);
> 
> It's a minor thing, but I was wondering why you added the check on curr.
> deadline_queue_push_tasks() already checks if are there pushable tasks,
> plus curr can still be of a different class at this point?

Hmm, so by moving that code into set_next_task() it is exposed to the:

  if (queued)
    deuque_task();
  if (running)
    put_prev_task();

  /* do stuff */

  if (queued)
    enqueue_task();
  if (running)
    set_next_task();

patter from core.c; and in that case nothing changes. That said; I
might've gotten it wrong.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-07-29 11:15     ` Peter Zijlstra
@ 2019-07-29 11:27       ` Juri Lelli
  2019-07-29 13:04         ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2019-07-29 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 29/07/19 13:15, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 11:25:19AM +0200, Juri Lelli wrote:
> > Hi,
> > 
> > On 26/07/19 16:54, Peter Zijlstra wrote:
> > > Because pick_next_task() implies set_curr_task() and some of the
> > > details haven't matter too much, some of what _should_ be in
> > > set_curr_task() ended up in pick_next_task, correct this.
> > > 
> > > This prepares the way for a pick_next_task() variant that does not
> > > affect the current state; allowing remote picking.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/sched/deadline.c |   23 ++++++++++++-----------
> > >  kernel/sched/rt.c       |   27 ++++++++++++++-------------
> > >  2 files changed, 26 insertions(+), 24 deletions(-)
> > > 
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *r
> > >  }
> > >  #endif
> > >  
> > > -static inline void set_next_task(struct rq *rq, struct task_struct *p)
> > > +static void set_next_task_dl(struct rq *rq, struct task_struct *p)
> > >  {
> > >  	p->se.exec_start = rq_clock_task(rq);
> > >  
> > >  	/* You can't push away the running task */
> > >  	dequeue_pushable_dl_task(rq, p);
> > > +
> > > +	if (hrtick_enabled(rq))
> > > +		start_hrtick_dl(rq, p);
> > > +
> > > +	if (rq->curr->sched_class != &dl_sched_class)
> > > +		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
> > > +
> > > +	if (rq->curr != p)
> > > +		deadline_queue_push_tasks(rq);
> > 
> > It's a minor thing, but I was wondering why you added the check on curr.
> > deadline_queue_push_tasks() already checks if are there pushable tasks,
> > plus curr can still be of a different class at this point?
> 
> Hmm, so by moving that code into set_next_task() it is exposed to the:
> 
>   if (queued)
>     deuque_task();
>   if (running)
>     put_prev_task();
> 
>   /* do stuff */
> 
>   if (queued)
>     enqueue_task();
>   if (running)
>     set_next_task();
> 
> patter from core.c; and in that case nothing changes. That said; I
> might've gotten it wrong.

Right. But, I was wondering about the __schedule()->pick_next_task()
case, where, say, prev (rq->curr) is RT/CFS and next (p) is DEADLINE.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-07-29  8:57   ` Juri Lelli
@ 2019-07-29 11:45     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 63+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-07-29 11:45 UTC (permalink / raw)
  To: Juri Lelli, Peter Zijlstra
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, balsini,
	dvyukov, tglx, vpillai, rostedt

On 29/07/2019 10:57, Juri Lelli wrote:
> Hi,
> 
> On 26/07/19 16:54, Peter Zijlstra wrote:
>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
>> Cc: Luca Abeni <luca.abeni@santannapisa.it>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Dmitry Vyukov <dvyukov@google.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---
>>  include/linux/sched/sysctl.h |    3 +++
>>  kernel/sched/deadline.c      |   23 +++++++++++++++++++++--
>>  kernel/sysctl.c              |   14 ++++++++++++++
>>  3 files changed, 38 insertions(+), 2 deletions(-)
>>
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -56,6 +56,9 @@ int sched_proc_update_handler(struct ctl
>>  extern unsigned int sysctl_sched_rt_period;
>>  extern int sysctl_sched_rt_runtime;
>>  
>> +extern unsigned int sysctl_sched_dl_period_max;
>> +extern unsigned int sysctl_sched_dl_period_min;
>> +
>>  #ifdef CONFIG_UCLAMP_TASK
>>  extern unsigned int sysctl_sched_uclamp_util_min;
>>  extern unsigned int sysctl_sched_uclamp_util_max;
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2597,6 +2597,14 @@ void __getparam_dl(struct task_struct *p
>>  }
>>  
>>  /*
>> + * Default limits for DL period; on the top end we guard against small util
>> + * tasks still getting rediculous long effective runtimes, on the bottom end we
> s/rediculous/ridiculous/
> 
>> + * guard against timer DoS.
>> + */
>> +unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
>> +unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
> These limits look sane to me. I've actually been experimenting with 10us
> period tasks and throttling seemed to behave fine, but I guess 100us is
> a saner default.
> 
> So, (with a few lines of changelog :)
> 
> Acked-by: Juri Lelli <juri.lelli@redhat.com>


Looks sane to me too!

Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>

-- Daniel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-07-29 11:27       ` Juri Lelli
@ 2019-07-29 13:04         ` Peter Zijlstra
  2019-07-29 13:17           ` Juri Lelli
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-29 13:04 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On Mon, Jul 29, 2019 at 01:27:02PM +0200, Juri Lelli wrote:
> On 29/07/19 13:15, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 11:25:19AM +0200, Juri Lelli wrote:
> > > Hi,
> > > 
> > > On 26/07/19 16:54, Peter Zijlstra wrote:
> > > > Because pick_next_task() implies set_curr_task() and some of the
> > > > details haven't matter too much, some of what _should_ be in
> > > > set_curr_task() ended up in pick_next_task, correct this.
> > > > 
> > > > This prepares the way for a pick_next_task() variant that does not
> > > > affect the current state; allowing remote picking.
> > > > 
> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > ---
> > > >  kernel/sched/deadline.c |   23 ++++++++++++-----------
> > > >  kernel/sched/rt.c       |   27 ++++++++++++++-------------
> > > >  2 files changed, 26 insertions(+), 24 deletions(-)
> > > > 
> > > > --- a/kernel/sched/deadline.c
> > > > +++ b/kernel/sched/deadline.c
> > > > @@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *r
> > > >  }
> > > >  #endif
> > > >  
> > > > -static inline void set_next_task(struct rq *rq, struct task_struct *p)
> > > > +static void set_next_task_dl(struct rq *rq, struct task_struct *p)
> > > >  {
> > > >  	p->se.exec_start = rq_clock_task(rq);
> > > >  
> > > >  	/* You can't push away the running task */
> > > >  	dequeue_pushable_dl_task(rq, p);
> > > > +
> > > > +	if (hrtick_enabled(rq))
> > > > +		start_hrtick_dl(rq, p);
> > > > +
> > > > +	if (rq->curr->sched_class != &dl_sched_class)
> > > > +		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
> > > > +
> > > > +	if (rq->curr != p)
> > > > +		deadline_queue_push_tasks(rq);
> > > 
> > > It's a minor thing, but I was wondering why you added the check on curr.
> > > deadline_queue_push_tasks() already checks if are there pushable tasks,
> > > plus curr can still be of a different class at this point?
> > 
> > Hmm, so by moving that code into set_next_task() it is exposed to the:
> > 
> >   if (queued)
> >     deuque_task();
> >   if (running)
> >     put_prev_task();
> > 
> >   /* do stuff */
> > 
> >   if (queued)
> >     enqueue_task();
> >   if (running)
> >     set_next_task();
> > 
> > patter from core.c; and in that case nothing changes. That said; I
> > might've gotten it wrong.
> 
> Right. But, I was wondering about the __schedule()->pick_next_task()
> case, where, say, prev (rq->curr) is RT/CFS and next (p) is DEADLINE.

So we do pick_next_task() first and then set rq->curr (obviously). So
the first set_next_task() will see rq->curr != p and we'll do the push
balance stuff.

Then the above pattern will always see rq->curr == p and we'll not
trigger push balancing.

Now, looking at it, this also doesn't do push balancing when we
re-select the same task, even though we really should be doing it. So I
suppose not adding the condition, and always doing the push balance,
while wasteful, is not wrong.

Hmm?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-07-29 13:04         ` Peter Zijlstra
@ 2019-07-29 13:17           ` Juri Lelli
  2019-07-29 14:40             ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2019-07-29 13:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 29/07/19 15:04, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 01:27:02PM +0200, Juri Lelli wrote:
> > On 29/07/19 13:15, Peter Zijlstra wrote:
> > > On Mon, Jul 29, 2019 at 11:25:19AM +0200, Juri Lelli wrote:
> > > > Hi,
> > > > 
> > > > On 26/07/19 16:54, Peter Zijlstra wrote:
> > > > > Because pick_next_task() implies set_curr_task() and some of the
> > > > > details haven't matter too much, some of what _should_ be in
> > > > > set_curr_task() ended up in pick_next_task, correct this.
> > > > > 
> > > > > This prepares the way for a pick_next_task() variant that does not
> > > > > affect the current state; allowing remote picking.
> > > > > 
> > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > > ---
> > > > >  kernel/sched/deadline.c |   23 ++++++++++++-----------
> > > > >  kernel/sched/rt.c       |   27 ++++++++++++++-------------
> > > > >  2 files changed, 26 insertions(+), 24 deletions(-)
> > > > > 
> > > > > --- a/kernel/sched/deadline.c
> > > > > +++ b/kernel/sched/deadline.c
> > > > > @@ -1694,12 +1694,21 @@ static void start_hrtick_dl(struct rq *r
> > > > >  }
> > > > >  #endif
> > > > >  
> > > > > -static inline void set_next_task(struct rq *rq, struct task_struct *p)
> > > > > +static void set_next_task_dl(struct rq *rq, struct task_struct *p)
> > > > >  {
> > > > >  	p->se.exec_start = rq_clock_task(rq);
> > > > >  
> > > > >  	/* You can't push away the running task */
> > > > >  	dequeue_pushable_dl_task(rq, p);
> > > > > +
> > > > > +	if (hrtick_enabled(rq))
> > > > > +		start_hrtick_dl(rq, p);
> > > > > +
> > > > > +	if (rq->curr->sched_class != &dl_sched_class)
> > > > > +		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
> > > > > +
> > > > > +	if (rq->curr != p)
> > > > > +		deadline_queue_push_tasks(rq);
> > > > 
> > > > It's a minor thing, but I was wondering why you added the check on curr.
> > > > deadline_queue_push_tasks() already checks if are there pushable tasks,
> > > > plus curr can still be of a different class at this point?
> > > 
> > > Hmm, so by moving that code into set_next_task() it is exposed to the:
> > > 
> > >   if (queued)
> > >     deuque_task();
> > >   if (running)
> > >     put_prev_task();
> > > 
> > >   /* do stuff */
> > > 
> > >   if (queued)
> > >     enqueue_task();
> > >   if (running)
> > >     set_next_task();
> > > 
> > > patter from core.c; and in that case nothing changes. That said; I
> > > might've gotten it wrong.
> > 
> > Right. But, I was wondering about the __schedule()->pick_next_task()
> > case, where, say, prev (rq->curr) is RT/CFS and next (p) is DEADLINE.
> 
> So we do pick_next_task() first and then set rq->curr (obviously). So
> the first set_next_task() will see rq->curr != p and we'll do the push
> balance stuff.
> 
> Then the above pattern will always see rq->curr == p and we'll not
> trigger push balancing.
> 
> Now, looking at it, this also doesn't do push balancing when we
> re-select the same task, even though we really should be doing it. So I
> suppose not adding the condition, and always doing the push balance,
> while wasteful, is not wrong.

Right, also because deadline_queue_push_tasks() already checks if there
are tasks to potentially push around before queuing the balance
callback.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  2019-07-29 13:17           ` Juri Lelli
@ 2019-07-29 14:40             ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-07-29 14:40 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On Mon, Jul 29, 2019 at 03:17:01PM +0200, Juri Lelli wrote:
> On 29/07/19 15:04, Peter Zijlstra wrote:

> > Now, looking at it, this also doesn't do push balancing when we
> > re-select the same task, even though we really should be doing it. So I
> > suppose not adding the condition, and always doing the push balance,
> > while wasteful, is not wrong.
> 
> Right, also because deadline_queue_push_tasks() already checks if there
> are tasks to potentially push around before queuing the balance
> callback.

Yes, but in the overloaded case, where there is always a task to push,
but nowhere to push it to, we can waste a 'lot' of time looking for
naught in case of extra pushes.

So in that regard the check you reference is not sufficient.

Anyway, let me change this for now.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering
  2019-07-26 14:54 ` [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
@ 2019-07-30 13:16   ` Phil Auld
  2019-07-30 13:22   ` Steven Rostedt
  1 sibling, 0 replies; 63+ messages in thread
From: Phil Auld @ 2019-07-30 13:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, balsini, dvyukov, tglx, vpillai, rostedt,
	Valentin Schneider, Aaron Lu, keescook, Pawan Gupta, torvalds,
	Tim Chen, fweisbec, subhra.mazumdar, Julien Desfossez, pjt,
	Nishanth Aravamudan, Aubrey Li, Mel Gorman, kerrnel,
	Paolo Bonzini

On Fri, Jul 26, 2019 at 04:54:11PM +0200 Peter Zijlstra wrote:
> Make sure the entire for loop has stop_cpus_in_progress set.
> 
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Aaron Lu <aaron.lwe@gmail.com>
> Cc: keescook@chromium.org
> Cc: mingo@kernel.org
> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
> Cc: Phil Auld <pauld@redhat.com>
> Cc: torvalds@linux-foundation.org
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: fweisbec@gmail.com
> Cc: subhra.mazumdar@oracle.com
> Cc: tglx@linutronix.de
> Cc: Julien Desfossez <jdesfossez@digitalocean.com>
> Cc: pjt@google.com
> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
> Cc: Aubrey Li <aubrey.intel@gmail.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: kerrnel@google.com
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpillai@digitalocean.com
> ---
>  kernel/stop_machine.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -383,6 +383,7 @@ static bool queue_stop_cpus_work(const s
>  	 */
>  	preempt_disable();
>  	stop_cpus_in_progress = true;
> +	barrier();
>  	for_each_cpu(cpu, cpumask) {
>  		work = &per_cpu(cpu_stopper.stop_work, cpu);
>  		work->fn = fn;
> @@ -391,6 +392,7 @@ static bool queue_stop_cpus_work(const s
>  		if (cpu_stop_queue_work(cpu, work))
>  			queued = true;
>  	}
> +	barrier();
>  	stop_cpus_in_progress = false;
>  	preempt_enable();
>  
> 
> 

This looks good.

Reviewed-by: Phil Auld <pauld@redhat.com>


-- 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering
  2019-07-26 14:54 ` [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
  2019-07-30 13:16   ` Phil Auld
@ 2019-07-30 13:22   ` Steven Rostedt
  1 sibling, 0 replies; 63+ messages in thread
From: Steven Rostedt @ 2019-07-30 13:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, balsini, dvyukov, tglx, vpillai, Valentin Schneider,
	Aaron Lu, keescook, Pawan Gupta, Phil Auld, torvalds, Tim Chen,
	fweisbec, subhra.mazumdar, Julien Desfossez, pjt,
	Nishanth Aravamudan, Aubrey Li, Mel Gorman, kerrnel,
	Paolo Bonzini

On Fri, 26 Jul 2019 16:54:11 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Make sure the entire for loop has stop_cpus_in_progress set.


>  kernel/stop_machine.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -383,6 +383,7 @@ static bool queue_stop_cpus_work(const s
>  	 */
>  	preempt_disable();
>  	stop_cpus_in_progress = true;
> +	barrier();

Like smp_mb() shouldn't barrier() have a comment to what is being
ordered and why?

-- Steve

>  	for_each_cpu(cpu, cpumask) {
>  		work = &per_cpu(cpu_stopper.stop_work, cpu);
>  		work->fn = fn;
> @@ -391,6 +392,7 @@ static bool queue_stop_cpus_work(const s
>  		if (cpu_stop_queue_work(cpu, work))
>  			queued = true;
>  	}
> +	barrier();
>  	stop_cpus_in_progress = false;
>  	preempt_enable();
>  
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-07-26 14:54 ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Peter Zijlstra
  2019-07-29  8:57   ` Juri Lelli
@ 2019-08-02 17:21   ` Alessio Balsini
  2019-08-05 11:53     ` Peter Zijlstra
  2020-05-20 18:38   ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Juri Lelli
  2020-06-16 12:21   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  3 siblings, 1 reply; 63+ messages in thread
From: Alessio Balsini @ 2019-08-02 17:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, dvyukov, tglx, vpillai, rostedt

Hi Peter,

This is indeed an important missing piece.

I think it would be worth having some simple checks, e.g.:
- period_min <= period_max;
- period_min >= (1ULL << DL_SCALE).

To be even more picky, I'm wondering if it would be a good practice to
fail the write operation if there are already dl-tasks in the system
that violate the new constraint.  I'm afraid there is no cheap way of
achieving such check, so, I think we can skip this last tricky thing for
now.

Thanks,
Alessio

On Fri, Jul 26, 2019 at 04:54:10PM +0200, Peter Zijlstra wrote:
> 
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Luca Abeni <luca.abeni@santannapisa.it>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched/sysctl.h |    3 +++
>  kernel/sched/deadline.c      |   23 +++++++++++++++++++++--
>  kernel/sysctl.c              |   14 ++++++++++++++
>  3 files changed, 38 insertions(+), 2 deletions(-)
> 
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -56,6 +56,9 @@ int sched_proc_update_handler(struct ctl
>  extern unsigned int sysctl_sched_rt_period;
>  extern int sysctl_sched_rt_runtime;
>  
> +extern unsigned int sysctl_sched_dl_period_max;
> +extern unsigned int sysctl_sched_dl_period_min;
> +
>  #ifdef CONFIG_UCLAMP_TASK
>  extern unsigned int sysctl_sched_uclamp_util_min;
>  extern unsigned int sysctl_sched_uclamp_util_max;
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2597,6 +2597,14 @@ void __getparam_dl(struct task_struct *p
>  }
>  
>  /*
> + * Default limits for DL period; on the top end we guard against small util
> + * tasks still getting rediculous long effective runtimes, on the bottom end we
> + * guard against timer DoS.
> + */
> +unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
> +unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
> +
> +/*
>   * This function validates the new parameters of a -deadline task.
>   * We ask for the deadline not being zero, and greater or equal
>   * than the runtime, as well as the period of being zero or
> @@ -2608,6 +2616,8 @@ void __getparam_dl(struct task_struct *p
>   */
>  bool __checkparam_dl(const struct sched_attr *attr)
>  {
> +	u64 period, max, min;
> +
>  	/* special dl tasks don't actually use any parameter */
>  	if (attr->sched_flags & SCHED_FLAG_SUGOV)
>  		return true;
> @@ -2631,12 +2641,21 @@ bool __checkparam_dl(const struct sched_
>  	    attr->sched_period & (1ULL << 63))
>  		return false;
>  
> +	period = attr->sched_period;
> +	if (!period)
> +		period = attr->sched_deadline;
> +
>  	/* runtime <= deadline <= period (if period != 0) */
> -	if ((attr->sched_period != 0 &&
> -	     attr->sched_period < attr->sched_deadline) ||
> +	if (period < attr->sched_deadline ||
>  	    attr->sched_deadline < attr->sched_runtime)
>  		return false;
>  
> +	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> +	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> +
> +	if (period < min || period > max)
> +		return false;
> +
>  	return true;
>  }
>  
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -443,6 +443,20 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= sched_rt_handler,
>  	},
>  	{
> +		.procname	= "sched_deadline_period_max_us",
> +		.data		= &sysctl_sched_dl_period_max,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
> +		.procname	= "sched_deadline_period_min_us",
> +		.data		= &sysctl_sched_dl_period_min,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
>  		.procname	= "sched_rr_timeslice_ms",
>  		.data		= &sysctl_sched_rr_timeslice,
>  		.maxlen		= sizeof(int),
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-08-02 17:21   ` Alessio Balsini
@ 2019-08-05 11:53     ` Peter Zijlstra
  2019-08-22 12:29       ` Alessio Balsini
  2019-10-23 17:17       ` [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl Alessio Balsini
  0 siblings, 2 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-08-05 11:53 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, dvyukov, tglx, vpillai, rostedt

On Fri, Aug 02, 2019 at 06:21:04PM +0100, Alessio Balsini wrote:
> Hi Peter,
> 
> This is indeed an important missing piece.
> 
> I think it would be worth having some simple checks, e.g.:
> - period_min <= period_max;
> - period_min >= (1ULL << DL_SCALE).
> 
> To be even more picky, I'm wondering if it would be a good practice to
> fail the write operation if there are already dl-tasks in the system
> that violate the new constraint.  I'm afraid there is no cheap way of
> achieving such check, so, I think we can skip this last tricky thing for
> now.

Like so?

---
Subject: sched/deadline: Impose global limits on sched_attr::sched_period
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Jul 23 16:01:29 CEST 2019

There are two DoS scenarios with SCHED_DEADLINE related to
sched_attr::sched_period:

 - since access-control only looks at utilization and density, a very
   large period can allow a very large runtime, which in turn can
   incur a very large latency to lower priority tasks.

 - for very short periods we can end up spending more time programming
   the hardware timer than actually running the task.

Mitigate these by imposing limits on the period.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
---
 include/linux/sched/sysctl.h |    7 ++++
 kernel/sched/deadline.c      |   61 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |   14 +++++++++
 3 files changed, 80 insertions(+), 2 deletions(-)

--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,13 @@ int sched_proc_update_handler(struct ctl
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+extern unsigned int sysctl_sched_dl_period_max;
+extern unsigned int sysctl_sched_dl_period_min;
+
+extern int sched_dl_period_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2660,6 +2660,52 @@ void __getparam_dl(struct task_struct *p
 }
 
 /*
+ * Default limits for DL period; on the top end we guard against small util
+ * tasks still getting ridiculous long effective runtimes, on the bottom end we
+ * guard against timer DoS.
+ */
+unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
+unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
+
+int sched_dl_period_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos)
+{
+	unsigned int old_min, old_max;
+	static DEFINE_MUTEX(mutex);
+	int ret;
+
+	mutex_lock(&mutex);
+	old_min = sysctl_sched_dl_period_min;
+	old_max = sysctl_sched_dl_period_max;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		u64 min = (u64)sysctl_sched_dl_period_min * NSEC_PER_USEC;
+		u64 max = (u64)sysctl_sched_dl_period_man * NSEC_PER_USEC;
+
+		ret = -EINVAL;
+		if (min < 1ULL << DL_SCALE)
+			goto undo;
+
+		if (max < min)
+			goto undo;
+
+		ret = 0;
+	}
+
+	if (0) {
+undo:
+		sysctl_sched_rt_period = old_period;
+		sysctl_sched_rt_runtime = old_runtime;
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
+/*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime, as well as the period of being zero or
@@ -2671,6 +2717,8 @@ void __getparam_dl(struct task_struct *p
  */
 bool __checkparam_dl(const struct sched_attr *attr)
 {
+	u64 period, max, min;
+
 	/* special dl tasks don't actually use any parameter */
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
 		return true;
@@ -2694,12 +2742,21 @@ bool __checkparam_dl(const struct sched_
 	    attr->sched_period & (1ULL << 63))
 		return false;
 
+	period = attr->sched_period;
+	if (!period)
+		period = attr->sched_deadline;
+
 	/* runtime <= deadline <= period (if period != 0) */
-	if ((attr->sched_period != 0 &&
-	     attr->sched_period < attr->sched_deadline) ||
+	if (period < attr->sched_deadline ||
 	    attr->sched_deadline < attr->sched_runtime)
 		return false;
 
+	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
+	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
+
+	if (period < min || period > max)
+		return false;
+
 	return true;
 }
 
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -443,6 +443,20 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= sched_rt_handler,
 	},
 	{
+		.procname	= "sched_deadline_period_max_us",
+		.data		= &sysctl_sched_dl_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_period_handler,
+	},
+	{
+		.procname	= "sched_deadline_period_min_us",
+		.data		= &sysctl_sched_dl_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_period_handler,
+	},
+	{
 		.procname	= "sched_rr_timeslice_ms",
 		.data		= &sysctl_sched_rr_timeslice,
 		.maxlen		= sizeof(int),

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-07-26 14:54 ` [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers Peter Zijlstra
@ 2019-08-07 16:31   ` Dietmar Eggemann
  2019-08-08  6:52     ` Juri Lelli
  2019-08-08  7:56     ` Peter Zijlstra
  2019-08-08  6:59   ` Juri Lelli
  2019-08-09  9:17   ` Dietmar Eggemann
  2 siblings, 2 replies; 63+ messages in thread
From: Dietmar Eggemann @ 2019-08-07 16:31 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli
  Cc: linux-kernel, luca.abeni, bristot, balsini, dvyukov, tglx,
	vpillai, rostedt

On 7/26/19 4:54 PM, Peter Zijlstra wrote:
> 
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[...]

> @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
>  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>  		cgroup_account_cputime(curtask, delta_exec);
>  		account_group_exec_runtime(curtask, delta_exec);
> +		if (curtask->server)
> +			dl_server_update(curtask->server, delta_exec);
>  	}

I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
when running the full stack.

...
[    0.530216] root domain span: 0-5 (max cpu_capacity = 1024)
[    0.538655] devtmpfs: initialized
[    0.556485] update_curr: rq mismatch rq[0] != rq[4]
[    0.561519] update_curr: rq mismatch rq[0] != rq[4]
[    0.566497] update_curr: rq mismatch rq[0] != rq[4]
[    0.571443] update_curr: rq mismatch rq[0] != rq[4]
[    0.576762] update_curr: rq mismatch rq[2] != rq[4]
[    0.581674] update_curr: rq mismatch rq[2] != rq[4]
[    0.586569] ------------[ cut here ]------------
[    0.591220] WARNING: CPU: 2 PID: 2 at kernel/sched/deadline.c:916 start_dl_timer+0x160/0x178
[    0.599686] Modules linked in:
[    0.602756] CPU: 2 PID: 2 Comm: kthreadd Tainted: G        W         5.3.0-rc3-00013-ga33cf033cc99-dirty #64
[    0.612620] Hardware name: ARM Juno development board (r0) (DT)
[    0.618560] pstate: 60000085 (nZCv daIf -PAN -UAO)
[    0.623369] pc : start_dl_timer+0x160/0x178
[    0.627572] lr : start_dl_timer+0x160/0x178
[    0.631768] sp : ffff000010013cb0
...
[    0.715075] Call trace:
[    0.717531]  start_dl_timer+0x160/0x178
[    0.721382]  update_curr_dl_se+0x108/0x208
[    0.725494]  dl_server_update+0x2c/0x38
[    0.729348]  update_curr+0x1b4/0x3b8
[    0.732934]  task_tick_fair+0x74/0xa88
[    0.736698]  scheduler_tick+0x94/0x110
[    0.740461]  update_process_times+0x48/0x60
...

Seems to be related to the fact that the rq can change:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4c14851a34c..5e3130a200ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,8 +891,17 @@ static void update_curr(struct cfs_rq *cfs_rq)
                trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
                cgroup_account_cputime(curtask, delta_exec);
                account_group_exec_runtime(curtask, delta_exec);
-               if (curtask->server)
+               if (curtask->server) {
+                       struct rq *rq = rq_of(cfs_rq);
+                       struct rq *rq2 = curtask->server->rq;
+
+                       if (rq != rq2) {
+                               printk("update_curr: rq mismatch rq[%d] != rq[%d]\n",
+                                      cpu_of(rq), cpu_of(rq2));
+                       }
+
                        dl_server_update(curtask->server, delta_exec);
+               }
        }

...



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-07 16:31   ` Dietmar Eggemann
@ 2019-08-08  6:52     ` Juri Lelli
  2019-08-08  7:52       ` Dietmar Eggemann
  2019-08-08  7:56     ` Peter Zijlstra
  1 sibling, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2019-08-08  6:52 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

Hi Dietmar,

On 07/08/19 18:31, Dietmar Eggemann wrote:
> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
> > 
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> [...]
> 
> > @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
> >  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> >  		cgroup_account_cputime(curtask, delta_exec);
> >  		account_group_exec_runtime(curtask, delta_exec);
> > +		if (curtask->server)
> > +			dl_server_update(curtask->server, delta_exec);
> >  	}
> 
> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
> when running the full stack.
> 
> ...
> [    0.530216] root domain span: 0-5 (max cpu_capacity = 1024)
> [    0.538655] devtmpfs: initialized
> [    0.556485] update_curr: rq mismatch rq[0] != rq[4]
> [    0.561519] update_curr: rq mismatch rq[0] != rq[4]
> [    0.566497] update_curr: rq mismatch rq[0] != rq[4]
> [    0.571443] update_curr: rq mismatch rq[0] != rq[4]
> [    0.576762] update_curr: rq mismatch rq[2] != rq[4]
> [    0.581674] update_curr: rq mismatch rq[2] != rq[4]
> [    0.586569] ------------[ cut here ]------------
> [    0.591220] WARNING: CPU: 2 PID: 2 at kernel/sched/deadline.c:916 start_dl_timer+0x160/0x178
> [    0.599686] Modules linked in:
> [    0.602756] CPU: 2 PID: 2 Comm: kthreadd Tainted: G        W         5.3.0-rc3-00013-ga33cf033cc99-dirty #64
> [    0.612620] Hardware name: ARM Juno development board (r0) (DT)
> [    0.618560] pstate: 60000085 (nZCv daIf -PAN -UAO)
> [    0.623369] pc : start_dl_timer+0x160/0x178
> [    0.627572] lr : start_dl_timer+0x160/0x178
> [    0.631768] sp : ffff000010013cb0
> ...
> [    0.715075] Call trace:
> [    0.717531]  start_dl_timer+0x160/0x178
> [    0.721382]  update_curr_dl_se+0x108/0x208
> [    0.725494]  dl_server_update+0x2c/0x38
> [    0.729348]  update_curr+0x1b4/0x3b8
> [    0.732934]  task_tick_fair+0x74/0xa88
> [    0.736698]  scheduler_tick+0x94/0x110
> [    0.740461]  update_process_times+0x48/0x60
> ...
> 
> Seems to be related to the fact that the rq can change:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4c14851a34c..5e3130a200ec 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -891,8 +891,17 @@ static void update_curr(struct cfs_rq *cfs_rq)
>                 trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>                 cgroup_account_cputime(curtask, delta_exec);
>                 account_group_exec_runtime(curtask, delta_exec);
> -               if (curtask->server)
> +               if (curtask->server) {
> +                       struct rq *rq = rq_of(cfs_rq);
> +                       struct rq *rq2 = curtask->server->rq;
> +
> +                       if (rq != rq2) {
> +                               printk("update_curr: rq mismatch rq[%d] != rq[%d]\n",
> +                                      cpu_of(rq), cpu_of(rq2));
> +                       }
> +
>                         dl_server_update(curtask->server, delta_exec);
> +               }
>         }
> 
> ...

Yeah, I actually noticed the same. Some debugging seems to point to
early boot spawning of kthreads. I can reliably for example attribute
this mismatch to ksoftirqd(s). It looks like they can avoid the
dl_server assignment in pick_next_task_dl() and this breaks things.
Still need to figure out why this happens and how to fix it, though.

Best,

Juri

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-07-26 14:54 ` [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers Peter Zijlstra
  2019-08-07 16:31   ` Dietmar Eggemann
@ 2019-08-08  6:59   ` Juri Lelli
  2019-08-09  9:17   ` Dietmar Eggemann
  2 siblings, 0 replies; 63+ messages in thread
From: Juri Lelli @ 2019-08-08  6:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

Hi,

On 26/07/19 16:54, Peter Zijlstra wrote:

[...]

> +void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> +		    dl_server_has_tasks_f has_tasks,
> +		    dl_server_pick_f pick)
> +{
> +	dl_se->dl_server = 1;
> +	dl_se->rq = rq;
> +	dl_se->server_has_tasks = has_tasks;
> +	dl_se->server_pick = pick;
> +
> +	setup_new_dl_entity(dl_se);

I think we need to postpone calling setup_new_dl_entity() to when
dl_server_start() is called for the first time. rq_clock() needs both
a caller that holds rq lock (which we can add) and an updated clock
(which I'm not sure we can do at this point in time, or maybe we can
avoid/trick the debug check?).

Thanks,

Juri

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08  6:52     ` Juri Lelli
@ 2019-08-08  7:52       ` Dietmar Eggemann
  0 siblings, 0 replies; 63+ messages in thread
From: Dietmar Eggemann @ 2019-08-08  7:52 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 8/8/19 8:52 AM, Juri Lelli wrote:
> Hi Dietmar,
> 
> On 07/08/19 18:31, Dietmar Eggemann wrote:
>> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
>>>
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>
>> [...]
>>
>>> @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
>>>  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>>>  		cgroup_account_cputime(curtask, delta_exec);
>>>  		account_group_exec_runtime(curtask, delta_exec);
>>> +		if (curtask->server)
>>> +			dl_server_update(curtask->server, delta_exec);
>>>  	}
>>
>> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
>> when running the full stack.
>>
>> ...
>> [    0.530216] root domain span: 0-5 (max cpu_capacity = 1024)
>> [    0.538655] devtmpfs: initialized
>> [    0.556485] update_curr: rq mismatch rq[0] != rq[4]
>> [    0.561519] update_curr: rq mismatch rq[0] != rq[4]
>> [    0.566497] update_curr: rq mismatch rq[0] != rq[4]
>> [    0.571443] update_curr: rq mismatch rq[0] != rq[4]
>> [    0.576762] update_curr: rq mismatch rq[2] != rq[4]
>> [    0.581674] update_curr: rq mismatch rq[2] != rq[4]
>> [    0.586569] ------------[ cut here ]------------
>> [    0.591220] WARNING: CPU: 2 PID: 2 at kernel/sched/deadline.c:916 start_dl_timer+0x160/0x178
>> [    0.599686] Modules linked in:
>> [    0.602756] CPU: 2 PID: 2 Comm: kthreadd Tainted: G        W         5.3.0-rc3-00013-ga33cf033cc99-dirty #64
>> [    0.612620] Hardware name: ARM Juno development board (r0) (DT)
>> [    0.618560] pstate: 60000085 (nZCv daIf -PAN -UAO)
>> [    0.623369] pc : start_dl_timer+0x160/0x178
>> [    0.627572] lr : start_dl_timer+0x160/0x178
>> [    0.631768] sp : ffff000010013cb0
>> ...
>> [    0.715075] Call trace:
>> [    0.717531]  start_dl_timer+0x160/0x178
>> [    0.721382]  update_curr_dl_se+0x108/0x208
>> [    0.725494]  dl_server_update+0x2c/0x38
>> [    0.729348]  update_curr+0x1b4/0x3b8
>> [    0.732934]  task_tick_fair+0x74/0xa88
>> [    0.736698]  scheduler_tick+0x94/0x110
>> [    0.740461]  update_process_times+0x48/0x60
>> ...
>>
>> Seems to be related to the fact that the rq can change:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e4c14851a34c..5e3130a200ec 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -891,8 +891,17 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>                 trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>>                 cgroup_account_cputime(curtask, delta_exec);
>>                 account_group_exec_runtime(curtask, delta_exec);
>> -               if (curtask->server)
>> +               if (curtask->server) {
>> +                       struct rq *rq = rq_of(cfs_rq);
>> +                       struct rq *rq2 = curtask->server->rq;
>> +
>> +                       if (rq != rq2) {
>> +                               printk("update_curr: rq mismatch rq[%d] != rq[%d]\n",
>> +                                      cpu_of(rq), cpu_of(rq2));
>> +                       }
>> +
>>                         dl_server_update(curtask->server, delta_exec);
>> +               }
>>         }
>>
>> ...
> 
> Yeah, I actually noticed the same. Some debugging seems to point to
> early boot spawning of kthreads. I can reliably for example attribute
> this mismatch to ksoftirqd(s). It looks like they can avoid the
> dl_server assignment in pick_next_task_dl() and this breaks things.
> Still need to figure out why this happens and how to fix it, though.

Yeah, can confirm this:

...
[0.556941] update_curr: rq mismatch rq[0] != rq[4] curr=[kthreadd, 39]
[0.563722] update_curr: rq mismatch rq[0] != rq[4] curr=[netns, 39]
[0.570179] update_curr: rq mismatch rq[0] != rq[4] curr=[netns, 39]
[0.576606] update_curr: rq mismatch rq[0] != rq[4] curr=[netns, 39]
[0.583708] update_curr: rq mismatch rq[2] != rq[1] curr=[kworker/2:1, 40]
[0.590793] update_curr: rq mismatch rq[2] != rq[1] curr=[kworker/2:1, 40]
[0.615096] update_curr: rq mismatch rq[2] != rq[1] curr=[kworker/2:1, 40]
...
[0.626644] WARNING: CPU: 2 PID: 40 at kernel/sched/deadline.c:916 start_dl_timer+0x160/0x178
...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-07 16:31   ` Dietmar Eggemann
  2019-08-08  6:52     ` Juri Lelli
@ 2019-08-08  7:56     ` Peter Zijlstra
  2019-08-08  8:11       ` Dietmar Eggemann
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2019-08-08  7:56 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, juri.lelli, linux-kernel, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt

On Wed, Aug 07, 2019 at 06:31:59PM +0200, Dietmar Eggemann wrote:
> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
> > 
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> [...]
> 
> > @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
> >  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> >  		cgroup_account_cputime(curtask, delta_exec);
> >  		account_group_exec_runtime(curtask, delta_exec);
> > +		if (curtask->server)
> > +			dl_server_update(curtask->server, delta_exec);
> >  	}
> 
> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
> when running the full stack.

That would seem to imply a stale curtask->server value; the hunk below:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3756,8 +3756,11 @@ pick_next_task(struct rq *rq, struct tas

        for_each_class(class) {
                p = class->pick_next_task(rq, NULL, NULL);
-               if (p)
+               if (p) {
+                       if (p->sched_class == class && p->server)
+                               p->server = NULL;
                        return p;
+               }
        }


Was supposed to clear p->server, but clearly something is going 'funny'.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08  7:56     ` Peter Zijlstra
@ 2019-08-08  8:11       ` Dietmar Eggemann
  2019-08-08  8:46         ` Juri Lelli
  0 siblings, 1 reply; 63+ messages in thread
From: Dietmar Eggemann @ 2019-08-08  8:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, linux-kernel, luca.abeni, bristot, balsini,
	dvyukov, tglx, vpillai, rostedt

On 8/8/19 9:56 AM, Peter Zijlstra wrote:
> On Wed, Aug 07, 2019 at 06:31:59PM +0200, Dietmar Eggemann wrote:
>> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
>>>
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>
>> [...]
>>
>>> @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
>>>  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>>>  		cgroup_account_cputime(curtask, delta_exec);
>>>  		account_group_exec_runtime(curtask, delta_exec);
>>> +		if (curtask->server)
>>> +			dl_server_update(curtask->server, delta_exec);
>>>  	}
>>
>> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
>> when running the full stack.
> 
> That would seem to imply a stale curtask->server value; the hunk below:
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3756,8 +3756,11 @@ pick_next_task(struct rq *rq, struct tas
> 
>         for_each_class(class) {
>                 p = class->pick_next_task(rq, NULL, NULL);
> -               if (p)
> +               if (p) {
> +                       if (p->sched_class == class && p->server)
> +                               p->server = NULL;
>                         return p;
> +               }
>         }
> 
> 
> Was supposed to clear p->server, but clearly something is going 'funny'.

What about the fast path in pick_next_task()?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bffe849b5a42..f1ea6ae16052 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3742,6 +3742,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
                if (unlikely(!p))
                        p = idle_sched_class.pick_next_task(rq, prev, rf);
 
+               if (p->sched_class == &fair_sched_class && p->server)
+                       p->server = NULL;
+
                return p;
        }

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08  8:11       ` Dietmar Eggemann
@ 2019-08-08  8:46         ` Juri Lelli
  2019-08-08  8:57           ` Dietmar Eggemann
  2019-08-08 10:31           ` Peter Zijlstra
  0 siblings, 2 replies; 63+ messages in thread
From: Juri Lelli @ 2019-08-08  8:46 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 08/08/19 10:11, Dietmar Eggemann wrote:
> On 8/8/19 9:56 AM, Peter Zijlstra wrote:
> > On Wed, Aug 07, 2019 at 06:31:59PM +0200, Dietmar Eggemann wrote:
> >> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
> >>>
> >>>
> >>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>
> >> [...]
> >>
> >>> @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
> >>>  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> >>>  		cgroup_account_cputime(curtask, delta_exec);
> >>>  		account_group_exec_runtime(curtask, delta_exec);
> >>> +		if (curtask->server)
> >>> +			dl_server_update(curtask->server, delta_exec);
> >>>  	}
> >>
> >> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
> >> when running the full stack.
> > 
> > That would seem to imply a stale curtask->server value; the hunk below:
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3756,8 +3756,11 @@ pick_next_task(struct rq *rq, struct tas
> > 
> >         for_each_class(class) {
> >                 p = class->pick_next_task(rq, NULL, NULL);
> > -               if (p)
> > +               if (p) {
> > +                       if (p->sched_class == class && p->server)
> > +                               p->server = NULL;
> >                         return p;
> > +               }
> >         }
> > 
> > 
> > Was supposed to clear p->server, but clearly something is going 'funny'.
> 
> What about the fast path in pick_next_task()?
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bffe849b5a42..f1ea6ae16052 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3742,6 +3742,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>                 if (unlikely(!p))
>                         p = idle_sched_class.pick_next_task(rq, prev, rf);
>  
> +               if (p->sched_class == &fair_sched_class && p->server)
> +                       p->server = NULL;
> +

Hummm, but then who sets it back to the correct server. AFAIU
update_curr() needs a ->server to do the correct DL accounting?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08  8:46         ` Juri Lelli
@ 2019-08-08  8:57           ` Dietmar Eggemann
  2019-08-08  9:27             ` Juri Lelli
  2019-08-08 10:31           ` Peter Zijlstra
  1 sibling, 1 reply; 63+ messages in thread
From: Dietmar Eggemann @ 2019-08-08  8:57 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 8/8/19 10:46 AM, Juri Lelli wrote:
> On 08/08/19 10:11, Dietmar Eggemann wrote:
>> On 8/8/19 9:56 AM, Peter Zijlstra wrote:
>>> On Wed, Aug 07, 2019 at 06:31:59PM +0200, Dietmar Eggemann wrote:
>>>> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
>>>>>
>>>>>
>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>
>>>> [...]
>>>>
>>>>> @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
>>>>>  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>>>>>  		cgroup_account_cputime(curtask, delta_exec);
>>>>>  		account_group_exec_runtime(curtask, delta_exec);
>>>>> +		if (curtask->server)
>>>>> +			dl_server_update(curtask->server, delta_exec);
>>>>>  	}
>>>>
>>>> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
>>>> when running the full stack.
>>>
>>> That would seem to imply a stale curtask->server value; the hunk below:
>>>
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -3756,8 +3756,11 @@ pick_next_task(struct rq *rq, struct tas
>>>
>>>         for_each_class(class) {
>>>                 p = class->pick_next_task(rq, NULL, NULL);
>>> -               if (p)
>>> +               if (p) {
>>> +                       if (p->sched_class == class && p->server)
>>> +                               p->server = NULL;
>>>                         return p;
>>> +               }
>>>         }
>>>
>>>
>>> Was supposed to clear p->server, but clearly something is going 'funny'.
>>
>> What about the fast path in pick_next_task()?
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index bffe849b5a42..f1ea6ae16052 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3742,6 +3742,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>                 if (unlikely(!p))
>>                         p = idle_sched_class.pick_next_task(rq, prev, rf);
>>  
>> +               if (p->sched_class == &fair_sched_class && p->server)
>> +                       p->server = NULL;
>> +
> 
> Hummm, but then who sets it back to the correct server. AFAIU
> update_curr() needs a ->server to do the correct DL accounting?

Ah, OK, this would kill the whole functionality ;-)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08  8:57           ` Dietmar Eggemann
@ 2019-08-08  9:27             ` Juri Lelli
  2019-08-08  9:45               ` Juri Lelli
  0 siblings, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2019-08-08  9:27 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 08/08/19 10:57, Dietmar Eggemann wrote:
> On 8/8/19 10:46 AM, Juri Lelli wrote:
> > On 08/08/19 10:11, Dietmar Eggemann wrote:
> >> On 8/8/19 9:56 AM, Peter Zijlstra wrote:
> >>> On Wed, Aug 07, 2019 at 06:31:59PM +0200, Dietmar Eggemann wrote:
> >>>> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
> >>>>>
> >>>>>
> >>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>>>
> >>>> [...]
> >>>>
> >>>>> @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
> >>>>>  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> >>>>>  		cgroup_account_cputime(curtask, delta_exec);
> >>>>>  		account_group_exec_runtime(curtask, delta_exec);
> >>>>> +		if (curtask->server)
> >>>>> +			dl_server_update(curtask->server, delta_exec);
> >>>>>  	}
> >>>>
> >>>> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
> >>>> when running the full stack.
> >>>
> >>> That would seem to imply a stale curtask->server value; the hunk below:
> >>>
> >>> --- a/kernel/sched/core.c
> >>> +++ b/kernel/sched/core.c
> >>> @@ -3756,8 +3756,11 @@ pick_next_task(struct rq *rq, struct tas
> >>>
> >>>         for_each_class(class) {
> >>>                 p = class->pick_next_task(rq, NULL, NULL);
> >>> -               if (p)
> >>> +               if (p) {
> >>> +                       if (p->sched_class == class && p->server)
> >>> +                               p->server = NULL;
> >>>                         return p;
> >>> +               }
> >>>         }
> >>>
> >>>
> >>> Was supposed to clear p->server, but clearly something is going 'funny'.
> >>
> >> What about the fast path in pick_next_task()?
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index bffe849b5a42..f1ea6ae16052 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -3742,6 +3742,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >>                 if (unlikely(!p))
> >>                         p = idle_sched_class.pick_next_task(rq, prev, rf);
> >>  
> >> +               if (p->sched_class == &fair_sched_class && p->server)
> >> +                       p->server = NULL;
> >> +
> > 
> > Hummm, but then who sets it back to the correct server. AFAIU
> > update_curr() needs a ->server to do the correct DL accounting?
> 
> Ah, OK, this would kill the whole functionality ;-)
> 

I'm thinking we could use &rq->fair_server. It seems to pass the point
we are discussing about, but then virt box becomes unresponsive (busy
loops).

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08  9:27             ` Juri Lelli
@ 2019-08-08  9:45               ` Juri Lelli
  2019-08-30 11:24                 ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2019-08-08  9:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 08/08/19 11:27, Juri Lelli wrote:
> On 08/08/19 10:57, Dietmar Eggemann wrote:
> > On 8/8/19 10:46 AM, Juri Lelli wrote:
> > > On 08/08/19 10:11, Dietmar Eggemann wrote:
> > >> On 8/8/19 9:56 AM, Peter Zijlstra wrote:
> > >>> On Wed, Aug 07, 2019 at 06:31:59PM +0200, Dietmar Eggemann wrote:
> > >>>> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
> > >>>>>
> > >>>>>
> > >>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > >>>>
> > >>>> [...]
> > >>>>
> > >>>>> @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
> > >>>>>  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> > >>>>>  		cgroup_account_cputime(curtask, delta_exec);
> > >>>>>  		account_group_exec_runtime(curtask, delta_exec);
> > >>>>> +		if (curtask->server)
> > >>>>> +			dl_server_update(curtask->server, delta_exec);
> > >>>>>  	}
> > >>>>
> > >>>> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
> > >>>> when running the full stack.
> > >>>
> > >>> That would seem to imply a stale curtask->server value; the hunk below:
> > >>>
> > >>> --- a/kernel/sched/core.c
> > >>> +++ b/kernel/sched/core.c
> > >>> @@ -3756,8 +3756,11 @@ pick_next_task(struct rq *rq, struct tas
> > >>>
> > >>>         for_each_class(class) {
> > >>>                 p = class->pick_next_task(rq, NULL, NULL);
> > >>> -               if (p)
> > >>> +               if (p) {
> > >>> +                       if (p->sched_class == class && p->server)
> > >>> +                               p->server = NULL;
> > >>>                         return p;
> > >>> +               }
> > >>>         }
> > >>>
> > >>>
> > >>> Was supposed to clear p->server, but clearly something is going 'funny'.
> > >>
> > >> What about the fast path in pick_next_task()?
> > >>
> > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > >> index bffe849b5a42..f1ea6ae16052 100644
> > >> --- a/kernel/sched/core.c
> > >> +++ b/kernel/sched/core.c
> > >> @@ -3742,6 +3742,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > >>                 if (unlikely(!p))
> > >>                         p = idle_sched_class.pick_next_task(rq, prev, rf);
> > >>  
> > >> +               if (p->sched_class == &fair_sched_class && p->server)
> > >> +                       p->server = NULL;
> > >> +
> > > 
> > > Hummm, but then who sets it back to the correct server. AFAIU
> > > update_curr() needs a ->server to do the correct DL accounting?
> > 
> > Ah, OK, this would kill the whole functionality ;-)
> > 
> 
> I'm thinking we could use &rq->fair_server. It seems to pass the point
> we are discussing about, but then virt box becomes unresponsive (busy
> loops).

I'd like to take this last sentence back, I was able to run a few boot +
hackbench + shutdown cycles with the following applied (guess too much
debug printks around before).

--->8---
 kernel/sched/deadline.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3fe82b8f7825..d4a20072d4c0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1312,6 +1312,10 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
+	if (!dl_server(dl_se)) {
+		dl_se->dl_server = 1;
+		setup_new_dl_entity(dl_se);
+	}
 	enqueue_dl_entity(dl_se, dl_se, ENQUEUE_WAKEUP);
 }
 
@@ -1324,12 +1328,9 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick)
 {
-	dl_se->dl_server = 1;
 	dl_se->rq = rq;
 	dl_se->server_has_tasks = has_tasks;
 	dl_se->server_pick = pick;
-
-	setup_new_dl_entity(dl_se);
 }
 
 /*
@@ -2829,6 +2830,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_yielded		= 0;
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
+	dl_se->dl_server		= 0;
 }
 
 void init_dl_entity(struct sched_dl_entity *dl_se)

--->8---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7a161472decd..355cb1382aef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3783,6 +3783,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		if (unlikely(!p))
 			p = idle_sched_class.pick_next_task(rq, prev, rf);
 
+		if (p->sched_class == &fair_sched_class)
+			p->server = &rq->fair_server;
+
 		return p;
 	}
 

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08  8:46         ` Juri Lelli
  2019-08-08  8:57           ` Dietmar Eggemann
@ 2019-08-08 10:31           ` Peter Zijlstra
  2019-08-09  7:13             ` Juri Lelli
  1 sibling, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2019-08-08 10:31 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Dietmar Eggemann, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On Thu, Aug 08, 2019 at 10:46:52AM +0200, Juri Lelli wrote:
> On 08/08/19 10:11, Dietmar Eggemann wrote:

> > What about the fast path in pick_next_task()?
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bffe849b5a42..f1ea6ae16052 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3742,6 +3742,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >                 if (unlikely(!p))
> >                         p = idle_sched_class.pick_next_task(rq, prev, rf);
> >  
> > +               if (p->sched_class == &fair_sched_class && p->server)
> > +                       p->server = NULL;
> > +
> 
> Hummm, but then who sets it back to the correct server. AFAIU
> update_curr() needs a ->server to do the correct DL accounting?

The above looks ok.

The pick_next_task_dl() when it selects a server will set ->server.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08 10:31           ` Peter Zijlstra
@ 2019-08-09  7:13             ` Juri Lelli
  0 siblings, 0 replies; 63+ messages in thread
From: Juri Lelli @ 2019-08-09  7:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 08/08/19 12:31, Peter Zijlstra wrote:
> On Thu, Aug 08, 2019 at 10:46:52AM +0200, Juri Lelli wrote:
> > On 08/08/19 10:11, Dietmar Eggemann wrote:
> 
> > > What about the fast path in pick_next_task()?
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index bffe849b5a42..f1ea6ae16052 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3742,6 +3742,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > >                 if (unlikely(!p))
> > >                         p = idle_sched_class.pick_next_task(rq, prev, rf);
> > >  
> > > +               if (p->sched_class == &fair_sched_class && p->server)
> > > +                       p->server = NULL;
> > > +
> > 
> > Hummm, but then who sets it back to the correct server. AFAIU
> > update_curr() needs a ->server to do the correct DL accounting?
> 
> The above looks ok.
> 
> The pick_next_task_dl() when it selects a server will set ->server.

OK. Yesterday's IRC conversation resolved my perplexity about this
point. I tried to summarize what I've got in the comment below (which we
might want to add on top of Dietmar's fix).

--->8---
 /*
  * Since we are executing the fast path it means that the DL
  * server was throttled and we'll now be scheduled w/o it. So,
  * reset p's pointer to it (accounting won't be performed, as
  * it wouldn't make sense while server is thottled).
  */

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-07-26 14:54 ` [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers Peter Zijlstra
  2019-08-07 16:31   ` Dietmar Eggemann
  2019-08-08  6:59   ` Juri Lelli
@ 2019-08-09  9:17   ` Dietmar Eggemann
  2019-08-09 12:16     ` Juri Lelli
  2 siblings, 1 reply; 63+ messages in thread
From: Dietmar Eggemann @ 2019-08-09  9:17 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli
  Cc: linux-kernel, luca.abeni, bristot, balsini, dvyukov, tglx,
	vpillai, rostedt

On 7/26/19 4:54 PM, Peter Zijlstra wrote:

[...]

> +void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> +		    dl_server_has_tasks_f has_tasks,
> +		    dl_server_pick_f pick)
> +{
> +	dl_se->dl_server = 1;
> +	dl_se->rq = rq;
> +	dl_se->server_has_tasks = has_tasks;
> +	dl_se->server_pick = pick;
> +
> +	setup_new_dl_entity(dl_se);

IMHO, this needs rq locking because of the rq_clock(rq) in
setup_new_dl_entity().

[    0.000000] WARNING: CPU: 0 PID: 0 at kernel/sched/sched.h:1119
dl_server_init+0x118/0x178
...
[    0.000000] CPU: 0 PID: 0 Comm: swapper Tainted: G        W
5.3.0-rc3-00013-ga33cf033cc99-dirty #10
[    0.000000] Hardware name: ARM Juno development board (r0) (DT)
...
[    0.000000] Call trace:
[    0.000000]  dl_server_init+0x118/0x178
[    0.000000]  fair_server_init+0x5c/0x68
[    0.000000]  sched_init+0x2c8/0x474
[    0.000000]  start_kernel+0x290/0x514

[...]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-09  9:17   ` Dietmar Eggemann
@ 2019-08-09 12:16     ` Juri Lelli
  0 siblings, 0 replies; 63+ messages in thread
From: Juri Lelli @ 2019-08-09 12:16 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On 09/08/19 11:17, Dietmar Eggemann wrote:
> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
> 
> [...]
> 
> > +void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> > +		    dl_server_has_tasks_f has_tasks,
> > +		    dl_server_pick_f pick)
> > +{
> > +	dl_se->dl_server = 1;
> > +	dl_se->rq = rq;
> > +	dl_se->server_has_tasks = has_tasks;
> > +	dl_se->server_pick = pick;
> > +
> > +	setup_new_dl_entity(dl_se);
> 
> IMHO, this needs rq locking because of the rq_clock(rq) in
> setup_new_dl_entity().
> 
> [    0.000000] WARNING: CPU: 0 PID: 0 at kernel/sched/sched.h:1119
> dl_server_init+0x118/0x178
> ...
> [    0.000000] CPU: 0 PID: 0 Comm: swapper Tainted: G        W
> 5.3.0-rc3-00013-ga33cf033cc99-dirty #10
> [    0.000000] Hardware name: ARM Juno development board (r0) (DT)
> ...
> [    0.000000] Call trace:
> [    0.000000]  dl_server_init+0x118/0x178
> [    0.000000]  fair_server_init+0x5c/0x68
> [    0.000000]  sched_init+0x2c8/0x474
> [    0.000000]  start_kernel+0x290/0x514
> 
> [...]

Right.. noticed the same and had a similar thinking the other day (1st
hunk): 20190808094546.GJ29310@localhost.localdomain

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-08-05 11:53     ` Peter Zijlstra
@ 2019-08-22 12:29       ` Alessio Balsini
  2019-08-22 16:51         ` Peter Zijlstra
  2019-10-23 17:17       ` [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl Alessio Balsini
  1 sibling, 1 reply; 63+ messages in thread
From: Alessio Balsini @ 2019-08-22 12:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, dvyukov, tglx, vpillai, rostedt, kernel-team

On Mon, Aug 05, 2019 at 01:53:09PM +0200, Peter Zijlstra wrote:
> 
> Like so?
> 

Yes, that's exactly what I meant!
What about this refactoring?

Thanks,
Alessio

---
From 459d5488acb3fac938b0f35f480a81a6e401ef92 Mon Sep 17 00:00:00 2001
From: Alessio Balsini <balsini@android.com>
Date: Thu, 22 Aug 2019 12:55:55 +0100
Subject: [PATCH] sched/deadline: Impose global limits on
 sched_attr::sched_period

There are two DoS scenarios with SCHED_DEADLINE related to
sched_attr::sched_period:

 - since access-control only looks at utilization and density, a very
   large period can allow a very large runtime, which in turn can
   incur a very large latency to lower priority tasks.

 - for very short periods we can end up spending more time programming
   the hardware timer than actually running the task.

Mitigate these by imposing limits on the period.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alessio Balsini <balsini@android.com>
---
 include/linux/sched/sysctl.h |  7 +++++
 kernel/sched/deadline.c      | 51 ++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              | 14 ++++++++++
 3 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f7..7c8ef07e52133 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,13 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+extern unsigned int sysctl_sched_dl_period_max;
+extern unsigned int sysctl_sched_dl_period_min;
+
+extern int sched_dl_period_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0b9cbfb2b1d4f..fcdf70d9c0516 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2640,6 +2640,42 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
 	attr->sched_flags = dl_se->flags;
 }
 
+/*
+ * Default limits for DL period: on the top end we guard against small util
+ * tasks still getting ridiculous long effective runtimes, on the bottom end we
+ * guard against timer DoS.
+ */
+unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
+unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
+
+int sched_dl_period_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos)
+{
+	unsigned int old_max, old_min;
+	static DEFINE_MUTEX(mutex);
+	int ret;
+
+	mutex_lock(&mutex);
+	old_max = sysctl_sched_dl_period_max;
+	old_min = sysctl_sched_dl_period_min;
+
+	ret = proc_douintvec(table, write, buffer, lenp, ppos);
+	if (!ret && write) {
+		u64 max = (u64)sysctl_sched_dl_period_max * NSEC_PER_USEC;
+		u64 min = (u64)sysctl_sched_dl_period_min * NSEC_PER_USEC;
+
+		if (min < 1ULL << DL_SCALE || max < min) {
+			sysctl_sched_dl_period_max = old_max;
+			sysctl_sched_dl_period_min = old_min;
+			ret = -EINVAL;
+		}
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 /*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
@@ -2652,6 +2688,8 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
  */
 bool __checkparam_dl(const struct sched_attr *attr)
 {
+	u64 period, max, min;
+
 	/* special dl tasks don't actually use any parameter */
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
 		return true;
@@ -2675,12 +2713,21 @@ bool __checkparam_dl(const struct sched_attr *attr)
 	    attr->sched_period & (1ULL << 63))
 		return false;
 
+	period = attr->sched_period;
+	if (!period)
+		period = attr->sched_deadline;
+
 	/* runtime <= deadline <= period (if period != 0) */
-	if ((attr->sched_period != 0 &&
-	     attr->sched_period < attr->sched_deadline) ||
+	if (period < attr->sched_deadline ||
 	    attr->sched_deadline < attr->sched_runtime)
 		return false;
 
+	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
+	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
+
+	if (period < min || period > max)
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605ba..0d07e4707e9d2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -442,6 +442,20 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rt_handler,
 	},
+	{
+		.procname	= "sched_deadline_period_max_us",
+		.data		= &sysctl_sched_dl_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_period_handler,
+	},
+	{
+		.procname	= "sched_deadline_period_min_us",
+		.data		= &sysctl_sched_dl_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_period_handler,
+	},
 	{
 		.procname	= "sched_rr_timeslice_ms",
 		.data		= &sysctl_sched_rr_timeslice,
-- 
2.23.0.187.g17f5b7556c-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-08-22 12:29       ` Alessio Balsini
@ 2019-08-22 16:51         ` Peter Zijlstra
  2019-08-31 14:41           ` Alessio Balsini
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2019-08-22 16:51 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, dvyukov, tglx, vpillai, rostedt, kernel-team

On Thu, Aug 22, 2019 at 01:29:49PM +0100, Alessio Balsini wrote:
> Yes, that's exactly what I meant!
> What about this refactoring?

Makes sense; and now I spot a race (and sched_rt_handler() from which I
copied this is susceptible too):

> +int sched_dl_period_handler(struct ctl_table *table, int write,
> +			    void __user *buffer, size_t *lenp,
> +			    loff_t *ppos)
> +{
> +	unsigned int old_max, old_min;
> +	static DEFINE_MUTEX(mutex);
> +	int ret;
> +
> +	mutex_lock(&mutex);
> +	old_max = sysctl_sched_dl_period_max;
> +	old_min = sysctl_sched_dl_period_min;
> +
> +	ret = proc_douintvec(table, write, buffer, lenp, ppos);

Here the sysctl_* values have been changed, interleave with:

> +	if (!ret && write) {
> +		u64 max = (u64)sysctl_sched_dl_period_max * NSEC_PER_USEC;
> +		u64 min = (u64)sysctl_sched_dl_period_min * NSEC_PER_USEC;
> +
> +		if (min < 1ULL << DL_SCALE || max < min) {
> +			sysctl_sched_dl_period_max = old_max;
> +			sysctl_sched_dl_period_min = old_min;
> +			ret = -EINVAL;
> +		}
> +	}
> +	mutex_unlock(&mutex);
> +
> +	return ret;
> +}

> @@ -2675,12 +2713,21 @@ bool __checkparam_dl(const struct sched_attr *attr)
>  	    attr->sched_period & (1ULL << 63))
>  		return false;
>  
> +	period = attr->sched_period;
> +	if (!period)
> +		period = attr->sched_deadline;
> +
>  	/* runtime <= deadline <= period (if period != 0) */
> -	if ((attr->sched_period != 0 &&
> -	     attr->sched_period < attr->sched_deadline) ||
> +	if (period < attr->sched_deadline ||
>  	    attr->sched_deadline < attr->sched_runtime)
>  		return false;
>  
> +	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> +	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;

this, and we're using unvalidated numbers.

> +	if (period < min || period > max)
> +		return false;
> +
>  	return true;
>  }

Something like the completely untested (again, sorry) below ought to
cure that I think; the same needs doing to sched_rt_handler() I'm
thinking.

---
Subject: sched/deadline: Impose global limits on sched_attr::sched_period
From: Alessio Balsini <balsini@android.com>
Date: Thu, 22 Aug 2019 13:29:49 +0100

There are two DoS scenarios with SCHED_DEADLINE related to
sched_attr::sched_period:

 - since access-control only looks at utilization and density, a very
   large period can allow a very large runtime, which in turn can
   incur a very large latency to lower priority tasks.

 - for very short periods we can end up spending more time programming
   the hardware timer than actually running the task.

Mitigate these by imposing limits on the period.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alessio Balsini <balsini@android.com>
Cc: rostedt@goodmis.org
Cc: mingo@kernel.org
Cc: luca.abeni@santannapisa.it
Cc: bristot@redhat.com
Cc: vpillai@digitalocean.com
Cc: kernel-team@android.com
Cc: juri.lelli@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: dvyukov@google.com
Cc: tglx@linutronix.de
---
 include/linux/sched/sysctl.h |    7 +++++
 kernel/sched/deadline.c      |   58 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |   14 ++++++++++
 3 files changed, 77 insertions(+), 2 deletions(-)

--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,13 @@ int sched_proc_update_handler(struct ctl
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+extern unsigned int sysctl_sched_dl_period_max;
+extern unsigned int sysctl_sched_dl_period_min;
+
+extern int sched_dl_period_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2633,6 +2633,49 @@ void __getparam_dl(struct task_struct *p
 }
 
 /*
+ * Default limits for DL period: on the top end we guard against small util
+ * tasks still getting ridiculous long effective runtimes, on the bottom end we
+ * guard against timer DoS.
+ */
+unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
+unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
+
+int sched_dl_period_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos)
+{
+	unsigned int new_max, new_min;
+	struct ctl_table new_table;
+	static DEFINE_MUTEX(mutex);
+	int ret;
+
+	mutex_lock(&mutex);
+	new_max = sysctl_sched_dl_period_max;
+	new_min = sysctl_sched_dl_period_min;
+	new_table = *table;
+	if (new_table.data == &sysctl_sched_dl_period_max)
+		new_table.data = &new_max;
+	else
+		new_table.data = &new_min;
+
+	ret = proc_douintvec(&new_table, write, buffer, lenp, ppos);
+	if (!ret && write) {
+		u64 max = (u64)new_max * NSEC_PER_USEC;
+		u64 min = (u64)new_min * NSEC_PER_USEC;
+
+		if (min > 1ULL << DL_SCALE && max > min) {
+			WRITE_ONCE(sysctl_sched_dl_period_max, new_max);
+			WRITE_ONCE(sysctl_sched_dl_period_min, new_min);
+		} else {
+			ret = -EINVAL;
+		}
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
+/*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime, as well as the period of being zero or
@@ -2644,6 +2687,8 @@ void __getparam_dl(struct task_struct *p
  */
 bool __checkparam_dl(const struct sched_attr *attr)
 {
+	u64 period, max, min;
+
 	/* special dl tasks don't actually use any parameter */
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
 		return true;
@@ -2667,12 +2712,21 @@ bool __checkparam_dl(const struct sched_
 	    attr->sched_period & (1ULL << 63))
 		return false;
 
+	period = attr->sched_period;
+	if (!period)
+		period = attr->sched_deadline;
+
 	/* runtime <= deadline <= period (if period != 0) */
-	if ((attr->sched_period != 0 &&
-	     attr->sched_period < attr->sched_deadline) ||
+	if (period < attr->sched_deadline ||
 	    attr->sched_deadline < attr->sched_runtime)
 		return false;
 
+	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
+	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
+
+	if (period < min || period > max)
+		return false;
+
 	return true;
 }
 
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -443,6 +443,20 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= sched_rt_handler,
 	},
 	{
+		.procname	= "sched_deadline_period_max_us",
+		.data		= &sysctl_sched_dl_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_period_handler,
+	},
+	{
+		.procname	= "sched_deadline_period_min_us",
+		.data		= &sysctl_sched_dl_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_period_handler,
+	},
+	{
 		.procname	= "sched_rr_timeslice_ms",
 		.data		= &sysctl_sched_rr_timeslice,
 		.maxlen		= sizeof(int),

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-08  9:45               ` Juri Lelli
@ 2019-08-30 11:24                 ` Peter Zijlstra
  2019-09-06  9:36                   ` Juri Lelli
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2019-08-30 11:24 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Dietmar Eggemann, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

On Thu, Aug 08, 2019 at 11:45:46AM +0200, Juri Lelli wrote:
> I'd like to take this last sentence back, I was able to run a few boot +
> hackbench + shutdown cycles with the following applied (guess too much
> debug printks around before).

I've changed that slightly; the merged delta looks like:


--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3913,6 +3913,13 @@ pick_next_task(struct rq *rq, struct tas
 		if (unlikely(!p))
 			p = idle_sched_class.pick_next_task(rq, prev, rf);
 
+		/*
+		 * This is the fast path; it cannot be a DL server pick;
+		 * therefore even if @p == @prev, ->server must be NULL.
+		 */
+		if (prev->server)
+			p->server = NULL;
+
 		return p;
 	}
 
@@ -3925,13 +3932,18 @@ pick_next_task(struct rq *rq, struct tas
 	if (!rq->nr_running)
 		newidle_balance(rq, rf);
 
+	/*
+	 * We've updated @prev and no longer need the server link, clear it.
+	 * Must be done before ->pick_next_task() because that can (re)set
+	 * ->server.
+	 */
+	if (prev->server)
+		prev->server = NULL;
+
 	for_each_class(class) {
 		p = class->pick_next_task(rq, NULL, NULL);
-		if (p) {
-			if (p->sched_class == class && p->server)
-				p->server = NULL;
+		if (p)
 			return p;
-		}
 	}
 
 	/* The idle class should always have a runnable task: */
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1312,6 +1312,10 @@ void dl_server_update(struct sched_dl_en
 
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
+	if (!dl_server(dl_se)) {
+		dl_se->dl_server = 1;
+		setup_new_dl_entity(dl_se);
+	}
 	enqueue_dl_entity(dl_se, dl_se, ENQUEUE_WAKEUP);
 }
 
@@ -1324,12 +1328,9 @@ void dl_server_init(struct sched_dl_enti
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick)
 {
-	dl_se->dl_server = 1;
 	dl_se->rq = rq;
 	dl_se->server_has_tasks = has_tasks;
 	dl_se->server_pick = pick;
-
-	setup_new_dl_entity(dl_se);
 }
 
 /*
@@ -2855,6 +2856,7 @@ static void __dl_clear_params(struct sch
 	dl_se->dl_yielded		= 0;
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
+	dl_se->dl_server		= 0;
 }
 
 void init_dl_entity(struct sched_dl_entity *dl_se)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-08-22 16:51         ` Peter Zijlstra
@ 2019-08-31 14:41           ` Alessio Balsini
  2019-09-02  9:16             ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Alessio Balsini @ 2019-08-31 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, dvyukov, tglx, vpillai, rostedt, kernel-team

Right!

Verified that sysctl_sched_dl_period_max and sysctl_sched_dl_period_min values
are now always consistent.

I spent some time in trying to figure out if not having any mutex in
__checkparam_dl() is safe. There can surely happen that "max < min", e.g.:

          |              |               periods
User1     | User2        | checkparam_dl()  | sysctl_sched_dl_*
----------|--------------|------------------|-------------------
          |              |                  | [x, x]
p_min = 5 |              |                  |
          |              |                  | [5, x]
p_max = 5 |              |                  |
          |              |                  | [5, 5]
          | setattr(p=8) |                  |
          |              | p = 8            |
          |              | [x, 5]           |
p_max = 9 |              |                  |
          |              |                  | [5, 9]
p_min = 6 |              |                  |
          |              |                  | [6, 9]
          |              | [6, 5]           |
----------|--------------|------------------|-------------------

Sharing my thoughts, a "BUG_ON(max < min)" in __checkparam_dl() is then a
guaranteed source of explosions, but the good news is that "if (period < min ||
period > max" in __checkparam_dl() surely fails if "max < min".  Also the fact
that, when we are writing the new sysctl_sched_dl_* values, only one is
actually changed at a time, that surely helps to preserve the consistency.

But is that enough?

Well, I couldn't find any counterexample to make __checkparam_dl() pass with
wrong parameters. And the tests I made are happy.

On Thu, Aug 22, 2019 at 06:51:25PM +0200, Peter Zijlstra wrote:
>  include/linux/sched/sysctl.h |    7 +++++
>  kernel/sched/deadline.c      |   58 +++++++++++++++++++++++++++++++++++++++++--
>  kernel/sysctl.c              |   14 ++++++++++
>  3 files changed, 77 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> +int sched_dl_period_handler(struct ctl_table *table, int write,
> +			    void __user *buffer, size_t *lenp,
> +			    loff_t *ppos)
> +{
> +		if (min > 1ULL << DL_SCALE && max > min) {

s/>/>=/

> +			WRITE_ONCE(sysctl_sched_dl_period_max, new_max);
> +			WRITE_ONCE(sysctl_sched_dl_period_min, new_min);

Besides the inline comment above, this is my ack to your patch.

Otherwise, here follows a slightly more convoluted version of your patch with a
couple of changes to sched_dl_period_handler(), summarized as:
 - handle new_table only if writing;
 - directly compare the us min and max (saves one multiplication);
 - atomic-writes only the sysctl_sched_dl_period_XXX which changed.

M2c.

Thanks!
-Alessio

---
From cb4481233b57e42ff9dd315811f7919168a28162 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 22 Aug 2019 18:51:25 +0200
Subject: [PATCH] sched/deadline: Impose global limits on
 sched_attr::sched_period

There are two DoS scenarios with SCHED_DEADLINE related to
sched_attr::sched_period:

 - since access-control only looks at utilization and density, a very
   large period can allow a very large runtime, which in turn can
   incur a very large latency to lower priority tasks.

 - for very short periods we can end up spending more time programming
   the hardware timer than actually running the task.

Mitigate these by imposing limits on the period.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alessio Balsini <balsini@android.com>
Cc: rostedt@goodmis.org
Cc: mingo@kernel.org
Cc: luca.abeni@santannapisa.it
Cc: bristot@redhat.com
Cc: vpillai@digitalocean.com
Cc: kernel-team@android.com
Cc: juri.lelli@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: dvyukov@google.com
Cc: tglx@linutronix.de
---
 include/linux/sched/sysctl.h |  7 ++++
 kernel/sched/deadline.c      | 68 ++++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              | 14 ++++++++
 3 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f7..7c8ef07e52133 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,13 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+extern unsigned int sysctl_sched_dl_period_max;
+extern unsigned int sysctl_sched_dl_period_min;
+
+extern int sched_dl_period_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0b9cbfb2b1d4f..c4a6107e055c7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2640,6 +2640,59 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
 	attr->sched_flags = dl_se->flags;
 }
 
+/*
+ * Default limits for DL period: on the top end we guard against small util
+ * tasks still getting ridiculous long effective runtimes, on the bottom end we
+ * guard against timer DoS.
+ */
+unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4.2 seconds */
+unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
+
+int sched_dl_period_handler(struct ctl_table *table, int write,
+			    void __user *buffer, size_t *lenp,
+			    loff_t *ppos)
+{
+	static DEFINE_MUTEX(mutex);
+	int ret;
+
+	mutex_lock(&mutex);
+	if (write) {
+		/*
+		 * Use a temporary data structure to store the value read from
+		 * userspace. The sysctl_sched_dl_period_{max,min} value will
+		 * be updated only if the data is consistent.
+		 */
+		struct ctl_table new_table = *table;
+		unsigned int max, min;
+
+		if (new_table.data == &sysctl_sched_dl_period_max) {
+			new_table.data = &max;
+			min = sysctl_sched_dl_period_min;
+		} else {
+			new_table.data = &min;
+			max = sysctl_sched_dl_period_max;
+		}
+
+		ret = proc_douintvec(&new_table, write, buffer, lenp, ppos);
+		if (!ret) {
+			if (min > max ||
+			    (u64)min * NSEC_PER_USEC < 1ULL << DL_SCALE) {
+				ret = -EINVAL;
+			} else {
+				unsigned int *src = new_table.data;
+				unsigned int *dst = table->data;
+
+				WRITE_ONCE(*dst, *src);
+			}
+		}
+	} else {
+		ret = proc_douintvec(table, write, buffer, lenp, ppos);
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 /*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
@@ -2652,6 +2705,8 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
  */
 bool __checkparam_dl(const struct sched_attr *attr)
 {
+	u64 period, max, min;
+
 	/* special dl tasks don't actually use any parameter */
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
 		return true;
@@ -2675,12 +2730,21 @@ bool __checkparam_dl(const struct sched_attr *attr)
 	    attr->sched_period & (1ULL << 63))
 		return false;
 
+	period = attr->sched_period;
+	if (!period)
+		period = attr->sched_deadline;
+
 	/* runtime <= deadline <= period (if period != 0) */
-	if ((attr->sched_period != 0 &&
-	     attr->sched_period < attr->sched_deadline) ||
+	if (period < attr->sched_deadline ||
 	    attr->sched_deadline < attr->sched_runtime)
 		return false;
 
+	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
+	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
+
+	if (period < min || period > max)
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605ba..0d07e4707e9d2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -442,6 +442,20 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rt_handler,
 	},
+	{
+		.procname	= "sched_deadline_period_max_us",
+		.data		= &sysctl_sched_dl_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_period_handler,
+	},
+	{
+		.procname	= "sched_deadline_period_min_us",
+		.data		= &sysctl_sched_dl_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_period_handler,
+	},
 	{
 		.procname	= "sched_rr_timeslice_ms",
 		.data		= &sysctl_sched_rr_timeslice,
-- 
2.23.0.187.g17f5b7556c-goog



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-08-31 14:41           ` Alessio Balsini
@ 2019-09-02  9:16             ` Peter Zijlstra
  2019-09-02 12:31               ` Peter Zijlstra
  2019-09-04 10:16               ` Steven Rostedt
  0 siblings, 2 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-09-02  9:16 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, dvyukov, tglx, vpillai, rostedt, kernel-team

On Sat, Aug 31, 2019 at 03:41:17PM +0100, Alessio Balsini wrote:
> Right!
> 
> Verified that sysctl_sched_dl_period_max and sysctl_sched_dl_period_min values
> are now always consistent.
> 
> I spent some time in trying to figure out if not having any mutex in
> __checkparam_dl() is safe. There can surely happen that "max < min", e.g.:
> 
>           |              |               periods
> User1     | User2        | checkparam_dl()  | sysctl_sched_dl_*
> ----------|--------------|------------------|-------------------
>           |              |                  | [x, x]
> p_min = 5 |              |                  |
>           |              |                  | [5, x]
> p_max = 5 |              |                  |
>           |              |                  | [5, 5]
>           | setattr(p=8) |                  |
>           |              | p = 8            |
>           |              | [x, 5]           |
> p_max = 9 |              |                  |
>           |              |                  | [5, 9]
> p_min = 6 |              |                  |
>           |              |                  | [6, 9]
>           |              | [6, 5]           |
> ----------|--------------|------------------|-------------------
> 
> Sharing my thoughts, a "BUG_ON(max < min)" in __checkparam_dl() is then a
> guaranteed source of explosions, but the good news is that "if (period < min ||
> period > max" in __checkparam_dl() surely fails if "max < min".  Also the fact
> that, when we are writing the new sysctl_sched_dl_* values, only one is
> actually changed at a time, that surely helps to preserve the consistency.
> 
> But is that enough?

Strictly speaking, no, I suppose it is not. We can have two changes in
between the two READ_ONCE()s and then we'd be able to observe a
violation.

The easy way to fix that is do something like:

+	synchronize_rcu();
	mutex_unlock(&mutex);

in sched_dl_period_handler(). And do:

+	preempt_disable();
	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
+	preempt_enable();

in __checkparam_dl().

That would prohibit we see two changes, and seeing only the single
change is safe.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-09-02  9:16             ` Peter Zijlstra
@ 2019-09-02 12:31               ` Peter Zijlstra
  2019-09-04 10:16               ` Steven Rostedt
  1 sibling, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-09-02 12:31 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, dvyukov, tglx, vpillai, rostedt, kernel-team

On Mon, Sep 02, 2019 at 11:16:23AM +0200, Peter Zijlstra wrote:
> On Sat, Aug 31, 2019 at 03:41:17PM +0100, Alessio Balsini wrote:
> > Right!
> > 
> > Verified that sysctl_sched_dl_period_max and sysctl_sched_dl_period_min values
> > are now always consistent.
> > 
> > I spent some time in trying to figure out if not having any mutex in
> > __checkparam_dl() is safe. There can surely happen that "max < min", e.g.:

> > Sharing my thoughts, a "BUG_ON(max < min)" in __checkparam_dl() is then a
> > guaranteed source of explosions, but the good news is that "if (period < min ||
> > period > max" in __checkparam_dl() surely fails if "max < min".  Also the fact
> > that, when we are writing the new sysctl_sched_dl_* values, only one is
> > actually changed at a time, that surely helps to preserve the consistency.
> > 
> > But is that enough?
> 
> Strictly speaking, no, I suppose it is not. We can have two changes in
> between the two READ_ONCE()s and then we'd be able to observe a
> violation.
> 
> The easy way to fix that is do something like:
> 
> +	synchronize_rcu();
> 	mutex_unlock(&mutex);
> 
> in sched_dl_period_handler(). And do:
> 
> +	preempt_disable();
> 	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> 	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> +	preempt_enable();
> 
> in __checkparam_dl().
> 
> That would prohibit we see two changes, and seeing only the single
> change is safe.

I pushed out a new version; and added patch to sched_rt_handler() on
top.

Please have a look at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/wip-deadline

I'll move these two patches to sched/core if everything looks good.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure
  2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
                   ` (13 preceding siblings ...)
  2019-07-26 20:01 ` [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure luca abeni
@ 2019-09-03 14:27 ` Alessio Balsini
  2019-09-04 10:50   ` Juri Lelli
  14 siblings, 1 reply; 63+ messages in thread
From: Alessio Balsini @ 2019-09-03 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni,
	bristot, dvyukov, tglx, vpillai, rostedt

Hi Peter,

While testing your series (peterz/sched/wip-deadline 7a9e91d3fe951), I ended up
in a panic at boot on a x86_64 kvm guest, would you please have a look?  Here
attached the backtrace.
Happy to test any suggestion that fixes the issue.

Thanks,
Alessio
---
------>8------
[    0.798326] ------------[ cut here ]------------       
[    0.798328] kernel BUG at kernel/sched/deadline.c:1542!            
[    0.798335] invalid opcode: 0000 [#1] SMP PTI                                  
[    0.798339] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.3.0-rc6+ #28
[    0.798340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[    0.798349] RIP: 0010:enqueue_dl_entity+0x3f8/0x440
[    0.798351] Code: ff 48 8b 85 60 0a 00 00 8b 48 28 85 c9 0f 85 99 fd ff ff c7 40 28 01 00 00 00 e9 8d fd ff ff 85 c0 75 20 f                                                                                                                             
[    0.798353] RSP: 0000:ffffb68e40154f10 EFLAGS: 00010096
[    0.798356] RAX: 0000000000000020 RBX: ffff974bc74d0c00 RCX: ffff974bc751b200
[    0.798358] RDX: 0000000000000001 RSI: ffff974bc7929410 RDI: ffff974bc7929410
[    0.798359] RBP: 0000000000000009 R08: 00000000a73eb274 R09: 00000000000f4887
[    0.798361] R10: 0000000000000000 R11: 0000000000000000 R12: ffff974bc74d0c80
[    0.798362] R13: 0000000000000000 R14: ffff974bc74d0d00 R15: 0000000000000000
[    0.798365] FS:  0000000000000000(0000) GS:ffff974bc7900000(0000) knlGS:0000000000000000
[    0.798371] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[    0.798372] CR2: 00000000ffffffff CR3: 000000000480a000 CR4: 00000000000006e0
[    0.798374] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.798375] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.798376] Call Trace:                       
[    0.798397]  <IRQ>                                                
[    0.798402]  enqueue_task_fair+0xe69/0x11d0   
[    0.798407]  activate_task+0x58/0x90                                
[    0.798412]  ? kvm_sched_clock_read+0xd/0x20              
[    0.798416]  ttwu_do_activate.isra.96+0x3a/0x50                 
[    0.798420]  sched_ttwu_pending+0x5e/0x90                                              
[    0.798424]  scheduler_ipi+0x9f/0x120               
[    0.798430]  reschedule_interrupt+0xf/0x20           
[    0.798432]  </IRQ>                                                                           
[    0.798436] RIP: 0010:default_idle+0x20/0x140   
[    0.798438] Code: 90 90 90 90 90 90 90 90 90 90 41 55 41 54 55 65 8b 2d f4 40 d7 70 53 0f 1f 44 00 00 e9 07 00 00 00 0f 00 5                                                                                                                             
[    0.798440] RSP: 0000:ffffb68e40083ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
[    0.798442] RAX: ffffffff8f29c250 RBX: 0000000000000004 RCX: ffff974bc7916000               
[    0.798444] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff974bc791ca80
[    0.798445] RBP: 0000000000000004 R08: 000000009d74022b R09: 0000004d7ebb5820
[    0.798447] R10: 0000000000000400 R11: 0000000000000400 R12: 0000000000000000
[    0.798448] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    0.798451]  ? __cpuidle_text_start+0x8/0x8 
[    0.798455]  do_idle+0x19e/0x210              
[    0.798458]  cpu_startup_entry+0x14/0x20
[    0.798464]  start_secondary+0x144/0x170                       
[    0.798467]  secondary_startup_64+0xa4/0xb0   
[    0.798469] Modules linked in:              
[    0.798478] ---[ end trace c2be7729c78a55ad ]---
[    0.798482] RIP: 0010:enqueue_dl_entity+0x3f8/0x440                                   
[    0.798484] Code: ff 48 8b 85 60 0a 00 00 8b 48 28 85 c9 0f 85 99 fd ff ff c7 40 28 01 00 00 00 e9 8d fd ff ff 85 c0 75 20 f                                                                                                                             
[    0.798485] RSP: 0000:ffffb68e40154f10 EFLAGS: 00010096
[    0.798487] RAX: 0000000000000020 RBX: ffff974bc74d0c00 RCX: ffff974bc751b200
[    0.798489] RDX: 0000000000000001 RSI: ffff974bc7929410 RDI: ffff974bc7929410
[    0.798490] RBP: 0000000000000009 R08: 00000000a73eb274 R09: 00000000000f4887
[    0.798491] R10: 0000000000000000 R11: 0000000000000000 R12: ffff974bc74d0c80
[    0.798493] R13: 0000000000000000 R14: ffff974bc74d0d00 R15: 0000000000000000
[    0.798495] FS:  0000000000000000(0000) GS:ffff974bc7900000(0000) knlGS:0000000000000000
[    0.798500] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.798501] CR2: 00000000ffffffff CR3: 000000000480a000 CR4: 00000000000006e0
[    0.798502] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.798504] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.798505] Kernel panic - not syncing: Fatal exception in interrupt
[    0.799522] Kernel Offset: 0xd800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    0.875144] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
------8<------


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-09-02  9:16             ` Peter Zijlstra
  2019-09-02 12:31               ` Peter Zijlstra
@ 2019-09-04 10:16               ` Steven Rostedt
  2019-09-04 11:30                 ` Peter Zijlstra
  1 sibling, 1 reply; 63+ messages in thread
From: Steven Rostedt @ 2019-09-04 10:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alessio Balsini, mingo, juri.lelli, linux-kernel,
	dietmar.eggemann, luca.abeni, bristot, dvyukov, tglx, vpillai,
	kernel-team

On Mon, 2 Sep 2019 11:16:23 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> in sched_dl_period_handler(). And do:
> 
> +	preempt_disable();
> 	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> 	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> +	preempt_enable();

Hmm, I'm curious. Doesn't the preempt_disable/enable() also add
compiler barriers which would remove the need for the READ_ONCE()s here?

-- Steve

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure
  2019-09-03 14:27 ` Alessio Balsini
@ 2019-09-04 10:50   ` Juri Lelli
  2019-09-04 11:32     ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2019-09-04 10:50 UTC (permalink / raw)
  To: Alessio Balsini
  Cc: Peter Zijlstra, mingo, linux-kernel, dietmar.eggemann,
	luca.abeni, bristot, dvyukov, tglx, vpillai, rostedt

Hi Alessio,

On 03/09/19 15:27, Alessio Balsini wrote:
> Hi Peter,
> 
> While testing your series (peterz/sched/wip-deadline 7a9e91d3fe951), I ended up
> in a panic at boot on a x86_64 kvm guest, would you please have a look?  Here
> attached the backtrace.
> Happy to test any suggestion that fixes the issue.

Are you running with latest fix by Peter?

https://lore.kernel.org/lkml/20190830112437.GD2369@hirez.programming.kicks-ass.net/

It seems that his wip tree now has d3138279c7f3 on top (and the fix
above has been merged).

Not sure it fixes also what you are seeing, though.

Thanks,

Juri

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-09-04 10:16               ` Steven Rostedt
@ 2019-09-04 11:30                 ` Peter Zijlstra
  2019-09-04 13:24                   ` Joel Fernandes
  0 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2019-09-04 11:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alessio Balsini, mingo, juri.lelli, linux-kernel,
	dietmar.eggemann, luca.abeni, bristot, dvyukov, tglx, vpillai,
	kernel-team

On Wed, Sep 04, 2019 at 06:16:16AM -0400, Steven Rostedt wrote:
> On Mon, 2 Sep 2019 11:16:23 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > in sched_dl_period_handler(). And do:
> > 
> > +	preempt_disable();
> > 	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> > 	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> > +	preempt_enable();
> 
> Hmm, I'm curious. Doesn't the preempt_disable/enable() also add
> compiler barriers which would remove the need for the READ_ONCE()s here?

They do add compiler barriers; but they do not avoid the compiler
tearing stuff up.

So while Linus has declared that compilers should not be tearing shit
up, I'm hesitant to actually trust compilers much these days.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure
  2019-09-04 10:50   ` Juri Lelli
@ 2019-09-04 11:32     ` Peter Zijlstra
  0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-09-04 11:32 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Alessio Balsini, mingo, linux-kernel, dietmar.eggemann,
	luca.abeni, bristot, dvyukov, tglx, vpillai, rostedt

On Wed, Sep 04, 2019 at 12:50:37PM +0200, Juri Lelli wrote:
> Hi Alessio,
> 
> On 03/09/19 15:27, Alessio Balsini wrote:
> > Hi Peter,
> > 
> > While testing your series (peterz/sched/wip-deadline 7a9e91d3fe951), I ended up
> > in a panic at boot on a x86_64 kvm guest, would you please have a look?  Here
> > attached the backtrace.
> > Happy to test any suggestion that fixes the issue.
> 
> Are you running with latest fix by Peter?
> 
> https://lore.kernel.org/lkml/20190830112437.GD2369@hirez.programming.kicks-ass.net/
> 
> It seems that his wip tree now has d3138279c7f3 on top (and the fix
> above has been merged).
> 
> Not sure it fixes also what you are seeing, though.

He likely is; but it is also very likely I messed it up somehow; I
didn't even boot that branch :/ I'll try and have a look, but I'm
running out of time before LPC.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-09-04 11:30                 ` Peter Zijlstra
@ 2019-09-04 13:24                   ` Joel Fernandes
  2019-09-04 14:11                     ` Will Deacon
  2019-09-04 15:52                     ` Peter Zijlstra
  0 siblings, 2 replies; 63+ messages in thread
From: Joel Fernandes @ 2019-09-04 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Alessio Balsini, mingo, juri.lelli, linux-kernel,
	dietmar.eggemann, luca.abeni, bristot, dvyukov, tglx, vpillai,
	kernel-team, will.deacon

On Wed, Sep 04, 2019 at 01:30:38PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 04, 2019 at 06:16:16AM -0400, Steven Rostedt wrote:
> > On Mon, 2 Sep 2019 11:16:23 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > in sched_dl_period_handler(). And do:
> > > 
> > > +	preempt_disable();
> > > 	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> > > 	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> > > +	preempt_enable();
> > 
> > Hmm, I'm curious. Doesn't the preempt_disable/enable() also add
> > compiler barriers which would remove the need for the READ_ONCE()s here?
> 
> They do add compiler barriers; but they do not avoid the compiler
> tearing stuff up.

Neither does WRITE_ONCE() on some possibly buggy but currently circulating
compilers :(

As Will said in:
https://lore.kernel.org/lkml/20190821103200.kpufwtviqhpbuv2n@willie-the-truck/

void bar(u64 *x)
{
	*(volatile u64 *)x = 0xabcdef10abcdef10;
}

gives:

bar:
	mov	w1, 61200
	movk	w1, 0xabcd, lsl 16
	str	w1, [x0]
	str	w1, [x0, 4]
	ret

Speaking of which, Will, is there a plan to have compiler folks address this
tearing issue and are bugs filed somewhere? I believe aarch64 gcc is buggy,
and clang is better but is still buggy?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-09-04 13:24                   ` Joel Fernandes
@ 2019-09-04 14:11                     ` Will Deacon
  2019-09-04 14:35                       ` Joel Fernandes
  2019-09-04 15:52                     ` Peter Zijlstra
  1 sibling, 1 reply; 63+ messages in thread
From: Will Deacon @ 2019-09-04 14:11 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, Steven Rostedt, Alessio Balsini, mingo,
	juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	dvyukov, tglx, vpillai, kernel-team, will

Hi Joel,

On Wed, Sep 04, 2019 at 09:24:18AM -0400, Joel Fernandes wrote:
> On Wed, Sep 04, 2019 at 01:30:38PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 04, 2019 at 06:16:16AM -0400, Steven Rostedt wrote:
> > > On Mon, 2 Sep 2019 11:16:23 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > in sched_dl_period_handler(). And do:
> > > > 
> > > > +	preempt_disable();
> > > > 	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> > > > 	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> > > > +	preempt_enable();
> > > 
> > > Hmm, I'm curious. Doesn't the preempt_disable/enable() also add
> > > compiler barriers which would remove the need for the READ_ONCE()s here?
> > 
> > They do add compiler barriers; but they do not avoid the compiler
> > tearing stuff up.
> 
> Neither does WRITE_ONCE() on some possibly buggy but currently circulating
> compilers :(

Hmm. The example above is using READ_ONCE, which is a different kettle of
frogs.

> As Will said in:
> https://lore.kernel.org/lkml/20190821103200.kpufwtviqhpbuv2n@willie-the-truck/
> 
> void bar(u64 *x)
> {
> 	*(volatile u64 *)x = 0xabcdef10abcdef10;
> }
> 
> gives:
> 
> bar:
> 	mov	w1, 61200
> 	movk	w1, 0xabcd, lsl 16
> 	str	w1, [x0]
> 	str	w1, [x0, 4]
> 	ret
> 
> Speaking of which, Will, is there a plan to have compiler folks address this
> tearing issue and are bugs filed somewhere? I believe aarch64 gcc is buggy,
> and clang is better but is still buggy?

Well, it depends on your point of view. Personally, I think tearing a
volatile access (e.g. WRITE_ONCE) is buggy and it seems as though the GCC
developers agree:

https://gcc.gnu.org/ml/gcc-patches/2019-08/msg01500.html

so it's likely this will be fixed for AArch64 GCC. I couldn't persuade
clang to break the volatile case, so think we're good there too.

For the non-volatile case, I don't actually consider it to be a bug,
although I sympathise with the desire to avoid a retrospective tree-wide
sweep adding random WRITE_ONCE invocations to stores that look like they
might be concurrent. In other words, I think I'd suggest:

  * The use of WRITE_ONCE in new code (probably with a comment justifying it)
  * The introduction of WRITE_ONCE to existing code where it can be shown to
    be fixing a real bug (e.g. by demonstrating that a compiler actually
    gets it wrong)

For the /vast/ majority of cases, the compiler will do the right thing
even without WRITE_ONCE, simply because that's going to be the most
performant choice as well.

Will

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-09-04 14:11                     ` Will Deacon
@ 2019-09-04 14:35                       ` Joel Fernandes
  0 siblings, 0 replies; 63+ messages in thread
From: Joel Fernandes @ 2019-09-04 14:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Steven Rostedt, Alessio Balsini, mingo,
	juri.lelli, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	dvyukov, tglx, vpillai, kernel-team

On Wed, Sep 04, 2019 at 03:11:10PM +0100, Will Deacon wrote:
> Hi Joel,
> 
> On Wed, Sep 04, 2019 at 09:24:18AM -0400, Joel Fernandes wrote:
> > On Wed, Sep 04, 2019 at 01:30:38PM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 04, 2019 at 06:16:16AM -0400, Steven Rostedt wrote:
> > > > On Mon, 2 Sep 2019 11:16:23 +0200
> > > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > > 
> > > > > in sched_dl_period_handler(). And do:
> > > > > 
> > > > > +	preempt_disable();
> > > > > 	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> > > > > 	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> > > > > +	preempt_enable();
> > > > 
> > > > Hmm, I'm curious. Doesn't the preempt_disable/enable() also add
> > > > compiler barriers which would remove the need for the READ_ONCE()s here?
> > > 
> > > They do add compiler barriers; but they do not avoid the compiler
> > > tearing stuff up.
> > 
> > Neither does WRITE_ONCE() on some possibly buggy but currently circulating
> > compilers :(
> 
> Hmm. The example above is using READ_ONCE, which is a different kettle of
> frogs.

True. But, I equally worry about all *-tearing frog kettles ;-)

> > As Will said in:
> > https://lore.kernel.org/lkml/20190821103200.kpufwtviqhpbuv2n@willie-the-truck/
> > 
> > void bar(u64 *x)
> > {
> > 	*(volatile u64 *)x = 0xabcdef10abcdef10;
> > }
> > 
> > gives:
> > 
> > bar:
> > 	mov	w1, 61200
> > 	movk	w1, 0xabcd, lsl 16
> > 	str	w1, [x0]
> > 	str	w1, [x0, 4]
> > 	ret
> > 
> > Speaking of which, Will, is there a plan to have compiler folks address this
> > tearing issue and are bugs filed somewhere? I believe aarch64 gcc is buggy,
> > and clang is better but is still buggy?
> 
> Well, it depends on your point of view. Personally, I think tearing a
> volatile access (e.g. WRITE_ONCE) is buggy and it seems as though the GCC
> developers agree:
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-08/msg01500.html
> 
> so it's likely this will be fixed for AArch64 GCC. I couldn't persuade
> clang to break the volatile case, so think we're good there too.

Glad to know that GCC folks are looking into the issue.

Sorry if this is getting a bit off-topic. Also does the aarch64 clang doing
the "memset folding" issue, also need to be looked into?
You had mentioned it in the same thread:
https://lore.kernel.org/lkml/20190821103200.kpufwtviqhpbuv2n@willie-the-truck/
Or, does WRITE_ONCE() resolve such memset store-merging?

> For the non-volatile case, I don't actually consider it to be a bug,
> although I sympathise with the desire to avoid a retrospective tree-wide
> sweep adding random WRITE_ONCE invocations to stores that look like they
> might be concurrent. In other words, I think I'd suggest:
> 
>   * The use of WRITE_ONCE in new code (probably with a comment justifying it)
>   * The introduction of WRITE_ONCE to existing code where it can be shown to
>     be fixing a real bug (e.g. by demonstrating that a compiler actually
>     gets it wrong)
> 
> For the /vast/ majority of cases, the compiler will do the right thing
> even without WRITE_ONCE, simply because that's going to be the most
> performant choice as well.

Thanks for the thoughts. They seem to be reasonable to me.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-09-04 13:24                   ` Joel Fernandes
  2019-09-04 14:11                     ` Will Deacon
@ 2019-09-04 15:52                     ` Peter Zijlstra
  1 sibling, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2019-09-04 15:52 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Steven Rostedt, Alessio Balsini, mingo, juri.lelli, linux-kernel,
	dietmar.eggemann, luca.abeni, bristot, dvyukov, tglx, vpillai,
	kernel-team, will.deacon

On Wed, Sep 04, 2019 at 09:24:18AM -0400, Joel Fernandes wrote:
> On Wed, Sep 04, 2019 at 01:30:38PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 04, 2019 at 06:16:16AM -0400, Steven Rostedt wrote:
> > > On Mon, 2 Sep 2019 11:16:23 +0200
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > in sched_dl_period_handler(). And do:
> > > > 
> > > > +	preempt_disable();
> > > > 	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> > > > 	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> > > > +	preempt_enable();
> > > 
> > > Hmm, I'm curious. Doesn't the preempt_disable/enable() also add
> > > compiler barriers which would remove the need for the READ_ONCE()s here?
> > 
> > They do add compiler barriers; but they do not avoid the compiler
> > tearing stuff up.
> 
> Neither does WRITE_ONCE() on some possibly buggy but currently circulating
> compilers :(

Yes, I'm aware :/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers
  2019-08-30 11:24                 ` Peter Zijlstra
@ 2019-09-06  9:36                   ` Juri Lelli
  0 siblings, 0 replies; 63+ messages in thread
From: Juri Lelli @ 2019-09-06  9:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, mingo, linux-kernel, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

Hi,

On 30/08/19 13:24, Peter Zijlstra wrote:
> On Thu, Aug 08, 2019 at 11:45:46AM +0200, Juri Lelli wrote:
> > I'd like to take this last sentence back, I was able to run a few boot +
> > hackbench + shutdown cycles with the following applied (guess too much
> > debug printks around before).
> 
> I've changed that slightly; the merged delta looks like:

It looks like I had to further apply what follows on top of your latest
wip-deadline branch (d3138279c7f3) to make kvm boot.

--->8---
 kernel/fork.c       | 2 ++
 kernel/sched/core.c | 3 +--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..4f404fe1d9ab 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1983,6 +1983,8 @@ static __latent_entropy struct task_struct *copy_process(
 	p->sequential_io_avg	= 0;
 #endif
 
+	p->server = NULL;
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1fd2a8c68daf..d72579b93a9f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3917,8 +3917,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		 * This is the fast path; it cannot be a DL server pick;
 		 * therefore even if @p == @prev, ->server must be NULL.
 		 */
-		if (prev->server)
-			p->server = NULL;
+		p->server = NULL;
 
 		return p;
 	}

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl
  2019-08-05 11:53     ` Peter Zijlstra
  2019-08-22 12:29       ` Alessio Balsini
@ 2019-10-23 17:17       ` Alessio Balsini
  2019-10-23 17:22         ` Alessio Balsini
  2019-10-25  0:17         ` Sasha Levin
  1 sibling, 2 replies; 63+ messages in thread
From: Alessio Balsini @ 2019-10-23 17:17 UTC (permalink / raw)
  To: gregkh
  Cc: stable, linux-kernel, kernel-team, Alessio Balsini, Jens Axboe,
	Sasha Levin

[ Upstream commit fdbe4eeeb1aac219b14f10c0ed31ae5d1123e9b8 ]

Enabling Direct I/O with loop devices helps reducing memory usage by
avoiding double caching.  32 bit applications running on 64 bits systems
are currently not able to request direct I/O because is missing from the
lo_compat_ioctl.

This patch fixes the compatibility issue mentioned above by exporting
LOOP_SET_DIRECT_IO as additional lo_compat_ioctl() entry.
The input argument for this ioctl is a single long converted to a 1-bit
boolean, so compatibility is preserved.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/block/loop.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da3902ac16c86..8aadd4d0c3a88 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1557,6 +1557,7 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
 		arg = (unsigned long) compat_ptr(arg);
 	case LOOP_SET_FD:
 	case LOOP_CHANGE_FD:
+	case LOOP_SET_DIRECT_IO:
 		err = lo_ioctl(bdev, mode, cmd, arg);
 		break;
 	default:
-- 
2.23.0.866.gb869b98d4c-goog


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl
  2019-10-23 17:17       ` [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl Alessio Balsini
@ 2019-10-23 17:22         ` Alessio Balsini
  2019-10-25  0:17         ` Sasha Levin
  1 sibling, 0 replies; 63+ messages in thread
From: Alessio Balsini @ 2019-10-23 17:22 UTC (permalink / raw)
  To: gregkh; +Cc: stable, linux-kernel, kernel-team, Jens Axboe, Sasha Levin

Ops, please forgive the wrong in-reply-to messge id :)

Cheers,
Alessio

On Wed, Oct 23, 2019 at 06:17:36PM +0100, Alessio Balsini wrote:
> [ Upstream commit fdbe4eeeb1aac219b14f10c0ed31ae5d1123e9b8 ]
> 
> Enabling Direct I/O with loop devices helps reducing memory usage by
> avoiding double caching.  32 bit applications running on 64 bits systems
> are currently not able to request direct I/O because is missing from the
> lo_compat_ioctl.
> 
> This patch fixes the compatibility issue mentioned above by exporting
> LOOP_SET_DIRECT_IO as additional lo_compat_ioctl() entry.
> The input argument for this ioctl is a single long converted to a 1-bit
> boolean, so compatibility is preserved.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Alessio Balsini <balsini@android.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  drivers/block/loop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index da3902ac16c86..8aadd4d0c3a88 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1557,6 +1557,7 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
>  		arg = (unsigned long) compat_ptr(arg);
>  	case LOOP_SET_FD:
>  	case LOOP_CHANGE_FD:
> +	case LOOP_SET_DIRECT_IO:
>  		err = lo_ioctl(bdev, mode, cmd, arg);
>  		break;
>  	default:
> -- 
> 2.23.0.866.gb869b98d4c-goog
> 
> -- 
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl
  2019-10-23 17:17       ` [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl Alessio Balsini
  2019-10-23 17:22         ` Alessio Balsini
@ 2019-10-25  0:17         ` Sasha Levin
  1 sibling, 0 replies; 63+ messages in thread
From: Sasha Levin @ 2019-10-25  0:17 UTC (permalink / raw)
  To: Alessio Balsini; +Cc: gregkh, stable, linux-kernel, kernel-team, Jens Axboe

On Wed, Oct 23, 2019 at 06:17:36PM +0100, Alessio Balsini wrote:
>[ Upstream commit fdbe4eeeb1aac219b14f10c0ed31ae5d1123e9b8 ]
>
>Enabling Direct I/O with loop devices helps reducing memory usage by
>avoiding double caching.  32 bit applications running on 64 bits systems
>are currently not able to request direct I/O because is missing from the
>lo_compat_ioctl.
>
>This patch fixes the compatibility issue mentioned above by exporting
>LOOP_SET_DIRECT_IO as additional lo_compat_ioctl() entry.
>The input argument for this ioctl is a single long converted to a 1-bit
>boolean, so compatibility is preserved.
>
>Cc: Jens Axboe <axboe@kernel.dk>
>Signed-off-by: Alessio Balsini <balsini@android.com>
>Signed-off-by: Jens Axboe <axboe@kernel.dk>
>Signed-off-by: Sasha Levin <sashal@kernel.org>

Queued up, thanks!

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-07-26 14:54 ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Peter Zijlstra
  2019-07-29  8:57   ` Juri Lelli
  2019-08-02 17:21   ` Alessio Balsini
@ 2020-05-20 18:38   ` Juri Lelli
  2020-05-21 13:45     ` Daniel Bristot de Oliveira
  2020-06-16 12:21   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  3 siblings, 1 reply; 63+ messages in thread
From: Juri Lelli @ 2020-05-20 18:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, bristot,
	balsini, dvyukov, tglx, vpillai, rostedt

Hi Peter,

On 26/07/19 16:54, Peter Zijlstra wrote:
> 
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Luca Abeni <luca.abeni@santannapisa.it>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched/sysctl.h |    3 +++
>  kernel/sched/deadline.c      |   23 +++++++++++++++++++++--
>  kernel/sysctl.c              |   14 ++++++++++++++
>  3 files changed, 38 insertions(+), 2 deletions(-)
> 
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -56,6 +56,9 @@ int sched_proc_update_handler(struct ctl
>  extern unsigned int sysctl_sched_rt_period;
>  extern int sysctl_sched_rt_runtime;
>  
> +extern unsigned int sysctl_sched_dl_period_max;
> +extern unsigned int sysctl_sched_dl_period_min;
> +
>  #ifdef CONFIG_UCLAMP_TASK
>  extern unsigned int sysctl_sched_uclamp_util_min;
>  extern unsigned int sysctl_sched_uclamp_util_max;
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2597,6 +2597,14 @@ void __getparam_dl(struct task_struct *p
>  }
>  
>  /*
> + * Default limits for DL period; on the top end we guard against small util
> + * tasks still getting rediculous long effective runtimes, on the bottom end we
> + * guard against timer DoS.
> + */
> +unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
> +unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
> +
> +/*
>   * This function validates the new parameters of a -deadline task.
>   * We ask for the deadline not being zero, and greater or equal
>   * than the runtime, as well as the period of being zero or
> @@ -2608,6 +2616,8 @@ void __getparam_dl(struct task_struct *p
>   */
>  bool __checkparam_dl(const struct sched_attr *attr)
>  {
> +	u64 period, max, min;
> +
>  	/* special dl tasks don't actually use any parameter */
>  	if (attr->sched_flags & SCHED_FLAG_SUGOV)
>  		return true;
> @@ -2631,12 +2641,21 @@ bool __checkparam_dl(const struct sched_
>  	    attr->sched_period & (1ULL << 63))
>  		return false;
>  
> +	period = attr->sched_period;
> +	if (!period)
> +		period = attr->sched_deadline;
> +
>  	/* runtime <= deadline <= period (if period != 0) */
> -	if ((attr->sched_period != 0 &&
> -	     attr->sched_period < attr->sched_deadline) ||
> +	if (period < attr->sched_deadline ||
>  	    attr->sched_deadline < attr->sched_runtime)
>  		return false;
>  
> +	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
> +	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
> +
> +	if (period < min || period > max)
> +		return false;
> +
>  	return true;
>  }
>  
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -443,6 +443,20 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= sched_rt_handler,
>  	},
>  	{
> +		.procname	= "sched_deadline_period_max_us",
> +		.data		= &sysctl_sched_dl_period_max,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
> +		.procname	= "sched_deadline_period_min_us",
> +		.data		= &sysctl_sched_dl_period_min,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
>  		.procname	= "sched_rr_timeslice_ms",
>  		.data		= &sysctl_sched_rr_timeslice,
>  		.maxlen		= sizeof(int),

I think this never made it upstream. And I believe both me and Daniel
were OK with it. Do you recall if any additional change was needed?

Thanks,

Juri


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period
  2020-05-20 18:38   ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Juri Lelli
@ 2020-05-21 13:45     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 63+ messages in thread
From: Daniel Bristot de Oliveira @ 2020-05-21 13:45 UTC (permalink / raw)
  To: Juri Lelli, Peter Zijlstra
  Cc: mingo, linux-kernel, dietmar.eggemann, luca.abeni, balsini,
	dvyukov, tglx, vpillai, rostedt

On 5/20/20 8:38 PM, Juri Lelli wrote:
> Hi Peter,
> 
> On 26/07/19 16:54, Peter Zijlstra wrote:
>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
>> Cc: Luca Abeni <luca.abeni@santannapisa.it>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Dmitry Vyukov <dvyukov@google.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---
>>  include/linux/sched/sysctl.h |    3 +++
>>  kernel/sched/deadline.c      |   23 +++++++++++++++++++++--
>>  kernel/sysctl.c              |   14 ++++++++++++++
>>  3 files changed, 38 insertions(+), 2 deletions(-)
>>
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -56,6 +56,9 @@ int sched_proc_update_handler(struct ctl
>>  extern unsigned int sysctl_sched_rt_period;
>>  extern int sysctl_sched_rt_runtime;
>>  
>> +extern unsigned int sysctl_sched_dl_period_max;
>> +extern unsigned int sysctl_sched_dl_period_min;
>> +
>>  #ifdef CONFIG_UCLAMP_TASK
>>  extern unsigned int sysctl_sched_uclamp_util_min;
>>  extern unsigned int sysctl_sched_uclamp_util_max;
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -2597,6 +2597,14 @@ void __getparam_dl(struct task_struct *p
>>  }
>>  
>>  /*
>> + * Default limits for DL period; on the top end we guard against small util
>> + * tasks still getting rediculous long effective runtimes, on the bottom end we
>> + * guard against timer DoS.
>> + */
>> +unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
>> +unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
>> +
>> +/*
>>   * This function validates the new parameters of a -deadline task.
>>   * We ask for the deadline not being zero, and greater or equal
>>   * than the runtime, as well as the period of being zero or
>> @@ -2608,6 +2616,8 @@ void __getparam_dl(struct task_struct *p
>>   */
>>  bool __checkparam_dl(const struct sched_attr *attr)
>>  {
>> +	u64 period, max, min;
>> +
>>  	/* special dl tasks don't actually use any parameter */
>>  	if (attr->sched_flags & SCHED_FLAG_SUGOV)
>>  		return true;
>> @@ -2631,12 +2641,21 @@ bool __checkparam_dl(const struct sched_
>>  	    attr->sched_period & (1ULL << 63))
>>  		return false;
>>  
>> +	period = attr->sched_period;
>> +	if (!period)
>> +		period = attr->sched_deadline;
>> +
>>  	/* runtime <= deadline <= period (if period != 0) */
>> -	if ((attr->sched_period != 0 &&
>> -	     attr->sched_period < attr->sched_deadline) ||
>> +	if (period < attr->sched_deadline ||
>>  	    attr->sched_deadline < attr->sched_runtime)
>>  		return false;
>>  
>> +	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
>> +	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
>> +
>> +	if (period < min || period > max)
>> +		return false;
>> +
>>  	return true;
>>  }
>>  
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -443,6 +443,20 @@ static struct ctl_table kern_table[] = {
>>  		.proc_handler	= sched_rt_handler,
>>  	},
>>  	{
>> +		.procname	= "sched_deadline_period_max_us",
>> +		.data		= &sysctl_sched_dl_period_max,
>> +		.maxlen		= sizeof(unsigned int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>> +	{
>> +		.procname	= "sched_deadline_period_min_us",
>> +		.data		= &sysctl_sched_dl_period_min,
>> +		.maxlen		= sizeof(unsigned int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>> +	{
>>  		.procname	= "sched_rr_timeslice_ms",
>>  		.data		= &sysctl_sched_rr_timeslice,
>>  		.maxlen		= sizeof(int),
> I think this never made it upstream. And I believe both me and Daniel
> were OK with it. Do you recall if any additional change was needed?

Just to confirm, yes I am OK with it!

-- Daniel


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [tip: sched/core] sched/deadline: Impose global limits on sched_attr::sched_period
  2019-07-26 14:54 ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Peter Zijlstra
                     ` (2 preceding siblings ...)
  2020-05-20 18:38   ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Juri Lelli
@ 2020-06-16 12:21   ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 63+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-16 12:21 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b4098bfc5efb1fd7ecf40165132a1283aeea3500
Gitweb:        https://git.kernel.org/tip/b4098bfc5efb1fd7ecf40165132a1283aeea3500
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 26 Jul 2019 16:54:10 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 15 Jun 2020 14:10:04 +02:00

sched/deadline: Impose global limits on sched_attr::sched_period

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726161357.397880775@infradead.org
---
 include/linux/sched/sysctl.h |  3 +++
 kernel/sched/deadline.c      | 23 +++++++++++++++++++++--
 kernel/sysctl.c              | 14 ++++++++++++++
 3 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 660ac49..24be30a 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -61,6 +61,9 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+extern unsigned int sysctl_sched_dl_period_max;
+extern unsigned int sysctl_sched_dl_period_min;
+
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 504d2f5..f31964a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2635,6 +2635,14 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
 }
 
 /*
+ * Default limits for DL period; on the top end we guard against small util
+ * tasks still getting rediculous long effective runtimes, on the bottom end we
+ * guard against timer DoS.
+ */
+unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
+unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
+
+/*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime, as well as the period of being zero or
@@ -2646,6 +2654,8 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
  */
 bool __checkparam_dl(const struct sched_attr *attr)
 {
+	u64 period, max, min;
+
 	/* special dl tasks don't actually use any parameter */
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
 		return true;
@@ -2669,12 +2679,21 @@ bool __checkparam_dl(const struct sched_attr *attr)
 	    attr->sched_period & (1ULL << 63))
 		return false;
 
+	period = attr->sched_period;
+	if (!period)
+		period = attr->sched_deadline;
+
 	/* runtime <= deadline <= period (if period != 0) */
-	if ((attr->sched_period != 0 &&
-	     attr->sched_period < attr->sched_deadline) ||
+	if (period < attr->sched_deadline ||
 	    attr->sched_deadline < attr->sched_runtime)
 		return false;
 
+	max = (u64)READ_ONCE(sysctl_sched_dl_period_max) * NSEC_PER_USEC;
+	min = (u64)READ_ONCE(sysctl_sched_dl_period_min) * NSEC_PER_USEC;
+
+	if (period < min || period > max)
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index db1ce7a..4aea67d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1780,6 +1780,20 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= sched_rt_handler,
 	},
 	{
+		.procname	= "sched_deadline_period_max_us",
+		.data		= &sysctl_sched_dl_period_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sched_deadline_period_min_us",
+		.data		= &sysctl_sched_dl_period_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "sched_rr_timeslice_ms",
 		.data		= &sysctl_sched_rr_timeslice,
 		.maxlen		= sizeof(int),

^ permalink raw reply related	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2020-06-16 12:22 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-26 14:54 [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Peter Zijlstra
2019-07-29  8:57   ` Juri Lelli
2019-07-29 11:45     ` Daniel Bristot de Oliveira
2019-08-02 17:21   ` Alessio Balsini
2019-08-05 11:53     ` Peter Zijlstra
2019-08-22 12:29       ` Alessio Balsini
2019-08-22 16:51         ` Peter Zijlstra
2019-08-31 14:41           ` Alessio Balsini
2019-09-02  9:16             ` Peter Zijlstra
2019-09-02 12:31               ` Peter Zijlstra
2019-09-04 10:16               ` Steven Rostedt
2019-09-04 11:30                 ` Peter Zijlstra
2019-09-04 13:24                   ` Joel Fernandes
2019-09-04 14:11                     ` Will Deacon
2019-09-04 14:35                       ` Joel Fernandes
2019-09-04 15:52                     ` Peter Zijlstra
2019-10-23 17:17       ` [PATCH 4.4 4.9 4.14] loop: Add LOOP_SET_DIRECT_IO to compat ioctl Alessio Balsini
2019-10-23 17:22         ` Alessio Balsini
2019-10-25  0:17         ` Sasha Levin
2020-05-20 18:38   ` [RFC][PATCH 01/13] sched/deadline: Impose global limits on sched_attr::sched_period Juri Lelli
2020-05-21 13:45     ` Daniel Bristot de Oliveira
2020-06-16 12:21   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering Peter Zijlstra
2019-07-30 13:16   ` Phil Auld
2019-07-30 13:22   ` Steven Rostedt
2019-07-26 14:54 ` [RFC][PATCH 03/13] sched: Fix kerneldoc comment for ia64_set_curr_task Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 04/13] sched/{rt,deadline}: Fix set_next_task vs pick_next_task Peter Zijlstra
2019-07-29  9:25   ` Juri Lelli
2019-07-29 11:15     ` Peter Zijlstra
2019-07-29 11:27       ` Juri Lelli
2019-07-29 13:04         ` Peter Zijlstra
2019-07-29 13:17           ` Juri Lelli
2019-07-29 14:40             ` Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 05/13] sched: Add task_struct pointer to sched_class::set_curr_task Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 06/13] sched/fair: Export newidle_balance() Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 07/13] sched: Allow put_prev_task() to drop rq->lock Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 08/13] sched: Rework pick_next_task() slow-path Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 09/13] sched: Unify runtime accounting across classes Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 10/13] sched/deadline: Collect sched_dl_entity initialization Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 11/13] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Peter Zijlstra
2019-07-26 14:54 ` [RFC][PATCH 12/13] sched/deadline: Introduce deadline servers Peter Zijlstra
2019-08-07 16:31   ` Dietmar Eggemann
2019-08-08  6:52     ` Juri Lelli
2019-08-08  7:52       ` Dietmar Eggemann
2019-08-08  7:56     ` Peter Zijlstra
2019-08-08  8:11       ` Dietmar Eggemann
2019-08-08  8:46         ` Juri Lelli
2019-08-08  8:57           ` Dietmar Eggemann
2019-08-08  9:27             ` Juri Lelli
2019-08-08  9:45               ` Juri Lelli
2019-08-30 11:24                 ` Peter Zijlstra
2019-09-06  9:36                   ` Juri Lelli
2019-08-08 10:31           ` Peter Zijlstra
2019-08-09  7:13             ` Juri Lelli
2019-08-08  6:59   ` Juri Lelli
2019-08-09  9:17   ` Dietmar Eggemann
2019-08-09 12:16     ` Juri Lelli
2019-07-26 14:54 ` [RFC][PATCH 13/13] sched/fair: Add trivial fair server Peter Zijlstra
2019-07-26 20:01 ` [RFC][PATCH 00/13] SCHED_DEADLINE server infrastructure luca abeni
2019-09-03 14:27 ` Alessio Balsini
2019-09-04 10:50   ` Juri Lelli
2019-09-04 11:32     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).