All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/3] sched/uclamp: new sysctl for default RT boost value
@ 2020-07-16 11:03 Qais Yousef
  2020-07-16 11:03 ` [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default " Qais Yousef
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Qais Yousef @ 2020-07-16 11:03 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Doug Anderson, Qais Yousef, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

Changes in v7:

	* Rebase on top of tip/sched/core
	* Hold task_rq_lock() instead of using RCU.
	* Better document that changes to p->uclamp_ require task_rq_lock()
	* Remove smp_{wr}mp()
	* Hold the the tasklist_lock with smp_mp__after_spinlock()
	* Add patch 3 which addresses a splat I've seen while testing.
	  static_branch_enable() in __setscheduler_uclamp() was causing it.
	  Remove the call outside of the critical section to fix it.


*** v6 cover-letter ***

This series introduces a new sysctl_sched_uclamp_util_min_rt_default to control
at runtime the default boost value of RT tasks.

Full rationale is in patch 1 commit message.

v6 has changed the approach taken in v5 [1] and earlier by moving away from the
lazy update approach that touched the fast path to a synchronous one that is
performed when the write to the procfs entry is done.

for_each_process_thread() is used to update all existing RT tasks now. And to
handle the race with a concurrent fork() we introduce sched_post_fork() in
_do_fork() to ensure a concurrently forked RT tasks gets the right update.

To ensure the race condition is handled correctly, I wrote this small (simple!)
test program:

	https://github.com/qais-yousef/uclamp_test.git

And ran it on 4core x86 system and 8core big.LITTLE juno-r2 system.

From juno-r2 run, 10 iterations each run:

Without sched_post_fork()

	# ./run.sh
	pid 3105 has 336 but default should be 337
	pid 13162 has 336 but default should be 337
	pid 23256 has 338 but default should be 339
	All forked RT tasks had the correct uclamp.min
	pid 10638 has 334 but default should be 335
	All forked RT tasks had the correct uclamp.min
	pid 30683 has 335 but default should be 336
	pid 8247 has 336 but default should be 337
	pid 18170 has 1024 but default should be 334
	pid 28274 has 336 but default should be 337

With sched_post_fork()

	# ./run.sh
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min
	All forked RT tasks had the correct uclamp.min

Thanks

--
Qais Yousef

[1] https://lore.kernel.org/lkml/20200511154053.7822-1-qais.yousef@arm.com/

CC: Jonathan Corbet <corbet@lwn.net>
CC: Juri Lelli <juri.lelli@redhat.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Segall <bsegall@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Kees Cook <keescook@chromium.org>
CC: Iurii Zaikin <yzaikin@google.com>
CC: Quentin Perret <qperret@google.com>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
CC: Pavan Kondeti <pkondeti@codeaurora.org>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org


Qais Yousef (3):
  sched/uclamp: Add a new sysctl to control RT default boost value
  Documentation/sysctl: Document uclamp sysctl knobs
  sched/uclamp: Fix a deadlock when enabling uclamp static key

 Documentation/admin-guide/sysctl/kernel.rst |  54 +++++++
 include/linux/sched.h                       |  10 +-
 include/linux/sched/sysctl.h                |   1 +
 include/linux/sched/task.h                  |   1 +
 kernel/fork.c                               |   1 +
 kernel/sched/core.c                         | 149 ++++++++++++++++++--
 kernel/sysctl.c                             |   7 +
 7 files changed, 208 insertions(+), 15 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-07-16 11:03 [PATCH v7 0/3] sched/uclamp: new sysctl for default RT boost value Qais Yousef
@ 2020-07-16 11:03 ` Qais Yousef
  2020-07-24  8:54   ` Peter Zijlstra
  2020-07-29 12:02   ` [tip: sched/core] " tip-bot2 for Qais Yousef
  2020-07-16 11:03 ` [PATCH v7 2/3] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
  2020-07-16 11:03 ` [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key Qais Yousef
  2 siblings, 2 replies; 15+ messages in thread
From: Qais Yousef @ 2020-07-16 11:03 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Doug Anderson, Qais Yousef, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Juri Lelli <juri.lelli@redhat.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Segall <bsegall@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Kees Cook <keescook@chromium.org>
CC: Iurii Zaikin <yzaikin@google.com>
CC: Quentin Perret <qperret@google.com>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
CC: Pavan Kondeti <pkondeti@codeaurora.org>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
---
 include/linux/sched.h        |  10 ++-
 include/linux/sched/sysctl.h |   1 +
 include/linux/sched/task.h   |   1 +
 kernel/fork.c                |   1 +
 kernel/sched/core.c          | 119 +++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |   7 +++
 6 files changed, 131 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 683372943093..86d90bc40e0b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -686,9 +686,15 @@ struct task_struct {
 	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
-	/* Clamp values requested for a scheduling entity */
+	/*
+	 * Clamp values requested for a scheduling entity.
+	 * Must be updated with task_rq_lock() held.
+	 */
 	struct uclamp_se		uclamp_req[UCLAMP_CNT];
-	/* Effective clamp values used for a scheduling entity */
+	/*
+	 * Effective clamp values used for a scheduling entity.
+	 * Must be updated with task_rq_lock() held.
+	 */
 	struct uclamp_se		uclamp[UCLAMP_CNT];
 #endif
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 24be30a40814..3c31ba88aca5 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -67,6 +67,7 @@ extern unsigned int sysctl_sched_dl_period_min;
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
+extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 38359071236a..e7ddab095baf 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -55,6 +55,7 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
+extern void sched_post_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 void __noreturn do_task_dead(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index efc5493203ae..e75c2e41f3d1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2304,6 +2304,7 @@ static __latent_entropy struct task_struct *copy_process(
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
+	sched_post_fork(p);
 	cgroup_post_fork(p, args);
 	perf_event_fork(p);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 31c07e6c00b1..e1578c3ad40c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -793,6 +793,23 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 /* Max allowed maximum utilization */
 unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 
+/*
+ * By default RT tasks run at the maximum performance point/capacity of the
+ * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
+ * SCHED_CAPACITY_SCALE.
+ *
+ * This knob allows admins to change the default behavior when uclamp is being
+ * used. In battery powered devices, particularly, running at the maximum
+ * capacity and frequency will increase energy consumption and shorten the
+ * battery life.
+ *
+ * This knob only affects RT tasks that their uclamp_se->user_defined == false.
+ *
+ * This knob will not override the system default sched_util_clamp_min defined
+ * above.
+ */
+unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
+
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
@@ -895,6 +912,64 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
 	return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static void __uclamp_sync_util_min_rt_default_locked(struct task_struct *p)
+{
+	unsigned int default_util_min;
+	struct uclamp_se *uc_se;
+
+	lockdep_assert_held(&p->pi_lock);
+
+	uc_se = &p->uclamp_req[UCLAMP_MIN];
+
+	/* Only sync if user didn't override the default */
+	if (uc_se->user_defined)
+		return;
+
+	default_util_min = sysctl_sched_uclamp_util_min_rt_default;
+	uclamp_se_set(uc_se, default_util_min, false);
+}
+
+static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (!rt_task(p))
+		return;
+
+	/* Protect updates to p->uclamp_* */
+	rq = task_rq_lock(p, &rf);
+	__uclamp_sync_util_min_rt_default_locked(p);
+	task_rq_unlock(rq, p, &rf);
+}
+
+static void uclamp_sync_util_min_rt_default(void)
+{
+	struct task_struct *g, *p;
+
+	/*
+	 * copy_process()			sysctl_uclamp
+	 *					  uclamp_min_rt = X;
+	 *   write_lock(&tasklist_lock)		  read_lock(&tasklist_lock)
+	 *   // link thread			  smp_mb__after_spinlock()
+	 *   write_unlock(&tasklist_lock)	  read_unlock(&tasklist_lock);
+	 *   sched_post_fork()			  for_each_process_thread()
+	 *     __uclamp_sync_rt()		    __uclamp_sync_rt()
+	 *
+	 * Ensures that either sched_post_fork() will observe the new
+	 * uclamp_min_rt or for_each_process_thread() will observe the new
+	 * task.
+	 */
+	read_lock(&tasklist_lock);
+	smp_mb__after_spinlock();
+	read_unlock(&tasklist_lock);
+
+	rcu_read_lock();
+	for_each_process_thread(g, p)
+		__uclamp_sync_util_min_rt_default(p);
+	rcu_read_unlock();
+}
+
 static inline struct uclamp_se
 uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
@@ -1193,12 +1268,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				void *buffer, size_t *lenp, loff_t *ppos)
 {
 	bool update_root_tg = false;
-	int old_min, old_max;
+	int old_min, old_max, old_min_rt;
 	int result;
 
 	mutex_lock(&uclamp_mutex);
 	old_min = sysctl_sched_uclamp_util_min;
 	old_max = sysctl_sched_uclamp_util_max;
+	old_min_rt = sysctl_sched_uclamp_util_min_rt_default;
 
 	result = proc_dointvec(table, write, buffer, lenp, ppos);
 	if (result)
@@ -1207,7 +1283,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		goto done;
 
 	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
-	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE	||
+	    sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
+
 		result = -EINVAL;
 		goto undo;
 	}
@@ -1228,6 +1306,11 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		uclamp_update_root_tg();
 	}
 
+	if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) {
+		static_branch_enable(&sched_uclamp_used);
+		uclamp_sync_util_min_rt_default();
+	}
+
 	/*
 	 * We update all RUNNABLE tasks only when task groups are in use.
 	 * Otherwise, keep it simple and do just a lazy update at each next
@@ -1239,6 +1322,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 undo:
 	sysctl_sched_uclamp_util_min = old_min;
 	sysctl_sched_uclamp_util_max = old_max;
+	sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
 done:
 	mutex_unlock(&uclamp_mutex);
 
@@ -1275,17 +1359,20 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	 */
 	for_each_clamp_id(clamp_id) {
 		struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
-		unsigned int clamp_value = uclamp_none(clamp_id);
 
 		/* Keep using defined clamps across class changes */
 		if (uc_se->user_defined)
 			continue;
 
-		/* By default, RT tasks always get 100% boost */
+		/*
+		 * RT by default have a 100% boost value that could be modified
+		 * at runtime.
+		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			clamp_value = uclamp_none(UCLAMP_MAX);
+			__uclamp_sync_util_min_rt_default_locked(p);
+		else
+			uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
 
-		uclamp_se_set(uc_se, clamp_value, false);
 	}
 
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
@@ -1308,6 +1395,10 @@ static void uclamp_fork(struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	/*
+	 * We don't need to hold task_rq_lock() when updating p->uclamp_* here
+	 * as the task is still at its early fork stages.
+	 */
 	for_each_clamp_id(clamp_id)
 		p->uclamp[clamp_id].active = false;
 
@@ -1320,6 +1411,11 @@ static void uclamp_fork(struct task_struct *p)
 	}
 }
 
+static void uclamp_post_fork(struct task_struct *p)
+{
+	__uclamp_sync_util_min_rt_default(p);
+}
+
 static void __init init_uclamp_rq(struct rq *rq)
 {
 	enum uclamp_id clamp_id;
@@ -1372,6 +1468,7 @@ static inline int uclamp_validate(struct task_struct *p,
 static void __setscheduler_uclamp(struct task_struct *p,
 				  const struct sched_attr *attr) { }
 static inline void uclamp_fork(struct task_struct *p) { }
+static inline void uclamp_post_fork(struct task_struct *p) { }
 static inline void init_uclamp(void) { }
 #endif /* CONFIG_UCLAMP_TASK */
 
@@ -3074,6 +3171,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	return 0;
 }
 
+void sched_post_fork(struct task_struct *p)
+{
+	uclamp_post_fork(p);
+}
+
 unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
@@ -5597,6 +5699,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		kattr.sched_nice = task_nice(p);
 
 #ifdef CONFIG_UCLAMP_TASK
+	/*
+	 * This could race with another potential updater, but this is fine
+	 * because it'll correctly read the old or the new value. We don't need
+	 * to guarantee who wins the race as long as it doesn't return garbage.
+	 */
 	kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
 	kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4aea67d3d552..1b4d2dc270a5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1815,6 +1815,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_util_clamp_min_rt_default",
+		.data		= &sysctl_sched_uclamp_util_min_rt_default,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_handler,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 2/3] Documentation/sysctl: Document uclamp sysctl knobs
  2020-07-16 11:03 [PATCH v7 0/3] sched/uclamp: new sysctl for default RT boost value Qais Yousef
  2020-07-16 11:03 ` [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default " Qais Yousef
@ 2020-07-16 11:03 ` Qais Yousef
  2020-07-29 12:02   ` [tip: sched/core] " tip-bot2 for Qais Yousef
  2020-07-16 11:03 ` [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key Qais Yousef
  2 siblings, 1 reply; 15+ messages in thread
From: Qais Yousef @ 2020-07-16 11:03 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Doug Anderson, Qais Yousef, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

Uclamp exposes 3 sysctl knobs:

	* sched_util_clamp_min
	* sched_util_clamp_max
	* sched_util_clamp_min_rt_default

Document them in sysctl/kernel.rst.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Juri Lelli <juri.lelli@redhat.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Segall <bsegall@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Kees Cook <keescook@chromium.org>
CC: Iurii Zaikin <yzaikin@google.com>
CC: Quentin Perret <qperret@google.com>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
CC: Pavan Kondeti <pkondeti@codeaurora.org>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
---
 Documentation/admin-guide/sysctl/kernel.rst | 54 +++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 83acf5025488..55bf6b4de4ec 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1062,6 +1062,60 @@ Enables/disables scheduler statistics. Enabling this feature
 incurs a small amount of overhead in the scheduler but is
 useful for debugging and performance tuning.
 
+sched_util_clamp_min:
+=====================
+
+Max allowed *minimum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.min value cannot be greater than
+sched_util_clamp_min, i.e., it is restricted to the range
+[0:sched_util_clamp_min].
+
+sched_util_clamp_max:
+=====================
+
+Max allowed *maximum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.max value cannot be greater than
+sched_util_clamp_max, i.e., it is restricted to the range
+[0:sched_util_clamp_max].
+
+sched_util_clamp_min_rt_default:
+================================
+
+By default Linux is tuned for performance. Which means that RT tasks always run
+at the highest frequency and most capable (highest capacity) CPU (in
+heterogeneous systems).
+
+Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
+1024 by default, which effectively boosts the tasks to run at the highest
+frequency and biases them to run on the biggest CPU.
+
+This knob allows admins to change the default behavior when uclamp is being
+used. In battery powered devices particularly, running at the maximum
+capacity and frequency will increase energy consumption and shorten the battery
+life.
+
+This knob is only effective for RT tasks which the user hasn't modified their
+requested uclamp.min value via sched_setattr() syscall.
+
+This knob will not escape the range constraint imposed by sched_util_clamp_min
+defined above.
+
+For example if
+
+	sched_util_clamp_min_rt_default = 800
+	sched_util_clamp_min = 600
+
+Then the boost will be clamped to 600 because 800 is outside of the permissible
+range of [0:600]. This could happen for instance if a powersave mode will
+restrict all boosts temporarily by modifying sched_util_clamp_min. As soon as
+this restriction is lifted, the requested sched_util_clamp_min_rt_default
+will take effect.
 
 seccomp
 =======
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key
  2020-07-16 11:03 [PATCH v7 0/3] sched/uclamp: new sysctl for default RT boost value Qais Yousef
  2020-07-16 11:03 ` [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default " Qais Yousef
  2020-07-16 11:03 ` [PATCH v7 2/3] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
@ 2020-07-16 11:03 ` Qais Yousef
  2020-07-16 11:13   ` Qais Yousef
                     ` (2 more replies)
  2 siblings, 3 replies; 15+ messages in thread
From: Qais Yousef @ 2020-07-16 11:03 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Doug Anderson, Qais Yousef, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

The following splat was caught when setting uclamp value of a task.

	BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

	======================================================
	in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 731, name: l_3-1
	WARNING: possible circular locking dependency detected
	5.8.0-rc4-00040-g6345b3305877-dirty #864 Not tainted
	INFO: lockdep is turned off.
	------------------------------------------------------
	l_0-0/730 is trying to acquire lock:
	irq event stamp: 150
	ffff80001343dea0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x1c/0x38
	el0_svc_common.constprop.4+0xe4/0x200

	but task is already holding lock:
	ffff00097ef4ca58 (&rq->lock){-.-.}-{2:2}, at: task_rq_lock+0x60/0xf0
	_raw_spin_lock_irqsave+0x38/0xa8

	which lock already depends on the new lock.

	the existing dependency chain (in reverse order) is:
	copy_process+0x620/0x18f0

	-> #1 (&rq->lock){-.-.}-{2:2}:
	0x0
	       _raw_spin_lock+0x64/0x80
	       __schedule+0x108/0x910
	CPU: 5 PID: 731 Comm: l_3-1 Not tainted 5.8.0-rc4-00040-g6345b3305877-dirty #864
	       schedule+0x7c/0x108
	       schedule_timeout+0x2b0/0x448
	Hardware name: ARM Juno development board (r2) (DT)
	       wait_for_completion_killable+0xb8/0x1a8
	       __kthread_create_on_node+0xe0/0x1c0
	Call trace:
	       kthread_create_on_node+0x8c/0xb8
	       create_worker+0xd0/0x1b8
	 dump_backtrace+0x0/0x1f0
	       workqueue_prepare_cpu+0x5c/0xa0
	       cpuhp_invoke_callback+0xe8/0xe30
	 show_stack+0x2c/0x38
	       _cpu_up+0xf4/0x1c0
	       cpu_up+0xa0/0xc0
	 dump_stack+0xf0/0x170
	       bringup_nonboot_cpus+0x88/0xc0
	       smp_init+0x34/0x90
	 ___might_sleep+0x144/0x200
	       kernel_init_freeable+0x1b8/0x338
	       kernel_init+0x18/0x118
	 __might_sleep+0x54/0x88
	       ret_from_fork+0x10/0x18

	-> #0 (cpu_hotplug_lock){++++}-{0:0}:
	 cpus_read_lock+0x2c/0x130
	       __lock_acquire+0x11a0/0x1550
	       lock_acquire+0xf8/0x460
	 static_key_enable+0x1c/0x38
	       cpus_read_lock+0x68/0x130
	       static_key_enable+0x1c/0x38
	 __sched_setscheduler+0x900/0xad8
	       __sched_setscheduler+0x900/0xad8
	       __arm64_sys_sched_setattr+0x2e0/0x4f8
	 __arm64_sys_sched_setattr+0x2e0/0x4f8
	       el0_svc_common.constprop.4+0x84/0x200
	       do_el0_svc+0x34/0xa0
	 el0_svc_common.constprop.4+0x84/0x200
	       el0_sync_handler+0x16c/0x340
	       el0_sync+0x140/0x180
	 do_el0_svc+0x34/0xa0

	other info that might help us debug this:

	 Possible unsafe locking scenario:

	 el0_sync_handler+0x16c/0x340
	       CPU0                    CPU1
	       ----                    ----
	 el0_sync+0x140/0x180
	  lock(&rq->lock);
	                               lock(cpu_hotplug_lock);
	                               lock(&rq->lock);
	  lock(cpu_hotplug_lock);

	 *** DEADLOCK ***

	3 locks held by l_0-0/730:
	 #0: ffff80001345b4d0 (&cpuset_rwsem){++++}-{0:0}, at: __sched_setscheduler+0x4c0/0xad8
	 #1: ffff00096e83b718 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x44/0xf0
	 #2: ffff00097ef4ca58 (&rq->lock){-.-.}-{2:2}, at: task_rq_lock+0x60/0xf0

	stack backtrace:
	CPU: 1 PID: 730 Comm: l_0-0 Tainted: G        W         5.8.0-rc4-00040-g6345b3305877-dirty #864
	Hardware name: ARM Juno development board (r2) (DT)
	Call trace:
	 dump_backtrace+0x0/0x1f0
	 show_stack+0x2c/0x38
	 dump_stack+0xf0/0x170
	 print_circular_bug.isra.40+0x228/0x280
	 check_noncircular+0x14c/0x1b0
	 __lock_acquire+0x11a0/0x1550
	 lock_acquire+0xf8/0x460
	 cpus_read_lock+0x68/0x130
	 static_key_enable+0x1c/0x38
	 __sched_setscheduler+0x900/0xad8
	 __arm64_sys_sched_setattr+0x2e0/0x4f8
	 el0_svc_common.constprop.4+0x84/0x200
	 do_el0_svc+0x34/0xa0
	 el0_sync_handler+0x16c/0x340
	 el0_sync+0x140/0x180

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/core.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1578c3ad40c..947a1f4fa112 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1348,7 +1348,7 @@ static int uclamp_validate(struct task_struct *p,
 	return 0;
 }
 
-static void __setscheduler_uclamp(struct task_struct *p,
+static bool __setscheduler_uclamp(struct task_struct *p,
 				  const struct sched_attr *attr)
 {
 	enum uclamp_id clamp_id;
@@ -1376,9 +1376,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	}
 
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
-		return;
-
-	static_branch_enable(&sched_uclamp_used);
+		return false;
 
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
@@ -1389,6 +1387,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
 			      attr->sched_util_max, true);
 	}
+
+	return true;
 }
 
 static void uclamp_fork(struct task_struct *p)
@@ -1465,8 +1465,11 @@ static inline int uclamp_validate(struct task_struct *p,
 {
 	return -EOPNOTSUPP;
 }
-static void __setscheduler_uclamp(struct task_struct *p,
-				  const struct sched_attr *attr) { }
+static bool __setscheduler_uclamp(struct task_struct *p,
+				  const struct sched_attr *attr)
+{
+	return false;
+}
 static inline void uclamp_fork(struct task_struct *p) { }
 static inline void uclamp_post_fork(struct task_struct *p) { }
 static inline void init_uclamp(void) { }
@@ -5305,7 +5308,8 @@ static int __sched_setscheduler(struct task_struct *p,
 	prev_class = p->sched_class;
 
 	__setscheduler(rq, p, attr, pi);
-	__setscheduler_uclamp(p, attr);
+
+	retval = __setscheduler_uclamp(p, attr);
 
 	if (queued) {
 		/*
@@ -5335,6 +5339,18 @@ static int __sched_setscheduler(struct task_struct *p,
 	balance_callback(rq);
 	preempt_enable();
 
+#ifdef CONFIG_UCLAMP_TASK
+	/*
+	 * Enable uclamp static key outside the critical section.
+	 * static_branch_enable() will hold cpu_hotplug_lock; if done from
+	 * critical section above which holds other locks (rq->lock namely),
+	 * it could lead to deadlock scenarios as both are popular locks and
+	 * could be acquired from different paths in different orders.
+	 */
+	if (retval)
+		static_branch_enable(&sched_uclamp_used);
+#endif
+
 	return 0;
 
 unlock:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key
  2020-07-16 11:03 ` [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key Qais Yousef
@ 2020-07-16 11:13   ` Qais Yousef
  2020-07-23 15:51     ` Qais Yousef
  2020-07-24  9:12   ` Peter Zijlstra
  2020-07-29 12:02   ` [tip: sched/core] " tip-bot2 for Qais Yousef
  2 siblings, 1 reply; 15+ messages in thread
From: Qais Yousef @ 2020-07-16 11:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Doug Anderson, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Patrick Bellasi,
	Chris Redpath, Lukasz Luba, linux-kernel

On 07/16/20 12:03, Qais Yousef wrote:

[...]

> Fix by ensuring we enable the key outside of the critical section in
> __sched_setscheduler()
> 
> Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>

I am assuming this Fixes tag is still valid given the patch is only in
tip/sched/core only. And that the hash value won't change after Linus merges
it.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key
  2020-07-16 11:13   ` Qais Yousef
@ 2020-07-23 15:51     ` Qais Yousef
  0 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2020-07-23 15:51 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Doug Anderson, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Patrick Bellasi,
	Chris Redpath, Lukasz Luba, linux-kernel

On 07/16/20 12:13, Qais Yousef wrote:
> On 07/16/20 12:03, Qais Yousef wrote:
> 
> [...]
> 
> > Fix by ensuring we enable the key outside of the critical section in
> > __sched_setscheduler()
> > 
> > Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
> > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> 
> I am assuming this Fixes tag is still valid given the patch is only in
> tip/sched/core only. And that the hash value won't change after Linus merges
> it.

Gentle ping that this patch is actually a fix that I happened to uncover while
testing this series.

Happy to send separately. It should apply on its own too as it has no
dependency on previous patches.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-07-16 11:03 ` [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default " Qais Yousef
@ 2020-07-24  8:54   ` Peter Zijlstra
  2020-07-24  9:16     ` Qais Yousef
  2020-07-29 12:02   ` [tip: sched/core] " tip-bot2 for Qais Yousef
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2020-07-24  8:54 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Doug Anderson, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Thu, Jul 16, 2020 at 12:03:45PM +0100, Qais Yousef wrote:

Would you mind terribly if I rename things like so?

I tried and failed to come up with a shorter name in general, these
functions names are somewhat unwieldy. I considered s/_default//.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -914,7 +914,7 @@ unsigned int uclamp_rq_max_value(struct
 	return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
-static void __uclamp_sync_util_min_rt_default_locked(struct task_struct *p)
+static void __uclamp_update_util_min_rt_default(struct task_struct *p)
 {
 	unsigned int default_util_min;
 	struct uclamp_se *uc_se;
@@ -931,7 +931,7 @@ static void __uclamp_sync_util_min_rt_de
 	uclamp_se_set(uc_se, default_util_min, false);
 }
 
-static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
+static void uclamp_update_util_min_rt_default(struct task_struct *p)
 {
 	struct rq_flags rf;
 	struct rq *rq;
@@ -941,7 +941,7 @@ static void __uclamp_sync_util_min_rt_de
 
 	/* Protect updates to p->uclamp_* */
 	rq = task_rq_lock(p, &rf);
-	__uclamp_sync_util_min_rt_default_locked(p);
+	__uclamp_update_util_min_rt_default(p);
 	task_rq_unlock(rq, p, &rf);
 }
 
@@ -968,7 +968,7 @@ static void uclamp_sync_util_min_rt_defa
 
 	rcu_read_lock();
 	for_each_process_thread(g, p)
-		__uclamp_sync_util_min_rt_default(p);
+		uclamp_update_util_min_rt_default(p);
 	rcu_read_unlock();
 }
 
@@ -1360,7 +1360,7 @@ static void __setscheduler_uclamp(struct
 		 * at runtime.
 		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			__uclamp_sync_util_min_rt_default_locked(p);
+			__uclamp_update_util_min_rt_default(p);
 		else
 			uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
 
@@ -1404,7 +1404,7 @@ static void uclamp_fork(struct task_stru
 
 static void uclamp_post_fork(struct task_struct *p)
 {
-	__uclamp_sync_util_min_rt_default(p);
+	uclamp_update_util_min_rt_default(p);
 }
 
 static void __init init_uclamp_rq(struct rq *rq)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key
  2020-07-16 11:03 ` [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key Qais Yousef
  2020-07-16 11:13   ` Qais Yousef
@ 2020-07-24  9:12   ` Peter Zijlstra
  2020-07-24  9:46     ` Qais Yousef
  2020-07-29 12:02   ` [tip: sched/core] " tip-bot2 for Qais Yousef
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2020-07-24  9:12 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Doug Anderson, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On Thu, Jul 16, 2020 at 12:03:47PM +0100, Qais Yousef wrote:

I've trimmed the Changelog to read like:

---
Subject: sched/uclamp: Fix a deadlock when enabling uclamp static key
From: Qais Yousef <qais.yousef@arm.com>
Date: Thu, 16 Jul 2020 12:03:47 +0100

From: Qais Yousef <qais.yousef@arm.com>

The following splat was caught when setting uclamp value of a task:

  BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

   cpus_read_lock+0x68/0x130
   static_key_enable+0x1c/0x38
   __sched_setscheduler+0x900/0xad8

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
---

And placed this patch first in the series

That said; don't we have a slight problem with enabling the key late
like this? It means the uclamp will not actually take effect immediately
and we'll have to wait for the next context switch ... whenever that
might be.

Should we not have enabled the key early, instead of late?

something like so perhaps?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2a244af9a53..c6499b2696f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1282,8 +1282,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
 		return;
 
-	static_branch_enable(&sched_uclamp_used);
-
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);
@@ -5074,6 +5072,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		retval = uclamp_validate(p, attr);
 		if (retval)
 			return retval;
+		static_branch_enable(&sched_uclamp_used);
 	}
 
 	if (pi)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-07-24  8:54   ` Peter Zijlstra
@ 2020-07-24  9:16     ` Qais Yousef
  0 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2020-07-24  9:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Doug Anderson, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 07/24/20 10:54, Peter Zijlstra wrote:
> On Thu, Jul 16, 2020 at 12:03:45PM +0100, Qais Yousef wrote:
> 
> Would you mind terribly if I rename things like so?

Nope, the new name is good for me too.

> 
> I tried and failed to come up with a shorter name in general, these
> functions names are somewhat unwieldy. I considered s/_default//.

Can do. Me thinking that maybe we need to sprinkle more comments then. But if
I felt the need for more comments, I can always post another patch on top :-)

If you'd like a shorter name, a slightly shorter one would be

	update_uclamp_min_rt()

Thanks

--
Qais Yousef

> 
> ---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -914,7 +914,7 @@ unsigned int uclamp_rq_max_value(struct
>  	return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>  
> -static void __uclamp_sync_util_min_rt_default_locked(struct task_struct *p)
> +static void __uclamp_update_util_min_rt_default(struct task_struct *p)
>  {
>  	unsigned int default_util_min;
>  	struct uclamp_se *uc_se;
> @@ -931,7 +931,7 @@ static void __uclamp_sync_util_min_rt_de
>  	uclamp_se_set(uc_se, default_util_min, false);
>  }
>  
> -static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
> +static void uclamp_update_util_min_rt_default(struct task_struct *p)
>  {
>  	struct rq_flags rf;
>  	struct rq *rq;
> @@ -941,7 +941,7 @@ static void __uclamp_sync_util_min_rt_de
>  
>  	/* Protect updates to p->uclamp_* */
>  	rq = task_rq_lock(p, &rf);
> -	__uclamp_sync_util_min_rt_default_locked(p);
> +	__uclamp_update_util_min_rt_default(p);
>  	task_rq_unlock(rq, p, &rf);
>  }
>  
> @@ -968,7 +968,7 @@ static void uclamp_sync_util_min_rt_defa
>  
>  	rcu_read_lock();
>  	for_each_process_thread(g, p)
> -		__uclamp_sync_util_min_rt_default(p);
> +		uclamp_update_util_min_rt_default(p);
>  	rcu_read_unlock();
>  }
>  
> @@ -1360,7 +1360,7 @@ static void __setscheduler_uclamp(struct
>  		 * at runtime.
>  		 */
>  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> -			__uclamp_sync_util_min_rt_default_locked(p);
> +			__uclamp_update_util_min_rt_default(p);
>  		else
>  			uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
>  
> @@ -1404,7 +1404,7 @@ static void uclamp_fork(struct task_stru
>  
>  static void uclamp_post_fork(struct task_struct *p)
>  {
> -	__uclamp_sync_util_min_rt_default(p);
> +	uclamp_update_util_min_rt_default(p);
>  }
>  
>  static void __init init_uclamp_rq(struct rq *rq)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key
  2020-07-24  9:12   ` Peter Zijlstra
@ 2020-07-24  9:46     ` Qais Yousef
  2020-07-24 10:41       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Qais Yousef @ 2020-07-24  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Doug Anderson, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On 07/24/20 11:12, Peter Zijlstra wrote:
> On Thu, Jul 16, 2020 at 12:03:47PM +0100, Qais Yousef wrote:
> 
> I've trimmed the Changelog to read like:

+1

Should we mention the ordering issue too? Or maybe I misinterpreted the
'Possible unsafe locking scenario' part?

> 
> ---
> Subject: sched/uclamp: Fix a deadlock when enabling uclamp static key
> From: Qais Yousef <qais.yousef@arm.com>
> Date: Thu, 16 Jul 2020 12:03:47 +0100
> 
> From: Qais Yousef <qais.yousef@arm.com>
> 
> The following splat was caught when setting uclamp value of a task:
> 
>   BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49
> 
>    cpus_read_lock+0x68/0x130
>    static_key_enable+0x1c/0x38
>    __sched_setscheduler+0x900/0xad8
> 
> Fix by ensuring we enable the key outside of the critical section in
> __sched_setscheduler()
> 
> Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
> ---
> 
> And placed this patch first in the series
> 
> That said; don't we have a slight problem with enabling the key late
> like this? It means the uclamp will not actually take effect immediately
> and we'll have to wait for the next context switch ... whenever that
> might be.

The enabling is racy inherently. Your suggestion below is better though.
I should have scrolled more screens up!

> 
> Should we not have enabled the key early, instead of late?
> 
> something like so perhaps?

This should work, but you'll need to sprinkle ifdef around the key. Or move it
to uclamp_validate()

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1578c3ad40c..cd3b7a25ac59 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1345,6 +1345,21 @@ static int uclamp_validate(struct task_struct *p,
 	if (upper_bound > SCHED_CAPACITY_SCALE)
 		return -EINVAL;
 
+	/*
+	 * Enable uclamp static key outside the critical section of
+	 * __sched_setschuler().
+	 *
+	 * static_branch_enable() will hold cpu_hotplug_lock; if done from
+	 * critical section, __setscheduler_uclamp(), which holds other locks
+	 * (rq->lock namely), it could lead to deadlock scenarios as both are
+	 * popular locks and could be acquired from different paths in
+	 * different orders.
+	 *
+	 * Besides cpu_hotplug_lock could sleep, which is not allowed inside
+	 * the critical section of __sched_setscheduler().
+	 */
+	static_branch_enable(&sched_uclamp_used);
+
 	return 0;
 }
 
@@ -1378,8 +1393,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
 		return;
 
-	static_branch_enable(&sched_uclamp_used);
-
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key
  2020-07-24  9:46     ` Qais Yousef
@ 2020-07-24 10:41       ` Peter Zijlstra
  2020-07-24 10:50         ` Qais Yousef
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2020-07-24 10:41 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Doug Anderson, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On Fri, Jul 24, 2020 at 10:46:50AM +0100, Qais Yousef wrote:
> On 07/24/20 11:12, Peter Zijlstra wrote:
> > On Thu, Jul 16, 2020 at 12:03:47PM +0100, Qais Yousef wrote:
> > 
> > I've trimmed the Changelog to read like:
> 
> +1
> 
> Should we mention the ordering issue too? Or maybe I misinterpreted the
> 'Possible unsafe locking scenario' part?

The lock inversion was, imo, secondary. It only existed because of the
impossible lock ordering -- taking a blocking lock inside an atomic
lock. Fixing the first, avoids the second etc.. So I left it out.

> This should work, but you'll need to sprinkle ifdef around the key. Or move it
> to uclamp_validate()

Indeed, the patch now reads like:

---
Subject: sched/uclamp: Fix a deadlock when enabling uclamp static key
From: Qais Yousef <qais.yousef@arm.com>
Date: Thu, 16 Jul 2020 12:03:47 +0100

From: Qais Yousef <qais.yousef@arm.com>

The following splat was caught when setting uclamp value of a task:

  BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

   cpus_read_lock+0x68/0x130
   static_key_enable+0x1c/0x38
   __sched_setscheduler+0x900/0xad8

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
---
 kernel/sched/core.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1252,6 +1252,15 @@ static int uclamp_validate(struct task_s
 	if (upper_bound > SCHED_CAPACITY_SCALE)
 		return -EINVAL;
 
+	/*
+	 * We have valid uclamp attributes; make sure uclamp is enabled.
+	 *
+	 * We need to do that here, because enabling static branches is a
+	 * blocking operation which obviously cannot be done while holding
+	 * scheduler locks.
+	 */
+	static_branch_enable(&sched_uclamp_used);
+
 	return 0;
 }
 
@@ -1282,8 +1291,6 @@ static void __setscheduler_uclamp(struct
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
 		return;
 
-	static_branch_enable(&sched_uclamp_used);
-
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key
  2020-07-24 10:41       ` Peter Zijlstra
@ 2020-07-24 10:50         ` Qais Yousef
  0 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2020-07-24 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Doug Anderson, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On 07/24/20 12:41, Peter Zijlstra wrote:
> On Fri, Jul 24, 2020 at 10:46:50AM +0100, Qais Yousef wrote:
> > On 07/24/20 11:12, Peter Zijlstra wrote:
> > > On Thu, Jul 16, 2020 at 12:03:47PM +0100, Qais Yousef wrote:
> > > 
> > > I've trimmed the Changelog to read like:
> > 
> > +1
> > 
> > Should we mention the ordering issue too? Or maybe I misinterpreted the
> > 'Possible unsafe locking scenario' part?
> 
> The lock inversion was, imo, secondary. It only existed because of the
> impossible lock ordering -- taking a blocking lock inside an atomic
> lock. Fixing the first, avoids the second etc.. So I left it out.
> 
> > This should work, but you'll need to sprinkle ifdef around the key. Or move it
> > to uclamp_validate()
> 
> Indeed, the patch now reads like:

Maybe s/deadlock/splat/ in the subject now? I clearly focused on the secondary
thing..

Sorry you had to modify the patch that much yourself.

Thanks!

--
Qais Yousef

> 
> ---
> Subject: sched/uclamp: Fix a deadlock when enabling uclamp static key
> From: Qais Yousef <qais.yousef@arm.com>
> Date: Thu, 16 Jul 2020 12:03:47 +0100
> 
> From: Qais Yousef <qais.yousef@arm.com>
> 
> The following splat was caught when setting uclamp value of a task:
> 
>   BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49
> 
>    cpus_read_lock+0x68/0x130
>    static_key_enable+0x1c/0x38
>    __sched_setscheduler+0x900/0xad8
> 
> Fix by ensuring we enable the key outside of the critical section in
> __sched_setscheduler()
> 
> Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
> ---
>  kernel/sched/core.c |   11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1252,6 +1252,15 @@ static int uclamp_validate(struct task_s
>  	if (upper_bound > SCHED_CAPACITY_SCALE)
>  		return -EINVAL;
>  
> +	/*
> +	 * We have valid uclamp attributes; make sure uclamp is enabled.
> +	 *
> +	 * We need to do that here, because enabling static branches is a
> +	 * blocking operation which obviously cannot be done while holding
> +	 * scheduler locks.
> +	 */
> +	static_branch_enable(&sched_uclamp_used);
> +
>  	return 0;
>  }
>  
> @@ -1282,8 +1291,6 @@ static void __setscheduler_uclamp(struct
>  	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>  		return;
>  
> -	static_branch_enable(&sched_uclamp_used);
> -
>  	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>  		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
>  			      attr->sched_util_min, true);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: sched/core] Documentation/sysctl: Document uclamp sysctl knobs
  2020-07-16 11:03 ` [PATCH v7 2/3] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
@ 2020-07-29 12:02   ` tip-bot2 for Qais Yousef
  0 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Qais Yousef @ 2020-07-29 12:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Qais Yousef, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     1f73d1abe5836bd8ffe747ff5cb7561b17ce5bc6
Gitweb:        https://git.kernel.org/tip/1f73d1abe5836bd8ffe747ff5cb7561b17ce5bc6
Author:        Qais Yousef <qais.yousef@arm.com>
AuthorDate:    Thu, 16 Jul 2020 12:03:46 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Jul 2020 13:51:48 +02:00

Documentation/sysctl: Document uclamp sysctl knobs

Uclamp exposes 3 sysctl knobs:

	* sched_util_clamp_min
	* sched_util_clamp_max
	* sched_util_clamp_min_rt_default

Document them in sysctl/kernel.rst.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-3-qais.yousef@arm.com
---
 Documentation/admin-guide/sysctl/kernel.rst | 54 ++++++++++++++++++++-
 1 file changed, 54 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 83acf50..55bf6b4 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1062,6 +1062,60 @@ Enables/disables scheduler statistics. Enabling this feature
 incurs a small amount of overhead in the scheduler but is
 useful for debugging and performance tuning.
 
+sched_util_clamp_min:
+=====================
+
+Max allowed *minimum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.min value cannot be greater than
+sched_util_clamp_min, i.e., it is restricted to the range
+[0:sched_util_clamp_min].
+
+sched_util_clamp_max:
+=====================
+
+Max allowed *maximum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.max value cannot be greater than
+sched_util_clamp_max, i.e., it is restricted to the range
+[0:sched_util_clamp_max].
+
+sched_util_clamp_min_rt_default:
+================================
+
+By default Linux is tuned for performance. Which means that RT tasks always run
+at the highest frequency and most capable (highest capacity) CPU (in
+heterogeneous systems).
+
+Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
+1024 by default, which effectively boosts the tasks to run at the highest
+frequency and biases them to run on the biggest CPU.
+
+This knob allows admins to change the default behavior when uclamp is being
+used. In battery powered devices particularly, running at the maximum
+capacity and frequency will increase energy consumption and shorten the battery
+life.
+
+This knob is only effective for RT tasks which the user hasn't modified their
+requested uclamp.min value via sched_setattr() syscall.
+
+This knob will not escape the range constraint imposed by sched_util_clamp_min
+defined above.
+
+For example if
+
+	sched_util_clamp_min_rt_default = 800
+	sched_util_clamp_min = 600
+
+Then the boost will be clamped to 600 because 800 is outside of the permissible
+range of [0:600]. This could happen for instance if a powersave mode will
+restrict all boosts temporarily by modifying sched_util_clamp_min. As soon as
+this restriction is lifted, the requested sched_util_clamp_min_rt_default
+will take effect.
 
 seccomp
 =======

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-07-16 11:03 ` [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default " Qais Yousef
  2020-07-24  8:54   ` Peter Zijlstra
@ 2020-07-29 12:02   ` tip-bot2 for Qais Yousef
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot2 for Qais Yousef @ 2020-07-29 12:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Qais Yousef, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     13685c4a08fca9dd76bf53bfcbadc044ab2a08cb
Gitweb:        https://git.kernel.org/tip/13685c4a08fca9dd76bf53bfcbadc044ab2a08cb
Author:        Qais Yousef <qais.yousef@arm.com>
AuthorDate:    Thu, 16 Jul 2020 12:03:45 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Jul 2020 13:51:47 +02:00

sched/uclamp: Add a new sysctl to control RT default boost value

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
---
 include/linux/sched.h        |  10 ++-
 include/linux/sched/sysctl.h |   1 +-
 include/linux/sched/task.h   |   1 +-
 kernel/fork.c                |   1 +-
 kernel/sched/core.c          | 119 ++++++++++++++++++++++++++++++++--
 kernel/sysctl.c              |   7 ++-
 6 files changed, 131 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index adf0125..a6bf77c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -686,9 +686,15 @@ struct task_struct {
 	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
-	/* Clamp values requested for a scheduling entity */
+	/*
+	 * Clamp values requested for a scheduling entity.
+	 * Must be updated with task_rq_lock() held.
+	 */
 	struct uclamp_se		uclamp_req[UCLAMP_CNT];
-	/* Effective clamp values used for a scheduling entity */
+	/*
+	 * Effective clamp values used for a scheduling entity.
+	 * Must be updated with task_rq_lock() held.
+	 */
 	struct uclamp_se		uclamp[UCLAMP_CNT];
 #endif
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 24be30a..3c31ba8 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -67,6 +67,7 @@ extern unsigned int sysctl_sched_dl_period_min;
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
+extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 3835907..e7ddab0 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -55,6 +55,7 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
+extern void sched_post_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 void __noreturn do_task_dead(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index efc5493..e75c2e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2304,6 +2304,7 @@ static __latent_entropy struct task_struct *copy_process(
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
+	sched_post_fork(p);
 	cgroup_post_fork(p, args);
 	perf_event_fork(p);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e44d83f..12e1f3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -889,6 +889,23 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 /* Max allowed maximum utilization */
 unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 
+/*
+ * By default RT tasks run at the maximum performance point/capacity of the
+ * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
+ * SCHED_CAPACITY_SCALE.
+ *
+ * This knob allows admins to change the default behavior when uclamp is being
+ * used. In battery powered devices, particularly, running at the maximum
+ * capacity and frequency will increase energy consumption and shorten the
+ * battery life.
+ *
+ * This knob only affects RT tasks that their uclamp_se->user_defined == false.
+ *
+ * This knob will not override the system default sched_util_clamp_min defined
+ * above.
+ */
+unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
+
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
@@ -991,6 +1008,64 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
 	return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static void __uclamp_update_util_min_rt_default(struct task_struct *p)
+{
+	unsigned int default_util_min;
+	struct uclamp_se *uc_se;
+
+	lockdep_assert_held(&p->pi_lock);
+
+	uc_se = &p->uclamp_req[UCLAMP_MIN];
+
+	/* Only sync if user didn't override the default */
+	if (uc_se->user_defined)
+		return;
+
+	default_util_min = sysctl_sched_uclamp_util_min_rt_default;
+	uclamp_se_set(uc_se, default_util_min, false);
+}
+
+static void uclamp_update_util_min_rt_default(struct task_struct *p)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (!rt_task(p))
+		return;
+
+	/* Protect updates to p->uclamp_* */
+	rq = task_rq_lock(p, &rf);
+	__uclamp_update_util_min_rt_default(p);
+	task_rq_unlock(rq, p, &rf);
+}
+
+static void uclamp_sync_util_min_rt_default(void)
+{
+	struct task_struct *g, *p;
+
+	/*
+	 * copy_process()			sysctl_uclamp
+	 *					  uclamp_min_rt = X;
+	 *   write_lock(&tasklist_lock)		  read_lock(&tasklist_lock)
+	 *   // link thread			  smp_mb__after_spinlock()
+	 *   write_unlock(&tasklist_lock)	  read_unlock(&tasklist_lock);
+	 *   sched_post_fork()			  for_each_process_thread()
+	 *     __uclamp_sync_rt()		    __uclamp_sync_rt()
+	 *
+	 * Ensures that either sched_post_fork() will observe the new
+	 * uclamp_min_rt or for_each_process_thread() will observe the new
+	 * task.
+	 */
+	read_lock(&tasklist_lock);
+	smp_mb__after_spinlock();
+	read_unlock(&tasklist_lock);
+
+	rcu_read_lock();
+	for_each_process_thread(g, p)
+		uclamp_update_util_min_rt_default(p);
+	rcu_read_unlock();
+}
+
 static inline struct uclamp_se
 uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
@@ -1278,12 +1353,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				void *buffer, size_t *lenp, loff_t *ppos)
 {
 	bool update_root_tg = false;
-	int old_min, old_max;
+	int old_min, old_max, old_min_rt;
 	int result;
 
 	mutex_lock(&uclamp_mutex);
 	old_min = sysctl_sched_uclamp_util_min;
 	old_max = sysctl_sched_uclamp_util_max;
+	old_min_rt = sysctl_sched_uclamp_util_min_rt_default;
 
 	result = proc_dointvec(table, write, buffer, lenp, ppos);
 	if (result)
@@ -1292,7 +1368,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		goto done;
 
 	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
-	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE	||
+	    sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
+
 		result = -EINVAL;
 		goto undo;
 	}
@@ -1313,6 +1391,11 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		uclamp_update_root_tg();
 	}
 
+	if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) {
+		static_branch_enable(&sched_uclamp_used);
+		uclamp_sync_util_min_rt_default();
+	}
+
 	/*
 	 * We update all RUNNABLE tasks only when task groups are in use.
 	 * Otherwise, keep it simple and do just a lazy update at each next
@@ -1324,6 +1407,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 undo:
 	sysctl_sched_uclamp_util_min = old_min;
 	sysctl_sched_uclamp_util_max = old_max;
+	sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
 done:
 	mutex_unlock(&uclamp_mutex);
 
@@ -1369,17 +1453,20 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	 */
 	for_each_clamp_id(clamp_id) {
 		struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
-		unsigned int clamp_value = uclamp_none(clamp_id);
 
 		/* Keep using defined clamps across class changes */
 		if (uc_se->user_defined)
 			continue;
 
-		/* By default, RT tasks always get 100% boost */
+		/*
+		 * RT by default have a 100% boost value that could be modified
+		 * at runtime.
+		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			clamp_value = uclamp_none(UCLAMP_MAX);
+			__uclamp_update_util_min_rt_default(p);
+		else
+			uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
 
-		uclamp_se_set(uc_se, clamp_value, false);
 	}
 
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
@@ -1400,6 +1487,10 @@ static void uclamp_fork(struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	/*
+	 * We don't need to hold task_rq_lock() when updating p->uclamp_* here
+	 * as the task is still at its early fork stages.
+	 */
 	for_each_clamp_id(clamp_id)
 		p->uclamp[clamp_id].active = false;
 
@@ -1412,6 +1503,11 @@ static void uclamp_fork(struct task_struct *p)
 	}
 }
 
+static void uclamp_post_fork(struct task_struct *p)
+{
+	uclamp_update_util_min_rt_default(p);
+}
+
 static void __init init_uclamp_rq(struct rq *rq)
 {
 	enum uclamp_id clamp_id;
@@ -1462,6 +1558,7 @@ static inline int uclamp_validate(struct task_struct *p,
 static void __setscheduler_uclamp(struct task_struct *p,
 				  const struct sched_attr *attr) { }
 static inline void uclamp_fork(struct task_struct *p) { }
+static inline void uclamp_post_fork(struct task_struct *p) { }
 static inline void init_uclamp(void) { }
 #endif /* CONFIG_UCLAMP_TASK */
 
@@ -3205,6 +3302,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	return 0;
 }
 
+void sched_post_fork(struct task_struct *p)
+{
+	uclamp_post_fork(p);
+}
+
 unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
@@ -5724,6 +5826,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		kattr.sched_nice = task_nice(p);
 
 #ifdef CONFIG_UCLAMP_TASK
+	/*
+	 * This could race with another potential updater, but this is fine
+	 * because it'll correctly read the old or the new value. We don't need
+	 * to guarantee who wins the race as long as it doesn't return garbage.
+	 */
 	kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
 	kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4aea67d..1b4d2dc 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1815,6 +1815,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_util_clamp_min_rt_default",
+		.data		= &sysctl_sched_uclamp_util_min_rt_default,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_handler,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/uclamp: Fix a deadlock when enabling uclamp static key
  2020-07-16 11:03 ` [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key Qais Yousef
  2020-07-16 11:13   ` Qais Yousef
  2020-07-24  9:12   ` Peter Zijlstra
@ 2020-07-29 12:02   ` tip-bot2 for Qais Yousef
  2 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Qais Yousef @ 2020-07-29 12:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Qais Yousef, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e65855a52b479f98674998cb23b21ef5a8144b04
Gitweb:        https://git.kernel.org/tip/e65855a52b479f98674998cb23b21ef5a8144b04
Author:        Qais Yousef <qais.yousef@arm.com>
AuthorDate:    Thu, 16 Jul 2020 12:03:47 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Jul 2020 13:51:47 +02:00

sched/uclamp: Fix a deadlock when enabling uclamp static key

The following splat was caught when setting uclamp value of a task:

  BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

   cpus_read_lock+0x68/0x130
   static_key_enable+0x1c/0x38
   __sched_setscheduler+0x900/0xad8

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
---
 kernel/sched/core.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6782534..e44d83f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1346,6 +1346,15 @@ static int uclamp_validate(struct task_struct *p,
 	if (upper_bound > SCHED_CAPACITY_SCALE)
 		return -EINVAL;
 
+	/*
+	 * We have valid uclamp attributes; make sure uclamp is enabled.
+	 *
+	 * We need to do that here, because enabling static branches is a
+	 * blocking operation which obviously cannot be done while holding
+	 * scheduler locks.
+	 */
+	static_branch_enable(&sched_uclamp_used);
+
 	return 0;
 }
 
@@ -1376,8 +1385,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
 		return;
 
-	static_branch_enable(&sched_uclamp_used);
-
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-07-29 12:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-16 11:03 [PATCH v7 0/3] sched/uclamp: new sysctl for default RT boost value Qais Yousef
2020-07-16 11:03 ` [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default " Qais Yousef
2020-07-24  8:54   ` Peter Zijlstra
2020-07-24  9:16     ` Qais Yousef
2020-07-29 12:02   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2020-07-16 11:03 ` [PATCH v7 2/3] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
2020-07-29 12:02   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2020-07-16 11:03 ` [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key Qais Yousef
2020-07-16 11:13   ` Qais Yousef
2020-07-23 15:51     ` Qais Yousef
2020-07-24  9:12   ` Peter Zijlstra
2020-07-24  9:46     ` Qais Yousef
2020-07-24 10:41       ` Peter Zijlstra
2020-07-24 10:50         ` Qais Yousef
2020-07-29 12:02   ` [tip: sched/core] " tip-bot2 for Qais Yousef

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.