linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path
@ 2020-06-25 15:43 Qais Yousef
  2020-06-25 15:43 ` [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Qais Yousef @ 2020-06-25 15:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Valentin Schneider, Qais Yousef, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

This series attempts to address the report that uclamp logic could be expensive
sometimes and shows a regression in netperf UDP_STREAM under certain
conditions.

The first patch is a fix for how struct uclamp_rq is initialized which is
required by the 2nd patch which contains the real 'fix'.

Worth noting that the root cause of the overhead is believed to be system
specific or related to potential certain code/data layout issues, leading to
worse I/D $ performance.

Different systems exhibited different behaviors and the regression did
disappear in certain kernel version while attempting to reporoduce.

More info can be found here:

https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

Having the static key seemed the best thing to do to ensure the effect of
uclamp is minimized for kernels that compile it in but don't have a userspace
that uses it, which will allow distros to distribute uclamp capable kernels by
default without having to compromise on performance for some systems that could
be affected.

Changes in v4:
	* Fix broken boosting of RT tasks when static key is disabled.

Changes in v3:
	* Avoid double negatives and rename the static key to uclamp_used
	* Unconditionally enable the static key through any of the paths where
	  the user can modify the default uclamp value.
	* Use C99 named struct initializer for struct uclamp_rq which is easier
	  to read than the memset().

Changes in v2:
	* Add more info in the commit message about the result of perf diff to
	  demonstrate that the activate/deactivate_task pressure is reduced in
	  the fast path.

	* Fix sparse warning reported by the test robot.

	* Add an extra commit about using static_branch_likely() instead of
	  static_branc_unlikely().

Thanks

--
Qais Yousef

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: linux-kernel@vger.kernel.org

Qais Yousef (2):
  sched/uclamp: Fix initialization of struct uclamp_rq
  sched/uclamp: Protect uclamp fast path code with static key

 kernel/sched/core.c              | 80 ++++++++++++++++++++++++++++----
 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/sched.h             |  7 +++
 3 files changed, 79 insertions(+), 10 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq
  2020-06-25 15:43 [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
@ 2020-06-25 15:43 ` Qais Yousef
  2020-06-26 12:32   ` Patrick Bellasi
  2020-06-25 15:43 ` [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Qais Yousef @ 2020-06-25 15:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Valentin Schneider, Qais Yousef, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

struct uclamp_rq was zeroed out entirely in assumption that in the first
call to uclamp_rq_inc() they'd be initialized correctly in accordance to
default settings.

But when next patch introduces a static key to skip
uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
will fail to perform any frequency changes because the
rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
means all rqs are capped to 0 by default.

Fix it by making sure we do proper initialization at init without
relying on uclamp_rq_inc() doing it later.

Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/core.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8fe2ac910bed..235b2cae00a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1248,6 +1248,20 @@ static void uclamp_fork(struct task_struct *p)
 	}
 }
 
+static void __init init_uclamp_rq(struct rq *rq)
+{
+	enum uclamp_id clamp_id;
+	struct uclamp_rq *uc_rq = rq->uclamp;
+
+	for_each_clamp_id(clamp_id) {
+		uc_rq[clamp_id] = (struct uclamp_rq) {
+			.value = uclamp_none(clamp_id)
+		};
+	}
+
+	rq->uclamp_flags = 0;
+}
+
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
@@ -1256,11 +1270,8 @@ static void __init init_uclamp(void)
 
 	mutex_init(&uclamp_mutex);
 
-	for_each_possible_cpu(cpu) {
-		memset(&cpu_rq(cpu)->uclamp, 0,
-				sizeof(struct uclamp_rq)*UCLAMP_CNT);
-		cpu_rq(cpu)->uclamp_flags = 0;
-	}
+	for_each_possible_cpu(cpu)
+		init_uclamp_rq(cpu_rq(cpu));
 
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&init_task.uclamp_req[clamp_id],
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-25 15:43 [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
  2020-06-25 15:43 ` [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
@ 2020-06-25 15:43 ` Qais Yousef
  2020-06-26 12:38   ` Patrick Bellasi
  2020-06-26 10:00 ` [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path Lukasz Luba
  2020-06-26 10:27 ` Peter Zijlstra
  3 siblings, 1 reply; 11+ messages in thread
From: Qais Yousef @ 2020-06-25 15:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Valentin Schneider, Qais Yousef, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

There is a report that when uclamp is enabled, a netperf UDP test
regresses compared to a kernel compiled without uclamp.

https://lore.kernel.org/lkml/20200529100806.GA3070@suse.de/

While investigating the root cause, there were no sign that the uclamp
code is doing anything particularly expensive but could suffer from bad
cache behavior under certain circumstances that are yet to be
understood.

https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

To reduce the pressure on the fast path anyway, add a static key that is
by default will skip executing uclamp logic in the
enqueue/dequeue_task() fast path until it's needed.

As soon as the user start using util clamp by:

	1. Changing uclamp value of a task with sched_setattr()
	2. Modifying the default sysctl_sched_util_clamp_{min, max}
	3. Modifying the default cpu.uclamp.{min, max} value in cgroup

We flip the static key now that the user has opted to use util clamp.
Effectively re-introducing uclamp logic in the enqueue/dequeue_task()
fast path. It stays on from that point forward until the next reboot.

This should help minimize the effect of util clamp on workloads that
don't need it but still allow distros to ship their kernels with uclamp
compiled in by default.

SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end
up with unbalanced call to uclamp_rq_dec_id() if we flip the key while
a task is running in the rq. Since we know it is harmless we just
quietly return if we attempt a uclamp_rq_dec_id() when
rq->uclamp[].bucket[].tasks is 0.

In schedutil, we introduce a new uclamp_is_enabled() helper which takes
the static key into account to ensure RT boosting behavior is retained.

The following results demonstrates how this helps on 2 Sockets Xeon E5
2x10-Cores system.

                                   nouclamp                 uclamp      uclamp-static-key
Hmean     send-64         162.43 (   0.00%)      157.84 *  -2.82%*      163.39 *   0.59%*
Hmean     send-128        324.71 (   0.00%)      314.78 *  -3.06%*      326.18 *   0.45%*
Hmean     send-256        641.55 (   0.00%)      628.67 *  -2.01%*      648.12 *   1.02%*
Hmean     send-1024      2525.28 (   0.00%)     2448.26 *  -3.05%*     2543.73 *   0.73%*
Hmean     send-2048      4836.14 (   0.00%)     4712.08 *  -2.57%*     4867.69 *   0.65%*
Hmean     send-3312      7540.83 (   0.00%)     7425.45 *  -1.53%*     7621.06 *   1.06%*
Hmean     send-4096      9124.53 (   0.00%)     8948.82 *  -1.93%*     9276.25 *   1.66%*
Hmean     send-8192     15589.67 (   0.00%)    15486.35 *  -0.66%*    15819.98 *   1.48%*
Hmean     send-16384    26386.47 (   0.00%)    25752.25 *  -2.40%*    26773.74 *   1.47%*

The perf diff between nouclamp and uclamp-static-key when uclamp is
disabled in the fast path:

     8.73%     -1.55%  [kernel.kallsyms]        [k] try_to_wake_up
     0.07%     +0.04%  [kernel.kallsyms]        [k] deactivate_task
     0.13%     -0.02%  [kernel.kallsyms]        [k] activate_task

The diff between nouclamp and uclamp-static-key when uclamp is enabled
in the fast path:

     8.73%     -0.72%  [kernel.kallsyms]        [k] try_to_wake_up
     0.13%     +0.39%  [kernel.kallsyms]        [k] activate_task
     0.07%     +0.38%  [kernel.kallsyms]        [k] deactivate_task

Reported-by: Mel Gorman <mgorman@suse.de>
Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: linux-kernel@vger.kernel.org
---

This takes a different approach to PSI which introduces a config option

```
      CONFIG_PSI_DEFAULT_DISABLED

        Require boot parameter to enable pressure stall information
        tracking (NEW)

      boot param psi
```

via commit e0c274472d5d "psi: make disabling/enabling easier for vendor kernels"

uclamp has a clearer points of entry when userspace would like to use it so we
can automatically flip the switch if the kernel is running on a userspace that
wants to user utilclamp without any extra userspace visible switches.

I wanted to make this dependent on schedutil being the governor too, but beside
the complexity, uclamp is used for capacity awareness. We could certainly
construct a more complex condition, but I'm not sure it's worth it. Open to
hear more opinions and points of views on this :)


 kernel/sched/core.c              | 59 +++++++++++++++++++++++++++++---
 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/sched.h             |  7 ++++
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 235b2cae00a0..e2f1fffa013c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -794,6 +794,25 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
+/*
+ * This static key is used to reduce the uclamp overhead in the fast path. It
+ * only disables the call to uclamp_rq_{inc, dec}() in enqueue/dequeue_task().
+ *
+ * This allows users to continue to enable uclamp in their kernel config with
+ * minimum uclamp overhead in the fast path.
+ *
+ * As soon as userspace modifies any of the uclamp knobs, the static key is
+ * enabled, since we have an actual users that make use of uclamp
+ * functionality.
+ *
+ * The knobs that would enable this static key are:
+ *
+ *   * A task modifying its uclamp value with sched_setattr().
+ *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
+ *   * An admin modifying the cgroup cpu.uclamp.{min, max}
+ */
+static DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
+
 /* Integer rounded range for each bucket */
 #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
 
@@ -994,9 +1013,16 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 	lockdep_assert_held(&rq->lock);
 
 	bucket = &uc_rq->bucket[uc_se->bucket_id];
-	SCHED_WARN_ON(!bucket->tasks);
-	if (likely(bucket->tasks))
-		bucket->tasks--;
+
+	/*
+	 * This could happen if sched_uclamp_used was enabled while the
+	 * current task was running, hence we could end up with unbalanced call
+	 * to uclamp_rq_dec_id().
+	 */
+	if (unlikely(!bucket->tasks))
+		return;
+
+	bucket->tasks--;
 	uc_se->active = false;
 
 	/*
@@ -1032,6 +1058,13 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	/*
+	 * Avoid any overhead until uclamp is actually used by the userspace.
+	 * Including the branch if we use static_branch_likely()
+	 */
+	if (!static_branch_unlikely(&sched_uclamp_used))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1047,6 +1080,13 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	/*
+	 * Avoid any overhead until uclamp is actually used by the userspace.
+	 * Including the branch if we use static_branch_likely()
+	 */
+	if (!static_branch_unlikely(&sched_uclamp_used))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1155,8 +1195,10 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		update_root_tg = true;
 	}
 
-	if (update_root_tg)
+	if (update_root_tg) {
+		static_branch_enable(&sched_uclamp_used);
 		uclamp_update_root_tg();
+	}
 
 	/*
 	 * We update all RUNNABLE tasks only when task groups are in use.
@@ -1221,6 +1263,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
 		return;
 
+	static_branch_enable(&sched_uclamp_used);
+
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);
@@ -1289,6 +1333,11 @@ static void __init init_uclamp(void)
 	}
 }
 
+bool uclamp_is_enabled(void)
+{
+	return static_branch_likely(&sched_uclamp_used);
+}
+
 #else /* CONFIG_UCLAMP_TASK */
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
@@ -7387,6 +7436,8 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
 	if (req.ret)
 		return req.ret;
 
+	static_branch_enable(&sched_uclamp_used);
+
 	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7fbaee24c824..3f4e296ccb67 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
 	unsigned long dl_util, util, irq;
 	struct rq *rq = cpu_rq(cpu);
 
-	if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
+	if (!uclamp_is_enabled() &&
 	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
 		return max;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1d4e94c1e5fe..aff5430331e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2371,6 +2371,8 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
 
 	return clamp(util, min_util, max_util);
 }
+
+bool uclamp_is_enabled(void);
 #else /* CONFIG_UCLAMP_TASK */
 static inline
 unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
@@ -2378,6 +2380,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
 {
 	return util;
 }
+
+static inline bool uclamp_is_enabled(void)
+{
+	return false;
+}
 #endif /* CONFIG_UCLAMP_TASK */
 
 #ifdef arch_scale_freq_capacity
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path
  2020-06-25 15:43 [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
  2020-06-25 15:43 ` [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
  2020-06-25 15:43 ` [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
@ 2020-06-26 10:00 ` Lukasz Luba
  2020-06-26 10:27 ` Peter Zijlstra
  3 siblings, 0 replies; 11+ messages in thread
From: Lukasz Luba @ 2020-06-26 10:00 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Patrick Bellasi, Chris Redpath, linux-kernel

Hi Qais,

On 6/25/20 4:43 PM, Qais Yousef wrote:
> This series attempts to address the report that uclamp logic could be expensive
> sometimes and shows a regression in netperf UDP_STREAM under certain
> conditions.
> 
> The first patch is a fix for how struct uclamp_rq is initialized which is
> required by the 2nd patch which contains the real 'fix'.
> 
> Worth noting that the root cause of the overhead is believed to be system
> specific or related to potential certain code/data layout issues, leading to
> worse I/D $ performance.
> 
> Different systems exhibited different behaviors and the regression did
> disappear in certain kernel version while attempting to reporoduce.
> 
> More info can be found here:
> 
> https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/
> 
> Having the static key seemed the best thing to do to ensure the effect of
> uclamp is minimized for kernels that compile it in but don't have a userspace
> that uses it, which will allow distros to distribute uclamp capable kernels by
> default without having to compromise on performance for some systems that could
> be affected.
> 
> Changes in v4:
> 	* Fix broken boosting of RT tasks when static key is disabled.
> 
> Changes in v3:
> 	* Avoid double negatives and rename the static key to uclamp_used
> 	* Unconditionally enable the static key through any of the paths where
> 	  the user can modify the default uclamp value.
> 	* Use C99 named struct initializer for struct uclamp_rq which is easier
> 	  to read than the memset().
> 
> Changes in v2:
> 	* Add more info in the commit message about the result of perf diff to
> 	  demonstrate that the activate/deactivate_task pressure is reduced in
> 	  the fast path.
> 
> 	* Fix sparse warning reported by the test robot.
> 
> 	* Add an extra commit about using static_branch_likely() instead of
> 	  static_branc_unlikely().
> 


I've tried this v4 series with mmtest netperf-udp (30x each UDP
size) - results are good (just double checking and making sure
the tag indicating that v4 was tested can be applied).

                       v5.7-rc7-base-noucl    v5.7-rc7-ucl-tsk-nofix 
v5.7-rc7-ucl-tsk-grp-fix_v4
Hmean     send-64          62.15 (   0.00%)       59.65 *  -4.02%* 
65.86 *   5.97%*
Hmean     send-128        122.88 (   0.00%)      119.37 *  -2.85%* 
131.75 *   7.22%*
Hmean     send-256        244.85 (   0.00%)      234.26 *  -4.32%* 
259.33 *   5.92%*
Hmean     send-1024       919.24 (   0.00%)      880.67 *  -4.20%* 
979.49 *   6.55%*
Hmean     send-2048      1689.45 (   0.00%)     1647.54 *  -2.48%* 
1805.21 *   6.85%*
Hmean     send-3312      2542.36 (   0.00%)     2485.23 *  -2.25%* 
2658.30 *   4.56%*
Hmean     send-4096      2935.69 (   0.00%)     2861.09 *  -2.54%* 
3083.08 *   5.02%*
Hmean     send-8192      4800.35 (   0.00%)     4680.09 *  -2.51%* 
4984.22 *   3.83%*
Hmean     send-16384     7473.66 (   0.00%)     7349.60 *  -1.66%* 
7703.88 *   3.08%*
Hmean     recv-64          62.15 (   0.00%)       59.65 *  -4.03%* 
65.85 *   5.96%*
Hmean     recv-128        122.88 (   0.00%)      119.37 *  -2.85%* 
131.74 *   7.21%*
Hmean     recv-256        244.84 (   0.00%)      234.26 *  -4.32%* 
259.33 *   5.92%*
Hmean     recv-1024       919.24 (   0.00%)      880.67 *  -4.20%* 
979.46 *   6.55%*
Hmean     recv-2048      1689.44 (   0.00%)     1647.54 *  -2.48%* 
1805.17 *   6.85%*
Hmean     recv-3312      2542.36 (   0.00%)     2485.23 *  -2.25%* 
2657.67 *   4.54%*
Hmean     recv-4096      2935.69 (   0.00%)     2861.09 *  -2.54%* 
3082.58 *   5.00%*
Hmean     recv-8192      4800.35 (   0.00%)     4678.15 *  -2.55%* 
4982.49 *   3.79%*
Hmean     recv-16384     7473.63 (   0.00%)     7349.52 *  -1.66%* 
7701.53 *   3.05%*

You can add my:

Tested-by: Lukasz Luba <lukasz.luba@arm.com>

If anyone would like to see some other tests, please let me know,
maybe I can setup something.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path
  2020-06-25 15:43 [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
                   ` (2 preceding siblings ...)
  2020-06-26 10:00 ` [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path Lukasz Luba
@ 2020-06-26 10:27 ` Peter Zijlstra
  3 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2020-06-26 10:27 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Valentin Schneider, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On Thu, Jun 25, 2020 at 04:43:50PM +0100, Qais Yousef wrote:
> Qais Yousef (2):
>   sched/uclamp: Fix initialization of struct uclamp_rq
>   sched/uclamp: Protect uclamp fast path code with static key
> 
>  kernel/sched/core.c              | 80 ++++++++++++++++++++++++++++----
>  kernel/sched/cpufreq_schedutil.c |  2 +-
>  kernel/sched/sched.h             |  7 +++
>  3 files changed, 79 insertions(+), 10 deletions(-)

Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq
  2020-06-25 15:43 ` [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
@ 2020-06-26 12:32   ` Patrick Bellasi
  2020-06-26 23:17     ` Valentin Schneider
  2020-06-29 12:12     ` Qais Yousef
  0 siblings, 2 replies; 11+ messages in thread
From: Patrick Bellasi @ 2020-06-26 12:32 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chris Redpath, Lukasz Luba, linux-kernel


On Thu, Jun 25, 2020 at 17:43:51 +0200, Qais Yousef <qais.yousef@arm.com> wrote...

> struct uclamp_rq was zeroed out entirely in assumption that in the first
> call to uclamp_rq_inc() they'd be initialized correctly in accordance to
> default settings.

Perhaps I was not clear in my previous comment:

   https://lore.kernel.org/lkml/87sgekorfq.derkling@matbug.net/

when I did say:

   Does not this means the problem is more likely with
   uclamp_rq_util_with(), which should be guarded?

I did not mean that we have to guard the calls to that function but
instead that we should just make that function aware of uclamp being
opted in or not.

> But when next patch introduces a static key to skip
> uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
> will fail to perform any frequency changes because the
> rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
> means all rqs are capped to 0 by default.

The initialization you wants to do here it's needed because with the
current approach you keep calling the same uclamp_rq_util_with() and
keep doing min/max aggregations even when uclamp is not opted in.
But this means also that we have min/max aggregation _when not really
required_.

> Fix it by making sure we do proper initialization at init without
> relying on uclamp_rq_inc() doing it later.

My proposal was as simple as:

---8<---
  static __always_inline
  unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
  				  struct task_struct *p)
  {
  	unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
  	unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
  
+       if (!static_branch_unlikely(&sched_uclamp_used))
+               return rt_task(p) ? uclamp_none(UCLAMP_MAX) : util 
  
  	if (p) {
  		min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
  		max_util = max(max_util, uclamp_eff_value(p, UCLAMP_MAX));
  	}
  
  	/*
  	 * Since CPU's {min,max}_util clamps are MAX aggregated considering
  	 * RUNNABLE tasks with _different_ clamps, we can end up with an
  	 * inversion. Fix it now when the clamps are applied.
  	 */
  	if (unlikely(min_util >= max_util))
  		return min_util;
  
  	return clamp(util, min_util, max_util);
  }
---8<---

Such small change is more self-contained IMHO and does not remove
an existing optimizations like this lazy RQ's initialization at first
usage.

Moreover, it can folded in the following patch, with all the other
static keys shortcuts.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-25 15:43 ` [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
@ 2020-06-26 12:38   ` Patrick Bellasi
  2020-06-26 23:21     ` Valentin Schneider
  2020-06-29 12:21     ` Qais Yousef
  0 siblings, 2 replies; 11+ messages in thread
From: Patrick Bellasi @ 2020-06-26 12:38 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chris Redpath, Lukasz Luba, linux-kernel


On Thu, Jun 25, 2020 at 17:43:52 +0200, Qais Yousef <qais.yousef@arm.com> wrote...

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 235b2cae00a0..e2f1fffa013c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -794,6 +794,25 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> +/*
> + * This static key is used to reduce the uclamp overhead in the fast path. It
> + * only disables the call to uclamp_rq_{inc, dec}() in enqueue/dequeue_task().
> + *
> + * This allows users to continue to enable uclamp in their kernel config with
> + * minimum uclamp overhead in the fast path.
> + *
> + * As soon as userspace modifies any of the uclamp knobs, the static key is
> + * enabled, since we have an actual users that make use of uclamp
> + * functionality.
> + *
> + * The knobs that would enable this static key are:
> + *
> + *   * A task modifying its uclamp value with sched_setattr().
> + *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
> + *   * An admin modifying the cgroup cpu.uclamp.{min, max}
> + */
> +static DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);

This is me being choosy, but given that:

---8<---
% grep -e '_used[ )]' fair.c core.c
fair.c:	return static_key_false(&__cfs_bandwidth_used);
fair.c:	static_key_slow_inc_cpuslocked(&__cfs_bandwidth_used);
fair.c:	static_key_slow_dec_cpuslocked(&__cfs_bandwidth_used);

% grep -e '_enabled[ )]' fair.c core.c
fair.c:	if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
fair.c:	if (!cfs_rq->runtime_enabled || cfs_rq->nr_running)
fair.c:	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
fair.c:	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
fair.c:	cfs_rq->runtime_enabled = 0;
fair.c:		cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
fair.c:		if (!cfs_rq->runtime_enabled)
fair.c:		cfs_rq->runtime_enabled = 0;
core.c:	if (static_key_false((&paravirt_steal_rq_enabled))) {
core.c:	if (unlikely(!p->sched_class->uclamp_enabled))
core.c:	if (unlikely(!p->sched_class->uclamp_enabled))
core.c:	 * Prevent race between setting of cfs_rq->runtime_enabled and
core.c:	runtime_enabled = quota != RUNTIME_INF;
core.c:	runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
core.c:	if (runtime_enabled && !runtime_was_enabled)
core.c:	if (runtime_enabled)
core.c:		cfs_rq->runtime_enabled = runtime_enabled;
core.c:	if (runtime_was_enabled && !runtime_enabled)
---8<---

even just for consistency shake, I would still prefer sched_uclamp_enabled for
the static key name.

> +
>  /* Integer rounded range for each bucket */
>  #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
>  
> @@ -994,9 +1013,16 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>  	lockdep_assert_held(&rq->lock);
>  
>  	bucket = &uc_rq->bucket[uc_se->bucket_id];
> -	SCHED_WARN_ON(!bucket->tasks);
> -	if (likely(bucket->tasks))
> -		bucket->tasks--;
> +
> +	/*
> +	 * This could happen if sched_uclamp_used was enabled while the
> +	 * current task was running, hence we could end up with unbalanced call
> +	 * to uclamp_rq_dec_id().
> +	 */
> +	if (unlikely(!bucket->tasks))
> +		return;
> +
> +	bucket->tasks--;
>  	uc_se->active = false;

In this chunk you are indeed changing the code.

Are we sure there are not issues with patterns like:

  enqueue(taskA)
  // uclamp gets enabled
  enqueue(taskB)
  dequeue(taskA)
  // bucket->tasks is now 0
  dequeue(taskB)

TaskB has been enqueued with with uclamp enabled, thus it
has got uc_se->active=True and enforced its clamp value at RQ level.

But with your change above we don't reset that anymore.

As per my previous proposal: why not just removing the SCHED_WARN_ON?
That's the only real problem in the code above, since now we are not
more granted to have balanced inc/dec.

[...]

> +bool uclamp_is_enabled(void)
> +{
> +	return static_branch_likely(&sched_uclamp_used);
> +}
> +

The above and the following changes are not necessary if we just
add the guards at the beginning of uclamp_rq_util_with() instead of what
you do in PATCH1.

I think this will give better overheads removal by avoiding not
necessary min/max clamps for both RT and CFS.
  
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 7fbaee24c824..3f4e296ccb67 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
>  	unsigned long dl_util, util, irq;
>  	struct rq *rq = cpu_rq(cpu);
>  
> -	if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
> +	if (!uclamp_is_enabled() &&
>  	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
>  		return max;
>  	}
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1d4e94c1e5fe..aff5430331e4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2371,6 +2371,8 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
>  
>  	return clamp(util, min_util, max_util);
>  }
> +
> +bool uclamp_is_enabled(void);
>  #else /* CONFIG_UCLAMP_TASK */
>  static inline
>  unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> @@ -2378,6 +2380,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
>  {
>  	return util;
>  }
> +
> +static inline bool uclamp_is_enabled(void)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_UCLAMP_TASK */
>  
>  #ifdef arch_scale_freq_capacity


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq
  2020-06-26 12:32   ` Patrick Bellasi
@ 2020-06-26 23:17     ` Valentin Schneider
  2020-06-29 12:12     ` Qais Yousef
  1 sibling, 0 replies; 11+ messages in thread
From: Valentin Schneider @ 2020-06-26 23:17 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chris Redpath, Lukasz Luba, linux-kernel


On 26/06/20 13:32, Patrick Bellasi wrote:
> On Thu, Jun 25, 2020 at 17:43:51 +0200, Qais Yousef <qais.yousef@arm.com> wrote...
>
>> struct uclamp_rq was zeroed out entirely in assumption that in the first
>> call to uclamp_rq_inc() they'd be initialized correctly in accordance to
>> default settings.
>
> Perhaps I was not clear in my previous comment:
>
>    https://lore.kernel.org/lkml/87sgekorfq.derkling@matbug.net/
>
> when I did say:
>
>    Does not this means the problem is more likely with
>    uclamp_rq_util_with(), which should be guarded?
>
> I did not mean that we have to guard the calls to that function but
> instead that we should just make that function aware of uclamp being
> opted in or not.
>
>> But when next patch introduces a static key to skip
>> uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
>> will fail to perform any frequency changes because the
>> rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
>> means all rqs are capped to 0 by default.
>
> The initialization you wants to do here it's needed because with the
> current approach you keep calling the same uclamp_rq_util_with() and
> keep doing min/max aggregations even when uclamp is not opted in.
> But this means also that we have min/max aggregation _when not really
> required_.
>
>> Fix it by making sure we do proper initialization at init without
>> relying on uclamp_rq_inc() doing it later.
>
> My proposal was as simple as:
>
> ---8<---
>   static __always_inline
>   unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
>                                 struct task_struct *p)
>   {
>       unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
>       unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
>
> +       if (!static_branch_unlikely(&sched_uclamp_used))
> +               return rt_task(p) ? uclamp_none(UCLAMP_MAX) : util
>
>       if (p) {
>               min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
>               max_util = max(max_util, uclamp_eff_value(p, UCLAMP_MAX));
>       }
>
>       /*
>        * Since CPU's {min,max}_util clamps are MAX aggregated considering
>        * RUNNABLE tasks with _different_ clamps, we can end up with an
>        * inversion. Fix it now when the clamps are applied.
>        */
>       if (unlikely(min_util >= max_util))
>               return min_util;
>
>       return clamp(util, min_util, max_util);
>   }
> ---8<---
>
> Such small change is more self-contained IMHO and does not remove
> an existing optimizations like this lazy RQ's initialization at first
> usage.
>
> Moreover, it can folded in the following patch, with all the other
> static keys shortcuts.

I'd have to think some more over it, but I like this in that we wouldn't
have to molest schedutil anymore.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-26 12:38   ` Patrick Bellasi
@ 2020-06-26 23:21     ` Valentin Schneider
  2020-06-29 12:21     ` Qais Yousef
  1 sibling, 0 replies; 11+ messages in thread
From: Valentin Schneider @ 2020-06-26 23:21 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Qais Yousef, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chris Redpath, Lukasz Luba, linux-kernel


On 26/06/20 13:38, Patrick Bellasi wrote:
> On Thu, Jun 25, 2020 at 17:43:52 +0200, Qais Yousef <qais.yousef@arm.com> wrote...
>> @@ -994,9 +1013,16 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>>      lockdep_assert_held(&rq->lock);
>>
>>      bucket = &uc_rq->bucket[uc_se->bucket_id];
>> -	SCHED_WARN_ON(!bucket->tasks);
>> -	if (likely(bucket->tasks))
>> -		bucket->tasks--;
>> +
>> +	/*
>> +	 * This could happen if sched_uclamp_used was enabled while the
>> +	 * current task was running, hence we could end up with unbalanced call
>> +	 * to uclamp_rq_dec_id().
>> +	 */
>> +	if (unlikely(!bucket->tasks))
>> +		return;
>> +
>> +	bucket->tasks--;
>>      uc_se->active = false;
>
> In this chunk you are indeed changing the code.
>
> Are we sure there are not issues with patterns like:
>
>   enqueue(taskA)
>   // uclamp gets enabled
>   enqueue(taskB)
>   dequeue(taskA)
>   // bucket->tasks is now 0
>   dequeue(taskB)
>
> TaskB has been enqueued with with uclamp enabled, thus it
> has got uc_se->active=True and enforced its clamp value at RQ level.
>
> But with your change above we don't reset that anymore.
>

Harumph indeed...

> As per my previous proposal: why not just removing the SCHED_WARN_ON?
> That's the only real problem in the code above, since now we are not
> more granted to have balanced inc/dec.
>

The SCHED_WARN_ON is gone, were you thinking of something else?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq
  2020-06-26 12:32   ` Patrick Bellasi
  2020-06-26 23:17     ` Valentin Schneider
@ 2020-06-29 12:12     ` Qais Yousef
  1 sibling, 0 replies; 11+ messages in thread
From: Qais Yousef @ 2020-06-29 12:12 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chris Redpath, Lukasz Luba, linux-kernel

Hi Patrick

On 06/26/20 14:32, Patrick Bellasi wrote:
> 
> On Thu, Jun 25, 2020 at 17:43:51 +0200, Qais Yousef <qais.yousef@arm.com> wrote...
> 
> > struct uclamp_rq was zeroed out entirely in assumption that in the first
> > call to uclamp_rq_inc() they'd be initialized correctly in accordance to
> > default settings.
> 
> Perhaps I was not clear in my previous comment:
> 
>    https://lore.kernel.org/lkml/87sgekorfq.derkling@matbug.net/
> 
> when I did say:
> 
>    Does not this means the problem is more likely with
>    uclamp_rq_util_with(), which should be guarded?
> 
> I did not mean that we have to guard the calls to that function but
> instead that we should just make that function aware of uclamp being
> opted in or not.
> 
> > But when next patch introduces a static key to skip
> > uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
> > will fail to perform any frequency changes because the
> > rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
> > means all rqs are capped to 0 by default.
> 
> The initialization you wants to do here it's needed because with the
> current approach you keep calling the same uclamp_rq_util_with() and
> keep doing min/max aggregations even when uclamp is not opted in.
> But this means also that we have min/max aggregation _when not really
> required_.
> 
> > Fix it by making sure we do proper initialization at init without
> > relying on uclamp_rq_inc() doing it later.
> 
> My proposal was as simple as:
> 
> ---8<---
>   static __always_inline
>   unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
>   				  struct task_struct *p)
>   {
>   	unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
>   	unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
>   
> +       if (!static_branch_unlikely(&sched_uclamp_used))
> +               return rt_task(p) ? uclamp_none(UCLAMP_MAX) : util 

This the same mistake I've done.

schedutil_cpu_util() is called by sugov_get_util() passing p as NULL.

292 static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)                                                                                                                                               
293 {                                                                               
294         struct rq *rq = cpu_rq(sg_cpu->cpu);                                    
295         unsigned long util = cpu_util_cfs(rq);                                  
296         unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);               
297                                                                                 
298         sg_cpu->max = max;                                                      
299         sg_cpu->bw_dl = cpu_bw_dl(rq);                                          
300                                                                                 
301         return schedutil_cpu_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL, NULL);
302 }

So this will not fix the problem with RT tasks and we have to do the check
at schedutil which is the right abstraction layer anyway. I don't think calling
rq_rt_is_runnable() is right here and is unnecessary duplication since
schedutil will still have to do that anyway.

When p is not NULL then the function will work correctly in regards to
uclamp_eff_get() returning the right value, including the default settings.
Which is important when my patch to introduce sysctl for default RT boosting
comes into play.

The function deserves a comment for sure though so that users are aware that
the function will not be useful if the static key is disabled. There's no
aggregation done at rq level and this can certainly catch people out.

We can add a static key to return util directly too. I might just do this since
it might help with readability.

Please shout if I missed something.

>   
>   	if (p) {
>   		min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
>   		max_util = max(max_util, uclamp_eff_value(p, UCLAMP_MAX));
>   	}
>   
>   	/*
>   	 * Since CPU's {min,max}_util clamps are MAX aggregated considering
>   	 * RUNNABLE tasks with _different_ clamps, we can end up with an
>   	 * inversion. Fix it now when the clamps are applied.
>   	 */
>   	if (unlikely(min_util >= max_util))
>   		return min_util;
>   
>   	return clamp(util, min_util, max_util);
>   }
> ---8<---
> 
> Such small change is more self-contained IMHO and does not remove
> an existing optimizations like this lazy RQ's initialization at first

IMHO this is a bug not an optimization. We have to write to these struct
anyway, so why not write the right value to start with? This already caught me
out and wasted over half a day trying to chase it.

Beside what are we optimizing for here? I think correctness dictates we start
with correct initial values if we can, no? How are you expecting this to slow
down sched_init()?

Thanks

--
Qais Yousef

> usage.
> 
> Moreover, it can folded in the following patch, with all the other
> static keys shortcuts.
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-26 12:38   ` Patrick Bellasi
  2020-06-26 23:21     ` Valentin Schneider
@ 2020-06-29 12:21     ` Qais Yousef
  1 sibling, 0 replies; 11+ messages in thread
From: Qais Yousef @ 2020-06-29 12:21 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chris Redpath, Lukasz Luba, linux-kernel

On 06/26/20 14:38, Patrick Bellasi wrote:
> 
> On Thu, Jun 25, 2020 at 17:43:52 +0200, Qais Yousef <qais.yousef@arm.com> wrote...
> 
> [...]
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 235b2cae00a0..e2f1fffa013c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -794,6 +794,25 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> >  /* All clamps are required to be less or equal than these values */
> >  static struct uclamp_se uclamp_default[UCLAMP_CNT];
> >  
> > +/*
> > + * This static key is used to reduce the uclamp overhead in the fast path. It
> > + * only disables the call to uclamp_rq_{inc, dec}() in enqueue/dequeue_task().
> > + *
> > + * This allows users to continue to enable uclamp in their kernel config with
> > + * minimum uclamp overhead in the fast path.
> > + *
> > + * As soon as userspace modifies any of the uclamp knobs, the static key is
> > + * enabled, since we have an actual users that make use of uclamp
> > + * functionality.
> > + *
> > + * The knobs that would enable this static key are:
> > + *
> > + *   * A task modifying its uclamp value with sched_setattr().
> > + *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
> > + *   * An admin modifying the cgroup cpu.uclamp.{min, max}
> > + */
> > +static DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
> 
> This is me being choosy, but given that:

ulcamp_enabled is already used. And technically uclamp is enabled but not used
here. We can use active too, but I prefer this term as it refers to userspace
opting in using util clamp.

> 
> ---8<---
> % grep -e '_used[ )]' fair.c core.c
> fair.c:	return static_key_false(&__cfs_bandwidth_used);
> fair.c:	static_key_slow_inc_cpuslocked(&__cfs_bandwidth_used);
> fair.c:	static_key_slow_dec_cpuslocked(&__cfs_bandwidth_used);
> 
> % grep -e '_enabled[ )]' fair.c core.c
> fair.c:	if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
> fair.c:	if (!cfs_rq->runtime_enabled || cfs_rq->nr_running)
> fair.c:	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
> fair.c:	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
> fair.c:	cfs_rq->runtime_enabled = 0;
> fair.c:		cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
> fair.c:		if (!cfs_rq->runtime_enabled)
> fair.c:		cfs_rq->runtime_enabled = 0;
> core.c:	if (static_key_false((&paravirt_steal_rq_enabled))) {
> core.c:	if (unlikely(!p->sched_class->uclamp_enabled))
> core.c:	if (unlikely(!p->sched_class->uclamp_enabled))
> core.c:	 * Prevent race between setting of cfs_rq->runtime_enabled and
> core.c:	runtime_enabled = quota != RUNTIME_INF;
> core.c:	runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
> core.c:	if (runtime_enabled && !runtime_was_enabled)
> core.c:	if (runtime_enabled)
> core.c:		cfs_rq->runtime_enabled = runtime_enabled;
> core.c:	if (runtime_was_enabled && !runtime_enabled)
> ---8<---
> 
> even just for consistency shake, I would still prefer sched_uclamp_enabled for
> the static key name.
> 
> > +
> >  /* Integer rounded range for each bucket */
> >  #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
> >  
> > @@ -994,9 +1013,16 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> >  	lockdep_assert_held(&rq->lock);
> >  
> >  	bucket = &uc_rq->bucket[uc_se->bucket_id];
> > -	SCHED_WARN_ON(!bucket->tasks);
> > -	if (likely(bucket->tasks))
> > -		bucket->tasks--;
> > +
> > +	/*
> > +	 * This could happen if sched_uclamp_used was enabled while the
> > +	 * current task was running, hence we could end up with unbalanced call
> > +	 * to uclamp_rq_dec_id().
> > +	 */
> > +	if (unlikely(!bucket->tasks))
> > +		return;
> > +
> > +	bucket->tasks--;
> >  	uc_se->active = false;
> 
> In this chunk you are indeed changing the code.
> 
> Are we sure there are not issues with patterns like:
> 
>   enqueue(taskA)
>   // uclamp gets enabled
>   enqueue(taskB)
>   dequeue(taskA)
>   // bucket->tasks is now 0
>   dequeue(taskB)

Hmm I don't know to be honest if this ordering problem could happen.

> 
> TaskB has been enqueued with with uclamp enabled, thus it
> has got uc_se->active=True and enforced its clamp value at RQ level.
> 
> But with your change above we don't reset that anymore.
> 
> As per my previous proposal: why not just removing the SCHED_WARN_ON?
> That's the only real problem in the code above, since now we are not
> more granted to have balanced inc/dec.

Yeah I agree. I'll re-instate it and add a comment of this ordering issue too.

> 
> [...]
> 
> > +bool uclamp_is_enabled(void)
> > +{
> > +	return static_branch_likely(&sched_uclamp_used);
> > +}
> > +
> 
> The above and the following changes are not necessary if we just
> add the guards at the beginning of uclamp_rq_util_with() instead of what
> you do in PATCH1.

I addressed that in patch 1 and hopefully my reasoning made sense of why we
can't do what you're suggesting in uclamp_rq_util_with().

> 
> I think this will give better overheads removal by avoiding not
> necessary min/max clamps for both RT and CFS.

As I stated earlier, uclamp_rq_util_with() is called from schedutil_cpu_util()
and find_energy_efficient_cpu(). I don't think it's an overhead if they continue
to call it since it'll not break correctness and speed is not an issue.

But as I think this could be potential cause of confusion at least. I'll add
a comment there and probably make it return early so it's more readable and
future proof.

Thanks

--
Qais Yousef

>   
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 7fbaee24c824..3f4e296ccb67 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
> >  	unsigned long dl_util, util, irq;
> >  	struct rq *rq = cpu_rq(cpu);
> >  
> > -	if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
> > +	if (!uclamp_is_enabled() &&
> >  	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
> >  		return max;
> >  	}
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 1d4e94c1e5fe..aff5430331e4 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2371,6 +2371,8 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> >  
> >  	return clamp(util, min_util, max_util);
> >  }
> > +
> > +bool uclamp_is_enabled(void);
> >  #else /* CONFIG_UCLAMP_TASK */
> >  static inline
> >  unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> > @@ -2378,6 +2380,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
> >  {
> >  	return util;
> >  }
> > +
> > +static inline bool uclamp_is_enabled(void)
> > +{
> > +	return false;
> > +}
> >  #endif /* CONFIG_UCLAMP_TASK */
> >  
> >  #ifdef arch_scale_freq_capacity
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-06-29 19:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 15:43 [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
2020-06-25 15:43 ` [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
2020-06-26 12:32   ` Patrick Bellasi
2020-06-26 23:17     ` Valentin Schneider
2020-06-29 12:12     ` Qais Yousef
2020-06-25 15:43 ` [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
2020-06-26 12:38   ` Patrick Bellasi
2020-06-26 23:21     ` Valentin Schneider
2020-06-29 12:21     ` Qais Yousef
2020-06-26 10:00 ` [PATCH v4 0/2] sched: Optionally skip uclamp logic in fast path Lukasz Luba
2020-06-26 10:27 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).