linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path
@ 2020-06-30 11:21 Qais Yousef
  2020-06-30 11:21 ` [PATCH v6 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Qais Yousef @ 2020-06-30 11:21 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Valentin Schneider, Qais Yousef, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

This series attempts to address the report that uclamp logic could be expensive
sometimes and shows a regression in netperf UDP_STREAM under certain
conditions.

The first patch is a fix for how struct uclamp_rq is initialized which is
required by the 2nd patch which contains the real 'fix'.

Worth noting that the root cause of the overhead is believed to be system
specific or related to potential certain code/data layout issues, leading to
worse I/D $ performance.

Different systems exhibited different behaviors and the regression did
disappear in certain kernel version while attempting to reporoduce.

More info can be found here:

https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

Having the static key seemed the best thing to do to ensure the effect of
uclamp is minimized for kernels that compile it in but don't have a userspace
that uses it, which will allow distros to distribute uclamp capable kernels by
default without having to compromise on performance for some systems that could
be affected.

Changes in v6:
	* s/uclamp_is_enabled/uclamp_is_used/ + add comment
	* Improve the bailout condition for the case where we could end up with
	  unbalanced call of uclamp_rq_dec_id()
	* Clarify some comments.

Changes in v5:
	* Fix a race that could happen when order of enqueue/dequeue of tasks
	  A and B is not done in order, and sched_uclamp_used is enabled in
	  between.
	* Add more comments explaining the race and the behavior of
	  uclamp_rq_util_with() which is now protected with a static key to be
	  a NOP. When no uclamp aggregation at rq level is done, this function
	  can't do much.

Changes in v4:
	* Fix broken boosting of RT tasks when static key is disabled.

Changes in v3:
	* Avoid double negatives and rename the static key to uclamp_used
	* Unconditionally enable the static key through any of the paths where
	  the user can modify the default uclamp value.
	* Use C99 named struct initializer for struct uclamp_rq which is easier
	  to read than the memset().

Changes in v2:
	* Add more info in the commit message about the result of perf diff to
	  demonstrate that the activate/deactivate_task pressure is reduced in
	  the fast path.

	* Fix sparse warning reported by the test robot.

	* Add an extra commit about using static_branch_likely() instead of
	  static_branch_unlikely().

Thanks

--
Qais Yousef

Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: linux-kernel@vger.kernel.org


Qais Yousef (2):
  sched/uclamp: Fix initialization of struct uclamp_rq
  sched/uclamp: Protect uclamp fast path code with static key

 kernel/sched/core.c              | 95 ++++++++++++++++++++++++++++++--
 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/sched.h             | 47 +++++++++++++++-
 3 files changed, 135 insertions(+), 9 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v6 1/2] sched/uclamp: Fix initialization of struct uclamp_rq
  2020-06-30 11:21 [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
@ 2020-06-30 11:21 ` Qais Yousef
  2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Qais Yousef
  2020-06-30 11:21 ` [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Qais Yousef @ 2020-06-30 11:21 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Valentin Schneider, Qais Yousef, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

struct uclamp_rq was zeroed out entirely in assumption that in the first
call to uclamp_rq_inc() they'd be initialized correctly in accordance to
default settings.

But when next patch introduces a static key to skip
uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
will fail to perform any frequency changes because the
rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
means all rqs are capped to 0 by default.

Fix it by making sure we do proper initialization at init without
relying on uclamp_rq_inc() doing it later.

Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/core.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8fe2ac910bed..235b2cae00a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1248,6 +1248,20 @@ static void uclamp_fork(struct task_struct *p)
 	}
 }
 
+static void __init init_uclamp_rq(struct rq *rq)
+{
+	enum uclamp_id clamp_id;
+	struct uclamp_rq *uc_rq = rq->uclamp;
+
+	for_each_clamp_id(clamp_id) {
+		uc_rq[clamp_id] = (struct uclamp_rq) {
+			.value = uclamp_none(clamp_id)
+		};
+	}
+
+	rq->uclamp_flags = 0;
+}
+
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
@@ -1256,11 +1270,8 @@ static void __init init_uclamp(void)
 
 	mutex_init(&uclamp_mutex);
 
-	for_each_possible_cpu(cpu) {
-		memset(&cpu_rq(cpu)->uclamp, 0,
-				sizeof(struct uclamp_rq)*UCLAMP_CNT);
-		cpu_rq(cpu)->uclamp_flags = 0;
-	}
+	for_each_possible_cpu(cpu)
+		init_uclamp_rq(cpu_rq(cpu));
 
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&init_task.uclamp_req[clamp_id],
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-30 11:21 [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
  2020-06-30 11:21 ` [PATCH v6 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
@ 2020-06-30 11:21 ` Qais Yousef
  2020-06-30 17:07   ` Peter Zijlstra
  2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Qais Yousef
  2020-07-01 16:32 ` [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path Lukasz Luba
  2020-07-03 12:09 ` Vincent Guittot
  3 siblings, 2 replies; 14+ messages in thread
From: Qais Yousef @ 2020-06-30 11:21 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Valentin Schneider, Qais Yousef, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

There is a report that when uclamp is enabled, a netperf UDP test
regresses compared to a kernel compiled without uclamp.

https://lore.kernel.org/lkml/20200529100806.GA3070@suse.de/

While investigating the root cause, there were no sign that the uclamp
code is doing anything particularly expensive but could suffer from bad
cache behavior under certain circumstances that are yet to be
understood.

https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

To reduce the pressure on the fast path anyway, add a static key that is
by default will skip executing uclamp logic in the
enqueue/dequeue_task() fast path until it's needed.

As soon as the user start using util clamp by:

	1. Changing uclamp value of a task with sched_setattr()
	2. Modifying the default sysctl_sched_util_clamp_{min, max}
	3. Modifying the default cpu.uclamp.{min, max} value in cgroup

We flip the static key now that the user has opted to use util clamp.
Effectively re-introducing uclamp logic in the enqueue/dequeue_task()
fast path. It stays on from that point forward until the next reboot.

This should help minimize the effect of util clamp on workloads that
don't need it but still allow distros to ship their kernels with uclamp
compiled in by default.

SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end
up with unbalanced call to uclamp_rq_dec_id() if we flip the key while
a task is running in the rq. Since we know it is harmless we just
quietly return if we attempt a uclamp_rq_dec_id() when
rq->uclamp[].bucket[].tasks is 0.

In schedutil, we introduce a new uclamp_is_enabled() helper which takes
the static key into account to ensure RT boosting behavior is retained.

The following results demonstrates how this helps on 2 Sockets Xeon E5
2x10-Cores system.

                                   nouclamp                 uclamp      uclamp-static-key
Hmean     send-64         162.43 (   0.00%)      157.84 *  -2.82%*      163.39 *   0.59%*
Hmean     send-128        324.71 (   0.00%)      314.78 *  -3.06%*      326.18 *   0.45%*
Hmean     send-256        641.55 (   0.00%)      628.67 *  -2.01%*      648.12 *   1.02%*
Hmean     send-1024      2525.28 (   0.00%)     2448.26 *  -3.05%*     2543.73 *   0.73%*
Hmean     send-2048      4836.14 (   0.00%)     4712.08 *  -2.57%*     4867.69 *   0.65%*
Hmean     send-3312      7540.83 (   0.00%)     7425.45 *  -1.53%*     7621.06 *   1.06%*
Hmean     send-4096      9124.53 (   0.00%)     8948.82 *  -1.93%*     9276.25 *   1.66%*
Hmean     send-8192     15589.67 (   0.00%)    15486.35 *  -0.66%*    15819.98 *   1.48%*
Hmean     send-16384    26386.47 (   0.00%)    25752.25 *  -2.40%*    26773.74 *   1.47%*

The perf diff between nouclamp and uclamp-static-key when uclamp is
disabled in the fast path:

     8.73%     -1.55%  [kernel.kallsyms]        [k] try_to_wake_up
     0.07%     +0.04%  [kernel.kallsyms]        [k] deactivate_task
     0.13%     -0.02%  [kernel.kallsyms]        [k] activate_task

The diff between nouclamp and uclamp-static-key when uclamp is enabled
in the fast path:

     8.73%     -0.72%  [kernel.kallsyms]        [k] try_to_wake_up
     0.13%     +0.39%  [kernel.kallsyms]        [k] activate_task
     0.07%     +0.38%  [kernel.kallsyms]        [k] deactivate_task

Reported-by: Mel Gorman <mgorman@suse.de>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: linux-kernel@vger.kernel.org
---

This takes a different approach to PSI which introduces a config option

```
      CONFIG_PSI_DEFAULT_DISABLED

        Require boot parameter to enable pressure stall information
        tracking (NEW)

      boot param psi
```

via commit e0c274472d5d "psi: make disabling/enabling easier for vendor kernels"

uclamp has a clearer points of entry when userspace would like to use it so we
can automatically flip the switch if the kernel is running on a userspace that
wants to user utilclamp without any extra userspace visible switches.

I wanted to make this dependent on schedutil being the governor too, but beside
the complexity, uclamp is used for capacity awareness. We could certainly
construct a more complex condition, but I'm not sure it's worth it. Open to
hear more opinions and points of views on this :)


 kernel/sched/core.c              | 74 +++++++++++++++++++++++++++++++-
 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/sched.h             | 47 +++++++++++++++++++-
 3 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 235b2cae00a0..ef8f8b421622 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -794,6 +794,26 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
+/*
+ * This static key is used to reduce the uclamp overhead in the fast path. It
+ * primarily disables the call to uclamp_rq_{inc, dec}() in
+ * enqueue/dequeue_task().
+ *
+ * This allows users to continue to enable uclamp in their kernel config with
+ * minimum uclamp overhead in the fast path.
+ *
+ * As soon as userspace modifies any of the uclamp knobs, the static key is
+ * enabled, since we have an actual users that make use of uclamp
+ * functionality.
+ *
+ * The knobs that would enable this static key are:
+ *
+ *   * A task modifying its uclamp value with sched_setattr().
+ *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
+ *   * An admin modifying the cgroup cpu.uclamp.{min, max}
+ */
+DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
+
 /* Integer rounded range for each bucket */
 #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
 
@@ -993,10 +1013,38 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 
 	lockdep_assert_held(&rq->lock);
 
+	/*
+	 * If sched_uclamp_used was enabled after task @p was enqueued,
+	 * we could end up with unbalanced call to uclamp_rq_dec_id().
+	 *
+	 * In this case the uc_se->active flag should be false since no uclamp
+	 * accounting was performed at enqueue time and we can just return
+	 * here.
+	 *
+	 * Need to be careful of the following enqeueue/dequeue ordering
+	 * problem too
+	 *
+	 *	enqueue(taskA)
+	 *	// sched_uclamp_used gets enabled
+	 *	enqueue(taskB)
+	 *	dequeue(taskA)
+	 *	// Must not decrement bukcet->tasks here
+	 *	dequeue(taskB)
+	 *
+	 * where we could end up with stale data in uc_se and
+	 * bucket[uc_se->bucket_id].
+	 *
+	 * The following check here eliminates the possibility of such race.
+	 */
+	if (unlikely(!uc_se->active))
+		return;
+
 	bucket = &uc_rq->bucket[uc_se->bucket_id];
+
 	SCHED_WARN_ON(!bucket->tasks);
 	if (likely(bucket->tasks))
 		bucket->tasks--;
+
 	uc_se->active = false;
 
 	/*
@@ -1032,6 +1080,15 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	/*
+	 * Avoid any overhead until uclamp is actually used by the userspace.
+	 *
+	 * The condition is constructed such that a NOP is generated when
+	 * sched_uclamp_used is disabled.
+	 */
+	if (!static_branch_unlikely(&sched_uclamp_used))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1047,6 +1104,15 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	/*
+	 * Avoid any overhead until uclamp is actually used by the userspace.
+	 *
+	 * The condition is constructed such that a NOP is generated when
+	 * sched_uclamp_used is disabled.
+	 */
+	if (!static_branch_unlikely(&sched_uclamp_used))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1155,8 +1221,10 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		update_root_tg = true;
 	}
 
-	if (update_root_tg)
+	if (update_root_tg) {
+		static_branch_enable(&sched_uclamp_used);
 		uclamp_update_root_tg();
+	}
 
 	/*
 	 * We update all RUNNABLE tasks only when task groups are in use.
@@ -1221,6 +1289,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
 		return;
 
+	static_branch_enable(&sched_uclamp_used);
+
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);
@@ -7387,6 +7457,8 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
 	if (req.ret)
 		return req.ret;
 
+	static_branch_enable(&sched_uclamp_used);
+
 	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7fbaee24c824..dc6835bc6490 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
 	unsigned long dl_util, util, irq;
 	struct rq *rq = cpu_rq(cpu);
 
-	if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
+	if (!uclamp_is_used() &&
 	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
 		return max;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1d4e94c1e5fe..d340518327a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -862,6 +862,8 @@ struct uclamp_rq {
 	unsigned int value;
 	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
 };
+
+DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
 #endif /* CONFIG_UCLAMP_TASK */
 
 /*
@@ -2349,12 +2351,35 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #ifdef CONFIG_UCLAMP_TASK
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
 
+/**
+ * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
+ * @rq:		The rq to clamp against. Must not be NULL.
+ * @util:	The util value to clamp.
+ * @p:		The task to clamp against. Can be NULL if you want to clamp
+ *		against @rq only.
+ *
+ * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
+ *
+ * If sched_uclamp_used static key is disabled, then just return the util
+ * without any clamping since uclamp aggregation at the rq level in the fast
+ * path is disabled, rendering this operation a NOP.
+ *
+ * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
+ * will return the correct effective uclamp value of the task even if the
+ * static key is disabled.
+ */
 static __always_inline
 unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
 				  struct task_struct *p)
 {
-	unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
-	unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+	unsigned long min_util;
+	unsigned long max_util;
+
+	if (!static_branch_likely(&sched_uclamp_used))
+		return util;
+
+	min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
+	max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
 
 	if (p) {
 		min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
@@ -2371,6 +2396,19 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
 
 	return clamp(util, min_util, max_util);
 }
+
+/*
+ * When uclamp is compiled in, the aggregation at rq level is 'turned off'
+ * by default in the fast path and only gets turned on once userspace performs
+ * an operation that requires it.
+ *
+ * Returns true if userspace opted-in to use uclamp and aggregation at rq level
+ * hence is active.
+ */
+static inline bool uclamp_is_used(void)
+{
+	return static_branch_likely(&sched_uclamp_used);
+}
 #else /* CONFIG_UCLAMP_TASK */
 static inline
 unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
@@ -2378,6 +2416,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
 {
 	return util;
 }
+
+static inline bool uclamp_is_used(void)
+{
+	return false;
+}
 #endif /* CONFIG_UCLAMP_TASK */
 
 #ifdef arch_scale_freq_capacity
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-30 11:21 ` [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
@ 2020-06-30 17:07   ` Peter Zijlstra
  2020-06-30 17:55     ` Qais Yousef
  2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Qais Yousef
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2020-06-30 17:07 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Valentin Schneider, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On Tue, Jun 30, 2020 at 12:21:23PM +0100, Qais Yousef wrote:
> @@ -993,10 +1013,38 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>  
>  	lockdep_assert_held(&rq->lock);
>  
> +	/*
> +	 * If sched_uclamp_used was enabled after task @p was enqueued,
> +	 * we could end up with unbalanced call to uclamp_rq_dec_id().
> +	 *
> +	 * In this case the uc_se->active flag should be false since no uclamp
> +	 * accounting was performed at enqueue time and we can just return
> +	 * here.
> +	 *
> +	 * Need to be careful of the following enqeueue/dequeue ordering
> +	 * problem too
> +	 *
> +	 *	enqueue(taskA)
> +	 *	// sched_uclamp_used gets enabled
> +	 *	enqueue(taskB)
> +	 *	dequeue(taskA)
> +	 *	// Must not decrement bukcet->tasks here
> +	 *	dequeue(taskB)
> +	 *
> +	 * where we could end up with stale data in uc_se and
> +	 * bucket[uc_se->bucket_id].
> +	 *
> +	 * The following check here eliminates the possibility of such race.
> +	 */
> +	if (unlikely(!uc_se->active))
> +		return;
> +
>  	bucket = &uc_rq->bucket[uc_se->bucket_id];
> +
>  	SCHED_WARN_ON(!bucket->tasks);
>  	if (likely(bucket->tasks))
>  		bucket->tasks--;
> +
>  	uc_se->active = false;
>  
>  	/*

> @@ -1221,6 +1289,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
>  	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>  		return;
>  
> +	static_branch_enable(&sched_uclamp_used);
> +
>  	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>  		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
>  			      attr->sched_util_min, true);
> @@ -7387,6 +7457,8 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
>  	if (req.ret)
>  		return req.ret;
>  
> +	static_branch_enable(&sched_uclamp_used);
> +
>  	mutex_lock(&uclamp_mutex);
>  	rcu_read_lock();
>  

There's a fun race described in 9107c89e269d ("perf: Fix race between
event install and jump_labels"), are we sure this isn't also susceptible
to something similar?

I suspect not, but I just wanted to make sure.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-30 17:07   ` Peter Zijlstra
@ 2020-06-30 17:55     ` Qais Yousef
  2020-06-30 19:06       ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Qais Yousef @ 2020-06-30 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Valentin Schneider, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On 06/30/20 19:07, Peter Zijlstra wrote:
> On Tue, Jun 30, 2020 at 12:21:23PM +0100, Qais Yousef wrote:
> > @@ -993,10 +1013,38 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> >  
> >  	lockdep_assert_held(&rq->lock);
> >  
> > +	/*
> > +	 * If sched_uclamp_used was enabled after task @p was enqueued,
> > +	 * we could end up with unbalanced call to uclamp_rq_dec_id().
> > +	 *
> > +	 * In this case the uc_se->active flag should be false since no uclamp
> > +	 * accounting was performed at enqueue time and we can just return
> > +	 * here.
> > +	 *
> > +	 * Need to be careful of the following enqeueue/dequeue ordering
> > +	 * problem too
> > +	 *
> > +	 *	enqueue(taskA)
> > +	 *	// sched_uclamp_used gets enabled
> > +	 *	enqueue(taskB)
> > +	 *	dequeue(taskA)
> > +	 *	// Must not decrement bukcet->tasks here
> > +	 *	dequeue(taskB)
> > +	 *
> > +	 * where we could end up with stale data in uc_se and
> > +	 * bucket[uc_se->bucket_id].
> > +	 *
> > +	 * The following check here eliminates the possibility of such race.
> > +	 */
> > +	if (unlikely(!uc_se->active))
> > +		return;
> > +
> >  	bucket = &uc_rq->bucket[uc_se->bucket_id];
> > +
> >  	SCHED_WARN_ON(!bucket->tasks);
> >  	if (likely(bucket->tasks))
> >  		bucket->tasks--;
> > +
> >  	uc_se->active = false;
> >  
> >  	/*
> 
> > @@ -1221,6 +1289,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
> >  	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> >  		return;
> >  
> > +	static_branch_enable(&sched_uclamp_used);
> > +
> >  	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> >  		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
> >  			      attr->sched_util_min, true);
> > @@ -7387,6 +7457,8 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
> >  	if (req.ret)
> >  		return req.ret;
> >  
> > +	static_branch_enable(&sched_uclamp_used);
> > +
> >  	mutex_lock(&uclamp_mutex);
> >  	rcu_read_lock();
> >  
> 
> There's a fun race described in 9107c89e269d ("perf: Fix race between
> event install and jump_labels"), are we sure this isn't also susceptible
> to something similar?
> 
> I suspect not, but I just wanted to make sure.

IIUC, the worry is that not all CPUs might have observed the change in the
static key state; hence could not be running the patched
enqueue/dequeue_task(), so we could end up with some CPUs accounting for
uclamp in the enqueue/dequeue path but not others?

I was hoping this synchronization is guaranteed by the static_branch_*() call.

aarch64_insn_patch_text_nosync() in arch/arm64/kernel/insn.c performs
__flush_icache_range() after writing the new instruction.

I need to dig into what __flush_icache_range() do, but I think it'll just
flush the icache of the calling CPU. Need to dig into upper layers too.

It'd be good to know if I got you correctly first.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-30 17:55     ` Qais Yousef
@ 2020-06-30 19:06       ` Peter Zijlstra
  2020-06-30 19:28         ` Qais Yousef
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2020-06-30 19:06 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Valentin Schneider, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On Tue, Jun 30, 2020 at 06:55:02PM +0100, Qais Yousef wrote:
> On 06/30/20 19:07, Peter Zijlstra wrote:

> > There's a fun race described in 9107c89e269d ("perf: Fix race between
> > event install and jump_labels"), are we sure this isn't also susceptible
> > to something similar?
> > 
> > I suspect not, but I just wanted to make sure.
> 
> IIUC, the worry is that not all CPUs might have observed the change in the
> static key state; hence could not be running the patched
> enqueue/dequeue_task(), so we could end up with some CPUs accounting for
> uclamp in the enqueue/dequeue path but not others?
> 
> I was hoping this synchronization is guaranteed by the static_branch_*() call.

It is, that isn't quite the the problem. Looking at it more I think
commit 1dbb6704de91 ("jump_label: Fix concurrent
static_key_enable/disable()") fixed some of it.

From what I can remember there were two parts to this problem, one being
fixed by the above commit, the other being that if we enable while a
task is running we miss the switch-in event (exactly how in this patch
we miss the enqueue).

Due to the missing switch-in, the state is 'weird' and the subsequent
IPI to install a remote event didn't quite work.

So I put that sync_sched() call in to guarantee all CPUs have done a
schedule() cycle after having the key switched. This makes sure that
every running task has seen the switch-in and thus the state is as
expected.

But like I said, I think we're good, that one extra branch deals with
the half-state.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-30 19:06       ` Peter Zijlstra
@ 2020-06-30 19:28         ` Qais Yousef
  0 siblings, 0 replies; 14+ messages in thread
From: Qais Yousef @ 2020-06-30 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Valentin Schneider, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On 06/30/20 21:06, Peter Zijlstra wrote:
> On Tue, Jun 30, 2020 at 06:55:02PM +0100, Qais Yousef wrote:
> > On 06/30/20 19:07, Peter Zijlstra wrote:
> 
> > > There's a fun race described in 9107c89e269d ("perf: Fix race between
> > > event install and jump_labels"), are we sure this isn't also susceptible
> > > to something similar?
> > > 
> > > I suspect not, but I just wanted to make sure.
> > 
> > IIUC, the worry is that not all CPUs might have observed the change in the
> > static key state; hence could not be running the patched
> > enqueue/dequeue_task(), so we could end up with some CPUs accounting for
> > uclamp in the enqueue/dequeue path but not others?
> > 
> > I was hoping this synchronization is guaranteed by the static_branch_*() call.
> 
> It is, that isn't quite the the problem. Looking at it more I think
> commit 1dbb6704de91 ("jump_label: Fix concurrent
> static_key_enable/disable()") fixed some of it.
> 
> From what I can remember there were two parts to this problem, one being
> fixed by the above commit, the other being that if we enable while a
> task is running we miss the switch-in event (exactly how in this patch
> we miss the enqueue).
> 
> Due to the missing switch-in, the state is 'weird' and the subsequent
> IPI to install a remote event didn't quite work.
> 
> So I put that sync_sched() call in to guarantee all CPUs have done a
> schedule() cycle after having the key switched. This makes sure that
> every running task has seen the switch-in and thus the state is as
> expected.
> 
> But like I said, I think we're good, that one extra branch deals with
> the half-state.

Got it, thanks.

Yes, we should be good for currently running tasks.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path
  2020-06-30 11:21 [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
  2020-06-30 11:21 ` [PATCH v6 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
  2020-06-30 11:21 ` [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
@ 2020-07-01 16:32 ` Lukasz Luba
  2020-07-03 12:09 ` Vincent Guittot
  3 siblings, 0 replies; 14+ messages in thread
From: Lukasz Luba @ 2020-07-01 16:32 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Patrick Bellasi, Chris Redpath, linux-kernel

Hi Qais,

On 6/30/20 12:21 PM, Qais Yousef wrote:
> This series attempts to address the report that uclamp logic could be expensive
> sometimes and shows a regression in netperf UDP_STREAM under certain
> conditions.
> 
> The first patch is a fix for how struct uclamp_rq is initialized which is
> required by the 2nd patch which contains the real 'fix'.
> 
> Worth noting that the root cause of the overhead is believed to be system
> specific or related to potential certain code/data layout issues, leading to
> worse I/D $ performance.
> 
> Different systems exhibited different behaviors and the regression did
> disappear in certain kernel version while attempting to reporoduce.
> 
> More info can be found here:
> 
> https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/
> 
> Having the static key seemed the best thing to do to ensure the effect of
> uclamp is minimized for kernels that compile it in but don't have a userspace
> that uses it, which will allow distros to distribute uclamp capable kernels by
> default without having to compromise on performance for some systems that could
> be affected.
> 
> Changes in v6:
> 	* s/uclamp_is_enabled/uclamp_is_used/ + add comment
> 	* Improve the bailout condition for the case where we could end up with
> 	  unbalanced call of uclamp_rq_dec_id()
> 	* Clarify some comments.
> 

I've tried this v6 series with mmtest netperf-udp (30x each UDP
size) - the results are good.

                       v5.7-rc7-base-noucl    v5.7-rc7-ucl-tsk-nofix 
v5.7-rc7-ucl-tsk-grp-fix_v6
Hmean     send-64          62.15 (   0.00%)       59.65 *  -4.02%* 
65.87 *   5.99%*
Hmean     send-128        122.88 (   0.00%)      119.37 *  -2.85%* 
131.76 *   7.23%*
Hmean     send-256        244.85 (   0.00%)      234.26 *  -4.32%* 
259.87 *   6.14%*
Hmean     send-1024       919.24 (   0.00%)      880.67 *  -4.20%* 
975.48 *   6.12%*
Hmean     send-2048      1689.45 (   0.00%)     1647.54 *  -2.48%* 
1797.23 *   6.38%*
Hmean     send-3312      2542.36 (   0.00%)     2485.23 *  -2.25%* 
2665.69 *   4.85%*
Hmean     send-4096      2935.69 (   0.00%)     2861.09 *  -2.54%* 
3087.79 *   5.18%*
Hmean     send-8192      4800.35 (   0.00%)     4680.09 *  -2.51%* 
4966.50 *   3.46%*
Hmean     send-16384     7473.66 (   0.00%)     7349.60 *  -1.66%* 
7598.49 *   1.67%*
Hmean     recv-64          62.15 (   0.00%)       59.65 *  -4.03%* 
65.86 *   5.97%*
Hmean     recv-128        122.88 (   0.00%)      119.37 *  -2.85%* 
131.76 *   7.23%*
Hmean     recv-256        244.84 (   0.00%)      234.26 *  -4.32%* 
259.80 *   6.11%*
Hmean     recv-1024       919.24 (   0.00%)      880.67 *  -4.20%* 
975.42 *   6.11%*
Hmean     recv-2048      1689.44 (   0.00%)     1647.54 *  -2.48%* 
1797.18 *   6.38%*
Hmean     recv-3312      2542.36 (   0.00%)     2485.23 *  -2.25%* 
2665.53 *   4.84%*
Hmean     recv-4096      2935.69 (   0.00%)     2861.09 *  -2.54%* 
3087.79 *   5.18%*
Hmean     recv-8192      4800.35 (   0.00%)     4678.15 *  -2.55%* 
4966.48 *   3.46%*
Hmean     recv-16384     7473.63 (   0.00%)     7349.52 *  -1.66%* 
7597.90 *   1.66%*

You can add:

Tested-by: Lukasz Luba <lukasz.luba@arm.com>

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path
  2020-06-30 11:21 [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
                   ` (2 preceding siblings ...)
  2020-07-01 16:32 ` [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path Lukasz Luba
@ 2020-07-03 12:09 ` Vincent Guittot
  2020-07-06 10:41   ` Qais Yousef
  3 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2020-07-03 12:09 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On Tue, 30 Jun 2020 at 13:22, Qais Yousef <qais.yousef@arm.com> wrote:
>
> This series attempts to address the report that uclamp logic could be expensive
> sometimes and shows a regression in netperf UDP_STREAM under certain
> conditions.
>
> The first patch is a fix for how struct uclamp_rq is initialized which is
> required by the 2nd patch which contains the real 'fix'.
>
> Worth noting that the root cause of the overhead is believed to be system
> specific or related to potential certain code/data layout issues, leading to
> worse I/D $ performance.
>
> Different systems exhibited different behaviors and the regression did
> disappear in certain kernel version while attempting to reporoduce.
>
> More info can be found here:
>
> https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/
>
> Having the static key seemed the best thing to do to ensure the effect of
> uclamp is minimized for kernels that compile it in but don't have a userspace
> that uses it, which will allow distros to distribute uclamp capable kernels by
> default without having to compromise on performance for some systems that could
> be affected.
>
> Changes in v6:
>         * s/uclamp_is_enabled/uclamp_is_used/ + add comment
>         * Improve the bailout condition for the case where we could end up with
>           unbalanced call of uclamp_rq_dec_id()
>         * Clarify some comments.
>
> Changes in v5:
>         * Fix a race that could happen when order of enqueue/dequeue of tasks
>           A and B is not done in order, and sched_uclamp_used is enabled in
>           between.
>         * Add more comments explaining the race and the behavior of
>           uclamp_rq_util_with() which is now protected with a static key to be
>           a NOP. When no uclamp aggregation at rq level is done, this function
>           can't do much.
>
> Changes in v4:
>         * Fix broken boosting of RT tasks when static key is disabled.
>
> Changes in v3:
>         * Avoid double negatives and rename the static key to uclamp_used
>         * Unconditionally enable the static key through any of the paths where
>           the user can modify the default uclamp value.
>         * Use C99 named struct initializer for struct uclamp_rq which is easier
>           to read than the memset().
>
> Changes in v2:
>         * Add more info in the commit message about the result of perf diff to
>           demonstrate that the activate/deactivate_task pressure is reduced in
>           the fast path.
>
>         * Fix sparse warning reported by the test robot.
>
>         * Add an extra commit about using static_branch_likely() instead of
>           static_branch_unlikely().
>
> Thanks
>
> --
> Qais Yousef
>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> CC: Patrick Bellasi <patrick.bellasi@matbug.net>
> Cc: Chris Redpath <chris.redpath@arm.com>
> Cc: Lukasz Luba <lukasz.luba@arm.com>
> Cc: linux-kernel@vger.kernel.org
>
>
> Qais Yousef (2):
>   sched/uclamp: Fix initialization of struct uclamp_rq
>   sched/uclamp: Protect uclamp fast path code with static key

I have run the perf bench sched pipe that have have already run
previously with this v6 and the results are similar to my previous
tests:
The impact is -1.61% similarly to v2 which is better compared the
original -3.66% without your patch

>
>  kernel/sched/core.c              | 95 ++++++++++++++++++++++++++++++--
>  kernel/sched/cpufreq_schedutil.c |  2 +-
>  kernel/sched/sched.h             | 47 +++++++++++++++-
>  3 files changed, 135 insertions(+), 9 deletions(-)
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path
  2020-07-03 12:09 ` Vincent Guittot
@ 2020-07-06 10:41   ` Qais Yousef
  2020-07-07 12:29     ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Qais Yousef @ 2020-07-06 10:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On 07/03/20 14:09, Vincent Guittot wrote:
> I have run the perf bench sched pipe that have have already run
> previously with this v6 and the results are similar to my previous
> tests:
> The impact is -1.61% similarly to v2 which is better compared the
> original -3.66% without your patch

Thanks Vincent.

Can you afford doing a capture of `perf record` and share the resulting
perf.dat with vmlinux (with debug symbols)?

Having a before/after capture would be even better.

Not sure if we can do  much about this -1.61% in your case, but it'd be good to
understand why if possible. perf bench sched pipe is very sensitive to tiniest
of changes which could be due to binary-to-binary differences.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path
  2020-07-06 10:41   ` Qais Yousef
@ 2020-07-07 12:29     ` Vincent Guittot
  2020-07-07 13:11       ` Qais Yousef
  0 siblings, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2020-07-07 12:29 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On Mon, 6 Jul 2020 at 12:41, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 07/03/20 14:09, Vincent Guittot wrote:
> > I have run the perf bench sched pipe that have have already run
> > previously with this v6 and the results are similar to my previous
> > tests:
> > The impact is -1.61% similarly to v2 which is better compared the
> > original -3.66% without your patch
>
> Thanks Vincent.
>
> Can you afford doing a capture of `perf record` and share the resulting
> perf.dat with vmlinux (with debug symbols)?

Will try to make it by end of the week

>
> Having a before/after capture would be even better.
>
> Not sure if we can do  much about this -1.61% in your case, but it'd be good to
> understand why if possible. perf bench sched pipe is very sensitive to tiniest
> of changes which could be due to binary-to-binary differences.
>
> Thanks
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path
  2020-07-07 12:29     ` Vincent Guittot
@ 2020-07-07 13:11       ` Qais Yousef
  0 siblings, 0 replies; 14+ messages in thread
From: Qais Yousef @ 2020-07-07 13:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Patrick Bellasi, Chris Redpath, Lukasz Luba, linux-kernel

On 07/07/20 14:29, Vincent Guittot wrote:
> On Mon, 6 Jul 2020 at 12:41, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 07/03/20 14:09, Vincent Guittot wrote:
> > > I have run the perf bench sched pipe that have have already run
> > > previously with this v6 and the results are similar to my previous
> > > tests:
> > > The impact is -1.61% similarly to v2 which is better compared the
> > > original -3.66% without your patch
> >
> > Thanks Vincent.
> >
> > Can you afford doing a capture of `perf record` and share the resulting
> > perf.dat with vmlinux (with debug symbols)?
> 
> Will try to make it by end of the week

Thanks! If you want a place to drop them let me know.

Cheers

--
Qais Yousef

> 
> >
> > Having a before/after capture would be even better.
> >
> > Not sure if we can do  much about this -1.61% in your case, but it'd be good to
> > understand why if possible. perf bench sched pipe is very sensitive to tiniest
> > of changes which could be due to binary-to-binary differences.
> >
> > Thanks
> >
> > --
> > Qais Yousef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip: sched/core] sched/uclamp: Protect uclamp fast path code with static key
  2020-06-30 11:21 ` [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
  2020-06-30 17:07   ` Peter Zijlstra
@ 2020-07-09  8:45   ` tip-bot2 for Qais Yousef
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot2 for Qais Yousef @ 2020-07-09  8:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Qais Yousef, Peter Zijlstra (Intel), Lukasz Luba, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     46609ce227039fd192e0ecc7d940bed587fd2c78
Gitweb:        https://git.kernel.org/tip/46609ce227039fd192e0ecc7d940bed587fd2c78
Author:        Qais Yousef <qais.yousef@arm.com>
AuthorDate:    Tue, 30 Jun 2020 12:21:23 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00

sched/uclamp: Protect uclamp fast path code with static key

There is a report that when uclamp is enabled, a netperf UDP test
regresses compared to a kernel compiled without uclamp.

https://lore.kernel.org/lkml/20200529100806.GA3070@suse.de/

While investigating the root cause, there were no sign that the uclamp
code is doing anything particularly expensive but could suffer from bad
cache behavior under certain circumstances that are yet to be
understood.

https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

To reduce the pressure on the fast path anyway, add a static key that is
by default will skip executing uclamp logic in the
enqueue/dequeue_task() fast path until it's needed.

As soon as the user start using util clamp by:

	1. Changing uclamp value of a task with sched_setattr()
	2. Modifying the default sysctl_sched_util_clamp_{min, max}
	3. Modifying the default cpu.uclamp.{min, max} value in cgroup

We flip the static key now that the user has opted to use util clamp.
Effectively re-introducing uclamp logic in the enqueue/dequeue_task()
fast path. It stays on from that point forward until the next reboot.

This should help minimize the effect of util clamp on workloads that
don't need it but still allow distros to ship their kernels with uclamp
compiled in by default.

SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end
up with unbalanced call to uclamp_rq_dec_id() if we flip the key while
a task is running in the rq. Since we know it is harmless we just
quietly return if we attempt a uclamp_rq_dec_id() when
rq->uclamp[].bucket[].tasks is 0.

In schedutil, we introduce a new uclamp_is_enabled() helper which takes
the static key into account to ensure RT boosting behavior is retained.

The following results demonstrates how this helps on 2 Sockets Xeon E5
2x10-Cores system.

                                   nouclamp                 uclamp      uclamp-static-key
Hmean     send-64         162.43 (   0.00%)      157.84 *  -2.82%*      163.39 *   0.59%*
Hmean     send-128        324.71 (   0.00%)      314.78 *  -3.06%*      326.18 *   0.45%*
Hmean     send-256        641.55 (   0.00%)      628.67 *  -2.01%*      648.12 *   1.02%*
Hmean     send-1024      2525.28 (   0.00%)     2448.26 *  -3.05%*     2543.73 *   0.73%*
Hmean     send-2048      4836.14 (   0.00%)     4712.08 *  -2.57%*     4867.69 *   0.65%*
Hmean     send-3312      7540.83 (   0.00%)     7425.45 *  -1.53%*     7621.06 *   1.06%*
Hmean     send-4096      9124.53 (   0.00%)     8948.82 *  -1.93%*     9276.25 *   1.66%*
Hmean     send-8192     15589.67 (   0.00%)    15486.35 *  -0.66%*    15819.98 *   1.48%*
Hmean     send-16384    26386.47 (   0.00%)    25752.25 *  -2.40%*    26773.74 *   1.47%*

The perf diff between nouclamp and uclamp-static-key when uclamp is
disabled in the fast path:

     8.73%     -1.55%  [kernel.kallsyms]        [k] try_to_wake_up
     0.07%     +0.04%  [kernel.kallsyms]        [k] deactivate_task
     0.13%     -0.02%  [kernel.kallsyms]        [k] activate_task

The diff between nouclamp and uclamp-static-key when uclamp is enabled
in the fast path:

     8.73%     -0.72%  [kernel.kallsyms]        [k] try_to_wake_up
     0.13%     +0.39%  [kernel.kallsyms]        [k] activate_task
     0.07%     +0.38%  [kernel.kallsyms]        [k] deactivate_task

Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20200630112123.12076-3-qais.yousef@arm.com
---
 kernel/sched/core.c              | 74 ++++++++++++++++++++++++++++++-
 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/sched.h             | 47 +++++++++++++++++++-
 3 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9605db7..4cf30e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -796,6 +796,26 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
+/*
+ * This static key is used to reduce the uclamp overhead in the fast path. It
+ * primarily disables the call to uclamp_rq_{inc, dec}() in
+ * enqueue/dequeue_task().
+ *
+ * This allows users to continue to enable uclamp in their kernel config with
+ * minimum uclamp overhead in the fast path.
+ *
+ * As soon as userspace modifies any of the uclamp knobs, the static key is
+ * enabled, since we have an actual users that make use of uclamp
+ * functionality.
+ *
+ * The knobs that would enable this static key are:
+ *
+ *   * A task modifying its uclamp value with sched_setattr().
+ *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
+ *   * An admin modifying the cgroup cpu.uclamp.{min, max}
+ */
+DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
+
 /* Integer rounded range for each bucket */
 #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
 
@@ -992,10 +1012,38 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 
 	lockdep_assert_held(&rq->lock);
 
+	/*
+	 * If sched_uclamp_used was enabled after task @p was enqueued,
+	 * we could end up with unbalanced call to uclamp_rq_dec_id().
+	 *
+	 * In this case the uc_se->active flag should be false since no uclamp
+	 * accounting was performed at enqueue time and we can just return
+	 * here.
+	 *
+	 * Need to be careful of the following enqeueue/dequeue ordering
+	 * problem too
+	 *
+	 *	enqueue(taskA)
+	 *	// sched_uclamp_used gets enabled
+	 *	enqueue(taskB)
+	 *	dequeue(taskA)
+	 *	// Must not decrement bukcet->tasks here
+	 *	dequeue(taskB)
+	 *
+	 * where we could end up with stale data in uc_se and
+	 * bucket[uc_se->bucket_id].
+	 *
+	 * The following check here eliminates the possibility of such race.
+	 */
+	if (unlikely(!uc_se->active))
+		return;
+
 	bucket = &uc_rq->bucket[uc_se->bucket_id];
+
 	SCHED_WARN_ON(!bucket->tasks);
 	if (likely(bucket->tasks))
 		bucket->tasks--;
+
 	uc_se->active = false;
 
 	/*
@@ -1023,6 +1071,15 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	/*
+	 * Avoid any overhead until uclamp is actually used by the userspace.
+	 *
+	 * The condition is constructed such that a NOP is generated when
+	 * sched_uclamp_used is disabled.
+	 */
+	if (!static_branch_unlikely(&sched_uclamp_used))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1038,6 +1095,15 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	/*
+	 * Avoid any overhead until uclamp is actually used by the userspace.
+	 *
+	 * The condition is constructed such that a NOP is generated when
+	 * sched_uclamp_used is disabled.
+	 */
+	if (!static_branch_unlikely(&sched_uclamp_used))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1146,8 +1212,10 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		update_root_tg = true;
 	}
 
-	if (update_root_tg)
+	if (update_root_tg) {
+		static_branch_enable(&sched_uclamp_used);
 		uclamp_update_root_tg();
+	}
 
 	/*
 	 * We update all RUNNABLE tasks only when task groups are in use.
@@ -1212,6 +1280,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
 	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
 		return;
 
+	static_branch_enable(&sched_uclamp_used);
+
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);
@@ -7436,6 +7506,8 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
 	if (req.ret)
 		return req.ret;
 
+	static_branch_enable(&sched_uclamp_used);
+
 	mutex_lock(&uclamp_mutex);
 	rcu_read_lock();
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7fbaee2..dc6835b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
 	unsigned long dl_util, util, irq;
 	struct rq *rq = cpu_rq(cpu);
 
-	if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
+	if (!uclamp_is_used() &&
 	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
 		return max;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9bef2dd..b1432f6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -878,6 +878,8 @@ struct uclamp_rq {
 	unsigned int value;
 	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
 };
+
+DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
 #endif /* CONFIG_UCLAMP_TASK */
 
 /*
@@ -2365,12 +2367,35 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 #ifdef CONFIG_UCLAMP_TASK
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
 
+/**
+ * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
+ * @rq:		The rq to clamp against. Must not be NULL.
+ * @util:	The util value to clamp.
+ * @p:		The task to clamp against. Can be NULL if you want to clamp
+ *		against @rq only.
+ *
+ * Clamps the passed @util to the max(@rq, @p) effective uclamp values.
+ *
+ * If sched_uclamp_used static key is disabled, then just return the util
+ * without any clamping since uclamp aggregation at the rq level in the fast
+ * path is disabled, rendering this operation a NOP.
+ *
+ * Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
+ * will return the correct effective uclamp value of the task even if the
+ * static key is disabled.
+ */
 static __always_inline
 unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
 				  struct task_struct *p)
 {
-	unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
-	unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+	unsigned long min_util;
+	unsigned long max_util;
+
+	if (!static_branch_likely(&sched_uclamp_used))
+		return util;
+
+	min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
+	max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
 
 	if (p) {
 		min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
@@ -2387,6 +2412,19 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
 
 	return clamp(util, min_util, max_util);
 }
+
+/*
+ * When uclamp is compiled in, the aggregation at rq level is 'turned off'
+ * by default in the fast path and only gets turned on once userspace performs
+ * an operation that requires it.
+ *
+ * Returns true if userspace opted-in to use uclamp and aggregation at rq level
+ * hence is active.
+ */
+static inline bool uclamp_is_used(void)
+{
+	return static_branch_likely(&sched_uclamp_used);
+}
 #else /* CONFIG_UCLAMP_TASK */
 static inline
 unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
@@ -2394,6 +2432,11 @@ unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
 {
 	return util;
 }
+
+static inline bool uclamp_is_used(void)
+{
+	return false;
+}
 #endif /* CONFIG_UCLAMP_TASK */
 
 #ifdef arch_scale_freq_capacity

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip: sched/core] sched/uclamp: Fix initialization of struct uclamp_rq
  2020-06-30 11:21 ` [PATCH v6 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
@ 2020-07-09  8:45   ` tip-bot2 for Qais Yousef
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot2 for Qais Yousef @ 2020-07-09  8:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Qais Yousef, Peter Zijlstra (Intel),
	Valentin Schneider, Lukasz Luba, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     d81ae8aac85ca2e307d273f6dc7863a721bf054e
Gitweb:        https://git.kernel.org/tip/d81ae8aac85ca2e307d273f6dc7863a721bf054e
Author:        Qais Yousef <qais.yousef@arm.com>
AuthorDate:    Tue, 30 Jun 2020 12:21:22 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00

sched/uclamp: Fix initialization of struct uclamp_rq

struct uclamp_rq was zeroed out entirely in assumption that in the first
call to uclamp_rq_inc() they'd be initialized correctly in accordance to
default settings.

But when next patch introduces a static key to skip
uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
will fail to perform any frequency changes because the
rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
means all rqs are capped to 0 by default.

Fix it by making sure we do proper initialization at init without
relying on uclamp_rq_inc() doing it later.

Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20200630112123.12076-2-qais.yousef@arm.com
---
 kernel/sched/core.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15c980a..9605db7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1239,6 +1239,20 @@ static void uclamp_fork(struct task_struct *p)
 	}
 }
 
+static void __init init_uclamp_rq(struct rq *rq)
+{
+	enum uclamp_id clamp_id;
+	struct uclamp_rq *uc_rq = rq->uclamp;
+
+	for_each_clamp_id(clamp_id) {
+		uc_rq[clamp_id] = (struct uclamp_rq) {
+			.value = uclamp_none(clamp_id)
+		};
+	}
+
+	rq->uclamp_flags = 0;
+}
+
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
@@ -1247,11 +1261,8 @@ static void __init init_uclamp(void)
 
 	mutex_init(&uclamp_mutex);
 
-	for_each_possible_cpu(cpu) {
-		memset(&cpu_rq(cpu)->uclamp, 0,
-				sizeof(struct uclamp_rq)*UCLAMP_CNT);
-		cpu_rq(cpu)->uclamp_flags = 0;
-	}
+	for_each_possible_cpu(cpu)
+		init_uclamp_rq(cpu_rq(cpu));
 
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&init_task.uclamp_req[clamp_id],

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-07-09  8:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-30 11:21 [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path Qais Yousef
2020-06-30 11:21 ` [PATCH v6 1/2] sched/uclamp: Fix initialization of struct uclamp_rq Qais Yousef
2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2020-06-30 11:21 ` [PATCH v6 2/2] sched/uclamp: Protect uclamp fast path code with static key Qais Yousef
2020-06-30 17:07   ` Peter Zijlstra
2020-06-30 17:55     ` Qais Yousef
2020-06-30 19:06       ` Peter Zijlstra
2020-06-30 19:28         ` Qais Yousef
2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Qais Yousef
2020-07-01 16:32 ` [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path Lukasz Luba
2020-07-03 12:09 ` Vincent Guittot
2020-07-06 10:41   ` Qais Yousef
2020-07-07 12:29     ` Vincent Guittot
2020-07-07 13:11       ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).