LKML Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
@ 2020-04-03 12:30 Qais Yousef
  2020-04-03 12:30 ` [PATCH 2/2] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
                   ` (3 more replies)
  0 siblings, 4 replies; 68+ messages in thread
From: Qais Yousef @ 2020-04-03 12:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Qais Yousef, Jonathan Corbet, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrators, introduce the new sysctl_sched_rt_default_uclamp_util_min
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

Whenever the new default changes, it'd be applied lazily on the next
enqueue, assuming that it still uses the system default value and not a
user applied one.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Juri Lelli <juri.lelli@redhat.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Segall <bsegall@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Kees Cook <keescook@chromium.org>
CC: Iurii Zaikin <yzaikin@google.com>
CC: Quentin Perret <qperret@google.com>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
CC: Pavan Kondeti <pkondeti@codeaurora.org>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
---

Changes in v2:
	* Do the lazy update in uclamp_rq_inc() (Thanks Qeuntin)
	* Rename to: sysctl_sched_rt_default_uclamp_util_min (Quentin)
	* uclamp_rt_sync_default_util_min() now is a static function in core.c
	  with no prototype export in sched.h
	* Added docs in sysctl/kernel.rst (patch 2) (Quentin)

v1 can be found here:

	https://lore.kernel.org/lkml/20191220164838.31619-1-qais.yousef@arm.com/

Summary of v1 discussion:

Patrick has voiced a concern about the approach. AFAIU the suggestion proposed
by Patrick instead is to split sysctl_sched_uclamp_util_min into 2, one for RT
and another for CFS. And use the RT constraint to limit how much boost RT tasks
get globally.

If my understanding was correct, the proposed approach by Patrick doesn't work
for what we want to achieve here. Beside it breaks ABI.

The global per RT task doesn't work because if I want to disable boosting for
all RT tasks by default but still allow a handful of critical ones to be
boosted, the global constraint will render any request via sched_setattr()
syscall a NOP.

So IIUC, we'll still _always_ boost RT tasks to max, but introduce a new knob
to cap and restrict this boost. Which IMHO is a convoluted way to disable the
hardcoded max boost behavior.

This approach instead gives admins a direct control over the default boost
value for RT tasks, which is exactly what we want, without any level of
indirection. It converts a hardcoded value into a sysctl variable that
sysadmins can modify at runtime.


 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c          | 66 +++++++++++++++++++++++++++++++++---
 kernel/sysctl.c              |  7 ++++
 3 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..91204480fabc 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
+extern unsigned int sysctl_sched_rt_default_uclamp_util_min;
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a9983da4408..a726b26a5056 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -797,6 +797,27 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 /* Max allowed maximum utilization */
 unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 
+/*
+ * By default RT tasks run at the maximum performance point/capacity of the
+ * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
+ * SCHED_CAPACITY_SCALE.
+ *
+ * This knob allows admins to change the default behavior when uclamp is being
+ * used. In battery powered devices, particularly, running at the maximum
+ * capacity and frequency will increase energy consumption and shorten the
+ * battery life.
+ *
+ * This knob only affects the default value RT has when a new RT task is
+ * forked or has just changed policy to RT, given the user hasn't modified the
+ * uclamp.min value of the task via sched_setattr().
+ *
+ * This knob will not override the system default sched_util_clamp_min defined
+ * above.
+ *
+ * Any modification is applied lazily on the next RT task wakeup.
+ */
+unsigned int sysctl_sched_rt_default_uclamp_util_min = SCHED_CAPACITY_SCALE;
+
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
@@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 	return uc_req;
 }
 
+static void uclamp_rt_sync_default_util_min(struct task_struct *p)
+{
+	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
+
+	if (!uc_se->user_defined)
+		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
+}
+
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_eff;
@@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
+	/*
+	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
+	 * user, we apply any new value on the next wakeup, which is here.
+	 */
+	uclamp_rt_sync_default_util_min(p);
+
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_inc_id(rq, p, clamp_id);
 
@@ -1121,12 +1156,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				loff_t *ppos)
 {
 	bool update_root_tg = false;
-	int old_min, old_max;
+	int old_min, old_max, old_rt_min;
 	int result;
 
 	mutex_lock(&uclamp_mutex);
 	old_min = sysctl_sched_uclamp_util_min;
 	old_max = sysctl_sched_uclamp_util_max;
+	old_rt_min = sysctl_sched_rt_default_uclamp_util_min;
 
 	result = proc_dointvec(table, write, buffer, lenp, ppos);
 	if (result)
@@ -1134,12 +1170,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	if (!write)
 		goto done;
 
+	/*
+	 * The new value will be applied to all RT tasks the next time they
+	 * wakeup, assuming the task is using the system default and not a user
+	 * specified value. In the latter we shall leave the value as the user
+	 * requested.
+	 */
 	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
 	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
 		result = -EINVAL;
 		goto undo;
 	}
 
+	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
+		result = -EINVAL;
+		goto undo;
+	}
+
 	if (old_min != sysctl_sched_uclamp_util_min) {
 		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
 			      sysctl_sched_uclamp_util_min, false);
@@ -1165,6 +1212,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 undo:
 	sysctl_sched_uclamp_util_min = old_min;
 	sysctl_sched_uclamp_util_max = old_max;
+	sysctl_sched_rt_default_uclamp_util_min = old_rt_min;
 done:
 	mutex_unlock(&uclamp_mutex);
 
@@ -1207,9 +1255,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
 		if (uc_se->user_defined)
 			continue;
 
-		/* By default, RT tasks always get 100% boost */
+		/*
+		 * By default, RT tasks always get 100% boost, which the admins
+		 * are allowed to change via
+		 * sysctl_sched_rt_default_uclamp_util_min knob.
+		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			clamp_value = uclamp_none(UCLAMP_MAX);
+			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
 
 		uclamp_se_set(uc_se, clamp_value, false);
 	}
@@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
 	for_each_clamp_id(clamp_id) {
 		unsigned int clamp_value = uclamp_none(clamp_id);
 
-		/* By default, RT tasks always get 100% boost */
+		/*
+		 * By default, RT tasks always get 100% boost, which the admins
+		 * are allowed to change via
+		 * sysctl_sched_rt_default_uclamp_util_min knob.
+		 */
 		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			clamp_value = uclamp_none(UCLAMP_MAX);
+			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
 
 		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
 	}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ad5b88a53c5a..0272ae8c6147 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_rt_default_util_clamp_min",
+		.data		= &sysctl_sched_rt_default_uclamp_util_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_handler,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
-- 
2.17.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 2/2] Documentation/sysctl: Document uclamp sysctl knobs
  2020-04-03 12:30 [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Qais Yousef
@ 2020-04-03 12:30 ` Qais Yousef
  2020-04-14 18:21 ` [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Patrick Bellasi
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-04-03 12:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Qais Yousef, Jonathan Corbet, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

Uclamp exposes 3 sysctl knobs:

	* sched_util_clamp_min
	* sched_util_clamp_max
	* sched_rt_default_util_clamp_min

Document them in sysctl/kernel.rst.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Juri Lelli <juri.lelli@redhat.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Segall <bsegall@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Kees Cook <keescook@chromium.org>
CC: Iurii Zaikin <yzaikin@google.com>
CC: Quentin Perret <qperret@google.com>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
CC: Pavan Kondeti <pkondeti@codeaurora.org>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
---
 Documentation/admin-guide/sysctl/kernel.rst | 47 +++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index def074807cee..5e974cd493b5 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -935,6 +935,53 @@ Enables/disables scheduler statistics. Enabling this feature
 incurs a small amount of overhead in the scheduler but is
 useful for debugging and performance tuning.
 
+sched_util_clamp_min:
+=====================
+
+Max allowed *minimum* utilization.
+
+Default value is SCHED_CAPACITY_SCALE (1024), which is the maximum possible
+value.
+
+It means that any requested uclamp.min value cannot be greater than
+sched_util_clamp_min, ie: it is restricted to the range
+[0:sched_util_clamp_min].
+
+sched_util_clamp_max:
+=====================
+
+Max allowed *maximum* utilization.
+
+Default value is SCHED_CAPACITY_SCALE (1024), which is the maximum possible
+value.
+
+It means that any requested uclamp.max value cannot be greater than
+sched_util_clamp_max, ie: it is restricted to the range
+[0:sched_util_clamp_max].
+
+sched_rt_default_util_clamp_min:
+================================
+
+By default Linux is tuned for performance. Which means that RT tasks always run
+at the highest frequency and most capable (highest capacity) CPU (in
+hetergenous systems).
+
+Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
+SCHED_CAPACITY_SCALE (1024) by default. Which effectively boosts the tasks to
+run at the highest frequency and bias them to run on the biggest CPU.
+
+This knob allows admins to change the default behavior when uclamp is being
+used. In battery powered devices particularly, running at the maximum
+capacity and frequency will increase energy consumption and shorten the battery
+life.
+
+This knob is only effective for RT tasks which the user hasn't modified their
+requested uclamp.min value via sched_setattr() syscall.
+
+This knob will not escape the constraint imposed by sched_util_clamp_min
+defined above.
+
+Any modification is applied lazily on the next RT task wakeup.
 
 sg-big-buff:
 ============
-- 
2.17.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-03 12:30 [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Qais Yousef
  2020-04-03 12:30 ` [PATCH 2/2] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
@ 2020-04-14 18:21 ` Patrick Bellasi
  2020-04-15  7:46   ` Patrick Bellasi
                     ` (2 more replies)
  2020-04-15 10:11 ` Quentin Perret
  2020-04-20  8:29 ` Dietmar Eggemann
  3 siblings, 3 replies; 68+ messages in thread
From: Patrick Bellasi @ 2020-04-14 18:21 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

Hi Qais!

On 03-Apr 13:30, Qais Yousef wrote:

[...]

> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index d4f6215ee03f..91204480fabc 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
>  #ifdef CONFIG_UCLAMP_TASK
>  extern unsigned int sysctl_sched_uclamp_util_min;
>  extern unsigned int sysctl_sched_uclamp_util_max;
> +extern unsigned int sysctl_sched_rt_default_uclamp_util_min;

nit-pick: I would prefer to keep the same prefix of the already
exising knobs, i.e. sysctl_sched_uclamp_util_min_rt

The same change for consistency should be applied to all the following
symbols related to "uclamp_util_min_rt".

NOTE: I would not use "default" as I think that what we are doing is
exactly force setting a user_defined value for all RT tasks. More on
that later...

>  #endif
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a9983da4408..a726b26a5056 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -797,6 +797,27 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
>  /* Max allowed maximum utilization */
>  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
>  
> +/*
> + * By default RT tasks run at the maximum performance point/capacity of the
> + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> + * SCHED_CAPACITY_SCALE.
> + *
> + * This knob allows admins to change the default behavior when uclamp is being
> + * used. In battery powered devices, particularly, running at the maximum
> + * capacity and frequency will increase energy consumption and shorten the
> + * battery life.
> + *
> + * This knob only affects the default value RT has when a new RT task is
> + * forked or has just changed policy to RT, given the user hasn't modified the
> + * uclamp.min value of the task via sched_setattr().
> + *
> + * This knob will not override the system default sched_util_clamp_min defined
> + * above.
> + *
> + * Any modification is applied lazily on the next RT task wakeup.
> + */
> +unsigned int sysctl_sched_rt_default_uclamp_util_min = SCHED_CAPACITY_SCALE;
> +
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  	return uc_req;
>  }
>  
> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> +{
> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];

Don't we have to filter for RT tasks only here?

> +
> +	if (!uc_se->user_defined)
> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);

Here you are actually setting a user-requested value, why not marking
it as that, i.e. by using true for the last parameter?

Moreover, by keeping user_defined=false I think you are not getting
what you want for RT tasks running in a nested cgroup.

Let say a subgroup is still with the util_min=1024 inherited from the
system defaults, in uclamp_tg_restrict() we will still return the max
value and not what you requested for. Isn't it?

IOW, what about:

---8<---
static void uclamp_sync_util_min_rt(struct task_struct *p)
{
	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];

  if (likely(uc_se->user_defined || !rt_task(p)))
    return;

  uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt, true);
}
---8<---


> +}
> +
>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>  {
>  	struct uclamp_se uc_eff;
> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  	if (unlikely(!p->sched_class->uclamp_enabled))
>  		return;
>  
> +	/*
> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> +	 * user, we apply any new value on the next wakeup, which is here.
> +	 */
> +	uclamp_rt_sync_default_util_min(p);
> +
>  	for_each_clamp_id(clamp_id)
>  		uclamp_rq_inc_id(rq, p, clamp_id);
>  
> @@ -1121,12 +1156,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  				loff_t *ppos)
>  {
>  	bool update_root_tg = false;
> -	int old_min, old_max;
> +	int old_min, old_max, old_rt_min;
>  	int result;
>  
>  	mutex_lock(&uclamp_mutex);
>  	old_min = sysctl_sched_uclamp_util_min;
>  	old_max = sysctl_sched_uclamp_util_max;
> +	old_rt_min = sysctl_sched_rt_default_uclamp_util_min;

Perpahs it's just my OCD but, is not "old_min_rt" reading better?

>  
>  	result = proc_dointvec(table, write, buffer, lenp, ppos);
>  	if (result)
> @@ -1134,12 +1170,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  	if (!write)
>  		goto done;
>  
> +	/*
> +	 * The new value will be applied to all RT tasks the next time they
> +	 * wakeup, assuming the task is using the system default and not a user
> +	 * specified value. In the latter we shall leave the value as the user
> +	 * requested.
> +	 */

Should not this comment go before the next block?

>  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
>  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
>  		result = -EINVAL;
>  		goto undo;
>  	}
>  
> +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> +		result = -EINVAL;
> +		goto undo;
> +	}
> +
>  	if (old_min != sysctl_sched_uclamp_util_min) {
>  		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
>  			      sysctl_sched_uclamp_util_min, false);
> @@ -1165,6 +1212,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  undo:
>  	sysctl_sched_uclamp_util_min = old_min;
>  	sysctl_sched_uclamp_util_max = old_max;
> +	sysctl_sched_rt_default_uclamp_util_min = old_rt_min;
>  done:
>  	mutex_unlock(&uclamp_mutex);
>  
> @@ -1207,9 +1255,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
>  		if (uc_se->user_defined)
>  			continue;
>  
> -		/* By default, RT tasks always get 100% boost */
> +		/*
> +		 * By default, RT tasks always get 100% boost, which the admins
> +		 * are allowed to change via
> +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> +		 */
>  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> -			clamp_value = uclamp_none(UCLAMP_MAX);
> +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
>
>  		uclamp_se_set(uc_se, clamp_value, false);
>  	}
> @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
>  	for_each_clamp_id(clamp_id) {
>  		unsigned int clamp_value = uclamp_none(clamp_id);
>  
> -		/* By default, RT tasks always get 100% boost */
> +		/*
> +		 * By default, RT tasks always get 100% boost, which the admins
> +		 * are allowed to change via
> +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> +		 */
>  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> -			clamp_value = uclamp_none(UCLAMP_MAX);
> +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
>  

This is not required, look at this Quentin's patch:

   Message-ID: <20200414161320.251897-1-qperret@google.com>
   https://lore.kernel.org/lkml/20200414161320.251897-1-qperret@google.com/

>  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
>  	}
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index ad5b88a53c5a..0272ae8c6147 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= sysctl_sched_uclamp_handler,
>  	},
> +	{
> +		.procname	= "sched_rt_default_util_clamp_min",
> +		.data		= &sysctl_sched_rt_default_uclamp_util_min,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= sysctl_sched_uclamp_handler,
> +	},
>  #endif
>  #ifdef CONFIG_SCHED_AUTOGROUP
>  	{

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-14 18:21 ` [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Patrick Bellasi
@ 2020-04-15  7:46   ` Patrick Bellasi
  2020-04-20 15:04     ` Qais Yousef
  2020-04-20  8:24   ` Dietmar Eggemann
  2020-04-20 14:50   ` Qais Yousef
  2 siblings, 1 reply; 68+ messages in thread
From: Patrick Bellasi @ 2020-04-15  7:46 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 14-Apr 20:21, Patrick Bellasi wrote:
> Hi Qais!

Hello againa!

> On 03-Apr 13:30, Qais Yousef wrote:
> 
> [...]
> 
> > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > index d4f6215ee03f..91204480fabc 100644
> > --- a/include/linux/sched/sysctl.h
> > +++ b/include/linux/sched/sysctl.h
> > @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
> >  #ifdef CONFIG_UCLAMP_TASK
> >  extern unsigned int sysctl_sched_uclamp_util_min;
> >  extern unsigned int sysctl_sched_uclamp_util_max;
> > +extern unsigned int sysctl_sched_rt_default_uclamp_util_min;
> 
> nit-pick: I would prefer to keep the same prefix of the already
> exising knobs, i.e. sysctl_sched_uclamp_util_min_rt
> 
> The same change for consistency should be applied to all the following
> symbols related to "uclamp_util_min_rt".
> 
> NOTE: I would not use "default" as I think that what we are doing is
> exactly force setting a user_defined value for all RT tasks. More on
> that later...

Had a second tought on that...

> 
> >  #endif
> >  
> >  #ifdef CONFIG_CFS_BANDWIDTH
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..a726b26a5056 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -797,6 +797,27 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
> >  /* Max allowed maximum utilization */
> >  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> >  
> > +/*
> > + * By default RT tasks run at the maximum performance point/capacity of the
> > + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> > + * SCHED_CAPACITY_SCALE.
> > + *
> > + * This knob allows admins to change the default behavior when uclamp is being
> > + * used. In battery powered devices, particularly, running at the maximum
> > + * capacity and frequency will increase energy consumption and shorten the
> > + * battery life.
> > + *
> > + * This knob only affects the default value RT has when a new RT task is
> > + * forked or has just changed policy to RT, given the user hasn't modified the
> > + * uclamp.min value of the task via sched_setattr().
> > + *
> > + * This knob will not override the system default sched_util_clamp_min defined
> > + * above.
> > + *
> > + * Any modification is applied lazily on the next RT task wakeup.
> > + */
> > +unsigned int sysctl_sched_rt_default_uclamp_util_min = SCHED_CAPACITY_SCALE;
> > +
> >  /* All clamps are required to be less or equal than these values */
> >  static struct uclamp_se uclamp_default[UCLAMP_CNT];
> >  
> > @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >  	return uc_req;
> >  }
> >  
> > +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> > +{
> > +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
> Don't we have to filter for RT tasks only here?

I think this is still a valid point.

> > +
> > +	if (!uc_se->user_defined)
> > +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> 
> Here you are actually setting a user-requested value, why not marking
> it as that, i.e. by using true for the last parameter?

I think you don't want to set user_defined to ensure we keep updating
the value every time the task is enqueued, in case the "default"
should be updated at run-time.

> Moreover, by keeping user_defined=false I think you are not getting
> what you want for RT tasks running in a nested cgroup.
> 
> Let say a subgroup is still with the util_min=1024 inherited from the
> system defaults, in uclamp_tg_restrict() we will still return the max
> value and not what you requested for. Isn't it?

This is also not completely true since perhaps you assume that if an
RT task is running in a nested group with a non tuned uclamp_max then
that's probably what we want.

There is still a small concern due to the fact we don't distinguish
CFS and RT tasks when it comes to cgroup clamp values, which
potentially could still generate the same issue. Let say for example
you wanna allow CFS tasks to be boosted to max (util_min=1024) but
still want to run RT tasks only at lower OPPs.
Not sure if that could be a use case tho.
 
> IOW, what about:
> 
> ---8<---
> static void uclamp_sync_util_min_rt(struct task_struct *p)
> {
> 	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
>   if (likely(uc_se->user_defined || !rt_task(p)))
>     return;
> 
>   uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt, true);
                                                          ^^^^
                     This should remain false as in your patch.
> }
> ---8<---

Still, I was thinking that perhaps it would be better to massage the
code above into the generation of the effective value, in uclamp_eff_get().

Since you wanna (possibly) update the value at each enqueue time,
that's what conceptually is represented by the "effective clamp
value": a value that is computed by definition at enqueue time by
aggregating all the requests and constraints.

Poking with the effective value instead of the requested value will
fix also the ambiguity above, where we set a "requested values" with
user-defined=false.

> > +}
> > +
> >  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >  {
> >  	struct uclamp_se uc_eff;
> > @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >  	if (unlikely(!p->sched_class->uclamp_enabled))
> >  		return;
> >  
> > +	/*
> > +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> > +	 * user, we apply any new value on the next wakeup, which is here.
> > +	 */
> > +	uclamp_rt_sync_default_util_min(p);
> > +
> >  	for_each_clamp_id(clamp_id)
> >  		uclamp_rq_inc_id(rq, p, clamp_id);
> >  
> > @@ -1121,12 +1156,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  				loff_t *ppos)
> >  {
> >  	bool update_root_tg = false;
> > -	int old_min, old_max;
> > +	int old_min, old_max, old_rt_min;
> >  	int result;
> >  
> >  	mutex_lock(&uclamp_mutex);
> >  	old_min = sysctl_sched_uclamp_util_min;
> >  	old_max = sysctl_sched_uclamp_util_max;
> > +	old_rt_min = sysctl_sched_rt_default_uclamp_util_min;
> 
> Perpahs it's just my OCD but, is not "old_min_rt" reading better?
> 
> >  
> >  	result = proc_dointvec(table, write, buffer, lenp, ppos);
> >  	if (result)
> > @@ -1134,12 +1170,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  	if (!write)
> >  		goto done;
> >  
> > +	/*
> > +	 * The new value will be applied to all RT tasks the next time they
> > +	 * wakeup, assuming the task is using the system default and not a user
> > +	 * specified value. In the latter we shall leave the value as the user
> > +	 * requested.
> > +	 */
> 
> Should not this comment go before the next block?
> 
> >  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> >  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> >  		result = -EINVAL;
> >  		goto undo;
> >  	}
> >  
> > +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> > +		result = -EINVAL;
> > +		goto undo;
> > +	}
> > +
> >  	if (old_min != sysctl_sched_uclamp_util_min) {
> >  		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
> >  			      sysctl_sched_uclamp_util_min, false);
> > @@ -1165,6 +1212,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  undo:
> >  	sysctl_sched_uclamp_util_min = old_min;
> >  	sysctl_sched_uclamp_util_max = old_max;
> > +	sysctl_sched_rt_default_uclamp_util_min = old_rt_min;
> >  done:
> >  	mutex_unlock(&uclamp_mutex);
> >  
> > @@ -1207,9 +1255,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
> >  		if (uc_se->user_defined)
> >  			continue;
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >
> >  		uclamp_se_set(uc_se, clamp_value, false);
> >  	}
> > @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
> >  	for_each_clamp_id(clamp_id) {
> >  		unsigned int clamp_value = uclamp_none(clamp_id);
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >  
> 
> This is not required, look at this Quentin's patch:
> 
>    Message-ID: <20200414161320.251897-1-qperret@google.com>
>    https://lore.kernel.org/lkml/20200414161320.251897-1-qperret@google.com/
> 
> >  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
> >  	}
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index ad5b88a53c5a..0272ae8c6147 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
> >  		.mode		= 0644,
> >  		.proc_handler	= sysctl_sched_uclamp_handler,
> >  	},
> > +	{
> > +		.procname	= "sched_rt_default_util_clamp_min",
> > +		.data		= &sysctl_sched_rt_default_uclamp_util_min,
> > +		.maxlen		= sizeof(unsigned int),
> > +		.mode		= 0644,
> > +		.proc_handler	= sysctl_sched_uclamp_handler,
> > +	},
> >  #endif
> >  #ifdef CONFIG_SCHED_AUTOGROUP
> >  	{

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-03 12:30 [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Qais Yousef
  2020-04-03 12:30 ` [PATCH 2/2] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
  2020-04-14 18:21 ` [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Patrick Bellasi
@ 2020-04-15 10:11 ` Quentin Perret
  2020-04-20 15:08   ` Qais Yousef
  2020-04-20  8:29 ` Dietmar Eggemann
  3 siblings, 1 reply; 68+ messages in thread
From: Quentin Perret @ 2020-04-15 10:11 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

Hi Qais,

On Friday 03 Apr 2020 at 13:30:19 (+0100), Qais Yousef wrote:
<snip>
> +	/*
> +	 * The new value will be applied to all RT tasks the next time they
> +	 * wakeup, assuming the task is using the system default and not a user
> +	 * specified value. In the latter we shall leave the value as the user
> +	 * requested.
> +	 */
>  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
>  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
>  		result = -EINVAL;
>  		goto undo;
>  	}
>  
> +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> +		result = -EINVAL;
> +		goto undo;
> +	}

Hmm, checking:

	if (sysctl_sched_rt_default_uclamp_util_min > sysctl_sched_uclamp_util_min)

would probably make sense too, but then that would make writing in
sysctl_sched_uclamp_util_min cumbersome for sysadmins as they'd need to
lower the rt default first. Is that the reason for checking against
SCHED_CAPACITY_SCALE? That might deserve a comment or something.

<snip>
> @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
>  	for_each_clamp_id(clamp_id) {
>  		unsigned int clamp_value = uclamp_none(clamp_id);
>  
> -		/* By default, RT tasks always get 100% boost */
> +		/*
> +		 * By default, RT tasks always get 100% boost, which the admins
> +		 * are allowed to change via
> +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> +		 */
>  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> -			clamp_value = uclamp_none(UCLAMP_MAX);
> +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
>  
>  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
>  	}

And that, as per 20200414161320.251897-1-qperret@google.com, should not
be there :)

Otherwise the patch pretty looks good to me!

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-14 18:21 ` [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Patrick Bellasi
  2020-04-15  7:46   ` Patrick Bellasi
@ 2020-04-20  8:24   ` Dietmar Eggemann
  2020-04-20 15:19     ` Qais Yousef
  2020-04-20 14:50   ` Qais Yousef
  2 siblings, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2020-04-20  8:24 UTC (permalink / raw)
  To: Patrick Bellasi, Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Pavan Kondeti, linux-doc, linux-kernel,
	linux-fsdevel

On 14/04/2020 20:21, Patrick Bellasi wrote:
> Hi Qais!
> 
> On 03-Apr 13:30, Qais Yousef wrote:

[...]

>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>>  	return uc_req;
>>  }
>>  
>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
>> +{
>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
> Don't we have to filter for RT tasks only here?

I think so. It's probably because it got moved from rt.c to core.c.

[...]

>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index ad5b88a53c5a..0272ae8c6147 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
>>  		.mode		= 0644,
>>  		.proc_handler	= sysctl_sched_uclamp_handler,
>>  	},
>> +	{
>> +		.procname	= "sched_rt_default_util_clamp_min",

root@h960:~# find / -name "*util_clamp*"
/proc/sys/kernel/sched_rt_default_util_clamp_min
/proc/sys/kernel/sched_util_clamp_max
/proc/sys/kernel/sched_util_clamp_min

IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.

/proc/sys/kernel/sched_util_clamp_rt_default_min

[...]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-03 12:30 [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Qais Yousef
                   ` (2 preceding siblings ...)
  2020-04-15 10:11 ` Quentin Perret
@ 2020-04-20  8:29 ` Dietmar Eggemann
  2020-04-20 15:13   ` Qais Yousef
  3 siblings, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2020-04-20  8:29 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra
  Cc: Jonathan Corbet, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Mel Gorman, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider,
	Patrick Bellasi, Pavan Kondeti, linux-doc, linux-kernel,
	linux-fsdevel

On 03.04.20 14:30, Qais Yousef wrote:

[...]

> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  	return uc_req;
>  }
>  
> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> +{
> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> +
> +	if (!uc_se->user_defined)
> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> +}
> +
>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>  {
>  	struct uclamp_se uc_eff;
> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  	if (unlikely(!p->sched_class->uclamp_enabled))
>  		return;
>  
> +	/*
> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> +	 * user, we apply any new value on the next wakeup, which is here.
> +	 */
> +	uclamp_rt_sync_default_util_min(p);
> +

Does this have to be an extra function? Can we not reuse
uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?

This function will then deal with enforcing restrictions, whether system
and taskgroup hierarchy related or default value (latter only for rt-min
right now since the others are fixed) related.

uclamp_eff_get() -> uclamp_restrict() is called from:

  'enqueue_task(), uclamp_update_active() -> uclamp_rq_inc() -> uclamp_rq_inc_id()' and

  'task_fits_capacity() -> clamp_task_util(), rt_task_fits_capacity() -> uclamp_eff_value()' and

  'schedutil_cpu_util(), find_energy_efficient_cpu() -> uclamp_rq_util_with() -> uclamp_eff_value()'

so there would be more check-points than the one in 'enqueue_task() -> uclamp_rq_inc()' now.

Only lightly tested:

---8<---

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Sun, 19 Apr 2020 01:20:17 +0200
Subject: [PATCH] sched/core: uclamp: Move uclamp_rt_sync_default_util_min()
 into uclamp_tg_restrict()

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f4e0d5c7daf..6802113d6d4b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -899,12 +899,22 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
 }
 
 static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
+uclamp_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
-	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
-#ifdef CONFIG_UCLAMP_TASK_GROUP
-	struct uclamp_se uc_max;
+	struct uclamp_se uc_req, __maybe_unused uc_max;
+
+	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN &&
+	    !uc_req.user_defined) {
+		struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
+		int rt_min = sysctl_sched_rt_default_uclamp_util_min;
+
+		if (uc_se->value != rt_min)
+			uclamp_se_set(uc_se, rt_min, false);
+	}
 
+	uc_req = p->uclamp_req[clamp_id];
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
 	/*
 	 * Tasks in autogroups or root task group will be
 	 * restricted by system defaults.
@@ -933,7 +943,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
-	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
+	struct uclamp_se uc_req = uclamp_restrict(p, clamp_id);
 	struct uclamp_se uc_max = uclamp_default[clamp_id];
 
 	/* System default restrictions always apply */
@@ -943,14 +953,6 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 	return uc_req;
 }
 
-static void uclamp_rt_sync_default_util_min(struct task_struct *p)
-{
-	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
-
-	if (!uc_se->user_defined)
-		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
-}
-
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_eff;
@@ -1057,12 +1059,6 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
-	/*
-	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
-	 * user, we apply any new value on the next wakeup, which is here.
-	 */
-	uclamp_rt_sync_default_util_min(p);
-
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_inc_id(rq, p, clamp_id);
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-14 18:21 ` [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Patrick Bellasi
  2020-04-15  7:46   ` Patrick Bellasi
  2020-04-20  8:24   ` Dietmar Eggemann
@ 2020-04-20 14:50   ` Qais Yousef
  2 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-04-20 14:50 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

Hi Patrick!

On 04/14/20 20:21, Patrick Bellasi wrote:
> Hi Qais!
> 
> On 03-Apr 13:30, Qais Yousef wrote:
> 
> [...]
> 
> > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > index d4f6215ee03f..91204480fabc 100644
> > --- a/include/linux/sched/sysctl.h
> > +++ b/include/linux/sched/sysctl.h
> > @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
> >  #ifdef CONFIG_UCLAMP_TASK
> >  extern unsigned int sysctl_sched_uclamp_util_min;
> >  extern unsigned int sysctl_sched_uclamp_util_max;
> > +extern unsigned int sysctl_sched_rt_default_uclamp_util_min;
> 
> nit-pick: I would prefer to keep the same prefix of the already
> exising knobs, i.e. sysctl_sched_uclamp_util_min_rt
> 
> The same change for consistency should be applied to all the following
> symbols related to "uclamp_util_min_rt".

All rt sysctl knobs are prefixed with 'sched_rt', so I used this for
consistency.

> 
> NOTE: I would not use "default" as I think that what we are doing is

The 'default' was suggested by Quentin as it felt more descriptive to him. I
took his 'user' point of view.

I don't mind or care about the name ultimately. This symbol is internal and not
exported to userspace.

I just think sticking to sched_rt for consistency is better. What follows I'm
happy to put whatever reviewers agree on. I think Dietmar has another
suggestion for this too.

> exactly force setting a user_defined value for all RT tasks. More on
> that later...

We are NOT force setting the user_defined value. The whole point is to have
a variable default behavior while still allow the user to override this
default. Exactly like it currently works, except that the __hardcoded__ value
is no longer hardcoded and can be modifyied by sys admins at runtime.

> 
> >  #endif
> >  
> >  #ifdef CONFIG_CFS_BANDWIDTH
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..a726b26a5056 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -797,6 +797,27 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
> >  /* Max allowed maximum utilization */
> >  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> >  
> > +/*
> > + * By default RT tasks run at the maximum performance point/capacity of the
> > + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> > + * SCHED_CAPACITY_SCALE.
> > + *
> > + * This knob allows admins to change the default behavior when uclamp is being
> > + * used. In battery powered devices, particularly, running at the maximum
> > + * capacity and frequency will increase energy consumption and shorten the
> > + * battery life.
> > + *
> > + * This knob only affects the default value RT has when a new RT task is
> > + * forked or has just changed policy to RT, given the user hasn't modified the
> > + * uclamp.min value of the task via sched_setattr().
> > + *
> > + * This knob will not override the system default sched_util_clamp_min defined
> > + * above.
> > + *
> > + * Any modification is applied lazily on the next RT task wakeup.
> > + */
> > +unsigned int sysctl_sched_rt_default_uclamp_util_min = SCHED_CAPACITY_SCALE;
> > +
> >  /* All clamps are required to be less or equal than these values */
> >  static struct uclamp_se uclamp_default[UCLAMP_CNT];
> >  
> > @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >  	return uc_req;
> >  }
> >  
> > +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> > +{
> > +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
> Don't we have to filter for RT tasks only here?

Indeed!

> 
> > +
> > +	if (!uc_se->user_defined)
> > +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> 
> Here you are actually setting a user-requested value, why not marking
> it as that, i.e. by using true for the last parameter?

I will have to bounce the question back. The original code passed false.
So why didn't you pass true then? The fact that the value is no longer
hardcoded with this patch doesn't mean the purpose of this code has changed.

> 
> Moreover, by keeping user_defined=false I think you are not getting
> what you want for RT tasks running in a nested cgroup.
> 
> Let say a subgroup is still with the util_min=1024 inherited from the
> system defaults, in uclamp_tg_restrict() we will still return the max
> value and not what you requested for. Isn't it?

With the current code (without my patch) where RT tasks util_min is 1024 by
default and user_defined=false. How is it behaving then? My patch should
retain this behavior. I can't see how it breaks it.. :-/

> 
> IOW, what about:
> 
> ---8<---
> static void uclamp_sync_util_min_rt(struct task_struct *p)
> {
> 	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> 
>   if (likely(uc_se->user_defined || !rt_task(p)))
>     return;
> 
>   uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt, true);
> }
> ---8<---

See above.

> 
> 
> > +}
> > +
> >  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >  {
> >  	struct uclamp_se uc_eff;
> > @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >  	if (unlikely(!p->sched_class->uclamp_enabled))
> >  		return;
> >  
> > +	/*
> > +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> > +	 * user, we apply any new value on the next wakeup, which is here.
> > +	 */
> > +	uclamp_rt_sync_default_util_min(p);
> > +
> >  	for_each_clamp_id(clamp_id)
> >  		uclamp_rq_inc_id(rq, p, clamp_id);
> >  
> > @@ -1121,12 +1156,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  				loff_t *ppos)
> >  {
> >  	bool update_root_tg = false;
> > -	int old_min, old_max;
> > +	int old_min, old_max, old_rt_min;
> >  	int result;
> >  
> >  	mutex_lock(&uclamp_mutex);
> >  	old_min = sysctl_sched_uclamp_util_min;
> >  	old_max = sysctl_sched_uclamp_util_max;
> > +	old_rt_min = sysctl_sched_rt_default_uclamp_util_min;
> 
> Perpahs it's just my OCD but, is not "old_min_rt" reading better?

:D

> 
> >  
> >  	result = proc_dointvec(table, write, buffer, lenp, ppos);
> >  	if (result)
> > @@ -1134,12 +1170,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  	if (!write)
> >  		goto done;
> >  
> > +	/*
> > +	 * The new value will be applied to all RT tasks the next time they
> > +	 * wakeup, assuming the task is using the system default and not a user
> > +	 * specified value. In the latter we shall leave the value as the user
> > +	 * requested.
> > +	 */
> 
> Should not this comment go before the next block?

+1

> 
> >  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> >  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> >  		result = -EINVAL;
> >  		goto undo;
> >  	}
> >  
> > +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> > +		result = -EINVAL;
> > +		goto undo;
> > +	}
> > +
> >  	if (old_min != sysctl_sched_uclamp_util_min) {
> >  		uclamp_se_set(&uclamp_default[UCLAMP_MIN],
> >  			      sysctl_sched_uclamp_util_min, false);
> > @@ -1165,6 +1212,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  undo:
> >  	sysctl_sched_uclamp_util_min = old_min;
> >  	sysctl_sched_uclamp_util_max = old_max;
> > +	sysctl_sched_rt_default_uclamp_util_min = old_rt_min;
> >  done:
> >  	mutex_unlock(&uclamp_mutex);
> >  
> > @@ -1207,9 +1255,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
> >  		if (uc_se->user_defined)
> >  			continue;
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >
> >  		uclamp_se_set(uc_se, clamp_value, false);
> >  	}
> > @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
> >  	for_each_clamp_id(clamp_id) {
> >  		unsigned int clamp_value = uclamp_none(clamp_id);
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >  
> 
> This is not required, look at this Quentin's patch:
> 
>    Message-ID: <20200414161320.251897-1-qperret@google.com>
>    https://lore.kernel.org/lkml/20200414161320.251897-1-qperret@google.com/

Yep saw it.

Thanks for taking the time to review!

Cheers

--
Qais Yousef

> 
> >  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
> >  	}
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index ad5b88a53c5a..0272ae8c6147 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
> >  		.mode		= 0644,
> >  		.proc_handler	= sysctl_sched_uclamp_handler,
> >  	},
> > +	{
> > +		.procname	= "sched_rt_default_util_clamp_min",
> > +		.data		= &sysctl_sched_rt_default_uclamp_util_min,
> > +		.maxlen		= sizeof(unsigned int),
> > +		.mode		= 0644,
> > +		.proc_handler	= sysctl_sched_uclamp_handler,
> > +	},
> >  #endif
> >  #ifdef CONFIG_SCHED_AUTOGROUP
> >  	{
> 
> Best,
> Patrick
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-15  7:46   ` Patrick Bellasi
@ 2020-04-20 15:04     ` Qais Yousef
  0 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-04-20 15:04 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 04/15/20 09:46, Patrick Bellasi wrote:

[...]

> > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > > index d4f6215ee03f..91204480fabc 100644
> > > --- a/include/linux/sched/sysctl.h
> > > +++ b/include/linux/sched/sysctl.h
> > > @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
> > >  #ifdef CONFIG_UCLAMP_TASK
> > >  extern unsigned int sysctl_sched_uclamp_util_min;
> > >  extern unsigned int sysctl_sched_uclamp_util_max;
> > > +extern unsigned int sysctl_sched_rt_default_uclamp_util_min;
> > 
> > nit-pick: I would prefer to keep the same prefix of the already
> > exising knobs, i.e. sysctl_sched_uclamp_util_min_rt
> > 
> > The same change for consistency should be applied to all the following
> > symbols related to "uclamp_util_min_rt".
> > 
> > NOTE: I would not use "default" as I think that what we are doing is
> > exactly force setting a user_defined value for all RT tasks. More on
> > that later...
> 
> Had a second tought on that...

Sorry just noticed that you had a second reply to this. I just saw the first
initially.

Still catching up after holiday..

[...]

> > > +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> > > +{
> > > +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> > 
> > Don't we have to filter for RT tasks only here?
> 
> I think this is still a valid point.

Yep it is.

> 
> > > +
> > > +	if (!uc_se->user_defined)
> > > +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> > 
> > Here you are actually setting a user-requested value, why not marking
> > it as that, i.e. by using true for the last parameter?
> 
> I think you don't want to set user_defined to ensure we keep updating
> the value every time the task is enqueued, in case the "default"
> should be updated at run-time.

Yes. I'm glad we're finally on agreement about this :-)

> 
> > Moreover, by keeping user_defined=false I think you are not getting
> > what you want for RT tasks running in a nested cgroup.
> > 
> > Let say a subgroup is still with the util_min=1024 inherited from the
> > system defaults, in uclamp_tg_restrict() we will still return the max
> > value and not what you requested for. Isn't it?
> 
> This is also not completely true since perhaps you assume that if an
> RT task is running in a nested group with a non tuned uclamp_max then
> that's probably what we want.
> 
> There is still a small concern due to the fact we don't distinguish
> CFS and RT tasks when it comes to cgroup clamp values, which
> potentially could still generate the same issue. Let say for example
> you wanna allow CFS tasks to be boosted to max (util_min=1024) but
> still want to run RT tasks only at lower OPPs.
> Not sure if that could be a use case tho.

No we can't within the same cgroup. But sys admins can potentially create
different cgroups to enforce the different policies.

A per sched-class uclamp control could simplify userspace if they end up with
this scenario. But given where we are now, I'm not sure how easy it would be to
stage the change.

>  
> > IOW, what about:
> > 
> > ---8<---
> > static void uclamp_sync_util_min_rt(struct task_struct *p)
> > {
> > 	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> > 
> >   if (likely(uc_se->user_defined || !rt_task(p)))
> >     return;
> > 
> >   uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt, true);
>                                                           ^^^^
>                      This should remain false as in your patch.
> > }
> > ---8<---
> 
> Still, I was thinking that perhaps it would be better to massage the
> code above into the generation of the effective value, in uclamp_eff_get().
> 
> Since you wanna (possibly) update the value at each enqueue time,
> that's what conceptually is represented by the "effective clamp
> value": a value that is computed by definition at enqueue time by
> aggregating all the requests and constraints.
> 
> Poking with the effective value instead of the requested value will
> fix also the ambiguity above, where we set a "requested values" with
> user-defined=false.

Okay, let me have a second look at this. I just took what we had and improved
on it. But what you say could work too. Let me try it out.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-15 10:11 ` Quentin Perret
@ 2020-04-20 15:08   ` Qais Yousef
  0 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-04-20 15:08 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

Hi Quentin

On 04/15/20 11:11, Quentin Perret wrote:
> Hi Qais,
> 
> On Friday 03 Apr 2020 at 13:30:19 (+0100), Qais Yousef wrote:
> <snip>
> > +	/*
> > +	 * The new value will be applied to all RT tasks the next time they
> > +	 * wakeup, assuming the task is using the system default and not a user
> > +	 * specified value. In the latter we shall leave the value as the user
> > +	 * requested.
> > +	 */
> >  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> >  	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> >  		result = -EINVAL;
> >  		goto undo;
> >  	}
> >  
> > +	if (sysctl_sched_rt_default_uclamp_util_min > SCHED_CAPACITY_SCALE) {
> > +		result = -EINVAL;
> > +		goto undo;
> > +	}
> 
> Hmm, checking:
> 
> 	if (sysctl_sched_rt_default_uclamp_util_min > sysctl_sched_uclamp_util_min)
> 
> would probably make sense too, but then that would make writing in
> sysctl_sched_uclamp_util_min cumbersome for sysadmins as they'd need to
> lower the rt default first. Is that the reason for checking against
> SCHED_CAPACITY_SCALE? That might deserve a comment or something.

There's no need for that extra diff. That constraint will be applied
automatically when calculating the effective value.

The check for SCHED_CAPACITY_SCALE is a a range check. The possible value is
[0:SCHED_CAPACITY_SCALE].

Does this answer your question? I could add a comment that all the uclamp
sysctls need to be within this range.

> 
> <snip>
> > @@ -1241,9 +1293,13 @@ static void uclamp_fork(struct task_struct *p)
> >  	for_each_clamp_id(clamp_id) {
> >  		unsigned int clamp_value = uclamp_none(clamp_id);
> >  
> > -		/* By default, RT tasks always get 100% boost */
> > +		/*
> > +		 * By default, RT tasks always get 100% boost, which the admins
> > +		 * are allowed to change via
> > +		 * sysctl_sched_rt_default_uclamp_util_min knob.
> > +		 */
> >  		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > -			clamp_value = uclamp_none(UCLAMP_MAX);
> > +			clamp_value = sysctl_sched_rt_default_uclamp_util_min;
> >  
> >  		uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
> >  	}
> 
> And that, as per 20200414161320.251897-1-qperret@google.com, should not
> be there :)

Yep saw it. Thanks for fixing it!

> 
> Otherwise the patch pretty looks good to me!

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-20  8:29 ` Dietmar Eggemann
@ 2020-04-20 15:13   ` Qais Yousef
  2020-04-21 11:18     ` Dietmar Eggemann
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-04-20 15:13 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 04/20/20 10:29, Dietmar Eggemann wrote:
> On 03.04.20 14:30, Qais Yousef wrote:
> 
> [...]
> 
> > @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >  	return uc_req;
> >  }
> >  
> > +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> > +{
> > +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> > +
> > +	if (!uc_se->user_defined)
> > +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> > +}
> > +
> >  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >  {
> >  	struct uclamp_se uc_eff;
> > @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >  	if (unlikely(!p->sched_class->uclamp_enabled))
> >  		return;
> >  
> > +	/*
> > +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> > +	 * user, we apply any new value on the next wakeup, which is here.
> > +	 */
> > +	uclamp_rt_sync_default_util_min(p);
> > +
> 
> Does this have to be an extra function? Can we not reuse
> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?

Hmm the thing is that we're not restricting here. In contrary we're boosting,
so the name would be misleading.

> 
> This function will then deal with enforcing restrictions, whether system
> and taskgroup hierarchy related or default value (latter only for rt-min
> right now since the others are fixed) related.
> 
> uclamp_eff_get() -> uclamp_restrict() is called from:
> 
>   'enqueue_task(), uclamp_update_active() -> uclamp_rq_inc() -> uclamp_rq_inc_id()' and
> 
>   'task_fits_capacity() -> clamp_task_util(), rt_task_fits_capacity() -> uclamp_eff_value()' and
> 
>   'schedutil_cpu_util(), find_energy_efficient_cpu() -> uclamp_rq_util_with() -> uclamp_eff_value()'
> 
> so there would be more check-points than the one in 'enqueue_task() -> uclamp_rq_inc()' now.

I think you're revolving around the same idea that Patrick was suggesting.
I think it is possible to do something in uclamp_eff_get() too.

Thanks

--
Qais Yousef

> 
> Only lightly tested:
> 
> ---8<---
> 
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Sun, 19 Apr 2020 01:20:17 +0200
> Subject: [PATCH] sched/core: uclamp: Move uclamp_rt_sync_default_util_min()
>  into uclamp_tg_restrict()
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/core.c | 34 +++++++++++++++-------------------
>  1 file changed, 15 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8f4e0d5c7daf..6802113d6d4b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -899,12 +899,22 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
>  }
>  
>  static inline struct uclamp_se
> -uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
> +uclamp_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> -	struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> -#ifdef CONFIG_UCLAMP_TASK_GROUP
> -	struct uclamp_se uc_max;
> +	struct uclamp_se uc_req, __maybe_unused uc_max;
> +
> +	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN &&
> +	    !uc_req.user_defined) {
> +		struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> +		int rt_min = sysctl_sched_rt_default_uclamp_util_min;
> +
> +		if (uc_se->value != rt_min)
> +			uclamp_se_set(uc_se, rt_min, false);
> +	}
>  
> +	uc_req = p->uclamp_req[clamp_id];
> +
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
>  	/*
>  	 * Tasks in autogroups or root task group will be
>  	 * restricted by system defaults.
> @@ -933,7 +943,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  static inline struct uclamp_se
>  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> -	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> +	struct uclamp_se uc_req = uclamp_restrict(p, clamp_id);
>  	struct uclamp_se uc_max = uclamp_default[clamp_id];
>  
>  	/* System default restrictions always apply */
> @@ -943,14 +953,6 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  	return uc_req;
>  }
>  
> -static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> -{
> -	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> -
> -	if (!uc_se->user_defined)
> -		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> -}
> -
>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>  {
>  	struct uclamp_se uc_eff;
> @@ -1057,12 +1059,6 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  	if (unlikely(!p->sched_class->uclamp_enabled))
>  		return;
>  
> -	/*
> -	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> -	 * user, we apply any new value on the next wakeup, which is here.
> -	 */
> -	uclamp_rt_sync_default_util_min(p);
> -
>  	for_each_clamp_id(clamp_id)
>  		uclamp_rq_inc_id(rq, p, clamp_id);
>  
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-20  8:24   ` Dietmar Eggemann
@ 2020-04-20 15:19     ` Qais Yousef
  2020-04-21  0:52       ` Steven Rostedt
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-04-20 15:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Patrick Bellasi, Ingo Molnar, Peter Zijlstra, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 04/20/20 10:24, Dietmar Eggemann wrote:
> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> >> index ad5b88a53c5a..0272ae8c6147 100644
> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -465,6 +465,13 @@ static struct ctl_table kern_table[] = {
> >>  		.mode		= 0644,
> >>  		.proc_handler	= sysctl_sched_uclamp_handler,
> >>  	},
> >> +	{
> >> +		.procname	= "sched_rt_default_util_clamp_min",
> 
> root@h960:~# find / -name "*util_clamp*"
> /proc/sys/kernel/sched_rt_default_util_clamp_min
> /proc/sys/kernel/sched_util_clamp_max
> /proc/sys/kernel/sched_util_clamp_min
> 
> IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.
> 
> /proc/sys/kernel/sched_util_clamp_rt_default_min

All RT related knobs are prefixed with 'sched_rt'. I kept the 'util_clamp_min'
coherent with the current sysctl (sched_util_clamp_min). Quentin suggested
adding 'default' to be more obvious, so I ended up with

	'sched_rt' + '_default' + '_util_clamp_min'.

I think this is the logical and most consistent form. Given that Patrick seems
to be okay with the 'default' now, does this look good to you too?

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-20 15:19     ` Qais Yousef
@ 2020-04-21  0:52       ` Steven Rostedt
  2020-04-21 11:16         ` Dietmar Eggemann
  0 siblings, 1 reply; 68+ messages in thread
From: Steven Rostedt @ 2020-04-21  0:52 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Dietmar Eggemann, Patrick Bellasi, Ingo Molnar, Peter Zijlstra,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On Mon, 20 Apr 2020 16:19:42 +0100
Qais Yousef <qais.yousef@arm.com> wrote:

> > root@h960:~# find / -name "*util_clamp*"
> > /proc/sys/kernel/sched_rt_default_util_clamp_min
> > /proc/sys/kernel/sched_util_clamp_max
> > /proc/sys/kernel/sched_util_clamp_min
> > 
> > IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.
> > 
> > /proc/sys/kernel/sched_util_clamp_rt_default_min  
> 
> All RT related knobs are prefixed with 'sched_rt'. I kept the 'util_clamp_min'
> coherent with the current sysctl (sched_util_clamp_min). Quentin suggested
> adding 'default' to be more obvious, so I ended up with
> 
> 	'sched_rt' + '_default' + '_util_clamp_min'.
> 
> I think this is the logical and most consistent form. Given that Patrick seems
> to be okay with the 'default' now, does this look good to you too?

There's only two files with "sched_rt" and they are tightly coupled
(they define how much an RT task may use the CPU).

My question is, is this "sched_rt_default_util_clamp_min" related in
any way to those other two files that start with "sched_rt", or is it
more related to the files that start with "sched_util_clamp"?

If the latter, then I would suggest using
"sched_util_clamp_min_rt_default", as it looks to be more related to
the "sched_util_clamp_min" than to anything else.

-- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-21  0:52       ` Steven Rostedt
@ 2020-04-21 11:16         ` Dietmar Eggemann
  2020-04-21 11:23           ` Qais Yousef
  0 siblings, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2020-04-21 11:16 UTC (permalink / raw)
  To: Steven Rostedt, Qais Yousef
  Cc: Patrick Bellasi, Ingo Molnar, Peter Zijlstra, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Pavan Kondeti, linux-doc, linux-kernel,
	linux-fsdevel

On 21/04/2020 02:52, Steven Rostedt wrote:
> On Mon, 20 Apr 2020 16:19:42 +0100
> Qais Yousef <qais.yousef@arm.com> wrote:
> 
>>> root@h960:~# find / -name "*util_clamp*"
>>> /proc/sys/kernel/sched_rt_default_util_clamp_min
>>> /proc/sys/kernel/sched_util_clamp_max
>>> /proc/sys/kernel/sched_util_clamp_min
>>>
>>> IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.
>>>
>>> /proc/sys/kernel/sched_util_clamp_rt_default_min  
>>
>> All RT related knobs are prefixed with 'sched_rt'. I kept the 'util_clamp_min'
>> coherent with the current sysctl (sched_util_clamp_min). Quentin suggested
>> adding 'default' to be more obvious, so I ended up with
>>
>> 	'sched_rt' + '_default' + '_util_clamp_min'.
>>
>> I think this is the logical and most consistent form. Given that Patrick seems
>> to be okay with the 'default' now, does this look good to you too?
> 
> There's only two files with "sched_rt" and they are tightly coupled
> (they define how much an RT task may use the CPU).
> 
> My question is, is this "sched_rt_default_util_clamp_min" related in
> any way to those other two files that start with "sched_rt", or is it
> more related to the files that start with "sched_util_clamp"?
> 
> If the latter, then I would suggest using
> "sched_util_clamp_min_rt_default", as it looks to be more related to
> the "sched_util_clamp_min" than to anything else.

For me it's the latter.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-20 15:13   ` Qais Yousef
@ 2020-04-21 11:18     ` Dietmar Eggemann
  2020-04-21 11:27       ` Qais Yousef
  0 siblings, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2020-04-21 11:18 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 20/04/2020 17:13, Qais Yousef wrote:
> On 04/20/20 10:29, Dietmar Eggemann wrote:
>> On 03.04.20 14:30, Qais Yousef wrote:
>>
>> [...]
>>
>>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>>>  	return uc_req;
>>>  }
>>>  
>>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
>>> +{
>>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
>>> +
>>> +	if (!uc_se->user_defined)
>>> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
>>> +}
>>> +
>>>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>>>  {
>>>  	struct uclamp_se uc_eff;
>>> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>>>  	if (unlikely(!p->sched_class->uclamp_enabled))
>>>  		return;
>>>  
>>> +	/*
>>> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
>>> +	 * user, we apply any new value on the next wakeup, which is here.
>>> +	 */
>>> +	uclamp_rt_sync_default_util_min(p);
>>> +
>>
>> Does this have to be an extra function? Can we not reuse
>> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?
> 
> Hmm the thing is that we're not restricting here. In contrary we're boosting,
> so the name would be misleading.

I always thought that we're restricting p->uclamp_req[UCLAMP_MIN].value (default 1024) to
sysctl_sched_rt_default_uclamp_util_min (0-1024)?

root@h960:~# echo 999 > /proc/sys/kernel/sched_rt_default_util_clamp_min

[  118.028582] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=1024
[  118.036290] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=1024
[  125.181747] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=1024
[  125.189443] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=1024
[  131.213211] uclamp_restrict() [rtkit-daemon 410] p->uclamp_req[0].value=999
[  131.220201] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=999
[  131.227792] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=999
[  137.181544] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=999
[  137.189170] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=999

>> This function will then deal with enforcing restrictions, whether system
>> and taskgroup hierarchy related or default value (latter only for rt-min
>> right now since the others are fixed) related.
>>
>> uclamp_eff_get() -> uclamp_restrict() is called from:
>>
>>   'enqueue_task(), uclamp_update_active() -> uclamp_rq_inc() -> uclamp_rq_inc_id()' and
>>
>>   'task_fits_capacity() -> clamp_task_util(), rt_task_fits_capacity() -> uclamp_eff_value()' and
>>
>>   'schedutil_cpu_util(), find_energy_efficient_cpu() -> uclamp_rq_util_with() -> uclamp_eff_value()'
>>
>> so there would be more check-points than the one in 'enqueue_task() -> uclamp_rq_inc()' now.
> 
> I think you're revolving around the same idea that Patrick was suggesting.
> I think it is possible to do something in uclamp_eff_get() too.

Yeah, I read https://lore.kernel.org/linux-doc/20200415074600.GA26984@darkstar again.

Everything which moves enforcing sysctl_sched_rt_default_uclamp_util_min closer to 'uclamp_eff_get() -> 
uclamp_(tg_)restrict()' is fine with me.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-21 11:16         ` Dietmar Eggemann
@ 2020-04-21 11:23           ` Qais Yousef
  0 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-04-21 11:23 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steven Rostedt, Patrick Bellasi, Ingo Molnar, Peter Zijlstra,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 04/21/20 13:16, Dietmar Eggemann wrote:
> On 21/04/2020 02:52, Steven Rostedt wrote:
> > On Mon, 20 Apr 2020 16:19:42 +0100
> > Qais Yousef <qais.yousef@arm.com> wrote:
> > 
> >>> root@h960:~# find / -name "*util_clamp*"
> >>> /proc/sys/kernel/sched_rt_default_util_clamp_min
> >>> /proc/sys/kernel/sched_util_clamp_max
> >>> /proc/sys/kernel/sched_util_clamp_min
> >>>
> >>> IMHO, keeping the common 'sched_util_clamp_' would be helpful here, e.g.
> >>>
> >>> /proc/sys/kernel/sched_util_clamp_rt_default_min  
> >>
> >> All RT related knobs are prefixed with 'sched_rt'. I kept the 'util_clamp_min'
> >> coherent with the current sysctl (sched_util_clamp_min). Quentin suggested
> >> adding 'default' to be more obvious, so I ended up with
> >>
> >> 	'sched_rt' + '_default' + '_util_clamp_min'.
> >>
> >> I think this is the logical and most consistent form. Given that Patrick seems
> >> to be okay with the 'default' now, does this look good to you too?
> > 
> > There's only two files with "sched_rt" and they are tightly coupled
> > (they define how much an RT task may use the CPU).
> > 
> > My question is, is this "sched_rt_default_util_clamp_min" related in
> > any way to those other two files that start with "sched_rt", or is it
> > more related to the files that start with "sched_util_clamp"?
> > 
> > If the latter, then I would suggest using
> > "sched_util_clamp_min_rt_default", as it looks to be more related to
> > the "sched_util_clamp_min" than to anything else.
> 
> For me it's the latter.

The way I see it is that 'sched_rt' define an rt class property. And running at
max performance level is an RT class property that uclamp honoured and has an
extra side effect that it allows us to tune.

That said, I'm fine with whatever.

Patrick, Quentin, is sched_util_clamp_min_rt_default fine with you too?

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-21 11:18     ` Dietmar Eggemann
@ 2020-04-21 11:27       ` Qais Yousef
  2020-04-22 10:59         ` Dietmar Eggemann
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-04-21 11:27 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 04/21/20 13:18, Dietmar Eggemann wrote:
> On 20/04/2020 17:13, Qais Yousef wrote:
> > On 04/20/20 10:29, Dietmar Eggemann wrote:
> >> On 03.04.20 14:30, Qais Yousef wrote:
> >>
> >> [...]
> >>
> >>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >>>  	return uc_req;
> >>>  }
> >>>  
> >>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> >>> +{
> >>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> >>> +
> >>> +	if (!uc_se->user_defined)
> >>> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> >>> +}
> >>> +
> >>>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >>>  {
> >>>  	struct uclamp_se uc_eff;
> >>> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >>>  	if (unlikely(!p->sched_class->uclamp_enabled))
> >>>  		return;
> >>>  
> >>> +	/*
> >>> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> >>> +	 * user, we apply any new value on the next wakeup, which is here.
> >>> +	 */
> >>> +	uclamp_rt_sync_default_util_min(p);
> >>> +
> >>
> >> Does this have to be an extra function? Can we not reuse
> >> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?
> > 
> > Hmm the thing is that we're not restricting here. In contrary we're boosting,
> > so the name would be misleading.
> 
> I always thought that we're restricting p->uclamp_req[UCLAMP_MIN].value (default 1024) to
> sysctl_sched_rt_default_uclamp_util_min (0-1024)?

The way I look at it is that we're *setting* it to
sysctl_sched_rt_default_uclamp_util_min if !user_defined.

The restriction mechanism that ensures this set value doesn't escape
cgroup/global restrictions setup.

> 
> root@h960:~# echo 999 > /proc/sys/kernel/sched_rt_default_util_clamp_min
> 
> [  118.028582] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=1024
> [  118.036290] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=1024
> [  125.181747] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=1024
> [  125.189443] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=1024
> [  131.213211] uclamp_restrict() [rtkit-daemon 410] p->uclamp_req[0].value=999
> [  131.220201] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=999
> [  131.227792] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=999
> [  137.181544] uclamp_eff_get() [rtkit-daemon 410] tag=0 uclamp_id=0 uc_req.value=999
> [  137.189170] uclamp_eff_get() [rtkit-daemon 410] tag=1 uclamp_id=0 uc_req.value=999
> 
> >> This function will then deal with enforcing restrictions, whether system
> >> and taskgroup hierarchy related or default value (latter only for rt-min
> >> right now since the others are fixed) related.
> >>
> >> uclamp_eff_get() -> uclamp_restrict() is called from:
> >>
> >>   'enqueue_task(), uclamp_update_active() -> uclamp_rq_inc() -> uclamp_rq_inc_id()' and
> >>
> >>   'task_fits_capacity() -> clamp_task_util(), rt_task_fits_capacity() -> uclamp_eff_value()' and
> >>
> >>   'schedutil_cpu_util(), find_energy_efficient_cpu() -> uclamp_rq_util_with() -> uclamp_eff_value()'
> >>
> >> so there would be more check-points than the one in 'enqueue_task() -> uclamp_rq_inc()' now.
> > 
> > I think you're revolving around the same idea that Patrick was suggesting.
> > I think it is possible to do something in uclamp_eff_get() too.
> 
> Yeah, I read https://lore.kernel.org/linux-doc/20200415074600.GA26984@darkstar again.
> 
> Everything which moves enforcing sysctl_sched_rt_default_uclamp_util_min closer to 'uclamp_eff_get() -> 
> uclamp_(tg_)restrict()' is fine with me.

Cool.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-21 11:27       ` Qais Yousef
@ 2020-04-22 10:59         ` Dietmar Eggemann
  2020-04-22 13:13           ` Qais Yousef
  0 siblings, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2020-04-22 10:59 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 21/04/2020 13:27, Qais Yousef wrote:
> On 04/21/20 13:18, Dietmar Eggemann wrote:
>> On 20/04/2020 17:13, Qais Yousef wrote:
>>> On 04/20/20 10:29, Dietmar Eggemann wrote:
>>>> On 03.04.20 14:30, Qais Yousef wrote:
>>>>
>>>> [...]
>>>>
>>>>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>>>>>  	return uc_req;
>>>>>  }
>>>>>  
>>>>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
>>>>> +{
>>>>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
>>>>> +
>>>>> +	if (!uc_se->user_defined)
>>>>> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
>>>>> +}
>>>>> +
>>>>>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
>>>>>  {
>>>>>  	struct uclamp_se uc_eff;
>>>>> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>>>>>  	if (unlikely(!p->sched_class->uclamp_enabled))
>>>>>  		return;
>>>>>  
>>>>> +	/*
>>>>> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
>>>>> +	 * user, we apply any new value on the next wakeup, which is here.
>>>>> +	 */
>>>>> +	uclamp_rt_sync_default_util_min(p);
>>>>> +
>>>>
>>>> Does this have to be an extra function? Can we not reuse
>>>> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?

Btw, there was an issue in my little snippet. I used uc_req.user_defined
uninitialized in uclamp_restrict().


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3706dad32ce..7e6b2b7cd1e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -903,12 +903,11 @@ uclamp_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
 	struct uclamp_se uc_req, __maybe_unused uc_max;
 
-	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN &&
-	    !uc_req.user_defined) {
+	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN) {
 		struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
 		int rt_min = sysctl_sched_rt_default_uclamp_util_min;
 
-		if (uc_se->value != rt_min) {
+		if (!uc_se->user_defined && uc_se->value != rt_min) {
 			uclamp_se_set(uc_se, rt_min, false);
 			printk("uclamp_restrict() [%s %d] p->uclamp_req[%d].value=%d\n",
 			       p->comm, p->pid, clamp_id, uc_se->value);

>>> Hmm the thing is that we're not restricting here. In contrary we're boosting,
>>> so the name would be misleading.
>>
>> I always thought that we're restricting p->uclamp_req[UCLAMP_MIN].value (default 1024) to
>> sysctl_sched_rt_default_uclamp_util_min (0-1024)?
> 
> The way I look at it is that we're *setting* it to
> sysctl_sched_rt_default_uclamp_util_min if !user_defined.
> 
> The restriction mechanism that ensures this set value doesn't escape
> cgroup/global restrictions setup.

I guess we overall agree here. 

I see 3 restriction levels: (!user_defined) task -> taskgroup -> system

I see sysctl_sched_rt_default_uclamp_util_min (min_rt_default) as a
restriction on task level.

It's true that the task level restriction is setting the value at the same time.

For CFS (id=UCLAMP_[MIN\|MAX]) and RT (id=UCLAMP_MAX) we use
uclamp_none(id) and those values (0, 1024) are fixed so these task level
values don't need to be further restricted.

For RT (id=UCLAMP_MIN) we use 'min_rt_default' and since it can change
we have to check the task level restriction in 'uclamp_eff_get() ->
uclamp_(tg)_restrict()'.

root@h960:~# echo 999 > /proc/sys/kernel/sched_rt_default_util_clamp_min

[ 2540.507236] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_req.value=1024
[ 2540.514947] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_req.value=1024
[ 2548.015208] uclamp_restrict() [rtkit-daemon 419] p->uclamp_req[0].value=999

root@h960:~# echo 666 > /proc/sys/kernel/sched_util_clamp_min

[ 2548.022219] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_req.value=999
[ 2548.029825] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_req.value=999
[ 2553.479509] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_max.value=666
[ 2553.487131] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_max.value=666

Haven't tried to put an rt task into a taskgroup other than root.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-04-22 10:59         ` Dietmar Eggemann
@ 2020-04-22 13:13           ` Qais Yousef
  0 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-04-22 13:13 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 04/22/20 12:59, Dietmar Eggemann wrote:
> On 21/04/2020 13:27, Qais Yousef wrote:
> > On 04/21/20 13:18, Dietmar Eggemann wrote:
> >> On 20/04/2020 17:13, Qais Yousef wrote:
> >>> On 04/20/20 10:29, Dietmar Eggemann wrote:
> >>>> On 03.04.20 14:30, Qais Yousef wrote:
> >>>>
> >>>> [...]
> >>>>
> >>>>> @@ -924,6 +945,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> >>>>>  	return uc_req;
> >>>>>  }
> >>>>>  
> >>>>> +static void uclamp_rt_sync_default_util_min(struct task_struct *p)
> >>>>> +{
> >>>>> +	struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
> >>>>> +
> >>>>> +	if (!uc_se->user_defined)
> >>>>> +		uclamp_se_set(uc_se, sysctl_sched_rt_default_uclamp_util_min, false);
> >>>>> +}
> >>>>> +
> >>>>>  unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
> >>>>>  {
> >>>>>  	struct uclamp_se uc_eff;
> >>>>> @@ -1030,6 +1059,12 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >>>>>  	if (unlikely(!p->sched_class->uclamp_enabled))
> >>>>>  		return;
> >>>>>  
> >>>>> +	/*
> >>>>> +	 * When sysctl_sched_rt_default_uclamp_util_min value is changed by the
> >>>>> +	 * user, we apply any new value on the next wakeup, which is here.
> >>>>> +	 */
> >>>>> +	uclamp_rt_sync_default_util_min(p);
> >>>>> +
> >>>>
> >>>> Does this have to be an extra function? Can we not reuse
> >>>> uclamp_tg_restrict() by slightly rename it to uclamp_restrict()?
> 
> Btw, there was an issue in my little snippet. I used uc_req.user_defined
> uninitialized in uclamp_restrict().
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f3706dad32ce..7e6b2b7cd1e5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -903,12 +903,11 @@ uclamp_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
>  	struct uclamp_se uc_req, __maybe_unused uc_max;
>  
> -	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN &&
> -	    !uc_req.user_defined) {
> +	if (unlikely(rt_task(p)) && clamp_id == UCLAMP_MIN) {
>  		struct uclamp_se *uc_se = &p->uclamp_req[UCLAMP_MIN];
>  		int rt_min = sysctl_sched_rt_default_uclamp_util_min;
>  
> -		if (uc_se->value != rt_min) {
> +		if (!uc_se->user_defined && uc_se->value != rt_min) {
>  			uclamp_se_set(uc_se, rt_min, false);
>  			printk("uclamp_restrict() [%s %d] p->uclamp_req[%d].value=%d\n",
>  			       p->comm, p->pid, clamp_id, uc_se->value);
> 
> >>> Hmm the thing is that we're not restricting here. In contrary we're boosting,
> >>> so the name would be misleading.
> >>
> >> I always thought that we're restricting p->uclamp_req[UCLAMP_MIN].value (default 1024) to
> >> sysctl_sched_rt_default_uclamp_util_min (0-1024)?
> > 
> > The way I look at it is that we're *setting* it to
> > sysctl_sched_rt_default_uclamp_util_min if !user_defined.
> > 
> > The restriction mechanism that ensures this set value doesn't escape
> > cgroup/global restrictions setup.
> 
> I guess we overall agree here. 
> 
> I see 3 restriction levels: (!user_defined) task -> taskgroup -> system
> 
> I see sysctl_sched_rt_default_uclamp_util_min (min_rt_default) as a
> restriction on task level.

Hmm from code perspective it is a request. One that is applied by default if
the user didn't make any request.

Since restriction has a different meaning from code point of view, I think
interchanging both could be confusing.

A restriction from the code view is that you can request 1024, but if cgroup or
global settings doesn't allow you, the system will automatically crop it.

The new sysctl doesn't result in any cropping. It is equivalent to a user
making a sched_setattr() call to change UCLAMP_MIN value of a task. It's just
this request is applied automatically by default, if !user_defined.

But if you meant your high level view of how it works you think of it as
a restriction, then yeah, it could be abstracted in this way. The terminology
just conflicts with the code.

> 
> It's true that the task level restriction is setting the value at the same time.
> 
> For CFS (id=UCLAMP_[MIN\|MAX]) and RT (id=UCLAMP_MAX) we use
> uclamp_none(id) and those values (0, 1024) are fixed so these task level
> values don't need to be further restricted.

I wouldn't think of these as restriction. They're default requests, if
I understood what you're saying correctly, by default:

	cfs_task->util_min = 0
	cfs_task->util_max = 1024

	rt_task->util_min = 1024
	rt_task->util_max = 1024

Which are the requested value.

sysctl_util_clamp_{min,max} are the default restriction which by default would
allow the tasks to request any value within the full range.

The root taskgroup will inherit this value by default. And new cgroups will
inherit from the root taskgroup.

> 
> For RT (id=UCLAMP_MIN) we use 'min_rt_default' and since it can change
> we have to check the task level restriction in 'uclamp_eff_get() ->
> uclamp_(tg)_restrict()'.

Yes. If we take the approach to apply the default request in uclamp_eff_get(),
then this must be applied before uclamp_tg_restrict() call.

> 
> root@h960:~# echo 999 > /proc/sys/kernel/sched_rt_default_util_clamp_min
> 
> [ 2540.507236] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_req.value=1024
> [ 2540.514947] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_req.value=1024
> [ 2548.015208] uclamp_restrict() [rtkit-daemon 419] p->uclamp_req[0].value=999
> 
> root@h960:~# echo 666 > /proc/sys/kernel/sched_util_clamp_min
> 
> [ 2548.022219] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_req.value=999
> [ 2548.029825] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_req.value=999
> [ 2553.479509] uclamp_eff_get() [rtkit-daemon 419] tag=0 uclamp_id=0 uc_max.value=666
> [ 2553.487131] uclamp_eff_get() [rtkit-daemon 419] tag=1 uclamp_id=0 uc_max.value=666
> 
> Haven't tried to put an rt task into a taskgroup other than root.

I do run a test that Patrick had which checks for cgroup values. One of the
tests checks if RT are boosted to max, and it fails when I change the default
RT max-boost value :)

I think we're in agreement, but the terminology is probably making things a bit
confusing.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-23 15:44                             ` Qais Yousef
@ 2020-06-24  8:45                               ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2020-06-24  8:45 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

Hi Qais,

On Tue, 23 Jun 2020 at 17:44, Qais Yousef <qais.yousef@arm.com> wrote:
>
> Hi Vincent
>
> On 06/11/20 14:01, Vincent Guittot wrote:
> > On Thu, 11 Jun 2020 at 12:24, Qais Yousef <qais.yousef@arm.com> wrote:
>
> [...]
>
> > > > Strange because I have been able to trace them.
> > >
> > > On your arm platform? I can certainly see them on x86.
> >
> > yes on my arm platform
>
> Sorry for not getting back to you earlier but I have tried several things and
> shared my results, which you were CCed into all of them.
>
> I have posted a patch that protects uclamp with a static key, mind trying it on
> your platform to see if it helps you too?

I have run some tests with your latest patchset and will reply to the
email thread below

>
> https://lore.kernel.org/lkml/20200619172011.5810-1-qais.yousef@arm.com/
>
> Thanks
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-11 12:01                           ` Vincent Guittot
@ 2020-06-23 15:44                             ` Qais Yousef
  2020-06-24  8:45                               ` Vincent Guittot
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-06-23 15:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

Hi Vincent

On 06/11/20 14:01, Vincent Guittot wrote:
> On Thu, 11 Jun 2020 at 12:24, Qais Yousef <qais.yousef@arm.com> wrote:

[...]

> > > Strange because I have been able to trace them.
> >
> > On your arm platform? I can certainly see them on x86.
> 
> yes on my arm platform

Sorry for not getting back to you earlier but I have tried several things and
shared my results, which you were CCed into all of them.

I have posted a patch that protects uclamp with a static key, mind trying it on
your platform to see if it helps you too?

https://lore.kernel.org/lkml/20200619172011.5810-1-qais.yousef@arm.com/

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-16 11:08                   ` Qais Yousef
@ 2020-06-16 13:56                     ` Lukasz Luba
  0 siblings, 0 replies; 68+ messages in thread
From: Lukasz Luba @ 2020-06-16 13:56 UTC (permalink / raw)
  To: Qais Yousef, Mel Gorman
  Cc: Dietmar Eggemann, Peter Zijlstra, Ingo Molnar, Randy Dunlap,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel,
	chris.redpath


[snip]

Hi Mel and Qais,

I was able to synthesize results from some experiments which I conducted
on my machine. You can find them below with descriptions.

1. Description of the configuration and hardware

My machine is a HP server 2 socket 24 CPUs X86 64bit
(4 NUMA nodes, AMD Opteron 6174, L2 512KB/cpu, L3 6MB/node, RAM 40GB/node).

Results presented here are coming from OpenSuse 15.1 (apart from last 
experiment) with kernel build based on the distro config.
Kernel tag v5.7-rc7.
There are 3 kernels that I have created based on distro config:
a) v5.7-rc7-base - default kernel build (no uclamp)
b) v5.7-rc7-ucl-tsk - base kernel + CONFIG_UCLAMP_TASK
c) v5.7-rc7-ucl-tsk-grp - base kernel + CONFIG_UCLAMP_TASK & 
CONFIG_UCLAMP_TASK_GROUP

2. Experiments

I have been using the mmtests with configuration as you recommended.
I put under stress the system in different scenarios, to check if some
regression can be observed and under what circumstances.
The descriptions below show these different angles of attacks during
mmtests: w/ or w/o numa pinning, using or not perf, tracing, etc.
I have also checked a bit closer to the suspected functions:
activate_task and deactivate_task, which you might find in the
experiment description.

2.1. Experiment with netperf and two kernels

These tests have been conducted without numactl force settings (all CPUs
allowed). As it can be seen the kernel with uclamp task has worse
performance for UDP, but somehow better for TCP.

UDP tests results:
netperf-udp
                           ./v5.7-rc7-base       ./v5.7-rc7-ucl-tsk
Hmean     send-64          62.15 (   0.00%)       59.65 *  -4.02%*
Hmean     send-128        122.88 (   0.00%)      119.37 *  -2.85%*
Hmean     send-256        244.85 (   0.00%)      234.26 *  -4.32%*
Hmean     send-1024       919.24 (   0.00%)      880.67 *  -4.20%*
Hmean     send-2048      1689.45 (   0.00%)     1647.54 *  -2.48%*
Hmean     send-3312      2542.36 (   0.00%)     2485.23 *  -2.25%*
Hmean     send-4096      2935.69 (   0.00%)     2861.09 *  -2.54%*
Hmean     send-8192      4800.35 (   0.00%)     4680.09 *  -2.51%*
Hmean     send-16384     7473.66 (   0.00%)     7349.60 *  -1.66%*
Hmean     recv-64          62.15 (   0.00%)       59.65 *  -4.03%*
Hmean     recv-128        122.88 (   0.00%)      119.37 *  -2.85%*
Hmean     recv-256        244.84 (   0.00%)      234.26 *  -4.32%*
Hmean     recv-1024       919.24 (   0.00%)      880.67 *  -4.20%*
Hmean     recv-2048      1689.44 (   0.00%)     1647.54 *  -2.48%*
Hmean     recv-3312      2542.36 (   0.00%)     2485.23 *  -2.25%*
Hmean     recv-4096      2935.69 (   0.00%)     2861.09 *  -2.54%*
Hmean     recv-8192      4800.35 (   0.00%)     4678.15 *  -2.55%*
Hmean     recv-16384     7473.63 (   0.00%)     7349.52 *  -1.66%*

TCP test results:
netperf-tcp
                        ./v5.7-rc7-base    ./v5.7-rc7-ucl-tsk
Hmean     64         756.44 (   0.00%)      881.17 *  16.49%*
Hmean     128       1425.09 (   0.00%)     1558.70 *   9.38%*
Hmean     256       2292.65 (   0.00%)     2508.72 *   9.42%*
Hmean     1024      5068.70 (   0.00%)     5612.17 *  10.72%*
Hmean     2048      6506.81 (   0.00%)     6739.87 *   3.58%*
Hmean     3312      7232.42 (   0.00%)     7735.86 *   6.96%*
Hmean     4096      7597.95 (   0.00%)     7698.76 *   1.33%*
Hmean     8192      8402.80 (   0.00%)     8540.36 *   1.64%*
Hmean     16384     8841.60 (   0.00%)     9068.70 *   2.57%*

Using perf for in similar workload:
Perf difference in the activate_task and deactivate_task is not too
small.
v5.7-rc7-base
      0.62%  netperf          [kernel.kallsyms]        [k] activate_task
      0.06%  netserver        [kernel.kallsyms]        [k] deactivate_task

v5.7-rc7-ucl-tsk
      3.43%  netperf          [kernel.kallsyms]        [k] activate_task
      2.39%  netserver        [kernel.kallsyms]        [k] deactivate_task

It's a starting point, just to align with others who see also some
regression.

2.2. Experiment with many tests of a single netperf-udp 64B and tracing

I have tried to measure the suspected functions, which were mentioned
many times. Here are the measurements of functions 'activate_task' and
'deactivate_task', such as:
number of hits, total computation time, average time of one call.
These values have been captured during one single netperf-udp 64B test,
but repeated many time. These tables below show processed statistics for
experiments conducted with 3 different kernels. How many times the test
has been repeated on each kernel is shown in row called 'counts'.
This is the output from pandas data frame, function describe(). In case
of confusion with labels in the first row, please check the web for some
tutorials.

stats: fprof.base (basic kernel v5.7-rc7 nouclamp)
activate_task
                Hit    Time_us  Avg_us  s^2_us
count       138.00     138.00  138.00  138.00
mean     20,387.44  14,587.33    1.15    0.53
std     114,980.19  81,427.51    0.42    0.23
min         110.00     181.68    0.32    0.00
50%         411.00     461.55    1.32    0.54
75%         881.75     760.08    1.47    0.66
90%       2,885.60   1,302.03    1.61    0.80
95%      55,318.05  41,273.41    1.66    0.92
99%     501,660.04 358,939.04    1.77    1.09
max   1,131,457.00 798,097.30    1.80    1.42
deactivate_task
                Hit    Time_us  Avg_us  s^2_us
count       138.00     138.00  138.00  138.00
mean     81,828.83  39,991.61    0.81    0.28
std     260,130.01 126,386.89    0.28    0.14
min          97.00      92.35    0.26    0.00
50%         424.00     340.35    0.94    0.30
75%       1,062.25     684.98    1.05    0.37
90%     330,657.50 168,320.94    1.11    0.46
95%     748,920.70 359,498.23    1.15    0.51
99%   1,094,614.76 528,459.50    1.21    0.56
max   1,630,473.00 789,476.50    1.25    0.60

stats: fprof.uclamp_tsk (kernel v5.7-rc7 + uclamp tasks)
activate_task
                Hit      Time_us  Avg_us  s^2_us
count       113.00       113.00  113.00  113.00
mean     23,006.46    24,133.29    1.36    0.64
std     161,171.74   170,299.61    0.45    0.24
min          98.00       173.13    0.44    0.08
50%         369.00       575.96    1.55    0.62
75%         894.00       883.71    1.69    0.74
90%       1,941.20     1,221.70    1.77    0.90
95%       3,187.40     1,627.21    1.85    1.14
99%     431,604.88   437,291.66    1.92    1.35
max   1,631,657.00 1,729,488.00    2.16    1.35
deactivate_task
                Hit      Time_us  Avg_us  s^2_us
count       113.00       113.00  113.00  113.00
mean    108,067.93    86,020.56    1.00    0.35
std     310,429.35   246,938.68    0.33    0.15
min          89.00       102.46    0.33    0.00
50%         430.00       495.87    1.14    0.35
75%       1,361.00       823.63    1.24    0.44
90%     437,528.40   345,051.10    1.34    0.53
95%     886,978.60   696,796.74    1.40    0.58
99%   1,345,052.40 1,086,567.76    1.44    0.68
max   1,391,534.00 1,116,053.00    1.63    0.80

stats: fprof.uclamp_tsk_grp (kernel v5.7-rc7 + uclamp tasks + uclamp 
task group)
activate_task
                Hit      Time_us  Avg_us  s^2_us
count       273.00       273.00  273.00  273.00
mean     15,958.34    16,471.84    1.58    0.67
std     105,096.88   108,322.03    0.43    0.32
min           3.00         4.96    0.41    0.00
50%         245.00       400.23    1.70    0.64
75%         384.00       565.53    1.85    0.78
90%       1,602.00     1,069.08    1.95    0.95
95%       3,403.00     1,573.74    2.01    1.13
99%     589,484.56   604,992.57    2.11    1.75
max   1,035,866.00 1,096,975.00    2.40    3.08
deactivate_task
                Hit      Time_us  Avg_us  s^2_us
count       273.00       273.00  273.00  273.00
mean     94,607.02    63,433.12    1.02    0.34
std     325,130.91   216,844.92    0.28    0.16
min           2.00         2.79    0.29    0.00
50%         244.00       291.49    1.11    0.36
75%         496.00       448.72    1.19    0.43
90%     120,304.60    82,964.94    1.25    0.55
95%     945,480.60   626,793.58    1.33    0.60
99%   1,485,959.96 1,010,615.72    1.40    0.68
max   2,120,682.00 1,403,280.00    1.80    1.11

As you can see the data is distributed differently, having
higher 'Hit' and 'Time_us' value at around .95 for kernels
with uclamp.

2.3. Experiment forcing test tasks to run in the same NUMA node

The experiment showing if forcing to use only one NUMA node for all test
tasks can make a difference.

netperf-udp
                                  ./v5.7-rc7             ./v5.7-rc7 
        ./v5.7-rc7
                                  base-numa0          ucl-tsk-numa0 
ucl-tsk-grp-numa0
Hmean     send-64          60.99 (   0.00%)       61.19 *   0.32%* 
64.58 *   5.88%*
Hmean     send-128        121.92 (   0.00%)      121.37 *  -0.45%* 
128.26 *   5.20%*
Hmean     send-256        240.74 (   0.00%)      240.87 *   0.06%* 
253.86 *   5.45%*
Hmean     send-1024       905.17 (   0.00%)      908.43 *   0.36%* 
955.59 *   5.57%*
Hmean     send-2048      1669.18 (   0.00%)     1681.30 *   0.73%* 
1752.39 *   4.99%*
Hmean     send-3312      2496.30 (   0.00%)     2510.48 *   0.57%* 
2602.42 *   4.25%*
Hmean     send-4096      2914.13 (   0.00%)     2932.19 *   0.62%* 
3028.83 *   3.94%*
Hmean     send-8192      4744.81 (   0.00%)     4762.90 *   0.38%* 
4916.24 *   3.61%*
Hmean     send-16384     7489.47 (   0.00%)     7514.17 *   0.33%* 
7570.39 *   1.08%*
Hmean     recv-64          60.98 (   0.00%)       61.18 *   0.34%* 
64.54 *   5.85%*
Hmean     recv-128        121.86 (   0.00%)      121.29 *  -0.47%* 
128.26 *   5.26%*
Hmean     recv-256        240.65 (   0.00%)      240.79 *   0.06%* 
253.74 *   5.44%*
Hmean     recv-1024       904.65 (   0.00%)      908.20 *   0.39%* 
955.58 *   5.63%*
Hmean     recv-2048      1669.18 (   0.00%)     1680.89 *   0.70%* 
1752.39 *   4.99%*
Hmean     recv-3312      2495.08 (   0.00%)     2509.68 *   0.59%* 
2601.31 *   4.26%*
Hmean     recv-4096      2911.66 (   0.00%)     2931.46 *   0.68%* 
3028.83 *   4.02%*
Hmean     recv-8192      4738.70 (   0.00%)     4762.27 *   0.50%* 
4911.90 *   3.66%*
Hmean     recv-16384     7485.81 (   0.00%)     7513.41 *   0.37%* 
7569.91 *   1.12%*

netperf-tcp
                         ./v5.7-rc7             ./v5.7-rc7 
./v5.7-rc7
                         base-numa0          ucl-tsk-numa0 
ucl-tsk-grp-numa0
Hmean     64         762.29 (   0.00%)      826.48 *   8.42%* 
768.86 *   0.86%*
Hmean     128       1418.94 (   0.00%)     1573.76 *  10.91%* 
1444.04 *   1.77%*
Hmean     256       2302.76 (   0.00%)     2518.75 *   9.38%* 
2315.00 *   0.53%*
Hmean     1024      5076.92 (   0.00%)     5351.65 *   5.41%* 
5061.19 *  -0.31%*
Hmean     2048      6493.42 (   0.00%)     6645.99 *   2.35%* 
6493.79 *   0.01%*
Hmean     3312      7229.76 (   0.00%)     7373.29 *   1.99%* 
7208.45 *  -0.29%*
Hmean     4096      7604.00 (   0.00%)     7656.45 *   0.69%* 
7574.14 *  -0.39%*
Hmean     8192      8456.24 (   0.00%)     8495.95 *   0.47%* 
8387.04 *  -0.82%*
Hmean     16384     8835.74 (   0.00%)     8775.17 *  -0.69%* 
8837.48 *   0.02%*

Perf values of suspected functions for each kernel for similar test from
above (pinned to NUMA 0) shows that there is more calls to these
functions, like usually.
  base
      0.57%  netperf          [kernel.kallsyms]        [k] activate_task
      0.11%  netserver        [kernel.kallsyms]        [k] deactivate_task
  ucl-tsk
      3.44%  netperf          [kernel.kallsyms]          [k] activate_task
      2.49%  netserver        [kernel.kallsyms]          [k] deactivate_task
  ucl-tsk-grp
      2.47%  netperf          [kernel.kallsyms]        [k] activate_task
      1.30%  netserver        [kernel.kallsyms]        [k] deactivate_task

This shows there is more work in the related function, but somehow the
machine is able to handle it and the performance results are even better
with uclamp.

2.4. Experiment with one netperf-udp and perf tool.

Repeating nteperd-udp 64B experiment with base kernel vs uclamp task
group of one test run a few times, I could observed in perf that I have:
87bln vs 100bln cycles
~0.8-0.9k  vs ~2.6M context-switches
  ~73bln vs 76-77bln instr
task-clock stays the same: ~48s

2.5. Ubuntu server and distro kernel experiments

Here are some results when I checked different distro, to check if it
can be observed there as well.
This experiment if for different kernel and different distro:
Ubuntu server 18.04, but the same machine.
The results are for kernel uclamp task + task (last column) group might
look really bad.
I convinced myself after processing results from experiment 2.2
that I just might hit worse usecase during these 5 iterations test of
'netperf-udp send-128', a very bad tasks bouncing.
Apart from that, in general, worse performance results can be observed.

                       ./v5.6-custom-nouclamp       ./v5.6-custom-uct 
  ./v5.6-custom-uctg
Hmean     send-64          99.43 (   0.00%)       94.40 *  -5.06%* 
90.19 *  -9.29%*
Hmean     send-128        198.81 (   0.00%)      180.91 *  -9.01%* 
137.80 * -30.69%*
Hmean     send-256        393.12 (   0.00%)      341.89 * -13.03%* 
332.72 * -15.36%*
Hmean     send-1024      1052.48 (   0.00%)      961.17 *  -8.68%* 
961.64 *  -8.63%*
Hmean     send-2048      1935.68 (   0.00%)     1803.86 *  -6.81%* 
1755.36 *  -9.32%*
Hmean     send-3312      2983.04 (   0.00%)     2806.50 *  -5.92%* 
2802.44 *  -6.05%*
Hmean     send-4096      3558.37 (   0.00%)     3348.70 *  -5.89%* 
3373.92 *  -5.18%*
Hmean     send-8192      5335.23 (   0.00%)     5227.89 *  -2.01%* 
5277.22 *  -1.09%*
Hmean     send-16384     7552.66 (   0.00%)     7374.27 *  -2.36%* 
7388.90 *  -2.17%*

3. Some hypothesis and summary

These 1.5M extra ctx-switches might cause + 3-4bln instr,
which could consume extra 13bln cycles.
Tasks are jumping around across the CPUs more often.
More frequently there is context switch.
The functions 'activate_task' and 'deactivate_task' have worse
total hit or total computation time in the same netperf-udp test.
This also makes worse average time for them. It might be because of the
pressure on caches and branch predictions. Surprisingly the machine can
handle higher value of bouncing tasks when they are pinned to one single
NUMA node.

I hope it could help you to investigate further this issue and find a
solution. IMHO having this uclamp option as a static key is in my
opinion a good idea.
Thank you Mel for your help in my machine configuration and setup.

Regards,
Lukasz Luba



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-11 10:58                 ` Qais Yousef
@ 2020-06-16 11:08                   ` Qais Yousef
  2020-06-16 13:56                     ` Lukasz Luba
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-06-16 11:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dietmar Eggemann, Peter Zijlstra, Ingo Molnar, Randy Dunlap,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel,
	chris.redpath, lukasz.luba

On 06/11/20 11:58, Qais Yousef wrote:

[...]

> 
>                                     nouclam               nouclamp                  uclam                 uclamp         uclamp.disable                 uclamp                 uclamp                 uclamp
>                                    nouclamp              recompile                 uclamp                uclamp2        uclamp.disabled                    opt                   opt2           opt.disabled
> Hmean     send-64         158.07 (   0.00%)      156.99 *  -0.68%*      163.83 *   3.65%*      160.97 *   1.83%*      163.93 *   3.71%*      159.62 *   0.98%*      161.79 *   2.36%*      161.14 *   1.94%*
> Hmean     send-128        314.86 (   0.00%)      314.41 *  -0.14%*      329.05 *   4.51%*      322.88 *   2.55%*      327.88 *   4.14%*      317.56 *   0.86%*      320.72 *   1.86%*      319.62 *   1.51%*
> Hmean     send-256        629.98 (   0.00%)      625.78 *  -0.67%*      652.67 *   3.60%*      639.98 *   1.59%*      643.99 *   2.22%*      631.96 *   0.31%*      635.75 *   0.92%*      644.10 *   2.24%*
> Hmean     send-1024      2465.04 (   0.00%)     2452.29 *  -0.52%*     2554.66 *   3.64%*     2509.60 *   1.81%*     2540.71 *   3.07%*     2495.82 *   1.25%*     2490.50 *   1.03%*     2509.86 *   1.82%*
> Hmean     send-2048      4717.57 (   0.00%)     4713.17 *  -0.09%*     4923.98 *   4.38%*     4811.01 *   1.98%*     4881.87 *   3.48%*     4793.82 *   1.62%*     4820.28 *   2.18%*     4824.60 *   2.27%*
> Hmean     send-3312      7412.33 (   0.00%)     7433.42 *   0.28%*     7717.76 *   4.12%*     7522.97 *   1.49%*     7620.99 *   2.82%*     7522.89 *   1.49%*     7614.51 *   2.73%*     7568.51 *   2.11%*
> Hmean     send-4096      9021.55 (   0.00%)     8988.71 *  -0.36%*     9337.62 *   3.50%*     9075.49 *   0.60%*     9258.34 *   2.62%*     9117.17 *   1.06%*     9175.85 *   1.71%*     9079.50 *   0.64%*
> Hmean     send-8192     15370.36 (   0.00%)    15467.63 *   0.63%*    15999.52 *   4.09%*    15467.80 *   0.63%*    15978.69 *   3.96%*    15619.84 *   1.62%*    15395.09 *   0.16%*    15779.73 *   2.66%*
> Hmean     send-16384    26512.35 (   0.00%)    26498.18 *  -0.05%*    26931.86 *   1.58%*    26513.18 *   0.00%*    26873.98 *   1.36%*    26456.38 *  -0.21%*    26467.77 *  -0.17%*    26975.04 *   1.75%*

I have attempted a few other things after this.

As pointed out above, with 5.7-rc7 I can't see a regression.

The machine I'm testing on is 2 Sockets Xeon E5 2x10-Cores (40 CPUs).

If I switch to 5.6, I can see a drop (performed each run twice)

                                   nouclamp              nouclamp2                 uclamp                uclamp2
Hmean     send-64         162.43 (   0.00%)      161.46 *  -0.60%*      157.84 *  -2.82%*      158.11 *  -2.66%*
Hmean     send-128        324.71 (   0.00%)      323.88 *  -0.25%*      314.78 *  -3.06%*      314.94 *  -3.01%*
Hmean     send-256        641.55 (   0.00%)      640.22 *  -0.21%*      628.67 *  -2.01%*      631.79 *  -1.52%*
Hmean     send-1024      2525.28 (   0.00%)     2520.31 *  -0.20%*     2448.26 *  -3.05%*     2497.15 *  -1.11%*
Hmean     send-2048      4836.14 (   0.00%)     4827.47 *  -0.18%*     4712.08 *  -2.57%*     4757.70 *  -1.62%*
Hmean     send-3312      7540.83 (   0.00%)     7603.14 *   0.83%*     7425.45 *  -1.53%*     7499.87 *  -0.54%*
Hmean     send-4096      9124.53 (   0.00%)     9224.90 *   1.10%*     8948.82 *  -1.93%*     9087.20 *  -0.41%*
Hmean     send-8192     15589.67 (   0.00%)    15768.82 *   1.15%*    15486.35 *  -0.66%*    15594.53 *   0.03%*
Hmean     send-16384    26386.47 (   0.00%)    26683.64 *   1.13%*    25752.25 *  -2.40%*    26609.64 *   0.85%*

If I apply the 2 patches from my previous email, with uclamp enabled I see

                                   nouclamp              nouclamp2             uclamp-opt            uclamp-opt2
Hmean     send-64         162.43 (   0.00%)      161.46 *  -0.60%*      159.84 *  -1.60%*      160.79 *  -1.01%*
Hmean     send-128        324.71 (   0.00%)      323.88 *  -0.25%*      318.44 *  -1.93%*      321.88 *  -0.87%*
Hmean     send-256        641.55 (   0.00%)      640.22 *  -0.21%*      633.54 *  -1.25%*      640.43 *  -0.17%*
Hmean     send-1024      2525.28 (   0.00%)     2520.31 *  -0.20%*     2497.47 *  -1.10%*     2522.00 *  -0.13%*
Hmean     send-2048      4836.14 (   0.00%)     4827.47 *  -0.18%*     4773.63 *  -1.29%*     4825.31 *  -0.22%*
Hmean     send-3312      7540.83 (   0.00%)     7603.14 *   0.83%*     7512.92 *  -0.37%*     7482.66 *  -0.77%*
Hmean     send-4096      9124.53 (   0.00%)     9224.90 *   1.10%*     9076.62 *  -0.52%*     9175.58 *   0.56%*
Hmean     send-8192     15589.67 (   0.00%)    15768.82 *   1.15%*    15466.02 *  -0.79%*    15792.10 *   1.30%*
Hmean     send-16384    26386.47 (   0.00%)    26683.64 *   1.13%*    26234.79 *  -0.57%*    26459.95 *   0.28%*

Which shows that on this machine, the system is slowed down due to bad D$
behavior on access to rq->uclamp[].bucket[] and p->uclamp{_rq}[].

If I disable uclamp using the static key I get

                                   nouclamp              nouclamp2    uclamp-opt.disabled   uclamp-opt.disabled2
Hmean     send-64         162.43 (   0.00%)      161.46 *  -0.60%*      161.21 *  -0.75%*      161.05 *  -0.85%*
Hmean     send-128        324.71 (   0.00%)      323.88 *  -0.25%*      321.09 *  -1.11%*      319.72 *  -1.54%*
Hmean     send-256        641.55 (   0.00%)      640.22 *  -0.21%*      637.37 *  -0.65%*      637.82 *  -0.58%*
Hmean     send-1024      2525.28 (   0.00%)     2520.31 *  -0.20%*     2510.07 *  -0.60%*     2504.99 *  -0.80%*
Hmean     send-2048      4836.14 (   0.00%)     4827.47 *  -0.18%*     4795.29 *  -0.84%*     4788.99 *  -0.97%*
Hmean     send-3312      7540.83 (   0.00%)     7603.14 *   0.83%*     7490.27 *  -0.67%*     7498.56 *  -0.56%*
Hmean     send-4096      9124.53 (   0.00%)     9224.90 *   1.10%*     9108.73 *  -0.17%*     9196.45 *   0.79%*
Hmean     send-8192     15589.67 (   0.00%)    15768.82 *   1.15%*    15649.50 *   0.38%*    16101.68 *   3.28%*
Hmean     send-16384    26386.47 (   0.00%)    26683.64 *   1.13%*    26435.38 *   0.19%*    27199.11 *   3.08%*

I decided after this to see if this failure is observed all the way until
5.7-rc7.

For 5.7-rc1 I get (comparing against 5.6-nouclamp)

                                   nouclamp              nouclamp2                 uclamp                uclamp2
Hmean     send-64         162.43 (   0.00%)      161.46 *  -0.60%*      155.56 *  -4.23%*      156.72 *  -3.52%*
Hmean     send-128        324.71 (   0.00%)      323.88 *  -0.25%*      311.68 *  -4.01%*      312.63 *  -3.72%*
Hmean     send-256        641.55 (   0.00%)      640.22 *  -0.21%*      616.03 *  -3.98%*      620.83 *  -3.23%*
Hmean     send-1024      2525.28 (   0.00%)     2520.31 *  -0.20%*     2441.92 *  -3.30%*     2433.83 *  -3.62%*
Hmean     send-2048      4836.14 (   0.00%)     4827.47 *  -0.18%*     4698.42 *  -2.85%*     4682.22 *  -3.18%*
Hmean     send-3312      7540.83 (   0.00%)     7603.14 *   0.83%*     7379.37 *  -2.14%*     7354.82 *  -2.47%*
Hmean     send-4096      9124.53 (   0.00%)     9224.90 *   1.10%*     8797.21 *  -3.59%*     8815.65 *  -3.39%*
Hmean     send-8192     15589.67 (   0.00%)    15768.82 *   1.15%*    15009.19 *  -3.72%*    15065.16 *  -3.36%*
Hmean     send-16384    26386.47 (   0.00%)    26683.64 *   1.13%*    25829.20 *  -2.11%*    25783.17 *  -2.29%*

For 5.7-rc2, the overhead disappears again (against 5.6-nouclamp)

                                   nouclamp              nouclamp2                 uclamp                uclamp2
Hmean     send-64         162.43 (   0.00%)      161.46 *  -0.60%*      162.97 *   0.34%*      163.31 *   0.54%*
Hmean     send-128        324.71 (   0.00%)      323.88 *  -0.25%*      323.94 *  -0.24%*      325.74 *   0.32%*
Hmean     send-256        641.55 (   0.00%)      640.22 *  -0.21%*      641.82 *   0.04%*      645.11 *   0.56%*
Hmean     send-1024      2525.28 (   0.00%)     2520.31 *  -0.20%*     2522.74 *  -0.10%*     2535.63 *   0.41%*
Hmean     send-2048      4836.14 (   0.00%)     4827.47 *  -0.18%*     4836.74 *   0.01%*     4838.62 *   0.05%*
Hmean     send-3312      7540.83 (   0.00%)     7603.14 *   0.83%*     7635.31 *   1.25%*     7613.91 *   0.97%*
Hmean     send-4096      9124.53 (   0.00%)     9224.90 *   1.10%*     9198.58 *   0.81%*     9161.53 *   0.41%*
Hmean     send-8192     15589.67 (   0.00%)    15768.82 *   1.15%*    15804.47 *   1.38%*    15755.91 *   1.07%*
Hmean     send-16384    26386.47 (   0.00%)    26683.64 *   1.13%*    26649.29 *   1.00%*    26677.46 *   1.10%*

I stopped here tbh. I thought maybe numa scheduling is making the uclamp
accesses more expensive in certain patterns, so I tried with numactl -N 0
(using 5.7-rc1)

                                   nouclamp              nouclamp2            uclamp-N0-1            uclamp-N0-2
Hmean     send-64         162.43 (   0.00%)      161.46 *  -0.60%*      156.26 *  -3.80%*      156.00 *  -3.96%*
Hmean     send-128        324.71 (   0.00%)      323.88 *  -0.25%*      312.20 *  -3.85%*      312.94 *  -3.63%*
Hmean     send-256        641.55 (   0.00%)      640.22 *  -0.21%*      620.29 *  -3.31%*      619.25 *  -3.48%*
Hmean     send-1024      2525.28 (   0.00%)     2520.31 *  -0.20%*     2437.59 *  -3.47%*     2433.94 *  -3.62%*
Hmean     send-2048      4836.14 (   0.00%)     4827.47 *  -0.18%*     4671.28 *  -3.41%*     4714.49 *  -2.52%*
Hmean     send-3312      7540.83 (   0.00%)     7603.14 *   0.83%*     7355.86 *  -2.45%*     7387.51 *  -2.03%*
Hmean     send-4096      9124.53 (   0.00%)     9224.90 *   1.10%*     8793.02 *  -3.63%*     8883.88 *  -2.64%*
Hmean     send-8192     15589.67 (   0.00%)    15768.82 *   1.15%*    14898.76 *  -4.43%*    14958.19 *  -4.05%*
Hmean     send-16384    26386.47 (   0.00%)    26683.64 *   1.13%*    25745.40 *  -2.43%*    25800.01 *  -2.22%*

And it had no effect. Interesting Lukasz can see an improvement if he tries
something similar on his machine.

Did we have any previous history of code/data layout affecting the performance
of the hot path in the past? On the juno board (octa core big.LITTLE arm
paltform), I could make the overhead disappear with a simple code shuffle (for
perf bench sched pipe).

I have tried putting the rq->uclamp[].bucket[] structures into their own PERCPU
variable since the rq is read by many cpus and thought that might lead to bad
cache patterns since uclamp are mostly read by the owning cpus, but no luck
with this approach.

I am working on a proper static key patch now that disables uclamp by default
and only enables it if the userspace attemps to modify any of the knobs it
provides, then we switch it on and keep it on. Testing it at the moment.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-11 10:24                         ` Qais Yousef
@ 2020-06-11 12:01                           ` Vincent Guittot
  2020-06-23 15:44                             ` Qais Yousef
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2020-06-11 12:01 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On Thu, 11 Jun 2020 at 12:24, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 06/09/20 19:10, Vincent Guittot wrote:
> > On Mon, 8 Jun 2020 at 14:31, Qais Yousef <qais.yousef@arm.com> wrote:
> > >
> > > On 06/04/20 14:14, Vincent Guittot wrote:
> > >
> > > [...]
> > >
> > > > I have tried your patch and I don't see any difference compared to
> > > > previous tests. Let me give you more details of my setup:
> > > > I create 3 levels of cgroups and usually run the tests in the 4 levels
> > > > (which includes root). The result above are for the root level
> > > >
> > > > But I see a difference at other levels:
> > > >
> > > >                            root           level 1       level 2       level 3
> > > >
> > > > /w patch uclamp disable     50097         46615         43806         41078
> > > > tip uclamp enable           48706(-2.78%) 45583(-2.21%) 42851(-2.18%)
> > > > 40313(-1.86%)
> > > > /w patch uclamp enable      48882(-2.43%) 45774(-1.80%) 43108(-1.59%)
> > > > 40667(-1.00%)
> > > >
> > > > Whereas tip with uclamp stays around 2% behind tip without uclamp, the
> > > > diff of uclamp with your patch tends to decrease when we increase the
> > > > number of level
> > >
> > > So I did try to dig more into this, but I think it's either not a good
> > > reproducer or what we're observing here is uArch level latencies caused by the
> > > new code that seem to produce a bigger knock on effect than what they really
> > > are.
> > >
> > > First, CONFIG_FAIR_GROUP_SCHED is 'expensive', for some definition of
> > > expensive..
> >
> > yes, enabling CONFIG_FAIR_GROUP_SCHED adds an overhead
> >
> > >
> > > *** uclamp disabled/fair group enabled ***
> > >
> > >         # Executed 50000 pipe operations between two threads
> > >
> > >              Total time: 0.958 [sec]
> > >
> > >               19.177100 usecs/op
> > >                   52145 ops/sec
> > >
> > > *** uclamp disabled/fair group disabled ***
> > >
> > >         # Executed 50000 pipe operations between two threads
> > >              Total time: 0.808 [sec]
> > >
> > >              16.176200 usecs/op
> > >                  61819 ops/sec
> > >
> > > So there's a 15.6% drop in ops/sec when enabling this option. I think it's good
> > > to look at the absolutely number of usecs/op, Fair group adds around
> > > 3 usecs/op.
> > >
> > > I dropped FAIR_GROUP_SCHED from my config to eliminate this overhead and focus
> > > on solely on uclamp overhead.
> >
> > Have you checked that both tests run at the root level ?
>
> I haven't actively moved tasks to cgroups. As I said that snippet was
> particularly bad and I didn't see that level of nesting in every call.
>
> > Your function-graph log below shows several calls to
> > update_cfs_group() which means that your trace below has not been made
> > at root level but most probably at the 3rd level and I wonder if you
> > used the same setup for running the benchmark above. This could
> > explain such huge difference because I don't have such difference on
> > my platform but more around 2%
>
> What promoted me to look at this is when you reported that even without uclamp
> the nested cgroup showed a drop at each level. I was just trying to understand
> how both affect the hot path in hope to understand the root cause of uclamp
> overhead.
>
> >
> > For uclamp disable/fair group enable/ function graph enable :  47994ops/sec
> > For uclamp disable/fair group disable/ function graph enable : 49107ops/sec
> >
> > >
> > > With uclamp enabled but no fair group I get
> > >
> > > *** uclamp enabled/fair group disabled ***
> > >
> > >         # Executed 50000 pipe operations between two threads
> > >              Total time: 0.856 [sec]
> > >
> > >              17.125740 usecs/op
> > >                  58391 ops/sec
> > >
> > > The drop is 5.5% in ops/sec. Or 1 usecs/op.
> > >
> > > I don't know what's the expectation here. 1 us could be a lot, but I don't
> > > think we expect the new code to take more than few 100s of ns anyway. If you
> > > add potential caching effects, reaching 1 us wouldn't be that hard.
> > >
> > > Note that in my runs I chose performance governor and use `taskset 0x2` to
> >
> > You might want to set 2 CPUs in your cpumask instead of 1 in order to
> > have 1 CPU for each thread
>
> I did try that but it didn't seem to change the number. I think the 2 tasks
> interleave so running in 2 CPUs doesn't change the result. But to ease ftrace
> capture, it's easier to monitor a single cpu.
>
> >
> > > force running on a big core to make sure the runs are repeatable.
> >
> > I also use performance governor but don't pinned tasks because I use smp.
>
> Is your arm platform SMP?

Yes, all my tests are done on the Arm64 octo core  smp system

>
> >
> > >
> > > On Juno-r2 I managed to scrap most of the 1 us with the below patch. It seems
> > > there was weird branching behavior that affects the I$ in my case. It'd be good
> > > to try it out to see if it makes a difference for you.
> >
> > The perf are slightly worse on my setup:
> > For uclamp enable/fair group disable/ function graph enable : 48413ops/sec
> > with patch  below : 47804os/sec
>
> I am not sure if the new code could just introduce worse cache performance
> in a platform dependent way. The evidences I have so far point in this
> direction.
>
> >
> > >
> > > The I$ effect is my best educated guess. Perf doesn't catch this path and
> > > I couldn't convince it to look at cache and branch misses between 2 specific
> > > points.
> > >
> > > Other subtle code shuffling did have weird effect on the result too. One worthy
> > > one is making uclamp_rq_dec() noinline gains back ~400 ns. Making
> > > uclamp_rq_inc() noinline *too* cancels this gain out :-/
> > >
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 0464569f26a7..0835ee20a3c7 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1071,13 +1071,11 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> > >
> > >  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> > >  {
> > > -       enum uclamp_id clamp_id;
> > > -
> > >         if (unlikely(!p->sched_class->uclamp_enabled))
> > >                 return;
> > >
> > > -       for_each_clamp_id(clamp_id)
> > > -               uclamp_rq_inc_id(rq, p, clamp_id);
> > > +       uclamp_rq_inc_id(rq, p, UCLAMP_MIN);
> > > +       uclamp_rq_inc_id(rq, p, UCLAMP_MAX);
> > >
> > >         /* Reset clamp idle holding when there is one RUNNABLE task */
> > >         if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> > > @@ -1086,13 +1084,11 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> > >
> > >  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> > >  {
> > > -       enum uclamp_id clamp_id;
> > > -
> > >         if (unlikely(!p->sched_class->uclamp_enabled))
> > >                 return;
> > >
> > > -       for_each_clamp_id(clamp_id)
> > > -               uclamp_rq_dec_id(rq, p, clamp_id);
> > > +       uclamp_rq_dec_id(rq, p, UCLAMP_MIN);
> > > +       uclamp_rq_dec_id(rq, p, UCLAMP_MAX);
> > >  }
> > >
> > >  static inline void
> > >
> > >
> > > FWIW I fail to see activate/deactivate_task in perf record. They don't show up
> > > on the list which means this micro benchmark doesn't stress them as Mel's test
> > > does.
> >
> > Strange because I have been able to trace them.
>
> On your arm platform? I can certainly see them on x86.

yes on my arm platform

>
> Thanks

>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-04 13:40               ` Mel Gorman
  2020-06-05 10:58                 ` Qais Yousef
@ 2020-06-11 10:58                 ` Qais Yousef
  2020-06-16 11:08                   ` Qais Yousef
  1 sibling, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-06-11 10:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dietmar Eggemann, Peter Zijlstra, Ingo Molnar, Randy Dunlap,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 06/04/20 14:40, Mel Gorman wrote:
> On Wed, Jun 03, 2020 at 01:41:13PM +0100, Qais Yousef wrote:
> > > > netperf-udp
> > > >                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
> > > >                               without-clamp             with-clamp      with-clamp-tskgrp
> > > > 
> > > > Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > > Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > > Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > > Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > > Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> > > > Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > > Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > > Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> > > > Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
> > > > Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > > Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > > Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > > Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > > Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> > > > Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > > Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > > Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> > > > Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
> > > > 
> > > > netperf-tcp
> > > >  
> > > > Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> > > > Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> > > > Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> > > > Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> > > > Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> > > > Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> > > > Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> > > > Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> > > > Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
> > > > 
> > > > The diffs are smaller than on openSUSE Leap 15.1 and some of the
> > > > uclamp taskgroup results are better?
> > > > 
> > > 
> > > I don't see the stddev and coeff but these look close to borderline.
> > > Sure, they are marked with a * so it passed a significant test but it's
> > > still a very marginal difference for netperf. It's possible that the
> > > systemd configurations differ in some way that is significant for uclamp
> > > but I don't know what that is.
> > 
> > Hmm so what you're saying is that Dietmar didn't reproduce the same problem
> > you're observing? I was hoping to use that to dig more into it.
> > 
> 
> Not as such, I'm saying that for whatever reason the problem is not as
> visible with Dietmar's setup. It may be machine-specific or distribution
> specific. There are alternative suggestions for testing just the fast
> paths with a pipe test that may be clearer.

I have regained access to the same machine Dietmar ran his tests on. And I got
some weird results to share..

First I tried with `perf bench -r 20 sched pipe -T` command to identify the
cause of the overhead. And indeed I do see the activate/deactivate_task
overhead going up when uclamp is enabled

With uclamp run #1:

   2.44%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   1.59%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task

With uclamp run #2:

   4.55%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   2.34%  sched-pipe  [kernel.vmlinux]  [k] activate_task

Without uclamp run #1:

   0.12%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.11%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task

Without uclamp run #2:

   0.11%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.07%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task

Looking at the annotation I see in the enqueue

  0.08 │ c5:   mov    %ecx,%esi                                                                                           ▒
  0.99 │       and    $0xfffff800,%r9d                                                                                    ▒
       │       and    $0x7ff,%eax                                                                                         ▒
       │       lea    0xd4(%rsi),%r13                                                                                     ▒
  0.10 │       or     %r9d,%eax                                                                                           ▒
       │     bucket->tasks++;                                                                                             ▒
  0.03 │       lea    (%rsi,%rsi,1),%r11                                                                                  ▒
       │     p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);                                                           ▒
  0.02 │       mov    %eax,0x358(%rbx,%rdi,4)                                                                             ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.02 │       movzbl 0x9(%rbx,%r13,4),%ecx                                                                               ▒
       │     bucket->tasks++;                                                                                             ▒
  0.74 │       add    %rsi,%r11                                                                                           ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.02 │       shr    $0x3,%cl                                                                                            ▒
       │     bucket->tasks++;                                                                                             ▒
       │       and    $0x7,%ecx                                                                                           ◆
  0.05 │       lea    0x8(%rcx,%r11,2),%rax                                                                               ▒
  3.52 │       addq   $0x800,0x8(%r12,%rax,8)                                                                             ▒
       │     uc_se->active = true;                                                                                        ▒
       │       orb    $0x40,0x9(%rbx,%r13,4)                                                                              ▒
       │     uclamp_idle_reset(rq, clamp_id, uc_se->value);                                                               ▒
 73.34 │       movzwl 0x8(%rbx,%r13,4),%eax                     <--- XXXXX                                                ▒
       │     uclamp_idle_reset():                                                                                         ▒
       │     if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))                                                                  ▒
       │       mov    0xa0(%r12),%r9d                                                                                     ▒
       │       mov    %r9d,%r10d                                                                                          ▒
       │     uclamp_rq_inc_id():                                                                                          ▒
       │     uclamp_idle_reset(rq, clamp_id, uc_se->value)

and at the dequeue

  0.07 │       mov    0x8(%rax),%ecx                                                                                      ▒
       │       test   %ecx,%ecx                                                                                           ▒
       │     ↑ je     60                                                                                                  ▒
  0.30 │       xor    %r14d,%r14d                                                                                         ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bucket->tasks--;                                                                                             ▒
       │       mov    $0xfffffffffffff800,%r15                                                                            ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.07 │ 90:   mov    %r14d,%ecx                                                                                          ▒
       │       mov    %r14d,%r12d                                                                                         ▒
  0.04 │       lea    0xd4(%rcx),%rax                                                                                     ▒
 20.06 │       movzbl 0x9(%rsi,%rax,4),%r13d                     <--- XXXXX                                               ▒
       │     SCHED_WARN_ON(!bucket->tasks);                                                                               ▒
       │       lea    (%rcx,%rcx,1),%rax                                                                                  ▒
       │       add    %rcx,%rax                                                                                           ▒
       │     bucket = &uc_rq->bucket[uc_se->bucket_id];                                                                   ▒
  0.07 │       shr    $0x3,%r13b                                                                                          ▒
       │     SCHED_WARN_ON(!bucket->tasks);                                                                               ▒
  0.07 │       and    $0x7,%r13d                                                                                          ▒
  0.17 │       lea    0x8(%r13,%rax,2),%rax                                                                               ▒
 24.52 │       testq  $0xfffffffffffff800,0x8(%rbx,%rax,8)       <--- XXXXX                                               ▒
  0.34 │     ↓ je     172
.
.
.
  0.14 │       mov    %ecx,0x40(%rax)                                                                                     ▒
       │     ↑ jmpq   f8                                                                                                  ▒
  1.25 │250:   sub    $0x8,%rcx                                                                                           ▒
       │     uclamp_rq_max_value():                                                                                       ▒
       │     for ( ; bucket_id >= 0; bucket_id--) {                                                                       ▒
  4.97 │       cmp    %rcx,%rax                                                                                           ▒
       │     ↑ jne    22b                                                                                                 ▒
       │     uclamp_idle_value():                                                                                         ▒
       │     return uclamp_none(UCLAMP_MIN);                                                                              ▒
       │       xor    %ecx,%ecx                                                                                           ▒
       │     if (clamp_id == UCLAMP_MAX) {                                                                                ▒
  0.74 │       cmp    $0x1,%r14                                                                                           ▒
       │     ↑ jne    23d                                                                                                 ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);                                                 ▒
 20.10 │       movzwl 0x35c(%rsi),%ecx                           <--- XXXXXX                                              ▒
       │     uclamp_idle_value():                                                                                         ▒
       │     rq->uclamp_flags |= UCLAMP_FLAG_IDLE;                                                                        ▒
       │       orl    $0x1,0xa0(%rbx)                                                                                     ▒
       │     uclamp_rq_dec_id():                                                                                          ▒
       │     bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);                                                 ▒
  0.14 │       and    $0x7ff,%ecx                                                                                         ▒
       │     ↑ jmp    23d


Which I interpreted as accesses to rq->uclamp[].bucket and p->uclamp[] structs.

The movzwl shanangians promoted me to remove the bitfields in case this is
causing some weird effect, and I shortend struct uclamp_bucket to reduce the
potential cache pressure to make it all fit in a single cache line


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..7d0acf250573 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -601,10 +601,10 @@ struct sched_dl_entity {
  * default boost can still drop its own boosting to 0%.
  */
 struct uclamp_se {
-	unsigned int value		: bits_per(SCHED_CAPACITY_SCALE);
-	unsigned int bucket_id		: bits_per(UCLAMP_BUCKETS);
-	unsigned int active		: 1;
-	unsigned int user_defined	: 1;
+	unsigned int value;
+	unsigned int bucket_id;
+	unsigned int active;
+	unsigned int user_defined;
 };
 #endif /* CONFIG_UCLAMP_TASK */
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..a1e7080c48e8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -833,8 +833,8 @@ extern void rto_push_irq_work_func(struct irq_work *work);
  * clamp value.
  */
 struct uclamp_bucket {
-	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
-	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+	unsigned short value;
+	unsigned short tasks;
 };
 
 /*


This make the perf output look like this now

With patch run #1:

   1.34%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   0.44%  sched-pipe  [kernel.vmlinux]  [k] activate_task

with patch run #2:

   2.41%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task
   0.32%  sched-pipe  [kernel.vmlinux]  [k] activate_task


So it seems to help the activate_task path a lot, but not as much for the
deactivate_task path. Note that activate_task path was hotter than
deactivate_task without this patch.

Further, I have tried adding a static key like you suggested for psi

With static key disabling uclamp:

   0.13%  sched-pipe  [kernel.vmlinux]  [k] activate_task
   0.07%  sched-pipe  [kernel.vmlinux]  [k] deactivate_task



diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..1814baa95c81 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -84,6 +84,9 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
 extern int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				       void __user *buffer, size_t *lenp,
 				       loff_t *ppos);
+extern int sysctl_sched_uclamp_disabled(struct ctl_table *table, int write,
+				        void __user *buffer, size_t *lenp,
+				        loff_t *ppos);
 #endif
 
 extern int sysctl_numa_balancing(struct ctl_table *table, int write,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..8d932b3922c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -793,6 +793,8 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
+DEFINE_STATIC_KEY_FALSE(sched_uclamp_disabled);
+
 /* Integer rounded range for each bucket */
 #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
 
@@ -1020,6 +1022,9 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	if (static_branch_likely(&sched_uclamp_disabled))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1035,6 +1040,9 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
+	if (static_branch_likely(&sched_uclamp_disabled))
+		return;
+
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
@@ -1164,6 +1172,30 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 	return result;
 }
 
+int sysctl_sched_uclamp_disabled(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_uclamp_disabled);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+	if (write) {
+		if (state)
+			static_branch_enable(&sched_uclamp_disabled);
+		else
+			static_branch_disable(&sched_uclamp_disabled);
+	}
+	return err;
+}
+
 static int uclamp_validate(struct task_struct *p,
 			   const struct sched_attr *attr)
 {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..ef842cbf1f91 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -453,6 +453,15 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_uclamp_disabled",
+		.data		= NULL,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_disabled,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{



I stopped over here and decided to run with your netperf test to see what
effect this has, and this is where things gets weird.


Running with uclamp gives better results than without uclamp :-/


The nouclamp column is with uclamp disabled at the config level.

The '.disabled' postfix is uclamp disabled via sysctl patch (static key).

uclamp-opt is with the above patch that improved the D$ performance for perf
bench.

I ran uclamp and uclamp-opt with the static key patch applied. I ran them twice
as after noticing the nouclamp is worse than with uclamp I wanted to see how
the numbers differ between runs.

                                    nouclam               nouclamp                  uclam                 uclamp         uclamp.disable                 uclamp                 uclamp                 uclamp
                                   nouclamp              recompile                 uclamp                uclamp2        uclamp.disabled                    opt                   opt2           opt.disabled
Hmean     send-64         158.07 (   0.00%)      156.99 *  -0.68%*      163.83 *   3.65%*      160.97 *   1.83%*      163.93 *   3.71%*      159.62 *   0.98%*      161.79 *   2.36%*      161.14 *   1.94%*
Hmean     send-128        314.86 (   0.00%)      314.41 *  -0.14%*      329.05 *   4.51%*      322.88 *   2.55%*      327.88 *   4.14%*      317.56 *   0.86%*      320.72 *   1.86%*      319.62 *   1.51%*
Hmean     send-256        629.98 (   0.00%)      625.78 *  -0.67%*      652.67 *   3.60%*      639.98 *   1.59%*      643.99 *   2.22%*      631.96 *   0.31%*      635.75 *   0.92%*      644.10 *   2.24%*
Hmean     send-1024      2465.04 (   0.00%)     2452.29 *  -0.52%*     2554.66 *   3.64%*     2509.60 *   1.81%*     2540.71 *   3.07%*     2495.82 *   1.25%*     2490.50 *   1.03%*     2509.86 *   1.82%*
Hmean     send-2048      4717.57 (   0.00%)     4713.17 *  -0.09%*     4923.98 *   4.38%*     4811.01 *   1.98%*     4881.87 *   3.48%*     4793.82 *   1.62%*     4820.28 *   2.18%*     4824.60 *   2.27%*
Hmean     send-3312      7412.33 (   0.00%)     7433.42 *   0.28%*     7717.76 *   4.12%*     7522.97 *   1.49%*     7620.99 *   2.82%*     7522.89 *   1.49%*     7614.51 *   2.73%*     7568.51 *   2.11%*
Hmean     send-4096      9021.55 (   0.00%)     8988.71 *  -0.36%*     9337.62 *   3.50%*     9075.49 *   0.60%*     9258.34 *   2.62%*     9117.17 *   1.06%*     9175.85 *   1.71%*     9079.50 *   0.64%*
Hmean     send-8192     15370.36 (   0.00%)    15467.63 *   0.63%*    15999.52 *   4.09%*    15467.80 *   0.63%*    15978.69 *   3.96%*    15619.84 *   1.62%*    15395.09 *   0.16%*    15779.73 *   2.66%*
Hmean     send-16384    26512.35 (   0.00%)    26498.18 *  -0.05%*    26931.86 *   1.58%*    26513.18 *   0.00%*    26873.98 *   1.36%*    26456.38 *  -0.21%*    26467.77 *  -0.17%*    26975.04 *   1.75%*


Happy to try more things if you have any suggestions. I am getting a bit
stumped by the netperf results, but haven't tried to profile them. I might try
that but thought I'll report this first as it's time consuming.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-09 17:10                       ` Vincent Guittot
@ 2020-06-11 10:24                         ` Qais Yousef
  2020-06-11 12:01                           ` Vincent Guittot
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-06-11 10:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On 06/09/20 19:10, Vincent Guittot wrote:
> On Mon, 8 Jun 2020 at 14:31, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > On 06/04/20 14:14, Vincent Guittot wrote:
> >
> > [...]
> >
> > > I have tried your patch and I don't see any difference compared to
> > > previous tests. Let me give you more details of my setup:
> > > I create 3 levels of cgroups and usually run the tests in the 4 levels
> > > (which includes root). The result above are for the root level
> > >
> > > But I see a difference at other levels:
> > >
> > >                            root           level 1       level 2       level 3
> > >
> > > /w patch uclamp disable     50097         46615         43806         41078
> > > tip uclamp enable           48706(-2.78%) 45583(-2.21%) 42851(-2.18%)
> > > 40313(-1.86%)
> > > /w patch uclamp enable      48882(-2.43%) 45774(-1.80%) 43108(-1.59%)
> > > 40667(-1.00%)
> > >
> > > Whereas tip with uclamp stays around 2% behind tip without uclamp, the
> > > diff of uclamp with your patch tends to decrease when we increase the
> > > number of level
> >
> > So I did try to dig more into this, but I think it's either not a good
> > reproducer or what we're observing here is uArch level latencies caused by the
> > new code that seem to produce a bigger knock on effect than what they really
> > are.
> >
> > First, CONFIG_FAIR_GROUP_SCHED is 'expensive', for some definition of
> > expensive..
> 
> yes, enabling CONFIG_FAIR_GROUP_SCHED adds an overhead
> 
> >
> > *** uclamp disabled/fair group enabled ***
> >
> >         # Executed 50000 pipe operations between two threads
> >
> >              Total time: 0.958 [sec]
> >
> >               19.177100 usecs/op
> >                   52145 ops/sec
> >
> > *** uclamp disabled/fair group disabled ***
> >
> >         # Executed 50000 pipe operations between two threads
> >              Total time: 0.808 [sec]
> >
> >              16.176200 usecs/op
> >                  61819 ops/sec
> >
> > So there's a 15.6% drop in ops/sec when enabling this option. I think it's good
> > to look at the absolutely number of usecs/op, Fair group adds around
> > 3 usecs/op.
> >
> > I dropped FAIR_GROUP_SCHED from my config to eliminate this overhead and focus
> > on solely on uclamp overhead.
> 
> Have you checked that both tests run at the root level ?

I haven't actively moved tasks to cgroups. As I said that snippet was
particularly bad and I didn't see that level of nesting in every call.

> Your function-graph log below shows several calls to
> update_cfs_group() which means that your trace below has not been made
> at root level but most probably at the 3rd level and I wonder if you
> used the same setup for running the benchmark above. This could
> explain such huge difference because I don't have such difference on
> my platform but more around 2%

What promoted me to look at this is when you reported that even without uclamp
the nested cgroup showed a drop at each level. I was just trying to understand
how both affect the hot path in hope to understand the root cause of uclamp
overhead.

> 
> For uclamp disable/fair group enable/ function graph enable :  47994ops/sec
> For uclamp disable/fair group disable/ function graph enable : 49107ops/sec
> 
> >
> > With uclamp enabled but no fair group I get
> >
> > *** uclamp enabled/fair group disabled ***
> >
> >         # Executed 50000 pipe operations between two threads
> >              Total time: 0.856 [sec]
> >
> >              17.125740 usecs/op
> >                  58391 ops/sec
> >
> > The drop is 5.5% in ops/sec. Or 1 usecs/op.
> >
> > I don't know what's the expectation here. 1 us could be a lot, but I don't
> > think we expect the new code to take more than few 100s of ns anyway. If you
> > add potential caching effects, reaching 1 us wouldn't be that hard.
> >
> > Note that in my runs I chose performance governor and use `taskset 0x2` to
> 
> You might want to set 2 CPUs in your cpumask instead of 1 in order to
> have 1 CPU for each thread

I did try that but it didn't seem to change the number. I think the 2 tasks
interleave so running in 2 CPUs doesn't change the result. But to ease ftrace
capture, it's easier to monitor a single cpu.

> 
> > force running on a big core to make sure the runs are repeatable.
> 
> I also use performance governor but don't pinned tasks because I use smp.

Is your arm platform SMP?

> 
> >
> > On Juno-r2 I managed to scrap most of the 1 us with the below patch. It seems
> > there was weird branching behavior that affects the I$ in my case. It'd be good
> > to try it out to see if it makes a difference for you.
> 
> The perf are slightly worse on my setup:
> For uclamp enable/fair group disable/ function graph enable : 48413ops/sec
> with patch  below : 47804os/sec

I am not sure if the new code could just introduce worse cache performance
in a platform dependent way. The evidences I have so far point in this
direction.

> 
> >
> > The I$ effect is my best educated guess. Perf doesn't catch this path and
> > I couldn't convince it to look at cache and branch misses between 2 specific
> > points.
> >
> > Other subtle code shuffling did have weird effect on the result too. One worthy
> > one is making uclamp_rq_dec() noinline gains back ~400 ns. Making
> > uclamp_rq_inc() noinline *too* cancels this gain out :-/
> >
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0464569f26a7..0835ee20a3c7 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1071,13 +1071,11 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> >
> >  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >  {
> > -       enum uclamp_id clamp_id;
> > -
> >         if (unlikely(!p->sched_class->uclamp_enabled))
> >                 return;
> >
> > -       for_each_clamp_id(clamp_id)
> > -               uclamp_rq_inc_id(rq, p, clamp_id);
> > +       uclamp_rq_inc_id(rq, p, UCLAMP_MIN);
> > +       uclamp_rq_inc_id(rq, p, UCLAMP_MAX);
> >
> >         /* Reset clamp idle holding when there is one RUNNABLE task */
> >         if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> > @@ -1086,13 +1084,11 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >
> >  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> >  {
> > -       enum uclamp_id clamp_id;
> > -
> >         if (unlikely(!p->sched_class->uclamp_enabled))
> >                 return;
> >
> > -       for_each_clamp_id(clamp_id)
> > -               uclamp_rq_dec_id(rq, p, clamp_id);
> > +       uclamp_rq_dec_id(rq, p, UCLAMP_MIN);
> > +       uclamp_rq_dec_id(rq, p, UCLAMP_MAX);
> >  }
> >
> >  static inline void
> >
> >
> > FWIW I fail to see activate/deactivate_task in perf record. They don't show up
> > on the list which means this micro benchmark doesn't stress them as Mel's test
> > does.
> 
> Strange because I have been able to trace them.

On your arm platform? I can certainly see them on x86.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-08 14:44                       ` Steven Rostedt
@ 2020-06-11 10:13                         ` Qais Yousef
  0 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-06-11 10:13 UTC (permalink / raw)
  To: Steven Rostedt, Vincent Guittot
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fs

On 06/08/20 10:44, Steven Rostedt wrote:
> On Mon, 8 Jun 2020 13:31:03 +0100
> Qais Yousef <qais.yousef@arm.com> wrote:
> 
> > I admit I don't know how much of these numbers is ftrace overhead. When trying
> 
> Note, if you want to get a better idea of how long a function runs, put it
> into set_ftrace_filter, and then trace it. That way you remove the overhead
> of the function graph tracer when its nesting within a function.

Thanks for the tip!

With CONFIG_FAIR_GROUP_SCHED I see (uclamp disabled)


      sched-pipe-602   [001]    73.755392: funcgraph_entry:        2.080 us   |  activate_task();
      sched-pipe-602   [001]    73.755399: funcgraph_entry:        2.000 us   |  deactivate_task();
      sched-pipe-601   [001]    73.755407: funcgraph_entry:        2.220 us   |  activate_task();
      sched-pipe-601   [001]    73.755414: funcgraph_entry:        2.020 us   |  deactivate_task();
      sched-pipe-602   [001]    73.755422: funcgraph_entry:        2.160 us   |  activate_task();
      sched-pipe-602   [001]    73.755429: funcgraph_entry:        1.920 us   |  deactivate_task();
      sched-pipe-601   [001]    73.755437: funcgraph_entry:        2.260 us   |  activate_task();
      sched-pipe-601   [001]    73.755444: funcgraph_entry:        2.080 us   |  deactivate_task();
      sched-pipe-602   [001]    73.755452: funcgraph_entry:        2.160 us   |  activate_task();
      sched-pipe-602   [001]    73.755459: funcgraph_entry:        2.080 us   |  deactivate_task();
      sched-pipe-601   [001]    73.755468: funcgraph_entry:        2.200 us   |  activate_task();
      sched-pipe-601   [001]    73.755521: funcgraph_entry:        3.160 us   |  activate_task();

update_cfs_group() overhead

      sched-pipe-622   [001]   156.790478: funcgraph_entry:        0.820 us   |  update_cfs_group();
      sched-pipe-622   [001]   156.790483: funcgraph_entry:        0.840 us   |  update_cfs_group();
      sched-pipe-622   [001]   156.790485: funcgraph_entry:        0.820 us   |  update_cfs_group();
      sched-pipe-622   [001]   156.790487: funcgraph_entry:        0.820 us   |  update_cfs_group();
      sched-pipe-622   [001]   156.790488: funcgraph_entry:        0.800 us   |  update_cfs_group();
      sched-pipe-622   [001]   156.790508: funcgraph_entry:        1.040 us   |  update_cfs_group();
      sched-pipe-622   [001]   156.790510: funcgraph_entry:        0.920 us   |  update_cfs_group();
      sched-pipe-622   [001]   156.790511: funcgraph_entry:        1.040 us   |  update_cfs_group();
      sched-pipe-622   [001]   156.790513: funcgraph_entry:        0.840 us   |  update_cfs_group();
      sched-pipe-623   [001]   156.790540: funcgraph_entry:        1.160 us   |  update_cfs_group();
      sched-pipe-623   [001]   156.790543: funcgraph_entry:        1.020 us   |  update_cfs_group();
      sched-pipe-623   [001]   156.790544: funcgraph_entry:        0.880 us   |  update_cfs_group();
      sched-pipe-623   [001]   156.790546: funcgraph_entry:        0.840 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790905: funcgraph_entry:        1.780 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790908: funcgraph_entry:        1.060 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790910: funcgraph_entry:        0.880 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790912: funcgraph_entry:        0.880 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790916: funcgraph_entry:        0.800 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790917: funcgraph_entry:        0.820 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790919: funcgraph_entry:        0.840 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790921: funcgraph_entry:        0.880 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790932: funcgraph_entry:        0.960 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790934: funcgraph_entry:        0.960 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790936: funcgraph_entry:        1.080 us   |  update_cfs_group();
      sched-pipe-621   [001]   156.790937: funcgraph_entry:        0.840 us   |  update_cfs_group();

Without CONFIG_FAIR_GROUP_SCHED and without CONFIG_UCLAMP_TASK

      sched-pipe-604   [001]    76.386078: funcgraph_entry:        1.380 us   |  activate_task();
      sched-pipe-604   [001]    76.386084: funcgraph_entry:        1.360 us   |  deactivate_task();
      sched-pipe-605   [001]    76.386091: funcgraph_entry:        1.400 us   |  activate_task();
      sched-pipe-605   [001]    76.386096: funcgraph_entry:        1.260 us   |  deactivate_task();
      sched-pipe-604   [001]    76.386104: funcgraph_entry:        1.500 us   |  activate_task();
      sched-pipe-604   [001]    76.386109: funcgraph_entry:        1.280 us   |  deactivate_task();
      sched-pipe-605   [001]    76.386117: funcgraph_entry:        1.380 us   |  activate_task();
      sched-pipe-605   [001]    76.386122: funcgraph_entry:        1.300 us   |  deactivate_task();
      sched-pipe-604   [001]    76.386130: funcgraph_entry:        1.380 us   |  activate_task();
      sched-pipe-604   [001]    76.386135: funcgraph_entry:        1.260 us   |  deactivate_task();
      sched-pipe-605   [001]    76.386142: funcgraph_entry:        1.400 us   |  activate_task();
      sched-pipe-605   [001]    76.386148: funcgraph_entry:        1.340 us   |  deactivate_task();

So approximately 800ns are added by update_cfs_group() for enqueue and dequeue.
This overhead affects 2 tasks in the tests, so the total effect on the
generated usecs/ops

	2 * enqueue_overhead + 2 * dequeue overhead = 4 * ~800ns = 3.2 us

Which explains the 3us drop I see when fair group config is enabled.

Applying similar analysis to uclamp

With uclamp enabled

      sched-pipe-610   [001]   173.429431: funcgraph_entry:        1.580 us   |  activate_task();
      sched-pipe-610   [001]   173.429437: funcgraph_entry:        1.440 us   |  deactivate_task();
      sched-pipe-609   [001]   173.429444: funcgraph_entry:        1.580 us   |  activate_task();
      sched-pipe-609   [001]   173.429450: funcgraph_entry:        1.440 us   |  deactivate_task();
      sched-pipe-610   [001]   173.429458: funcgraph_entry:        1.700 us   |  activate_task();
      sched-pipe-610   [001]   173.429464: funcgraph_entry:        1.460 us   |  deactivate_task();
      sched-pipe-609   [001]   173.429471: funcgraph_entry:        1.540 us   |  activate_task();
      sched-pipe-609   [001]   173.429477: funcgraph_entry:        1.460 us   |  deactivate_task();
      sched-pipe-610   [001]   173.429485: funcgraph_entry:        1.560 us   |  activate_task();
      sched-pipe-610   [001]   173.429491: funcgraph_entry:        1.500 us   |  deactivate_task();
      sched-pipe-609   [001]   173.429498: funcgraph_entry:        1.600 us   |  activate_task();
      sched-pipe-609   [001]   173.429504: funcgraph_entry:        1.460 us   |  deactivate_task();

Which adds approximately 200ns at enqueue and dequeue.

	2 * enqueue_overhead + 2 * dequeue overhead = 4 * ~200ns = 0.8 us

Which would explain the ~1us drop I've seen with uclamp when running sched
bench. Apologies for the very course averaging of the numbers from my side.

As a reminder the results I reported before:


*** uclamp disabled/fair group enabled ***

        # Executed 50000 pipe operations between two threads

             Total time: 0.958 [sec]

              19.177100 usecs/op
                  52145 ops/sec

*** uclamp disabled/fair group disabled ***

        # Executed 50000 pipe operations between two threads
             Total time: 0.808 [sec]

             16.176200 usecs/op
                 61819 ops/sec

*** uclamp enabled/fair group disabled ***

        # Executed 50000 pipe operations between two threads
             Total time: 0.856 [sec]

             17.125740 usecs/op
                 58391 ops/sec


Based on my observation with code shuffling it seems a lot of this 200ns comes
from terrible I$ performance on the particular platform I am testing on.

When I run on x86 machine, if I interpreted perf annotation correctly I see D$
misses on accessing rq->uclamp_rq.bucket[] and p->uclamp[]. But I'll share this
result on a separate email in-reply to Mel.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-08 12:31                     ` Qais Yousef
  2020-06-08 13:06                       ` Valentin Schneider
  2020-06-08 14:44                       ` Steven Rostedt
@ 2020-06-09 17:10                       ` Vincent Guittot
  2020-06-11 10:24                         ` Qais Yousef
  2 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2020-06-09 17:10 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On Mon, 8 Jun 2020 at 14:31, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 06/04/20 14:14, Vincent Guittot wrote:
>
> [...]
>
> > I have tried your patch and I don't see any difference compared to
> > previous tests. Let me give you more details of my setup:
> > I create 3 levels of cgroups and usually run the tests in the 4 levels
> > (which includes root). The result above are for the root level
> >
> > But I see a difference at other levels:
> >
> >                            root           level 1       level 2       level 3
> >
> > /w patch uclamp disable     50097         46615         43806         41078
> > tip uclamp enable           48706(-2.78%) 45583(-2.21%) 42851(-2.18%)
> > 40313(-1.86%)
> > /w patch uclamp enable      48882(-2.43%) 45774(-1.80%) 43108(-1.59%)
> > 40667(-1.00%)
> >
> > Whereas tip with uclamp stays around 2% behind tip without uclamp, the
> > diff of uclamp with your patch tends to decrease when we increase the
> > number of level
>
> So I did try to dig more into this, but I think it's either not a good
> reproducer or what we're observing here is uArch level latencies caused by the
> new code that seem to produce a bigger knock on effect than what they really
> are.
>
> First, CONFIG_FAIR_GROUP_SCHED is 'expensive', for some definition of
> expensive..

yes, enabling CONFIG_FAIR_GROUP_SCHED adds an overhead

>
> *** uclamp disabled/fair group enabled ***
>
>         # Executed 50000 pipe operations between two threads
>
>              Total time: 0.958 [sec]
>
>               19.177100 usecs/op
>                   52145 ops/sec
>
> *** uclamp disabled/fair group disabled ***
>
>         # Executed 50000 pipe operations between two threads
>              Total time: 0.808 [sec]
>
>              16.176200 usecs/op
>                  61819 ops/sec
>
> So there's a 15.6% drop in ops/sec when enabling this option. I think it's good
> to look at the absolutely number of usecs/op, Fair group adds around
> 3 usecs/op.
>
> I dropped FAIR_GROUP_SCHED from my config to eliminate this overhead and focus
> on solely on uclamp overhead.

Have you checked that both tests run at the root level ?
Your function-graph log below shows several calls to
update_cfs_group() which means that your trace below has not been made
at root level but most probably at the 3rd level and I wonder if you
used the same setup for running the benchmark above. This could
explain such huge difference because I don't have such difference on
my platform but more around 2%

For uclamp disable/fair group enable/ function graph enable :  47994ops/sec
For uclamp disable/fair group disable/ function graph enable : 49107ops/sec

>
> With uclamp enabled but no fair group I get
>
> *** uclamp enabled/fair group disabled ***
>
>         # Executed 50000 pipe operations between two threads
>              Total time: 0.856 [sec]
>
>              17.125740 usecs/op
>                  58391 ops/sec
>
> The drop is 5.5% in ops/sec. Or 1 usecs/op.
>
> I don't know what's the expectation here. 1 us could be a lot, but I don't
> think we expect the new code to take more than few 100s of ns anyway. If you
> add potential caching effects, reaching 1 us wouldn't be that hard.
>
> Note that in my runs I chose performance governor and use `taskset 0x2` to

You might want to set 2 CPUs in your cpumask instead of 1 in order to
have 1 CPU for each thread

> force running on a big core to make sure the runs are repeatable.

I also use performance governor but don't pinned tasks because I use smp.

>
> On Juno-r2 I managed to scrap most of the 1 us with the below patch. It seems
> there was weird branching behavior that affects the I$ in my case. It'd be good
> to try it out to see if it makes a difference for you.

The perf are slightly worse on my setup:
For uclamp enable/fair group disable/ function graph enable : 48413ops/sec
with patch  below : 47804os/sec

>
> The I$ effect is my best educated guess. Perf doesn't catch this path and
> I couldn't convince it to look at cache and branch misses between 2 specific
> points.
>
> Other subtle code shuffling did have weird effect on the result too. One worthy
> one is making uclamp_rq_dec() noinline gains back ~400 ns. Making
> uclamp_rq_inc() noinline *too* cancels this gain out :-/
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..0835ee20a3c7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1071,13 +1071,11 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>
>  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  {
> -       enum uclamp_id clamp_id;
> -
>         if (unlikely(!p->sched_class->uclamp_enabled))
>                 return;
>
> -       for_each_clamp_id(clamp_id)
> -               uclamp_rq_inc_id(rq, p, clamp_id);
> +       uclamp_rq_inc_id(rq, p, UCLAMP_MIN);
> +       uclamp_rq_inc_id(rq, p, UCLAMP_MAX);
>
>         /* Reset clamp idle holding when there is one RUNNABLE task */
>         if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> @@ -1086,13 +1084,11 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
>  {
> -       enum uclamp_id clamp_id;
> -
>         if (unlikely(!p->sched_class->uclamp_enabled))
>                 return;
>
> -       for_each_clamp_id(clamp_id)
> -               uclamp_rq_dec_id(rq, p, clamp_id);
> +       uclamp_rq_dec_id(rq, p, UCLAMP_MIN);
> +       uclamp_rq_dec_id(rq, p, UCLAMP_MAX);
>  }
>
>  static inline void
>
>
> FWIW I fail to see activate/deactivate_task in perf record. They don't show up
> on the list which means this micro benchmark doesn't stress them as Mel's test
> does.

Strange because I have been able to trace them.

>
> Worth noting that I did try running the same test on 2 vCPU VirtualBox VM and
> 64 vCPU qemu and I couldn't spot a difference when uclamp was enabled/disabled
> in these 2 environments.
>
> >
> > Beside this, that's also interesting to notice the ~6% of perf impact
> > between each level for the same image
>
> Beside my observation above, I captured this function_graph when
> FAIR_GROUP_SCHED is enabled. What I pasted below is a particularly bad
> deactivation, it's not always that costly.
>
> This ran happened was recorded with uclamp disabled.
>
> I admit I don't know how much of these numbers is ftrace overhead. When trying
> to capture similar runs for uclamp, the numbers didn't add up compared to
> running the test without ftrace generating the graph. If juno is suffering from
> bad branching costs in this path, then I suspect ftrace will amplify this as
> AFAIU it'll cause extra jumps on entry and exit.
>
>
>
>       sched-pipe-6532  [001]  9407.276302: funcgraph_entry:                   |  deactivate_task() {
>       sched-pipe-6532  [001]  9407.276302: funcgraph_entry:                   |    dequeue_task_fair() {
>       sched-pipe-6532  [001]  9407.276303: funcgraph_entry:                   |      update_curr() {
>       sched-pipe-6532  [001]  9407.276304: funcgraph_entry:        0.780 us   |        update_min_vruntime();
>       sched-pipe-6532  [001]  9407.276306: funcgraph_entry:                   |        cpuacct_charge() {
>       sched-pipe-6532  [001]  9407.276306: funcgraph_entry:        0.820 us   |          __rcu_read_lock();
>       sched-pipe-6532  [001]  9407.276308: funcgraph_entry:        0.740 us   |          __rcu_read_unlock();
>       sched-pipe-6532  [001]  9407.276309: funcgraph_exit:         3.980 us   |        }
>       sched-pipe-6532  [001]  9407.276310: funcgraph_entry:        0.720 us   |        __rcu_read_lock();
>       sched-pipe-6532  [001]  9407.276312: funcgraph_entry:        0.720 us   |        __rcu_read_unlock();
>       sched-pipe-6532  [001]  9407.276313: funcgraph_exit:         9.840 us   |      }
>       sched-pipe-6532  [001]  9407.276314: funcgraph_entry:                   |      __update_load_avg_se() {
>       sched-pipe-6532  [001]  9407.276315: funcgraph_entry:        0.720 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276316: funcgraph_exit:         2.260 us   |      }
>       sched-pipe-6532  [001]  9407.276317: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
>       sched-pipe-6532  [001]  9407.276318: funcgraph_entry:        0.860 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276319: funcgraph_exit:         2.340 us   |      }
>       sched-pipe-6532  [001]  9407.276320: funcgraph_entry:        0.760 us   |      clear_buddies();
>       sched-pipe-6532  [001]  9407.276321: funcgraph_entry:        0.800 us   |      account_entity_dequeue();
>       sched-pipe-6532  [001]  9407.276323: funcgraph_entry:        0.720 us   |      update_cfs_group();
>       sched-pipe-6532  [001]  9407.276324: funcgraph_entry:        0.740 us   |      update_min_vruntime();
>       sched-pipe-6532  [001]  9407.276326: funcgraph_entry:        0.720 us   |      set_next_buddy();
>       sched-pipe-6532  [001]  9407.276327: funcgraph_entry:                   |      __update_load_avg_se() {
>       sched-pipe-6532  [001]  9407.276328: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276329: funcgraph_exit:         2.220 us   |      }
>       sched-pipe-6532  [001]  9407.276330: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
>       sched-pipe-6532  [001]  9407.276331: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276332: funcgraph_exit:         2.180 us   |      }
>       sched-pipe-6532  [001]  9407.276333: funcgraph_entry:                   |      update_cfs_group() {
>       sched-pipe-6532  [001]  9407.276334: funcgraph_entry:                   |        reweight_entity() {
>       sched-pipe-6532  [001]  9407.276335: funcgraph_entry:                   |          update_curr() {
>       sched-pipe-6532  [001]  9407.276335: funcgraph_entry:        0.720 us   |            __calc_delta();
>       sched-pipe-6532  [001]  9407.276337: funcgraph_entry:        0.740 us   |            update_min_vruntime();
>       sched-pipe-6532  [001]  9407.276338: funcgraph_exit:         3.560 us   |          }
>       sched-pipe-6532  [001]  9407.276339: funcgraph_entry:        0.720 us   |          account_entity_dequeue();
>       sched-pipe-6532  [001]  9407.276340: funcgraph_entry:        0.720 us   |          account_entity_enqueue();
>       sched-pipe-6532  [001]  9407.276342: funcgraph_exit:         7.860 us   |        }
>       sched-pipe-6532  [001]  9407.276342: funcgraph_exit:         9.280 us   |      }
>       sched-pipe-6532  [001]  9407.276343: funcgraph_entry:                   |      __update_load_avg_se() {
>       sched-pipe-6532  [001]  9407.276344: funcgraph_entry:        0.720 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276345: funcgraph_exit:         2.180 us   |      }
>       sched-pipe-6532  [001]  9407.276346: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
>       sched-pipe-6532  [001]  9407.276347: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276348: funcgraph_exit:         2.180 us   |      }
>       sched-pipe-6532  [001]  9407.276349: funcgraph_entry:                   |      update_cfs_group() {
>       sched-pipe-6532  [001]  9407.276350: funcgraph_entry:                   |        reweight_entity() {
>       sched-pipe-6532  [001]  9407.276350: funcgraph_entry:                   |          update_curr() {
>       sched-pipe-6532  [001]  9407.276351: funcgraph_entry:        0.740 us   |            __calc_delta();
>       sched-pipe-6532  [001]  9407.276353: funcgraph_entry:        0.720 us   |            update_min_vruntime();
>       sched-pipe-6532  [001]  9407.276354: funcgraph_exit:         3.580 us   |          }
>       sched-pipe-6532  [001]  9407.276355: funcgraph_entry:        0.740 us   |          account_entity_dequeue();
>       sched-pipe-6532  [001]  9407.276356: funcgraph_entry:        0.720 us   |          account_entity_enqueue();
>       sched-pipe-6532  [001]  9407.276358: funcgraph_exit:         7.960 us   |        }
>       sched-pipe-6532  [001]  9407.276358: funcgraph_exit:         9.400 us   |      }
>       sched-pipe-6532  [001]  9407.276360: funcgraph_entry:                   |      __update_load_avg_se() {
>       sched-pipe-6532  [001]  9407.276360: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276362: funcgraph_exit:         2.220 us   |      }
>       sched-pipe-6532  [001]  9407.276362: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
>       sched-pipe-6532  [001]  9407.276363: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276365: funcgraph_exit:         2.160 us   |      }
>       sched-pipe-6532  [001]  9407.276366: funcgraph_entry:                   |      update_cfs_group() {
>       sched-pipe-6532  [001]  9407.276367: funcgraph_entry:                   |        reweight_entity() {
>       sched-pipe-6532  [001]  9407.276368: funcgraph_entry:                   |          update_curr() {
>       sched-pipe-6532  [001]  9407.276368: funcgraph_entry:        0.720 us   |            __calc_delta();
>       sched-pipe-6532  [001]  9407.276370: funcgraph_entry:        0.720 us   |            update_min_vruntime();
>       sched-pipe-6532  [001]  9407.276371: funcgraph_exit:         3.540 us   |          }
>       sched-pipe-6532  [001]  9407.276372: funcgraph_entry:        0.740 us   |          account_entity_dequeue();
>       sched-pipe-6532  [001]  9407.276373: funcgraph_entry:        0.720 us   |          account_entity_enqueue();
>       sched-pipe-6532  [001]  9407.276375: funcgraph_exit:         7.840 us   |        }
>       sched-pipe-6532  [001]  9407.276375: funcgraph_exit:         9.300 us   |      }
>       sched-pipe-6532  [001]  9407.276376: funcgraph_entry:        0.720 us   |      hrtick_update();
>       sched-pipe-6532  [001]  9407.276377: funcgraph_exit:       + 75.000 us  |    }
>       sched-pipe-6532  [001]  9407.276378: funcgraph_exit:       + 76.700 us  |  }
>
>
> Cheers
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-05 10:45                     ` Qais Yousef
@ 2020-06-09 15:29                       ` Vincent Guittot
  0 siblings, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2020-06-09 15:29 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

H Qais,

Sorry for the late reply.

On Fri, 5 Jun 2020 at 12:45, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 06/04/20 14:14, Vincent Guittot wrote:
> > I have tried your patch and I don't see any difference compared to
> > previous tests. Let me give you more details of my setup:
> > I create 3 levels of cgroups and usually run the tests in the 4 levels
> > (which includes root). The result above are for the root level
> >
> > But I see a difference at other levels:
> >
> >                            root           level 1       level 2       level 3
> >
> > /w patch uclamp disable     50097         46615         43806         41078
> > tip uclamp enable           48706(-2.78%) 45583(-2.21%) 42851(-2.18%)
> > 40313(-1.86%)
> > /w patch uclamp enable      48882(-2.43%) 45774(-1.80%) 43108(-1.59%)
> > 40667(-1.00%)
> >
> > Whereas tip with uclamp stays around 2% behind tip without uclamp, the
> > diff of uclamp with your patch tends to decrease when we increase the
> > number of level
>
> Thanks for the extra info. Let me try this.
>
> If you can run perf and verify that you see activate/deactivate_task showing up
> as overhead I'd appreciate it. Just to confirm that indeed what we're seeing
> here are symptoms of the same problem Mel is seeing.

I see call to  activate_task() for each wakeup of the sched-pipi thread

>
> > Beside this, that's also interesting to notice the ~6% of perf impact
> > between each level for the same image
>
> Interesting indeed.
>
> Thanks
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-08 12:31                     ` Qais Yousef
  2020-06-08 13:06                       ` Valentin Schneider
@ 2020-06-08 14:44                       ` Steven Rostedt
  2020-06-11 10:13                         ` Qais Yousef
  2020-06-09 17:10                       ` Vincent Guittot
  2 siblings, 1 reply; 68+ messages in thread
From: Steven Rostedt @ 2020-06-08 14:44 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Mel Gorman, Patrick Bellasi, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On Mon, 8 Jun 2020 13:31:03 +0100
Qais Yousef <qais.yousef@arm.com> wrote:

> I admit I don't know how much of these numbers is ftrace overhead. When trying

Note, if you want to get a better idea of how long a function runs, put it
into set_ftrace_filter, and then trace it. That way you remove the overhead
of the function graph tracer when its nesting within a function.

> to capture similar runs for uclamp, the numbers didn't add up compared to
> running the test without ftrace generating the graph. If juno is suffering from
> bad branching costs in this path, then I suspect ftrace will amplify this as
> AFAIU it'll cause extra jumps on entry and exit.
> 
> 
> 
>       sched-pipe-6532  [001]  9407.276302: funcgraph_entry:                   |  deactivate_task() {
>       sched-pipe-6532  [001]  9407.276302: funcgraph_entry:                   |    dequeue_task_fair() {
>       sched-pipe-6532  [001]  9407.276303: funcgraph_entry:                   |      update_curr() {
>       sched-pipe-6532  [001]  9407.276304: funcgraph_entry:        0.780 us   |        update_min_vruntime();
>       sched-pipe-6532  [001]  9407.276306: funcgraph_entry:                   |        cpuacct_charge() {
>       sched-pipe-6532  [001]  9407.276306: funcgraph_entry:        0.820 us   |          __rcu_read_lock();
>       sched-pipe-6532  [001]  9407.276308: funcgraph_entry:        0.740 us   |          __rcu_read_unlock();

The above is more accurate than...

>       sched-pipe-6532  [001]  9407.276309: funcgraph_exit:         3.980 us   |        }

this one. Because this one has nested tracing within it.

-- Steve


>       sched-pipe-6532  [001]  9407.276310: funcgraph_entry:        0.720 us   |        __rcu_read_lock();
>       sched-pipe-6532  [001]  9407.276312: funcgraph_entry:        0.720 us   |        __rcu_read_unlock();
>       sched-pipe-6532  [001]  9407.276313: funcgraph_exit:         9.840 us   |      }
>       sched-pipe-6532  [001]  9407.276314: funcgraph_entry:                   |      __update_load_avg_se() {
>       sched-pipe-6532  [001]  9407.276315: funcgraph_entry:        0.720 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276316: funcgraph_exit:         2.260 us   |      }
>       sched-pipe-6532  [001]  9407.276317: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
>       sched-pipe-6532  [001]  9407.276318: funcgraph_entry:        0.860 us   |        __accumulate_pelt_segments();
>       sched-pipe-6532  [001]  9407.276319: funcgraph_exit:         2.340 us   |      }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-08 12:31                     ` Qais Yousef
@ 2020-06-08 13:06                       ` Valentin Schneider
  2020-06-08 14:44                       ` Steven Rostedt
  2020-06-09 17:10                       ` Vincent Guittot
  2 siblings, 0 replies; 68+ messages in thread
From: Valentin Schneider @ 2020-06-08 13:06 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Mel Gorman, Patrick Bellasi, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Quentin Perret, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs


On 08/06/20 13:31, Qais Yousef wrote:
> With uclamp enabled but no fair group I get
>
> *** uclamp enabled/fair group disabled ***
>
>       # Executed 50000 pipe operations between two threads
>            Total time: 0.856 [sec]
>
>            17.125740 usecs/op
>                58391 ops/sec
>
> The drop is 5.5% in ops/sec. Or 1 usecs/op.
>
> I don't know what's the expectation here. 1 us could be a lot, but I don't
> think we expect the new code to take more than few 100s of ns anyway. If you
> add potential caching effects, reaching 1 us wouldn't be that hard.
>

I don't think it's fair to look at the absolute delta. This being a very
hot path, cumulative overhead gets scary real quick. A drop of 5.5% work
done is a big hour lost over a full processing day.

> Note that in my runs I chose performance governor and use `taskset 0x2` to
> force running on a big core to make sure the runs are repeatable.
>
> On Juno-r2 I managed to scrap most of the 1 us with the below patch. It seems
> there was weird branching behavior that affects the I$ in my case. It'd be good
> to try it out to see if it makes a difference for you.
>
> The I$ effect is my best educated guess. Perf doesn't catch this path and
> I couldn't convince it to look at cache and branch misses between 2 specific
> points.
>
> Other subtle code shuffling did have weird effect on the result too. One worthy
> one is making uclamp_rq_dec() noinline gains back ~400 ns. Making
> uclamp_rq_inc() noinline *too* cancels this gain out :-/
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..0835ee20a3c7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1071,13 +1071,11 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>
>  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  {
> -	enum uclamp_id clamp_id;
> -
>       if (unlikely(!p->sched_class->uclamp_enabled))
>               return;
>
> -	for_each_clamp_id(clamp_id)
> -		uclamp_rq_inc_id(rq, p, clamp_id);
> +	uclamp_rq_inc_id(rq, p, UCLAMP_MIN);
> +	uclamp_rq_inc_id(rq, p, UCLAMP_MAX);
>
>       /* Reset clamp idle holding when there is one RUNNABLE task */
>       if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> @@ -1086,13 +1084,11 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
>  {
> -	enum uclamp_id clamp_id;
> -
>       if (unlikely(!p->sched_class->uclamp_enabled))
>               return;
>
> -	for_each_clamp_id(clamp_id)
> -		uclamp_rq_dec_id(rq, p, clamp_id);
> +	uclamp_rq_dec_id(rq, p, UCLAMP_MIN);
> +	uclamp_rq_dec_id(rq, p, UCLAMP_MAX);
>  }
>

That's... Surprising. Did you look at the difference in generated code?

>  static inline void
>
>
> FWIW I fail to see activate/deactivate_task in perf record. They don't show up
> on the list which means this micro benchmark doesn't stress them as Mel's test
> does.
>

You're not going to see it them perf on the Juno. They're in IRQ
disabled sections, so AFAICT it won't get sampled as you don't have
NMIs. You can turn on ARM64_PSEUDO_NMI, but you'll need a GICv3 (Ampere
eMAG, Cavium ThunderX2).

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-04 12:14                   ` Vincent Guittot
  2020-06-05 10:45                     ` Qais Yousef
@ 2020-06-08 12:31                     ` Qais Yousef
  2020-06-08 13:06                       ` Valentin Schneider
                                         ` (2 more replies)
  1 sibling, 3 replies; 68+ messages in thread
From: Qais Yousef @ 2020-06-08 12:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On 06/04/20 14:14, Vincent Guittot wrote:

[...]

> I have tried your patch and I don't see any difference compared to
> previous tests. Let me give you more details of my setup:
> I create 3 levels of cgroups and usually run the tests in the 4 levels
> (which includes root). The result above are for the root level
> 
> But I see a difference at other levels:
> 
>                            root           level 1       level 2       level 3
> 
> /w patch uclamp disable     50097         46615         43806         41078
> tip uclamp enable           48706(-2.78%) 45583(-2.21%) 42851(-2.18%)
> 40313(-1.86%)
> /w patch uclamp enable      48882(-2.43%) 45774(-1.80%) 43108(-1.59%)
> 40667(-1.00%)
> 
> Whereas tip with uclamp stays around 2% behind tip without uclamp, the
> diff of uclamp with your patch tends to decrease when we increase the
> number of level

So I did try to dig more into this, but I think it's either not a good
reproducer or what we're observing here is uArch level latencies caused by the
new code that seem to produce a bigger knock on effect than what they really
are.

First, CONFIG_FAIR_GROUP_SCHED is 'expensive', for some definition of
expensive..

*** uclamp disabled/fair group enabled ***

	# Executed 50000 pipe operations between two threads

	     Total time: 0.958 [sec]

	      19.177100 usecs/op
		  52145 ops/sec

*** uclamp disabled/fair group disabled ***

	# Executed 50000 pipe operations between two threads
	     Total time: 0.808 [sec]

	     16.176200 usecs/op
		 61819 ops/sec

So there's a 15.6% drop in ops/sec when enabling this option. I think it's good
to look at the absolutely number of usecs/op, Fair group adds around
3 usecs/op.

I dropped FAIR_GROUP_SCHED from my config to eliminate this overhead and focus
on solely on uclamp overhead.

With uclamp enabled but no fair group I get

*** uclamp enabled/fair group disabled ***

	# Executed 50000 pipe operations between two threads
	     Total time: 0.856 [sec]

	     17.125740 usecs/op
		 58391 ops/sec

The drop is 5.5% in ops/sec. Or 1 usecs/op.

I don't know what's the expectation here. 1 us could be a lot, but I don't
think we expect the new code to take more than few 100s of ns anyway. If you
add potential caching effects, reaching 1 us wouldn't be that hard.

Note that in my runs I chose performance governor and use `taskset 0x2` to
force running on a big core to make sure the runs are repeatable.

On Juno-r2 I managed to scrap most of the 1 us with the below patch. It seems
there was weird branching behavior that affects the I$ in my case. It'd be good
to try it out to see if it makes a difference for you.

The I$ effect is my best educated guess. Perf doesn't catch this path and
I couldn't convince it to look at cache and branch misses between 2 specific
points.

Other subtle code shuffling did have weird effect on the result too. One worthy
one is making uclamp_rq_dec() noinline gains back ~400 ns. Making
uclamp_rq_inc() noinline *too* cancels this gain out :-/


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0464569f26a7..0835ee20a3c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1071,13 +1071,11 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
-	enum uclamp_id clamp_id;
-
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_inc_id(rq, p, clamp_id);
+	uclamp_rq_inc_id(rq, p, UCLAMP_MIN);
+	uclamp_rq_inc_id(rq, p, UCLAMP_MAX);
 
 	/* Reset clamp idle holding when there is one RUNNABLE task */
 	if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
@@ -1086,13 +1084,11 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
-	enum uclamp_id clamp_id;
-
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_dec_id(rq, p, clamp_id);
+	uclamp_rq_dec_id(rq, p, UCLAMP_MIN);
+	uclamp_rq_dec_id(rq, p, UCLAMP_MAX);
 }
 
 static inline void


FWIW I fail to see activate/deactivate_task in perf record. They don't show up
on the list which means this micro benchmark doesn't stress them as Mel's test
does.

Worth noting that I did try running the same test on 2 vCPU VirtualBox VM and
64 vCPU qemu and I couldn't spot a difference when uclamp was enabled/disabled
in these 2 environments.

> 
> Beside this, that's also interesting to notice the ~6% of perf impact
> between each level for the same image

Beside my observation above, I captured this function_graph when
FAIR_GROUP_SCHED is enabled. What I pasted below is a particularly bad
deactivation, it's not always that costly.

This ran happened was recorded with uclamp disabled.

I admit I don't know how much of these numbers is ftrace overhead. When trying
to capture similar runs for uclamp, the numbers didn't add up compared to
running the test without ftrace generating the graph. If juno is suffering from
bad branching costs in this path, then I suspect ftrace will amplify this as
AFAIU it'll cause extra jumps on entry and exit.



      sched-pipe-6532  [001]  9407.276302: funcgraph_entry:                   |  deactivate_task() {
      sched-pipe-6532  [001]  9407.276302: funcgraph_entry:                   |    dequeue_task_fair() {
      sched-pipe-6532  [001]  9407.276303: funcgraph_entry:                   |      update_curr() {
      sched-pipe-6532  [001]  9407.276304: funcgraph_entry:        0.780 us   |        update_min_vruntime();
      sched-pipe-6532  [001]  9407.276306: funcgraph_entry:                   |        cpuacct_charge() {
      sched-pipe-6532  [001]  9407.276306: funcgraph_entry:        0.820 us   |          __rcu_read_lock();
      sched-pipe-6532  [001]  9407.276308: funcgraph_entry:        0.740 us   |          __rcu_read_unlock();
      sched-pipe-6532  [001]  9407.276309: funcgraph_exit:         3.980 us   |        }
      sched-pipe-6532  [001]  9407.276310: funcgraph_entry:        0.720 us   |        __rcu_read_lock();
      sched-pipe-6532  [001]  9407.276312: funcgraph_entry:        0.720 us   |        __rcu_read_unlock();
      sched-pipe-6532  [001]  9407.276313: funcgraph_exit:         9.840 us   |      }
      sched-pipe-6532  [001]  9407.276314: funcgraph_entry:                   |      __update_load_avg_se() {
      sched-pipe-6532  [001]  9407.276315: funcgraph_entry:        0.720 us   |        __accumulate_pelt_segments();
      sched-pipe-6532  [001]  9407.276316: funcgraph_exit:         2.260 us   |      }
      sched-pipe-6532  [001]  9407.276317: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
      sched-pipe-6532  [001]  9407.276318: funcgraph_entry:        0.860 us   |        __accumulate_pelt_segments();
      sched-pipe-6532  [001]  9407.276319: funcgraph_exit:         2.340 us   |      }
      sched-pipe-6532  [001]  9407.276320: funcgraph_entry:        0.760 us   |      clear_buddies();
      sched-pipe-6532  [001]  9407.276321: funcgraph_entry:        0.800 us   |      account_entity_dequeue();
      sched-pipe-6532  [001]  9407.276323: funcgraph_entry:        0.720 us   |      update_cfs_group();
      sched-pipe-6532  [001]  9407.276324: funcgraph_entry:        0.740 us   |      update_min_vruntime();
      sched-pipe-6532  [001]  9407.276326: funcgraph_entry:        0.720 us   |      set_next_buddy();
      sched-pipe-6532  [001]  9407.276327: funcgraph_entry:                   |      __update_load_avg_se() {
      sched-pipe-6532  [001]  9407.276328: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
      sched-pipe-6532  [001]  9407.276329: funcgraph_exit:         2.220 us   |      }
      sched-pipe-6532  [001]  9407.276330: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
      sched-pipe-6532  [001]  9407.276331: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
      sched-pipe-6532  [001]  9407.276332: funcgraph_exit:         2.180 us   |      }
      sched-pipe-6532  [001]  9407.276333: funcgraph_entry:                   |      update_cfs_group() {
      sched-pipe-6532  [001]  9407.276334: funcgraph_entry:                   |        reweight_entity() {
      sched-pipe-6532  [001]  9407.276335: funcgraph_entry:                   |          update_curr() {
      sched-pipe-6532  [001]  9407.276335: funcgraph_entry:        0.720 us   |            __calc_delta();
      sched-pipe-6532  [001]  9407.276337: funcgraph_entry:        0.740 us   |            update_min_vruntime();
      sched-pipe-6532  [001]  9407.276338: funcgraph_exit:         3.560 us   |          }
      sched-pipe-6532  [001]  9407.276339: funcgraph_entry:        0.720 us   |          account_entity_dequeue();
      sched-pipe-6532  [001]  9407.276340: funcgraph_entry:        0.720 us   |          account_entity_enqueue();
      sched-pipe-6532  [001]  9407.276342: funcgraph_exit:         7.860 us   |        }
      sched-pipe-6532  [001]  9407.276342: funcgraph_exit:         9.280 us   |      }
      sched-pipe-6532  [001]  9407.276343: funcgraph_entry:                   |      __update_load_avg_se() {
      sched-pipe-6532  [001]  9407.276344: funcgraph_entry:        0.720 us   |        __accumulate_pelt_segments();
      sched-pipe-6532  [001]  9407.276345: funcgraph_exit:         2.180 us   |      }
      sched-pipe-6532  [001]  9407.276346: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
      sched-pipe-6532  [001]  9407.276347: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
      sched-pipe-6532  [001]  9407.276348: funcgraph_exit:         2.180 us   |      }
      sched-pipe-6532  [001]  9407.276349: funcgraph_entry:                   |      update_cfs_group() {
      sched-pipe-6532  [001]  9407.276350: funcgraph_entry:                   |        reweight_entity() {
      sched-pipe-6532  [001]  9407.276350: funcgraph_entry:                   |          update_curr() {
      sched-pipe-6532  [001]  9407.276351: funcgraph_entry:        0.740 us   |            __calc_delta();
      sched-pipe-6532  [001]  9407.276353: funcgraph_entry:        0.720 us   |            update_min_vruntime();
      sched-pipe-6532  [001]  9407.276354: funcgraph_exit:         3.580 us   |          }
      sched-pipe-6532  [001]  9407.276355: funcgraph_entry:        0.740 us   |          account_entity_dequeue();
      sched-pipe-6532  [001]  9407.276356: funcgraph_entry:        0.720 us   |          account_entity_enqueue();
      sched-pipe-6532  [001]  9407.276358: funcgraph_exit:         7.960 us   |        }
      sched-pipe-6532  [001]  9407.276358: funcgraph_exit:         9.400 us   |      }
      sched-pipe-6532  [001]  9407.276360: funcgraph_entry:                   |      __update_load_avg_se() {
      sched-pipe-6532  [001]  9407.276360: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
      sched-pipe-6532  [001]  9407.276362: funcgraph_exit:         2.220 us   |      }
      sched-pipe-6532  [001]  9407.276362: funcgraph_entry:                   |      __update_load_avg_cfs_rq() {
      sched-pipe-6532  [001]  9407.276363: funcgraph_entry:        0.740 us   |        __accumulate_pelt_segments();
      sched-pipe-6532  [001]  9407.276365: funcgraph_exit:         2.160 us   |      }
      sched-pipe-6532  [001]  9407.276366: funcgraph_entry:                   |      update_cfs_group() {
      sched-pipe-6532  [001]  9407.276367: funcgraph_entry:                   |        reweight_entity() {
      sched-pipe-6532  [001]  9407.276368: funcgraph_entry:                   |          update_curr() {
      sched-pipe-6532  [001]  9407.276368: funcgraph_entry:        0.720 us   |            __calc_delta();
      sched-pipe-6532  [001]  9407.276370: funcgraph_entry:        0.720 us   |            update_min_vruntime();
      sched-pipe-6532  [001]  9407.276371: funcgraph_exit:         3.540 us   |          }
      sched-pipe-6532  [001]  9407.276372: funcgraph_entry:        0.740 us   |          account_entity_dequeue();
      sched-pipe-6532  [001]  9407.276373: funcgraph_entry:        0.720 us   |          account_entity_enqueue();
      sched-pipe-6532  [001]  9407.276375: funcgraph_exit:         7.840 us   |        }
      sched-pipe-6532  [001]  9407.276375: funcgraph_exit:         9.300 us   |      }
      sched-pipe-6532  [001]  9407.276376: funcgraph_entry:        0.720 us   |      hrtick_update();
      sched-pipe-6532  [001]  9407.276377: funcgraph_exit:       + 75.000 us  |    }
      sched-pipe-6532  [001]  9407.276378: funcgraph_exit:       + 76.700 us  |  }


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-05 11:32                     ` Qais Yousef
@ 2020-06-05 13:27                       ` Patrick Bellasi
  0 siblings, 0 replies; 68+ messages in thread
From: Patrick Bellasi @ 2020-06-05 13:27 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Mel Gorman, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs


On Fri, Jun 05, 2020 at 13:32:04 +0200, Qais Yousef <qais.yousef@arm.com> wrote...

> On 06/05/20 09:55, Patrick Bellasi wrote:
>> On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef <qais.yousef@arm.com> wrote...

[...]

>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 0464569f26a7..9f48090eb926 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>> >          * e.g. due to future modification, warn and fixup the expected value.
>> >          */
>> >         SCHED_WARN_ON(bucket->value > rq_clamp);
>> > +#if 0
>> >         if (bucket->value >= rq_clamp) {
>> >                 bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
>> >                 WRITE_ONCE(uc_rq->value, bkt_clamp);
>> >         }
>> > +#endif
>> 
>> Yep, that's likely where we have most of the overhead at dequeue time,
>> sine _sometimes_ we need to update the cpu's clamp value.
>> 
>> However, while running perf sched pipe, I expect:
>>  - all tasks to have the same clamp value
>>  - all CPUs to have _always_ at least one RUNNABLE task
>> 
>> Given these two conditions above, if the CPU is never "CFS idle" (i.e.
>> without RUNNABLE CFS tasks), the code above should never be triggered.
>> More on that later...
>
> So the cost is only incurred by idle cpus is what you're saying.

Not really, you pay the cost every time you need to reduce the CPU clamp
value. This can happen also on a busy CPU but only when you dequeue the
last task defining the current uclamp(cpu) value and the remaining
RUNNABLE tasks have a lower value.

>> >  }
>> >
>> >  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>> >
>> >
>> >
>> > uclamp_rq_max_value() could be expensive as it loops over all buckets.
>> 
>> It loops over UCLAMP_CNT values which are defined to fit into a single
>
> I think you meant to say UCLAMP_BUCKETS which is defined 5 by default.

Right, UCLAMP_BUCKETS.

>> $L. That was the optimal space/time complexity compromise we found to
>> get the MAX of a set of values.
>
> It actually covers two cachelines, see below and my other email to
> Mel.

The two cache lines are covered if you consider both min and max clamps.
One single CLAMP_ID has a _size_ which fits into a single cache line.

However, to be precise:
- while uclamp_min spans a single cache line, uclamp_max is likely
  across two
- at enqueue/dequeue time we update both min/max, thus we can touch
  both cache lines

>> > Commenting this whole path out strangely doesn't just 'fix' it,
>> > but produces  better results to no-uclamp kernel :-/
>> >
>> > # ./perf bench -r 20 sched pipe -T -l 50000
>> > Without uclamp:		5039
>> > With uclamp:		4832
>> > With uclamp+patch:	5729
>> 
>> I explain it below: with that code removed you never decrease the CPU's
>> uclamp value. Thus, the first time you schedule an RT task you go to MAX
>> OPP and stay there forever.
>
> Okay.
>
>> 
>> > It might be because schedutil gets biased differently by uclamp..? If I move to
>> > performance governor these numbers almost double.
>> >
>> > I don't know. But this promoted me to look closer and
>> 
>> Just to resume, when a task is dequeued we can have only these cases:
>> 
>> - uclamp(task) < uclamp(cpu):
>>   this happens when the task was co-scheduled with other tasks with
>>   higher clamp values which are still RUNNABLE.
>>   In this case there are no uclamp(cpu) updates.
>> 
>> - uclamp(task) == uclamp(cpu):
>>   this happens when the task was one of the tasks defining the current
>>   uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
>>   tasks clamp values.
>> 
>> In this last case we _not_ always need to do a uclamp(cpu) update.
>> Indeed the update is required _only_ when that task was _the last_ task
>> defining the current uclamp(cpu) value.
>> 
>> In this case we use uclamp_rq_max_value() to do a linear scan of
>> UCLAMP_CNT values which fits into a single cache line.
>
> Again, I think you mean UCLAMP_BUCKETS here. Unless I missed something, they
> span 2 cahcelines on 64bit machines and 64b cacheline size.

Correct:
- s/UCLAMP_CNT/UCLAMP_BUCLKETS/
- 1 cacheline per CLAMP_ID
- the array scan works on 1 CLAMP_ID:
  - spanning 1 cache line for uclamp_min
  - spanning 2 cache lines for uclamp_max


> To be specific, I am referring to struct uclamp_rq, which defines an array of
> size UCLAMP_BUCKETS of type struct uclamp_bucket.
>
> uclamp_rq_max_value() scans the buckets for a given clamp_id (UCLAMP_MIN or
> UCLAMP_MAX).
>
> So sizeof(struct uclamp_rq) = 8 * 5 + 4 = 44; on 64bit machines.
>
> And actually the compiler introduces a 4 bytes hole, so we end up with a total
> of 48 bytes.
>
> In struct rq, we define struct uclamp_rq as an array of UCLAMP_CNT which is 2.
>
> So by default we have 2 * sizeof(struct uclamp_rq) = 96 bytes.

Right, here is the layout we get on x86 (with some context before/after):

---8<---
        /* XXX 4 bytes hole, try to pack */

        long unsigned int          nr_load_updates;      /*    32     8 */
        u64                        nr_switches;          /*    40     8 */

        /* XXX 16 bytes hole, try to pack */

        /* --- cacheline 1 boundary (64 bytes) --- */
        struct uclamp_rq           uclamp[2];            /*    64    96 */
        /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
        unsigned int               uclamp_flags;         /*   160     4 */

        /* XXX 28 bytes hole, try to pack */

        /* --- cacheline 3 boundary (192 bytes) --- */
        struct cfs_rq              cfs;                  /*   192   384 */

        /* XXX last struct has 40 bytes of padding */

        /* --- cacheline 9 boundary (576 bytes) --- */
        struct rt_rq               rt;                   /*   576  1704 */
        /* --- cacheline 35 boundary (2240 bytes) was 40 bytes ago --- */
        struct dl_rq               dl;                   /*  2280   104 */
---8<---

Considering that:

  struct uclamp_rq {                                                                                                                 
      unsigned int value;
      struct uclamp_bucket bucket[UCLAMP_BUCKETS];
    };

perhaps we can experiment by adding some padding at the end of this
struct to get also uclamp_max spanning only one cache line.

But, considering that at enqueue/dequeue we update both min and max
clamp task's counters, I don't think we get much.


>> > I think I spotted a bug where in the if condition we check for '>='
>> > instead of '>', causing us to take the supposedly impossible fail safe
>> > path.
>> 
>> The fail safe path is when the '>' condition matches, which is what the
>> SCHED_WARN_ON tell us. Indeed, we never expect uclamp(cpu) to be bigger
>> than one of its RUNNABLE tasks. If that should happen we WARN and fix
>> the cpu clamp value for the best.
>> 
>> The normal path is instead '=' and, according to by previous resume,
>> it's expected to be executed _only_ when we dequeue the last task of the
>> clamp group defining the current uclamp(cpu) value.
>
> Okay. I was mislead by the comment then. Thanks for clarifying.
>
> Can this function be broken down to deal with '=' separately from the '>' case?

The '=' case is there just for defensive programming. If something is
wrong and is not catastrophic: fix it and report. The comment tells
exactly this, perhaps we can extend it by saying that something like:
 "Normally we expect MAX to be updated only to a smaller value"
?

> IIUC, for the common '=', we always want to return uclamp_idle_value() hence
> skip the potentially expensive scan?

No, for the '=' case we want to return the new max.

In uclamp_rq_max_value() we check if there are other RUNNABLE tasks,
which will have by definition a smaller uclamp value. If we find one we
want to return their clamp value.

If there are not, we could have avoided the scan, true.
But, since uclamp tracks both RT and CFS tasks, we know that we will be
going idle thus the scan overhead should not be a big deal.

> Anyway, based on Vincent results, it doesn't seem this path is an issue for him
> and the real problem is lurking somewhere else.

Yes, likely.

The only thing perhaps worth trying is the usage of unsigned instead of
long unsigned you proposed before. With the aim to fit everything in a
single cache line. But honestly, I'm still quite sceptical about that
being the root cause. Worth a try tho.
We would need to cache stats to proof it.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-05  7:55                   ` Patrick Bellasi
@ 2020-06-05 11:32                     ` Qais Yousef
  2020-06-05 13:27                       ` Patrick Bellasi
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-06-05 11:32 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Vincent Guittot, Mel Gorman, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On 06/05/20 09:55, Patrick Bellasi wrote:
> 
> Hi Qais,
> 
> On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef <qais.yousef@arm.com> wrote...
> 
> > On 06/03/20 16:59, Vincent Guittot wrote:
> >> When I want to stress the fast path i usually use "perf bench sched pipe -T "
> >> The tip/sched/core on my arm octo core gives the following results for
> >> 20 iterations of perf bench sched pipe -T -l 50000
> >> 
> >> all uclamp config disabled  50035.4(+/- 0.334%)
> >> all uclamp config enabled  48749.8(+/- 0.339%)   -2.64%
> 
> I use to run the same test but I don't remember such big numbers :/

Yeah I remember you ran a lot of testing on this.

> 
> >> It's quite easy to reproduce and probably easier to study the impact
> >
> > Thanks Vincent. This is very useful!
> >
> > I could reproduce that on my Juno.
> >
> > One of the codepath I was suspecting seems to affect it.
> >
> >
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0464569f26a7..9f48090eb926 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> >          * e.g. due to future modification, warn and fixup the expected value.
> >          */
> >         SCHED_WARN_ON(bucket->value > rq_clamp);
> > +#if 0
> >         if (bucket->value >= rq_clamp) {
> >                 bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> >                 WRITE_ONCE(uc_rq->value, bkt_clamp);
> >         }
> > +#endif
> 
> Yep, that's likely where we have most of the overhead at dequeue time,
> sine _sometimes_ we need to update the cpu's clamp value.
> 
> However, while running perf sched pipe, I expect:
>  - all tasks to have the same clamp value
>  - all CPUs to have _always_ at least one RUNNABLE task
> 
> Given these two conditions above, if the CPU is never "CFS idle" (i.e.
> without RUNNABLE CFS tasks), the code above should never be triggered.
> More on that later...

So the cost is only incurred by idle cpus is what you're saying.

> 
> >  }
> >
> >  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> >
> >
> >
> > uclamp_rq_max_value() could be expensive as it loops over all buckets.
> 
> It loops over UCLAMP_CNT values which are defined to fit into a single

I think you meant to say UCLAMP_BUCKETS which is defined 5 by default.

> $L. That was the optimal space/time complexity compromise we found to
> get the MAX of a set of values.

It actually covers two cachelines, see below and my other email to Mel.

> 
> > Commenting this whole path out strangely doesn't just 'fix' it,
> > but produces  better results to no-uclamp kernel :-/
> >
> > # ./perf bench -r 20 sched pipe -T -l 50000
> > Without uclamp:		5039
> > With uclamp:		4832
> > With uclamp+patch:	5729
> 
> I explain it below: with that code removed you never decrease the CPU's
> uclamp value. Thus, the first time you schedule an RT task you go to MAX
> OPP and stay there forever.

Okay.

> 
> > It might be because schedutil gets biased differently by uclamp..? If I move to
> > performance governor these numbers almost double.
> >
> > I don't know. But this promoted me to look closer and
> 
> Just to resume, when a task is dequeued we can have only these cases:
> 
> - uclamp(task) < uclamp(cpu):
>   this happens when the task was co-scheduled with other tasks with
>   higher clamp values which are still RUNNABLE.
>   In this case there are no uclamp(cpu) updates.
> 
> - uclamp(task) == uclamp(cpu):
>   this happens when the task was one of the tasks defining the current
>   uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
>   tasks clamp values.
> 
> In this last case we _not_ always need to do a uclamp(cpu) update.
> Indeed the update is required _only_ when that task was _the last_ task
> defining the current uclamp(cpu) value.
> 
> In this case we use uclamp_rq_max_value() to do a linear scan of
> UCLAMP_CNT values which fits into a single cache line.

Again, I think you mean UCLAMP_BUCKETS here. Unless I missed something, they
span 2 cahcelines on 64bit machines and 64b cacheline size.

To be specific, I am referring to struct uclamp_rq, which defines an array of
size UCLAMP_BUCKETS of type struct uclamp_bucket.

uclamp_rq_max_value() scans the buckets for a given clamp_id (UCLAMP_MIN or
UCLAMP_MAX).

So sizeof(struct uclamp_rq) = 8 * 5 + 4 = 44; on 64bit machines.

And actually the compiler introduces a 4 bytes hole, so we end up with a total
of 48 bytes.

In struct rq, we define struct uclamp_rq as an array of UCLAMP_CNT which is 2.

So by default we have 2 * sizeof(struct uclamp_rq) = 96 bytes.

> 
> > I think I spotted a bug where in the if condition we check for '>='
> > instead of '>', causing us to take the supposedly impossible fail safe
> > path.
> 
> The fail safe path is when the '>' condition matches, which is what the
> SCHED_WARN_ON tell us. Indeed, we never expect uclamp(cpu) to be bigger
> than one of its RUNNABLE tasks. If that should happen we WARN and fix
> the cpu clamp value for the best.
> 
> The normal path is instead '=' and, according to by previous resume,
> it's expected to be executed _only_ when we dequeue the last task of the
> clamp group defining the current uclamp(cpu) value.

Okay. I was mislead by the comment then. Thanks for clarifying.

Can this function be broken down to deal with '=' separately from the '>' case?

IIUC, for the common '=', we always want to return uclamp_idle_value() hence
skip the potentially expensive scan?

Anyway, based on Vincent results, it doesn't seem this path is an issue for him
and the real problem is lurking somewhere else.

> 
> > Mind trying with the below patch please?
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0464569f26a7..50d66d4016ff 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
> >          * e.g. due to future modification, warn and fixup the expected value.
> >          */
> >         SCHED_WARN_ON(bucket->value > rq_clamp);
> > -       if (bucket->value >= rq_clamp) {
> > +       if (bucket->value > rq_clamp) {
> >                 bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> >                 WRITE_ONCE(uc_rq->value, bkt_clamp);
> >         }
> 
> This patch is thus bogus, since we never expect to have uclamp(cpu)
> bigger than uclamp(task) and thus we will never reset the clamp value of
> a cpu.

Yeah I got confused by SCHED_WARN_ON() and the comment. Thanks for clarifying.

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-04 13:40               ` Mel Gorman
@ 2020-06-05 10:58                 ` Qais Yousef
  2020-06-11 10:58                 ` Qais Yousef
  1 sibling, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-06-05 10:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dietmar Eggemann, Peter Zijlstra, Ingo Molnar, Randy Dunlap,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 06/04/20 14:40, Mel Gorman wrote:
> > > > The diffs are smaller than on openSUSE Leap 15.1 and some of the
> > > > uclamp taskgroup results are better?
> > > > 
> > > 
> > > I don't see the stddev and coeff but these look close to borderline.
> > > Sure, they are marked with a * so it passed a significant test but it's
> > > still a very marginal difference for netperf. It's possible that the
> > > systemd configurations differ in some way that is significant for uclamp
> > > but I don't know what that is.
> > 
> > Hmm so what you're saying is that Dietmar didn't reproduce the same problem
> > you're observing? I was hoping to use that to dig more into it.
> > 
> 
> Not as such, I'm saying that for whatever reason the problem is not as
> visible with Dietmar's setup. It may be machine-specific or distribution
> specific. There are alternative suggestions for testing just the fast
> paths with a pipe test that may be clearer.

Unfortunately lost access to that machine, but will resume testing on it as
soon as it's back online.

Vincent shared more info about his setup. If I can see the same thing without
having to use a big machine that'd make it easier to debug.

> > > 
> > > > With this test setup we now can play with the uclamp code in
> > > > enqueue_task() and dequeue_task().
> > > > 
> > > 
> > > That is still true. An annotated perf profile should tell you if the
> > > uclamp code is being heavily used or if it's bailing early but it's also
> > > possible that uclamp overhead is not a big deal on your particular
> > > machine.
> > > 
> > > The possibility that either the distribution, the machine or both are
> > > critical for detecting a problem with uclamp may explain why any overhead
> > > was missed. Even if it is marginal, it still makes sense to minimise the
> > > amount of uclamp code that is executed if no limit is specified for tasks.
> > 
> > So one speculation I have that might be causing the problem is that the
> > accesses of struct uclamp_rq are causing bad cache behavior in your case. Your
> > mmtest description of the netperf says that it is sensitive to cacheline
> > bouncing.
> > 
> > Looking at struct rq, the uclamp_rq is spanning 2 cachelines
> > 
> >  29954         /* --- cacheline 1 boundary (64 bytes) --- */
> >  29955         struct uclamp_rq           uclamp[2];            /*    64    96 */
> >  29956         /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> >  29957         unsigned int               uclamp_flags;         /*   160     4 */
> >  29958
> >  29959         /* XXX 28 bytes hole, try to pack */
> >  29960
> > 
> > Reducing sturct uclamp_bucket to use unsigned int instead of unsigned long
> > helps putting it all in a single cacheline
> > 
> 
> I tried this and while it did not make much of a difference to the
> headline metric, the workload was less variable so if it's proven that
> cache line bouncing is reduced (I didn't measure it), it may have merit
> on its own even if it does not fully address the problem.

Yeah maybe if we can prove it's worth it. I'll keep it on my list to look at
after we fix the main issue first.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-04 12:14                   ` Vincent Guittot
@ 2020-06-05 10:45                     ` Qais Yousef
  2020-06-09 15:29                       ` Vincent Guittot
  2020-06-08 12:31                     ` Qais Yousef
  1 sibling, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-06-05 10:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On 06/04/20 14:14, Vincent Guittot wrote:
> I have tried your patch and I don't see any difference compared to
> previous tests. Let me give you more details of my setup:
> I create 3 levels of cgroups and usually run the tests in the 4 levels
> (which includes root). The result above are for the root level
> 
> But I see a difference at other levels:
> 
>                            root           level 1       level 2       level 3
> 
> /w patch uclamp disable     50097         46615         43806         41078
> tip uclamp enable           48706(-2.78%) 45583(-2.21%) 42851(-2.18%)
> 40313(-1.86%)
> /w patch uclamp enable      48882(-2.43%) 45774(-1.80%) 43108(-1.59%)
> 40667(-1.00%)
> 
> Whereas tip with uclamp stays around 2% behind tip without uclamp, the
> diff of uclamp with your patch tends to decrease when we increase the
> number of level

Thanks for the extra info. Let me try this.

If you can run perf and verify that you see activate/deactivate_task showing up
as overhead I'd appreciate it. Just to confirm that indeed what we're seeing
here are symptoms of the same problem Mel is seeing.

> Beside this, that's also interesting to notice the ~6% of perf impact
> between each level for the same image

Interesting indeed.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-03 16:52                 ` Qais Yousef
  2020-06-04 12:14                   ` Vincent Guittot
@ 2020-06-05  7:55                   ` Patrick Bellasi
  2020-06-05 11:32                     ` Qais Yousef
  1 sibling, 1 reply; 68+ messages in thread
From: Patrick Bellasi @ 2020-06-05  7:55 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Mel Gorman, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs


Hi Qais,

On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef <qais.yousef@arm.com> wrote...

> On 06/03/20 16:59, Vincent Guittot wrote:
>> When I want to stress the fast path i usually use "perf bench sched pipe -T "
>> The tip/sched/core on my arm octo core gives the following results for
>> 20 iterations of perf bench sched pipe -T -l 50000
>> 
>> all uclamp config disabled  50035.4(+/- 0.334%)
>> all uclamp config enabled  48749.8(+/- 0.339%)   -2.64%

I use to run the same test but I don't remember such big numbers :/

>> It's quite easy to reproduce and probably easier to study the impact
>
> Thanks Vincent. This is very useful!
>
> I could reproduce that on my Juno.
>
> One of the codepath I was suspecting seems to affect it.
>
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..9f48090eb926 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>          * e.g. due to future modification, warn and fixup the expected value.
>          */
>         SCHED_WARN_ON(bucket->value > rq_clamp);
> +#if 0
>         if (bucket->value >= rq_clamp) {
>                 bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
>                 WRITE_ONCE(uc_rq->value, bkt_clamp);
>         }
> +#endif

Yep, that's likely where we have most of the overhead at dequeue time,
sine _sometimes_ we need to update the cpu's clamp value.

However, while running perf sched pipe, I expect:
 - all tasks to have the same clamp value
 - all CPUs to have _always_ at least one RUNNABLE task

Given these two conditions above, if the CPU is never "CFS idle" (i.e.
without RUNNABLE CFS tasks), the code above should never be triggered.
More on that later...

>  }
>
>  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>
>
> uclamp_rq_max_value() could be expensive as it loops over all buckets.

It loops over UCLAMP_CNT values which are defined to fit into a single
$L. That was the optimal space/time complexity compromise we found to
get the MAX of a set of values.

> Commenting this whole path out strangely doesn't just 'fix' it,
> but produces  better results to no-uclamp kernel :-/
>
> # ./perf bench -r 20 sched pipe -T -l 50000
> Without uclamp:		5039
> With uclamp:		4832
> With uclamp+patch:	5729

I explain it below: with that code removed you never decrease the CPU's
uclamp value. Thus, the first time you schedule an RT task you go to MAX
OPP and stay there forever.

> It might be because schedutil gets biased differently by uclamp..? If I move to
> performance governor these numbers almost double.
>
> I don't know. But this promoted me to look closer and

Just to resume, when a task is dequeued we can have only these cases:

- uclamp(task) < uclamp(cpu):
  this happens when the task was co-scheduled with other tasks with
  higher clamp values which are still RUNNABLE.
  In this case there are no uclamp(cpu) updates.

- uclamp(task) == uclamp(cpu):
  this happens when the task was one of the tasks defining the current
  uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
  tasks clamp values.

In this last case we _not_ always need to do a uclamp(cpu) update.
Indeed the update is required _only_ when that task was _the last_ task
defining the current uclamp(cpu) value.

In this case we use uclamp_rq_max_value() to do a linear scan of
UCLAMP_CNT values which fits into a single cache line.

> I think I spotted a bug where in the if condition we check for '>='
> instead of '>', causing us to take the supposedly impossible fail safe
> path.

The fail safe path is when the '>' condition matches, which is what the
SCHED_WARN_ON tell us. Indeed, we never expect uclamp(cpu) to be bigger
than one of its RUNNABLE tasks. If that should happen we WARN and fix
the cpu clamp value for the best.

The normal path is instead '=' and, according to by previous resume,
it's expected to be executed _only_ when we dequeue the last task of the
clamp group defining the current uclamp(cpu) value.

> Mind trying with the below patch please?
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..50d66d4016ff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>          * e.g. due to future modification, warn and fixup the expected value.
>          */
>         SCHED_WARN_ON(bucket->value > rq_clamp);
> -       if (bucket->value >= rq_clamp) {
> +       if (bucket->value > rq_clamp) {
>                 bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
>                 WRITE_ONCE(uc_rq->value, bkt_clamp);
>         }

This patch is thus bogus, since we never expect to have uclamp(cpu)
bigger than uclamp(task) and thus we will never reset the clamp value of
a cpu.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-03 12:41             ` Qais Yousef
@ 2020-06-04 13:40               ` Mel Gorman
  2020-06-05 10:58                 ` Qais Yousef
  2020-06-11 10:58                 ` Qais Yousef
  0 siblings, 2 replies; 68+ messages in thread
From: Mel Gorman @ 2020-06-04 13:40 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Dietmar Eggemann, Peter Zijlstra, Ingo Molnar, Randy Dunlap,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Wed, Jun 03, 2020 at 01:41:13PM +0100, Qais Yousef wrote:
> > > netperf-udp
> > >                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
> > >                               without-clamp             with-clamp      with-clamp-tskgrp
> > > 
> > > Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> > > Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> > > Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
> > > Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> > > Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> > > Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
> > > 
> > > netperf-tcp
> > >  
> > > Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> > > Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> > > Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> > > Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> > > Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> > > Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> > > Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> > > Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> > > Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
> > > 
> > > The diffs are smaller than on openSUSE Leap 15.1 and some of the
> > > uclamp taskgroup results are better?
> > > 
> > 
> > I don't see the stddev and coeff but these look close to borderline.
> > Sure, they are marked with a * so it passed a significant test but it's
> > still a very marginal difference for netperf. It's possible that the
> > systemd configurations differ in some way that is significant for uclamp
> > but I don't know what that is.
> 
> Hmm so what you're saying is that Dietmar didn't reproduce the same problem
> you're observing? I was hoping to use that to dig more into it.
> 

Not as such, I'm saying that for whatever reason the problem is not as
visible with Dietmar's setup. It may be machine-specific or distribution
specific. There are alternative suggestions for testing just the fast
paths with a pipe test that may be clearer.

> > 
> > > With this test setup we now can play with the uclamp code in
> > > enqueue_task() and dequeue_task().
> > > 
> > 
> > That is still true. An annotated perf profile should tell you if the
> > uclamp code is being heavily used or if it's bailing early but it's also
> > possible that uclamp overhead is not a big deal on your particular
> > machine.
> > 
> > The possibility that either the distribution, the machine or both are
> > critical for detecting a problem with uclamp may explain why any overhead
> > was missed. Even if it is marginal, it still makes sense to minimise the
> > amount of uclamp code that is executed if no limit is specified for tasks.
> 
> So one speculation I have that might be causing the problem is that the
> accesses of struct uclamp_rq are causing bad cache behavior in your case. Your
> mmtest description of the netperf says that it is sensitive to cacheline
> bouncing.
> 
> Looking at struct rq, the uclamp_rq is spanning 2 cachelines
> 
>  29954         /* --- cacheline 1 boundary (64 bytes) --- */
>  29955         struct uclamp_rq           uclamp[2];            /*    64    96 */
>  29956         /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
>  29957         unsigned int               uclamp_flags;         /*   160     4 */
>  29958
>  29959         /* XXX 28 bytes hole, try to pack */
>  29960
> 
> Reducing sturct uclamp_bucket to use unsigned int instead of unsigned long
> helps putting it all in a single cacheline
> 

I tried this and while it did not make much of a difference to the
headline metric, the workload was less variable so if it's proven that
cache line bouncing is reduced (I didn't measure it), it may have merit
on its own even if it does not fully address the problem.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-03 16:52                 ` Qais Yousef
@ 2020-06-04 12:14                   ` Vincent Guittot
  2020-06-05 10:45                     ` Qais Yousef
  2020-06-08 12:31                     ` Qais Yousef
  2020-06-05  7:55                   ` Patrick Bellasi
  1 sibling, 2 replies; 68+ messages in thread
From: Vincent Guittot @ 2020-06-04 12:14 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On Wed, 3 Jun 2020 at 18:52, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 06/03/20 16:59, Vincent Guittot wrote:
> > When I want to stress the fast path i usually use "perf bench sched pipe -T "
> > The tip/sched/core on my arm octo core gives the following results for
> > 20 iterations of perf bench sched pipe -T -l 50000
> >
> > all uclamp config disabled  50035.4(+/- 0.334%)
> > all uclamp config enabled  48749.8(+/- 0.339%)   -2.64%
> >
> > It's quite easy to reproduce and probably easier to study the impact
>
> Thanks Vincent. This is very useful!
>
> I could reproduce that on my Juno.
>
> One of the codepath I was suspecting seems to affect it.
>
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..9f48090eb926 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>          * e.g. due to future modification, warn and fixup the expected value.
>          */
>         SCHED_WARN_ON(bucket->value > rq_clamp);
> +#if 0
>         if (bucket->value >= rq_clamp) {
>                 bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
>                 WRITE_ONCE(uc_rq->value, bkt_clamp);
>         }
> +#endif
>  }
>
>  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>
>
> uclamp_rq_max_value() could be expensive as it loops over all buckets.
> Commenting this whole path out strangely doesn't just 'fix' it, but produces
> better results to no-uclamp kernel :-/
>
>
>
> # ./perf bench -r 20 sched pipe -T -l 50000
> Without uclamp:         5039
> With uclamp:            4832
> With uclamp+patch:      5729
>
>
>
> It might be because schedutil gets biased differently by uclamp..? If I move to
> performance governor these numbers almost double.
>
> I don't know. But this promoted me to look closer and I think I spotted a bug
> where in the if condition we check for '>=' instead of '>', causing us to take
> the supposedly impossible fail safe path.
>
> Mind trying with the below patch please?

I have tried your patch and I don't see any difference compared to
previous tests. Let me give you more details of my setup:
I create 3 levels of cgroups and usually run the tests in the 4 levels
(which includes root). The result above are for the root level

But I see a difference at other levels:

                           root           level 1       level 2       level 3

/w patch uclamp disable     50097         46615         43806         41078
tip uclamp enable           48706(-2.78%) 45583(-2.21%) 42851(-2.18%)
40313(-1.86%)
/w patch uclamp enable      48882(-2.43%) 45774(-1.80%) 43108(-1.59%)
40667(-1.00%)

Whereas tip with uclamp stays around 2% behind tip without uclamp, the
diff of uclamp with your patch tends to decrease when we increase the
number of level

Beside this, that's also interesting to notice the ~6% of perf impact
between each level for the same image

>
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..50d66d4016ff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
>          * e.g. due to future modification, warn and fixup the expected value.
>          */
>         SCHED_WARN_ON(bucket->value > rq_clamp);
> -       if (bucket->value >= rq_clamp) {
> +       if (bucket->value > rq_clamp) {
>                 bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
>                 WRITE_ONCE(uc_rq->value, bkt_clamp);
>         }
>
>
>
> Thanks
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-03 14:59               ` Vincent Guittot
@ 2020-06-03 16:52                 ` Qais Yousef
  2020-06-04 12:14                   ` Vincent Guittot
  2020-06-05  7:55                   ` Patrick Bellasi
  0 siblings, 2 replies; 68+ messages in thread
From: Qais Yousef @ 2020-06-03 16:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On 06/03/20 16:59, Vincent Guittot wrote:
> When I want to stress the fast path i usually use "perf bench sched pipe -T "
> The tip/sched/core on my arm octo core gives the following results for
> 20 iterations of perf bench sched pipe -T -l 50000
> 
> all uclamp config disabled  50035.4(+/- 0.334%)
> all uclamp config enabled  48749.8(+/- 0.339%)   -2.64%
> 
> It's quite easy to reproduce and probably easier to study the impact

Thanks Vincent. This is very useful!

I could reproduce that on my Juno.

One of the codepath I was suspecting seems to affect it.



diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0464569f26a7..9f48090eb926 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
         * e.g. due to future modification, warn and fixup the expected value.
         */
        SCHED_WARN_ON(bucket->value > rq_clamp);
+#if 0
        if (bucket->value >= rq_clamp) {
                bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
                WRITE_ONCE(uc_rq->value, bkt_clamp);
        }
+#endif
 }

 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)



uclamp_rq_max_value() could be expensive as it loops over all buckets.
Commenting this whole path out strangely doesn't just 'fix' it, but produces
better results to no-uclamp kernel :-/



# ./perf bench -r 20 sched pipe -T -l 50000
Without uclamp:		5039
With uclamp:		4832
With uclamp+patch:	5729



It might be because schedutil gets biased differently by uclamp..? If I move to
performance governor these numbers almost double.

I don't know. But this promoted me to look closer and I think I spotted a bug
where in the if condition we check for '>=' instead of '>', causing us to take
the supposedly impossible fail safe path.

Mind trying with the below patch please?



diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0464569f26a7..50d66d4016ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
         * e.g. due to future modification, warn and fixup the expected value.
         */
        SCHED_WARN_ON(bucket->value > rq_clamp);
-       if (bucket->value >= rq_clamp) {
+       if (bucket->value > rq_clamp) {
                bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
                WRITE_ONCE(uc_rq->value, bkt_clamp);
        }



Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-03 10:10             ` Mel Gorman
@ 2020-06-03 14:59               ` Vincent Guittot
  2020-06-03 16:52                 ` Qais Yousef
  0 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2020-06-03 14:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Patrick Bellasi, Dietmar Eggemann, Peter Zijlstra, Qais Yousef,
	Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fs

On Wed, 3 Jun 2020 at 12:10, Mel Gorman <mgorman@suse.de> wrote:
>
> On Wed, Jun 03, 2020 at 10:29:22AM +0200, Patrick Bellasi wrote:
> >
> > Hi Dietmar,
> > thanks for sharing these numbers.
> >
> > On Tue, Jun 02, 2020 at 18:46:00 +0200, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote...
> >
> > [...]
> >
> > > I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2
> > > (2 sockets * 10 cores * 2 threads) with powersave governor as:
> > >
> > > $ numactl -N 0 ./run-mmtests.sh XXX
> >
> > Great setup, it's worth to rule out all possible noise source (freq
> > scaling, thermal throttling, NUMA scheduler, etc.).
>
> config-network-netperf-cross-socket will do the binding of the server
> and client to two CPUs that are on one socket. However, it does not take
> care to avoid HT siblings although that could be implemented. The same
> configuration should limit the CPU to C1. It does not change the governor
> but all that would take is adding "cpupower frequency-set -g performance"
> to the end of the configuration.
>
> > Wondering if disabling HT can also help here in reducing results "noise"?
> >
> > > w/ config-network-netperf-unbound.
> > >
> > > Running w/o 'numactl -N 0' gives slightly worse results.
> > >
> > > without-clamp      : CONFIG_UCLAMP_TASK is not set
> > > with-clamp         : CONFIG_UCLAMP_TASK=y,
> > >                      CONFIG_UCLAMP_TASK_GROUP is not set
> > > with-clamp-tskgrp  : CONFIG_UCLAMP_TASK=y,
> > >                      CONFIG_UCLAMP_TASK_GROUP=y
> > >
> > >
> > > netperf-udp
> > >                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
> > >                               without-clamp             with-clamp      with-clamp-tskgrp
> >
> > Can you please specify how to read the following scores? I give it a run
> > to my local netperf and it reports Throughput, thous I would expect the
> > higher the better... but... this seems something different.
> >
> > > Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> > > Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> > > Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
> >
> > If I read it as the lower the score the better, all the above results
> > tell us that with-clamp is even better, while with-clamp-tskgrp
> > is not that much worst.
> >
>
> The figures are throughput to taking the first line
>
> without-clamp           153.62
> with-clamp              151.80 (worse, so the percentage difference is negative)
> with-clamp-tskgrp       155.60 (better so the percentage different is positive)
>
> > The other way around (the higher the score the better) would look odd
> > since we definitively add in more code and complexity when uclamp has
> > the TG support enabled we would not expect better scores.
> >
>
> Netperf for small differences is very fickle as small differences in timing
> or code layout can make a difference. Boot-to-boot variance can also be
> an issue and bisection is generally unreliable. In this case, I relied on
> the perf annotation and differences in ftrace function_graph to determine
> that uclamp was introducing enough overhead to be considered a problem.

When I want to stress the fast path i usually use "perf bench sched pipe -T "
The tip/sched/core on my arm octo core gives the following results for
20 iterations of perf bench sched pipe -T -l 50000

all uclamp config disabled  50035.4(+/- 0.334%)
all uclamp config enabled  48749.8(+/- 0.339%)   -2.64%

It's quite easy to reproduce and probably easier to study the impact

>
> > > Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > > Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > > Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > > Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > > Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> > > Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > > Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > > Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> > > Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
> > >
> > > netperf-tcp
> > >
> > > Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> > > Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> > > Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> > > Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> > > Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> > > Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> > > Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> > > Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> > > Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
> > >
> > > The diffs are smaller than on openSUSE Leap 15.1 and some of the
> > > uclamp taskgroup results are better?
> > >
> > > With this test setup we now can play with the uclamp code in
> > > enqueue_task() and dequeue_task().
> > >
> > > ---
> > >
> > > W/ config-network-netperf-unbound (only netperf-udp and buffer size 64):
> > >
> > > $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp/perf.data | grep activate_task
> > >
> > > # Event 'cycles:ppp'
> > > #
> > > # Baseline  Delta Abs  Shared Object            Symbol
> > >
> > >      0.02%     +0.54%  [kernel.vmlinux]         [k] activate_task
> > >      0.02%     +0.38%  [kernel.vmlinux]         [k] deactivate_task
> > >
> > > $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp-tskgrp/perf.data | grep activate_task
> > >
> > >      0.02%     +0.35%  [kernel.vmlinux]         [k] activate_task
> > >      0.02%     +0.34%  [kernel.vmlinux]         [k] deactivate_task
> >
> > These data makes more sense to me, AFAIR we measured <1% impact in the
> > wakeup path using cycletest.
> >
>
> 1% doesn't sound like a lot but UDP_STREAM is an example of a load with
> a *lot* of wakeups so even though the impact on each individual wakeup
> is small, it builds up.
>
> > I would also suggest to always report the overheads for
> >   __update_load_avg_cfs_rq()
> > as a reference point. We use that code quite a lot in the wakeup path
> > and it's a good proxy for relative comparisons.
> >
> >
> > > I still see 20 out of 90 tests with the warning message that the
> > > desired confidence was not achieved though.
> >
> > Where the 90 comes from? From the above table we run 9 sizes for
> > {udp-send, udp-recv, tcp} and 3 kernels. Should not give us 81 results?
> >
> > Maybe the Warning are generated only when a test has to be repeated?
>
> The warning is issued when it could not get a reliable result within the
> iterations allowed.
>
> > > "
> > > !!! WARNING
> > > !!! Desired confidence was not achieved within the specified iterations.
> > > !!! This implies that there was variability in the test environment that
> > > !!! must be investigated before going further.
> > > !!! Confidence intervals: Throughput      : 6.727% <-- more than 5% !!!
> > > !!!                       Local CPU util  : 0.000%
> > > !!!                       Remote CPU util : 0.000%
> > > "
> > >
> > > mmtests seems to run netperf with the following '-I' and 'i' parameter
> > > hardcoded: 'netperf -t UDP_STREAM -i 3,3 -I 95,5'
> >
> > This means that we compute a score's (average +-2.5%) with a 95% confidence.
> >
> > Does not that means that every +-2.5% difference in the results
> > above should be considered in the noise?
> >
>
> Usually yes but the impact is small enough to be within noise but
> still detectable. Where we get hurt is when there are multiple problems
> introduced where each contribute overhead that is within the noise but when
> all added together there is a regression outside the noise. Uclamp is not
> special in this respect, it just happens to be the current focus.  We met
> this type of problem before with PSI that was resolved by e0c274472d5d
> ("psi: make disabling/enabling easier for vendor kernels").
>
> > I would say that it could be useful to run with more iterations
> > and, given the small numbers we are looking at (apparently we are
> > scared by a 1% overhead), we should better use a more aggressive CI.
> >
> > What about something like:
> >
> >    netperf -t UDP_STREAM -i 3,30 -I 99,1
> >
> > ?
> >
>
> You could but the runtime of netperf will be variable, it will not be
> guaranteed to give consistent results and it may mask the true variability
> of the workload. While we could debate which is a valid approach, I
> think it makes sense to minimise the overhead of uclamp when it's not
> configured even if that means putting it behind a static branch that is
> enabled via a command-line parameter or a Kconfig that specifies whether
> it's on or off by default.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-03  9:40           ` Mel Gorman
@ 2020-06-03 12:41             ` Qais Yousef
  2020-06-04 13:40               ` Mel Gorman
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-06-03 12:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dietmar Eggemann, Peter Zijlstra, Ingo Molnar, Randy Dunlap,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 06/03/20 10:40, Mel Gorman wrote:
> On Tue, Jun 02, 2020 at 06:46:00PM +0200, Dietmar Eggemann wrote:
> > On 29.05.20 12:08, Mel Gorman wrote:
> > > On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote:
> > >>> FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
> > >>> Trying to see what goes on in there.
> > >>
> > >> Indeed, that one. The fact that regular distros cannot enable this
> > >> feature due to performance overhead is unfortunate. It means there is a
> > >> lot less potential for this stuff.
> > > 
> > > During that talk, I was a vague about the cost, admitted I had not looked
> > > too closely at mainline performance and had since deleted the data given
> > > that the problem was first spotted in early April. If I heard someone
> > > else making statements like I did at the talk, I would consider it a bit
> > > vague, potentially FUD, possibly wrong and worth rechecking myself. In
> > > terms of distributions "cannot enable this", we could but I was unwilling
> > > to pay the cost for a feature no one has asked for yet. If they had, I
> > > would endevour to put it behind static branches and disable it by default
> > > (like what happened for PSI). I was contacted offlist about my comments
> > > at OSPM and gathered new data to respond properly. For the record, here
> > > is an editted version of my response;
> > 
> > [...]
> > 
> > I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2
> > (2 sockets * 10 cores * 2 threads) with powersave governor as:
> > 
> > $ numactl -N 0 ./run-mmtests.sh XXX
> > 
> > w/ config-network-netperf-unbound.
> > 
> > Running w/o 'numactl -N 0' gives slightly worse results.
> > 
> > without-clamp      : CONFIG_UCLAMP_TASK is not set
> > with-clamp         : CONFIG_UCLAMP_TASK=y,
> >                      CONFIG_UCLAMP_TASK_GROUP is not set
> > with-clamp-tskgrp  : CONFIG_UCLAMP_TASK=y,
> >                      CONFIG_UCLAMP_TASK_GROUP=y
> > 
> > 
> > netperf-udp
> >                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
> >                               without-clamp             with-clamp      with-clamp-tskgrp
> > 
> > Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> > Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> > Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
> > Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> > Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> > Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
> > 
> > netperf-tcp
> >  
> > Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> > Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> > Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> > Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> > Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> > Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> > Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> > Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> > Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
> > 
> > The diffs are smaller than on openSUSE Leap 15.1 and some of the
> > uclamp taskgroup results are better?
> > 
> 
> I don't see the stddev and coeff but these look close to borderline.
> Sure, they are marked with a * so it passed a significant test but it's
> still a very marginal difference for netperf. It's possible that the
> systemd configurations differ in some way that is significant for uclamp
> but I don't know what that is.

Hmm so what you're saying is that Dietmar didn't reproduce the same problem
you're observing? I was hoping to use that to dig more into it.

> 
> > With this test setup we now can play with the uclamp code in
> > enqueue_task() and dequeue_task().
> > 
> 
> That is still true. An annotated perf profile should tell you if the
> uclamp code is being heavily used or if it's bailing early but it's also
> possible that uclamp overhead is not a big deal on your particular
> machine.
> 
> The possibility that either the distribution, the machine or both are
> critical for detecting a problem with uclamp may explain why any overhead
> was missed. Even if it is marginal, it still makes sense to minimise the
> amount of uclamp code that is executed if no limit is specified for tasks.

So one speculation I have that might be causing the problem is that the
accesses of struct uclamp_rq are causing bad cache behavior in your case. Your
mmtest description of the netperf says that it is sensitive to cacheline
bouncing.

Looking at struct rq, the uclamp_rq is spanning 2 cachelines

 29954         /* --- cacheline 1 boundary (64 bytes) --- */
 29955         struct uclamp_rq           uclamp[2];            /*    64    96 */
 29956         /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
 29957         unsigned int               uclamp_flags;         /*   160     4 */
 29958
 29959         /* XXX 28 bytes hole, try to pack */
 29960

Reducing sturct uclamp_bucket to use unsigned int instead of unsigned long
helps putting it all in a single cacheline

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..63b5397a1708 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -833,8 +833,8 @@ extern void rto_push_irq_work_func(struct irq_work *work);
  * clamp value.
  */
 struct uclamp_bucket {
-       unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
-       unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+       unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
+       unsigned int tasks : 32 - bits_per(SCHED_CAPACITY_SCALE);
 };

 /*

 29954         /* --- cacheline 1 boundary (64 bytes) --- */
 29955         struct uclamp_rq           uclamp[2];            /*    64    48 */
 29956         unsigned int               uclamp_flags;         /*   112     4 */
 29957

Is it something worth experimenting with?

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-03  8:29           ` Patrick Bellasi
@ 2020-06-03 10:10             ` Mel Gorman
  2020-06-03 14:59               ` Vincent Guittot
  0 siblings, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2020-06-03 10:10 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Dietmar Eggemann, Peter Zijlstra, Qais Yousef, Ingo Molnar,
	Randy Dunlap, Jonathan Corbet, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fsdevel

On Wed, Jun 03, 2020 at 10:29:22AM +0200, Patrick Bellasi wrote:
> 
> Hi Dietmar,
> thanks for sharing these numbers.
> 
> On Tue, Jun 02, 2020 at 18:46:00 +0200, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote...
> 
> [...]
> 
> > I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2
> > (2 sockets * 10 cores * 2 threads) with powersave governor as:
> >
> > $ numactl -N 0 ./run-mmtests.sh XXX
> 
> Great setup, it's worth to rule out all possible noise source (freq
> scaling, thermal throttling, NUMA scheduler, etc.).

config-network-netperf-cross-socket will do the binding of the server
and client to two CPUs that are on one socket. However, it does not take
care to avoid HT siblings although that could be implemented. The same
configuration should limit the CPU to C1. It does not change the governor
but all that would take is adding "cpupower frequency-set -g performance"
to the end of the configuration.

> Wondering if disabling HT can also help here in reducing results "noise"?
> 
> > w/ config-network-netperf-unbound.
> >
> > Running w/o 'numactl -N 0' gives slightly worse results.
> >
> > without-clamp      : CONFIG_UCLAMP_TASK is not set
> > with-clamp         : CONFIG_UCLAMP_TASK=y,
> >                      CONFIG_UCLAMP_TASK_GROUP is not set
> > with-clamp-tskgrp  : CONFIG_UCLAMP_TASK=y,
> >                      CONFIG_UCLAMP_TASK_GROUP=y
> >
> >
> > netperf-udp
> >                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
> >                               without-clamp             with-clamp      with-clamp-tskgrp
> 
> Can you please specify how to read the following scores? I give it a run
> to my local netperf and it reports Throughput, thous I would expect the
> higher the better... but... this seems something different.
> 
> > Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> > Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> > Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
> 
> If I read it as the lower the score the better, all the above results
> tell us that with-clamp is even better, while with-clamp-tskgrp
> is not that much worst.
> 

The figures are throughput to taking the first line

without-clamp		153.62
with-clamp		151.80 (worse, so the percentage difference is negative)
with-clamp-tskgrp	155.60 (better so the percentage different is positive)

> The other way around (the higher the score the better) would look odd
> since we definitively add in more code and complexity when uclamp has
> the TG support enabled we would not expect better scores.
> 

Netperf for small differences is very fickle as small differences in timing
or code layout can make a difference. Boot-to-boot variance can also be
an issue and bisection is generally unreliable. In this case, I relied on
the perf annotation and differences in ftrace function_graph to determine
that uclamp was introducing enough overhead to be considered a problem.

> > Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> > Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> > Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> > Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> > Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> > Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> > Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> > Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> > Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
> >
> > netperf-tcp
> >  
> > Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> > Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> > Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> > Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> > Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> > Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> > Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> > Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> > Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
> >
> > The diffs are smaller than on openSUSE Leap 15.1 and some of the
> > uclamp taskgroup results are better?
> >
> > With this test setup we now can play with the uclamp code in
> > enqueue_task() and dequeue_task().
> >
> > ---
> >
> > W/ config-network-netperf-unbound (only netperf-udp and buffer size 64):
> >
> > $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp/perf.data | grep activate_task
> >
> > # Event 'cycles:ppp'
> > #
> > # Baseline  Delta Abs  Shared Object            Symbol
> >
> >      0.02%     +0.54%  [kernel.vmlinux]         [k] activate_task
> >      0.02%     +0.38%  [kernel.vmlinux]         [k] deactivate_task
> >
> > $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp-tskgrp/perf.data | grep activate_task
> >
> >      0.02%     +0.35%  [kernel.vmlinux]         [k] activate_task
> >      0.02%     +0.34%  [kernel.vmlinux]         [k] deactivate_task
> 
> These data makes more sense to me, AFAIR we measured <1% impact in the
> wakeup path using cycletest.
> 

1% doesn't sound like a lot but UDP_STREAM is an example of a load with
a *lot* of wakeups so even though the impact on each individual wakeup
is small, it builds up.

> I would also suggest to always report the overheads for 
>   __update_load_avg_cfs_rq()
> as a reference point. We use that code quite a lot in the wakeup path
> and it's a good proxy for relative comparisons.
> 
> 
> > I still see 20 out of 90 tests with the warning message that the
> > desired confidence was not achieved though.
> 
> Where the 90 comes from? From the above table we run 9 sizes for
> {udp-send, udp-recv, tcp} and 3 kernels. Should not give us 81 results?
> 
> Maybe the Warning are generated only when a test has to be repeated?

The warning is issued when it could not get a reliable result within the
iterations allowed.

> > "
> > !!! WARNING
> > !!! Desired confidence was not achieved within the specified iterations.
> > !!! This implies that there was variability in the test environment that
> > !!! must be investigated before going further.
> > !!! Confidence intervals: Throughput      : 6.727% <-- more than 5% !!!
> > !!!                       Local CPU util  : 0.000%
> > !!!                       Remote CPU util : 0.000%
> > "
> >
> > mmtests seems to run netperf with the following '-I' and 'i' parameter
> > hardcoded: 'netperf -t UDP_STREAM -i 3,3 -I 95,5' 
> 
> This means that we compute a score's (average +-2.5%) with a 95% confidence.
> 
> Does not that means that every +-2.5% difference in the results
> above should be considered in the noise?
> 

Usually yes but the impact is small enough to be within noise but
still detectable. Where we get hurt is when there are multiple problems
introduced where each contribute overhead that is within the noise but when
all added together there is a regression outside the noise. Uclamp is not
special in this respect, it just happens to be the current focus.  We met
this type of problem before with PSI that was resolved by e0c274472d5d
("psi: make disabling/enabling easier for vendor kernels").

> I would say that it could be useful to run with more iterations
> and, given the small numbers we are looking at (apparently we are
> scared by a 1% overhead), we should better use a more aggressive CI.
> 
> What about something like:
> 
>    netperf -t UDP_STREAM -i 3,30 -I 99,1
> 
> ?
> 

You could but the runtime of netperf will be variable, it will not be
guaranteed to give consistent results and it may mask the true variability
of the workload. While we could debate which is a valid approach, I
think it makes sense to minimise the overhead of uclamp when it's not
configured even if that means putting it behind a static branch that is
enabled via a command-line parameter or a Kconfig that specifies whether
it's on or off by default.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-02 16:46         ` Dietmar Eggemann
  2020-06-03  8:29           ` Patrick Bellasi
@ 2020-06-03  9:40           ` Mel Gorman
  2020-06-03 12:41             ` Qais Yousef
  1 sibling, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2020-06-03  9:40 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Qais Yousef, Ingo Molnar, Randy Dunlap,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Tue, Jun 02, 2020 at 06:46:00PM +0200, Dietmar Eggemann wrote:
> On 29.05.20 12:08, Mel Gorman wrote:
> > On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote:
> >>> FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
> >>> Trying to see what goes on in there.
> >>
> >> Indeed, that one. The fact that regular distros cannot enable this
> >> feature due to performance overhead is unfortunate. It means there is a
> >> lot less potential for this stuff.
> > 
> > During that talk, I was a vague about the cost, admitted I had not looked
> > too closely at mainline performance and had since deleted the data given
> > that the problem was first spotted in early April. If I heard someone
> > else making statements like I did at the talk, I would consider it a bit
> > vague, potentially FUD, possibly wrong and worth rechecking myself. In
> > terms of distributions "cannot enable this", we could but I was unwilling
> > to pay the cost for a feature no one has asked for yet. If they had, I
> > would endevour to put it behind static branches and disable it by default
> > (like what happened for PSI). I was contacted offlist about my comments
> > at OSPM and gathered new data to respond properly. For the record, here
> > is an editted version of my response;
> 
> [...]
> 
> I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2
> (2 sockets * 10 cores * 2 threads) with powersave governor as:
> 
> $ numactl -N 0 ./run-mmtests.sh XXX
> 
> w/ config-network-netperf-unbound.
> 
> Running w/o 'numactl -N 0' gives slightly worse results.
> 
> without-clamp      : CONFIG_UCLAMP_TASK is not set
> with-clamp         : CONFIG_UCLAMP_TASK=y,
>                      CONFIG_UCLAMP_TASK_GROUP is not set
> with-clamp-tskgrp  : CONFIG_UCLAMP_TASK=y,
>                      CONFIG_UCLAMP_TASK_GROUP=y
> 
> 
> netperf-udp
>                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
>                               without-clamp             with-clamp      with-clamp-tskgrp
> 
> Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
> Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
> 
> netperf-tcp
>  
> Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
> 
> The diffs are smaller than on openSUSE Leap 15.1 and some of the
> uclamp taskgroup results are better?
> 

I don't see the stddev and coeff but these look close to borderline.
Sure, they are marked with a * so it passed a significant test but it's
still a very marginal difference for netperf. It's possible that the
systemd configurations differ in some way that is significant for uclamp
but I don't know what that is.

> With this test setup we now can play with the uclamp code in
> enqueue_task() and dequeue_task().
> 

That is still true. An annotated perf profile should tell you if the
uclamp code is being heavily used or if it's bailing early but it's also
possible that uclamp overhead is not a big deal on your particular
machine.

The possibility that either the distribution, the machine or both are
critical for detecting a problem with uclamp may explain why any overhead
was missed. Even if it is marginal, it still makes sense to minimise the
amount of uclamp code that is executed if no limit is specified for tasks.

> ---
> 
> W/ config-network-netperf-unbound (only netperf-udp and buffer size 64):
> 
> $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp/perf.data | grep activate_task
> 
> # Event 'cycles:ppp'
> #
> # Baseline  Delta Abs  Shared Object            Symbol
> 
>      0.02%     +0.54%  [kernel.vmlinux]         [k] activate_task
>      0.02%     +0.38%  [kernel.vmlinux]         [k] deactivate_task
> 
> $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp-tskgrp/perf.data | grep activate_task
> 
>      0.02%     +0.35%  [kernel.vmlinux]         [k] activate_task
>      0.02%     +0.34%  [kernel.vmlinux]         [k] deactivate_task
> 
> ---
> 
> I still see 20 out of 90 tests with the warning message that the
> desired confidence was not achieved though.
> 
> "
> !!! WARNING
> !!! Desired confidence was not achieved within the specified iterations.
> !!! This implies that there was variability in the test environment that
> !!! must be investigated before going further.
> !!! Confidence intervals: Throughput      : 6.727% <-- more than 5% !!!
> !!!                       Local CPU util  : 0.000%
> !!!                       Remote CPU util : 0.000%
> "
> 
> mmtests seems to run netperf with the following '-I' and 'i' parameter
> hardcoded: 'netperf -t UDP_STREAM -i 3,3 -I 95,5' 

The reason is that netperf on localhost can be a bit unreliable. It also
hits problems with shared locks and atomics that do not necessarily happen
when running netperf between two physical machines. When running netperf
with something like "-I 99,1" it can take a highly variable amount of
time to run and you are left with no clue how variable it really is or
whether it's anywhere close to the "true mean".  Hence, in mmtests I
opted to run netperf multiple times with low confidence to get an idea
of how variable the test is.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-06-02 16:46         ` Dietmar Eggemann
@ 2020-06-03  8:29           ` Patrick Bellasi
  2020-06-03 10:10             ` Mel Gorman
  2020-06-03  9:40           ` Mel Gorman
  1 sibling, 1 reply; 68+ messages in thread
From: Patrick Bellasi @ 2020-06-03  8:29 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Mel Gorman, Peter Zijlstra, Qais Yousef, Ingo Molnar,
	Randy Dunlap, Jonathan Corbet, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fsdevel


Hi Dietmar,
thanks for sharing these numbers.

On Tue, Jun 02, 2020 at 18:46:00 +0200, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote...

[...]

> I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2
> (2 sockets * 10 cores * 2 threads) with powersave governor as:
>
> $ numactl -N 0 ./run-mmtests.sh XXX

Great setup, it's worth to rule out all possible noise source (freq
scaling, thermal throttling, NUMA scheduler, etc.).
Wondering if disabling HT can also help here in reducing results "noise"?

> w/ config-network-netperf-unbound.
>
> Running w/o 'numactl -N 0' gives slightly worse results.
>
> without-clamp      : CONFIG_UCLAMP_TASK is not set
> with-clamp         : CONFIG_UCLAMP_TASK=y,
>                      CONFIG_UCLAMP_TASK_GROUP is not set
> with-clamp-tskgrp  : CONFIG_UCLAMP_TASK=y,
>                      CONFIG_UCLAMP_TASK_GROUP=y
>
>
> netperf-udp
>                                 ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
>                               without-clamp             with-clamp      with-clamp-tskgrp

Can you please specify how to read the following scores? I give it a run
to my local netperf and it reports Throughput, thous I would expect the
higher the better... but... this seems something different.

> Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
> Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
> Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*

If I read it as the lower the score the better, all the above results
tell us that with-clamp is even better, while with-clamp-tskgrp
is not that much worst.

The other way around (the higher the score the better) would look odd
since we definitively add in more code and complexity when uclamp has
the TG support enabled we would not expect better scores.

> Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
> Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
> Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
> Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
> Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
> Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
> Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
> Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
> Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*
>
> netperf-tcp
>  
> Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
> Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
> Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
> Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
> Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
> Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
> Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
> Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
> Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*
>
> The diffs are smaller than on openSUSE Leap 15.1 and some of the
> uclamp taskgroup results are better?
>
> With this test setup we now can play with the uclamp code in
> enqueue_task() and dequeue_task().
>
> ---
>
> W/ config-network-netperf-unbound (only netperf-udp and buffer size 64):
>
> $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp/perf.data | grep activate_task
>
> # Event 'cycles:ppp'
> #
> # Baseline  Delta Abs  Shared Object            Symbol
>
>      0.02%     +0.54%  [kernel.vmlinux]         [k] activate_task
>      0.02%     +0.38%  [kernel.vmlinux]         [k] deactivate_task
>
> $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp-tskgrp/perf.data | grep activate_task
>
>      0.02%     +0.35%  [kernel.vmlinux]         [k] activate_task
>      0.02%     +0.34%  [kernel.vmlinux]         [k] deactivate_task

These data makes more sense to me, AFAIR we measured <1% impact in the
wakeup path using cycletest.

I would also suggest to always report the overheads for 
  __update_load_avg_cfs_rq()
as a reference point. We use that code quite a lot in the wakeup path
and it's a good proxy for relative comparisons.


> I still see 20 out of 90 tests with the warning message that the
> desired confidence was not achieved though.

Where the 90 comes from? From the above table we run 9 sizes for
{udp-send, udp-recv, tcp} and 3 kernels. Should not give us 81 results?

Maybe the Warning are generated only when a test has to be repeated?
Thus, all the numbers above are granted to be within the specific CI?

> "
> !!! WARNING
> !!! Desired confidence was not achieved within the specified iterations.
> !!! This implies that there was variability in the test environment that
> !!! must be investigated before going further.
> !!! Confidence intervals: Throughput      : 6.727% <-- more than 5% !!!
> !!!                       Local CPU util  : 0.000%
> !!!                       Remote CPU util : 0.000%
> "
>
> mmtests seems to run netperf with the following '-I' and 'i' parameter
> hardcoded: 'netperf -t UDP_STREAM -i 3,3 -I 95,5' 

This means that we compute a score's (average +-2.5%) with a 95% confidence.

Does not that means that every +-2.5% difference in the results
above should be considered in the noise?

I would say that it could be useful to run with more iterations
and, given the small numbers we are looking at (apparently we are
scared by a 1% overhead), we should better use a more aggressive CI.

What about something like:

   netperf -t UDP_STREAM -i 3,30 -I 99,1

?


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-29 10:08       ` Mel Gorman
  2020-05-29 16:04         ` Qais Yousef
@ 2020-06-02 16:46         ` Dietmar Eggemann
  2020-06-03  8:29           ` Patrick Bellasi
  2020-06-03  9:40           ` Mel Gorman
  1 sibling, 2 replies; 68+ messages in thread
From: Dietmar Eggemann @ 2020-06-02 16:46 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Qais Yousef, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 29.05.20 12:08, Mel Gorman wrote:
> On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote:
>>> FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
>>> Trying to see what goes on in there.
>>
>> Indeed, that one. The fact that regular distros cannot enable this
>> feature due to performance overhead is unfortunate. It means there is a
>> lot less potential for this stuff.
> 
> During that talk, I was a vague about the cost, admitted I had not looked
> too closely at mainline performance and had since deleted the data given
> that the problem was first spotted in early April. If I heard someone
> else making statements like I did at the talk, I would consider it a bit
> vague, potentially FUD, possibly wrong and worth rechecking myself. In
> terms of distributions "cannot enable this", we could but I was unwilling
> to pay the cost for a feature no one has asked for yet. If they had, I
> would endevour to put it behind static branches and disable it by default
> (like what happened for PSI). I was contacted offlist about my comments
> at OSPM and gathered new data to respond properly. For the record, here
> is an editted version of my response;

[...]

I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2
(2 sockets * 10 cores * 2 threads) with powersave governor as:

$ numactl -N 0 ./run-mmtests.sh XXX

w/ config-network-netperf-unbound.

Running w/o 'numactl -N 0' gives slightly worse results.

without-clamp      : CONFIG_UCLAMP_TASK is not set
with-clamp         : CONFIG_UCLAMP_TASK=y,
                     CONFIG_UCLAMP_TASK_GROUP is not set
with-clamp-tskgrp  : CONFIG_UCLAMP_TASK=y,
                     CONFIG_UCLAMP_TASK_GROUP=y


netperf-udp
                                ./5.7.0-rc7            ./5.7.0-rc7            ./5.7.0-rc7
                              without-clamp             with-clamp      with-clamp-tskgrp

Hmean     send-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
Hmean     send-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
Hmean     send-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
Hmean     send-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
Hmean     send-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.96 *   1.24%*
Hmean     send-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
Hmean     send-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
Hmean     send-8192     14961.77 (   0.00%)    14418.92 *  -3.63%*    14908.36 *  -0.36%*
Hmean     send-16384    25799.50 (   0.00%)    25025.64 *  -3.00%*    25831.20 *   0.12%*
Hmean     recv-64         153.62 (   0.00%)      151.80 *  -1.19%*      155.60 *   1.28%*
Hmean     recv-128        306.77 (   0.00%)      306.27 *  -0.16%*      309.39 *   0.85%*
Hmean     recv-256        608.54 (   0.00%)      604.28 *  -0.70%*      613.42 *   0.80%*
Hmean     recv-1024      2395.80 (   0.00%)     2365.67 *  -1.26%*     2409.50 *   0.57%*
Hmean     recv-2048      4608.70 (   0.00%)     4544.02 *  -1.40%*     4665.95 *   1.24%*
Hmean     recv-3312      7223.97 (   0.00%)     7158.88 *  -0.90%*     7331.23 *   1.48%*
Hmean     recv-4096      8729.53 (   0.00%)     8598.78 *  -1.50%*     8860.47 *   1.50%*
Hmean     recv-8192     14961.61 (   0.00%)    14418.88 *  -3.63%*    14908.30 *  -0.36%*
Hmean     recv-16384    25799.39 (   0.00%)    25025.49 *  -3.00%*    25831.00 *   0.12%*

netperf-tcp
 
Hmean     64              818.65 (   0.00%)      812.98 *  -0.69%*      826.17 *   0.92%*
Hmean     128            1569.55 (   0.00%)     1555.79 *  -0.88%*     1586.94 *   1.11%*
Hmean     256            2952.86 (   0.00%)     2915.07 *  -1.28%*     2968.15 *   0.52%*
Hmean     1024          10425.91 (   0.00%)    10296.68 *  -1.24%*    10418.38 *  -0.07%*
Hmean     2048          17454.51 (   0.00%)    17369.57 *  -0.49%*    17419.24 *  -0.20%*
Hmean     3312          22509.95 (   0.00%)    22229.69 *  -1.25%*    22373.32 *  -0.61%*
Hmean     4096          25033.23 (   0.00%)    24859.59 *  -0.69%*    24912.50 *  -0.48%*
Hmean     8192          32080.51 (   0.00%)    31744.51 *  -1.05%*    31800.45 *  -0.87%*
Hmean     16384         36531.86 (   0.00%)    37064.68 *   1.46%*    37397.71 *   2.37%*

The diffs are smaller than on openSUSE Leap 15.1 and some of the
uclamp taskgroup results are better?

With this test setup we now can play with the uclamp code in
enqueue_task() and dequeue_task().

---

W/ config-network-netperf-unbound (only netperf-udp and buffer size 64):

$ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp/perf.data | grep activate_task

# Event 'cycles:ppp'
#
# Baseline  Delta Abs  Shared Object            Symbol

     0.02%     +0.54%  [kernel.vmlinux]         [k] activate_task
     0.02%     +0.38%  [kernel.vmlinux]         [k] deactivate_task

$ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp-tskgrp/perf.data | grep activate_task

     0.02%     +0.35%  [kernel.vmlinux]         [k] activate_task
     0.02%     +0.34%  [kernel.vmlinux]         [k] deactivate_task

---

I still see 20 out of 90 tests with the warning message that the
desired confidence was not achieved though.

"
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      : 6.727% <-- more than 5% !!!
!!!                       Local CPU util  : 0.000%
!!!                       Remote CPU util : 0.000%
"

mmtests seems to run netperf with the following '-I' and 'i' parameter
hardcoded: 'netperf -t UDP_STREAM -i 3,3 -I 95,5' 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-29 16:04         ` Qais Yousef
@ 2020-05-29 16:57           ` Mel Gorman
  0 siblings, 0 replies; 68+ messages in thread
From: Mel Gorman @ 2020-05-29 16:57 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

> > A lot of the uclamp functions appear to be inlined so it is not be
> > particularly obvious from a raw profile but it shows up in the annotated
> > profile in activate_task and dequeue_task for example. In the case of
> > dequeue_task, uclamp_rq_dec_id() is extremely expensive according to the
> > annotated profile.
> > 
> > I'm afraid I did not dig into this deeply once I knew I could just disable
> > it even within the distribution.
> 
> Could by any chance the vmlinux (with debug symbols hopefully) and perf.dat are
> still lying around to share?
> 

I didn't preserve the vmlinux files. I can recreate them if you have
problems reproducing this locally. The "perf archive" files and profile
data can be downloaded at
http://www.skynet.ie/~mel/postings/netperf-20200529/profile.tar.gz which
should be enough for an annotated profile to compare with a local run.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-29 16:02             ` Mel Gorman
@ 2020-05-29 16:05               ` Qais Yousef
  0 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-05-29 16:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 05/29/20 17:02, Mel Gorman wrote:
> On Fri, May 29, 2020 at 04:11:18PM +0100, Qais Yousef wrote:
> > > Elsewhere in the thread, I showed some results based on 5.7 so uclamp
> > > task group existed but I had it disabled. The uclamp related parts of
> > > the kconfig were
> > > 
> > > # zgrep UCLAMP kconfig-5.7.0-rc7-with-clamp.txt.gz
> > > CONFIG_UCLAMP_TASK=y
> > > CONFIG_UCLAMP_BUCKETS_COUNT=5
> > > # CONFIG_UCLAMP_TASK_GROUP is not set
> > 
> > So you never had the TASK_GROUP part enabled when you noticed the regression?
> 
> Correct.
> 
> > Or is it the other way around, you just disabled CONFIG_UCLAMP_TASK_GROUP to
> > 'fix' it?
> > 
> 
> I disabled CONFIG_UCLAMP_TASK to "fix" it.

Okay. That eliminates one thing out at least.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-29 10:08       ` Mel Gorman
@ 2020-05-29 16:04         ` Qais Yousef
  2020-05-29 16:57           ` Mel Gorman
  2020-06-02 16:46         ` Dietmar Eggemann
  1 sibling, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-05-29 16:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 05/29/20 11:08, Mel Gorman wrote:
> On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote:
> > > FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
> > > Trying to see what goes on in there.
> > 
> > Indeed, that one. The fact that regular distros cannot enable this
> > feature due to performance overhead is unfortunate. It means there is a
> > lot less potential for this stuff.
> 
> During that talk, I was a vague about the cost, admitted I had not looked
> too closely at mainline performance and had since deleted the data given
> that the problem was first spotted in early April. If I heard someone
> else making statements like I did at the talk, I would consider it a bit
> vague, potentially FUD, possibly wrong and worth rechecking myself. In
> terms of distributions "cannot enable this", we could but I was unwilling
> to pay the cost for a feature no one has asked for yet. If they had, I
> would endevour to put it behind static branches and disable it by default
> (like what happened for PSI). I was contacted offlist about my comments
> at OSPM and gathered new data to respond properly. For the record, here
> is an editted version of my response;

I had this impression too that's why I had a rather humble attempt.

[...]

> # Event 'cycles:ppp'
> #
> # Baseline  Delta Abs  Shared Object             Symbol
> # ........  .........  ........................  ..............................................
> #
>      9.59%     -2.87%  [kernel.vmlinux]          [k] poll_idle
>      0.19%     +1.85%  [kernel.vmlinux]          [k] activate_task
>                +1.17%  [kernel.vmlinux]          [k] dequeue_task
>                +0.89%  [kernel.vmlinux]          [k] update_rq_clock.part.73
>      3.88%     +0.73%  [kernel.vmlinux]          [k] try_to_wake_up
>      3.17%     +0.68%  [kernel.vmlinux]          [k] __schedule
>      1.16%     -0.60%  [kernel.vmlinux]          [k] __update_load_avg_cfs_rq
>      2.20%     -0.54%  [kernel.vmlinux]          [k] resched_curr
>      2.08%     -0.29%  [kernel.vmlinux]          [k] _raw_spin_lock_irqsave
>      0.44%     -0.29%  [kernel.vmlinux]          [k] cpus_share_cache
>      1.13%     +0.23%  [kernel.vmlinux]          [k] _raw_spin_lock_bh
> 
> A lot of the uclamp functions appear to be inlined so it is not be
> particularly obvious from a raw profile but it shows up in the annotated
> profile in activate_task and dequeue_task for example. In the case of
> dequeue_task, uclamp_rq_dec_id() is extremely expensive according to the
> annotated profile.
> 
> I'm afraid I did not dig into this deeply once I knew I could just disable
> it even within the distribution.

Could by any chance the vmlinux (with debug symbols hopefully) and perf.dat are
still lying around to share?

I could send you a link to drop them somewhere.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-29 15:11           ` Qais Yousef
@ 2020-05-29 16:02             ` Mel Gorman
  2020-05-29 16:05               ` Qais Yousef
  0 siblings, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2020-05-29 16:02 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Fri, May 29, 2020 at 04:11:18PM +0100, Qais Yousef wrote:
> > Elsewhere in the thread, I showed some results based on 5.7 so uclamp
> > task group existed but I had it disabled. The uclamp related parts of
> > the kconfig were
> > 
> > # zgrep UCLAMP kconfig-5.7.0-rc7-with-clamp.txt.gz
> > CONFIG_UCLAMP_TASK=y
> > CONFIG_UCLAMP_BUCKETS_COUNT=5
> > # CONFIG_UCLAMP_TASK_GROUP is not set
> 
> So you never had the TASK_GROUP part enabled when you noticed the regression?

Correct.

> Or is it the other way around, you just disabled CONFIG_UCLAMP_TASK_GROUP to
> 'fix' it?
> 

I disabled CONFIG_UCLAMP_TASK to "fix" it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-29 10:21         ` Mel Gorman
@ 2020-05-29 15:11           ` Qais Yousef
  2020-05-29 16:02             ` Mel Gorman
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-05-29 15:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 05/29/20 11:21, Mel Gorman wrote:
> On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote:
> > > Indeed, that one. The fact that regular distros cannot enable this
> > > feature due to performance overhead is unfortunate. It means there is a
> > > lot less potential for this stuff.
> > 
> > I had a humble try to catch the overhead but wasn't successful. The observation
> > wasn't missed by us too then.
> > 
> 
> As with all things, it's perfectly possible I was looking at a workload
> where the cost is more obvious but given that the functions are inlined,
> it's not trivial to spot. I just happened to spot it because I was paying
> close attention to try_to_wake_up() at the time.

Indeed.

> 
> > On my Ubuntu 18.04 machine uclamp is enabled by default by the way. 5.3 kernel
> > though, so uclamp task group stuff not there yet. Should check how their server
> > distro looks like.
> > 
> 
> Elsewhere in the thread, I showed some results based on 5.7 so uclamp
> task group existed but I had it disabled. The uclamp related parts of
> the kconfig were
> 
> # zgrep UCLAMP kconfig-5.7.0-rc7-with-clamp.txt.gz
> CONFIG_UCLAMP_TASK=y
> CONFIG_UCLAMP_BUCKETS_COUNT=5
> # CONFIG_UCLAMP_TASK_GROUP is not set

So you never had the TASK_GROUP part enabled when you noticed the regression?
Or is it the other way around, you just disabled CONFIG_UCLAMP_TASK_GROUP to
'fix' it?

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 16:51       ` Qais Yousef
  2020-05-28 18:29         ` Peter Zijlstra
@ 2020-05-29 10:21         ` Mel Gorman
  2020-05-29 15:11           ` Qais Yousef
  1 sibling, 1 reply; 68+ messages in thread
From: Mel Gorman @ 2020-05-29 10:21 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote:
> > Indeed, that one. The fact that regular distros cannot enable this
> > feature due to performance overhead is unfortunate. It means there is a
> > lot less potential for this stuff.
> 
> I had a humble try to catch the overhead but wasn't successful. The observation
> wasn't missed by us too then.
> 

As with all things, it's perfectly possible I was looking at a workload
where the cost is more obvious but given that the functions are inlined,
it's not trivial to spot. I just happened to spot it because I was paying
close attention to try_to_wake_up() at the time.

> On my Ubuntu 18.04 machine uclamp is enabled by default by the way. 5.3 kernel
> though, so uclamp task group stuff not there yet. Should check how their server
> distro looks like.
> 

Elsewhere in the thread, I showed some results based on 5.7 so uclamp
task group existed but I had it disabled. The uclamp related parts of
the kconfig were

# zgrep UCLAMP kconfig-5.7.0-rc7-with-clamp.txt.gz
CONFIG_UCLAMP_TASK=y
CONFIG_UCLAMP_BUCKETS_COUNT=5
# CONFIG_UCLAMP_TASK_GROUP is not set

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 16:11     ` Peter Zijlstra
  2020-05-28 16:51       ` Qais Yousef
@ 2020-05-29 10:08       ` Mel Gorman
  2020-05-29 16:04         ` Qais Yousef
  2020-06-02 16:46         ` Dietmar Eggemann
  1 sibling, 2 replies; 68+ messages in thread
From: Mel Gorman @ 2020-05-29 10:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qais Yousef, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote:
> > FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
> > Trying to see what goes on in there.
> 
> Indeed, that one. The fact that regular distros cannot enable this
> feature due to performance overhead is unfortunate. It means there is a
> lot less potential for this stuff.

During that talk, I was a vague about the cost, admitted I had not looked
too closely at mainline performance and had since deleted the data given
that the problem was first spotted in early April. If I heard someone
else making statements like I did at the talk, I would consider it a bit
vague, potentially FUD, possibly wrong and worth rechecking myself. In
terms of distributions "cannot enable this", we could but I was unwilling
to pay the cost for a feature no one has asked for yet. If they had, I
would endevour to put it behind static branches and disable it by default
(like what happened for PSI). I was contacted offlist about my comments
at OSPM and gathered new data to respond properly. For the record, here
is an editted version of my response;

--8<--

(Some context deleted that is not relevant)

> Does it need any special admin configuration for system
> services, cgroups, scripts, etc?

Nothing special -- out of box configuration. Tests were executed via
mmtests.

> Which mmtests config file did you use?
> 

I used network-netperf-unbound and network-netperf-cstate.
network-netperf-unbound is usually the default but for some issues, I
use the cstate configuration to limit C-states.

For a perf profile, I used network-netperf-cstate-small and
network-netperf-unbound-small to limit the amount of profile data that
was collected. Just collecting data for 64 byte buffers was enough.

> The server that I am going to configure is x86_64 numa, not arm64.

That's fine, I didn't actually test arm64 at all.

> I have a 2 socket 24 CPUs X86 server (4 NUMA nodes, AMD Opteron 6174,
> L2 512KB/cpu, L3 6MB/node, RAM 40GB/node).
> Which machine did you run it on?
> 

It was a 2-socket Haswell machine (E5-2670 v3) with 2 NUMA nodes. I used
5.7-rc7 with the openSUSE Leap 15.1 kernel configuration as a baseline.
I compared with and without uclamp enabled.

For network-netperf-unbound I see

netperf-udp
                                  5.7.0-rc7              5.7.0-rc7
                                 with-clamp          without-clamp
Hmean     send-64         238.52 (   0.00%)      257.28 *   7.87%*
Hmean     send-128        477.10 (   0.00%)      511.57 *   7.23%*
Hmean     send-256        945.53 (   0.00%)      982.50 *   3.91%*
Hmean     send-1024      3655.74 (   0.00%)     3846.98 *   5.23%*
Hmean     send-2048      6926.84 (   0.00%)     7247.04 *   4.62%*
Hmean     send-3312     10767.47 (   0.00%)    10976.73 (   1.94%)
Hmean     send-4096     12821.77 (   0.00%)    13506.03 *   5.34%*
Hmean     send-8192     22037.72 (   0.00%)    22275.29 (   1.08%)
Hmean     send-16384    35935.31 (   0.00%)    34737.63 *  -3.33%*
Hmean     recv-64         238.52 (   0.00%)      257.28 *   7.87%*
Hmean     recv-128        477.10 (   0.00%)      511.57 *   7.23%*
Hmean     recv-256        945.45 (   0.00%)      982.50 *   3.92%*
Hmean     recv-1024      3655.74 (   0.00%)     3846.98 *   5.23%*
Hmean     recv-2048      6926.84 (   0.00%)     7246.51 *   4.62%*
Hmean     recv-3312     10767.47 (   0.00%)    10975.93 (   1.94%)
Hmean     recv-4096     12821.76 (   0.00%)    13506.02 *   5.34%*
Hmean     recv-8192     22037.71 (   0.00%)    22274.55 (   1.07%)
Hmean     recv-16384    35934.82 (   0.00%)    34737.50 *  -3.33%*

netperf-tcp
                             5.7.0-rc7              5.7.0-rc7
                            with-clamp          without-clamp
Min       64        2004.71 (   0.00%)     2033.23 (   1.42%)
Min       128       3657.58 (   0.00%)     3733.35 (   2.07%)
Min       256       6063.25 (   0.00%)     6105.67 (   0.70%)
Min       1024     18152.50 (   0.00%)    18487.00 (   1.84%)
Min       2048     28544.54 (   0.00%)    29218.11 (   2.36%)
Min       3312     33962.06 (   0.00%)    36094.97 (   6.28%)
Min       4096     36234.82 (   0.00%)    38223.60 (   5.49%)
Min       8192     42324.06 (   0.00%)    43328.72 (   2.37%)
Min       16384    44323.33 (   0.00%)    45315.21 (   2.24%)
Hmean     64        2018.36 (   0.00%)     2038.53 *   1.00%*
Hmean     128       3700.12 (   0.00%)     3758.20 *   1.57%*
Hmean     256       6236.14 (   0.00%)     6212.77 (  -0.37%)
Hmean     1024     18214.97 (   0.00%)    18601.01 *   2.12%*
Hmean     2048     28749.56 (   0.00%)    29728.26 *   3.40%*
Hmean     3312     34585.50 (   0.00%)    36345.09 *   5.09%*
Hmean     4096     36777.62 (   0.00%)    38576.17 *   4.89%*
Hmean     8192     43149.08 (   0.00%)    43903.77 *   1.75%*
Hmean     16384    45478.27 (   0.00%)    46372.93 (   1.97%)

The cstate-limited config had similar results for UDP_STREAM but was
mostly indifferent for TCP_STREAM.

So for UDP_STREAM,. there is a fairly sizable difference for uclamp. There
are caveats, netperf is not 100% stable from a performance perspective on
NUMA machines. That's improved quite a bit with 5.7 but it still should
be treated with care.

When I first saw a problem, I was using ftrace looking for latencies and
uclamp appeared to crop up. As I didn't actually need uclamp and there was
no user request to support it, I simply dropped it from the master config
so it would get propogated to any distro we release with a 5.x kernel.

From a perf profile, it's not particularly obvious that uclamp is
involved so it could be in error but I doubt it. A diff of without vs
with looks like

# Event 'cycles:ppp'
#
# Baseline  Delta Abs  Shared Object             Symbol
# ........  .........  ........................  ..............................................
#
     9.59%     -2.87%  [kernel.vmlinux]          [k] poll_idle
     0.19%     +1.85%  [kernel.vmlinux]          [k] activate_task
               +1.17%  [kernel.vmlinux]          [k] dequeue_task
               +0.89%  [kernel.vmlinux]          [k] update_rq_clock.part.73
     3.88%     +0.73%  [kernel.vmlinux]          [k] try_to_wake_up
     3.17%     +0.68%  [kernel.vmlinux]          [k] __schedule
     1.16%     -0.60%  [kernel.vmlinux]          [k] __update_load_avg_cfs_rq
     2.20%     -0.54%  [kernel.vmlinux]          [k] resched_curr
     2.08%     -0.29%  [kernel.vmlinux]          [k] _raw_spin_lock_irqsave
     0.44%     -0.29%  [kernel.vmlinux]          [k] cpus_share_cache
     1.13%     +0.23%  [kernel.vmlinux]          [k] _raw_spin_lock_bh

A lot of the uclamp functions appear to be inlined so it is not be
particularly obvious from a raw profile but it shows up in the annotated
profile in activate_task and dequeue_task for example. In the case of
dequeue_task, uclamp_rq_dec_id() is extremely expensive according to the
annotated profile.

I'm afraid I did not dig into this deeply once I knew I could just disable
it even within the distribution.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 18:29         ` Peter Zijlstra
  2020-05-28 19:08           ` Patrick Bellasi
  2020-05-28 19:20           ` Dietmar Eggemann
@ 2020-05-29  9:11           ` Qais Yousef
  2 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-05-29  9:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 05/28/20 20:29, Peter Zijlstra wrote:
> On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote:
> 
> > In my head, the simpler version of
> > 
> > 	if (rt_task(p) && !uc->user_defined)
> > 		// update_uclamp_min
> > 
> > Is a single branch and write to cache, so should be fast. I'm failing to see
> > how this could generate an overhead tbh, but will not argue about it :-)
> 
> Mostly true; but you also had a load of that sysctl in there, which is
> likely to be a miss, and those are expensive.

Hmm yes there's no guarantee the sysctl global variable will be in LLC, though
I thought that would be the likely case.

> 
> Also; if we're going to have to optimize this, less logic is in there,
> the less we need to take out. Esp. for stuff that 'never' changes, like
> this.

Agreed.

> 
> > > It's more code, but it is all outside of the normal paths where we care
> > > about performance.
> > 
> > I am happy to take that direction if you think it's worth it. I'm thinking
> > task_woken_rt() is good. But again, maybe I am missing something.
> 
> Basic rule, if the state 'never' changes, don't touch fast paths.
> 
> Such little things can be very difficult to measure, but at some point
> they cause death-by-a-thousnd-cuts.

Yeah we're bound to reach the critical mass at some point if too much bloat
creeps up on the hot path.

Thanks

--
Qais Yousef

> 
> > > Indeed, that one. The fact that regular distros cannot enable this
> > > feature due to performance overhead is unfortunate. It means there is a
> > > lot less potential for this stuff.
> > 
> > I had a humble try to catch the overhead but wasn't successful. The observation
> > wasn't missed by us too then.
> 
> Right, I remember us doing benchmarks when we introduced all this and
> clearly we missed something. I would be good if Mel can share which
> benchmark hurt most so we can go have a look.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 18:29         ` Peter Zijlstra
  2020-05-28 19:08           ` Patrick Bellasi
@ 2020-05-28 19:20           ` Dietmar Eggemann
  2020-05-29  9:11           ` Qais Yousef
  2 siblings, 0 replies; 68+ messages in thread
From: Dietmar Eggemann @ 2020-05-28 19:20 UTC (permalink / raw)
  To: Peter Zijlstra, Qais Yousef
  Cc: Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

On 28/05/2020 20:29, Peter Zijlstra wrote:
> On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote:
> 
>> In my head, the simpler version of
>>
>> 	if (rt_task(p) && !uc->user_defined)
>> 		// update_uclamp_min
>>
>> Is a single branch and write to cache, so should be fast. I'm failing to see
>> how this could generate an overhead tbh, but will not argue about it :-)
> 
> Mostly true; but you also had a load of that sysctl in there, which is
> likely to be a miss, and those are expensive.
> 
> Also; if we're going to have to optimize this, less logic is in there,
> the less we need to take out. Esp. for stuff that 'never' changes, like
> this.
> 
>>> It's more code, but it is all outside of the normal paths where we care
>>> about performance.
>>
>> I am happy to take that direction if you think it's worth it. I'm thinking
>> task_woken_rt() is good. But again, maybe I am missing something.
> 
> Basic rule, if the state 'never' changes, don't touch fast paths.
> 
> Such little things can be very difficult to measure, but at some point
> they cause death-by-a-thousnd-cuts.
> 
>>> Indeed, that one. The fact that regular distros cannot enable this
>>> feature due to performance overhead is unfortunate. It means there is a
>>> lot less potential for this stuff.
>>
>> I had a humble try to catch the overhead but wasn't successful. The observation
>> wasn't missed by us too then.
> 
> Right, I remember us doing benchmarks when we introduced all this and
> clearly we missed something. I would be good if Mel can share which
> benchmark hurt most so we can go have a look.

IIRC, it was a local mmtests netperf-udp with various buffer sizes?

At least that's what we're trying to run right now on a '2 Sockets Xeon
E5 2x10-Cores (40 CPUs)' w/ 3 different kernel ((1) wo_clamp (2)
tsk_uclamp (3) tskgrp_uclamp).

We have currently Ubuntu Desktop on it. I think that systemd uses
cgroups (especially cpu controller) differently on a (Ubuntu) Server.
Maybe this has an influence here as well?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 18:29         ` Peter Zijlstra
@ 2020-05-28 19:08           ` Patrick Bellasi
  2020-05-28 19:20           ` Dietmar Eggemann
  2020-05-29  9:11           ` Qais Yousef
  2 siblings, 0 replies; 68+ messages in thread
From: Patrick Bellasi @ 2020-05-28 19:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Qais Yousef, Giovanni Gherdovich, Ingo Molnar, Randy Dunlap,
	Jonathan Corbet, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Quentin Perret, Valentin Schneider,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel


[+Giovanni]

On Thu, May 28, 2020 at 20:29:14 +0200, Peter Zijlstra <peterz@infradead.org> wrote...

> On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote:

>> I had a humble try to catch the overhead but wasn't successful. The observation
>> wasn't missed by us too then.
>
> Right, I remember us doing benchmarks when we introduced all this and
> clearly we missed something. I would be good if Mel can share which
> benchmark hurt most so we can go have a look.

Indeed, would be great to have a description of their test setup and
results. Perhaps Giovanni can also support us on that.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 16:51       ` Qais Yousef
@ 2020-05-28 18:29         ` Peter Zijlstra
  2020-05-28 19:08           ` Patrick Bellasi
                             ` (2 more replies)
  2020-05-29 10:21         ` Mel Gorman
  1 sibling, 3 replies; 68+ messages in thread
From: Peter Zijlstra @ 2020-05-28 18:29 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote:

> In my head, the simpler version of
> 
> 	if (rt_task(p) && !uc->user_defined)
> 		// update_uclamp_min
> 
> Is a single branch and write to cache, so should be fast. I'm failing to see
> how this could generate an overhead tbh, but will not argue about it :-)

Mostly true; but you also had a load of that sysctl in there, which is
likely to be a miss, and those are expensive.

Also; if we're going to have to optimize this, less logic is in there,
the less we need to take out. Esp. for stuff that 'never' changes, like
this.

> > It's more code, but it is all outside of the normal paths where we care
> > about performance.
> 
> I am happy to take that direction if you think it's worth it. I'm thinking
> task_woken_rt() is good. But again, maybe I am missing something.

Basic rule, if the state 'never' changes, don't touch fast paths.

Such little things can be very difficult to measure, but at some point
they cause death-by-a-thousnd-cuts.

> > Indeed, that one. The fact that regular distros cannot enable this
> > feature due to performance overhead is unfortunate. It means there is a
> > lot less potential for this stuff.
> 
> I had a humble try to catch the overhead but wasn't successful. The observation
> wasn't missed by us too then.

Right, I remember us doing benchmarks when we introduced all this and
clearly we missed something. I would be good if Mel can share which
benchmark hurt most so we can go have a look.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 16:11     ` Peter Zijlstra
@ 2020-05-28 16:51       ` Qais Yousef
  2020-05-28 18:29         ` Peter Zijlstra
  2020-05-29 10:21         ` Mel Gorman
  2020-05-29 10:08       ` Mel Gorman
  1 sibling, 2 replies; 68+ messages in thread
From: Qais Yousef @ 2020-05-28 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 05/28/20 18:11, Peter Zijlstra wrote:
> On Thu, May 28, 2020 at 04:58:01PM +0100, Qais Yousef wrote:
> > On 05/28/20 15:23, Peter Zijlstra wrote:
> 
> > > So afaict this is directly added to the enqueue/dequeue path, and we've
> > > recently already had complaints that uclamp is too slow.
> > 
> > I wanted to keep this function simpler.
> 
> Right; I appreciate that, but as always it's a balance between simple
> and performance :-)

Sure :-)

In my head, the simpler version of

	if (rt_task(p) && !uc->user_defined)
		// update_uclamp_min

Is a single branch and write to cache, so should be fast. I'm failing to see
how this could generate an overhead tbh, but will not argue about it :-)

> 
> > > Is there really no other way?
> > 
> > There is my first attempt which performs the sync @ task_woken_rt().
> > 
> > https://lore.kernel.org/lkml/20191220164838.31619-1-qais.yousef@arm.com/
> > 
> > I can revert the sync function to the simpler version defined in that patch
> > too.
> > 
> > I can potentially move this to uclamp_eff_value() too. Will need to think more
> > if this is enough. If task_woken_rt() is good for you, I'd say that's more
> > obviously correct and better to go with it.
> 
> task_woken_rt() is better, because that only slows down RT tasks, but
> I'm thinking we can do even better by simply setting the default such
> that new tasks pick it up and then (rcu) iterating all existing tasks
> and modiying them.
> 
> It's more code, but it is all outside of the normal paths where we care
> about performance.

I am happy to take that direction if you think it's worth it. I'm thinking
task_woken_rt() is good. But again, maybe I am missing something.

> 
> > FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
> > Trying to see what goes on in there.
> 
> Indeed, that one. The fact that regular distros cannot enable this
> feature due to performance overhead is unfortunate. It means there is a
> lot less potential for this stuff.

I had a humble try to catch the overhead but wasn't successful. The observation
wasn't missed by us too then.

On my Ubuntu 18.04 machine uclamp is enabled by default by the way. 5.3 kernel
though, so uclamp task group stuff not there yet. Should check how their server
distro looks like.

We don't want to lose that potential!

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 15:58   ` Qais Yousef
@ 2020-05-28 16:11     ` Peter Zijlstra
  2020-05-28 16:51       ` Qais Yousef
  2020-05-29 10:08       ` Mel Gorman
  0 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2020-05-28 16:11 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Thu, May 28, 2020 at 04:58:01PM +0100, Qais Yousef wrote:
> On 05/28/20 15:23, Peter Zijlstra wrote:

> > So afaict this is directly added to the enqueue/dequeue path, and we've
> > recently already had complaints that uclamp is too slow.
> 
> I wanted to keep this function simpler.

Right; I appreciate that, but as always it's a balance between simple
and performance :-)

> > Is there really no other way?
> 
> There is my first attempt which performs the sync @ task_woken_rt().
> 
> https://lore.kernel.org/lkml/20191220164838.31619-1-qais.yousef@arm.com/
> 
> I can revert the sync function to the simpler version defined in that patch
> too.
> 
> I can potentially move this to uclamp_eff_value() too. Will need to think more
> if this is enough. If task_woken_rt() is good for you, I'd say that's more
> obviously correct and better to go with it.

task_woken_rt() is better, because that only slows down RT tasks, but
I'm thinking we can do even better by simply setting the default such
that new tasks pick it up and then (rcu) iterating all existing tasks
and modiying them.

It's more code, but it is all outside of the normal paths where we care
about performance.

> FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
> Trying to see what goes on in there.

Indeed, that one. The fact that regular distros cannot enable this
feature due to performance overhead is unfortunate. It means there is a
lot less potential for this stuff.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-28 13:23 ` Peter Zijlstra
@ 2020-05-28 15:58   ` Qais Yousef
  2020-05-28 16:11     ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Qais Yousef @ 2020-05-28 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 05/28/20 15:23, Peter Zijlstra wrote:
> On Mon, May 11, 2020 at 04:40:52PM +0100, Qais Yousef wrote:
> > +/*
> > + * By default RT tasks run at the maximum performance point/capacity of the
> > + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> > + * SCHED_CAPACITY_SCALE.
> > + *
> > + * This knob allows admins to change the default behavior when uclamp is being
> > + * used. In battery powered devices, particularly, running at the maximum
> > + * capacity and frequency will increase energy consumption and shorten the
> > + * battery life.
> > + *
> > + * This knob only affects RT tasks that their uclamp_se->user_defined == false.
> > + *
> > + * This knob will not override the system default sched_util_clamp_min defined
> > + * above.
> > + *
> > + * Any modification is applied lazily on the next attempt to calculate the
> > + * effective value of the task.
> > + */
> > +unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
> > +
> >  /* All clamps are required to be less or equal than these values */
> >  static struct uclamp_se uclamp_default[UCLAMP_CNT];
> >  
> > @@ -872,6 +892,28 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
> >  	return uclamp_idle_value(rq, clamp_id, clamp_value);
> >  }
> >  
> > +static inline void uclamp_sync_util_min_rt_default(struct task_struct *p,
> > +						   enum uclamp_id clamp_id)
> > +{
> > +	unsigned int default_util_min = sysctl_sched_uclamp_util_min_rt_default;
> > +	struct uclamp_se *uc_se;
> > +
> > +	/* Only sync for UCLAMP_MIN and RT tasks */
> > +	if (clamp_id != UCLAMP_MIN || !rt_task(p))
> > +		return;
> > +
> > +	uc_se = &p->uclamp_req[UCLAMP_MIN];
> > +
> > +	/*
> > +	 * Only sync if user didn't override the default request and the sysctl
> > +	 * knob has changed.
> > +	 */
> > +	if (uc_se->user_defined || uc_se->value == default_util_min)
> > +		return;
> > +
> > +	uclamp_se_set(uc_se, default_util_min, false);
> > +}
> 
> So afaict this is directly added to the enqueue/dequeue path, and we've
> recently already had complaints that uclamp is too slow.

I wanted to keep this function simpler.

> 
> Is there really no other way?

There is my first attempt which performs the sync @ task_woken_rt().

https://lore.kernel.org/lkml/20191220164838.31619-1-qais.yousef@arm.com/

I can revert the sync function to the simpler version defined in that patch
too.

I can potentially move this to uclamp_eff_value() too. Will need to think more
if this is enough. If task_woken_rt() is good for you, I'd say that's more
obviously correct and better to go with it.

FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
Trying to see what goes on in there.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-11 15:40 Qais Yousef
                   ` (3 preceding siblings ...)
  2020-05-18  8:31 ` Dietmar Eggemann
@ 2020-05-28 13:23 ` Peter Zijlstra
  2020-05-28 15:58   ` Qais Yousef
  4 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2020-05-28 13:23 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Randy Dunlap, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On Mon, May 11, 2020 at 04:40:52PM +0100, Qais Yousef wrote:
> +/*
> + * By default RT tasks run at the maximum performance point/capacity of the
> + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> + * SCHED_CAPACITY_SCALE.
> + *
> + * This knob allows admins to change the default behavior when uclamp is being
> + * used. In battery powered devices, particularly, running at the maximum
> + * capacity and frequency will increase energy consumption and shorten the
> + * battery life.
> + *
> + * This knob only affects RT tasks that their uclamp_se->user_defined == false.
> + *
> + * This knob will not override the system default sched_util_clamp_min defined
> + * above.
> + *
> + * Any modification is applied lazily on the next attempt to calculate the
> + * effective value of the task.
> + */
> +unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
> +
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> @@ -872,6 +892,28 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
>  	return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>  
> +static inline void uclamp_sync_util_min_rt_default(struct task_struct *p,
> +						   enum uclamp_id clamp_id)
> +{
> +	unsigned int default_util_min = sysctl_sched_uclamp_util_min_rt_default;
> +	struct uclamp_se *uc_se;
> +
> +	/* Only sync for UCLAMP_MIN and RT tasks */
> +	if (clamp_id != UCLAMP_MIN || !rt_task(p))
> +		return;
> +
> +	uc_se = &p->uclamp_req[UCLAMP_MIN];
> +
> +	/*
> +	 * Only sync if user didn't override the default request and the sysctl
> +	 * knob has changed.
> +	 */
> +	if (uc_se->user_defined || uc_se->value == default_util_min)
> +		return;
> +
> +	uclamp_se_set(uc_se, default_util_min, false);
> +}

So afaict this is directly added to the enqueue/dequeue path, and we've
recently already had complaints that uclamp is too slow.

Is there really no other way?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-18  8:31 ` Dietmar Eggemann
@ 2020-05-18 16:49   ` Qais Yousef
  0 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-05-18 16:49 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

On 05/18/20 10:31, Dietmar Eggemann wrote:
> On 11/05/2020 17:40, Qais Yousef wrote:
> 
> [..]
> 
> > @@ -790,6 +790,26 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
> >  /* Max allowed maximum utilization */
> >  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> >  
> > +/*
> > + * By default RT tasks run at the maximum performance point/capacity of the
> > + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> > + * SCHED_CAPACITY_SCALE.
> > + *
> > + * This knob allows admins to change the default behavior when uclamp is being
> > + * used. In battery powered devices, particularly, running at the maximum
> > + * capacity and frequency will increase energy consumption and shorten the
> > + * battery life.
> > + *
> > + * This knob only affects RT tasks that their uclamp_se->user_defined == false.
> 
> Nit pick: Isn't there a verb missing in this sentence?
> 
> [...]
> 
> > @@ -1114,12 +1161,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  				loff_t *ppos)
> >  {
> >  	bool update_root_tg = false;
> > -	int old_min, old_max;
> > +	int old_min, old_max, old_min_rt;
> 
> Nit pick: Order local variable declarations according to length.
> 
> [...]
> 
> > @@ -1128,7 +1176,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> >  		goto done;
> >  
> >  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> > -	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> > +	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE		||
> 
> Nit pick: This extra space looks weird to me.
> 
> [...]
> 
> Apart from that, LGTM
> 
> For both patches of this v5:
> 
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thanks Dietmar and Patrick.

Peter, let me know if you'd like me to address the nitpicks or you're okay with
this as-is.

Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-11 15:40 Qais Yousef
                   ` (2 preceding siblings ...)
  2020-05-15 11:08 ` Patrick Bellasi
@ 2020-05-18  8:31 ` Dietmar Eggemann
  2020-05-18 16:49   ` Qais Yousef
  2020-05-28 13:23 ` Peter Zijlstra
  4 siblings, 1 reply; 68+ messages in thread
From: Dietmar Eggemann @ 2020-05-18  8:31 UTC (permalink / raw)
  To: Qais Yousef, Peter Zijlstra, Ingo Molnar
  Cc: Randy Dunlap, Jonathan Corbet, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Quentin Perret, Valentin Schneider,
	Patrick Bellasi, Pavan Kondeti, linux-doc, linux-kernel,
	linux-fsdevel

On 11/05/2020 17:40, Qais Yousef wrote:

[..]

> @@ -790,6 +790,26 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
>  /* Max allowed maximum utilization */
>  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
>  
> +/*
> + * By default RT tasks run at the maximum performance point/capacity of the
> + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> + * SCHED_CAPACITY_SCALE.
> + *
> + * This knob allows admins to change the default behavior when uclamp is being
> + * used. In battery powered devices, particularly, running at the maximum
> + * capacity and frequency will increase energy consumption and shorten the
> + * battery life.
> + *
> + * This knob only affects RT tasks that their uclamp_se->user_defined == false.

Nit pick: Isn't there a verb missing in this sentence?

[...]

> @@ -1114,12 +1161,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  				loff_t *ppos)
>  {
>  	bool update_root_tg = false;
> -	int old_min, old_max;
> +	int old_min, old_max, old_min_rt;

Nit pick: Order local variable declarations according to length.

[...]

> @@ -1128,7 +1176,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  		goto done;
>  
>  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> -	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> +	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE		||

Nit pick: This extra space looks weird to me.

[...]

Apart from that, LGTM

For both patches of this v5:

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-11 15:40 Qais Yousef
  2020-05-11 17:18 ` Qais Yousef
  2020-05-12  2:10 ` Pavan Kondeti
@ 2020-05-15 11:08 ` Patrick Bellasi
  2020-05-18  8:31 ` Dietmar Eggemann
  2020-05-28 13:23 ` Peter Zijlstra
  4 siblings, 0 replies; 68+ messages in thread
From: Patrick Bellasi @ 2020-05-15 11:08 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider, Pavan Kondeti,
	linux-doc, linux-kernel, linux-fsdevel


I Qais,
I see we are converging toward the final shape. :)

Function wise code looks ok to me now.

Lemme just point out few more remarks and possible nit-picks.
I guess at the end it's up to you to decide if you wanna follow up with
a v6 and to the maintainers to decide how picky they wanna be.

Otherwise, FWIW, feel free to consider this a LGTM.

Best,
Patrick

On Mon, May 11, 2020 at 17:40:52 +0200, Qais Yousef <qais.yousef@arm.com> wrote...

[...]

> +static inline void uclamp_sync_util_min_rt_default(struct task_struct *p,
> +						   enum uclamp_id clamp_id)
> +{
> +	unsigned int default_util_min = sysctl_sched_uclamp_util_min_rt_default;
> +	struct uclamp_se *uc_se;
> +
> +	/* Only sync for UCLAMP_MIN and RT tasks */
> +	if (clamp_id != UCLAMP_MIN || !rt_task(p))
> +		return;
> +
> +	uc_se = &p->uclamp_req[UCLAMP_MIN];

I went back to v3 version, where this was done above:

   https://lore.kernel.org/lkml/20200429113255.GA19464@codeaurora.org/
   Message-ID: 20200429113255.GA19464@codeaurora.org

and still I don't see why we want to keep it after this first check?

IMO it's just not required and it makes to code a tiny uglier.

> +
> +	/*
> +	 * Only sync if user didn't override the default request and the sysctl
> +	 * knob has changed.
> +	 */
> +	if (uc_se->user_defined || uc_se->value == default_util_min)
> +		return;
> +

nit-pick: the two comments above are stating the obvious.

> +	uclamp_se_set(uc_se, default_util_min, false);
> +}
> +
>  static inline struct uclamp_se
>  uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> @@ -907,8 +949,13 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  static inline struct uclamp_se
>  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> -	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> -	struct uclamp_se uc_max = uclamp_default[clamp_id];
> +	struct uclamp_se uc_req, uc_max;
> +
> +	/* Sync up any change to sysctl_sched_uclamp_util_min_rt_default. */

same here: the comment is stating the obvious.

Maybe even just by using a different function name we better document
the code, e.g. uclamp_rt_restrict(p, clamp_id);

This will implicitly convey the purpose: RT tasks can be somehow
further restricted, i.e. in addition to the TG restriction following.


> +	uclamp_sync_util_min_rt_default(p, clamp_id);
> +
> +	uc_req = uclamp_tg_restrict(p, clamp_id);
> +	uc_max = uclamp_default[clamp_id];
>  
>  	/* System default restrictions always apply */
>  	if (unlikely(uc_req.value > uc_max.value))

[...]


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-12  2:10 ` Pavan Kondeti
@ 2020-05-12 11:46   ` Qais Yousef
  0 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-05-12 11:46 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider,
	Patrick Bellasi, linux-doc, linux-kernel, linux-fsdevel

On 05/12/20 07:40, Pavan Kondeti wrote:
> On Mon, May 11, 2020 at 04:40:52PM +0100, Qais Yousef wrote:
> > RT tasks by default run at the highest capacity/performance level. When
> > uclamp is selected this default behavior is retained by enforcing the
> > requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
> > uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
> > value.
> > 
> > This is also referred to as 'the default boost value of RT tasks'.
> > 
> > See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
> > 
> > On battery powered devices, it is desired to control this default
> > (currently hardcoded) behavior at runtime to reduce energy consumed by
> > RT tasks.
> > 
> > For example, a mobile device manufacturer where big.LITTLE architecture
> > is dominant, the performance of the little cores varies across SoCs, and
> > on high end ones the big cores could be too power hungry.
> > 
> > Given the diversity of SoCs, the new knob allows manufactures to tune
> > the best performance/power for RT tasks for the particular hardware they
> > run on.
> > 
> > They could opt to further tune the value when the user selects
> > a different power saving mode or when the device is actively charging.
> > 
> > The runtime aspect of it further helps in creating a single kernel image
> > that can be run on multiple devices that require different tuning.
> > 
> > Keep in mind that a lot of RT tasks in the system are created by the
> > kernel. On Android for instance I can see over 50 RT tasks, only
> > a handful of which created by the Android framework.
> > 
> > To control the default behavior globally by system admins and device
> > integrators, introduce the new sysctl_sched_uclamp_util_min_rt_default
> > to change the default boost value of the RT tasks.
> > 
> > I anticipate this to be mostly in the form of modifying the init script
> > of a particular device.
> > 
> > Whenever the new default changes, it'd be applied lazily on the next
> > opportunity the scheduler needs to calculate the effective uclamp.min
> > value for the task, assuming that it still uses the system default value
> > and not a user applied one.
> > 
> > Tested on Juno-r2 in combination with the RT capacity awareness [1].
> > By default an RT task will go to the highest capacity CPU and run at the
> > maximum frequency, which is particularly energy inefficient on high end
> > mobile devices because the biggest core[s] are 'huge' and power hungry.
> > 
> > With this patch the RT task can be controlled to run anywhere by
> > default, and doesn't cause the frequency to be maximum all the time.
> > Yet any task that really needs to be boosted can easily escape this
> > default behavior by modifying its requested uclamp.min value
> > (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
> > 
> > [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
> > 
> 
> I have tested this patch on SDM845 running V5.7-rc4 and it works as expected.
> 
> Default: i.e /proc/sys/kernel/sched_util_clamp_min_rt_default = 1024.
> 
> RT task runs on BIG cluster every time at max frequency. Both effective
> and requested uclamp.min are set to 1024
> 
> With /proc/sys/kernel/sched_util_clamp_min_rt_default = 128
> 
> RT task runs on Little cluster (max capacity is 404) and frequency scaling
> happens as per the change in utilization. Both effective and requested
> uclamp are set to 128.
> 
> Feel free to add
> 
> Tested-by: Pavankumar Kondeti <pkondeti@codeaurora.org>

Thanks Pavan!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-11 15:40 Qais Yousef
  2020-05-11 17:18 ` Qais Yousef
@ 2020-05-12  2:10 ` Pavan Kondeti
  2020-05-12 11:46   ` Qais Yousef
  2020-05-15 11:08 ` Patrick Bellasi
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 68+ messages in thread
From: Pavan Kondeti @ 2020-05-12  2:10 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Peter Zijlstra, Ingo Molnar, Randy Dunlap, Jonathan Corbet,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Quentin Perret, Valentin Schneider,
	Patrick Bellasi, linux-doc, linux-kernel, linux-fsdevel

On Mon, May 11, 2020 at 04:40:52PM +0100, Qais Yousef wrote:
> RT tasks by default run at the highest capacity/performance level. When
> uclamp is selected this default behavior is retained by enforcing the
> requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
> uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
> value.
> 
> This is also referred to as 'the default boost value of RT tasks'.
> 
> See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
> 
> On battery powered devices, it is desired to control this default
> (currently hardcoded) behavior at runtime to reduce energy consumed by
> RT tasks.
> 
> For example, a mobile device manufacturer where big.LITTLE architecture
> is dominant, the performance of the little cores varies across SoCs, and
> on high end ones the big cores could be too power hungry.
> 
> Given the diversity of SoCs, the new knob allows manufactures to tune
> the best performance/power for RT tasks for the particular hardware they
> run on.
> 
> They could opt to further tune the value when the user selects
> a different power saving mode or when the device is actively charging.
> 
> The runtime aspect of it further helps in creating a single kernel image
> that can be run on multiple devices that require different tuning.
> 
> Keep in mind that a lot of RT tasks in the system are created by the
> kernel. On Android for instance I can see over 50 RT tasks, only
> a handful of which created by the Android framework.
> 
> To control the default behavior globally by system admins and device
> integrators, introduce the new sysctl_sched_uclamp_util_min_rt_default
> to change the default boost value of the RT tasks.
> 
> I anticipate this to be mostly in the form of modifying the init script
> of a particular device.
> 
> Whenever the new default changes, it'd be applied lazily on the next
> opportunity the scheduler needs to calculate the effective uclamp.min
> value for the task, assuming that it still uses the system default value
> and not a user applied one.
> 
> Tested on Juno-r2 in combination with the RT capacity awareness [1].
> By default an RT task will go to the highest capacity CPU and run at the
> maximum frequency, which is particularly energy inefficient on high end
> mobile devices because the biggest core[s] are 'huge' and power hungry.
> 
> With this patch the RT task can be controlled to run anywhere by
> default, and doesn't cause the frequency to be maximum all the time.
> Yet any task that really needs to be boosted can easily escape this
> default behavior by modifying its requested uclamp.min value
> (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
> 
> [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
> 

I have tested this patch on SDM845 running V5.7-rc4 and it works as expected.

Default: i.e /proc/sys/kernel/sched_util_clamp_min_rt_default = 1024.

RT task runs on BIG cluster every time at max frequency. Both effective
and requested uclamp.min are set to 1024

With /proc/sys/kernel/sched_util_clamp_min_rt_default = 128

RT task runs on Little cluster (max capacity is 404) and frequency scaling
happens as per the change in utilization. Both effective and requested
uclamp are set to 128.

Feel free to add

Tested-by: Pavankumar Kondeti <pkondeti@codeaurora.org>

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
  2020-05-11 15:40 Qais Yousef
@ 2020-05-11 17:18 ` Qais Yousef
  2020-05-12  2:10 ` Pavan Kondeti
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 68+ messages in thread
From: Qais Yousef @ 2020-05-11 17:18 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Randy Dunlap, Jonathan Corbet, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Quentin Perret,
	Valentin Schneider, Patrick Bellasi, Pavan Kondeti, linux-doc,
	linux-kernel, linux-fsdevel

Sorry I forgot to label this as v5 in the subject line.

--
Qais Yousef

On 05/11/20 16:40, Qais Yousef wrote:
> RT tasks by default run at the highest capacity/performance level. When
> uclamp is selected this default behavior is retained by enforcing the
> requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
> uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
> value.
> 
> This is also referred to as 'the default boost value of RT tasks'.
> 
> See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
> 
> On battery powered devices, it is desired to control this default
> (currently hardcoded) behavior at runtime to reduce energy consumed by
> RT tasks.
> 
> For example, a mobile device manufacturer where big.LITTLE architecture
> is dominant, the performance of the little cores varies across SoCs, and
> on high end ones the big cores could be too power hungry.
> 
> Given the diversity of SoCs, the new knob allows manufactures to tune
> the best performance/power for RT tasks for the particular hardware they
> run on.
> 
> They could opt to further tune the value when the user selects
> a different power saving mode or when the device is actively charging.
> 
> The runtime aspect of it further helps in creating a single kernel image
> that can be run on multiple devices that require different tuning.
> 
> Keep in mind that a lot of RT tasks in the system are created by the
> kernel. On Android for instance I can see over 50 RT tasks, only
> a handful of which created by the Android framework.
> 
> To control the default behavior globally by system admins and device
> integrators, introduce the new sysctl_sched_uclamp_util_min_rt_default
> to change the default boost value of the RT tasks.
> 
> I anticipate this to be mostly in the form of modifying the init script
> of a particular device.
> 
> Whenever the new default changes, it'd be applied lazily on the next
> opportunity the scheduler needs to calculate the effective uclamp.min
> value for the task, assuming that it still uses the system default value
> and not a user applied one.
> 
> Tested on Juno-r2 in combination with the RT capacity awareness [1].
> By default an RT task will go to the highest capacity CPU and run at the
> maximum frequency, which is particularly energy inefficient on high end
> mobile devices because the biggest core[s] are 'huge' and power hungry.
> 
> With this patch the RT task can be controlled to run anywhere by
> default, and doesn't cause the frequency to be maximum all the time.
> Yet any task that really needs to be boosted can easily escape this
> default behavior by modifying its requested uclamp.min value
> (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
> 
> [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
> 
> Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> CC: Jonathan Corbet <corbet@lwn.net>
> CC: Juri Lelli <juri.lelli@redhat.com>
> CC: Vincent Guittot <vincent.guittot@linaro.org>
> CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Ben Segall <bsegall@google.com>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Luis Chamberlain <mcgrof@kernel.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: Iurii Zaikin <yzaikin@google.com>
> CC: Quentin Perret <qperret@google.com>
> CC: Valentin Schneider <valentin.schneider@arm.com>
> CC: Patrick Bellasi <patrick.bellasi@matbug.net>
> CC: Pavan Kondeti <pkondeti@codeaurora.org>
> CC: linux-doc@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> CC: linux-fsdevel@vger.kernel.org
> ---
> 
> Changes in v5 (all from Patrick):
> 	* Remove use of likely/unlikely
> 	* Use short hand variable for sysctl_sched_uclamp_util_min_rt_default
> 	* Combine if conditions that checks for errors when setting
> 	  sysctl_sched_uclamp_util_min_rt_default
> 	* Fit a comment in a single line
> 
> v4 discussion:
> 
> https://lore.kernel.org/lkml/20200501114927.15248-1-qais.yousef@arm.com/
> 
>  include/linux/sched/sysctl.h |  1 +
>  kernel/sched/core.c          | 63 +++++++++++++++++++++++++++++++-----
>  kernel/sysctl.c              |  7 ++++
>  3 files changed, 63 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index d4f6215ee03f..e62cef019094 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
>  #ifdef CONFIG_UCLAMP_TASK
>  extern unsigned int sysctl_sched_uclamp_util_min;
>  extern unsigned int sysctl_sched_uclamp_util_max;
> +extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>  #endif
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..ea1e11db78bb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -790,6 +790,26 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
>  /* Max allowed maximum utilization */
>  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
>  
> +/*
> + * By default RT tasks run at the maximum performance point/capacity of the
> + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> + * SCHED_CAPACITY_SCALE.
> + *
> + * This knob allows admins to change the default behavior when uclamp is being
> + * used. In battery powered devices, particularly, running at the maximum
> + * capacity and frequency will increase energy consumption and shorten the
> + * battery life.
> + *
> + * This knob only affects RT tasks that their uclamp_se->user_defined == false.
> + *
> + * This knob will not override the system default sched_util_clamp_min defined
> + * above.
> + *
> + * Any modification is applied lazily on the next attempt to calculate the
> + * effective value of the task.
> + */
> +unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
> +
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> @@ -872,6 +892,28 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
>  	return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>  
> +static inline void uclamp_sync_util_min_rt_default(struct task_struct *p,
> +						   enum uclamp_id clamp_id)
> +{
> +	unsigned int default_util_min = sysctl_sched_uclamp_util_min_rt_default;
> +	struct uclamp_se *uc_se;
> +
> +	/* Only sync for UCLAMP_MIN and RT tasks */
> +	if (clamp_id != UCLAMP_MIN || !rt_task(p))
> +		return;
> +
> +	uc_se = &p->uclamp_req[UCLAMP_MIN];
> +
> +	/*
> +	 * Only sync if user didn't override the default request and the sysctl
> +	 * knob has changed.
> +	 */
> +	if (uc_se->user_defined || uc_se->value == default_util_min)
> +		return;
> +
> +	uclamp_se_set(uc_se, default_util_min, false);
> +}
> +
>  static inline struct uclamp_se
>  uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> @@ -907,8 +949,13 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  static inline struct uclamp_se
>  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> -	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> -	struct uclamp_se uc_max = uclamp_default[clamp_id];
> +	struct uclamp_se uc_req, uc_max;
> +
> +	/* Sync up any change to sysctl_sched_uclamp_util_min_rt_default. */
> +	uclamp_sync_util_min_rt_default(p, clamp_id);
> +
> +	uc_req = uclamp_tg_restrict(p, clamp_id);
> +	uc_max = uclamp_default[clamp_id];
>  
>  	/* System default restrictions always apply */
>  	if (unlikely(uc_req.value > uc_max.value))
> @@ -1114,12 +1161,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  				loff_t *ppos)
>  {
>  	bool update_root_tg = false;
> -	int old_min, old_max;
> +	int old_min, old_max, old_min_rt;
>  	int result;
>  
>  	mutex_lock(&uclamp_mutex);
>  	old_min = sysctl_sched_uclamp_util_min;
>  	old_max = sysctl_sched_uclamp_util_max;
> +	old_min_rt = sysctl_sched_uclamp_util_min_rt_default;
>  
>  	result = proc_dointvec(table, write, buffer, lenp, ppos);
>  	if (result)
> @@ -1128,7 +1176,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  		goto done;
>  
>  	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> -	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> +	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE		||
> +	    sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
> +
>  		result = -EINVAL;
>  		goto undo;
>  	}
> @@ -1158,6 +1208,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>  undo:
>  	sysctl_sched_uclamp_util_min = old_min;
>  	sysctl_sched_uclamp_util_max = old_max;
> +	sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
>  done:
>  	mutex_unlock(&uclamp_mutex);
>  
> @@ -1200,10 +1251,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
>  		if (uc_se->user_defined)
>  			continue;
>  
> -		/* By default, RT tasks always get 100% boost */
> -		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> -			clamp_value = uclamp_none(UCLAMP_MAX);
> -
>  		uclamp_se_set(uc_se, clamp_value, false);
>  	}
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a176d8727a3..64117363c502 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -453,6 +453,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= sysctl_sched_uclamp_handler,
>  	},
> +	{
> +		.procname	= "sched_util_clamp_min_rt_default",
> +		.data		= &sysctl_sched_uclamp_util_min_rt_default,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= sysctl_sched_uclamp_handler,
> +	},
>  #endif
>  #ifdef CONFIG_SCHED_AUTOGROUP
>  	{
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
@ 2020-05-11 15:40 Qais Yousef
  2020-05-11 17:18 ` Qais Yousef
                   ` (4 more replies)
  0 siblings, 5 replies; 68+ messages in thread
From: Qais Yousef @ 2020-05-11 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Randy Dunlap, Qais Yousef, Jonathan Corbet, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Quentin Perret, Valentin Schneider, Patrick Bellasi,
	Pavan Kondeti, linux-doc, linux-kernel, linux-fsdevel

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrators, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

Whenever the new default changes, it'd be applied lazily on the next
opportunity the scheduler needs to calculate the effective uclamp.min
value for the task, assuming that it still uses the system default value
and not a user applied one.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Juri Lelli <juri.lelli@redhat.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Segall <bsegall@google.com>
CC: Mel Gorman <mgorman@suse.de>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Kees Cook <keescook@chromium.org>
CC: Iurii Zaikin <yzaikin@google.com>
CC: Quentin Perret <qperret@google.com>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Patrick Bellasi <patrick.bellasi@matbug.net>
CC: Pavan Kondeti <pkondeti@codeaurora.org>
CC: linux-doc@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
---

Changes in v5 (all from Patrick):
	* Remove use of likely/unlikely
	* Use short hand variable for sysctl_sched_uclamp_util_min_rt_default
	* Combine if conditions that checks for errors when setting
	  sysctl_sched_uclamp_util_min_rt_default
	* Fit a comment in a single line

v4 discussion:

https://lore.kernel.org/lkml/20200501114927.15248-1-qais.yousef@arm.com/

 include/linux/sched/sysctl.h |  1 +
 kernel/sched/core.c          | 63 +++++++++++++++++++++++++++++++-----
 kernel/sysctl.c              |  7 ++++
 3 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..e62cef019094 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
+extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..ea1e11db78bb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -790,6 +790,26 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 /* Max allowed maximum utilization */
 unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
 
+/*
+ * By default RT tasks run at the maximum performance point/capacity of the
+ * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
+ * SCHED_CAPACITY_SCALE.
+ *
+ * This knob allows admins to change the default behavior when uclamp is being
+ * used. In battery powered devices, particularly, running at the maximum
+ * capacity and frequency will increase energy consumption and shorten the
+ * battery life.
+ *
+ * This knob only affects RT tasks that their uclamp_se->user_defined == false.
+ *
+ * This knob will not override the system default sched_util_clamp_min defined
+ * above.
+ *
+ * Any modification is applied lazily on the next attempt to calculate the
+ * effective value of the task.
+ */
+unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
+
 /* All clamps are required to be less or equal than these values */
 static struct uclamp_se uclamp_default[UCLAMP_CNT];
 
@@ -872,6 +892,28 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
 	return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static inline void uclamp_sync_util_min_rt_default(struct task_struct *p,
+						   enum uclamp_id clamp_id)
+{
+	unsigned int default_util_min = sysctl_sched_uclamp_util_min_rt_default;
+	struct uclamp_se *uc_se;
+
+	/* Only sync for UCLAMP_MIN and RT tasks */
+	if (clamp_id != UCLAMP_MIN || !rt_task(p))
+		return;
+
+	uc_se = &p->uclamp_req[UCLAMP_MIN];
+
+	/*
+	 * Only sync if user didn't override the default request and the sysctl
+	 * knob has changed.
+	 */
+	if (uc_se->user_defined || uc_se->value == default_util_min)
+		return;
+
+	uclamp_se_set(uc_se, default_util_min, false);
+}
+
 static inline struct uclamp_se
 uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
@@ -907,8 +949,13 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
-	struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
-	struct uclamp_se uc_max = uclamp_default[clamp_id];
+	struct uclamp_se uc_req, uc_max;
+
+	/* Sync up any change to sysctl_sched_uclamp_util_min_rt_default. */
+	uclamp_sync_util_min_rt_default(p, clamp_id);
+
+	uc_req = uclamp_tg_restrict(p, clamp_id);
+	uc_max = uclamp_default[clamp_id];
 
 	/* System default restrictions always apply */
 	if (unlikely(uc_req.value > uc_max.value))
@@ -1114,12 +1161,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 				loff_t *ppos)
 {
 	bool update_root_tg = false;
-	int old_min, old_max;
+	int old_min, old_max, old_min_rt;
 	int result;
 
 	mutex_lock(&uclamp_mutex);
 	old_min = sysctl_sched_uclamp_util_min;
 	old_max = sysctl_sched_uclamp_util_max;
+	old_min_rt = sysctl_sched_uclamp_util_min_rt_default;
 
 	result = proc_dointvec(table, write, buffer, lenp, ppos);
 	if (result)
@@ -1128,7 +1176,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 		goto done;
 
 	if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
-	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+	    sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE		||
+	    sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
+
 		result = -EINVAL;
 		goto undo;
 	}
@@ -1158,6 +1208,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
 undo:
 	sysctl_sched_uclamp_util_min = old_min;
 	sysctl_sched_uclamp_util_max = old_max;
+	sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
 done:
 	mutex_unlock(&uclamp_mutex);
 
@@ -1200,10 +1251,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
 		if (uc_se->user_defined)
 			continue;
 
-		/* By default, RT tasks always get 100% boost */
-		if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
-			clamp_value = uclamp_none(UCLAMP_MAX);
-
 		uclamp_se_set(uc_se, clamp_value, false);
 	}
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..64117363c502 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -453,6 +453,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sysctl_sched_uclamp_handler,
 	},
+	{
+		.procname	= "sched_util_clamp_min_rt_default",
+		.data		= &sysctl_sched_uclamp_util_min_rt_default,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_sched_uclamp_handler,
+	},
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
-- 
2.17.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, back to index

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03 12:30 [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Qais Yousef
2020-04-03 12:30 ` [PATCH 2/2] Documentation/sysctl: Document uclamp sysctl knobs Qais Yousef
2020-04-14 18:21 ` [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value Patrick Bellasi
2020-04-15  7:46   ` Patrick Bellasi
2020-04-20 15:04     ` Qais Yousef
2020-04-20  8:24   ` Dietmar Eggemann
2020-04-20 15:19     ` Qais Yousef
2020-04-21  0:52       ` Steven Rostedt
2020-04-21 11:16         ` Dietmar Eggemann
2020-04-21 11:23           ` Qais Yousef
2020-04-20 14:50   ` Qais Yousef
2020-04-15 10:11 ` Quentin Perret
2020-04-20 15:08   ` Qais Yousef
2020-04-20  8:29 ` Dietmar Eggemann
2020-04-20 15:13   ` Qais Yousef
2020-04-21 11:18     ` Dietmar Eggemann
2020-04-21 11:27       ` Qais Yousef
2020-04-22 10:59         ` Dietmar Eggemann
2020-04-22 13:13           ` Qais Yousef
2020-05-11 15:40 Qais Yousef
2020-05-11 17:18 ` Qais Yousef
2020-05-12  2:10 ` Pavan Kondeti
2020-05-12 11:46   ` Qais Yousef
2020-05-15 11:08 ` Patrick Bellasi
2020-05-18  8:31 ` Dietmar Eggemann
2020-05-18 16:49   ` Qais Yousef
2020-05-28 13:23 ` Peter Zijlstra
2020-05-28 15:58   ` Qais Yousef
2020-05-28 16:11     ` Peter Zijlstra
2020-05-28 16:51       ` Qais Yousef
2020-05-28 18:29         ` Peter Zijlstra
2020-05-28 19:08           ` Patrick Bellasi
2020-05-28 19:20           ` Dietmar Eggemann
2020-05-29  9:11           ` Qais Yousef
2020-05-29 10:21         ` Mel Gorman
2020-05-29 15:11           ` Qais Yousef
2020-05-29 16:02             ` Mel Gorman
2020-05-29 16:05               ` Qais Yousef
2020-05-29 10:08       ` Mel Gorman
2020-05-29 16:04         ` Qais Yousef
2020-05-29 16:57           ` Mel Gorman
2020-06-02 16:46         ` Dietmar Eggemann
2020-06-03  8:29           ` Patrick Bellasi
2020-06-03 10:10             ` Mel Gorman
2020-06-03 14:59               ` Vincent Guittot
2020-06-03 16:52                 ` Qais Yousef
2020-06-04 12:14                   ` Vincent Guittot
2020-06-05 10:45                     ` Qais Yousef
2020-06-09 15:29                       ` Vincent Guittot
2020-06-08 12:31                     ` Qais Yousef
2020-06-08 13:06                       ` Valentin Schneider
2020-06-08 14:44                       ` Steven Rostedt
2020-06-11 10:13                         ` Qais Yousef
2020-06-09 17:10                       ` Vincent Guittot
2020-06-11 10:24                         ` Qais Yousef
2020-06-11 12:01                           ` Vincent Guittot
2020-06-23 15:44                             ` Qais Yousef
2020-06-24  8:45                               ` Vincent Guittot
2020-06-05  7:55                   ` Patrick Bellasi
2020-06-05 11:32                     ` Qais Yousef
2020-06-05 13:27                       ` Patrick Bellasi
2020-06-03  9:40           ` Mel Gorman
2020-06-03 12:41             ` Qais Yousef
2020-06-04 13:40               ` Mel Gorman
2020-06-05 10:58                 ` Qais Yousef
2020-06-11 10:58                 ` Qais Yousef
2020-06-16 11:08                   ` Qais Yousef
2020-06-16 13:56                     ` Lukasz Luba

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git