linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] A few uclamp fixes
@ 2021-06-10 15:13 Quentin Perret
  2021-06-10 15:13 ` [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting Quentin Perret
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Quentin Perret @ 2021-06-10 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, qais.yousef,
	rickyiu, wvw, patrick.bellasi, xuewen.yan94
  Cc: linux-kernel, kernel-team, Quentin Perret

Hi all,

This series groups together the v2 of a few patches I orignally sent
independently from each others:

  https://lore.kernel.org/r/20210609143339.1194238-1-qperret@google.com
  https://lore.kernel.org/r/20210609170132.1386495-1-qperret@google.com
  https://lore.kernel.org/r/20210609175901.1423553-1-qperret@google.com

But since they're all touching uclamp-related things, I figured I might
as well group them in a series.

The first one is a pure fix, and the two others change a bit the
sched_setattr behaviour for uclamp to make it more convenient to use,
and allow to put restrictions on the per-task API.

Changes since v1:
 - fixed the CAP_SYS_NICE check to handle the uclamp_{min,max} = -1
   cases correctly;
 - fixed commit message of UCLAMP_FLAG_IDLE patch.

Thanks,
Quentin

Quentin Perret (3):
  sched: Fix UCLAMP_FLAG_IDLE setting
  sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
  sched: Make uclamp changes depend on CAP_SYS_NICE

 kernel/sched/core.c | 64 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 11 deletions(-)

-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting
  2021-06-10 15:13 [PATCH v2 0/3] A few uclamp fixes Quentin Perret
@ 2021-06-10 15:13 ` Quentin Perret
  2021-06-10 19:05   ` Peter Zijlstra
  2021-06-10 15:13 ` [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS Quentin Perret
  2021-06-10 15:13 ` [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE Quentin Perret
  2 siblings, 1 reply; 20+ messages in thread
From: Quentin Perret @ 2021-06-10 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, qais.yousef,
	rickyiu, wvw, patrick.bellasi, xuewen.yan94
  Cc: linux-kernel, kernel-team, Quentin Perret

The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
active task to maintain the last uclamp.max and prevent blocked util
from suddenly becoming visible.

However, there is an asymmetry in how the flag is set and cleared which
can lead to having the flag set whilst there are active tasks on the rq.
Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
called at enqueue time, but set in uclamp_rq_dec_id() which is called
both when dequeueing a task _and_ in the update_uclamp_active() path. As
a result, when both uclamp_rq_{dec,ind}_id() are called from
update_uclamp_active(), the flag ends up being set but not cleared,
hence leaving the runqueue in a broken state.

Fix this by setting the flag in the uclamp_rq_inc_id() path to ensure
things remain symmetrical.

Fixes: e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX")
Reported-by: Rick Yiu <rickyiu@google.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Quentin Perret <qperret@google.com>
---
 kernel/sched/core.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..3b213402798e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -980,6 +980,7 @@ static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
 	if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
 		return;
 
+	rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
 	WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
 }
 
@@ -1252,10 +1253,6 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_inc_id(rq, p, clamp_id);
-
-	/* Reset clamp idle holding when there is one RUNNABLE task */
-	if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
-		rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
 }
 
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
  2021-06-10 15:13 [PATCH v2 0/3] A few uclamp fixes Quentin Perret
  2021-06-10 15:13 ` [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting Quentin Perret
@ 2021-06-10 15:13 ` Quentin Perret
  2021-06-10 19:15   ` Peter Zijlstra
  2021-06-10 15:13 ` [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE Quentin Perret
  2 siblings, 1 reply; 20+ messages in thread
From: Quentin Perret @ 2021-06-10 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, qais.yousef,
	rickyiu, wvw, patrick.bellasi, xuewen.yan94
  Cc: linux-kernel, kernel-team, Quentin Perret

SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
the call must not touch scheduling parameters (nice or priority). This
is particularly handy for uclamp when used in conjunction with
SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
impacts uclamp values.

However, sched_setattr always checks whether the priorities and nice
values passed in sched_attr are valid first, even if those never get
used down the line. This is useless at best since userspace can
trivially bypass this check to set the uclamp values by specifying low
priorities. However, it is cumbersome to do so as there is no single
expression of this that skips both RT and CFS checks at once. As such,
userspace needs to query the task policy first with e.g. sched_getattr
and then set sched_attr.sched_priority accordingly. This is racy and
slower than a single call.

As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
is specified, simply inherit them in this case to match the policy
inheritance of SCHED_FLAG_KEEP_POLICY.

Reported-by: Wei Wang <wvw@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b213402798e..1d4aedbbcf96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6585,6 +6585,10 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
 	rcu_read_unlock();
 
 	if (likely(p)) {
+		if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) {
+			attr.sched_priority = p->rt_priority;
+			attr.sched_nice = task_nice(p);
+		}
 		retval = sched_setattr(p, &attr);
 		put_task_struct(p);
 	}
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-10 15:13 [PATCH v2 0/3] A few uclamp fixes Quentin Perret
  2021-06-10 15:13 ` [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting Quentin Perret
  2021-06-10 15:13 ` [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS Quentin Perret
@ 2021-06-10 15:13 ` Quentin Perret
  2021-06-11 12:48   ` Qais Yousef
  2 siblings, 1 reply; 20+ messages in thread
From: Quentin Perret @ 2021-06-10 15:13 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot, dietmar.eggemann, qais.yousef,
	rickyiu, wvw, patrick.bellasi, xuewen.yan94
  Cc: linux-kernel, kernel-team, Quentin Perret

There is currently nothing preventing tasks from changing their per-task
clamp values in anyway that they like. The rationale is probably that
system administrators are still able to limit those clamps thanks to the
cgroup interface. However, this causes pain in a system where both
per-task and per-cgroup clamp values are expected to be under the
control of core system components (as is the case for Android).

To fix this, let's require CAP_SYS_NICE to increase per-task clamp
values. This allows unprivileged tasks to lower their requests, but not
increase them, which is consistent with the existing behaviour for nice
values.

Signed-off-by: Quentin Perret <qperret@google.com>
---
 kernel/sched/core.c | 55 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 48 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d4aedbbcf96..6e24daca8d53 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1430,6 +1430,11 @@ static int uclamp_validate(struct task_struct *p,
 	if (util_min != -1 && util_max != -1 && util_min > util_max)
 		return -EINVAL;
 
+	return 0;
+}
+
+static void uclamp_enable(void)
+{
 	/*
 	 * We have valid uclamp attributes; make sure uclamp is enabled.
 	 *
@@ -1438,8 +1443,32 @@ static int uclamp_validate(struct task_struct *p,
 	 * scheduler locks.
 	 */
 	static_branch_enable(&sched_uclamp_used);
+}
 
-	return 0;
+static bool uclamp_reduce(struct task_struct *p, const struct sched_attr *attr)
+{
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+		int util_min = p->uclamp_req[UCLAMP_MIN].value;
+
+		if (attr->sched_util_min + 1 > util_min + 1)
+			return false;
+
+		if (rt_task(p) && attr->sched_util_min == -1 &&
+		    util_min < sysctl_sched_uclamp_util_min_rt_default)
+			return false;
+	}
+
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+		int util_max = p->uclamp_req[UCLAMP_MAX].value;
+
+		if (attr->sched_util_max + 1 > util_max + 1)
+			return false;
+
+		if (attr->sched_util_max == -1 && util_max < uclamp_none(UCLAMP_MAX))
+			return false;
+	}
+
+	return true;
 }
 
 static bool uclamp_reset(const struct sched_attr *attr,
@@ -1580,6 +1609,11 @@ static inline int uclamp_validate(struct task_struct *p,
 {
 	return -EOPNOTSUPP;
 }
+static inline void uclamp_enable(void) { }
+static bool uclamp_reduce(struct task_struct *p, const struct sched_attr *attr)
+{
+	return true;
+}
 static void __setscheduler_uclamp(struct task_struct *p,
 				  const struct sched_attr *attr) { }
 static inline void uclamp_fork(struct task_struct *p) { }
@@ -6116,6 +6150,13 @@ static int __sched_setscheduler(struct task_struct *p,
 	    (rt_policy(policy) != (attr->sched_priority != 0)))
 		return -EINVAL;
 
+	/* Update task specific "requested" clamps */
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+		retval = uclamp_validate(p, attr);
+		if (retval)
+			return retval;
+	}
+
 	/*
 	 * Allow unprivileged RT tasks to decrease priority:
 	 */
@@ -6165,6 +6206,10 @@ static int __sched_setscheduler(struct task_struct *p,
 		/* Normal users shall not reset the sched_reset_on_fork flag: */
 		if (p->sched_reset_on_fork && !reset_on_fork)
 			return -EPERM;
+
+		/* Can't increase util-clamps */
+		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP && !uclamp_reduce(p, attr))
+			return -EPERM;
 	}
 
 	if (user) {
@@ -6176,12 +6221,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			return retval;
 	}
 
-	/* Update task specific "requested" clamps */
-	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
-		retval = uclamp_validate(p, attr);
-		if (retval)
-			return retval;
-	}
+	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
+		uclamp_enable();
 
 	if (pi)
 		cpuset_read_lock();
-- 
2.32.0.272.g935e593368-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting
  2021-06-10 15:13 ` [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting Quentin Perret
@ 2021-06-10 19:05   ` Peter Zijlstra
  2021-06-11  7:25     ` Quentin Perret
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-06-10 19:05 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, vincent.guittot, dietmar.eggemann, qais.yousef, rickyiu,
	wvw, patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On Thu, Jun 10, 2021 at 03:13:04PM +0000, Quentin Perret wrote:
> The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
> active task to maintain the last uclamp.max and prevent blocked util
> from suddenly becoming visible.
> 
> However, there is an asymmetry in how the flag is set and cleared which
> can lead to having the flag set whilst there are active tasks on the rq.
> Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
> called at enqueue time, but set in uclamp_rq_dec_id() which is called
> both when dequeueing a task _and_ in the update_uclamp_active() path. As
> a result, when both uclamp_rq_{dec,ind}_id() are called from
> update_uclamp_active(), the flag ends up being set but not cleared,
> hence leaving the runqueue in a broken state.
> 
> Fix this by setting the flag in the uclamp_rq_inc_id() path to ensure
> things remain symmetrical.

The code you moved is neither in uclamp_rq_inc_id(), although
uclamp_idle_reset() is called from there, nor does it _set_ the flag.

I'm thinking it's been a long warm day? ;-)

> 
> Fixes: e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX")
> Reported-by: Rick Yiu <rickyiu@google.com>
> Reviewed-by: Qais Yousef <qais.yousef@arm.com>
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  kernel/sched/core.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5226cc26a095..3b213402798e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -980,6 +980,7 @@ static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
>  	if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
>  		return;
>  
> +	rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
>  	WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
>  }
>  
> @@ -1252,10 +1253,6 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  
>  	for_each_clamp_id(clamp_id)
>  		uclamp_rq_inc_id(rq, p, clamp_id);
> -
> -	/* Reset clamp idle holding when there is one RUNNABLE task */
> -	if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> -		rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
>  }
>  
>  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> -- 
> 2.32.0.272.g935e593368-goog
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
  2021-06-10 15:13 ` [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS Quentin Perret
@ 2021-06-10 19:15   ` Peter Zijlstra
  2021-06-11  8:59     ` Quentin Perret
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2021-06-10 19:15 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, vincent.guittot, dietmar.eggemann, qais.yousef, rickyiu,
	wvw, patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On Thu, Jun 10, 2021 at 03:13:05PM +0000, Quentin Perret wrote:
> SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
> the call must not touch scheduling parameters (nice or priority). This
> is particularly handy for uclamp when used in conjunction with
> SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
> impacts uclamp values.
> 
> However, sched_setattr always checks whether the priorities and nice
> values passed in sched_attr are valid first, even if those never get
> used down the line. This is useless at best since userspace can
> trivially bypass this check to set the uclamp values by specifying low
> priorities. However, it is cumbersome to do so as there is no single
> expression of this that skips both RT and CFS checks at once. As such,
> userspace needs to query the task policy first with e.g. sched_getattr
> and then set sched_attr.sched_priority accordingly. This is racy and
> slower than a single call.
> 
> As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
> is specified, simply inherit them in this case to match the policy
> inheritance of SCHED_FLAG_KEEP_POLICY.
> 
> Reported-by: Wei Wang <wvw@google.com>
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  kernel/sched/core.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3b213402798e..1d4aedbbcf96 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6585,6 +6585,10 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
>  	rcu_read_unlock();
>  
>  	if (likely(p)) {
> +		if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) {
> +			attr.sched_priority = p->rt_priority;
> +			attr.sched_nice = task_nice(p);
> +		}
>  		retval = sched_setattr(p, &attr);
>  		put_task_struct(p);
>  	}

I don't like this much... afaict the KEEP_PARAMS clause in
__setscheduler() also covers the DL params, and you 'forgot' to copy
those.

Can't we short circuit the validation logic?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting
  2021-06-10 19:05   ` Peter Zijlstra
@ 2021-06-11  7:25     ` Quentin Perret
  2021-06-17 15:27       ` Dietmar Eggemann
  0 siblings, 1 reply; 20+ messages in thread
From: Quentin Perret @ 2021-06-11  7:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, qais.yousef, rickyiu,
	wvw, patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On Thursday 10 Jun 2021 at 21:05:12 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 10, 2021 at 03:13:04PM +0000, Quentin Perret wrote:
> > The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
> > active task to maintain the last uclamp.max and prevent blocked util
> > from suddenly becoming visible.
> > 
> > However, there is an asymmetry in how the flag is set and cleared which
> > can lead to having the flag set whilst there are active tasks on the rq.
> > Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
> > called at enqueue time, but set in uclamp_rq_dec_id() which is called
> > both when dequeueing a task _and_ in the update_uclamp_active() path. As
> > a result, when both uclamp_rq_{dec,ind}_id() are called from
> > update_uclamp_active(), the flag ends up being set but not cleared,
> > hence leaving the runqueue in a broken state.
> > 
> > Fix this by setting the flag in the uclamp_rq_inc_id() path to ensure
> > things remain symmetrical.
> 
> The code you moved is neither in uclamp_rq_inc_id(), although
> uclamp_idle_reset() is called from there

Yep, that is what I was trying to say.

> nor does it _set_ the flag.

Ahem. That I don't have a good excuse for ...
> 
> I'm thinking it's been a long warm day? ;-)

Indeed :-)

Let me have another cup of coffee and try to write this again.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
  2021-06-10 19:15   ` Peter Zijlstra
@ 2021-06-11  8:59     ` Quentin Perret
  2021-06-11  9:07       ` Quentin Perret
  2021-06-11  9:20       ` Peter Zijlstra
  0 siblings, 2 replies; 20+ messages in thread
From: Quentin Perret @ 2021-06-11  8:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, qais.yousef, rickyiu,
	wvw, patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On Thursday 10 Jun 2021 at 21:15:45 (+0200), Peter Zijlstra wrote:
> On Thu, Jun 10, 2021 at 03:13:05PM +0000, Quentin Perret wrote:
> > SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
> > the call must not touch scheduling parameters (nice or priority). This
> > is particularly handy for uclamp when used in conjunction with
> > SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
> > impacts uclamp values.
> > 
> > However, sched_setattr always checks whether the priorities and nice
> > values passed in sched_attr are valid first, even if those never get
> > used down the line. This is useless at best since userspace can
> > trivially bypass this check to set the uclamp values by specifying low
> > priorities. However, it is cumbersome to do so as there is no single
> > expression of this that skips both RT and CFS checks at once. As such,
> > userspace needs to query the task policy first with e.g. sched_getattr
> > and then set sched_attr.sched_priority accordingly. This is racy and
> > slower than a single call.
> > 
> > As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
> > is specified, simply inherit them in this case to match the policy
> > inheritance of SCHED_FLAG_KEEP_POLICY.
> > 
> > Reported-by: Wei Wang <wvw@google.com>
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  kernel/sched/core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3b213402798e..1d4aedbbcf96 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6585,6 +6585,10 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> >  	rcu_read_unlock();
> >  
> >  	if (likely(p)) {
> > +		if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) {
> > +			attr.sched_priority = p->rt_priority;
> > +			attr.sched_nice = task_nice(p);
> > +		}
> >  		retval = sched_setattr(p, &attr);
> >  		put_task_struct(p);
> >  	}
> 
> I don't like this much... afaict the KEEP_PARAMS clause in
> __setscheduler() also covers the DL params, and you 'forgot' to copy
> those.
>
> Can't we short circuit the validation logic?

I think we can but I didn't like the look of it, because we end up
sprinkling checks all over the place. KEEP_PARAMS doesn't imply
KEEP_POLICY IIUC, and the policy and params checks are all mixed up.

But maybe that wants fixing too? I guess it could make sense to switch
policies without touching the params in some cases (e.g switching
between FIFO and RR, or BATCH and NORMAL), but I'm not sure what that
would mean for cross-sched_class transitions.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
  2021-06-11  8:59     ` Quentin Perret
@ 2021-06-11  9:07       ` Quentin Perret
  2021-06-11  9:20       ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Quentin Perret @ 2021-06-11  9:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, qais.yousef, rickyiu,
	wvw, patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On Friday 11 Jun 2021 at 08:59:25 (+0000), Quentin Perret wrote:
> On Thursday 10 Jun 2021 at 21:15:45 (+0200), Peter Zijlstra wrote:
> > On Thu, Jun 10, 2021 at 03:13:05PM +0000, Quentin Perret wrote:
> > > SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
> > > the call must not touch scheduling parameters (nice or priority). This
> > > is particularly handy for uclamp when used in conjunction with
> > > SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
> > > impacts uclamp values.
> > > 
> > > However, sched_setattr always checks whether the priorities and nice
> > > values passed in sched_attr are valid first, even if those never get
> > > used down the line. This is useless at best since userspace can
> > > trivially bypass this check to set the uclamp values by specifying low
> > > priorities. However, it is cumbersome to do so as there is no single
> > > expression of this that skips both RT and CFS checks at once. As such,
> > > userspace needs to query the task policy first with e.g. sched_getattr
> > > and then set sched_attr.sched_priority accordingly. This is racy and
> > > slower than a single call.
> > > 
> > > As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
> > > is specified, simply inherit them in this case to match the policy
> > > inheritance of SCHED_FLAG_KEEP_POLICY.
> > > 
> > > Reported-by: Wei Wang <wvw@google.com>
> > > Signed-off-by: Quentin Perret <qperret@google.com>
> > > ---
> > >  kernel/sched/core.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 3b213402798e..1d4aedbbcf96 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -6585,6 +6585,10 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> > >  	rcu_read_unlock();
> > >  
> > >  	if (likely(p)) {
> > > +		if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) {
> > > +			attr.sched_priority = p->rt_priority;
> > > +			attr.sched_nice = task_nice(p);
> > > +		}
> > >  		retval = sched_setattr(p, &attr);
> > >  		put_task_struct(p);
> > >  	}
> > 
> > I don't like this much... afaict the KEEP_PARAMS clause in
> > __setscheduler() also covers the DL params, and you 'forgot' to copy
> > those.
> >
> > Can't we short circuit the validation logic?
> 
> I think we can but I didn't like the look of it, because we end up
> sprinkling checks all over the place. KEEP_PARAMS doesn't imply
> KEEP_POLICY IIUC, and the policy and params checks are all mixed up.
> 
> But maybe that wants fixing too? I guess it could make sense to switch
> policies without touching the params in some cases (e.g switching
> between FIFO and RR, or BATCH and NORMAL), but I'm not sure what that
> would mean for cross-sched_class transitions.

Aha, policy transitions are actually blocked in __setscheduler if
KEEP_PARAMS is set, so KEEP_PARAMS does imply KEEP_POLICY. So skipping
the checks might not be too bad, I'll have a go at it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
  2021-06-11  8:59     ` Quentin Perret
  2021-06-11  9:07       ` Quentin Perret
@ 2021-06-11  9:20       ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2021-06-11  9:20 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, vincent.guittot, dietmar.eggemann, qais.yousef, rickyiu,
	wvw, patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On Fri, Jun 11, 2021 at 08:59:25AM +0000, Quentin Perret wrote:
> On Thursday 10 Jun 2021 at 21:15:45 (+0200), Peter Zijlstra wrote:
> > On Thu, Jun 10, 2021 at 03:13:05PM +0000, Quentin Perret wrote:
> > > SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
> > > the call must not touch scheduling parameters (nice or priority). This
> > > is particularly handy for uclamp when used in conjunction with
> > > SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
> > > impacts uclamp values.
> > > 
> > > However, sched_setattr always checks whether the priorities and nice
> > > values passed in sched_attr are valid first, even if those never get
> > > used down the line. This is useless at best since userspace can
> > > trivially bypass this check to set the uclamp values by specifying low
> > > priorities. However, it is cumbersome to do so as there is no single
> > > expression of this that skips both RT and CFS checks at once. As such,
> > > userspace needs to query the task policy first with e.g. sched_getattr
> > > and then set sched_attr.sched_priority accordingly. This is racy and
> > > slower than a single call.
> > > 
> > > As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
> > > is specified, simply inherit them in this case to match the policy
> > > inheritance of SCHED_FLAG_KEEP_POLICY.
> > > 
> > > Reported-by: Wei Wang <wvw@google.com>
> > > Signed-off-by: Quentin Perret <qperret@google.com>
> > > ---
> > >  kernel/sched/core.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 3b213402798e..1d4aedbbcf96 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -6585,6 +6585,10 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> > >  	rcu_read_unlock();
> > >  
> > >  	if (likely(p)) {
> > > +		if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS) {
> > > +			attr.sched_priority = p->rt_priority;
> > > +			attr.sched_nice = task_nice(p);
> > > +		}
> > >  		retval = sched_setattr(p, &attr);
> > >  		put_task_struct(p);
> > >  	}
> > 
> > I don't like this much... afaict the KEEP_PARAMS clause in
> > __setscheduler() also covers the DL params, and you 'forgot' to copy
> > those.
> >
> > Can't we short circuit the validation logic?
> 
> I think we can but I didn't like the look of it, because we end up
> sprinkling checks all over the place. KEEP_PARAMS doesn't imply
> KEEP_POLICY IIUC, and the policy and params checks are all mixed up.
> 
> But maybe that wants fixing too? 

If you can make that code nicer, I'm all for it, it's a bit of a mess.

But failing that, I suppose the alternative is extracting something like
get_params from sched_getattr() and sharing that bit of code to do what
you do above.

> I guess it could make sense to switch
> policies without touching the params in some cases (e.g switching
> between FIFO and RR, or BATCH and NORMAL), but I'm not sure what that
> would mean for cross-sched_class transitions.

You're right, cross-class needs both.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-10 15:13 ` [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE Quentin Perret
@ 2021-06-11 12:48   ` Qais Yousef
  2021-06-11 13:08     ` Quentin Perret
  0 siblings, 1 reply; 20+ messages in thread
From: Qais Yousef @ 2021-06-11 12:48 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On 06/10/21 15:13, Quentin Perret wrote:
> There is currently nothing preventing tasks from changing their per-task
> clamp values in anyway that they like. The rationale is probably that
> system administrators are still able to limit those clamps thanks to the
> cgroup interface. However, this causes pain in a system where both
> per-task and per-cgroup clamp values are expected to be under the
> control of core system components (as is the case for Android).
> 
> To fix this, let's require CAP_SYS_NICE to increase per-task clamp
> values. This allows unprivileged tasks to lower their requests, but not
> increase them, which is consistent with the existing behaviour for nice
> values.

Hmmm. I'm not in favour of this.

So uclamp is a performance and power management mechanism, it has no impact on
fairness AFAICT, so it being a privileged operation doesn't make sense.

We had a thought about this in the past and we didn't think there's any harm if
a task (app) wants to self manage. Yes a task could ask to run at max
performance and waste power, but anyone can generate a busy loop and waste
power too.

Now that doesn't mean your use case is not valid. I agree if there's a system
wide framework that wants to explicitly manage performance and power of tasks
via uclamp, then we can end up with 2 layers of controls overriding each
others.

Would it make more sense to have a procfs/sysfs flag that is disabled by
default that allows sys-admin to enforce a privileged uclamp access?

Something like

	/proc/sys/kernel/sched_uclamp_privileged

I think both usage scenarios are valid and giving sys-admins the power to
enforce a behavior makes more sense for me.

Unless there's a real concern in terms of security/fairness that we missed?


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-11 12:48   ` Qais Yousef
@ 2021-06-11 13:08     ` Quentin Perret
  2021-06-11 13:26       ` Qais Yousef
  0 siblings, 1 reply; 20+ messages in thread
From: Quentin Perret @ 2021-06-11 13:08 UTC (permalink / raw)
  To: Qais Yousef
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

Hi Qais,

On Friday 11 Jun 2021 at 13:48:20 (+0100), Qais Yousef wrote:
> On 06/10/21 15:13, Quentin Perret wrote:
> > There is currently nothing preventing tasks from changing their per-task
> > clamp values in anyway that they like. The rationale is probably that
> > system administrators are still able to limit those clamps thanks to the
> > cgroup interface. However, this causes pain in a system where both
> > per-task and per-cgroup clamp values are expected to be under the
> > control of core system components (as is the case for Android).
> > 
> > To fix this, let's require CAP_SYS_NICE to increase per-task clamp
> > values. This allows unprivileged tasks to lower their requests, but not
> > increase them, which is consistent with the existing behaviour for nice
> > values.
> 
> Hmmm. I'm not in favour of this.
> 
> So uclamp is a performance and power management mechanism, it has no impact on
> fairness AFAICT, so it being a privileged operation doesn't make sense.
> 
> We had a thought about this in the past and we didn't think there's any harm if
> a task (app) wants to self manage. Yes a task could ask to run at max
> performance and waste power, but anyone can generate a busy loop and waste
> power too.
> 
> Now that doesn't mean your use case is not valid. I agree if there's a system
> wide framework that wants to explicitly manage performance and power of tasks
> via uclamp, then we can end up with 2 layers of controls overriding each
> others.

Right, that's the main issue. Also, the reality is that most of time the
'right' clamps are platform-dependent, so most userspace apps are simply
not equipped to decide what their own clamps should be.

> Would it make more sense to have a procfs/sysfs flag that is disabled by
> default that allows sys-admin to enforce a privileged uclamp access?
> 
> Something like
> 
> 	/proc/sys/kernel/sched_uclamp_privileged

Hmm, dunno, I'm not aware of anything else having a behaviour like that,
so that feels a bit odd.

> I think both usage scenarios are valid and giving sys-admins the power to
> enforce a behavior makes more sense for me.

Yes, I wouldn't mind something like that in general. I originally wanted
to suggest introducing a dedicated capability for uclamp, but that felt
a bit overkill. Now if others think this should be the way to go I'm
happy to go implement it.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-11 13:08     ` Quentin Perret
@ 2021-06-11 13:26       ` Qais Yousef
  2021-06-11 13:49         ` Quentin Perret
  0 siblings, 1 reply; 20+ messages in thread
From: Qais Yousef @ 2021-06-11 13:26 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

Hi Quentin

On 06/11/21 13:08, Quentin Perret wrote:
> Hi Qais,
> 
> On Friday 11 Jun 2021 at 13:48:20 (+0100), Qais Yousef wrote:
> > On 06/10/21 15:13, Quentin Perret wrote:
> > > There is currently nothing preventing tasks from changing their per-task
> > > clamp values in anyway that they like. The rationale is probably that
> > > system administrators are still able to limit those clamps thanks to the
> > > cgroup interface. However, this causes pain in a system where both
> > > per-task and per-cgroup clamp values are expected to be under the
> > > control of core system components (as is the case for Android).
> > > 
> > > To fix this, let's require CAP_SYS_NICE to increase per-task clamp
> > > values. This allows unprivileged tasks to lower their requests, but not
> > > increase them, which is consistent with the existing behaviour for nice
> > > values.
> > 
> > Hmmm. I'm not in favour of this.
> > 
> > So uclamp is a performance and power management mechanism, it has no impact on
> > fairness AFAICT, so it being a privileged operation doesn't make sense.
> > 
> > We had a thought about this in the past and we didn't think there's any harm if
> > a task (app) wants to self manage. Yes a task could ask to run at max
> > performance and waste power, but anyone can generate a busy loop and waste
> > power too.
> > 
> > Now that doesn't mean your use case is not valid. I agree if there's a system
> > wide framework that wants to explicitly manage performance and power of tasks
> > via uclamp, then we can end up with 2 layers of controls overriding each
> > others.
> 
> Right, that's the main issue. Also, the reality is that most of time the
> 'right' clamps are platform-dependent, so most userspace apps are simply
> not equipped to decide what their own clamps should be.

I'd argue this is true for both a framework or an app point of view. It depends
on the application and how it would be used.

I can foresee for example and HTTP server wanting to use uclamp to guarantee
a QoS target ie: X number of requests per second or a maximum of Y tail
latency. The application can try to tune (calibrate) itself without having to
have the whole system tuned or pumped on steroid.

Or a framework could manage this on behalf of the application. Both can use
uclamp with a feedback loop to calibrate the perf requirement of the tasks to
meet a given perf/power criteria.

If you want to do a static management, system framework would make more sense
in this case, true.

> 
> > Would it make more sense to have a procfs/sysfs flag that is disabled by
> > default that allows sys-admin to enforce a privileged uclamp access?
> > 
> > Something like
> > 
> > 	/proc/sys/kernel/sched_uclamp_privileged
> 
> Hmm, dunno, I'm not aware of anything else having a behaviour like that,
> so that feels a bit odd.

I think /proc/sys/kernel/perf_event_paranoid falls into this category.

> 
> > I think both usage scenarios are valid and giving sys-admins the power to
> > enforce a behavior makes more sense for me.
> 
> Yes, I wouldn't mind something like that in general. I originally wanted
> to suggest introducing a dedicated capability for uclamp, but that felt
> a bit overkill. Now if others think this should be the way to go I'm
> happy to go implement it.

Would be good to hear what others think for sure :)


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-11 13:26       ` Qais Yousef
@ 2021-06-11 13:49         ` Quentin Perret
  2021-06-11 14:17           ` Qais Yousef
  0 siblings, 1 reply; 20+ messages in thread
From: Quentin Perret @ 2021-06-11 13:49 UTC (permalink / raw)
  To: Qais Yousef
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On Friday 11 Jun 2021 at 14:26:53 (+0100), Qais Yousef wrote:
> Hi Quentin
> 
> On 06/11/21 13:08, Quentin Perret wrote:
> > Hi Qais,
> > 
> > On Friday 11 Jun 2021 at 13:48:20 (+0100), Qais Yousef wrote:
> > > On 06/10/21 15:13, Quentin Perret wrote:
> > > > There is currently nothing preventing tasks from changing their per-task
> > > > clamp values in anyway that they like. The rationale is probably that
> > > > system administrators are still able to limit those clamps thanks to the
> > > > cgroup interface. However, this causes pain in a system where both
> > > > per-task and per-cgroup clamp values are expected to be under the
> > > > control of core system components (as is the case for Android).
> > > > 
> > > > To fix this, let's require CAP_SYS_NICE to increase per-task clamp
> > > > values. This allows unprivileged tasks to lower their requests, but not
> > > > increase them, which is consistent with the existing behaviour for nice
> > > > values.
> > > 
> > > Hmmm. I'm not in favour of this.
> > > 
> > > So uclamp is a performance and power management mechanism, it has no impact on
> > > fairness AFAICT, so it being a privileged operation doesn't make sense.
> > > 
> > > We had a thought about this in the past and we didn't think there's any harm if
> > > a task (app) wants to self manage. Yes a task could ask to run at max
> > > performance and waste power, but anyone can generate a busy loop and waste
> > > power too.
> > > 
> > > Now that doesn't mean your use case is not valid. I agree if there's a system
> > > wide framework that wants to explicitly manage performance and power of tasks
> > > via uclamp, then we can end up with 2 layers of controls overriding each
> > > others.
> > 
> > Right, that's the main issue. Also, the reality is that most of time the
> > 'right' clamps are platform-dependent, so most userspace apps are simply
> > not equipped to decide what their own clamps should be.
> 
> I'd argue this is true for both a framework or an app point of view. It depends
> on the application and how it would be used.
> 
> I can foresee for example and HTTP server wanting to use uclamp to guarantee
> a QoS target ie: X number of requests per second or a maximum of Y tail
> latency. The application can try to tune (calibrate) itself without having to
> have the whole system tuned or pumped on steroid.

Right, but the problem I see with this approach is that the app only
understand its own performance, but is unable to decide what is best for
the overall system health.

Anyway, it sounds like we agree that having _some_ way of limiting this
would be useful, so we're all good.

> Or a framework could manage this on behalf of the application. Both can use
> uclamp with a feedback loop to calibrate the perf requirement of the tasks to
> meet a given perf/power criteria.
> 
> If you want to do a static management, system framework would make more sense
> in this case, true.
> 
> > 
> > > Would it make more sense to have a procfs/sysfs flag that is disabled by
> > > default that allows sys-admin to enforce a privileged uclamp access?
> > > 
> > > Something like
> > > 
> > > 	/proc/sys/kernel/sched_uclamp_privileged
> > 
> > Hmm, dunno, I'm not aware of anything else having a behaviour like that,
> > so that feels a bit odd.
> 
> I think /proc/sys/kernel/perf_event_paranoid falls into this category.

Aha, so I'm guessing this was introduced as a sysfs knob rather than a
CAP because it is a non-binary knob, but it's an interesting example.

> > 
> > > I think both usage scenarios are valid and giving sys-admins the power to
> > > enforce a behavior makes more sense for me.
> > 
> > Yes, I wouldn't mind something like that in general. I originally wanted
> > to suggest introducing a dedicated capability for uclamp, but that felt
> > a bit overkill. Now if others think this should be the way to go I'm
> > happy to go implement it.
> 
> Would be good to hear what others think for sure :)

Thinking about it a bit more, a more involved option would be to have
this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
it. The semantics could be:

  - if the clamp requested by the non-privileged task is lower than its
    existing clamp, then allow;
  - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
    allow;
  - otherwise, deny,

And the same principle would apply to both uclamp.min and uclamp.max,
and UCLAMP_RLIMIT would default to 0.

Thoughts?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-11 13:49         ` Quentin Perret
@ 2021-06-11 14:17           ` Qais Yousef
  2021-06-11 14:43             ` Quentin Perret
  0 siblings, 1 reply; 20+ messages in thread
From: Qais Yousef @ 2021-06-11 14:17 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On 06/11/21 13:49, Quentin Perret wrote:
> On Friday 11 Jun 2021 at 14:26:53 (+0100), Qais Yousef wrote:
> > Hi Quentin
> > 
> > On 06/11/21 13:08, Quentin Perret wrote:
> > > Hi Qais,
> > > 
> > > On Friday 11 Jun 2021 at 13:48:20 (+0100), Qais Yousef wrote:
> > > > On 06/10/21 15:13, Quentin Perret wrote:
> > > > > There is currently nothing preventing tasks from changing their per-task
> > > > > clamp values in anyway that they like. The rationale is probably that
> > > > > system administrators are still able to limit those clamps thanks to the
> > > > > cgroup interface. However, this causes pain in a system where both
> > > > > per-task and per-cgroup clamp values are expected to be under the
> > > > > control of core system components (as is the case for Android).
> > > > > 
> > > > > To fix this, let's require CAP_SYS_NICE to increase per-task clamp
> > > > > values. This allows unprivileged tasks to lower their requests, but not
> > > > > increase them, which is consistent with the existing behaviour for nice
> > > > > values.
> > > > 
> > > > Hmmm. I'm not in favour of this.
> > > > 
> > > > So uclamp is a performance and power management mechanism, it has no impact on
> > > > fairness AFAICT, so it being a privileged operation doesn't make sense.
> > > > 
> > > > We had a thought about this in the past and we didn't think there's any harm if
> > > > a task (app) wants to self manage. Yes a task could ask to run at max
> > > > performance and waste power, but anyone can generate a busy loop and waste
> > > > power too.
> > > > 
> > > > Now that doesn't mean your use case is not valid. I agree if there's a system
> > > > wide framework that wants to explicitly manage performance and power of tasks
> > > > via uclamp, then we can end up with 2 layers of controls overriding each
> > > > others.
> > > 
> > > Right, that's the main issue. Also, the reality is that most of time the
> > > 'right' clamps are platform-dependent, so most userspace apps are simply
> > > not equipped to decide what their own clamps should be.
> > 
> > I'd argue this is true for both a framework or an app point of view. It depends
> > on the application and how it would be used.
> > 
> > I can foresee for example and HTTP server wanting to use uclamp to guarantee
> > a QoS target ie: X number of requests per second or a maximum of Y tail
> > latency. The application can try to tune (calibrate) itself without having to
> > have the whole system tuned or pumped on steroid.
> 
> Right, but the problem I see with this approach is that the app only
> understand its own performance, but is unable to decide what is best for
> the overall system health.

How do you define the overall system health here? If the app will cause the
system to heat up massively, it can do this with or without uclamp, no?

The app has better understanding of what tasks it creates and what's important
and what's not; and when certain tasks would need extra boost or not. How would
the system know that without some cooperation from the app anyway?

If you were referring what would happen if two apps are running
simultaneously; in my view the problem is not worse than what would happen
without them using uclamp. They could trip over each others in both cases. You
can actually argue if they both are doing a good job at self regulating you
could end up with a better overall result when using uclamp.

I agree one can do smarter things via a framework. I just don't find it a good
reason to limit the other use cases too.

> 
> Anyway, it sounds like we agree that having _some_ way of limiting this
> would be useful, so we're all good.

Yes. If there's a framework that is smart and want to be the master controller
of managing uclamp requests, then it make sense for it to have a way to ensure
apps can't try to escape what it has imposed. Otherwise it's chaos.

> 
> > Or a framework could manage this on behalf of the application. Both can use
> > uclamp with a feedback loop to calibrate the perf requirement of the tasks to
> > meet a given perf/power criteria.
> > 
> > If you want to do a static management, system framework would make more sense
> > in this case, true.
> > 
> > > 
> > > > Would it make more sense to have a procfs/sysfs flag that is disabled by
> > > > default that allows sys-admin to enforce a privileged uclamp access?
> > > > 
> > > > Something like
> > > > 
> > > > 	/proc/sys/kernel/sched_uclamp_privileged
> > > 
> > > Hmm, dunno, I'm not aware of anything else having a behaviour like that,
> > > so that feels a bit odd.
> > 
> > I think /proc/sys/kernel/perf_event_paranoid falls into this category.
> 
> Aha, so I'm guessing this was introduced as a sysfs knob rather than a
> CAP because it is a non-binary knob, but it's an interesting example.
> 
> > > 
> > > > I think both usage scenarios are valid and giving sys-admins the power to
> > > > enforce a behavior makes more sense for me.
> > > 
> > > Yes, I wouldn't mind something like that in general. I originally wanted
> > > to suggest introducing a dedicated capability for uclamp, but that felt
> > > a bit overkill. Now if others think this should be the way to go I'm
> > > happy to go implement it.
> > 
> > Would be good to hear what others think for sure :)
> 
> Thinking about it a bit more, a more involved option would be to have
> this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
> it. The semantics could be:
> 
>   - if the clamp requested by the non-privileged task is lower than its
>     existing clamp, then allow;
>   - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
>     allow;
>   - otherwise, deny,
> 
> And the same principle would apply to both uclamp.min and uclamp.max,
> and UCLAMP_RLIMIT would default to 0.
> 
> Thoughts?

That could work. But then I'd prefer your patch to go as-is. I don't think
uclamp can do with this extra complexity in using it.

We basically want to specify we want to be paranoid about uclamp CAP or not. In
my view that is simple and can't see why it would be a big deal to have
a procfs entry to define the level of paranoia the system wants to impose. If
it is a big deal though (would love to hear the arguments); requiring apps that
want to self regulate to have CAP_SYS_NICE is better approach. Though I'd still
prefer to keep uclamp ubiquitous and not enforce a specific usage pattern.

Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-11 14:17           ` Qais Yousef
@ 2021-06-11 14:43             ` Quentin Perret
  2021-06-14 15:03               ` Qais Yousef
  0 siblings, 1 reply; 20+ messages in thread
From: Quentin Perret @ 2021-06-11 14:43 UTC (permalink / raw)
  To: Qais Yousef
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On Friday 11 Jun 2021 at 15:17:37 (+0100), Qais Yousef wrote:
> On 06/11/21 13:49, Quentin Perret wrote:
> > Thinking about it a bit more, a more involved option would be to have
> > this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
> > it. The semantics could be:
> > 
> >   - if the clamp requested by the non-privileged task is lower than its
> >     existing clamp, then allow;
> >   - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
> >     allow;
> >   - otherwise, deny,
> > 
> > And the same principle would apply to both uclamp.min and uclamp.max,
> > and UCLAMP_RLIMIT would default to 0.
> > 
> > Thoughts?
> 
> That could work. But then I'd prefer your patch to go as-is. I don't think
> uclamp can do with this extra complexity in using it.

Sorry I'm not sure what you mean here?

> We basically want to specify we want to be paranoid about uclamp CAP or not. In
> my view that is simple and can't see why it would be a big deal to have
> a procfs entry to define the level of paranoia the system wants to impose. If
> it is a big deal though (would love to hear the arguments);

Not saying it's a big deal, but I think there are a few arguments in
favor of using rlimit instead of a sysfs knob. It allows for a much
finer grain configuration  -- constraints can be set per-task as well as
system wide if needed, and it is the standard way of limiting resources
that tasks can ask for.

> requiring apps that
> want to self regulate to have CAP_SYS_NICE is better approach.

Rlimit wouldn't require that though, which is also nice as CAP_SYS_NICE
grants you a lot more power than just clamps ...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-11 14:43             ` Quentin Perret
@ 2021-06-14 15:03               ` Qais Yousef
  2021-06-21 10:52                 ` Quentin Perret
  0 siblings, 1 reply; 20+ messages in thread
From: Qais Yousef @ 2021-06-14 15:03 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On 06/11/21 14:43, Quentin Perret wrote:
> On Friday 11 Jun 2021 at 15:17:37 (+0100), Qais Yousef wrote:
> > On 06/11/21 13:49, Quentin Perret wrote:
> > > Thinking about it a bit more, a more involved option would be to have
> > > this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
> > > it. The semantics could be:
> > > 
> > >   - if the clamp requested by the non-privileged task is lower than its
> > >     existing clamp, then allow;
> > >   - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
> > >     allow;
> > >   - otherwise, deny,
> > > 
> > > And the same principle would apply to both uclamp.min and uclamp.max,
> > > and UCLAMP_RLIMIT would default to 0.
> > > 
> > > Thoughts?
> > 
> > That could work. But then I'd prefer your patch to go as-is. I don't think
> > uclamp can do with this extra complexity in using it.
> 
> Sorry I'm not sure what you mean here?

Hmm. I understood this as a new flag to sched_setattr() syscall first, but now
I get it. You want to use getrlimit()/setrlimit()/prlimit() API to impose
a restriction. My comment was in regard to this being a sys call extension,
which it isn't. So please ignore it.

> 
> > We basically want to specify we want to be paranoid about uclamp CAP or not. In
> > my view that is simple and can't see why it would be a big deal to have
> > a procfs entry to define the level of paranoia the system wants to impose. If
> > it is a big deal though (would love to hear the arguments);
> 
> Not saying it's a big deal, but I think there are a few arguments in
> favor of using rlimit instead of a sysfs knob. It allows for a much
> finer grain configuration  -- constraints can be set per-task as well as
> system wide if needed, and it is the standard way of limiting resources
> that tasks can ask for.

Is it system wide or per user?

> 
> > requiring apps that
> > want to self regulate to have CAP_SYS_NICE is better approach.
> 
> Rlimit wouldn't require that though, which is also nice as CAP_SYS_NICE
> grants you a lot more power than just clamps ...

Now I better understand your suggestion. It seems a viable option I agree.
I need to digest it more still though. The devil is in the details :)

Shouldn't the default be RLIM_INIFINITY? ie: no limit?

We will need to add two limit, RLIMIT_UCLAMP_MIN/MAX, right?

We have the following hierarchy now:

	1. System Wide (/proc/sys/kerenl/sched_util_clamp_min/max)
	2. Cgroup
	3. Per-Task

In that order of priority where 1 limits/overrides 2 and 3. And
2 limits/overrides 3.

Where do you see the RLIMIT fit in this hierarchy? It should be between 2 and
3, right? Cgroup settings should still win even if the user/processes were
limited?

If the framework decided a user can't request any boost at all (can't increase
its uclamp_min above 0). IIUC then setting the hard limit of RLIMIT_UCLAMP_MIN
to 0 would achieve that, right?

Since the framework and the task itself would go through the same
sched_setattr() call, how would the framework circumvent this limit? IIUC it
has to raise the RLIMIT_UCLAMP_MIN first then perform sched_setattr() to
request the boost value, right? Would this overhead be acceptable? It looks
considerable to me.

Also, Will prlimit() allow you to go outside what was set for the user via
setrlimit()? Reading the man pages it seems to override, so that should be
fine.

For 1 (System Wide) limits, sched_setattr() requests are accepted, but the
effective uclamp is *capped by* the system wide limit.

Were you thinking RLIMIT_UCLAMP* will behave similarly? If they do, we have
consistent behavior with how the current system wide limits work; but this will
break your use case because tasks can change the requested uclamp value for
a task, albeit the effective value will be limited.

	RLIMIT_UCLAMP_MIN=512
	p->uclamp[UCLAMP_min] = 800	// this request is allowed but
					// Effective UCLAMP_MIN = 512

If not, then

	RLIMIT_UCLAMP_MIN=no limit
	p->uclamp[UCLAMP_min] = 800	// task changed its uclamp_min to 800
	RLIMIT_UCLAMP_MIN=512		// limit was lowered for task/user

what will happen to p->uclamp[UCLAMP_MIN] in this case? Will it be lowered to
match the new limit? And this will be inconsistent with the current system wide
limits we already have.

Sorry too many questions. I was mainly thinking loudly. I need to spend more
time to dig into the details of how RLIMITs are imposed to understand how this
could be a good fit. I already see some friction points that needs more
thinking.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting
  2021-06-11  7:25     ` Quentin Perret
@ 2021-06-17 15:27       ` Dietmar Eggemann
  2021-06-21 10:57         ` Quentin Perret
  0 siblings, 1 reply; 20+ messages in thread
From: Dietmar Eggemann @ 2021-06-17 15:27 UTC (permalink / raw)
  To: Quentin Perret, Peter Zijlstra
  Cc: mingo, vincent.guittot, qais.yousef, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

On 11/06/2021 09:25, Quentin Perret wrote:
> On Thursday 10 Jun 2021 at 21:05:12 (+0200), Peter Zijlstra wrote:
>> On Thu, Jun 10, 2021 at 03:13:04PM +0000, Quentin Perret wrote:
>>> The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
>>> active task to maintain the last uclamp.max and prevent blocked util
>>> from suddenly becoming visible.
>>>
>>> However, there is an asymmetry in how the flag is set and cleared which
>>> can lead to having the flag set whilst there are active tasks on the rq.
>>> Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
>>> called at enqueue time, but set in uclamp_rq_dec_id() which is called
>>> both when dequeueing a task _and_ in the update_uclamp_active() path. As
>>> a result, when both uclamp_rq_{dec,ind}_id() are called from
>>> update_uclamp_active(), the flag ends up being set but not cleared,
>>> hence leaving the runqueue in a broken state.
>>>
>>> Fix this by setting the flag in the uclamp_rq_inc_id() path to ensure
>>> things remain symmetrical.
>>
>> The code you moved is neither in uclamp_rq_inc_id(), although
>> uclamp_idle_reset() is called from there
> 
> Yep, that is what I was trying to say.
> 
>> nor does it _set_ the flag.
> 
> Ahem. That I don't have a good excuse for ...

(A) dequeue -> set

(1) dequeue_task() -> uclamp_rq_dec() ->

(2) cpu_util_update_eff() -> ... -> uclamp_update_active() ->

uclamp_rq_dec_id()

    uclamp_rq_max_value()

        /* No tasks -- default clamp values */
        uclamp_idle_value() {

            if (clamp_id == UCLAMP_MAX)
                rq->uclamp_flags |= UCLAMP_FLAG_IDLE;  <-- set
        }

---

(B) enqueue -> clear

(1) enqueue_task() ->

uclamp_rq_inc() {

(2) cpu_util_update_eff() -> ... -> uclamp_update_active() ->

    uclamp_rq_inc_id() {

        uclamp_idle_reset() {
    						     <-- new clear
       }                                                     ^
    }                                                        |
                                                             |
    if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)                 |
        rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;       <-- old clear
}

---

uclamp_update_active()

    if (p->uclamp[clamp_id].active) {
        uclamp_rq_dec_id()            <-- (A2)
	uclamp_rq_inc_id()            <-- (B2)
    }

Is this existing asymmetry in setting the flag but not clearing it in
uclamp_update_active() the only issue this patch fixes?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
  2021-06-14 15:03               ` Qais Yousef
@ 2021-06-21 10:52                 ` Quentin Perret
  0 siblings, 0 replies; 20+ messages in thread
From: Quentin Perret @ 2021-06-21 10:52 UTC (permalink / raw)
  To: Qais Yousef
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, rickyiu, wvw,
	patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

Hi Qais,

Apologies for the delayed reply, I was away last week.

On Monday 14 Jun 2021 at 16:03:27 (+0100), Qais Yousef wrote:
> On 06/11/21 14:43, Quentin Perret wrote:
> > On Friday 11 Jun 2021 at 15:17:37 (+0100), Qais Yousef wrote:
> > > On 06/11/21 13:49, Quentin Perret wrote:
> > > > Thinking about it a bit more, a more involved option would be to have
> > > > this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
> > > > it. The semantics could be:
> > > > 
> > > >   - if the clamp requested by the non-privileged task is lower than its
> > > >     existing clamp, then allow;
> > > >   - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
> > > >     allow;
> > > >   - otherwise, deny,
> > > > 
> > > > And the same principle would apply to both uclamp.min and uclamp.max,
> > > > and UCLAMP_RLIMIT would default to 0.
> > > > 
> > > > Thoughts?
> > > 
> > > That could work. But then I'd prefer your patch to go as-is. I don't think
> > > uclamp can do with this extra complexity in using it.
> > 
> > Sorry I'm not sure what you mean here?
> 
> Hmm. I understood this as a new flag to sched_setattr() syscall first, but now
> I get it. You want to use getrlimit()/setrlimit()/prlimit() API to impose
> a restriction. My comment was in regard to this being a sys call extension,
> which it isn't. So please ignore it.
> 
> > 
> > > We basically want to specify we want to be paranoid about uclamp CAP or not. In
> > > my view that is simple and can't see why it would be a big deal to have
> > > a procfs entry to define the level of paranoia the system wants to impose. If
> > > it is a big deal though (would love to hear the arguments);
> > 
> > Not saying it's a big deal, but I think there are a few arguments in
> > favor of using rlimit instead of a sysfs knob. It allows for a much
> > finer grain configuration  -- constraints can be set per-task as well as
> > system wide if needed, and it is the standard way of limiting resources
> > that tasks can ask for.
> 
> Is it system wide or per user?

Right, so calling this 'system-wide' is probably an abuse, but IIRC
rlimits are per-process, and are inherited accross fork/exec. So the
usual trick to have a default value is to set the rlimits on the init
task accordingly. Android for instance already does that for a few
things, and I would guess that systemd and friends have equivalents
(though admittedly I should check that).

> > 
> > > requiring apps that
> > > want to self regulate to have CAP_SYS_NICE is better approach.
> > 
> > Rlimit wouldn't require that though, which is also nice as CAP_SYS_NICE
> > grants you a lot more power than just clamps ...
> 
> Now I better understand your suggestion. It seems a viable option I agree.
> I need to digest it more still though. The devil is in the details :)
> 
> Shouldn't the default be RLIM_INIFINITY? ie: no limit?

I guess so yes.

> We will need to add two limit, RLIMIT_UCLAMP_MIN/MAX, right?

Not sure, but I was originally envisioning to have only one that applies
to both min and max. In which would we need separate ones?

> We have the following hierarchy now:
> 
> 	1. System Wide (/proc/sys/kerenl/sched_util_clamp_min/max)
> 	2. Cgroup
> 	3. Per-Task
> 
> In that order of priority where 1 limits/overrides 2 and 3. And
> 2 limits/overrides 3.
> 
> Where do you see the RLIMIT fit in this hierarchy? It should be between 2 and
> 3, right? Cgroup settings should still win even if the user/processes were
> limited?

Yes, the rlimit stuff would just apply the syscall interface.

> If the framework decided a user can't request any boost at all (can't increase
> its uclamp_min above 0). IIUC then setting the hard limit of RLIMIT_UCLAMP_MIN
> to 0 would achieve that, right?

Exactly.

> Since the framework and the task itself would go through the same
> sched_setattr() call, how would the framework circumvent this limit? IIUC it
> has to raise the RLIMIT_UCLAMP_MIN first then perform sched_setattr() to
> request the boost value, right? Would this overhead be acceptable? It looks
> considerable to me.

The framework needs to have CAP_SYS_NICE to change another process'
clamps, and generally rlimit checks don't apply to CAP_SYS_NICE-capable
processes -- see __sched_setscheduler(). So I think we should be fine.
IOW, rlimits are just constraining what unprivileged tasks are allowed
to request for themselves IIUC.

> Also, Will prlimit() allow you to go outside what was set for the user via
> setrlimit()? Reading the man pages it seems to override, so that should be
> fine.

IIRC rlimit are per-process properties, not per-user, so I think we
should be fine here as well?

> For 1 (System Wide) limits, sched_setattr() requests are accepted, but the
> effective uclamp is *capped by* the system wide limit.
> 
> Were you thinking RLIMIT_UCLAMP* will behave similarly?

Nope, I was actually thinking of having the syscall return -EPERM in
this case, as we already do for nice values or RT priorities.

> If they do, we have
> consistent behavior with how the current system wide limits work; but this will
> break your use case because tasks can change the requested uclamp value for
> a task, albeit the effective value will be limited.
> 
> 	RLIMIT_UCLAMP_MIN=512
> 	p->uclamp[UCLAMP_min] = 800	// this request is allowed but
> 					// Effective UCLAMP_MIN = 512
> 
> If not, then
> 
> 	RLIMIT_UCLAMP_MIN=no limit
> 	p->uclamp[UCLAMP_min] = 800	// task changed its uclamp_min to 800
> 	RLIMIT_UCLAMP_MIN=512		// limit was lowered for task/user
> 
> what will happen to p->uclamp[UCLAMP_MIN] in this case? Will it be lowered to
> match the new limit? And this will be inconsistent with the current system wide
> limits we already have.

As per the above, if the syscall returns -EPERM we can leave the
integration with system-wide defaults and such untouched I think.

> Sorry too many questions. I was mainly thinking loudly. I need to spend more
> time to dig into the details of how RLIMITs are imposed to understand how this
> could be a good fit. I already see some friction points that needs more
> thinking.

No need to apologize, this would be a new userspace-visible interface,
so you're right that we need to think it through.

Thanks for the feedback,
Quentin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting
  2021-06-17 15:27       ` Dietmar Eggemann
@ 2021-06-21 10:57         ` Quentin Perret
  0 siblings, 0 replies; 20+ messages in thread
From: Quentin Perret @ 2021-06-21 10:57 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, vincent.guittot, qais.yousef, rickyiu,
	wvw, patrick.bellasi, xuewen.yan94, linux-kernel, kernel-team

Hi Dietmar,

On Thursday 17 Jun 2021 at 17:27:56 (+0200), Dietmar Eggemann wrote:
> On 11/06/2021 09:25, Quentin Perret wrote:
> > On Thursday 10 Jun 2021 at 21:05:12 (+0200), Peter Zijlstra wrote:
> >> On Thu, Jun 10, 2021 at 03:13:04PM +0000, Quentin Perret wrote:
> >>> The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
> >>> active task to maintain the last uclamp.max and prevent blocked util
> >>> from suddenly becoming visible.
> >>>
> >>> However, there is an asymmetry in how the flag is set and cleared which
> >>> can lead to having the flag set whilst there are active tasks on the rq.
> >>> Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
> >>> called at enqueue time, but set in uclamp_rq_dec_id() which is called
> >>> both when dequeueing a task _and_ in the update_uclamp_active() path. As
> >>> a result, when both uclamp_rq_{dec,ind}_id() are called from
> >>> update_uclamp_active(), the flag ends up being set but not cleared,
> >>> hence leaving the runqueue in a broken state.
> >>>
> >>> Fix this by setting the flag in the uclamp_rq_inc_id() path to ensure
> >>> things remain symmetrical.
> >>
> >> The code you moved is neither in uclamp_rq_inc_id(), although
> >> uclamp_idle_reset() is called from there
> > 
> > Yep, that is what I was trying to say.
> > 
> >> nor does it _set_ the flag.
> > 
> > Ahem. That I don't have a good excuse for ...
> 
> (A) dequeue -> set
> 
> (1) dequeue_task() -> uclamp_rq_dec() ->
> 
> (2) cpu_util_update_eff() -> ... -> uclamp_update_active() ->
> 
> uclamp_rq_dec_id()
> 
>     uclamp_rq_max_value()
> 
>         /* No tasks -- default clamp values */
>         uclamp_idle_value() {
> 
>             if (clamp_id == UCLAMP_MAX)
>                 rq->uclamp_flags |= UCLAMP_FLAG_IDLE;  <-- set
>         }
> 
> ---
> 
> (B) enqueue -> clear
> 
> (1) enqueue_task() ->
> 
> uclamp_rq_inc() {
> 
> (2) cpu_util_update_eff() -> ... -> uclamp_update_active() ->
> 
>     uclamp_rq_inc_id() {
> 
>         uclamp_idle_reset() {
>     						     <-- new clear
>        }                                                     ^
>     }                                                        |
>                                                              |
>     if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)                 |
>         rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;       <-- old clear
> }
> 
> ---
> 
> uclamp_update_active()
> 
>     if (p->uclamp[clamp_id].active) {
>         uclamp_rq_dec_id()            <-- (A2)
> 	uclamp_rq_inc_id()            <-- (B2)
>     }
> 
> Is this existing asymmetry in setting the flag but not clearing it in
> uclamp_update_active() the only issue this patch fixes?

I think this is the root of the problem, but it can have odd symptoms.
In a bad case that can lead to hitting the WARN in uclamp_rq_dec_id
(which is how we've found the bug in the first place).

I'll try and repost this with a correct commit message soon -- still
fighting with my inbox right now.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-06-21 10:57 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-10 15:13 [PATCH v2 0/3] A few uclamp fixes Quentin Perret
2021-06-10 15:13 ` [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting Quentin Perret
2021-06-10 19:05   ` Peter Zijlstra
2021-06-11  7:25     ` Quentin Perret
2021-06-17 15:27       ` Dietmar Eggemann
2021-06-21 10:57         ` Quentin Perret
2021-06-10 15:13 ` [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS Quentin Perret
2021-06-10 19:15   ` Peter Zijlstra
2021-06-11  8:59     ` Quentin Perret
2021-06-11  9:07       ` Quentin Perret
2021-06-11  9:20       ` Peter Zijlstra
2021-06-10 15:13 ` [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE Quentin Perret
2021-06-11 12:48   ` Qais Yousef
2021-06-11 13:08     ` Quentin Perret
2021-06-11 13:26       ` Qais Yousef
2021-06-11 13:49         ` Quentin Perret
2021-06-11 14:17           ` Qais Yousef
2021-06-11 14:43             ` Quentin Perret
2021-06-14 15:03               ` Qais Yousef
2021-06-21 10:52                 ` Quentin Perret

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).