Re: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

From: Suren Baghdasaryan <surenb@google.com>
To: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
	"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Paul Turner <pjt@google.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Juri Lelli <juri.lelli@redhat.com>, Todd Kjos <tkjos@google.com>,
	Joel Fernandes <joelaf@google.com>,
	Steve Muckle <smuckle@google.com>
Subject: Re: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups
Date: Fri, 20 Jul 2018 17:25:31 -0700	[thread overview]
Message-ID: <CAJuCfpGvpX5U9rED+vX=wW624GLZo8f=-2kOLGrY1UJSEXRh6g@mail.gmail.com> (raw)
In-Reply-To: <20180720151156.GA31421@e110439-lin>

Hi Patrick,

On Fri, Jul 20, 2018 at 8:11 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> Hi Suren,
> thanks for the review, all good point... some more comments follow
> inline.
>
> On 19-Jul 16:51, Suren Baghdasaryan wrote:
>> On Mon, Jul 16, 2018 at 1:28 AM, Patrick Bellasi
>> <patrick.bellasi@arm.com> wrote:
>
> [...]
>
>> > +/**
>> > + * uclamp_group_available: checks if a clamp group is available
>> > + * @clamp_id: the utilization clamp index (i.e. min or max clamp)
>> > + * @group_id: the group index in the given clamp_id
>> > + *
>> > + * A clamp group is not free if there is at least one SE which is sing a clamp
>>
>> Did you mean to say "single clamp"?
>
> No, it's "...at least one SE which is USING a clamp value..."
>
>> > + * value mapped on the specified clamp_id. These SEs are reference counted by
>> > + * the se_count of a uclamp_map entry.
>> > + *
>> > + * Return: true if there are no SE's mapped on the specified clamp
>> > + *         index and group
>> > + */
>> > +static inline bool uclamp_group_available(int clamp_id, int group_id)
>> > +{
>> > +       struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
>> > +
>> > +       return (uc_map[group_id].value == UCLAMP_NONE);
>>
>> The usage of UCLAMP_NONE is very confusing to me. It was not used at
>> all in the patch where it was introduced [1/12], here it's used as a
>> clamp value and in uclamp_group_find() it's used as group_id. Please
>> clarify the usage.
>
> Yes, it's meant to represent a "clamp not valid" condition, whatever
> it's a "clamp group" or a "clamp value"... perhaps the name can be
> improved.
>
>> I also feel UCLAMP_NONE does not really belong to
>> the uclamp_id enum because other elements there are indexes in
>> uclamp_maps and this one is a special value.
>
> Right, it looks a bit misplaced, I agree. I think I tried to set it
> using a #define but there was some issues I don't remember now...
> Anyway, I'll give it another go...
>
>
>> IMHO if both *group_id*
>> and *value* need a special value (-1) to represent
>> unused/uninitialized entry it would be better to use different
>> constants. Maybe UCLAMP_VAL_NONE and UCLAMP_GROUP_NONE?
>
> Yes, maybe we can use a
>
>    #define UCLAMP_NOT_VALID -1
>
> and get rid the confusing enum entry.
>
> Will update it on v3.
>

Sounds good to me.

>> > +}
>
> [...]
>
>> > +/**
>> > + * uclamp_group_find: finds the group index of a utilization clamp group
>> > + * @clamp_id: the utilization clamp index (i.e. min or max clamping)
>> > + * @clamp_value: the utilization clamping value lookup for
>> > + *
>> > + * Verify if a group has been assigned to a certain clamp value and return
>> > + * its index to be used for accounting.
>> > + *
>> > + * Since only a limited number of utilization clamp groups are allowed, if no
>> > + * groups have been assigned for the specified value, a new group is assigned
>> > + * if possible. Otherwise an error is returned, meaning that an additional clamp
>> > + * value is not (currently) supported.
>> > + */
>> > +static int
>> > +uclamp_group_find(int clamp_id, unsigned int clamp_value)
>> > +{
>> > +       struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
>> > +       int free_group_id = UCLAMP_NONE;
>> > +       unsigned int group_id = 0;
>> > +
>> > +       for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
>> > +               /* Keep track of first free clamp group */
>> > +               if (uclamp_group_available(clamp_id, group_id)) {
>> > +                       if (free_group_id == UCLAMP_NONE)
>> > +                               free_group_id = group_id;
>> > +                       continue;
>> > +               }
>> > +               /* Return index of first group with same clamp value */
>> > +               if (uc_map[group_id].value == clamp_value)
>> > +                       return group_id;
>> > +       }
>> > +       /* Default to first free clamp group */
>> > +       if (group_id > CONFIG_UCLAMP_GROUPS_COUNT)
>>
>> Is the condition above needed? I think it's always true if you got here.
>> Also AFAIKT after the for loop you can just do:
>>
>> return (free_group_id != UCLAMP_NONE) ? free_group_id : -ENOSPC;
>
> Yes, you right... the code above can be simplified!
>
>>
>> > +               group_id = free_group_id;
>> > +       /* All clamp group already track different clamp values */
>> > +       if (group_id == UCLAMP_NONE)
>> > +               return -ENOSPC;
>> > +       return group_id;
>> > +}
>
> [...]
>
>> > +static inline void uclamp_group_put(int clamp_id, int group_id)
>> > +{
>> > +       struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
>> > +       unsigned long flags;
>> > +
>> > +       /* Ignore SE's not yet attached */
>> > +       if (group_id == UCLAMP_NONE)
>> > +               return;
>> > +
>> > +       /* Remove SE from this clamp group */
>> > +       raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
>> > +       uc_map[group_id].se_count -= 1;
>>
>> If uc_map[group_id].se_count was 0 before decrement you end up with
>> se_count == -1 and no reset for the element.
>
> Well... this should never happen, otherwise the refcounting is not
> working as expected.
>
> Maybe we can add (at least) a debug check and warning, something like:
>
> #ifdef SCHED_DEBUG
>         if (unlikely(uc_map[group_id].se_count == 0)) {
>                 WARN(1, "invalid clamp group [%d:%d] refcount\n",
>                                 clamp_id, group_id);
>                 uc_map[group_id].se_count = 1;
>         }
> #endif
>
>> > +       if (uc_map[group_id].se_count == 0)
>> > +               uclamp_group_reset(clamp_id, group_id);
>> > +       raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
>> > +}
>> > +
>
> [...]
>
>> >  static inline int __setscheduler_uclamp(struct task_struct *p,
>> >                                         const struct sched_attr *attr)
>> >  {
>> > +       struct uclamp_se *uc_se;
>> > +       int retval = 0;
>> > +
>> >         if (attr->sched_util_min > attr->sched_util_max)
>> >                 return -EINVAL;
>> >         if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
>> >                 return -EINVAL;
>> >
>> > -       p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
>> > -       p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
>> > +       mutex_lock(&uclamp_mutex);
>> > +
>> > +       /* Update min utilization clamp */
>> > +       uc_se = &p->uclamp[UCLAMP_MIN];
>> > +       retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
>> > +                                  attr->sched_util_min);
>> > +
>> > +       /* Update max utilization clamp */
>> > +       uc_se = &p->uclamp[UCLAMP_MAX];
>> > +       retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
>> > +                                  attr->sched_util_max);
>> > +
>> > +       mutex_unlock(&uclamp_mutex);
>> > +
>> > +       /*
>> > +        * If one of the two clamp values should fail,
>> > +        * let the userspace know.
>> > +        */
>> > +       if (retval)
>> > +               return -ENOSPC;
>>
>> Maybe a minor issue but this failure is ambiguous. It might mean:
>> 1. no clamp value was updated
>> 2. UCLAMP_MIN was updated but UCLAMP_MAX was not
>> 3. UCLAMP_MAX was updated but UCLAMP_MIN was not
>
> That's right, I put a bit of thought on that me too but at the end
> I've been convinced that the possibility to use a single syscall to
> set both clamps at the same time has some benefits for user-space.
>
> Maybe the current solution can be improved by supporting an (optional)
> strict semantic with an in-kernel roll-back in case one of the two
> uclamp_group_get should fail.
>
> The strict semantic with roll-back could be controller via an
> additional flag, e.g. SCHED_FLAG_UTIL_CLAMP_STRICT.
>
> When the flag is set, either we are able to set both the attributes or
> we roll-back. Otherwise, when the flag is not set, we keep the current
> behavior. i.e. we set what we can and report an error to notify
> userspace that one constraints was not enforced.
>
> The following snippet should implement this semantics:
>
> ---8<---
>
> /* Uclamp flags */
> #define SCHED_FLAG_UTIL_CLAMP_STRICT    0x11 /* Roll-back on failure */
> #define SCHED_FLAG_UTIL_CLAMP_MIN       0x12 /* Update util_min */
> #define SCHED_FLAG_UTIL_CLAMP_MAX       0x14 /* Update util_max */
> #define SCHED_FLAG_UTIL_CLAMP           ( \
>                 SCHED_FLAG_UTIL_CLAMP_MIN | SCHED_FLAG_UTIL_CLAMP_MAX)
>

Having ability to update only min or only max this way might be indeed
very useful.
Instead of rolling back on failure I would suggest to check both
inputs first to make sure there won't be any error before updating.
This would remove the need for SCHED_FLAG_UTIL_CLAMP_STRICT (which I
think any user would want to set to 1 anyway).
Looks like uclamp_group_get() can fail only if uclamp_group_find()
fails to find a slot for uclamp_value or a free slot. So one way to do
this search before update is to call uclamp_group_find() for both
UCLAMP_MIN and UCLAMP_MAX beforehand and if they succeed then pass
obtained next_group_ids into uclamp_group_get() to avoid doing the
same search twice. This requires some refactoring of
uclamp_group_get() but I think the end result would be a cleaner and
more predictable solution.

> static inline int __setscheduler_uclamp(struct task_struct *p,
>                                         const struct sched_attr *attr)
> {
>         unsigned int uclamp_value_old = 0;
>         unsigned int uclamp_value;
>         struct uclamp_se *uc_se;
>         int retval = 0;
>
>         if (attr->sched_util_min > attr->sched_util_max)
>                 return -EINVAL;
>         if (attr->sched_util_max > 100)
>                 return -EINVAL;
>
>         mutex_lock(&uclamp_mutex);
>
>         if (!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN))
>                 goto set_util_max;
>
>         uc_se = &p->uclamp[UCLAMP_MIN];
>         uclamp_value = scale_from_percent(attr->sched_util_min);
>         if (uc_se->value == uclamp_value)
>                 goto set_util_max;
>
>         /* Update min utilization clamp */
>         uclamp_value_old = uc_se->value;
>         retval |= uclamp_group_get(p, NULL, UCLAMP_MIN, uc_se, uclamp_value);
>         if (retval &&
>             attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_STRICT)
>                 goto exit_failure;
>
> set_util_max:
>
>         if (!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX))
>                 goto exit_done;
>
>         uc_se = &p->uclamp[UCLAMP_MAX];
>         uclamp_value = scale_from_percent(attr->sched_util_max);
>         if (uc_se->value == uclamp_value)
>                 goto exit_done;
>
>         /* Update max utilization clamp */
>         if (uclamp_group_get(p, NULL, UCLAMP_MAX,
>                              uc_se, uclamp_value))
>                 goto exit_rollback;
>
> exit_done:
>         mutex_unlock(&uclamp_mutex);
>         return retval;
>
> exit_rollback:
>         if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
>             attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_STRICT) {
>                 uclamp_group_get(p, NULL, UCLAMP_MIN,
>                                  uc_se, uclamp_value_old);
>         }
> exit_failure:
>         mutex_unlock(&uclamp_mutex);
>
>         return -ENOSPC;
> }
>
> ---8<---
>
>
> Could that work better?
>
> The code is maybe a bit more convoluted... but perhaps it can be
> improved by encoding it in a loop.
>
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi