LKML Archive on
 help / color / Atom feed
From: Patrick Bellasi <>
To: Peter Zijlstra <>
Cc: Juri Lelli <>,,,
	Ingo Molnar <>, Tejun Heo <>,
	"Rafael J . Wysocki" <>,
	Viresh Kumar <>,
	Vincent Guittot <>,
	Paul Turner <>,
	Quentin Perret <>,
	Dietmar Eggemann <>,
	Morten Rasmussen <>,
	Todd Kjos <>, Joel Fernandes <>,
	Steve Muckle <>,
	Suren Baghdasaryan <>
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default
Date: Wed, 26 Sep 2018 11:43:55 +0100
Message-ID: <20180926104355.GA3980@e110439-lin> (raw)
In-Reply-To: <>

On 25-Sep 17:49, Peter Zijlstra wrote:
> On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:


> Well, with DL there are well defined rules for what to put in and what
> to then expect.
> For this thing, not so much I feel.

Maybe you'll prove me wrong, but that's not already something happening
for things like priorities?

When you set a prio for a CFS task you don't really know how much
more/less CPU time a CFS task will get, because it depends on other
tasks prios and tasks from higher prio classes.

The priority is thus a knob which defines an "intended behavior", a
preference, without being "legally binding" like in the case of DL
bandwidth... nevertheless we can still make a good use of prios.


> > In the clamping case, it's still the user-space that needs to figure
> > our an optimal clamp value, while considering your performance and
> > energy efficiency goals. This can be based on an automated profiling
> > process which comes up with "optimal" clamp values.
> > 
> > In the DL case, we are perfectly fine to have a running time
> > parameter, although we don't give any precise and deterministic
> > formula to quantify it. It's up to user-space to figure out the
> > required running time for a given app and platform.
> > It's also not unrealistic the case you need to close a control loop
> > with user-space to keep updating this requirement.
> > 
> > Why the same cannot hold for clamp values ?
> The big difference is that if I request (and am granted) a runtime quota
> of a given amount, then that is what I'm guaranteed to get.

(I think not always... but that's a detail for a different discussion)

> Irrespective of the amount being sufficient for the work in question --
> which is where the platform dependency comes in.
> But what am I getting when I set a specific clamp value? What does it
> mean to set the value to 80%

Exactly, that's a good point: which "expectations" can we set on users
based on a given value ?

> So far the only real meaning is when combined with the EAS OPP data, we
> get a direct translation to OPPs. Irrespective of how the utilization is
> measured and the capacity:OPP mapping established, once that's set, we
> can map a clamp value to an OPP and get meaning.

If we strictly follow this line of reasoning then we should probably
set a frequency value directly...  but still we will not be saying
anything about "expectations".

Give the current patchset, right now we can't do much more then
_biasing_ an OPP selection.
It's actually just a bias, we cannot really grant anything to users
based on clamping. For example, if you require util_min=1024 you are
not really granted anything about running at the maximum capacity,
especially in the current patchset where we are not biasing task

My fear then is, since we are not really granting/enforcing anything,
why should we base such an interface on an internal implementation
detail and/or platform specific values ?

Why a slightly more abstract interface is so much more confusing ?

> But without that, it doesn't mean anything much at all. And that is my
> complaint. It seems to get presented as: 'random knob that might do
> something'. The range it takes as input doesn't change a thing.

Can not the "range" help in defining the perceived expectations ?

If we use a percentage, IMHO it's more clear that's a _relative_ and
_not mandatory_ interface:

  relative: because, for example, a 50% capped task is expected
            (normally) to run slower then a 50% boosted task, although
            we don't know, or care to know, the exact frequency or cpu

  not mandatory: because, for example, the 50% boosted task is not
                 granted to always run at an OPP which capacity is
                 not smaller then 512

> > > How are expecting people to determine what to put into the interface?

The same way people define priorities. Which means, with increasing
level of complexity:

 a) by guessing (or using the default, i.e. no clamps)

 b) by making an educated choice
   i.e. profiling your app to pick the value which makes more sense
        give the platform and a set of optimization goals

 c) by controlling in a closed feedback loop
    i.e. by periodically measuring some app specific power/perf metric
    and tuning the clamp values to close a gap with respect to a given

I think that the complexity of both b) and c) is not really impacted
by the scale/range used... but it also does not benefit much in
"clarity" if we use capacity values instead of percentages.

> > > Knee points, little capacity, those things make 'obvious' sense.

> > IMHO, they make "obvious" sense from a kernel-space perspective
> > exactly because they are implementation details and platform specific
> > concepts.
> > 
> > At the same time, I struggle to provide a definition of knee point and
> > I struggle to find a use-case where I can certainly say that a task
> > should be clamped exactly to the little capacity for example.
> > 
> > I'm more of the idea that the right clamp value is something a bit
> > fuzzy and possibly subject to change over time depending on the
> > specific application phase (e.g. cpu-vs-memory bounded) and/or
> > optimization goals (e.g. performance vs energy efficiency).

What do you think about this last my sentence above ?

> > Here we are thus at defining and agree about a "generic and abstract"
> > interface which allows user-space to feed input to kernel-space.
> > To this purpose, I think platform specific details and/or internal
> > implementation details are not "a bonus".
> But unlike DL, which has well specified behaviour, and when I know my
> platform I can compute a usable value. This doesn't seem to gain meaning
> when I know the platform.
> Or does it?

... or we don't really care about a platform specific meaning.

> If you say yes, then we need to be able to correlate to the platform
> data that gives it meaning; which would be the OPP states. And those
> come with capacity numbers.

The meaning, strictly speaking, should be just:

   I figured out (somehow) that if I set value X my app is now working
   as expected in terms of the acceptable power/performance
   optimization goal.

   I know that value X could require tuning over time depending on
   possible changes in platform status or tasks composition.


> > Internally, in kernel space, we use 1024 units. It's just the
> > user-space interface that speaks percentages but, as soon as a
> > percentage value is used to configure a clamp, it's translated into a
> > [0..1024] range value.
> > 
> > Is this not an acceptable compromise? We have a generic user-space
> > interface and an effective/consistent kernel-space implementation.
> I really don't see how changing the unit changes anything. Either you
> want to relate to OPPs and those are exposed in 1/1024 unit capacity
> through the EAS files, or you don't and then the knob has no meaning.
> And how the heck are we supposed to assign a value for something that
> has no meaning.
> Again, with DL we ask for time, once I know the platform I can convert
> my work into instructions and time and all makes sense.
> With this, you seem reluctant to allow us to close that loop. Why is
> that? Why not directly relate to the EAS OPPs, because that is directly
> what they end up mapping to.

I'm not really fighting against that, if people find it more intuitive
the usage of capacities we can certainly go for them.

My reluctance is really just tossing out a possible different
perspective considering we are adding a user-space API which
certainly set "expectations" to users.

Provided it's clear the concept that it's a _non mandatory_ and
_relative_ API, then any scale should be ok... I just personally
prefer a percentage for the reasons described above.

In both cases, who will use the interface can certainly close a
loop... especially an automated profiling or run-time control loop.

> When I know the platform, I can convert my work into instructions and
> obtain time, I can convert my clamp into an OPP and time*OPP gives an
> energy consumption.

What you describe makes sense, it can definitively help the human user
to set a value. I'm just not convinced this will be the main usage of
such an interface... or that a single value could fit all the
run-time scenarios.

I think in real workload scenarios we will have so many tasks, some
competing other cooperating, that it will not be possible to do the
exercise you describe above.

What we will do instead will probably be to close a profiling/control
loop from user-space and let the system figure out the optimal
value. In these cases, the platform details are just "informative"
and what we need is really just "random knob which can do
something"... provide that "something" is a consistent mapping of the
knob values on certain scheduler actions.

> Why muddle things up and make it complicated?

I'll not push this further, really, if you are strongly on the
opinion we should use capacity I'll drop percentages in the next v5.

Otherwise, if you also are like me still a bit unsure for what could
be the best API, we can hope in more feedbacks from other folks...
maybe I can ping someone in CC ?


#include <best/regards.h>

Patrick Bellasi

  reply index

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-28 13:53 [PATCH v4 00/16] Add utilization clamping support Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 01/16] sched/core: uclamp: extend sched_setattr to support utilization clamping Patrick Bellasi
2018-09-05 11:01   ` Juri Lelli
2018-08-28 13:53 ` [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups Patrick Bellasi
2018-09-05 10:45   ` Juri Lelli
2018-09-06 13:48     ` Patrick Bellasi
2018-09-06 14:13       ` Juri Lelli
2018-09-06  8:17   ` Juri Lelli
2018-09-06 14:00     ` Patrick Bellasi
2018-09-08 23:47   ` Suren Baghdasaryan
2018-09-12 10:32     ` Patrick Bellasi
2018-09-12 13:49   ` Peter Zijlstra
2018-09-12 15:56     ` Patrick Bellasi
2018-09-12 16:12       ` Peter Zijlstra
2018-09-12 17:35         ` Patrick Bellasi
2018-09-12 17:42           ` Peter Zijlstra
2018-09-12 17:52             ` Patrick Bellasi
2018-09-13 19:14               ` Peter Zijlstra
2018-09-14  8:51                 ` Patrick Bellasi
2018-09-12 16:24   ` Peter Zijlstra
2018-09-12 17:42     ` Patrick Bellasi
2018-09-13 19:20       ` Peter Zijlstra
2018-09-14  8:47         ` Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting Patrick Bellasi
2018-09-12 17:34   ` Peter Zijlstra
2018-09-12 17:44     ` Patrick Bellasi
2018-09-13 19:12   ` Peter Zijlstra
2018-09-14  9:07     ` Patrick Bellasi
2018-09-14 11:52       ` Peter Zijlstra
2018-09-14 13:41         ` Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 04/16] sched/core: uclamp: update CPU's refcount on clamp changes Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 05/16] sched/core: uclamp: enforce last task UCLAMP_MAX Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 06/16] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks Patrick Bellasi
2018-09-14  9:32   ` Peter Zijlstra
2018-09-14 13:19     ` Patrick Bellasi
2018-09-14 13:36       ` Peter Zijlstra
2018-09-14 13:57         ` Patrick Bellasi
2018-09-27 10:23           ` Quentin Perret
2018-08-28 13:53 ` [PATCH v4 07/16] sched/core: uclamp: extend cpu's cgroup controller Patrick Bellasi
2018-08-28 18:29   ` Randy Dunlap
2018-08-29  8:53     ` Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps Patrick Bellasi
2018-09-09  3:02   ` Suren Baghdasaryan
2018-09-12 12:51     ` Patrick Bellasi
2018-09-12 15:56       ` Suren Baghdasaryan
2018-09-11 15:18   ` Tejun Heo
2018-09-11 16:26     ` Patrick Bellasi
2018-09-11 16:28       ` Tejun Heo
2018-08-28 13:53 ` [PATCH v4 09/16] sched/core: uclamp: map TG's clamp values into CPU's clamp groups Patrick Bellasi
2018-09-09 18:52   ` Suren Baghdasaryan
2018-09-12 14:19     ` Patrick Bellasi
2018-09-12 15:53       ` Suren Baghdasaryan
2018-08-28 13:53 ` [PATCH v4 10/16] sched/core: uclamp: use TG's clamps to restrict Task's clamps Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 11/16] sched/core: uclamp: add system default clamps Patrick Bellasi
2018-09-10 16:20   ` Suren Baghdasaryan
2018-09-11 16:46     ` Patrick Bellasi
2018-09-11 19:25       ` Suren Baghdasaryan
2018-08-28 13:53 ` [PATCH v4 12/16] sched/core: uclamp: update CPU's refcount on TG's clamp changes Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 13/16] sched/core: uclamp: use percentage clamp values Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default Patrick Bellasi
2018-09-04 13:47   ` Juri Lelli
2018-09-06 14:40     ` Patrick Bellasi
2018-09-06 14:59       ` Juri Lelli
2018-09-06 17:21         ` Patrick Bellasi
2018-09-14 11:10       ` Peter Zijlstra
2018-09-14 14:07         ` Patrick Bellasi
2018-09-14 14:28           ` Peter Zijlstra
2018-09-17 12:27             ` Patrick Bellasi
2018-09-21  9:13               ` Peter Zijlstra
2018-09-24 15:14                 ` Patrick Bellasi
2018-09-24 15:56                   ` Peter Zijlstra
2018-09-24 17:23                     ` Patrick Bellasi
2018-09-24 16:26                   ` Peter Zijlstra
2018-09-24 17:19                     ` Patrick Bellasi
2018-09-25 15:49                   ` Peter Zijlstra
2018-09-26 10:43                     ` Patrick Bellasi [this message]
2018-09-27 10:00                     ` Quentin Perret
2018-09-26 17:51                 ` Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 15/16] sched/core: uclamp: add clamp group discretization support Patrick Bellasi
2018-08-28 13:53 ` [PATCH v4 16/16] sched/cpufreq: uclamp: add utilization clamping for RT tasks Patrick Bellasi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180926104355.GA3980@e110439-lin \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on

Archives are clonable:
	git clone --mirror lkml/git/0.git
	git clone --mirror lkml/git/1.git
	git clone --mirror lkml/git/2.git
	git clone --mirror lkml/git/3.git
	git clone --mirror lkml/git/4.git
	git clone --mirror lkml/git/5.git
	git clone --mirror lkml/git/6.git
	git clone --mirror lkml/git/7.git
	git clone --mirror lkml/git/8.git
	git clone --mirror lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ \
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone