From: Quentin Perret <qperret@google.com>
To: Qais Yousef <qais.yousef@arm.com>
Cc: mingo@redhat.com, peterz@infradead.org,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rickyiu@google.com, wvw@google.com, patrick.bellasi@matbug.net,
xuewen.yan94@gmail.com, linux-kernel@vger.kernel.org,
kernel-team@android.com
Subject: Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
Date: Mon, 21 Jun 2021 10:52:28 +0000 [thread overview]
Message-ID: <YNBvbEv2UTzSHBP5@google.com> (raw)
In-Reply-To: <20210614150327.3humrvztv3fxurvk@e107158-lin.cambridge.arm.com>
Hi Qais,
Apologies for the delayed reply, I was away last week.
On Monday 14 Jun 2021 at 16:03:27 (+0100), Qais Yousef wrote:
> On 06/11/21 14:43, Quentin Perret wrote:
> > On Friday 11 Jun 2021 at 15:17:37 (+0100), Qais Yousef wrote:
> > > On 06/11/21 13:49, Quentin Perret wrote:
> > > > Thinking about it a bit more, a more involved option would be to have
> > > > this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
> > > > it. The semantics could be:
> > > >
> > > > - if the clamp requested by the non-privileged task is lower than its
> > > > existing clamp, then allow;
> > > > - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
> > > > allow;
> > > > - otherwise, deny,
> > > >
> > > > And the same principle would apply to both uclamp.min and uclamp.max,
> > > > and UCLAMP_RLIMIT would default to 0.
> > > >
> > > > Thoughts?
> > >
> > > That could work. But then I'd prefer your patch to go as-is. I don't think
> > > uclamp can do with this extra complexity in using it.
> >
> > Sorry I'm not sure what you mean here?
>
> Hmm. I understood this as a new flag to sched_setattr() syscall first, but now
> I get it. You want to use getrlimit()/setrlimit()/prlimit() API to impose
> a restriction. My comment was in regard to this being a sys call extension,
> which it isn't. So please ignore it.
>
> >
> > > We basically want to specify we want to be paranoid about uclamp CAP or not. In
> > > my view that is simple and can't see why it would be a big deal to have
> > > a procfs entry to define the level of paranoia the system wants to impose. If
> > > it is a big deal though (would love to hear the arguments);
> >
> > Not saying it's a big deal, but I think there are a few arguments in
> > favor of using rlimit instead of a sysfs knob. It allows for a much
> > finer grain configuration -- constraints can be set per-task as well as
> > system wide if needed, and it is the standard way of limiting resources
> > that tasks can ask for.
>
> Is it system wide or per user?
Right, so calling this 'system-wide' is probably an abuse, but IIRC
rlimits are per-process, and are inherited accross fork/exec. So the
usual trick to have a default value is to set the rlimits on the init
task accordingly. Android for instance already does that for a few
things, and I would guess that systemd and friends have equivalents
(though admittedly I should check that).
> >
> > > requiring apps that
> > > want to self regulate to have CAP_SYS_NICE is better approach.
> >
> > Rlimit wouldn't require that though, which is also nice as CAP_SYS_NICE
> > grants you a lot more power than just clamps ...
>
> Now I better understand your suggestion. It seems a viable option I agree.
> I need to digest it more still though. The devil is in the details :)
>
> Shouldn't the default be RLIM_INIFINITY? ie: no limit?
I guess so yes.
> We will need to add two limit, RLIMIT_UCLAMP_MIN/MAX, right?
Not sure, but I was originally envisioning to have only one that applies
to both min and max. In which would we need separate ones?
> We have the following hierarchy now:
>
> 1. System Wide (/proc/sys/kerenl/sched_util_clamp_min/max)
> 2. Cgroup
> 3. Per-Task
>
> In that order of priority where 1 limits/overrides 2 and 3. And
> 2 limits/overrides 3.
>
> Where do you see the RLIMIT fit in this hierarchy? It should be between 2 and
> 3, right? Cgroup settings should still win even if the user/processes were
> limited?
Yes, the rlimit stuff would just apply the syscall interface.
> If the framework decided a user can't request any boost at all (can't increase
> its uclamp_min above 0). IIUC then setting the hard limit of RLIMIT_UCLAMP_MIN
> to 0 would achieve that, right?
Exactly.
> Since the framework and the task itself would go through the same
> sched_setattr() call, how would the framework circumvent this limit? IIUC it
> has to raise the RLIMIT_UCLAMP_MIN first then perform sched_setattr() to
> request the boost value, right? Would this overhead be acceptable? It looks
> considerable to me.
The framework needs to have CAP_SYS_NICE to change another process'
clamps, and generally rlimit checks don't apply to CAP_SYS_NICE-capable
processes -- see __sched_setscheduler(). So I think we should be fine.
IOW, rlimits are just constraining what unprivileged tasks are allowed
to request for themselves IIUC.
> Also, Will prlimit() allow you to go outside what was set for the user via
> setrlimit()? Reading the man pages it seems to override, so that should be
> fine.
IIRC rlimit are per-process properties, not per-user, so I think we
should be fine here as well?
> For 1 (System Wide) limits, sched_setattr() requests are accepted, but the
> effective uclamp is *capped by* the system wide limit.
>
> Were you thinking RLIMIT_UCLAMP* will behave similarly?
Nope, I was actually thinking of having the syscall return -EPERM in
this case, as we already do for nice values or RT priorities.
> If they do, we have
> consistent behavior with how the current system wide limits work; but this will
> break your use case because tasks can change the requested uclamp value for
> a task, albeit the effective value will be limited.
>
> RLIMIT_UCLAMP_MIN=512
> p->uclamp[UCLAMP_min] = 800 // this request is allowed but
> // Effective UCLAMP_MIN = 512
>
> If not, then
>
> RLIMIT_UCLAMP_MIN=no limit
> p->uclamp[UCLAMP_min] = 800 // task changed its uclamp_min to 800
> RLIMIT_UCLAMP_MIN=512 // limit was lowered for task/user
>
> what will happen to p->uclamp[UCLAMP_MIN] in this case? Will it be lowered to
> match the new limit? And this will be inconsistent with the current system wide
> limits we already have.
As per the above, if the syscall returns -EPERM we can leave the
integration with system-wide defaults and such untouched I think.
> Sorry too many questions. I was mainly thinking loudly. I need to spend more
> time to dig into the details of how RLIMITs are imposed to understand how this
> could be a good fit. I already see some friction points that needs more
> thinking.
No need to apologize, this would be a new userspace-visible interface,
so you're right that we need to think it through.
Thanks for the feedback,
Quentin
prev parent reply other threads:[~2021-06-21 10:52 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-10 15:13 [PATCH v2 0/3] A few uclamp fixes Quentin Perret
2021-06-10 15:13 ` [PATCH v2 1/3] sched: Fix UCLAMP_FLAG_IDLE setting Quentin Perret
2021-06-10 19:05 ` Peter Zijlstra
2021-06-11 7:25 ` Quentin Perret
2021-06-17 15:27 ` Dietmar Eggemann
2021-06-21 10:57 ` Quentin Perret
2021-06-10 15:13 ` [PATCH v2 2/3] sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS Quentin Perret
2021-06-10 19:15 ` Peter Zijlstra
2021-06-11 8:59 ` Quentin Perret
2021-06-11 9:07 ` Quentin Perret
2021-06-11 9:20 ` Peter Zijlstra
2021-06-10 15:13 ` [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE Quentin Perret
2021-06-11 12:48 ` Qais Yousef
2021-06-11 13:08 ` Quentin Perret
2021-06-11 13:26 ` Qais Yousef
2021-06-11 13:49 ` Quentin Perret
2021-06-11 14:17 ` Qais Yousef
2021-06-11 14:43 ` Quentin Perret
2021-06-14 15:03 ` Qais Yousef
2021-06-21 10:52 ` Quentin Perret [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YNBvbEv2UTzSHBP5@google.com \
--to=qperret@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=kernel-team@android.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=patrick.bellasi@matbug.net \
--cc=peterz@infradead.org \
--cc=qais.yousef@arm.com \
--cc=rickyiu@google.com \
--cc=vincent.guittot@linaro.org \
--cc=wvw@google.com \
--cc=xuewen.yan94@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).