Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE

From: Quentin Perret <qperret@google.com>
To: Qais Yousef <qais.yousef@arm.com>
Cc: mingo@redhat.com, peterz@infradead.org,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rickyiu@google.com, wvw@google.com, patrick.bellasi@matbug.net,
	xuewen.yan94@gmail.com, linux-kernel@vger.kernel.org,
	kernel-team@android.com
Subject: Re: [PATCH v2 3/3] sched: Make uclamp changes depend on CAP_SYS_NICE
Date: Mon, 21 Jun 2021 10:52:28 +0000	[thread overview]
Message-ID: <YNBvbEv2UTzSHBP5@google.com> (raw)
In-Reply-To: <20210614150327.3humrvztv3fxurvk@e107158-lin.cambridge.arm.com>

Hi Qais,

Apologies for the delayed reply, I was away last week.

On Monday 14 Jun 2021 at 16:03:27 (+0100), Qais Yousef wrote:
> On 06/11/21 14:43, Quentin Perret wrote:
> > On Friday 11 Jun 2021 at 15:17:37 (+0100), Qais Yousef wrote:
> > > On 06/11/21 13:49, Quentin Perret wrote:
> > > > Thinking about it a bit more, a more involved option would be to have
> > > > this patch as is, but to also introduce a new RLIMIT_UCLAMP on top of
> > > > it. The semantics could be:
> > > > 
> > > >   - if the clamp requested by the non-privileged task is lower than its
> > > >     existing clamp, then allow;
> > > >   - otherwise, if the requested clamp is less than UCLAMP_RLIMIT, then
> > > >     allow;
> > > >   - otherwise, deny,
> > > > 
> > > > And the same principle would apply to both uclamp.min and uclamp.max,
> > > > and UCLAMP_RLIMIT would default to 0.
> > > > 
> > > > Thoughts?
> > > 
> > > That could work. But then I'd prefer your patch to go as-is. I don't think
> > > uclamp can do with this extra complexity in using it.
> > 
> > Sorry I'm not sure what you mean here?
> 
> Hmm. I understood this as a new flag to sched_setattr() syscall first, but now
> I get it. You want to use getrlimit()/setrlimit()/prlimit() API to impose
> a restriction. My comment was in regard to this being a sys call extension,
> which it isn't. So please ignore it.
> 
> > 
> > > We basically want to specify we want to be paranoid about uclamp CAP or not. In
> > > my view that is simple and can't see why it would be a big deal to have
> > > a procfs entry to define the level of paranoia the system wants to impose. If
> > > it is a big deal though (would love to hear the arguments);
> > 
> > Not saying it's a big deal, but I think there are a few arguments in
> > favor of using rlimit instead of a sysfs knob. It allows for a much
> > finer grain configuration  -- constraints can be set per-task as well as
> > system wide if needed, and it is the standard way of limiting resources
> > that tasks can ask for.
> 
> Is it system wide or per user?

Right, so calling this 'system-wide' is probably an abuse, but IIRC
rlimits are per-process, and are inherited accross fork/exec. So the
usual trick to have a default value is to set the rlimits on the init
task accordingly. Android for instance already does that for a few
things, and I would guess that systemd and friends have equivalents
(though admittedly I should check that).

> > 
> > > requiring apps that
> > > want to self regulate to have CAP_SYS_NICE is better approach.
> > 
> > Rlimit wouldn't require that though, which is also nice as CAP_SYS_NICE
> > grants you a lot more power than just clamps ...
> 
> Now I better understand your suggestion. It seems a viable option I agree.
> I need to digest it more still though. The devil is in the details :)
> 
> Shouldn't the default be RLIM_INIFINITY? ie: no limit?

I guess so yes.

> We will need to add two limit, RLIMIT_UCLAMP_MIN/MAX, right?

Not sure, but I was originally envisioning to have only one that applies
to both min and max. In which would we need separate ones?

> We have the following hierarchy now:
> 
> 	1. System Wide (/proc/sys/kerenl/sched_util_clamp_min/max)
> 	2. Cgroup
> 	3. Per-Task
> 
> In that order of priority where 1 limits/overrides 2 and 3. And
> 2 limits/overrides 3.
> 
> Where do you see the RLIMIT fit in this hierarchy? It should be between 2 and
> 3, right? Cgroup settings should still win even if the user/processes were
> limited?

Yes, the rlimit stuff would just apply the syscall interface.

> If the framework decided a user can't request any boost at all (can't increase
> its uclamp_min above 0). IIUC then setting the hard limit of RLIMIT_UCLAMP_MIN
> to 0 would achieve that, right?

Exactly.

> Since the framework and the task itself would go through the same
> sched_setattr() call, how would the framework circumvent this limit? IIUC it
> has to raise the RLIMIT_UCLAMP_MIN first then perform sched_setattr() to
> request the boost value, right? Would this overhead be acceptable? It looks
> considerable to me.

The framework needs to have CAP_SYS_NICE to change another process'
clamps, and generally rlimit checks don't apply to CAP_SYS_NICE-capable
processes -- see __sched_setscheduler(). So I think we should be fine.
IOW, rlimits are just constraining what unprivileged tasks are allowed
to request for themselves IIUC.

> Also, Will prlimit() allow you to go outside what was set for the user via
> setrlimit()? Reading the man pages it seems to override, so that should be
> fine.

IIRC rlimit are per-process properties, not per-user, so I think we
should be fine here as well?

> For 1 (System Wide) limits, sched_setattr() requests are accepted, but the
> effective uclamp is *capped by* the system wide limit.
> 
> Were you thinking RLIMIT_UCLAMP* will behave similarly?

Nope, I was actually thinking of having the syscall return -EPERM in
this case, as we already do for nice values or RT priorities.

> If they do, we have
> consistent behavior with how the current system wide limits work; but this will
> break your use case because tasks can change the requested uclamp value for
> a task, albeit the effective value will be limited.
> 
> 	RLIMIT_UCLAMP_MIN=512
> 	p->uclamp[UCLAMP_min] = 800	// this request is allowed but
> 					// Effective UCLAMP_MIN = 512
> 
> If not, then
> 
> 	RLIMIT_UCLAMP_MIN=no limit
> 	p->uclamp[UCLAMP_min] = 800	// task changed its uclamp_min to 800
> 	RLIMIT_UCLAMP_MIN=512		// limit was lowered for task/user
> 
> what will happen to p->uclamp[UCLAMP_MIN] in this case? Will it be lowered to
> match the new limit? And this will be inconsistent with the current system wide
> limits we already have.

As per the above, if the syscall returns -EPERM we can leave the
integration with system-wide defaults and such untouched I think.

> Sorry too many questions. I was mainly thinking loudly. I need to spend more
> time to dig into the details of how RLIMITs are imposed to understand how this
> could be a good fit. I already see some friction points that needs more
> thinking.

No need to apologize, this would be a new userspace-visible interface,
so you're right that we need to think it through.

Thanks for the feedback,
Quentin