On 2020-08-31, Aleksa Sarai <asarai@suse.de> wrote:
> On 2020-08-28, Sargun Dhillon <sargun@sargun.me> wrote:
> > On Fri, Aug 28, 2020 at 12:29 PM Eric W. Biederman
> > <ebiederm@xmission.com> wrote:
> > > Just to scope how much work it would be to fix rlimits
> > > so they are not a problem for user namespaces I took a quick
> > > survey.
> > >
> > > The rlimits can be found in
> > > include/uapi/asm-generic/resource.h
> > >
> > > There are a total of 16 rlimits.
> > > There are only 4 rlimits that are enforced at anything other
> > > than process granularity.
> > >
> > > RLIMIT_NPROC
> > > RLIMIT_MEMLOCK
> > > RLIMIT_SIGPENDING
> > > RLIMIT_MSGQUEUE
> > >
> > > So it should not be difficult to fix those rlimits.
> > 
> > What are your proposed semantics for what the "fix" would look like? Or
> > are you saying that once we take on Christian's proposal of 64-bit kuid
> > they would be trivial to fix? I think the reason we didn't move forward with
> > fixing it is the only real thing we could agree upon is an rlimit namespace,
> 
> From memory, we did briefly discuss how this would work in the call. I
> believe the basic idea was that the host rlimit would act as a maximum
> setting but there would be an optional lower limit that a user namespace
> could set and would be accounted separately. That way containers
> wouldn't interfere with each others' rlimit settings. I imagine this
> would be nested with user namespaces and presumable means that rlimit
> would now be attached to userns directly.
> 
> (But I might be misremembering the details of the proposal. I do
> remember Eric mentioning that the "maximum namespaces" sysctl semantics
> were a useful model to look at.)
> 
> > and then you get into a question of why do these even exist, and should
> > they just be cgroup(v2) controllers, and should calling setrlimit just
> > be a wrapper around a cgroup(v2) controller that has a map of
> > uid -> limit?
> 
> To mirror what I said when this came up in the actual discussion, the
> reason why we don't have cgroups for all of these things is that some of
> those limits aren't "real resources" and arguably should all be managed
> through kmemcg policies.
> 
> Right after getting the pids cgroup controller merged, I did mention
> adding controllers for the other rlimits and Tejun said that they didn't
> make sense to add ([1] is one of the responses I found through a quick
> search). The only reason the pids controller was merged is that you
> could still fork-bomb a system even with modest kmemcg limits.
> 
> [1]: https://lore.kernel.org/lkml/20150227114940.GB3964@htj.duckdns.org/

[2] is a more explicit NACK from Tejun in that thread.

[2]: https://lore.kernel.org/lkml/20150227170640.GK3964@htj.duckdns.org/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>