kernel-hardening.lists.openwall.com archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>,
	 LKML <linux-kernel@vger.kernel.org>,
	 Linux Containers <containers@lists.linux-foundation.org>,
	 Kernel Hardening <kernel-hardening@lists.openwall.com>,
	 Alexey Gladkov <legion@kernel.org>,
	 Kees Cook <keescook@chromium.org>,
	 Christian Brauner <christian@brauner.io>
Subject: Re: [RFC PATCH v2 0/8] Count rlimits in each user namespace
Date: Mon, 11 Jan 2021 14:17:19 -0600	[thread overview]
Message-ID: <87a6tfp6sw.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <CAHk-=wgXZmRu762bjSeK80+T_LTo+UP9y5rP-uvym1vquSxmBw@mail.gmail.com> (Linus Torvalds's message of "Sun, 10 Jan 2021 10:46:05 -0800")

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Jan 10, 2021 at 9:34 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>>
>> To address the problem, we bind rlimit counters to each user namespace. The
>> result is a tree of rlimit counters with the biggest value at the root (aka
>> init_user_ns). The rlimit counter increment/decrement occurs in the current and
>> all parent user namespaces.
>
> I'm not seeing why this is necessary.
>
> Maybe it's the right approach, but none of the patches (or this cover
> letter email) really explain it to me.
>
> I understand why you might want the _limits_ themselves would form a
> tree like this - with the "master limit" limiting the limits in the
> user namespaces under it.
>
> But I don't understand why the _counts_ should do that. The 'struct
> user_struct' should be shared across even user namespaces for the same
> user.
>
> IOW, the very example of the problem you quote seems to argue against this:
>
>> For example, there are two containers (A and B) created by one user. The
>> container A sets RLIMIT_NPROC=1 and starts one process. Everything is fine, but
>> when container B tries to do the same it will fail because the number of
>> processes is counted globally for each user and user has one process already.
>
> Note how the problem was _not_ that the _count_ was global. That part
> was fine and all good.

The problem is fundamentally that the per process RLIMIT_NPROC was
compared against the user_struct->processes.

I have only heard the problem described but I believe it is either the
RLIMIT_NPROC test in fork or at the beginning of do_execveat_common that
is failing.

From fs/exec.c line 1866:
> 	/*
> 	 * We move the actual failure in case of RLIMIT_NPROC excess from
> 	 * set*uid() to execve() because too many poorly written programs
> 	 * don't check setuid() return code.  Here we additionally recheck
> 	 * whether NPROC limit is still exceeded.
> 	 */
> 	if ((current->flags & PF_NPROC_EXCEEDED) &&
> 	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
> 		retval = -EAGAIN;
> 		goto out_ret;
> 	}

From fs/fork.c line 1966:
> 	retval = -EAGAIN;
> 	if (atomic_read(&p->real_cred->user->processes) >=
> 			task_rlimit(p, RLIMIT_NPROC)) {
> 		if (p->real_cred->user != INIT_USER &&
> 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> 			goto bad_fork_free;
> 	}
> 	current->flags &= ~PF_NPROC_EXCEEDED;

In both the cases the RLIMIT_NPROC value comes from
task->signal->rlim[RLIMIT_NPROC] and the count of processes
comes from task->cred->user->processes.

> No, the problem was that the _limit_ in container A also ended up
> affecting container B.

The description I have is that both containers run the same service
that set it's RLIMIT_NPROC to 1 in both containers.

> So to me, that says that it would make sense to continue to use the
> resource counts in 'struct user_struct' (because if user A has a hard
> limit of X, then creating a new namespace shouldn't expand that
> limit), but then have the ability to make per-container changes to the
> resource limits (as long as they are within the bounds of the parent
> user namespace resource limit).

I agree that needs to work as well.

> Maybe there is some reason for this ucounts approach, but if so, I
> feel it was not explained at all.

Let me see if I can starte the example a litle more clearly.

Suppose there is a service never_fork that sets RLIMIT_NPROC runs as
never_fork_user and sets RLIMIT_NPROC to 1 in it's systemd service file.

Further suppose there is a user bob who has two containers he wants to
run: container1 and container2.  Both containers start the never_fork
service.

Bob first starts container1 and inside it the never_fork service starts.
Bob starts container2 and the never_fork service fails to start.

Does that make it clear that it is the count of the processes that would
exceed 1 if both instances of the never_fork service starts that would
be the problem?

Eric

      reply	other threads:[~2021-01-11 20:18 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-10 17:33 [RFC PATCH v2 0/8] Count rlimits in each user namespace Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 1/8] Use atomic type for ucounts reference counting Alexey Gladkov
2021-01-13 16:31   ` Eric W. Biederman
2021-01-13 18:01     ` Kees Cook
2021-01-10 17:33 ` [RFC PATCH v2 2/8] Add a reference to ucounts for each user Alexey Gladkov
2021-01-13  6:33   ` 59ebc79722: kernel_BUG_at_kernel/cred.c kernel test robot
2021-01-13 16:25   ` [RFC PATCH v2 2/8] Add a reference to ucounts for each user Eric W. Biederman
2021-01-10 17:33 ` [RFC PATCH v2 3/8] Increase size of ucounts to atomic_long_t Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 4/8] Move RLIMIT_NPROC counter to ucounts Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 5/8] Move RLIMIT_MSGQUEUE " Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 6/8] Move RLIMIT_SIGPENDING " Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 7/8] Move RLIMIT_MEMLOCK " Alexey Gladkov
2021-01-10 17:33 ` [RFC PATCH v2 8/8] Move RLIMIT_NPROC check to the place where we increment the counter Alexey Gladkov
2021-01-10 18:46 ` [RFC PATCH v2 0/8] Count rlimits in each user namespace Linus Torvalds
2021-01-11 20:17   ` Eric W. Biederman [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a6tfp6sw.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=christian@brauner.io \
    --cc=containers@lists.linux-foundation.org \
    --cc=gladkov.alexey@gmail.com \
    --cc=keescook@chromium.org \
    --cc=kernel-hardening@lists.openwall.com \
    --cc=legion@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).