From: Christian Brauner <christian@brauner.io>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Lutomirski <luto@kernel.org>, Jann Horn <jannh@google.com>,
Aleksa Sarai <cyphar@cyphar.com>,
Al Viro <viro@zeniv.linux.org.uk>,
Jeff Layton <jlayton@kernel.org>,
"J. Bruce Fields" <bfields@fieldses.org>,
Arnd Bergmann <arnd@arndb.de>,
David Howells <dhowells@redhat.com>,
Eric Biederman <ebiederm@xmission.com>,
Andrew Morton <akpm@linux-foundation.org>,
Alexei Starovoitov <ast@kernel.org>,
Kees Cook <keescook@chromium.org>,
Tycho Andersen <tycho@tycho.ws>,
David Drysdale <drysdale@google.com>,
Chanho Min <chanho.min@lge.com>, Oleg Nesterov <oleg@redhat.com>,
Aleksa Sarai <asarai@suse.de>,
Linux Containers <containers@lists.linux-foundation.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Linux API <linux-api@vger.kernel.org>,
kernel list <linux-kernel@vger.kernel.org>,
linux-arch <linux-arch@vger.kernel.org>
Subject: Re: [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters
Date: Sun, 12 May 2019 12:19:35 +0200 [thread overview]
Message-ID: <20190512101933.lqafnc2zu46ocyhe@brauner.io> (raw)
In-Reply-To: <CAHk-=wg3+3GfHsHdB4o78jNiPh_5ShrzxBuTN-Y8EZfiFMhCvw@mail.gmail.com>
On Sat, May 11, 2019 at 07:08:49PM -0400, Linus Torvalds wrote:
> [ on mobile again, power is off and the wifi doesn't work, so I'm reading
> email on my phone and apologizing once more for horrible html email.. ]
>
> On Sat, May 11, 2019, 18:40 Andy Lutomirski <luto@kernel.org> wrote:
>
> >
> > a) Change all my UIDs and GIDs to match a container, enter that
> > container's namespaces, and run some binary in the container's
For the namespace part, an idea I had and presented at LPC a while ago
was to make setns() interpret the nstype argument as a flag argument and
to introduce an fd that can refer to multiple namespaces at the same
time. This way you could do:
setns(namespaces_fd, CLONE_NEWNS | CLONE_NEWUSER | CLONE_NEWPID)
that could still be done. But I since have come to think that there's a
better way now that we have CLONE_PIDFD. We should just make setns()
take a pidfd as argument and then be able to call:
setns(pidfd, 0);
which would cause the calling thread to join all of the namespaces of
the process referred to by pidfd at the same time. That really shouldn't
be hard to do. I could probably get this going rather quickly and it
would really help out container managers a lot.
> > filesystem, all atomically enough that I don't need to worry about
> > accidentally leaking privileges into the container. A
> > super-duper-non-dumpable mode would kind of allow this, but I'd worry
> > that there's some other hole besides ptrace() and /proc/self.
> >
>
> So I would expect that you'd want to do this even *without* doing an execve
> at all, which is why I still don't think this is actually about
> spawn/execve at all.
>
> But I agree that what you that what you want sounds reasonable. But I think
I have pitched an api like that a while ago (see [1]) - when I first
started this fd for processes thing - that would allow you to do things
like that. The gist was:
1. int pidfd_create aka CLONE_PIDFD now
will return an fd that creates a process context. The fd returned by
does not refer to any existing process and has not actually been
started yet. So non-configuration operations on it or trying to
interact with it would fail with e.g. ESRCH/EINVAL.
We essentially have this now with CLONE_PIDFD. The bit that is missing
is an "unscheduled" process.
2. int pidfd_config
takes a CLONE_PIDFD and can be used to configure a process context
before it is alive.
Some things that I would like to be able to do with this syscall are:
- configure signals
- set clone flags
- write idmappings if the process runs in a new user namespace
- configure what happens when all procfds referring to the process are gone
- ...
3. int pidfd_info
4. int pidfd_manage
Parts of that are captured in pidfd_send_signal().
Just to get a very rough feel for this without detailing parameters right now:
/* process should have own mountns */
pidfd_config(fd, PROC_SET_NAMESPACE, CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWUSR, <potentially additional arguments>)
/* process should get SIGKILL when all procfds are closed */
pidfd_config(fd, PROC_SET_CLOSE, SIGKILL, <potentially additional arguments>)
After the caller is done configuring the process there would be a final step:
pidfd_config(fd, PROC_CREATE, 0, <potentially additional arguments>)
which would create the process and (either as return value or through a
parameter) return the pid of the newly created process.
I had more thoughts on this and had started prototyping some of it but
then there wasn't much interest it seemed. Maybe because it's crazy.
[1]: https://lore.kernel.org/lkml/20181118174148.nvkc4ox2uorfatbm@brauner.io/
> the "dumpable" flag has always been a complete hack, and very unreliable
> with random meanings (and random ways to then get around it).
>
> We have actually had lots of people wanting to run "lists of system calls"
> kinds of things. Sometimes for performance reasons, sometimes for other
> random reasons Maybe that "atomicity" angle would be another one, although
> we would have to make the setuid etc things be something that happens at
> the end.
>
> So rather than "spawn", is much rather see people think about ways to just
> batch operations in general, rather than think it is about batching things
> just before a process change.
>
> b) Change all my UIDs and GIDs to match a container, enter that
> > container's namespaces, and run some binary that is *not* in the
> > container's filesystem.
> >
>
> Hey, you could probably do something very close to that by opening the
> executable you want to run first, making it O_CLOEXEC, then doing all the
> other things (which now makes the executable inaccessible), and finally
> doing execveat() on the file descriptor..
I think that's somewhat similar to what I've suggested above.
>
> I say "something very close" because even though it's O_CLOEXEC, only the
> file would be closed, and /proc/self/exe would still exist.
>
> But we really could make that execveat of a O_CLOEXEC file thing also
> disable access to /proc/*/exe, and it sounds like a sane and simple
> extension in general to do..
>
> Linus
>
> >
next prev parent reply other threads:[~2019-05-12 10:19 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-06 16:54 [PATCH v6 0/6] namei: resolveat(2) path resolution restriction API Aleksa Sarai
2019-05-06 16:54 ` [PATCH v6 1/6] namei: split out nd->dfd handling to dirfd_path_init Aleksa Sarai
2019-05-06 16:54 ` [PATCH v6 2/6] namei: O_BENEATH-style path resolution flags Aleksa Sarai
2019-05-06 16:54 ` [PATCH v6 3/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution Aleksa Sarai
2019-05-06 16:54 ` [PATCH v6 4/6] namei: aggressively check for nd->root escape on ".." resolution Aleksa Sarai
2019-05-06 16:54 ` [PATCH v6 5/6] binfmt_*: scope path resolution of interpreters Aleksa Sarai
2019-05-06 18:37 ` Jann Horn
2019-05-06 19:17 ` Aleksa Sarai
2019-05-06 23:41 ` Andy Lutomirski
2019-05-08 0:54 ` Aleksa Sarai
2019-05-10 20:41 ` Jann Horn
2019-05-10 21:20 ` Andy Lutomirski
2019-05-10 22:55 ` Jann Horn
2019-05-10 23:36 ` Christian Brauner
2019-05-11 15:49 ` Aleksa Sarai
2019-05-11 17:00 ` Andy Lutomirski
2019-05-11 17:21 ` Linus Torvalds
2019-05-11 17:26 ` Linus Torvalds
2019-05-11 17:31 ` Aleksa Sarai
2019-05-11 17:43 ` Linus Torvalds
2019-05-11 17:48 ` Christian Brauner
2019-05-11 18:00 ` Aleksa Sarai
2019-05-11 22:39 ` Andy Lutomirski
[not found] ` <CAHk-=wg3+3GfHsHdB4o78jNiPh_5ShrzxBuTN-Y8EZfiFMhCvw@mail.gmail.com>
2019-05-12 10:19 ` Christian Brauner [this message]
[not found] ` <9CD2B97D-A6BD-43BE-9040-B410D996A195@amacapital.net>
2019-05-12 10:44 ` Linus Torvalds
2019-05-12 13:35 ` Aleksa Sarai
2019-05-12 13:38 ` Aleksa Sarai
2019-05-12 14:34 ` Andy Lutomirski
2019-05-11 17:26 ` Aleksa Sarai
2019-05-08 0:38 ` Eric W. Biederman
2019-05-10 20:10 ` Jann Horn
2019-05-06 16:54 ` [PATCH v6 6/6] namei: resolveat(2) syscall Aleksa Sarai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190512101933.lqafnc2zu46ocyhe@brauner.io \
--to=christian@brauner.io \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=asarai@suse.de \
--cc=ast@kernel.org \
--cc=bfields@fieldses.org \
--cc=chanho.min@lge.com \
--cc=containers@lists.linux-foundation.org \
--cc=cyphar@cyphar.com \
--cc=dhowells@redhat.com \
--cc=drysdale@google.com \
--cc=ebiederm@xmission.com \
--cc=jannh@google.com \
--cc=jlayton@kernel.org \
--cc=keescook@chromium.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=oleg@redhat.com \
--cc=torvalds@linux-foundation.org \
--cc=tycho@tycho.ws \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).