Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs

From: Emily Shepherd <emily@redcoat.dev>
To: Rob Landley <rob@landley.net>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	initramfs@vger.kernel.org,
	"Thomas Strömberg" <t+github@chainguard.dev>,
	"Anders Björklund" <anders.f.bjorklund@gmail.com>,
	"Giuseppe Scrivano" <giuseppe@scrivano.org>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Christoph Hellwig" <hch@lst.de>, "Jens Axboe" <axboe@kernel.dk>
Subject: Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
Date: Fri, 1 Dec 2023 23:37:43 +0000	[thread overview]
Message-ID: <l6s4ngxtj6v7n5z6npcj4gmvr6x6vrvldwsfmyufovrj6teu3r@vajcufu3lltc> (raw)
In-Reply-To: <dbeb7b4b-b5e6-15ba-7e74-f9ffdd07059b@landley.net>

On Fri, Dec 01, 2023 at 04:02:50PM -0600, Rob Landley wrote:
>You are reasoning backwards from your solution and not thinking about 
>the
>design. I don't think you're addressing the real issue.
>
>Right now "separate" container namespaces all share a common rootfs instance.
>They do NOT share a common init task, even though before containers that was
>universal. You can have your own PID namespace, which starts _empty_.
>
>Your mount tree in a container does NOT start empty. From the clone(2) man page:
>
>  If  CLONE_NEWNS  is  set,  the  cloned child is started in a new
>  mount namespace, initialized with a copy of the namespace of the parent.
>
>Defaulting to having everything in it and removing what you don't want to keep
>is very different from what PID or UID namespaces do, and is causing you
>problems. Doing a chroot is basically an overmount, the other mount points are
>still there in your tree and accessable if you try hard enough, and rootfs is
>common to all containers. Mitigating this requires cleanup work that isn't
>always even possible to fully do (ala rootfs actually being used, which does
>happen a lot today and it's always accessible if a static process forking its
>own mount namespace does enough umounts, which can then act as a
>cifs/nfs/9p/rsync server out to the parent or some such).
>
>Logically, extending the kernel to have a CLONE_NEWROOTFS where it gets a _new_
>ramfs or tmpfs instance, unique to that namespace, at the root of a new empty
>mount tree, is the logical fix. There is then design work around "so what API do
>you use to populate it" which could range from "the first int below child_stack
>is the fd of a cpio.gz to extract into it and then it launches an /init out of
>there the way the host linux boots" through "the new child starts suspended ala
>vfork/ptrace and then the parent process initializes it and unblocks it" to "the
>init task is running the executable from the host context that called clone and
>has inherited the existing open filehandles from the host context, although
>despite the openat() family being in posix-2008 we sadly don't appear to have a
>mountat()...". I dunno. That's design work to properly fix the issue.
>
>You don't want to address the design problem, you want to add a special case
>workaround for your current issue. You see doing that as a "design fix". I do not.

I think this is a good point - I definitely agree that the weird 
hackiness that runtimes have to do to setup their mount namespaces 
properly is suboptimal.

The hypothetical CLONE_NEWROOTFS that you suggest is a superior 
suggestion - not least because it would better do what containers 
actually want, but it would also do it with less syscalls and flapping!

As an aside: I take your point RE rootfs being shared. The general 
concern is normally that information from the host might leak if 
containers can read the host root, so sharing an empty rootfs is less of 
a concern, but again the theoretical case of information sharing between 
containers by writing to the shared rootfs is an interesting one too.

>Fine. Moving on. I still think a dedicated CONFIG entry is a bad way to do the
>silly thing. Specifying the silly thing on the kernel command line seems less bad.
>
>Checking for "root=tmpfs" to trigger the silly thing seems less bad to 
>me,
>although I note that init/do_mounts.c function init_rootfs() already _is_
>checking for that (and there's a pending patch to tweak it), so... be aware.

My original reasoning for having it as a built option was that, in the 
case of running directly from initramfs, that's often something that's 
done if you're embedding the initfamfs to create a unified kernel. As a 
result, it is something that you'd only really care to turn on or off at 
build time.

Having said that, I have no strong opinion on that.

>That's the part I don't understand. It _seems_ like what you were 
>saying. Not
>"this hasn't been working fine for everyone else for the past 15 years already",
>but "I think it should have been designed a different way 20 years ago, and
>would like to change it to match my opinion".

I have to say I struggle to understand where to go from here... as I 
said above, I do like the CLONE_NEWROOTFS suggestion (and it was 
actually something I was batting around for my own project) but that 
feels that a _way more_ specialised feature.

And now you are saying that apparently we _shouldn't_ make a relatively 
small change to initramfs because its worked fine for years, but we 
should add a much larger patch to clone() which has also worked for many 
years? I shouldn't question how initramfs works because you were there 
when it was written [1], but we should question all the devs who decided 
on CLONE_NEWNS over CLONE_NEWROOTFS?

I'm not saying we shouldn't, but help me out here - how can I tell 
what's "reasonable" to question and what isn't?

[1]: https://media.tenor.com/lR9rjwXjL50AAAAC/deep-magic-lion.gif

>LOTS of embedded people have used the existing initramfs, and it's accumulated a
>BUNCH of weirdness over the years. Did you know you can concatenate multiple
>cpio.gz files and the kernel loader will accept them as one big 
>archive?

I did, yes.

>Are you suggesting I don't understand because I'm not "one of us"?

No, and I am sorry that I phrased that poorly. I merely meant that there 
are a hell of a lot of different build options and systems within the 
kernel, and it is perhaps not unreasonable to suggest that it is not a 
requirement that everyone intimately understands all of them all of the 
time.

>You are not the first person to use this plumbing. "Everybody _really_ 
>wants
>what I think it should always have been like, but nobody's mentioned it in the
>past 20 years" is a strange position to take. Earlier you said "the fact that
>the desirable path is" as a universal statement rather than a personal opinion.
>Desirable to who? Judged as "fact" by who?

I meant for container runtimes. Most are quite opinionated about not 
doing mount --move . / && chroot(.), strictly preferring pivot_root 
instead.

-- 
Emily Shepherd

Red Coat Development Limited