From: Emily Shepherd <emily@redcoat.dev>
To: Rob Landley <rob@landley.net>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
initramfs@vger.kernel.org,
"Thomas Strömberg" <t+github@chainguard.dev>,
"Anders Björklund" <anders.f.bjorklund@gmail.com>,
"Giuseppe Scrivano" <giuseppe@scrivano.org>,
"Al Viro" <viro@zeniv.linux.org.uk>,
"Christoph Hellwig" <hch@lst.de>, "Jens Axboe" <axboe@kernel.dk>
Subject: Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
Date: Thu, 30 Nov 2023 03:31:38 +0000 [thread overview]
Message-ID: <37yuynohcuve46jhgzbz24ip6yb2lqvwcn6gpxwxpw6msgtk4b@7dgqfkdtjngb> (raw)
In-Reply-To: <cec90924-e7ec-377c-fb02-e0f25ab9db73@landley.net>
On Wed, Nov 29, 2023 at 02:53:49PM -0600, Rob Landley wrote:
>I'm assuming you can do process-local unmounts to prune what you'd be
>overmounting? (switch_root didn't _have_ to delete the old initramfs contents,
>it did it to free space. Containers similarly would remove privileges as part of
>their setup, and that includes mounts the new context shouldn't have.)
I think we are talking at cross purposes here, let's get back on track:
The aim is to do a pivot_root within a container's mount namespace. In
order to do this, we need at least two layers of "root" in the host
mount namespaces. If there is just one (ie rootfs), we can't pivot_root,
because rootfs cannot be removed.
Yes, you can switch_root or chroot or do any number of things, but that
is not relevant. That is not the way container runtimes tend to work.
The desirable outcome is pivot_root.
>From reading the link's summary, it seems like "unmount the old
>inherited /proc
>and /sys before you "mount --move newfs /" seems like it would have been a fix?
It was the patch that was made in the container runtimes at the time,
yes. This does not change the fact that the _desirable_ path is
pivot_root.
>Lazy unmount it (which never affects a process's open files, including
>the "/"
>and "." symlinks in each process), then mount --move so the visibility hides it,
>then teach the kernel that "overmounted" lets lazy unmounts go. (Which it
>_might_ already do if the reference count falls to 0 because of "." and "/"
>leaving, although you'd have to make sure no other open file descriptors
>referenced it in your current namespace from /dev entries and just plain
>inherited filehandles...)
>
>But it seems doable?
You can unmount child mounts, sure, but if your root is rootfs, you
can't unmount it. The aim of this change is to make unmounting the host
root more convenient, by ensuring there is a blank rootfs below it.
>Lemme guess, the child does something like:
>
>for (i = 0; i<32767; i++) close(i);
>mkdir("sub/blah")
>mount("sub", "sub", "tmpfs");
>chdir("sub");
>umount(".", MNT_DETACH);
>chroot("blah");
>chdir("../../../..");
>chroot(".")
>readdir();
I have told you already that the chroot, chdir .. trick does not work
within containers. This code snippet has nothing to do with this patch
or this discussion at all.
>
>> This would not occur with pivot_root.
>
>It would not occur if the filesystem had been removed from the current mount
>namespace by other means, either. (Or if the kernel got the test right, which
>you're saying it does now.)
You can't remove it if it's rootfs. If your host's root is rootfs, as it
would be if you run directly from initramfs, you can't unmount it.
>Back when I was trying to get /dev/console to work properly with
>init=/bin/sh I
>didn't ask for a kconfig option to make the initial task be PID 2 with PID 1
>(and its magic signal-blocking properties and inability to call reboot() and
>session ID 0 and orphaned zombies reparented to it so on) being a second idle
>task. I figured out how to make my userspace do the right thing.
>
>This really seems like an "init starts as PID 2" solution, which is a weird
>thing to have a dedicated build-time kernel config option for.
I am afraid I do not understand this point at all. This change is not
requesting anything like this.
>You want to use rootfs but not use rootfs. It's very zen.
No, I want an initramfs, I just don't want it on rootfs. If I mount a
block device as root, that wouldn't be rootfs either.
>If you'd like to go "I want to have the kernel automatically mount a
>freshly
>formatted ext4 filesystem and then have the kernel extract the cpio archive into
>that instead, because it's more convenient for me to have the kernel do this for
>me than doing it in userspace"...
Not what this patch is suggesting.
The kernel already supports mounting a block device, in lieu of a
userspace init doing it, via the root= parameter. Are you suggesting its
support of that is inappropriate?
>If you fix the mount --move issues you could bind mount your current
>directory,
>cd $PWD, and then --move mount it to /
>
>I think you're addressing the wrong issue.
No, I'm fixing the fact that container runtimes want to pivot_root, and
can't when running directly from initramfs, as this extracts to rootfs.
>As with the trivial patch to have init= launch PID 2, the cognitive
>load of
>explaining to people WHY the config option exists and when somebody might have
>wanted to use it in the kernel you're trying to forward port in a design you
>inherited from somebody who isn't around anymore is itself a form of design
>complexity. It's a special case _adding_ a design wart.
Is it possible that the reasoning of why this important would be much
more apparent to people in the container space?
I disagree that this introduces a design wart. On the contrary, I
believe it adds the option to make initramfs more consistent with the
other root setup methods:
1. kernel mounted block device via root= results in a nominally empty
rootfs and a block device on top with the root file system in it.
pivot_root can be used.
2. initramfs which performs some init, mounts a block device, then
switches root to it. This results in a nominally empty rootfs and a
block device on top with the root file system in it. pivot_root can be
used.
3. initramfs which contains an embedded root filesystem to be used
directly. Results in a rootfs with the root file system in it with
nothing on top. pivot_root cannot be used.
This patch simply changes point 3, to be more in line with the others:
3. initramfs which contains an embedded root filesystem to be used
directly. Would result in a nominally empty rootfs with tmpfs on top
with the root filesystem in it. pivot_root can be used.
>You're asking the kernel to create a second empty ramfs or tmpfs
>instance, and
>instead of checking an existing argument like "root=tmpfs" you're changing the
>kernel's behavior with a dedicated config option that does a specific thing.
If we want to set this behaviour via a kernel parameter, we can do that
:)
>What happens if somebody sets that config option and then goes
>root=/dev/sda2
>
>In theory making the rootfs directory neither readable nor executable to the PID
>you've mapped root to in the container is anther approach.
Incorrect, please reread the patch.
>Your "one and only time" is an awful lot of embedded systems. It's a
>common use
>case. The point of having initramfs be tmpfs is you can _persist_ in using it as
>your root filesystem without an errant log file filling up memory and hanging
>the system (a problem with ramfs).
We are not in disagreement on this point. In fact the irony is that we
are actually in strong agreement here. Leaving the root in the initramfs
_is_ a useful and commonly used flow - this change simply means to make
that flow more compatible with container runtimes.
>Whatever your container stuff is
Love it or hate it, lots of stuff runs on containers now. The kernel has
made plenty of changes to better facilitate containers.
>it won't be
>able to run on any of those existing systems that keeps initramfs populated with
>files. So again why have it be a config option: if you're going to change the
>behavior, change it for EVERYBODY or your stuff will need a special kernel
>configuration in order to run.
Sure, if we think its more appropriate to just do this always (not via a
build option) or gated behind a kernel parameter, we can do that.
>
>Heck, Debian populates initramfs with a cpio.gz file as part of its normal boot
>process:
>
>$ zcat /boot/initrd.img-4.19.0-22-amd64 | toybox file -
>-: ASCII cpio archive (SVR4 with no CRC)
>
>Has done for over a decade. You're saying debian can clean up but your stuff
>can't be expected to.
No, that is not what I'm saying.
>If you want a NULLFS, that is a design change. Maybe ask for the design
>change
>so THAT can be discussed. Your config option seems like a partial fix at best,
>and the kernel has enough abandoned partial fixes needing legacy support.
We already have what you call a nullfs. It's defined in
init/noinitramfs.c and usr/default_cpio_list, and its what you get if
you call switch_root within the initramfs.
In most runtime situations, rootfs _is_ what you'd call a nullfs. So
yes, sure: I want a nullfs when my root filesystem lives inside the
initramfs too. Like I'd get if I'm mounting with root= and like I'd get
if initramfs calls switch_root.
--
Emily Shepherd
Red Coat Development Limited
next prev parent reply other threads:[~2023-11-30 3:31 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-29 9:00 [PATCH v2] initramfs: Support unpacking directly to tmpfs Emily Shepherd
2023-11-29 16:38 ` Rob Landley
2023-11-29 17:48 ` Emily Shepherd
2023-11-29 20:53 ` Rob Landley
2023-11-30 3:31 ` Emily Shepherd [this message]
2023-12-01 22:02 ` Rob Landley
2023-12-01 23:37 ` Emily Shepherd
2023-12-02 5:40 ` Rob Landley
2023-12-02 23:27 ` Emily Shepherd
2023-12-19 19:22 Askar Safin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=37yuynohcuve46jhgzbz24ip6yb2lqvwcn6gpxwxpw6msgtk4b@7dgqfkdtjngb \
--to=emily@redcoat.dev \
--cc=akpm@linux-foundation.org \
--cc=anders.f.bjorklund@gmail.com \
--cc=axboe@kernel.dk \
--cc=giuseppe@scrivano.org \
--cc=hch@lst.de \
--cc=initramfs@vger.kernel.org \
--cc=rob@landley.net \
--cc=t+github@chainguard.dev \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.