All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] initramfs: Support unpacking directly to tmpfs
@ 2023-11-29  9:00 Emily Shepherd
  2023-11-29 16:38 ` Rob Landley
  0 siblings, 1 reply; 10+ messages in thread
From: Emily Shepherd @ 2023-11-29  9:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: initramfs, Thomas Strömberg, Anders Björklund,
	Giuseppe Scrivano, Al Viro, Christoph Hellwig, Jens Axboe,
	Rob Landley

For systems which run directly from initramfs, it is not possible to use
pivot_root without first changing root. This is because of the
intentional design choice that rootfs, which is where initramfs is
unpacked to, cannot be unmounted.

pivot_root is an important feature for creating containers and the
alternative (mounting the new root over the top of the old with MS_MOVE
and then calling chroot) is not favoured by most container runtimes
[1][2] as it does not completely remove the host system mounts from the
mount namespace.

The general work around, when running directly from initramfs, is to
have init mount a new tmpfs, copy everything out of rootfs, and then
switch_root [3][4]. This is only required when running directly from the
initramfs as all other methods of acquiring a root device (having the
kernel mount a root device directly via the root= parameter, or using
initramfs to mount and then switch_root to a new root) leave an empty
rootfs at the top of the mount stack.

This commit adds a new build option - EMPTY_ROOTFS, available when
initrd/initramfs is enabled. When selected, rather than unpacking the
inbuilt / bootloader provided initramfs directly into rootfs, the kernel
will mount a new tmpfs/ramfs over the top of the rootfs and unpack to
that instead, leaving an empty rootfs at the top of the stack. This
removes the need to have init copy everything as a workaround.

[1]: https://github.com/opencontainers/runc/blob/95a93c132cf179a017312e22a954f137e8237c4e/man/runc-create.8.md?plain=1#L27
[2]: https://github.com/containers/crun/blob/8e8d7972f738f28294cd5c16091d136ca278759e/crun.1.md?plain=1#L103
[3]: https://github.com/tinycorelinux/Core-scripts/blob/dbb24bf42a0a9935b18e66a0b936266b2244251b/init#L13
[4]: https://github.com/kubernetes/minikube/blob/master/deploy/iso/minikube-iso/board/minikube/x86_64/rootfs-overlay/init#L6

Signed-off-by: Emily Shepherd <emily@redcoat.dev>
---
v2:
  - Fix formatting error in patch
  - Update overmount_rootfs() return type to void
  - cc relevant kernel devs based on blame of init files
  - cc OCI container runtime devs who have supported no-pivot options
  - cc small / embedded linux devs who have mitigated this by copying 
    root
  - tweak to changelog: clarify why no-pivot is not recommended
  - tweak to changelog: include missing reference to minikube's rootfs 
    mitigation
---
 init/Kconfig     | 13 +++++++++++++
 init/do_mounts.c | 23 +++++++++++++++++++++++
 init/do_mounts.h |  6 ++++++
 init/initramfs.c |  4 ++++
 4 files changed, 46 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 6d35728b94b2b..bf15bd08abdc2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1299,6 +1299,19 @@ config BLK_DEV_INITRD
 
 if BLK_DEV_INITRD
 
+config EMPTY_ROOTFS
+	bool "Mount initramfs over empty rootfs"
+	help
+		Normally initramfs is unpacked directly into the rootfs. When this
+		option is enabled, initramfs is instead unpacked into a tmpfs
+		mounted on top of a permanently empty rootfs.
+
+		This is mostly useful for embedded operating systems, running
+		directly from initramfs, which need to make use of pivot_root (for
+		example systems running containers).
+
+		If unsure, say N.
+
 source "usr/Kconfig"
 
 endif
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 5dfd30b13f485..7cf106cf976db 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -514,3 +514,26 @@ void __init init_rootfs(void)
 		(!root_fs_names || strstr(root_fs_names, "tmpfs")))
 		is_tmpfs = true;
 }
+
+#ifdef CONFIG_EMPTY_ROOTFS
+void __init overmount_rootfs(void) {
+	int err;
+
+	err = init_mkdir("/root", 0700);
+	if (err != 0)
+		goto out;
+
+	err = init_mount("rootfs", "/root", is_tmpfs ? "tmpfs" : "ramfs", 0, NULL);
+	if (err != 0)
+		goto out;
+
+	init_chdir("/root");
+	init_mount(".", "/", NULL, MS_MOVE, NULL);
+	init_chroot(".");
+
+	return;
+
+out:
+	printk(KERN_WARNING "Failed to mount over rootfs\n");
+}
+#endif /* CONFIG_EMPTY_ROOTFS */
diff --git a/init/do_mounts.h b/init/do_mounts.h
index 15e372b00ce70..3a261f1ae0d64 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -41,3 +41,9 @@ static inline bool initrd_load(char *root_device_name)
 	}
 
 #endif
+
+#ifdef CONFIG_EMPTY_ROOTFS
+void __init overmount_rootfs(void);
+#else
+static inline void __init overmount_rootfs(void) { return; }
+#endif
diff --git a/init/initramfs.c b/init/initramfs.c
index 8d0fd946cdd2b..76525108a39d2 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -19,6 +19,8 @@
 #include <linux/task_work.h>
 #include <linux/umh.h>
 
+#include "do_mounts.h"
+
 static __initdata bool csum_present;
 static __initdata u32 io_csum;
 
@@ -688,6 +690,8 @@ static void __init populate_initrd_image(char *err)
 
 static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
 {
+	overmount_rootfs();
+
 	/* Load the built in initramfs */
 	char *err = unpack_to_rootfs(__initramfs_start, __initramfs_size);
 	if (err)
-- 2.42.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
  2023-11-29  9:00 [PATCH v2] initramfs: Support unpacking directly to tmpfs Emily Shepherd
@ 2023-11-29 16:38 ` Rob Landley
  2023-11-29 17:48   ` Emily Shepherd
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Landley @ 2023-11-29 16:38 UTC (permalink / raw)
  To: Emily Shepherd, Andrew Morton
  Cc: initramfs, Thomas Strömberg, Anders Björklund,
	Giuseppe Scrivano, Al Viro, Christoph Hellwig, Jens Axboe

On 11/29/23 03:00, Emily Shepherd wrote:
> For systems which run directly from initramfs, it is not possible to use
> pivot_root without first changing root. This is because of the
> intentional design choice that rootfs, which is where initramfs is
> unpacked to, cannot be unmounted.
> 
> pivot_root is an important feature for creating containers

Because nobody's ever wanted to fix chroot() so mkdir("sub", 0777);
chroot("sub"); chdir("../../../../.."); chroot("."); wouldn't escape it, so they
repurposed a syscall intended to dispose of initial ramdisks that does things
like traverse the process list and close/reopen all the open files pointing to
the old filesystem (including "." and "/") so they point to the new filesystem.

Personally, I always found THAT a bit awkward, but there wasn't a "clean" system
call that _only_ did what the container guys wanted (patch the mount tree). I
would have thought you could use "mount --move . /" to nerf the cd ../../.. but
for some reason it didn't work (I forget why) and nobody wanted to fix that either.

(By the way, I used pivot_root() on ramfs back when it DID move it, which then
allowed you to unmount it, at which point the kernel locked up as the doubly
linked list traversal kept going until they hit the initramfs entry that was
"always there" and thus reliably terminated the list... Yeah, that got fixed,
now the pivot_root returns an error. Being an initramfs early adopter was
"interesting"...)

> and the
> alternative (mounting the new root over the top of the old with MS_MOVE
> and then calling chroot) is not favoured by most container runtimes
> [1][2] as it does not completely remove the host system mounts from the
> mount namespace.
> 
> The general work around, when running directly from initramfs, is to
> have init mount a new tmpfs, copy everything out of rootfs, and then
> switch_root [3][4].

Which is why I added switch_root to busybox 18 years ago, yes. (I thought I got
the idea from klibc but I'm not finding it in their git repo. I didn't invent
it, there was an existing one somewhere, my 2005 busybox commit comment credits
run_init.c from "kconfig" which can't be right. I did rename it to be more
obviously analogous to pivot_root...)

> This is only required when running directly from the
> initramfs as all other methods of acquiring a root device (having the
> kernel mount a root device directly via the root= parameter, or using
> initramfs to mount and then switch_root to a new root) leave an empty
> rootfs at the top of the mount stack.

If you don't use rootfs you don't have to empty it, yes.

You could use an old-style initrd which would be mounted over the root
filesystem and which you could switch_root away from and then unmount. Then
pivot_root() could actually perform its as-designed function, although last I
checked it wasn't fully container-aware so tended to have fairly awkward global
impact if you ran it inside a container without being VERY careful. (Maybe it's
been fixed since?)

You could also have your own tar.xz in rootfs with a tiny busybox/toybox root to
extract it into the subdir so "cp -ax" didn't have a 2X memory high water mark.
You could even have a little static binary to call so you don't even need a
shell. Off the top of my head:

void main(void)
{
  mkdir("blah", 0777);
  mount("newroot", "blah", "tmpfs", 0, "noswap,size=37%,huge=within_size");
  if (!fork()) exec("tar", "tar", "xpC", "blah", "blah.txz", NULL);
  else wait();
  if (!fork()) exec("rm", "rm", "blah.txz", "init", "tar", "xzcat", NULL);
  else wait();
  if (chdir("/blah") || chroot(".") || exec("/init")) complain_and_hang();
}

Statically linked against musl-libc that's not likely to be more than 32k, it's
all syscalls. The tar and xzcat binaries are a bit bigger, but not unreasonable
in either busybox or toybox...

Or you could petition to add -x to mv I suppose. I could add it to toybox
tomorrow if you like? (And probably send a patch to Denys for busybox?)

> This commit adds a new build option - EMPTY_ROOTFS, available when
> initrd/initramfs is enabled. When selected, rather than unpacking the
> inbuilt / bootloader provided initramfs directly into rootfs, the kernel
> will mount a new tmpfs/ramfs over the top of the rootfs and unpack to
> that instead, leaving an empty rootfs at the top of the stack. This
> removes the need to have init copy everything as a workaround.

How is it a "workaround"? The userspace tool is as old as initramfs.

Your real complaint seems to be that a single ramfs instance is shared between
container instances, even when the PID 1 init process isn't. What you're
"working around" is incomplete container namespace separation, and you're doing
so by adding yet another kernel config option. You are _adding_ a workaround to
the kernel.

If you still need to complicate the kernel, wouldn't it make more sense to add a
runtime check for rootfstype=redundant or some such, and have _that_ do the
overmount (without needing a config symbol to micromanage a weird corner case
behavior)? If it's _init code it should be freed before launching PID 1...

Rob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
  2023-11-29 16:38 ` Rob Landley
@ 2023-11-29 17:48   ` Emily Shepherd
  2023-11-29 20:53     ` Rob Landley
  0 siblings, 1 reply; 10+ messages in thread
From: Emily Shepherd @ 2023-11-29 17:48 UTC (permalink / raw)
  To: Rob Landley
  Cc: Andrew Morton, initramfs, Thomas Strömberg,
	Anders Björklund, Giuseppe Scrivano, Al Viro,
	Christoph Hellwig, Jens Axboe

On Wed, Nov 29, 2023 at 10:38:48AM -0600, Rob Landley wrote:
>Because nobody's ever wanted to fix chroot() so mkdir("sub", 0777);
>chroot("sub"); chdir("../../../../.."); chroot("."); wouldn't escape it
>
>I
>would have thought you could use "mount --move . /" to nerf the cd ../../.. but
>for some reason it didn't work (I forget why) and nobody wanted to fix that either.

Actually move mounting the desired new root over the top of the old does 
mitigate the chroot & chdir attack. The main reason, I believe, that the 
runtime maintainers don't like that option is that, despite being 
"inaccessible", the old mount tree still exists in the container's mount 
namespace. This has led to issues such as the sysfs/procfs issue [1]: In 
summary, that attack worked by a process within a container creating a 
new userns, and giving that CAP_SYS_ADMIN.

In such cases, the kernel had protections in place to ensure that, even 
with the SYS_ADMIN capability, the process in the new userns wasn't 
allowed to mount proc or sysfs, unless a fully visible mount of 
proc/sysfs already exists in the process' mount namespace.

There was a bug in the kernel's visibility check - it checked if each 
instance of proc/sysfs in the mount namespace had been over mounted, or 
any of this subdirectories had, but forgot to check if any of its root 
directories had. This resulted in the original root's /proc / /sys 
mounts counting as visible, even though they weren't, which allowed the 
child userns to mount a fully unmasked instance and gain access to 
things it shouldn't.

Now, this was fixed in 7e96c1b0e0f495 however my assumption, and I don't 
want to speak on behalf of all runtime maintainers here, is that the 
advice to prefer pivot_root is because of the increased risk of bugs 
like these. When using pivot_root, the old root is able to be completely 
unmounted from the container's mount namespace after the pivot which, 
from a security perspective, gives better peace of mind.

There is at least one other fringe exploit that I am aware of when 
running in containers not using pivot root - this involves process 1 
within a container unmounting its root with MNT_DETACH. While this 
doesn't always allow that process itself to break out fully, it does 
allow subsequent calls to exec within the container to leak information 
about the host's root file system. This would not occur with pivot_root.

[1]: https://github.com/opencontainers/runc/pull/1962

>If you don't use rootfs you don't have to empty it, yes.

The point I meant was that this brings the initramfs flow in line with 
the other root approaches: for initrd, kernel handled root= mounts, and 
initramfs switch_root setups, rootfs exists because it has to at the top 
of the stack. For initramfs embedded systems, rootfs exists because it 
is the root - embedded linux actually using the rootfs as a root is the 
outlying behaviour.

>You could use an old-style initrd which would be mounted over the root
>filesystem and which you could switch_root away from and then 
>unmount.

You could, but isn't initramfs a more modern way to pack files than the 
initrd? And is it not reasonable to bring (or at least give the option 
for) the initramfs flow to be a bit more like the initrd flow? (Ie, with 
an empty rootfs).

>pivot_root() could actually perform its as-designed function, although 
>last I
>checked it wasn't fully container-aware so tended to have fairly awkward global
>impact if you ran it inside a container without being VERY careful. (Maybe it's
>been fixed since?)

Most container runtimes that I am aware of would run a container within 
their own mount namespace so pivot_root should be safe from the rest of 
the system's point of view. Indeed pivot_root is the preferred option 
for container runtimes but cannot be used when running directly from 
rootfs.

>
>You could also have your own tar.xz in rootfs with a tiny busybox/toybox root to
>extract it into the subdir so "cp -ax" didn't have a 2X memory high water mark.
>You could even have a little static binary to call so you don't even need a
>shell. Off the top of my head:
>
>void main(void)
>{
>  mkdir("blah", 0777);
>  mount("newroot", "blah", "tmpfs", 0, "noswap,size=37%,huge=within_size");
>  if (!fork()) exec("tar", "tar", "xpC", "blah", "blah.txz", NULL);
>  else wait();
>  if (!fork()) exec("rm", "rm", "blah.txz", "init", "tar", "xzcat", NULL);
>  else wait();
>  if (chdir("/blah") || chroot(".") || exec("/init")) complain_and_hang();
>}
>
>Statically linked against musl-libc that's not likely to be more than 32k, it's
>all syscalls. The tar and xzcat binaries are a bit bigger, but not unreasonable
>in either busybox or toybox...

Yep, I completely get this - and it is a good point. This is definitely 
a gray area on the "what should the kernel do vs what should we let 
userspace init handle". My reasoning for including it in the kernel is 
that all of the userspace init options to handle this (ie untaring 
something, or copying everything over straight away) amount to "double 
zipping" or moving something that the kernel has just extracted. This is 
a bit of a shame to require userspace to do, especially when it is a 
trivial patch to just have the kernel extract the initramfs to where we 
want it in the first place.

>
>Or you could petition to add -x to mv I suppose. I could add it to toybox
>tomorrow if you like? (And probably send a patch to Denys for busybox?)

I'm not sure how adding it to busybox would help - as you have already 
show, there are existing userspace workarounds (and I referred to two 
others in the patch's changelog: the tiny core linux and minikube init 
examples) so I'm not sure we need more?

>How is it a "workaround"? The userspace tool is as old as initramfs.

Because it takes a thing that has just been extracted and moves it 
somewhere else. That is a workaround for it not being in the right 
place.

>Your real complaint seems to be that a single ramfs instance is shared 
>between
>container instances, even when the PID 1 init process isn't.

Well, when rootfs is empty, it doesn't really matter that it's shared 
with all mount namespaces. My issue isn't with that, it's that the 
embedded initramfs flow is the one and only time that rootfs can't be 
relied upon to be empty.

>What you're
>"working around" is incomplete container namespace separation, and you're doing
>so by adding yet another kernel config option. You are _adding_ a workaround to
>the kernel.

What you are calling incomplete container namespace separation is the 
kernel's inability to unmount rootfs ever? I don't think that's a flaw - 
the logic for it makes perfect sense, you always have a rootfs so that 
you don't accidentally empty the mount tree. What doesn't make sense is 
then using that rootfs for anything more than that "stopper" under a 
"real" root - that's where the problems come in when attempting to swap 
roots for containers.

>If you still need to complicate the kernel, wouldn't it make more sense 
>to add a
>runtime check for rootfstype=redundant or some such, and have _that_ do the
>overmount (without needing a config symbol to micromanage a weird corner case
>behavior)? If it's _init code it should be freed before launching PID 1...

The context that I'm talking about is situations where the init process 
within initramfs doesn't hand over to another init. This is for embedded 
initramfs situations.

I could do another version of the patch to check in the kernel for a 
rootfstype parameter if you like and work off of that rather than a 
build flag? Or would you not want that check within the kernel at all?

-- 
Emily Shepherd

Red Coat Development Limited

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
  2023-11-29 17:48   ` Emily Shepherd
@ 2023-11-29 20:53     ` Rob Landley
  2023-11-30  3:31       ` Emily Shepherd
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Landley @ 2023-11-29 20:53 UTC (permalink / raw)
  To: Emily Shepherd
  Cc: Andrew Morton, initramfs, Thomas Strömberg,
	Anders Björklund, Giuseppe Scrivano, Al Viro,
	Christoph Hellwig, Jens Axboe

On 11/29/23 11:48, Emily Shepherd wrote:
> On Wed, Nov 29, 2023 at 10:38:48AM -0600, Rob Landley wrote:
>>Because nobody's ever wanted to fix chroot() so mkdir("sub", 0777);
>>chroot("sub"); chdir("../../../../.."); chroot("."); wouldn't escape it
>>
>>I
>>would have thought you could use "mount --move . /" to nerf the cd ../../.. but
>>for some reason it didn't work (I forget why) and nobody wanted to fix that either.
> 
> Actually move mounting the desired new root over the top of the old does 
> mitigate the chroot & chdir attack. The main reason, I believe, that the 
> runtime maintainers don't like that option is that, despite being 
> "inaccessible", the old mount tree still exists in the container's mount 
> namespace.

I'm assuming you can do process-local unmounts to prune what you'd be
overmounting? (switch_root didn't _have_ to delete the old initramfs contents,
it did it to free space. Containers similarly would remove privileges as part of
their setup, and that includes mounts the new context shouldn't have.)

Alas, mount separation was implemented in multiple stages with MS_SHARED
predating CLONE_NEWNS, plus fun with --bind mounts and /proc and /sys leaking
various forms of weird (and an explicit refusal to have namespace aware devtmpfs
instances which just seems _sad_), and I'm never confident there isn't some new
thing I'm missing or old thing I've forgotten/missed...

> This has led to issues such as the sysfs/procfs issue [1]: In 
> summary, that attack worked by a process within a container creating a 
> new userns, and giving that CAP_SYS_ADMIN.

From reading the link's summary, it seems like "unmount the old inherited /proc
and /sys before you "mount --move newfs /" seems like it would have been a fix?
Dunno, wasn't following it...

> In such cases, the kernel had protections in place to ensure that, even 
> with the SYS_ADMIN capability, the process in the new userns wasn't 
> allowed to mount proc or sysfs, unless a fully visible mount of 
> proc/sysfs already exists in the process' mount namespace.
> 
> There was a bug in the kernel's visibility check - it checked if each 
> instance of proc/sysfs in the mount namespace had been over mounted, or 
> any of this subdirectories had, but forgot to check if any of its root 
> directories had. This resulted in the original root's /proc / /sys 
> mounts counting as visible, even though they weren't, which allowed the 
> child userns to mount a fully unmasked instance and gain access to 
> things it shouldn't.
> 
> Now, this was fixed in 7e96c1b0e0f495 however my assumption, and I don't 
> want to speak on behalf of all runtime maintainers here, is that the 
> advice to prefer pivot_root is because of the increased risk of bugs 
> like these. When using pivot_root, the old root is able to be completely 
> unmounted from the container's mount namespace after the pivot which, 
> from a security perspective, gives better peace of mind.

Lazy unmount it (which never affects a process's open files, including the "/"
and "." symlinks in each process), then mount --move so the visibility hides it,
then teach the kernel that "overmounted" lets lazy unmounts go. (Which it
_might_ already do if the reference count falls to 0 because of "." and "/"
leaving, although you'd have to make sure no other open file descriptors
referenced it in your current namespace from /dev entries and just plain
inherited filehandles...)

But it seems doable?

> There is at least one other fringe exploit that I am aware of when 
> running in containers not using pivot root - this involves process 1 
> within a container unmounting its root with MNT_DETACH. While this 
> doesn't always allow that process itself to break out fully, it does 
> allow subsequent calls to exec within the container to leak information 
> about the host's root file system.

Lemme guess, the child does something like:

for (i = 0; i<32767; i++) close(i);
mkdir("sub/blah")
mount("sub", "sub", "tmpfs");
chdir("sub");
umount(".", MNT_DETACH);
chroot("blah");
chdir("../../../..");
chroot(".")
readdir();

> This would not occur with pivot_root.

It would not occur if the filesystem had been removed from the current mount
namespace by other means, either. (Or if the kernel got the test right, which
you're saying it does now.)

> [1]: https://github.com/opencontainers/runc/pull/1962
> 
>>If you don't use rootfs you don't have to empty it, yes.
> 
> The point I meant was that this brings the initramfs flow in line with 
> the other root approaches: for initrd, kernel handled root= mounts, and 
> initramfs switch_root setups, rootfs exists because it has to at the top 
> of the stack. For initramfs embedded systems, rootfs exists because it 
> is the root - embedded linux actually using the rootfs as a root is the 
> outlying behaviour.

Back when I was trying to get /dev/console to work properly with init=/bin/sh I
didn't ask for a kconfig option to make the initial task be PID 2 with PID 1
(and its magic signal-blocking properties and inability to call reboot() and
session ID 0 and orphaned zombies reparented to it so on) being a second idle
task. I figured out how to make my userspace do the right thing.

This really seems like an "init starts as PID 2" solution, which is a weird
thing to have a dedicated build-time kernel config option for.

>>You could use an old-style initrd which would be mounted over the root
>>filesystem and which you could switch_root away from and then 
>>unmount.
> 
> You could,

So there's already a(nother) fix.

> but isn't initramfs a more modern way to pack files than the 
> initrd?

It's a different archive format. Feed it a squashfs if you're just gonna pivot
away soon, that's created like an archiver.

> And is it not reasonable to bring (or at least give the option 
> for) the initramfs flow to be a bit more like the initrd flow? (Ie, with 
> an empty rootfs).

You want to use rootfs but not use rootfs. It's very zen.

"initramfs" is an extractor to populate "rootfs" before launching PID 1. That
was the design. (From Al Viro I think.)

If you'd like to go "I want to have the kernel automatically mount a freshly
formatted ext4 filesystem and then have the kernel extract the cpio archive into
that instead, because it's more convenient for me to have the kernel do this for
me than doing it in userspace"...

*shrug* It's not my call and I can't predict what the kernel devs will do these
days, but... ow?

>>pivot_root() could actually perform its as-designed function, although 
>>last I
>>checked it wasn't fully container-aware so tended to have fairly awkward global
>>impact if you ran it inside a container without being VERY careful. (Maybe it's
>>been fixed since?)
> 
> Most container runtimes that I am aware of would run a container within 
> their own mount namespace so pivot_root should be safe from the rest of 
> the system's point of view.

The first time I ran pivot_root within a CLONE_NEWNS it chrooted all the
processes in the entire system from / into the new root, because I hadn't run it
in a CLONE_NEWPID namespace and it didn't have a filter for the fact those
processes couldn't SEE the new root because they weren't in its namespace. (I
mean I _think_ that's what happened, the system hung pretty fast and I had to
reboot it.)

Hopefully it's less brittle now, but I haven't retried recently.

> Indeed pivot_root is the preferred option 
> for container runtimes but cannot be used when running directly from 
> rootfs.

If you fix the mount --move issues you could bind mount your current directory,
cd $PWD, and then --move mount it to /.

I think you're addressing the wrong issue.

>>Statically linked against musl-libc that's not likely to be more than 32k, it's
>>all syscalls. The tar and xzcat binaries are a bit bigger, but not unre asonable
>>in either busybox or toybox...
> 
> Yep, I completely get this - and it is a good point. This is definitely 
> a gray area on the "what should the kernel do vs what should we let 
> userspace init handle".

Back in the day I wrote busybox's switch_root and
Documentation/filesystems/ramfs-rootfs-initramfs.txt and I recently had a long
argument with somebody about how my 310 line bash script
(https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh) that builds
Linux systems from source for a dozen architectures and boots them to a shell
prompt under qemu was using (for one of the architectures) a static cpio.gz
linked into the KERNEL IMAGE to populate initramfs and this was just a CRAZY
thing that NOBODY EVER DOES... (No really,
https://landley.net/notes-2023.html#12-05-2023).

So my perspective is apparently a bit skewed. Maybe I"m too close to the problem
to see it...

> My reasoning for including it in the kernel is 
> that all of the userspace init options to handle this (ie untaring 
> something, or copying everything over straight away) amount to "double 
> zipping" or moving something that the kernel has just extracted. This is 
> a bit of a shame to require userspace to do, especially when it is a 
> trivial patch to just have the kernel extract the initramfs to where we 
> want it in the first place.

As with the trivial patch to have init= launch PID 2, the cognitive load of
explaining to people WHY the config option exists and when somebody might have
wanted to use it in the kernel you're trying to forward port in a design you
inherited from somebody who isn't around anymore is itself a form of design
complexity. It's a special case _adding_ a design wart.

Different kernels working different ways with a bunch of special cases replacing
design don't get _better_ over time. "This is what that's for. Except not really..."

>>Or you could petition to add -x to mv I suppose. I could add it to toybox
>>tomorrow if you like? (And probably send a patch to Denys for busybox?)
> 
> I'm not sure how adding it to busybox would help - as you have already 
> show, there are existing userspace workarounds (and I referred to two 
> others in the patch's changelog: the tiny core linux and minikube init 
> examples) so I'm not sure we need more?

Because mv doesn't extact twice and the memory high water mark isn't
significantly higher. If you cp -a and then rm -r the memory usage high water
mark is "all the files briefly exist twice". With mv across filesystems, only
the largest single file existing twice is the memory usage high water mark.

A couple years after I taught initramfs to mount tmpfs instead of just ramfs
(which was in 2013), somebody came to me needing to force it BACK to ramfs
because their cpio wouldn't extract into tmpfs... because if you don't specify
arguments to tmpfs then the "size=" defaults to 50% of available memory and
their cpio.gz extracted to a little over 60%. (But the system still worked
fine... with ramfs. With tmpfs the cpio.gz extract aborted when the filesystem
refused further writes, and since I hadn't wired up the rootflags= plumbing
(ramfs didn't take any flags) there was no way to tell the tmpfs instance to
allow a larger size.

Alas, that's another one of the patches that I couldn't get the linux-kernel
bureaucracy to notice. I don't think it ever did go upstream. Yeah, in
init/do_mounts.c() it looks like  root_mount_data is only ever used in
mount_root_generic()'s call to do_mount_root() but not passed through to
rootfs_init_fs_context() and the call to shmem_init_fs_context().

*shrug* The usual...

>>How is it a "workaround"? The userspace tool is as old as initramfs.
> 
> Because it takes a thing that has just been extracted and moves it 
> somewhere else. That is a workaround for it not being in the right 
> place.

You're asking the kernel to create a second empty ramfs or tmpfs instance, and
instead of checking an existing argument like "root=tmpfs" you're changing the
kernel's behavior with a dedicated config option that does a specific thing.

What happens if somebody sets that config option and then goes root=/dev/sda2

In theory making the rootfs directory neither readable nor executable to the PID
you've mapped root to in the container is anther approach. There's a LOT you can
already do in userspace about this...

>>Your real complaint seems to be that a single ramfs instance is shared 
>>between
>>container instances, even when the PID 1 init process isn't.
> 
> Well, when rootfs is empty, it doesn't really matter that it's shared 
> with all mount namespaces.

Yes, that is your "workaround" to the real problem.

> My issue isn't with that, it's that the 
> embedded initramfs flow is the one and only time that rootfs can't be 
> relied upon to be empty.
Your "one and only time" is an awful lot of embedded systems. It's a common use
case. The point of having initramfs be tmpfs is you can _persist_ in using it as
your root filesystem without an errant log file filling up memory and hanging
the system (a problem with ramfs). Whatever your container stuff is, it won't be
able to run on any of those existing systems that keeps initramfs populated with
files. So again why have it be a config option: if you're going to change the
behavior, change it for EVERYBODY or your stuff will need a special kernel
configuration in order to run.

Heck, Debian populates initramfs with a cpio.gz file as part of its normal boot
process:

$ zcat /boot/initrd.img-4.19.0-22-amd64  | toybox file -
-: ASCII cpio archive (SVR4 with no CRC)

Has done for over a decade. You're saying debian can clean up but your stuff
can't be expected to.

You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
actually _be_ a filesystem. Even with your "fix", containers could communicate
with each _other_ through it if it becomes accessible. If a container can get
access to an empty initramfs and write into it, it can ask/answer the question
"Are there any other containers on this machine running stux24" and then coordinate.

Really seems to me like you're addressing the wrong issue. You want a design
change to Linux, and you're phrasing it as a config option. The design change
has side effects and "support both modes, forever" is the proposed answer to that.

>>What you're
>>"working around" is incomplete container namespace separation, and you're doing
>>so by adding yet another kernel config option. You are _adding_ a workaround to
>>the kernel.
> 
> What you are calling incomplete container namespace separation is the 
> kernel's inability to unmount rootfs ever? I don't think that's a flaw - 
> the logic for it makes perfect sense, you always have a rootfs so that 
> you don't accidentally empty the mount tree. What doesn't make sense is 
> then using that rootfs for anything more than that "stopper" under a 
> "real" root - that's where the problems come in when attempting to swap 
> roots for containers.

If you want a NULLFS, that is a design change. Maybe ask for the design change
so THAT can be discussed. Your config option seems like a partial fix at best,
and the kernel has enough abandoned partial fixes needing legacy support.

>>If you still need to complicate the kernel, wouldn't it make more sense 
>>to add a
>>runtime check for rootfstype=redundant or some such, and have _that_ do the
>>overmount (without needing a config symbol to micromanage a weird corner case
>>behavior)? If it's _init code it should be freed before launching PID 1...
> 
> The context that I'm talking about is situations where the init process 
> within initramfs doesn't hand over to another init. This is for embedded 
> initramfs situations.

If your package grows a dependency on a new kernel symbol, then it can only be
installed into certain kernels, and it's the _explaining_why_ that's the problem
for me.

Embedded initramfs situations are actually quite common in my (admittedly weird)
experience.

For context, my mkroot script above builds systems that boot to a shell prompt
mostly[1] under qemu running out of initmpfs, which means I deal with "system
runs out of rootfs" pretty much every day. I even ship them, extract any of
https://landley.net/bin/mkroot/latest/ and ./run-qemu.sh for example.

I then run https://github.com/landley/toybox/blob/master/mkroot/testroot.sh
against the results which launches them all for basic regression smoketesting
that the network and block devices and so on work (on each new toolchain, linux,
toybox, and qemu version). That part works today.

Next up I'm trying to get it to run the full toybox regression test suite (make
tests) but I need to do more work on the toybox shell for it to be a proper bash
replacement, and then there's a lot of "how _do_ I test insmod, how _do_ I test
ps" design work that I can't properly start until I have a known environment
running as root... that can run the test suite. (The toybox shell isn't quite
finished yet, and nerfing a test suite bash can already run seems
counterproductive. Working on it...)

And THEN I'm trying to get it to build Linux From Scratch in an automated
fashion using the native compilers that scripts/mcm-buildall.sh produces (the
*-native.sqf squashfs images in https://landley.net/bin/toolchains/latest/),
which I actually already DID in a previous life...

Back when I was maintaining busybox, I was working to get Linux From Scratch to
build natively under a busybox-based system built from seven packages (gcc,
binutils, linux, make, busybox, uClibc, bash), and I succeeded (with LFS 6.8):

https://landley.net/aboriginal/about.html
https://github.com/landley/control-images/tree/master/images/lfs-bootstrap

After I got that working, distributions like Alpine Linux built themselves
around busybox, because now a simple busybox-based system can provide a full
build environment you can add arbitrary packages to by building them from
source. Trying to make that work is _why_ I wound up maintaining busybox back in
the day. It's also why I got into initramfs early, because my old "append the
root filesystem and teach lilo to load the initrd image from a file starting at
an offset" hack for getting the kernel and root filesystem into the same file
back in the https://landley.net/aboriginal/old/ days was... well for one thing
the lilo maintainer wouldn't take my "offset=" argument patch upstream, and grub
was already replacing lilo, and I needed a DIFFERENT hack for User Mode Linux...

Now I'm trying to do it all again with toybox instead of busybox, and also with
80% of the earlier project replaced by a 300 line bash script because I figured
out how to do it in a simpler way. Along the way I switched from uClibc to
musl-libc, user mode linux to qemu, from a zisofs root filesystem to initramfs
(and the implemented initmpfs because somebody needed to)...

Anyway, if you're wondering why I popped up in your cc: list... :)

[1]  The exception is the sh2eb board which builds a kernel for actual hardware,
a j-core FPGA with a ROM bootloader that knows how to run a vmlinux but does NOT
know how to load an external cpio.gz the way qemu's built-in bootloader does, so
I statically linked the cpio.gz into the kernel, that's what the BUILTIN=1 in
that build script does. (Which that person I was arguing with kind of bounced
off when he saw it, as against the natural order of things, or something? The
thing is that board had use cases that needed to do a chain-of-custody thing,
the hardware will cryptographically validate the vmlinux it loads but then the
running linux system has to validate anything else it loads and having the
cpio.gz built in to only one "kernel" image saved a step in a small ROM, and the
potential problem of detecting out-of-sync files if there's more than one. New
kernel with old userspace or vice versa, dunno if it's an attack vector but not
going there.)

> I could do another version of the patch to check in the kernel for a 
> rootfstype parameter if you like and work off of that rather than a 
> build flag? Or would you not want that check within the kernel at all?

I think checking root=tmpfs or rootfstype=overlay or whatever user interface
seems natural to you is a better approach, yes.

Adding a special purpose config option that requires a lot of backstory is a
potential additional cognitive load on everyone learning to configure Linux
beyond the "ignore everything you don't recognize" level (let alone modify that
part of the code). The embedded people especially are the ones who have to learn
why they don't need it.

Runtime flags can be flipped later, build time flags you have to get right
before you ship.

I personally care because if you look at (for example)
https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh#L232 everything
it knows about the s390x platform is on those three lines. The KCONF= is CSV
symbols that get expanded into the device-specific CONFIG_BLAH=y (or
CONFIG_BLAH="whatever" if the CSV symbol already has an = in it, which are
appended to the generic config symbols on
https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh#L270 that apply
to all boards (BLK_DEV_LOOP and EXT4_FS and so on), which is expanded into a
miniconfig, ala:

https://landley.net/aboriginal/FAQ.html#dev_miniconfig
https://lwn.net/Articles/161086/

Which is then expanded into a full kernel .config via "make allnoconfig
KCONFIG_ALLCONFIG=mini.conf".

Which means that when I'm adding support for a new board, I do look at every
symbol that's set to try to understand what it does and whether it's needed. I
have some tools to help (like
https://github.com/landley/aboriginal/blob/master/more/miniconfig.sh to convert
a big .config file into a miniconfig in an ugly but automated fashion (yank each
line and feed it back through allnoconfig to see if the result changes or not;
if it doesn't change the line wasn't needed). A miniconfig is literally just the
list of symbols that you'd have to set if you started from allnoconfig and let
the dependency resolver do its thing.

Again, I may be weird. But I mostly hang out in the embedded space, where there
are an awful lot of weird people who do work in this space. (And sadly most of
them literally cannot be PAID to interact with what the linux-kernel community
has become over the past ~15 years, so the viewpoint tends to be a bit
chronically under-represented here.)

Rob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
  2023-11-29 20:53     ` Rob Landley
@ 2023-11-30  3:31       ` Emily Shepherd
  2023-12-01 22:02         ` Rob Landley
  0 siblings, 1 reply; 10+ messages in thread
From: Emily Shepherd @ 2023-11-30  3:31 UTC (permalink / raw)
  To: Rob Landley
  Cc: Andrew Morton, initramfs, Thomas Strömberg,
	Anders Björklund, Giuseppe Scrivano, Al Viro,
	Christoph Hellwig, Jens Axboe

On Wed, Nov 29, 2023 at 02:53:49PM -0600, Rob Landley wrote:
>I'm assuming you can do process-local unmounts to prune what you'd be
>overmounting? (switch_root didn't _have_ to delete the old initramfs contents,
>it did it to free space. Containers similarly would remove privileges as part of
>their setup, and that includes mounts the new context shouldn't have.)

I think we are talking at cross purposes here, let's get back on track: 
The aim is to do a pivot_root within a container's mount namespace. In 
order to do this, we need at least two layers of "root" in the host 
mount namespaces. If there is just one (ie rootfs), we can't pivot_root, 
because rootfs cannot be removed.

Yes, you can switch_root or chroot or do any number of things, but that 
is not relevant. That is not the way container runtimes tend to work. 
The desirable outcome is pivot_root.

>From reading the link's summary, it seems like "unmount the old 
>inherited /proc
>and /sys before you "mount --move newfs /" seems like it would have been a fix?

It was the patch that was made in the container runtimes at the time, 
yes. This does not change the fact that the _desirable_ path is 
pivot_root.

>Lazy unmount it (which never affects a process's open files, including 
>the "/"
>and "." symlinks in each process), then mount --move so the visibility hides it,
>then teach the kernel that "overmounted" lets lazy unmounts go. (Which it
>_might_ already do if the reference count falls to 0 because of "." and "/"
>leaving, although you'd have to make sure no other open file descriptors
>referenced it in your current namespace from /dev entries and just plain
>inherited filehandles...)
>
>But it seems doable?

You can unmount child mounts, sure, but if your root is rootfs, you 
can't unmount it. The aim of this change is to make unmounting the host 
root more convenient, by ensuring there is a blank rootfs below it.

>Lemme guess, the child does something like:
>
>for (i = 0; i<32767; i++) close(i);
>mkdir("sub/blah")
>mount("sub", "sub", "tmpfs");
>chdir("sub");
>umount(".", MNT_DETACH);
>chroot("blah");
>chdir("../../../..");
>chroot(".")
>readdir();

I have told you already that the chroot, chdir .. trick does not work 
within containers. This code snippet has nothing to do with this patch 
or this discussion at all.

>
>> This would not occur with pivot_root.
>
>It would not occur if the filesystem had been removed from the current mount
>namespace by other means, either. (Or if the kernel got the test right, which
>you're saying it does now.)

You can't remove it if it's rootfs. If your host's root is rootfs, as it 
would be if you run directly from initramfs, you can't unmount it.

>Back when I was trying to get /dev/console to work properly with 
>init=/bin/sh I
>didn't ask for a kconfig option to make the initial task be PID 2 with PID 1
>(and its magic signal-blocking properties and inability to call reboot() and
>session ID 0 and orphaned zombies reparented to it so on) being a second idle
>task. I figured out how to make my userspace do the right thing.
>
>This really seems like an "init starts as PID 2" solution, which is a weird
>thing to have a dedicated build-time kernel config option for.

I am afraid I do not understand this point at all. This change is not 
requesting anything like this.

>You want to use rootfs but not use rootfs. It's very zen.

No, I want an initramfs, I just don't want it on rootfs. If I mount a 
block device as root, that wouldn't be rootfs either.

>If you'd like to go "I want to have the kernel automatically mount a 
>freshly
>formatted ext4 filesystem and then have the kernel extract the cpio archive into
>that instead, because it's more convenient for me to have the kernel do this for
>me than doing it in userspace"...

Not what this patch is suggesting.

The kernel already supports mounting a block device, in lieu of a 
userspace init doing it, via the root= parameter. Are you suggesting its 
support of that is inappropriate?

>If you fix the mount --move issues you could bind mount your current 
>directory,
>cd $PWD, and then --move mount it to /
>
>I think you're addressing the wrong issue.

No, I'm fixing the fact that container runtimes want to pivot_root, and 
can't when running directly from initramfs, as this extracts to rootfs.

>As with the trivial patch to have init= launch PID 2, the cognitive 
>load of
>explaining to people WHY the config option exists and when somebody might have
>wanted to use it in the kernel you're trying to forward port in a design you
>inherited from somebody who isn't around anymore is itself a form of design
>complexity. It's a special case _adding_ a design wart.

Is it possible that the reasoning of why this important would be much 
more apparent to people in the container space?

I disagree that this introduces a design wart. On the contrary, I 
believe it adds the option to make initramfs more consistent with the 
other root setup methods:

1. kernel mounted block device via root= results in a nominally empty 
rootfs and a block device on top with the root file system in it. 
pivot_root can be used.
2. initramfs which performs some init, mounts a block device, then 
switches root to it. This results in a nominally empty rootfs and a 
block device on top with the root file system in it. pivot_root can be 
used.
3. initramfs which contains an embedded root filesystem to be used 
directly. Results in a rootfs with the root file system in it with 
nothing on top. pivot_root cannot be used.

This patch simply changes point 3, to be more in line with the others:

3. initramfs which contains an embedded root filesystem to be used 
directly. Would result in a nominally empty rootfs with tmpfs on top 
with the root filesystem in it. pivot_root can be used.

>You're asking the kernel to create a second empty ramfs or tmpfs 
>instance, and
>instead of checking an existing argument like "root=tmpfs" you're changing the
>kernel's behavior with a dedicated config option that does a specific thing.

If we want to set this behaviour via a kernel parameter, we can do that 
:)

>What happens if somebody sets that config option and then goes 
>root=/dev/sda2
>
>In theory making the rootfs directory neither readable nor executable to the PID
>you've mapped root to in the container is anther approach.

Incorrect, please reread the patch.

>Your "one and only time" is an awful lot of embedded systems. It's a 
>common use
>case. The point of having initramfs be tmpfs is you can _persist_ in using it as
>your root filesystem without an errant log file filling up memory and hanging
>the system (a problem with ramfs).

We are not in disagreement on this point. In fact the irony is that we 
are actually in strong agreement here. Leaving the root in the initramfs 
_is_ a useful and commonly used flow - this change simply means to make 
that flow more compatible with container runtimes.

>Whatever your container stuff is

Love it or hate it, lots of stuff runs on containers now. The kernel has 
made plenty of changes to better facilitate containers.

>it won't be
>able to run on any of those existing systems that keeps initramfs populated with
>files. So again why have it be a config option: if you're going to change the
>behavior, change it for EVERYBODY or your stuff will need a special kernel
>configuration in order to run.

Sure, if we think its more appropriate to just do this always (not via a 
build option) or gated behind a kernel parameter, we can do that.

>
>Heck, Debian populates initramfs with a cpio.gz file as part of its normal boot
>process:
>
>$ zcat /boot/initrd.img-4.19.0-22-amd64  | toybox file -
>-: ASCII cpio archive (SVR4 with no CRC)
>
>Has done for over a decade. You're saying debian can clean up but your stuff
>can't be expected to.

No, that is not what I'm saying.

>If you want a NULLFS, that is a design change. Maybe ask for the design 
>change
>so THAT can be discussed. Your config option seems like a partial fix at best,
>and the kernel has enough abandoned partial fixes needing legacy support.

We already have what you call a nullfs. It's defined in 
init/noinitramfs.c and usr/default_cpio_list, and its what you get if 
you call switch_root within the initramfs.

In most runtime situations, rootfs _is_ what you'd call a nullfs. So 
yes, sure: I want a nullfs when my root filesystem lives inside the 
initramfs too. Like I'd get if I'm mounting with root= and like I'd get 
if initramfs calls switch_root.

-- 
Emily Shepherd

Red Coat Development Limited

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
  2023-11-30  3:31       ` Emily Shepherd
@ 2023-12-01 22:02         ` Rob Landley
  2023-12-01 23:37           ` Emily Shepherd
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Landley @ 2023-12-01 22:02 UTC (permalink / raw)
  To: Emily Shepherd
  Cc: Andrew Morton, initramfs, Thomas Strömberg,
	Anders Björklund, Giuseppe Scrivano, Al Viro,
	Christoph Hellwig, Jens Axboe

On 11/29/23 21:31, Emily Shepherd wrote:
> On Wed, Nov 29, 2023 at 02:53:49PM -0600, Rob Landley wrote:
>>I'm assuming you can do process-local unmounts to prune what you'd be
>>overmounting? (switch_root didn't _have_ to delete the old initramfs contents,
>>it did it to free space. Containers similarly would remove privileges as part of
>>their setup, and that includes mounts the new context shouldn't have.)
> 
> I think we are talking at cross purposes here, let's get back on track: 
> The aim is to do a pivot_root within a container's mount namespace.

Your aim is. You are starting from your conclusion and reasoning backwards.

> In 
> order to do this, we need at least two layers of "root" in the host 
> mount namespaces. If there is just one (ie rootfs), we can't pivot_root, 
> because rootfs cannot be removed.
> 
> Yes, you can switch_root or chroot or do any number of things, but that 
> is not relevant.

Those approaches have worked fine for many people for many years.

Possibly I'm biased, I thought using pivot_root was silly when I learned
containers were doing it back in 2011:

  https://landley.net/notes-2011.html#02-06-2011

(It's a pity the sourceforge link doesn't work anymore. Cloud rot.)

> That is not the way container runtimes tend to work. 
> The desirable outcome is pivot_root.

Still reasoning backwards from your conclusion. But sure, let's accept for the
sake of argument that you're stuck with a bad legacy decision in userspace. Is a
kernel config option the best way to compensate vs checking root=tmpfs?

>>From reading the link's summary, it seems like "unmount the old 
>>inherited /proc
>>and /sys before you "mount --move newfs /" seems like it would have been a fix?
> 
> It was the patch that was made in the container runtimes at the time, 
> yes. This does not change the fact that the _desirable_ path is 
> pivot_root.

https://en.wikipedia.org/wiki/Proof_by_assertion

>>> This would not occur with pivot_root.
>>
>>It would not occur if the filesystem had been removed from the current mount
>>namespace by other means, either. (Or if the kernel got the test right, which
>>you're saying it does now.)
> 
> You can't remove it if it's rootfs. If your host's root is rootfs, as it 
> would be if you run directly from initramfs, you can't unmount it.

You can't remove rootfs even if you do overmount something on top of it. You can
only delete its contents, it's still a writeable filesystem shared between all
container instances. A userspace tool was provided to delete its contents many
years ago. Said tool has been in util-linux since 2009
(https://github.com/util-linux/util-linux/commit/711ea7307d54) and they copied
the name from https://git.busybox.net/busybox/commit/?id=0f34a821ab99 from 2005.

If "the same rootfs is potentially exposed into every container namespace, which
is a problem even after it's been emptied" is the issue, your patch doesn't
address it. If it's NOT the issue, userspace has been able to achieve the state
you want without your patch since initramfs was created.

Your argument is "I don't want to". Which isn't a deal-breaker, but is the
context in which I'm reacting to the design change.

>>Back when I was trying to get /dev/console to work properly with 
>>init=/bin/sh I
>>didn't ask for a kconfig option to make the initial task be PID 2 with PID 1
>>(and its magic signal-blocking properties and inability to call reboot() and
>>session ID 0 and orphaned zombies reparented to it so on) being a second idle
>>task. I figured out how to make my userspace do the right thing.
>>
>>This really seems like an "init starts as PID 2" solution, which is a weird
>>thing to have a dedicated build-time kernel config option for.
> 
> I am afraid I do not understand this point at all. This change is not 
> requesting anything like this.

If I wanted the kernel to launch an init task for me, but I didn't want that
init task to be on PID 1, and I said that PID 1 has a bunch of strange
properties so it's inappropriate for that to run my code, and I proposed a patch
with a dedicated config option to do that, this would be analogous to the patch
you've presented.

If someone responded "you can call fork()" and I tried to come up with reasons
other than "I don't want to"... at that point we're in a Jim Jeffries routine.

And no this isn't an entirely hypothetical case, this was me scratching my head
for more than a year back in the busybox days, pestering people on IRC while
working out out to write oneit.c:

http://lists.busybox.net/pipermail/busybox/2008-November/067722.html

>>You want to use rootfs but not use rootfs. It's very zen.
> 
> No, I want an initramfs, I just don't want it on rootfs.

There's ways of dealing with that. Have been for a while. I would respond to a
dedicated CONFIG_INIT_ADJUST to have PID 1 be a second idle task and put init=
on PID 2 fairly similarly.

> If I mount a 
> block device as root, that wouldn't be rootfs either.

I am aware of that, yes. I think I suggested it.

You are reasoning backwards from your solution and not thinking about the
design. I don't think you're addressing the real issue.

Right now "separate" container namespaces all share a common rootfs instance.
They do NOT share a common init task, even though before containers that was
universal. You can have your own PID namespace, which starts _empty_.

Your mount tree in a container does NOT start empty. From the clone(2) man page:

  If  CLONE_NEWNS  is  set,  the  cloned child is started in a new
  mount namespace, initialized with a copy of the namespace of the parent.

Defaulting to having everything in it and removing what you don't want to keep
is very different from what PID or UID namespaces do, and is causing you
problems. Doing a chroot is basically an overmount, the other mount points are
still there in your tree and accessable if you try hard enough, and rootfs is
common to all containers. Mitigating this requires cleanup work that isn't
always even possible to fully do (ala rootfs actually being used, which does
happen a lot today and it's always accessible if a static process forking its
own mount namespace does enough umounts, which can then act as a
cifs/nfs/9p/rsync server out to the parent or some such).

Logically, extending the kernel to have a CLONE_NEWROOTFS where it gets a _new_
ramfs or tmpfs instance, unique to that namespace, at the root of a new empty
mount tree, is the logical fix. There is then design work around "so what API do
you use to populate it" which could range from "the first int below child_stack
is the fd of a cpio.gz to extract into it and then it launches an /init out of
there the way the host linux boots" through "the new child starts suspended ala
vfork/ptrace and then the parent process initializes it and unblocks it" to "the
init task is running the executable from the host context that called clone and
has inherited the existing open filehandles from the host context, although
despite the openat() family being in posix-2008 we sadly don't appear to have a
mountat()...". I dunno. That's design work to properly fix the issue.

You don't want to address the design problem, you want to add a special case
workaround for your current issue. You see doing that as a "design fix". I do not.

>>If you'd like to go "I want to have the kernel automatically mount a 
>>freshly
>>formatted ext4 filesystem and then have the kernel extract the cpio archive into
>>that instead, because it's more convenient for me to have the kernel do this for
>>me than doing it in userspace"...
> 
> Not what this patch is suggesting.

Yes, I've noticed.

My objection is "that may not be not the right fix at the design level" and your
response is "my patch didn't implement what you're talking about".

> The kernel already supports mounting a block device, in lieu of a 
> userspace init doing it, via the root= parameter. Are you suggesting its 
> support of that is inappropriate?
> 
>>If you fix the mount --move issues you could bind mount your current 
>>directory,
>>cd $PWD, and then --move mount it to /
>>
>>I think you're addressing the wrong issue.
> 
> No, I'm fixing the fact that container runtimes want to pivot_root, and 
> can't when running directly from initramfs, as this extracts to rootfs.

Which you can already do with switch_root, with initrd, with a static binary
calling mount/mv/chroot, and probably other ways none of which would be you
exclusively reasoning backwards from your conclusion. You don't seem to want to
do that, and only want to talk about your solution not talk about the problem.

Fine. Moving on. I still think a dedicated CONFIG entry is a bad way to do the
silly thing. Specifying the silly thing on the kernel command line seems less bad.

>>As with the trivial patch to have init= launch PID 2, the cognitive 
>>load of
>>explaining to people WHY the config option exists and when somebody might have
>>wanted to use it in the kernel you're trying to forward port in a design you
>>inherited from somebody who isn't around anymore is itself a form of design
>>complexity. It's a special case _adding_ a design wart.
> 
> Is it possible that the reasoning of why this important would be much 
> more apparent to people in the container space?

Are you suggesting I don't understand because I'm not "one of us"?

Personally I think the problem is more likely that I _predate_ you, and had a
small hand in creating this entire area, and thus don't have the same hardwired
assumptions about this being the only possible way it could ever have been done.

In 2011 while working the OpenVZ booth at Scale along with Kir Kolyshkin (the
OpenVZ userspace maintainer, we were coworkers when I did a contract at
Parallels), I came up with the phrase "chroot on steroids" as part of my 30
second pitch explaining the conceptual difference between virtualization and
containerization to attendees who stopped at the booth to ask what we offered.
(I wrote down "chroot on steroids" back in
https://landley.net/notes-2011.html#19-04-2011 and got con crud at scale in
https://landley.livejournal.com/53863.html . Yeah, sometimes I do marketing on
the side. Startups need everybody to do everything. Containers were just SO COOL
and nobody KNEW about them back then.) I also wrote various documentation like
https://landley.net/lxc/ while adding adding container support to things like
CIFS in commit f1d0c998653f.

The team I worked with at at Parallels was porting container technology from
OpenVZ into Linux, which is how vanilla Linux got containers. (The line of
development dated back to a Russian bank moving from mainframes to linux in 1999
and writing in-house mainframe-like extensions to Linux. I asked Kir, who had
worked for said bank before the tech was spun out. Don't ask me the details of
the 12 year old conversation, but he's still around if you want to ask him. He
works at Red Hat these days.)

Later I attended the https://github.com/Fewbytes/rubber-docker talk live at
linuxconf.au in 2017 and have collected links like
https://blog.lizzie.io/linux-containers-in-500-loc.html ever since. I've meant
to add basic container support to toybox but it's been fairly far down on my
todo list, partly because I talked the android developers' ears off about
containers back when they merged toybox in 2015 (ala
http://lists.landley.net/pipermail/toybox-landley.net/2015-February/015135.html)
and a year later they wrote minijail (ala https://lwn.net/Articles/700557/) so
they've got their own plumbing now that I'd have to be compatible with. And once
I implemented unshare+nsenter in toybox and I've mostly just done stuff like
"sudo env -i USER=root TERM=linux SHELL=/bin/bash LANG=$LANG
PATH=/bin:/sbin:/usr/bin:/usr/sbin unshare -Cimnpuf chroot debootstrap" which is
usually good enough for my personal quick and dirty use cases.

I watched the rise of docker, was in the audience at ELC when the systemd guys
announced rocket, still think the initial cgroup filesystem had a better design
than cgroup2 (it could NEST, and why they don't properly instance devtmpfs in
containers I still don't understand). I vaguely followed the multiple variants
of flatpak that emerged because Ulrich Drepper hated static linking and too many
people swallowed that line of nonsense. I remember when /proc/self/exe being
inherited from the host became an exploit vector (an area I had previous
interest in for different reasons ala https://lkml.org/lkml/2017/9/12/175)...

But mostly what I was paying attention to was the checkpoint/restore stuff in
case that was reasonable to implement in toybox, largely because of prior
interest: back in 2002
https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0610.html and
https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0835.html got reblogged a lot
by various lwn.net competitors, so people would ask _me_ about it and it was so
nice when we finally got plumbing that let people _implement_ it. I mean OpenVZ
already had live migration working great at the 2011 SCALE demo, that's what Kir
was showing people at the other end of the table (and what I handed to them to
when I got them spun up on the basic "containers are cool and it's not the same
as VMs, all this 'cloud' stuff they're building from beowulf should pivot from
VM to container ASAP", but it took a LOT of redesign to chip off pieces and get
them into vanilla. (For one thing OpenVZ had added a bunch of new syscalls and
Linus wanted to use synthetic filesystems instead.)

You can argue that I'm _out_of_date_ if you like. (Probably true, on most
things. Jack of all trades, master of none. Usually I know who to ask.) But I
have some familiarity with, and reason to care about, both container
infrastructure and Linux early boot.

> I disagree that this introduces a design wart. On the contrary, I 
> believe it adds the option to make initramfs more consistent with the 
> other root setup methods:

A think it's silly, and think having a dedicated build-time CONFIG option
dedicated to doing something silly is not ideal.

Checking for "root=tmpfs" to trigger the silly thing seems less bad to me,
although I note that init/do_mounts.c function init_rootfs() already _is_
checking for that (and there's a pending patch to tweak it), so... be aware.

> 1. kernel mounted block device via root= results in a nominally empty 
> rootfs and a block device on top with the root file system in it. 
> pivot_root can be used.
> 2. initramfs which performs some init, mounts a block device, then 
> switches root to it. This results in a nominally empty rootfs and a 
> block device on top with the root file system in it. pivot_root can be 
> used.
> 3. initramfs which contains an embedded root filesystem to be used 
> directly. Results in a rootfs with the root file system in it with 
> nothing on top. pivot_root cannot be used.
> 
> This patch simply changes point 3, to be more in line with the others:

I believe I have basic context for Linux early boot.

>>You're asking the kernel to create a second empty ramfs or tmpfs 
>>instance, and
>>instead of checking an existing argument like "root=tmpfs" you're changing the
>>kernel's behavior with a dedicated config option that does a specific thing.
> 
> If we want to set this behaviour via a kernel parameter, we can do that 
> :)

As a convenience feature, a kernel parameter makes more sense to me. I still
think it's a silly thing to want to do, but it means I don't have to teach
people to fish it out of analogous defconfigs when they're learning to do board
bringup. (The hardware vendor usually has a Linux BSP on some variant of a demo
board, otherwise the people speccing the product wouldn't have selected that
chipset for a Linux project. There are exceptions, but sane management doesn't
throw newbies at them.)

>>Whatever your container stuff is
> 
> Love it or hate it, lots of stuff runs on containers now. The kernel has 
> made plenty of changes to better facilitate containers.

Yes. I was there. Helping make that happen was, briefly, my day job.

>>it won't be
>>able to run on any of those existing systems that keeps initramfs populated with
>>files. So again why have it be a config option: if you're going to change the
>>behavior, change it for EVERYBODY or your stuff will need a special kernel
>>configuration in order to run.
> 
> Sure, if we think its more appropriate to just do this always (not via a 
> build option) or gated behind a kernel parameter, we can do that.

I think doing it unconditionally will break existing users. And be unnecessary
for 95% of the existing users of initramfs.

Heck, I didn't expect the CONFIG_DEVTMPFS_MOUNT patch above would break debian's
init ala https://lkml.iu.edu/hypermail/linux/kernel/1705.2/05813.html because
they had a broken error handling path in their init script that had never
triggered before, and when it did trigger it did something stupid and crashed
the system. And yet that blocked the patch, and then adding a workaround for
debian hit
https://lore.kernel.org/linux-input/d6a5ba05-5de2-24e0-49ae-437058001b37@landley.net/
and even though
https://lkml.iu.edu/hypermail/linux/kernel/1709.2/02400.html has almost
certainly happened by now the patch is still maintained out of tree...

Also, my inittmpfs patches in 2013 only used tmpfs instead of ramfs under
specific circumstances, and left it as ramfs in others (mostly determined by the
root= and rootfstype= arguments). Despite that, I got email from somebody who
switched their system to tmpfs and had it break because the cpio.gz they were
extracting ate about 60% of the kernel's memory (which worked fine in their use
case) and the tmpfs mounts default to size=50%, so it worked for ramfs but not
tmpfs. And since ramfs takes no arguments rootflags= was never wired up to be
passed through there to override the size=, and I _still_ haven't bothered to do
it (the bug reporter went back to ramfs, and it's a static variable in
init/do_mounts.c and exporting it over through rootfs_init_fs_context() requires
touching header files so wasn't a 5 minute job... If somebody else wants to,
feel free.)

>>Heck, Debian populates initramfs with a cpio.gz file as part of its normal boot
>>process:
>>
>>$ zcat /boot/initrd.img-4.19.0-22-amd64  | toybox file -
>>-: ASCII cpio archive (SVR4 with no CRC)
>>
>>Has done for over a decade. You're saying debian can clean up but your stuff
>>can't be expected to.
> 
> No, that is not what I'm saying.

That's the part I don't understand. It _seems_ like what you were saying. Not
"this hasn't been working fine for everyone else for the past 15 years already",
but "I think it should have been designed a different way 20 years ago, and
would like to change it to match my opinion".

You're not unlocking a new capability you can't do without this patch. You just
think it should always have worked differently than it does.

I've submitted a few convenience patches myself over the years. Commit
595a22acee26 eventually made it in, and I've maintained
https://lkml.iu.edu/hypermail/linux/kernel/2302.2/05597.html out of tree since
https://lkml.iu.edu/hypermail/linux/kernel/1606.2/05686.html . I am not
conceptually opposed to convenience patches. (Nor am I the final decision maker
here, by the way...)

I'm just not seeing how the old way is _hard_ when the tool for it is in
util-linux (as I said, I may be biased), and I don't think this addresses the
underlying issue of the same rootfs being globally visible in all containers
when the same init task isn't (and is similarly "kernel panics if this exits"
important).

>>If you want a NULLFS, that is a design change. Maybe ask for the design 
>>change
>>so THAT can be discussed. Your config option seems like a partial fix at best,
>>and the kernel has enough abandoned partial fixes needing legacy support.
> 
> We already have what you call a nullfs. It's defined in 
> init/noinitramfs.c and usr/default_cpio_list, and its what you get if 
> you call switch_root within the initramfs.

I mean a filesystem type that maintains no state and _cannot_ be written to,
which your patch doesn't get because the rootfs that's there _can_ still be
written to if you try.

Hiding stuff in undermounts is classic 1970's student shenanigans, at best
you're making it less obvious. The ramfs instance that's always mounted and
never written to is still mapped into every container namespace, and the fact it
_can_ be written to is non-obvious.

Note: that's basically what Linux had before initramfs was invented. The root=
mount was mandatory and there was nothing under it. There _being_ a rootfs
instead was a design decision made 20 years ago by other people, which you seem
to be trying to revert.

> In most runtime situations, rootfs _is_ what you'd call a nullfs.

Where "most" apparently does not include the category of "routers", which are
one of the most numerous types of Linux system in the world. (According to
https://www.bitdefender.com/blog/hotforsecurity/current-routers-use-eol-linux-kernel-chock-full-vulnerabilities/
91% of the routers they tested ran linux, just the residential router market was
$11 billion in 2023, which assuming an average sale price of $200 is 50 million
units. The PC market shipped 68 million units but most of those run windows or
macos.)

LOTS of embedded people have used the existing initramfs, and it's accumulated a
BUNCH of weirdness over the years. Did you know you can concatenate multiple
cpio.gz files and the kernel loader will accept them as one big archive? Except
gnu cpio adds runs of NUL bytes to the end, which broke the parser at one point,
and then the android guys used a tool that reproduced the gnu behavior, and then
I got added to the bug report...

http://lists.landley.net/pipermail/toybox-landley.net/2021-April/028464.html

Every one of those archives was extracted into the initramfs we have now.

There have also been multiple threads about adding xattr support to cpio and the
initramfs extractor (without which selinux bringup is WAY more awkward, and yes
I get cc'd on them ala
https://lkml.iu.edu/hypermail/linux/kernel/1905.2/06559.html), which so far keep
petering out without resolution...

You are not the first person to use this plumbing. "Everybody _really_ wants
what I think it should always have been like, but nobody's mentioned it in the
past 20 years" is a strange position to take. Earlier you said "the fact that
the desirable path is" as a universal statement rather than a personal opinion.
Desirable to who? Judged as "fact" by who?

> So 
> yes, sure: I want a nullfs when my root filesystem lives inside the 
> initramfs too.

You didn't understand what I meant by nullfs, I tried to clarify above.

> Like I'd get if I'm mounting with root= and like I'd get 
> if initramfs calls switch_root.

I.E. like you can already get in multiple ways from userspace without this patch.

A special root=tmpfs or similar workaround to change the behavior is small and
ignoreable enough that I personally wouldn't object to it, if it didn't break
anything else. I still don't think it's the right approach, but you do you.

(For one thing, I can just NOT document "root=tmpfs" and it's invisible, and
doesn't cause a problem. Where CONFIG_THINGY is a thing people see in menuconfig
and I have to explain why they DON'T need it. In a previous life I was (again
briefly) linux-kernel documentation maintainer, and still produce a lot of it in
other contexts. My "oh goddess, not another variant to explain during early
boot" reaction may be unique, I dunno, but... sigh. It is not entirely without
cost.)

Rob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
  2023-12-01 22:02         ` Rob Landley
@ 2023-12-01 23:37           ` Emily Shepherd
  2023-12-02  5:40             ` Rob Landley
  0 siblings, 1 reply; 10+ messages in thread
From: Emily Shepherd @ 2023-12-01 23:37 UTC (permalink / raw)
  To: Rob Landley
  Cc: Andrew Morton, initramfs, Thomas Strömberg,
	Anders Björklund, Giuseppe Scrivano, Al Viro,
	Christoph Hellwig, Jens Axboe

On Fri, Dec 01, 2023 at 04:02:50PM -0600, Rob Landley wrote:
>You are reasoning backwards from your solution and not thinking about 
>the
>design. I don't think you're addressing the real issue.
>
>Right now "separate" container namespaces all share a common rootfs instance.
>They do NOT share a common init task, even though before containers that was
>universal. You can have your own PID namespace, which starts _empty_.
>
>Your mount tree in a container does NOT start empty. From the clone(2) man page:
>
>  If  CLONE_NEWNS  is  set,  the  cloned child is started in a new
>  mount namespace, initialized with a copy of the namespace of the parent.
>
>Defaulting to having everything in it and removing what you don't want to keep
>is very different from what PID or UID namespaces do, and is causing you
>problems. Doing a chroot is basically an overmount, the other mount points are
>still there in your tree and accessable if you try hard enough, and rootfs is
>common to all containers. Mitigating this requires cleanup work that isn't
>always even possible to fully do (ala rootfs actually being used, which does
>happen a lot today and it's always accessible if a static process forking its
>own mount namespace does enough umounts, which can then act as a
>cifs/nfs/9p/rsync server out to the parent or some such).
>
>Logically, extending the kernel to have a CLONE_NEWROOTFS where it gets a _new_
>ramfs or tmpfs instance, unique to that namespace, at the root of a new empty
>mount tree, is the logical fix. There is then design work around "so what API do
>you use to populate it" which could range from "the first int below child_stack
>is the fd of a cpio.gz to extract into it and then it launches an /init out of
>there the way the host linux boots" through "the new child starts suspended ala
>vfork/ptrace and then the parent process initializes it and unblocks it" to "the
>init task is running the executable from the host context that called clone and
>has inherited the existing open filehandles from the host context, although
>despite the openat() family being in posix-2008 we sadly don't appear to have a
>mountat()...". I dunno. That's design work to properly fix the issue.
>
>You don't want to address the design problem, you want to add a special case
>workaround for your current issue. You see doing that as a "design fix". I do not.

I think this is a good point - I definitely agree that the weird 
hackiness that runtimes have to do to setup their mount namespaces 
properly is suboptimal.

The hypothetical CLONE_NEWROOTFS that you suggest is a superior 
suggestion - not least because it would better do what containers 
actually want, but it would also do it with less syscalls and flapping!

As an aside: I take your point RE rootfs being shared. The general 
concern is normally that information from the host might leak if 
containers can read the host root, so sharing an empty rootfs is less of 
a concern, but again the theoretical case of information sharing between 
containers by writing to the shared rootfs is an interesting one too.

>Fine. Moving on. I still think a dedicated CONFIG entry is a bad way to do the
>silly thing. Specifying the silly thing on the kernel command line seems less bad.
>
>Checking for "root=tmpfs" to trigger the silly thing seems less bad to 
>me,
>although I note that init/do_mounts.c function init_rootfs() already _is_
>checking for that (and there's a pending patch to tweak it), so... be aware.

My original reasoning for having it as a built option was that, in the 
case of running directly from initramfs, that's often something that's 
done if you're embedding the initfamfs to create a unified kernel. As a 
result, it is something that you'd only really care to turn on or off at 
build time.

Having said that, I have no strong opinion on that.

>That's the part I don't understand. It _seems_ like what you were 
>saying. Not
>"this hasn't been working fine for everyone else for the past 15 years already",
>but "I think it should have been designed a different way 20 years ago, and
>would like to change it to match my opinion".

I have to say I struggle to understand where to go from here... as I 
said above, I do like the CLONE_NEWROOTFS suggestion (and it was 
actually something I was batting around for my own project) but that 
feels that a _way more_ specialised feature.

And now you are saying that apparently we _shouldn't_ make a relatively 
small change to initramfs because its worked fine for years, but we 
should add a much larger patch to clone() which has also worked for many 
years? I shouldn't question how initramfs works because you were there 
when it was written [1], but we should question all the devs who decided 
on CLONE_NEWNS over CLONE_NEWROOTFS?

I'm not saying we shouldn't, but help me out here - how can I tell 
what's "reasonable" to question and what isn't?

[1]: https://media.tenor.com/lR9rjwXjL50AAAAC/deep-magic-lion.gif

>LOTS of embedded people have used the existing initramfs, and it's accumulated a
>BUNCH of weirdness over the years. Did you know you can concatenate multiple
>cpio.gz files and the kernel loader will accept them as one big 
>archive?

I did, yes.

>Are you suggesting I don't understand because I'm not "one of us"?

No, and I am sorry that I phrased that poorly. I merely meant that there 
are a hell of a lot of different build options and systems within the 
kernel, and it is perhaps not unreasonable to suggest that it is not a 
requirement that everyone intimately understands all of them all of the 
time.

>You are not the first person to use this plumbing. "Everybody _really_ 
>wants
>what I think it should always have been like, but nobody's mentioned it in the
>past 20 years" is a strange position to take. Earlier you said "the fact that
>the desirable path is" as a universal statement rather than a personal opinion.
>Desirable to who? Judged as "fact" by who?

I meant for container runtimes. Most are quite opinionated about not 
doing mount --move . / && chroot(.), strictly preferring pivot_root 
instead.

-- 
Emily Shepherd

Red Coat Development Limited

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
  2023-12-01 23:37           ` Emily Shepherd
@ 2023-12-02  5:40             ` Rob Landley
  2023-12-02 23:27               ` Emily Shepherd
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Landley @ 2023-12-02  5:40 UTC (permalink / raw)
  To: Emily Shepherd
  Cc: Andrew Morton, initramfs, Thomas Strömberg,
	Anders Björklund, Giuseppe Scrivano, Al Viro,
	Christoph Hellwig, Jens Axboe

On 12/1/23 17:37, Emily Shepherd wrote:
> I have to say I struggle to understand where to go from here... as I 
> said above, I do like the CLONE_NEWROOTFS suggestion (and it was 
> actually something I was batting around for my own project) but that 
> feels that a _way more_ specialised feature.
> 
> And now you are saying that apparently we _shouldn't_ make a relatively 
> small change to initramfs because its worked fine for years, but we 
> should add a much larger patch to clone() which has also worked for many 
> years?

No, "the perfect is the enemy of the good" applies. Blocking a small fix to
force a large fix isn't always reasonable.

I just dislike adding special cases. Right now when you root=/dev/sda1 you
specify what to overmount on rootfs at runtime, so if you're going to overmount
something _else_ that seems the layer to do it at. A CONFIG_BLAH=y option to do
a similar thing (changing the semantics from a different area entirely) seems
painful, especially as a workaround for a self-inflicted issue in a userspace
package.

I would very much like the kernel to get _simpler_ over time. It isn't, and it
won't, and eventually we'll start over with
https://www.youtube.com/watch?v=Ce1pMlZO_mI or something. But I can at least try
to push back a bit to _slow_ the descent. (Unix lasted about 1 billion seconds,
which rolled over Saturday September 8, 2001. The second billion seconds of the
unix clock rolls over Tuesday, May 17 2033, which roughly coincides with Linus
hitting retirement age. Not a profound observation, just a frame of reference...)

Posix outlived Unix v7 and System V and so on because it defined an interface
that could have its implementation swapped out. Posix is a TERRIBLE standard
because if you implement a posix-only system it can't boot (no "init" command)
and can't access filesystems (no "mount" command), but it's the subset people
could agree upon. (Politics: they needed something with holes big enough to
drive OS/360 and Windows NT through or else the big boys couldn't get federal
procurement contracts while FIPS-151-2 was in force if they didn't nominally
comply. Even 1980's Apple came out with A/UX, no really:
https://www.youtube.com/watch?v=nwrTTXOg-KI )

A multiple-choice interface is harder to get test coverage on in a single
implementation, let alone an IETF-style bakeoff where
https://docs.freebsd.org/en/books/handbook/linuxemu/ and
https://learn.microsoft.com/en-us/windows/wsl/ and
https://9to5google.com/2021/02/12/google-fuchsia-os-android-linux-programs-starnix/
and so on all agree on a documented set of interfaces that can run the same
code. There may be some API pruning once everybody young enough to remember when
"the GPL" was a single thing (instead of Samba and Linux being unable to share
code even though they implement two ends of the same protocol and are both GPL)
has aged out of the productive flow, at which point you may not be able to _pay_
enough younguns to touch the modern equivalent of cobol...

(I say this as someone who has reimplemented a gnu-compatible sed implementation
TWICE, once in busybox, once in toybox. And lamented for a standard that
actually MEANS something both times. I have long threads with Bash maintainer
Chet Ramey about weird corner cases of Bash
(http://lists.landley.net/pipermail/toybox-landley.net/2023-June/029616.html)
because I'm implementing a bash compatible shell from scratch in toybox, and the
closest I have to a "standard" is the bash man page which does not always
document what bash actually _does_. (Alas, Chet keeps FIXING things I bring up,
which he considers progress and I consider making bash a moving target...)

Anyway, this sort of thing tends to be on my mind a lot. If you assume an
ABI/API is gonna get extracted from this with a new implementation stuck under
it someday (as has happened before), "which bits will definitely get pruned but
probably cause collateral damage" is a question that comes up. I expect "a
minimal host system capable of running containers" to be a fairly EARLY cloning
target...

> I shouldn't question how initramfs works because you were there 
> when it was written [1], but we should question all the devs who decided 
> on CLONE_NEWNS over CLONE_NEWROOTFS?

Oh no, please question it. Question everything.

And I only started paying attention to this one a little _after_ it was written.
Early adopter, not author. Reported various bugs, wrote the documentation I'd
wanted to read, genericized the userspace tooling a bit... But it was Al Viro's
baby.

Speaking of which, all the http://www.uwsg.iu.edu/ links in said docs still work
if you switch them to https://lkml.iu.edu/ and leave the rest under it. I should
push a patch, but the linux development community chased all the hobbyists who
used to fix that sort of thing away at least ten years ago, sometime before
https://lwn.net/Articles/563578/ so nothing that doesn't affect Red IBM Hat's
bottom line really gets addressed in vanilla these days. They just sort of
linger if it's not worth billable hours for a career engineer to do on the
clock, run through Jira, and check off on the spreadsheet in the standup. Those
links have been broken for YEARS. I fixed my local copy. People occasionally
email me and I tell them the update. But nobody's tried to push a patch through
the signed-off-by in triplicate with the 47 files in Documentation/process
including 873 lines of submitting-patches.rst and a 24 step submit-checklist.rst
which sort of assume you've read contribution-maturity-model.rst and "The
lifecycle of a patch" section out of 2.Process.rst and...

At least until the network admin running kernel.org gets his way and closes down
the open mailing list, replaced by one that only approved people are allowed to
join:

https://social.kernel.org/objects/9b3adb80-4198-4c86-abbd-aa3c58700975

And then they stop taking patches by email:

https://social.kernel.org/objects/fbda91b8-f865-4ee5-9a40-22a2c70479f4

*shrug* See above about me waiting to see what replaces all this when it rolls
to a stop...

> I'm not saying we shouldn't, but help me out here - how can I tell 
> what's "reasonable" to question and what isn't?

Everything is reasonable to question. Not always helpful, but reasonable.

> I merely meant that there 
> are a hell of a lot of different build options and systems within the 
> kernel, and it is perhaps not unreasonable to suggest that it is not a 
> requirement that everyone intimately understands all of them all of the 
> time.

I'm weird enough to still _try_. At least in the parts common to the systems I'm
building on a dozen different architectures. (_Everybody_ has to go through
early boot.)

I'm currently trying to get vanilla u-boot, linux, and devuan debootstrap to run
on the orange pi 3b because I don't trust anything that keeps its repo on
"huaweicloud" to _not_ have spyware in it because Xi Who Must Be Obeyed ordered
it so. The hardware was put together by some very nice engineers, who seem to
have pushed support upstream into the various vanilla projects, so I _should_ be
able to get all-vanilla to work on this. (Unlike raspberry pi which is still
binary blobs as far as the bootloader can see and a forked kernel.) But in order
to build a fully capable u-boot for this board I need an or1k cross-compiler
because the power controller needs firmware, which they provide the source code
to but somebody actually made an openrisc ASIC (really!) to control the power,
so you need to compile it with an or1k cross compiler to make the firmware to
load into it, and if u-boot doesn't initialize this hardware, the Linux kernel
it hands off to can't suspend or reboot the board from software:

  https://github.com/u-boot/u-boot/blob/master/board/sunxi/README.sunxi64#L64

The problem is, if I get distracted by that, and then go "hey, hexagon finally
has qemu-system emulation now" (ala
https://github.com/quic/toolchain_for_hexagon/commit/8a8923bd6c6a) and so on, if
I don't come back to other projects for a couple releases stuff's bit-rotted
behind my back and I have to bisect and reverse engineer it.

The change under discussion here is a case where explaining the design context
behind this distinction, let alone the decision to change it, is multiple
minutes for a domain expert to unpack the backstory for you, and hours if not
days to pick apart yourself. It changes what the design IS. I personally already
_know_ (some of?) the backstory, but I don't expect other people to, and really
don't look forward to having to document it.

>>You are not the first person to use this plumbing. "Everybody _really_ 
>>wants
>>what I think it should always have been like, but nobody's mentioned it in the
>>past 20 years" is a strange position to take. Earlier you said "the fact that
>>the desirable path is" as a universal statement rather than a personal opinion.
>>Desirable to who? Judged as "fact" by who?
> 
> I meant for container runtimes. Most are quite opinionated about not 
> doing mount --move . / && chroot(.), strictly preferring pivot_root 
> instead.

Indeed. They want to start with an empty mount tree, and they don't want to
umount all the stuff they inherited. It's an understandable desire, but
repurposing pivot_root for this was not exactly an elegant solution, as this
thread is just one aspect of.

People get so stuck defending a solution they forget what the problem was.

Rob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
  2023-12-02  5:40             ` Rob Landley
@ 2023-12-02 23:27               ` Emily Shepherd
  0 siblings, 0 replies; 10+ messages in thread
From: Emily Shepherd @ 2023-12-02 23:27 UTC (permalink / raw)
  To: Rob Landley
  Cc: Andrew Morton, initramfs, Thomas Strömberg,
	Anders Björklund, Giuseppe Scrivano, Al Viro,
	Christoph Hellwig, Jens Axboe

On Fri, Dec 01, 2023 at 11:40:37PM -0600, Rob Landley wrote:
>No, "the perfect is the enemy of the good" applies. Blocking a small 
>fix to
>force a large fix isn't always reasonable.
>
>I just dislike adding special cases. Right now when you root=/dev/sda1 you
>specify what to overmount on rootfs at runtime, so if you're going to overmount
>something _else_ that seems the layer to do it at. A CONFIG_BLAH=y option to do
>a similar thing (changing the semantics from a different area entirely) seems
>painful, especially as a workaround for a self-inflicted issue in a userspace
>package.

Understood.

>At least until the network admin running kernel.org gets his way and closes down
>the open mailing list, replaced by one that only approved people are allowed to
>join:
>
>https://social.kernel.org/objects/9b3adb80-4198-4c86-abbd-aa3c58700975

Haha wow, I have to say: reading the LKML discussion about this is so 
surreal. Lots of people are suggesting features that have been standard 
in all of the git hosting platforms for years like they are new. I 
really don't understand the reluctance to move to one of the existing 
platforms. Most support outputting plaintext patches, or raising / 
responding to patches via email for people who like that flow. Seems 
like a win win, as it would come with all the extra patch / pull request 
management for free.

>I'm weird enough to still _try_. At least in the parts common to the 
>systems I'm
>building on a dozen different architectures. (_Everybody_ has to go through
>early boot.)

Fair point :)

>I'm currently trying to get vanilla u-boot, linux, and devuan 
>debootstrap to run
>on the orange pi 3b.

Nice. My current project started life on the raspberry pi :) I wanted to 
run containers, but quickly became frustrated at all the extra stuff 
that Raspberry Pi OS was running, which was slowing everything down for 
no reason, so I went on a crusade to make a much more minimal system - 
it looks like there were actually some similarities with mkroot! Great 
minds think alike, I suppose :) Anyway, I have now moved over to support 
amd64 too as I realised what I'd built could boot to kubernetes faster 
than AWS' tailored images can, so looking at options there.

>(Unlike raspberry pi which is still
>binary blobs as far as the bootloader can see and a forked kernel.)

Gah, the weird binary blobs really, really bug me on the RPI. Not least 
because the whole point of the Pis was meant to be a learning / 
exploration tool, so having a "just trust me bro" blackbox of a 
bootloader is so absurd.

>The change under discussion here is a case where explaining the design context
>behind this distinction, let alone the decision to change it, is multiple
>minutes for a domain expert to unpack the backstory for you, and hours if not
>days to pick apart yourself. It changes what the design IS. I personally already
>_know_ (some of?) the backstory, but I don't expect other people to, and really
>don't look forward to having to document it.

It looks like this patch isn't going to go anywhere - I'll keep it in my 
own tree for the moment, as it is useful for me, but may play around 
with the CLONE_NEWROOTFS idea if I have time - certainly would be 
interesting to see how easy it would be create proper independent mount 
namespaces (cue: something random falling over!).

-- 
Emily Shepherd

Red Coat Development Limited

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs
@ 2023-12-19 19:22 Askar Safin
  0 siblings, 0 replies; 10+ messages in thread
From: Askar Safin @ 2023-12-19 19:22 UTC (permalink / raw)
  To: emily; +Cc: initramfs, rob

Hi, Emily.

I propose this solution: in very beginning of your initramfs's init do
equivalent of this:

mkdir root
mount --bind / root
cd root
mount --move . /
chroot .

And then everything else.

This will create second mount of initramfs. Everything will look same,
but pivot_root will work.

Also, you can do this: create file named, say, "preinit" and put it to
initramfs. Write to this file code above and put to the end of the
file "exec /init". Of course, "preinit" could be written in shell or C
or any other language. Add "rdinit=/preinit" to kernel command line.
This will execute preinit at first, preinit will duplicate initramfs
mount and then execute actual initramfs's init.

I didn't test this, but I'm nearly sure it will work. If you want, I
can test this.

Also you can put "rdinit=/preinit" to CONFIG_CMDLINE. As well as I
understand, CONFIG_CMDLINE will be merged with command line provided
by bootloader, but I'm not sure. Also you can link small initramfs
with /preinit to kernel image. Again, as well as I understand it will
be merged with initramfs provided by bootloader. Thus, kernel with
CONFIG_CMDLINE and with linked in initramfs with /preinit will behave
very similarly to kernel with your patch. Of course, initial mount
will not be empty, but I think this is a minor point.

Of course, instead of that "mount --bind" trick you can do the same
with "cp + rm" solution.

Also, I remember I saw patch similar to yours in Linux mailing lists.
It was rejected, too. If you want I can try to find it.

I suggest solution described above. But let me also provide some
alternative solutions. You can implement patch similar to yours, but
which will work unconditionally, always. This will solve the problem
once and for all. Ideally initial mount will be nullfs as suggested by
Rob. I. e. file system, which has no state at all. To make sure
containers cannot exchange data. I think such filesystem is easy to
create. All operations will be no-ops. Look at "fs" directory in
kernel tree and write something similar.

Another way is to make pivot_root work with initial mount. I think
this will be hard.

Note that Rob Landley is absent from MAINTAINERS file (
https://elixir.bootlin.com/linux/latest/source/MAINTAINERS ), so he
doesn't decide whether a patch will be accepted. (I doesn't decide,
too. I'm not a kernel developer, I just occasionally found this thread
and decided to answer.)

If you ever want to start new discussion or sent new patch, then,
please, sent it to LKML, not to initramfs@vger.kernel.org . As you can
see here: https://lore.kernel.org/initramfs/ , new discussions are
started nearly once a month at initramfs@vger.kernel.org , so I think
very few people will see your message.

Ask me any questions.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-12-19 19:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-29  9:00 [PATCH v2] initramfs: Support unpacking directly to tmpfs Emily Shepherd
2023-11-29 16:38 ` Rob Landley
2023-11-29 17:48   ` Emily Shepherd
2023-11-29 20:53     ` Rob Landley
2023-11-30  3:31       ` Emily Shepherd
2023-12-01 22:02         ` Rob Landley
2023-12-01 23:37           ` Emily Shepherd
2023-12-02  5:40             ` Rob Landley
2023-12-02 23:27               ` Emily Shepherd
2023-12-19 19:22 Askar Safin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.