linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mnt: add support for non-rootfs initramfs
@ 2020-03-05 19:35 Ignat Korchagin
  2020-03-05 20:21 ` Al Viro
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Ignat Korchagin @ 2020-03-05 19:35 UTC (permalink / raw)
  To: viro, linux-fsdevel, linux-kernel; +Cc: Ignat Korchagin, kernel-team

The main need for this is to support container runtimes on stateless Linux
system (pivot_root system call from initramfs).

Normally, the task of initramfs is to mount and switch to a "real" root
filesystem. However, on stateless systems (booting over the network) it is just
convenient to have your "real" filesystem as initramfs from the start.

This, however, breaks different container runtimes, because they usually use
pivot_root system call after creating their mount namespace. But pivot_root does
not work from initramfs, because initramfs runs form rootfs, which is the root
of the mount tree and can't be unmounted.

One can solve this problem from userspace, but it is much more cumbersome. We
either have to create a multilayered archive for initramfs, where the outer
layer creates a tmpfs filesystem and unpacks the inner layer, switches root and
does not forget to properly cleanup the old rootfs. Or we need to use keepinitrd
kernel cmdline option, unpack initramfs to rootfs, run a script to create our
target tmpfs root, unpack the same initramfs there, switch root to it and again
properly cleanup the old root, thus unpacking the same archive twice and also
wasting memory, because kernel stores compressed initramfs image indefinitely.

With this change we can ask the kernel (by specifying nonroot_initramfs kernel
cmdline option) to create a "leaf" tmpfs mount for us and switch root to it
before the initramfs handling code, so initramfs gets unpacked directly into
the "leaf" tmpfs with rootfs being empty and no need to clean up anything.

Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
---
 fs/namespace.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 85b5f7bea82e..a1ec862e8146 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3701,6 +3701,49 @@ static void __init init_mount_tree(void)
 	set_fs_root(current->fs, &root);
 }
 
+#if IS_ENABLED(CONFIG_TMPFS)
+static int __initdata nonroot_initramfs;
+
+static int __init nonroot_initramfs_param(char *str)
+{
+	if (*str)
+		return 0;
+	nonroot_initramfs = 1;
+	return 1;
+}
+__setup("nonroot_initramfs", nonroot_initramfs_param);
+
+static void __init init_nonroot_initramfs(void)
+{
+	int err;
+
+	if (!nonroot_initramfs)
+		return;
+
+	err = ksys_mkdir("/root", 0700);
+	if (err < 0)
+		goto out;
+
+	err = do_mount("tmpfs", "/root", "tmpfs", 0, NULL);
+	if (err)
+		goto out;
+
+	err = ksys_chdir("/root");
+	if (err)
+		goto out;
+
+	err = do_mount(".", "/", NULL, MS_MOVE, NULL);
+	if (err)
+		goto out;
+
+	err = ksys_chroot(".");
+	if (!err)
+		return;
+out:
+	printk(KERN_WARNING "Failed to create a non-root filesystem for initramfs\n");
+}
+#endif /* IS_ENABLED(CONFIG_TMPFS) */
+
 void __init mnt_init(void)
 {
 	int err;
@@ -3734,6 +3777,10 @@ void __init mnt_init(void)
 	shmem_init();
 	init_rootfs();
 	init_mount_tree();
+
+#if IS_ENABLED(CONFIG_TMPFS)
+	init_nonroot_initramfs();
+#endif
 }
 
 void put_mnt_ns(struct mnt_namespace *ns)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mnt: add support for non-rootfs initramfs
  2020-03-05 19:35 [PATCH] mnt: add support for non-rootfs initramfs Ignat Korchagin
@ 2020-03-05 20:21 ` Al Viro
  2020-03-05 22:45   ` Ignat Korchagin
  2020-03-05 21:09 ` James Bottomley
  2020-03-11 14:01 ` Ignat Korchagin
  2 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2020-03-05 20:21 UTC (permalink / raw)
  To: Ignat Korchagin; +Cc: linux-fsdevel, linux-kernel, kernel-team

On Thu, Mar 05, 2020 at 07:35:11PM +0000, Ignat Korchagin wrote:
> The main need for this is to support container runtimes on stateless Linux
> system (pivot_root system call from initramfs).
> 
> Normally, the task of initramfs is to mount and switch to a "real" root
> filesystem. However, on stateless systems (booting over the network) it is just
> convenient to have your "real" filesystem as initramfs from the start.
> 
> This, however, breaks different container runtimes, because they usually use
> pivot_root system call after creating their mount namespace. But pivot_root does
> not work from initramfs, because initramfs runs form rootfs, which is the root
> of the mount tree and can't be unmounted.
> 
> One can solve this problem from userspace, but it is much more cumbersome. We
> either have to create a multilayered archive for initramfs, where the outer
> layer creates a tmpfs filesystem and unpacks the inner layer, switches root and
> does not forget to properly cleanup the old rootfs. Or we need to use keepinitrd
> kernel cmdline option, unpack initramfs to rootfs, run a script to create our
> target tmpfs root, unpack the same initramfs there, switch root to it and again
> properly cleanup the old root, thus unpacking the same archive twice and also
> wasting memory, because kernel stores compressed initramfs image indefinitely.
> 
> With this change we can ask the kernel (by specifying nonroot_initramfs kernel
> cmdline option) to create a "leaf" tmpfs mount for us and switch root to it
> before the initramfs handling code, so initramfs gets unpacked directly into
> the "leaf" tmpfs with rootfs being empty and no need to clean up anything.

IDGI.  Why not simply this as the first thing from your userland:
	mount("/", "/", NULL, MS_BIND | MS_REC, NULL);
	chdir("/..");
	chroot(".");
3 syscalls and you should be all set...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mnt: add support for non-rootfs initramfs
  2020-03-05 19:35 [PATCH] mnt: add support for non-rootfs initramfs Ignat Korchagin
  2020-03-05 20:21 ` Al Viro
@ 2020-03-05 21:09 ` James Bottomley
  2020-03-05 22:21   ` Arvind Sankar
  2020-03-11 14:01 ` Ignat Korchagin
  2 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2020-03-05 21:09 UTC (permalink / raw)
  To: Ignat Korchagin, viro, linux-fsdevel, linux-kernel; +Cc: kernel-team

On Thu, 2020-03-05 at 19:35 +0000, Ignat Korchagin wrote:
> The main need for this is to support container runtimes on stateless
> Linux system (pivot_root system call from initramfs).
> 
> Normally, the task of initramfs is to mount and switch to a "real"
> root filesystem. However, on stateless systems (booting over the
> network) it is just convenient to have your "real" filesystem as
> initramfs from the start.
> 
> This, however, breaks different container runtimes, because they
> usually use pivot_root system call after creating their mount
> namespace. But pivot_root does not work from initramfs, because
> initramfs runs form rootfs, which is the root of the mount tree and
> can't be unmounted.

Can you say more about why this is a problem?  We use pivot_root to
pivot from the initramfs rootfs to the newly discovered and mounted
real root ... the same mechanism should work for a container (mount
namespace) running from initramfs ... why doesn't it?

The sequence usually looks like: create and enter a mount namespace,
build a tmpfs for the container in some $root directory then do


    cd $root
    mkdir old-root
    pivot_root . old-root
    mount --
make-rprivate /old-root
    umount -l /old-root
    rmdir /old-root

Once that's done you're disconnected from the initramfs root.  The
sequence is really no accident because it's what the initramfs would
have done to pivot to the new root anyway (that's where container
people got it from).


James


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mnt: add support for non-rootfs initramfs
  2020-03-05 21:09 ` James Bottomley
@ 2020-03-05 22:21   ` Arvind Sankar
  2020-03-05 22:53     ` Ignat Korchagin
  0 siblings, 1 reply; 9+ messages in thread
From: Arvind Sankar @ 2020-03-05 22:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ignat Korchagin, viro, linux-fsdevel, linux-kernel, kernel-team

On Thu, Mar 05, 2020 at 01:09:10PM -0800, James Bottomley wrote:
> On Thu, 2020-03-05 at 19:35 +0000, Ignat Korchagin wrote:
> > The main need for this is to support container runtimes on stateless
> > Linux system (pivot_root system call from initramfs).
> > 
> > Normally, the task of initramfs is to mount and switch to a "real"
> > root filesystem. However, on stateless systems (booting over the
> > network) it is just convenient to have your "real" filesystem as
> > initramfs from the start.
> > 
> > This, however, breaks different container runtimes, because they
> > usually use pivot_root system call after creating their mount
> > namespace. But pivot_root does not work from initramfs, because
> > initramfs runs form rootfs, which is the root of the mount tree and
> > can't be unmounted.
> 
> Can you say more about why this is a problem?  We use pivot_root to
> pivot from the initramfs rootfs to the newly discovered and mounted
> real root ... the same mechanism should work for a container (mount
> namespace) running from initramfs ... why doesn't it?

Not sure how it interacts with mount namespaces, but we don't use
pivot_root to go from rootfs to the real root. We use switch_root, which
moves the new root onto the old / using mount with MS_MOVE and then
chroot to it.

https://www.kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt

> 
> The sequence usually looks like: create and enter a mount namespace,
> build a tmpfs for the container in some $root directory then do
> 
> 
>     cd $root
>     mkdir old-root
>     pivot_root . old-root
>     mount --
> make-rprivate /old-root
>     umount -l /old-root
>     rmdir /old-root
> 
> Once that's done you're disconnected from the initramfs root.  The
> sequence is really no accident because it's what the initramfs would
> have done to pivot to the new root anyway (that's where container
> people got it from).
> 
> 
> James
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mnt: add support for non-rootfs initramfs
  2020-03-05 20:21 ` Al Viro
@ 2020-03-05 22:45   ` Ignat Korchagin
  0 siblings, 0 replies; 9+ messages in thread
From: Ignat Korchagin @ 2020-03-05 22:45 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, kernel-team

On Thu, Mar 5, 2020 at 8:21 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Mar 05, 2020 at 07:35:11PM +0000, Ignat Korchagin wrote:
> > The main need for this is to support container runtimes on stateless Linux
> > system (pivot_root system call from initramfs).
> >
> > Normally, the task of initramfs is to mount and switch to a "real" root
> > filesystem. However, on stateless systems (booting over the network) it is just
> > convenient to have your "real" filesystem as initramfs from the start.
> >
> > This, however, breaks different container runtimes, because they usually use
> > pivot_root system call after creating their mount namespace. But pivot_root does
> > not work from initramfs, because initramfs runs form rootfs, which is the root
> > of the mount tree and can't be unmounted.
> >
> > One can solve this problem from userspace, but it is much more cumbersome. We
> > either have to create a multilayered archive for initramfs, where the outer
> > layer creates a tmpfs filesystem and unpacks the inner layer, switches root and
> > does not forget to properly cleanup the old rootfs. Or we need to use keepinitrd
> > kernel cmdline option, unpack initramfs to rootfs, run a script to create our
> > target tmpfs root, unpack the same initramfs there, switch root to it and again
> > properly cleanup the old root, thus unpacking the same archive twice and also
> > wasting memory, because kernel stores compressed initramfs image indefinitely.
> >
> > With this change we can ask the kernel (by specifying nonroot_initramfs kernel
> > cmdline option) to create a "leaf" tmpfs mount for us and switch root to it
> > before the initramfs handling code, so initramfs gets unpacked directly into
> > the "leaf" tmpfs with rootfs being empty and no need to clean up anything.
>
> IDGI.  Why not simply this as the first thing from your userland:
>         mount("/", "/", NULL, MS_BIND | MS_REC, NULL);
>         chdir("/..");
>         chroot(".");
> 3 syscalls and you should be all set...

(sorry for duplicate - didn't press "reply all" the first time)
Container people really prefer pivot_root over chroot due to some
security concerns around chroot.
As far as my (probably limited) understanding goes, while the above
approach will make it work,
it will have the same security implications as just using chroot: we
trick the system to perform
pivot_root, however we don't get rid of the actual host root
filesystem in the cloned namespace.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mnt: add support for non-rootfs initramfs
  2020-03-05 22:21   ` Arvind Sankar
@ 2020-03-05 22:53     ` Ignat Korchagin
  0 siblings, 0 replies; 9+ messages in thread
From: Ignat Korchagin @ 2020-03-05 22:53 UTC (permalink / raw)
  To: Arvind Sankar, James Bottomley
  Cc: Al Viro, linux-fsdevel, linux-kernel, kernel-team

On Thu, Mar 5, 2020 at 10:21 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Thu, Mar 05, 2020 at 01:09:10PM -0800, James Bottomley wrote:
> > On Thu, 2020-03-05 at 19:35 +0000, Ignat Korchagin wrote:
> > > The main need for this is to support container runtimes on stateless
> > > Linux system (pivot_root system call from initramfs).
> > >
> > > Normally, the task of initramfs is to mount and switch to a "real"
> > > root filesystem. However, on stateless systems (booting over the
> > > network) it is just convenient to have your "real" filesystem as
> > > initramfs from the start.
> > >
> > > This, however, breaks different container runtimes, because they
> > > usually use pivot_root system call after creating their mount
> > > namespace. But pivot_root does not work from initramfs, because
> > > initramfs runs form rootfs, which is the root of the mount tree and
> > > can't be unmounted.
> >
> > Can you say more about why this is a problem?  We use pivot_root to
> > pivot from the initramfs rootfs to the newly discovered and mounted
> > real root ... the same mechanism should work for a container (mount
> > namespace) running from initramfs ... why doesn't it?
>
> Not sure how it interacts with mount namespaces, but we don't use
> pivot_root to go from rootfs to the real root. We use switch_root, which
> moves the new root onto the old / using mount with MS_MOVE and then
> chroot to it.
>
> https://www.kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt
>
> >
> > The sequence usually looks like: create and enter a mount namespace,
> > build a tmpfs for the container in some $root directory then do
> >
> >
> >     cd $root
> >     mkdir old-root
> >     pivot_root . old-root
> >     mount --
> > make-rprivate /old-root
> >     umount -l /old-root
> >     rmdir /old-root
> >
> > Once that's done you're disconnected from the initramfs root.  The
> > sequence is really no accident because it's what the initramfs would
> > have done to pivot to the new root anyway (that's where container
> > people got it from).
> >
> >
> > James
> >

Yes, to add to Arvind's point the above sequence will only work for
"old style" initrd (block ramdisk with some filesystem image on top),
but will not work for the "new style" initramfs (just a disguised
tmpfs). The sequence will fail on "pivot_root" with EINVAL (see
pivot_root(2)). In fact this patch conceptually tries to have the same
behaviour as with "old style" initrd. As currently, if you use initrd:
1. The kernel will create an empty "dummy" initramfs
2. Create a ramdisk
3. Unpack the FS image into the ramdisk
4. Mount the the disk
5. Do switch_root/move etc

So we have initial mount tree as: rootfs->some_initrd_fs
(and pivot_root works here and you get empty rootfs by default)

With this option we have similar in the end: rootfs->tmpfs
and rootfs is empty, because the kernel never unpacked anything there.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mnt: add support for non-rootfs initramfs
  2020-03-05 19:35 [PATCH] mnt: add support for non-rootfs initramfs Ignat Korchagin
  2020-03-05 20:21 ` Al Viro
  2020-03-05 21:09 ` James Bottomley
@ 2020-03-11 14:01 ` Ignat Korchagin
  2 siblings, 0 replies; 9+ messages in thread
From: Ignat Korchagin @ 2020-03-11 14:01 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel, linux-kernel

Following up on this: is there a way to move this forward somehow or
we're stuck with complex userspace code to achieve the same, like
https://git.kernel.org/pub/scm/libs/klibc/klibc.git/tree/usr/kinit/run-init/runinitlib.c,
but for tmpfs?

Just FYI, here is an example of a fix for a security issue, which is
caused by using chroot vs pivot_root in containers:
https://github.com/opencontainers/runc/commit/28a697cce3e4f905dca700eda81d681a30eef9cd

Alternatively, if the use-case is not generic enough, we could keep
the patch to ourselves - just would appreciate some advice/potential
concerns with this approach which we might have overlooked.

Thanks,
Ignat

On Thu, Mar 5, 2020 at 7:35 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> The main need for this is to support container runtimes on stateless Linux
> system (pivot_root system call from initramfs).
>
> Normally, the task of initramfs is to mount and switch to a "real" root
> filesystem. However, on stateless systems (booting over the network) it is just
> convenient to have your "real" filesystem as initramfs from the start.
>
> This, however, breaks different container runtimes, because they usually use
> pivot_root system call after creating their mount namespace. But pivot_root does
> not work from initramfs, because initramfs runs form rootfs, which is the root
> of the mount tree and can't be unmounted.
>
> One can solve this problem from userspace, but it is much more cumbersome. We
> either have to create a multilayered archive for initramfs, where the outer
> layer creates a tmpfs filesystem and unpacks the inner layer, switches root and
> does not forget to properly cleanup the old rootfs. Or we need to use keepinitrd
> kernel cmdline option, unpack initramfs to rootfs, run a script to create our
> target tmpfs root, unpack the same initramfs there, switch root to it and again
> properly cleanup the old root, thus unpacking the same archive twice and also
> wasting memory, because kernel stores compressed initramfs image indefinitely.
>
> With this change we can ask the kernel (by specifying nonroot_initramfs kernel
> cmdline option) to create a "leaf" tmpfs mount for us and switch root to it
> before the initramfs handling code, so initramfs gets unpacked directly into
> the "leaf" tmpfs with rootfs being empty and no need to clean up anything.
>
> Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
> ---
>  fs/namespace.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 85b5f7bea82e..a1ec862e8146 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3701,6 +3701,49 @@ static void __init init_mount_tree(void)
>         set_fs_root(current->fs, &root);
>  }
>
> +#if IS_ENABLED(CONFIG_TMPFS)
> +static int __initdata nonroot_initramfs;
> +
> +static int __init nonroot_initramfs_param(char *str)
> +{
> +       if (*str)
> +               return 0;
> +       nonroot_initramfs = 1;
> +       return 1;
> +}
> +__setup("nonroot_initramfs", nonroot_initramfs_param);
> +
> +static void __init init_nonroot_initramfs(void)
> +{
> +       int err;
> +
> +       if (!nonroot_initramfs)
> +               return;
> +
> +       err = ksys_mkdir("/root", 0700);
> +       if (err < 0)
> +               goto out;
> +
> +       err = do_mount("tmpfs", "/root", "tmpfs", 0, NULL);
> +       if (err)
> +               goto out;
> +
> +       err = ksys_chdir("/root");
> +       if (err)
> +               goto out;
> +
> +       err = do_mount(".", "/", NULL, MS_MOVE, NULL);
> +       if (err)
> +               goto out;
> +
> +       err = ksys_chroot(".");
> +       if (!err)
> +               return;
> +out:
> +       printk(KERN_WARNING "Failed to create a non-root filesystem for initramfs\n");
> +}
> +#endif /* IS_ENABLED(CONFIG_TMPFS) */
> +
>  void __init mnt_init(void)
>  {
>         int err;
> @@ -3734,6 +3777,10 @@ void __init mnt_init(void)
>         shmem_init();
>         init_rootfs();
>         init_mount_tree();
> +
> +#if IS_ENABLED(CONFIG_TMPFS)
> +       init_nonroot_initramfs();
> +#endif
>  }
>
>  void put_mnt_ns(struct mnt_namespace *ns)
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] mnt: add support for non-rootfs initramfs
  2021-09-14 17:09 graham
@ 2021-09-14 17:09 ` graham
  0 siblings, 0 replies; 9+ messages in thread
From: graham @ 2021-09-14 17:09 UTC (permalink / raw)
  To: graham, Jonathan Corbet, Alexander Viro
  Cc: Ignat Korchagin, linux-doc, linux-kernel, linux-fsdevel

From: Ignat Korchagin <ignat@cloudflare.com>

The main need for this is to support container runtimes on stateless Linux
system (pivot_root system call from initramfs).

Normally, the task of initramfs is to mount and switch to a "real" root
filesystem. However, on stateless systems (booting over the network) it is
just convenient to have your "real" filesystem as initramfs from the start.

This, however, breaks different container runtimes, because they usually
use pivot_root system call after creating their mount namespace. But
pivot_root does not work from initramfs, because initramfs runs from
rootfs, which is the root of the mount tree and can't be unmounted.

One workaround is to do:

  mount --bind / /

However, that defeats one of the purposes of using pivot_root in the
cloned containers: get rid of host root filesystem, should the code somehow
escapes the chroot.

There is a way to solve this problem from userspace, but it is much more
cumbersome:
  * either have to create a multilayered archive for initramfs, where the
    outer layer creates a tmpfs filesystem and unpacks the inner layer,
    switches root and does not forget to properly cleanup the old rootfs
  * or we need to use keepinitrd kernel cmdline option, unpack initramfs
    to rootfs, run a script to create our target tmpfs root, unpack the
    same initramfs there, switch root to it and again properly cleanup
    the old root, thus unpacking the same archive twice and also wasting
    memory, because the kernel stores compressed initramfs image
    indefinitely.

With this change we can ask the kernel (by specifying nonroot_initramfs
kernel cmdline option) to create a "leaf" tmpfs mount for us and switch
root to it before the initramfs handling code, so initramfs gets unpacked
directly into the "leaf" tmpfs with rootfs being empty and no need to
clean up anything.

This also bring the behaviour in line with the older style initrd, where
the initrd is located on some leaf filesystem in the mount tree and rootfs
remaining empty.

Co-developed-by: Graham Christensen <graham@determinate.systems>
Signed-off-by: Graham Christensen <graham@determinate.systems>
Tested-by: Graham Christensen <graham@determinate.systems>
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
---
 .../admin-guide/kernel-parameters.txt         |  9 +++-
 fs/namespace.c                                | 48 +++++++++++++++++++
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..bfbc904ad751 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3517,11 +3517,18 @@
 	nomfgpt		[X86-32] Disable Multi-Function General Purpose
 			Timer usage (for AMD Geode machines).
 
+	nomodule        Disable module load
+
 	nonmi_ipi	[X86] Disable using NMI IPIs during panic/reboot to
 			shutdown the other cpus.  Instead use the REBOOT_VECTOR
 			irq.
 
-	nomodule	Disable module load
+	nonroot_initramfs
+			[KNL] Create an additional tmpfs filesystem under rootfs
+			and unpack initramfs there instead of the rootfs itself.
+			This is useful for stateless systems, which run directly
+			from initramfs, create mount namespaces and use
+			"pivot_root" system call.
 
 	nopat		[X86] Disable PAT (page attribute table extension of
 			pagetables) support.
diff --git a/fs/namespace.c b/fs/namespace.c
index 659a8f39c61a..c639ea9feb66 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -18,6 +18,7 @@
 #include <linux/cred.h>
 #include <linux/idr.h>
 #include <linux/init.h>		/* init_rootfs */
+#include <linux/init_syscalls.h> /* init_chdir, init_chroot, init_mkdir */
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/file.h>
@@ -4302,6 +4303,49 @@ static void __init init_mount_tree(void)
 	set_fs_root(current->fs, &root);
 }
 
+#if IS_ENABLED(CONFIG_TMPFS)
+static int __initdata nonroot_initramfs;
+
+static int __init nonroot_initramfs_param(char *str)
+{
+	if (*str)
+		return 0;
+	nonroot_initramfs = 1;
+	return 1;
+}
+__setup("nonroot_initramfs", nonroot_initramfs_param);
+
+static void __init init_nonroot_initramfs(void)
+{
+	int err;
+
+	if (!nonroot_initramfs)
+		return;
+
+	err = init_mkdir("/root", 0700);
+	if (err < 0)
+		goto out;
+
+	err = init_mount("tmpfs", "/root", "tmpfs", 0, NULL);
+	if (err)
+		goto out;
+
+	err = init_chdir("/root");
+	if (err)
+		goto out;
+
+	err = init_mount(".", "/", NULL, MS_MOVE, NULL);
+	if (err)
+		goto out;
+
+	err = init_chroot(".");
+	if (!err)
+		return;
+out:
+	pr_warn("Failed to create a non-root filesystem for initramfs\n");
+}
+#endif /* IS_ENABLED(CONFIG_TMPFS) */
+
 void __init mnt_init(void)
 {
 	int err;
@@ -4335,6 +4379,10 @@ void __init mnt_init(void)
 	shmem_init();
 	init_rootfs();
 	init_mount_tree();
+
+#if IS_ENABLED(CONFIG_TMPFS)
+	init_nonroot_initramfs();
+#endif
 }
 
 void put_mnt_ns(struct mnt_namespace *ns)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] mnt: add support for non-rootfs initramfs
@ 2021-09-14 17:09 graham
  2021-09-14 17:09 ` graham
  0 siblings, 1 reply; 9+ messages in thread
From: graham @ 2021-09-14 17:09 UTC (permalink / raw)
  To: graham, Jonathan Corbet, Alexander Viro
  Cc: Ignat Korchagin, linux-doc, linux-kernel, linux-fsdevel

From: Ignat Korchagin <ignat@cloudflare.com>

The main need for this is to support container runtimes on stateless Linux
system (pivot_root system call from initramfs).

Normally, the task of initramfs is to mount and switch to a "real" root
filesystem. However, on stateless systems (booting over the network) it is
just convenient to have your "real" filesystem as initramfs from the start.

This, however, breaks different container runtimes, because they usually
use pivot_root system call after creating their mount namespace. But
pivot_root does not work from initramfs, because initramfs runs from
rootfs, which is the root of the mount tree and can't be unmounted.

One workaround is to do:

  mount --bind / /

However, that defeats one of the purposes of using pivot_root in the
cloned containers: get rid of host root filesystem, should the code somehow
escapes the chroot.

There is a way to solve this problem from userspace, but it is much more
cumbersome:
  * either have to create a multilayered archive for initramfs, where the
    outer layer creates a tmpfs filesystem and unpacks the inner layer,
    switches root and does not forget to properly cleanup the old rootfs
  * or we need to use keepinitrd kernel cmdline option, unpack initramfs
    to rootfs, run a script to create our target tmpfs root, unpack the
    same initramfs there, switch root to it and again properly cleanup
    the old root, thus unpacking the same archive twice and also wasting
    memory, because the kernel stores compressed initramfs image
    indefinitely.

With this change we can ask the kernel (by specifying nonroot_initramfs
kernel cmdline option) to create a "leaf" tmpfs mount for us and switch
root to it before the initramfs handling code, so initramfs gets unpacked
directly into the "leaf" tmpfs with rootfs being empty and no need to
clean up anything.

This also bring the behaviour in line with the older style initrd, where
the initrd is located on some leaf filesystem in the mount tree and rootfs
remaining empty.

Co-developed-by: Graham Christensen <graham@determinate.systems>
Signed-off-by: Graham Christensen <graham@determinate.systems>
Tested-by: Graham Christensen <graham@determinate.systems>
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
---
 .../admin-guide/kernel-parameters.txt         |  9 +++-
 fs/namespace.c                                | 48 +++++++++++++++++++
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..bfbc904ad751 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3517,11 +3517,18 @@
 	nomfgpt		[X86-32] Disable Multi-Function General Purpose
 			Timer usage (for AMD Geode machines).
 
+	nomodule        Disable module load
+
 	nonmi_ipi	[X86] Disable using NMI IPIs during panic/reboot to
 			shutdown the other cpus.  Instead use the REBOOT_VECTOR
 			irq.
 
-	nomodule	Disable module load
+	nonroot_initramfs
+			[KNL] Create an additional tmpfs filesystem under rootfs
+			and unpack initramfs there instead of the rootfs itself.
+			This is useful for stateless systems, which run directly
+			from initramfs, create mount namespaces and use
+			"pivot_root" system call.
 
 	nopat		[X86] Disable PAT (page attribute table extension of
 			pagetables) support.
diff --git a/fs/namespace.c b/fs/namespace.c
index 659a8f39c61a..c639ea9feb66 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -18,6 +18,7 @@
 #include <linux/cred.h>
 #include <linux/idr.h>
 #include <linux/init.h>		/* init_rootfs */
+#include <linux/init_syscalls.h> /* init_chdir, init_chroot, init_mkdir */
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/file.h>
@@ -4302,6 +4303,49 @@ static void __init init_mount_tree(void)
 	set_fs_root(current->fs, &root);
 }
 
+#if IS_ENABLED(CONFIG_TMPFS)
+static int __initdata nonroot_initramfs;
+
+static int __init nonroot_initramfs_param(char *str)
+{
+	if (*str)
+		return 0;
+	nonroot_initramfs = 1;
+	return 1;
+}
+__setup("nonroot_initramfs", nonroot_initramfs_param);
+
+static void __init init_nonroot_initramfs(void)
+{
+	int err;
+
+	if (!nonroot_initramfs)
+		return;
+
+	err = init_mkdir("/root", 0700);
+	if (err < 0)
+		goto out;
+
+	err = init_mount("tmpfs", "/root", "tmpfs", 0, NULL);
+	if (err)
+		goto out;
+
+	err = init_chdir("/root");
+	if (err)
+		goto out;
+
+	err = init_mount(".", "/", NULL, MS_MOVE, NULL);
+	if (err)
+		goto out;
+
+	err = init_chroot(".");
+	if (!err)
+		return;
+out:
+	pr_warn("Failed to create a non-root filesystem for initramfs\n");
+}
+#endif /* IS_ENABLED(CONFIG_TMPFS) */
+
 void __init mnt_init(void)
 {
 	int err;
@@ -4335,6 +4379,10 @@ void __init mnt_init(void)
 	shmem_init();
 	init_rootfs();
 	init_mount_tree();
+
+#if IS_ENABLED(CONFIG_TMPFS)
+	init_nonroot_initramfs();
+#endif
 }
 
 void put_mnt_ns(struct mnt_namespace *ns)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-09-14 17:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-05 19:35 [PATCH] mnt: add support for non-rootfs initramfs Ignat Korchagin
2020-03-05 20:21 ` Al Viro
2020-03-05 22:45   ` Ignat Korchagin
2020-03-05 21:09 ` James Bottomley
2020-03-05 22:21   ` Arvind Sankar
2020-03-05 22:53     ` Ignat Korchagin
2020-03-11 14:01 ` Ignat Korchagin
2021-09-14 17:09 graham
2021-09-14 17:09 ` graham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).