initramfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] initoverlayfs - a scalable initial filesystem
@ 2023-12-08 17:59 Eric Curtin
  2023-12-09 12:46 ` Luca Boccassi
  2023-12-11  9:57 ` Lennart Poettering
  0 siblings, 2 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-08 17:59 UTC (permalink / raw)
  To: systemd-devel, initramfs
  Cc: Stephen Smoogen, Yariv Rachmani, Daniel Walsh, Douglas Landgraf

We have been working on a new initial filesystem called initoverlayfs.
It is a new filesystem that provides a more scalable approach to
initial filesystems as opposed to just using initrds. We are writing
this RFC to the systemd and dracut mailing lists (feel free to forward
to UAPI group also) because although this solution works without
changing the code in these projects, it operates in the same area as
systemd, udev, dracut, etc. and uses these tools.

Brief context:
--------------

initoverlayfs by default uses transient overlays rather than tmpfs to
create throwaway filesystems early in the boot sequence.

Why?

An initramfs has to be decompressed and copied to a tmpfs up front
before it can be used. This results in a situation where you end up
paying for every byte in an initrd in boot performance, even the ones
you don't use in a given boot.

This leads to a fear of using languages that result in larger binaries
sizes early boot, reusing libraries, etc. In some cases, reimplemented
minified versions of software components present in the rootfs are
used.

Alternatively, initoverlayfs uses erofs (with compression) and
overlayfs to achieve this, so you only pay for the bytes you actually
use.

There is also increased pressure from certain industries like
automotive, to start essential services in a boot sequence early.

Requirements:
-------------

An init system
An initramfs building tool
A device manager
overlayfs

Nothing that you wouldn't find in most Linux distributions today.

Design:
-------

Here is the boot sequence with initoverlayfs integrated, the
mini-initramfs contains just enough to get storage drivers loaded and
storage devices initialized. storage-init is a process that is not
designed to replace init, it does just enough to initialize storage
(performs a targeted udev trigger on storage), switches to
initoverlayfs as root and then executes init.

```
fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs

fw -> bootloader -> kernel -> storage-init   -> init ----------------->
```

Benefits:
---------

Scalability: You can put less emphasis on keeping this initial
filesystem small as you will only pay for the bytes you read. This is
probably the bigger picture than raw performance in the next point.

Performance: As this minifies the initramfs to contain only the most
basic storage initialization tasks, linux userspace starts earlier
than it would using just initramfs alone. Leaving all the other
software that require early throwaway filesystems to be executed in
the initoverlayfs. In the case of a Raspberry Pi 4 with sd card, it
leads to systemd starting ~300ms faster and in the case of a Raspberry
Pi 4 with NVMe SSD drive over USB it leads to systemd starting ~500ms
faster. There are some devices that by starting Linux userspace early,
you can expose a slowly initializing storage driver, leading to a
slower boot as with just an initramfs you mask this slow driver by
spending this time on decompression and copying. But a computer is
only as fast as it's slowest component, so if you care about super
fast boots, you need to optimize your storage drivers.

Flexibility: It is now easier to consider using fatter languages like
Rust, etc. Using libraries like graphics libraries, camera libraries,
libevent, glib, C++, etc. early boot can be considered. As you don't
have to decompress and copy this data upfront. This leads to easier to
maintain initrd software also, with more consolidation between rootfs
impelmentations and initial filesystem implementations of components.

Changes required in other projects:
-----------------------------------

There are no major changes required in other projects. Tools like
systemd-analyze might need to be updated to recognize this boot
sequence more accurately, because it has no awareness of
initoverlayfs.

Future plans:
-------------

We intend to propose this to Fedora, CentOS Stream, ostree and
non-ostree variants as we continue this project.

Feel free to try:
-----------------

It should work on most standard 3 partition non-ostree Fedora and
CentOS 9 installs (note: CentOS 9 kernel does not support erofs
compression, so Fedora is a better playground today). It's still in
alpha/beta state I guess. Although I successfully dogfood this on my
laptop and we hard tried this on a couple of different pieces of
hardware and VMs... Maybe run this on a non-critical piece of hardware
or a VM for the next few weeks if you want to try :)

git repo:

https://github.com/containers/initoverlayfs

Also checkout the README.md, there are some graphs and other information there:

https://github.com/containers/initoverlayfs/blob/main/README.md

rpm available in copr:

dnf copr enable @centos-automotive-sig/next
dnf install initoverlayfs
initoverlayfs-install

Is mise le meas/Regards,

Eric Curtin


^ permalink raw reply	[flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
@ 2023-12-18 21:59 Askar Safin
       [not found] ` <CAOgh=FyA94-7YqGpsAqVQjadegRusoAvRhD=t-ipzVWN0CiJRQ@mail.gmail.com>
  0 siblings, 1 reply; 49+ messages in thread
From: Askar Safin @ 2023-12-18 21:59 UTC (permalink / raw)
  To: ecurtin; +Cc: initramfs, systemd-devel

Hi. Unfortunately, this is not clear enough from
https://github.com/containers/initoverlayfs how exactly the
second-stage early filesystem is mounted. So, please, add that
information to README. Let me describe how I understand this.

First, init program from (small) first-stage early filesystem mounts
boot/ESP partition, where second-stage early filesystem image (i. e.
erofs) is located. Then that init program mounts that erofs image.
Without copying the whole erofs image into memory. In other words, if
some part of erofs image is not accessed, then not only it is not
uncompressed, it even is not loaded from disk to memory at all. Is my
understanding correct?

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2023-12-18 23:32 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-08 17:59 [RFC] initoverlayfs - a scalable initial filesystem Eric Curtin
2023-12-09 12:46 ` Luca Boccassi
2023-12-09 14:42   ` Eric Curtin
2023-12-09 14:56     ` Andrei Borzenkov
2023-12-09 15:07       ` Eric Curtin
2023-12-09 15:22         ` Daan De Meyer
2023-12-09 15:46           ` Eric Curtin
2023-12-09 17:19         ` Luca Boccassi
2023-12-09 17:24           ` Eric Curtin
2023-12-09 17:46             ` Luca Boccassi
2023-12-09 17:57               ` Eric Curtin
2023-12-09 18:11                 ` Luca Boccassi
2023-12-09 18:26                   ` Eric Curtin
2023-12-11  9:57 ` Lennart Poettering
2023-12-11 10:07   ` Lennart Poettering
2023-12-11 11:20   ` Eric Curtin
2023-12-11 11:28     ` Eric Curtin
2023-12-11 11:42       ` Eric Curtin
2023-12-11 11:58         ` Lennart Poettering
2023-12-11 11:51       ` Lennart Poettering
2023-12-11 12:48         ` Eric Curtin
2023-12-11 12:52           ` Eric Curtin
2023-12-12 17:37           ` Lennart Poettering
2023-12-12 17:40           ` Lennart Poettering
2023-12-12 19:05             ` Demi Marie Obenour
2023-12-11 16:28   ` Demi Marie Obenour
2023-12-11 17:03     ` Eric Curtin
2023-12-11 17:46       ` Demi Marie Obenour
2023-12-12 18:00       ` Lennart Poettering
2023-12-12 20:34         ` Nils Kattenbeck
2023-12-12 20:48           ` Eric Curtin
2023-12-12 21:02           ` Lennart Poettering
2023-12-12 22:01             ` Nils Kattenbeck
2023-12-13  9:03               ` Lennart Poettering
2023-12-14  1:17                 ` Nils Kattenbeck
2023-12-16 14:34                   ` Lennart Poettering
2023-12-11 17:33     ` Neal Gompa
2023-12-11 20:15     ` Luca Boccassi
2023-12-11 20:43       ` Demi Marie Obenour
2023-12-11 20:58         ` Luca Boccassi
2023-12-11 21:20           ` Demi Marie Obenour
2023-12-11 21:45             ` Luca Boccassi
2023-12-12  3:47               ` Paul Menzel
2023-12-12  3:56               ` Paul Menzel
2023-12-12 15:26               ` Paul Menzel
2023-12-11 21:24           ` Eric Curtin
2023-12-12 17:50     ` Lennart Poettering
2023-12-18 21:59 Askar Safin
     [not found] ` <CAOgh=FyA94-7YqGpsAqVQjadegRusoAvRhD=t-ipzVWN0CiJRQ@mail.gmail.com>
2023-12-18 23:31   ` Askar Safin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).