initramfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] initoverlayfs - a scalable initial filesystem
@ 2023-12-08 17:59 Eric Curtin
  2023-12-09 12:46 ` Luca Boccassi
  2023-12-11  9:57 ` Lennart Poettering
  0 siblings, 2 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-08 17:59 UTC (permalink / raw)
  To: systemd-devel, initramfs
  Cc: Stephen Smoogen, Yariv Rachmani, Daniel Walsh, Douglas Landgraf

We have been working on a new initial filesystem called initoverlayfs.
It is a new filesystem that provides a more scalable approach to
initial filesystems as opposed to just using initrds. We are writing
this RFC to the systemd and dracut mailing lists (feel free to forward
to UAPI group also) because although this solution works without
changing the code in these projects, it operates in the same area as
systemd, udev, dracut, etc. and uses these tools.

Brief context:
--------------

initoverlayfs by default uses transient overlays rather than tmpfs to
create throwaway filesystems early in the boot sequence.

Why?

An initramfs has to be decompressed and copied to a tmpfs up front
before it can be used. This results in a situation where you end up
paying for every byte in an initrd in boot performance, even the ones
you don't use in a given boot.

This leads to a fear of using languages that result in larger binaries
sizes early boot, reusing libraries, etc. In some cases, reimplemented
minified versions of software components present in the rootfs are
used.

Alternatively, initoverlayfs uses erofs (with compression) and
overlayfs to achieve this, so you only pay for the bytes you actually
use.

There is also increased pressure from certain industries like
automotive, to start essential services in a boot sequence early.

Requirements:
-------------

An init system
An initramfs building tool
A device manager
overlayfs

Nothing that you wouldn't find in most Linux distributions today.

Design:
-------

Here is the boot sequence with initoverlayfs integrated, the
mini-initramfs contains just enough to get storage drivers loaded and
storage devices initialized. storage-init is a process that is not
designed to replace init, it does just enough to initialize storage
(performs a targeted udev trigger on storage), switches to
initoverlayfs as root and then executes init.

```
fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs

fw -> bootloader -> kernel -> storage-init   -> init ----------------->
```

Benefits:
---------

Scalability: You can put less emphasis on keeping this initial
filesystem small as you will only pay for the bytes you read. This is
probably the bigger picture than raw performance in the next point.

Performance: As this minifies the initramfs to contain only the most
basic storage initialization tasks, linux userspace starts earlier
than it would using just initramfs alone. Leaving all the other
software that require early throwaway filesystems to be executed in
the initoverlayfs. In the case of a Raspberry Pi 4 with sd card, it
leads to systemd starting ~300ms faster and in the case of a Raspberry
Pi 4 with NVMe SSD drive over USB it leads to systemd starting ~500ms
faster. There are some devices that by starting Linux userspace early,
you can expose a slowly initializing storage driver, leading to a
slower boot as with just an initramfs you mask this slow driver by
spending this time on decompression and copying. But a computer is
only as fast as it's slowest component, so if you care about super
fast boots, you need to optimize your storage drivers.

Flexibility: It is now easier to consider using fatter languages like
Rust, etc. Using libraries like graphics libraries, camera libraries,
libevent, glib, C++, etc. early boot can be considered. As you don't
have to decompress and copy this data upfront. This leads to easier to
maintain initrd software also, with more consolidation between rootfs
impelmentations and initial filesystem implementations of components.

Changes required in other projects:
-----------------------------------

There are no major changes required in other projects. Tools like
systemd-analyze might need to be updated to recognize this boot
sequence more accurately, because it has no awareness of
initoverlayfs.

Future plans:
-------------

We intend to propose this to Fedora, CentOS Stream, ostree and
non-ostree variants as we continue this project.

Feel free to try:
-----------------

It should work on most standard 3 partition non-ostree Fedora and
CentOS 9 installs (note: CentOS 9 kernel does not support erofs
compression, so Fedora is a better playground today). It's still in
alpha/beta state I guess. Although I successfully dogfood this on my
laptop and we hard tried this on a couple of different pieces of
hardware and VMs... Maybe run this on a non-critical piece of hardware
or a VM for the next few weeks if you want to try :)

git repo:

https://github.com/containers/initoverlayfs

Also checkout the README.md, there are some graphs and other information there:

https://github.com/containers/initoverlayfs/blob/main/README.md

rpm available in copr:

dnf copr enable @centos-automotive-sig/next
dnf install initoverlayfs
initoverlayfs-install

Is mise le meas/Regards,

Eric Curtin


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-08 17:59 [RFC] initoverlayfs - a scalable initial filesystem Eric Curtin
@ 2023-12-09 12:46 ` Luca Boccassi
  2023-12-09 14:42   ` Eric Curtin
  2023-12-11  9:57 ` Lennart Poettering
  1 sibling, 1 reply; 49+ messages in thread
From: Luca Boccassi @ 2023-12-09 12:46 UTC (permalink / raw)
  To: Eric Curtin
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
>
> We have been working on a new initial filesystem called initoverlayfs.
> It is a new filesystem that provides a more scalable approach to
> initial filesystems as opposed to just using initrds. We are writing
> this RFC to the systemd and dracut mailing lists (feel free to forward
> to UAPI group also) because although this solution works without
> changing the code in these projects, it operates in the same area as
> systemd, udev, dracut, etc. and uses these tools.

It seems to me everything you described already exists? If you want to
avoid having an initrd -> rootfs transition, you can already do that -
the initrd code paths run because there's /etc/initrd-release, omit
that and the transition/phase is avoided. If you want to have an
overlay with r/o images, you can already do that with sysexts. You'll
need to reimplement and maintain separately TPM support, LUKS support,
fido2, etc etc

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 12:46 ` Luca Boccassi
@ 2023-12-09 14:42   ` Eric Curtin
  2023-12-09 14:56     ` Andrei Borzenkov
  0 siblings, 1 reply; 49+ messages in thread
From: Eric Curtin @ 2023-12-09 14:42 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
>
> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> >
> > We have been working on a new initial filesystem called initoverlayfs.
> > It is a new filesystem that provides a more scalable approach to
> > initial filesystems as opposed to just using initrds. We are writing
> > this RFC to the systemd and dracut mailing lists (feel free to forward
> > to UAPI group also) because although this solution works without
> > changing the code in these projects, it operates in the same area as
> > systemd, udev, dracut, etc. and uses these tools.
>
> It seems to me everything you described already exists? If you want to
> avoid having an initrd -> rootfs transition, you can already do that -

You need a initrd -> rootfs transition for generic linux operating
systems right? Or else you start building all sorts of things directly
into the kernel which isn't really scalable.

> the initrd code paths run because there's /etc/initrd-release, omit
> that and the transition/phase is avoided. If you want to have an
> overlay with r/o images, you can already do that with sysexts. You'll
> need to reimplement and maintain separately TPM support, LUKS support,
> fido2, etc etc

This is intended to be something you can use with or without sysexts,
not a competing alternative. There will be some reimplementations, but
our hope is to minimize that, leave as much as possible to systemd,
initoverlayfs stage, etc. where you don't pay the upfront cost for
decompressing and copying all the bytes.

We are open to executing minified systemd libraries/binaries in the
minified initramfs, we do that in the current version of storage-init
by calling systemd udev binaries.

>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 14:42   ` Eric Curtin
@ 2023-12-09 14:56     ` Andrei Borzenkov
  2023-12-09 15:07       ` Eric Curtin
  0 siblings, 1 reply; 49+ messages in thread
From: Andrei Borzenkov @ 2023-12-09 14:56 UTC (permalink / raw)
  To: Eric Curtin, Luca Boccassi
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On 09.12.2023 17:42, Eric Curtin wrote:
> On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
>>
>> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
>>>
>>> We have been working on a new initial filesystem called initoverlayfs.
>>> It is a new filesystem that provides a more scalable approach to
>>> initial filesystems as opposed to just using initrds. We are writing
>>> this RFC to the systemd and dracut mailing lists (feel free to forward
>>> to UAPI group also) because although this solution works without
>>> changing the code in these projects, it operates in the same area as
>>> systemd, udev, dracut, etc. and uses these tools.
>>
>> It seems to me everything you described already exists? If you want to
>> avoid having an initrd -> rootfs transition, you can already do that -
> 
> You need a initrd -> rootfs transition for generic linux operating
> systems right?

No, you do not. Nothing stops you from running off initramfs (today you 
do not really have init*RAM Disk* - the content of initrd is unpacked 
into initramfs.

> Or else you start building all sorts of things directly
> into the kernel which isn't really scalable.
>

See above.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 14:56     ` Andrei Borzenkov
@ 2023-12-09 15:07       ` Eric Curtin
  2023-12-09 15:22         ` Daan De Meyer
  2023-12-09 17:19         ` Luca Boccassi
  0 siblings, 2 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-09 15:07 UTC (permalink / raw)
  To: Andrei Borzenkov
  Cc: Luca Boccassi, systemd-devel, initramfs, Yariv Rachmani,
	Stephen Smoogen, Douglas Landgraf

On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>
> On 09.12.2023 17:42, Eric Curtin wrote:
> > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> >>
> >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> >>>
> >>> We have been working on a new initial filesystem called initoverlayfs.
> >>> It is a new filesystem that provides a more scalable approach to
> >>> initial filesystems as opposed to just using initrds. We are writing
> >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> >>> to UAPI group also) because although this solution works without
> >>> changing the code in these projects, it operates in the same area as
> >>> systemd, udev, dracut, etc. and uses these tools.
> >>
> >> It seems to me everything you described already exists? If you want to
> >> avoid having an initrd -> rootfs transition, you can already do that -
> >
> > You need a initrd -> rootfs transition for generic linux operating
> > systems right?
>
> No, you do not. Nothing stops you from running off initramfs (today you
> do not really have init*RAM Disk* - the content of initrd is unpacked
> into initramfs.

Apologies if I am misinterpreting this response, I use terms initrd
and initramfs
interchangeably (not technically correct, but it's common to do this). The
point is to avoid unpacking as much as possible, because in many initrds
the majority of the software need not be unpacked, but is designed to work
with throwaway initial filesystems.

>
> > Or else you start building all sorts of things directly
> > into the kernel which isn't really scalable.
> >
>
> See above.
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 15:07       ` Eric Curtin
@ 2023-12-09 15:22         ` Daan De Meyer
  2023-12-09 15:46           ` Eric Curtin
  2023-12-09 17:19         ` Luca Boccassi
  1 sibling, 1 reply; 49+ messages in thread
From: Daan De Meyer @ 2023-12-09 15:22 UTC (permalink / raw)
  To: Eric Curtin
  Cc: Andrei Borzenkov, initramfs, systemd-devel, Stephen Smoogen,
	Yariv Rachmani, Douglas Landgraf, Luca Boccassi

> We have been working on a new initial filesystem called initoverlayfs.
> It is a new filesystem that provides a more scalable approach to
> initial filesystems as opposed to just using initrds. We are writing
> this RFC to the systemd and dracut mailing lists (feel free to forward
> to UAPI group also) because although this solution works without
> changing the code in these projects, it operates in the same area as
> systemd, udev, dracut, etc. and uses these tools.

I like the concept of using erofs instead of a compressed cpio and we have
been discussing doing something similar within systemd. I very much dislike
the implementation though. I believe this should be implemented natively within
the Linux kernel instead of hacking around the missing kernel support
in userspace.

If the kernel would add support for supplying an erofs initramfs
instead of a cpio
initramfs, put a writable tmpfs on top of it and would unpack any
extra cpios provided
by the bootloader on top of the tmpfs, then there wouldn't be any need
for initoverlayfs.

Before adopting anything like this I believe there should be a serious
effort to get
this implemented within Linux itself. Only if that turns out to be
impossible should
we fall back to exploring userspace only solutions.

Cheers,

Daan


On Sat, 9 Dec 2023 at 16:08, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> >
> > On 09.12.2023 17:42, Eric Curtin wrote:
> > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> > >>
> > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> > >>>
> > >>> We have been working on a new initial filesystem called initoverlayfs.
> > >>> It is a new filesystem that provides a more scalable approach to
> > >>> initial filesystems as opposed to just using initrds. We are writing
> > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > >>> to UAPI group also) because although this solution works without
> > >>> changing the code in these projects, it operates in the same area as
> > >>> systemd, udev, dracut, etc. and uses these tools.
> > >>
> > >> It seems to me everything you described already exists? If you want to
> > >> avoid having an initrd -> rootfs transition, you can already do that -
> > >
> > > You need a initrd -> rootfs transition for generic linux operating
> > > systems right?
> >
> > No, you do not. Nothing stops you from running off initramfs (today you
> > do not really have init*RAM Disk* - the content of initrd is unpacked
> > into initramfs.
>
> Apologies if I am misinterpreting this response, I use terms initrd
> and initramfs
> interchangeably (not technically correct, but it's common to do this). The
> point is to avoid unpacking as much as possible, because in many initrds
> the majority of the software need not be unpacked, but is designed to work
> with throwaway initial filesystems.
>
> >
> > > Or else you start building all sorts of things directly
> > > into the kernel which isn't really scalable.
> > >
> >
> > See above.
> >
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 15:22         ` Daan De Meyer
@ 2023-12-09 15:46           ` Eric Curtin
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-09 15:46 UTC (permalink / raw)
  To: Daan De Meyer
  Cc: Andrei Borzenkov, initramfs, systemd-devel, Stephen Smoogen,
	Yariv Rachmani, Douglas Landgraf, Luca Boccassi

On Sat, 9 Dec 2023 at 15:23, Daan De Meyer <daan.j.demeyer@gmail.com> wrote:
>
> > We have been working on a new initial filesystem called initoverlayfs.
> > It is a new filesystem that provides a more scalable approach to
> > initial filesystems as opposed to just using initrds. We are writing
> > this RFC to the systemd and dracut mailing lists (feel free to forward
> > to UAPI group also) because although this solution works without
> > changing the code in these projects, it operates in the same area as
> > systemd, udev, dracut, etc. and uses these tools.
>
> I like the concept of using erofs instead of a compressed cpio and we have
> been discussing doing something similar within systemd. I very much dislike
> the implementation though. I believe this should be implemented natively within
> the Linux kernel instead of hacking around the missing kernel support
> in userspace.
>

I'm not against eventually implementing this in kernelspace, it's
something I've thought about. Implementing in userspace made more
sense to start as a lot of this tooling is much easier to work with in
userspace. It was much faster to write this in userspace to prove the
benefits, test, etc.

It is easier to maintain and develop software in userspace though. So
we would need to have serious thought on why we are pushing this into
kernelspace, what are the benefits, etc.

> If the kernel would add support for supplying an erofs initramfs
> instead of a cpio
> initramfs, put a writable tmpfs on top of it and would unpack any
> extra cpios provided
> by the bootloader on top of the tmpfs, then there wouldn't be any need
> for initoverlayfs.

Do we have to unpack extra cpio's, could that be optional? Mounting
erofs with transient overlay is really fast. Of course if people want
to do that it's fine :)


>
> Before adopting anything like this I believe there should be a serious
> effort to get
> this implemented within Linux itself. Only if that turns out to be
> impossible should
> we fall back to exploring userspace only solutions.
>
> Cheers,
>
> Daan
>
>
> On Sat, 9 Dec 2023 at 16:08, Eric Curtin <ecurtin@redhat.com> wrote:
> >
> > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > >
> > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> > > >>
> > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> > > >>>
> > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > >>> It is a new filesystem that provides a more scalable approach to
> > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > >>> to UAPI group also) because although this solution works without
> > > >>> changing the code in these projects, it operates in the same area as
> > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > >>
> > > >> It seems to me everything you described already exists? If you want to
> > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > >
> > > > You need a initrd -> rootfs transition for generic linux operating
> > > > systems right?
> > >
> > > No, you do not. Nothing stops you from running off initramfs (today you
> > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > into initramfs.
> >
> > Apologies if I am misinterpreting this response, I use terms initrd
> > and initramfs
> > interchangeably (not technically correct, but it's common to do this). The
> > point is to avoid unpacking as much as possible, because in many initrds
> > the majority of the software need not be unpacked, but is designed to work
> > with throwaway initial filesystems.
> >
> > >
> > > > Or else you start building all sorts of things directly
> > > > into the kernel which isn't really scalable.
> > > >
> > >
> > > See above.
> > >
> >
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 15:07       ` Eric Curtin
  2023-12-09 15:22         ` Daan De Meyer
@ 2023-12-09 17:19         ` Luca Boccassi
  2023-12-09 17:24           ` Eric Curtin
  1 sibling, 1 reply; 49+ messages in thread
From: Luca Boccassi @ 2023-12-09 17:19 UTC (permalink / raw)
  To: Eric Curtin
  Cc: Andrei Borzenkov, systemd-devel, initramfs, Yariv Rachmani,
	Stephen Smoogen, Douglas Landgraf

On Sat, 9 Dec 2023 at 15:08, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> >
> > On 09.12.2023 17:42, Eric Curtin wrote:
> > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> > >>
> > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> > >>>
> > >>> We have been working on a new initial filesystem called initoverlayfs.
> > >>> It is a new filesystem that provides a more scalable approach to
> > >>> initial filesystems as opposed to just using initrds. We are writing
> > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > >>> to UAPI group also) because although this solution works without
> > >>> changing the code in these projects, it operates in the same area as
> > >>> systemd, udev, dracut, etc. and uses these tools.
> > >>
> > >> It seems to me everything you described already exists? If you want to
> > >> avoid having an initrd -> rootfs transition, you can already do that -
> > >
> > > You need a initrd -> rootfs transition for generic linux operating
> > > systems right?
> >
> > No, you do not. Nothing stops you from running off initramfs (today you
> > do not really have init*RAM Disk* - the content of initrd is unpacked
> > into initramfs.
>
> Apologies if I am misinterpreting this response, I use terms initrd
> and initramfs
> interchangeably (not technically correct, but it's common to do this). The
> point is to avoid unpacking as much as possible, because in many initrds
> the majority of the software need not be unpacked, but is designed to work
> with throwaway initial filesystems.

sd-stub already supports having a small initrd shipped in the UKI,
that is extended via sysexts, and systemd already supports running
from it, without any transition to a final rootfs. What else do you
need? What problem is this attempting to solve?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 17:19         ` Luca Boccassi
@ 2023-12-09 17:24           ` Eric Curtin
  2023-12-09 17:46             ` Luca Boccassi
  0 siblings, 1 reply; 49+ messages in thread
From: Eric Curtin @ 2023-12-09 17:24 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Andrei Borzenkov, systemd-devel, initramfs, Yariv Rachmani,
	Stephen Smoogen, Douglas Landgraf

On Sat, 9 Dec 2023 at 17:19, Luca Boccassi <bluca@debian.org> wrote:
>
> On Sat, 9 Dec 2023 at 15:08, Eric Curtin <ecurtin@redhat.com> wrote:
> >
> > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > >
> > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> > > >>
> > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> > > >>>
> > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > >>> It is a new filesystem that provides a more scalable approach to
> > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > >>> to UAPI group also) because although this solution works without
> > > >>> changing the code in these projects, it operates in the same area as
> > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > >>
> > > >> It seems to me everything you described already exists? If you want to
> > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > >
> > > > You need a initrd -> rootfs transition for generic linux operating
> > > > systems right?
> > >
> > > No, you do not. Nothing stops you from running off initramfs (today you
> > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > into initramfs.
> >
> > Apologies if I am misinterpreting this response, I use terms initrd
> > and initramfs
> > interchangeably (not technically correct, but it's common to do this). The
> > point is to avoid unpacking as much as possible, because in many initrds
> > the majority of the software need not be unpacked, but is designed to work
> > with throwaway initial filesystems.
>
> sd-stub already supports having a small initrd shipped in the UKI,
> that is extended via sysexts, and systemd already supports running
> from it, without any transition to a final rootfs. What else do you
> need? What problem is this attempting to solve?

I must give sd-stub a try. The bootloader I most commonly work with (and is one
of the target platforms this is intended for) isn't UEFI, we need something more
portable.

Is mise le meas/Regards,

Eric Curtin

>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 17:24           ` Eric Curtin
@ 2023-12-09 17:46             ` Luca Boccassi
  2023-12-09 17:57               ` Eric Curtin
  0 siblings, 1 reply; 49+ messages in thread
From: Luca Boccassi @ 2023-12-09 17:46 UTC (permalink / raw)
  To: Eric Curtin
  Cc: Andrei Borzenkov, systemd-devel, initramfs, Yariv Rachmani,
	Stephen Smoogen, Douglas Landgraf

On Sat, 9 Dec 2023 at 17:25, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Sat, 9 Dec 2023 at 17:19, Luca Boccassi <bluca@debian.org> wrote:
> >
> > On Sat, 9 Dec 2023 at 15:08, Eric Curtin <ecurtin@redhat.com> wrote:
> > >
> > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > > >
> > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> > > > >>
> > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> > > > >>>
> > > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > > >>> It is a new filesystem that provides a more scalable approach to
> > > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > > >>> to UAPI group also) because although this solution works without
> > > > >>> changing the code in these projects, it operates in the same area as
> > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > >>
> > > > >> It seems to me everything you described already exists? If you want to
> > > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > > >
> > > > > You need a initrd -> rootfs transition for generic linux operating
> > > > > systems right?
> > > >
> > > > No, you do not. Nothing stops you from running off initramfs (today you
> > > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > > into initramfs.
> > >
> > > Apologies if I am misinterpreting this response, I use terms initrd
> > > and initramfs
> > > interchangeably (not technically correct, but it's common to do this). The
> > > point is to avoid unpacking as much as possible, because in many initrds
> > > the majority of the software need not be unpacked, but is designed to work
> > > with throwaway initial filesystems.
> >
> > sd-stub already supports having a small initrd shipped in the UKI,
> > that is extended via sysexts, and systemd already supports running
> > from it, without any transition to a final rootfs. What else do you
> > need? What problem is this attempting to solve?
>
> I must give sd-stub a try. The bootloader I most commonly work with (and is one
> of the target platforms this is intended for) isn't UEFI, we need something more
> portable.

Do we, though? All modern hardware platforms (and VMs) that matter are
UEFI. Why would any of this be needed for legacy hardware platforms?
The existing mechanisms can work just fine on those until they reach
EOL, they won't stop working.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 17:46             ` Luca Boccassi
@ 2023-12-09 17:57               ` Eric Curtin
  2023-12-09 18:11                 ` Luca Boccassi
  0 siblings, 1 reply; 49+ messages in thread
From: Eric Curtin @ 2023-12-09 17:57 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Andrei Borzenkov, systemd-devel, initramfs, Yariv Rachmani,
	Stephen Smoogen, Douglas Landgraf

On Sat, 9 Dec 2023 at 17:46, Luca Boccassi <bluca@debian.org> wrote:
>
> On Sat, 9 Dec 2023 at 17:25, Eric Curtin <ecurtin@redhat.com> wrote:
> >
> > On Sat, 9 Dec 2023 at 17:19, Luca Boccassi <bluca@debian.org> wrote:
> > >
> > > On Sat, 9 Dec 2023 at 15:08, Eric Curtin <ecurtin@redhat.com> wrote:
> > > >
> > > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > > > >
> > > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> > > > > >>
> > > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> > > > > >>>
> > > > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > > > >>> It is a new filesystem that provides a more scalable approach to
> > > > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > > > >>> to UAPI group also) because although this solution works without
> > > > > >>> changing the code in these projects, it operates in the same area as
> > > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > > >>
> > > > > >> It seems to me everything you described already exists? If you want to
> > > > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > > > >
> > > > > > You need a initrd -> rootfs transition for generic linux operating
> > > > > > systems right?
> > > > >
> > > > > No, you do not. Nothing stops you from running off initramfs (today you
> > > > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > > > into initramfs.
> > > >
> > > > Apologies if I am misinterpreting this response, I use terms initrd
> > > > and initramfs
> > > > interchangeably (not technically correct, but it's common to do this). The
> > > > point is to avoid unpacking as much as possible, because in many initrds
> > > > the majority of the software need not be unpacked, but is designed to work
> > > > with throwaway initial filesystems.
> > >
> > > sd-stub already supports having a small initrd shipped in the UKI,
> > > that is extended via sysexts, and systemd already supports running
> > > from it, without any transition to a final rootfs. What else do you
> > > need? What problem is this attempting to solve?
> >
> > I must give sd-stub a try. The bootloader I most commonly work with (and is one
> > of the target platforms this is intended for) isn't UEFI, we need something more
> > portable.
>
> Do we, though? All modern hardware platforms (and VMs) that matter are
> UEFI. Why would any of this be needed for legacy hardware platforms?
> The existing mechanisms can work just fine on those until they reach
> EOL, they won't stop working.

Respectfully, this is not true. Especially on ARM platforms. I would
like it to be true, but it's not true today.

I should have expanded, we are not trying to avoid transitioning to a
final rootfs, the goal is to transition to a final rootfs. But not to decompress
and copy all the bytes to a tmpfs up front, rather use something like erofs,
overlayfs, etc. sysexts uses erofs+overlayfs, but it's designed with
a different goal in mind.

>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 17:57               ` Eric Curtin
@ 2023-12-09 18:11                 ` Luca Boccassi
  2023-12-09 18:26                   ` Eric Curtin
  0 siblings, 1 reply; 49+ messages in thread
From: Luca Boccassi @ 2023-12-09 18:11 UTC (permalink / raw)
  To: Eric Curtin
  Cc: Andrei Borzenkov, systemd-devel, initramfs, Yariv Rachmani,
	Stephen Smoogen, Douglas Landgraf

On Sat, 9 Dec 2023 at 17:58, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Sat, 9 Dec 2023 at 17:46, Luca Boccassi <bluca@debian.org> wrote:
> >
> > On Sat, 9 Dec 2023 at 17:25, Eric Curtin <ecurtin@redhat.com> wrote:
> > >
> > > On Sat, 9 Dec 2023 at 17:19, Luca Boccassi <bluca@debian.org> wrote:
> > > >
> > > > On Sat, 9 Dec 2023 at 15:08, Eric Curtin <ecurtin@redhat.com> wrote:
> > > > >
> > > > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > > > > >
> > > > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> > > > > > >>
> > > > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> > > > > > >>>
> > > > > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > > > > >>> It is a new filesystem that provides a more scalable approach to
> > > > > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > > > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > > > > >>> to UAPI group also) because although this solution works without
> > > > > > >>> changing the code in these projects, it operates in the same area as
> > > > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > > > >>
> > > > > > >> It seems to me everything you described already exists? If you want to
> > > > > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > > > > >
> > > > > > > You need a initrd -> rootfs transition for generic linux operating
> > > > > > > systems right?
> > > > > >
> > > > > > No, you do not. Nothing stops you from running off initramfs (today you
> > > > > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > > > > into initramfs.
> > > > >
> > > > > Apologies if I am misinterpreting this response, I use terms initrd
> > > > > and initramfs
> > > > > interchangeably (not technically correct, but it's common to do this). The
> > > > > point is to avoid unpacking as much as possible, because in many initrds
> > > > > the majority of the software need not be unpacked, but is designed to work
> > > > > with throwaway initial filesystems.
> > > >
> > > > sd-stub already supports having a small initrd shipped in the UKI,
> > > > that is extended via sysexts, and systemd already supports running
> > > > from it, without any transition to a final rootfs. What else do you
> > > > need? What problem is this attempting to solve?
> > >
> > > I must give sd-stub a try. The bootloader I most commonly work with (and is one
> > > of the target platforms this is intended for) isn't UEFI, we need something more
> > > portable.
> >
> > Do we, though? All modern hardware platforms (and VMs) that matter are
> > UEFI. Why would any of this be needed for legacy hardware platforms?
> > The existing mechanisms can work just fine on those until they reach
> > EOL, they won't stop working.
>
> Respectfully, this is not true. Especially on ARM platforms. I would
> like it to be true, but it's not true today.

Where any of this would actually matter, they mostly do, and where
they don't one can put together uboot with uefi mode.

> I should have expanded, we are not trying to avoid transitioning to a
> final rootfs, the goal is to transition to a final rootfs. But not to decompress
> and copy all the bytes to a tmpfs up front, rather use something like erofs,
> overlayfs, etc. sysexts uses erofs+overlayfs, but it's designed with
> a different goal in mind.

In what way is the goal different?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-09 18:11                 ` Luca Boccassi
@ 2023-12-09 18:26                   ` Eric Curtin
  0 siblings, 0 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-09 18:26 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Andrei Borzenkov, systemd-devel, initramfs, Yariv Rachmani,
	Stephen Smoogen, Douglas Landgraf

On Sat, 9 Dec 2023 at 18:12, Luca Boccassi <bluca@debian.org> wrote:
>
> On Sat, 9 Dec 2023 at 17:58, Eric Curtin <ecurtin@redhat.com> wrote:
> >
> > On Sat, 9 Dec 2023 at 17:46, Luca Boccassi <bluca@debian.org> wrote:
> > >
> > > On Sat, 9 Dec 2023 at 17:25, Eric Curtin <ecurtin@redhat.com> wrote:
> > > >
> > > > On Sat, 9 Dec 2023 at 17:19, Luca Boccassi <bluca@debian.org> wrote:
> > > > >
> > > > > On Sat, 9 Dec 2023 at 15:08, Eric Curtin <ecurtin@redhat.com> wrote:
> > > > > >
> > > > > > On Sat, 9 Dec 2023 at 14:56, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > > > > > >
> > > > > > > On 09.12.2023 17:42, Eric Curtin wrote:
> > > > > > > > On Sat, 9 Dec 2023 at 12:46, Luca Boccassi <bluca@debian.org> wrote:
> > > > > > > >>
> > > > > > > >> On Fri, 8 Dec 2023 at 19:00, Eric Curtin <ecurtin@redhat.com> wrote:
> > > > > > > >>>
> > > > > > > >>> We have been working on a new initial filesystem called initoverlayfs.
> > > > > > > >>> It is a new filesystem that provides a more scalable approach to
> > > > > > > >>> initial filesystems as opposed to just using initrds. We are writing
> > > > > > > >>> this RFC to the systemd and dracut mailing lists (feel free to forward
> > > > > > > >>> to UAPI group also) because although this solution works without
> > > > > > > >>> changing the code in these projects, it operates in the same area as
> > > > > > > >>> systemd, udev, dracut, etc. and uses these tools.
> > > > > > > >>
> > > > > > > >> It seems to me everything you described already exists? If you want to
> > > > > > > >> avoid having an initrd -> rootfs transition, you can already do that -
> > > > > > > >
> > > > > > > > You need a initrd -> rootfs transition for generic linux operating
> > > > > > > > systems right?
> > > > > > >
> > > > > > > No, you do not. Nothing stops you from running off initramfs (today you
> > > > > > > do not really have init*RAM Disk* - the content of initrd is unpacked
> > > > > > > into initramfs.
> > > > > >
> > > > > > Apologies if I am misinterpreting this response, I use terms initrd
> > > > > > and initramfs
> > > > > > interchangeably (not technically correct, but it's common to do this). The
> > > > > > point is to avoid unpacking as much as possible, because in many initrds
> > > > > > the majority of the software need not be unpacked, but is designed to work
> > > > > > with throwaway initial filesystems.
> > > > >
> > > > > sd-stub already supports having a small initrd shipped in the UKI,
> > > > > that is extended via sysexts, and systemd already supports running
> > > > > from it, without any transition to a final rootfs. What else do you
> > > > > need? What problem is this attempting to solve?
> > > >
> > > > I must give sd-stub a try. The bootloader I most commonly work with (and is one
> > > > of the target platforms this is intended for) isn't UEFI, we need something more
> > > > portable.
> > >
> > > Do we, though? All modern hardware platforms (and VMs) that matter are
> > > UEFI. Why would any of this be needed for legacy hardware platforms?
> > > The existing mechanisms can work just fine on those until they reach
> > > EOL, they won't stop working.
> >
> > Respectfully, this is not true. Especially on ARM platforms. I would
> > like it to be true, but it's not true today.
>
> Where any of this would actually matter, they mostly do, and where
> they don't one can put together uboot with uefi mode.

When you are trying to improve boot performance, introducing another
layer of bootloader with uboot doesn't help. You also have to port
every hardware platform you encounter to uboot. And if you can solve
the problem in the Linux stack somewhere rather than the bootloader.
Why would we choose to fix the problem in the bootloader?

>
> > I should have expanded, we are not trying to avoid transitioning to a
> > final rootfs, the goal is to transition to a final rootfs. But not to decompress
> > and copy all the bytes to a tmpfs up front, rather use something like erofs,
> > overlayfs, etc. sysexts uses erofs+overlayfs, but it's designed with
> > a different goal in mind.
>
> In what way is the goal different?

This project is basically build an initrd, but put it in a
erofs+overlayfs alternatively (technically it builds a really small
initrd to initialize some basic storage drivers etc. and build a
second initrd in an erofs format). All existing software that we've
tested "just works" with this approach, including all the systemd
stuff. And you can do transparent decompression with lz4hc
alternatively. It also means you don't have to be as afraid of
bloating your initial filesystem, because minimizing initrd's is
tedious work.

>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-08 17:59 [RFC] initoverlayfs - a scalable initial filesystem Eric Curtin
  2023-12-09 12:46 ` Luca Boccassi
@ 2023-12-11  9:57 ` Lennart Poettering
  2023-12-11 10:07   ` Lennart Poettering
                     ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Lennart Poettering @ 2023-12-11  9:57 UTC (permalink / raw)
  To: Eric Curtin
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:

> Here is the boot sequence with initoverlayfs integrated, the
> mini-initramfs contains just enough to get storage drivers loaded and
> storage devices initialized. storage-init is a process that is not
> designed to replace init, it does just enough to initialize storage
> (performs a targeted udev trigger on storage), switches to
> initoverlayfs as root and then executes init.
>
> ```
> fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
>
> fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> ```

I am not sure I follow what these chains are supposed to mean? Why are
there two lines?

So, I generally would agree that the current initrd scheme is not
ideal, and we have been discussing better approaches. But I am not
sure your approach really is useful on generic systems for two
reasons:

1. no security model? you need to authenticate your initrd in
   2023. There's no execuse to not doing that anymore these days. Not
   in automotive, and not anywhere else really.

2. no way to deal with complex storage? i.e. people use FDE, want to
   unlock their root disks with TPM2 and similar things. People use
   RAID, LVM, and all that mess.

Actually the above are kinda the same problem in a way: you need
complex storage, but if you need that you kinda need udev, and
services, and then also systemd and all that other stuff, and that's
why the system works like the system works right now.

Whenever you devise a system like yours by cutting corners, and
declaring that you don't want TPM, you don't want signed initrds, you
don't want to support weird storage, you just solve your problem in a
very specific way, ignoring the big picture. Which is OK, *if* you can
actually really work without all that and are willing to maintain the
solution for your specific problem only.

As I understand you are trying to solve multiple problems at once
here, and I think one should start with figuring out clearly what
those are before trying to address them, maybe without compromising on
security. So my guess is you want to address the following:

1. You don't want the whole big initrd to be read off disk on every
   boot, but only the parts of it that are actually needed.

2. You don't want the whole big initrd to be fully decompressed on every
   boot, but only the parts of it that are actually needed.

3. You want to share data between root fs and initrd

4. You want to save some boot time by not bringing up an init system
   in the initrd once, then tearing it down again, and starting it
   again from the root fs.

For the items listed above I think you can find different solutions
which do not necessarily compromise security as much.

So, in the list above you could address the latter three like this:

2. Use an erofs rather than a packed cpio as initrd. Make the boot
   loader load the erofs into contigous memory, then use memmap=X!Y on
   the kernel cmdline to synthesize a block device from that, which
   you then mount directly (without any initrd) via
   root=/dev/pmem0. This means yout boot loader will still load the
   whole image into memory, but only decompress the bits actually
   neeed. (It also has some other nice benefits I like, such as an
   immutable rootfs, which tmpfs-based initrds don't have.)

3. Simply never transition to the root fs, don't marke the initrds in
   systemd's eyes as an initrd (specifically: don't add an
   /etc/initrd-release file to it). Instead, just merge resources of
   the root fs into your initrd fs via overlayfs. systemd has
   infrastructure for this: "systemd-sysext". It takes immutable,
   authenticated erofs images (with verity, we call them "DDIs",
   i.e. "discoverable disk images") that it overlays into /usr/. [You
   could also very nicely combine this approach with systemd's
   portable services, and npsawn containers, which operate on the same
   authenticated images]. At MSFT we have a major product that works
   exactly like this: the OS runs off a rootfs that is loaded as an
   initrd, and everything that runs on top of this are just these
   verity disk images, using overlayfs and portable services.

4. The proposal in 3 also addresses goal 4.

Which leaves item 1, which is a bit harder to address. We have been
discussing this off an on internally too. A generic solution to this
is hard. My current thinking for this could be something like this,
covering the UEFI world: support sticking a DDI for the main initrd in
the ESP. The ESP is per definition unencrypted and unauthenticated,
but otherwise relatively well defined, i.e. known to be vfat and
discoverable via UUID on a GPT disk. So: build a minimal
single-process initrd into the kernel (i.e. UKI) that has exactly the
storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
drivers, and dm-verity. Then have a PID 1 that does exactly enough to
jump into the rootfs stored in the ESP. That latter then has proper
file system drivers, storage drivers, crypto stack, and can unlock the
real root. This would still be a pretty specific solution to one set
of devices though, as it could not cover network boots (i.e. where
there is just no ESP to boot from), but I think this could be kept
relatively close, as the logic in that case could just fall back into
loading the DDI that normally would still in the ESP fully into
memory.

(If you are focussing on systems lacking UEFI, then replace the word
"ESP" in the above with a similar concept, i.e. a well discoverable,
unauthenticated relatively simple file system, such as vfat).

Anyway, I can't tell you how to solve your specific problems, but if
there's one thing I'd suggest you to keep in mind then it's the
security angle, i.e. keep in mind from the beginning how
authentication of every component of your process shall work, how
unatteneded disk encryption shall operate and how measurement shall
work. Security must be built into things from the beginning, not be
added as an afterthought.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11  9:57 ` Lennart Poettering
@ 2023-12-11 10:07   ` Lennart Poettering
  2023-12-11 11:20   ` Eric Curtin
  2023-12-11 16:28   ` Demi Marie Obenour
  2 siblings, 0 replies; 49+ messages in thread
From: Lennart Poettering @ 2023-12-11 10:07 UTC (permalink / raw)
  To: Eric Curtin
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mo, 11.12.23 10:57, Lennart Poettering (mzerqung@0pointer.de) wrote:

> Which leaves item 1, which is a bit harder to address. We have been
> discussing this off an on internally too. A generic solution to this
> is hard. My current thinking for this could be something like this,
> covering the UEFI world: support sticking a DDI for the main initrd in
> the ESP. The ESP is per definition unencrypted and unauthenticated,
> but otherwise relatively well defined, i.e. known to be vfat and
> discoverable via UUID on a GPT disk. So: build a minimal
> single-process initrd into the kernel (i.e. UKI) that has exactly the
> storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> jump into the rootfs stored in the ESP. That latter then has proper
> file system drivers, storage drivers, crypto stack, and can unlock the
> real root. This would still be a pretty specific solution to one set
> of devices though, as it could not cover network boots (i.e. where
> there is just no ESP to boot from), but I think this could be kept
> relatively close, as the logic in that case could just fall back into
> loading the DDI that normally would still in the ESP fully into
> memory.

BTW, one thing I would like to emphasize though. i think this item is
really the last thing you should focus on. If your OS never
transitions out of the initrd, and gets its payload merged in via
DDIs, then the root fs should be reasonably small enough and "fully
used at boot" (i.e. every sector read anyway) that doing this extra
work of finding a split-out DDI on the ESP is entirely unnecessary and
just a waste of time (both of developer time and boot time).

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11  9:57 ` Lennart Poettering
  2023-12-11 10:07   ` Lennart Poettering
@ 2023-12-11 11:20   ` Eric Curtin
  2023-12-11 11:28     ` Eric Curtin
  2023-12-11 16:28   ` Demi Marie Obenour
  2 siblings, 1 reply; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 11:20 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mon, 11 Dec 2023 at 10:06, Lennart Poettering <mzerqung@0pointer.de> wrote:
>
> On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
>
> > Here is the boot sequence with initoverlayfs integrated, the
> > mini-initramfs contains just enough to get storage drivers loaded and
> > storage devices initialized. storage-init is a process that is not
> > designed to replace init, it does just enough to initialize storage
> > (performs a targeted udev trigger on storage), switches to
> > initoverlayfs as root and then executes init.
> >
> > ```
> > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> >
> > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > ```
>
> I am not sure I follow what these chains are supposed to mean? Why are
> there two lines?

The top line is the filesystem transition, the bottom is more like a
process perspective. Will make this clearer in future.

>
> So, I generally would agree that the current initrd scheme is not
> ideal, and we have been discussing better approaches. But I am not
> sure your approach really is useful on generic systems for two
> reasons:
>
> 1. no security model? you need to authenticate your initrd in
>    2023. There's no execuse to not doing that anymore these days. Not
>    in automotive, and not anywhere else really.

Yes you are right, there is no excuse, the plan was to mount using
dm-verity most likely with the details from the initramfs, but
admittedly we had not looked into that into great detail.

>
> 2. no way to deal with complex storage? i.e. people use FDE, want to
>    unlock their root disks with TPM2 and similar things. People use
>    RAID, LVM, and all that mess.

We had 3 thoughts on this:

1. Just worry about the common use-cases and leave everyone else
fallback to the approaches we use today.
2. Try and split up systemd to make it even smaller. We do use
systemd-udev in the small initramfs storage-init process so far.
3. Reimplement some things? But as little as possible, on a case by
case basis, we certainly don't want to fall into the trap of rewriting
systemd that's for sure, systemd does these things very well.

Tbh, if we try and implement this in kernelspace a lot of these
questions go away. You just teach the kernel to deal with the
filesystem image early (say erofs or whatever other filesystem) and
have that data where initramfs data currently is. You still pay for
the initial read, but you still save a bunch of kernel time.

>
> Actually the above are kinda the same problem in a way: you need
> complex storage, but if you need that you kinda need udev, and
> services, and then also systemd and all that other stuff, and that's
> why the system works like the system works right now.

True, but there is also a bunch of stuff in current initrd's today
that aren't required to mount basic storage, but are designed around
the whole idea of having an early throwaway filesystem.

>
> Whenever you devise a system like yours by cutting corners, and
> declaring that you don't want TPM, you don't want signed initrds, you
> don't want to support weird storage, you just solve your problem in a
> very specific way, ignoring the big picture. Which is OK, *if* you can
> actually really work without all that and are willing to maintain the
> solution for your specific problem only.
>
> As I understand you are trying to solve multiple problems at once
> here, and I think one should start with figuring out clearly what
> those are before trying to address them, maybe without compromising on
> security. So my guess is you want to address the following:
>
> 1. You don't want the whole big initrd to be read off disk on every
>    boot, but only the parts of it that are actually needed.
>
> 2. You don't want the whole big initrd to be fully decompressed on every
>    boot, but only the parts of it that are actually needed.
>
> 3. You want to share data between root fs and initrd
>
> 4. You want to save some boot time by not bringing up an init system
>    in the initrd once, then tearing it down again, and starting it
>    again from the root fs.

It's mainly the top 3 that were the goals. And that people have the
freedom to consider using heavier weight generic libraries, tools,
etc. if they want. You want to use Rust (or languages X, Y, Z) to
write something early boot, go ahead! You'll only pay the cost for the
larger binary if you actually use it. The week I started tinkering at
this, there was a mini-debate on whether we should include glib or not
in the initrd. And we are regularly under pressure to reduce boot time
at the moment.

Number 4 was a convenient way to do an early version of this, stick a
process in between systemd and the kernel. But it turns out, it works
very well, the only problem is the reimplementation problem really.

Theoretically this could be systemd-storage-init -> systemd also. Or
systemd and dlopen more libraries as they become available later down
the line.

>
> For the items listed above I think you can find different solutions
> which do not necessarily compromise security as much.
>
> So, in the list above you could address the latter three like this:
>
> 2. Use an erofs rather than a packed cpio as initrd. Make the boot
>    loader load the erofs into contigous memory, then use memmap=X!Y on
>    the kernel cmdline to synthesize a block device from that, which
>    you then mount directly (without any initrd) via
>    root=/dev/pmem0. This means yout boot loader will still load the
>    whole image into memory, but only decompress the bits actually
>    neeed. (It also has some other nice benefits I like, such as an
>    immutable rootfs, which tmpfs-based initrds don't have.)

Yes, lets explore this approach with the kernel community to gather
their thoughts. I'm still happy I did the userspace version first,
even if we end up doing it in kernelspace because it allowed me to
test on various pieces of hardware to see if the benefits are genuine
and they are....

>
> 3. Simply never transition to the root fs, don't marke the initrds in
>    systemd's eyes as an initrd (specifically: don't add an
>    /etc/initrd-release file to it). Instead, just merge resources of
>    the root fs into your initrd fs via overlayfs. systemd has
>    infrastructure for this: "systemd-sysext". It takes immutable,
>    authenticated erofs images (with verity, we call them "DDIs",
>    i.e. "discoverable disk images") that it overlays into /usr/. [You
>    could also very nicely combine this approach with systemd's
>    portable services, and npsawn containers, which operate on the same
>    authenticated images]. At MSFT we have a major product that works
>    exactly like this: the OS runs off a rootfs that is loaded as an
>    initrd, and everything that runs on top of this are just these
>    verity disk images, using overlayfs and portable services.
>
> 4. The proposal in 3 also addresses goal 4.
>

I'm hoping we can benefit both use cases, the case where you want to
transition to a rootfs and the case where you never want to transition
to a rootfs.

> Which leaves item 1, which is a bit harder to address. We have been
> discussing this off an on internally too. A generic solution to this
> is hard. My current thinking for this could be something like this,
> covering the UEFI world: support sticking a DDI for the main initrd in
> the ESP. The ESP is per definition unencrypted and unauthenticated,
> but otherwise relatively well defined, i.e. known to be vfat and
> discoverable via UUID on a GPT disk. So: build a minimal
> single-process initrd into the kernel (i.e. UKI) that has exactly the
> storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> jump into the rootfs stored in the ESP. That latter then has proper
> file system drivers, storage drivers, crypto stack, and can unlock the
> real root. This would still be a pretty specific solution to one set
> of devices though, as it could not cover network boots (i.e. where
> there is just no ESP to boot from), but I think this could be kept
> relatively close, as the logic in that case could just fall back into
> loading the DDI that normally would still in the ESP fully into
> memory.
>

I'm certainly a little biased here because I work with ARM, I would
like it to be UEFI world, but it's not and convincing every SoC vendor
you must use UEFI is hard. I know a UEFI covering solution only would
not have much value for my team at least.

> (If you are focussing on systems lacking UEFI, then replace the word
> "ESP" in the above with a similar concept, i.e. a well discoverable,
> unauthenticated relatively simple file system, such as vfat).

Yeah, agree, this baseline, I think, is common enough to assume. Like
Android Boot Images as an example are basically a UKI binary stuff in
a boot partition.

>
> Anyway, I can't tell you how to solve your specific problems, but if
> there's one thing I'd suggest you to keep in mind then it's the
> security angle, i.e. keep in mind from the beginning how
> authentication of every component of your process shall work, how
> unatteneded disk encryption shall operate and how measurement shall
> work. Security must be built into things from the beginning, not be
> added as an afterthought.

Yes and we certainly want something that fits with the UKI models and
the other commonplace models around.

>
> Lennart
>
> --
> Lennart Poettering, Berlin
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 11:20   ` Eric Curtin
@ 2023-12-11 11:28     ` Eric Curtin
  2023-12-11 11:42       ` Eric Curtin
  2023-12-11 11:51       ` Lennart Poettering
  0 siblings, 2 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 11:28 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mon, 11 Dec 2023 at 11:20, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Mon, 11 Dec 2023 at 10:06, Lennart Poettering <mzerqung@0pointer.de> wrote:
> >
> > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
>
> The top line is the filesystem transition, the bottom is more like a
> process perspective. Will make this clearer in future.
>
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >    2023. There's no execuse to not doing that anymore these days. Not
> >    in automotive, and not anywhere else really.
>
> Yes you are right, there is no excuse, the plan was to mount using
> dm-verity most likely with the details from the initramfs, but
> admittedly we had not looked into that into great detail.
>
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >    unlock their root disks with TPM2 and similar things. People use
> >    RAID, LVM, and all that mess.
>
> We had 3 thoughts on this:
>
> 1. Just worry about the common use-cases and leave everyone else
> fallback to the approaches we use today.
> 2. Try and split up systemd to make it even smaller. We do use
> systemd-udev in the small initramfs storage-init process so far.
> 3. Reimplement some things? But as little as possible, on a case by
> case basis, we certainly don't want to fall into the trap of rewriting
> systemd that's for sure, systemd does these things very well.
>
> Tbh, if we try and implement this in kernelspace a lot of these
> questions go away. You just teach the kernel to deal with the
> filesystem image early (say erofs or whatever other filesystem) and
> have that data where initramfs data currently is. You still pay for
> the initial read, but you still save a bunch of kernel time.
>
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
>
> True, but there is also a bunch of stuff in current initrd's today
> that aren't required to mount basic storage, but are designed around
> the whole idea of having an early throwaway filesystem.
>
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >    boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >    boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >    in the initrd once, then tearing it down again, and starting it
> >    again from the root fs.
>
> It's mainly the top 3 that were the goals. And that people have the
> freedom to consider using heavier weight generic libraries, tools,
> etc. if they want. You want to use Rust (or languages X, Y, Z) to
> write something early boot, go ahead! You'll only pay the cost for the
> larger binary if you actually use it. The week I started tinkering at
> this, there was a mini-debate on whether we should include glib or not
> in the initrd. And we are regularly under pressure to reduce boot time
> at the moment.
>
> Number 4 was a convenient way to do an early version of this, stick a
> process in between systemd and the kernel. But it turns out, it works
> very well, the only problem is the reimplementation problem really.
>
> Theoretically this could be systemd-storage-init -> systemd also. Or
> systemd and dlopen more libraries as they become available later down
> the line.
>
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> >    loader load the erofs into contigous memory, then use memmap=X!Y on
> >    the kernel cmdline to synthesize a block device from that, which
> >    you then mount directly (without any initrd) via
> >    root=/dev/pmem0. This means yout boot loader will still load the
> >    whole image into memory, but only decompress the bits actually
> >    neeed. (It also has some other nice benefits I like, such as an
> >    immutable rootfs, which tmpfs-based initrds don't have.)

What I am unsure about here, is the "make the bootloader load the
erofs into contiguous memory" part. I wonder could we try and use the
existing initramfs data as is. I dunno if
bootloaders make much assumptions about the format of that data, worst
case scenario we could encapsulate erofs in the initramfs, cpio looking
data. Teach the kernel not to decompress and process the whole
thing and mount it like an erofs alternatively. Does this sound crazy
or reasonable?
Sometimes you cannot change the code in a bootloader and it would be
nice if we could avoid introducing another layer of bootloader.


>
> Yes, lets explore this approach with the kernel community to gather
> their thoughts. I'm still happy I did the userspace version first,
> even if we end up doing it in kernelspace because it allowed me to
> test on various pieces of hardware to see if the benefits are genuine
> and they are....
>
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> >    systemd's eyes as an initrd (specifically: don't add an
> >    /etc/initrd-release file to it). Instead, just merge resources of
> >    the root fs into your initrd fs via overlayfs. systemd has
> >    infrastructure for this: "systemd-sysext". It takes immutable,
> >    authenticated erofs images (with verity, we call them "DDIs",
> >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> >    could also very nicely combine this approach with systemd's
> >    portable services, and npsawn containers, which operate on the same
> >    authenticated images]. At MSFT we have a major product that works
> >    exactly like this: the OS runs off a rootfs that is loaded as an
> >    initrd, and everything that runs on top of this are just these
> >    verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
>
> I'm hoping we can benefit both use cases, the case where you want to
> transition to a rootfs and the case where you never want to transition
> to a rootfs.
>
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a DDI for the main initrd in
> > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > but otherwise relatively well defined, i.e. known to be vfat and
> > discoverable via UUID on a GPT disk. So: build a minimal
> > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > jump into the rootfs stored in the ESP. That latter then has proper
> > file system drivers, storage drivers, crypto stack, and can unlock the
> > real root. This would still be a pretty specific solution to one set
> > of devices though, as it could not cover network boots (i.e. where
> > there is just no ESP to boot from), but I think this could be kept
> > relatively close, as the logic in that case could just fall back into
> > loading the DDI that normally would still in the ESP fully into
> > memory.
> >
>
> I'm certainly a little biased here because I work with ARM, I would
> like it to be UEFI world, but it's not and convincing every SoC vendor
> you must use UEFI is hard. I know a UEFI covering solution only would
> not have much value for my team at least.
>
> > (If you are focussing on systems lacking UEFI, then replace the word
> > "ESP" in the above with a similar concept, i.e. a well discoverable,
> > unauthenticated relatively simple file system, such as vfat).
>
> Yeah, agree, this baseline, I think, is common enough to assume. Like
> Android Boot Images as an example are basically a UKI binary stuff in
> a boot partition.
>
> >
> > Anyway, I can't tell you how to solve your specific problems, but if
> > there's one thing I'd suggest you to keep in mind then it's the
> > security angle, i.e. keep in mind from the beginning how
> > authentication of every component of your process shall work, how
> > unatteneded disk encryption shall operate and how measurement shall
> > work. Security must be built into things from the beginning, not be
> > added as an afterthought.
>
> Yes and we certainly want something that fits with the UKI models and
> the other commonplace models around.
>
> >
> > Lennart
> >
> > --
> > Lennart Poettering, Berlin
> >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 11:28     ` Eric Curtin
@ 2023-12-11 11:42       ` Eric Curtin
  2023-12-11 11:58         ` Lennart Poettering
  2023-12-11 11:51       ` Lennart Poettering
  1 sibling, 1 reply; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 11:42 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

I am also thinking, what is the difference between "make the
bootloader load the erofs into contiguous memory" part and doing
something like storage-init.

They are similar approaches, introduce something in the middle to
handle the erofs.

Is mise le meas/Regards,

Eric Curtin

On Mon, 11 Dec 2023 at 11:28, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Mon, 11 Dec 2023 at 11:20, Eric Curtin <ecurtin@redhat.com> wrote:
> >
> > On Mon, 11 Dec 2023 at 10:06, Lennart Poettering <mzerqung@0pointer.de> wrote:
> > >
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> >
> > The top line is the filesystem transition, the bottom is more like a
> > process perspective. Will make this clearer in future.
> >
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > >    2023. There's no execuse to not doing that anymore these days. Not
> > >    in automotive, and not anywhere else really.
> >
> > Yes you are right, there is no excuse, the plan was to mount using
> > dm-verity most likely with the details from the initramfs, but
> > admittedly we had not looked into that into great detail.
> >
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > >    unlock their root disks with TPM2 and similar things. People use
> > >    RAID, LVM, and all that mess.
> >
> > We had 3 thoughts on this:
> >
> > 1. Just worry about the common use-cases and leave everyone else
> > fallback to the approaches we use today.
> > 2. Try and split up systemd to make it even smaller. We do use
> > systemd-udev in the small initramfs storage-init process so far.
> > 3. Reimplement some things? But as little as possible, on a case by
> > case basis, we certainly don't want to fall into the trap of rewriting
> > systemd that's for sure, systemd does these things very well.
> >
> > Tbh, if we try and implement this in kernelspace a lot of these
> > questions go away. You just teach the kernel to deal with the
> > filesystem image early (say erofs or whatever other filesystem) and
> > have that data where initramfs data currently is. You still pay for
> > the initial read, but you still save a bunch of kernel time.
> >
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> >
> > True, but there is also a bunch of stuff in current initrd's today
> > that aren't required to mount basic storage, but are designed around
> > the whole idea of having an early throwaway filesystem.
> >
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > >    boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > >    boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > >    in the initrd once, then tearing it down again, and starting it
> > >    again from the root fs.
> >
> > It's mainly the top 3 that were the goals. And that people have the
> > freedom to consider using heavier weight generic libraries, tools,
> > etc. if they want. You want to use Rust (or languages X, Y, Z) to
> > write something early boot, go ahead! You'll only pay the cost for the
> > larger binary if you actually use it. The week I started tinkering at
> > this, there was a mini-debate on whether we should include glib or not
> > in the initrd. And we are regularly under pressure to reduce boot time
> > at the moment.
> >
> > Number 4 was a convenient way to do an early version of this, stick a
> > process in between systemd and the kernel. But it turns out, it works
> > very well, the only problem is the reimplementation problem really.
> >
> > Theoretically this could be systemd-storage-init -> systemd also. Or
> > systemd and dlopen more libraries as they become available later down
> > the line.
> >
> > >
> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > >    the kernel cmdline to synthesize a block device from that, which
> > >    you then mount directly (without any initrd) via
> > >    root=/dev/pmem0. This means yout boot loader will still load the
> > >    whole image into memory, but only decompress the bits actually
> > >    neeed. (It also has some other nice benefits I like, such as an
> > >    immutable rootfs, which tmpfs-based initrds don't have.)
>
> What I am unsure about here, is the "make the bootloader load the
> erofs into contiguous memory" part. I wonder could we try and use the
> existing initramfs data as is. I dunno if
> bootloaders make much assumptions about the format of that data, worst
> case scenario we could encapsulate erofs in the initramfs, cpio looking
> data. Teach the kernel not to decompress and process the whole
> thing and mount it like an erofs alternatively. Does this sound crazy
> or reasonable?
> Sometimes you cannot change the code in a bootloader and it would be
> nice if we could avoid introducing another layer of bootloader.
>
>
> >
> > Yes, lets explore this approach with the kernel community to gather
> > their thoughts. I'm still happy I did the userspace version first,
> > even if we end up doing it in kernelspace because it allowed me to
> > test on various pieces of hardware to see if the benefits are genuine
> > and they are....
> >
> > >
> > > 3. Simply never transition to the root fs, don't marke the initrds in
> > >    systemd's eyes as an initrd (specifically: don't add an
> > >    /etc/initrd-release file to it). Instead, just merge resources of
> > >    the root fs into your initrd fs via overlayfs. systemd has
> > >    infrastructure for this: "systemd-sysext". It takes immutable,
> > >    authenticated erofs images (with verity, we call them "DDIs",
> > >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> > >    could also very nicely combine this approach with systemd's
> > >    portable services, and npsawn containers, which operate on the same
> > >    authenticated images]. At MSFT we have a major product that works
> > >    exactly like this: the OS runs off a rootfs that is loaded as an
> > >    initrd, and everything that runs on top of this are just these
> > >    verity disk images, using overlayfs and portable services.
> > >
> > > 4. The proposal in 3 also addresses goal 4.
> > >
> >
> > I'm hoping we can benefit both use cases, the case where you want to
> > transition to a rootfs and the case where you never want to transition
> > to a rootfs.
> >
> > > Which leaves item 1, which is a bit harder to address. We have been
> > > discussing this off an on internally too. A generic solution to this
> > > is hard. My current thinking for this could be something like this,
> > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > but otherwise relatively well defined, i.e. known to be vfat and
> > > discoverable via UUID on a GPT disk. So: build a minimal
> > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > jump into the rootfs stored in the ESP. That latter then has proper
> > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > real root. This would still be a pretty specific solution to one set
> > > of devices though, as it could not cover network boots (i.e. where
> > > there is just no ESP to boot from), but I think this could be kept
> > > relatively close, as the logic in that case could just fall back into
> > > loading the DDI that normally would still in the ESP fully into
> > > memory.
> > >
> >
> > I'm certainly a little biased here because I work with ARM, I would
> > like it to be UEFI world, but it's not and convincing every SoC vendor
> > you must use UEFI is hard. I know a UEFI covering solution only would
> > not have much value for my team at least.
> >
> > > (If you are focussing on systems lacking UEFI, then replace the word
> > > "ESP" in the above with a similar concept, i.e. a well discoverable,
> > > unauthenticated relatively simple file system, such as vfat).
> >
> > Yeah, agree, this baseline, I think, is common enough to assume. Like
> > Android Boot Images as an example are basically a UKI binary stuff in
> > a boot partition.
> >
> > >
> > > Anyway, I can't tell you how to solve your specific problems, but if
> > > there's one thing I'd suggest you to keep in mind then it's the
> > > security angle, i.e. keep in mind from the beginning how
> > > authentication of every component of your process shall work, how
> > > unatteneded disk encryption shall operate and how measurement shall
> > > work. Security must be built into things from the beginning, not be
> > > added as an afterthought.
> >
> > Yes and we certainly want something that fits with the UKI models and
> > the other commonplace models around.
> >
> > >
> > > Lennart
> > >
> > > --
> > > Lennart Poettering, Berlin
> > >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 11:28     ` Eric Curtin
  2023-12-11 11:42       ` Eric Curtin
@ 2023-12-11 11:51       ` Lennart Poettering
  2023-12-11 12:48         ` Eric Curtin
  1 sibling, 1 reply; 49+ messages in thread
From: Lennart Poettering @ 2023-12-11 11:51 UTC (permalink / raw)
  To: Eric Curtin
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mo, 11.12.23 11:28, Eric Curtin (ecurtin@redhat.com) wrote:

> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > >    the kernel cmdline to synthesize a block device from that, which
> > >    you then mount directly (without any initrd) via
> > >    root=/dev/pmem0. This means yout boot loader will still load the
> > >    whole image into memory, but only decompress the bits actually
> > >    neeed. (It also has some other nice benefits I like, such as an
> > >    immutable rootfs, which tmpfs-based initrds don't have.)
>
> What I am unsure about here, is the "make the bootloader load the
> erofs into contiguous memory" part. I wonder could we try and use the
> existing initramfs data as is.

Today's initrds are packed cpio archives of an OS file system
hierarchy. What I proposed means you'd have to put the OS file system
hiearchy into an erofs image instead. Which is a trivial operation,
just unpack and repack.

Note that there are two concepts of "initrd" out there.

a) from the kernel perspective an initrd/initramfs (which both are
   badly named, because its a tmpfs these days) is that packed cpio
   archive that is unpacked into a tmpfs, and then jumped into.

b) from systemd's perspective an initrd is an OS image that carries an
   /etc/initrd-release file. If that file exists then systemd will not
   boot up the system regularly, but instead just prepare everything
   that it can transition into some other root fs.

While most often in real life the initrds currently qualify under both
definitions. But there's no reason to always do this. You can also
have images the kernel would consider an initrd, but systemd does not,
which is something we use in the "USI" concept, i.e. "unified system
images", which are basically UKIs (large UKIs) with a complete rootfs
that is the main system of the OS. And you can also do it the other
way round, which is potentially what I am suggesting to you here: use
an erofs image that would not be considered an initrd by the kernel,
but that systemd would consider one, and transition out of.

> I dunno if
> bootloaders make much assumptions about the format of that data, worst
> case scenario we could encapsulate erofs in the initramfs, cpio looking
> data.

boot loaders generally don't bother with the cpio, it's just "data"
for them. Compression algorithms have changed in the past, and it only
mattered that the kernel could decompress it, the boot loader doesn't care.

> Teach the kernel not to decompress and process the whole
> thing and mount it like an erofs alternatively. Does this sound crazy
> or reasonable?

You are re-inventing the traditional "initrd" logic of the kernel
which was a ramdisk (i.e. a block device /dev/ram0), that was filled
with some fs of your choice loaded by the boot loader.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 11:42       ` Eric Curtin
@ 2023-12-11 11:58         ` Lennart Poettering
  0 siblings, 0 replies; 49+ messages in thread
From: Lennart Poettering @ 2023-12-11 11:58 UTC (permalink / raw)
  To: Eric Curtin
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mo, 11.12.23 11:42, Eric Curtin (ecurtin@redhat.com) wrote:

> I am also thinking, what is the difference between "make the
> bootloader load the erofs into contiguous memory" part and doing
> something like storage-init.

Well, from my PoV there's value in reducing the stages of the boot
process, and reducing the amount of storage stacks you need in the
mix. Hence, the boot loader can load stuff from disk into memory
anyway, it always has done that, typically the kernel and the
initrd. just swapping out the format of the initrd to get better
behaviour is relatively cheap there, means no additional storage
logic, no additional stage of the boot. You basically only have "boot
loader" (which loads kernel and initrd), and the "host os" (which runs
of the final rootfs).

Otoh if you let your storage-init load the initrd, then you basically
have a third step in the middle, which shares a lot of props with the
last step, but also is distinct. I mean, you probably would reinvent
your own udev and DM stack for that, to get verity in the mix (because
that depends on DM, and udev, to some degree)

In my ideal model, initrds are just part of the UKI btw, so they end
up being loaded together with the rest of the kernel, and need no
verity becaused signed along with the UKI itself.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 11:51       ` Lennart Poettering
@ 2023-12-11 12:48         ` Eric Curtin
  2023-12-11 12:52           ` Eric Curtin
                             ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 12:48 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mon, 11 Dec 2023 at 11:51, Lennart Poettering <lennart@poettering.net> wrote:
>
> On Mo, 11.12.23 11:28, Eric Curtin (ecurtin@redhat.com) wrote:
>
> > > > For the items listed above I think you can find different solutions
> > > > which do not necessarily compromise security as much.
> > > >
> > > > So, in the list above you could address the latter three like this:
> > > >
> > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > > >    the kernel cmdline to synthesize a block device from that, which
> > > >    you then mount directly (without any initrd) via
> > > >    root=/dev/pmem0. This means yout boot loader will still load the
> > > >    whole image into memory, but only decompress the bits actually
> > > >    neeed. (It also has some other nice benefits I like, such as an
> > > >    immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > What I am unsure about here, is the "make the bootloader load the
> > erofs into contiguous memory" part. I wonder could we try and use the
> > existing initramfs data as is.
>
> Today's initrds are packed cpio archives of an OS file system
> hierarchy. What I proposed means you'd have to put the OS file system
> hiearchy into an erofs image instead. Which is a trivial operation,
> just unpack and repack.
>
> Note that there are two concepts of "initrd" out there.
>
> a) from the kernel perspective an initrd/initramfs (which both are
>    badly named, because its a tmpfs these days) is that packed cpio
>    archive that is unpacked into a tmpfs, and then jumped into.
>
> b) from systemd's perspective an initrd is an OS image that carries an
>    /etc/initrd-release file. If that file exists then systemd will not
>    boot up the system regularly, but instead just prepare everything
>    that it can transition into some other root fs.
>
> While most often in real life the initrds currently qualify under both
> definitions. But there's no reason to always do this. You can also
> have images the kernel would consider an initrd, but systemd does not,
> which is something we use in the "USI" concept, i.e. "unified system
> images", which are basically UKIs (large UKIs) with a complete rootfs
> that is the main system of the OS. And you can also do it the other
> way round, which is potentially what I am suggesting to you here: use
> an erofs image that would not be considered an initrd by the kernel,
> but that systemd would consider one, and transition out of.
>
> > I dunno if
> > bootloaders make much assumptions about the format of that data, worst
> > case scenario we could encapsulate erofs in the initramfs, cpio looking
> > data.
>
> boot loaders generally don't bother with the cpio, it's just "data"
> for them. Compression algorithms have changed in the past, and it only
> mattered that the kernel could decompress it, the boot loader doesn't care.
>
> > Teach the kernel not to decompress and process the whole
> > thing and mount it like an erofs alternatively. Does this sound crazy
> > or reasonable?
>
> You are re-inventing the traditional "initrd" logic of the kernel
> which was a ramdisk (i.e. a block device /dev/ram0), that was filled
> with some fs of your choice loaded by the boot loader.

Sort of yes, but preferably using that __initramfs_start /
initrd_start buffer as is without copying any bytes anywhere else and
without teaching the bootloaders to do things.

The "memmap=" approach you suggested sounds like what we are thinking,
but do you think we could do this without teaching bootloaders to do
new things?

Although the nice thing about a storage-init like approach is there's
basically zero copies up front. What storage-init is trying to be, is
a tool to just call systemd storage things, without also inheriting
all the systemd stack.

>
> Lennart
>
> --
> Lennart Poettering, Berlin
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 12:48         ` Eric Curtin
@ 2023-12-11 12:52           ` Eric Curtin
  2023-12-12 17:37           ` Lennart Poettering
  2023-12-12 17:40           ` Lennart Poettering
  2 siblings, 0 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 12:52 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mon, 11 Dec 2023 at 12:48, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Mon, 11 Dec 2023 at 11:51, Lennart Poettering <lennart@poettering.net> wrote:
> >
> > On Mo, 11.12.23 11:28, Eric Curtin (ecurtin@redhat.com) wrote:
> >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > >    the kernel cmdline to synthesize a block device from that, which
> > > > >    you then mount directly (without any initrd) via
> > > > >    root=/dev/pmem0. This means yout boot loader will still load the
> > > > >    whole image into memory, but only decompress the bits actually
> > > > >    neeed. (It also has some other nice benefits I like, such as an
> > > > >    immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > What I am unsure about here, is the "make the bootloader load the
> > > erofs into contiguous memory" part. I wonder could we try and use the
> > > existing initramfs data as is.
> >
> > Today's initrds are packed cpio archives of an OS file system
> > hierarchy. What I proposed means you'd have to put the OS file system
> > hiearchy into an erofs image instead. Which is a trivial operation,
> > just unpack and repack.
> >
> > Note that there are two concepts of "initrd" out there.
> >
> > a) from the kernel perspective an initrd/initramfs (which both are
> >    badly named, because its a tmpfs these days) is that packed cpio
> >    archive that is unpacked into a tmpfs, and then jumped into.
> >
> > b) from systemd's perspective an initrd is an OS image that carries an
> >    /etc/initrd-release file. If that file exists then systemd will not
> >    boot up the system regularly, but instead just prepare everything
> >    that it can transition into some other root fs.
> >
> > While most often in real life the initrds currently qualify under both
> > definitions. But there's no reason to always do this. You can also
> > have images the kernel would consider an initrd, but systemd does not,
> > which is something we use in the "USI" concept, i.e. "unified system
> > images", which are basically UKIs (large UKIs) with a complete rootfs
> > that is the main system of the OS. And you can also do it the other
> > way round, which is potentially what I am suggesting to you here: use
> > an erofs image that would not be considered an initrd by the kernel,
> > but that systemd would consider one, and transition out of.
> >
> > > I dunno if
> > > bootloaders make much assumptions about the format of that data, worst
> > > case scenario we could encapsulate erofs in the initramfs, cpio looking
> > > data.
> >
> > boot loaders generally don't bother with the cpio, it's just "data"
> > for them. Compression algorithms have changed in the past, and it only
> > mattered that the kernel could decompress it, the boot loader doesn't care.
> >
> > > Teach the kernel not to decompress and process the whole
> > > thing and mount it like an erofs alternatively. Does this sound crazy
> > > or reasonable?
> >
> > You are re-inventing the traditional "initrd" logic of the kernel
> > which was a ramdisk (i.e. a block device /dev/ram0), that was filled
> > with some fs of your choice loaded by the boot loader.
>
> Sort of yes, but preferably using that __initramfs_start /
> initrd_start buffer as is without copying any bytes anywhere else and
> without teaching the bootloaders to do things.
>
> The "memmap=" approach you suggested sounds like what we are thinking,
> but do you think we could do this without teaching bootloaders to do
> new things?

Like could we do that with a "initrd3.0=on" karg and it just uses the
__initramfs_start and __initramfs_size to memmap? (that probably
wouldn't be the arg name, it's just for description purposes here,
maybe it's even a build time flag, etc.)

>
> Although the nice thing about a storage-init like approach is there's
> basically zero copies up front. What storage-init is trying to be, is
> a tool to just call systemd storage things, without also inheriting
> all the systemd stack.
>
> >
> > Lennart
> >
> > --
> > Lennart Poettering, Berlin
> >


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11  9:57 ` Lennart Poettering
  2023-12-11 10:07   ` Lennart Poettering
  2023-12-11 11:20   ` Eric Curtin
@ 2023-12-11 16:28   ` Demi Marie Obenour
  2023-12-11 17:03     ` Eric Curtin
                       ` (3 more replies)
  2 siblings, 4 replies; 49+ messages in thread
From: Demi Marie Obenour @ 2023-12-11 16:28 UTC (permalink / raw)
  To: Lennart Poettering, Eric Curtin
  Cc: Yariv Rachmani, initramfs, systemd-devel, Stephen Smoogen,
	Douglas Landgraf, Qubes OS Development Mailing List

[-- Attachment #1: Type: text/plain, Size: 8760 bytes --]

On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> 
> > Here is the boot sequence with initoverlayfs integrated, the
> > mini-initramfs contains just enough to get storage drivers loaded and
> > storage devices initialized. storage-init is a process that is not
> > designed to replace init, it does just enough to initialize storage
> > (performs a targeted udev trigger on storage), switches to
> > initoverlayfs as root and then executes init.
> >
> > ```
> > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> >
> > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > ```
> 
> I am not sure I follow what these chains are supposed to mean? Why are
> there two lines?
> 
> So, I generally would agree that the current initrd scheme is not
> ideal, and we have been discussing better approaches. But I am not
> sure your approach really is useful on generic systems for two
> reasons:
> 
> 1. no security model? you need to authenticate your initrd in
>    2023. There's no execuse to not doing that anymore these days. Not
>    in automotive, and not anywhere else really.
> 
> 2. no way to deal with complex storage? i.e. people use FDE, want to
>    unlock their root disks with TPM2 and similar things. People use
>    RAID, LVM, and all that mess.
> 
> Actually the above are kinda the same problem in a way: you need
> complex storage, but if you need that you kinda need udev, and
> services, and then also systemd and all that other stuff, and that's
> why the system works like the system works right now.
> 
> Whenever you devise a system like yours by cutting corners, and
> declaring that you don't want TPM, you don't want signed initrds, you
> don't want to support weird storage, you just solve your problem in a
> very specific way, ignoring the big picture. Which is OK, *if* you can
> actually really work without all that and are willing to maintain the
> solution for your specific problem only.
> 
> As I understand you are trying to solve multiple problems at once
> here, and I think one should start with figuring out clearly what
> those are before trying to address them, maybe without compromising on
> security. So my guess is you want to address the following:
> 
> 1. You don't want the whole big initrd to be read off disk on every
>    boot, but only the parts of it that are actually needed.
> 
> 2. You don't want the whole big initrd to be fully decompressed on every
>    boot, but only the parts of it that are actually needed.
> 
> 3. You want to share data between root fs and initrd
> 
> 4. You want to save some boot time by not bringing up an init system
>    in the initrd once, then tearing it down again, and starting it
>    again from the root fs.
> 
> For the items listed above I think you can find different solutions
> which do not necessarily compromise security as much.
> 
> So, in the list above you could address the latter three like this:
> 
> 2. Use an erofs rather than a packed cpio as initrd. Make the boot
>    loader load the erofs into contigous memory, then use memmap=X!Y on
>    the kernel cmdline to synthesize a block device from that, which
>    you then mount directly (without any initrd) via
>    root=/dev/pmem0. This means yout boot loader will still load the
>    whole image into memory, but only decompress the bits actually
>    neeed. (It also has some other nice benefits I like, such as an
>    immutable rootfs, which tmpfs-based initrds don't have.)
> 
> 3. Simply never transition to the root fs, don't marke the initrds in
>    systemd's eyes as an initrd (specifically: don't add an
>    /etc/initrd-release file to it). Instead, just merge resources of
>    the root fs into your initrd fs via overlayfs. systemd has
>    infrastructure for this: "systemd-sysext". It takes immutable,
>    authenticated erofs images (with verity, we call them "DDIs",
>    i.e. "discoverable disk images") that it overlays into /usr/. [You
>    could also very nicely combine this approach with systemd's
>    portable services, and npsawn containers, which operate on the same
>    authenticated images]. At MSFT we have a major product that works
>    exactly like this: the OS runs off a rootfs that is loaded as an
>    initrd, and everything that runs on top of this are just these
>    verity disk images, using overlayfs and portable services.
> 
> 4. The proposal in 3 also addresses goal 4.
> 
> Which leaves item 1, which is a bit harder to address. We have been
> discussing this off an on internally too. A generic solution to this
> is hard. My current thinking for this could be something like this,
> covering the UEFI world: support sticking a DDI for the main initrd in
> the ESP. The ESP is per definition unencrypted and unauthenticated,
> but otherwise relatively well defined, i.e. known to be vfat and
> discoverable via UUID on a GPT disk. So: build a minimal
> single-process initrd into the kernel (i.e. UKI) that has exactly the
> storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> jump into the rootfs stored in the ESP. That latter then has proper
> file system drivers, storage drivers, crypto stack, and can unlock the
> real root. This would still be a pretty specific solution to one set
> of devices though, as it could not cover network boots (i.e. where
> there is just no ESP to boot from), but I think this could be kept
> relatively close, as the logic in that case could just fall back into
> loading the DDI that normally would still in the ESP fully into
> memory.

I don't think this is "a pretty specific solution to one set of devices"
_at all_.  To the contrary, it is _exactly_ what I want to see desktop
systems moving to in the future.

It solves the problem of large firmware images.  It solves the problem
of device-specific configuration, because one can use a file on the EFI
system partition that is read by userspace and either treated as
untrusted or TPM-signed.  It means that one have a complete set of
recovery tools in the event of a problem, rather than being limited to
whatever one can squeese into an initramfs.  One can even include a full
GUI stack (with accessibility support!), rather than just Plymouth.  For
Qubes OS, one can include enough of the Xen and Qubes toolstack to even
launch virtual machines, allowing the use of USB devices and networking
for recovery purposes.  It even means that one can use a FIDO2 token to
unlock the hard drive without a USB stack on the host.  And because the
initramfs _only_ needs to load the boot extension volume, it can be
very, _very_ small, which works great with using Linux as a coreboot
payload.

The only problem I can see that this does not solve is network boot, but
that is very much a niche use case when compared to the millions of
Fedora or Debian desktop installs, or even the tens of thousands of
Qubes OS installs.  Furthermore, I would _much_ rather network boot be
handled by userspace and kexec, rather than the closed source UEFI network
stack.

It does require some care when upgrading, as the dm-verity image and the
UKI cannot both be updated atomically, but one can solve that by first
writing the new dm-verity image to a separate location.  The UKI will
try both both the old and new locations for the dm-verity image and
rename the new image over the old one on success.  The wrong image will
simply fail to mount as its root hash will be wrong.

This even allows Apple-esque boot policies to be implemented on
commodity hardware, provided that the system firmware is sufficiently
hardened.  It won't be as good as what Apple does, but it will be a huge
win from what is possible today.

> (If you are focussing on systems lacking UEFI, then replace the word
> "ESP" in the above with a similar concept, i.e. a well discoverable,
> unauthenticated relatively simple file system, such as vfat).
> 
> Anyway, I can't tell you how to solve your specific problems, but if
> there's one thing I'd suggest you to keep in mind then it's the
> security angle, i.e. keep in mind from the beginning how
> authentication of every component of your process shall work, how
> unatteneded disk encryption shall operate and how measurement shall
> work. Security must be built into things from the beginning, not be
> added as an afterthought.

As a Qubes OS developer and a security researcher, thank you.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 16:28   ` Demi Marie Obenour
@ 2023-12-11 17:03     ` Eric Curtin
  2023-12-11 17:46       ` Demi Marie Obenour
  2023-12-12 18:00       ` Lennart Poettering
  2023-12-11 17:33     ` Neal Gompa
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 17:03 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Lennart Poettering, Yariv Rachmani, initramfs, systemd-devel,
	Stephen Smoogen, Douglas Landgraf,
	Qubes OS Development Mailing List

On Mon, 11 Dec 2023 at 16:36, Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >    2023. There's no execuse to not doing that anymore these days. Not
> >    in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >    unlock their root disks with TPM2 and similar things. People use
> >    RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >    boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >    boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >    in the initrd once, then tearing it down again, and starting it
> >    again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> >    loader load the erofs into contigous memory, then use memmap=X!Y on
> >    the kernel cmdline to synthesize a block device from that, which
> >    you then mount directly (without any initrd) via
> >    root=/dev/pmem0. This means yout boot loader will still load the
> >    whole image into memory, but only decompress the bits actually
> >    neeed. (It also has some other nice benefits I like, such as an
> >    immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> >    systemd's eyes as an initrd (specifically: don't add an
> >    /etc/initrd-release file to it). Instead, just merge resources of
> >    the root fs into your initrd fs via overlayfs. systemd has
> >    infrastructure for this: "systemd-sysext". It takes immutable,
> >    authenticated erofs images (with verity, we call them "DDIs",
> >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> >    could also very nicely combine this approach with systemd's
> >    portable services, and npsawn containers, which operate on the same
> >    authenticated images]. At MSFT we have a major product that works
> >    exactly like this: the OS runs off a rootfs that is loaded as an
> >    initrd, and everything that runs on top of this are just these
> >    verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a DDI for the main initrd in
> > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > but otherwise relatively well defined, i.e. known to be vfat and
> > discoverable via UUID on a GPT disk. So: build a minimal
> > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > jump into the rootfs stored in the ESP. That latter then has proper
> > file system drivers, storage drivers, crypto stack, and can unlock the
> > real root. This would still be a pretty specific solution to one set
> > of devices though, as it could not cover network boots (i.e. where
> > there is just no ESP to boot from), but I think this could be kept
> > relatively close, as the logic in that case could just fall back into
> > loading the DDI that normally would still in the ESP fully into
> > memory.
>
> I don't think this is "a pretty specific solution to one set of devices"
> _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images.  It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed.  It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs.  One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth.  For

plymouth is very interesting in that it has it's own graphics stack, event loop
implementations, etc. A lot of the initrd software is like this.
plymouth is one of
the examples I think of in my head of something that could benefit from being
able to use more generic things. At least it's an easy example to explain to
people.

> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes.  It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host.  And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.
>
> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs.  Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.

A generic approach is hard, I think it's worth discussing which type of boots
you should actually care about milliseconds of performance for. It would be nice
if we had an init system that contained the binary data to do the minimum for
standard Fedora, Debian installs and everything else was an extension whether
that's sysexts, dlopen, a new binary to execute etc.

If the network is ingrained in your boot stack like this, I'm guessing
you probably
don't care about boot performance. Should we come up with a new technique?

Automotive has an expectation for really fast boots, like 2 seconds, in standard
desktops installs there's some expectation as you interface directly
with a human,
but for other installs how much expectation is there?

Or can we just fall back to existing techniques for installs like network boot?

Is mise le meas/Regards,

Eric Curtin

>
> It does require some care when upgrading, as the dm-verity image and the
> UKI cannot both be updated atomically, but one can solve that by first
> writing the new dm-verity image to a separate location.  The UKI will
> try both both the old and new locations for the dm-verity image and
> rename the new image over the old one on success.  The wrong image will
> simply fail to mount as its root hash will be wrong.
>
> This even allows Apple-esque boot policies to be implemented on
> commodity hardware, provided that the system firmware is sufficiently
> hardened.  It won't be as good as what Apple does, but it will be a huge
> win from what is possible today.
>
> > (If you are focussing on systems lacking UEFI, then replace the word
> > "ESP" in the above with a similar concept, i.e. a well discoverable,
> > unauthenticated relatively simple file system, such as vfat).
> >
> > Anyway, I can't tell you how to solve your specific problems, but if
> > there's one thing I'd suggest you to keep in mind then it's the
> > security angle, i.e. keep in mind from the beginning how
> > authentication of every component of your process shall work, how
> > unatteneded disk encryption shall operate and how measurement shall
> > work. Security must be built into things from the beginning, not be
> > added as an afterthought.
>
> As a Qubes OS developer and a security researcher, thank you.
> --
> Sincerely,
> Demi Marie Obenour (she/her/hers)
> Invisible Things Lab


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 16:28   ` Demi Marie Obenour
  2023-12-11 17:03     ` Eric Curtin
@ 2023-12-11 17:33     ` Neal Gompa
  2023-12-11 20:15     ` Luca Boccassi
  2023-12-12 17:50     ` Lennart Poettering
  3 siblings, 0 replies; 49+ messages in thread
From: Neal Gompa @ 2023-12-11 17:33 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
	Stephen Smoogen, Qubes OS Development Mailing List,
	Yariv Rachmani, Douglas Landgraf

On Mon, Dec 11, 2023 at 12:30 PM Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >    2023. There's no execuse to not doing that anymore these days. Not
> >    in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >    unlock their root disks with TPM2 and similar things. People use
> >    RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >    boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >    boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >    in the initrd once, then tearing it down again, and starting it
> >    again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> >    loader load the erofs into contigous memory, then use memmap=X!Y on
> >    the kernel cmdline to synthesize a block device from that, which
> >    you then mount directly (without any initrd) via
> >    root=/dev/pmem0. This means yout boot loader will still load the
> >    whole image into memory, but only decompress the bits actually
> >    neeed. (It also has some other nice benefits I like, such as an
> >    immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> >    systemd's eyes as an initrd (specifically: don't add an
> >    /etc/initrd-release file to it). Instead, just merge resources of
> >    the root fs into your initrd fs via overlayfs. systemd has
> >    infrastructure for this: "systemd-sysext". It takes immutable,
> >    authenticated erofs images (with verity, we call them "DDIs",
> >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> >    could also very nicely combine this approach with systemd's
> >    portable services, and npsawn containers, which operate on the same
> >    authenticated images]. At MSFT we have a major product that works
> >    exactly like this: the OS runs off a rootfs that is loaded as an
> >    initrd, and everything that runs on top of this are just these
> >    verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a DDI for the main initrd in
> > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > but otherwise relatively well defined, i.e. known to be vfat and
> > discoverable via UUID on a GPT disk. So: build a minimal
> > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > jump into the rootfs stored in the ESP. That latter then has proper
> > file system drivers, storage drivers, crypto stack, and can unlock the
> > real root. This would still be a pretty specific solution to one set
> > of devices though, as it could not cover network boots (i.e. where
> > there is just no ESP to boot from), but I think this could be kept
> > relatively close, as the logic in that case could just fall back into
> > loading the DDI that normally would still in the ESP fully into
> > memory.
>
> I don't think this is "a pretty specific solution to one set of devices"
> _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images.  It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed.  It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs.  One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth.  For
> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes.  It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host.  And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.
>
> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs.  Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.
>

Network boot is fairly common in some industries for workstations. In
particular, the film industry does this a fair bit to leverage
switching between workstation and renderfarm modes for workstation
hardware.


-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 17:03     ` Eric Curtin
@ 2023-12-11 17:46       ` Demi Marie Obenour
  2023-12-12 18:00       ` Lennart Poettering
  1 sibling, 0 replies; 49+ messages in thread
From: Demi Marie Obenour @ 2023-12-11 17:46 UTC (permalink / raw)
  To: Eric Curtin
  Cc: Lennart Poettering, Yariv Rachmani, initramfs, systemd-devel,
	Stephen Smoogen, Douglas Landgraf,
	Qubes OS Development Mailing List

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Mon, Dec 11, 2023 at 05:03:13PM +0000, Eric Curtin wrote:
> On Mon, 11 Dec 2023 at 16:36, Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > >    2023. There's no execuse to not doing that anymore these days. Not
> > >    in automotive, and not anywhere else really.
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > >    unlock their root disks with TPM2 and similar things. People use
> > >    RAID, LVM, and all that mess.
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > >    boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > >    boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > >    in the initrd once, then tearing it down again, and starting it
> > >    again from the root fs.
> > >
> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > >    the kernel cmdline to synthesize a block device from that, which
> > >    you then mount directly (without any initrd) via
> > >    root=/dev/pmem0. This means yout boot loader will still load the
> > >    whole image into memory, but only decompress the bits actually
> > >    neeed. (It also has some other nice benefits I like, such as an
> > >    immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > 3. Simply never transition to the root fs, don't marke the initrds in
> > >    systemd's eyes as an initrd (specifically: don't add an
> > >    /etc/initrd-release file to it). Instead, just merge resources of
> > >    the root fs into your initrd fs via overlayfs. systemd has
> > >    infrastructure for this: "systemd-sysext". It takes immutable,
> > >    authenticated erofs images (with verity, we call them "DDIs",
> > >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> > >    could also very nicely combine this approach with systemd's
> > >    portable services, and npsawn containers, which operate on the same
> > >    authenticated images]. At MSFT we have a major product that works
> > >    exactly like this: the OS runs off a rootfs that is loaded as an
> > >    initrd, and everything that runs on top of this are just these
> > >    verity disk images, using overlayfs and portable services.
> > >
> > > 4. The proposal in 3 also addresses goal 4.
> > >
> > > Which leaves item 1, which is a bit harder to address. We have been
> > > discussing this off an on internally too. A generic solution to this
> > > is hard. My current thinking for this could be something like this,
> > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > but otherwise relatively well defined, i.e. known to be vfat and
> > > discoverable via UUID on a GPT disk. So: build a minimal
> > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > jump into the rootfs stored in the ESP. That latter then has proper
> > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > real root. This would still be a pretty specific solution to one set
> > > of devices though, as it could not cover network boots (i.e. where
> > > there is just no ESP to boot from), but I think this could be kept
> > > relatively close, as the logic in that case could just fall back into
> > > loading the DDI that normally would still in the ESP fully into
> > > memory.
> >
> > I don't think this is "a pretty specific solution to one set of devices"
> > _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> > systems moving to in the future.
> >
> > It solves the problem of large firmware images.  It solves the problem
> > of device-specific configuration, because one can use a file on the EFI
> > system partition that is read by userspace and either treated as
> > untrusted or TPM-signed.  It means that one have a complete set of
> > recovery tools in the event of a problem, rather than being limited to
> > whatever one can squeese into an initramfs.  One can even include a full
> > GUI stack (with accessibility support!), rather than just Plymouth.  For
> 
> plymouth is very interesting in that it has it's own graphics stack, event loop
> implementations, etc. A lot of the initrd software is like this.
> plymouth is one of
> the examples I think of in my head of something that could benefit from being
> able to use more generic things. At least it's an easy example to explain to
> people.

Indeed so.  There is still the concern of startup time, which
GPU-accelerated programs in particular are often not great at.

> > Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> > launch virtual machines, allowing the use of USB devices and networking
> > for recovery purposes.  It even means that one can use a FIDO2 token to
> > unlock the hard drive without a USB stack on the host.  And because the
> > initramfs _only_ needs to load the boot extension volume, it can be
> > very, _very_ small, which works great with using Linux as a coreboot
> > payload.
> >
> > The only problem I can see that this does not solve is network boot, but
> > that is very much a niche use case when compared to the millions of
> > Fedora or Debian desktop installs, or even the tens of thousands of
> > Qubes OS installs.  Furthermore, I would _much_ rather network boot be
> > handled by userspace and kexec, rather than the closed source UEFI network
> > stack.
> 
> A generic approach is hard, I think it's worth discussing which type of boots
> you should actually care about milliseconds of performance for. It would be nice
> if we had an init system that contained the binary data to do the minimum for
> standard Fedora, Debian installs and everything else was an extension whether
> that's sysexts, dlopen, a new binary to execute etc.
> 
> If the network is ingrained in your boot stack like this, I'm guessing
> you probably
> don't care about boot performance. Should we come up with a new technique?
> 
> Automotive has an expectation for really fast boots, like 2 seconds, in standard
> desktops installs there's some expectation as you interface directly
> with a human,
> but for other installs how much expectation is there?
> 
> Or can we just fall back to existing techniques for installs like network boot?

I wouldn't say that people doing network boot don't care about boot
performance, mostly because I have been on the other side of similar
arguments before [1].  However, I don't think this technique needs to
support network boot.
- -- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[1]: Qubes OS doesn't expose GPU acceleration to VMs.  This is not
     because the developers don't care about graphics performance, but
     because GPUs and especially their driver stacks have a very large
     attack surface.  Work is being done to address this, but even once
     Qubes OS does support GPU acceleration, it will need to be off by
     default, at least initially.
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmV3SuMACgkQsoi1X/+c
IsHH6RAAhMQl/nw2jdZ4tlwxX/zqib3Tfzdo1p9a5VOkSobrvV7qbG0DWVrqe+vH
NKU1xy6FGqPexKjLoGlxWXgPN5rQKvkFXSgRaRefcqGn190WRjqexF0euu26GYTx
AfOEWC1hywoyXUR2LMygEMpodA0ZvZffIZcovmjjr4OeXiSc5aAUrHQ2PabHZaET
BL4jfeNikjw6sA2UdpviMRzb1OVEGZDD96XDSbVz/8tOBcZZNePz+FQXnHqTpcLk
DrBtx4l5noeUYingzxmw4MQZYYPr3kC4+DQtQr7zxv8D0UE9g8lIcpektqMvgoON
88FwVOa4TgTij7vG2f4BGCrZjE7PiPPo5BRb+MtjlZMtrhwdI4IwXY8q4EANWUnw
8nM+952nffVVQjpBtKRsXPZ3glAjvUuqHT8GzfWYYu8y8Dar9c3U4aQSTCJspkz3
jBsPAatFSjdBvlE6OtmyYco92K3A9g6WXzkw5t+/yaljBOddEkxEAw8+Lo1dCqrn
zK+vSFhcGpYodsHFQY0w9kAZ2+6HBX2nZaEmD6ka3furRussm7D4Z36lx1D/pi68
BL4aAFFLaEQ0jD8jqtjVZ2JYpUQufzwrnsNPTZ97WTEKd2F/zM/S09WjFsaOfVIO
F95Eqk0YMHP+krDEcXvm34EZ3PeRGlVm1fz4ttjw8XEekwwB5QU=
=HR07
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 16:28   ` Demi Marie Obenour
  2023-12-11 17:03     ` Eric Curtin
  2023-12-11 17:33     ` Neal Gompa
@ 2023-12-11 20:15     ` Luca Boccassi
  2023-12-11 20:43       ` Demi Marie Obenour
  2023-12-12 17:50     ` Lennart Poettering
  3 siblings, 1 reply; 49+ messages in thread
From: Luca Boccassi @ 2023-12-11 20:15 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
	Stephen Smoogen, Qubes OS Development Mailing List,
	Yariv Rachmani, Douglas Landgraf

On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> >    2023. There's no execuse to not doing that anymore these days. Not
> >    in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> >    unlock their root disks with TPM2 and similar things. People use
> >    RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> >    boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> >    boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> >    in the initrd once, then tearing it down again, and starting it
> >    again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> >    loader load the erofs into contigous memory, then use memmap=X!Y on
> >    the kernel cmdline to synthesize a block device from that, which
> >    you then mount directly (without any initrd) via
> >    root=/dev/pmem0. This means yout boot loader will still load the
> >    whole image into memory, but only decompress the bits actually
> >    neeed. (It also has some other nice benefits I like, such as an
> >    immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> >    systemd's eyes as an initrd (specifically: don't add an
> >    /etc/initrd-release file to it). Instead, just merge resources of
> >    the root fs into your initrd fs via overlayfs. systemd has
> >    infrastructure for this: "systemd-sysext". It takes immutable,
> >    authenticated erofs images (with verity, we call them "DDIs",
> >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> >    could also very nicely combine this approach with systemd's
> >    portable services, and npsawn containers, which operate on the same
> >    authenticated images]. At MSFT we have a major product that works
> >    exactly like this: the OS runs off a rootfs that is loaded as an
> >    initrd, and everything that runs on top of this are just these
> >    verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a DDI for the main initrd in
> > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > but otherwise relatively well defined, i.e. known to be vfat and
> > discoverable via UUID on a GPT disk. So: build a minimal
> > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > jump into the rootfs stored in the ESP. That latter then has proper
> > file system drivers, storage drivers, crypto stack, and can unlock the
> > real root. This would still be a pretty specific solution to one set
> > of devices though, as it could not cover network boots (i.e. where
> > there is just no ESP to boot from), but I think this could be kept
> > relatively close, as the logic in that case could just fall back into
> > loading the DDI that normally would still in the ESP fully into
> > memory.
>
> I don't think this is "a pretty specific solution to one set of devices"
> _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images.  It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed.

All those problems are already solved, without inventing a new shell
scripting solution - we have DDIs and credentials. This is the exact
opposite of the direction we are pursuing: we want to _kill_ all these
initrd-specific infrastructure, tools, build systems, dependency
management and so on, because they are difficult to maintain, they
create a completely different environment that what is "normally" ran,
and they end up reinventing everything the 'normal' image does. We
want to build initrds from packages - as in normal distribution
packages, not special sauce initrd-only packages, so that the same
code and the same configuration is used everywhere, in different
runtime modes. Because that's what distributions are good to do:
creating package-based ecosystems, with good tooling, infrastructure
and so on.

The end goal is to build images without initramfs-tools/dracut and
just using packages, not to stick yet another glue script in front of
them, that needs yet more special initrd-only arcane magic to put
together, in order to save a handful of KBs.

And for ancient, legacy platforms that do not support modern APIs, the
old ways will still be there, and can be used. Nobody is going to take
away grub and dracut from the internet, if you got some special corner
case where you want to use it it will still be there, but the fact
that such corner cases exist cannot stop the rest of the ecosystem
that is targeted to modern hardware from evolving into something
better, more maintainable and more straightforward.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 20:15     ` Luca Boccassi
@ 2023-12-11 20:43       ` Demi Marie Obenour
  2023-12-11 20:58         ` Luca Boccassi
  0 siblings, 1 reply; 49+ messages in thread
From: Demi Marie Obenour @ 2023-12-11 20:43 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
	Stephen Smoogen, Yariv Rachmani, Douglas Landgraf

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > >    2023. There's no execuse to not doing that anymore these days. Not
> > >    in automotive, and not anywhere else really.
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > >    unlock their root disks with TPM2 and similar things. People use
> > >    RAID, LVM, and all that mess.
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > >    boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > >    boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > >    in the initrd once, then tearing it down again, and starting it
> > >    again from the root fs.
> > >
> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > >    the kernel cmdline to synthesize a block device from that, which
> > >    you then mount directly (without any initrd) via
> > >    root=/dev/pmem0. This means yout boot loader will still load the
> > >    whole image into memory, but only decompress the bits actually
> > >    neeed. (It also has some other nice benefits I like, such as an
> > >    immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > 3. Simply never transition to the root fs, don't marke the initrds in
> > >    systemd's eyes as an initrd (specifically: don't add an
> > >    /etc/initrd-release file to it). Instead, just merge resources of
> > >    the root fs into your initrd fs via overlayfs. systemd has
> > >    infrastructure for this: "systemd-sysext". It takes immutable,
> > >    authenticated erofs images (with verity, we call them "DDIs",
> > >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> > >    could also very nicely combine this approach with systemd's
> > >    portable services, and npsawn containers, which operate on the same
> > >    authenticated images]. At MSFT we have a major product that works
> > >    exactly like this: the OS runs off a rootfs that is loaded as an
> > >    initrd, and everything that runs on top of this are just these
> > >    verity disk images, using overlayfs and portable services.
> > >
> > > 4. The proposal in 3 also addresses goal 4.
> > >
> > > Which leaves item 1, which is a bit harder to address. We have been
> > > discussing this off an on internally too. A generic solution to this
> > > is hard. My current thinking for this could be something like this,
> > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > but otherwise relatively well defined, i.e. known to be vfat and
> > > discoverable via UUID on a GPT disk. So: build a minimal
> > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > jump into the rootfs stored in the ESP. That latter then has proper
> > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > real root. This would still be a pretty specific solution to one set
> > > of devices though, as it could not cover network boots (i.e. where
> > > there is just no ESP to boot from), but I think this could be kept
> > > relatively close, as the logic in that case could just fall back into
> > > loading the DDI that normally would still in the ESP fully into
> > > memory.
> >
> > I don't think this is "a pretty specific solution to one set of devices"
> > _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> > systems moving to in the future.
> >
> > It solves the problem of large firmware images.  It solves the problem
> > of device-specific configuration, because one can use a file on the EFI
> > system partition that is read by userspace and either treated as
> > untrusted or TPM-signed.
> 
> All those problems are already solved, without inventing a new shell
> scripting solution - we have DDIs and credentials. This is the exact
> opposite of the direction we are pursuing: we want to _kill_ all these
> initrd-specific infrastructure, tools, build systems, dependency
> management and so on, because they are difficult to maintain, they
> create a completely different environment that what is "normally" ran,
> and they end up reinventing everything the 'normal' image does. We
> want to build initrds from packages - as in normal distribution
> packages, not special sauce initrd-only packages, so that the same
> code and the same configuration is used everywhere, in different
> runtime modes. Because that's what distributions are good to do:
> creating package-based ecosystems, with good tooling, infrastructure
> and so on.
> 
> The end goal is to build images without initramfs-tools/dracut and
> just using packages, not to stick yet another glue script in front of
> them, that needs yet more special initrd-only arcane magic to put
> together, in order to save a handful of KBs.

The initramfs being a RAM filesystem is exactly why keeping it small is
so critical.  Lennart's suggestion solves this problem by eagerly
loading an image from disk, which is much less size-constrained.  One
would use distribution packages to build this on-disk image.

> And for ancient, legacy platforms that do not support modern APIs, the
> old ways will still be there, and can be used. Nobody is going to take
> away grub and dracut from the internet, if you got some special corner
> case where you want to use it it will still be there, but the fact
> that such corner cases exist cannot stop the rest of the ecosystem
> that is targeted to modern hardware from evolving into something
> better, more maintainable and more straightforward.

The problem is not that UEFI is not usable in automotive systems.  The
problem is that U-Boot (or any other UEFI implementation) is an extra
stage in the boot process, slows things down, and has more attack
surface.
- -- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmV3dGwACgkQsoi1X/+c
IsGAxg//SME2795YgGWdruCwKs3D3s78MChJ18zx7DKAkIl24bETMHr7fBF0kOf/
nGKgl5VFEFNL+nEKVXNstLPTqP50BdGUShJqz7A7JVXYkpoc7+3WGmd7ZkjUXpXJ
l+37aJEo+U11Vew84LBvpckR63oshoeCr/cJrcnDaNK5NyqN9vhDXHSgJ6lu+8bh
gC7LnAhmvyB0g+vL0QpzNijNyM7nDg9zCzlP3cOYiyLj5cb4MoLL9TAZPsK0oy2q
UagW+5keJxJfY5ffdAWFpqg2UeY/7cPU5H/rkdkUFbaE9Dk8VLVsTFq6Zk5arUGw
8/CJptX2rD3DsFM+yWgizKC7Tnb9DGNZPB5ORZFem26nrNYmBz58NupDWW5HCNo9
OuPO3ASREb6z1XGmrnD1Dc8ExyTczn/zwp+x/qEDtmn8fmhDGuknwQ9D0mZ6XgO4
DuA9q4aKldgOT5wjflTaSSLkjvzaV81m1wGtxvMDdJlrmturU0GsRTeL/RpK9Dsj
BtgfvSfy+FC0uUxXSJQo/dvJmfnFHQFKss/HDf6nJJMvT20fzT+XbNljzVWLRsr3
f3suT56nIQ7oorRlgnpaCN7uQeyMKkMY7CWQtgqLGkp6c27ObfuUREFDl9KZWoF6
pI61gAVGzKmwSwJlFYHohkqMJlcqln27UX2aspQ52PMeGNwOPVM=
=mLxX
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 20:43       ` Demi Marie Obenour
@ 2023-12-11 20:58         ` Luca Boccassi
  2023-12-11 21:20           ` Demi Marie Obenour
  2023-12-11 21:24           ` Eric Curtin
  0 siblings, 2 replies; 49+ messages in thread
From: Luca Boccassi @ 2023-12-11 20:58 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
	Stephen Smoogen, Yariv Rachmani, Douglas Landgraf

On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > <demi@invisiblethingslab.com> wrote:
> > >
> > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > > >
> > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > storage devices initialized. storage-init is a process that is not
> > > > > designed to replace init, it does just enough to initialize storage
> > > > > (performs a targeted udev trigger on storage), switches to
> > > > > initoverlayfs as root and then executes init.
> > > > >
> > > > > ```
> > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > > >
> > > > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > > > ```
> > > >
> > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > there two lines?
> > > >
> > > > So, I generally would agree that the current initrd scheme is not
> > > > ideal, and we have been discussing better approaches. But I am not
> > > > sure your approach really is useful on generic systems for two
> > > > reasons:
> > > >
> > > > 1. no security model? you need to authenticate your initrd in
> > > >    2023. There's no execuse to not doing that anymore these days. Not
> > > >    in automotive, and not anywhere else really.
> > > >
> > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > >    unlock their root disks with TPM2 and similar things. People use
> > > >    RAID, LVM, and all that mess.
> > > >
> > > > Actually the above are kinda the same problem in a way: you need
> > > > complex storage, but if you need that you kinda need udev, and
> > > > services, and then also systemd and all that other stuff, and that's
> > > > why the system works like the system works right now.
> > > >
> > > > Whenever you devise a system like yours by cutting corners, and
> > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > don't want to support weird storage, you just solve your problem in a
> > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > actually really work without all that and are willing to maintain the
> > > > solution for your specific problem only.
> > > >
> > > > As I understand you are trying to solve multiple problems at once
> > > > here, and I think one should start with figuring out clearly what
> > > > those are before trying to address them, maybe without compromising on
> > > > security. So my guess is you want to address the following:
> > > >
> > > > 1. You don't want the whole big initrd to be read off disk on every
> > > >    boot, but only the parts of it that are actually needed.
> > > >
> > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > >    boot, but only the parts of it that are actually needed.
> > > >
> > > > 3. You want to share data between root fs and initrd
> > > >
> > > > 4. You want to save some boot time by not bringing up an init system
> > > >    in the initrd once, then tearing it down again, and starting it
> > > >    again from the root fs.
> > > >
> > > > For the items listed above I think you can find different solutions
> > > > which do not necessarily compromise security as much.
> > > >
> > > > So, in the list above you could address the latter three like this:
> > > >
> > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > > >    the kernel cmdline to synthesize a block device from that, which
> > > >    you then mount directly (without any initrd) via
> > > >    root=/dev/pmem0. This means yout boot loader will still load the
> > > >    whole image into memory, but only decompress the bits actually
> > > >    neeed. (It also has some other nice benefits I like, such as an
> > > >    immutable rootfs, which tmpfs-based initrds don't have.)
> > > >
> > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > >    systemd's eyes as an initrd (specifically: don't add an
> > > >    /etc/initrd-release file to it). Instead, just merge resources of
> > > >    the root fs into your initrd fs via overlayfs. systemd has
> > > >    infrastructure for this: "systemd-sysext". It takes immutable,
> > > >    authenticated erofs images (with verity, we call them "DDIs",
> > > >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > >    could also very nicely combine this approach with systemd's
> > > >    portable services, and npsawn containers, which operate on the same
> > > >    authenticated images]. At MSFT we have a major product that works
> > > >    exactly like this: the OS runs off a rootfs that is loaded as an
> > > >    initrd, and everything that runs on top of this are just these
> > > >    verity disk images, using overlayfs and portable services.
> > > >
> > > > 4. The proposal in 3 also addresses goal 4.
> > > >
> > > > Which leaves item 1, which is a bit harder to address. We have been
> > > > discussing this off an on internally too. A generic solution to this
> > > > is hard. My current thinking for this could be something like this,
> > > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > > but otherwise relatively well defined, i.e. known to be vfat and
> > > > discoverable via UUID on a GPT disk. So: build a minimal
> > > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > > jump into the rootfs stored in the ESP. That latter then has proper
> > > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > > real root. This would still be a pretty specific solution to one set
> > > > of devices though, as it could not cover network boots (i.e. where
> > > > there is just no ESP to boot from), but I think this could be kept
> > > > relatively close, as the logic in that case could just fall back into
> > > > loading the DDI that normally would still in the ESP fully into
> > > > memory.
> > >
> > > I don't think this is "a pretty specific solution to one set of devices"
> > > _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> > > systems moving to in the future.
> > >
> > > It solves the problem of large firmware images.  It solves the problem
> > > of device-specific configuration, because one can use a file on the EFI
> > > system partition that is read by userspace and either treated as
> > > untrusted or TPM-signed.
> >
> > All those problems are already solved, without inventing a new shell
> > scripting solution - we have DDIs and credentials. This is the exact
> > opposite of the direction we are pursuing: we want to _kill_ all these
> > initrd-specific infrastructure, tools, build systems, dependency
> > management and so on, because they are difficult to maintain, they
> > create a completely different environment that what is "normally" ran,
> > and they end up reinventing everything the 'normal' image does. We
> > want to build initrds from packages - as in normal distribution
> > packages, not special sauce initrd-only packages, so that the same
> > code and the same configuration is used everywhere, in different
> > runtime modes. Because that's what distributions are good to do:
> > creating package-based ecosystems, with good tooling, infrastructure
> > and so on.
> >
> > The end goal is to build images without initramfs-tools/dracut and
> > just using packages, not to stick yet another glue script in front of
> > them, that needs yet more special initrd-only arcane magic to put
> > together, in order to save a handful of KBs.
>
> The initramfs being a RAM filesystem is exactly why keeping it small is
> so critical.  Lennart's suggestion solves this problem by eagerly
> loading an image from disk, which is much less size-constrained.  One
> would use distribution packages to build this on-disk image.

This is already solved by using extension DDIs for optional packages.

> > And for ancient, legacy platforms that do not support modern APIs, the
> > old ways will still be there, and can be used. Nobody is going to take
> > away grub and dracut from the internet, if you got some special corner
> > case where you want to use it it will still be there, but the fact
> > that such corner cases exist cannot stop the rest of the ecosystem
> > that is targeted to modern hardware from evolving into something
> > better, more maintainable and more straightforward.
>
> The problem is not that UEFI is not usable in automotive systems.  The
> problem is that U-Boot (or any other UEFI implementation) is an extra
> stage in the boot process, slows things down, and has more attack
> surface.

Whatever firmware you use will have an attack surface, the interface
it provides - whether legacy bios or uefi-based - is irrelevant for
that. Skipping or reimplementing all the verity, tpm, etc logic also
increases the attack surface, as does adding initrd-only code that is
never tested and exercised outside of that limited context. If you are
running with legacy bios on ancient hardware you also will likely lack
tpm, secure boot, and so on, so it's all moot, any security argument
goes out of the window. If anybody cares about platform security, then
a tpm-capable and secureboot-capable firmware with a modern, usable
interface like uefi, running the same code in initrd and full system,
using dm-verity everywhere, is pretty much the best one can do.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 20:58         ` Luca Boccassi
@ 2023-12-11 21:20           ` Demi Marie Obenour
  2023-12-11 21:45             ` Luca Boccassi
  2023-12-11 21:24           ` Eric Curtin
  1 sibling, 1 reply; 49+ messages in thread
From: Demi Marie Obenour @ 2023-12-11 21:20 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
	Stephen Smoogen, Yariv Rachmani, Douglas Landgraf

[-- Attachment #1: Type: text/plain, Size: 11154 bytes --]

On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA512
> >
> > On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > > <demi@invisiblethingslab.com> wrote:
> > > >
> > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > > > >
> > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > initoverlayfs as root and then executes init.
> > > > > >
> > > > > > ```
> > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > > > >
> > > > > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > > > > ```
> > > > >
> > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > there two lines?
> > > > >
> > > > > So, I generally would agree that the current initrd scheme is not
> > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > sure your approach really is useful on generic systems for two
> > > > > reasons:
> > > > >
> > > > > 1. no security model? you need to authenticate your initrd in
> > > > >    2023. There's no execuse to not doing that anymore these days. Not
> > > > >    in automotive, and not anywhere else really.
> > > > >
> > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > >    unlock their root disks with TPM2 and similar things. People use
> > > > >    RAID, LVM, and all that mess.
> > > > >
> > > > > Actually the above are kinda the same problem in a way: you need
> > > > > complex storage, but if you need that you kinda need udev, and
> > > > > services, and then also systemd and all that other stuff, and that's
> > > > > why the system works like the system works right now.
> > > > >
> > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > don't want to support weird storage, you just solve your problem in a
> > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > actually really work without all that and are willing to maintain the
> > > > > solution for your specific problem only.
> > > > >
> > > > > As I understand you are trying to solve multiple problems at once
> > > > > here, and I think one should start with figuring out clearly what
> > > > > those are before trying to address them, maybe without compromising on
> > > > > security. So my guess is you want to address the following:
> > > > >
> > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > >    boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > > >    boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 3. You want to share data between root fs and initrd
> > > > >
> > > > > 4. You want to save some boot time by not bringing up an init system
> > > > >    in the initrd once, then tearing it down again, and starting it
> > > > >    again from the root fs.
> > > > >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > >    the kernel cmdline to synthesize a block device from that, which
> > > > >    you then mount directly (without any initrd) via
> > > > >    root=/dev/pmem0. This means yout boot loader will still load the
> > > > >    whole image into memory, but only decompress the bits actually
> > > > >    neeed. (It also has some other nice benefits I like, such as an
> > > > >    immutable rootfs, which tmpfs-based initrds don't have.)
> > > > >
> > > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > >    systemd's eyes as an initrd (specifically: don't add an
> > > > >    /etc/initrd-release file to it). Instead, just merge resources of
> > > > >    the root fs into your initrd fs via overlayfs. systemd has
> > > > >    infrastructure for this: "systemd-sysext". It takes immutable,
> > > > >    authenticated erofs images (with verity, we call them "DDIs",
> > > > >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > > >    could also very nicely combine this approach with systemd's
> > > > >    portable services, and npsawn containers, which operate on the same
> > > > >    authenticated images]. At MSFT we have a major product that works
> > > > >    exactly like this: the OS runs off a rootfs that is loaded as an
> > > > >    initrd, and everything that runs on top of this are just these
> > > > >    verity disk images, using overlayfs and portable services.
> > > > >
> > > > > 4. The proposal in 3 also addresses goal 4.
> > > > >
> > > > > Which leaves item 1, which is a bit harder to address. We have been
> > > > > discussing this off an on internally too. A generic solution to this
> > > > > is hard. My current thinking for this could be something like this,
> > > > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > > > but otherwise relatively well defined, i.e. known to be vfat and
> > > > > discoverable via UUID on a GPT disk. So: build a minimal
> > > > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > > > jump into the rootfs stored in the ESP. That latter then has proper
> > > > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > > > real root. This would still be a pretty specific solution to one set
> > > > > of devices though, as it could not cover network boots (i.e. where
> > > > > there is just no ESP to boot from), but I think this could be kept
> > > > > relatively close, as the logic in that case could just fall back into
> > > > > loading the DDI that normally would still in the ESP fully into
> > > > > memory.
> > > >
> > > > I don't think this is "a pretty specific solution to one set of devices"
> > > > _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> > > > systems moving to in the future.
> > > >
> > > > It solves the problem of large firmware images.  It solves the problem
> > > > of device-specific configuration, because one can use a file on the EFI
> > > > system partition that is read by userspace and either treated as
> > > > untrusted or TPM-signed.
> > >
> > > All those problems are already solved, without inventing a new shell
> > > scripting solution - we have DDIs and credentials. This is the exact
> > > opposite of the direction we are pursuing: we want to _kill_ all these
> > > initrd-specific infrastructure, tools, build systems, dependency
> > > management and so on, because they are difficult to maintain, they
> > > create a completely different environment that what is "normally" ran,
> > > and they end up reinventing everything the 'normal' image does. We
> > > want to build initrds from packages - as in normal distribution
> > > packages, not special sauce initrd-only packages, so that the same
> > > code and the same configuration is used everywhere, in different
> > > runtime modes. Because that's what distributions are good to do:
> > > creating package-based ecosystems, with good tooling, infrastructure
> > > and so on.
> > >
> > > The end goal is to build images without initramfs-tools/dracut and
> > > just using packages, not to stick yet another glue script in front of
> > > them, that needs yet more special initrd-only arcane magic to put
> > > together, in order to save a handful of KBs.
> >
> > The initramfs being a RAM filesystem is exactly why keeping it small is
> > so critical.  Lennart's suggestion solves this problem by eagerly
> > loading an image from disk, which is much less size-constrained.  One
> > would use distribution packages to build this on-disk image.
> 
> This is already solved by using extension DDIs for optional packages.

What about non-optional packages?  The goal is to _require_ the on-disk
image to boot, so that full-featured UI toolkits can be used to e.g.
prompt for LUKS passphrases.  Ideally, the initramfs would be as minimal
as possible.

> > > And for ancient, legacy platforms that do not support modern APIs, the
> > > old ways will still be there, and can be used. Nobody is going to take
> > > away grub and dracut from the internet, if you got some special corner
> > > case where you want to use it it will still be there, but the fact
> > > that such corner cases exist cannot stop the rest of the ecosystem
> > > that is targeted to modern hardware from evolving into something
> > > better, more maintainable and more straightforward.
> >
> > The problem is not that UEFI is not usable in automotive systems.  The
> > problem is that U-Boot (or any other UEFI implementation) is an extra
> > stage in the boot process, slows things down, and has more attack
> > surface.
> 
> Whatever firmware you use will have an attack surface, the interface
> it provides - whether legacy bios or uefi-based - is irrelevant for
> that. Skipping or reimplementing all the verity, tpm, etc logic also
> increases the attack surface, as does adding initrd-only code that is
> never tested and exercised outside of that limited context. If you are
> running with legacy bios on ancient hardware you also will likely lack
> tpm, secure boot, and so on, so it's all moot, any security argument
> goes out of the window. If anybody cares about platform security, then
> a tpm-capable and secureboot-capable firmware with a modern, usable
> interface like uefi, running the same code in initrd and full system,
> using dm-verity everywhere, is pretty much the best one can do.

Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
have better platform security than any UEFI-based device on the market I
am aware of.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 20:58         ` Luca Boccassi
  2023-12-11 21:20           ` Demi Marie Obenour
@ 2023-12-11 21:24           ` Eric Curtin
  1 sibling, 0 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 21:24 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Demi Marie Obenour, Lennart Poettering, initramfs, systemd-devel,
	Stephen Smoogen, Yariv Rachmani, Douglas Landgraf

On Mon, 11 Dec 2023 at 20:59, Luca Boccassi <bluca@debian.org> wrote:
>
> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA512
> >
> > On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > > <demi@invisiblethingslab.com> wrote:
> > > >
> > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > > > >
> > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > initoverlayfs as root and then executes init.
> > > > > >
> > > > > > ```
> > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > > > >
> > > > > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > > > > ```
> > > > >
> > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > there two lines?
> > > > >
> > > > > So, I generally would agree that the current initrd scheme is not
> > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > sure your approach really is useful on generic systems for two
> > > > > reasons:
> > > > >
> > > > > 1. no security model? you need to authenticate your initrd in
> > > > >    2023. There's no execuse to not doing that anymore these days. Not
> > > > >    in automotive, and not anywhere else really.
> > > > >
> > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > >    unlock their root disks with TPM2 and similar things. People use
> > > > >    RAID, LVM, and all that mess.
> > > > >
> > > > > Actually the above are kinda the same problem in a way: you need
> > > > > complex storage, but if you need that you kinda need udev, and
> > > > > services, and then also systemd and all that other stuff, and that's
> > > > > why the system works like the system works right now.
> > > > >
> > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > don't want to support weird storage, you just solve your problem in a
> > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > actually really work without all that and are willing to maintain the
> > > > > solution for your specific problem only.
> > > > >
> > > > > As I understand you are trying to solve multiple problems at once
> > > > > here, and I think one should start with figuring out clearly what
> > > > > those are before trying to address them, maybe without compromising on
> > > > > security. So my guess is you want to address the following:
> > > > >
> > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > >    boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > > >    boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 3. You want to share data between root fs and initrd
> > > > >
> > > > > 4. You want to save some boot time by not bringing up an init system
> > > > >    in the initrd once, then tearing it down again, and starting it
> > > > >    again from the root fs.
> > > > >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > >    the kernel cmdline to synthesize a block device from that, which
> > > > >    you then mount directly (without any initrd) via
> > > > >    root=/dev/pmem0. This means yout boot loader will still load the
> > > > >    whole image into memory, but only decompress the bits actually
> > > > >    neeed. (It also has some other nice benefits I like, such as an
> > > > >    immutable rootfs, which tmpfs-based initrds don't have.)
> > > > >
> > > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > >    systemd's eyes as an initrd (specifically: don't add an
> > > > >    /etc/initrd-release file to it). Instead, just merge resources of
> > > > >    the root fs into your initrd fs via overlayfs. systemd has
> > > > >    infrastructure for this: "systemd-sysext". It takes immutable,
> > > > >    authenticated erofs images (with verity, we call them "DDIs",
> > > > >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > > >    could also very nicely combine this approach with systemd's
> > > > >    portable services, and npsawn containers, which operate on the same
> > > > >    authenticated images]. At MSFT we have a major product that works
> > > > >    exactly like this: the OS runs off a rootfs that is loaded as an
> > > > >    initrd, and everything that runs on top of this are just these
> > > > >    verity disk images, using overlayfs and portable services.
> > > > >
> > > > > 4. The proposal in 3 also addresses goal 4.
> > > > >
> > > > > Which leaves item 1, which is a bit harder to address. We have been
> > > > > discussing this off an on internally too. A generic solution to this
> > > > > is hard. My current thinking for this could be something like this,
> > > > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > > > but otherwise relatively well defined, i.e. known to be vfat and
> > > > > discoverable via UUID on a GPT disk. So: build a minimal
> > > > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > > > jump into the rootfs stored in the ESP. That latter then has proper
> > > > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > > > real root. This would still be a pretty specific solution to one set
> > > > > of devices though, as it could not cover network boots (i.e. where
> > > > > there is just no ESP to boot from), but I think this could be kept
> > > > > relatively close, as the logic in that case could just fall back into
> > > > > loading the DDI that normally would still in the ESP fully into
> > > > > memory.
> > > >
> > > > I don't think this is "a pretty specific solution to one set of devices"
> > > > _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> > > > systems moving to in the future.
> > > >
> > > > It solves the problem of large firmware images.  It solves the problem
> > > > of device-specific configuration, because one can use a file on the EFI
> > > > system partition that is read by userspace and either treated as
> > > > untrusted or TPM-signed.
> > >
> > > All those problems are already solved, without inventing a new shell
> > > scripting solution - we have DDIs and credentials. This is the exact
> > > opposite of the direction we are pursuing: we want to _kill_ all these
> > > initrd-specific infrastructure, tools, build systems, dependency
> > > management and so on, because they are difficult to maintain, they
> > > create a completely different environment that what is "normally" ran,
> > > and they end up reinventing everything the 'normal' image does. We
> > > want to build initrds from packages - as in normal distribution
> > > packages, not special sauce initrd-only packages, so that the same
> > > code and the same configuration is used everywhere, in different
> > > runtime modes. Because that's what distributions are good to do:
> > > creating package-based ecosystems, with good tooling, infrastructure
> > > and so on.
> > >
> > > The end goal is to build images without initramfs-tools/dracut and
> > > just using packages, not to stick yet another glue script in front of
> > > them, that needs yet more special initrd-only arcane magic to put
> > > together, in order to save a handful of KBs.
> >
> > The initramfs being a RAM filesystem is exactly why keeping it small is
> > so critical.  Lennart's suggestion solves this problem by eagerly
> > loading an image from disk, which is much less size-constrained.  One
> > would use distribution packages to build this on-disk image.
>
> This is already solved by using extension DDIs for optional packages.
>
> > > And for ancient, legacy platforms that do not support modern APIs, the
> > > old ways will still be there, and can be used. Nobody is going to take
> > > away grub and dracut from the internet, if you got some special corner
> > > case where you want to use it it will still be there, but the fact
> > > that such corner cases exist cannot stop the rest of the ecosystem
> > > that is targeted to modern hardware from evolving into something
> > > better, more maintainable and more straightforward.
> >
> > The problem is not that UEFI is not usable in automotive systems.  The
> > problem is that U-Boot (or any other UEFI implementation) is an extra
> > stage in the boot process, slows things down, and has more attack
> > surface.
>
> Whatever firmware you use will have an attack surface, the interface
> it provides - whether legacy bios or uefi-based - is irrelevant for
> that. Skipping or reimplementing all the verity, tpm, etc logic also
> increases the attack surface, as does adding initrd-only code that is
> never tested and exercised outside of that limited context. If you are
> running with legacy bios on ancient hardware you also will likely lack
> tpm, secure boot, and so on, so it's all moot, any security argument
> goes out of the window. If anybody cares about platform security, then
> a tpm-capable and secureboot-capable firmware with a modern, usable
> interface like uefi, running the same code in initrd and full system,
> using dm-verity everywhere, is pretty much the best one can do.

I am unsure how many new systems are being developed with legacy BIOS,
but alternative firmware platforms do exist that are just as secure as
UEFI, Android Boot Image format is one for example. I am pretty sure
UKIs took influence from that format either directly or indirectly.

Everything in x86 is UEFI, but other architectures like ARM are important.

And when you are deploying on ARM it can be pretty hard to tell
partners how to boot pre-Linux kernel if you are an OS distributor,
which makes it pretty hard to assume grub, sd-boot, sd-stub, etc.

But what you can do is design from Linux kernel boot onwards to the
best of your ability, and I think kernel-space and user-space could
benefit from just decompressing the bytes you use.

Whether we use dracut or something that composes initramfs using rpms,
These structures/containers/etc. are just filesystems at the end of
the day.

Is mise le meas/Regards,

Eric Curtin


>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 21:20           ` Demi Marie Obenour
@ 2023-12-11 21:45             ` Luca Boccassi
  2023-12-12  3:47               ` Paul Menzel
                                 ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Luca Boccassi @ 2023-12-11 21:45 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
	Stephen Smoogen, Yariv Rachmani, Douglas Landgraf

On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
> > On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
> > <demi@invisiblethingslab.com> wrote:
> > >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA512
> > >
> > > On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> > > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > > > <demi@invisiblethingslab.com> wrote:
> > > > >
> > > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > > > > >
> > > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > > initoverlayfs as root and then executes init.
> > > > > > >
> > > > > > > ```
> > > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > > > > >
> > > > > > > fw -> bootloader -> kernel -> storage-init   -> init ----------------->
> > > > > > > ```
> > > > > >
> > > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > > there two lines?
> > > > > >
> > > > > > So, I generally would agree that the current initrd scheme is not
> > > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > > sure your approach really is useful on generic systems for two
> > > > > > reasons:
> > > > > >
> > > > > > 1. no security model? you need to authenticate your initrd in
> > > > > >    2023. There's no execuse to not doing that anymore these days. Not
> > > > > >    in automotive, and not anywhere else really.
> > > > > >
> > > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > > >    unlock their root disks with TPM2 and similar things. People use
> > > > > >    RAID, LVM, and all that mess.
> > > > > >
> > > > > > Actually the above are kinda the same problem in a way: you need
> > > > > > complex storage, but if you need that you kinda need udev, and
> > > > > > services, and then also systemd and all that other stuff, and that's
> > > > > > why the system works like the system works right now.
> > > > > >
> > > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > > don't want to support weird storage, you just solve your problem in a
> > > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > > actually really work without all that and are willing to maintain the
> > > > > > solution for your specific problem only.
> > > > > >
> > > > > > As I understand you are trying to solve multiple problems at once
> > > > > > here, and I think one should start with figuring out clearly what
> > > > > > those are before trying to address them, maybe without compromising on
> > > > > > security. So my guess is you want to address the following:
> > > > > >
> > > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > > >    boot, but only the parts of it that are actually needed.
> > > > > >
> > > > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > > > >    boot, but only the parts of it that are actually needed.
> > > > > >
> > > > > > 3. You want to share data between root fs and initrd
> > > > > >
> > > > > > 4. You want to save some boot time by not bringing up an init system
> > > > > >    in the initrd once, then tearing it down again, and starting it
> > > > > >    again from the root fs.
> > > > > >
> > > > > > For the items listed above I think you can find different solutions
> > > > > > which do not necessarily compromise security as much.
> > > > > >
> > > > > > So, in the list above you could address the latter three like this:
> > > > > >
> > > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > > >    loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > > >    the kernel cmdline to synthesize a block device from that, which
> > > > > >    you then mount directly (without any initrd) via
> > > > > >    root=/dev/pmem0. This means yout boot loader will still load the
> > > > > >    whole image into memory, but only decompress the bits actually
> > > > > >    neeed. (It also has some other nice benefits I like, such as an
> > > > > >    immutable rootfs, which tmpfs-based initrds don't have.)
> > > > > >
> > > > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > > >    systemd's eyes as an initrd (specifically: don't add an
> > > > > >    /etc/initrd-release file to it). Instead, just merge resources of
> > > > > >    the root fs into your initrd fs via overlayfs. systemd has
> > > > > >    infrastructure for this: "systemd-sysext". It takes immutable,
> > > > > >    authenticated erofs images (with verity, we call them "DDIs",
> > > > > >    i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > > > >    could also very nicely combine this approach with systemd's
> > > > > >    portable services, and npsawn containers, which operate on the same
> > > > > >    authenticated images]. At MSFT we have a major product that works
> > > > > >    exactly like this: the OS runs off a rootfs that is loaded as an
> > > > > >    initrd, and everything that runs on top of this are just these
> > > > > >    verity disk images, using overlayfs and portable services.
> > > > > >
> > > > > > 4. The proposal in 3 also addresses goal 4.
> > > > > >
> > > > > > Which leaves item 1, which is a bit harder to address. We have been
> > > > > > discussing this off an on internally too. A generic solution to this
> > > > > > is hard. My current thinking for this could be something like this,
> > > > > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > > > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > > > > but otherwise relatively well defined, i.e. known to be vfat and
> > > > > > discoverable via UUID on a GPT disk. So: build a minimal
> > > > > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > > > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > > > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > > > > jump into the rootfs stored in the ESP. That latter then has proper
> > > > > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > > > > real root. This would still be a pretty specific solution to one set
> > > > > > of devices though, as it could not cover network boots (i.e. where
> > > > > > there is just no ESP to boot from), but I think this could be kept
> > > > > > relatively close, as the logic in that case could just fall back into
> > > > > > loading the DDI that normally would still in the ESP fully into
> > > > > > memory.
> > > > >
> > > > > I don't think this is "a pretty specific solution to one set of devices"
> > > > > _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> > > > > systems moving to in the future.
> > > > >
> > > > > It solves the problem of large firmware images.  It solves the problem
> > > > > of device-specific configuration, because one can use a file on the EFI
> > > > > system partition that is read by userspace and either treated as
> > > > > untrusted or TPM-signed.
> > > >
> > > > All those problems are already solved, without inventing a new shell
> > > > scripting solution - we have DDIs and credentials. This is the exact
> > > > opposite of the direction we are pursuing: we want to _kill_ all these
> > > > initrd-specific infrastructure, tools, build systems, dependency
> > > > management and so on, because they are difficult to maintain, they
> > > > create a completely different environment that what is "normally" ran,
> > > > and they end up reinventing everything the 'normal' image does. We
> > > > want to build initrds from packages - as in normal distribution
> > > > packages, not special sauce initrd-only packages, so that the same
> > > > code and the same configuration is used everywhere, in different
> > > > runtime modes. Because that's what distributions are good to do:
> > > > creating package-based ecosystems, with good tooling, infrastructure
> > > > and so on.
> > > >
> > > > The end goal is to build images without initramfs-tools/dracut and
> > > > just using packages, not to stick yet another glue script in front of
> > > > them, that needs yet more special initrd-only arcane magic to put
> > > > together, in order to save a handful of KBs.
> > >
> > > The initramfs being a RAM filesystem is exactly why keeping it small is
> > > so critical.  Lennart's suggestion solves this problem by eagerly
> > > loading an image from disk, which is much less size-constrained.  One
> > > would use distribution packages to build this on-disk image.
> >
> > This is already solved by using extension DDIs for optional packages.
>
> What about non-optional packages?  The goal is to _require_ the on-disk
> image to boot, so that full-featured UI toolkits can be used to e.g.
> prompt for LUKS passphrases.  Ideally, the initramfs would be as minimal
> as possible.

You can use DDIs for anything you want, outside of systemd itself

> > > > And for ancient, legacy platforms that do not support modern APIs, the
> > > > old ways will still be there, and can be used. Nobody is going to take
> > > > away grub and dracut from the internet, if you got some special corner
> > > > case where you want to use it it will still be there, but the fact
> > > > that such corner cases exist cannot stop the rest of the ecosystem
> > > > that is targeted to modern hardware from evolving into something
> > > > better, more maintainable and more straightforward.
> > >
> > > The problem is not that UEFI is not usable in automotive systems.  The
> > > problem is that U-Boot (or any other UEFI implementation) is an extra
> > > stage in the boot process, slows things down, and has more attack
> > > surface.
> >
> > Whatever firmware you use will have an attack surface, the interface
> > it provides - whether legacy bios or uefi-based - is irrelevant for
> > that. Skipping or reimplementing all the verity, tpm, etc logic also
> > increases the attack surface, as does adding initrd-only code that is
> > never tested and exercised outside of that limited context. If you are
> > running with legacy bios on ancient hardware you also will likely lack
> > tpm, secure boot, and so on, so it's all moot, any security argument
> > goes out of the window. If anybody cares about platform security, then
> > a tpm-capable and secureboot-capable firmware with a modern, usable
> > interface like uefi, running the same code in initrd and full system,
> > using dm-verity everywhere, is pretty much the best one can do.
>
> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
> have better platform security than any UEFI-based device on the market I
> am aware of.

We are talking about Linux distributions here. If one wants to use
proprietary systems, sure, there are better things out there, but
that's off topic.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 21:45             ` Luca Boccassi
@ 2023-12-12  3:47               ` Paul Menzel
  2023-12-12  3:56               ` Paul Menzel
  2023-12-12 15:26               ` Paul Menzel
  2 siblings, 0 replies; 49+ messages in thread
From: Paul Menzel @ 2023-12-12  3:47 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Demi Marie Obenour, initramfs, systemd-devel, Eric Curtin,
	Yariv Rachmani, Lennart Poettering, Douglas Landgraf,
	Stephen Smoogen

Dear Luca,


Am 11.12.23 um 22:45 schrieb Luca Boccassi:
> On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour wrote:
>>
>> On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
>>> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour wrote:

>>>> On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
>>>>> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour

>>>>>> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
>>>>>>> On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:

[…]

>>>>> And for ancient, legacy platforms that do not support modern APIs, the
>>>>> old ways will still be there, and can be used. Nobody is going to take
>>>>> away grub and dracut from the internet, if you got some special corner
>>>>> case where you want to use it it will still be there, but the fact
>>>>> that such corner cases exist cannot stop the rest of the ecosystem
>>>>> that is targeted to modern hardware from evolving into something
>>>>> better, more maintainable and more straightforward.
>>>>
>>>> The problem is not that UEFI is not usable in automotive systems.  The
>>>> problem is that U-Boot (or any other UEFI implementation) is an extra
>>>> stage in the boot process, slows things down, and has more attack
>>>> surface.
>>>
>>> Whatever firmware you use will have an attack surface, the interface
>>> it provides - whether legacy bios or uefi-based - is irrelevant for
>>> that. Skipping or reimplementing all the verity, tpm, etc logic also
>>> increases the attack surface, as does adding initrd-only code that is
>>> never tested and exercised outside of that limited context. If you are
>>> running with legacy bios on ancient hardware you also will likely lack
>>> tpm, secure boot, and so on, so it's all moot, any security argument
>>> goes out of the window. If anybody cares about platform security, then
>>> a tpm-capable and secureboot-capable firmware with a modern, usable
>>> interface like uefi, running the same code in initrd and full system,
>>> using dm-verity everywhere, is pretty much the best one can do.
>>
>> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
>> have better platform security than any UEFI-based device on the market I
>> am aware of.
> 
> We are talking about Linux distributions here. If one wants to use
> proprietary systems, sure, there are better things out there, but
> that's off topic.

In what way is ChromeOS more proprietary than the other GNU/Linux 
distributions, that allow to install the Chrome browser?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 21:45             ` Luca Boccassi
  2023-12-12  3:47               ` Paul Menzel
@ 2023-12-12  3:56               ` Paul Menzel
  2023-12-12 15:26               ` Paul Menzel
  2 siblings, 0 replies; 49+ messages in thread
From: Paul Menzel @ 2023-12-12  3:56 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Demi Marie Obenour, initramfs, systemd-devel, Eric Curtin,
	Yariv Rachmani, Lennart Poettering, Douglas Landgraf,
	Stephen Smoogen

Dear Luca,


Am 11.12.23 um 22:45 schrieb Luca Boccassi:
> On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour wrote:
>>
>> On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
>>> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour wrote:

>>>> On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
>>>>> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour

>>>>>> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
>>>>>>> On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:

[…]

>>>>> And for ancient, legacy platforms that do not support modern APIs, the
>>>>> old ways will still be there, and can be used. Nobody is going to take
>>>>> away grub and dracut from the internet, if you got some special corner
>>>>> case where you want to use it it will still be there, but the fact
>>>>> that such corner cases exist cannot stop the rest of the ecosystem
>>>>> that is targeted to modern hardware from evolving into something
>>>>> better, more maintainable and more straightforward.
>>>>
>>>> The problem is not that UEFI is not usable in automotive systems.  The
>>>> problem is that U-Boot (or any other UEFI implementation) is an extra
>>>> stage in the boot process, slows things down, and has more attack
>>>> surface.
>>>
>>> Whatever firmware you use will have an attack surface, the interface
>>> it provides - whether legacy bios or uefi-based - is irrelevant for
>>> that. Skipping or reimplementing all the verity, tpm, etc logic also
>>> increases the attack surface, as does adding initrd-only code that is
>>> never tested and exercised outside of that limited context. If you are
>>> running with legacy bios on ancient hardware you also will likely lack
>>> tpm, secure boot, and so on, so it's all moot, any security argument
>>> goes out of the window. If anybody cares about platform security, then
>>> a tpm-capable and secureboot-capable firmware with a modern, usable
>>> interface like uefi, running the same code in initrd and full system,
>>> using dm-verity everywhere, is pretty much the best one can do.
>>
>> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
>> have better platform security than any UEFI-based device on the market I
>> am aware of.
> 
> We are talking about Linux distributions here. If one wants to use
> proprietary systems, sure, there are better things out there, but
> that's off topic.

In what way is ChromeOS more proprietary than the other GNU/Linux 
distributions, that allow to install the Chrome browser?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 21:45             ` Luca Boccassi
  2023-12-12  3:47               ` Paul Menzel
  2023-12-12  3:56               ` Paul Menzel
@ 2023-12-12 15:26               ` Paul Menzel
  2 siblings, 0 replies; 49+ messages in thread
From: Paul Menzel @ 2023-12-12 15:26 UTC (permalink / raw)
  To: Luca Boccassi
  Cc: Demi Marie Obenour, initramfs, systemd-devel, Eric Curtin,
	Yariv Rachmani, Lennart Poettering, Douglas Landgraf,
	Stephen Smoogen

[Sorry for the spam to the people in Cc. Now the real address.]

Dear Luca,


Am 11.12.23 um 22:45 schrieb Luca Boccassi:
> On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour wrote:
>>
>> On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
>>> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour wrote:

>>>> On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
>>>>> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour

>>>>>> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
>>>>>>> On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:

[…]

>>>>> And for ancient, legacy platforms that do not support modern APIs, the
>>>>> old ways will still be there, and can be used. Nobody is going to take
>>>>> away grub and dracut from the internet, if you got some special corner
>>>>> case where you want to use it it will still be there, but the fact
>>>>> that such corner cases exist cannot stop the rest of the ecosystem
>>>>> that is targeted to modern hardware from evolving into something
>>>>> better, more maintainable and more straightforward.
>>>>
>>>> The problem is not that UEFI is not usable in automotive systems.  The
>>>> problem is that U-Boot (or any other UEFI implementation) is an extra
>>>> stage in the boot process, slows things down, and has more attack
>>>> surface.
>>>
>>> Whatever firmware you use will have an attack surface, the interface
>>> it provides - whether legacy bios or uefi-based - is irrelevant for
>>> that. Skipping or reimplementing all the verity, tpm, etc logic also
>>> increases the attack surface, as does adding initrd-only code that is
>>> never tested and exercised outside of that limited context. If you are
>>> running with legacy bios on ancient hardware you also will likely lack
>>> tpm, secure boot, and so on, so it's all moot, any security argument
>>> goes out of the window. If anybody cares about platform security, then
>>> a tpm-capable and secureboot-capable firmware with a modern, usable
>>> interface like uefi, running the same code in initrd and full system,
>>> using dm-verity everywhere, is pretty much the best one can do.
>>
>> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
>> have better platform security than any UEFI-based device on the market I
>> am aware of.
> 
> We are talking about Linux distributions here. If one wants to use
> proprietary systems, sure, there are better things out there, but
> that's off topic.

In what way is ChromeOS more proprietary than the other GNU/Linux 
distributions, that allow to install the Chrome browser?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 12:48         ` Eric Curtin
  2023-12-11 12:52           ` Eric Curtin
@ 2023-12-12 17:37           ` Lennart Poettering
  2023-12-12 17:40           ` Lennart Poettering
  2 siblings, 0 replies; 49+ messages in thread
From: Lennart Poettering @ 2023-12-12 17:37 UTC (permalink / raw)
  To: Eric Curtin
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mo, 11.12.23 12:48, Eric Curtin (ecurtin@redhat.com) wrote:

> Sort of yes, but preferably using that __initramfs_start /
> initrd_start buffer as is without copying any bytes anywhere else and
> without teaching the bootloaders to do things.
>
> The "memmap=" approach you suggested sounds like what we are thinking,
> but do you think we could do this without teaching bootloaders to do
> new things?

Well, in a standard UEFI world it would suffice to teach the memmap=
logic to the stub that is glued in front of the kernel. For example,
make sd-stub find the erofs initrd in the UKI, then trivially
synthesize a memmap= switch and append it to the kernel command line.

but of course, you don't believe in UEFI or good boot loaders, so you
kinda dug your own grave here...

(The main reason why sd-stub doesn't actually support erofs-initrds,
is that sd-stub also generates initrd cpios on the fly, to pass
credentials and system extension images to the kernel, and you can't
really mix erofs and cpio initrds into one)

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 12:48         ` Eric Curtin
  2023-12-11 12:52           ` Eric Curtin
  2023-12-12 17:37           ` Lennart Poettering
@ 2023-12-12 17:40           ` Lennart Poettering
  2023-12-12 19:05             ` Demi Marie Obenour
  2 siblings, 1 reply; 49+ messages in thread
From: Lennart Poettering @ 2023-12-12 17:40 UTC (permalink / raw)
  To: Eric Curtin
  Cc: systemd-devel, initramfs, Yariv Rachmani, Stephen Smoogen,
	Douglas Landgraf

On Mo, 11.12.23 12:48, Eric Curtin (ecurtin@redhat.com) wrote:

> Although the nice thing about a storage-init like approach is there's
> basically zero copies up front. What storage-init is trying to be, is
> a tool to just call systemd storage things, without also inheriting
> all the systemd stack.

Just to make this clear: using things like systemd-cryptsetup outside
of the systemd stack is not going to work once you leave trivial
setups. i.e. the TPM hookup involves multiple services these days, and
it's not going to get any simpler. i.e. systemd-tpm2-setup,
systemd-pcrextend, systemd-pcrlock and so on. I am sorry, but doing
reasonable disk encryption with TPM involved means you either buy into
the whole systemd offer (i.e. with the service manager) or you have to
rewrite your own systemd.

But maybe I am misunderstanding what you are saying here.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 16:28   ` Demi Marie Obenour
                       ` (2 preceding siblings ...)
  2023-12-11 20:15     ` Luca Boccassi
@ 2023-12-12 17:50     ` Lennart Poettering
  3 siblings, 0 replies; 49+ messages in thread
From: Lennart Poettering @ 2023-12-12 17:50 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
	Qubes OS Development Mailing List, Yariv Rachmani,
	Douglas Landgraf

On Mo, 11.12.23 11:28, Demi Marie Obenour (demi@invisiblethingslab.com) wrote:

> I don't think this is "a pretty specific solution to one set of devices"
> _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images.  It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed.  It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs.  One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth.  For
> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes.  It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host.  And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.

systemd's "system extension" concept ("sysexts") already allow you to
do all that. The stuff I was fantasizing about would only change one
thing: instead of sd-stub from uefi mode already putting the sysexts
you installed into memory for the initrd to consume, it would be some
proto-initrd that would do so. This does not really change what you
can do with this, but mostly is just an optimization, reducing iops
and memory use a bit, and thus boot time latency.

> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs.  Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.

Well, somebody's niche is somebody else's common case. In VM/cloud/server
scenarios network booting is not that "niche" as it might be on the desktop.

> It does require some care when upgrading, as the dm-verity image and the
> UKI cannot both be updated atomically, but one can solve that by first
> writing the new dm-verity image to a separate location.  The UKI will
> try both both the old and new locations for the dm-verity image and
> rename the new image over the old one on success.  The wrong image will
> simply fail to mount as its root hash will be wrong.

systemd-sysext already covers this just fine: you can encode in their
"extension-release" file to which base images they match up, and
systemd-syext will then find the right one to apply, and ignore the
others. Thus just make sure you drop in the sysexts fist, and the UKI
last and things should be perfectly robust.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-11 17:03     ` Eric Curtin
  2023-12-11 17:46       ` Demi Marie Obenour
@ 2023-12-12 18:00       ` Lennart Poettering
  2023-12-12 20:34         ` Nils Kattenbeck
  1 sibling, 1 reply; 49+ messages in thread
From: Lennart Poettering @ 2023-12-12 18:00 UTC (permalink / raw)
  To: Eric Curtin
  Cc: Demi Marie Obenour, Yariv Rachmani, initramfs, systemd-devel,
	Stephen Smoogen, Douglas Landgraf,
	Qubes OS Development Mailing List

On Mo, 11.12.23 17:03, Eric Curtin (ecurtin@redhat.com) wrote:

> A generic approach is hard, I think it's worth discussing which type of boots
> you should actually care about milliseconds of performance for. It would be nice
> if we had an init system that contained the binary data to do the minimum for
> standard Fedora, Debian installs and everything else was an extension whether
> that's sysexts, dlopen, a new binary to execute etc.
>
> If the network is ingrained in your boot stack like this, I'm
> guessing you probably don't care about boot performance.

Uh, I am not sure that's really true. People boot up VMs on demand,
based on network traffic. They sure care about latency and boot
times. I mean people care about firecracker and these things precisely
because it brings the of off-to-IP to a minimum.

> Automotive has an expectation for really fast boots, like 2 seconds, in standard
> desktops installs there's some expectation as you interface directly
> with a human,
> but for other installs how much expectation is there?

AFAIR in particular in cars there's quite som functionality you
probaly want to move very early in boot. Which yells to me that you
want a service manager super early. Which again suggests to me that
the first initrd that runs should probably already cover that.

If I were you I'd probably focus on a design like this: ship a basic
systemd in an initrd. Complete enough to find the harddisk, and to run
the other services that are absolutely necessary this early. Then,
once you found the disk, look for sysext images on it, and apply them
all on top of the initrd's root fs you are already running with. Never
transition anywhere else.

The try to optimize the initrd a bit by making it an erofs/memmap
thing and so on. And make sure the initrd only contains stuff you
always need, so that reading it all into memory is necessary anyway,
and hence any approach that tries to run even the initrd off a disk
image won't be necessary becuase you need to read everything anyway.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-12 17:40           ` Lennart Poettering
@ 2023-12-12 19:05             ` Demi Marie Obenour
  0 siblings, 0 replies; 49+ messages in thread
From: Demi Marie Obenour @ 2023-12-12 19:05 UTC (permalink / raw)
  To: Lennart Poettering, Eric Curtin
  Cc: Yariv Rachmani, initramfs, systemd-devel, Stephen Smoogen,
	Douglas Landgraf

[-- Attachment #1: Type: text/plain, Size: 1549 bytes --]

On Tue, Dec 12, 2023 at 06:40:32PM +0100, Lennart Poettering wrote:
> On Mo, 11.12.23 12:48, Eric Curtin (ecurtin@redhat.com) wrote:
> 
> > Although the nice thing about a storage-init like approach is there's
> > basically zero copies up front. What storage-init is trying to be, is
> > a tool to just call systemd storage things, without also inheriting
> > all the systemd stack.
> 
> Just to make this clear: using things like systemd-cryptsetup outside
> of the systemd stack is not going to work once you leave trivial
> setups. i.e. the TPM hookup involves multiple services these days, and
> it's not going to get any simpler. i.e. systemd-tpm2-setup,
> systemd-pcrextend, systemd-pcrlock and so on. I am sorry, but doing
> reasonable disk encryption with TPM involved means you either buy into
> the whole systemd offer (i.e. with the service manager) or you have to
> rewrite your own systemd.
> 
> But maybe I am misunderstanding what you are saying here.

I think a key factor here is that the initial suggestion was for
automotive use cases.  One can have a vastly simpler system if one is
willing to deliver hardware-specific images, rather than trying to have
a single image that supports many different hardware models.  Automotive
and other embedded systemd understandably do not want to pay for
complexity that they do not need, and which is present to support
features (such as supporting arbitrary hardware) they will never use.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-12 18:00       ` Lennart Poettering
@ 2023-12-12 20:34         ` Nils Kattenbeck
  2023-12-12 20:48           ` Eric Curtin
  2023-12-12 21:02           ` Lennart Poettering
  0 siblings, 2 replies; 49+ messages in thread
From: Nils Kattenbeck @ 2023-12-12 20:34 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
	Qubes OS Development Mailing List, Yariv Rachmani,
	Douglas Landgraf

Hi, while I have been following this thread passively for now I also
wanted to chime in.

> (The main reason why sd-stub doesn't actually support erofs-initrds,
> is that sd-stub also generates initrd cpios on the fly, to pass
> credentials and system extension images to the kernel, and you can't
> really mix erofs and cpio initrds into one)

What prevents one from mixing the two (especially given that the
hypothetical erofs initrd support does not yet exist)?
Or are you talking about mixing this with your memmap+root=/dev/pmem suggestion?

> The try to optimize the initrd a bit by making it an erofs/memmap
> thing and so on. And make sure the initrd only contains stuff you
> always need, so that reading it all into memory is necessary anyway,
> and hence any approach that tries to run even the initrd off a disk
> image won't be necessary becuase you need to read everything anyway.

Having to ensure that the initrd is as small as possible is definitely
no easy task.
Furthermore unless one has total control over the devices, or even if
there are only a few hardware revisions, parts of the initrd might not
be used.
Even if everything is the same there are codes paths which might not
be taken during usual operation. An example would be services similar
to the new systemd-bsod which are only triggered in emergencies.
Having these in the cpio means that they will always be read and
decompressed.
Using sysexts also has the drawback that each and every one of them
has to be decompressed. I might be mistaken but I expect that this
will be the case even if the extension-release in the sysext results
in it being discarded which is obviously another big drawback.

Regardless, even if every single file within the cpio archive (and
potential sysexts) is used, erofs still has a distinct advantage over
cpio!
With cpio everything has to be decompressed and read up front. With
erofs this is not the case.
Only the fs header has to be read at first as files are decompressed on demand.
This means that critical stuff can be started earlier as it does not
have to wait for decompression of stuff only needed later on.
For example an initrd-only (i.e. not pivolint root), graphical system
could start all background services long before the UI starts and
accesses large asset files.

I agree that this splitting up into another micro-initrd just for some
storage stuff etc (which I still have not groked completely) does not
seem to offer any advantages to what we have today. *However*, I
certainly think that standardizing and supporting some kind of erofs
based initrd would gain some advantages.

On the other hand this feels like going back to an old ramdisk again.
This goes beyond my knowledge but based on the kernel docs most
drawbacks of ramdisks would not apply to an approach with erofs. Also
maybe the more flexible loopback devices could be used(?) which might
alleviate some problems.

-- This block device was of fixed size, so the filesystem mounted on
it was of fixed size.
   -> Should not be of concern as it is readonly anyhow.
-- Using a ram disk also required unnecessarily copying memory from
the fake block device into the page cache (and copying changes back
out), as well as creating and destroying dentries.
   -> (?) This one I am actually not too sure about and supersedes my
knowledge on tmpfs, vfs (and its cache layers), erofs caching, and
loopback devices).
-- Plus it needed a filesystem driver (such as ext2) to format and
interpret this data.
   -> erofs is already included in most initrds (and is not too big if
it is not)

Regards, Nils

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-12 20:34         ` Nils Kattenbeck
@ 2023-12-12 20:48           ` Eric Curtin
  2023-12-12 21:02           ` Lennart Poettering
  1 sibling, 0 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-12 20:48 UTC (permalink / raw)
  To: Nils Kattenbeck
  Cc: Lennart Poettering, initramfs, systemd-devel, Stephen Smoogen,
	Qubes OS Development Mailing List, Yariv Rachmani,
	Douglas Landgraf

On Tue, 12 Dec 2023 at 20:35, Nils Kattenbeck <nilskemail@gmail.com> wrote:
>
> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem suggestion?
>
> > The try to optimize the initrd a bit by making it an erofs/memmap
> > thing and so on. And make sure the initrd only contains stuff you
> > always need, so that reading it all into memory is necessary anyway,
> > and hence any approach that tries to run even the initrd off a disk
> > image won't be necessary becuase you need to read everything anyway.
>
> Having to ensure that the initrd is as small as possible is definitely
> no easy task.
> Furthermore unless one has total control over the devices, or even if
> there are only a few hardware revisions, parts of the initrd might not
> be used.
> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.
> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.
>
> Regardless, even if every single file within the cpio archive (and
> potential sysexts) is used, erofs still has a distinct advantage over
> cpio!
> With cpio everything has to be decompressed and read up front. With
> erofs this is not the case.
> Only the fs header has to be read at first as files are decompressed on demand.
> This means that critical stuff can be started earlier as it does not
> have to wait for decompression of stuff only needed later on.
> For example an initrd-only (i.e. not pivolint root), graphical system
> could start all background services long before the UI starts and
> accesses large asset files.
>
> I agree that this splitting up into another micro-initrd just for some
> storage stuff etc (which I still have not groked completely) does not
> seem to offer any advantages to what we have today. *However*, I
> certainly think that standardizing and supporting some kind of erofs
> based initrd would gain some advantages.

Are we sure? A bunch of stuff in modern initrd's today have nothing to
do with mounting storage. I've proved there's benefit to that with the
data on the initoverlayfs page, you save ~300ms on systemd start time
on a Raspberry Pi 4 with an sd card, if you use an NVMe drive over USB
on a Raspberry Pi 4 it's even more... ~500ms. I wouldn't say that's
insignificant. You still get all the functionality of the fully
fledged initramfs when systemd starts but you save between 300ms and
500ms.

>
> On the other hand this feels like going back to an old ramdisk again.
> This goes beyond my knowledge but based on the kernel docs most
> drawbacks of ramdisks would not apply to an approach with erofs. Also
> maybe the more flexible loopback devices could be used(?) which might
> alleviate some problems.

For the record, this is what we are doing for initoverlayfs at the
moment, mounting "/boot" partition and then loopback. There are
significant advantages as there are few bytes read until you start
using initoverlayfs.

/boot/initramfs-6.5.12-200.fc38.x86_64.img
/boot/initoverlayfs-6.5.12-200.fc38.x86_64.img

>
> -- This block device was of fixed size, so the filesystem mounted on
> it was of fixed size.
>    -> Should not be of concern as it is readonly anyhow.
> -- Using a ram disk also required unnecessarily copying memory from
> the fake block device into the page cache (and copying changes back
> out), as well as creating and destroying dentries.
>    -> (?) This one I am actually not too sure about and supersedes my
> knowledge on tmpfs, vfs (and its cache layers), erofs caching, and
> loopback devices).
> -- Plus it needed a filesystem driver (such as ext2) to format and
> interpret this data.
>    -> erofs is already included in most initrds (and is not too big if
> it is not)
>
> Regards, Nils
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-12 20:34         ` Nils Kattenbeck
  2023-12-12 20:48           ` Eric Curtin
@ 2023-12-12 21:02           ` Lennart Poettering
  2023-12-12 22:01             ` Nils Kattenbeck
  1 sibling, 1 reply; 49+ messages in thread
From: Lennart Poettering @ 2023-12-12 21:02 UTC (permalink / raw)
  To: Nils Kattenbeck
  Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
	Qubes OS Development Mailing List, Yariv Rachmani,
	Douglas Landgraf

On Di, 12.12.23 21:34, Nils Kattenbeck (nilskemail@gmail.com) wrote:

> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem
> suggestion?

If you have 7 cpio initrds then the kernel will allocate a tmpfs and
unpack them all into it, one after the other, on top of each other,
and then jumps into the result.

if you have an erofs and 7 cpio initds, what are you going to do? You
cannot extract into an erofs, it's immutable. You'd need something
like overlayfs, but that would require (at least for now) an
additional step in userspace, which is something to avoid.

Alternatively (and preferred by me) would support a mode where it
would unpack any cpios it gets into a tmpfs, and then pass an fsopen()
fd to that to the executable it then invokes from the erofs. the
executable could then mount that somewhere if it wants. But this would
require a kenrel patch.

> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.

systemd-bsod is tiny though, less than 8K compressed here. Not sure it
is a good example.

> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.

sysexts are erofs or squashfs file systems with verity backing. Only
the sectors you access are decompressed.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-12 21:02           ` Lennart Poettering
@ 2023-12-12 22:01             ` Nils Kattenbeck
  2023-12-13  9:03               ` Lennart Poettering
  0 siblings, 1 reply; 49+ messages in thread
From: Nils Kattenbeck @ 2023-12-12 22:01 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
	Qubes OS Development Mailing List, Yariv Rachmani,
	Douglas Landgraf

On Tue, Dec 12, 2023 at 10:02 PM Lennart Poettering
<lennart@poettering.net> wrote:
>
> If you have 7 cpio initrds then the kernel will allocate a tmpfs and
> unpack them all into it, one after the other, on top of each other,
> and then jumps into the result.
>
> if you have an erofs and 7 cpio initds, what are you going to do? You
> cannot extract into an erofs, it's immutable. You'd need something
> like overlayfs, but that would require (at least for now) an
> additional step in userspace, which is something to avoid.
>
> Alternatively (and preferred by me) would support a mode where it
> would unpack any cpios it gets into a tmpfs, and then pass an fsopen()
> fd to that to the executable it then invokes from the erofs. the
> executable could then mount that somewhere if it wants. But this would
> require a kenrel patch.

Such a kernel patch would likely be the more advanced method.
I also saw that they now wrote to the LKML to potentially discuss
something like this.
The method with an overlaysfs would likely be easier for init systems
to use but also less customizable.

> > Even if everything is the same there are codes paths which might not
> > be taken during usual operation. An example would be services similar
> > to the new systemd-bsod which are only triggered in emergencies.
> > Having these in the cpio means that they will always be read and
> > decompressed.
>
> systemd-bsod is tiny though, less than 8K compressed here. Not sure it
> is a good example.

Yes that is right though it is the first and most universal thing
which came to mind.
A better example would be something like a fleet management SDK (in
Java or a similar language with a runtime) which phones to a
management server indicating a boot failure and publishing crash logs.

> > Using sysexts also has the drawback that each and every one of them
> > has to be decompressed. I might be mistaken but I expect that this
> > will be the case even if the extension-release in the sysext results
> > in it being discarded which is obviously another big drawback.
>
> sysexts are erofs or squashfs file systems with verity backing. Only
> the sectors you access are decompressed.

Okay I forgot that they were erofs based and mentioned cpio archives
so I assumed they would be one.
Do they need to be fully read from disk to generate the cpio archive?

> Lennart
>
> --
> Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-12 22:01             ` Nils Kattenbeck
@ 2023-12-13  9:03               ` Lennart Poettering
  2023-12-14  1:17                 ` Nils Kattenbeck
  0 siblings, 1 reply; 49+ messages in thread
From: Lennart Poettering @ 2023-12-13  9:03 UTC (permalink / raw)
  To: Nils Kattenbeck
  Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
	Qubes OS Development Mailing List, Yariv Rachmani,
	Douglas Landgraf

On Di, 12.12.23 23:01, Nils Kattenbeck (nilskemail@gmail.com) wrote:

> > sysexts are erofs or squashfs file systems with verity backing. Only
> > the sectors you access are decompressed.
>
> Okay I forgot that they were erofs based and mentioned cpio archives
> so I assumed they would be one.
> Do they need to be fully read from disk to generate the cpio archive?

erofs is a file system, cpio is a serialized archive. Two different
things. The discussion here is whether to pass the initrd to the
kernel as one or the other. But noone is suggesting to convert one to
the other at boot time.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-13  9:03               ` Lennart Poettering
@ 2023-12-14  1:17                 ` Nils Kattenbeck
  2023-12-16 14:34                   ` Lennart Poettering
  0 siblings, 1 reply; 49+ messages in thread
From: Nils Kattenbeck @ 2023-12-14  1:17 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
	Yariv Rachmani, Douglas Landgraf

On Wed, Dec 13, 2023 at 10:03 AM Lennart Poettering
<lennart@poettering.net> wrote:
>
> On Di, 12.12.23 23:01, Nils Kattenbeck (nilskemail@gmail.com) wrote:
>
> > > sysexts are erofs or squashfs file systems with verity backing. Only
> > > the sectors you access are decompressed.
> >
> > Okay I forgot that they were erofs based and mentioned cpio archives
> > so I assumed they would be one.
> > Do they need to be fully read from disk to generate the cpio archive?
>
> erofs is a file system, cpio is a serialized archive. Two different
> things. The discussion here is whether to pass the initrd to the
> kernel as one or the other. But noone is suggesting to convert one to
> the other at boot time.

I was referring to the following line from sd-stub's man page: "The
following resources are passed as initrd cpio archives to the booted
kernel: [...] /.extra/sysext/*.raw [...]". I assume the initrd
containing the sysexts has to be created at some point?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
  2023-12-14  1:17                 ` Nils Kattenbeck
@ 2023-12-16 14:34                   ` Lennart Poettering
  0 siblings, 0 replies; 49+ messages in thread
From: Lennart Poettering @ 2023-12-16 14:34 UTC (permalink / raw)
  To: Nils Kattenbeck
  Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
	Yariv Rachmani, Douglas Landgraf

On Do, 14.12.23 02:17, Nils Kattenbeck (nilskemail@gmail.com) wrote:

> On Wed, Dec 13, 2023 at 10:03 AM Lennart Poettering
> <lennart@poettering.net> wrote:
> >
> > On Di, 12.12.23 23:01, Nils Kattenbeck (nilskemail@gmail.com) wrote:
> >
> > > > sysexts are erofs or squashfs file systems with verity backing. Only
> > > > the sectors you access are decompressed.
> > >
> > > Okay I forgot that they were erofs based and mentioned cpio archives
> > > so I assumed they would be one.
> > > Do they need to be fully read from disk to generate the cpio archive?
> >
> > erofs is a file system, cpio is a serialized archive. Two different
> > things. The discussion here is whether to pass the initrd to the
> > kernel as one or the other. But noone is suggesting to convert one to
> > the other at boot time.
>
> I was referring to the following line from sd-stub's man page: "The
> following resources are passed as initrd cpio archives to the booted
> kernel: [...] /.extra/sysext/*.raw [...]". I assume the initrd
> containing the sysexts has to be created at some point?

These cpios are created on-the-fly and placed into memory and passed
to the invoked kernel. And yes, for that the data they contian needs
to be read off disk first.

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
       [not found] ` <CAOgh=FyA94-7YqGpsAqVQjadegRusoAvRhD=t-ipzVWN0CiJRQ@mail.gmail.com>
@ 2023-12-18 23:31   ` Askar Safin
  0 siblings, 0 replies; 49+ messages in thread
From: Askar Safin @ 2023-12-18 23:31 UTC (permalink / raw)
  To: Eric Curtin; +Cc: initramfs

> Yes, your understanding is correct
Cool! Then I think your current solution is better than other
solutions proposed in the thread. I. e. it is better than to have the
bootloader to load erofs image into memory, because we don't want this
extra copy

--
Askar Safin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC] initoverlayfs - a scalable initial filesystem
@ 2023-12-18 21:59 Askar Safin
       [not found] ` <CAOgh=FyA94-7YqGpsAqVQjadegRusoAvRhD=t-ipzVWN0CiJRQ@mail.gmail.com>
  0 siblings, 1 reply; 49+ messages in thread
From: Askar Safin @ 2023-12-18 21:59 UTC (permalink / raw)
  To: ecurtin; +Cc: initramfs, systemd-devel

Hi. Unfortunately, this is not clear enough from
https://github.com/containers/initoverlayfs how exactly the
second-stage early filesystem is mounted. So, please, add that
information to README. Let me describe how I understand this.

First, init program from (small) first-stage early filesystem mounts
boot/ESP partition, where second-stage early filesystem image (i. e.
erofs) is located. Then that init program mounts that erofs image.
Without copying the whole erofs image into memory. In other words, if
some part of erofs image is not accessed, then not only it is not
uncompressed, it even is not loaded from disk to memory at all. Is my
understanding correct?

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2023-12-18 23:32 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-08 17:59 [RFC] initoverlayfs - a scalable initial filesystem Eric Curtin
2023-12-09 12:46 ` Luca Boccassi
2023-12-09 14:42   ` Eric Curtin
2023-12-09 14:56     ` Andrei Borzenkov
2023-12-09 15:07       ` Eric Curtin
2023-12-09 15:22         ` Daan De Meyer
2023-12-09 15:46           ` Eric Curtin
2023-12-09 17:19         ` Luca Boccassi
2023-12-09 17:24           ` Eric Curtin
2023-12-09 17:46             ` Luca Boccassi
2023-12-09 17:57               ` Eric Curtin
2023-12-09 18:11                 ` Luca Boccassi
2023-12-09 18:26                   ` Eric Curtin
2023-12-11  9:57 ` Lennart Poettering
2023-12-11 10:07   ` Lennart Poettering
2023-12-11 11:20   ` Eric Curtin
2023-12-11 11:28     ` Eric Curtin
2023-12-11 11:42       ` Eric Curtin
2023-12-11 11:58         ` Lennart Poettering
2023-12-11 11:51       ` Lennart Poettering
2023-12-11 12:48         ` Eric Curtin
2023-12-11 12:52           ` Eric Curtin
2023-12-12 17:37           ` Lennart Poettering
2023-12-12 17:40           ` Lennart Poettering
2023-12-12 19:05             ` Demi Marie Obenour
2023-12-11 16:28   ` Demi Marie Obenour
2023-12-11 17:03     ` Eric Curtin
2023-12-11 17:46       ` Demi Marie Obenour
2023-12-12 18:00       ` Lennart Poettering
2023-12-12 20:34         ` Nils Kattenbeck
2023-12-12 20:48           ` Eric Curtin
2023-12-12 21:02           ` Lennart Poettering
2023-12-12 22:01             ` Nils Kattenbeck
2023-12-13  9:03               ` Lennart Poettering
2023-12-14  1:17                 ` Nils Kattenbeck
2023-12-16 14:34                   ` Lennart Poettering
2023-12-11 17:33     ` Neal Gompa
2023-12-11 20:15     ` Luca Boccassi
2023-12-11 20:43       ` Demi Marie Obenour
2023-12-11 20:58         ` Luca Boccassi
2023-12-11 21:20           ` Demi Marie Obenour
2023-12-11 21:45             ` Luca Boccassi
2023-12-12  3:47               ` Paul Menzel
2023-12-12  3:56               ` Paul Menzel
2023-12-12 15:26               ` Paul Menzel
2023-12-11 21:24           ` Eric Curtin
2023-12-12 17:50     ` Lennart Poettering
2023-12-18 21:59 Askar Safin
     [not found] ` <CAOgh=FyA94-7YqGpsAqVQjadegRusoAvRhD=t-ipzVWN0CiJRQ@mail.gmail.com>
2023-12-18 23:31   ` Askar Safin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).