* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 16:28 ` Demi Marie Obenour
@ 2023-12-11 17:03 ` Eric Curtin
2023-12-11 17:46 ` Demi Marie Obenour
2023-12-12 18:00 ` Lennart Poettering
2023-12-11 17:33 ` Neal Gompa
` (2 subsequent siblings)
3 siblings, 2 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 17:03 UTC (permalink / raw)
To: Demi Marie Obenour
Cc: Lennart Poettering, Yariv Rachmani, initramfs, systemd-devel,
Stephen Smoogen, Douglas Landgraf,
Qubes OS Development Mailing List
On Mon, 11 Dec 2023 at 16:36, Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> > 2023. There's no execuse to not doing that anymore these days. Not
> > in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > unlock their root disks with TPM2 and similar things. People use
> > RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> > boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> > boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> > in the initrd once, then tearing it down again, and starting it
> > again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > loader load the erofs into contigous memory, then use memmap=X!Y on
> > the kernel cmdline to synthesize a block device from that, which
> > you then mount directly (without any initrd) via
> > root=/dev/pmem0. This means yout boot loader will still load the
> > whole image into memory, but only decompress the bits actually
> > neeed. (It also has some other nice benefits I like, such as an
> > immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> > systemd's eyes as an initrd (specifically: don't add an
> > /etc/initrd-release file to it). Instead, just merge resources of
> > the root fs into your initrd fs via overlayfs. systemd has
> > infrastructure for this: "systemd-sysext". It takes immutable,
> > authenticated erofs images (with verity, we call them "DDIs",
> > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > could also very nicely combine this approach with systemd's
> > portable services, and npsawn containers, which operate on the same
> > authenticated images]. At MSFT we have a major product that works
> > exactly like this: the OS runs off a rootfs that is loaded as an
> > initrd, and everything that runs on top of this are just these
> > verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a DDI for the main initrd in
> > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > but otherwise relatively well defined, i.e. known to be vfat and
> > discoverable via UUID on a GPT disk. So: build a minimal
> > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > jump into the rootfs stored in the ESP. That latter then has proper
> > file system drivers, storage drivers, crypto stack, and can unlock the
> > real root. This would still be a pretty specific solution to one set
> > of devices though, as it could not cover network boots (i.e. where
> > there is just no ESP to boot from), but I think this could be kept
> > relatively close, as the logic in that case could just fall back into
> > loading the DDI that normally would still in the ESP fully into
> > memory.
>
> I don't think this is "a pretty specific solution to one set of devices"
> _at all_. To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images. It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed. It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs. One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth. For
plymouth is very interesting in that it has it's own graphics stack, event loop
implementations, etc. A lot of the initrd software is like this.
plymouth is one of
the examples I think of in my head of something that could benefit from being
able to use more generic things. At least it's an easy example to explain to
people.
> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes. It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host. And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.
>
> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs. Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.
A generic approach is hard, I think it's worth discussing which type of boots
you should actually care about milliseconds of performance for. It would be nice
if we had an init system that contained the binary data to do the minimum for
standard Fedora, Debian installs and everything else was an extension whether
that's sysexts, dlopen, a new binary to execute etc.
If the network is ingrained in your boot stack like this, I'm guessing
you probably
don't care about boot performance. Should we come up with a new technique?
Automotive has an expectation for really fast boots, like 2 seconds, in standard
desktops installs there's some expectation as you interface directly
with a human,
but for other installs how much expectation is there?
Or can we just fall back to existing techniques for installs like network boot?
Is mise le meas/Regards,
Eric Curtin
>
> It does require some care when upgrading, as the dm-verity image and the
> UKI cannot both be updated atomically, but one can solve that by first
> writing the new dm-verity image to a separate location. The UKI will
> try both both the old and new locations for the dm-verity image and
> rename the new image over the old one on success. The wrong image will
> simply fail to mount as its root hash will be wrong.
>
> This even allows Apple-esque boot policies to be implemented on
> commodity hardware, provided that the system firmware is sufficiently
> hardened. It won't be as good as what Apple does, but it will be a huge
> win from what is possible today.
>
> > (If you are focussing on systems lacking UEFI, then replace the word
> > "ESP" in the above with a similar concept, i.e. a well discoverable,
> > unauthenticated relatively simple file system, such as vfat).
> >
> > Anyway, I can't tell you how to solve your specific problems, but if
> > there's one thing I'd suggest you to keep in mind then it's the
> > security angle, i.e. keep in mind from the beginning how
> > authentication of every component of your process shall work, how
> > unatteneded disk encryption shall operate and how measurement shall
> > work. Security must be built into things from the beginning, not be
> > added as an afterthought.
>
> As a Qubes OS developer and a security researcher, thank you.
> --
> Sincerely,
> Demi Marie Obenour (she/her/hers)
> Invisible Things Lab
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 17:03 ` Eric Curtin
@ 2023-12-11 17:46 ` Demi Marie Obenour
2023-12-12 18:00 ` Lennart Poettering
1 sibling, 0 replies; 49+ messages in thread
From: Demi Marie Obenour @ 2023-12-11 17:46 UTC (permalink / raw)
To: Eric Curtin
Cc: Lennart Poettering, Yariv Rachmani, initramfs, systemd-devel,
Stephen Smoogen, Douglas Landgraf,
Qubes OS Development Mailing List
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
On Mon, Dec 11, 2023 at 05:03:13PM +0000, Eric Curtin wrote:
> On Mon, 11 Dec 2023 at 16:36, Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > > 2023. There's no execuse to not doing that anymore these days. Not
> > > in automotive, and not anywhere else really.
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > unlock their root disks with TPM2 and similar things. People use
> > > RAID, LVM, and all that mess.
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > > boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > > in the initrd once, then tearing it down again, and starting it
> > > again from the root fs.
> > >
> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > loader load the erofs into contigous memory, then use memmap=X!Y on
> > > the kernel cmdline to synthesize a block device from that, which
> > > you then mount directly (without any initrd) via
> > > root=/dev/pmem0. This means yout boot loader will still load the
> > > whole image into memory, but only decompress the bits actually
> > > neeed. (It also has some other nice benefits I like, such as an
> > > immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > systemd's eyes as an initrd (specifically: don't add an
> > > /etc/initrd-release file to it). Instead, just merge resources of
> > > the root fs into your initrd fs via overlayfs. systemd has
> > > infrastructure for this: "systemd-sysext". It takes immutable,
> > > authenticated erofs images (with verity, we call them "DDIs",
> > > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > could also very nicely combine this approach with systemd's
> > > portable services, and npsawn containers, which operate on the same
> > > authenticated images]. At MSFT we have a major product that works
> > > exactly like this: the OS runs off a rootfs that is loaded as an
> > > initrd, and everything that runs on top of this are just these
> > > verity disk images, using overlayfs and portable services.
> > >
> > > 4. The proposal in 3 also addresses goal 4.
> > >
> > > Which leaves item 1, which is a bit harder to address. We have been
> > > discussing this off an on internally too. A generic solution to this
> > > is hard. My current thinking for this could be something like this,
> > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > but otherwise relatively well defined, i.e. known to be vfat and
> > > discoverable via UUID on a GPT disk. So: build a minimal
> > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > jump into the rootfs stored in the ESP. That latter then has proper
> > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > real root. This would still be a pretty specific solution to one set
> > > of devices though, as it could not cover network boots (i.e. where
> > > there is just no ESP to boot from), but I think this could be kept
> > > relatively close, as the logic in that case could just fall back into
> > > loading the DDI that normally would still in the ESP fully into
> > > memory.
> >
> > I don't think this is "a pretty specific solution to one set of devices"
> > _at all_. To the contrary, it is _exactly_ what I want to see desktop
> > systems moving to in the future.
> >
> > It solves the problem of large firmware images. It solves the problem
> > of device-specific configuration, because one can use a file on the EFI
> > system partition that is read by userspace and either treated as
> > untrusted or TPM-signed. It means that one have a complete set of
> > recovery tools in the event of a problem, rather than being limited to
> > whatever one can squeese into an initramfs. One can even include a full
> > GUI stack (with accessibility support!), rather than just Plymouth. For
>
> plymouth is very interesting in that it has it's own graphics stack, event loop
> implementations, etc. A lot of the initrd software is like this.
> plymouth is one of
> the examples I think of in my head of something that could benefit from being
> able to use more generic things. At least it's an easy example to explain to
> people.
Indeed so. There is still the concern of startup time, which
GPU-accelerated programs in particular are often not great at.
> > Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> > launch virtual machines, allowing the use of USB devices and networking
> > for recovery purposes. It even means that one can use a FIDO2 token to
> > unlock the hard drive without a USB stack on the host. And because the
> > initramfs _only_ needs to load the boot extension volume, it can be
> > very, _very_ small, which works great with using Linux as a coreboot
> > payload.
> >
> > The only problem I can see that this does not solve is network boot, but
> > that is very much a niche use case when compared to the millions of
> > Fedora or Debian desktop installs, or even the tens of thousands of
> > Qubes OS installs. Furthermore, I would _much_ rather network boot be
> > handled by userspace and kexec, rather than the closed source UEFI network
> > stack.
>
> A generic approach is hard, I think it's worth discussing which type of boots
> you should actually care about milliseconds of performance for. It would be nice
> if we had an init system that contained the binary data to do the minimum for
> standard Fedora, Debian installs and everything else was an extension whether
> that's sysexts, dlopen, a new binary to execute etc.
>
> If the network is ingrained in your boot stack like this, I'm guessing
> you probably
> don't care about boot performance. Should we come up with a new technique?
>
> Automotive has an expectation for really fast boots, like 2 seconds, in standard
> desktops installs there's some expectation as you interface directly
> with a human,
> but for other installs how much expectation is there?
>
> Or can we just fall back to existing techniques for installs like network boot?
I wouldn't say that people doing network boot don't care about boot
performance, mostly because I have been on the other side of similar
arguments before [1]. However, I don't think this technique needs to
support network boot.
- --
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[1]: Qubes OS doesn't expose GPU acceleration to VMs. This is not
because the developers don't care about graphics performance, but
because GPUs and especially their driver stacks have a very large
attack surface. Work is being done to address this, but even once
Qubes OS does support GPU acceleration, it will need to be off by
default, at least initially.
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmV3SuMACgkQsoi1X/+c
IsHH6RAAhMQl/nw2jdZ4tlwxX/zqib3Tfzdo1p9a5VOkSobrvV7qbG0DWVrqe+vH
NKU1xy6FGqPexKjLoGlxWXgPN5rQKvkFXSgRaRefcqGn190WRjqexF0euu26GYTx
AfOEWC1hywoyXUR2LMygEMpodA0ZvZffIZcovmjjr4OeXiSc5aAUrHQ2PabHZaET
BL4jfeNikjw6sA2UdpviMRzb1OVEGZDD96XDSbVz/8tOBcZZNePz+FQXnHqTpcLk
DrBtx4l5noeUYingzxmw4MQZYYPr3kC4+DQtQr7zxv8D0UE9g8lIcpektqMvgoON
88FwVOa4TgTij7vG2f4BGCrZjE7PiPPo5BRb+MtjlZMtrhwdI4IwXY8q4EANWUnw
8nM+952nffVVQjpBtKRsXPZ3glAjvUuqHT8GzfWYYu8y8Dar9c3U4aQSTCJspkz3
jBsPAatFSjdBvlE6OtmyYco92K3A9g6WXzkw5t+/yaljBOddEkxEAw8+Lo1dCqrn
zK+vSFhcGpYodsHFQY0w9kAZ2+6HBX2nZaEmD6ka3furRussm7D4Z36lx1D/pi68
BL4aAFFLaEQ0jD8jqtjVZ2JYpUQufzwrnsNPTZ97WTEKd2F/zM/S09WjFsaOfVIO
F95Eqk0YMHP+krDEcXvm34EZ3PeRGlVm1fz4ttjw8XEekwwB5QU=
=HR07
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 17:03 ` Eric Curtin
2023-12-11 17:46 ` Demi Marie Obenour
@ 2023-12-12 18:00 ` Lennart Poettering
2023-12-12 20:34 ` Nils Kattenbeck
1 sibling, 1 reply; 49+ messages in thread
From: Lennart Poettering @ 2023-12-12 18:00 UTC (permalink / raw)
To: Eric Curtin
Cc: Demi Marie Obenour, Yariv Rachmani, initramfs, systemd-devel,
Stephen Smoogen, Douglas Landgraf,
Qubes OS Development Mailing List
On Mo, 11.12.23 17:03, Eric Curtin (ecurtin@redhat.com) wrote:
> A generic approach is hard, I think it's worth discussing which type of boots
> you should actually care about milliseconds of performance for. It would be nice
> if we had an init system that contained the binary data to do the minimum for
> standard Fedora, Debian installs and everything else was an extension whether
> that's sysexts, dlopen, a new binary to execute etc.
>
> If the network is ingrained in your boot stack like this, I'm
> guessing you probably don't care about boot performance.
Uh, I am not sure that's really true. People boot up VMs on demand,
based on network traffic. They sure care about latency and boot
times. I mean people care about firecracker and these things precisely
because it brings the of off-to-IP to a minimum.
> Automotive has an expectation for really fast boots, like 2 seconds, in standard
> desktops installs there's some expectation as you interface directly
> with a human,
> but for other installs how much expectation is there?
AFAIR in particular in cars there's quite som functionality you
probaly want to move very early in boot. Which yells to me that you
want a service manager super early. Which again suggests to me that
the first initrd that runs should probably already cover that.
If I were you I'd probably focus on a design like this: ship a basic
systemd in an initrd. Complete enough to find the harddisk, and to run
the other services that are absolutely necessary this early. Then,
once you found the disk, look for sysext images on it, and apply them
all on top of the initrd's root fs you are already running with. Never
transition anywhere else.
The try to optimize the initrd a bit by making it an erofs/memmap
thing and so on. And make sure the initrd only contains stuff you
always need, so that reading it all into memory is necessary anyway,
and hence any approach that tries to run even the initrd off a disk
image won't be necessary becuase you need to read everything anyway.
Lennart
--
Lennart Poettering, Berlin
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-12 18:00 ` Lennart Poettering
@ 2023-12-12 20:34 ` Nils Kattenbeck
2023-12-12 20:48 ` Eric Curtin
2023-12-12 21:02 ` Lennart Poettering
0 siblings, 2 replies; 49+ messages in thread
From: Nils Kattenbeck @ 2023-12-12 20:34 UTC (permalink / raw)
To: Lennart Poettering
Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
Qubes OS Development Mailing List, Yariv Rachmani,
Douglas Landgraf
Hi, while I have been following this thread passively for now I also
wanted to chime in.
> (The main reason why sd-stub doesn't actually support erofs-initrds,
> is that sd-stub also generates initrd cpios on the fly, to pass
> credentials and system extension images to the kernel, and you can't
> really mix erofs and cpio initrds into one)
What prevents one from mixing the two (especially given that the
hypothetical erofs initrd support does not yet exist)?
Or are you talking about mixing this with your memmap+root=/dev/pmem suggestion?
> The try to optimize the initrd a bit by making it an erofs/memmap
> thing and so on. And make sure the initrd only contains stuff you
> always need, so that reading it all into memory is necessary anyway,
> and hence any approach that tries to run even the initrd off a disk
> image won't be necessary becuase you need to read everything anyway.
Having to ensure that the initrd is as small as possible is definitely
no easy task.
Furthermore unless one has total control over the devices, or even if
there are only a few hardware revisions, parts of the initrd might not
be used.
Even if everything is the same there are codes paths which might not
be taken during usual operation. An example would be services similar
to the new systemd-bsod which are only triggered in emergencies.
Having these in the cpio means that they will always be read and
decompressed.
Using sysexts also has the drawback that each and every one of them
has to be decompressed. I might be mistaken but I expect that this
will be the case even if the extension-release in the sysext results
in it being discarded which is obviously another big drawback.
Regardless, even if every single file within the cpio archive (and
potential sysexts) is used, erofs still has a distinct advantage over
cpio!
With cpio everything has to be decompressed and read up front. With
erofs this is not the case.
Only the fs header has to be read at first as files are decompressed on demand.
This means that critical stuff can be started earlier as it does not
have to wait for decompression of stuff only needed later on.
For example an initrd-only (i.e. not pivolint root), graphical system
could start all background services long before the UI starts and
accesses large asset files.
I agree that this splitting up into another micro-initrd just for some
storage stuff etc (which I still have not groked completely) does not
seem to offer any advantages to what we have today. *However*, I
certainly think that standardizing and supporting some kind of erofs
based initrd would gain some advantages.
On the other hand this feels like going back to an old ramdisk again.
This goes beyond my knowledge but based on the kernel docs most
drawbacks of ramdisks would not apply to an approach with erofs. Also
maybe the more flexible loopback devices could be used(?) which might
alleviate some problems.
-- This block device was of fixed size, so the filesystem mounted on
it was of fixed size.
-> Should not be of concern as it is readonly anyhow.
-- Using a ram disk also required unnecessarily copying memory from
the fake block device into the page cache (and copying changes back
out), as well as creating and destroying dentries.
-> (?) This one I am actually not too sure about and supersedes my
knowledge on tmpfs, vfs (and its cache layers), erofs caching, and
loopback devices).
-- Plus it needed a filesystem driver (such as ext2) to format and
interpret this data.
-> erofs is already included in most initrds (and is not too big if
it is not)
Regards, Nils
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-12 20:34 ` Nils Kattenbeck
@ 2023-12-12 20:48 ` Eric Curtin
2023-12-12 21:02 ` Lennart Poettering
1 sibling, 0 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-12 20:48 UTC (permalink / raw)
To: Nils Kattenbeck
Cc: Lennart Poettering, initramfs, systemd-devel, Stephen Smoogen,
Qubes OS Development Mailing List, Yariv Rachmani,
Douglas Landgraf
On Tue, 12 Dec 2023 at 20:35, Nils Kattenbeck <nilskemail@gmail.com> wrote:
>
> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem suggestion?
>
> > The try to optimize the initrd a bit by making it an erofs/memmap
> > thing and so on. And make sure the initrd only contains stuff you
> > always need, so that reading it all into memory is necessary anyway,
> > and hence any approach that tries to run even the initrd off a disk
> > image won't be necessary becuase you need to read everything anyway.
>
> Having to ensure that the initrd is as small as possible is definitely
> no easy task.
> Furthermore unless one has total control over the devices, or even if
> there are only a few hardware revisions, parts of the initrd might not
> be used.
> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.
> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.
>
> Regardless, even if every single file within the cpio archive (and
> potential sysexts) is used, erofs still has a distinct advantage over
> cpio!
> With cpio everything has to be decompressed and read up front. With
> erofs this is not the case.
> Only the fs header has to be read at first as files are decompressed on demand.
> This means that critical stuff can be started earlier as it does not
> have to wait for decompression of stuff only needed later on.
> For example an initrd-only (i.e. not pivolint root), graphical system
> could start all background services long before the UI starts and
> accesses large asset files.
>
> I agree that this splitting up into another micro-initrd just for some
> storage stuff etc (which I still have not groked completely) does not
> seem to offer any advantages to what we have today. *However*, I
> certainly think that standardizing and supporting some kind of erofs
> based initrd would gain some advantages.
Are we sure? A bunch of stuff in modern initrd's today have nothing to
do with mounting storage. I've proved there's benefit to that with the
data on the initoverlayfs page, you save ~300ms on systemd start time
on a Raspberry Pi 4 with an sd card, if you use an NVMe drive over USB
on a Raspberry Pi 4 it's even more... ~500ms. I wouldn't say that's
insignificant. You still get all the functionality of the fully
fledged initramfs when systemd starts but you save between 300ms and
500ms.
>
> On the other hand this feels like going back to an old ramdisk again.
> This goes beyond my knowledge but based on the kernel docs most
> drawbacks of ramdisks would not apply to an approach with erofs. Also
> maybe the more flexible loopback devices could be used(?) which might
> alleviate some problems.
For the record, this is what we are doing for initoverlayfs at the
moment, mounting "/boot" partition and then loopback. There are
significant advantages as there are few bytes read until you start
using initoverlayfs.
/boot/initramfs-6.5.12-200.fc38.x86_64.img
/boot/initoverlayfs-6.5.12-200.fc38.x86_64.img
>
> -- This block device was of fixed size, so the filesystem mounted on
> it was of fixed size.
> -> Should not be of concern as it is readonly anyhow.
> -- Using a ram disk also required unnecessarily copying memory from
> the fake block device into the page cache (and copying changes back
> out), as well as creating and destroying dentries.
> -> (?) This one I am actually not too sure about and supersedes my
> knowledge on tmpfs, vfs (and its cache layers), erofs caching, and
> loopback devices).
> -- Plus it needed a filesystem driver (such as ext2) to format and
> interpret this data.
> -> erofs is already included in most initrds (and is not too big if
> it is not)
>
> Regards, Nils
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-12 20:34 ` Nils Kattenbeck
2023-12-12 20:48 ` Eric Curtin
@ 2023-12-12 21:02 ` Lennart Poettering
2023-12-12 22:01 ` Nils Kattenbeck
1 sibling, 1 reply; 49+ messages in thread
From: Lennart Poettering @ 2023-12-12 21:02 UTC (permalink / raw)
To: Nils Kattenbeck
Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
Qubes OS Development Mailing List, Yariv Rachmani,
Douglas Landgraf
On Di, 12.12.23 21:34, Nils Kattenbeck (nilskemail@gmail.com) wrote:
> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem
> suggestion?
If you have 7 cpio initrds then the kernel will allocate a tmpfs and
unpack them all into it, one after the other, on top of each other,
and then jumps into the result.
if you have an erofs and 7 cpio initds, what are you going to do? You
cannot extract into an erofs, it's immutable. You'd need something
like overlayfs, but that would require (at least for now) an
additional step in userspace, which is something to avoid.
Alternatively (and preferred by me) would support a mode where it
would unpack any cpios it gets into a tmpfs, and then pass an fsopen()
fd to that to the executable it then invokes from the erofs. the
executable could then mount that somewhere if it wants. But this would
require a kenrel patch.
> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.
systemd-bsod is tiny though, less than 8K compressed here. Not sure it
is a good example.
> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.
sysexts are erofs or squashfs file systems with verity backing. Only
the sectors you access are decompressed.
Lennart
--
Lennart Poettering, Berlin
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-12 21:02 ` Lennart Poettering
@ 2023-12-12 22:01 ` Nils Kattenbeck
2023-12-13 9:03 ` Lennart Poettering
0 siblings, 1 reply; 49+ messages in thread
From: Nils Kattenbeck @ 2023-12-12 22:01 UTC (permalink / raw)
To: Lennart Poettering
Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
Qubes OS Development Mailing List, Yariv Rachmani,
Douglas Landgraf
On Tue, Dec 12, 2023 at 10:02 PM Lennart Poettering
<lennart@poettering.net> wrote:
>
> If you have 7 cpio initrds then the kernel will allocate a tmpfs and
> unpack them all into it, one after the other, on top of each other,
> and then jumps into the result.
>
> if you have an erofs and 7 cpio initds, what are you going to do? You
> cannot extract into an erofs, it's immutable. You'd need something
> like overlayfs, but that would require (at least for now) an
> additional step in userspace, which is something to avoid.
>
> Alternatively (and preferred by me) would support a mode where it
> would unpack any cpios it gets into a tmpfs, and then pass an fsopen()
> fd to that to the executable it then invokes from the erofs. the
> executable could then mount that somewhere if it wants. But this would
> require a kenrel patch.
Such a kernel patch would likely be the more advanced method.
I also saw that they now wrote to the LKML to potentially discuss
something like this.
The method with an overlaysfs would likely be easier for init systems
to use but also less customizable.
> > Even if everything is the same there are codes paths which might not
> > be taken during usual operation. An example would be services similar
> > to the new systemd-bsod which are only triggered in emergencies.
> > Having these in the cpio means that they will always be read and
> > decompressed.
>
> systemd-bsod is tiny though, less than 8K compressed here. Not sure it
> is a good example.
Yes that is right though it is the first and most universal thing
which came to mind.
A better example would be something like a fleet management SDK (in
Java or a similar language with a runtime) which phones to a
management server indicating a boot failure and publishing crash logs.
> > Using sysexts also has the drawback that each and every one of them
> > has to be decompressed. I might be mistaken but I expect that this
> > will be the case even if the extension-release in the sysext results
> > in it being discarded which is obviously another big drawback.
>
> sysexts are erofs or squashfs file systems with verity backing. Only
> the sectors you access are decompressed.
Okay I forgot that they were erofs based and mentioned cpio archives
so I assumed they would be one.
Do they need to be fully read from disk to generate the cpio archive?
> Lennart
>
> --
> Lennart Poettering, Berlin
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-12 22:01 ` Nils Kattenbeck
@ 2023-12-13 9:03 ` Lennart Poettering
2023-12-14 1:17 ` Nils Kattenbeck
0 siblings, 1 reply; 49+ messages in thread
From: Lennart Poettering @ 2023-12-13 9:03 UTC (permalink / raw)
To: Nils Kattenbeck
Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
Qubes OS Development Mailing List, Yariv Rachmani,
Douglas Landgraf
On Di, 12.12.23 23:01, Nils Kattenbeck (nilskemail@gmail.com) wrote:
> > sysexts are erofs or squashfs file systems with verity backing. Only
> > the sectors you access are decompressed.
>
> Okay I forgot that they were erofs based and mentioned cpio archives
> so I assumed they would be one.
> Do they need to be fully read from disk to generate the cpio archive?
erofs is a file system, cpio is a serialized archive. Two different
things. The discussion here is whether to pass the initrd to the
kernel as one or the other. But noone is suggesting to convert one to
the other at boot time.
Lennart
--
Lennart Poettering, Berlin
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-13 9:03 ` Lennart Poettering
@ 2023-12-14 1:17 ` Nils Kattenbeck
2023-12-16 14:34 ` Lennart Poettering
0 siblings, 1 reply; 49+ messages in thread
From: Nils Kattenbeck @ 2023-12-14 1:17 UTC (permalink / raw)
To: Lennart Poettering
Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
Yariv Rachmani, Douglas Landgraf
On Wed, Dec 13, 2023 at 10:03 AM Lennart Poettering
<lennart@poettering.net> wrote:
>
> On Di, 12.12.23 23:01, Nils Kattenbeck (nilskemail@gmail.com) wrote:
>
> > > sysexts are erofs or squashfs file systems with verity backing. Only
> > > the sectors you access are decompressed.
> >
> > Okay I forgot that they were erofs based and mentioned cpio archives
> > so I assumed they would be one.
> > Do they need to be fully read from disk to generate the cpio archive?
>
> erofs is a file system, cpio is a serialized archive. Two different
> things. The discussion here is whether to pass the initrd to the
> kernel as one or the other. But noone is suggesting to convert one to
> the other at boot time.
I was referring to the following line from sd-stub's man page: "The
following resources are passed as initrd cpio archives to the booted
kernel: [...] /.extra/sysext/*.raw [...]". I assume the initrd
containing the sysexts has to be created at some point?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-14 1:17 ` Nils Kattenbeck
@ 2023-12-16 14:34 ` Lennart Poettering
0 siblings, 0 replies; 49+ messages in thread
From: Lennart Poettering @ 2023-12-16 14:34 UTC (permalink / raw)
To: Nils Kattenbeck
Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
Yariv Rachmani, Douglas Landgraf
On Do, 14.12.23 02:17, Nils Kattenbeck (nilskemail@gmail.com) wrote:
> On Wed, Dec 13, 2023 at 10:03 AM Lennart Poettering
> <lennart@poettering.net> wrote:
> >
> > On Di, 12.12.23 23:01, Nils Kattenbeck (nilskemail@gmail.com) wrote:
> >
> > > > sysexts are erofs or squashfs file systems with verity backing. Only
> > > > the sectors you access are decompressed.
> > >
> > > Okay I forgot that they were erofs based and mentioned cpio archives
> > > so I assumed they would be one.
> > > Do they need to be fully read from disk to generate the cpio archive?
> >
> > erofs is a file system, cpio is a serialized archive. Two different
> > things. The discussion here is whether to pass the initrd to the
> > kernel as one or the other. But noone is suggesting to convert one to
> > the other at boot time.
>
> I was referring to the following line from sd-stub's man page: "The
> following resources are passed as initrd cpio archives to the booted
> kernel: [...] /.extra/sysext/*.raw [...]". I assume the initrd
> containing the sysexts has to be created at some point?
These cpios are created on-the-fly and placed into memory and passed
to the invoked kernel. And yes, for that the data they contian needs
to be read off disk first.
Lennart
--
Lennart Poettering, Berlin
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 16:28 ` Demi Marie Obenour
2023-12-11 17:03 ` Eric Curtin
@ 2023-12-11 17:33 ` Neal Gompa
2023-12-11 20:15 ` Luca Boccassi
2023-12-12 17:50 ` Lennart Poettering
3 siblings, 0 replies; 49+ messages in thread
From: Neal Gompa @ 2023-12-11 17:33 UTC (permalink / raw)
To: Demi Marie Obenour
Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
Stephen Smoogen, Qubes OS Development Mailing List,
Yariv Rachmani, Douglas Landgraf
On Mon, Dec 11, 2023 at 12:30 PM Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> > 2023. There's no execuse to not doing that anymore these days. Not
> > in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > unlock their root disks with TPM2 and similar things. People use
> > RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> > boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> > boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> > in the initrd once, then tearing it down again, and starting it
> > again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > loader load the erofs into contigous memory, then use memmap=X!Y on
> > the kernel cmdline to synthesize a block device from that, which
> > you then mount directly (without any initrd) via
> > root=/dev/pmem0. This means yout boot loader will still load the
> > whole image into memory, but only decompress the bits actually
> > neeed. (It also has some other nice benefits I like, such as an
> > immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> > systemd's eyes as an initrd (specifically: don't add an
> > /etc/initrd-release file to it). Instead, just merge resources of
> > the root fs into your initrd fs via overlayfs. systemd has
> > infrastructure for this: "systemd-sysext". It takes immutable,
> > authenticated erofs images (with verity, we call them "DDIs",
> > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > could also very nicely combine this approach with systemd's
> > portable services, and npsawn containers, which operate on the same
> > authenticated images]. At MSFT we have a major product that works
> > exactly like this: the OS runs off a rootfs that is loaded as an
> > initrd, and everything that runs on top of this are just these
> > verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a DDI for the main initrd in
> > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > but otherwise relatively well defined, i.e. known to be vfat and
> > discoverable via UUID on a GPT disk. So: build a minimal
> > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > jump into the rootfs stored in the ESP. That latter then has proper
> > file system drivers, storage drivers, crypto stack, and can unlock the
> > real root. This would still be a pretty specific solution to one set
> > of devices though, as it could not cover network boots (i.e. where
> > there is just no ESP to boot from), but I think this could be kept
> > relatively close, as the logic in that case could just fall back into
> > loading the DDI that normally would still in the ESP fully into
> > memory.
>
> I don't think this is "a pretty specific solution to one set of devices"
> _at all_. To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images. It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed. It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs. One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth. For
> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes. It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host. And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.
>
> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs. Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.
>
Network boot is fairly common in some industries for workstations. In
particular, the film industry does this a fair bit to leverage
switching between workstation and renderfarm modes for workstation
hardware.
--
真実はいつも一つ!/ Always, there's only one truth!
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 16:28 ` Demi Marie Obenour
2023-12-11 17:03 ` Eric Curtin
2023-12-11 17:33 ` Neal Gompa
@ 2023-12-11 20:15 ` Luca Boccassi
2023-12-11 20:43 ` Demi Marie Obenour
2023-12-12 17:50 ` Lennart Poettering
3 siblings, 1 reply; 49+ messages in thread
From: Luca Boccassi @ 2023-12-11 20:15 UTC (permalink / raw)
To: Demi Marie Obenour
Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
Stephen Smoogen, Qubes OS Development Mailing List,
Yariv Rachmani, Douglas Landgraf
On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> >
> > > Here is the boot sequence with initoverlayfs integrated, the
> > > mini-initramfs contains just enough to get storage drivers loaded and
> > > storage devices initialized. storage-init is a process that is not
> > > designed to replace init, it does just enough to initialize storage
> > > (performs a targeted udev trigger on storage), switches to
> > > initoverlayfs as root and then executes init.
> > >
> > > ```
> > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > >
> > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > ```
> >
> > I am not sure I follow what these chains are supposed to mean? Why are
> > there two lines?
> >
> > So, I generally would agree that the current initrd scheme is not
> > ideal, and we have been discussing better approaches. But I am not
> > sure your approach really is useful on generic systems for two
> > reasons:
> >
> > 1. no security model? you need to authenticate your initrd in
> > 2023. There's no execuse to not doing that anymore these days. Not
> > in automotive, and not anywhere else really.
> >
> > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > unlock their root disks with TPM2 and similar things. People use
> > RAID, LVM, and all that mess.
> >
> > Actually the above are kinda the same problem in a way: you need
> > complex storage, but if you need that you kinda need udev, and
> > services, and then also systemd and all that other stuff, and that's
> > why the system works like the system works right now.
> >
> > Whenever you devise a system like yours by cutting corners, and
> > declaring that you don't want TPM, you don't want signed initrds, you
> > don't want to support weird storage, you just solve your problem in a
> > very specific way, ignoring the big picture. Which is OK, *if* you can
> > actually really work without all that and are willing to maintain the
> > solution for your specific problem only.
> >
> > As I understand you are trying to solve multiple problems at once
> > here, and I think one should start with figuring out clearly what
> > those are before trying to address them, maybe without compromising on
> > security. So my guess is you want to address the following:
> >
> > 1. You don't want the whole big initrd to be read off disk on every
> > boot, but only the parts of it that are actually needed.
> >
> > 2. You don't want the whole big initrd to be fully decompressed on every
> > boot, but only the parts of it that are actually needed.
> >
> > 3. You want to share data between root fs and initrd
> >
> > 4. You want to save some boot time by not bringing up an init system
> > in the initrd once, then tearing it down again, and starting it
> > again from the root fs.
> >
> > For the items listed above I think you can find different solutions
> > which do not necessarily compromise security as much.
> >
> > So, in the list above you could address the latter three like this:
> >
> > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > loader load the erofs into contigous memory, then use memmap=X!Y on
> > the kernel cmdline to synthesize a block device from that, which
> > you then mount directly (without any initrd) via
> > root=/dev/pmem0. This means yout boot loader will still load the
> > whole image into memory, but only decompress the bits actually
> > neeed. (It also has some other nice benefits I like, such as an
> > immutable rootfs, which tmpfs-based initrds don't have.)
> >
> > 3. Simply never transition to the root fs, don't marke the initrds in
> > systemd's eyes as an initrd (specifically: don't add an
> > /etc/initrd-release file to it). Instead, just merge resources of
> > the root fs into your initrd fs via overlayfs. systemd has
> > infrastructure for this: "systemd-sysext". It takes immutable,
> > authenticated erofs images (with verity, we call them "DDIs",
> > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > could also very nicely combine this approach with systemd's
> > portable services, and npsawn containers, which operate on the same
> > authenticated images]. At MSFT we have a major product that works
> > exactly like this: the OS runs off a rootfs that is loaded as an
> > initrd, and everything that runs on top of this are just these
> > verity disk images, using overlayfs and portable services.
> >
> > 4. The proposal in 3 also addresses goal 4.
> >
> > Which leaves item 1, which is a bit harder to address. We have been
> > discussing this off an on internally too. A generic solution to this
> > is hard. My current thinking for this could be something like this,
> > covering the UEFI world: support sticking a DDI for the main initrd in
> > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > but otherwise relatively well defined, i.e. known to be vfat and
> > discoverable via UUID on a GPT disk. So: build a minimal
> > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > jump into the rootfs stored in the ESP. That latter then has proper
> > file system drivers, storage drivers, crypto stack, and can unlock the
> > real root. This would still be a pretty specific solution to one set
> > of devices though, as it could not cover network boots (i.e. where
> > there is just no ESP to boot from), but I think this could be kept
> > relatively close, as the logic in that case could just fall back into
> > loading the DDI that normally would still in the ESP fully into
> > memory.
>
> I don't think this is "a pretty specific solution to one set of devices"
> _at all_. To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images. It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed.
All those problems are already solved, without inventing a new shell
scripting solution - we have DDIs and credentials. This is the exact
opposite of the direction we are pursuing: we want to _kill_ all these
initrd-specific infrastructure, tools, build systems, dependency
management and so on, because they are difficult to maintain, they
create a completely different environment that what is "normally" ran,
and they end up reinventing everything the 'normal' image does. We
want to build initrds from packages - as in normal distribution
packages, not special sauce initrd-only packages, so that the same
code and the same configuration is used everywhere, in different
runtime modes. Because that's what distributions are good to do:
creating package-based ecosystems, with good tooling, infrastructure
and so on.
The end goal is to build images without initramfs-tools/dracut and
just using packages, not to stick yet another glue script in front of
them, that needs yet more special initrd-only arcane magic to put
together, in order to save a handful of KBs.
And for ancient, legacy platforms that do not support modern APIs, the
old ways will still be there, and can be used. Nobody is going to take
away grub and dracut from the internet, if you got some special corner
case where you want to use it it will still be there, but the fact
that such corner cases exist cannot stop the rest of the ecosystem
that is targeted to modern hardware from evolving into something
better, more maintainable and more straightforward.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 20:15 ` Luca Boccassi
@ 2023-12-11 20:43 ` Demi Marie Obenour
2023-12-11 20:58 ` Luca Boccassi
0 siblings, 1 reply; 49+ messages in thread
From: Demi Marie Obenour @ 2023-12-11 20:43 UTC (permalink / raw)
To: Luca Boccassi
Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
Stephen Smoogen, Yariv Rachmani, Douglas Landgraf
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > >
> > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > storage devices initialized. storage-init is a process that is not
> > > > designed to replace init, it does just enough to initialize storage
> > > > (performs a targeted udev trigger on storage), switches to
> > > > initoverlayfs as root and then executes init.
> > > >
> > > > ```
> > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > >
> > > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > > ```
> > >
> > > I am not sure I follow what these chains are supposed to mean? Why are
> > > there two lines?
> > >
> > > So, I generally would agree that the current initrd scheme is not
> > > ideal, and we have been discussing better approaches. But I am not
> > > sure your approach really is useful on generic systems for two
> > > reasons:
> > >
> > > 1. no security model? you need to authenticate your initrd in
> > > 2023. There's no execuse to not doing that anymore these days. Not
> > > in automotive, and not anywhere else really.
> > >
> > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > unlock their root disks with TPM2 and similar things. People use
> > > RAID, LVM, and all that mess.
> > >
> > > Actually the above are kinda the same problem in a way: you need
> > > complex storage, but if you need that you kinda need udev, and
> > > services, and then also systemd and all that other stuff, and that's
> > > why the system works like the system works right now.
> > >
> > > Whenever you devise a system like yours by cutting corners, and
> > > declaring that you don't want TPM, you don't want signed initrds, you
> > > don't want to support weird storage, you just solve your problem in a
> > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > actually really work without all that and are willing to maintain the
> > > solution for your specific problem only.
> > >
> > > As I understand you are trying to solve multiple problems at once
> > > here, and I think one should start with figuring out clearly what
> > > those are before trying to address them, maybe without compromising on
> > > security. So my guess is you want to address the following:
> > >
> > > 1. You don't want the whole big initrd to be read off disk on every
> > > boot, but only the parts of it that are actually needed.
> > >
> > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > boot, but only the parts of it that are actually needed.
> > >
> > > 3. You want to share data between root fs and initrd
> > >
> > > 4. You want to save some boot time by not bringing up an init system
> > > in the initrd once, then tearing it down again, and starting it
> > > again from the root fs.
> > >
> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > loader load the erofs into contigous memory, then use memmap=X!Y on
> > > the kernel cmdline to synthesize a block device from that, which
> > > you then mount directly (without any initrd) via
> > > root=/dev/pmem0. This means yout boot loader will still load the
> > > whole image into memory, but only decompress the bits actually
> > > neeed. (It also has some other nice benefits I like, such as an
> > > immutable rootfs, which tmpfs-based initrds don't have.)
> > >
> > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > systemd's eyes as an initrd (specifically: don't add an
> > > /etc/initrd-release file to it). Instead, just merge resources of
> > > the root fs into your initrd fs via overlayfs. systemd has
> > > infrastructure for this: "systemd-sysext". It takes immutable,
> > > authenticated erofs images (with verity, we call them "DDIs",
> > > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > could also very nicely combine this approach with systemd's
> > > portable services, and npsawn containers, which operate on the same
> > > authenticated images]. At MSFT we have a major product that works
> > > exactly like this: the OS runs off a rootfs that is loaded as an
> > > initrd, and everything that runs on top of this are just these
> > > verity disk images, using overlayfs and portable services.
> > >
> > > 4. The proposal in 3 also addresses goal 4.
> > >
> > > Which leaves item 1, which is a bit harder to address. We have been
> > > discussing this off an on internally too. A generic solution to this
> > > is hard. My current thinking for this could be something like this,
> > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > but otherwise relatively well defined, i.e. known to be vfat and
> > > discoverable via UUID on a GPT disk. So: build a minimal
> > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > jump into the rootfs stored in the ESP. That latter then has proper
> > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > real root. This would still be a pretty specific solution to one set
> > > of devices though, as it could not cover network boots (i.e. where
> > > there is just no ESP to boot from), but I think this could be kept
> > > relatively close, as the logic in that case could just fall back into
> > > loading the DDI that normally would still in the ESP fully into
> > > memory.
> >
> > I don't think this is "a pretty specific solution to one set of devices"
> > _at all_. To the contrary, it is _exactly_ what I want to see desktop
> > systems moving to in the future.
> >
> > It solves the problem of large firmware images. It solves the problem
> > of device-specific configuration, because one can use a file on the EFI
> > system partition that is read by userspace and either treated as
> > untrusted or TPM-signed.
>
> All those problems are already solved, without inventing a new shell
> scripting solution - we have DDIs and credentials. This is the exact
> opposite of the direction we are pursuing: we want to _kill_ all these
> initrd-specific infrastructure, tools, build systems, dependency
> management and so on, because they are difficult to maintain, they
> create a completely different environment that what is "normally" ran,
> and they end up reinventing everything the 'normal' image does. We
> want to build initrds from packages - as in normal distribution
> packages, not special sauce initrd-only packages, so that the same
> code and the same configuration is used everywhere, in different
> runtime modes. Because that's what distributions are good to do:
> creating package-based ecosystems, with good tooling, infrastructure
> and so on.
>
> The end goal is to build images without initramfs-tools/dracut and
> just using packages, not to stick yet another glue script in front of
> them, that needs yet more special initrd-only arcane magic to put
> together, in order to save a handful of KBs.
The initramfs being a RAM filesystem is exactly why keeping it small is
so critical. Lennart's suggestion solves this problem by eagerly
loading an image from disk, which is much less size-constrained. One
would use distribution packages to build this on-disk image.
> And for ancient, legacy platforms that do not support modern APIs, the
> old ways will still be there, and can be used. Nobody is going to take
> away grub and dracut from the internet, if you got some special corner
> case where you want to use it it will still be there, but the fact
> that such corner cases exist cannot stop the rest of the ecosystem
> that is targeted to modern hardware from evolving into something
> better, more maintainable and more straightforward.
The problem is not that UEFI is not usable in automotive systems. The
problem is that U-Boot (or any other UEFI implementation) is an extra
stage in the boot process, slows things down, and has more attack
surface.
- --
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEdodNnxM2uiJZBxxxsoi1X/+cIsEFAmV3dGwACgkQsoi1X/+c
IsGAxg//SME2795YgGWdruCwKs3D3s78MChJ18zx7DKAkIl24bETMHr7fBF0kOf/
nGKgl5VFEFNL+nEKVXNstLPTqP50BdGUShJqz7A7JVXYkpoc7+3WGmd7ZkjUXpXJ
l+37aJEo+U11Vew84LBvpckR63oshoeCr/cJrcnDaNK5NyqN9vhDXHSgJ6lu+8bh
gC7LnAhmvyB0g+vL0QpzNijNyM7nDg9zCzlP3cOYiyLj5cb4MoLL9TAZPsK0oy2q
UagW+5keJxJfY5ffdAWFpqg2UeY/7cPU5H/rkdkUFbaE9Dk8VLVsTFq6Zk5arUGw
8/CJptX2rD3DsFM+yWgizKC7Tnb9DGNZPB5ORZFem26nrNYmBz58NupDWW5HCNo9
OuPO3ASREb6z1XGmrnD1Dc8ExyTczn/zwp+x/qEDtmn8fmhDGuknwQ9D0mZ6XgO4
DuA9q4aKldgOT5wjflTaSSLkjvzaV81m1wGtxvMDdJlrmturU0GsRTeL/RpK9Dsj
BtgfvSfy+FC0uUxXSJQo/dvJmfnFHQFKss/HDf6nJJMvT20fzT+XbNljzVWLRsr3
f3suT56nIQ7oorRlgnpaCN7uQeyMKkMY7CWQtgqLGkp6c27ObfuUREFDl9KZWoF6
pI61gAVGzKmwSwJlFYHohkqMJlcqln27UX2aspQ52PMeGNwOPVM=
=mLxX
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 20:43 ` Demi Marie Obenour
@ 2023-12-11 20:58 ` Luca Boccassi
2023-12-11 21:20 ` Demi Marie Obenour
2023-12-11 21:24 ` Eric Curtin
0 siblings, 2 replies; 49+ messages in thread
From: Luca Boccassi @ 2023-12-11 20:58 UTC (permalink / raw)
To: Demi Marie Obenour
Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
Stephen Smoogen, Yariv Rachmani, Douglas Landgraf
On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > <demi@invisiblethingslab.com> wrote:
> > >
> > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > > >
> > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > storage devices initialized. storage-init is a process that is not
> > > > > designed to replace init, it does just enough to initialize storage
> > > > > (performs a targeted udev trigger on storage), switches to
> > > > > initoverlayfs as root and then executes init.
> > > > >
> > > > > ```
> > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > > >
> > > > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > > > ```
> > > >
> > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > there two lines?
> > > >
> > > > So, I generally would agree that the current initrd scheme is not
> > > > ideal, and we have been discussing better approaches. But I am not
> > > > sure your approach really is useful on generic systems for two
> > > > reasons:
> > > >
> > > > 1. no security model? you need to authenticate your initrd in
> > > > 2023. There's no execuse to not doing that anymore these days. Not
> > > > in automotive, and not anywhere else really.
> > > >
> > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > unlock their root disks with TPM2 and similar things. People use
> > > > RAID, LVM, and all that mess.
> > > >
> > > > Actually the above are kinda the same problem in a way: you need
> > > > complex storage, but if you need that you kinda need udev, and
> > > > services, and then also systemd and all that other stuff, and that's
> > > > why the system works like the system works right now.
> > > >
> > > > Whenever you devise a system like yours by cutting corners, and
> > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > don't want to support weird storage, you just solve your problem in a
> > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > actually really work without all that and are willing to maintain the
> > > > solution for your specific problem only.
> > > >
> > > > As I understand you are trying to solve multiple problems at once
> > > > here, and I think one should start with figuring out clearly what
> > > > those are before trying to address them, maybe without compromising on
> > > > security. So my guess is you want to address the following:
> > > >
> > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > boot, but only the parts of it that are actually needed.
> > > >
> > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > > boot, but only the parts of it that are actually needed.
> > > >
> > > > 3. You want to share data between root fs and initrd
> > > >
> > > > 4. You want to save some boot time by not bringing up an init system
> > > > in the initrd once, then tearing it down again, and starting it
> > > > again from the root fs.
> > > >
> > > > For the items listed above I think you can find different solutions
> > > > which do not necessarily compromise security as much.
> > > >
> > > > So, in the list above you could address the latter three like this:
> > > >
> > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > the kernel cmdline to synthesize a block device from that, which
> > > > you then mount directly (without any initrd) via
> > > > root=/dev/pmem0. This means yout boot loader will still load the
> > > > whole image into memory, but only decompress the bits actually
> > > > neeed. (It also has some other nice benefits I like, such as an
> > > > immutable rootfs, which tmpfs-based initrds don't have.)
> > > >
> > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > systemd's eyes as an initrd (specifically: don't add an
> > > > /etc/initrd-release file to it). Instead, just merge resources of
> > > > the root fs into your initrd fs via overlayfs. systemd has
> > > > infrastructure for this: "systemd-sysext". It takes immutable,
> > > > authenticated erofs images (with verity, we call them "DDIs",
> > > > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > > could also very nicely combine this approach with systemd's
> > > > portable services, and npsawn containers, which operate on the same
> > > > authenticated images]. At MSFT we have a major product that works
> > > > exactly like this: the OS runs off a rootfs that is loaded as an
> > > > initrd, and everything that runs on top of this are just these
> > > > verity disk images, using overlayfs and portable services.
> > > >
> > > > 4. The proposal in 3 also addresses goal 4.
> > > >
> > > > Which leaves item 1, which is a bit harder to address. We have been
> > > > discussing this off an on internally too. A generic solution to this
> > > > is hard. My current thinking for this could be something like this,
> > > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > > but otherwise relatively well defined, i.e. known to be vfat and
> > > > discoverable via UUID on a GPT disk. So: build a minimal
> > > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > > jump into the rootfs stored in the ESP. That latter then has proper
> > > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > > real root. This would still be a pretty specific solution to one set
> > > > of devices though, as it could not cover network boots (i.e. where
> > > > there is just no ESP to boot from), but I think this could be kept
> > > > relatively close, as the logic in that case could just fall back into
> > > > loading the DDI that normally would still in the ESP fully into
> > > > memory.
> > >
> > > I don't think this is "a pretty specific solution to one set of devices"
> > > _at all_. To the contrary, it is _exactly_ what I want to see desktop
> > > systems moving to in the future.
> > >
> > > It solves the problem of large firmware images. It solves the problem
> > > of device-specific configuration, because one can use a file on the EFI
> > > system partition that is read by userspace and either treated as
> > > untrusted or TPM-signed.
> >
> > All those problems are already solved, without inventing a new shell
> > scripting solution - we have DDIs and credentials. This is the exact
> > opposite of the direction we are pursuing: we want to _kill_ all these
> > initrd-specific infrastructure, tools, build systems, dependency
> > management and so on, because they are difficult to maintain, they
> > create a completely different environment that what is "normally" ran,
> > and they end up reinventing everything the 'normal' image does. We
> > want to build initrds from packages - as in normal distribution
> > packages, not special sauce initrd-only packages, so that the same
> > code and the same configuration is used everywhere, in different
> > runtime modes. Because that's what distributions are good to do:
> > creating package-based ecosystems, with good tooling, infrastructure
> > and so on.
> >
> > The end goal is to build images without initramfs-tools/dracut and
> > just using packages, not to stick yet another glue script in front of
> > them, that needs yet more special initrd-only arcane magic to put
> > together, in order to save a handful of KBs.
>
> The initramfs being a RAM filesystem is exactly why keeping it small is
> so critical. Lennart's suggestion solves this problem by eagerly
> loading an image from disk, which is much less size-constrained. One
> would use distribution packages to build this on-disk image.
This is already solved by using extension DDIs for optional packages.
> > And for ancient, legacy platforms that do not support modern APIs, the
> > old ways will still be there, and can be used. Nobody is going to take
> > away grub and dracut from the internet, if you got some special corner
> > case where you want to use it it will still be there, but the fact
> > that such corner cases exist cannot stop the rest of the ecosystem
> > that is targeted to modern hardware from evolving into something
> > better, more maintainable and more straightforward.
>
> The problem is not that UEFI is not usable in automotive systems. The
> problem is that U-Boot (or any other UEFI implementation) is an extra
> stage in the boot process, slows things down, and has more attack
> surface.
Whatever firmware you use will have an attack surface, the interface
it provides - whether legacy bios or uefi-based - is irrelevant for
that. Skipping or reimplementing all the verity, tpm, etc logic also
increases the attack surface, as does adding initrd-only code that is
never tested and exercised outside of that limited context. If you are
running with legacy bios on ancient hardware you also will likely lack
tpm, secure boot, and so on, so it's all moot, any security argument
goes out of the window. If anybody cares about platform security, then
a tpm-capable and secureboot-capable firmware with a modern, usable
interface like uefi, running the same code in initrd and full system,
using dm-verity everywhere, is pretty much the best one can do.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 20:58 ` Luca Boccassi
@ 2023-12-11 21:20 ` Demi Marie Obenour
2023-12-11 21:45 ` Luca Boccassi
2023-12-11 21:24 ` Eric Curtin
1 sibling, 1 reply; 49+ messages in thread
From: Demi Marie Obenour @ 2023-12-11 21:20 UTC (permalink / raw)
To: Luca Boccassi
Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
Stephen Smoogen, Yariv Rachmani, Douglas Landgraf
[-- Attachment #1: Type: text/plain, Size: 11154 bytes --]
On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA512
> >
> > On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > > <demi@invisiblethingslab.com> wrote:
> > > >
> > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > > > >
> > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > initoverlayfs as root and then executes init.
> > > > > >
> > > > > > ```
> > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > > > >
> > > > > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > > > > ```
> > > > >
> > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > there two lines?
> > > > >
> > > > > So, I generally would agree that the current initrd scheme is not
> > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > sure your approach really is useful on generic systems for two
> > > > > reasons:
> > > > >
> > > > > 1. no security model? you need to authenticate your initrd in
> > > > > 2023. There's no execuse to not doing that anymore these days. Not
> > > > > in automotive, and not anywhere else really.
> > > > >
> > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > > unlock their root disks with TPM2 and similar things. People use
> > > > > RAID, LVM, and all that mess.
> > > > >
> > > > > Actually the above are kinda the same problem in a way: you need
> > > > > complex storage, but if you need that you kinda need udev, and
> > > > > services, and then also systemd and all that other stuff, and that's
> > > > > why the system works like the system works right now.
> > > > >
> > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > don't want to support weird storage, you just solve your problem in a
> > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > actually really work without all that and are willing to maintain the
> > > > > solution for your specific problem only.
> > > > >
> > > > > As I understand you are trying to solve multiple problems at once
> > > > > here, and I think one should start with figuring out clearly what
> > > > > those are before trying to address them, maybe without compromising on
> > > > > security. So my guess is you want to address the following:
> > > > >
> > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > > boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > > > boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 3. You want to share data between root fs and initrd
> > > > >
> > > > > 4. You want to save some boot time by not bringing up an init system
> > > > > in the initrd once, then tearing it down again, and starting it
> > > > > again from the root fs.
> > > > >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > > loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > > the kernel cmdline to synthesize a block device from that, which
> > > > > you then mount directly (without any initrd) via
> > > > > root=/dev/pmem0. This means yout boot loader will still load the
> > > > > whole image into memory, but only decompress the bits actually
> > > > > neeed. (It also has some other nice benefits I like, such as an
> > > > > immutable rootfs, which tmpfs-based initrds don't have.)
> > > > >
> > > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > > systemd's eyes as an initrd (specifically: don't add an
> > > > > /etc/initrd-release file to it). Instead, just merge resources of
> > > > > the root fs into your initrd fs via overlayfs. systemd has
> > > > > infrastructure for this: "systemd-sysext". It takes immutable,
> > > > > authenticated erofs images (with verity, we call them "DDIs",
> > > > > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > > > could also very nicely combine this approach with systemd's
> > > > > portable services, and npsawn containers, which operate on the same
> > > > > authenticated images]. At MSFT we have a major product that works
> > > > > exactly like this: the OS runs off a rootfs that is loaded as an
> > > > > initrd, and everything that runs on top of this are just these
> > > > > verity disk images, using overlayfs and portable services.
> > > > >
> > > > > 4. The proposal in 3 also addresses goal 4.
> > > > >
> > > > > Which leaves item 1, which is a bit harder to address. We have been
> > > > > discussing this off an on internally too. A generic solution to this
> > > > > is hard. My current thinking for this could be something like this,
> > > > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > > > but otherwise relatively well defined, i.e. known to be vfat and
> > > > > discoverable via UUID on a GPT disk. So: build a minimal
> > > > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > > > jump into the rootfs stored in the ESP. That latter then has proper
> > > > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > > > real root. This would still be a pretty specific solution to one set
> > > > > of devices though, as it could not cover network boots (i.e. where
> > > > > there is just no ESP to boot from), but I think this could be kept
> > > > > relatively close, as the logic in that case could just fall back into
> > > > > loading the DDI that normally would still in the ESP fully into
> > > > > memory.
> > > >
> > > > I don't think this is "a pretty specific solution to one set of devices"
> > > > _at all_. To the contrary, it is _exactly_ what I want to see desktop
> > > > systems moving to in the future.
> > > >
> > > > It solves the problem of large firmware images. It solves the problem
> > > > of device-specific configuration, because one can use a file on the EFI
> > > > system partition that is read by userspace and either treated as
> > > > untrusted or TPM-signed.
> > >
> > > All those problems are already solved, without inventing a new shell
> > > scripting solution - we have DDIs and credentials. This is the exact
> > > opposite of the direction we are pursuing: we want to _kill_ all these
> > > initrd-specific infrastructure, tools, build systems, dependency
> > > management and so on, because they are difficult to maintain, they
> > > create a completely different environment that what is "normally" ran,
> > > and they end up reinventing everything the 'normal' image does. We
> > > want to build initrds from packages - as in normal distribution
> > > packages, not special sauce initrd-only packages, so that the same
> > > code and the same configuration is used everywhere, in different
> > > runtime modes. Because that's what distributions are good to do:
> > > creating package-based ecosystems, with good tooling, infrastructure
> > > and so on.
> > >
> > > The end goal is to build images without initramfs-tools/dracut and
> > > just using packages, not to stick yet another glue script in front of
> > > them, that needs yet more special initrd-only arcane magic to put
> > > together, in order to save a handful of KBs.
> >
> > The initramfs being a RAM filesystem is exactly why keeping it small is
> > so critical. Lennart's suggestion solves this problem by eagerly
> > loading an image from disk, which is much less size-constrained. One
> > would use distribution packages to build this on-disk image.
>
> This is already solved by using extension DDIs for optional packages.
What about non-optional packages? The goal is to _require_ the on-disk
image to boot, so that full-featured UI toolkits can be used to e.g.
prompt for LUKS passphrases. Ideally, the initramfs would be as minimal
as possible.
> > > And for ancient, legacy platforms that do not support modern APIs, the
> > > old ways will still be there, and can be used. Nobody is going to take
> > > away grub and dracut from the internet, if you got some special corner
> > > case where you want to use it it will still be there, but the fact
> > > that such corner cases exist cannot stop the rest of the ecosystem
> > > that is targeted to modern hardware from evolving into something
> > > better, more maintainable and more straightforward.
> >
> > The problem is not that UEFI is not usable in automotive systems. The
> > problem is that U-Boot (or any other UEFI implementation) is an extra
> > stage in the boot process, slows things down, and has more attack
> > surface.
>
> Whatever firmware you use will have an attack surface, the interface
> it provides - whether legacy bios or uefi-based - is irrelevant for
> that. Skipping or reimplementing all the verity, tpm, etc logic also
> increases the attack surface, as does adding initrd-only code that is
> never tested and exercised outside of that limited context. If you are
> running with legacy bios on ancient hardware you also will likely lack
> tpm, secure boot, and so on, so it's all moot, any security argument
> goes out of the window. If anybody cares about platform security, then
> a tpm-capable and secureboot-capable firmware with a modern, usable
> interface like uefi, running the same code in initrd and full system,
> using dm-verity everywhere, is pretty much the best one can do.
Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
have better platform security than any UEFI-based device on the market I
am aware of.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 21:20 ` Demi Marie Obenour
@ 2023-12-11 21:45 ` Luca Boccassi
2023-12-12 3:47 ` Paul Menzel
` (2 more replies)
0 siblings, 3 replies; 49+ messages in thread
From: Luca Boccassi @ 2023-12-11 21:45 UTC (permalink / raw)
To: Demi Marie Obenour
Cc: Lennart Poettering, Eric Curtin, initramfs, systemd-devel,
Stephen Smoogen, Yariv Rachmani, Douglas Landgraf
On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour
<demi@invisiblethingslab.com> wrote:
>
> On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
> > On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
> > <demi@invisiblethingslab.com> wrote:
> > >
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA512
> > >
> > > On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> > > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > > > <demi@invisiblethingslab.com> wrote:
> > > > >
> > > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > > > > >
> > > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > > initoverlayfs as root and then executes init.
> > > > > > >
> > > > > > > ```
> > > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > > > > >
> > > > > > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > > > > > ```
> > > > > >
> > > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > > there two lines?
> > > > > >
> > > > > > So, I generally would agree that the current initrd scheme is not
> > > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > > sure your approach really is useful on generic systems for two
> > > > > > reasons:
> > > > > >
> > > > > > 1. no security model? you need to authenticate your initrd in
> > > > > > 2023. There's no execuse to not doing that anymore these days. Not
> > > > > > in automotive, and not anywhere else really.
> > > > > >
> > > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > > > unlock their root disks with TPM2 and similar things. People use
> > > > > > RAID, LVM, and all that mess.
> > > > > >
> > > > > > Actually the above are kinda the same problem in a way: you need
> > > > > > complex storage, but if you need that you kinda need udev, and
> > > > > > services, and then also systemd and all that other stuff, and that's
> > > > > > why the system works like the system works right now.
> > > > > >
> > > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > > don't want to support weird storage, you just solve your problem in a
> > > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > > actually really work without all that and are willing to maintain the
> > > > > > solution for your specific problem only.
> > > > > >
> > > > > > As I understand you are trying to solve multiple problems at once
> > > > > > here, and I think one should start with figuring out clearly what
> > > > > > those are before trying to address them, maybe without compromising on
> > > > > > security. So my guess is you want to address the following:
> > > > > >
> > > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > > > boot, but only the parts of it that are actually needed.
> > > > > >
> > > > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > > > > boot, but only the parts of it that are actually needed.
> > > > > >
> > > > > > 3. You want to share data between root fs and initrd
> > > > > >
> > > > > > 4. You want to save some boot time by not bringing up an init system
> > > > > > in the initrd once, then tearing it down again, and starting it
> > > > > > again from the root fs.
> > > > > >
> > > > > > For the items listed above I think you can find different solutions
> > > > > > which do not necessarily compromise security as much.
> > > > > >
> > > > > > So, in the list above you could address the latter three like this:
> > > > > >
> > > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > > > loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > > > the kernel cmdline to synthesize a block device from that, which
> > > > > > you then mount directly (without any initrd) via
> > > > > > root=/dev/pmem0. This means yout boot loader will still load the
> > > > > > whole image into memory, but only decompress the bits actually
> > > > > > neeed. (It also has some other nice benefits I like, such as an
> > > > > > immutable rootfs, which tmpfs-based initrds don't have.)
> > > > > >
> > > > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > > > systemd's eyes as an initrd (specifically: don't add an
> > > > > > /etc/initrd-release file to it). Instead, just merge resources of
> > > > > > the root fs into your initrd fs via overlayfs. systemd has
> > > > > > infrastructure for this: "systemd-sysext". It takes immutable,
> > > > > > authenticated erofs images (with verity, we call them "DDIs",
> > > > > > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > > > > could also very nicely combine this approach with systemd's
> > > > > > portable services, and npsawn containers, which operate on the same
> > > > > > authenticated images]. At MSFT we have a major product that works
> > > > > > exactly like this: the OS runs off a rootfs that is loaded as an
> > > > > > initrd, and everything that runs on top of this are just these
> > > > > > verity disk images, using overlayfs and portable services.
> > > > > >
> > > > > > 4. The proposal in 3 also addresses goal 4.
> > > > > >
> > > > > > Which leaves item 1, which is a bit harder to address. We have been
> > > > > > discussing this off an on internally too. A generic solution to this
> > > > > > is hard. My current thinking for this could be something like this,
> > > > > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > > > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > > > > but otherwise relatively well defined, i.e. known to be vfat and
> > > > > > discoverable via UUID on a GPT disk. So: build a minimal
> > > > > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > > > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > > > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > > > > jump into the rootfs stored in the ESP. That latter then has proper
> > > > > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > > > > real root. This would still be a pretty specific solution to one set
> > > > > > of devices though, as it could not cover network boots (i.e. where
> > > > > > there is just no ESP to boot from), but I think this could be kept
> > > > > > relatively close, as the logic in that case could just fall back into
> > > > > > loading the DDI that normally would still in the ESP fully into
> > > > > > memory.
> > > > >
> > > > > I don't think this is "a pretty specific solution to one set of devices"
> > > > > _at all_. To the contrary, it is _exactly_ what I want to see desktop
> > > > > systems moving to in the future.
> > > > >
> > > > > It solves the problem of large firmware images. It solves the problem
> > > > > of device-specific configuration, because one can use a file on the EFI
> > > > > system partition that is read by userspace and either treated as
> > > > > untrusted or TPM-signed.
> > > >
> > > > All those problems are already solved, without inventing a new shell
> > > > scripting solution - we have DDIs and credentials. This is the exact
> > > > opposite of the direction we are pursuing: we want to _kill_ all these
> > > > initrd-specific infrastructure, tools, build systems, dependency
> > > > management and so on, because they are difficult to maintain, they
> > > > create a completely different environment that what is "normally" ran,
> > > > and they end up reinventing everything the 'normal' image does. We
> > > > want to build initrds from packages - as in normal distribution
> > > > packages, not special sauce initrd-only packages, so that the same
> > > > code and the same configuration is used everywhere, in different
> > > > runtime modes. Because that's what distributions are good to do:
> > > > creating package-based ecosystems, with good tooling, infrastructure
> > > > and so on.
> > > >
> > > > The end goal is to build images without initramfs-tools/dracut and
> > > > just using packages, not to stick yet another glue script in front of
> > > > them, that needs yet more special initrd-only arcane magic to put
> > > > together, in order to save a handful of KBs.
> > >
> > > The initramfs being a RAM filesystem is exactly why keeping it small is
> > > so critical. Lennart's suggestion solves this problem by eagerly
> > > loading an image from disk, which is much less size-constrained. One
> > > would use distribution packages to build this on-disk image.
> >
> > This is already solved by using extension DDIs for optional packages.
>
> What about non-optional packages? The goal is to _require_ the on-disk
> image to boot, so that full-featured UI toolkits can be used to e.g.
> prompt for LUKS passphrases. Ideally, the initramfs would be as minimal
> as possible.
You can use DDIs for anything you want, outside of systemd itself
> > > > And for ancient, legacy platforms that do not support modern APIs, the
> > > > old ways will still be there, and can be used. Nobody is going to take
> > > > away grub and dracut from the internet, if you got some special corner
> > > > case where you want to use it it will still be there, but the fact
> > > > that such corner cases exist cannot stop the rest of the ecosystem
> > > > that is targeted to modern hardware from evolving into something
> > > > better, more maintainable and more straightforward.
> > >
> > > The problem is not that UEFI is not usable in automotive systems. The
> > > problem is that U-Boot (or any other UEFI implementation) is an extra
> > > stage in the boot process, slows things down, and has more attack
> > > surface.
> >
> > Whatever firmware you use will have an attack surface, the interface
> > it provides - whether legacy bios or uefi-based - is irrelevant for
> > that. Skipping or reimplementing all the verity, tpm, etc logic also
> > increases the attack surface, as does adding initrd-only code that is
> > never tested and exercised outside of that limited context. If you are
> > running with legacy bios on ancient hardware you also will likely lack
> > tpm, secure boot, and so on, so it's all moot, any security argument
> > goes out of the window. If anybody cares about platform security, then
> > a tpm-capable and secureboot-capable firmware with a modern, usable
> > interface like uefi, running the same code in initrd and full system,
> > using dm-verity everywhere, is pretty much the best one can do.
>
> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
> have better platform security than any UEFI-based device on the market I
> am aware of.
We are talking about Linux distributions here. If one wants to use
proprietary systems, sure, there are better things out there, but
that's off topic.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 21:45 ` Luca Boccassi
@ 2023-12-12 3:47 ` Paul Menzel
2023-12-12 3:56 ` Paul Menzel
2023-12-12 15:26 ` Paul Menzel
2 siblings, 0 replies; 49+ messages in thread
From: Paul Menzel @ 2023-12-12 3:47 UTC (permalink / raw)
To: Luca Boccassi
Cc: Demi Marie Obenour, initramfs, systemd-devel, Eric Curtin,
Yariv Rachmani, Lennart Poettering, Douglas Landgraf,
Stephen Smoogen
Dear Luca,
Am 11.12.23 um 22:45 schrieb Luca Boccassi:
> On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour wrote:
>>
>> On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
>>> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour wrote:
>>>> On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
>>>>> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
>>>>>> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
>>>>>>> On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
[…]
>>>>> And for ancient, legacy platforms that do not support modern APIs, the
>>>>> old ways will still be there, and can be used. Nobody is going to take
>>>>> away grub and dracut from the internet, if you got some special corner
>>>>> case where you want to use it it will still be there, but the fact
>>>>> that such corner cases exist cannot stop the rest of the ecosystem
>>>>> that is targeted to modern hardware from evolving into something
>>>>> better, more maintainable and more straightforward.
>>>>
>>>> The problem is not that UEFI is not usable in automotive systems. The
>>>> problem is that U-Boot (or any other UEFI implementation) is an extra
>>>> stage in the boot process, slows things down, and has more attack
>>>> surface.
>>>
>>> Whatever firmware you use will have an attack surface, the interface
>>> it provides - whether legacy bios or uefi-based - is irrelevant for
>>> that. Skipping or reimplementing all the verity, tpm, etc logic also
>>> increases the attack surface, as does adding initrd-only code that is
>>> never tested and exercised outside of that limited context. If you are
>>> running with legacy bios on ancient hardware you also will likely lack
>>> tpm, secure boot, and so on, so it's all moot, any security argument
>>> goes out of the window. If anybody cares about platform security, then
>>> a tpm-capable and secureboot-capable firmware with a modern, usable
>>> interface like uefi, running the same code in initrd and full system,
>>> using dm-verity everywhere, is pretty much the best one can do.
>>
>> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
>> have better platform security than any UEFI-based device on the market I
>> am aware of.
>
> We are talking about Linux distributions here. If one wants to use
> proprietary systems, sure, there are better things out there, but
> that's off topic.
In what way is ChromeOS more proprietary than the other GNU/Linux
distributions, that allow to install the Chrome browser?
Kind regards,
Paul
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 21:45 ` Luca Boccassi
2023-12-12 3:47 ` Paul Menzel
@ 2023-12-12 3:56 ` Paul Menzel
2023-12-12 15:26 ` Paul Menzel
2 siblings, 0 replies; 49+ messages in thread
From: Paul Menzel @ 2023-12-12 3:56 UTC (permalink / raw)
To: Luca Boccassi
Cc: Demi Marie Obenour, initramfs, systemd-devel, Eric Curtin,
Yariv Rachmani, Lennart Poettering, Douglas Landgraf,
Stephen Smoogen
Dear Luca,
Am 11.12.23 um 22:45 schrieb Luca Boccassi:
> On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour wrote:
>>
>> On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
>>> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour wrote:
>>>> On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
>>>>> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
>>>>>> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
>>>>>>> On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
[…]
>>>>> And for ancient, legacy platforms that do not support modern APIs, the
>>>>> old ways will still be there, and can be used. Nobody is going to take
>>>>> away grub and dracut from the internet, if you got some special corner
>>>>> case where you want to use it it will still be there, but the fact
>>>>> that such corner cases exist cannot stop the rest of the ecosystem
>>>>> that is targeted to modern hardware from evolving into something
>>>>> better, more maintainable and more straightforward.
>>>>
>>>> The problem is not that UEFI is not usable in automotive systems. The
>>>> problem is that U-Boot (or any other UEFI implementation) is an extra
>>>> stage in the boot process, slows things down, and has more attack
>>>> surface.
>>>
>>> Whatever firmware you use will have an attack surface, the interface
>>> it provides - whether legacy bios or uefi-based - is irrelevant for
>>> that. Skipping or reimplementing all the verity, tpm, etc logic also
>>> increases the attack surface, as does adding initrd-only code that is
>>> never tested and exercised outside of that limited context. If you are
>>> running with legacy bios on ancient hardware you also will likely lack
>>> tpm, secure boot, and so on, so it's all moot, any security argument
>>> goes out of the window. If anybody cares about platform security, then
>>> a tpm-capable and secureboot-capable firmware with a modern, usable
>>> interface like uefi, running the same code in initrd and full system,
>>> using dm-verity everywhere, is pretty much the best one can do.
>>
>> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
>> have better platform security than any UEFI-based device on the market I
>> am aware of.
>
> We are talking about Linux distributions here. If one wants to use
> proprietary systems, sure, there are better things out there, but
> that's off topic.
In what way is ChromeOS more proprietary than the other GNU/Linux
distributions, that allow to install the Chrome browser?
Kind regards,
Paul
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 21:45 ` Luca Boccassi
2023-12-12 3:47 ` Paul Menzel
2023-12-12 3:56 ` Paul Menzel
@ 2023-12-12 15:26 ` Paul Menzel
2 siblings, 0 replies; 49+ messages in thread
From: Paul Menzel @ 2023-12-12 15:26 UTC (permalink / raw)
To: Luca Boccassi
Cc: Demi Marie Obenour, initramfs, systemd-devel, Eric Curtin,
Yariv Rachmani, Lennart Poettering, Douglas Landgraf,
Stephen Smoogen
[Sorry for the spam to the people in Cc. Now the real address.]
Dear Luca,
Am 11.12.23 um 22:45 schrieb Luca Boccassi:
> On Mon, 11 Dec 2023 at 21:20, Demi Marie Obenour wrote:
>>
>> On Mon, Dec 11, 2023 at 08:58:58PM +0000, Luca Boccassi wrote:
>>> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour wrote:
>>>> On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
>>>>> On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
>>>>>> On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
>>>>>>> On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
[…]
>>>>> And for ancient, legacy platforms that do not support modern APIs, the
>>>>> old ways will still be there, and can be used. Nobody is going to take
>>>>> away grub and dracut from the internet, if you got some special corner
>>>>> case where you want to use it it will still be there, but the fact
>>>>> that such corner cases exist cannot stop the rest of the ecosystem
>>>>> that is targeted to modern hardware from evolving into something
>>>>> better, more maintainable and more straightforward.
>>>>
>>>> The problem is not that UEFI is not usable in automotive systems. The
>>>> problem is that U-Boot (or any other UEFI implementation) is an extra
>>>> stage in the boot process, slows things down, and has more attack
>>>> surface.
>>>
>>> Whatever firmware you use will have an attack surface, the interface
>>> it provides - whether legacy bios or uefi-based - is irrelevant for
>>> that. Skipping or reimplementing all the verity, tpm, etc logic also
>>> increases the attack surface, as does adding initrd-only code that is
>>> never tested and exercised outside of that limited context. If you are
>>> running with legacy bios on ancient hardware you also will likely lack
>>> tpm, secure boot, and so on, so it's all moot, any security argument
>>> goes out of the window. If anybody cares about platform security, then
>>> a tpm-capable and secureboot-capable firmware with a modern, usable
>>> interface like uefi, running the same code in initrd and full system,
>>> using dm-verity everywhere, is pretty much the best one can do.
>>
>> Neither Chrome OS devices nor Macs with Apple silicon use UEFI, and both
>> have better platform security than any UEFI-based device on the market I
>> am aware of.
>
> We are talking about Linux distributions here. If one wants to use
> proprietary systems, sure, there are better things out there, but
> that's off topic.
In what way is ChromeOS more proprietary than the other GNU/Linux
distributions, that allow to install the Chrome browser?
Kind regards,
Paul
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 20:58 ` Luca Boccassi
2023-12-11 21:20 ` Demi Marie Obenour
@ 2023-12-11 21:24 ` Eric Curtin
1 sibling, 0 replies; 49+ messages in thread
From: Eric Curtin @ 2023-12-11 21:24 UTC (permalink / raw)
To: Luca Boccassi
Cc: Demi Marie Obenour, Lennart Poettering, initramfs, systemd-devel,
Stephen Smoogen, Yariv Rachmani, Douglas Landgraf
On Mon, 11 Dec 2023 at 20:59, Luca Boccassi <bluca@debian.org> wrote:
>
> On Mon, 11 Dec 2023 at 20:43, Demi Marie Obenour
> <demi@invisiblethingslab.com> wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA512
> >
> > On Mon, Dec 11, 2023 at 08:15:27PM +0000, Luca Boccassi wrote:
> > > On Mon, 11 Dec 2023 at 17:30, Demi Marie Obenour
> > > <demi@invisiblethingslab.com> wrote:
> > > >
> > > > On Mon, Dec 11, 2023 at 10:57:58AM +0100, Lennart Poettering wrote:
> > > > > On Fr, 08.12.23 17:59, Eric Curtin (ecurtin@redhat.com) wrote:
> > > > >
> > > > > > Here is the boot sequence with initoverlayfs integrated, the
> > > > > > mini-initramfs contains just enough to get storage drivers loaded and
> > > > > > storage devices initialized. storage-init is a process that is not
> > > > > > designed to replace init, it does just enough to initialize storage
> > > > > > (performs a targeted udev trigger on storage), switches to
> > > > > > initoverlayfs as root and then executes init.
> > > > > >
> > > > > > ```
> > > > > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
> > > > > >
> > > > > > fw -> bootloader -> kernel -> storage-init -> init ----------------->
> > > > > > ```
> > > > >
> > > > > I am not sure I follow what these chains are supposed to mean? Why are
> > > > > there two lines?
> > > > >
> > > > > So, I generally would agree that the current initrd scheme is not
> > > > > ideal, and we have been discussing better approaches. But I am not
> > > > > sure your approach really is useful on generic systems for two
> > > > > reasons:
> > > > >
> > > > > 1. no security model? you need to authenticate your initrd in
> > > > > 2023. There's no execuse to not doing that anymore these days. Not
> > > > > in automotive, and not anywhere else really.
> > > > >
> > > > > 2. no way to deal with complex storage? i.e. people use FDE, want to
> > > > > unlock their root disks with TPM2 and similar things. People use
> > > > > RAID, LVM, and all that mess.
> > > > >
> > > > > Actually the above are kinda the same problem in a way: you need
> > > > > complex storage, but if you need that you kinda need udev, and
> > > > > services, and then also systemd and all that other stuff, and that's
> > > > > why the system works like the system works right now.
> > > > >
> > > > > Whenever you devise a system like yours by cutting corners, and
> > > > > declaring that you don't want TPM, you don't want signed initrds, you
> > > > > don't want to support weird storage, you just solve your problem in a
> > > > > very specific way, ignoring the big picture. Which is OK, *if* you can
> > > > > actually really work without all that and are willing to maintain the
> > > > > solution for your specific problem only.
> > > > >
> > > > > As I understand you are trying to solve multiple problems at once
> > > > > here, and I think one should start with figuring out clearly what
> > > > > those are before trying to address them, maybe without compromising on
> > > > > security. So my guess is you want to address the following:
> > > > >
> > > > > 1. You don't want the whole big initrd to be read off disk on every
> > > > > boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 2. You don't want the whole big initrd to be fully decompressed on every
> > > > > boot, but only the parts of it that are actually needed.
> > > > >
> > > > > 3. You want to share data between root fs and initrd
> > > > >
> > > > > 4. You want to save some boot time by not bringing up an init system
> > > > > in the initrd once, then tearing it down again, and starting it
> > > > > again from the root fs.
> > > > >
> > > > > For the items listed above I think you can find different solutions
> > > > > which do not necessarily compromise security as much.
> > > > >
> > > > > So, in the list above you could address the latter three like this:
> > > > >
> > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > > > > loader load the erofs into contigous memory, then use memmap=X!Y on
> > > > > the kernel cmdline to synthesize a block device from that, which
> > > > > you then mount directly (without any initrd) via
> > > > > root=/dev/pmem0. This means yout boot loader will still load the
> > > > > whole image into memory, but only decompress the bits actually
> > > > > neeed. (It also has some other nice benefits I like, such as an
> > > > > immutable rootfs, which tmpfs-based initrds don't have.)
> > > > >
> > > > > 3. Simply never transition to the root fs, don't marke the initrds in
> > > > > systemd's eyes as an initrd (specifically: don't add an
> > > > > /etc/initrd-release file to it). Instead, just merge resources of
> > > > > the root fs into your initrd fs via overlayfs. systemd has
> > > > > infrastructure for this: "systemd-sysext". It takes immutable,
> > > > > authenticated erofs images (with verity, we call them "DDIs",
> > > > > i.e. "discoverable disk images") that it overlays into /usr/. [You
> > > > > could also very nicely combine this approach with systemd's
> > > > > portable services, and npsawn containers, which operate on the same
> > > > > authenticated images]. At MSFT we have a major product that works
> > > > > exactly like this: the OS runs off a rootfs that is loaded as an
> > > > > initrd, and everything that runs on top of this are just these
> > > > > verity disk images, using overlayfs and portable services.
> > > > >
> > > > > 4. The proposal in 3 also addresses goal 4.
> > > > >
> > > > > Which leaves item 1, which is a bit harder to address. We have been
> > > > > discussing this off an on internally too. A generic solution to this
> > > > > is hard. My current thinking for this could be something like this,
> > > > > covering the UEFI world: support sticking a DDI for the main initrd in
> > > > > the ESP. The ESP is per definition unencrypted and unauthenticated,
> > > > > but otherwise relatively well defined, i.e. known to be vfat and
> > > > > discoverable via UUID on a GPT disk. So: build a minimal
> > > > > single-process initrd into the kernel (i.e. UKI) that has exactly the
> > > > > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> > > > > drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> > > > > jump into the rootfs stored in the ESP. That latter then has proper
> > > > > file system drivers, storage drivers, crypto stack, and can unlock the
> > > > > real root. This would still be a pretty specific solution to one set
> > > > > of devices though, as it could not cover network boots (i.e. where
> > > > > there is just no ESP to boot from), but I think this could be kept
> > > > > relatively close, as the logic in that case could just fall back into
> > > > > loading the DDI that normally would still in the ESP fully into
> > > > > memory.
> > > >
> > > > I don't think this is "a pretty specific solution to one set of devices"
> > > > _at all_. To the contrary, it is _exactly_ what I want to see desktop
> > > > systems moving to in the future.
> > > >
> > > > It solves the problem of large firmware images. It solves the problem
> > > > of device-specific configuration, because one can use a file on the EFI
> > > > system partition that is read by userspace and either treated as
> > > > untrusted or TPM-signed.
> > >
> > > All those problems are already solved, without inventing a new shell
> > > scripting solution - we have DDIs and credentials. This is the exact
> > > opposite of the direction we are pursuing: we want to _kill_ all these
> > > initrd-specific infrastructure, tools, build systems, dependency
> > > management and so on, because they are difficult to maintain, they
> > > create a completely different environment that what is "normally" ran,
> > > and they end up reinventing everything the 'normal' image does. We
> > > want to build initrds from packages - as in normal distribution
> > > packages, not special sauce initrd-only packages, so that the same
> > > code and the same configuration is used everywhere, in different
> > > runtime modes. Because that's what distributions are good to do:
> > > creating package-based ecosystems, with good tooling, infrastructure
> > > and so on.
> > >
> > > The end goal is to build images without initramfs-tools/dracut and
> > > just using packages, not to stick yet another glue script in front of
> > > them, that needs yet more special initrd-only arcane magic to put
> > > together, in order to save a handful of KBs.
> >
> > The initramfs being a RAM filesystem is exactly why keeping it small is
> > so critical. Lennart's suggestion solves this problem by eagerly
> > loading an image from disk, which is much less size-constrained. One
> > would use distribution packages to build this on-disk image.
>
> This is already solved by using extension DDIs for optional packages.
>
> > > And for ancient, legacy platforms that do not support modern APIs, the
> > > old ways will still be there, and can be used. Nobody is going to take
> > > away grub and dracut from the internet, if you got some special corner
> > > case where you want to use it it will still be there, but the fact
> > > that such corner cases exist cannot stop the rest of the ecosystem
> > > that is targeted to modern hardware from evolving into something
> > > better, more maintainable and more straightforward.
> >
> > The problem is not that UEFI is not usable in automotive systems. The
> > problem is that U-Boot (or any other UEFI implementation) is an extra
> > stage in the boot process, slows things down, and has more attack
> > surface.
>
> Whatever firmware you use will have an attack surface, the interface
> it provides - whether legacy bios or uefi-based - is irrelevant for
> that. Skipping or reimplementing all the verity, tpm, etc logic also
> increases the attack surface, as does adding initrd-only code that is
> never tested and exercised outside of that limited context. If you are
> running with legacy bios on ancient hardware you also will likely lack
> tpm, secure boot, and so on, so it's all moot, any security argument
> goes out of the window. If anybody cares about platform security, then
> a tpm-capable and secureboot-capable firmware with a modern, usable
> interface like uefi, running the same code in initrd and full system,
> using dm-verity everywhere, is pretty much the best one can do.
I am unsure how many new systems are being developed with legacy BIOS,
but alternative firmware platforms do exist that are just as secure as
UEFI, Android Boot Image format is one for example. I am pretty sure
UKIs took influence from that format either directly or indirectly.
Everything in x86 is UEFI, but other architectures like ARM are important.
And when you are deploying on ARM it can be pretty hard to tell
partners how to boot pre-Linux kernel if you are an OS distributor,
which makes it pretty hard to assume grub, sd-boot, sd-stub, etc.
But what you can do is design from Linux kernel boot onwards to the
best of your ability, and I think kernel-space and user-space could
benefit from just decompressing the bytes you use.
Whether we use dracut or something that composes initramfs using rpms,
These structures/containers/etc. are just filesystems at the end of
the day.
Is mise le meas/Regards,
Eric Curtin
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [RFC] initoverlayfs - a scalable initial filesystem
2023-12-11 16:28 ` Demi Marie Obenour
` (2 preceding siblings ...)
2023-12-11 20:15 ` Luca Boccassi
@ 2023-12-12 17:50 ` Lennart Poettering
3 siblings, 0 replies; 49+ messages in thread
From: Lennart Poettering @ 2023-12-12 17:50 UTC (permalink / raw)
To: Demi Marie Obenour
Cc: Eric Curtin, initramfs, systemd-devel, Stephen Smoogen,
Qubes OS Development Mailing List, Yariv Rachmani,
Douglas Landgraf
On Mo, 11.12.23 11:28, Demi Marie Obenour (demi@invisiblethingslab.com) wrote:
> I don't think this is "a pretty specific solution to one set of devices"
> _at all_. To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images. It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed. It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs. One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth. For
> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes. It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host. And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.
systemd's "system extension" concept ("sysexts") already allow you to
do all that. The stuff I was fantasizing about would only change one
thing: instead of sd-stub from uefi mode already putting the sysexts
you installed into memory for the initrd to consume, it would be some
proto-initrd that would do so. This does not really change what you
can do with this, but mostly is just an optimization, reducing iops
and memory use a bit, and thus boot time latency.
> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs. Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.
Well, somebody's niche is somebody else's common case. In VM/cloud/server
scenarios network booting is not that "niche" as it might be on the desktop.
> It does require some care when upgrading, as the dm-verity image and the
> UKI cannot both be updated atomically, but one can solve that by first
> writing the new dm-verity image to a separate location. The UKI will
> try both both the old and new locations for the dm-verity image and
> rename the new image over the old one on success. The wrong image will
> simply fail to mount as its root hash will be wrong.
systemd-sysext already covers this just fine: you can encode in their
"extension-release" file to which base images they match up, and
systemd-syext will then find the right one to apply, and ignore the
others. Thus just make sure you drop in the sysexts fist, and the UKI
last and things should be perfectly robust.
Lennart
--
Lennart Poettering, Berlin
^ permalink raw reply [flat|nested] 49+ messages in thread