Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem

From: Dave Chinner <david@fromorbit.com>
To: Christian Brauner <brauner@kernel.org>
Cc: Giuseppe Scrivano <gscrivan@redhat.com>,
	Amir Goldstein <amir73il@gmail.com>,
	Gao Xiang <hsiangkao@linux.alibaba.com>,
	Alexander Larsson <alexl@redhat.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Miklos Szeredi <miklos@szeredi.hu>,
	Yurii Zubrytskyi <zyy@google.com>,
	Eugene Zemtsov <ezemtsov@google.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
Date: Wed, 18 Jan 2023 11:22:42 +1100	[thread overview]
Message-ID: <20230118002242.GB937597@dread.disaster.area> (raw)
In-Reply-To: <20230117152756.jbwmeq724potyzju@wittgenstein>

On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote:
> On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote:
> > Christian Brauner <brauner@kernel.org> writes:
> > 2) no multi repo support:
> > 
> > Both reflinks and hardlinks do not work across mount points, so we
> 
> Just fwiw, afaict reflinks work across mount points since at least 5.18.

The might work for NFS server *file clones* across different exports
within the same NFS server (or server cluster), but they most
certainly don't work across mountpoints for local filesystems, or
across different types of filesystems.

I'm not here to advocate that composefs as the right solution, I'm
just pointing out that the proposed alternatives do not, in any way,
have the same critical behavioural characteristics as composefs
provides container orchestration systems and hence do not solve the
problems that composefs is attempting to solve.

In short: any solution that requires userspace to create a new
filesystem heirarchy one file at a time via standard syscall
mechanisms is not going to perform acceptibly at scale - that's a
major problem that composefs addresses.

The whole problem with file copying to create images - even with
reflink or hardlinks avoiding data copying - is the overhead of
creating and destroying those copies in the first place. A reflink
copy of a tens of thousands of files in a complex directory
structure is not free - each individual reflink has a time, CPU,
memory and IO cost to it. The teardown cost is similar - the only
way to remove the "container image" built with reflinks is "rm -rf",
and that has significant time, CPU memory and IO costs associated
with it as well.

Further, you can't ship container images to remote hosts using
reflink copies - they can only be created at runtime on the host
that the container will be instantiated on. IOWs, the entire cost of
reflink copies for container instances must be taken at container
instantiation and destruction time.

When you have container instances that might only be needed for a
few seconds, taking half a minute to set up the container instance
and then another half a minute to tear it down just isn't viable -
we need instantiation and teardown times in the order of a second or
two.

From my reading of the code, composefs is based around the concept
of a verifiable "shipping manifest", where the filesystem namespace
presented to users by the kernel is derived from the manifest rahter
than from some other filesystem namespace. Overlay, reflinks, etc
all use some other filesystem namespace to generate the container
namespace that links to the common data, whilst composefs uses the
manifest for that.

The use of a minfest file means there is almost zero container setup
overhead - ship the manifest file, mount it, all done - and zero
teardown overhead as unmounting the filesystem is all that is needed
to remove all traces of the container instance from the system.

In having a custom manifest format, the manifest can easily contain
verification information alongside the pointer to the content the
namespace should expose. i.e. the manifest references a secure
content addressed repository that is protected by fsverity and
contains the fsverity digests itself. Hence it doesn't rely on the
repository to self-verify, it actually ensures that the repository
files actually contain the data the manifest expects them to
contain.

Hence if the composefs kernel module is provided with a mechanism
for validating the chain of trust for the manifest file that a user
is trying to mount, then we just don't care who the mounting user
is.  This architecture is a viable path to rootless mounting of
pre-built third party container images.

Also, with the host's content addressed repository being managed
separately by the trusted host and distro package management, the
manifest is not be unique to a single container host. The distro can
build manifests so that containers are running known, signed and
verified container images built by the distro. The container
orchestration software or admin could also build manifests on demand
and sign them.

If the manifest is not signed, not signed with a key loaded
into the kernel keyring, or does not pass verification, then we
simply fall back to root-in-the-init-ns permissions being required
to mount the manifest. This fallback is exactly the same security
model we have for every other type of filesystem image that the
linux kernel can mount - we trust root not to be mounting malicious
images.

Essentially, I don't think any of the filesystems in the linux
kernel currently provide a viable solution to the problem that
composefs is trying to solve. We need a different way of solving the
ephemeral container namespace creation and destruction overhead
problem. Composefs provides a mechanism that not only solves this
problem and potentially several others, whilst also being easy to
retrofit into existing production container stacks.

As such, I think composefs is definitely worth further time and
investment as a unique line of filesystem development for Linux.
Solve the chain of trust problem (i.e. crypto signing for the
manifest files) and we potentially have game changing container
infrastructure in a couple of thousand lines of code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com