Re: Initial patches for Incremental FS

From: Eugene Zemtsov <ezemtsov@google.com>
To: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	tytso@mit.edu, Amir Goldstein <amir73il@gmail.com>,
	miklos@szeredi.hu, richard.weinberger@gmail.com
Subject: Re: Initial patches for Incremental FS
Date: Thu, 2 May 2019 21:23:31 -0700	[thread overview]
Message-ID: <CAK8JDrFZW1jwOmhq+YVDPJi9jWWrCRkwpqQ085EouVSyzw-1cg@mail.gmail.com> (raw)
In-Reply-To: <20190502132623.GU23075@ZenIV.linux.org.uk>

On Thu, May 2, 2019 at 6:26 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Why not CODA, though, with local fs as cache?

On Thu, May 2, 2019 at 4:20 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> This sounds very useful.
>
> Why does it have to be a new special-purpose Linux virtual file?
> Why not FUSE, which is meant for this purpose?
> Those are things that you should explain when you are proposing a new
> filesystem,
> but I will answer for you - because FUSE page fault will incur high
> latency also after
> blocks are locally available in your backend store. Right?
>
> How about fscache support for FUSE then?
> You can even write your own fscache backend if the existing ones don't
> fit your needs for some reason.
>
> Piling logic into the kernel is not the answer.
> Adding the missing interfaces to the kernel is the answer.
>

Thanks for the interest and feedback. What I dreaded most was silence.

Probably I should have given a bit more details in the introductory email.
Important features we’re aiming for:

1. An attempt to read a missing data block gives a userspace data loader a
chance to fetch it. Once a block is loaded (in advance or after a page fault)
it is saved into a local backing storage and following reads of the same block
are done directly by the kernel. [Implemented]

2. Block level compression. It saves space on a device, while still allowing
very granular loading and mapping. Less granular compression would trigger
loading of more data than absolutely necessary, and that’s the thing we
want to avoid. [Implemented]

3. Block level integrity verification. The signature scheme is similar to
DMverity or fs-verity. In other words, each file has a Merkle tree with
crypto-digests of 4KB blocks. The root digest is signed with RSASSA or ECDSA.
Each time a data block is read digest is calculated and checked with the
Merkle tree, if the signature check fails the read operation fails as well.
Ideally I’d like to use fs-verity API for that. [Not implemented yet.]

4. New files can be pushed into incremental-fs “externally” when an app needs
a new resource or a binary. This is needed for situations when a new resource
or a new version of code is available, e.g. a user just changed the system
language to Spanish, or a developer rolled out an app update.
Things change over time and this means that we can’t just incrementally
load a precooked ext4 image and mount it via a loopback device.   [Implemented]

5. No need to support writes or file resizing. It eliminates a lot of
complexity.

Currently not all of these features are implemented yet, but they all will be
needed to achieve our goals:
 - Apps can be delivered incrementally without having to wait for extra data.
   At the same time given enough time the app can be downloaded fully without
   having to keep a connection open after that.
- App’s integrity should be verifiable without having to read all its blocks.
- Local storage and battery need to be conserved.
- Apps binaries and resources can change over time.
   Such changes are triggered by external events.

I’d like to comment on proposed alternative solutions:

FUSE
We have a FUSE based prototype and though functional it turned out to be battery
hungry and read performance leaving much to be desired.
Our measurements were roughly corresponding to results in the article
I link in PATCH 1 incrementalfs.rst

In this thread Amir Goldstein absolutely correctly pointed out that FUSE’s
constant overhead keeps hurting app’s performance even when all blocks are
available locally. But not only that, FUSE needs to be involved with each
readdir() and stat() call. And to our surprise we learned that many apps do
directory traversals and stat()-s much more often that it seems reasonable.

Moreover, Android has a bit of a recent history with FUSE. A big chunk of
Android directory tree (“external storage”) use to be mounted via FUSE.
It didn’t turn out to be a great approach and it was eventually replaced by
a kernel module.

I reckon the amount of changes that we’d need to introduce to FUSE in order
to make it support things mentioned above will be, to put it mildly,
very substantial. And having to be as generic as FUSE (i.e. support writes etc)
will make the task much more complicated than it is now.

Coda
Indeed it is somewhat similar to what we need. But according to Coda’s
documentation it fetches a whole file first time it is accessed,
which is opposite of what we need. It is not really obvious that adding all
the things above to Coda would be simpler than creating a separate driver.
Especially if Coda needs to keep supporting all of its existing features.

userfaultfd
As far as I can see this would only work for mmap-ed files.
All read() and readdir() calls would never return right results.

-- 
Thanks,
Eugene Zemtsov.