Re: Initial patches for Incremental FS

From: Yurii Zubrytskyi <zyy@google.com>
To: Eugene Zemtsov <ezemtsov@google.com>,
	amir73il@gmail.com, linux-fsdevel@vger.kernel.org,
	miklos@szeredi.hu
Subject: Re: Initial patches for Incremental FS
Date: Mon, 20 May 2019 18:32:36 -0700	[thread overview]
Message-ID: <CAJeUaNCvr=X-cc+B3rsunKcdC6yHSGGa4G+8X+n8OxGKHeE3zQ@mail.gmail.com> (raw)
In-Reply-To: <CAK8JDrEQnXTcCtAPkb+S4r4hORiKh_yX=0A0A=LYSVKUo_n4OA@mail.gmail.com>

On Thu, May 9, 2019 at 1:15 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> I think you have made the right choice for you and for the product you are
> working on to use an isolated module to provide this functionality.
>
> But I assume the purpose of your posting was to request upstream inclusion,
> community code review, etc. This is not likely to happen when the
> implementation and design choices are derived from Employer needs vs.
> the community needs. Sure, you can get high level design review, which is
> what *this* is, but I recon not much more.
>
> This discussion has several references to community projects that can benefit
> from this functionality, but not in its current form.
>
> This development model has worked well in the past for Android and the Android
> user base leverage could help to get you a ticket to staging, but eventually,
> those modules (e.g. ashmem) often do get replaced with more community oriented
> APIs.
>

Hi fsdevel
I'm Yurii, and I work with Eugene on the same team and the same project.

I want to explain how we ended up with a custom filesystem instead of
trying to improve FUSE for everyone, and
why we think (maybe incorrectly) that it may be still pretty useful
for the community.
As the project goal was to allow instant (-ish) deployment of apps
from the dev environment to Android phone, we were hoping to
stick with plain FUSE filesystem, and that's what we've done at first.
But it turned out that even with the best tuning it was still
really slow and battery-hungry (phones spent energy faster than they
were charging over the cord).
At this point we've already collected the profiles for the filesystem
usage, and also figured out what features are essential
to make it usable for streaming:
1. Random reads are the most common -> 4kb-sized read is the size we
have to support, and may not go to usermode on each of those
2. Android tends to list the app directory and stat files in it often
-> these operations need to be cached in kernel as well
3. Because of *random* reads streaming files sequentially isn't
optimal -> need to be able to collect read logs from first deployment
    and stream in that order next time on incremental builds
4. Devices have small flash cards, need to deploy uncompressed game
images for speed and mmap access ->
    support storing 4kb blocks compressed
4.1. Host computer is much better at compression -> support streaming
compressed blocks into the filesystem storage directly, without
       recompression on the phone
5. Android has to verify app signature for installation -> need to
support per-block signing and lazy verification
5.1. For big games even per-block signature data can be huge, so need
to stream even the signatures
6. Development cycle is usually edit-build-try-edit-... -> need to
support delta-patches from existing files
7. File names for installed apps are standard and different from what
they were on the host ->
    must be able to store user-supplied 'key' next to each file to identify it
8. Files never change -> no need to have complex code for mutable data
in the filesystem

In the end, we saw only two ways how to make all of this work: either
take sdcardfs as a base and extend it, or change FUSE to
support cache in kernel; and as you can imagine, sdcardfs route got
thrown out of the window immediately after looking at the code.
But after learning some FUSE internals and its code what we found out
is that to make it do all the listed things we'd basically have
to implement a totally new filesystem inside of it. The only real use
of FUSE that remained was to send FUSE_INIT, and occasional
read requests. Everything else required, first of all, making a cache
object inside FUSE intercept every message before it goes to the
user mode, and also adding new specialized commands initiated by the
usermode (e.g. prefetching data that hasn't been requested
yet, or streaming hashes in). Some things even didn't make sense for a
generic usecase (e.g. having a limited circular buffer of read
blocks in kernel that user can ask for and flush).

In the end, after several tries we just came to a conclusion that the
very set of original requirements is so specific that, funny enough,
anyone who wants to create a lazy-loading experience would hit most of
them, while anyone who's doing something else, would miss
most of them. That's the main reason to go with a separate specialized
driver module, and the reason to share it with the community -
we have a feeling that people will benefit from a high-quality
implementation of lazy loading in kernel, and we will benefit from the
community support and guiding.

Again, we all are human and can be wrong at any step when making
conclusions. E.g. we didn't know about the fscache subsystem,
and were only planning to create a cache object inside FUSE instead.
But for now I still feel that our original research stands, and
that in the long run specialized filesystem serves its users much
better than several scattered changes in other places that all
pretty much look like the same filesystem split into three parts and
adopted to the interfaces those places force onto it. Even more,
those changes and interfaces look quite strange on their own, when not
used together.

Please tell me what you think about this whole thing. We do care about
the feature in general, not about making it
look as we've coded it right now. If you feel that making fscache
interface that covers the whole FUSE usermode
messages + allows for those requirements is useful beyond streaming,
we'll investigate that route further.

Thank you, and sorry for a long email

--
Thanks, Yurii