Re: Initial patches for Incremental FS

From: Miklos Szeredi <miklos@szeredi.hu>
To: Yurii Zubrytskyi <zyy@google.com>
Cc: Eugene Zemtsov <ezemtsov@google.com>,
	Amir Goldstein <amir73il@gmail.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: Initial patches for Incremental FS
Date: Thu, 30 May 2019 11:22:54 +0200	[thread overview]
Message-ID: <CAJfpeguys2P9q5EpE3GzKHcOS9GVLO9Fj9HB3JBLw36eax+NkQ@mail.gmail.com> (raw)
In-Reply-To: <CAJeUaNBn0gA6eApgOu=n2uoy+6PbOR_xjTdVvc+StvOKGA-i=Q@mail.gmail.com>

On Wed, May 29, 2019 at 11:06 PM Yurii Zubrytskyi <zyy@google.com> wrote:

> Yes, and this was _exactly_ our first plan, and it mitigates the read
> performance
> issue. The reasons why we didn't move forward with it are that we figured out
> all other requirements, and fixing each of those needs another change in
> FUSE, up to the level when FUSE interface becomes 50% dedicated to
> our specific goal:
> 1. MAP message would have to support data compression (with different
> algorithms), hash verification (same thing) with hash streaming (because
> even the Merkle tree for a 5GB file is huge, and can't be preloaded
> at once)

With the proposed FUSE solution the following sequences would occur:

kernel: if index for given block is missing, send MAP message
  userspace: if data/hash is missing for given block then download data/hash
  userspace: send MAP reply
kernel: decompress data and verify hash based on index

The kernel would not be involved in either streaming data or hash, it
would only work with data/hash that has already been downloaded.
Right?

Or is your implementation doing streamed decompress/hash or partial blocks?

>   1.1. Mapping memory usage can get out of hands pretty quickly: it has to
> be at least (offset + size + compression type + hash location + hash size +
> hash kind) per each block. I'm not even thinking about multiple storage files
> here. For that 5GB file (that's a debug APK for some Android game we're
> targeting) we have 1.3M blocks, so ~16 bytes *1.3M = 20M of index only,
> without actual overhead for the lookup table.
> If the kernel code owns and manages its own on-disk data store and the
> format, this index can be loaded and discarded on demand there.

Why does the kernel have to know the on-disk format to be able to load
and discard parts of the index on-demand?  It only needs to know which
blocks were accessed recently and which not so recently.

> > There's also work currently ongoing in optimizing the overhead of
> > userspace roundtrip.  The most promising thing appears to be matching
> > up the CPU for the userspace server with that of the task doing the
> > request.  This can apparently result in  60-500% speed improvement.
>
> That sounds almost too good to be true, and will be really cool.
> Do you have any patches or git remote available in any compilable state to
> try the optimization out? Android has quite complicated hardware config
> and I want to see how this works, especially with our model where
> several processes may send requests into the same filesystem FD.

Currently it's only a bunch of hacks, no proper interfaces yet.

I'll let you know once there's something useful for testing with a
real filesystem.

BTW, which interface does your fuse filesystem use?  Libfuse?  Raw device?

Thanks,
Miklos