archive mirror
 help / color / mirror / Atom feed
From: Amir Goldstein <>
To: Sargun Dhillon <>
Cc: overlayfs <>,
	Alessio Balsini <>
Subject: Re: Lazy Loading Layers (Userfaultfd for filesystems?)
Date: Tue, 26 Jan 2021 07:18:29 +0200	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <20210125194848.GA12389@ircssh-2.c.rugged-nimbus-611.internal>

On Mon, Jan 25, 2021 at 9:54 PM Sargun Dhillon <> wrote:
> One of the projects I'm playing with for containers is lazy-loading of layers.
> We've found that less than 10% of the files on a layer actually get used, which
> is an unfortunate waste. It also means in some cases downloading ~100s of MB, or
> ~1s of GB of files before starting a container workload. This is unfortunate.
> It would be nice if there was a way to start a container workload, and have
> it so that if it tries to access and unpopulated (not yet downloaded) part
> of the filesystem block while trying to be accessed. This is trivial to do
> if the "lowest" layer is FUSE, where one can just stall in userspace on
> loads. Unfortunately, AFAIK, there's not a good way to swap out the FUSE
> filesystem with the "real" filesystem once it's done fully populating,
> and you have to pay for the full FUSE cost on each read / write.

Unless you used FUSE_PASSTHROUGH:

Only in current v12 patchset, a passthrough capable FUSE is declared
non-stackable by setting s_max_depth = FILESYSTEM_MAX_STACK_DEPTH.
This wasn't done deliberately in order to deny stacking of overlay of top
of passthrough capable fuse, but in order to deny stacking passthrough fuse on
top of each other.

I mentioned in one of the reviews that this limitation could become
a problem if someone where to do exactly what you are trying to do.
It should not be a problem to relax this limitation, it just did not feel fair
to demand that for initial version of passthrough fuse, before there was an
actual use case. I am sure you will be able to lift that limitation if it stands
in your way.

> I've tossed around:
> 1. Mutable lowerdirs and having something like this:
> layer0 --> Writeable space
> layer1 --> Real XFS filesystem
> layer2 --> FUSE FS
> and if there is a "miss" on layer 1, it will then look it up on
> layer 2 while layer 1 is being populated. Then the FUSE FS can block.

How would you verify that mutating the lowerdir doesn't result in
"undefined behavior"?
It would be nice if for some images, you could fetch a "metacopy" image from
some "meta" image repository, to use as layer1. It that a possibility
for your use case?
At least if the only mutation allowed on layer1 was a data copy up, it would
be pretty easy to show that overlayfs behavior will be well defined.
When FUSE knows that data in Real fs file has been populated, it can remove the
metacopy xattr and invalidate the fuse dentry, causing ovl dentry
invalidate and then
re-lookup will constructs the ovl dentry without the FUSE layer.

> This is neat, but it requires the FUSE FS to always be up, and incurs
> a userspace bounce on every miss.

You may be able to shutdown the FUSE fs eventually. At the end of the
population process, issue a "layer shutdown" ioctl to overlayfs, that will
mark the layer as shutdown. ovl_revalidate() will invalidate any ovl dentry
with a shut down layer in its lower stack and ovl_lookup()/ovl_path_next()
will skip lower stack dentries in shut down layers.

When there are no more open files from fuse and no more ovl dentries
with fuse layer in their lower stack, the fuse layer mnt refcount should
drop to 2(?) and it should be possible to carefully release the root ovl
dentry lower stack entry and finally the layer itself.
A refcount on the layer will probably be to correct pattern to use.

> It also means things like metadata only copies don't work.

I can see there are some feature limitation due to FUSE having no UUID,
but this should be solvable too.

> Does anyone have a suggestion of a mechanism to handle this? I've looked into
> swapping out layers on the fly, and what it would take to add a mechanism like
> userfaultfd to overlayfs, but I was wondering if anything like this was already
> built, or if someone has thought it through more than me.

I've seen many projects that try to do similar things but not using overlayfs:
Android Incremental FS, ExtFUSE, libprojfs.

If I were to tackle this task, I would choose to enhance FUSE_PASSTHROUGH
to be able to passthrough for more than just read/write, to the point
that it could
eventually satisfy the requirements of all those projects above,
something that I
have discussed with Alessio in the past.

When that happens, you might as well call passthrough FUSE "Userfaultfd for
filesystems" if you wish ;-)


  reply	other threads:[~2021-01-27  4:56 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-25 19:48 Lazy Loading Layers (Userfaultfd for filesystems?) Sargun Dhillon
2021-01-26  5:18 ` Amir Goldstein [this message]
2021-01-26 13:12   ` Alessio Balsini
2023-05-29 15:15 ` Detaching lower layers (Was: Lazy Loading Layers) Amir Goldstein
2023-05-29 17:50   ` Rodrigo Campos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).