linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Lazy Loading Layers (Userfaultfd for filesystems?)
@ 2021-01-25 19:48 Sargun Dhillon
  2021-01-26  5:18 ` Amir Goldstein
  2023-05-29 15:15 ` Detaching lower layers (Was: Lazy Loading Layers) Amir Goldstein
  0 siblings, 2 replies; 5+ messages in thread
From: Sargun Dhillon @ 2021-01-25 19:48 UTC (permalink / raw)
  To: linux-unionfs

One of the projects I'm playing with for containers is lazy-loading of layers. 
We've found that less than 10% of the files on a layer actually get used, which 
is an unfortunate waste. It also means in some cases downloading ~100s of MB, or 
~1s of GB of files before starting a container workload. This is unfortunate.

It would be nice if there was a way to start a container workload, and have
it so that if it tries to access and unpopulated (not yet downloaded) part
of the filesystem block while trying to be accessed. This is trivial to do
if the "lowest" layer is FUSE, where one can just stall in userspace on
loads. Unfortunately, AFAIK, there's not a good way to swap out the FUSE
filesystem with the "real" filesystem once it's done fully populating,
and you have to pay for the full FUSE cost on each read / write.

I've tossed around:
1. Mutable lowerdirs and having something like this:

layer0 --> Writeable space
layer1 --> Real XFS filesystem
layer2 --> FUSE FS

and if there is a "miss" on layer 1, it will then look it up on
layer 2 while layer 1 is being populated. Then the FUSE FS can block.
This is neat, but it requires the FUSE FS to always be up, and incurs
a userspace bounce on every miss.

It also means things like metadata only copies don't work.

Does anyone have a suggestion of a mechanism to handle this? I've looked into 
swapping out layers on the fly, and what it would take to add a mechanism like 
userfaultfd to overlayfs, but I was wondering if anything like this was already 
built, or if someone has thought it through more than me.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Lazy Loading Layers (Userfaultfd for filesystems?)
  2021-01-25 19:48 Lazy Loading Layers (Userfaultfd for filesystems?) Sargun Dhillon
@ 2021-01-26  5:18 ` Amir Goldstein
  2021-01-26 13:12   ` Alessio Balsini
  2023-05-29 15:15 ` Detaching lower layers (Was: Lazy Loading Layers) Amir Goldstein
  1 sibling, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2021-01-26  5:18 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: overlayfs, Alessio Balsini

On Mon, Jan 25, 2021 at 9:54 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> One of the projects I'm playing with for containers is lazy-loading of layers.
> We've found that less than 10% of the files on a layer actually get used, which
> is an unfortunate waste. It also means in some cases downloading ~100s of MB, or
> ~1s of GB of files before starting a container workload. This is unfortunate.
>
> It would be nice if there was a way to start a container workload, and have
> it so that if it tries to access and unpopulated (not yet downloaded) part
> of the filesystem block while trying to be accessed. This is trivial to do
> if the "lowest" layer is FUSE, where one can just stall in userspace on
> loads. Unfortunately, AFAIK, there's not a good way to swap out the FUSE
> filesystem with the "real" filesystem once it's done fully populating,
> and you have to pay for the full FUSE cost on each read / write.

Unless you used FUSE_PASSTHROUGH:

https://lore.kernel.org/linux-fsdevel/20210125153057.3623715-1-balsini@android.com/

Only in current v12 patchset, a passthrough capable FUSE is declared
non-stackable by setting s_max_depth = FILESYSTEM_MAX_STACK_DEPTH.
This wasn't done deliberately in order to deny stacking of overlay of top
of passthrough capable fuse, but in order to deny stacking passthrough fuse on
top of each other.

I mentioned in one of the reviews that this limitation could become
a problem if someone where to do exactly what you are trying to do.
It should not be a problem to relax this limitation, it just did not feel fair
to demand that for initial version of passthrough fuse, before there was an
actual use case. I am sure you will be able to lift that limitation if it stands
in your way.


>
> I've tossed around:
> 1. Mutable lowerdirs and having something like this:
>
> layer0 --> Writeable space
> layer1 --> Real XFS filesystem
> layer2 --> FUSE FS
>
> and if there is a "miss" on layer 1, it will then look it up on
> layer 2 while layer 1 is being populated. Then the FUSE FS can block.

Interesting.
How would you verify that mutating the lowerdir doesn't result in
"undefined behavior"?
It would be nice if for some images, you could fetch a "metacopy" image from
some "meta" image repository, to use as layer1. It that a possibility
for your use case?
At least if the only mutation allowed on layer1 was a data copy up, it would
be pretty easy to show that overlayfs behavior will be well defined.
When FUSE knows that data in Real fs file has been populated, it can remove the
metacopy xattr and invalidate the fuse dentry, causing ovl dentry
invalidate and then
re-lookup will constructs the ovl dentry without the FUSE layer.

> This is neat, but it requires the FUSE FS to always be up, and incurs
> a userspace bounce on every miss.
>

You may be able to shutdown the FUSE fs eventually. At the end of the
population process, issue a "layer shutdown" ioctl to overlayfs, that will
mark the layer as shutdown. ovl_revalidate() will invalidate any ovl dentry
with a shut down layer in its lower stack and ovl_lookup()/ovl_path_next()
will skip lower stack dentries in shut down layers.

When there are no more open files from fuse and no more ovl dentries
with fuse layer in their lower stack, the fuse layer mnt refcount should
drop to 2(?) and it should be possible to carefully release the root ovl
dentry lower stack entry and finally the layer itself.
A refcount on the layer will probably be to correct pattern to use.

> It also means things like metadata only copies don't work.
>

Why?
I can see there are some feature limitation due to FUSE having no UUID,
but this should be solvable too.

> Does anyone have a suggestion of a mechanism to handle this? I've looked into
> swapping out layers on the fly, and what it would take to add a mechanism like
> userfaultfd to overlayfs, but I was wondering if anything like this was already
> built, or if someone has thought it through more than me.
>

I've seen many projects that try to do similar things but not using overlayfs:
Android Incremental FS, ExtFUSE, libprojfs.

If I were to tackle this task, I would choose to enhance FUSE_PASSTHROUGH
to be able to passthrough for more than just read/write, to the point
that it could
eventually satisfy the requirements of all those projects above,
something that I
have discussed with Alessio in the past.

When that happens, you might as well call passthrough FUSE "Userfaultfd for
filesystems" if you wish ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Lazy Loading Layers (Userfaultfd for filesystems?)
  2021-01-26  5:18 ` Amir Goldstein
@ 2021-01-26 13:12   ` Alessio Balsini
  0 siblings, 0 replies; 5+ messages in thread
From: Alessio Balsini @ 2021-01-26 13:12 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Sargun Dhillon, overlayfs, Alessio Balsini

Thanks Amir for looping me in the discussion.

On Tue, Jan 26, 2021 at 07:18:29AM +0200, Amir Goldstein wrote:
> On Mon, Jan 25, 2021 at 9:54 PM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > One of the projects I'm playing with for containers is lazy-loading of layers.
> > We've found that less than 10% of the files on a layer actually get used, which
> > is an unfortunate waste. It also means in some cases downloading ~100s of MB, or
> > ~1s of GB of files before starting a container workload. This is unfortunate.
> >
> > It would be nice if there was a way to start a container workload, and have
> > it so that if it tries to access and unpopulated (not yet downloaded) part
> > of the filesystem block while trying to be accessed. This is trivial to do
> > if the "lowest" layer is FUSE, where one can just stall in userspace on
> > loads. Unfortunately, AFAIK, there's not a good way to swap out the FUSE
> > filesystem with the "real" filesystem once it's done fully populating,
> > and you have to pay for the full FUSE cost on each read / write.

Sargun's use case has some similarities with IncFS
(https://source.android.com/devices/architecture/kernel/incfs).
The main purpose of IncFS is not to save space, but to allow the user to
open apps as soon as possible, by making them accessible also if
partially downloaded. Without going out of topic with implementation
details, IncFS also needs to handle extra things like live data
(de)compression, and here's where it diverges from Sargun's idea.
The reason why I mention this is that the first IncFS prototypes were
based on FUSE, but because of the performance regression introduced by
the FUSE daemon round-trip we were forced to proceed with a separate
kernel module implementation.

> 
> Unless you used FUSE_PASSTHROUGH:
> 
> https://lore.kernel.org/linux-fsdevel/20210125153057.3623715-1-balsini@android.com/
> 
> Only in current v12 patchset, a passthrough capable FUSE is declared
> non-stackable by setting s_max_depth = FILESYSTEM_MAX_STACK_DEPTH.
> This wasn't done deliberately in order to deny stacking of overlay of top
> of passthrough capable fuse, but in order to deny stacking passthrough fuse on
> top of each other.
> 
> I mentioned in one of the reviews that this limitation could become
> a problem if someone where to do exactly what you are trying to do.
> It should not be a problem to relax this limitation, it just did not feel fair
> to demand that for initial version of passthrough fuse, before there was an
> actual use case. I am sure you will be able to lift that limitation if it stands
> in your way.
> 
> 

It would be nice to see that FUSE passthrough can be helpful in this
scenario as well.
Like Amir was mentioning, the stacking limitation of this first
passthrough implementation has been chosen super strict as a safety
measure, but nothing prevents us from relaxing it in the future as
stacking becomes mandatory for certain use cases and after we properly
analyze all the corner cases.

> >
> > I've tossed around:
> > 1. Mutable lowerdirs and having something like this:
> >
> > layer0 --> Writeable space
> > layer1 --> Real XFS filesystem
> > layer2 --> FUSE FS
> >
> > and if there is a "miss" on layer 1, it will then look it up on
> > layer 2 while layer 1 is being populated. Then the FUSE FS can block.
> 
> Interesting.
> How would you verify that mutating the lowerdir doesn't result in
> "undefined behavior"?
> It would be nice if for some images, you could fetch a "metacopy" image from
> some "meta" image repository, to use as layer1. It that a possibility
> for your use case?
> At least if the only mutation allowed on layer1 was a data copy up, it would
> be pretty easy to show that overlayfs behavior will be well defined.
> When FUSE knows that data in Real fs file has been populated, it can remove the
> metacopy xattr and invalidate the fuse dentry, causing ovl dentry
> invalidate and then
> re-lookup will constructs the ovl dentry without the FUSE layer.
> 
> > This is neat, but it requires the FUSE FS to always be up, and incurs
> > a userspace bounce on every miss.
> >
> 
> You may be able to shutdown the FUSE fs eventually. At the end of the
> population process, issue a "layer shutdown" ioctl to overlayfs, that will
> mark the layer as shutdown. ovl_revalidate() will invalidate any ovl dentry
> with a shut down layer in its lower stack and ovl_lookup()/ovl_path_next()
> will skip lower stack dentries in shut down layers.
> 
> When there are no more open files from fuse and no more ovl dentries
> with fuse layer in their lower stack, the fuse layer mnt refcount should
> drop to 2(?) and it should be possible to carefully release the root ovl
> dentry lower stack entry and finally the layer itself.
> A refcount on the layer will probably be to correct pattern to use.
> 
> > It also means things like metadata only copies don't work.
> >
> 
> Why?
> I can see there are some feature limitation due to FUSE having no UUID,
> but this should be solvable too.
> 
> > Does anyone have a suggestion of a mechanism to handle this? I've looked into
> > swapping out layers on the fly, and what it would take to add a mechanism like
> > userfaultfd to overlayfs, but I was wondering if anything like this was already
> > built, or if someone has thought it through more than me.
> >
> 
> I've seen many projects that try to do similar things but not using overlayfs:
> Android Incremental FS, ExtFUSE, libprojfs.
> 
> If I were to tackle this task, I would choose to enhance FUSE_PASSTHROUGH
> to be able to passthrough for more than just read/write, to the point
> that it could
> eventually satisfy the requirements of all those projects above,
> something that I
> have discussed with Alessio in the past.
> 
> When that happens, you might as well call passthrough FUSE "Userfaultfd for
> filesystems" if you wish ;-)
> 
> Thanks,
> Amir.

Thanks for advocating the use of FUSE passthrough! :)
Sargun, if read/write performance was your main concern, the current
version of FUSE passthrough should already make the trick. You can also
find a libfuse repository in the list that contains the minimal changes
to enable it in your fs.

My TODO list already has a bunch of further extensions, e.g. passthrough
for directory operations, but I'm currently blocked on the series to get
merged upstream. This is both because I would love the community to
start exploring FUSE passthrough and come out with additional feature
requests that would help me prioritize what comes next, and to avoid
accumulating too much tech debt: working on top of out-of-tree changes I
risk that all my FUSE passthrough extensions work will never come to
life. So, fingers crossed that I made everything right with this V12! :)

Thanks,
Alessio


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Detaching lower layers (Was: Lazy Loading Layers)
  2021-01-25 19:48 Lazy Loading Layers (Userfaultfd for filesystems?) Sargun Dhillon
  2021-01-26  5:18 ` Amir Goldstein
@ 2023-05-29 15:15 ` Amir Goldstein
  2023-05-29 17:50   ` Rodrigo Campos
  1 sibling, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2023-05-29 15:15 UTC (permalink / raw)
  To: Sargun Dhillon; +Cc: overlayfs, Miklos Szeredi

On Mon, Jan 25, 2021 at 9:54 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> One of the projects I'm playing with for containers is lazy-loading of layers.
> We've found that less than 10% of the files on a layer actually get used, which
> is an unfortunate waste. It also means in some cases downloading ~100s of MB, or
> ~1s of GB of files before starting a container workload. This is unfortunate.
>
> It would be nice if there was a way to start a container workload, and have
> it so that if it tries to access and unpopulated (not yet downloaded) part
> of the filesystem block while trying to be accessed. This is trivial to do
> if the "lowest" layer is FUSE, where one can just stall in userspace on
> loads. Unfortunately, AFAIK, there's not a good way to swap out the FUSE
> filesystem with the "real" filesystem once it's done fully populating,
> and you have to pay for the full FUSE cost on each read / write.
>
> I've tossed around:
> 1. Mutable lowerdirs and having something like this:
>
> layer0 --> Writeable space
> layer1 --> Real XFS filesystem
> layer2 --> FUSE FS
>
> and if there is a "miss" on layer 1, it will then look it up on
> layer 2 while layer 1 is being populated. Then the FUSE FS can block.
> This is neat, but it requires the FUSE FS to always be up, and incurs
> a userspace bounce on every miss.
>
> It also means things like metadata only copies don't work.
>
> Does anyone have a suggestion of a mechanism to handle this? I've looked into
> swapping out layers on the fly, and what it would take to add a mechanism like
> userfaultfd to overlayfs, but I was wondering if anything like this was already
> built, or if someone has thought it through more than me.
>

Hi Sargun,

I believe that this is the use case that you asked me about in LSFMM,
at least the lower part of layer1+layer2. Is that correct?

You did not mention three layers in the use case that you described
Is that because you decided that layer0 and layer1 can be combined?

Technically, you can also setup a nested overlay with the lower overlay
layer1+layer2 only doing the lazy loading of the remote read-only layer
and the upper overlay is composed of layer0+ovl(layer1+layer2), but this
nested overlay configuration has some limitations.

Anyway, I have talked with Miklos about the use case that requires
detaching the lowermost FUSE layer eventually and the solution that
we discussed was to gradually "opaquify" directories whose entire
descendant hierarchy is fully copied up at readdir time.

I have prepared POC patches for this design:

https://github.com/amir73il/linux/commits/ovl-xino-nofollow

This was tested using the following patch to unionmount-testsuite:

https://github.com/amir73il/unionmount-testsuite/commits/ovl-xino-nofollow

commit 026e73c37f3993f56e76128a267e54faedf2322c
Author: Amir Goldstein <amir73il@gmail.com>
Date:   Mon May 29 17:01:55 2023 +0300

    Test detaching lower fs

    Test that with xino=nofollow, after copying up all files and listing
    all the directories in DFS order, the lower fs can be detached.

    Signed-off-by: Amir Goldstein <amir73il@gmail.com>

diff --git a/mount_union.py b/mount_union.py
index e905b83..4fad5dd 100644
--- a/mount_union.py
+++ b/mount_union.py
@@ -54,3 +54,13 @@ def mount_union(ctx):
         ctx.note_upper_fs(upper_mntroot, testdir, union_mntroot + "/f")
         ctx.note_lower_layers(lower_mntroot)
         ctx.note_upper_layer(upperdir)
+        if cfg.is_xino():
+            # Copy up everything, set all dirs opaque and then detach lower fs.
+            # Instead of iterating in DFS order we iterate 4 times as the depth
+            # of the dataset tree - on every iteration, level 4-i
becomes opaque.
+            system("chown -R 0.0 " + union_mntroot)
+            system("find " + union_mntroot + " -inum 0")
+            system("find " + union_mntroot + " -inum 0")
+            system("find " + union_mntroot + " -inum 0")
+            system("find " + union_mntroot + " -inum 0")
+            system("xfs_io -x -c shutdown " + lower_mntroot)
diff --git a/run b/run
index 3a6efc3..f8116c1 100755
--- a/run
+++ b/run
@@ -219,7 +219,7 @@ if redirect_dir is False:

 # Auto-upgrade xino=auto to xino=on for kernel < v5.7
 if xino:
-    cfg.add_mntopt("xino=on")
+    cfg.add_mntopt("xino=nofollow")

--

It should be pretty self-explanatory - after mounting the overlay, all
lower files
are copied up using chown -R (no metacopy) and then the overlay is iterated
several times, until all the merge directories iterations notice that
there is nothing
interesting in the lower dirs, so they all become opaque.
At this point, the lowest xfs layer is being shutdown and the tests are run.
With the 4*find iterations, none of the tests get EIO.

This does not mean that the lower xfs can be cleanly unmounted - there may
still be references to dentries/inodes from the lower fs, but
overlayfs never calls
any filesystem methods on the lower dentry/inodes - specifically lookup misses
in the upper dir do not end up looking in the lower dir.

The reason that I used an opt-in mount option (xino=nofollow) to enable this
functionality is because even after all files have been copied up,
overlayfs does
currently access one bit of information from the lower fs - it calls
getattr() to get
st_ino from the lower file/directory in order to preserve st_ino across copy up.

I used an opt-in mount option to allow st_ino to change across copy up.
I hope this change of behavior is acceptable for your use case.
Note that after the completion of the migration process (e.g. chown -R + 4*find)
all inode numbers are stabilized.

Are you interested in testing these patches?
If you indicate that they are useful to you, I can post them for review
and in that case, I would appreciate if you can write the xfstests
for the feature.

Thanks,
Amir.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Detaching lower layers (Was: Lazy Loading Layers)
  2023-05-29 15:15 ` Detaching lower layers (Was: Lazy Loading Layers) Amir Goldstein
@ 2023-05-29 17:50   ` Rodrigo Campos
  0 siblings, 0 replies; 5+ messages in thread
From: Rodrigo Campos @ 2023-05-29 17:50 UTC (permalink / raw)
  To: Amir Goldstein, Sargun Dhillon; +Cc: overlayfs, Miklos Szeredi

On 5/29/23 17:15, Amir Goldstein wrote:
> On Mon, Jan 25, 2021 at 9:54 PM Sargun Dhillon <sargun@sargun.me> wrote:
>>
>> One of the projects I'm playing with for containers is lazy-loading of layers.
>> We've found that less than 10% of the files on a layer actually get used, which
>> is an unfortunate waste. It also means in some cases downloading ~100s of MB, or
>> ~1s of GB of files before starting a container workload. This is unfortunate.
>>
>> It would be nice if there was a way to start a container workload, and have
>> it so that if it tries to access and unpopulated (not yet downloaded) part
>> of the filesystem block while trying to be accessed. This is trivial to do
>> if the "lowest" layer is FUSE, where one can just stall in userspace on
>> loads. Unfortunately, AFAIK, there's not a good way to swap out the FUSE
>> filesystem with the "real" filesystem once it's done fully populating,
>> and you have to pay for the full FUSE cost on each read / write.
>>
>> I've tossed around:
>> 1. Mutable lowerdirs and having something like this:
>>
>> layer0 --> Writeable space
>> layer1 --> Real XFS filesystem
>> layer2 --> FUSE FS
>>
>> and if there is a "miss" on layer 1, it will then look it up on
>> layer 2 while layer 1 is being populated. Then the FUSE FS can block.
>> This is neat, but it requires the FUSE FS to always be up, and incurs
>> a userspace bounce on every miss.

Interesting.

I haven't checked the patches yet, but does the patchset "FUSE BPF: A 
Stacked Filesystem Extension for FUSE" help with your use case,  Sargun?



Best,
Rodrigo

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-05-29 17:52 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-25 19:48 Lazy Loading Layers (Userfaultfd for filesystems?) Sargun Dhillon
2021-01-26  5:18 ` Amir Goldstein
2021-01-26 13:12   ` Alessio Balsini
2023-05-29 15:15 ` Detaching lower layers (Was: Lazy Loading Layers) Amir Goldstein
2023-05-29 17:50   ` Rodrigo Campos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).