All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
@ 2023-02-27  9:22 Alexander Larsson
  2023-02-27 10:45 ` Gao Xiang
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Alexander Larsson @ 2023-02-27  9:22 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel

Hello,

Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
Composefs filesystem. It is an opportunistically sharing, validating
image-based filesystem, targeting usecases like validated ostree
rootfs:es, validated container images that share common files, as well
as other image based usecases.

During the discussions in the composefs proposal (as seen on LWN[3])
is has been proposed that (with some changes to overlayfs), similar
behaviour can be achieved by combining the overlayfs
"overlay.redirect" xattr with an read-only filesystem such as erofs.

There are pros and cons to both these approaches, and the discussion
about their respective value has sometimes been heated. We would like
to have an in-person discussion at the summit, ideally also involving
more of the filesystem development community, so that we can reach
some consensus on what is the best apporach.

Good participants would be at least: Alexander Larsson, Giuseppe
Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
Jingbo Xu.

[1] https://github.com/containers/composefs
[2] https://lore.kernel.org/lkml/cover.1674227308.git.alexl@redhat.com/
[3] https://lwn.net/SubscriberLink/922851/45ed93154f336f73/

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a lounge-singing crooked cowboy on his last day in the job. She's
a 
psychotic nymphomaniac single mother prone to fits of savage, 
blood-crazed rage. They fight crime! 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-02-27  9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson
@ 2023-02-27 10:45 ` Gao Xiang
  2023-02-27 10:58   ` Christian Brauner
  2023-03-01  3:47   ` Jingbo Xu
  2023-02-27 11:37 ` Jingbo Xu
  2023-03-03 13:57 ` Alexander Larsson
  2 siblings, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2023-02-27 10:45 UTC (permalink / raw)
  To: Alexander Larsson, lsf-pc; +Cc: linux-fsdevel, Christian Brauner, Jingbo Xu


(+cc Jingbo Xu and Christian Brauner)

On 2023/2/27 17:22, Alexander Larsson wrote:
> Hello,
> 
> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> Composefs filesystem. It is an opportunistically sharing, validating
> image-based filesystem, targeting usecases like validated ostree
> rootfs:es, validated container images that share common files, as well
> as other image based usecases.
> 
> During the discussions in the composefs proposal (as seen on LWN[3])
> is has been proposed that (with some changes to overlayfs), similar
> behaviour can be achieved by combining the overlayfs
> "overlay.redirect" xattr with an read-only filesystem such as erofs.
> 
> There are pros and cons to both these approaches, and the discussion
> about their respective value has sometimes been heated. We would like
> to have an in-person discussion at the summit, ideally also involving
> more of the filesystem development community, so that we can reach
> some consensus on what is the best apporach.
> 
> Good participants would be at least: Alexander Larsson, Giuseppe
> Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
> Jingbo Xu
I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed
the root cause of the performance gap is that

composefs read some data symlink-like payload data by using
cfs_read_vdata_path() which involves kernel_read() and trigger heuristic
readahead of dir data (which is also landed in composefs vdata area
together with payload), so that most composefs dir I/O is already done
in advance by heuristic  readahead.  And we think almost all exist
in-kernel local fses doesn't have such heuristic readahead and if we add
the similar stuff, EROFS could do better than composefs.

Also we've tried random stat()s about 500~1000 files in the tree you shared
(rather than just "ls -lR") and EROFS did almost the same or better than
composefs.  I guess further analysis (including blktrace) could be shown by
Jingbo later.

Not sure if Christian Brauner would like to discuss this new stacked fs
with on-disk metadata as well (especially about userns stuff since it's
somewhat a plan in the composefs roadmap as well.)

Thanks,
Gao Xiang

> 
> [1] https://github.com/containers/composefs
> [2] https://lore.kernel.org/lkml/cover.1674227308.git.alexl@redhat.com/
> [3] https://lwn.net/SubscriberLink/922851/45ed93154f336f73/
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-02-27 10:45 ` Gao Xiang
@ 2023-02-27 10:58   ` Christian Brauner
  2023-04-27 16:11     ` [Lsf-pc] " Amir Goldstein
  2023-03-01  3:47   ` Jingbo Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Brauner @ 2023-02-27 10:58 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Alexander Larsson, lsf-pc, linux-fsdevel, Jingbo Xu

On Mon, Feb 27, 2023 at 06:45:50PM +0800, Gao Xiang wrote:
> 
> (+cc Jingbo Xu and Christian Brauner)
> 
> On 2023/2/27 17:22, Alexander Larsson wrote:
> > Hello,
> > 
> > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> > Composefs filesystem. It is an opportunistically sharing, validating
> > image-based filesystem, targeting usecases like validated ostree
> > rootfs:es, validated container images that share common files, as well
> > as other image based usecases.
> > 
> > During the discussions in the composefs proposal (as seen on LWN[3])
> > is has been proposed that (with some changes to overlayfs), similar
> > behaviour can be achieved by combining the overlayfs
> > "overlay.redirect" xattr with an read-only filesystem such as erofs.
> > 
> > There are pros and cons to both these approaches, and the discussion
> > about their respective value has sometimes been heated. We would like
> > to have an in-person discussion at the summit, ideally also involving
> > more of the filesystem development community, so that we can reach
> > some consensus on what is the best apporach.
> > 
> > Good participants would be at least: Alexander Larsson, Giuseppe
> > Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
> > Jingbo Xu
> I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed
> the root cause of the performance gap is that
> 
> composefs read some data symlink-like payload data by using
> cfs_read_vdata_path() which involves kernel_read() and trigger heuristic
> readahead of dir data (which is also landed in composefs vdata area
> together with payload), so that most composefs dir I/O is already done
> in advance by heuristic  readahead.  And we think almost all exist
> in-kernel local fses doesn't have such heuristic readahead and if we add
> the similar stuff, EROFS could do better than composefs.
> 
> Also we've tried random stat()s about 500~1000 files in the tree you shared
> (rather than just "ls -lR") and EROFS did almost the same or better than
> composefs.  I guess further analysis (including blktrace) could be shown by
> Jingbo later.
> 
> Not sure if Christian Brauner would like to discuss this new stacked fs

I'll be at lsfmm in any case and already got my invite a while ago. I
intend to give some updates about a few vfs things and I can talk about
this as well.

Thanks, Gao!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-02-27  9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson
  2023-02-27 10:45 ` Gao Xiang
@ 2023-02-27 11:37 ` Jingbo Xu
  2023-03-03 13:57 ` Alexander Larsson
  2 siblings, 0 replies; 42+ messages in thread
From: Jingbo Xu @ 2023-02-27 11:37 UTC (permalink / raw)
  To: Alexander Larsson, lsf-pc; +Cc: linux-fsdevel



On 2/27/23 5:22 PM, Alexander Larsson wrote:
> Hello,
> 
> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> Composefs filesystem. It is an opportunistically sharing, validating
> image-based filesystem, targeting usecases like validated ostree
> rootfs:es, validated container images that share common files, as well
> as other image based usecases.
> 
> During the discussions in the composefs proposal (as seen on LWN[3])
> is has been proposed that (with some changes to overlayfs), similar
> behaviour can be achieved by combining the overlayfs
> "overlay.redirect" xattr with an read-only filesystem such as erofs.
> 
> There are pros and cons to both these approaches, and the discussion
> about their respective value has sometimes been heated. We would like
> to have an in-person discussion at the summit, ideally also involving
> more of the filesystem development community, so that we can reach
> some consensus on what is the best apporach.
> 
> Good participants would be at least: Alexander Larsson, Giuseppe
> Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
> Jingbo Xu.
> 
> [1] https://github.com/containers/composefs
> [2] https://lore.kernel.org/lkml/cover.1674227308.git.alexl@redhat.com/
> [3] https://lwn.net/SubscriberLink/922851/45ed93154f336f73/
> 

I'm quite interested in the topic and would be glad to attend the
discussion if possible.

Thanks.

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-02-27 10:45 ` Gao Xiang
  2023-02-27 10:58   ` Christian Brauner
@ 2023-03-01  3:47   ` Jingbo Xu
  2023-03-03 14:41     ` Alexander Larsson
  1 sibling, 1 reply; 42+ messages in thread
From: Jingbo Xu @ 2023-03-01  3:47 UTC (permalink / raw)
  To: Gao Xiang, Alexander Larsson, Christian Brauner, Amir Goldstein
  Cc: linux-fsdevel, lsf-pc

Hi all,

On 2/27/23 6:45 PM, Gao Xiang wrote:
> 
> (+cc Jingbo Xu and Christian Brauner)
> 
> On 2023/2/27 17:22, Alexander Larsson wrote:
>> Hello,
>>
>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>> Composefs filesystem. It is an opportunistically sharing, validating
>> image-based filesystem, targeting usecases like validated ostree
>> rootfs:es, validated container images that share common files, as well
>> as other image based usecases.
>>
>> During the discussions in the composefs proposal (as seen on LWN[3])
>> is has been proposed that (with some changes to overlayfs), similar
>> behaviour can be achieved by combining the overlayfs
>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>
>> There are pros and cons to both these approaches, and the discussion
>> about their respective value has sometimes been heated. We would like
>> to have an in-person discussion at the summit, ideally also involving
>> more of the filesystem development community, so that we can reach
>> some consensus on what is the best apporach.
>>
>> Good participants would be at least: Alexander Larsson, Giuseppe
>> Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
>> Jingbo Xu
> I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed
> the root cause of the performance gap is that
> 
> composefs read some data symlink-like payload data by using
> cfs_read_vdata_path() which involves kernel_read() and trigger heuristic
> readahead of dir data (which is also landed in composefs vdata area
> together with payload), so that most composefs dir I/O is already done
> in advance by heuristic  readahead.  And we think almost all exist
> in-kernel local fses doesn't have such heuristic readahead and if we add
> the similar stuff, EROFS could do better than composefs.
> 
> Also we've tried random stat()s about 500~1000 files in the tree you shared
> (rather than just "ls -lR") and EROFS did almost the same or better than
> composefs.  I guess further analysis (including blktrace) could be shown by
> Jingbo later.
> 

The link path string and dirents are mix stored in a so-called vdata
(variable data) section[1] in composefs, sometimes even in the same
block (figured out by dumping the composefs image).  When doing lookup,
composefs will resolve the link path.  It will read the link path string
from vdata section through kernel_read(), along which those dirents in
the following blocks are also read in by the heuristic readahead
algorithm in kernel_read().  I believe this will much benefit the
performance in the workload like "ls -lR".



Test on Subset of Files
=======================

I also tested the performance of running stat(1) on a random subset of
these files in the tested image[2] generated by "find
<root_directory_of_tested_image> -type f -printf "%p\n" | sort -R | head
-n <lines>".

					      | uncached| cached
					      |  (ms)	|  (ms)
----------------------------------------------|---------|--------
(1900 files)
composefs				      | 352	| 15
erofs (raw disk) 			      | 355 	| 16
erofs (DIRECT loop) 			      | 367 	| 16
erofs (DIRECT loop) + overlayfs(lazyfollowup) | 379 	| 16
erofs (BUFFER loop) 			      | 85 	| 16
erofs (BUFFER loop) + overlayfs(lazyfollowup) | 96 	| 16

(1000 files)
composefs				      | 311	| 9
erofs (DIRECT loop)			      | 260	| 9
erofs (raw disk) 			      | 255 	| 9
erofs (DIRECT loop) + overlayfs(lazyfollowup) | 262 	| 9.7
erofs (BUFFER loop) 			      | 71 	| 9
erofs (BUFFER loop) + overlayfs(lazyfollowup) | 77 	| 9.4

(500 files)
composefs				      | 258	| 5.5
erofs (DIRECT loop)			      | 180	| 5.5
erofs (raw disk) 			      | 179 	| 5.5
erofs (DIRECT loop) + overlayfs(lazyfollowup) | 182 	| 5.9
erofs (BUFFER loop) 			      | 55 	| 5.7
erofs (BUFFER loop) + overlayfs(lazyfollowup) | 60 	| 5.8


Here I tested erofs solely (without overlayfs) and erofs+overlayfs.  The
code base of tested erofs is the same as the latest upstream without any
optimization.

It can be seen that, as the number of stated files decreases, erofs
gradually behaves better than composefs.  It indicates that the
heuristic readahead in kernel_read() plays an important role in the
final performance statistics of this workload.



blktrace Log
============

To further verify that the heuristic readahead in kernel_read() will
readahead dirents for composefs, I dumped the blktrace log when
composefs is accessing the manifest file.

Composefs is mounted on "/mnt/cps", and then I ran the following three
commands sequentially.

```
# ls -l /mnt/cps/etc/NetworkManager
# ls -l /mnt/cps/etc/pki
# strace ls /mnt/cps/etc/pki/pesign-rh-test
```


The blktrace log for the above three commands is shown respectively:

```
# blktrace output for "ls -l /mnt/cps/etc/NetworkManager"
  7,0   66        1     0.000000000     0  C   R 9136 + 8 [0]
  7,0   66        2     0.000302905     0  C   R 8 + 8 [0]
  7,0   66        3     0.000506568     0  C   R 9144 + 8 [0]
  7,0   66        4     0.000968212     0  C   R 9152 + 8 [0]
  7,0   66        5     0.001054728     0  C   R 48 + 8 [0]
  7,0   66        6     0.001422439     0  C  RA 9296 + 32 [0]
  7,0   66        7     0.002019686     0  C  RA 9328 + 128 [0]
  7,0   53        4     0.000006260  9052  Q   R 8 + 8 [ls]
  7,0   53        5     0.000006699  9052  G   R 8 + 8 [ls]
  7,0   53        6     0.000006892  9052  D   R 8 + 8 [ls]
  7,0   53        7     0.000308009  9052  Q   R 9144 + 8 [ls]
  7,0   53        8     0.000308552  9052  G   R 9144 + 8 [ls]
  7,0   53        9     0.000308780  9052  D   R 9144 + 8 [ls]
  7,0   53       10     0.000893060  9052  Q   R 9152 + 8 [ls]
  7,0   53       11     0.000893604  9052  G   R 9152 + 8 [ls]
  7,0   53       12     0.000893964  9052  D   R 9152 + 8 [ls]
  7,0   53       13     0.000975783  9052  Q   R 48 + 8 [ls]
  7,0   53       14     0.000976134  9052  G   R 48 + 8 [ls]
  7,0   53       15     0.000976286  9052  D   R 48 + 8 [ls]
  7,0   53       16     0.001061486  9052  Q  RA 9296 + 32 [ls]
  7,0   53       17     0.001061892  9052  G  RA 9296 + 32 [ls]
  7,0   53       18     0.001062066  9052  P   N [ls]
  7,0   53       19     0.001062282  9052  D  RA 9296 + 32 [ls]
  7,0   53       20     0.001433106  9052  Q  RA 9328 + 128 [ls]
<--readahead dirents of "/mnt/cps/etc/pki/pesign-rh-test" directory
  7,0   53       21     0.001433613  9052  G  RA 9328 + 128 [ls]
  7,0   53       22     0.001433742  9052  P   N [ls]
  7,0   53       23     0.001433888  9052  D  RA 9328 + 128 [ls]

# blktrace output for "ls -l /mnt/cps/etc/pki"
  7,0   66        8    56.301287076     0  C   R 32 + 8 [0]
  7,0   66        9    56.301580752     0  C   R 9160 + 8 [0]
  7,0   66       10    56.301666669     0  C   R 96 + 8 [0]
  7,0   53       24    56.300902079  9065  Q   R 32 + 8 [ls]
  7,0   53       25    56.300904047  9065  G   R 32 + 8 [ls]
  7,0   53       26    56.300904720  9065  D   R 32 + 8 [ls]
  7,0   53       27    56.301478055  9065  Q   R 9160 + 8 [ls]
  7,0   53       28    56.301478831  9065  G   R 9160 + 8 [ls]
  7,0   53       29    56.301479147  9065  D   R 9160 + 8 [ls]
  7,0   53       30    56.301588701  9065  Q   R 96 + 8 [ls]
  7,0   53       31    56.301589461  9065  G   R 96 + 8 [ls]
  7,0   53       32    56.301589836  9065  D   R 96 + 8 [ls]

# no output for "strace ls /mnt/cps/etc/pki/pesign-rh-test"
```

I found that there's respective blktrace log printed out when running
the first two commands, i.e. "ls -l /mnt/cps/etc/NetworkManager" and "ls
-l /mnt/cps/etc/pki", while there's no blktrace log when running the
last command, i.e. "strace ls /mnt/cps/etc/pki/pesign-rh-test".

Let's look at the blktrace log for the first command, i.e. "ls -l
/mnt/cps/etc/NetworkManager".  There's a readahead on sector 9328 with a
length of 128 sectors.


It can be seen from the filefrag of the manifest file i.e.
large.composefs that, the manifest file is stored on the disk starting
at sector 8, and thus the readahead range starts at sector 9320 (9328 -
8) of the manifest file.

```
# filefrag -v -b512 large.composefs
File size of large.composefs is 8998590 (17576 blocks of 512 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   17567:          8..     17575:  17568:
   1:  8994816.. 8998589:          0..      3773:   3774:    8998912:
last,not_aligned,inline,eof
large.composefs: 2 extents found
```


I dumped the manifest file with tool from [3], with an enhancement of
printing the sector address of the vdata section for each file.  For
directories, the corresponding vdata section is used to place dirents.

```
|---pesign-rh-test, block 9320(1)/  <-- dirents in pesign-rh-test
|----cert9.db [etc/pki/pesign-rh-test/cert9.db], block 9769(1)
|----key4.db [etc/pki/pesign-rh-test/key4.db], block 9769(1)
|----pkcs11.txt [etc/pki/pesign-rh-test/pkcs11.txt], block 9769(1)
```

It can be seen that the dirents of "/mnt/cps/etc/pki/pesign-rh-test"
directory are placed at sector 9320 starting from the manifest file,
which has already been read ahead when running "ls -l
/mnt/cps/etc/NetworkManager".  It explains why there's no IO submitted
when reading dirents of "/mnt/cps/etc/pki/pesign-rh-test" directory.



[1]
https://lore.kernel.org/lkml/20baca7da01c285b2a77c815c9d4b3080ce4b279.1674227308.git.alexl@redhat.com/
[2] https://my.owndrive.com/index.php/s/irHJXRpZHtT3a5i
[3] https://github.com/containers/composefs

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-02-27  9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson
  2023-02-27 10:45 ` Gao Xiang
  2023-02-27 11:37 ` Jingbo Xu
@ 2023-03-03 13:57 ` Alexander Larsson
  2023-03-03 15:13   ` Gao Xiang
                     ` (2 more replies)
  2 siblings, 3 replies; 42+ messages in thread
From: Alexander Larsson @ 2023-03-03 13:57 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>
> Hello,
>
> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> Composefs filesystem. It is an opportunistically sharing, validating
> image-based filesystem, targeting usecases like validated ostree
> rootfs:es, validated container images that share common files, as well
> as other image based usecases.
>
> During the discussions in the composefs proposal (as seen on LWN[3])
> is has been proposed that (with some changes to overlayfs), similar
> behaviour can be achieved by combining the overlayfs
> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>
> There are pros and cons to both these approaches, and the discussion
> about their respective value has sometimes been heated. We would like
> to have an in-person discussion at the summit, ideally also involving
> more of the filesystem development community, so that we can reach
> some consensus on what is the best apporach.

In order to better understand the behaviour and requirements of the
overlayfs+erofs approach I spent some time implementing direct support
for erofs in libcomposefs. So, with current HEAD of
github.com/containers/composefs you can now do:

$ mkcompose --digest-store=objects --format=erofs source-dir image.erofs

This will produce an object store with the backing files, and a erofs
file with the required overlayfs xattrs, including a made up one
called "overlay.fs-verity" containing the expected fs-verity digest
for the lower dir. It also adds the required whiteouts to cover the
00-ff dirs from the lower dir.

These erofs files are ordered similarly to the composefs files, and we
give similar guarantees about their reproducibility, etc. So, they
should be apples-to-apples comparable with the composefs images.

Given this, I ran another set of performance tests on the original cs9
rootfs dataset, again measuring the time of `ls -lR`. I also tried to
measure the memory use like this:

# echo 3 > /proc/sys/vm/drop_caches
# systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
/proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'

These are the alternatives I tried:

xfs: the source of the image, regular dir on xfs
erofs: the image.erofs above, on loopback
erofs dio: the image.erofs above, on loopback with --direct-io=on
ovl: erofs above combined with overlayfs
ovl dio: erofs dio above combined with overlayfs
cfs: composefs mount of image.cfs

All tests use the same objects dir, stored on xfs. The erofs and
overlay implementations are from a stock 6.1.13 kernel, and composefs
module is from github HEAD.

I tried loopback both with and without the direct-io option, because
without direct-io enabled the kernel will double-cache the loopbacked
data, as per[1].

The produced images are:
 8.9M image.cfs
11.3M image.erofs

And gives these results:
           | Cold cache | Warm cache | Mem use
           |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
xfs        |   1449     |    442     |    54
erofs      |    700     |    391     |    45
erofs dio  |    939     |    400     |    45
ovl        |   1827     |    530     |   130
ovl dio    |   2156     |    531     |   130
cfs        |    689     |    389     |    51

I also ran the same tests in a VM that had the latest kernel including
the lazyfollow patches (ovl lazy in the table, not using direct-io),
this one ext4 based:

           | Cold cache | Warm cache | Mem use
           |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
ext4       |   1135     |    394     |    54
erofs      |    715     |    401     |    46
erofs dio  |    922     |    401     |    45
ovl        |   1412     |    515     |   148
ovl dio    |   1810     |    532     |   149
ovl lazy   |   1063     |    523     |    87
cfs        |    719     |    463     |    51

Things noticeable in the results:

* composefs and erofs (by itself) perform roughly  similar. This is
  not necessarily news, and results from Jingbo Xu match this.

* Erofs on top of direct-io enabled loopback causes quite a drop in
  performance, which I don't really understand. Especially since its
  reporting the same memory use as non-direct io. I guess the
  double-cacheing in the later case isn't properly attributed to the
  cgroup so the difference is not measured. However, why would the
  double cache improve performance?  Maybe I'm not completely
  understanding how these things interact.

* Stacking overlay on top of erofs causes about 100msec slower
  warm-cache times compared to all non-overlay approaches, and much
  more in the cold cache case. The cold cache performance is helped
  significantly by the lazyfollow patches, but the warm cache overhead
  remains.

* The use of overlayfs more than doubles memory use, probably
  because of all the extra inodes and dentries in action for the
  various layers. The lazyfollow patches helps, but only partially.

* Even though overlayfs+erofs is slower than cfs and raw erofs, it is
  not that much slower (~25%) than the pure xfs/ext4 directory, which
  is a pretty good baseline for comparisons. It is even faster when
  using lazyfollow on ext4.

* The erofs images are slightly larger than the equivalent composefs
  image.

In summary: The performance of composefs is somewhat better than the
best erofs+ovl combination, although the overlay approach is not
significantly worse than the baseline of a regular directory, except
that it uses a bit more memory.

On top of the above pure performance based comparisons I would like to
re-state some of the other advantages of composefs compared to the
overlay approach:

* composefs is namespaceable, in the sense that you can use it (given
  mount capabilities) inside a namespace (such as a container) without
  access to non-namespaced resources like loopback or device-mapper
  devices. (There was work on fixing this with loopfs, but that seems
  to have stalled.)

* While it is not in the current design, the simplicity of the format
  and lack of loopback makes it at least theoretically possible that
  composefs can be made usable in a rootless fashion at some point in
  the future.

And of course, there are disadvantages to composefs too. Primarily
being more code, increasing maintenance burden and risk of security
problems. Composefs is particularly burdensome because it is a
stacking filesystem and these have historically been shown to be hard
to get right.


The question now is what is the best approach overall? For my own
primary usecase of making a verifying ostree root filesystem, the
overlay approach (with the lazyfollow work finished) is, while not
ideal, good enough.

But I know for the people who are more interested in using composefs
for containers the eventual goal of rootless support is very
important. So, on behalf of them I guess the question is: Is there
ever any chance that something like composefs could work rootlessly?
Or conversely: Is there some way to get rootless support from the
overlay approach? Opinions? Ideas?


[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6



--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-01  3:47   ` Jingbo Xu
@ 2023-03-03 14:41     ` Alexander Larsson
  2023-03-03 15:48       ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Alexander Larsson @ 2023-03-03 14:41 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Gao Xiang, Christian Brauner, Amir Goldstein, linux-fsdevel, lsf-pc

On Wed, Mar 1, 2023 at 4:47 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi all,
>
> On 2/27/23 6:45 PM, Gao Xiang wrote:
> >
> > (+cc Jingbo Xu and Christian Brauner)
> >
> > On 2023/2/27 17:22, Alexander Larsson wrote:
> >> Hello,
> >>
> >> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> >> Composefs filesystem. It is an opportunistically sharing, validating
> >> image-based filesystem, targeting usecases like validated ostree
> >> rootfs:es, validated container images that share common files, as well
> >> as other image based usecases.
> >>
> >> During the discussions in the composefs proposal (as seen on LWN[3])
> >> is has been proposed that (with some changes to overlayfs), similar
> >> behaviour can be achieved by combining the overlayfs
> >> "overlay.redirect" xattr with an read-only filesystem such as erofs.
> >>
> >> There are pros and cons to both these approaches, and the discussion
> >> about their respective value has sometimes been heated. We would like
> >> to have an in-person discussion at the summit, ideally also involving
> >> more of the filesystem development community, so that we can reach
> >> some consensus on what is the best apporach.
> >>
> >> Good participants would be at least: Alexander Larsson, Giuseppe
> >> Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
> >> Jingbo Xu
> > I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed
> > the root cause of the performance gap is that
> >
> > composefs read some data symlink-like payload data by using
> > cfs_read_vdata_path() which involves kernel_read() and trigger heuristic
> > readahead of dir data (which is also landed in composefs vdata area
> > together with payload), so that most composefs dir I/O is already done
> > in advance by heuristic  readahead.  And we think almost all exist
> > in-kernel local fses doesn't have such heuristic readahead and if we add
> > the similar stuff, EROFS could do better than composefs.
> >
> > Also we've tried random stat()s about 500~1000 files in the tree you shared
> > (rather than just "ls -lR") and EROFS did almost the same or better than
> > composefs.  I guess further analysis (including blktrace) could be shown by
> > Jingbo later.
> >
>
> The link path string and dirents are mix stored in a so-called vdata
> (variable data) section[1] in composefs, sometimes even in the same
> block (figured out by dumping the composefs image).  When doing lookup,
> composefs will resolve the link path.  It will read the link path string
> from vdata section through kernel_read(), along which those dirents in
> the following blocks are also read in by the heuristic readahead
> algorithm in kernel_read().  I believe this will much benefit the
> performance in the workload like "ls -lR".

This is interesting stuff, and honestly I'm a bit surprised other
filesystems don't try to readahead directory metadata to some degree
too. It seems inherent to all filesystems that they try to pack
related metadata near each other, so readahead would probably be
useful even for read-write filesystems, although even more so for
read-only filesystems (due to lack of fragmentation).

But anyway, this is sort of beside the current issue. There is nothing
inherent in composefs that makes it have to do readahead like this,
and correspondingly, if it is a good idea to do it, erofs could do it
too,

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-03 13:57 ` Alexander Larsson
@ 2023-03-03 15:13   ` Gao Xiang
  2023-03-03 17:37     ` Gao Xiang
  2023-03-07 10:15     ` Christian Brauner
  2023-03-04  0:46   ` Jingbo Xu
  2023-03-06 11:33   ` Alexander Larsson
  2 siblings, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-03 15:13 UTC (permalink / raw)
  To: Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi

Hi Alexander,

On 2023/3/3 21:57, Alexander Larsson wrote:
> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>>
>> Hello,
>>
>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>> Composefs filesystem. It is an opportunistically sharing, validating
>> image-based filesystem, targeting usecases like validated ostree
>> rootfs:es, validated container images that share common files, as well
>> as other image based usecases.
>>
>> During the discussions in the composefs proposal (as seen on LWN[3])
>> is has been proposed that (with some changes to overlayfs), similar
>> behaviour can be achieved by combining the overlayfs
>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>
>> There are pros and cons to both these approaches, and the discussion
>> about their respective value has sometimes been heated. We would like
>> to have an in-person discussion at the summit, ideally also involving
>> more of the filesystem development community, so that we can reach
>> some consensus on what is the best apporach.
> 
> In order to better understand the behaviour and requirements of the
> overlayfs+erofs approach I spent some time implementing direct support
> for erofs in libcomposefs. So, with current HEAD of
> github.com/containers/composefs you can now do:
> 
> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs

Thanks you for taking time on working on EROFS support.  I don't have
time to play with it yet since I'd like to work out erofs-utils 1.6
these days and will work on some new stuffs such as !pagesize block
size as I said previously.

> 
> This will produce an object store with the backing files, and a erofs
> file with the required overlayfs xattrs, including a made up one
> called "overlay.fs-verity" containing the expected fs-verity digest
> for the lower dir. It also adds the required whiteouts to cover the
> 00-ff dirs from the lower dir.
> 
> These erofs files are ordered similarly to the composefs files, and we
> give similar guarantees about their reproducibility, etc. So, they
> should be apples-to-apples comparable with the composefs images.
> 
> Given this, I ran another set of performance tests on the original cs9
> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
> measure the memory use like this:
> 
> # echo 3 > /proc/sys/vm/drop_caches
> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> 
> These are the alternatives I tried:
> 
> xfs: the source of the image, regular dir on xfs
> erofs: the image.erofs above, on loopback
> erofs dio: the image.erofs above, on loopback with --direct-io=on
> ovl: erofs above combined with overlayfs
> ovl dio: erofs dio above combined with overlayfs
> cfs: composefs mount of image.cfs
> 
> All tests use the same objects dir, stored on xfs. The erofs and
> overlay implementations are from a stock 6.1.13 kernel, and composefs
> module is from github HEAD.
> 
> I tried loopback both with and without the direct-io option, because
> without direct-io enabled the kernel will double-cache the loopbacked
> data, as per[1].
> 
> The produced images are:
>   8.9M image.cfs
> 11.3M image.erofs
> 
> And gives these results:
>             | Cold cache | Warm cache | Mem use
>             |   (msec)   |   (msec)   |  (mb)
> -----------+------------+------------+---------
> xfs        |   1449     |    442     |    54
> erofs      |    700     |    391     |    45
> erofs dio  |    939     |    400     |    45
> ovl        |   1827     |    530     |   130
> ovl dio    |   2156     |    531     |   130
> cfs        |    689     |    389     |    51
> 
> I also ran the same tests in a VM that had the latest kernel including
> the lazyfollow patches (ovl lazy in the table, not using direct-io),
> this one ext4 based:
> 
>             | Cold cache | Warm cache | Mem use
>             |   (msec)   |   (msec)   |  (mb)
> -----------+------------+------------+---------
> ext4       |   1135     |    394     |    54
> erofs      |    715     |    401     |    46
> erofs dio  |    922     |    401     |    45
> ovl        |   1412     |    515     |   148
> ovl dio    |   1810     |    532     |   149
> ovl lazy   |   1063     |    523     |    87
> cfs        |    719     |    463     |    51
> 
> Things noticeable in the results:
> 
> * composefs and erofs (by itself) perform roughly  similar. This is
>    not necessarily news, and results from Jingbo Xu match this.
> 
> * Erofs on top of direct-io enabled loopback causes quite a drop in
>    performance, which I don't really understand. Especially since its
>    reporting the same memory use as non-direct io. I guess the
>    double-cacheing in the later case isn't properly attributed to the
>    cgroup so the difference is not measured. However, why would the
>    double cache improve performance?  Maybe I'm not completely
>    understanding how these things interact.

We've already analysed the root cause of composefs is that composefs
uses a kernel_read() to read its path while irrelevant metadata
(such as dir data) is read together.  Such heuristic readahead is a
unusual stuff for all local fses (obviously almost all in-kernel
filesystems don't use kernel_read() to read their metadata. Although
some filesystems could readahead some related extent metadata when
reading inode, they at least does _not_ work as kernel_read().) But
double caching will introduce almost the same impact as kernel_read()
(assuming you read some source code of loop device.)

I do hope you already read what Jingbo's latest test results, and that
test result shows how bad readahead performs if fs metadata is
partially randomly used (stat < 1500 files):
https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com

Also you could explicitly _disable_ readahead for composefs
manifiest file (because all EROFS metadata read is without
readahead), and let's see how it works then.

Again, if your workload is just "ls -lR".  My answer is "just async
readahead the whole manifest file / loop device together" when
mounting.  That will give the best result to you.  But I'm not sure
that is the real use case you propose.

> 
> * Stacking overlay on top of erofs causes about 100msec slower
>    warm-cache times compared to all non-overlay approaches, and much
>    more in the cold cache case. The cold cache performance is helped
>    significantly by the lazyfollow patches, but the warm cache overhead
>    remains.
> 
> * The use of overlayfs more than doubles memory use, probably
>    because of all the extra inodes and dentries in action for the
>    various layers. The lazyfollow patches helps, but only partially.
> 
> * Even though overlayfs+erofs is slower than cfs and raw erofs, it is
>    not that much slower (~25%) than the pure xfs/ext4 directory, which
>    is a pretty good baseline for comparisons. It is even faster when
>    using lazyfollow on ext4.
> 
> * The erofs images are slightly larger than the equivalent composefs
>    image.
> 
> In summary: The performance of composefs is somewhat better than the
> best erofs+ovl combination, although the overlay approach is not
> significantly worse than the baseline of a regular directory, except
> that it uses a bit more memory.
> 
> On top of the above pure performance based comparisons I would like to
> re-state some of the other advantages of composefs compared to the
> overlay approach:
> 
> * composefs is namespaceable, in the sense that you can use it (given
>    mount capabilities) inside a namespace (such as a container) without
>    access to non-namespaced resources like loopback or device-mapper
>    devices. (There was work on fixing this with loopfs, but that seems
>    to have stalled.)
> 
> * While it is not in the current design, the simplicity of the format
>    and lack of loopback makes it at least theoretically possible that
>    composefs can be made usable in a rootless fashion at some point in
>    the future.
Do you consider sending some commands to /dev/cachefiles to configure
a daemonless dir and mount erofs image directly by using "erofs over
fscache" but in a daemonless way?  That is an ongoing stuff on our side.

IMHO, I don't think file-based interfaces are quite a charmful stuff.
Historically I recalled some practice is to "avoid directly reading
files in kernel" so that I think almost all local fses don't work on
files directl and loopback devices are all the ways for these use
cases.  If loopback devices are not okay to you, how about improving
loopback devices and that will benefit to almost all local fses.

> 
> And of course, there are disadvantages to composefs too. Primarily
> being more code, increasing maintenance burden and risk of security
> problems. Composefs is particularly burdensome because it is a
> stacking filesystem and these have historically been shown to be hard
> to get right.
> 
> 
> The question now is what is the best approach overall? For my own
> primary usecase of making a verifying ostree root filesystem, the
> overlay approach (with the lazyfollow work finished) is, while not
> ideal, good enough.

So your judgement is still "ls -lR" and your use case is still just
pure read-only and without writable stuff?

Anyway, I'm really happy to work with you on your ostree use cases
as always, as long as all corner cases work out by the community.

> 
> But I know for the people who are more interested in using composefs
> for containers the eventual goal of rootless support is very
> important. So, on behalf of them I guess the question is: Is there
> ever any chance that something like composefs could work rootlessly?
> Or conversely: Is there some way to get rootless support from the
> overlay approach? Opinions? Ideas?

Honestly, I do want to get a proper answer when Giuseppe asked me
the same question.  My current view is simply "that question is
almost the same for all in-kernel fses with some on-disk format".

If you think EROFS compression part is too complex and useless to your
use cases,  okay,  I think we could add a new mount option called
"nocompress" so that we can avoid that part runtimely explicitly. But
that still doesn't help to the original question on my side.

Thanks,
Gao Xiang

> 
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6
> 
> 
> 
> --
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>   Alexander Larsson                                Red Hat, Inc
>         alexl@redhat.com         alexander.larsson@gmail.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-03 14:41     ` Alexander Larsson
@ 2023-03-03 15:48       ` Gao Xiang
  0 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-03 15:48 UTC (permalink / raw)
  To: Alexander Larsson, Jingbo Xu
  Cc: Christian Brauner, Amir Goldstein, linux-fsdevel, lsf-pc



On 2023/3/3 22:41, Alexander Larsson wrote:
> On Wed, Mar 1, 2023 at 4:47 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi all,
>>
>> On 2/27/23 6:45 PM, Gao Xiang wrote:
>>>
>>> (+cc Jingbo Xu and Christian Brauner)
>>>
>>> On 2023/2/27 17:22, Alexander Larsson wrote:
>>>> Hello,
>>>>
>>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>>>> Composefs filesystem. It is an opportunistically sharing, validating
>>>> image-based filesystem, targeting usecases like validated ostree
>>>> rootfs:es, validated container images that share common files, as well
>>>> as other image based usecases.
>>>>
>>>> During the discussions in the composefs proposal (as seen on LWN[3])
>>>> is has been proposed that (with some changes to overlayfs), similar
>>>> behaviour can be achieved by combining the overlayfs
>>>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>>>
>>>> There are pros and cons to both these approaches, and the discussion
>>>> about their respective value has sometimes been heated. We would like
>>>> to have an in-person discussion at the summit, ideally also involving
>>>> more of the filesystem development community, so that we can reach
>>>> some consensus on what is the best apporach.
>>>>
>>>> Good participants would be at least: Alexander Larsson, Giuseppe
>>>> Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
>>>> Jingbo Xu
>>> I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed
>>> the root cause of the performance gap is that
>>>
>>> composefs read some data symlink-like payload data by using
>>> cfs_read_vdata_path() which involves kernel_read() and trigger heuristic
>>> readahead of dir data (which is also landed in composefs vdata area
>>> together with payload), so that most composefs dir I/O is already done
>>> in advance by heuristic  readahead.  And we think almost all exist
>>> in-kernel local fses doesn't have such heuristic readahead and if we add
>>> the similar stuff, EROFS could do better than composefs.
>>>
>>> Also we've tried random stat()s about 500~1000 files in the tree you shared
>>> (rather than just "ls -lR") and EROFS did almost the same or better than
>>> composefs.  I guess further analysis (including blktrace) could be shown by
>>> Jingbo later.
>>>
>>
>> The link path string and dirents are mix stored in a so-called vdata
>> (variable data) section[1] in composefs, sometimes even in the same
>> block (figured out by dumping the composefs image).  When doing lookup,
>> composefs will resolve the link path.  It will read the link path string
>> from vdata section through kernel_read(), along which those dirents in
>> the following blocks are also read in by the heuristic readahead
>> algorithm in kernel_read().  I believe this will much benefit the
>> performance in the workload like "ls -lR".
> 
> This is interesting stuff, and honestly I'm a bit surprised other
> filesystems don't try to readahead directory metadata to some degree
> too. It seems inherent to all filesystems that they try to pack
> related metadata near each other, so readahead would probably be
> useful even for read-write filesystems, although even more so for
> read-only filesystems (due to lack of fragmentation).

As I wrote before, IMHO, local filesystems read data in some basic
unit (for example block size), if there are other irreverent metadata
read in one shot, of course it can read together.

Some local filesystems could read more related metadata when reading
inodes.  But that is based on the logical relationship rather than
the in-kernel readahead algorithm.

> 
> But anyway, this is sort of beside the current issue. There is nothing
> inherent in composefs that makes it have to do readahead like this,
> and correspondingly, if it is a good idea to do it, erofs could do it
> too,

I don't think in-tree EROFS should do a random irreverent readahead
like kernel_read() without proof since it could have bad results to
small random file access.  If we do such thing, I'm afraid it's
also irresponsible to all end users already using EROFS in production.

Again, "ls -lR" is not the whole world, no? If you care about the
startup time, FAST 16 slacker implied only 6.4% of that data [1] is
read.  Even though it mainly told about lazy pulling, but that number
is almost the same as the startup I/O in our cloud containers too.

[1] https://www.usenix.org/conference/fast16/technical-sessions/presentation/harter

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-03 15:13   ` Gao Xiang
@ 2023-03-03 17:37     ` Gao Xiang
  2023-03-04 14:59       ` Colin Walters
  2023-03-07 10:15     ` Christian Brauner
  1 sibling, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2023-03-03 17:37 UTC (permalink / raw)
  To: Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 2023/3/3 23:13, Gao Xiang wrote:

...

>>
>> And of course, there are disadvantages to composefs too. Primarily
>> being more code, increasing maintenance burden and risk of security
>> problems. Composefs is particularly burdensome because it is a
>> stacking filesystem and these have historically been shown to be hard
>> to get right.

Just off a bit of that, I do think you could finally find a
fully-functional read-only filesystem is useful.

For example with EROFS you could,

  - keep composefs model files as your main use cases;

  - keep some small files such as "VERSION" or "README" inline;

  - refer to some parts of blobs (such as tar data) directly in
    addition to the whole file, which seems also a useful use cases
    for OCI containers;

  - deploy all of the above to raw disks and other media as well;

  - etc.

Actually since you're container guys, I would like to mention
a way to directly reuse OCI tar data and not sure if you
have some interest as well, that is just to generate EROFS
metadata which could point to the tar blobs so that data itself
is still the original tar, but we could add fsverity + IMMUTABLE
to these blobs rather than the individual untared files.

The main advantages over the current way (podman, containerd) are
  - save untar and snapshot gc time;
  - OCI layer diff IDs in the OCI spec [1] are guaranteed;
  - in-kernel mountable with runtime verificiation;
  - such tar can be mounted in secure containers in the same way
    as well.

Personally I've been working on EROFS since the end of 2017 until
now for many years, although it could take more or less time due
to other on-oning work, I always believe a read-only approach is
beyond just a pure space-saving stuff.  So I devoted almost all
my extra leisure time for this.

Honestly, I do hope there could be more people interested in EROFS
in addition to the original Android use cases because the overall
intention is much similar and I'm happy to help things that I could
do and avoid another random fs dump to Linux kernel (of course not
though.)

[1] https://github.com/opencontainers/image-spec/blob/main/config.md

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-03 13:57 ` Alexander Larsson
  2023-03-03 15:13   ` Gao Xiang
@ 2023-03-04  0:46   ` Jingbo Xu
  2023-03-06 11:33   ` Alexander Larsson
  2 siblings, 0 replies; 42+ messages in thread
From: Jingbo Xu @ 2023-03-04  0:46 UTC (permalink / raw)
  To: Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Gao Xiang,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 3/3/23 9:57 PM, Alexander Larsson wrote:
> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> 
> * Erofs on top of direct-io enabled loopback causes quite a drop in
>   performance, which I don't really understand. Especially since its
>   reporting the same memory use as non-direct io. I guess the
>   double-cacheing in the later case isn't properly attributed to the
>   cgroup so the difference is not measured. However, why would the
>   double cache improve performance?  Maybe I'm not completely
>   understanding how these things interact.
> 

Loop in BUFFERED mode actually calls .read_iter() of the backing file to
read from it, e.g. ext4_file_read_iter()->generic_file_read_iter(),
where heuristic readahead is also done.

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-03 17:37     ` Gao Xiang
@ 2023-03-04 14:59       ` Colin Walters
  2023-03-04 15:29         ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Colin Walters @ 2023-03-04 14:59 UTC (permalink / raw)
  To: Gao Xiang, Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote:
> 
> Actually since you're container guys, I would like to mention
> a way to directly reuse OCI tar data and not sure if you
> have some interest as well, that is just to generate EROFS
> metadata which could point to the tar blobs so that data itself
> is still the original tar, but we could add fsverity + IMMUTABLE
> to these blobs rather than the individual untared files.

>   - OCI layer diff IDs in the OCI spec [1] are guaranteed;

The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think.

Correct me if I'm wrong, but having erofs point to underlying tar wouldn't by default get us page cache sharing or even the "opportunistic" disk sharing that composefs brings, unless userspace did something like attempting to dedup files in the tar stream via hashing and using reflinks on the underlying fs.  And then doing reflinks would require alignment inside the stream, right?  The https://fedoraproject.org/wiki/Changes/RPMCoW change is very similar in that it's proposing a modification of the RPM format to 4k align files in the stream for this reason.  But that's exactly it, then it's a new tweaked format and not identical to what came before, so the "compatibility" rationale is actually weakened a lot.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-04 14:59       ` Colin Walters
@ 2023-03-04 15:29         ` Gao Xiang
  2023-03-04 16:22           ` Gao Xiang
  2023-03-07  1:00           ` Colin Walters
  0 siblings, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-04 15:29 UTC (permalink / raw)
  To: Colin Walters, Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi

Hi Colin,

On 2023/3/4 22:59, Colin Walters wrote:
> 
> 
> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote:
>>
>> Actually since you're container guys, I would like to mention
>> a way to directly reuse OCI tar data and not sure if you
>> have some interest as well, that is just to generate EROFS
>> metadata which could point to the tar blobs so that data itself
>> is still the original tar, but we could add fsverity + IMMUTABLE
>> to these blobs rather than the individual untared files.
> 
>>    - OCI layer diff IDs in the OCI spec [1] are guaranteed;
> 
> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think.

Thanks for the interest and comment.

I'm not aware of this project, and I'm not sure if tar-split
helps mount tar stuffs, maybe I'm missing something?

As for EROFS, as long as we support subpage block size, it's
entirely possible to refer the original tar data without tar
stream modification.

> 
> Correct me if I'm wrong, but having erofs point to underlying tar wouldn't by default get us page cache sharing or even the "opportunistic" disk sharing that composefs brings, unless userspace did something like attempting to dedup files in the tar stream via hashing and using reflinks on the underlying fs.  And then doing reflinks would require alignment inside the stream, right?  The https://fedoraproject.org/wiki/Changes/RPMCoW change is very similar in that it's proposing a modification of the RPM format to 4k align files in the 

hmmm.. I think userspace don't need to dedupe files in the
tar stream.

stream for this reason.  But that's exactly it, then it's a new tweaked format and not identical to what came before, so the "compatibility" rationale is actually weakened a lot.
> 
>

As you said, "opportunistic" finer disk sharing inside all tar
streams can be resolved by reflink or other stuffs by the underlay
filesystems (like XFS, or virtual devices like device mapper).

Not bacause EROFS cannot do on-disk dedupe, just because in this
way EROFS can only use the original tar blobs, and EROFS is not
the guy to resolve the on-disk sharing stuff.  However, here since
the original tar blob is used, so that the tar stream data is
unchanged (with the same diffID) when the container is running.

As a kernel filesystem, if two files are equal, we could treat them
in the same inode address space, even they are actually with slightly
different inode metadata (uid, gid, mode, nlink, etc).  That is
entirely possible as an in-kernel filesystem even currently linux
kernel doesn't implement finer page cache sharing, so EROFS can
support page-cache sharing of files in all tar streams if needed.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-04 15:29         ` Gao Xiang
@ 2023-03-04 16:22           ` Gao Xiang
  2023-03-07  1:00           ` Colin Walters
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-04 16:22 UTC (permalink / raw)
  To: Colin Walters, Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 2023/3/4 23:29, Gao Xiang wrote:
> Hi Colin,
> 
> On 2023/3/4 22:59, Colin Walters wrote:
>>
>>
>> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote:
>>>
>>> Actually since you're container guys, I would like to mention
>>> a way to directly reuse OCI tar data and not sure if you
>>> have some interest as well, that is just to generate EROFS
>>> metadata which could point to the tar blobs so that data itself
>>> is still the original tar, but we could add fsverity + IMMUTABLE
>>> to these blobs rather than the individual untared files.
>>
>>>    - OCI layer diff IDs in the OCI spec [1] are guaranteed;
>>
>> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think.
> 
> Thanks for the interest and comment.
> 
> I'm not aware of this project, and I'm not sure if tar-split
> helps mount tar stuffs, maybe I'm missing something?
> 
> As for EROFS, as long as we support subpage block size, it's
> entirely possible to refer the original tar data without tar
> stream modification.
> 
>>
>> Correct me if I'm wrong, but having erofs point to underlying tar wouldn't by default get us page cache sharing or even the "opportunistic" disk sharing that composefs brings, unless userspace did something like attempting to dedup files in the tar stream via hashing and using reflinks on the underlying fs.  And then doing reflinks would require alignment inside the stream, right?  The https://fedoraproject.org/wiki/Changes/RPMCoW change is very similar in that it's proposing a modification of the RPM format to 4k align files in the 
> 
> hmmm.. I think userspace don't need to dedupe files in the
> tar stream.
> 
> stream for this reason.  But that's exactly it, then it's a new tweaked format and not identical to what came before, so the "compatibility" rationale is actually weakened a lot.
>>
>>
> 
> As you said, "opportunistic" finer disk sharing inside all tar
> streams can be resolved by reflink or other stuffs by the underlay
> filesystems (like XFS, or virtual devices like device mapper).
> 
> Not bacause EROFS cannot do on-disk dedupe, just because in this
> way EROFS can only use the original tar blobs, and EROFS is not
> the guy to resolve the on-disk sharing stuff.  However, here since
> the original tar blob is used, so that the tar stream data is
> unchanged (with the same diffID) when the container is running.
> 
> As a kernel filesystem, if two files are equal, we could treat them
> in the same inode address space, even they are actually with slightly
> different inode metadata (uid, gid, mode, nlink, etc).  That is
> entirely possible as an in-kernel filesystem even currently linux
> kernel doesn't implement finer page cache sharing, so EROFS can
> support page-cache sharing of files in all tar streams if needed.

By the way, in case of misunderstanding, the current workable ways
of Linux page cache sharing don't _strictly_ need the real inode is
the same inode (like what stackable fs like overlayfs does), just
need sharing data among different inodes consecutive in one address
space, which means:

   1) we could reuse blob (the tar stream) address space to share
      page cache, actually that is what Jingbo's did for fscache
      page cache sharing:
      https://lore.kernel.org/r/20230203030143.73105-1-jefflexu@linux.alibaba.com

   2) create a virtual inode (or reuse one address space of real
      inodes) to share data between real inodes.

Either way can do page cache sharing of inodes with same data
across different filesystems and are practial without extra
linux-mm improvement.

thanks,
Gao Xiang

> 
> Thanks,
> Gao Xiang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-03 13:57 ` Alexander Larsson
  2023-03-03 15:13   ` Gao Xiang
  2023-03-04  0:46   ` Jingbo Xu
@ 2023-03-06 11:33   ` Alexander Larsson
  2023-03-06 12:15     ` Gao Xiang
  2023-03-06 15:49     ` Jingbo Xu
  2 siblings, 2 replies; 42+ messages in thread
From: Alexander Larsson @ 2023-03-06 11:33 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote:
>
> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> >
> > Hello,
> >
> > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> > Composefs filesystem. It is an opportunistically sharing, validating
> > image-based filesystem, targeting usecases like validated ostree
> > rootfs:es, validated container images that share common files, as well
> > as other image based usecases.
> >
> > During the discussions in the composefs proposal (as seen on LWN[3])
> > is has been proposed that (with some changes to overlayfs), similar
> > behaviour can be achieved by combining the overlayfs
> > "overlay.redirect" xattr with an read-only filesystem such as erofs.
> >
> > There are pros and cons to both these approaches, and the discussion
> > about their respective value has sometimes been heated. We would like
> > to have an in-person discussion at the summit, ideally also involving
> > more of the filesystem development community, so that we can reach
> > some consensus on what is the best apporach.
>
> In order to better understand the behaviour and requirements of the
> overlayfs+erofs approach I spent some time implementing direct support
> for erofs in libcomposefs. So, with current HEAD of
> github.com/containers/composefs you can now do:
>
> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
>
> This will produce an object store with the backing files, and a erofs
> file with the required overlayfs xattrs, including a made up one
> called "overlay.fs-verity" containing the expected fs-verity digest
> for the lower dir. It also adds the required whiteouts to cover the
> 00-ff dirs from the lower dir.
>
> These erofs files are ordered similarly to the composefs files, and we
> give similar guarantees about their reproducibility, etc. So, they
> should be apples-to-apples comparable with the composefs images.
>
> Given this, I ran another set of performance tests on the original cs9
> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
> measure the memory use like this:
>
> # echo 3 > /proc/sys/vm/drop_caches
> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
>
> These are the alternatives I tried:
>
> xfs: the source of the image, regular dir on xfs
> erofs: the image.erofs above, on loopback
> erofs dio: the image.erofs above, on loopback with --direct-io=on
> ovl: erofs above combined with overlayfs
> ovl dio: erofs dio above combined with overlayfs
> cfs: composefs mount of image.cfs
>
> All tests use the same objects dir, stored on xfs. The erofs and
> overlay implementations are from a stock 6.1.13 kernel, and composefs
> module is from github HEAD.
>
> I tried loopback both with and without the direct-io option, because
> without direct-io enabled the kernel will double-cache the loopbacked
> data, as per[1].
>
> The produced images are:
>  8.9M image.cfs
> 11.3M image.erofs
>
> And gives these results:
>            | Cold cache | Warm cache | Mem use
>            |   (msec)   |   (msec)   |  (mb)
> -----------+------------+------------+---------
> xfs        |   1449     |    442     |    54
> erofs      |    700     |    391     |    45
> erofs dio  |    939     |    400     |    45
> ovl        |   1827     |    530     |   130
> ovl dio    |   2156     |    531     |   130
> cfs        |    689     |    389     |    51

It has been noted that the readahead done by kernel_read() may cause
read-ahead of unrelated data into memory which skews the results in
favour of workloads that consume all the filesystem metadata (such as
the ls -lR usecase of the above test). In the table above this favours
composefs (which uses kernel_read in some codepaths) as well as
non-dio erofs (non-dio loopback device uses readahead too).

I updated composefs to not use kernel_read here:
  https://github.com/containers/composefs/pull/105

And a new kernel patch-set based on this is available at:
  https://github.com/alexlarsson/linux/tree/composefs

The resulting table is now (dropping the non-dio erofs):

           | Cold cache | Warm cache | Mem use
           |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
xfs        |   1449     |    442     |   54
erofs dio  |    939     |    400     |   45
ovl dio    |   2156     |    531     |  130
cfs        |    833     |    398     |   51

           | Cold cache | Warm cache | Mem use
           |   (msec)   |   (msec)   |  (mb)
-----------+------------+------------+---------
ext4       |   1135     |    394     |   54
erofs dio  |    922     |    401     |   45
ovl dio    |   1810     |    532     |  149
ovl lazy   |   1063     |    523     |  87
cfs        |    768     |    459     |  51

So, while cfs is somewhat worse now for this particular usecase, my
overall analysis still stands.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-06 11:33   ` Alexander Larsson
@ 2023-03-06 12:15     ` Gao Xiang
  2023-03-06 15:49     ` Jingbo Xu
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-06 12:15 UTC (permalink / raw)
  To: Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 2023/3/6 19:33, Alexander Larsson wrote:
> On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote:
>>
>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>>>
>>> Hello,
>>>
>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>>> Composefs filesystem. It is an opportunistically sharing, validating
>>> image-based filesystem, targeting usecases like validated ostree
>>> rootfs:es, validated container images that share common files, as well
>>> as other image based usecases.
>>>
>>> During the discussions in the composefs proposal (as seen on LWN[3])
>>> is has been proposed that (with some changes to overlayfs), similar
>>> behaviour can be achieved by combining the overlayfs
>>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>>
>>> There are pros and cons to both these approaches, and the discussion
>>> about their respective value has sometimes been heated. We would like
>>> to have an in-person discussion at the summit, ideally also involving
>>> more of the filesystem development community, so that we can reach
>>> some consensus on what is the best apporach.
>>
>> In order to better understand the behaviour and requirements of the
>> overlayfs+erofs approach I spent some time implementing direct support
>> for erofs in libcomposefs. So, with current HEAD of
>> github.com/containers/composefs you can now do:
>>
>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
>>
>> This will produce an object store with the backing files, and a erofs
>> file with the required overlayfs xattrs, including a made up one
>> called "overlay.fs-verity" containing the expected fs-verity digest
>> for the lower dir. It also adds the required whiteouts to cover the
>> 00-ff dirs from the lower dir.
>>
>> These erofs files are ordered similarly to the composefs files, and we
>> give similar guarantees about their reproducibility, etc. So, they
>> should be apples-to-apples comparable with the composefs images.
>>
>> Given this, I ran another set of performance tests on the original cs9
>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
>> measure the memory use like this:
>>
>> # echo 3 > /proc/sys/vm/drop_caches
>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
>>
>> These are the alternatives I tried:
>>
>> xfs: the source of the image, regular dir on xfs
>> erofs: the image.erofs above, on loopback
>> erofs dio: the image.erofs above, on loopback with --direct-io=on
>> ovl: erofs above combined with overlayfs
>> ovl dio: erofs dio above combined with overlayfs
>> cfs: composefs mount of image.cfs
>>
>> All tests use the same objects dir, stored on xfs. The erofs and
>> overlay implementations are from a stock 6.1.13 kernel, and composefs
>> module is from github HEAD.
>>
>> I tried loopback both with and without the direct-io option, because
>> without direct-io enabled the kernel will double-cache the loopbacked
>> data, as per[1].
>>
>> The produced images are:
>>   8.9M image.cfs
>> 11.3M image.erofs
>>
>> And gives these results:
>>             | Cold cache | Warm cache | Mem use
>>             |   (msec)   |   (msec)   |  (mb)
>> -----------+------------+------------+---------
>> xfs        |   1449     |    442     |    54
>> erofs      |    700     |    391     |    45
>> erofs dio  |    939     |    400     |    45
>> ovl        |   1827     |    530     |   130
>> ovl dio    |   2156     |    531     |   130
>> cfs        |    689     |    389     |    51
> 
> It has been noted that the readahead done by kernel_read() may cause
> read-ahead of unrelated data into memory which skews the results in
> favour of workloads that consume all the filesystem metadata (such as
> the ls -lR usecase of the above test). In the table above this favours
> composefs (which uses kernel_read in some codepaths) as well as
> non-dio erofs (non-dio loopback device uses readahead too).
> 
> I updated composefs to not use kernel_read here:
>    https://github.com/containers/composefs/pull/105
> 
> And a new kernel patch-set based on this is available at:
>    https://github.com/alexlarsson/linux/tree/composefs
> 
> The resulting table is now (dropping the non-dio erofs):
> 
>             | Cold cache | Warm cache | Mem use
>             |   (msec)   |   (msec)   |  (mb)
> -----------+------------+------------+---------
> xfs        |   1449     |    442     |   54
> erofs dio  |    939     |    400     |   45
> ovl dio    |   2156     |    531     |  130
> cfs        |    833     |    398     |   51
> 
>             | Cold cache | Warm cache | Mem use
>             |   (msec)   |   (msec)   |  (mb)
> -----------+------------+------------+---------
> ext4       |   1135     |    394     |   54
> erofs dio  |    922     |    401     |   45
> ovl dio    |   1810     |    532     |  149
> ovl lazy   |   1063     |    523     |  87
> cfs        |    768     |    459     |  51
> 
> So, while cfs is somewhat worse now for this particular usecase, my
> overall analysis still stands.

We will investigate it later, also you might still need to test some
other random workloads other than "ls -lR" (such as stat ~1000 files
randomly [1]) rather than completely ignore my and Jingbo's comments,
or at least you have to answer why "ls -lR" is the only judgement on
your side.

My point is simply simple.  If you consider a chance to get an
improved EROFS in some extents, we do hope we could improve your
"ls -lR" as much as possible without bad impacts to random access.
Or if you'd like to upstream a new file-based stackable filesystem
for this ostree specific use cases for your whatever KPIs anyway,
I don't think we could get some conclusion here and I cannot do any
help to you since I'm not that one.

Since you're addressing a very specific workload "ls -lR" and EROFS
as well as EROFS + overlayfs doesn't perform so bad without further
insights compared with Composefs even EROFS doesn't directly use
file-based interfaces.

Thanks,
Gao Xiang

[1] https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-06 11:33   ` Alexander Larsson
  2023-03-06 12:15     ` Gao Xiang
@ 2023-03-06 15:49     ` Jingbo Xu
  2023-03-06 16:09       ` Alexander Larsson
  2023-03-07 10:00       ` Jingbo Xu
  1 sibling, 2 replies; 42+ messages in thread
From: Jingbo Xu @ 2023-03-06 15:49 UTC (permalink / raw)
  To: Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Gao Xiang,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 3/6/23 7:33 PM, Alexander Larsson wrote:
> On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote:
>>
>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>>>
>>> Hello,
>>>
>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>>> Composefs filesystem. It is an opportunistically sharing, validating
>>> image-based filesystem, targeting usecases like validated ostree
>>> rootfs:es, validated container images that share common files, as well
>>> as other image based usecases.
>>>
>>> During the discussions in the composefs proposal (as seen on LWN[3])
>>> is has been proposed that (with some changes to overlayfs), similar
>>> behaviour can be achieved by combining the overlayfs
>>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>>
>>> There are pros and cons to both these approaches, and the discussion
>>> about their respective value has sometimes been heated. We would like
>>> to have an in-person discussion at the summit, ideally also involving
>>> more of the filesystem development community, so that we can reach
>>> some consensus on what is the best apporach.
>>
>> In order to better understand the behaviour and requirements of the
>> overlayfs+erofs approach I spent some time implementing direct support
>> for erofs in libcomposefs. So, with current HEAD of
>> github.com/containers/composefs you can now do:
>>
>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
>>
>> This will produce an object store with the backing files, and a erofs
>> file with the required overlayfs xattrs, including a made up one
>> called "overlay.fs-verity" containing the expected fs-verity digest
>> for the lower dir. It also adds the required whiteouts to cover the
>> 00-ff dirs from the lower dir.
>>
>> These erofs files are ordered similarly to the composefs files, and we
>> give similar guarantees about their reproducibility, etc. So, they
>> should be apples-to-apples comparable with the composefs images.
>>
>> Given this, I ran another set of performance tests on the original cs9
>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
>> measure the memory use like this:
>>
>> # echo 3 > /proc/sys/vm/drop_caches
>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
>>
>> These are the alternatives I tried:
>>
>> xfs: the source of the image, regular dir on xfs
>> erofs: the image.erofs above, on loopback
>> erofs dio: the image.erofs above, on loopback with --direct-io=on
>> ovl: erofs above combined with overlayfs
>> ovl dio: erofs dio above combined with overlayfs
>> cfs: composefs mount of image.cfs
>>
>> All tests use the same objects dir, stored on xfs. The erofs and
>> overlay implementations are from a stock 6.1.13 kernel, and composefs
>> module is from github HEAD.
>>
>> I tried loopback both with and without the direct-io option, because
>> without direct-io enabled the kernel will double-cache the loopbacked
>> data, as per[1].
>>
>> The produced images are:
>>  8.9M image.cfs
>> 11.3M image.erofs
>>
>> And gives these results:
>>            | Cold cache | Warm cache | Mem use
>>            |   (msec)   |   (msec)   |  (mb)
>> -----------+------------+------------+---------
>> xfs        |   1449     |    442     |    54
>> erofs      |    700     |    391     |    45
>> erofs dio  |    939     |    400     |    45
>> ovl        |   1827     |    530     |   130
>> ovl dio    |   2156     |    531     |   130
>> cfs        |    689     |    389     |    51
> 
> It has been noted that the readahead done by kernel_read() may cause
> read-ahead of unrelated data into memory which skews the results in
> favour of workloads that consume all the filesystem metadata (such as
> the ls -lR usecase of the above test). In the table above this favours
> composefs (which uses kernel_read in some codepaths) as well as
> non-dio erofs (non-dio loopback device uses readahead too).
> 
> I updated composefs to not use kernel_read here:
>   https://github.com/containers/composefs/pull/105
> 
> And a new kernel patch-set based on this is available at:
>   https://github.com/alexlarsson/linux/tree/composefs
> 
> The resulting table is now (dropping the non-dio erofs):
> 
>            | Cold cache | Warm cache | Mem use
>            |   (msec)   |   (msec)   |  (mb)
> -----------+------------+------------+---------
> xfs        |   1449     |    442     |   54
> erofs dio  |    939     |    400     |   45
> ovl dio    |   2156     |    531     |  130
> cfs        |    833     |    398     |   51
> 
>            | Cold cache | Warm cache | Mem use
>            |   (msec)   |   (msec)   |  (mb)
> -----------+------------+------------+---------
> ext4       |   1135     |    394     |   54
> erofs dio  |    922     |    401     |   45
> ovl dio    |   1810     |    532     |  149
> ovl lazy   |   1063     |    523     |  87
> cfs        |    768     |    459     |  51
> 
> So, while cfs is somewhat worse now for this particular usecase, my
> overall analysis still stands.
> 

Hi,

I tested your patch removing kernel_read(), and here is the statistics
tested in my environment.


Setup
======
CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Disk: cloud disk, 11800 IOPS upper limit
OS: Linux v6.2
FS of backing objects: xfs


Image size
===========
8.6M large.composefs (with --compute-digest)
8.9M large.erofs (mkfs.erofs)
11M  large.cps.in.erofs (mkfs.composefs --compute-digest --format=erofs)


Perf of "ls -lR"
================
					      | uncached| cached
					      |  (ms)	|  (ms)
----------------------------------------------|---------|--------
composefs				      	   | 519	| 178
erofs (mkfs.erofs, DIRECT loop) 	     	   | 497 	| 192
erofs (mkfs.composefs --format=erofs, DIRECT loop) | 536 	| 199

I tested the performance of "ls -lR" on the whole tree of
cs9-developer-rootfs.  It seems that the performance of erofs (generated
from mkfs.erofs) is slightly better than that of composefs.  While the
performance of erofs generated from mkfs.composefs is slightly worse
that that of composefs.

The uncached performance is somewhat slightly different with that given
by Alexander Larsson.  I think it may be due to different test
environment, as my test machine is a server with robust performance,
with cloud disk as storage.

It's just a simple test without further analysis, as it's a bit late for
me :)



-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-06 15:49     ` Jingbo Xu
@ 2023-03-06 16:09       ` Alexander Larsson
  2023-03-06 16:17         ` Gao Xiang
  2023-03-07 10:00       ` Jingbo Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Alexander Larsson @ 2023-03-06 16:09 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner,
	Gao Xiang, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Mon, Mar 6, 2023 at 4:49 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> On 3/6/23 7:33 PM, Alexander Larsson wrote:
> > On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote:
> >>
> >> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> >>> Composefs filesystem. It is an opportunistically sharing, validating
> >>> image-based filesystem, targeting usecases like validated ostree
> >>> rootfs:es, validated container images that share common files, as well
> >>> as other image based usecases.
> >>>
> >>> During the discussions in the composefs proposal (as seen on LWN[3])
> >>> is has been proposed that (with some changes to overlayfs), similar
> >>> behaviour can be achieved by combining the overlayfs
> >>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
> >>>
> >>> There are pros and cons to both these approaches, and the discussion
> >>> about their respective value has sometimes been heated. We would like
> >>> to have an in-person discussion at the summit, ideally also involving
> >>> more of the filesystem development community, so that we can reach
> >>> some consensus on what is the best apporach.
> >>
> >> In order to better understand the behaviour and requirements of the
> >> overlayfs+erofs approach I spent some time implementing direct support
> >> for erofs in libcomposefs. So, with current HEAD of
> >> github.com/containers/composefs you can now do:
> >>
> >> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
> >>
> >> This will produce an object store with the backing files, and a erofs
> >> file with the required overlayfs xattrs, including a made up one
> >> called "overlay.fs-verity" containing the expected fs-verity digest
> >> for the lower dir. It also adds the required whiteouts to cover the
> >> 00-ff dirs from the lower dir.
> >>
> >> These erofs files are ordered similarly to the composefs files, and we
> >> give similar guarantees about their reproducibility, etc. So, they
> >> should be apples-to-apples comparable with the composefs images.
> >>
> >> Given this, I ran another set of performance tests on the original cs9
> >> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
> >> measure the memory use like this:
> >>
> >> # echo 3 > /proc/sys/vm/drop_caches
> >> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
> >> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> >>
> >> These are the alternatives I tried:
> >>
> >> xfs: the source of the image, regular dir on xfs
> >> erofs: the image.erofs above, on loopback
> >> erofs dio: the image.erofs above, on loopback with --direct-io=on
> >> ovl: erofs above combined with overlayfs
> >> ovl dio: erofs dio above combined with overlayfs
> >> cfs: composefs mount of image.cfs
> >>
> >> All tests use the same objects dir, stored on xfs. The erofs and
> >> overlay implementations are from a stock 6.1.13 kernel, and composefs
> >> module is from github HEAD.
> >>
> >> I tried loopback both with and without the direct-io option, because
> >> without direct-io enabled the kernel will double-cache the loopbacked
> >> data, as per[1].
> >>
> >> The produced images are:
> >>  8.9M image.cfs
> >> 11.3M image.erofs
> >>
> >> And gives these results:
> >>            | Cold cache | Warm cache | Mem use
> >>            |   (msec)   |   (msec)   |  (mb)
> >> -----------+------------+------------+---------
> >> xfs        |   1449     |    442     |    54
> >> erofs      |    700     |    391     |    45
> >> erofs dio  |    939     |    400     |    45
> >> ovl        |   1827     |    530     |   130
> >> ovl dio    |   2156     |    531     |   130
> >> cfs        |    689     |    389     |    51
> >
> > It has been noted that the readahead done by kernel_read() may cause
> > read-ahead of unrelated data into memory which skews the results in
> > favour of workloads that consume all the filesystem metadata (such as
> > the ls -lR usecase of the above test). In the table above this favours
> > composefs (which uses kernel_read in some codepaths) as well as
> > non-dio erofs (non-dio loopback device uses readahead too).
> >
> > I updated composefs to not use kernel_read here:
> >   https://github.com/containers/composefs/pull/105
> >
> > And a new kernel patch-set based on this is available at:
> >   https://github.com/alexlarsson/linux/tree/composefs
> >
> > The resulting table is now (dropping the non-dio erofs):
> >
> >            | Cold cache | Warm cache | Mem use
> >            |   (msec)   |   (msec)   |  (mb)
> > -----------+------------+------------+---------
> > xfs        |   1449     |    442     |   54
> > erofs dio  |    939     |    400     |   45
> > ovl dio    |   2156     |    531     |  130
> > cfs        |    833     |    398     |   51
> >
> >            | Cold cache | Warm cache | Mem use
> >            |   (msec)   |   (msec)   |  (mb)
> > -----------+------------+------------+---------
> > ext4       |   1135     |    394     |   54
> > erofs dio  |    922     |    401     |   45
> > ovl dio    |   1810     |    532     |  149
> > ovl lazy   |   1063     |    523     |  87
> > cfs        |    768     |    459     |  51
> >
> > So, while cfs is somewhat worse now for this particular usecase, my
> > overall analysis still stands.
> >
>
> Hi,
>
> I tested your patch removing kernel_read(), and here is the statistics
> tested in my environment.
>
>
> Setup
> ======
> CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
> Disk: cloud disk, 11800 IOPS upper limit
> OS: Linux v6.2
> FS of backing objects: xfs
>
>
> Image size
> ===========
> 8.6M large.composefs (with --compute-digest)
> 8.9M large.erofs (mkfs.erofs)
> 11M  large.cps.in.erofs (mkfs.composefs --compute-digest --format=erofs)
>
>
> Perf of "ls -lR"
> ================
>                                               | uncached| cached
>                                               |  (ms)   |  (ms)
> ----------------------------------------------|---------|--------
> composefs                                          | 519        | 178
> erofs (mkfs.erofs, DIRECT loop)                    | 497        | 192
> erofs (mkfs.composefs --format=erofs, DIRECT loop) | 536        | 199
>
> I tested the performance of "ls -lR" on the whole tree of
> cs9-developer-rootfs.  It seems that the performance of erofs (generated
> from mkfs.erofs) is slightly better than that of composefs.  While the
> performance of erofs generated from mkfs.composefs is slightly worse
> that that of composefs.

I suspect that the reason for the lower performance of mkfs.composefs
is the added overlay.fs-verity xattr to all the files. It makes the
image larger, and that means more i/o.

> The uncached performance is somewhat slightly different with that given
> by Alexander Larsson.  I think it may be due to different test
> environment, as my test machine is a server with robust performance,
> with cloud disk as storage.
>
> It's just a simple test without further analysis, as it's a bit late for
> me :)

Yeah, and for the record, I'm not claiming that my tests contain any
high degree of analysis or rigour either. They are short simple test
runs that give a rough estimate of the overall performance of metadata
operations. What is interesting here is if there are large or
unexpected differences, and from that point of view our results are
basically the same.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-06 16:09       ` Alexander Larsson
@ 2023-03-06 16:17         ` Gao Xiang
  2023-03-07  8:21           ` Alexander Larsson
  0 siblings, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2023-03-06 16:17 UTC (permalink / raw)
  To: Alexander Larsson, Jingbo Xu
  Cc: lsf-pc, linux-fsdevel, Amir Goldstein, Christian Brauner,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 2023/3/7 00:09, Alexander Larsson wrote:
> On Mon, Mar 6, 2023 at 4:49 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>> On 3/6/23 7:33 PM, Alexander Larsson wrote:
>>> On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote:
>>>>
>>>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>>>>> Composefs filesystem. It is an opportunistically sharing, validating
>>>>> image-based filesystem, targeting usecases like validated ostree
>>>>> rootfs:es, validated container images that share common files, as well
>>>>> as other image based usecases.
>>>>>
>>>>> During the discussions in the composefs proposal (as seen on LWN[3])
>>>>> is has been proposed that (with some changes to overlayfs), similar
>>>>> behaviour can be achieved by combining the overlayfs
>>>>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>>>>
>>>>> There are pros and cons to both these approaches, and the discussion
>>>>> about their respective value has sometimes been heated. We would like
>>>>> to have an in-person discussion at the summit, ideally also involving
>>>>> more of the filesystem development community, so that we can reach
>>>>> some consensus on what is the best apporach.
>>>>
>>>> In order to better understand the behaviour and requirements of the
>>>> overlayfs+erofs approach I spent some time implementing direct support
>>>> for erofs in libcomposefs. So, with current HEAD of
>>>> github.com/containers/composefs you can now do:
>>>>
>>>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
>>>>
>>>> This will produce an object store with the backing files, and a erofs
>>>> file with the required overlayfs xattrs, including a made up one
>>>> called "overlay.fs-verity" containing the expected fs-verity digest
>>>> for the lower dir. It also adds the required whiteouts to cover the
>>>> 00-ff dirs from the lower dir.
>>>>
>>>> These erofs files are ordered similarly to the composefs files, and we
>>>> give similar guarantees about their reproducibility, etc. So, they
>>>> should be apples-to-apples comparable with the composefs images.
>>>>
>>>> Given this, I ran another set of performance tests on the original cs9
>>>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
>>>> measure the memory use like this:
>>>>
>>>> # echo 3 > /proc/sys/vm/drop_caches
>>>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
>>>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
>>>>
>>>> These are the alternatives I tried:
>>>>
>>>> xfs: the source of the image, regular dir on xfs
>>>> erofs: the image.erofs above, on loopback
>>>> erofs dio: the image.erofs above, on loopback with --direct-io=on
>>>> ovl: erofs above combined with overlayfs
>>>> ovl dio: erofs dio above combined with overlayfs
>>>> cfs: composefs mount of image.cfs
>>>>
>>>> All tests use the same objects dir, stored on xfs. The erofs and
>>>> overlay implementations are from a stock 6.1.13 kernel, and composefs
>>>> module is from github HEAD.
>>>>
>>>> I tried loopback both with and without the direct-io option, because
>>>> without direct-io enabled the kernel will double-cache the loopbacked
>>>> data, as per[1].
>>>>
>>>> The produced images are:
>>>>   8.9M image.cfs
>>>> 11.3M image.erofs
>>>>
>>>> And gives these results:
>>>>             | Cold cache | Warm cache | Mem use
>>>>             |   (msec)   |   (msec)   |  (mb)
>>>> -----------+------------+------------+---------
>>>> xfs        |   1449     |    442     |    54
>>>> erofs      |    700     |    391     |    45
>>>> erofs dio  |    939     |    400     |    45
>>>> ovl        |   1827     |    530     |   130
>>>> ovl dio    |   2156     |    531     |   130
>>>> cfs        |    689     |    389     |    51
>>>
>>> It has been noted that the readahead done by kernel_read() may cause
>>> read-ahead of unrelated data into memory which skews the results in
>>> favour of workloads that consume all the filesystem metadata (such as
>>> the ls -lR usecase of the above test). In the table above this favours
>>> composefs (which uses kernel_read in some codepaths) as well as
>>> non-dio erofs (non-dio loopback device uses readahead too).
>>>
>>> I updated composefs to not use kernel_read here:
>>>    https://github.com/containers/composefs/pull/105
>>>
>>> And a new kernel patch-set based on this is available at:
>>>    https://github.com/alexlarsson/linux/tree/composefs
>>>
>>> The resulting table is now (dropping the non-dio erofs):
>>>
>>>             | Cold cache | Warm cache | Mem use
>>>             |   (msec)   |   (msec)   |  (mb)
>>> -----------+------------+------------+---------
>>> xfs        |   1449     |    442     |   54
>>> erofs dio  |    939     |    400     |   45
>>> ovl dio    |   2156     |    531     |  130
>>> cfs        |    833     |    398     |   51
>>>
>>>             | Cold cache | Warm cache | Mem use
>>>             |   (msec)   |   (msec)   |  (mb)
>>> -----------+------------+------------+---------
>>> ext4       |   1135     |    394     |   54
>>> erofs dio  |    922     |    401     |   45
>>> ovl dio    |   1810     |    532     |  149
>>> ovl lazy   |   1063     |    523     |  87
>>> cfs        |    768     |    459     |  51
>>>
>>> So, while cfs is somewhat worse now for this particular usecase, my
>>> overall analysis still stands.
>>>
>>
>> Hi,
>>
>> I tested your patch removing kernel_read(), and here is the statistics
>> tested in my environment.
>>
>>
>> Setup
>> ======
>> CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
>> Disk: cloud disk, 11800 IOPS upper limit
>> OS: Linux v6.2
>> FS of backing objects: xfs
>>
>>
>> Image size
>> ===========
>> 8.6M large.composefs (with --compute-digest)
>> 8.9M large.erofs (mkfs.erofs)
>> 11M  large.cps.in.erofs (mkfs.composefs --compute-digest --format=erofs)
>>
>>
>> Perf of "ls -lR"
>> ================
>>                                                | uncached| cached
>>                                                |  (ms)   |  (ms)
>> ----------------------------------------------|---------|--------
>> composefs                                          | 519        | 178
>> erofs (mkfs.erofs, DIRECT loop)                    | 497        | 192
>> erofs (mkfs.composefs --format=erofs, DIRECT loop) | 536        | 199
>>
>> I tested the performance of "ls -lR" on the whole tree of
>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
>> from mkfs.erofs) is slightly better than that of composefs.  While the
>> performance of erofs generated from mkfs.composefs is slightly worse
>> that that of composefs.
> 
> I suspect that the reason for the lower performance of mkfs.composefs
> is the added overlay.fs-verity xattr to all the files. It makes the
> image larger, and that means more i/o.

Actually you could move overlay.fs-verity to EROFS shared xattr area (or
even overlay.redirect but it depends) if needed, which could save some
I/Os for your workloads.

shared xattrs can be used in this way as well if you care such minor
difference, actually I think inlined xattrs for your workload are just
meaningful for selinux labels and capabilities.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-04 15:29         ` Gao Xiang
  2023-03-04 16:22           ` Gao Xiang
@ 2023-03-07  1:00           ` Colin Walters
  2023-03-07  3:10             ` Gao Xiang
  1 sibling, 1 reply; 42+ messages in thread
From: Colin Walters @ 2023-03-07  1:00 UTC (permalink / raw)
  To: Gao Xiang, Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On Sat, Mar 4, 2023, at 10:29 AM, Gao Xiang wrote:
> Hi Colin,
>
> On 2023/3/4 22:59, Colin Walters wrote:
>> 
>> 
>> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote:
>>>
>>> Actually since you're container guys, I would like to mention
>>> a way to directly reuse OCI tar data and not sure if you
>>> have some interest as well, that is just to generate EROFS
>>> metadata which could point to the tar blobs so that data itself
>>> is still the original tar, but we could add fsverity + IMMUTABLE
>>> to these blobs rather than the individual untared files.
>> 
>>>    - OCI layer diff IDs in the OCI spec [1] are guaranteed;
>> 
>> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think.
>
> Thanks for the interest and comment.
>
> I'm not aware of this project, and I'm not sure if tar-split
> helps mount tar stuffs, maybe I'm missing something?

Not directly; it's widely used in the container ecosystem (podman/docker etc.) to split off the original bit-for-bit tar stream metadata content from the actually large data (particularly regular files) so that one can write the files to a regular underlying fs (xfs/ext4/etc.) and use overlayfs on top.   Then it helps reverse the process and reconstruct the original tar stream for pushes, for exactly the reason you mention.

Slightly OT but a whole reason we're having this conversation now is definitely rooted in the original Docker inventor having the idea of *deriving* or layering on top of previous images, which is not part of dpkg/rpm or squashfs or raw disk images etc.  Inherent in this is the idea that we're not talking about *a* filesystem - we're talking about filesystem*s* plural and how they're wired together and stacked.

It's really only very simplistic use cases for which a single read-only filesystem suffices.  They exist - e.g. people booting things like Tails OS https://tails.boum.org/ on one of those USB sticks with a physical write protection switch, etc. 

But that approach makes every OS update very expensive - most use cases really want fast and efficient incremental in-place OS updates and a clear distinct split between OS filesystem and app filesystems.   But without also forcing separate size management onto both.

> Not bacause EROFS cannot do on-disk dedupe, just because in this
> way EROFS can only use the original tar blobs, and EROFS is not
> the guy to resolve the on-disk sharing stuff.  

Right, agree; this ties into my larger point above that no one technology/filesystem is the sole solution in the general case.

> As a kernel filesystem, if two files are equal, we could treat them
> in the same inode address space, even they are actually with slightly
> different inode metadata (uid, gid, mode, nlink, etc).  That is
> entirely possible as an in-kernel filesystem even currently linux
> kernel doesn't implement finer page cache sharing, so EROFS can
> support page-cache sharing of files in all tar streams if needed.

Hmmm.  I should clarify here I have zero kernel patches, I'm a userspace developer (on container and OS updates, for which I'd like a unified stack).  But it seems to me that while you're right that it would be technically possible for a single filesystem to do this, in practice it would require some sort of virtual sub-filesystem internally.  And at that point, it does seem more elegant to me to make that stacking explicit, more like how composefs is doing it.  

That said I think there's a lot of legitimate debate here, and I hope we can continue doing so productively!



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  1:00           ` Colin Walters
@ 2023-03-07  3:10             ` Gao Xiang
  0 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-07  3:10 UTC (permalink / raw)
  To: Colin Walters, Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 2023/3/7 09:00, Colin Walters wrote:
> 
> 
> On Sat, Mar 4, 2023, at 10:29 AM, Gao Xiang wrote:
>> Hi Colin,
>>
>> On 2023/3/4 22:59, Colin Walters wrote:
>>>
>>>
>>> On Fri, Mar 3, 2023, at 12:37 PM, Gao Xiang wrote:
>>>>
>>>> Actually since you're container guys, I would like to mention
>>>> a way to directly reuse OCI tar data and not sure if you
>>>> have some interest as well, that is just to generate EROFS
>>>> metadata which could point to the tar blobs so that data itself
>>>> is still the original tar, but we could add fsverity + IMMUTABLE
>>>> to these blobs rather than the individual untared files.
>>>
>>>>     - OCI layer diff IDs in the OCI spec [1] are guaranteed;
>>>
>>> The https://github.com/vbatts/tar-split approach addresses this problem domain adequately I think.
>>
>> Thanks for the interest and comment.
>>
>> I'm not aware of this project, and I'm not sure if tar-split
>> helps mount tar stuffs, maybe I'm missing something?
> 
> Not directly; it's widely used in the container ecosystem (podman/docker etc.) to split off the original bit-for-bit tar stream metadata content from the actually large data (particularly regular files) so that one can write the files to a regular underlying fs (xfs/ext4/etc.) and use overlayfs on top.   Then it helps reverse the process and reconstruct the original tar stream for pushes, for exactly the reason you mention.
> 
> Slightly OT but a whole reason we're having this conversation now is definitely rooted in the original Docker inventor having the idea of *deriving* or layering on top of previous images, which is not part of dpkg/rpm or squashfs or raw disk images etc.  Inherent in this is the idea that we're not talking about *a* filesystem - we're talking about filesystem*s* plural and how they're wired together and stacked.

Yes, as you said, if you think the actual OCI standard (or Docker
whatever) is all about layering.  There could be a possibility to
directly use the original layer for mounting without any conversion
(like "untar" or converting to another blob format which could
  support 4k reflink dedupe.)

I believe it can save untar time and snapshot gc problems that
users concern, such as our cloud with thousands of containers
launching/running/gcing in the same time.

> 
> It's really only very simplistic use cases for which a single read-only filesystem suffices.  They exist - e.g. people booting things like Tails OS https://tails.boum.org/ on one of those USB sticks with a physical write protection switch, etc.

I cannot access the webside. If you consider physical write
protection, then a read-only filesystem written on physical
media is needed.  So that EROFS manifest can be landed on
raw disks (for write protection and hardware integrate check)
or on other local filesystems.  It depends on the actual
detailed requirement.

> 
> But that approach makes every OS update very expensive - most use cases really want fast and efficient incremental in-place OS updates and a clear distinct split between OS filesystem and app filesystems.   But without also forcing separate size management onto both.
> 
>> Not bacause EROFS cannot do on-disk dedupe, just because in this
>> way EROFS can only use the original tar blobs, and EROFS is not
>> the guy to resolve the on-disk sharing stuff.
> 
> Right, agree; this ties into my larger point above that no one technology/filesystem is the sole solution in the general case.

Anyway, if you consider an _untar_ way, you could also
consider a conversion way (like you said padding to 4k).

Since OCI standard is all about layering, so you could
pad to 4k and then do data dedupe with:
   - data blobs theirselves (some recent project like
     Nydus with EROFS);
   - reflink enabled filesystems (such as XFS or btrfs).

Because untar behaves almost the same as the conversion
way, except that it doesn't produce massive files/dirs
to the underlay filesystem and then gc massive files/dirs
again.

To be clarified, since you are the OSTree original author,
here I'm not promoting alternative ways for you.  I believe
any practical engineering projects all have advantages and
disadvantages.  For example, even git is moving toward using
packed object store more and more, and I guess OSTree for
effective distribution could also have some packed format
at least to some extent.

Here I just would like to say, on-disk EROFS format (or
other most-used kernel filesystem) is not just designed for
a specific use cases like OSTree, tar blobs or whatever, or
specific media like block-based, file-based, etc.

As far as I can see, at least EROFS+overlay already supports
the OSTree composefs-like use cases for two years and landed
in many distros. And other local kernel filesystems don't
behave quite well with "ls -lR" workload.

> 
>> As a kernel filesystem, if two files are equal, we could treat them
>> in the same inode address space, even they are actually with slightly
>> different inode metadata (uid, gid, mode, nlink, etc).  That is
>> entirely possible as an in-kernel filesystem even currently linux
>> kernel doesn't implement finer page cache sharing, so EROFS can
>> support page-cache sharing of files in all tar streams if needed.
> 
> Hmmm.  I should clarify here I have zero kernel patches, I'm a userspace developer (on container and OS updates, for which I'd like a unified stack).  But it seems to me that while you're right that it would be technically possible for a single filesystem to do this, in practice it would require some sort of virtual sub-filesystem internally.  And at that point, it does seem more elegant to me to make that stacking explicit, more like how composefs is doing it.

As you said you're a userspace developer, here I just need
to clarify internal inodes are very common among local fses,
at least to my knowledge I know btrfs and f2fs in addition to
EROFS all have such stuffs to keep something to make use of
kernel page cache.

One advantage over the stackable way is that:  With the
stackable way, you have to explicitly open the backing file
which takes more time to lookup dcache/icache and even on-disk
hierarchy.  By contrast, if you consider page cache sharing
original tar blobs, you don't need to do another open at all.
Surely, it's not benchmarked by "ls -lR" but it indeed impacts
end users.

Again, here I'm trying to say I'm not in favor of or against
any user-space distribution solution, like OSTree or some else.
Nydus is just one of userspace examples to use EROFS which I
persuaded them to do such adaption.  Besides, EROFS is already
landed to all mainstream in-market Android smartphones, and I
hope it can get more attention, adaption over various use cases
and more developers could join us.

> 
> That said I think there's a lot of legitimate debate here, and I hope we can continue doing so productively!

Thanks, as a kernel filesystem developer for many years, I hope
our (at least myself) design can be used wider.  So again, I'm
not against your OSTree design and I believe all detailed
distribution approaches have pros and cons.

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-06 16:17         ` Gao Xiang
@ 2023-03-07  8:21           ` Alexander Larsson
  2023-03-07  8:33             ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Alexander Larsson @ 2023-03-07  8:21 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:

> >> I tested the performance of "ls -lR" on the whole tree of
> >> cs9-developer-rootfs.  It seems that the performance of erofs (generated
> >> from mkfs.erofs) is slightly better than that of composefs.  While the
> >> performance of erofs generated from mkfs.composefs is slightly worse
> >> that that of composefs.
> >
> > I suspect that the reason for the lower performance of mkfs.composefs
> > is the added overlay.fs-verity xattr to all the files. It makes the
> > image larger, and that means more i/o.
>
> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
> even overlay.redirect but it depends) if needed, which could save some
> I/Os for your workloads.
>
> shared xattrs can be used in this way as well if you care such minor
> difference, actually I think inlined xattrs for your workload are just
> meaningful for selinux labels and capabilities.

Really? Could you expand on this, because I would think it will be
sort of the opposite. In my usecase, the erofs fs will be read by
overlayfs, which will probably access overlay.* pretty often.  At the
very least it will load overlay.metacopy and overlay.redirect for
every lookup.

I guess it depends on how the verity support in overlayfs would work.
If it delays access to overlay.verity until open time, then it would
make sense to move it to the shared area.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  8:21           ` Alexander Larsson
@ 2023-03-07  8:33             ` Gao Xiang
  2023-03-07  8:48               ` Gao Xiang
  2023-03-07  9:07               ` Alexander Larsson
  0 siblings, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-07  8:33 UTC (permalink / raw)
  To: Alexander Larsson
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi



On 2023/3/7 16:21, Alexander Larsson wrote:
> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> 
>>>> I tested the performance of "ls -lR" on the whole tree of
>>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
>>>> from mkfs.erofs) is slightly better than that of composefs.  While the
>>>> performance of erofs generated from mkfs.composefs is slightly worse
>>>> that that of composefs.
>>>
>>> I suspect that the reason for the lower performance of mkfs.composefs
>>> is the added overlay.fs-verity xattr to all the files. It makes the
>>> image larger, and that means more i/o.
>>
>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
>> even overlay.redirect but it depends) if needed, which could save some
>> I/Os for your workloads.
>>
>> shared xattrs can be used in this way as well if you care such minor
>> difference, actually I think inlined xattrs for your workload are just
>> meaningful for selinux labels and capabilities.
> 
> Really? Could you expand on this, because I would think it will be
> sort of the opposite. In my usecase, the erofs fs will be read by
> overlayfs, which will probably access overlay.* pretty often.  At the
> very least it will load overlay.metacopy and overlay.redirect for
> every lookup.

Really.  In that way, it will behave much similiar to composefs on-disk
arrangement now (in composefs vdata area).

Because in that way, although an extra I/O is needed for verification,
and it can only happen when actually opening the file (so "ls -lR" is
not impacted.) But on-disk inodes are more compact.

All EROFS xattrs will be cached in memory so that accessing
overlay.* pretty often is not greatly impacted due to no real I/Os
(IOWs, only some CPU time is consumed).

> 
> I guess it depends on how the verity support in overlayfs would work.
> If it delays access to overlay.verity until open time, then it would
> make sense to move it to the shared area.

I think it could be just like what composefs does, it's not hard to
add just new dozen lines to overlayfs like:

static int cfs_open_file(struct inode *inode, struct file *file)
{
...
	/* If metadata records a digest for the file, ensure it is there
	 * and correct before using the contents.
	 */
	if (cino->inode_data.has_digest &&
	    fsi->verity_check >= CFS_VERITY_CHECK_IF_SPECIFIED) {
		...

		res = fsverity_get_digest(d_inode(backing_dentry),
					  verity_digest, &verity_algo);
		if (res < 0) {
			pr_warn("WARNING: composefs backing file '%pd' has no fs-verity digest\n",
				backing_dentry);
			return -EIO;
		}
		if (verity_algo != HASH_ALGO_SHA256 ||
		    memcmp(cino->inode_data.digest, verity_digest,
			   SHA256_DIGEST_SIZE) != 0) {
			pr_warn("WARNING: composefs backing file '%pd' has the wrong fs-verity digest\n",
				backing_dentry);
			return -EIO;
		}
		...
	}
...
}

Is this stacked fsverity feature really hard?

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  8:33             ` Gao Xiang
@ 2023-03-07  8:48               ` Gao Xiang
  2023-03-07  9:07               ` Alexander Larsson
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-07  8:48 UTC (permalink / raw)
  To: Alexander Larsson
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi



On 2023/3/7 16:33, Gao Xiang wrote:
> 
> 
> On 2023/3/7 16:21, Alexander Larsson wrote:
>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>>>> I tested the performance of "ls -lR" on the whole tree of
>>>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
>>>>> from mkfs.erofs) is slightly better than that of composefs.  While the
>>>>> performance of erofs generated from mkfs.composefs is slightly worse
>>>>> that that of composefs.
>>>>
>>>> I suspect that the reason for the lower performance of mkfs.composefs
>>>> is the added overlay.fs-verity xattr to all the files. It makes the
>>>> image larger, and that means more i/o.
>>>
>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
>>> even overlay.redirect but it depends) if needed, which could save some
>>> I/Os for your workloads.
>>>
>>> shared xattrs can be used in this way as well if you care such minor
>>> difference, actually I think inlined xattrs for your workload are just
>>> meaningful for selinux labels and capabilities.
>>
>> Really? Could you expand on this, because I would think it will be
>> sort of the opposite. In my usecase, the erofs fs will be read by
>> overlayfs, which will probably access overlay.* pretty often.  At the
>> very least it will load overlay.metacopy and overlay.redirect for
>> every lookup.
> 
> Really.  In that way, it will behave much similiar to composefs on-disk
> arrangement now (in composefs vdata area).
> 
> Because in that way, although an extra I/O is needed for verification,
> and it can only happen when actually opening the file (so "ls -lR" is
> not impacted.) But on-disk inodes are more compact.
> 
> All EROFS xattrs will be cached in memory so that accessing

    ^ all accessed xattrs in EROFS

Sorry about that if there could be some misunderstanding.

> overlay.* pretty often is not greatly impacted due to no real I/Os
> (IOWs, only some CPU time is consumed).
> 
>>
>> I guess it depends on how the verity support in overlayfs would work.
>> If it delays access to overlay.verity until open time, then it would
>> make sense to move it to the shared area.
> 
> I think it could be just like what composefs does, it's not hard to
> add just new dozen lines to overlayfs like:
> 
> static int cfs_open_file(struct inode *inode, struct file *file)
> {
> ...
>      /* If metadata records a digest for the file, ensure it is there
>       * and correct before using the contents.
>       */
>      if (cino->inode_data.has_digest &&
>          fsi->verity_check >= CFS_VERITY_CHECK_IF_SPECIFIED) {
>          ...
> 
>          res = fsverity_get_digest(d_inode(backing_dentry),
>                        verity_digest, &verity_algo);
>          if (res < 0) {
>              pr_warn("WARNING: composefs backing file '%pd' has no fs-verity digest\n",
>                  backing_dentry);
>              return -EIO;
>          }
>          if (verity_algo != HASH_ALGO_SHA256 ||
>              memcmp(cino->inode_data.digest, verity_digest,
>                 SHA256_DIGEST_SIZE) != 0) {
>              pr_warn("WARNING: composefs backing file '%pd' has the wrong fs-verity digest\n",
>                  backing_dentry);
>              return -EIO;
>          }
>          ...
>      }
> ...
> }
> 
> Is this stacked fsverity feature really hard?
> 
> Thanks,
> Gao Xiang
> 
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  8:33             ` Gao Xiang
  2023-03-07  8:48               ` Gao Xiang
@ 2023-03-07  9:07               ` Alexander Larsson
  2023-03-07  9:26                 ` Gao Xiang
  1 sibling, 1 reply; 42+ messages in thread
From: Alexander Larsson @ 2023-03-07  9:07 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2023/3/7 16:21, Alexander Larsson wrote:
> > On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >
> >>>> I tested the performance of "ls -lR" on the whole tree of
> >>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
> >>>> from mkfs.erofs) is slightly better than that of composefs.  While the
> >>>> performance of erofs generated from mkfs.composefs is slightly worse
> >>>> that that of composefs.
> >>>
> >>> I suspect that the reason for the lower performance of mkfs.composefs
> >>> is the added overlay.fs-verity xattr to all the files. It makes the
> >>> image larger, and that means more i/o.
> >>
> >> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
> >> even overlay.redirect but it depends) if needed, which could save some
> >> I/Os for your workloads.
> >>
> >> shared xattrs can be used in this way as well if you care such minor
> >> difference, actually I think inlined xattrs for your workload are just
> >> meaningful for selinux labels and capabilities.
> >
> > Really? Could you expand on this, because I would think it will be
> > sort of the opposite. In my usecase, the erofs fs will be read by
> > overlayfs, which will probably access overlay.* pretty often.  At the
> > very least it will load overlay.metacopy and overlay.redirect for
> > every lookup.
>
> Really.  In that way, it will behave much similiar to composefs on-disk
> arrangement now (in composefs vdata area).
>
> Because in that way, although an extra I/O is needed for verification,
> and it can only happen when actually opening the file (so "ls -lR" is
> not impacted.) But on-disk inodes are more compact.
>
> All EROFS xattrs will be cached in memory so that accessing
> overlay.* pretty often is not greatly impacted due to no real I/Os
> (IOWs, only some CPU time is consumed).

So, I tried moving the overlay.digest xattr to the shared area, but
actually this made the performance worse for the ls case. I have not
looked into the cause in detail, but my guess is that ls looks for the
acl xattr, and such a negative lookup will cause erofs to look at all
the shared xattrs for the inode, which means they all end up being
loaded anyway. Of course, this will only affect ls (or other cases
that read the acl), so its perhaps a bit uncommon.

Did you ever consider putting a bloom filter in the h_reserved area of
erofs_xattr_ibody_header? Then it could return early without i/o
operations for keys that are not set for the inode. Not sure what the
computational cost of that would be though.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  9:07               ` Alexander Larsson
@ 2023-03-07  9:26                 ` Gao Xiang
  2023-03-07  9:38                   ` Gao Xiang
  2023-03-07  9:46                   ` Alexander Larsson
  0 siblings, 2 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-07  9:26 UTC (permalink / raw)
  To: Alexander Larsson
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi



On 2023/3/7 17:07, Alexander Larsson wrote:
> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2023/3/7 16:21, Alexander Larsson wrote:
>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>
>>>>>> I tested the performance of "ls -lR" on the whole tree of
>>>>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
>>>>>> from mkfs.erofs) is slightly better than that of composefs.  While the
>>>>>> performance of erofs generated from mkfs.composefs is slightly worse
>>>>>> that that of composefs.
>>>>>
>>>>> I suspect that the reason for the lower performance of mkfs.composefs
>>>>> is the added overlay.fs-verity xattr to all the files. It makes the
>>>>> image larger, and that means more i/o.
>>>>
>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
>>>> even overlay.redirect but it depends) if needed, which could save some
>>>> I/Os for your workloads.
>>>>
>>>> shared xattrs can be used in this way as well if you care such minor
>>>> difference, actually I think inlined xattrs for your workload are just
>>>> meaningful for selinux labels and capabilities.
>>>
>>> Really? Could you expand on this, because I would think it will be
>>> sort of the opposite. In my usecase, the erofs fs will be read by
>>> overlayfs, which will probably access overlay.* pretty often.  At the
>>> very least it will load overlay.metacopy and overlay.redirect for
>>> every lookup.
>>
>> Really.  In that way, it will behave much similiar to composefs on-disk
>> arrangement now (in composefs vdata area).
>>
>> Because in that way, although an extra I/O is needed for verification,
>> and it can only happen when actually opening the file (so "ls -lR" is
>> not impacted.) But on-disk inodes are more compact.
>>
>> All EROFS xattrs will be cached in memory so that accessing
>> overlay.* pretty often is not greatly impacted due to no real I/Os
>> (IOWs, only some CPU time is consumed).
> 
> So, I tried moving the overlay.digest xattr to the shared area, but
> actually this made the performance worse for the ls case. I have not

That is much strange.  We'd like to open it up if needed.  BTW, did you
test EROFS with acl enabled all the time?

> looked into the cause in detail, but my guess is that ls looks for the
> acl xattr, and such a negative lookup will cause erofs to look at all
> the shared xattrs for the inode, which means they all end up being
> loaded anyway. Of course, this will only affect ls (or other cases
> that read the acl), so its perhaps a bit uncommon.

Yeah, in addition to that, I guess real acls could be landed in inlined
xattrs as well if exists...

> 
> Did you ever consider putting a bloom filter in the h_reserved area of
> erofs_xattr_ibody_header? Then it could return early without i/o
> operations for keys that are not set for the inode. Not sure what the
> computational cost of that would be though.

Good idea!  Let me think about it, but enabling "noacl" mount
option isn't prefered if acl is no needed in your use cases.
Optimizing negative xattr lookups might need more on-disk
improvements which we didn't care about xattrs more. (although
"overlay.redirect" and "overlay.digest" seems fine for
composefs use cases.)

BTW, if you have more interest in this way, we could get in
touch in a more effective way to improve EROFS in addition to
community emails except for the userns stuff (I know it's useful
but I don't know the answers, maybe as Chistian said, we could
develop a new vfs feature to delegate a filesystem mount to an
unprivileged one [1].  I think it's much safer in that way for
kernel fses with on-disk format.)

[1] https://lore.kernel.org/r/20230126082228.rweg75ztaexykejv@wittgenstein

Thanks,
Gao Xiang
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  9:26                 ` Gao Xiang
@ 2023-03-07  9:38                   ` Gao Xiang
  2023-03-07  9:56                     ` Alexander Larsson
  2023-03-07  9:46                   ` Alexander Larsson
  1 sibling, 1 reply; 42+ messages in thread
From: Gao Xiang @ 2023-03-07  9:38 UTC (permalink / raw)
  To: Alexander Larsson
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi



On 2023/3/7 17:26, Gao Xiang wrote:
> 
> 
> On 2023/3/7 17:07, Alexander Larsson wrote:
>> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>
>>>
>>>
>>> On 2023/3/7 16:21, Alexander Larsson wrote:
>>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>
>>>>>>> I tested the performance of "ls -lR" on the whole tree of
>>>>>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
>>>>>>> from mkfs.erofs) is slightly better than that of composefs.  While the
>>>>>>> performance of erofs generated from mkfs.composefs is slightly worse
>>>>>>> that that of composefs.
>>>>>>
>>>>>> I suspect that the reason for the lower performance of mkfs.composefs
>>>>>> is the added overlay.fs-verity xattr to all the files. It makes the
>>>>>> image larger, and that means more i/o.
>>>>>
>>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
>>>>> even overlay.redirect but it depends) if needed, which could save some
>>>>> I/Os for your workloads.
>>>>>
>>>>> shared xattrs can be used in this way as well if you care such minor
>>>>> difference, actually I think inlined xattrs for your workload are just
>>>>> meaningful for selinux labels and capabilities.
>>>>
>>>> Really? Could you expand on this, because I would think it will be
>>>> sort of the opposite. In my usecase, the erofs fs will be read by
>>>> overlayfs, which will probably access overlay.* pretty often.  At the
>>>> very least it will load overlay.metacopy and overlay.redirect for
>>>> every lookup.
>>>
>>> Really.  In that way, it will behave much similiar to composefs on-disk
>>> arrangement now (in composefs vdata area).
>>>
>>> Because in that way, although an extra I/O is needed for verification,
>>> and it can only happen when actually opening the file (so "ls -lR" is
>>> not impacted.) But on-disk inodes are more compact.
>>>
>>> All EROFS xattrs will be cached in memory so that accessing
>>> overlay.* pretty often is not greatly impacted due to no real I/Os
>>> (IOWs, only some CPU time is consumed).
>>
>> So, I tried moving the overlay.digest xattr to the shared area, but
>> actually this made the performance worse for the ls case. I have not
> 
> That is much strange.  We'd like to open it up if needed.  BTW, did you
> test EROFS with acl enabled all the time?
> 
>> looked into the cause in detail, but my guess is that ls looks for the
>> acl xattr, and such a negative lookup will cause erofs to look at all
>> the shared xattrs for the inode, which means they all end up being
>> loaded anyway. Of course, this will only affect ls (or other cases
>> that read the acl), so its perhaps a bit uncommon.
> 
> Yeah, in addition to that, I guess real acls could be landed in inlined
> xattrs as well if exists...
> 
>>
>> Did you ever consider putting a bloom filter in the h_reserved area of
>> erofs_xattr_ibody_header? Then it could return early without i/o
>> operations for keys that are not set for the inode. Not sure what the
>> computational cost of that would be though.
> 
> Good idea!  Let me think about it, but enabling "noacl" mount
> option isn't prefered if acl is no needed in your use cases.

           ^ is preferred.

> Optimizing negative xattr lookups might need more on-disk
> improvements which we didn't care about xattrs more. (although
> "overlay.redirect" and "overlay.digest" seems fine for
> composefs use cases.)

Or we could just add a FEATURE_COMPAT_NOACL mount option to disable
ACLs explicitly if the image doesn't have any ACLs.  At least it's
useful for your use cases.

Thanks,
Gao Xiang

> 
> BTW, if you have more interest in this way, we could get in
> touch in a more effective way to improve EROFS in addition to
> community emails except for the userns stuff (I know it's useful
> but I don't know the answers, maybe as Chistian said, we could
> develop a new vfs feature to delegate a filesystem mount to an
> unprivileged one [1].  I think it's much safer in that way for
> kernel fses with on-disk format.)
> 
> [1] https://lore.kernel.org/r/20230126082228.rweg75ztaexykejv@wittgenstein
> 
> Thanks,
> Gao Xiang
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  9:26                 ` Gao Xiang
  2023-03-07  9:38                   ` Gao Xiang
@ 2023-03-07  9:46                   ` Alexander Larsson
  2023-03-07 10:01                     ` Gao Xiang
  1 sibling, 1 reply; 42+ messages in thread
From: Alexander Larsson @ 2023-03-07  9:46 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Tue, Mar 7, 2023 at 10:26 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> On 2023/3/7 17:07, Alexander Larsson wrote:
> > On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 2023/3/7 16:21, Alexander Larsson wrote:
> >>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>>
> >>>>>> I tested the performance of "ls -lR" on the whole tree of
> >>>>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
> >>>>>> from mkfs.erofs) is slightly better than that of composefs.  While the
> >>>>>> performance of erofs generated from mkfs.composefs is slightly worse
> >>>>>> that that of composefs.
> >>>>>
> >>>>> I suspect that the reason for the lower performance of mkfs.composefs
> >>>>> is the added overlay.fs-verity xattr to all the files. It makes the
> >>>>> image larger, and that means more i/o.
> >>>>
> >>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
> >>>> even overlay.redirect but it depends) if needed, which could save some
> >>>> I/Os for your workloads.
> >>>>
> >>>> shared xattrs can be used in this way as well if you care such minor
> >>>> difference, actually I think inlined xattrs for your workload are just
> >>>> meaningful for selinux labels and capabilities.
> >>>
> >>> Really? Could you expand on this, because I would think it will be
> >>> sort of the opposite. In my usecase, the erofs fs will be read by
> >>> overlayfs, which will probably access overlay.* pretty often.  At the
> >>> very least it will load overlay.metacopy and overlay.redirect for
> >>> every lookup.
> >>
> >> Really.  In that way, it will behave much similiar to composefs on-disk
> >> arrangement now (in composefs vdata area).
> >>
> >> Because in that way, although an extra I/O is needed for verification,
> >> and it can only happen when actually opening the file (so "ls -lR" is
> >> not impacted.) But on-disk inodes are more compact.
> >>
> >> All EROFS xattrs will be cached in memory so that accessing
> >> overlay.* pretty often is not greatly impacted due to no real I/Os
> >> (IOWs, only some CPU time is consumed).
> >
> > So, I tried moving the overlay.digest xattr to the shared area, but
> > actually this made the performance worse for the ls case. I have not
>
> That is much strange.  We'd like to open it up if needed.  BTW, did you
> test EROFS with acl enabled all the time?

These were all with acl enabled.

And, to test this, I compared "ls -lR" and "ls -ZR", which do the same
per-file syscalls, except the later doesn't try to read the
system.posix_acl_access xattr. The result is:

xattr:        inlined | not inlined
------------+---------+------------
ls -lR cold |  708    |  721
ls -lR warm |  415    |  412
ls -ZR cold |  522    |  512
ls -ZR warm |  283    |  279

In the ZR case the out-of band digest is a win, but not in the lR
case, which seems to mean the failed lookup of the acl xattr is to
blame here.

Also, very interesting is the fact that the warm cache difference for
these to is so large. I guess that is because most other inode data is
cached, but the xattrs lookups are not. If you could cache negative
xattr lookups that seems like a large win. This can be either via a
bloom cache in the disk format or maybe even just some in-memory
negative lookup caches for the inode, maybe even special casing the
acl xattrs.

> > looked into the cause in detail, but my guess is that ls looks for the
> > acl xattr, and such a negative lookup will cause erofs to look at all
> > the shared xattrs for the inode, which means they all end up being
> > loaded anyway. Of course, this will only affect ls (or other cases
> > that read the acl), so its perhaps a bit uncommon.
>
> Yeah, in addition to that, I guess real acls could be landed in inlined
> xattrs as well if exists...

Yeah, but that doesn't help with the case where they don't exist.

> BTW, if you have more interest in this way, we could get in
> touch in a more effective way to improve EROFS in addition to
> community emails except for the userns stuff

I don't really have time to do any real erofs specific work. These are
just some ideas that i got looking at these results.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  9:38                   ` Gao Xiang
@ 2023-03-07  9:56                     ` Alexander Larsson
  2023-03-07 10:06                       ` Gao Xiang
  0 siblings, 1 reply; 42+ messages in thread
From: Alexander Larsson @ 2023-03-07  9:56 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Tue, Mar 7, 2023 at 10:38 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> On 2023/3/7 17:26, Gao Xiang wrote:
> >
> >
> > On 2023/3/7 17:07, Alexander Larsson wrote:
> >> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2023/3/7 16:21, Alexander Larsson wrote:
> >>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
> >>>>
> >>>>>>> I tested the performance of "ls -lR" on the whole tree of
> >>>>>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
> >>>>>>> from mkfs.erofs) is slightly better than that of composefs.  While the
> >>>>>>> performance of erofs generated from mkfs.composefs is slightly worse
> >>>>>>> that that of composefs.
> >>>>>>
> >>>>>> I suspect that the reason for the lower performance of mkfs.composefs
> >>>>>> is the added overlay.fs-verity xattr to all the files. It makes the
> >>>>>> image larger, and that means more i/o.
> >>>>>
> >>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
> >>>>> even overlay.redirect but it depends) if needed, which could save some
> >>>>> I/Os for your workloads.
> >>>>>
> >>>>> shared xattrs can be used in this way as well if you care such minor
> >>>>> difference, actually I think inlined xattrs for your workload are just
> >>>>> meaningful for selinux labels and capabilities.
> >>>>
> >>>> Really? Could you expand on this, because I would think it will be
> >>>> sort of the opposite. In my usecase, the erofs fs will be read by
> >>>> overlayfs, which will probably access overlay.* pretty often.  At the
> >>>> very least it will load overlay.metacopy and overlay.redirect for
> >>>> every lookup.
> >>>
> >>> Really.  In that way, it will behave much similiar to composefs on-disk
> >>> arrangement now (in composefs vdata area).
> >>>
> >>> Because in that way, although an extra I/O is needed for verification,
> >>> and it can only happen when actually opening the file (so "ls -lR" is
> >>> not impacted.) But on-disk inodes are more compact.
> >>>
> >>> All EROFS xattrs will be cached in memory so that accessing
> >>> overlay.* pretty often is not greatly impacted due to no real I/Os
> >>> (IOWs, only some CPU time is consumed).
> >>
> >> So, I tried moving the overlay.digest xattr to the shared area, but
> >> actually this made the performance worse for the ls case. I have not
> >
> > That is much strange.  We'd like to open it up if needed.  BTW, did you
> > test EROFS with acl enabled all the time?
> >
> >> looked into the cause in detail, but my guess is that ls looks for the
> >> acl xattr, and such a negative lookup will cause erofs to look at all
> >> the shared xattrs for the inode, which means they all end up being
> >> loaded anyway. Of course, this will only affect ls (or other cases
> >> that read the acl), so its perhaps a bit uncommon.
> >
> > Yeah, in addition to that, I guess real acls could be landed in inlined
> > xattrs as well if exists...
> >
> >>
> >> Did you ever consider putting a bloom filter in the h_reserved area of
> >> erofs_xattr_ibody_header? Then it could return early without i/o
> >> operations for keys that are not set for the inode. Not sure what the
> >> computational cost of that would be though.
> >
> > Good idea!  Let me think about it, but enabling "noacl" mount
> > option isn't prefered if acl is no needed in your use cases.
>
>            ^ is preferred.

That is probably the right approach for the composefs usecase. But
even when you want acls, typically only just a few files have acls
set, so it might be interesting to handle the negative acl lookup case
more efficiently.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-06 15:49     ` Jingbo Xu
  2023-03-06 16:09       ` Alexander Larsson
@ 2023-03-07 10:00       ` Jingbo Xu
  1 sibling, 0 replies; 42+ messages in thread
From: Jingbo Xu @ 2023-03-07 10:00 UTC (permalink / raw)
  To: Alexander Larsson, lsf-pc
  Cc: linux-fsdevel, Amir Goldstein, Christian Brauner, Gao Xiang,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 3/6/23 11:49 PM, Jingbo Xu wrote:
> 
> 
> On 3/6/23 7:33 PM, Alexander Larsson wrote:
>> On Fri, Mar 3, 2023 at 2:57 PM Alexander Larsson <alexl@redhat.com> wrote:
>>>
>>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>>>> Composefs filesystem. It is an opportunistically sharing, validating
>>>> image-based filesystem, targeting usecases like validated ostree
>>>> rootfs:es, validated container images that share common files, as well
>>>> as other image based usecases.
>>>>
>>>> During the discussions in the composefs proposal (as seen on LWN[3])
>>>> is has been proposed that (with some changes to overlayfs), similar
>>>> behaviour can be achieved by combining the overlayfs
>>>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>>>
>>>> There are pros and cons to both these approaches, and the discussion
>>>> about their respective value has sometimes been heated. We would like
>>>> to have an in-person discussion at the summit, ideally also involving
>>>> more of the filesystem development community, so that we can reach
>>>> some consensus on what is the best apporach.
>>>
>>> In order to better understand the behaviour and requirements of the
>>> overlayfs+erofs approach I spent some time implementing direct support
>>> for erofs in libcomposefs. So, with current HEAD of
>>> github.com/containers/composefs you can now do:
>>>
>>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
>>>
>>> This will produce an object store with the backing files, and a erofs
>>> file with the required overlayfs xattrs, including a made up one
>>> called "overlay.fs-verity" containing the expected fs-verity digest
>>> for the lower dir. It also adds the required whiteouts to cover the
>>> 00-ff dirs from the lower dir.
>>>
>>> These erofs files are ordered similarly to the composefs files, and we
>>> give similar guarantees about their reproducibility, etc. So, they
>>> should be apples-to-apples comparable with the composefs images.
>>>
>>> Given this, I ran another set of performance tests on the original cs9
>>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
>>> measure the memory use like this:
>>>
>>> # echo 3 > /proc/sys/vm/drop_caches
>>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
>>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
>>>
>>> These are the alternatives I tried:
>>>
>>> xfs: the source of the image, regular dir on xfs
>>> erofs: the image.erofs above, on loopback
>>> erofs dio: the image.erofs above, on loopback with --direct-io=on
>>> ovl: erofs above combined with overlayfs
>>> ovl dio: erofs dio above combined with overlayfs
>>> cfs: composefs mount of image.cfs
>>>
>>> All tests use the same objects dir, stored on xfs. The erofs and
>>> overlay implementations are from a stock 6.1.13 kernel, and composefs
>>> module is from github HEAD.
>>>
>>> I tried loopback both with and without the direct-io option, because
>>> without direct-io enabled the kernel will double-cache the loopbacked
>>> data, as per[1].
>>>
>>> The produced images are:
>>>  8.9M image.cfs
>>> 11.3M image.erofs
>>>
>>> And gives these results:
>>>            | Cold cache | Warm cache | Mem use
>>>            |   (msec)   |   (msec)   |  (mb)
>>> -----------+------------+------------+---------
>>> xfs        |   1449     |    442     |    54
>>> erofs      |    700     |    391     |    45
>>> erofs dio  |    939     |    400     |    45
>>> ovl        |   1827     |    530     |   130
>>> ovl dio    |   2156     |    531     |   130
>>> cfs        |    689     |    389     |    51
>>
>> It has been noted that the readahead done by kernel_read() may cause
>> read-ahead of unrelated data into memory which skews the results in
>> favour of workloads that consume all the filesystem metadata (such as
>> the ls -lR usecase of the above test). In the table above this favours
>> composefs (which uses kernel_read in some codepaths) as well as
>> non-dio erofs (non-dio loopback device uses readahead too).
>>
>> I updated composefs to not use kernel_read here:
>>   https://github.com/containers/composefs/pull/105
>>
>> And a new kernel patch-set based on this is available at:
>>   https://github.com/alexlarsson/linux/tree/composefs
>>
>> The resulting table is now (dropping the non-dio erofs):
>>
>>            | Cold cache | Warm cache | Mem use
>>            |   (msec)   |   (msec)   |  (mb)
>> -----------+------------+------------+---------
>> xfs        |   1449     |    442     |   54
>> erofs dio  |    939     |    400     |   45
>> ovl dio    |   2156     |    531     |  130
>> cfs        |    833     |    398     |   51
>>
>>            | Cold cache | Warm cache | Mem use
>>            |   (msec)   |   (msec)   |  (mb)
>> -----------+------------+------------+---------
>> ext4       |   1135     |    394     |   54
>> erofs dio  |    922     |    401     |   45
>> ovl dio    |   1810     |    532     |  149
>> ovl lazy   |   1063     |    523     |  87
>> cfs        |    768     |    459     |  51
>>
>> So, while cfs is somewhat worse now for this particular usecase, my
>> overall analysis still stands.
>>
> 
> Hi,
> 
> I tested your patch removing kernel_read(), and here is the statistics
> tested in my environment.
> 
> 
> Setup
> ======
> CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
> Disk: cloud disk, 11800 IOPS upper limit
> OS: Linux v6.2
> FS of backing objects: xfs
> 
> 
> Image size
> ===========
> 8.6M large.composefs (with --compute-digest)
> 8.9M large.erofs (mkfs.erofs)
> 11M  large.cps.in.erofs (mkfs.composefs --compute-digest --format=erofs)
> 
> 
> Perf of "ls -lR"
> ================
> 					      | uncached| cached
> 					      |  (ms)	|  (ms)
> ----------------------------------------------|---------|--------
> composefs				      	   | 519	| 178
> erofs (mkfs.erofs, DIRECT loop) 	     	   | 497 	| 192
> erofs (mkfs.composefs --format=erofs, DIRECT loop) | 536 	| 199
> 
> I tested the performance of "ls -lR" on the whole tree of
> cs9-developer-rootfs.  It seems that the performance of erofs (generated
> from mkfs.erofs) is slightly better than that of composefs.  While the
> performance of erofs generated from mkfs.composefs is slightly worse
> that that of composefs.
> 
> The uncached performance is somewhat slightly different with that given
> by Alexander Larsson.  I think it may be due to different test
> environment, as my test machine is a server with robust performance,
> with cloud disk as storage.
> 
> It's just a simple test without further analysis, as it's a bit late for
> me :)
> 

Forgot to mention that all erofs (no matter generated from mkfs.erofs or
mkfs.composefs) are mounted with "-o noacl", as composefs has not
implemented its acl yet.


-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  9:46                   ` Alexander Larsson
@ 2023-03-07 10:01                     ` Gao Xiang
  0 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-07 10:01 UTC (permalink / raw)
  To: Alexander Larsson
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi



On 2023/3/7 17:46, Alexander Larsson wrote:
> On Tue, Mar 7, 2023 at 10:26 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>> On 2023/3/7 17:07, Alexander Larsson wrote:
>>> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2023/3/7 16:21, Alexander Larsson wrote:
>>>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>
>>>>>>>> I tested the performance of "ls -lR" on the whole tree of
>>>>>>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
>>>>>>>> from mkfs.erofs) is slightly better than that of composefs.  While the
>>>>>>>> performance of erofs generated from mkfs.composefs is slightly worse
>>>>>>>> that that of composefs.
>>>>>>>
>>>>>>> I suspect that the reason for the lower performance of mkfs.composefs
>>>>>>> is the added overlay.fs-verity xattr to all the files. It makes the
>>>>>>> image larger, and that means more i/o.
>>>>>>
>>>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
>>>>>> even overlay.redirect but it depends) if needed, which could save some
>>>>>> I/Os for your workloads.
>>>>>>
>>>>>> shared xattrs can be used in this way as well if you care such minor
>>>>>> difference, actually I think inlined xattrs for your workload are just
>>>>>> meaningful for selinux labels and capabilities.
>>>>>
>>>>> Really? Could you expand on this, because I would think it will be
>>>>> sort of the opposite. In my usecase, the erofs fs will be read by
>>>>> overlayfs, which will probably access overlay.* pretty often.  At the
>>>>> very least it will load overlay.metacopy and overlay.redirect for
>>>>> every lookup.
>>>>
>>>> Really.  In that way, it will behave much similiar to composefs on-disk
>>>> arrangement now (in composefs vdata area).
>>>>
>>>> Because in that way, although an extra I/O is needed for verification,
>>>> and it can only happen when actually opening the file (so "ls -lR" is
>>>> not impacted.) But on-disk inodes are more compact.
>>>>
>>>> All EROFS xattrs will be cached in memory so that accessing
>>>> overlay.* pretty often is not greatly impacted due to no real I/Os
>>>> (IOWs, only some CPU time is consumed).
>>>
>>> So, I tried moving the overlay.digest xattr to the shared area, but
>>> actually this made the performance worse for the ls case. I have not
>>
>> That is much strange.  We'd like to open it up if needed.  BTW, did you
>> test EROFS with acl enabled all the time?
> 
> These were all with acl enabled.
> 
> And, to test this, I compared "ls -lR" and "ls -ZR", which do the same
> per-file syscalls, except the later doesn't try to read the
> system.posix_acl_access xattr. The result is:
> 
> xattr:        inlined | not inlined
> ------------+---------+------------
> ls -lR cold |  708    |  721
> ls -lR warm |  415    |  412
> ls -ZR cold |  522    |  512
> ls -ZR warm |  283    |  279
> 
> In the ZR case the out-of band digest is a win, but not in the lR
> case, which seems to mean the failed lookup of the acl xattr is to
> blame here.
> 
> Also, very interesting is the fact that the warm cache difference for
> these to is so large. I guess that is because most other inode data is
> cached, but the xattrs lookups are not. If you could cache negative
> xattr lookups that seems like a large win. This can be either via a
> bloom cache in the disk format or maybe even just some in-memory
> negative lookup caches for the inode, maybe even special casing the
> acl xattrs.

Yes, agree.  Actually we don't take much time to look that ACL impacts
because almost all generic fses (such as ext4, XFS, btrfs, etc.) all
implement ACLs.  But you could use "-o noacl" to disable it if needed
with the current codebase.

> 
>>> looked into the cause in detail, but my guess is that ls looks for the
>>> acl xattr, and such a negative lookup will cause erofs to look at all
>>> the shared xattrs for the inode, which means they all end up being
>>> loaded anyway. Of course, this will only affect ls (or other cases
>>> that read the acl), so its perhaps a bit uncommon.
>>
>> Yeah, in addition to that, I guess real acls could be landed in inlined
>> xattrs as well if exists...
> 
> Yeah, but that doesn't help with the case where they don't exist.
> 
>> BTW, if you have more interest in this way, we could get in
>> touch in a more effective way to improve EROFS in addition to
>> community emails except for the userns stuff
> 
> I don't really have time to do any real erofs specific work. These are
> just some ideas that i got looking at these results.

I don't want you guys to do any EROFS-specific work.  I just want to
confirm your real requirement (so I can improve this) and the final
goal of this discussion.

At least, on my side after long time discussion and comparison.
EROFS and composefs are much similar (but when EROFS was raised we
don't have a better choice to get a good performance since you've
already partially benchmarked other fses) from many points of views
except for some interfaces, and since composefs doesn't implement
acl now, if you use "-o noacl" to mount EROFS, it could perform
better performance.  So I think it's no needed to discuss "ls -lR"
stuffs here anymore, if you disagree, we could take more time to
investigate on this.

In other words, EROFS on-disk format and loopback devices are not
performance bottlenack even on "ls -lR" workload.  We could improve
xattr negative lookups as a real input of this.

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07  9:56                     ` Alexander Larsson
@ 2023-03-07 10:06                       ` Gao Xiang
  0 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-07 10:06 UTC (permalink / raw)
  To: Alexander Larsson
  Cc: Jingbo Xu, lsf-pc, linux-fsdevel, Amir Goldstein,
	Christian Brauner, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi



On 2023/3/7 17:56, Alexander Larsson wrote:
> On Tue, Mar 7, 2023 at 10:38 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> On 2023/3/7 17:26, Gao Xiang wrote:
>>>
>>>
>>> On 2023/3/7 17:07, Alexander Larsson wrote:
>>>> On Tue, Mar 7, 2023 at 9:34 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2023/3/7 16:21, Alexander Larsson wrote:
>>>>>> On Mon, Mar 6, 2023 at 5:17 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>>>>>
>>>>>>>>> I tested the performance of "ls -lR" on the whole tree of
>>>>>>>>> cs9-developer-rootfs.  It seems that the performance of erofs (generated
>>>>>>>>> from mkfs.erofs) is slightly better than that of composefs.  While the
>>>>>>>>> performance of erofs generated from mkfs.composefs is slightly worse
>>>>>>>>> that that of composefs.
>>>>>>>>
>>>>>>>> I suspect that the reason for the lower performance of mkfs.composefs
>>>>>>>> is the added overlay.fs-verity xattr to all the files. It makes the
>>>>>>>> image larger, and that means more i/o.
>>>>>>>
>>>>>>> Actually you could move overlay.fs-verity to EROFS shared xattr area (or
>>>>>>> even overlay.redirect but it depends) if needed, which could save some
>>>>>>> I/Os for your workloads.
>>>>>>>
>>>>>>> shared xattrs can be used in this way as well if you care such minor
>>>>>>> difference, actually I think inlined xattrs for your workload are just
>>>>>>> meaningful for selinux labels and capabilities.
>>>>>>
>>>>>> Really? Could you expand on this, because I would think it will be
>>>>>> sort of the opposite. In my usecase, the erofs fs will be read by
>>>>>> overlayfs, which will probably access overlay.* pretty often.  At the
>>>>>> very least it will load overlay.metacopy and overlay.redirect for
>>>>>> every lookup.
>>>>>
>>>>> Really.  In that way, it will behave much similiar to composefs on-disk
>>>>> arrangement now (in composefs vdata area).
>>>>>
>>>>> Because in that way, although an extra I/O is needed for verification,
>>>>> and it can only happen when actually opening the file (so "ls -lR" is
>>>>> not impacted.) But on-disk inodes are more compact.
>>>>>
>>>>> All EROFS xattrs will be cached in memory so that accessing
>>>>> overlay.* pretty often is not greatly impacted due to no real I/Os
>>>>> (IOWs, only some CPU time is consumed).
>>>>
>>>> So, I tried moving the overlay.digest xattr to the shared area, but
>>>> actually this made the performance worse for the ls case. I have not
>>>
>>> That is much strange.  We'd like to open it up if needed.  BTW, did you
>>> test EROFS with acl enabled all the time?
>>>
>>>> looked into the cause in detail, but my guess is that ls looks for the
>>>> acl xattr, and such a negative lookup will cause erofs to look at all
>>>> the shared xattrs for the inode, which means they all end up being
>>>> loaded anyway. Of course, this will only affect ls (or other cases
>>>> that read the acl), so its perhaps a bit uncommon.
>>>
>>> Yeah, in addition to that, I guess real acls could be landed in inlined
>>> xattrs as well if exists...
>>>
>>>>
>>>> Did you ever consider putting a bloom filter in the h_reserved area of
>>>> erofs_xattr_ibody_header? Then it could return early without i/o
>>>> operations for keys that are not set for the inode. Not sure what the
>>>> computational cost of that would be though.
>>>
>>> Good idea!  Let me think about it, but enabling "noacl" mount
>>> option isn't prefered if acl is no needed in your use cases.
>>
>>             ^ is preferred.
> 
> That is probably the right approach for the composefs usecase. But
> even when you want acls, typically only just a few files have acls
> set, so it might be interesting to handle the negative acl lookup case
> more efficiently.

Let me to seek time to improve this with bloom filters.  It won't be hard,
also I'd like to improve some other on-disk formats together with this
xattr enhancement.  Thanks for your input!

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-03 15:13   ` Gao Xiang
  2023-03-03 17:37     ` Gao Xiang
@ 2023-03-07 10:15     ` Christian Brauner
  2023-03-07 11:03       ` Gao Xiang
                         ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Christian Brauner @ 2023-03-07 10:15 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Alexander Larsson, lsf-pc, linux-fsdevel, Amir Goldstein,
	Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
> Hi Alexander,
> 
> On 2023/3/3 21:57, Alexander Larsson wrote:
> > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> > > 
> > > Hello,
> > > 
> > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> > > Composefs filesystem. It is an opportunistically sharing, validating
> > > image-based filesystem, targeting usecases like validated ostree
> > > rootfs:es, validated container images that share common files, as well
> > > as other image based usecases.
> > > 
> > > During the discussions in the composefs proposal (as seen on LWN[3])
> > > is has been proposed that (with some changes to overlayfs), similar
> > > behaviour can be achieved by combining the overlayfs
> > > "overlay.redirect" xattr with an read-only filesystem such as erofs.
> > > 
> > > There are pros and cons to both these approaches, and the discussion
> > > about their respective value has sometimes been heated. We would like
> > > to have an in-person discussion at the summit, ideally also involving
> > > more of the filesystem development community, so that we can reach
> > > some consensus on what is the best apporach.
> > 
> > In order to better understand the behaviour and requirements of the
> > overlayfs+erofs approach I spent some time implementing direct support
> > for erofs in libcomposefs. So, with current HEAD of
> > github.com/containers/composefs you can now do:
> > 
> > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
> 
> Thanks you for taking time on working on EROFS support.  I don't have
> time to play with it yet since I'd like to work out erofs-utils 1.6
> these days and will work on some new stuffs such as !pagesize block
> size as I said previously.
> 
> > 
> > This will produce an object store with the backing files, and a erofs
> > file with the required overlayfs xattrs, including a made up one
> > called "overlay.fs-verity" containing the expected fs-verity digest
> > for the lower dir. It also adds the required whiteouts to cover the
> > 00-ff dirs from the lower dir.
> > 
> > These erofs files are ordered similarly to the composefs files, and we
> > give similar guarantees about their reproducibility, etc. So, they
> > should be apples-to-apples comparable with the composefs images.
> > 
> > Given this, I ran another set of performance tests on the original cs9
> > rootfs dataset, again measuring the time of `ls -lR`. I also tried to
> > measure the memory use like this:
> > 
> > # echo 3 > /proc/sys/vm/drop_caches
> > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
> > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> > 
> > These are the alternatives I tried:
> > 
> > xfs: the source of the image, regular dir on xfs
> > erofs: the image.erofs above, on loopback
> > erofs dio: the image.erofs above, on loopback with --direct-io=on
> > ovl: erofs above combined with overlayfs
> > ovl dio: erofs dio above combined with overlayfs
> > cfs: composefs mount of image.cfs
> > 
> > All tests use the same objects dir, stored on xfs. The erofs and
> > overlay implementations are from a stock 6.1.13 kernel, and composefs
> > module is from github HEAD.
> > 
> > I tried loopback both with and without the direct-io option, because
> > without direct-io enabled the kernel will double-cache the loopbacked
> > data, as per[1].
> > 
> > The produced images are:
> >   8.9M image.cfs
> > 11.3M image.erofs
> > 
> > And gives these results:
> >             | Cold cache | Warm cache | Mem use
> >             |   (msec)   |   (msec)   |  (mb)
> > -----------+------------+------------+---------
> > xfs        |   1449     |    442     |    54
> > erofs      |    700     |    391     |    45
> > erofs dio  |    939     |    400     |    45
> > ovl        |   1827     |    530     |   130
> > ovl dio    |   2156     |    531     |   130
> > cfs        |    689     |    389     |    51
> > 
> > I also ran the same tests in a VM that had the latest kernel including
> > the lazyfollow patches (ovl lazy in the table, not using direct-io),
> > this one ext4 based:
> > 
> >             | Cold cache | Warm cache | Mem use
> >             |   (msec)   |   (msec)   |  (mb)
> > -----------+------------+------------+---------
> > ext4       |   1135     |    394     |    54
> > erofs      |    715     |    401     |    46
> > erofs dio  |    922     |    401     |    45
> > ovl        |   1412     |    515     |   148
> > ovl dio    |   1810     |    532     |   149
> > ovl lazy   |   1063     |    523     |    87
> > cfs        |    719     |    463     |    51
> > 
> > Things noticeable in the results:
> > 
> > * composefs and erofs (by itself) perform roughly  similar. This is
> >    not necessarily news, and results from Jingbo Xu match this.
> > 
> > * Erofs on top of direct-io enabled loopback causes quite a drop in
> >    performance, which I don't really understand. Especially since its
> >    reporting the same memory use as non-direct io. I guess the
> >    double-cacheing in the later case isn't properly attributed to the
> >    cgroup so the difference is not measured. However, why would the
> >    double cache improve performance?  Maybe I'm not completely
> >    understanding how these things interact.
> 
> We've already analysed the root cause of composefs is that composefs
> uses a kernel_read() to read its path while irrelevant metadata
> (such as dir data) is read together.  Such heuristic readahead is a
> unusual stuff for all local fses (obviously almost all in-kernel
> filesystems don't use kernel_read() to read their metadata. Although
> some filesystems could readahead some related extent metadata when
> reading inode, they at least does _not_ work as kernel_read().) But
> double caching will introduce almost the same impact as kernel_read()
> (assuming you read some source code of loop device.)
> 
> I do hope you already read what Jingbo's latest test results, and that
> test result shows how bad readahead performs if fs metadata is
> partially randomly used (stat < 1500 files):
> https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com
> 
> Also you could explicitly _disable_ readahead for composefs
> manifiest file (because all EROFS metadata read is without
> readahead), and let's see how it works then.
> 
> Again, if your workload is just "ls -lR".  My answer is "just async
> readahead the whole manifest file / loop device together" when
> mounting.  That will give the best result to you.  But I'm not sure
> that is the real use case you propose.
> 
> > 
> > * Stacking overlay on top of erofs causes about 100msec slower
> >    warm-cache times compared to all non-overlay approaches, and much
> >    more in the cold cache case. The cold cache performance is helped
> >    significantly by the lazyfollow patches, but the warm cache overhead
> >    remains.
> > 
> > * The use of overlayfs more than doubles memory use, probably
> >    because of all the extra inodes and dentries in action for the
> >    various layers. The lazyfollow patches helps, but only partially.
> > 
> > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is
> >    not that much slower (~25%) than the pure xfs/ext4 directory, which
> >    is a pretty good baseline for comparisons. It is even faster when
> >    using lazyfollow on ext4.
> > 
> > * The erofs images are slightly larger than the equivalent composefs
> >    image.
> > 
> > In summary: The performance of composefs is somewhat better than the
> > best erofs+ovl combination, although the overlay approach is not
> > significantly worse than the baseline of a regular directory, except
> > that it uses a bit more memory.
> > 
> > On top of the above pure performance based comparisons I would like to
> > re-state some of the other advantages of composefs compared to the
> > overlay approach:
> > 
> > * composefs is namespaceable, in the sense that you can use it (given
> >    mount capabilities) inside a namespace (such as a container) without
> >    access to non-namespaced resources like loopback or device-mapper
> >    devices. (There was work on fixing this with loopfs, but that seems
> >    to have stalled.)
> > 
> > * While it is not in the current design, the simplicity of the format
> >    and lack of loopback makes it at least theoretically possible that
> >    composefs can be made usable in a rootless fashion at some point in
> >    the future.
> Do you consider sending some commands to /dev/cachefiles to configure
> a daemonless dir and mount erofs image directly by using "erofs over
> fscache" but in a daemonless way?  That is an ongoing stuff on our side.
> 
> IMHO, I don't think file-based interfaces are quite a charmful stuff.
> Historically I recalled some practice is to "avoid directly reading
> files in kernel" so that I think almost all local fses don't work on
> files directl and loopback devices are all the ways for these use
> cases.  If loopback devices are not okay to you, how about improving
> loopback devices and that will benefit to almost all local fses.
> 
> > 
> > And of course, there are disadvantages to composefs too. Primarily
> > being more code, increasing maintenance burden and risk of security
> > problems. Composefs is particularly burdensome because it is a
> > stacking filesystem and these have historically been shown to be hard
> > to get right.
> > 
> > 
> > The question now is what is the best approach overall? For my own
> > primary usecase of making a verifying ostree root filesystem, the
> > overlay approach (with the lazyfollow work finished) is, while not
> > ideal, good enough.
> 
> So your judgement is still "ls -lR" and your use case is still just
> pure read-only and without writable stuff?
> 
> Anyway, I'm really happy to work with you on your ostree use cases
> as always, as long as all corner cases work out by the community.
> 
> > 
> > But I know for the people who are more interested in using composefs
> > for containers the eventual goal of rootless support is very
> > important. So, on behalf of them I guess the question is: Is there
> > ever any chance that something like composefs could work rootlessly?
> > Or conversely: Is there some way to get rootless support from the
> > overlay approach? Opinions? Ideas?
> 
> Honestly, I do want to get a proper answer when Giuseppe asked me
> the same question.  My current view is simply "that question is
> almost the same for all in-kernel fses with some on-disk format".

As far as I'm concerned filesystems with on-disk format will not be made
mountable by unprivileged containers. And I don't think I'm alone in
that view. The idea that ever more parts of the kernel with a massive
attack surface such as a filesystem need to vouchesafe for the safety in
the face of every rando having access to
unshare --mount --user --map-root is a dead end and will just end up
trapping us in a neverending cycle of security bugs (Because every
single bug that's found after making that fs mountable from an
unprivileged container will be treated as a security bug no matter if
justified or not. So this is also a good way to ruin your filesystem's
reputation.).

And honestly, if we set the precedent that it's fine for one filesystem
with an on-disk format to be able to be mounted by unprivileged
containers then other filesystems eventually want to do this as well.

At the rate we currently add filesystems that's just a matter of time
even if none of the existing ones would also want to do it. And then
we're left arguing that this was just an exception for one super
special, super safe, unexploitable filesystem with an on-disk format.

Imho, none of this is appealing. I don't want to slowly keep building a
future where we end up running fuzzers in unprivileged container to
generate random images to crash the kernel.

I have more arguments why I don't think is a path we will ever go down
but I don't want this to detract from the legitimate ask of making it
possible to mount trusted images from within unprivileged containers.
Because I think that's perfectly legitimate.

However, I don't think that this is something the kernel needs to solve
other than providing the necessary infrastructure so that this can be
solved in userspace.

Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I
explained how we currently mount into mount namespaces of unprivileged
cotainers which had been quite a difficult problem before the new mount
api. But now it's become almost comically trivial. I mean, there's stuff
that will still be good to have but overall all the bits are already
there.

Imho, delegated mounting should be done by a system service that is
responsible for all the steps that require privileges. So for most
filesytems not mountable by unprivileged user this would amount to:

fd_fs = fsopen("xfs")
fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
fsconfig(FSCONFIG_CMD_CREATE)
fd_mnt = fsmount(fd_fs)
// Only required for attributes that require privileges against the sb
// of the filesystem such as idmapped mounts
mount_setattr(fd_mnt, ...)

and then the fd_mnt can be sent to the container which can then attach
it wherever it wants to. The system level service doesn't even need to
change namespaces via setns(fd_userns|fd_mntns) like I illustrated in
the post I did. It's sufficient if we sent it via AF_UNIX for example
that's exposed to the container.

Of course, this system level service would be integrated with mount(8)
directly over a well-defined protocol. And this would be nestable as
well by e.g., bind-mounting the AF_UNIX socket.

And we do already support a rudimentary form of such integration through
systemd. For example via mount -t ddi (cf. [2]) which makes it possible
to mount discoverable disk images (ddi). But that's just an
illustration. 

This should be integrated with mount(8) and should be a simply protocol
over varlink or another lightweight ipc mechanism that can be
implemented by systemd-mountd (which is how I coined this for lack of
imagination when I came up with this) or by some other component if
platforms like k8s really want to do their own thing.

This also allows us to extend this feature to the whole system btw and
to all filesystems at once. Because it means that if systemd-mountd is
told what images to trust (based on location, from a specific registry,
signature, or whatever) then this isn't just useful for unprivileged
containers but also for regular users on the host that want to mount
stuff.

This is what we're currently working on.

(There's stuff that we can do to make this more powerful __if__ we need
to. One example would probably that we _could_ make it possible to mark
a superblock as being owned by a specific namespace with similar
permission checks as what we currently do for idmapped mounts
(privileged in the superblock of the fs, privileged over the ns to
delegate to etc). IOW,

fd_fs = fsopen("xfs")
fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns)

which completely sidesteps the issue of making that on-disk filesystem
mountable by unpriv users.

But let me say that this is completely unnecessary today as you can do:

fd_fs = fsopen("xfs")
fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
fsconfig(FSCONFIG_CMD_CREATE)
fd_mnt = fsmount(fd_fs)
mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP)

which changes ownership across the whole filesystem. The only time you
really want what I mention here is if you want to delegate control over
__every single ioctl and potentially destructive operation associated
with that filesystem__ to an unprivileged container which is almost
never what you want.)

[1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
[2]: https://github.com/systemd/systemd/pull/26695

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07 10:15     ` Christian Brauner
@ 2023-03-07 11:03       ` Gao Xiang
  2023-03-07 12:09       ` Alexander Larsson
  2023-03-07 13:38       ` Jeff Layton
  2 siblings, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-07 11:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Larsson, lsf-pc, linux-fsdevel, Amir Goldstein,
	Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

Hi Christian,

On 2023/3/7 18:15, Christian Brauner wrote:
> On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
>> Hi Alexander,
>>
>> On 2023/3/3 21:57, Alexander Larsson wrote:
>>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
>>>> Composefs filesystem. It is an opportunistically sharing, validating
>>>> image-based filesystem, targeting usecases like validated ostree
>>>> rootfs:es, validated container images that share common files, as well
>>>> as other image based usecases.
>>>>
>>>> During the discussions in the composefs proposal (as seen on LWN[3])
>>>> is has been proposed that (with some changes to overlayfs), similar
>>>> behaviour can be achieved by combining the overlayfs
>>>> "overlay.redirect" xattr with an read-only filesystem such as erofs.
>>>>
>>>> There are pros and cons to both these approaches, and the discussion
>>>> about their respective value has sometimes been heated. We would like
>>>> to have an in-person discussion at the summit, ideally also involving
>>>> more of the filesystem development community, so that we can reach
>>>> some consensus on what is the best apporach.
>>>
>>> In order to better understand the behaviour and requirements of the
>>> overlayfs+erofs approach I spent some time implementing direct support
>>> for erofs in libcomposefs. So, with current HEAD of
>>> github.com/containers/composefs you can now do:
>>>
>>> $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
>>
>> Thanks you for taking time on working on EROFS support.  I don't have
>> time to play with it yet since I'd like to work out erofs-utils 1.6
>> these days and will work on some new stuffs such as !pagesize block
>> size as I said previously.
>>
>>>
>>> This will produce an object store with the backing files, and a erofs
>>> file with the required overlayfs xattrs, including a made up one
>>> called "overlay.fs-verity" containing the expected fs-verity digest
>>> for the lower dir. It also adds the required whiteouts to cover the
>>> 00-ff dirs from the lower dir.
>>>
>>> These erofs files are ordered similarly to the composefs files, and we
>>> give similar guarantees about their reproducibility, etc. So, they
>>> should be apples-to-apples comparable with the composefs images.
>>>
>>> Given this, I ran another set of performance tests on the original cs9
>>> rootfs dataset, again measuring the time of `ls -lR`. I also tried to
>>> measure the memory use like this:
>>>
>>> # echo 3 > /proc/sys/vm/drop_caches
>>> # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
>>> /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
>>>
>>> These are the alternatives I tried:
>>>
>>> xfs: the source of the image, regular dir on xfs
>>> erofs: the image.erofs above, on loopback
>>> erofs dio: the image.erofs above, on loopback with --direct-io=on
>>> ovl: erofs above combined with overlayfs
>>> ovl dio: erofs dio above combined with overlayfs
>>> cfs: composefs mount of image.cfs
>>>
>>> All tests use the same objects dir, stored on xfs. The erofs and
>>> overlay implementations are from a stock 6.1.13 kernel, and composefs
>>> module is from github HEAD.
>>>
>>> I tried loopback both with and without the direct-io option, because
>>> without direct-io enabled the kernel will double-cache the loopbacked
>>> data, as per[1].
>>>
>>> The produced images are:
>>>    8.9M image.cfs
>>> 11.3M image.erofs
>>>
>>> And gives these results:
>>>              | Cold cache | Warm cache | Mem use
>>>              |   (msec)   |   (msec)   |  (mb)
>>> -----------+------------+------------+---------
>>> xfs        |   1449     |    442     |    54
>>> erofs      |    700     |    391     |    45
>>> erofs dio  |    939     |    400     |    45
>>> ovl        |   1827     |    530     |   130
>>> ovl dio    |   2156     |    531     |   130
>>> cfs        |    689     |    389     |    51
>>>
>>> I also ran the same tests in a VM that had the latest kernel including
>>> the lazyfollow patches (ovl lazy in the table, not using direct-io),
>>> this one ext4 based:
>>>
>>>              | Cold cache | Warm cache | Mem use
>>>              |   (msec)   |   (msec)   |  (mb)
>>> -----------+------------+------------+---------
>>> ext4       |   1135     |    394     |    54
>>> erofs      |    715     |    401     |    46
>>> erofs dio  |    922     |    401     |    45
>>> ovl        |   1412     |    515     |   148
>>> ovl dio    |   1810     |    532     |   149
>>> ovl lazy   |   1063     |    523     |    87
>>> cfs        |    719     |    463     |    51
>>>
>>> Things noticeable in the results:
>>>
>>> * composefs and erofs (by itself) perform roughly  similar. This is
>>>     not necessarily news, and results from Jingbo Xu match this.
>>>
>>> * Erofs on top of direct-io enabled loopback causes quite a drop in
>>>     performance, which I don't really understand. Especially since its
>>>     reporting the same memory use as non-direct io. I guess the
>>>     double-cacheing in the later case isn't properly attributed to the
>>>     cgroup so the difference is not measured. However, why would the
>>>     double cache improve performance?  Maybe I'm not completely
>>>     understanding how these things interact.
>>
>> We've already analysed the root cause of composefs is that composefs
>> uses a kernel_read() to read its path while irrelevant metadata
>> (such as dir data) is read together.  Such heuristic readahead is a
>> unusual stuff for all local fses (obviously almost all in-kernel
>> filesystems don't use kernel_read() to read their metadata. Although
>> some filesystems could readahead some related extent metadata when
>> reading inode, they at least does _not_ work as kernel_read().) But
>> double caching will introduce almost the same impact as kernel_read()
>> (assuming you read some source code of loop device.)
>>
>> I do hope you already read what Jingbo's latest test results, and that
>> test result shows how bad readahead performs if fs metadata is
>> partially randomly used (stat < 1500 files):
>> https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com
>>
>> Also you could explicitly _disable_ readahead for composefs
>> manifiest file (because all EROFS metadata read is without
>> readahead), and let's see how it works then.
>>
>> Again, if your workload is just "ls -lR".  My answer is "just async
>> readahead the whole manifest file / loop device together" when
>> mounting.  That will give the best result to you.  But I'm not sure
>> that is the real use case you propose.
>>
>>>
>>> * Stacking overlay on top of erofs causes about 100msec slower
>>>     warm-cache times compared to all non-overlay approaches, and much
>>>     more in the cold cache case. The cold cache performance is helped
>>>     significantly by the lazyfollow patches, but the warm cache overhead
>>>     remains.
>>>
>>> * The use of overlayfs more than doubles memory use, probably
>>>     because of all the extra inodes and dentries in action for the
>>>     various layers. The lazyfollow patches helps, but only partially.
>>>
>>> * Even though overlayfs+erofs is slower than cfs and raw erofs, it is
>>>     not that much slower (~25%) than the pure xfs/ext4 directory, which
>>>     is a pretty good baseline for comparisons. It is even faster when
>>>     using lazyfollow on ext4.
>>>
>>> * The erofs images are slightly larger than the equivalent composefs
>>>     image.
>>>
>>> In summary: The performance of composefs is somewhat better than the
>>> best erofs+ovl combination, although the overlay approach is not
>>> significantly worse than the baseline of a regular directory, except
>>> that it uses a bit more memory.
>>>
>>> On top of the above pure performance based comparisons I would like to
>>> re-state some of the other advantages of composefs compared to the
>>> overlay approach:
>>>
>>> * composefs is namespaceable, in the sense that you can use it (given
>>>     mount capabilities) inside a namespace (such as a container) without
>>>     access to non-namespaced resources like loopback or device-mapper
>>>     devices. (There was work on fixing this with loopfs, but that seems
>>>     to have stalled.)
>>>
>>> * While it is not in the current design, the simplicity of the format
>>>     and lack of loopback makes it at least theoretically possible that
>>>     composefs can be made usable in a rootless fashion at some point in
>>>     the future.
>> Do you consider sending some commands to /dev/cachefiles to configure
>> a daemonless dir and mount erofs image directly by using "erofs over
>> fscache" but in a daemonless way?  That is an ongoing stuff on our side.
>>
>> IMHO, I don't think file-based interfaces are quite a charmful stuff.
>> Historically I recalled some practice is to "avoid directly reading
>> files in kernel" so that I think almost all local fses don't work on
>> files directl and loopback devices are all the ways for these use
>> cases.  If loopback devices are not okay to you, how about improving
>> loopback devices and that will benefit to almost all local fses.
>>
>>>
>>> And of course, there are disadvantages to composefs too. Primarily
>>> being more code, increasing maintenance burden and risk of security
>>> problems. Composefs is particularly burdensome because it is a
>>> stacking filesystem and these have historically been shown to be hard
>>> to get right.
>>>
>>>
>>> The question now is what is the best approach overall? For my own
>>> primary usecase of making a verifying ostree root filesystem, the
>>> overlay approach (with the lazyfollow work finished) is, while not
>>> ideal, good enough.
>>
>> So your judgement is still "ls -lR" and your use case is still just
>> pure read-only and without writable stuff?
>>
>> Anyway, I'm really happy to work with you on your ostree use cases
>> as always, as long as all corner cases work out by the community.
>>
>>>
>>> But I know for the people who are more interested in using composefs
>>> for containers the eventual goal of rootless support is very
>>> important. So, on behalf of them I guess the question is: Is there
>>> ever any chance that something like composefs could work rootlessly?
>>> Or conversely: Is there some way to get rootless support from the
>>> overlay approach? Opinions? Ideas?
>>
>> Honestly, I do want to get a proper answer when Giuseppe asked me
>> the same question.  My current view is simply "that question is
>> almost the same for all in-kernel fses with some on-disk format".
> 
> As far as I'm concerned filesystems with on-disk format will not be made
> mountable by unprivileged containers. And I don't think I'm alone in
> that view. The idea that ever more parts of the kernel with a massive
> attack surface such as a filesystem need to vouchesafe for the safety in
> the face of every rando having access to
> unshare --mount --user --map-root is a dead end and will just end up
> trapping us in a neverending cycle of security bugs (Because every
> single bug that's found after making that fs mountable from an
> unprivileged container will be treated as a security bug no matter if
> justified or not. So this is also a good way to ruin your filesystem's
> reputation.).
> 
> And honestly, if we set the precedent that it's fine for one filesystem
> with an on-disk format to be able to be mounted by unprivileged
> containers then other filesystems eventually want to do this as well.
> 
> At the rate we currently add filesystems that's just a matter of time
> even if none of the existing ones would also want to do it. And then
> we're left arguing that this was just an exception for one super
> special, super safe, unexploitable filesystem with an on-disk format.

Yes, +1.  That's somewhat why I didn't answer immediately since I'd like
to find a chance to get more people interested in EROFS so I hope it could
be (somewhat) pointed out by other filesystem guys at that time.

> 
> Imho, none of this is appealing. I don't want to slowly keep building a
> future where we end up running fuzzers in unprivileged container to
> generate random images to crash the kernel.

Even fuzzers don't guarantee this unless we completely freeze the fs
code, otherwise any useful improvement will need a much much deep and
long long fuzzing in principle.  I'm not sure even if it could catch
release timing at all, and bug-free, honestly.

> 
> I have more arguments why I don't think is a path we will ever go down
> but I don't want this to detract from the legitimate ask of making it
> possible to mount trusted images from within unprivileged containers.
> Because I think that's perfectly legitimate.
> 
> However, I don't think that this is something the kernel needs to solve
> other than providing the necessary infrastructure so that this can be
> solved in userspace.

Yes, I think it's a principle as long as we have a way to do thing in
userspace effectively.

> 
> Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I
> explained how we currently mount into mount namespaces of unprivileged
> cotainers which had been quite a difficult problem before the new mount
> api. But now it's become almost comically trivial. I mean, there's stuff
> that will still be good to have but overall all the bits are already
> there.
> 
> Imho, delegated mounting should be done by a system service that is
> responsible for all the steps that require privileges. So for most
> filesytems not mountable by unprivileged user this would amount to:
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_CMD_CREATE)
> fd_mnt = fsmount(fd_fs)
> // Only required for attributes that require privileges against the sb
> // of the filesystem such as idmapped mounts
> mount_setattr(fd_mnt, ...)
> 
> and then the fd_mnt can be sent to the container which can then attach
> it wherever it wants to. The system level service doesn't even need to
> change namespaces via setns(fd_userns|fd_mntns) like I illustrated in
> the post I did. It's sufficient if we sent it via AF_UNIX for example
> that's exposed to the container.
> 
> Of course, this system level service would be integrated with mount(8)
> directly over a well-defined protocol. And this would be nestable as
> well by e.g., bind-mounting the AF_UNIX socket.
> 
> And we do already support a rudimentary form of such integration through
> systemd. For example via mount -t ddi (cf. [2]) which makes it possible
> to mount discoverable disk images (ddi). But that's just an
> illustration.
> 
> This should be integrated with mount(8) and should be a simply protocol
> over varlink or another lightweight ipc mechanism that can be
> implemented by systemd-mountd (which is how I coined this for lack of
> imagination when I came up with this) or by some other component if
> platforms like k8s really want to do their own thing.
> 
> This also allows us to extend this feature to the whole system btw and
> to all filesystems at once. Because it means that if systemd-mountd is
> told what images to trust (based on location, from a specific registry,
> signature, or whatever) then this isn't just useful for unprivileged
> containers but also for regular users on the host that want to mount
> stuff.
> 
> This is what we're currently working on.
> 
> (There's stuff that we can do to make this more powerful __if__ we need
> to. One example would probably that we _could_ make it possible to mark
> a superblock as being owned by a specific namespace with similar
> permission checks as what we currently do for idmapped mounts
> (privileged in the superblock of the fs, privileged over the ns to
> delegate to etc). IOW,
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns)
> 
> which completely sidesteps the issue of making that on-disk filesystem
> mountable by unpriv users.
> 
> But let me say that this is completely unnecessary today as you can do:
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_CMD_CREATE)
> fd_mnt = fsmount(fd_fs)
> mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP)
> 
> which changes ownership across the whole filesystem. The only time you
> really want what I mention here is if you want to delegate control over
> __every single ioctl and potentially destructive operation associated
> with that filesystem__ to an unprivileged container which is almost
> never what you want.)

Good to know this.  I do hope it can be resolved by the userspace stuffs
as you said.  So is there some barrier to not do like this, so that we
have to bother with FS_USERNS_MOUNT for fses with on-disk format?  Your
delegate control is a good stuff at least on my side and we hope some
system-wide service can help this since our cloud might need this in the
future as well.

Thanks,
Gao Xiang

> 
> [1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
> [2]: https://github.com/systemd/systemd/pull/26695

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07 10:15     ` Christian Brauner
  2023-03-07 11:03       ` Gao Xiang
@ 2023-03-07 12:09       ` Alexander Larsson
  2023-03-07 12:55         ` Gao Xiang
  2023-03-07 15:16         ` Christian Brauner
  2023-03-07 13:38       ` Jeff Layton
  2 siblings, 2 replies; 42+ messages in thread
From: Alexander Larsson @ 2023-03-07 12:09 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Gao Xiang, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi

On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
> > Hi Alexander,
> >
> > On 2023/3/3 21:57, Alexander Larsson wrote:
> > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:

> > > But I know for the people who are more interested in using composefs
> > > for containers the eventual goal of rootless support is very
> > > important. So, on behalf of them I guess the question is: Is there
> > > ever any chance that something like composefs could work rootlessly?
> > > Or conversely: Is there some way to get rootless support from the
> > > overlay approach? Opinions? Ideas?
> >
> > Honestly, I do want to get a proper answer when Giuseppe asked me
> > the same question.  My current view is simply "that question is
> > almost the same for all in-kernel fses with some on-disk format".
>
> As far as I'm concerned filesystems with on-disk format will not be made
> mountable by unprivileged containers. And I don't think I'm alone in
> that view. The idea that ever more parts of the kernel with a massive
> attack surface such as a filesystem need to vouchesafe for the safety in
> the face of every rando having access to
> unshare --mount --user --map-root is a dead end and will just end up
> trapping us in a neverending cycle of security bugs (Because every
> single bug that's found after making that fs mountable from an
> unprivileged container will be treated as a security bug no matter if
> justified or not. So this is also a good way to ruin your filesystem's
> reputation.).
>
> And honestly, if we set the precedent that it's fine for one filesystem
> with an on-disk format to be able to be mounted by unprivileged
> containers then other filesystems eventually want to do this as well.
>
> At the rate we currently add filesystems that's just a matter of time
> even if none of the existing ones would also want to do it. And then
> we're left arguing that this was just an exception for one super
> special, super safe, unexploitable filesystem with an on-disk format.
>
> Imho, none of this is appealing. I don't want to slowly keep building a
> future where we end up running fuzzers in unprivileged container to
> generate random images to crash the kernel.
>
> I have more arguments why I don't think is a path we will ever go down
> but I don't want this to detract from the legitimate ask of making it
> possible to mount trusted images from within unprivileged containers.
> Because I think that's perfectly legitimate.
>
> However, I don't think that this is something the kernel needs to solve
> other than providing the necessary infrastructure so that this can be
> solved in userspace.

So, I completely understand this point of view. And, since I'm not
really hearing any other viewpoint from the linux vfs developers it
seems to be a shared opinion. So, it seems like further work on the
kernel side of composefs isn't really useful anymore, and I will focus
my work on the overlayfs side. Maybe we can even drop the summit topic
to avoid a bunch of unnecessary travel?

That said, even though I understand (and even agree) with your
worries, I feel it is kind of unfortunate that we end up with
(essentially) a setuid helper approach for this. Because it feels like
we're giving up on a useful feature (trustless unprivileged mounts)
that the kernel could *theoretically* deliver, but a setuid helper
can't. Sure, if you have a closed system you can limit what images can
get mounted to images signed by a trusted key, but it won't work well
for things like user built images or publically available images.
Unfortunately practicalities kinda outweigh theoretical advantages.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Alexander Larsson                                Red Hat, Inc
       alexl@redhat.com         alexander.larsson@gmail.com


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07 12:09       ` Alexander Larsson
@ 2023-03-07 12:55         ` Gao Xiang
  2023-03-07 15:16         ` Christian Brauner
  1 sibling, 0 replies; 42+ messages in thread
From: Gao Xiang @ 2023-03-07 12:55 UTC (permalink / raw)
  To: Alexander Larsson, Christian Brauner
  Cc: lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi



On 2023/3/7 20:09, Alexander Larsson wrote:
> On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote:
>>
>> On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
>>> Hi Alexander,
>>>
>>> On 2023/3/3 21:57, Alexander Larsson wrote:
>>>> On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> 
>>>> But I know for the people who are more interested in using composefs
>>>> for containers the eventual goal of rootless support is very
>>>> important. So, on behalf of them I guess the question is: Is there
>>>> ever any chance that something like composefs could work rootlessly?
>>>> Or conversely: Is there some way to get rootless support from the
>>>> overlay approach? Opinions? Ideas?
>>>
>>> Honestly, I do want to get a proper answer when Giuseppe asked me
>>> the same question.  My current view is simply "that question is
>>> almost the same for all in-kernel fses with some on-disk format".
>>
>> As far as I'm concerned filesystems with on-disk format will not be made
>> mountable by unprivileged containers. And I don't think I'm alone in
>> that view. The idea that ever more parts of the kernel with a massive
>> attack surface such as a filesystem need to vouchesafe for the safety in
>> the face of every rando having access to
>> unshare --mount --user --map-root is a dead end and will just end up
>> trapping us in a neverending cycle of security bugs (Because every
>> single bug that's found after making that fs mountable from an
>> unprivileged container will be treated as a security bug no matter if
>> justified or not. So this is also a good way to ruin your filesystem's
>> reputation.).
>>
>> And honestly, if we set the precedent that it's fine for one filesystem
>> with an on-disk format to be able to be mounted by unprivileged
>> containers then other filesystems eventually want to do this as well.
>>
>> At the rate we currently add filesystems that's just a matter of time
>> even if none of the existing ones would also want to do it. And then
>> we're left arguing that this was just an exception for one super
>> special, super safe, unexploitable filesystem with an on-disk format.
>>
>> Imho, none of this is appealing. I don't want to slowly keep building a
>> future where we end up running fuzzers in unprivileged container to
>> generate random images to crash the kernel.
>>
>> I have more arguments why I don't think is a path we will ever go down
>> but I don't want this to detract from the legitimate ask of making it
>> possible to mount trusted images from within unprivileged containers.
>> Because I think that's perfectly legitimate.
>>
>> However, I don't think that this is something the kernel needs to solve
>> other than providing the necessary infrastructure so that this can be
>> solved in userspace.
> 
> So, I completely understand this point of view. And, since I'm not
> really hearing any other viewpoint from the linux vfs developers it
> seems to be a shared opinion. So, it seems like further work on the
> kernel side of composefs isn't really useful anymore, and I will focus
> my work on the overlayfs side. Maybe we can even drop the summit topic
> to avoid a bunch of unnecessary travel?
I am still looking forward to see you here since I'd like to
devote my time to work on anything which could makes EROFS
better and more useful (I'm always active in the Linux FS
community.)  Even if you folks finally don't decide to give
EROFS a chance, I'm still happy to get your further inputs
since I think an immutable filesystem can do better and useful
than the current status to the whole Linux ecosystem.

I'm very sorry that I didn't have a chance to go to FOSDEM 23 due
to my unexpected travelling visa issues to Belgium at that time.

> 
> That said, even though I understand (and even agree) with your
> worries, I feel it is kind of unfortunate that we end up with
> (essentially) a setuid helper approach for this. Because it feels like
> we're giving up on a useful feature (trustless unprivileged mounts)
> that the kernel could *theoretically* deliver, but a setuid helper
> can't. Sure, if you have a closed system you can limit what images can
> get mounted to images signed by a trusted key, but it won't work well
> for things like user built images or publically available images.
> Unfortunately practicalities kinda outweigh theoretical advantages.

In principle, I think _trusted_ unprivileged mounts in kernel
could be done in somewhat degree.  But before that, firstly,
that needs very very very hard proofs why userspace cannot do
this.  As long as it has a little possibility to work in
userspace effectively, it could become another story.

I'm somewhat against with untrusted unprivileged mounts with
actual on-disk formats all the time.  Why? Just because FUSE
is a pure simple protocol, and overlayfs uses very limited
xattrs without on-disk data.

If we have some filesystem with on-disk data, the problem inside
not only just causes panic but also deadlock, livelock, DOS, or
even corrupted memory due to some on-disk format.  In principle,
we could freeze all the code without any feature enhancement,
but it becomes hard in practice since users like new on-disk
useful features all the time.

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07 10:15     ` Christian Brauner
  2023-03-07 11:03       ` Gao Xiang
  2023-03-07 12:09       ` Alexander Larsson
@ 2023-03-07 13:38       ` Jeff Layton
  2023-03-08 10:37         ` Christian Brauner
  2 siblings, 1 reply; 42+ messages in thread
From: Jeff Layton @ 2023-03-07 13:38 UTC (permalink / raw)
  To: Christian Brauner, Gao Xiang
  Cc: Alexander Larsson, lsf-pc, linux-fsdevel, Amir Goldstein,
	Jingbo Xu, Giuseppe Scrivano, Dave Chinner, Vivek Goyal,
	Miklos Szeredi

On Tue, 2023-03-07 at 11:15 +0100, Christian Brauner wrote:
> On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
> > Hi Alexander,
> > 
> > On 2023/3/3 21:57, Alexander Larsson wrote:
> > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> > > > 
> > > > Hello,
> > > > 
> > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> > > > Composefs filesystem. It is an opportunistically sharing, validating
> > > > image-based filesystem, targeting usecases like validated ostree
> > > > rootfs:es, validated container images that share common files, as well
> > > > as other image based usecases.
> > > > 
> > > > During the discussions in the composefs proposal (as seen on LWN[3])
> > > > is has been proposed that (with some changes to overlayfs), similar
> > > > behaviour can be achieved by combining the overlayfs
> > > > "overlay.redirect" xattr with an read-only filesystem such as erofs.
> > > > 
> > > > There are pros and cons to both these approaches, and the discussion
> > > > about their respective value has sometimes been heated. We would like
> > > > to have an in-person discussion at the summit, ideally also involving
> > > > more of the filesystem development community, so that we can reach
> > > > some consensus on what is the best apporach.
> > > 
> > > In order to better understand the behaviour and requirements of the
> > > overlayfs+erofs approach I spent some time implementing direct support
> > > for erofs in libcomposefs. So, with current HEAD of
> > > github.com/containers/composefs you can now do:
> > > 
> > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
> > 
> > Thanks you for taking time on working on EROFS support.  I don't have
> > time to play with it yet since I'd like to work out erofs-utils 1.6
> > these days and will work on some new stuffs such as !pagesize block
> > size as I said previously.
> > 
> > > 
> > > This will produce an object store with the backing files, and a erofs
> > > file with the required overlayfs xattrs, including a made up one
> > > called "overlay.fs-verity" containing the expected fs-verity digest
> > > for the lower dir. It also adds the required whiteouts to cover the
> > > 00-ff dirs from the lower dir.
> > > 
> > > These erofs files are ordered similarly to the composefs files, and we
> > > give similar guarantees about their reproducibility, etc. So, they
> > > should be apples-to-apples comparable with the composefs images.
> > > 
> > > Given this, I ran another set of performance tests on the original cs9
> > > rootfs dataset, again measuring the time of `ls -lR`. I also tried to
> > > measure the memory use like this:
> > > 
> > > # echo 3 > /proc/sys/vm/drop_caches
> > > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
> > > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> > > 
> > > These are the alternatives I tried:
> > > 
> > > xfs: the source of the image, regular dir on xfs
> > > erofs: the image.erofs above, on loopback
> > > erofs dio: the image.erofs above, on loopback with --direct-io=on
> > > ovl: erofs above combined with overlayfs
> > > ovl dio: erofs dio above combined with overlayfs
> > > cfs: composefs mount of image.cfs
> > > 
> > > All tests use the same objects dir, stored on xfs. The erofs and
> > > overlay implementations are from a stock 6.1.13 kernel, and composefs
> > > module is from github HEAD.
> > > 
> > > I tried loopback both with and without the direct-io option, because
> > > without direct-io enabled the kernel will double-cache the loopbacked
> > > data, as per[1].
> > > 
> > > The produced images are:
> > >   8.9M image.cfs
> > > 11.3M image.erofs
> > > 
> > > And gives these results:
> > >             | Cold cache | Warm cache | Mem use
> > >             |   (msec)   |   (msec)   |  (mb)
> > > -----------+------------+------------+---------
> > > xfs        |   1449     |    442     |    54
> > > erofs      |    700     |    391     |    45
> > > erofs dio  |    939     |    400     |    45
> > > ovl        |   1827     |    530     |   130
> > > ovl dio    |   2156     |    531     |   130
> > > cfs        |    689     |    389     |    51
> > > 
> > > I also ran the same tests in a VM that had the latest kernel including
> > > the lazyfollow patches (ovl lazy in the table, not using direct-io),
> > > this one ext4 based:
> > > 
> > >             | Cold cache | Warm cache | Mem use
> > >             |   (msec)   |   (msec)   |  (mb)
> > > -----------+------------+------------+---------
> > > ext4       |   1135     |    394     |    54
> > > erofs      |    715     |    401     |    46
> > > erofs dio  |    922     |    401     |    45
> > > ovl        |   1412     |    515     |   148
> > > ovl dio    |   1810     |    532     |   149
> > > ovl lazy   |   1063     |    523     |    87
> > > cfs        |    719     |    463     |    51
> > > 
> > > Things noticeable in the results:
> > > 
> > > * composefs and erofs (by itself) perform roughly  similar. This is
> > >    not necessarily news, and results from Jingbo Xu match this.
> > > 
> > > * Erofs on top of direct-io enabled loopback causes quite a drop in
> > >    performance, which I don't really understand. Especially since its
> > >    reporting the same memory use as non-direct io. I guess the
> > >    double-cacheing in the later case isn't properly attributed to the
> > >    cgroup so the difference is not measured. However, why would the
> > >    double cache improve performance?  Maybe I'm not completely
> > >    understanding how these things interact.
> > 
> > We've already analysed the root cause of composefs is that composefs
> > uses a kernel_read() to read its path while irrelevant metadata
> > (such as dir data) is read together.  Such heuristic readahead is a
> > unusual stuff for all local fses (obviously almost all in-kernel
> > filesystems don't use kernel_read() to read their metadata. Although
> > some filesystems could readahead some related extent metadata when
> > reading inode, they at least does _not_ work as kernel_read().) But
> > double caching will introduce almost the same impact as kernel_read()
> > (assuming you read some source code of loop device.)
> > 
> > I do hope you already read what Jingbo's latest test results, and that
> > test result shows how bad readahead performs if fs metadata is
> > partially randomly used (stat < 1500 files):
> > https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com
> > 
> > Also you could explicitly _disable_ readahead for composefs
> > manifiest file (because all EROFS metadata read is without
> > readahead), and let's see how it works then.
> > 
> > Again, if your workload is just "ls -lR".  My answer is "just async
> > readahead the whole manifest file / loop device together" when
> > mounting.  That will give the best result to you.  But I'm not sure
> > that is the real use case you propose.
> > 
> > > 
> > > * Stacking overlay on top of erofs causes about 100msec slower
> > >    warm-cache times compared to all non-overlay approaches, and much
> > >    more in the cold cache case. The cold cache performance is helped
> > >    significantly by the lazyfollow patches, but the warm cache overhead
> > >    remains.
> > > 
> > > * The use of overlayfs more than doubles memory use, probably
> > >    because of all the extra inodes and dentries in action for the
> > >    various layers. The lazyfollow patches helps, but only partially.
> > > 
> > > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is
> > >    not that much slower (~25%) than the pure xfs/ext4 directory, which
> > >    is a pretty good baseline for comparisons. It is even faster when
> > >    using lazyfollow on ext4.
> > > 
> > > * The erofs images are slightly larger than the equivalent composefs
> > >    image.
> > > 
> > > In summary: The performance of composefs is somewhat better than the
> > > best erofs+ovl combination, although the overlay approach is not
> > > significantly worse than the baseline of a regular directory, except
> > > that it uses a bit more memory.
> > > 
> > > On top of the above pure performance based comparisons I would like to
> > > re-state some of the other advantages of composefs compared to the
> > > overlay approach:
> > > 
> > > * composefs is namespaceable, in the sense that you can use it (given
> > >    mount capabilities) inside a namespace (such as a container) without
> > >    access to non-namespaced resources like loopback or device-mapper
> > >    devices. (There was work on fixing this with loopfs, but that seems
> > >    to have stalled.)
> > > 
> > > * While it is not in the current design, the simplicity of the format
> > >    and lack of loopback makes it at least theoretically possible that
> > >    composefs can be made usable in a rootless fashion at some point in
> > >    the future.
> > Do you consider sending some commands to /dev/cachefiles to configure
> > a daemonless dir and mount erofs image directly by using "erofs over
> > fscache" but in a daemonless way?  That is an ongoing stuff on our side.
> > 
> > IMHO, I don't think file-based interfaces are quite a charmful stuff.
> > Historically I recalled some practice is to "avoid directly reading
> > files in kernel" so that I think almost all local fses don't work on
> > files directl and loopback devices are all the ways for these use
> > cases.  If loopback devices are not okay to you, how about improving
> > loopback devices and that will benefit to almost all local fses.
> > 
> > > 
> > > And of course, there are disadvantages to composefs too. Primarily
> > > being more code, increasing maintenance burden and risk of security
> > > problems. Composefs is particularly burdensome because it is a
> > > stacking filesystem and these have historically been shown to be hard
> > > to get right.
> > > 
> > > 
> > > The question now is what is the best approach overall? For my own
> > > primary usecase of making a verifying ostree root filesystem, the
> > > overlay approach (with the lazyfollow work finished) is, while not
> > > ideal, good enough.
> > 
> > So your judgement is still "ls -lR" and your use case is still just
> > pure read-only and without writable stuff?
> > 
> > Anyway, I'm really happy to work with you on your ostree use cases
> > as always, as long as all corner cases work out by the community.
> > 
> > > 
> > > But I know for the people who are more interested in using composefs
> > > for containers the eventual goal of rootless support is very
> > > important. So, on behalf of them I guess the question is: Is there
> > > ever any chance that something like composefs could work rootlessly?
> > > Or conversely: Is there some way to get rootless support from the
> > > overlay approach? Opinions? Ideas?
> > 
> > Honestly, I do want to get a proper answer when Giuseppe asked me
> > the same question.  My current view is simply "that question is
> > almost the same for all in-kernel fses with some on-disk format".
> 
> As far as I'm concerned filesystems with on-disk format will not be made
> mountable by unprivileged containers. And I don't think I'm alone in
> that view.
> 

You're absolutely not alone in that view. This is even more unsafe with
network and clustered filesystems, as you're trusting remote hardware
that is accessible by other users than just the local host. We have had
long-standing open requests to allow unprivileged users to mount
arbitrary remote filesystems, and I've never seen a way to do that
safely.

> The idea that ever more parts of the kernel with a massive
> attack surface such as a filesystem need to vouchesafe for the safety in
> the face of every rando having access to
> unshare --mount --user --map-root is a dead end and will just end up
> trapping us in a neverending cycle of security bugs (Because every
> single bug that's found after making that fs mountable from an
> unprivileged container will be treated as a security bug no matter if
> justified or not. So this is also a good way to ruin your filesystem's
> reputation.).
> 
> And honestly, if we set the precedent that it's fine for one filesystem
> with an on-disk format to be able to be mounted by unprivileged
> containers then other filesystems eventually want to do this as well.
> 
> At the rate we currently add filesystems that's just a matter of time
> even if none of the existing ones would also want to do it. And then
> we're left arguing that this was just an exception for one super
> special, super safe, unexploitable filesystem with an on-disk format.
> 
> Imho, none of this is appealing. I don't want to slowly keep building a
> future where we end up running fuzzers in unprivileged container to
> generate random images to crash the kernel.
> 
> I have more arguments why I don't think is a path we will ever go down
> but I don't want this to detract from the legitimate ask of making it
> possible to mount trusted images from within unprivileged containers.
> Because I think that's perfectly legitimate.
> 
> However, I don't think that this is something the kernel needs to solve
> other than providing the necessary infrastructure so that this can be
> solved in userspace.
> 
> Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I
> explained how we currently mount into mount namespaces of unprivileged
> cotainers which had been quite a difficult problem before the new mount
> api. But now it's become almost comically trivial. I mean, there's stuff
> that will still be good to have but overall all the bits are already
> there.
> 
> Imho, delegated mounting should be done by a system service that is
> responsible for all the steps that require privileges. So for most
> filesytems not mountable by unprivileged user this would amount to:
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_CMD_CREATE)
> fd_mnt = fsmount(fd_fs)
> // Only required for attributes that require privileges against the sb
> // of the filesystem such as idmapped mounts
> mount_setattr(fd_mnt, ...)
> 
> and then the fd_mnt can be sent to the container which can then attach
> it wherever it wants to. The system level service doesn't even need to
> change namespaces via setns(fd_userns|fd_mntns) like I illustrated in
> the post I did. It's sufficient if we sent it via AF_UNIX for example
> that's exposed to the container.
> 
> Of course, this system level service would be integrated with mount(8)
> directly over a well-defined protocol. And this would be nestable as
> well by e.g., bind-mounting the AF_UNIX socket.
> 
> And we do already support a rudimentary form of such integration through
> systemd. For example via mount -t ddi (cf. [2]) which makes it possible
> to mount discoverable disk images (ddi). But that's just an
> illustration. 
> 
> This should be integrated with mount(8) and should be a simply protocol
> over varlink or another lightweight ipc mechanism that can be
> implemented by systemd-mountd (which is how I coined this for lack of
> imagination when I came up with this) or by some other component if
> platforms like k8s really want to do their own thing.
> 
> This also allows us to extend this feature to the whole system btw and
> to all filesystems at once. Because it means that if systemd-mountd is
> told what images to trust (based on location, from a specific registry,
> signature, or whatever) then this isn't just useful for unprivileged
> containers but also for regular users on the host that want to mount
> stuff.
> 
> This is what we're currently working on.
> 

This is a very cool idea, and sounds like a reasonable way forward. I'd
be interested to hear more about this (and in particular what sort of
security model and use-cases you envision for this).

> (There's stuff that we can do to make this more powerful __if__ we need
> to. One example would probably that we _could_ make it possible to mark
> a superblock as being owned by a specific namespace with similar
> permission checks as what we currently do for idmapped mounts
> (privileged in the superblock of the fs, privileged over the ns to
> delegate to etc). IOW,
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_SET_FD, "owner", fd_container_userns)
> 
> which completely sidesteps the issue of making that on-disk filesystem
> mountable by unpriv users.
> 
> But let me say that this is completely unnecessary today as you can do:
> 
> fd_fs = fsopen("xfs")
> fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> fsconfig(FSCONFIG_CMD_CREATE)
> fd_mnt = fsmount(fd_fs)
> mount_setattr(fd_mnt, MOUNT_ATTR_IDMAP)
> 
> which changes ownership across the whole filesystem. The only time you
> really want what I mention here is if you want to delegate control over
> __every single ioctl and potentially destructive operation associated
> with that filesystem__ to an unprivileged container which is almost
> never what you want.)
> 
> [1]: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
> [2]: https://github.com/systemd/systemd/pull/26695

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07 12:09       ` Alexander Larsson
  2023-03-07 12:55         ` Gao Xiang
@ 2023-03-07 15:16         ` Christian Brauner
  2023-03-07 19:33           ` Giuseppe Scrivano
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Brauner @ 2023-03-07 15:16 UTC (permalink / raw)
  To: Alexander Larsson
  Cc: Gao Xiang, lsf-pc, linux-fsdevel, Amir Goldstein, Jingbo Xu,
	Giuseppe Scrivano, Dave Chinner, Vivek Goyal, Miklos Szeredi,
	Seth Forshee

On Tue, Mar 07, 2023 at 01:09:57PM +0100, Alexander Larsson wrote:
> On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
> > > Hi Alexander,
> > >
> > > On 2023/3/3 21:57, Alexander Larsson wrote:
> > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> 
> > > > But I know for the people who are more interested in using composefs
> > > > for containers the eventual goal of rootless support is very
> > > > important. So, on behalf of them I guess the question is: Is there
> > > > ever any chance that something like composefs could work rootlessly?
> > > > Or conversely: Is there some way to get rootless support from the
> > > > overlay approach? Opinions? Ideas?
> > >
> > > Honestly, I do want to get a proper answer when Giuseppe asked me
> > > the same question.  My current view is simply "that question is
> > > almost the same for all in-kernel fses with some on-disk format".
> >
> > As far as I'm concerned filesystems with on-disk format will not be made
> > mountable by unprivileged containers. And I don't think I'm alone in
> > that view. The idea that ever more parts of the kernel with a massive
> > attack surface such as a filesystem need to vouchesafe for the safety in
> > the face of every rando having access to
> > unshare --mount --user --map-root is a dead end and will just end up
> > trapping us in a neverending cycle of security bugs (Because every
> > single bug that's found after making that fs mountable from an
> > unprivileged container will be treated as a security bug no matter if
> > justified or not. So this is also a good way to ruin your filesystem's
> > reputation.).
> >
> > And honestly, if we set the precedent that it's fine for one filesystem
> > with an on-disk format to be able to be mounted by unprivileged
> > containers then other filesystems eventually want to do this as well.
> >
> > At the rate we currently add filesystems that's just a matter of time
> > even if none of the existing ones would also want to do it. And then
> > we're left arguing that this was just an exception for one super
> > special, super safe, unexploitable filesystem with an on-disk format.
> >
> > Imho, none of this is appealing. I don't want to slowly keep building a
> > future where we end up running fuzzers in unprivileged container to
> > generate random images to crash the kernel.
> >
> > I have more arguments why I don't think is a path we will ever go down
> > but I don't want this to detract from the legitimate ask of making it
> > possible to mount trusted images from within unprivileged containers.
> > Because I think that's perfectly legitimate.
> >
> > However, I don't think that this is something the kernel needs to solve
> > other than providing the necessary infrastructure so that this can be
> > solved in userspace.
> 
> So, I completely understand this point of view. And, since I'm not
> really hearing any other viewpoint from the linux vfs developers it
> seems to be a shared opinion. So, it seems like further work on the
> kernel side of composefs isn't really useful anymore, and I will focus
> my work on the overlayfs side. Maybe we can even drop the summit topic
> to avoid a bunch of unnecessary travel?
> 
> That said, even though I understand (and even agree) with your
> worries, I feel it is kind of unfortunate that we end up with
> (essentially) a setuid helper approach for this. Because it feels like
> we're giving up on a useful feature (trustless unprivileged mounts)
> that the kernel could *theoretically* deliver, but a setuid helper
> can't. Sure, if you have a closed system you can limit what images can
> get mounted to images signed by a trusted key, but it won't work well
> for things like user built images or publically available images.
> Unfortunately practicalities kinda outweigh theoretical advantages.

Characterzing this as a setuid helper approach feels a bit like negative
branding. :)

But just in case there's a misunderstanding of any form let me clarify
that systemd doesn't produce set*id binaries in any form; never has,
never will.

It's also good to remember that in order to even use unprivileged
containers with meaningful idmappings __two__ set*id binaries -
new*idmap - with an extremely clunky, and frankly unusable id delegation
policy expressed through these weird /etc/sub*id files have to be used.
Which apparently everyone is happy to use.

What we're talking about here however is a first class system service
capable of expressing meaningful security policy (e.g., image signed by
a key in the kernel keyring, polkit, ...). And such well-scoped local
services are a good thing.

This mentality of shoving ever more functionality under
unshare --user --map-root needs to really take a good hard look at
itself. Because it fundamentally assumes that unshare --user --map-root
is a sufficiently complex security policy to cover everything from
exposing complex network settings to complex filesystem settings to
unprivileged users.

To this day I'm not even sure if having ramfs mountable by unprivileged
users isn't just a trivial dos vector that just nobody really considers
important enough.

(This is not aimed in any form at you because I used to think that this
is a future worth building in the past myself but I think it's become
sufficiently clear that this just doesn't work especially when our
expectations around security and integrity become ever greater.)

Fwiw, Lennart is in the middle of implementing this so we can showcase
this asap.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07 15:16         ` Christian Brauner
@ 2023-03-07 19:33           ` Giuseppe Scrivano
  2023-03-08 10:31             ` Christian Brauner
  0 siblings, 1 reply; 42+ messages in thread
From: Giuseppe Scrivano @ 2023-03-07 19:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Larsson, Gao Xiang, lsf-pc, linux-fsdevel,
	Amir Goldstein, Jingbo Xu, Dave Chinner, Vivek Goyal,
	Miklos Szeredi, Seth Forshee

Christian Brauner <brauner@kernel.org> writes:

> On Tue, Mar 07, 2023 at 01:09:57PM +0100, Alexander Larsson wrote:
>> On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote:
>> >
>> > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
>> > > Hi Alexander,
>> > >
>> > > On 2023/3/3 21:57, Alexander Larsson wrote:
>> > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
>> 
>> > > > But I know for the people who are more interested in using composefs
>> > > > for containers the eventual goal of rootless support is very
>> > > > important. So, on behalf of them I guess the question is: Is there
>> > > > ever any chance that something like composefs could work rootlessly?
>> > > > Or conversely: Is there some way to get rootless support from the
>> > > > overlay approach? Opinions? Ideas?
>> > >
>> > > Honestly, I do want to get a proper answer when Giuseppe asked me
>> > > the same question.  My current view is simply "that question is
>> > > almost the same for all in-kernel fses with some on-disk format".
>> >
>> > As far as I'm concerned filesystems with on-disk format will not be made
>> > mountable by unprivileged containers. And I don't think I'm alone in
>> > that view. The idea that ever more parts of the kernel with a massive
>> > attack surface such as a filesystem need to vouchesafe for the safety in
>> > the face of every rando having access to
>> > unshare --mount --user --map-root is a dead end and will just end up
>> > trapping us in a neverending cycle of security bugs (Because every
>> > single bug that's found after making that fs mountable from an
>> > unprivileged container will be treated as a security bug no matter if
>> > justified or not. So this is also a good way to ruin your filesystem's
>> > reputation.).
>> >
>> > And honestly, if we set the precedent that it's fine for one filesystem
>> > with an on-disk format to be able to be mounted by unprivileged
>> > containers then other filesystems eventually want to do this as well.
>> >
>> > At the rate we currently add filesystems that's just a matter of time
>> > even if none of the existing ones would also want to do it. And then
>> > we're left arguing that this was just an exception for one super
>> > special, super safe, unexploitable filesystem with an on-disk format.
>> >
>> > Imho, none of this is appealing. I don't want to slowly keep building a
>> > future where we end up running fuzzers in unprivileged container to
>> > generate random images to crash the kernel.
>> >
>> > I have more arguments why I don't think is a path we will ever go down
>> > but I don't want this to detract from the legitimate ask of making it
>> > possible to mount trusted images from within unprivileged containers.
>> > Because I think that's perfectly legitimate.
>> >
>> > However, I don't think that this is something the kernel needs to solve
>> > other than providing the necessary infrastructure so that this can be
>> > solved in userspace.
>> 
>> So, I completely understand this point of view. And, since I'm not
>> really hearing any other viewpoint from the linux vfs developers it
>> seems to be a shared opinion. So, it seems like further work on the
>> kernel side of composefs isn't really useful anymore, and I will focus
>> my work on the overlayfs side. Maybe we can even drop the summit topic
>> to avoid a bunch of unnecessary travel?
>> 
>> That said, even though I understand (and even agree) with your
>> worries, I feel it is kind of unfortunate that we end up with
>> (essentially) a setuid helper approach for this. Because it feels like
>> we're giving up on a useful feature (trustless unprivileged mounts)
>> that the kernel could *theoretically* deliver, but a setuid helper
>> can't. Sure, if you have a closed system you can limit what images can
>> get mounted to images signed by a trusted key, but it won't work well
>> for things like user built images or publically available images.
>> Unfortunately practicalities kinda outweigh theoretical advantages.
>
> Characterzing this as a setuid helper approach feels a bit like negative
> branding. :)
>
> But just in case there's a misunderstanding of any form let me clarify
> that systemd doesn't produce set*id binaries in any form; never has,
> never will.
>
> It's also good to remember that in order to even use unprivileged
> containers with meaningful idmappings __two__ set*id binaries -
> new*idmap - with an extremely clunky, and frankly unusable id delegation
> policy expressed through these weird /etc/sub*id files have to be used.
> Which apparently everyone is happy to use.
>
> What we're talking about here however is a first class system service
> capable of expressing meaningful security policy (e.g., image signed by
> a key in the kernel keyring, polkit, ...). And such well-scoped local
> services are a good thing.

there are some disadvantages too:

- while the impact on system services is negligible, using the proposed
  approach could slow down container startup.
  It is somehow similar to the issue we currently have with cgroups,
  where manually creating a cgroup is faster than going through dbus and
  systemd.  IMHO, the kernel could easily verify the image signature
  without relying on an additional userland service when mounting it
  from a user namespace.

- it won't be usable from a containerized build system.  It is common to
  build container images inside of a container (so that they can be
  built in a cluster).  To use the systemd approach, we'll need to
  access systemd on the host from the container.

> This mentality of shoving ever more functionality under
> unshare --user --map-root needs to really take a good hard look at
> itself. Because it fundamentally assumes that unshare --user --map-root
> is a sufficiently complex security policy to cover everything from
> exposing complex network settings to complex filesystem settings to
> unprivileged users.
>
> To this day I'm not even sure if having ramfs mountable by unprivileged
> users isn't just a trivial dos vector that just nobody really considers
> important enough.
>
> (This is not aimed in any form at you because I used to think that this
> is a future worth building in the past myself but I think it's become
> sufficiently clear that this just doesn't work especially when our
> expectations around security and integrity become ever greater.)
>
> Fwiw, Lennart is in the middle of implementing this so we can showcase
> this asap.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07 19:33           ` Giuseppe Scrivano
@ 2023-03-08 10:31             ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2023-03-08 10:31 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Alexander Larsson, Gao Xiang, lsf-pc, linux-fsdevel,
	Amir Goldstein, Jingbo Xu, Dave Chinner, Vivek Goyal,
	Miklos Szeredi, Seth Forshee, Jeff Layton

On Tue, Mar 07, 2023 at 08:33:29PM +0100, Giuseppe Scrivano wrote:
> Christian Brauner <brauner@kernel.org> writes:
> 
> > On Tue, Mar 07, 2023 at 01:09:57PM +0100, Alexander Larsson wrote:
> >> On Tue, Mar 7, 2023 at 11:16 AM Christian Brauner <brauner@kernel.org> wrote:
> >> >
> >> > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
> >> > > Hi Alexander,
> >> > >
> >> > > On 2023/3/3 21:57, Alexander Larsson wrote:
> >> > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> >> 
> >> > > > But I know for the people who are more interested in using composefs
> >> > > > for containers the eventual goal of rootless support is very
> >> > > > important. So, on behalf of them I guess the question is: Is there
> >> > > > ever any chance that something like composefs could work rootlessly?
> >> > > > Or conversely: Is there some way to get rootless support from the
> >> > > > overlay approach? Opinions? Ideas?
> >> > >
> >> > > Honestly, I do want to get a proper answer when Giuseppe asked me
> >> > > the same question.  My current view is simply "that question is
> >> > > almost the same for all in-kernel fses with some on-disk format".
> >> >
> >> > As far as I'm concerned filesystems with on-disk format will not be made
> >> > mountable by unprivileged containers. And I don't think I'm alone in
> >> > that view. The idea that ever more parts of the kernel with a massive
> >> > attack surface such as a filesystem need to vouchesafe for the safety in
> >> > the face of every rando having access to
> >> > unshare --mount --user --map-root is a dead end and will just end up
> >> > trapping us in a neverending cycle of security bugs (Because every
> >> > single bug that's found after making that fs mountable from an
> >> > unprivileged container will be treated as a security bug no matter if
> >> > justified or not. So this is also a good way to ruin your filesystem's
> >> > reputation.).
> >> >
> >> > And honestly, if we set the precedent that it's fine for one filesystem
> >> > with an on-disk format to be able to be mounted by unprivileged
> >> > containers then other filesystems eventually want to do this as well.
> >> >
> >> > At the rate we currently add filesystems that's just a matter of time
> >> > even if none of the existing ones would also want to do it. And then
> >> > we're left arguing that this was just an exception for one super
> >> > special, super safe, unexploitable filesystem with an on-disk format.
> >> >
> >> > Imho, none of this is appealing. I don't want to slowly keep building a
> >> > future where we end up running fuzzers in unprivileged container to
> >> > generate random images to crash the kernel.
> >> >
> >> > I have more arguments why I don't think is a path we will ever go down
> >> > but I don't want this to detract from the legitimate ask of making it
> >> > possible to mount trusted images from within unprivileged containers.
> >> > Because I think that's perfectly legitimate.
> >> >
> >> > However, I don't think that this is something the kernel needs to solve
> >> > other than providing the necessary infrastructure so that this can be
> >> > solved in userspace.
> >> 
> >> So, I completely understand this point of view. And, since I'm not
> >> really hearing any other viewpoint from the linux vfs developers it
> >> seems to be a shared opinion. So, it seems like further work on the
> >> kernel side of composefs isn't really useful anymore, and I will focus
> >> my work on the overlayfs side. Maybe we can even drop the summit topic
> >> to avoid a bunch of unnecessary travel?
> >> 
> >> That said, even though I understand (and even agree) with your
> >> worries, I feel it is kind of unfortunate that we end up with
> >> (essentially) a setuid helper approach for this. Because it feels like
> >> we're giving up on a useful feature (trustless unprivileged mounts)
> >> that the kernel could *theoretically* deliver, but a setuid helper
> >> can't. Sure, if you have a closed system you can limit what images can
> >> get mounted to images signed by a trusted key, but it won't work well
> >> for things like user built images or publically available images.
> >> Unfortunately practicalities kinda outweigh theoretical advantages.
> >
> > Characterzing this as a setuid helper approach feels a bit like negative
> > branding. :)
> >
> > But just in case there's a misunderstanding of any form let me clarify
> > that systemd doesn't produce set*id binaries in any form; never has,
> > never will.
> >
> > It's also good to remember that in order to even use unprivileged
> > containers with meaningful idmappings __two__ set*id binaries -
> > new*idmap - with an extremely clunky, and frankly unusable id delegation
> > policy expressed through these weird /etc/sub*id files have to be used.
> > Which apparently everyone is happy to use.
> >
> > What we're talking about here however is a first class system service
> > capable of expressing meaningful security policy (e.g., image signed by
> > a key in the kernel keyring, polkit, ...). And such well-scoped local
> > services are a good thing.
> 
> there are some disadvantages too:
> 
> - while the impact on system services is negligible, using the proposed
>   approach could slow down container startup.
>   It is somehow similar to the issue we currently have with cgroups,
>   where manually creating a cgroup is faster than going through dbus and
>   systemd.  IMHO, the kernel could easily verify the image signature

This will use varlink, dbus would be optional and only be involved if
a service wanted to use polkit for trust. Signatures would be the main
way. Efficiency is ofc something that is on the forefront.

That said, note that big chunks of mounting are serialized on namespace
lock (mount propagation et al) and mount lock (properties, parent-child
relationships, mountpoint etc.) already so it's not really that this a
particularly fast operation.

Mounting is expensive in the kernel especially with mount propagation in
the mix. If you have a thousand containers all calling mount at the same
time with mount propagation between them for a big mount tree that'll be
costly. IOW, the cost for mounting isn't paid in userspace.

>   without relying on an additional userland service when mounting it
>   from a user namespace.
> 
> - it won't be usable from a containerized build system.  It is common to
>   build container images inside of a container (so that they can be
>   built in a cluster).  To use the systemd approach, we'll need to
>   access systemd on the host from the container.

I don't see why that would be a problem I consider it the proper design
in fact. And I've explained in the earlier mail that we even have
nesting in mind right away.

As you've mentioned the cgroup delegation model above. This is a good
example. The whole stick of pressure stall information (PSI) for
example, for the memory controller is the realization that instead of
pushing the policy about how to handle memory pressure every deeper into
the kernel it's better to exposes the necessary infrastructure to
userspace which can then implement policies tailored to the workload.
The kernel isn't suited for expressing such fine-grained policies. And
eBPF for containers will end up being managed in a similar way with a
system service that implements the policy for attaching eBPF programs to
containers.

The mounting of filesystem images, network filesystems and so on is imho
a similar problem. The policy when a filesystem mount should be allowed
is something that at the end of the day belongs into a userspace system
level service. The use-cases are just too many, the filesystems too
distinct and too complex to be covered by the kernel. The advantage also
is that with the system level service we can extend this ability to all
filesystems at once and to regular users on the system.

In order to give the security and resource guarantees that a modern
system needs the various services need to integrate with one another and
that may involve asking for privileged operations to be performed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-03-07 13:38       ` Jeff Layton
@ 2023-03-08 10:37         ` Christian Brauner
  0 siblings, 0 replies; 42+ messages in thread
From: Christian Brauner @ 2023-03-08 10:37 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Gao Xiang, Alexander Larsson, lsf-pc, linux-fsdevel,
	Amir Goldstein, Jingbo Xu, Giuseppe Scrivano, Dave Chinner,
	Vivek Goyal, Miklos Szeredi

On Tue, Mar 07, 2023 at 08:38:58AM -0500, Jeff Layton wrote:
> On Tue, 2023-03-07 at 11:15 +0100, Christian Brauner wrote:
> > On Fri, Mar 03, 2023 at 11:13:51PM +0800, Gao Xiang wrote:
> > > Hi Alexander,
> > > 
> > > On 2023/3/3 21:57, Alexander Larsson wrote:
> > > > On Mon, Feb 27, 2023 at 10:22 AM Alexander Larsson <alexl@redhat.com> wrote:
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> > > > > Composefs filesystem. It is an opportunistically sharing, validating
> > > > > image-based filesystem, targeting usecases like validated ostree
> > > > > rootfs:es, validated container images that share common files, as well
> > > > > as other image based usecases.
> > > > > 
> > > > > During the discussions in the composefs proposal (as seen on LWN[3])
> > > > > is has been proposed that (with some changes to overlayfs), similar
> > > > > behaviour can be achieved by combining the overlayfs
> > > > > "overlay.redirect" xattr with an read-only filesystem such as erofs.
> > > > > 
> > > > > There are pros and cons to both these approaches, and the discussion
> > > > > about their respective value has sometimes been heated. We would like
> > > > > to have an in-person discussion at the summit, ideally also involving
> > > > > more of the filesystem development community, so that we can reach
> > > > > some consensus on what is the best apporach.
> > > > 
> > > > In order to better understand the behaviour and requirements of the
> > > > overlayfs+erofs approach I spent some time implementing direct support
> > > > for erofs in libcomposefs. So, with current HEAD of
> > > > github.com/containers/composefs you can now do:
> > > > 
> > > > $ mkcompose --digest-store=objects --format=erofs source-dir image.erofs
> > > 
> > > Thanks you for taking time on working on EROFS support.  I don't have
> > > time to play with it yet since I'd like to work out erofs-utils 1.6
> > > these days and will work on some new stuffs such as !pagesize block
> > > size as I said previously.
> > > 
> > > > 
> > > > This will produce an object store with the backing files, and a erofs
> > > > file with the required overlayfs xattrs, including a made up one
> > > > called "overlay.fs-verity" containing the expected fs-verity digest
> > > > for the lower dir. It also adds the required whiteouts to cover the
> > > > 00-ff dirs from the lower dir.
> > > > 
> > > > These erofs files are ordered similarly to the composefs files, and we
> > > > give similar guarantees about their reproducibility, etc. So, they
> > > > should be apples-to-apples comparable with the composefs images.
> > > > 
> > > > Given this, I ran another set of performance tests on the original cs9
> > > > rootfs dataset, again measuring the time of `ls -lR`. I also tried to
> > > > measure the memory use like this:
> > > > 
> > > > # echo 3 > /proc/sys/vm/drop_caches
> > > > # systemd-run --scope sh -c 'ls -lR mountpoint' > /dev/null; cat $(cat
> > > > /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak'
> > > > 
> > > > These are the alternatives I tried:
> > > > 
> > > > xfs: the source of the image, regular dir on xfs
> > > > erofs: the image.erofs above, on loopback
> > > > erofs dio: the image.erofs above, on loopback with --direct-io=on
> > > > ovl: erofs above combined with overlayfs
> > > > ovl dio: erofs dio above combined with overlayfs
> > > > cfs: composefs mount of image.cfs
> > > > 
> > > > All tests use the same objects dir, stored on xfs. The erofs and
> > > > overlay implementations are from a stock 6.1.13 kernel, and composefs
> > > > module is from github HEAD.
> > > > 
> > > > I tried loopback both with and without the direct-io option, because
> > > > without direct-io enabled the kernel will double-cache the loopbacked
> > > > data, as per[1].
> > > > 
> > > > The produced images are:
> > > >   8.9M image.cfs
> > > > 11.3M image.erofs
> > > > 
> > > > And gives these results:
> > > >             | Cold cache | Warm cache | Mem use
> > > >             |   (msec)   |   (msec)   |  (mb)
> > > > -----------+------------+------------+---------
> > > > xfs        |   1449     |    442     |    54
> > > > erofs      |    700     |    391     |    45
> > > > erofs dio  |    939     |    400     |    45
> > > > ovl        |   1827     |    530     |   130
> > > > ovl dio    |   2156     |    531     |   130
> > > > cfs        |    689     |    389     |    51
> > > > 
> > > > I also ran the same tests in a VM that had the latest kernel including
> > > > the lazyfollow patches (ovl lazy in the table, not using direct-io),
> > > > this one ext4 based:
> > > > 
> > > >             | Cold cache | Warm cache | Mem use
> > > >             |   (msec)   |   (msec)   |  (mb)
> > > > -----------+------------+------------+---------
> > > > ext4       |   1135     |    394     |    54
> > > > erofs      |    715     |    401     |    46
> > > > erofs dio  |    922     |    401     |    45
> > > > ovl        |   1412     |    515     |   148
> > > > ovl dio    |   1810     |    532     |   149
> > > > ovl lazy   |   1063     |    523     |    87
> > > > cfs        |    719     |    463     |    51
> > > > 
> > > > Things noticeable in the results:
> > > > 
> > > > * composefs and erofs (by itself) perform roughly  similar. This is
> > > >    not necessarily news, and results from Jingbo Xu match this.
> > > > 
> > > > * Erofs on top of direct-io enabled loopback causes quite a drop in
> > > >    performance, which I don't really understand. Especially since its
> > > >    reporting the same memory use as non-direct io. I guess the
> > > >    double-cacheing in the later case isn't properly attributed to the
> > > >    cgroup so the difference is not measured. However, why would the
> > > >    double cache improve performance?  Maybe I'm not completely
> > > >    understanding how these things interact.
> > > 
> > > We've already analysed the root cause of composefs is that composefs
> > > uses a kernel_read() to read its path while irrelevant metadata
> > > (such as dir data) is read together.  Such heuristic readahead is a
> > > unusual stuff for all local fses (obviously almost all in-kernel
> > > filesystems don't use kernel_read() to read their metadata. Although
> > > some filesystems could readahead some related extent metadata when
> > > reading inode, they at least does _not_ work as kernel_read().) But
> > > double caching will introduce almost the same impact as kernel_read()
> > > (assuming you read some source code of loop device.)
> > > 
> > > I do hope you already read what Jingbo's latest test results, and that
> > > test result shows how bad readahead performs if fs metadata is
> > > partially randomly used (stat < 1500 files):
> > > https://lore.kernel.org/r/83829005-3f12-afac-9d05-8ba721a80b4d@linux.alibaba.com
> > > 
> > > Also you could explicitly _disable_ readahead for composefs
> > > manifiest file (because all EROFS metadata read is without
> > > readahead), and let's see how it works then.
> > > 
> > > Again, if your workload is just "ls -lR".  My answer is "just async
> > > readahead the whole manifest file / loop device together" when
> > > mounting.  That will give the best result to you.  But I'm not sure
> > > that is the real use case you propose.
> > > 
> > > > 
> > > > * Stacking overlay on top of erofs causes about 100msec slower
> > > >    warm-cache times compared to all non-overlay approaches, and much
> > > >    more in the cold cache case. The cold cache performance is helped
> > > >    significantly by the lazyfollow patches, but the warm cache overhead
> > > >    remains.
> > > > 
> > > > * The use of overlayfs more than doubles memory use, probably
> > > >    because of all the extra inodes and dentries in action for the
> > > >    various layers. The lazyfollow patches helps, but only partially.
> > > > 
> > > > * Even though overlayfs+erofs is slower than cfs and raw erofs, it is
> > > >    not that much slower (~25%) than the pure xfs/ext4 directory, which
> > > >    is a pretty good baseline for comparisons. It is even faster when
> > > >    using lazyfollow on ext4.
> > > > 
> > > > * The erofs images are slightly larger than the equivalent composefs
> > > >    image.
> > > > 
> > > > In summary: The performance of composefs is somewhat better than the
> > > > best erofs+ovl combination, although the overlay approach is not
> > > > significantly worse than the baseline of a regular directory, except
> > > > that it uses a bit more memory.
> > > > 
> > > > On top of the above pure performance based comparisons I would like to
> > > > re-state some of the other advantages of composefs compared to the
> > > > overlay approach:
> > > > 
> > > > * composefs is namespaceable, in the sense that you can use it (given
> > > >    mount capabilities) inside a namespace (such as a container) without
> > > >    access to non-namespaced resources like loopback or device-mapper
> > > >    devices. (There was work on fixing this with loopfs, but that seems
> > > >    to have stalled.)
> > > > 
> > > > * While it is not in the current design, the simplicity of the format
> > > >    and lack of loopback makes it at least theoretically possible that
> > > >    composefs can be made usable in a rootless fashion at some point in
> > > >    the future.
> > > Do you consider sending some commands to /dev/cachefiles to configure
> > > a daemonless dir and mount erofs image directly by using "erofs over
> > > fscache" but in a daemonless way?  That is an ongoing stuff on our side.
> > > 
> > > IMHO, I don't think file-based interfaces are quite a charmful stuff.
> > > Historically I recalled some practice is to "avoid directly reading
> > > files in kernel" so that I think almost all local fses don't work on
> > > files directl and loopback devices are all the ways for these use
> > > cases.  If loopback devices are not okay to you, how about improving
> > > loopback devices and that will benefit to almost all local fses.
> > > 
> > > > 
> > > > And of course, there are disadvantages to composefs too. Primarily
> > > > being more code, increasing maintenance burden and risk of security
> > > > problems. Composefs is particularly burdensome because it is a
> > > > stacking filesystem and these have historically been shown to be hard
> > > > to get right.
> > > > 
> > > > 
> > > > The question now is what is the best approach overall? For my own
> > > > primary usecase of making a verifying ostree root filesystem, the
> > > > overlay approach (with the lazyfollow work finished) is, while not
> > > > ideal, good enough.
> > > 
> > > So your judgement is still "ls -lR" and your use case is still just
> > > pure read-only and without writable stuff?
> > > 
> > > Anyway, I'm really happy to work with you on your ostree use cases
> > > as always, as long as all corner cases work out by the community.
> > > 
> > > > 
> > > > But I know for the people who are more interested in using composefs
> > > > for containers the eventual goal of rootless support is very
> > > > important. So, on behalf of them I guess the question is: Is there
> > > > ever any chance that something like composefs could work rootlessly?
> > > > Or conversely: Is there some way to get rootless support from the
> > > > overlay approach? Opinions? Ideas?
> > > 
> > > Honestly, I do want to get a proper answer when Giuseppe asked me
> > > the same question.  My current view is simply "that question is
> > > almost the same for all in-kernel fses with some on-disk format".
> > 
> > As far as I'm concerned filesystems with on-disk format will not be made
> > mountable by unprivileged containers. And I don't think I'm alone in
> > that view.
> > 
> 
> You're absolutely not alone in that view. This is even more unsafe with
> network and clustered filesystems, as you're trusting remote hardware
> that is accessible by other users than just the local host. We have had
> long-standing open requests to allow unprivileged users to mount
> arbitrary remote filesystems, and I've never seen a way to do that
> safely.
> 
> > The idea that ever more parts of the kernel with a massive
> > attack surface such as a filesystem need to vouchesafe for the safety in
> > the face of every rando having access to
> > unshare --mount --user --map-root is a dead end and will just end up
> > trapping us in a neverending cycle of security bugs (Because every
> > single bug that's found after making that fs mountable from an
> > unprivileged container will be treated as a security bug no matter if
> > justified or not. So this is also a good way to ruin your filesystem's
> > reputation.).
> > 
> > And honestly, if we set the precedent that it's fine for one filesystem
> > with an on-disk format to be able to be mounted by unprivileged
> > containers then other filesystems eventually want to do this as well.
> > 
> > At the rate we currently add filesystems that's just a matter of time
> > even if none of the existing ones would also want to do it. And then
> > we're left arguing that this was just an exception for one super
> > special, super safe, unexploitable filesystem with an on-disk format.
> > 
> > Imho, none of this is appealing. I don't want to slowly keep building a
> > future where we end up running fuzzers in unprivileged container to
> > generate random images to crash the kernel.
> > 
> > I have more arguments why I don't think is a path we will ever go down
> > but I don't want this to detract from the legitimate ask of making it
> > possible to mount trusted images from within unprivileged containers.
> > Because I think that's perfectly legitimate.
> > 
> > However, I don't think that this is something the kernel needs to solve
> > other than providing the necessary infrastructure so that this can be
> > solved in userspace.
> > 
> > Off-list, Amir had pointed to a blog I wrote last week (cf. [1]) where I
> > explained how we currently mount into mount namespaces of unprivileged
> > cotainers which had been quite a difficult problem before the new mount
> > api. But now it's become almost comically trivial. I mean, there's stuff
> > that will still be good to have but overall all the bits are already
> > there.
> > 
> > Imho, delegated mounting should be done by a system service that is
> > responsible for all the steps that require privileges. So for most
> > filesytems not mountable by unprivileged user this would amount to:
> > 
> > fd_fs = fsopen("xfs")
> > fsconfig(FSCONFIG_SET_STRING, "source", "/sm/sm")
> > fsconfig(FSCONFIG_CMD_CREATE)
> > fd_mnt = fsmount(fd_fs)
> > // Only required for attributes that require privileges against the sb
> > // of the filesystem such as idmapped mounts
> > mount_setattr(fd_mnt, ...)
> > 
> > and then the fd_mnt can be sent to the container which can then attach
> > it wherever it wants to. The system level service doesn't even need to
> > change namespaces via setns(fd_userns|fd_mntns) like I illustrated in
> > the post I did. It's sufficient if we sent it via AF_UNIX for example
> > that's exposed to the container.
> > 
> > Of course, this system level service would be integrated with mount(8)
> > directly over a well-defined protocol. And this would be nestable as
> > well by e.g., bind-mounting the AF_UNIX socket.
> > 
> > And we do already support a rudimentary form of such integration through
> > systemd. For example via mount -t ddi (cf. [2]) which makes it possible
> > to mount discoverable disk images (ddi). But that's just an
> > illustration. 
> > 
> > This should be integrated with mount(8) and should be a simply protocol
> > over varlink or another lightweight ipc mechanism that can be
> > implemented by systemd-mountd (which is how I coined this for lack of
> > imagination when I came up with this) or by some other component if
> > platforms like k8s really want to do their own thing.
> > 
> > This also allows us to extend this feature to the whole system btw and
> > to all filesystems at once. Because it means that if systemd-mountd is
> > told what images to trust (based on location, from a specific registry,
> > signature, or whatever) then this isn't just useful for unprivileged
> > containers but also for regular users on the host that want to mount
> > stuff.
> > 
> > This is what we're currently working on.
> > 
> 
> This is a very cool idea, and sounds like a reasonable way forward. I'd
> be interested to hear more about this (and in particular what sort of
> security model and use-cases you envision for this).

I convinced Lennart to put this on the top of his todo so he'll
hopefully finish the first implementation within the next week and put
up a PR. By LSFMM we should be able to demo this.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay
  2023-02-27 10:58   ` Christian Brauner
@ 2023-04-27 16:11     ` Amir Goldstein
  0 siblings, 0 replies; 42+ messages in thread
From: Amir Goldstein @ 2023-04-27 16:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Gao Xiang, linux-fsdevel, Jingbo Xu, lsf-pc, Alexander Larsson

On Mon, Feb 27, 2023 at 12:59 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Feb 27, 2023 at 06:45:50PM +0800, Gao Xiang wrote:
> >
> > (+cc Jingbo Xu and Christian Brauner)
> >
> > On 2023/2/27 17:22, Alexander Larsson wrote:
> > > Hello,
> > >
> > > Recently Giuseppe Scrivano and I have worked on[1] and proposed[2] the
> > > Composefs filesystem. It is an opportunistically sharing, validating
> > > image-based filesystem, targeting usecases like validated ostree
> > > rootfs:es, validated container images that share common files, as well
> > > as other image based usecases.
> > >
> > > During the discussions in the composefs proposal (as seen on LWN[3])
> > > is has been proposed that (with some changes to overlayfs), similar
> > > behaviour can be achieved by combining the overlayfs
> > > "overlay.redirect" xattr with an read-only filesystem such as erofs.
> > >
> > > There are pros and cons to both these approaches, and the discussion
> > > about their respective value has sometimes been heated. We would like
> > > to have an in-person discussion at the summit, ideally also involving
> > > more of the filesystem development community, so that we can reach
> > > some consensus on what is the best apporach.
> > >
> > > Good participants would be at least: Alexander Larsson, Giuseppe
> > > Scrivano, Amir Goldstein, David Chinner, Gao Xiang, Miklos Szeredi,
> > > Jingbo Xu
> > I'd be happy to discuss this at LSF/MM/BPF this year. Also we've addressed
> > the root cause of the performance gap is that
> >
> > composefs read some data symlink-like payload data by using
> > cfs_read_vdata_path() which involves kernel_read() and trigger heuristic
> > readahead of dir data (which is also landed in composefs vdata area
> > together with payload), so that most composefs dir I/O is already done
> > in advance by heuristic  readahead.  And we think almost all exist
> > in-kernel local fses doesn't have such heuristic readahead and if we add
> > the similar stuff, EROFS could do better than composefs.
> >
> > Also we've tried random stat()s about 500~1000 files in the tree you shared
> > (rather than just "ls -lR") and EROFS did almost the same or better than
> > composefs.  I guess further analysis (including blktrace) could be shown by
> > Jingbo later.
> >
> > Not sure if Christian Brauner would like to discuss this new stacked fs
>
> I'll be at lsfmm in any case and already got my invite a while ago. I
> intend to give some updates about a few vfs things and I can talk about
> this as well.
>

FYI, I schedule a ~30min session lead by Alexander
on remaining composefs topics
another ~30min session lead by Gao on EROFS topics
and another session for Christian dedicated to mounting images inside userns.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2023-04-27 16:11 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-27  9:22 [LSF/MM/BFP TOPIC] Composefs vs erofs+overlay Alexander Larsson
2023-02-27 10:45 ` Gao Xiang
2023-02-27 10:58   ` Christian Brauner
2023-04-27 16:11     ` [Lsf-pc] " Amir Goldstein
2023-03-01  3:47   ` Jingbo Xu
2023-03-03 14:41     ` Alexander Larsson
2023-03-03 15:48       ` Gao Xiang
2023-02-27 11:37 ` Jingbo Xu
2023-03-03 13:57 ` Alexander Larsson
2023-03-03 15:13   ` Gao Xiang
2023-03-03 17:37     ` Gao Xiang
2023-03-04 14:59       ` Colin Walters
2023-03-04 15:29         ` Gao Xiang
2023-03-04 16:22           ` Gao Xiang
2023-03-07  1:00           ` Colin Walters
2023-03-07  3:10             ` Gao Xiang
2023-03-07 10:15     ` Christian Brauner
2023-03-07 11:03       ` Gao Xiang
2023-03-07 12:09       ` Alexander Larsson
2023-03-07 12:55         ` Gao Xiang
2023-03-07 15:16         ` Christian Brauner
2023-03-07 19:33           ` Giuseppe Scrivano
2023-03-08 10:31             ` Christian Brauner
2023-03-07 13:38       ` Jeff Layton
2023-03-08 10:37         ` Christian Brauner
2023-03-04  0:46   ` Jingbo Xu
2023-03-06 11:33   ` Alexander Larsson
2023-03-06 12:15     ` Gao Xiang
2023-03-06 15:49     ` Jingbo Xu
2023-03-06 16:09       ` Alexander Larsson
2023-03-06 16:17         ` Gao Xiang
2023-03-07  8:21           ` Alexander Larsson
2023-03-07  8:33             ` Gao Xiang
2023-03-07  8:48               ` Gao Xiang
2023-03-07  9:07               ` Alexander Larsson
2023-03-07  9:26                 ` Gao Xiang
2023-03-07  9:38                   ` Gao Xiang
2023-03-07  9:56                     ` Alexander Larsson
2023-03-07 10:06                       ` Gao Xiang
2023-03-07  9:46                   ` Alexander Larsson
2023-03-07 10:01                     ` Gao Xiang
2023-03-07 10:00       ` Jingbo Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.