* [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem @ 2023-01-20 15:23 Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 1/6] fsverity: Export fsverity_get_digest Alexander Larsson ` (6 more replies) 0 siblings, 7 replies; 87+ messages in thread From: Alexander Larsson @ 2023-01-20 15:23 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, gscrivan, david, brauner, viro, Alexander Larsson Giuseppe Scrivano and I have recently been working on a new project we call composefs. This is the first time we propose this publically and we would like some feedback on it. At its core, composefs is a way to construct and use read only images that are used similar to how you would use e.g. loop-back mounted squashfs images. On top of this composefs has two fundamental features. First it allows sharing of file data (both on disk and in page cache) between images, and secondly it has dm-verity like validation on read. Let me first start with a minimal example of how this can be used, before going into the details: Suppose we have this source for an image: rootfs/ ├── dir │ └── another_a ├── file_a └── file_b We can then use this to generate an image file and a set of content-addressed backing files: # mkcomposefs --digest-store=objects rootfs/ rootfs.img # ls -l rootfs.img objects/*/* -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img The rootfs.img file contains all information about directory and file metadata plus references to the backing files by name. We can now mount this and look at the result: # mount -t composefs rootfs.img -o basedir=objects /mnt # ls /mnt/ dir file_a file_b # cat /mnt/file_a content_a When reading this file the kernel is actually reading the backing file, in a fashion similar to overlayfs. Since the backing file is content-addressed, the objects directory can be shared for multiple images, and any files that happen to have the same content are shared. I refer to this as opportunistic sharing, as it is different than the more course-grained explicit sharing used by e.g. container base images. The next step is the validation. Note how the object files have fs-verity enabled. In fact, they are named by their fs-verity digest: # fsverity digest objects/*/* sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f The generated filesystm image may contain the expected digest for the backing files. When the backing file digest is incorrect, the open will fail, and if the open succeeds, any other on-disk file-changes will be detected by fs-verity: # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f content_a # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f # cat /mnt/file_a WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest cat: /mnt/file_a: Input/output error This re-uses the existing fs-verity functionallity to protect against changes in file contents, while adding on top of it protection against changes in filesystem metadata and structure. I.e. protecting against replacing a fs-verity enabled file or modifying file permissions or xattrs. To be fully verified we need another step: we use fs-verity on the image itself. Then we pass the expected digest on the mount command line (which will be verified at mount time): # fsverity enable rootfs.img # fsverity digest rootfs.img sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt So, given a trusted set of mount options (say unlocked from TPM), we have a fully verified filesystem tree mounted, with opportunistic finegrained sharing of identical files. So, why do we want this? There are two initial users. First of all we want to use the opportunistic sharing for the podman container image baselayer. The idea is to use a composefs mount as the lower directory in an overlay mount, with the upper directory being the container work dir. This will allow automatical file-level disk and page-cache sharning between any two images, independent of details like the permissions and timestamps of the files. Secondly we are interested in using the verification aspects of composefs in the ostree project. Ostree already supports a content-addressed object store, but it is currently referenced by hardlink farms. The object store and the trees that reference it are signed and verified at download time, but there is no runtime verification. If we replace the hardlink farm with a composefs image that points into the existing object store we can use the verification to implement runtime verification. In fact, the tooling to create composefs images is 100% reproducible, so all we need is to add the composefs image fs-verity digest into the ostree commit. Then the image can be reconstructed from the ostree commit info, generating a file with the same fs-verity digest. These are the usecases we're currently interested in, but there seems to be a breadth of other possible uses. For example, many systems use loopback mounts for images (like lxc or snap), and these could take advantage of the opportunistic sharing. We've also talked about using fuse to implement a local cache for the backing files. I.e. you would have the second basedir be a fuse filesystem. On lookup failure in the first basedir it downloads the file and saves it in the first basedir for later lookups. There are many interesting possibilities here. The patch series contains some documentation on the file format and how to use the filesystem. The userspace tools (and a standalone kernel module) is available here: https://github.com/containers/composefs Initial work on ostree integration is here: https://github.com/ostreedev/ostree/pull/2640 Changes since v2: - Simplified filesystem format to use fixed size inodes. This resulted in simpler (now < 2k lines) code as well as higher performance at the cost of slightly (~40%) larger images. - We now use multi-page mappings from the page cache, which removes limits on sizes of xattrs and makes the dirent handling code simpler. - Added more documentation about the on-disk file format. - General cleanups based on review comments. Changes since v1: - Fixed some minor compiler warnings - Fixed build with !CONFIG_MMU - Documentation fixes from review by Bagas Sanjaya - Code style and cleanup from review by Brian Masney - Use existing kernel helpers for hex digit conversion - Use kmap_local_page() instead of deprecated kmap() Alexander Larsson (6): fsverity: Export fsverity_get_digest composefs: Add on-disk layout header composefs: Add descriptor parsing code composefs: Add filesystem implementation composefs: Add documentation composefs: Add kconfig and build support Documentation/filesystems/composefs.rst | 159 +++++ Documentation/filesystems/index.rst | 1 + fs/Kconfig | 1 + fs/Makefile | 1 + fs/composefs/Kconfig | 18 + fs/composefs/Makefile | 5 + fs/composefs/cfs-internals.h | 55 ++ fs/composefs/cfs-reader.c | 720 +++++++++++++++++++++++ fs/composefs/cfs.c | 750 ++++++++++++++++++++++++ fs/composefs/cfs.h | 172 ++++++ fs/verity/measure.c | 1 + 11 files changed, 1883 insertions(+) create mode 100644 Documentation/filesystems/composefs.rst create mode 100644 fs/composefs/Kconfig create mode 100644 fs/composefs/Makefile create mode 100644 fs/composefs/cfs-internals.h create mode 100644 fs/composefs/cfs-reader.c create mode 100644 fs/composefs/cfs.c create mode 100644 fs/composefs/cfs.h -- 2.39.0 ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH v3 1/6] fsverity: Export fsverity_get_digest 2023-01-20 15:23 [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson @ 2023-01-20 15:23 ` Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 2/6] composefs: Add on-disk layout header Alexander Larsson ` (5 subsequent siblings) 6 siblings, 0 replies; 87+ messages in thread From: Alexander Larsson @ 2023-01-20 15:23 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, gscrivan, david, brauner, viro, Alexander Larsson Composefs needs to call this when built in module form, so we need to export the symbol. This uses EXPORT_SYMBOL_GPL like the other fsverity functions do. Signed-off-by: Alexander Larsson <alexl@redhat.com> --- fs/verity/measure.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/verity/measure.c b/fs/verity/measure.c index 5c79ea1b2468..875d143e0c7e 100644 --- a/fs/verity/measure.c +++ b/fs/verity/measure.c @@ -85,3 +85,4 @@ int fsverity_get_digest(struct inode *inode, *alg = hash_alg->algo_id; return 0; } +EXPORT_SYMBOL_GPL(fsverity_get_digest); -- 2.39.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH v3 2/6] composefs: Add on-disk layout header 2023-01-20 15:23 [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 1/6] fsverity: Export fsverity_get_digest Alexander Larsson @ 2023-01-20 15:23 ` Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 3/6] composefs: Add descriptor parsing code Alexander Larsson ` (4 subsequent siblings) 6 siblings, 0 replies; 87+ messages in thread From: Alexander Larsson @ 2023-01-20 15:23 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, gscrivan, david, brauner, viro, Alexander Larsson This header contains the on-disk layout of the composefs file format. The basic format is a simple subperblock with a version and magic number at the start for filetype detection. after that, a table of inodes (indexed by inode number) data containing all the fixed-size inode elements. After the inodes (at offset specified in superblock) is a variable data section that is linked to by the inodes for: * symlink targets, * backing filenames * xattrs * dirents The goal of this file format is to be simple and efficient to decode when mapped directly from the page cache. This allows an easy to understand and maintain codebase. Signed-off-by: Alexander Larsson <alexl@redhat.com> Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --- fs/composefs/cfs.h | 172 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 172 insertions(+) create mode 100644 fs/composefs/cfs.h diff --git a/fs/composefs/cfs.h b/fs/composefs/cfs.h new file mode 100644 index 000000000000..9209b80dd6ca --- /dev/null +++ b/fs/composefs/cfs.h @@ -0,0 +1,172 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * composefs + * + * Copyright (C) 2021 Giuseppe Scrivano + * Copyright (C) 2022 Alexander Larsson + * + * This file is released under the GPL. + */ + +#ifndef _CFS_H +#define _CFS_H + +#include <asm/byteorder.h> +#include <crypto/sha2.h> +#include <linux/fs.h> +#include <linux/stat.h> +#include <linux/types.h> + +/* Descriptor file layout: + * + * +-----------------------+ + * | cfs_superblock | + * | vdata_offfset |---| + * +-----------------------| | + * | Inode table | | + * | N * cfs_inode_data | | + * +-----------------------| | + * | Variable data section |<--/ + * | Used for: | + * | symlink targets | + * | backing file paths | + * | dirents | + * | xattrs | + * | digests | + * +-----------------------+ + * + * The superblock is at the start of the file, and the inode table + * directly follows it. The variable data section found via + * vdata_offset and all sections are 32bit aligned. All data is + * little endian. + * + * The inode table is a table of fixed size cfs_inode_data elements. + * The filesystem inode numbers are 32bit indexes into this table. + * Actual file content (for regular files) is referenced by a backing + * file path which is looked up relative to a given base dir. + * + * All variable size data are stored in the variable data section and + * are referenced using cfs_vdata (64bit offset from the start of the + * vdata section and 32bit lengths). + * + * Directory dirent data is stored in one 32bit aligned vdata chunk, + * staring with a table of fixed size cfs_dirents and which is + * followed by a string table. The dirents reference the strings by + * offsets form the string table. The dirents are sorted for efficient + * binary search lookups. + * + * Xattrs data are stored in a 32bit aligned vdata chunk. This is + * a table of cfs_xattr, followed by the key/value data. The + * xattrs are sorted by key. Note that many inodes can reference + * the same xattr data. + */ + +/* Current (and atm only) version of the image format. */ +#define CFS_VERSION 1 + +#define CFS_MAGIC 0xc078629aU + +#define CFS_SUPERBLOCK_OFFSET 0 +#define CFS_INODE_TABLE_OFFSET sizeof(struct cfs_superblock) +#define CFS_INODE_SIZE sizeof(struct cfs_inode_data) +#define CFS_DIRENT_SIZE sizeof(struct cfs_dirent) +#define CFS_XATTR_ELEM_SIZE sizeof(struct cfs_xattr_element) +#define CFS_ROOT_INO 0 + +/* Fits at least the root inode */ +#define CFS_DESCRIPTOR_MIN_SIZE \ + (sizeof(struct cfs_superblock) + sizeof(struct cfs_inode_data)) + +/* More that this would overflow header size computation */ +#define CFS_MAX_DIRENTS (U32_MAX / CFS_DIRENT_SIZE - 1) + +#define CFS_MAX_XATTRS U16_MAX + +struct cfs_superblock { + __le32 version; /* CFS_VERSION */ + __le32 magic; /* CFS_MAGIC */ + + /* Offset of the variable data section from start of file */ + __le64 vdata_offset; + + /* For future use, and makes superblock 128 bytes to align + * inode table on cacheline boundary on most arches. + */ + __le32 unused[28]; +} __packed; + +struct cfs_vdata { + __le64 off; /* Offset into variable data section */ + __le32 len; +} __packed; + +struct cfs_inode_data { + __le32 st_mode; /* File type and mode. */ + __le32 st_nlink; /* Number of hard links, only for regular files. */ + __le32 st_uid; /* User ID of owner. */ + __le32 st_gid; /* Group ID of owner. */ + __le32 st_rdev; /* Device ID (if special file). */ + __le64 st_size; /* Size of file */ + __le64 st_mtim_sec; + __le32 st_mtim_nsec; + __le64 st_ctim_sec; + __le32 st_ctim_nsec; + + /* References to variable storage area: */ + + /* per-type variable data: + * S_IFDIR: dirents + * S_IFREG: backing file pathnem + * S_IFLNLK; symlink target + */ + struct cfs_vdata variable_data; + + struct cfs_vdata xattrs; + struct cfs_vdata digest; /* Expected fs-verity digest of backing file */ + + /* For future use, and makes inode_data 96 bytes which + * is semi-aligned with cacheline sizes. + */ + __le32 unused[2]; +} __packed; + +struct cfs_dirent { + __le32 inode_num; /* Index in inode table */ + __le32 name_offset; /* Offset from end of cfs_dir_header */ + u8 name_len; + u8 d_type; + u16 _padding; +} __packed; + +/* Directory entries, stored in variable data section, 32bit aligned, + * followed by name string table + */ +struct cfs_dir_header { + __le32 n_dirents; + struct cfs_dirent dirents[]; +} __packed; + +static inline size_t cfs_dir_header_size(size_t n_dirents) +{ + return sizeof(struct cfs_dir_header) + n_dirents * CFS_DIRENT_SIZE; +} + +struct cfs_xattr_element { + __le16 key_length; + __le16 value_length; +} __packed; + +/* Xattrs, stored in variable data section , 32bit aligned, followed + * by key/value table + */ +struct cfs_xattr_header { + __le16 n_attr; + struct cfs_xattr_element attr[0]; +} __packed; + +static inline size_t cfs_xattr_header_size(size_t n_element) +{ + return sizeof(struct cfs_xattr_header) + n_element * CFS_XATTR_ELEM_SIZE; +} + +#endif -- 2.39.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH v3 3/6] composefs: Add descriptor parsing code 2023-01-20 15:23 [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 1/6] fsverity: Export fsverity_get_digest Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 2/6] composefs: Add on-disk layout header Alexander Larsson @ 2023-01-20 15:23 ` Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 4/6] composefs: Add filesystem implementation Alexander Larsson ` (3 subsequent siblings) 6 siblings, 0 replies; 87+ messages in thread From: Alexander Larsson @ 2023-01-20 15:23 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, gscrivan, david, brauner, viro, Alexander Larsson This adds the code to load and decode the filesystem descriptor file format. We open the descriptor at mount time and keep the struct file * around. Most accesses to it happens via cfs_get_buf() which reads the descriptor data directly from the page cache. Although in a few cases (like when we need to directly copy data) we use kernel_read() instead. Signed-off-by: Alexander Larsson <alexl@redhat.com> Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --- fs/composefs/cfs-internals.h | 55 +++ fs/composefs/cfs-reader.c | 720 +++++++++++++++++++++++++++++++++++ 2 files changed, 775 insertions(+) create mode 100644 fs/composefs/cfs-internals.h create mode 100644 fs/composefs/cfs-reader.c diff --git a/fs/composefs/cfs-internals.h b/fs/composefs/cfs-internals.h new file mode 100644 index 000000000000..3524b977c8a8 --- /dev/null +++ b/fs/composefs/cfs-internals.h @@ -0,0 +1,55 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _CFS_INTERNALS_H +#define _CFS_INTERNALS_H + +#include "cfs.h" + +#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */ + +struct cfs_inode_extra_data { + char *path_payload; /* Real pathname for files, target for symlinks */ + + u64 xattrs_offset; + u32 xattrs_len; + + u64 dirents_offset; + u32 dirents_len; + + bool has_digest; + u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */ +}; + +struct cfs_context { + struct file *descriptor; + u64 vdata_offset; + u32 num_inodes; + + u64 descriptor_len; +}; + +int cfs_init_ctx(const char *descriptor_path, const u8 *required_digest, + struct cfs_context *ctx); + +void cfs_ctx_put(struct cfs_context *ctx); + +int cfs_init_inode(struct cfs_context *ctx, u32 inode_num, struct inode *inode, + struct cfs_inode_extra_data *data); + +ssize_t cfs_list_xattrs(struct cfs_context *ctx, + struct cfs_inode_extra_data *inode_data, char *names, + size_t size); +int cfs_get_xattr(struct cfs_context *ctx, struct cfs_inode_extra_data *inode_data, + const char *name, void *value, size_t size); + +typedef bool (*cfs_dir_iter_cb)(void *private, const char *name, int namelen, + u64 ino, unsigned int dtype); + +int cfs_dir_iterate(struct cfs_context *ctx, u64 index, + struct cfs_inode_extra_data *inode_data, loff_t first, + cfs_dir_iter_cb cb, void *private); + +int cfs_dir_lookup(struct cfs_context *ctx, u64 index, + struct cfs_inode_extra_data *inode_data, const char *name, + size_t name_len, u64 *index_out); + +#endif diff --git a/fs/composefs/cfs-reader.c b/fs/composefs/cfs-reader.c new file mode 100644 index 000000000000..6ff7d3e70d39 --- /dev/null +++ b/fs/composefs/cfs-reader.c @@ -0,0 +1,720 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * composefs + * + * Copyright (C) 2021 Giuseppe Scrivano + * Copyright (C) 2022 Alexander Larsson + * + * This file is released under the GPL. + */ + +#include "cfs-internals.h" + +#include <linux/file.h> +#include <linux/fsverity.h> +#include <linux/pagemap.h> +#include <linux/vmalloc.h> + +/* When mapping buffers via page arrays this is an "arbitrary" limit + * to ensure we're not ballooning memory use for the page array and + * mapping. On a 4k page, 64bit machine this limit will make the page + * array fit in one page, and will allow a mapping of 2MB. When + * applied to e.g. dirents this will allow more than 27000 filenames + * of length 64, which seems ok. If we need to support more, at that + * point we should probably fall back to an approach that maps pages + * incrementally. + */ +#define CFS_BUF_MAXPAGES 512 + +#define CFS_BUF_PREALLOC_SIZE 4 + +/* Check if the element, which is supposed to be offset from section_start + * actually fits in the section starting at section_start ending at section_end, + * and doesn't wrap. + */ +static bool cfs_is_in_section(u64 section_start, u64 section_end, + u64 element_offset, u64 element_size) +{ + u64 element_end; + u64 element_start; + + element_start = section_start + element_offset; + if (element_start < section_start || element_start >= section_end) + return false; + + element_end = element_start + element_size; + if (element_end < element_start || element_end > section_end) + return false; + + return true; +} + +struct cfs_buf { + struct page **pages; + size_t n_pages; + void *base; + + /* Used as "pages" above to avoid allocation for small buffers */ + struct page *prealloc[CFS_BUF_PREALLOC_SIZE]; +}; + +static void cfs_buf_put(struct cfs_buf *buf) +{ + if (buf->pages) { + if (buf->n_pages == 1) + kunmap_local(buf->base); + else + vm_unmap_ram(buf->base, buf->n_pages); + for (size_t i = 0; i < buf->n_pages; i++) + put_page(buf->pages[i]); + if (buf->n_pages > CFS_BUF_PREALLOC_SIZE) + kfree(buf->pages); + buf->pages = NULL; + } +} + +/* Map data from anywhere in the descriptor */ +static void *cfs_get_buf(struct cfs_context *ctx, u64 offset, u32 size, + struct cfs_buf *buf) +{ + struct inode *inode = ctx->descriptor->f_inode; + struct address_space *const mapping = inode->i_mapping; + size_t n_pages, read_pages; + u64 index, last_index; + struct page **pages; + void *base; + + if (buf->pages) + return ERR_PTR(-EINVAL); + + if (!cfs_is_in_section(0, ctx->descriptor_len, offset, size) || size == 0) + return ERR_PTR(-EFSCORRUPTED); + + index = offset >> PAGE_SHIFT; + last_index = (offset + size - 1) >> PAGE_SHIFT; + n_pages = last_index - index + 1; + + if (n_pages > CFS_BUF_MAXPAGES) + return ERR_PTR(-ENOMEM); + + if (n_pages > CFS_BUF_PREALLOC_SIZE) { + pages = kmalloc_array(n_pages, sizeof(struct page *), GFP_KERNEL); + if (!pages) + return ERR_PTR(-ENOMEM); + } else { + /* Avoid allocation in common (small) cases */ + pages = buf->prealloc; + } + + for (read_pages = 0; read_pages < n_pages; read_pages++) { + struct page *page = + read_cache_page(mapping, index + read_pages, NULL, NULL); + if (IS_ERR(page)) + goto nomem; + pages[read_pages] = page; + } + + if (n_pages == 1) { + base = kmap_local_page(pages[0]); + } else { + base = vm_map_ram(pages, n_pages, -1); + if (!base) + goto nomem; + } + + buf->pages = pages; + buf->n_pages = n_pages; + buf->base = base; + + return base + (offset & (PAGE_SIZE - 1)); + +nomem: + for (size_t i = 0; i < read_pages; i++) + put_page(pages[i]); + if (n_pages > CFS_BUF_PREALLOC_SIZE) + kfree(pages); + + return ERR_PTR(-ENOMEM); +} + +/* Map data from the inode table */ +static void *cfs_get_inode_buf(struct cfs_context *ctx, u64 offset, u32 len, + struct cfs_buf *buf) +{ + if (!cfs_is_in_section(CFS_INODE_TABLE_OFFSET, ctx->vdata_offset, offset, len)) + return ERR_PTR(-EINVAL); + + return cfs_get_buf(ctx, CFS_INODE_TABLE_OFFSET + offset, len, buf); +} + +/* Map data from the variable data section */ +static void *cfs_get_vdata_buf(struct cfs_context *ctx, u64 offset, u32 len, + struct cfs_buf *buf) +{ + if (!cfs_is_in_section(ctx->vdata_offset, ctx->descriptor_len, offset, len)) + return ERR_PTR(-EINVAL); + + return cfs_get_buf(ctx, ctx->vdata_offset + offset, len, buf); +} + +/* Read data from anywhere in the descriptor */ +static void *cfs_read_data(struct cfs_context *ctx, u64 offset, u32 size, u8 *dest) +{ + loff_t pos = offset; + size_t copied; + + if (!cfs_is_in_section(0, ctx->descriptor_len, offset, size)) + return ERR_PTR(-EFSCORRUPTED); + + copied = 0; + while (copied < size) { + ssize_t bytes; + + bytes = kernel_read(ctx->descriptor, dest + copied, + size - copied, &pos); + if (bytes < 0) + return ERR_PTR(bytes); + if (bytes == 0) + return ERR_PTR(-EINVAL); + + copied += bytes; + } + + if (copied != size) + return ERR_PTR(-EFSCORRUPTED); + return dest; +} + +/* Read data from the variable data section */ +static void *cfs_read_vdata(struct cfs_context *ctx, u64 offset, u32 len, char *buf) +{ + void *res; + + if (!cfs_is_in_section(ctx->vdata_offset, ctx->descriptor_len, offset, len)) + return ERR_PTR(-EINVAL); + + res = cfs_read_data(ctx, ctx->vdata_offset + offset, len, buf); + if (IS_ERR(res)) + return ERR_CAST(res); + + return buf; +} + +/* Allocate, read and null-terminate paths from the variable data section */ +static char *cfs_read_vdata_path(struct cfs_context *ctx, u64 offset, u32 len) +{ + char *path; + void *res; + + if (len > PATH_MAX) + return ERR_PTR(-EINVAL); + + path = kmalloc(len + 1, GFP_KERNEL); + if (!path) + return ERR_PTR(-ENOMEM); + + res = cfs_read_vdata(ctx, offset, len, path); + if (IS_ERR(res)) { + kfree(path); + return ERR_CAST(res); + } + + /* zero terminate */ + path[len] = 0; + + return path; +} + +int cfs_init_ctx(const char *descriptor_path, const u8 *required_digest, + struct cfs_context *ctx_out) +{ + u8 verity_digest[FS_VERITY_MAX_DIGEST_SIZE]; + struct cfs_superblock superblock_buf; + struct cfs_superblock *superblock; + enum hash_algo verity_algo; + struct cfs_context ctx; + struct file *descriptor; + u64 num_inodes; + loff_t i_size; + int res; + + descriptor = filp_open(descriptor_path, O_RDONLY, 0); + if (IS_ERR(descriptor)) + return PTR_ERR(descriptor); + + if (required_digest) { + res = fsverity_get_digest(d_inode(descriptor->f_path.dentry), + verity_digest, &verity_algo); + if (res < 0) { + pr_err("ERROR: composefs descriptor has no fs-verity digest\n"); + goto fail; + } + if (verity_algo != HASH_ALGO_SHA256 || + memcmp(required_digest, verity_digest, SHA256_DIGEST_SIZE) != 0) { + pr_err("ERROR: composefs descriptor has wrong fs-verity digest\n"); + res = -EINVAL; + goto fail; + } + } + + i_size = i_size_read(file_inode(descriptor)); + if (i_size <= CFS_DESCRIPTOR_MIN_SIZE) { + res = -EINVAL; + goto fail; + } + + /* Need this temporary ctx for cfs_read_data() */ + ctx.descriptor = descriptor; + ctx.descriptor_len = i_size; + + superblock = cfs_read_data(&ctx, CFS_SUPERBLOCK_OFFSET, + sizeof(struct cfs_superblock), + (u8 *)&superblock_buf); + if (IS_ERR(superblock)) { + res = PTR_ERR(superblock); + goto fail; + } + + ctx.vdata_offset = le64_to_cpu(superblock->vdata_offset); + + /* Some basic validation of the format */ + if (le32_to_cpu(superblock->version) != CFS_VERSION || + le32_to_cpu(superblock->magic) != CFS_MAGIC || + /* vdata is in file */ + ctx.vdata_offset > ctx.descriptor_len || + ctx.vdata_offset <= CFS_INODE_TABLE_OFFSET || + /* vdata is aligned */ + ctx.vdata_offset % 4 != 0) { + res = -EFSCORRUPTED; + goto fail; + } + + num_inodes = (ctx.vdata_offset - CFS_INODE_TABLE_OFFSET) / CFS_INODE_SIZE; + if (num_inodes > U32_MAX) { + res = -EFSCORRUPTED; + goto fail; + } + ctx.num_inodes = num_inodes; + + *ctx_out = ctx; + return 0; + +fail: + fput(descriptor); + return res; +} + +void cfs_ctx_put(struct cfs_context *ctx) +{ + if (ctx->descriptor) { + fput(ctx->descriptor); + ctx->descriptor = NULL; + } +} + +static bool cfs_validate_filename(const char *name, size_t name_len) +{ + if (name_len == 0) + return false; + + if (name_len == 1 && name[0] == '.') + return false; + + if (name_len == 2 && name[0] == '.' && name[1] == '.') + return false; + + if (memchr(name, '/', name_len)) + return false; + + return true; +} + +int cfs_init_inode(struct cfs_context *ctx, u32 inode_num, struct inode *inode, + struct cfs_inode_extra_data *inode_data) +{ + struct cfs_buf vdata_buf = { NULL }; + struct cfs_inode_data *disk_data; + char *path_payload = NULL; + void *res; + int ret = 0; + u64 variable_data_off; + u32 variable_data_len; + u64 digest_off; + u32 digest_len; + u32 st_type; + u64 size; + + if (inode_num >= ctx->num_inodes) + return -EFSCORRUPTED; + + disk_data = cfs_get_inode_buf(ctx, inode_num * CFS_INODE_SIZE, + CFS_INODE_SIZE, &vdata_buf); + if (IS_ERR(disk_data)) + return PTR_ERR(disk_data); + + inode->i_ino = inode_num; + + inode->i_mode = le32_to_cpu(disk_data->st_mode); + set_nlink(inode, le32_to_cpu(disk_data->st_nlink)); + inode->i_uid = make_kuid(current_user_ns(), le32_to_cpu(disk_data->st_uid)); + inode->i_gid = make_kgid(current_user_ns(), le32_to_cpu(disk_data->st_gid)); + inode->i_rdev = le32_to_cpu(disk_data->st_rdev); + + size = le64_to_cpu(disk_data->st_size); + i_size_write(inode, size); + inode_set_bytes(inode, size); + + inode->i_mtime.tv_sec = le64_to_cpu(disk_data->st_mtim_sec); + inode->i_mtime.tv_nsec = le32_to_cpu(disk_data->st_mtim_nsec); + inode->i_ctime.tv_sec = le64_to_cpu(disk_data->st_ctim_sec); + inode->i_ctime.tv_nsec = le32_to_cpu(disk_data->st_ctim_nsec); + inode->i_atime = inode->i_mtime; + + variable_data_off = le64_to_cpu(disk_data->variable_data.off); + variable_data_len = le32_to_cpu(disk_data->variable_data.len); + + st_type = inode->i_mode & S_IFMT; + if (st_type == S_IFDIR) { + inode_data->dirents_offset = variable_data_off; + inode_data->dirents_len = variable_data_len; + } else if ((st_type == S_IFLNK || st_type == S_IFREG) && + variable_data_len > 0) { + path_payload = cfs_read_vdata_path(ctx, variable_data_off, + variable_data_len); + if (IS_ERR(path_payload)) { + ret = PTR_ERR(path_payload); + goto fail; + } + inode_data->path_payload = path_payload; + } + + if (st_type == S_IFLNK) { + /* Symbolic link must have a non-empty target */ + if (!inode_data->path_payload || *inode_data->path_payload == 0) { + ret = -EFSCORRUPTED; + goto fail; + } + } else if (st_type == S_IFREG) { + /* Regular file must have backing file except empty files */ + if ((inode_data->path_payload && size == 0) || + (!inode_data->path_payload && size > 0)) { + ret = -EFSCORRUPTED; + goto fail; + } + } + + inode_data->xattrs_offset = le64_to_cpu(disk_data->xattrs.off); + inode_data->xattrs_len = le32_to_cpu(disk_data->xattrs.len); + + if (inode_data->xattrs_len != 0) { + /* Validate xattr size */ + if (inode_data->xattrs_len < sizeof(struct cfs_xattr_header)) { + ret = -EFSCORRUPTED; + goto fail; + } + } + + digest_off = le64_to_cpu(disk_data->digest.off); + digest_len = le32_to_cpu(disk_data->digest.len); + + if (digest_len > 0) { + if (digest_len != SHA256_DIGEST_SIZE) { + ret = -EFSCORRUPTED; + goto fail; + } + + res = cfs_read_vdata(ctx, digest_off, digest_len, inode_data->digest); + if (IS_ERR(res)) { + ret = PTR_ERR(res); + goto fail; + } + inode_data->has_digest = true; + } + + cfs_buf_put(&vdata_buf); + return 0; + +fail: + cfs_buf_put(&vdata_buf); + return ret; +} + +ssize_t cfs_list_xattrs(struct cfs_context *ctx, + struct cfs_inode_extra_data *inode_data, char *names, + size_t size) +{ + const struct cfs_xattr_header *xattrs; + struct cfs_buf vdata_buf = { NULL }; + size_t n_xattrs = 0; + u8 *data, *data_end; + ssize_t copied = 0; + + if (inode_data->xattrs_len == 0) + return 0; + + /* xattrs_len basic size req was verified in cfs_init_inode_data */ + + xattrs = cfs_get_vdata_buf(ctx, inode_data->xattrs_offset, + inode_data->xattrs_len, &vdata_buf); + if (IS_ERR(xattrs)) + return PTR_ERR(xattrs); + + n_xattrs = le16_to_cpu(xattrs->n_attr); + if (n_xattrs == 0 || n_xattrs > CFS_MAX_XATTRS || + inode_data->xattrs_len < cfs_xattr_header_size(n_xattrs)) { + copied = -EFSCORRUPTED; + goto exit; + } + + data = ((u8 *)xattrs) + cfs_xattr_header_size(n_xattrs); + data_end = ((u8 *)xattrs) + inode_data->xattrs_len; + + for (size_t i = 0; i < n_xattrs; i++) { + const struct cfs_xattr_element *e = &xattrs->attr[i]; + u16 this_value_len = le16_to_cpu(e->value_length); + u16 this_key_len = le16_to_cpu(e->key_length); + const char *this_key; + + if (this_key_len > XATTR_NAME_MAX || + /* key and data needs to fit in data */ + data_end - data < this_key_len + this_value_len) { + copied = -EFSCORRUPTED; + goto exit; + } + + this_key = data; + data += this_key_len + this_value_len; + + if (size) { + if (size - copied < this_key_len + 1) { + copied = -E2BIG; + goto exit; + } + + memcpy(names + copied, this_key, this_key_len); + names[copied + this_key_len] = '\0'; + } + + copied += this_key_len + 1; + } + +exit: + cfs_buf_put(&vdata_buf); + + return copied; +} + +int cfs_get_xattr(struct cfs_context *ctx, struct cfs_inode_extra_data *inode_data, + const char *name, void *value, size_t size) +{ + struct cfs_xattr_header *xattrs; + struct cfs_buf vdata_buf = { NULL }; + size_t name_len = strlen(name); + size_t n_xattrs = 0; + u8 *data, *data_end; + int res; + + if (inode_data->xattrs_len == 0) + return -ENODATA; + + /* xattrs_len minimal size req was verified in cfs_init_inode_data */ + + xattrs = cfs_get_vdata_buf(ctx, inode_data->xattrs_offset, + inode_data->xattrs_len, &vdata_buf); + if (IS_ERR(xattrs)) + return PTR_ERR(xattrs); + + n_xattrs = le16_to_cpu(xattrs->n_attr); + if (n_xattrs == 0 || n_xattrs > CFS_MAX_XATTRS || + inode_data->xattrs_len < cfs_xattr_header_size(n_xattrs)) { + res = -EFSCORRUPTED; + goto exit; + } + + data = ((u8 *)xattrs) + cfs_xattr_header_size(n_xattrs); + data_end = ((u8 *)xattrs) + inode_data->xattrs_len; + + for (size_t i = 0; i < n_xattrs; i++) { + const struct cfs_xattr_element *e = &xattrs->attr[i]; + u16 this_value_len = le16_to_cpu(e->value_length); + u16 this_key_len = le16_to_cpu(e->key_length); + const char *this_key, *this_value; + + if (this_key_len > XATTR_NAME_MAX || + /* key and data needs to fit in data */ + data_end - data < this_key_len + this_value_len) { + res = -EFSCORRUPTED; + goto exit; + } + + this_key = data; + this_value = data + this_key_len; + data += this_key_len + this_value_len; + + if (this_key_len != name_len || memcmp(this_key, name, name_len) != 0) + continue; + + if (size > 0) { + if (size < this_value_len) { + res = -E2BIG; + goto exit; + } + memcpy(value, this_value, this_value_len); + } + + res = this_value_len; + goto exit; + } + + res = -ENODATA; + +exit: + cfs_buf_put(&vdata_buf); + return res; +} + +/* This is essentially strmcp() for non-null-terminated strings */ +static inline int memcmp2(const void *a, const size_t a_size, const void *b, + size_t b_size) +{ + size_t common_size = min(a_size, b_size); + int res; + + res = memcmp(a, b, common_size); + if (res != 0 || a_size == b_size) + return res; + + return a_size < b_size ? -1 : 1; +} + +int cfs_dir_iterate(struct cfs_context *ctx, u64 index, + struct cfs_inode_extra_data *inode_data, loff_t first, + cfs_dir_iter_cb cb, void *private) +{ + struct cfs_buf vdata_buf = { NULL }; + const struct cfs_dir_header *dir; + u32 n_dirents; + char *namedata, *namedata_end; + loff_t pos; + int res; + + if (inode_data->dirents_len == 0) + return 0; + + dir = cfs_get_vdata_buf(ctx, inode_data->dirents_offset, + inode_data->dirents_len, &vdata_buf); + if (IS_ERR(dir)) + return PTR_ERR(dir); + + n_dirents = le32_to_cpu(dir->n_dirents); + if (n_dirents == 0 || n_dirents > CFS_MAX_DIRENTS || + inode_data->dirents_len < cfs_dir_header_size(n_dirents)) { + res = -EFSCORRUPTED; + goto exit; + } + + if (first >= n_dirents) { + res = 0; + goto exit; + } + + namedata = ((u8 *)dir) + cfs_dir_header_size(n_dirents); + namedata_end = ((u8 *)dir) + inode_data->dirents_len; + pos = 0; + for (size_t i = 0; i < n_dirents; i++) { + const struct cfs_dirent *dirent = &dir->dirents[i]; + char *dirent_name = + (char *)namedata + le32_to_cpu(dirent->name_offset); + size_t dirent_name_len = dirent->name_len; + + /* name needs to fit in namedata */ + if (dirent_name >= namedata_end || + namedata_end - dirent_name < dirent_name_len) { + res = -EFSCORRUPTED; + goto exit; + } + + if (!cfs_validate_filename(dirent_name, dirent_name_len)) { + res = -EFSCORRUPTED; + goto exit; + } + + if (pos++ < first) + continue; + + if (!cb(private, dirent_name, dirent_name_len, + le32_to_cpu(dirent->inode_num), dirent->d_type)) { + break; + } + } + + res = 0; +exit: + cfs_buf_put(&vdata_buf); + return res; +} + +int cfs_dir_lookup(struct cfs_context *ctx, u64 index, + struct cfs_inode_extra_data *inode_data, const char *name, + size_t name_len, u64 *index_out) +{ + struct cfs_buf vdata_buf = { NULL }; + const struct cfs_dir_header *dir; + u32 start_dirent, end_dirent, n_dirents; + char *namedata, *namedata_end; + int cmp, res; + + if (inode_data->dirents_len == 0) + return 0; + + dir = cfs_get_vdata_buf(ctx, inode_data->dirents_offset, + inode_data->dirents_len, &vdata_buf); + if (IS_ERR(dir)) + return PTR_ERR(dir); + + n_dirents = le32_to_cpu(dir->n_dirents); + if (n_dirents == 0 || n_dirents > CFS_MAX_DIRENTS || + inode_data->dirents_len < cfs_dir_header_size(n_dirents)) { + res = -EFSCORRUPTED; + goto exit; + } + + namedata = ((u8 *)dir) + cfs_dir_header_size(n_dirents); + namedata_end = ((u8 *)dir) + inode_data->dirents_len; + + start_dirent = 0; + end_dirent = n_dirents - 1; + while (start_dirent <= end_dirent) { + int mid_dirent = start_dirent + (end_dirent - start_dirent) / 2; + const struct cfs_dirent *dirent = &dir->dirents[mid_dirent]; + char *dirent_name = + (char *)namedata + le32_to_cpu(dirent->name_offset); + size_t dirent_name_len = dirent->name_len; + + /* name needs to fit in namedata */ + if (dirent_name >= namedata_end || + namedata_end - dirent_name < dirent_name_len) { + res = -EFSCORRUPTED; + goto exit; + } + + cmp = memcmp2(name, name_len, dirent_name, dirent_name_len); + if (cmp == 0) { + *index_out = le32_to_cpu(dirent->inode_num); + res = 1; + goto exit; + } + + if (cmp > 0) + start_dirent = mid_dirent + 1; + else + end_dirent = mid_dirent - 1; + } + + /* not found */ + res = 0; + +exit: + cfs_buf_put(&vdata_buf); + return res; +} -- 2.39.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH v3 4/6] composefs: Add filesystem implementation 2023-01-20 15:23 [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson ` (2 preceding siblings ...) 2023-01-20 15:23 ` [PATCH v3 3/6] composefs: Add descriptor parsing code Alexander Larsson @ 2023-01-20 15:23 ` Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 5/6] composefs: Add documentation Alexander Larsson ` (2 subsequent siblings) 6 siblings, 0 replies; 87+ messages in thread From: Alexander Larsson @ 2023-01-20 15:23 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, gscrivan, david, brauner, viro, Alexander Larsson This is the basic inode and filesystem implementation. We open a private mount for the specified base directories at the time of mount, and all backing data files are looked up with this mount as root, to protect against changes in the mount tables, or links pointing outside the base directories. Access to the backing data is done with the callers credentials, to the backing files need to be readable by all users of the filesystem. We also use open_with_fake_path() to ensure the backing filename is not visible in /proc/. Signed-off-by: Alexander Larsson <alexl@redhat.com> Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --- fs/composefs/cfs.c | 750 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 750 insertions(+) create mode 100644 fs/composefs/cfs.c diff --git a/fs/composefs/cfs.c b/fs/composefs/cfs.c new file mode 100644 index 000000000000..cd54d3751186 --- /dev/null +++ b/fs/composefs/cfs.c @@ -0,0 +1,750 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * composefs + * + * Copyright (C) 2000 Linus Torvalds. + * 2000 Transmeta Corp. + * Copyright (C) 2021 Giuseppe Scrivano + * Copyright (C) 2022 Alexander Larsson + * + * This file is released under the GPL. + */ + +#include <linux/exportfs.h> +#include <linux/fsverity.h> +#include <linux/fs_parser.h> +#include <linux/module.h> +#include <linux/namei.h> +#include <linux/seq_file.h> +#include <linux/version.h> +#include <linux/xattr.h> +#include <linux/statfs.h> + +#include "cfs-internals.h" + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Giuseppe Scrivano <gscrivan@redhat.com>"); + +#define CFS_MAX_STACK 500 + +/* Backing file fs-verity check policy, ordered in strictness */ +enum cfs_verity_policy { + CFS_VERITY_CHECK_NONE = 0, /* Never verify digest */ + CFS_VERITY_CHECK_IF_SPECIFIED = 1, /* Verify if specified in image */ + CFS_VERITY_CHECK_REQUIRED = 2, /* Always verify, fail if not specified in image */ +}; + +#define CFS_VERITY_CHECK_MAX_POLICY 2 + +struct cfs_info { + struct cfs_context cfs_ctx; + + char *base_path; + + size_t n_bases; + struct vfsmount **bases; + + enum cfs_verity_policy verity_check; + bool has_digest; + u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */ +}; + +struct cfs_inode { + struct inode vfs_inode; + struct cfs_inode_extra_data inode_data; +}; + +static inline struct cfs_inode *CFS_I(struct inode *inode) +{ + return container_of(inode, struct cfs_inode, vfs_inode); +} + +static struct file empty_file; + +static const struct file_operations cfs_file_operations; + +static const struct super_operations cfs_ops; +static const struct file_operations cfs_dir_operations; +static const struct inode_operations cfs_dir_inode_operations; +static const struct inode_operations cfs_file_inode_operations; +static const struct inode_operations cfs_link_inode_operations; + +static const struct xattr_handler *cfs_xattr_handlers[]; + +static const struct address_space_operations cfs_aops = { + .direct_IO = noop_direct_IO, +}; + +static ssize_t cfs_listxattr(struct dentry *dentry, char *names, size_t size); + +/* split array of basedirs at ':', copied from overlayfs. */ +static unsigned int cfs_split_basedirs(char *str) +{ + unsigned int ctr = 1; + char *s, *d; + + for (s = d = str;; s++, d++) { + if (*s == '\\') { + s++; + } else if (*s == ':') { + *d = '\0'; + ctr++; + continue; + } + *d = *s; + if (!*s) + break; + } + return ctr; +} + +static struct inode *cfs_make_inode(struct cfs_context *ctx, struct super_block *sb, + ino_t ino_num, const struct inode *dir) +{ + struct inode *inode; + struct cfs_inode *cino; + int ret; + + inode = new_inode(sb); + if (!inode) + return ERR_PTR(-ENOMEM); + + cino = CFS_I(inode); + + ret = cfs_init_inode(ctx, ino_num, inode, &cino->inode_data); + if (ret < 0) + goto fail; + + inode_init_owner(&init_user_ns, inode, dir, inode->i_mode); + inode->i_mapping->a_ops = &cfs_aops; + + switch (inode->i_mode & S_IFMT) { + case S_IFREG: + inode->i_op = &cfs_file_inode_operations; + inode->i_fop = &cfs_file_operations; + break; + case S_IFLNK: + inode->i_link = cino->inode_data.path_payload; + inode->i_op = &cfs_link_inode_operations; + inode->i_fop = &cfs_file_operations; + break; + case S_IFDIR: + inode->i_op = &cfs_dir_inode_operations; + inode->i_fop = &cfs_dir_operations; + break; + case S_IFCHR: + case S_IFBLK: + if (current_user_ns() != &init_user_ns) { + ret = -EPERM; + goto fail; + } + fallthrough; + default: + inode->i_op = &cfs_file_inode_operations; + init_special_inode(inode, inode->i_mode, inode->i_rdev); + break; + } + + return inode; + +fail: + iput(inode); + return ERR_PTR(ret); +} + +static bool cfs_iterate_cb(void *private, const char *name, int name_len, + u64 ino, unsigned int dtype) +{ + struct dir_context *ctx = private; + + if (!dir_emit(ctx, name, name_len, ino, dtype)) + return 0; + + ctx->pos++; + return 1; +} + +static int cfs_iterate(struct file *file, struct dir_context *ctx) +{ + struct inode *inode = file->f_inode; + struct cfs_info *fsi = inode->i_sb->s_fs_info; + struct cfs_inode *cino = CFS_I(inode); + + if (!dir_emit_dots(file, ctx)) + return 0; + + return cfs_dir_iterate(&fsi->cfs_ctx, inode->i_ino, &cino->inode_data, + ctx->pos - 2, cfs_iterate_cb, ctx); +} + +static struct dentry *cfs_lookup(struct inode *dir, struct dentry *dentry, + unsigned int flags) +{ + struct cfs_info *fsi = dir->i_sb->s_fs_info; + struct cfs_inode *cino = CFS_I(dir); + struct inode *inode = NULL; + u64 index; + int ret; + + if (dentry->d_name.len > NAME_MAX) + return ERR_PTR(-ENAMETOOLONG); + + ret = cfs_dir_lookup(&fsi->cfs_ctx, dir->i_ino, &cino->inode_data, + dentry->d_name.name, dentry->d_name.len, &index); + if (ret) { + if (ret < 0) + return ERR_PTR(ret); + inode = cfs_make_inode(&fsi->cfs_ctx, dir->i_sb, index, dir); + } + + return d_splice_alias(inode, dentry); +} + +static const struct file_operations cfs_dir_operations = { + .llseek = generic_file_llseek, + .read = generic_read_dir, + .iterate_shared = cfs_iterate, +}; + +static const struct inode_operations cfs_dir_inode_operations = { + .lookup = cfs_lookup, + .listxattr = cfs_listxattr, +}; + +static const struct inode_operations cfs_link_inode_operations = { + .get_link = simple_get_link, + .listxattr = cfs_listxattr, +}; + +static int digest_from_string(const char *digest_str, u8 *digest) +{ + int res; + + res = hex2bin(digest, digest_str, SHA256_DIGEST_SIZE); + if (res < 0) + return res; + + if (digest_str[2 * SHA256_DIGEST_SIZE] != 0) + return -EINVAL; /* Too long string */ + + return 0; +} + +/* + * Display the mount options in /proc/mounts. + */ +static int cfs_show_options(struct seq_file *m, struct dentry *root) +{ + struct cfs_info *fsi = root->d_sb->s_fs_info; + + if (fsi->base_path) + seq_show_option(m, "basedir", fsi->base_path); + if (fsi->has_digest) + seq_printf(m, ",digest=%*phN", SHA256_DIGEST_SIZE, fsi->digest); + if (fsi->verity_check != 0) + seq_printf(m, ",verity_check=%u", fsi->verity_check); + + return 0; +} + +static struct kmem_cache *cfs_inode_cachep; + +static struct inode *cfs_alloc_inode(struct super_block *sb) +{ + struct cfs_inode *cino = alloc_inode_sb(sb, cfs_inode_cachep, GFP_KERNEL); + + if (!cino) + return NULL; + + memset(&cino->inode_data, 0, sizeof(cino->inode_data)); + + return &cino->vfs_inode; +} + +static void cfs_free_inode(struct inode *inode) +{ + struct cfs_inode *cino = CFS_I(inode); + + kfree(cino->inode_data.path_payload); + kmem_cache_free(cfs_inode_cachep, cino); +} + +static void cfs_put_super(struct super_block *sb) +{ + struct cfs_info *fsi = sb->s_fs_info; + + cfs_ctx_put(&fsi->cfs_ctx); + if (fsi->bases) { + kern_unmount_array(fsi->bases, fsi->n_bases); + kfree(fsi->bases); + } + kfree(fsi->base_path); + + kfree(fsi); +} + +static int cfs_statfs(struct dentry *dentry, struct kstatfs *buf) +{ + struct cfs_info *fsi = dentry->d_sb->s_fs_info; + int err = 0; + + /* We return the free space, etc from the first base dir. */ + if (fsi->n_bases > 0) { + struct path root = { .mnt = fsi->bases[0], + .dentry = fsi->bases[0]->mnt_root }; + err = vfs_statfs(&root, buf); + } + + if (!err) { + buf->f_namelen = NAME_MAX; + buf->f_type = dentry->d_sb->s_magic; + } + + return err; +} + +static const struct super_operations cfs_ops = { + .statfs = cfs_statfs, + .drop_inode = generic_delete_inode, + .show_options = cfs_show_options, + .put_super = cfs_put_super, + .alloc_inode = cfs_alloc_inode, + .free_inode = cfs_free_inode, +}; + +enum cfs_param { + Opt_base_path, + Opt_digest, + Opt_verity_check, +}; + +const struct fs_parameter_spec cfs_parameters[] = { + fsparam_string("basedir", Opt_base_path), + fsparam_string("digest", Opt_digest), + fsparam_u32("verity_check", Opt_verity_check), + {} +}; + +static int cfs_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct cfs_info *fsi = fc->s_fs_info; + struct fs_parse_result result; + int opt, r; + + opt = fs_parse(fc, cfs_parameters, param, &result); + if (opt == -ENOPARAM) + return vfs_parse_fs_param_source(fc, param); + if (opt < 0) + return opt; + + switch (opt) { + case Opt_base_path: + kfree(fsi->base_path); + /* Take ownership. */ + fsi->base_path = param->string; + param->string = NULL; + break; + case Opt_digest: + r = digest_from_string(param->string, fsi->digest); + if (r < 0) + return r; + fsi->has_digest = true; + fsi->verity_check = CFS_VERITY_CHECK_REQUIRED; /* Default to full verity check */ + break; + case Opt_verity_check: + if (result.uint_32 > CFS_VERITY_CHECK_MAX_POLICY) + return invalfc(fc, "Invalid verity_check mode"); + fsi->verity_check = result.uint_32; + break; + } + + return 0; +} + +static struct vfsmount *resolve_basedir(const char *name) +{ + struct path path = {}; + struct vfsmount *mnt; + int err = -EINVAL; + + if (!*name) { + pr_err("empty basedir\n"); + return ERR_PTR(-EINVAL); + } + err = kern_path(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &path); + if (err) { + pr_err("failed to resolve '%s': %i\n", name, err); + return ERR_PTR(-EINVAL); + } + + mnt = clone_private_mount(&path); + path_put(&path); + if (!IS_ERR(mnt)) { + /* Don't inherit atime flags */ + mnt->mnt_flags &= ~(MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME); + } + + return mnt; +} + +static int cfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct cfs_info *fsi = sb->s_fs_info; + struct vfsmount **bases = NULL; + size_t numbasedirs = 0; + struct inode *inode; + struct vfsmount *mnt; + int ret; + + /* Set up the inode allocator early */ + sb->s_op = &cfs_ops; + sb->s_flags |= SB_RDONLY; + sb->s_magic = CFS_MAGIC; + sb->s_xattr = cfs_xattr_handlers; + + if (fsi->base_path) { + char *lower, *splitlower = NULL; + + ret = -ENOMEM; + splitlower = kstrdup(fsi->base_path, GFP_KERNEL); + if (!splitlower) + goto fail; + + ret = -EINVAL; + numbasedirs = cfs_split_basedirs(splitlower); + if (numbasedirs > CFS_MAX_STACK) { + pr_err("too many lower directories, limit is %d\n", + CFS_MAX_STACK); + kfree(splitlower); + goto fail; + } + + ret = -ENOMEM; + bases = kcalloc(numbasedirs, sizeof(struct vfsmount *), GFP_KERNEL); + if (!bases) { + kfree(splitlower); + goto fail; + } + + lower = splitlower; + for (size_t i = 0; i < numbasedirs; i++) { + mnt = resolve_basedir(lower); + if (IS_ERR(mnt)) { + ret = PTR_ERR(mnt); + kfree(splitlower); + goto fail; + } + bases[i] = mnt; + + lower = strchr(lower, '\0') + 1; + } + kfree(splitlower); + } + + /* Must be inited before calling cfs_get_inode. */ + ret = cfs_init_ctx(fc->source, fsi->has_digest ? fsi->digest : NULL, + &fsi->cfs_ctx); + if (ret < 0) + goto fail; + + inode = cfs_make_inode(&fsi->cfs_ctx, sb, CFS_ROOT_INO, NULL); + if (IS_ERR(inode)) { + ret = PTR_ERR(inode); + goto fail; + } + sb->s_root = d_make_root(inode); + + ret = -ENOMEM; + if (!sb->s_root) + goto fail; + + sb->s_maxbytes = MAX_LFS_FILESIZE; + sb->s_blocksize = PAGE_SIZE; + sb->s_blocksize_bits = PAGE_SHIFT; + + fsi->bases = bases; + fsi->n_bases = numbasedirs; + return 0; +fail: + if (bases) { + for (size_t i = 0; i < numbasedirs; i++) { + if (bases[i]) + kern_unmount(bases[i]); + } + kfree(bases); + } + cfs_ctx_put(&fsi->cfs_ctx); + return ret; +} + +static int cfs_get_tree(struct fs_context *fc) +{ + return get_tree_nodev(fc, cfs_fill_super); +} + +static const struct fs_context_operations cfs_context_ops = { + .parse_param = cfs_parse_param, + .get_tree = cfs_get_tree, +}; + +static struct file *open_base_file(struct cfs_info *fsi, struct inode *inode, + struct file *file) +{ + struct cfs_inode *cino = CFS_I(inode); + struct file *real_file; + char *real_path = cino->inode_data.path_payload; + + for (size_t i = 0; i < fsi->n_bases; i++) { + real_file = file_open_root_mnt(fsi->bases[i], real_path, + file->f_flags, 0); + if (real_file != ERR_PTR(-ENOENT)) + return real_file; + } + + return ERR_PTR(-ENOENT); +} + +static int cfs_open_file(struct inode *inode, struct file *file) +{ + struct cfs_info *fsi = inode->i_sb->s_fs_info; + struct cfs_inode *cino = CFS_I(inode); + char *real_path = cino->inode_data.path_payload; + struct file *faked_file; + struct file *real_file; + + if (file->f_flags & (O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_TRUNC)) + return -EROFS; + + if (!real_path) { + file->private_data = &empty_file; + return 0; + } + + if (fsi->verity_check >= CFS_VERITY_CHECK_REQUIRED && + !cino->inode_data.has_digest) { + pr_warn("WARNING: composefs image file '%pD' specified no fs-verity digest\n", + file); + return -EIO; + } + + real_file = open_base_file(fsi, inode, file); + + if (IS_ERR(real_file)) + return PTR_ERR(real_file); + + /* If metadata records a digest for the file, ensure it is there + * and correct before using the contents. + */ + if (cino->inode_data.has_digest && + fsi->verity_check >= CFS_VERITY_CHECK_IF_SPECIFIED) { + u8 verity_digest[FS_VERITY_MAX_DIGEST_SIZE]; + enum hash_algo verity_algo; + int res; + + res = fsverity_get_digest(d_inode(real_file->f_path.dentry), + verity_digest, &verity_algo); + if (res < 0) { + pr_warn("WARNING: composefs backing file '%pD' has no fs-verity digest\n", + real_file); + fput(real_file); + return -EIO; + } + if (verity_algo != HASH_ALGO_SHA256 || + memcmp(cino->inode_data.digest, verity_digest, + SHA256_DIGEST_SIZE) != 0) { + pr_warn("WARNING: composefs backing file '%pD' has the wrong fs-verity digest\n", + real_file); + fput(real_file); + return -EIO; + } + } + + faked_file = open_with_fake_path(&file->f_path, file->f_flags, + real_file->f_inode, current_cred()); + fput(real_file); + + if (IS_ERR(faked_file)) + return PTR_ERR(faked_file); + + file->private_data = faked_file; + return 0; +} + +#ifdef CONFIG_MMU +static unsigned long cfs_mmu_get_unmapped_area(struct file *file, unsigned long addr, + unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + struct file *realfile = file->private_data; + + if (realfile == &empty_file) + return 0; + + return current->mm->get_unmapped_area(file, addr, len, pgoff, flags); +} +#endif + +static int cfs_release_file(struct inode *inode, struct file *file) +{ + struct file *realfile = file->private_data; + + if (WARN_ON(!realfile)) + return -EIO; + + if (realfile == &empty_file) + return 0; + + fput(realfile); + + return 0; +} + +static int cfs_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct file *realfile = file->private_data; + int ret; + + if (realfile == &empty_file) + return 0; + + if (!realfile->f_op->mmap) + return -ENODEV; + + if (WARN_ON(file != vma->vm_file)) + return -EIO; + + vma_set_file(vma, realfile); + + ret = call_mmap(vma->vm_file, vma); + + return ret; +} + +static ssize_t cfs_read_iter(struct kiocb *iocb, struct iov_iter *iter) +{ + struct file *file = iocb->ki_filp; + struct file *realfile = file->private_data; + int ret; + + if (realfile == &empty_file) + return 0; + + if (!realfile->f_op->read_iter) + return -ENODEV; + + iocb->ki_filp = realfile; + ret = call_read_iter(realfile, iocb, iter); + iocb->ki_filp = file; + + return ret; +} + +static int cfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice) +{ + struct file *realfile = file->private_data; + + if (realfile == &empty_file) + return 0; + + return vfs_fadvise(realfile, offset, len, advice); +} + +static int cfs_getxattr(const struct xattr_handler *handler, + struct dentry *unused2, struct inode *inode, + const char *name, void *value, size_t size) +{ + struct cfs_info *fsi = inode->i_sb->s_fs_info; + struct cfs_inode *cino = CFS_I(inode); + + return cfs_get_xattr(&fsi->cfs_ctx, &cino->inode_data, name, value, size); +} + +static ssize_t cfs_listxattr(struct dentry *dentry, char *names, size_t size) +{ + struct inode *inode = d_inode(dentry); + struct cfs_info *fsi = inode->i_sb->s_fs_info; + struct cfs_inode *cino = CFS_I(inode); + + return cfs_list_xattrs(&fsi->cfs_ctx, &cino->inode_data, names, size); +} + +static const struct file_operations cfs_file_operations = { + .read_iter = cfs_read_iter, + .mmap = cfs_mmap, + .fadvise = cfs_fadvise, + .fsync = noop_fsync, + .splice_read = generic_file_splice_read, + .llseek = generic_file_llseek, +#ifdef CONFIG_MMU + .get_unmapped_area = cfs_mmu_get_unmapped_area, +#endif + .release = cfs_release_file, + .open = cfs_open_file, +}; + +static const struct xattr_handler cfs_xattr_handler = { + .prefix = "", /* catch all */ + .get = cfs_getxattr, +}; + +static const struct xattr_handler *cfs_xattr_handlers[] = { + &cfs_xattr_handler, + NULL, +}; + +static const struct inode_operations cfs_file_inode_operations = { + .listxattr = cfs_listxattr, +}; + +static int cfs_init_fs_context(struct fs_context *fc) +{ + struct cfs_info *fsi; + + fsi = kzalloc(sizeof(*fsi), GFP_KERNEL); + if (!fsi) + return -ENOMEM; + + fc->s_fs_info = fsi; + fc->ops = &cfs_context_ops; + return 0; +} + +static struct file_system_type cfs_type = { + .owner = THIS_MODULE, + .name = "composefs", + .init_fs_context = cfs_init_fs_context, + .parameters = cfs_parameters, + .kill_sb = kill_anon_super, +}; + +static void cfs_inode_init_once(void *foo) +{ + struct cfs_inode *cino = foo; + + inode_init_once(&cino->vfs_inode); +} + +static int __init init_cfs(void) +{ + cfs_inode_cachep = kmem_cache_create( + "cfs_inode", sizeof(struct cfs_inode), 0, + (SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD | SLAB_ACCOUNT), + cfs_inode_init_once); + if (!cfs_inode_cachep) + return -ENOMEM; + + return register_filesystem(&cfs_type); +} + +static void __exit exit_cfs(void) +{ + unregister_filesystem(&cfs_type); + + /* Ensure all RCU free inodes are safe to be destroyed. */ + rcu_barrier(); + + kmem_cache_destroy(cfs_inode_cachep); +} + +module_init(init_cfs); +module_exit(exit_cfs); -- 2.39.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH v3 5/6] composefs: Add documentation 2023-01-20 15:23 [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson ` (3 preceding siblings ...) 2023-01-20 15:23 ` [PATCH v3 4/6] composefs: Add filesystem implementation Alexander Larsson @ 2023-01-20 15:23 ` Alexander Larsson 2023-01-21 2:19 ` Bagas Sanjaya 2023-01-20 15:23 ` [PATCH v3 6/6] composefs: Add kconfig and build support Alexander Larsson 2023-01-20 19:44 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Amir Goldstein 6 siblings, 1 reply; 87+ messages in thread From: Alexander Larsson @ 2023-01-20 15:23 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, gscrivan, david, brauner, viro, Alexander Larsson, linux-doc Add documentation about the composefs filesystem and how to use it. Signed-off-by: Alexander Larsson <alexl@redhat.com> --- Documentation/filesystems/composefs.rst | 159 ++++++++++++++++++++++++ Documentation/filesystems/index.rst | 1 + 2 files changed, 160 insertions(+) create mode 100644 Documentation/filesystems/composefs.rst diff --git a/Documentation/filesystems/composefs.rst b/Documentation/filesystems/composefs.rst new file mode 100644 index 000000000000..f270a66f4204 --- /dev/null +++ b/Documentation/filesystems/composefs.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Composefs Filesystem +==================== + +Introduction +============ + +Composefs is a read-only file system that is backed by regular files +(rather than a block device). It is designed to help easily share +content between different directory trees, such as container images in +a local store or ostree checkouts. In addition it also has support for +integrity validation of file content and directory metadata, in an +efficient way (using fs-verity). + +The filesystem mount source is a binary blob called the descriptor. It +contains all the inode and directory entry data for the entire +filesystem. However, instead of storing the file content each regular +file inode stores a relative path name, and the filesystem gets the +file content from the filesystem by looking up that filename in a set +of base directories. + +Given such a descriptor called "image.cfs" and a directory with files +called "/dir" you can mount it like:: + + mount -t composefs image.cfs -o basedir=/dir /mnt + +Content sharing +=============== + +Suppose you have a single basedir where the files are content +addressed (i.e. named by content digest), and a set of composefs +descriptors using this basedir. Any file that happens to be shared +between two images (same content, so same digest) will now only be +stored once on the disk. + +Such sharing is possible even if the metadata for the file in the +image differs (common reasons for metadata difference are mtime, +permissions, xattrs, etc). The sharing is also anonymous in the sense +that you can't tell the difference on the mounted files from a +non-shared file (for example by looking at the link count for a +hardlinked file). + +In addition, any shared files that are actively in use will share +page-cache, because the page cache for the file contents will be +addressed by the backing file in the basedir, This means (for example) +that shared libraries between images will only be mmap:ed once across +all mounts. + +Integrity validation +==================== + +Composefs uses :doc:`fs-verity <fsverity>` for integrity validation, +and extends it by making the validation also apply to the directory +metadata. This happens on two levels, validation of the descriptor +and validation of the backing files. + +For descriptor validation, the idea is that you enable fs-verity on +the descriptor file which seals it from changes that would affect the +directory metadata. Additionally you can pass a "digest" mount option, +which composefs verifies against the descriptor fs-verity measure. Such +an option could be embedded in a trusted source (like a signed kernel +command line) and be used as a root of trust if using composefs for the +root filesystem. + +For file validation, the descriptor can contain digests for each +backing file, and you can enable fs-verity on them too. Composefs will +validate the digest before using the backing files. This means any +(accidental or malicious) modification of the basedir will be detected +at the time the file is used. + +Expected use-cases +================== + +Container Image Storage +``````````````````````` + +Typically a container image is stored as a set of "layer" directories, +merged into one mount by using overlayfs. The lower layers are +read-only image and the upper layer is the writable directory of a +running container. Multiple uses of the same layer can be shared this +way, but it is hard to share individual files between unrelated layers. + +Using composefs, we can instead use a shared, content-addressed +store for all the images in the system, and use composefs +for the read-only image of each container, pointing into the +shared store. Then for a running container we use an overlayfs +with the lower dir being the composefs and the upper dir being +the writable directory. + + +Ostree root filesystem validation +````````````````````````````````` + +Ostree uses a content-addressed on-disk store for file content, +allowing efficient updates and sharing of content. However to actually +use these as a root filesystem it needs to create a real +"chroot-style" directory, containing hard links into the store. The +store itself is validated when created, but once the hard-link +directory is created, nothing validates the directory structure for +post-creation changes. + +Instead of a chroot we can use composefs. The composefs image pointing +to the object store is created, then fs-verity is enabled for +everything and the descriptor digest is encoded in the +kernel-command line. This will allow booting a trusted system where +all directory metadata and file content is validated lazily at use. + + +Mount options +============= + +basedir + A colon separated list of directories to use as a base when resolving + relative content paths. + +verity_check=[0,1,2] + When to verify backing file fs-verity: + + * 0: never verify + * 1: if the digest is specified in image + * 2: always verify the file (and require digests in image) + +digest + A fs-verity sha256 digest that the descriptor file must match. If set, + "verity_check" defaults to 2. + + +Filesystem format +================= + +The format of the descriptor contains three sections: superblock, +inodes and variable data. All data in the file is stored in +little-endian form. + +The superblock starts at the beginning of the file and contains +version, magic value, and offsets to the variable data section. + +The inode table starts at a fixed location right after the +header. It is a array of fixed size inode data. The first inode +is the root inode, and inode numbers are index into this array. + +The variable data section is stored after the inode section, and you +can find it from the offset in the header. It contains paths, digests, +dirents and Xattrs data. The xattrs are referred to by offset and size +in the xattr attribute in the inode data. Each xattr data can be used +by many inodes in the filesystem. + +For more details, see cfs.h. + +Tools +===== + +Tools for composefs can be found at https://github.com/containers/composefs + +There is a mkcomposefs tool which can be used to create images on the +CLI, and a library that applications can use to create composefs +images. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index bee63d42e5ec..9b7cf136755d 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -75,6 +75,7 @@ Documentation for filesystem implementations. cifs/index ceph coda + composefs configfs cramfs dax -- 2.39.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH v3 5/6] composefs: Add documentation 2023-01-20 15:23 ` [PATCH v3 5/6] composefs: Add documentation Alexander Larsson @ 2023-01-21 2:19 ` Bagas Sanjaya 0 siblings, 0 replies; 87+ messages in thread From: Bagas Sanjaya @ 2023-01-21 2:19 UTC (permalink / raw) To: Alexander Larsson, linux-fsdevel Cc: linux-kernel, gscrivan, david, brauner, viro, linux-doc [-- Attachment #1: Type: text/plain, Size: 285 bytes --] On Fri, Jan 20, 2023 at 04:23:33PM +0100, Alexander Larsson wrote: > +For more details, see cfs.h. > + "See a code comment describing the descriptor file layout in fs/composefs/cfs.h for details." Otherwise LGTM. -- An old man doll... just what I always wanted! - Clara [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH v3 6/6] composefs: Add kconfig and build support 2023-01-20 15:23 [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson ` (4 preceding siblings ...) 2023-01-20 15:23 ` [PATCH v3 5/6] composefs: Add documentation Alexander Larsson @ 2023-01-20 15:23 ` Alexander Larsson 2023-01-20 19:44 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Amir Goldstein 6 siblings, 0 replies; 87+ messages in thread From: Alexander Larsson @ 2023-01-20 15:23 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, gscrivan, david, brauner, viro, Alexander Larsson This commit adds Makefile and Kconfig for composefs, and updates Makefile and Kconfig files in the fs directory Signed-off-by: Alexander Larsson <alexl@redhat.com> --- fs/Kconfig | 1 + fs/Makefile | 1 + fs/composefs/Kconfig | 18 ++++++++++++++++++ fs/composefs/Makefile | 5 +++++ 4 files changed, 25 insertions(+) create mode 100644 fs/composefs/Kconfig create mode 100644 fs/composefs/Makefile diff --git a/fs/Kconfig b/fs/Kconfig index 2685a4d0d353..de8493fc2b1e 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -127,6 +127,7 @@ source "fs/quota/Kconfig" source "fs/autofs/Kconfig" source "fs/fuse/Kconfig" source "fs/overlayfs/Kconfig" +source "fs/composefs/Kconfig" menu "Caches" diff --git a/fs/Makefile b/fs/Makefile index 4dea17840761..d16974e02468 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -137,3 +137,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-$(CONFIG_COMPOSEFS_FS) += composefs/ diff --git a/fs/composefs/Kconfig b/fs/composefs/Kconfig new file mode 100644 index 000000000000..88c5b55380e6 --- /dev/null +++ b/fs/composefs/Kconfig @@ -0,0 +1,18 @@ +# SPDX-License-Identifier: GPL-2.0-only + +config COMPOSEFS_FS + tristate "Composefs filesystem support" + select EXPORTFS + help + Composefs is a filesystem that allows combining file content from + existing regular files with a metadata directory structure from + a separate binary file. This is useful to share file content between + many different directory trees, such as in a local container image store. + + Composefs also allows using fs-verity to validate the content of the + content-files as well as the metadata file which allows dm-verity + like validation with the flexibility of regular files. + + For more information see Documentation/filesystems/composefs.rst + + If unsure, say N. diff --git a/fs/composefs/Makefile b/fs/composefs/Makefile new file mode 100644 index 000000000000..eac8445e7d25 --- /dev/null +++ b/fs/composefs/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0-only + +obj-$(CONFIG_COMPOSEFS_FS) += composefs.o + +composefs-objs += cfs-reader.o cfs.o -- 2.39.0 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-20 15:23 [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson ` (5 preceding siblings ...) 2023-01-20 15:23 ` [PATCH v3 6/6] composefs: Add kconfig and build support Alexander Larsson @ 2023-01-20 19:44 ` Amir Goldstein 2023-01-20 22:18 ` Giuseppe Scrivano 2023-01-23 17:56 ` Alexander Larsson 6 siblings, 2 replies; 87+ messages in thread From: Amir Goldstein @ 2023-01-20 19:44 UTC (permalink / raw) To: Alexander Larsson Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: > > Giuseppe Scrivano and I have recently been working on a new project we > call composefs. This is the first time we propose this publically and > we would like some feedback on it. > > At its core, composefs is a way to construct and use read only images > that are used similar to how you would use e.g. loop-back mounted > squashfs images. On top of this composefs has two fundamental > features. First it allows sharing of file data (both on disk and in > page cache) between images, and secondly it has dm-verity like > validation on read. > > Let me first start with a minimal example of how this can be used, > before going into the details: > > Suppose we have this source for an image: > > rootfs/ > ├── dir > │ └── another_a > ├── file_a > └── file_b > > We can then use this to generate an image file and a set of > content-addressed backing files: > > # mkcomposefs --digest-store=objects rootfs/ rootfs.img > # ls -l rootfs.img objects/*/* > -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 > -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img > > The rootfs.img file contains all information about directory and file > metadata plus references to the backing files by name. We can now > mount this and look at the result: > > # mount -t composefs rootfs.img -o basedir=objects /mnt > # ls /mnt/ > dir file_a file_b > # cat /mnt/file_a > content_a > > When reading this file the kernel is actually reading the backing > file, in a fashion similar to overlayfs. Since the backing file is > content-addressed, the objects directory can be shared for multiple > images, and any files that happen to have the same content are > shared. I refer to this as opportunistic sharing, as it is different > than the more course-grained explicit sharing used by e.g. container > base images. > > The next step is the validation. Note how the object files have > fs-verity enabled. In fact, they are named by their fs-verity digest: > > # fsverity digest objects/*/* > sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 > sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > > The generated filesystm image may contain the expected digest for the > backing files. When the backing file digest is incorrect, the open > will fail, and if the open succeeds, any other on-disk file-changes > will be detected by fs-verity: > > # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > content_a > # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > # cat /mnt/file_a > WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest > cat: /mnt/file_a: Input/output error > > This re-uses the existing fs-verity functionallity to protect against > changes in file contents, while adding on top of it protection against > changes in filesystem metadata and structure. I.e. protecting against > replacing a fs-verity enabled file or modifying file permissions or > xattrs. > > To be fully verified we need another step: we use fs-verity on the > image itself. Then we pass the expected digest on the mount command > line (which will be verified at mount time): > > # fsverity enable rootfs.img > # fsverity digest rootfs.img > sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img > # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt > > So, given a trusted set of mount options (say unlocked from TPM), we > have a fully verified filesystem tree mounted, with opportunistic > finegrained sharing of identical files. > > So, why do we want this? There are two initial users. First of all we > want to use the opportunistic sharing for the podman container image > baselayer. The idea is to use a composefs mount as the lower directory > in an overlay mount, with the upper directory being the container work > dir. This will allow automatical file-level disk and page-cache > sharning between any two images, independent of details like the > permissions and timestamps of the files. > > Secondly we are interested in using the verification aspects of > composefs in the ostree project. Ostree already supports a > content-addressed object store, but it is currently referenced by > hardlink farms. The object store and the trees that reference it are > signed and verified at download time, but there is no runtime > verification. If we replace the hardlink farm with a composefs image > that points into the existing object store we can use the verification > to implement runtime verification. > > In fact, the tooling to create composefs images is 100% reproducible, > so all we need is to add the composefs image fs-verity digest into the > ostree commit. Then the image can be reconstructed from the ostree > commit info, generating a file with the same fs-verity digest. > > These are the usecases we're currently interested in, but there seems > to be a breadth of other possible uses. For example, many systems use > loopback mounts for images (like lxc or snap), and these could take > advantage of the opportunistic sharing. We've also talked about using > fuse to implement a local cache for the backing files. I.e. you would > have the second basedir be a fuse filesystem. On lookup failure in the > first basedir it downloads the file and saves it in the first basedir > for later lookups. There are many interesting possibilities here. > > The patch series contains some documentation on the file format and > how to use the filesystem. > > The userspace tools (and a standalone kernel module) is available > here: > https://github.com/containers/composefs > > Initial work on ostree integration is here: > https://github.com/ostreedev/ostree/pull/2640 > > Changes since v2: > - Simplified filesystem format to use fixed size inodes. This resulted > in simpler (now < 2k lines) code as well as higher performance at > the cost of slightly (~40%) larger images. > - We now use multi-page mappings from the page cache, which removes > limits on sizes of xattrs and makes the dirent handling code simpler. > - Added more documentation about the on-disk file format. > - General cleanups based on review comments. > Hi Alexander, I must say that I am a little bit puzzled by this v3. Gao, Christian and myself asked you questions on v2 that are not mentioned in v3 at all. To sum it up, please do not propose composefs without explaining what are the barriers for achieving the exact same outcome with the use of a read-only overlayfs with two lower layer - uppermost with erofs containing the metadata files, which include trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer to the lowermost layer containing the content files. Any current functionality gap in erofs and/or in overlayfs cannot be considered as a reason to maintain a new filesystem driver unless you come up with an explanation why closing that functionality gap is not possible or why the erofs+overlayfs alternative would be inferior to maintaining a new filesystem driver. From the conversations so far, it does not seem like Gao thinks that the functionality gap in erofs cannot be closed and I don't see why the functionality gap in overlayfs cannot be closed. Are we missing something? Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-20 19:44 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Amir Goldstein @ 2023-01-20 22:18 ` Giuseppe Scrivano 2023-01-21 3:08 ` Gao Xiang 2023-01-21 10:57 ` Amir Goldstein 2023-01-23 17:56 ` Alexander Larsson 1 sibling, 2 replies; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-20 22:18 UTC (permalink / raw) To: Amir Goldstein Cc: Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi Hi Amir, Amir Goldstein <amir73il@gmail.com> writes: > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: >> >> Giuseppe Scrivano and I have recently been working on a new project we >> call composefs. This is the first time we propose this publically and >> we would like some feedback on it. >> >> At its core, composefs is a way to construct and use read only images >> that are used similar to how you would use e.g. loop-back mounted >> squashfs images. On top of this composefs has two fundamental >> features. First it allows sharing of file data (both on disk and in >> page cache) between images, and secondly it has dm-verity like >> validation on read. >> >> Let me first start with a minimal example of how this can be used, >> before going into the details: >> >> Suppose we have this source for an image: >> >> rootfs/ >> ├── dir >> │ └── another_a >> ├── file_a >> └── file_b >> >> We can then use this to generate an image file and a set of >> content-addressed backing files: >> >> # mkcomposefs --digest-store=objects rootfs/ rootfs.img >> # ls -l rootfs.img objects/*/* >> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 >> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img >> >> The rootfs.img file contains all information about directory and file >> metadata plus references to the backing files by name. We can now >> mount this and look at the result: >> >> # mount -t composefs rootfs.img -o basedir=objects /mnt >> # ls /mnt/ >> dir file_a file_b >> # cat /mnt/file_a >> content_a >> >> When reading this file the kernel is actually reading the backing >> file, in a fashion similar to overlayfs. Since the backing file is >> content-addressed, the objects directory can be shared for multiple >> images, and any files that happen to have the same content are >> shared. I refer to this as opportunistic sharing, as it is different >> than the more course-grained explicit sharing used by e.g. container >> base images. >> >> The next step is the validation. Note how the object files have >> fs-verity enabled. In fact, they are named by their fs-verity digest: >> >> # fsverity digest objects/*/* >> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 >> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> >> The generated filesystm image may contain the expected digest for the >> backing files. When the backing file digest is incorrect, the open >> will fail, and if the open succeeds, any other on-disk file-changes >> will be detected by fs-verity: >> >> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> content_a >> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> # cat /mnt/file_a >> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest >> cat: /mnt/file_a: Input/output error >> >> This re-uses the existing fs-verity functionallity to protect against >> changes in file contents, while adding on top of it protection against >> changes in filesystem metadata and structure. I.e. protecting against >> replacing a fs-verity enabled file or modifying file permissions or >> xattrs. >> >> To be fully verified we need another step: we use fs-verity on the >> image itself. Then we pass the expected digest on the mount command >> line (which will be verified at mount time): >> >> # fsverity enable rootfs.img >> # fsverity digest rootfs.img >> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img >> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt >> >> So, given a trusted set of mount options (say unlocked from TPM), we >> have a fully verified filesystem tree mounted, with opportunistic >> finegrained sharing of identical files. >> >> So, why do we want this? There are two initial users. First of all we >> want to use the opportunistic sharing for the podman container image >> baselayer. The idea is to use a composefs mount as the lower directory >> in an overlay mount, with the upper directory being the container work >> dir. This will allow automatical file-level disk and page-cache >> sharning between any two images, independent of details like the >> permissions and timestamps of the files. >> >> Secondly we are interested in using the verification aspects of >> composefs in the ostree project. Ostree already supports a >> content-addressed object store, but it is currently referenced by >> hardlink farms. The object store and the trees that reference it are >> signed and verified at download time, but there is no runtime >> verification. If we replace the hardlink farm with a composefs image >> that points into the existing object store we can use the verification >> to implement runtime verification. >> >> In fact, the tooling to create composefs images is 100% reproducible, >> so all we need is to add the composefs image fs-verity digest into the >> ostree commit. Then the image can be reconstructed from the ostree >> commit info, generating a file with the same fs-verity digest. >> >> These are the usecases we're currently interested in, but there seems >> to be a breadth of other possible uses. For example, many systems use >> loopback mounts for images (like lxc or snap), and these could take >> advantage of the opportunistic sharing. We've also talked about using >> fuse to implement a local cache for the backing files. I.e. you would >> have the second basedir be a fuse filesystem. On lookup failure in the >> first basedir it downloads the file and saves it in the first basedir >> for later lookups. There are many interesting possibilities here. >> >> The patch series contains some documentation on the file format and >> how to use the filesystem. >> >> The userspace tools (and a standalone kernel module) is available >> here: >> https://github.com/containers/composefs >> >> Initial work on ostree integration is here: >> https://github.com/ostreedev/ostree/pull/2640 >> >> Changes since v2: >> - Simplified filesystem format to use fixed size inodes. This resulted >> in simpler (now < 2k lines) code as well as higher performance at >> the cost of slightly (~40%) larger images. >> - We now use multi-page mappings from the page cache, which removes >> limits on sizes of xattrs and makes the dirent handling code simpler. >> - Added more documentation about the on-disk file format. >> - General cleanups based on review comments. >> > > Hi Alexander, > > I must say that I am a little bit puzzled by this v3. > Gao, Christian and myself asked you questions on v2 > that are not mentioned in v3 at all. > > To sum it up, please do not propose composefs without explaining > what are the barriers for achieving the exact same outcome with > the use of a read-only overlayfs with two lower layer - > uppermost with erofs containing the metadata files, which include > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer > to the lowermost layer containing the content files. I think Dave explained quite well why using overlay is not comparable to what composefs does. One big difference is that overlay still requires at least a syscall for each file in the image, and then we need the equivalent of "rm -rf" to clean it up. It is somehow acceptable for long-running services, but it is not for "serverless" containers where images/containers are created and destroyed frequently. So even in the case we already have all the image files available locally, we still need to create a checkout with the final structure we need for the image. I also don't see how overlay would solve the verified image problem. We would have the same problem we have today with fs-verity as it can only validate a single file but not the entire directory structure. Changes that affect the layer containing the trusted.overlay.{metacopy,redirect} xattrs won't be noticed. There are at the moment two ways to handle container images, both somehow guided by the available file systems in the kernel. - A single image mounted as a block device. - A list of tarballs (OCI image) that are unpacked and mounted as overlay layers. One big advantage of the block devices model is that you can use dm-verity, this is something we miss today with OCI container images that use overlay. What we are proposing with composefs is a way to have "dm-verity" style validation based on fs-verity and the possibility to share individual files instead of layers. These files can also be on different file systems, which is something not possible with the block device model. The composefs manifest blob could be generated remotely and signed. A client would need just to validate the signature for the manifest blob and from there retrieve the files that are not in the local CAS (even from an insecure source) and mount directly the manifest file. Regards, Giuseppe > Any current functionality gap in erofs and/or in overlayfs > cannot be considered as a reason to maintain a new filesystem > driver unless you come up with an explanation why closing that > functionality gap is not possible or why the erofs+overlayfs alternative > would be inferior to maintaining a new filesystem driver. > > From the conversations so far, it does not seem like Gao thinks > that the functionality gap in erofs cannot be closed and I don't > see why the functionality gap in overlayfs cannot be closed. > > Are we missing something? > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-20 22:18 ` Giuseppe Scrivano @ 2023-01-21 3:08 ` Gao Xiang 2023-01-21 16:19 ` Giuseppe Scrivano 2023-01-21 10:57 ` Amir Goldstein 1 sibling, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-01-21 3:08 UTC (permalink / raw) To: Giuseppe Scrivano, Amir Goldstein Cc: Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi, Linus Torvalds On 2023/1/21 06:18, Giuseppe Scrivano wrote: > Hi Amir, > > Amir Goldstein <amir73il@gmail.com> writes: > >> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: ... >>> >> >> Hi Alexander, >> >> I must say that I am a little bit puzzled by this v3. >> Gao, Christian and myself asked you questions on v2 >> that are not mentioned in v3 at all. >> >> To sum it up, please do not propose composefs without explaining >> what are the barriers for achieving the exact same outcome with >> the use of a read-only overlayfs with two lower layer - >> uppermost with erofs containing the metadata files, which include >> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >> to the lowermost layer containing the content files. > > I think Dave explained quite well why using overlay is not comparable to > what composefs does. > > One big difference is that overlay still requires at least a syscall for > each file in the image, and then we need the equivalent of "rm -rf" to > clean it up. It is somehow acceptable for long-running services, but it > is not for "serverless" containers where images/containers are created > and destroyed frequently. So even in the case we already have all the > image files available locally, we still need to create a checkout with > the final structure we need for the image. > > I also don't see how overlay would solve the verified image problem. We > would have the same problem we have today with fs-verity as it can only > validate a single file but not the entire directory structure. Changes > that affect the layer containing the trusted.overlay.{metacopy,redirect} > xattrs won't be noticed. > > There are at the moment two ways to handle container images, both somehow > guided by the available file systems in the kernel. > > - A single image mounted as a block device. > - A list of tarballs (OCI image) that are unpacked and mounted as > overlay layers. > > One big advantage of the block devices model is that you can use > dm-verity, this is something we miss today with OCI container images > that use overlay. > > What we are proposing with composefs is a way to have "dm-verity" style > validation based on fs-verity and the possibility to share individual > files instead of layers. These files can also be on different file > systems, which is something not possible with the block device model. That is not a new idea honestly, including chain of trust. Even laterly out-of-tree incremental fs using fs-verity for this as well, except that it's in a real self-contained way. > > The composefs manifest blob could be generated remotely and signed. A > client would need just to validate the signature for the manifest blob > and from there retrieve the files that are not in the local CAS (even > from an insecure source) and mount directly the manifest file. Back to the topic, after thinking something I have to make a compliment for reference. First, EROFS had the same internal dissussion and decision at that time almost _two years ago_ (June 2021), it means: a) Some internal people really suggested EROFS could develop an entire new file-based in-kernel local cache subsystem (as you called local CAS, whatever) with stackable file interface so that the exist Nydus image service [1] (as ostree, and maybe ostree can use it as well) don't need to modify anything to use exist blobs; b) Reuse exist fscache/cachefiles; The reason why we (especially me) finally selected b) because: - see the people discussion of Google's original Incremental FS topic [2] [3] in 2019, as Amir already mentioned. At that time all fs folks really like to reuse exist subsystem for in-kernel caching rather than reinvent another new in-kernel wheel for local cache. [ Reinventing a new wheel is not hard (fs or caching), just makes Linux more fragmented. Especially a new filesystem is just proposed to generate images full of massive massive new magical symlinks with *overriden* uid/gid/permissions to replace regular files. ] - in-kernel cache implementation usually met several common potential security issues; reusing exist subsystem can make all fses addressed them and benefited from it. - Usually an exist widely-used userspace implementation is never an excuse for a new in-kernel feature. Although David Howells is always quite busy these months to develop new netfs interface, otherwise (we think) we should already support failover, multiple daemon/dirs, daemonless and more. I know that you guys repeatedly say it's a self-contained stackable fs and has few code (the same words as Incfs folks [3] said four years ago already), four reasons make it weak IMHO: - I think core EROFS is about 2~3 kLOC as well if compression, sysfs and fscache are all code-truncated. Also, it's always welcome that all people could submit patches for cleaning up. I always do such cleanups from time to time and makes it better. - "Few code lines" is somewhat weak because people do develop new features, layout after upstream. Such claim is usually _NOT_ true in the future if you guys do more to optimize performance, new layout or even do your own lazy pulling with your local CAS codebase in the future unless you *promise* you once dump the code, and do bugfix only like Christian said [4]. From LWN.net comments, I do see the opposite possibility that you'd like to develop new features later. - In the past, all in-tree kernel filesystems were designed and implemented without some user-space specific indication, including Nydus and ostree (I did see a lot of discussion between folks before in ociv2 brainstorm [5]). That is why EROFS selected exist in-kernel fscache and made userspace Nydus adapt it: even (here called) manifest on-disk format --- EROFS call primary device --- they call Nydus bootstrap; I'm not sure why it becomes impossible for ... ($$$$). In addition, if fscache is used, it can also use fsverity_get_digest() to enable fsverity for non-on-demand files. But again I think even Google's folks think that is (somewhat) broken so that they added fs-verity to its incFS in a self-contained way in Feb 2021 [6]. Finally, again, I do hope a LSF/MM discussion for this new overlay model (full of massive magical symlinks to override permission.) [1] https://github.com/dragonflyoss/image-service [2] https://lore.kernel.org/r/CAK8JDrFZW1jwOmhq+YVDPJi9jWWrCRkwpqQ085EouVSyzw-1cg@mail.gmail.com/ [3] https://lore.kernel.org/r/CAK8JDrGRzA+yphpuX+GQ0syRwF_p2Fora+roGCnYqB5E1eOmXA@mail.gmail.com/ [4] https://lore.kernel.org/r/20230117101202.4v4zxuj2tbljogbx@wittgenstein/ [5] https://hackmd.io/@cyphar/ociv2-brainstorm [6] https://android-review.googlesource.com/c/kernel/common/+/1444521 Thanks, Gao Xiang > > Regards, > Giuseppe > >> Any current functionality gap in erofs and/or in overlayfs >> cannot be considered as a reason to maintain a new filesystem >> driver unless you come up with an explanation why closing that >> functionality gap is not possible or why the erofs+overlayfs alternative >> would be inferior to maintaining a new filesystem driver. >> >> From the conversations so far, it does not seem like Gao thinks >> that the functionality gap in erofs cannot be closed and I don't >> see why the functionality gap in overlayfs cannot be closed. >> >> Are we missing something? >> >> Thanks, >> Amir. > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-21 3:08 ` Gao Xiang @ 2023-01-21 16:19 ` Giuseppe Scrivano 2023-01-21 17:15 ` Gao Xiang 0 siblings, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-21 16:19 UTC (permalink / raw) To: Gao Xiang Cc: Amir Goldstein, Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi, Linus Torvalds Gao Xiang <hsiangkao@linux.alibaba.com> writes: > On 2023/1/21 06:18, Giuseppe Scrivano wrote: >> Hi Amir, >> Amir Goldstein <amir73il@gmail.com> writes: >> >>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: > > ... > >>>> >>> >>> Hi Alexander, >>> >>> I must say that I am a little bit puzzled by this v3. >>> Gao, Christian and myself asked you questions on v2 >>> that are not mentioned in v3 at all. >>> >>> To sum it up, please do not propose composefs without explaining >>> what are the barriers for achieving the exact same outcome with >>> the use of a read-only overlayfs with two lower layer - >>> uppermost with erofs containing the metadata files, which include >>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >>> to the lowermost layer containing the content files. >> I think Dave explained quite well why using overlay is not >> comparable to >> what composefs does. >> One big difference is that overlay still requires at least a syscall >> for >> each file in the image, and then we need the equivalent of "rm -rf" to >> clean it up. It is somehow acceptable for long-running services, but it >> is not for "serverless" containers where images/containers are created >> and destroyed frequently. So even in the case we already have all the >> image files available locally, we still need to create a checkout with >> the final structure we need for the image. >> I also don't see how overlay would solve the verified image problem. >> We >> would have the same problem we have today with fs-verity as it can only >> validate a single file but not the entire directory structure. Changes >> that affect the layer containing the trusted.overlay.{metacopy,redirect} >> xattrs won't be noticed. >> There are at the moment two ways to handle container images, both >> somehow >> guided by the available file systems in the kernel. >> - A single image mounted as a block device. >> - A list of tarballs (OCI image) that are unpacked and mounted as >> overlay layers. >> One big advantage of the block devices model is that you can use >> dm-verity, this is something we miss today with OCI container images >> that use overlay. >> What we are proposing with composefs is a way to have "dm-verity" >> style >> validation based on fs-verity and the possibility to share individual >> files instead of layers. These files can also be on different file >> systems, which is something not possible with the block device model. > > That is not a new idea honestly, including chain of trust. Even laterly > out-of-tree incremental fs using fs-verity for this as well, except that > it's in a real self-contained way. > >> The composefs manifest blob could be generated remotely and signed. >> A >> client would need just to validate the signature for the manifest blob >> and from there retrieve the files that are not in the local CAS (even >> from an insecure source) and mount directly the manifest file. > > > Back to the topic, after thinking something I have to make a > compliment for reference. > > First, EROFS had the same internal dissussion and decision at > that time almost _two years ago_ (June 2021), it means: > > a) Some internal people really suggested EROFS could develop > an entire new file-based in-kernel local cache subsystem > (as you called local CAS, whatever) with stackable file > interface so that the exist Nydus image service [1] (as > ostree, and maybe ostree can use it as well) don't need to > modify anything to use exist blobs; > > b) Reuse exist fscache/cachefiles; > > The reason why we (especially me) finally selected b) because: > > - see the people discussion of Google's original Incremental > FS topic [2] [3] in 2019, as Amir already mentioned. At > that time all fs folks really like to reuse exist subsystem > for in-kernel caching rather than reinvent another new > in-kernel wheel for local cache. > > [ Reinventing a new wheel is not hard (fs or caching), just > makes Linux more fragmented. Especially a new filesystem > is just proposed to generate images full of massive massive > new magical symlinks with *overriden* uid/gid/permissions > to replace regular files. ] > > - in-kernel cache implementation usually met several common > potential security issues; reusing exist subsystem can > make all fses addressed them and benefited from it. > > - Usually an exist widely-used userspace implementation is > never an excuse for a new in-kernel feature. > > Although David Howells is always quite busy these months to > develop new netfs interface, otherwise (we think) we should > already support failover, multiple daemon/dirs, daemonless and > more. we have not added any new cache system. overlay does "layer deduplication" and in similar way composefs does "file deduplication". That is not a built-in feature, it is just a side effect of how things are packed together. Using fscache seems like a good idea and it has many advantages but it is a centralized cache mechanism and it looks like a potential problem when you think about allowing mounts from a user namespace. As you know as I've contacted you, I've looked at EROFS in the past and tried to get our use cases to work with it before thinking about submitting composefs upstream. From what I could see EROFS and composefs use two different approaches to solve a similar problem, but it is not possible to do exactly with EROFS what we are trying to do. To oversimplify it: I see EROFS as a block device that uses fscache, and composefs as an overlay for files instead of directories. Sure composefs is quite simple and you could embed the composefs features in EROFS and let EROFS behave as composefs when provided a similar manifest file. But how is that any better than having a separate implementation that does just one thing well instead of merging different paradigms together? > I know that you guys repeatedly say it's a self-contained > stackable fs and has few code (the same words as Incfs > folks [3] said four years ago already), four reasons make it > weak IMHO: > > - I think core EROFS is about 2~3 kLOC as well if > compression, sysfs and fscache are all code-truncated. > > Also, it's always welcome that all people could submit > patches for cleaning up. I always do such cleanups > from time to time and makes it better. > > - "Few code lines" is somewhat weak because people do > develop new features, layout after upstream. > > Such claim is usually _NOT_ true in the future if you > guys do more to optimize performance, new layout or even > do your own lazy pulling with your local CAS codebase in > the future unless > you *promise* you once dump the code, and do bugfix > only like Christian said [4]. > > From LWN.net comments, I do see the opposite > possibility that you'd like to develop new features > later. > > - In the past, all in-tree kernel filesystems were > designed and implemented without some user-space > specific indication, including Nydus and ostree (I did > see a lot of discussion between folks before in ociv2 > brainstorm [5]). Since you are mentioning OCI: Potentially composefs can be the file system that enables something very close to "ociv2", but it won't need to be called v2 since it is completely compatible with the current OCI image format. It won't require a different image format, just a seekable tarball that is compatible with old "v1" clients and we need to provide the composefs manifest file. The seekable tarball allows individual files to be retrieved. OCI clients will not need to pull the entire tarball, but only the individual files that are not already present in the local CAS. They won't also need to create the overlay layout at all, as we do today, since it is already described with the composefs manifest file. The manifest is portable on different machines with different configurations, as you can use multiple CAS when mounting composefs. Some users might have a local CAS, some others could have a secondary CAS on a network file system and composefs support all these configurations with the same signed manifest file. > That is why EROFS selected exist in-kernel fscache and > made userspace Nydus adapt it: > > even (here called) manifest on-disk format --- > EROFS call primary device --- > they call Nydus bootstrap; > > I'm not sure why it becomes impossible for ... ($$$$). I am not sure what you mean, care to elaborate? > In addition, if fscache is used, it can also use > fsverity_get_digest() to enable fsverity for non-on-demand > files. > > But again I think even Google's folks think that is > (somewhat) broken so that they added fs-verity to its incFS > in a self-contained way in Feb 2021 [6]. > > Finally, again, I do hope a LSF/MM discussion for this new > overlay model (full of massive magical symlinks to override > permission.) you keep pointing it out but nobody is overriding any permission. The "symlinks" as you call them are just a way to refer to the payload files so they can be shared among different mounts. It is the same idea used by "overlay metacopy" and nobody is complaining about it being a security issue (because it is not). The files in the CAS are owned by the user that creates the mount, so there is no need to circumvent any permission check to access them. We use fs-verity for these files to make sure they are not modified by a malicious user that could get access to them (e.g. a container breakout). Regards, Giuseppe > > [1] https://github.com/dragonflyoss/image-service > [2] https://lore.kernel.org/r/CAK8JDrFZW1jwOmhq+YVDPJi9jWWrCRkwpqQ085EouVSyzw-1cg@mail.gmail.com/ > [3] https://lore.kernel.org/r/CAK8JDrGRzA+yphpuX+GQ0syRwF_p2Fora+roGCnYqB5E1eOmXA@mail.gmail.com/ > [4] https://lore.kernel.org/r/20230117101202.4v4zxuj2tbljogbx@wittgenstein/ > [5] https://hackmd.io/@cyphar/ociv2-brainstorm > [6] https://android-review.googlesource.com/c/kernel/common/+/1444521 > > Thanks, > Gao Xiang > >> Regards, >> Giuseppe >> >>> Any current functionality gap in erofs and/or in overlayfs >>> cannot be considered as a reason to maintain a new filesystem >>> driver unless you come up with an explanation why closing that >>> functionality gap is not possible or why the erofs+overlayfs alternative >>> would be inferior to maintaining a new filesystem driver. >>> >>> From the conversations so far, it does not seem like Gao thinks >>> that the functionality gap in erofs cannot be closed and I don't >>> see why the functionality gap in overlayfs cannot be closed. >>> >>> Are we missing something? >>> >>> Thanks, >>> Amir. >> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-21 16:19 ` Giuseppe Scrivano @ 2023-01-21 17:15 ` Gao Xiang 2023-01-21 22:34 ` Giuseppe Scrivano 0 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-01-21 17:15 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Amir Goldstein, Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi, Linus Torvalds On 2023/1/22 00:19, Giuseppe Scrivano wrote: > Gao Xiang <hsiangkao@linux.alibaba.com> writes: > >> On 2023/1/21 06:18, Giuseppe Scrivano wrote: >>> Hi Amir, >>> Amir Goldstein <amir73il@gmail.com> writes: >>> >>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: >> >> ... >> >>>>> >>>> >>>> Hi Alexander, >>>> >>>> I must say that I am a little bit puzzled by this v3. >>>> Gao, Christian and myself asked you questions on v2 >>>> that are not mentioned in v3 at all. >>>> >>>> To sum it up, please do not propose composefs without explaining >>>> what are the barriers for achieving the exact same outcome with >>>> the use of a read-only overlayfs with two lower layer - >>>> uppermost with erofs containing the metadata files, which include >>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >>>> to the lowermost layer containing the content files. >>> I think Dave explained quite well why using overlay is not >>> comparable to >>> what composefs does. >>> One big difference is that overlay still requires at least a syscall >>> for >>> each file in the image, and then we need the equivalent of "rm -rf" to >>> clean it up. It is somehow acceptable for long-running services, but it >>> is not for "serverless" containers where images/containers are created >>> and destroyed frequently. So even in the case we already have all the >>> image files available locally, we still need to create a checkout with >>> the final structure we need for the image. >>> I also don't see how overlay would solve the verified image problem. >>> We >>> would have the same problem we have today with fs-verity as it can only >>> validate a single file but not the entire directory structure. Changes >>> that affect the layer containing the trusted.overlay.{metacopy,redirect} >>> xattrs won't be noticed. >>> There are at the moment two ways to handle container images, both >>> somehow >>> guided by the available file systems in the kernel. >>> - A single image mounted as a block device. >>> - A list of tarballs (OCI image) that are unpacked and mounted as >>> overlay layers. >>> One big advantage of the block devices model is that you can use >>> dm-verity, this is something we miss today with OCI container images >>> that use overlay. >>> What we are proposing with composefs is a way to have "dm-verity" >>> style >>> validation based on fs-verity and the possibility to share individual >>> files instead of layers. These files can also be on different file >>> systems, which is something not possible with the block device model. >> >> That is not a new idea honestly, including chain of trust. Even laterly >> out-of-tree incremental fs using fs-verity for this as well, except that >> it's in a real self-contained way. >> >>> The composefs manifest blob could be generated remotely and signed. >>> A >>> client would need just to validate the signature for the manifest blob >>> and from there retrieve the files that are not in the local CAS (even >>> from an insecure source) and mount directly the manifest file. >> >> >> Back to the topic, after thinking something I have to make a >> compliment for reference. >> >> First, EROFS had the same internal dissussion and decision at >> that time almost _two years ago_ (June 2021), it means: >> >> a) Some internal people really suggested EROFS could develop >> an entire new file-based in-kernel local cache subsystem >> (as you called local CAS, whatever) with stackable file >> interface so that the exist Nydus image service [1] (as >> ostree, and maybe ostree can use it as well) don't need to >> modify anything to use exist blobs; >> >> b) Reuse exist fscache/cachefiles; >> >> The reason why we (especially me) finally selected b) because: >> >> - see the people discussion of Google's original Incremental >> FS topic [2] [3] in 2019, as Amir already mentioned. At >> that time all fs folks really like to reuse exist subsystem >> for in-kernel caching rather than reinvent another new >> in-kernel wheel for local cache. >> >> [ Reinventing a new wheel is not hard (fs or caching), just >> makes Linux more fragmented. Especially a new filesystem >> is just proposed to generate images full of massive massive >> new magical symlinks with *overriden* uid/gid/permissions >> to replace regular files. ] >> >> - in-kernel cache implementation usually met several common >> potential security issues; reusing exist subsystem can >> make all fses addressed them and benefited from it. >> >> - Usually an exist widely-used userspace implementation is >> never an excuse for a new in-kernel feature. >> >> Although David Howells is always quite busy these months to >> develop new netfs interface, otherwise (we think) we should >> already support failover, multiple daemon/dirs, daemonless and >> more. > > we have not added any new cache system. overlay does "layer > deduplication" and in similar way composefs does "file deduplication". > That is not a built-in feature, it is just a side effect of how things > are packed together. > > Using fscache seems like a good idea and it has many advantages but it > is a centralized cache mechanism and it looks like a potential problem > when you think about allowing mounts from a user namespace. I think Christian [1] had the same feeling of my own at that time: "I'm pretty skeptical of this plan whether we should add more filesystems that are mountable by unprivileged users. FUSE and Overlayfs are adventurous enough and they don't have their own on-disk format. The track record of bugs exploitable due to userns isn't making this very attractive." Yes, you could add fs-verity, but EROFS could add fs-verity (or just use dm-verity) as well, but it doesn't change _anything_ about concerns of "allowing mounts from a user namespace". > > As you know as I've contacted you, I've looked at EROFS in the past > and tried to get our use cases to work with it before thinking about > submitting composefs upstream. > > From what I could see EROFS and composefs use two different approaches > to solve a similar problem, but it is not possible to do exactly with > EROFS what we are trying to do. To oversimplify it: I see EROFS as a > block device that uses fscache, and composefs as an overlay for files > instead of directories. I don't think so honestly. EROFS "Multiple device" feature is actually "multiple blobs" feature if you really think "device" is block device. Primary device -- primary blob -- "composefs manifest blob" Blob device -- data blobs -- "composefs backing files" any difference? > > Sure composefs is quite simple and you could embed the composefs > features in EROFS and let EROFS behave as composefs when provided a > similar manifest file. But how is that any better than having a EROFS always has such feature since v5.16, we called primary device, or Nydus concept --- "bootstrap file". > separate implementation that does just one thing well instead of merging > different paradigms together? It's exist fs on-disk compatible (people can deploy the same image to wider scenarios), or you could modify/enhacnce any in-kernel local fs to do so like I already suggested, such as enhancing "fs/romfs" and make it maintained again due to this magic symlink feature (because composefs don't have other on-disk requirements other than a symlink path and a SHA256 verity digest from its original requirement. Any local fs can be enhanced like this.) > >> I know that you guys repeatedly say it's a self-contained >> stackable fs and has few code (the same words as Incfs >> folks [3] said four years ago already), four reasons make it >> weak IMHO: >> >> - I think core EROFS is about 2~3 kLOC as well if >> compression, sysfs and fscache are all code-truncated. >> >> Also, it's always welcome that all people could submit >> patches for cleaning up. I always do such cleanups >> from time to time and makes it better. >> >> - "Few code lines" is somewhat weak because people do >> develop new features, layout after upstream. >> >> Such claim is usually _NOT_ true in the future if you >> guys do more to optimize performance, new layout or even >> do your own lazy pulling with your local CAS codebase in >> the future unless >> you *promise* you once dump the code, and do bugfix >> only like Christian said [4]. >> >> From LWN.net comments, I do see the opposite >> possibility that you'd like to develop new features >> later. >> >> - In the past, all in-tree kernel filesystems were >> designed and implemented without some user-space >> specific indication, including Nydus and ostree (I did >> see a lot of discussion between folks before in ociv2 >> brainstorm [5]). > > Since you are mentioning OCI: > > Potentially composefs can be the file system that enables something very > close to "ociv2", but it won't need to be called v2 since it is > completely compatible with the current OCI image format. > > It won't require a different image format, just a seekable tarball that > is compatible with old "v1" clients and we need to provide the composefs > manifest file. May I ask did you really look into what Nydus + EROFS already did (as you mentioned we discussed before)? Your "composefs manifest file" is exactly "Nydus bootstrap file", see: https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md "Rafs is a filesystem image containing a separated metadata blob and several data-deduplicated content-addressable data blobs. In a typical rafs filesystem, the metadata is stored in bootstrap while the data is stored in blobfile. ... bootstrap: The metadata is a merkle tree (I think that is typo, should be filesystem tree) whose nodes represents a regular filesystem's directory/file a leaf node refers to a file and contains hash value of its file data. Root node and internal nodes refer to directories and contain the hash value of their children nodes." Nydus is already supported "It won't require a different image format, just a seekable tarball that is compatible with old "v1" clients and we need to provide the composefs manifest file." feature in v2.2 and will be released later. > > The seekable tarball allows individual files to be retrieved. OCI > clients will not need to pull the entire tarball, but only the individual > files that are not already present in the local CAS. They won't also need > to create the overlay layout at all, as we do today, since it is already > described with the composefs manifest file. > > The manifest is portable on different machines with different > configurations, as you can use multiple CAS when mounting composefs. > > Some users might have a local CAS, some others could have a secondary > CAS on a network file system and composefs support all these > configurations with the same signed manifest file. > >> That is why EROFS selected exist in-kernel fscache and >> made userspace Nydus adapt it: >> >> even (here called) manifest on-disk format --- >> EROFS call primary device --- >> they call Nydus bootstrap; >> >> I'm not sure why it becomes impossible for ... ($$$$). > > I am not sure what you mean, care to elaborate? I just meant these concepts are actually the same concept with different names and: Nydus is a 2020 stuff; EROFS + primary device is a 2021-mid stuff. > >> In addition, if fscache is used, it can also use >> fsverity_get_digest() to enable fsverity for non-on-demand >> files. >> >> But again I think even Google's folks think that is >> (somewhat) broken so that they added fs-verity to its incFS >> in a self-contained way in Feb 2021 [6]. >> >> Finally, again, I do hope a LSF/MM discussion for this new >> overlay model (full of massive magical symlinks to override >> permission.) > > you keep pointing it out but nobody is overriding any permission. The > "symlinks" as you call them are just a way to refer to the payload files > so they can be shared among different mounts. It is the same idea used > by "overlay metacopy" and nobody is complaining about it being a > security issue (because it is not). See overlay documentation clearly wrote such metacopy behavior: https://docs.kernel.org/filesystems/overlayfs.html " Do not use metacopy=on with untrusted upper/lower directories. Otherwise it is possible that an attacker can create a handcrafted file with appropriate REDIRECT and METACOPY xattrs, and gain access to file on lower pointed by REDIRECT. This should not be possible on local system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But it should be possible for untrusted layers like from a pen drive. " Do we really need such behavior working on another fs especially with on-disk format? At least Christian said, "FUSE and Overlayfs are adventurous enough and they don't have their own on-disk format." > > The files in the CAS are owned by the user that creates the mount, so > there is no need to circumvent any permission check to access them. > We use fs-verity for these files to make sure they are not modified by a > malicious user that could get access to them (e.g. a container breakout). fs-verity is not always enforcing and it's broken here if fsverity is not supported in underlay fses, that is another my arguable point. Thanks, Gao Xiang [1] https://lore.kernel.org/linux-fsdevel/20230117152756.jbwmeq724potyzju@wittgenstein/ > > Regards, > Giuseppe > >> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-21 17:15 ` Gao Xiang @ 2023-01-21 22:34 ` Giuseppe Scrivano 2023-01-22 0:39 ` Gao Xiang 0 siblings, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-21 22:34 UTC (permalink / raw) To: Gao Xiang Cc: Amir Goldstein, Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi, Linus Torvalds Gao Xiang <hsiangkao@linux.alibaba.com> writes: > On 2023/1/22 00:19, Giuseppe Scrivano wrote: >> Gao Xiang <hsiangkao@linux.alibaba.com> writes: >> >>> On 2023/1/21 06:18, Giuseppe Scrivano wrote: >>>> Hi Amir, >>>> Amir Goldstein <amir73il@gmail.com> writes: >>>> >>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: >>> >>> ... >>> >>>>>> >>>>> >>>>> Hi Alexander, >>>>> >>>>> I must say that I am a little bit puzzled by this v3. >>>>> Gao, Christian and myself asked you questions on v2 >>>>> that are not mentioned in v3 at all. >>>>> >>>>> To sum it up, please do not propose composefs without explaining >>>>> what are the barriers for achieving the exact same outcome with >>>>> the use of a read-only overlayfs with two lower layer - >>>>> uppermost with erofs containing the metadata files, which include >>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >>>>> to the lowermost layer containing the content files. >>>> I think Dave explained quite well why using overlay is not >>>> comparable to >>>> what composefs does. >>>> One big difference is that overlay still requires at least a syscall >>>> for >>>> each file in the image, and then we need the equivalent of "rm -rf" to >>>> clean it up. It is somehow acceptable for long-running services, but it >>>> is not for "serverless" containers where images/containers are created >>>> and destroyed frequently. So even in the case we already have all the >>>> image files available locally, we still need to create a checkout with >>>> the final structure we need for the image. >>>> I also don't see how overlay would solve the verified image problem. >>>> We >>>> would have the same problem we have today with fs-verity as it can only >>>> validate a single file but not the entire directory structure. Changes >>>> that affect the layer containing the trusted.overlay.{metacopy,redirect} >>>> xattrs won't be noticed. >>>> There are at the moment two ways to handle container images, both >>>> somehow >>>> guided by the available file systems in the kernel. >>>> - A single image mounted as a block device. >>>> - A list of tarballs (OCI image) that are unpacked and mounted as >>>> overlay layers. >>>> One big advantage of the block devices model is that you can use >>>> dm-verity, this is something we miss today with OCI container images >>>> that use overlay. >>>> What we are proposing with composefs is a way to have "dm-verity" >>>> style >>>> validation based on fs-verity and the possibility to share individual >>>> files instead of layers. These files can also be on different file >>>> systems, which is something not possible with the block device model. >>> >>> That is not a new idea honestly, including chain of trust. Even laterly >>> out-of-tree incremental fs using fs-verity for this as well, except that >>> it's in a real self-contained way. >>> >>>> The composefs manifest blob could be generated remotely and signed. >>>> A >>>> client would need just to validate the signature for the manifest blob >>>> and from there retrieve the files that are not in the local CAS (even >>>> from an insecure source) and mount directly the manifest file. >>> >>> >>> Back to the topic, after thinking something I have to make a >>> compliment for reference. >>> >>> First, EROFS had the same internal dissussion and decision at >>> that time almost _two years ago_ (June 2021), it means: >>> >>> a) Some internal people really suggested EROFS could develop >>> an entire new file-based in-kernel local cache subsystem >>> (as you called local CAS, whatever) with stackable file >>> interface so that the exist Nydus image service [1] (as >>> ostree, and maybe ostree can use it as well) don't need to >>> modify anything to use exist blobs; >>> >>> b) Reuse exist fscache/cachefiles; >>> >>> The reason why we (especially me) finally selected b) because: >>> >>> - see the people discussion of Google's original Incremental >>> FS topic [2] [3] in 2019, as Amir already mentioned. At >>> that time all fs folks really like to reuse exist subsystem >>> for in-kernel caching rather than reinvent another new >>> in-kernel wheel for local cache. >>> >>> [ Reinventing a new wheel is not hard (fs or caching), just >>> makes Linux more fragmented. Especially a new filesystem >>> is just proposed to generate images full of massive massive >>> new magical symlinks with *overriden* uid/gid/permissions >>> to replace regular files. ] >>> >>> - in-kernel cache implementation usually met several common >>> potential security issues; reusing exist subsystem can >>> make all fses addressed them and benefited from it. >>> >>> - Usually an exist widely-used userspace implementation is >>> never an excuse for a new in-kernel feature. >>> >>> Although David Howells is always quite busy these months to >>> develop new netfs interface, otherwise (we think) we should >>> already support failover, multiple daemon/dirs, daemonless and >>> more. >> we have not added any new cache system. overlay does "layer >> deduplication" and in similar way composefs does "file deduplication". >> That is not a built-in feature, it is just a side effect of how things >> are packed together. >> Using fscache seems like a good idea and it has many advantages but >> it >> is a centralized cache mechanism and it looks like a potential problem >> when you think about allowing mounts from a user namespace. > > I think Christian [1] had the same feeling of my own at that time: > > "I'm pretty skeptical of this plan whether we should add more filesystems > that are mountable by unprivileged users. FUSE and Overlayfs are > adventurous enough and they don't have their own on-disk format. The > track record of bugs exploitable due to userns isn't making this > very attractive." > > Yes, you could add fs-verity, but EROFS could add fs-verity (or just use > dm-verity) as well, but it doesn't change _anything_ about concerns of > "allowing mounts from a user namespace". I've mentioned that as a potential feature we could add in future, given the simplicity of the format and that it uses a CAS for its data instead of fscache. Each user can have and use their own store to mount the images. At this point it is just a wish from userspace, as it would improve a few real use cases we have. Having the possibility to run containers without root privileges is a big deal for many users, look at Flatpak apps for example, or rootless Podman. Mounting and validating images would be a a big security improvement. It is something that is not possible at the moment as fs-verity doesn't cover the directory structure and dm-verity seems out of reach from a user namespace. Composefs delegates the entire logic of dealing with files to the underlying file system in a similar way to overlay. Forging the inode metadata from a user namespace mount doesn't look like an insurmountable problem as well since it is already possible with a FUSE filesystem. So the proposal/wish here is to have a very simple format, that at some point could be considered safe to mount from a user namespace, in addition to overlay and FUSE. >> As you know as I've contacted you, I've looked at EROFS in the past >> and tried to get our use cases to work with it before thinking about >> submitting composefs upstream. >> From what I could see EROFS and composefs use two different >> approaches >> to solve a similar problem, but it is not possible to do exactly with >> EROFS what we are trying to do. To oversimplify it: I see EROFS as a >> block device that uses fscache, and composefs as an overlay for files >> instead of directories. > > I don't think so honestly. EROFS "Multiple device" feature is > actually "multiple blobs" feature if you really think "device" > is block device. > > Primary device -- primary blob -- "composefs manifest blob" > Blob device -- data blobs -- "composefs backing files" > > any difference? I wouldn't expect any substancial difference between two RO file systems. Please correct me if I am wrong: EROFS uses 16 bits for the blob device ID, so if we map each file to a single blob device we are kind of limited on how many files we can have. Sure this is just an artificial limit and can be bumped in a future version but the major difference remains: EROFS uses the blob device through fscache while the composefs files are looked up in the specified repositories. >> Sure composefs is quite simple and you could embed the composefs >> features in EROFS and let EROFS behave as composefs when provided a >> similar manifest file. But how is that any better than having a > > EROFS always has such feature since v5.16, we called primary device, > or Nydus concept --- "bootstrap file". > >> separate implementation that does just one thing well instead of merging >> different paradigms together? > > It's exist fs on-disk compatible (people can deploy the same image > to wider scenarios), or you could modify/enhacnce any in-kernel local > fs to do so like I already suggested, such as enhancing "fs/romfs" and > make it maintained again due to this magic symlink feature > > (because composefs don't have other on-disk requirements other than > a symlink path and a SHA256 verity digest from its original > requirement. Any local fs can be enhanced like this.) > >> >>> I know that you guys repeatedly say it's a self-contained >>> stackable fs and has few code (the same words as Incfs >>> folks [3] said four years ago already), four reasons make it >>> weak IMHO: >>> >>> - I think core EROFS is about 2~3 kLOC as well if >>> compression, sysfs and fscache are all code-truncated. >>> >>> Also, it's always welcome that all people could submit >>> patches for cleaning up. I always do such cleanups >>> from time to time and makes it better. >>> >>> - "Few code lines" is somewhat weak because people do >>> develop new features, layout after upstream. >>> >>> Such claim is usually _NOT_ true in the future if you >>> guys do more to optimize performance, new layout or even >>> do your own lazy pulling with your local CAS codebase in >>> the future unless >>> you *promise* you once dump the code, and do bugfix >>> only like Christian said [4]. >>> >>> From LWN.net comments, I do see the opposite >>> possibility that you'd like to develop new features >>> later. >>> >>> - In the past, all in-tree kernel filesystems were >>> designed and implemented without some user-space >>> specific indication, including Nydus and ostree (I did >>> see a lot of discussion between folks before in ociv2 >>> brainstorm [5]). >> Since you are mentioning OCI: >> Potentially composefs can be the file system that enables something >> very >> close to "ociv2", but it won't need to be called v2 since it is >> completely compatible with the current OCI image format. >> It won't require a different image format, just a seekable tarball >> that >> is compatible with old "v1" clients and we need to provide the composefs >> manifest file. > > May I ask did you really look into what Nydus + EROFS already did (as you > mentioned we discussed before)? > > Your "composefs manifest file" is exactly "Nydus bootstrap file", see: > https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md > > "Rafs is a filesystem image containing a separated metadata blob and > several data-deduplicated content-addressable data blobs. In a typical > rafs filesystem, the metadata is stored in bootstrap while the data > is stored in blobfile. > ... > > bootstrap: The metadata is a merkle tree (I think that is typo, should be > filesystem tree) whose nodes represents a regular filesystem's > directory/file a leaf node refers to a file and contains hash value of > its file data. > Root node and internal nodes refer to directories and contain the > hash value > of their children nodes." > > Nydus is already supported "It won't require a different image format, just > a seekable tarball that is compatible with old "v1" clients and we need to > provide the composefs manifest file." feature in v2.2 and will be released > later. Nydus is not using a tarball compatible with OCI v1. It defines a media type "application/vnd.oci.image.layer.nydus.blob.v1", that means it is not compatible with existing clients that don't know about it and you need special handling for that. Anyway, let's not bother LKML folks with these userspace details. It has no relevance to the kernel and what file systems do. >> The seekable tarball allows individual files to be retrieved. OCI >> clients will not need to pull the entire tarball, but only the individual >> files that are not already present in the local CAS. They won't also need >> to create the overlay layout at all, as we do today, since it is already >> described with the composefs manifest file. >> The manifest is portable on different machines with different >> configurations, as you can use multiple CAS when mounting composefs. >> Some users might have a local CAS, some others could have a >> secondary >> CAS on a network file system and composefs support all these >> configurations with the same signed manifest file. >> >>> That is why EROFS selected exist in-kernel fscache and >>> made userspace Nydus adapt it: >>> >>> even (here called) manifest on-disk format --- >>> EROFS call primary device --- >>> they call Nydus bootstrap; >>> >>> I'm not sure why it becomes impossible for ... ($$$$). >> I am not sure what you mean, care to elaborate? > > I just meant these concepts are actually the same concept with > different names and: > Nydus is a 2020 stuff; CRFS[1] is 2019 stuff. > EROFS + primary device is a 2021-mid stuff. > >>> In addition, if fscache is used, it can also use >>> fsverity_get_digest() to enable fsverity for non-on-demand >>> files. >>> >>> But again I think even Google's folks think that is >>> (somewhat) broken so that they added fs-verity to its incFS >>> in a self-contained way in Feb 2021 [6]. >>> >>> Finally, again, I do hope a LSF/MM discussion for this new >>> overlay model (full of massive magical symlinks to override >>> permission.) >> you keep pointing it out but nobody is overriding any permission. >> The >> "symlinks" as you call them are just a way to refer to the payload files >> so they can be shared among different mounts. It is the same idea used >> by "overlay metacopy" and nobody is complaining about it being a >> security issue (because it is not). > > See overlay documentation clearly wrote such metacopy behavior: > https://docs.kernel.org/filesystems/overlayfs.html > > " > Do not use metacopy=on with untrusted upper/lower directories. > Otherwise it is possible that an attacker can create a handcrafted file > with appropriate REDIRECT and METACOPY xattrs, and gain access to file > on lower pointed by REDIRECT. This should not be possible on local > system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But > it should be possible for untrusted layers like from a pen drive. > " > > Do we really need such behavior working on another fs especially with > on-disk format? At least Christian said, > "FUSE and Overlayfs are adventurous enough and they don't have their > own on-disk format." If users want to do something really weird then they can always find a way but the composefs lookup is limited under the directories specified at mount time, so it is not possible to access any file outside the repository. >> The files in the CAS are owned by the user that creates the mount, >> so >> there is no need to circumvent any permission check to access them. >> We use fs-verity for these files to make sure they are not modified by a >> malicious user that could get access to them (e.g. a container breakout). > > fs-verity is not always enforcing and it's broken here if fsverity is not > supported in underlay fses, that is another my arguable point. It is a trade-off. It is up to the user to pick a configuration that allows using fs-verity if they care about this feature. Regards, Giuseppe [1] https://github.com/google/crfs > Thanks, > Gao Xiang > > [1] https://lore.kernel.org/linux-fsdevel/20230117152756.jbwmeq724potyzju@wittgenstein/ > >> Regards, >> Giuseppe ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-21 22:34 ` Giuseppe Scrivano @ 2023-01-22 0:39 ` Gao Xiang 2023-01-22 9:01 ` Giuseppe Scrivano 0 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-01-22 0:39 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Amir Goldstein, Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi, Linus Torvalds On 2023/1/22 06:34, Giuseppe Scrivano wrote: > Gao Xiang <hsiangkao@linux.alibaba.com> writes: > >> On 2023/1/22 00:19, Giuseppe Scrivano wrote: >>> Gao Xiang <hsiangkao@linux.alibaba.com> writes: >>> >>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote: >>>>> Hi Amir, >>>>> Amir Goldstein <amir73il@gmail.com> writes: >>>>> >>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: >>>> >>>> ... >>>> >>>>>>> >>>>>> >>>>>> Hi Alexander, >>>>>> >>>>>> I must say that I am a little bit puzzled by this v3. >>>>>> Gao, Christian and myself asked you questions on v2 >>>>>> that are not mentioned in v3 at all. >>>>>> >>>>>> To sum it up, please do not propose composefs without explaining >>>>>> what are the barriers for achieving the exact same outcome with >>>>>> the use of a read-only overlayfs with two lower layer - >>>>>> uppermost with erofs containing the metadata files, which include >>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >>>>>> to the lowermost layer containing the content files. >>>>> I think Dave explained quite well why using overlay is not >>>>> comparable to >>>>> what composefs does. >>>>> One big difference is that overlay still requires at least a syscall >>>>> for >>>>> each file in the image, and then we need the equivalent of "rm -rf" to >>>>> clean it up. It is somehow acceptable for long-running services, but it >>>>> is not for "serverless" containers where images/containers are created >>>>> and destroyed frequently. So even in the case we already have all the >>>>> image files available locally, we still need to create a checkout with >>>>> the final structure we need for the image. >>>>> I also don't see how overlay would solve the verified image problem. >>>>> We >>>>> would have the same problem we have today with fs-verity as it can only >>>>> validate a single file but not the entire directory structure. Changes >>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect} >>>>> xattrs won't be noticed. >>>>> There are at the moment two ways to handle container images, both >>>>> somehow >>>>> guided by the available file systems in the kernel. >>>>> - A single image mounted as a block device. >>>>> - A list of tarballs (OCI image) that are unpacked and mounted as >>>>> overlay layers. >>>>> One big advantage of the block devices model is that you can use >>>>> dm-verity, this is something we miss today with OCI container images >>>>> that use overlay. >>>>> What we are proposing with composefs is a way to have "dm-verity" >>>>> style >>>>> validation based on fs-verity and the possibility to share individual >>>>> files instead of layers. These files can also be on different file >>>>> systems, which is something not possible with the block device model. >>>> >>>> That is not a new idea honestly, including chain of trust. Even laterly >>>> out-of-tree incremental fs using fs-verity for this as well, except that >>>> it's in a real self-contained way. >>>> >>>>> The composefs manifest blob could be generated remotely and signed. >>>>> A >>>>> client would need just to validate the signature for the manifest blob >>>>> and from there retrieve the files that are not in the local CAS (even >>>>> from an insecure source) and mount directly the manifest file. >>>> >>>> >>>> Back to the topic, after thinking something I have to make a >>>> compliment for reference. >>>> >>>> First, EROFS had the same internal dissussion and decision at >>>> that time almost _two years ago_ (June 2021), it means: >>>> >>>> a) Some internal people really suggested EROFS could develop >>>> an entire new file-based in-kernel local cache subsystem >>>> (as you called local CAS, whatever) with stackable file >>>> interface so that the exist Nydus image service [1] (as >>>> ostree, and maybe ostree can use it as well) don't need to >>>> modify anything to use exist blobs; >>>> >>>> b) Reuse exist fscache/cachefiles; >>>> >>>> The reason why we (especially me) finally selected b) because: >>>> >>>> - see the people discussion of Google's original Incremental >>>> FS topic [2] [3] in 2019, as Amir already mentioned. At >>>> that time all fs folks really like to reuse exist subsystem >>>> for in-kernel caching rather than reinvent another new >>>> in-kernel wheel for local cache. >>>> >>>> [ Reinventing a new wheel is not hard (fs or caching), just >>>> makes Linux more fragmented. Especially a new filesystem >>>> is just proposed to generate images full of massive massive >>>> new magical symlinks with *overriden* uid/gid/permissions >>>> to replace regular files. ] >>>> >>>> - in-kernel cache implementation usually met several common >>>> potential security issues; reusing exist subsystem can >>>> make all fses addressed them and benefited from it. >>>> >>>> - Usually an exist widely-used userspace implementation is >>>> never an excuse for a new in-kernel feature. >>>> >>>> Although David Howells is always quite busy these months to >>>> develop new netfs interface, otherwise (we think) we should >>>> already support failover, multiple daemon/dirs, daemonless and >>>> more. >>> we have not added any new cache system. overlay does "layer >>> deduplication" and in similar way composefs does "file deduplication". >>> That is not a built-in feature, it is just a side effect of how things >>> are packed together. >>> Using fscache seems like a good idea and it has many advantages but >>> it >>> is a centralized cache mechanism and it looks like a potential problem >>> when you think about allowing mounts from a user namespace. >> >> I think Christian [1] had the same feeling of my own at that time: >> >> "I'm pretty skeptical of this plan whether we should add more filesystems >> that are mountable by unprivileged users. FUSE and Overlayfs are >> adventurous enough and they don't have their own on-disk format. The >> track record of bugs exploitable due to userns isn't making this >> very attractive." >> >> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use >> dm-verity) as well, but it doesn't change _anything_ about concerns of >> "allowing mounts from a user namespace". > > I've mentioned that as a potential feature we could add in future, given > the simplicity of the format and that it uses a CAS for its data instead > of fscache. Each user can have and use their own store to mount the > images. > > At this point it is just a wish from userspace, as it would improve a > few real use cases we have. > > Having the possibility to run containers without root privileges is a > big deal for many users, look at Flatpak apps for example, or rootless > Podman. Mounting and validating images would be a a big security > improvement. It is something that is not possible at the moment as > fs-verity doesn't cover the directory structure and dm-verity seems out > of reach from a user namespace. > > Composefs delegates the entire logic of dealing with files to the > underlying file system in a similar way to overlay. > > Forging the inode metadata from a user namespace mount doesn't look > like an insurmountable problem as well since it is already possible > with a FUSE filesystem. > > So the proposal/wish here is to have a very simple format, that at some > point could be considered safe to mount from a user namespace, in > addition to overlay and FUSE. My response is quite similar to https://lore.kernel.org/r/CAJfpeguyajzHwhae=4PWLF4CUBorwFWeybO-xX6UBD2Ekg81fg@mail.gmail.com/ > > >>> As you know as I've contacted you, I've looked at EROFS in the past >>> and tried to get our use cases to work with it before thinking about >>> submitting composefs upstream. >>> From what I could see EROFS and composefs use two different >>> approaches >>> to solve a similar problem, but it is not possible to do exactly with >>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a >>> block device that uses fscache, and composefs as an overlay for files >>> instead of directories. >> >> I don't think so honestly. EROFS "Multiple device" feature is >> actually "multiple blobs" feature if you really think "device" >> is block device. >> >> Primary device -- primary blob -- "composefs manifest blob" >> Blob device -- data blobs -- "composefs backing files" >> >> any difference? > > I wouldn't expect any substancial difference between two RO file > systems. > > Please correct me if I am wrong: EROFS uses 16 bits for the blob device > ID, so if we map each file to a single blob device we are kind of > limited on how many files we can have. I was here just to represent "composefs manifest file" concept rather than device ID. > Sure this is just an artificial limit and can be bumped in a future > version but the major difference remains: EROFS uses the blob device > through fscache while the composefs files are looked up in the specified > repositories. No, fscache can also open any cookie when opening file. Again, even with fscache, EROFS doesn't need to modify _any_ on-disk format to: - record a "cookie id" for such special "magical symlink" with a similar symlink on-disk format (or whatever on-disk format with data, just with a new on-disk flag); - open such "cookie id" on demand when opening such EROFS file just as any other network fses. I don't think blob device is limited here. some difference now? > >>> Sure composefs is quite simple and you could embed the composefs >>> features in EROFS and let EROFS behave as composefs when provided a >>> similar manifest file. But how is that any better than having a >> >> EROFS always has such feature since v5.16, we called primary device, >> or Nydus concept --- "bootstrap file". >> >>> separate implementation that does just one thing well instead of merging >>> different paradigms together? >> >> It's exist fs on-disk compatible (people can deploy the same image >> to wider scenarios), or you could modify/enhacnce any in-kernel local >> fs to do so like I already suggested, such as enhancing "fs/romfs" and >> make it maintained again due to this magic symlink feature >> >> (because composefs don't have other on-disk requirements other than >> a symlink path and a SHA256 verity digest from its original >> requirement. Any local fs can be enhanced like this.) >> >>> >>>> I know that you guys repeatedly say it's a self-contained >>>> stackable fs and has few code (the same words as Incfs >>>> folks [3] said four years ago already), four reasons make it >>>> weak IMHO: >>>> >>>> - I think core EROFS is about 2~3 kLOC as well if >>>> compression, sysfs and fscache are all code-truncated. >>>> >>>> Also, it's always welcome that all people could submit >>>> patches for cleaning up. I always do such cleanups >>>> from time to time and makes it better. >>>> >>>> - "Few code lines" is somewhat weak because people do >>>> develop new features, layout after upstream. >>>> >>>> Such claim is usually _NOT_ true in the future if you >>>> guys do more to optimize performance, new layout or even >>>> do your own lazy pulling with your local CAS codebase in >>>> the future unless >>>> you *promise* you once dump the code, and do bugfix >>>> only like Christian said [4]. >>>> >>>> From LWN.net comments, I do see the opposite >>>> possibility that you'd like to develop new features >>>> later. >>>> >>>> - In the past, all in-tree kernel filesystems were >>>> designed and implemented without some user-space >>>> specific indication, including Nydus and ostree (I did >>>> see a lot of discussion between folks before in ociv2 >>>> brainstorm [5]). >>> Since you are mentioning OCI: >>> Potentially composefs can be the file system that enables something >>> very >>> close to "ociv2", but it won't need to be called v2 since it is >>> completely compatible with the current OCI image format. >>> It won't require a different image format, just a seekable tarball >>> that >>> is compatible with old "v1" clients and we need to provide the composefs >>> manifest file. >> >> May I ask did you really look into what Nydus + EROFS already did (as you >> mentioned we discussed before)? >> >> Your "composefs manifest file" is exactly "Nydus bootstrap file", see: >> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md >> >> "Rafs is a filesystem image containing a separated metadata blob and >> several data-deduplicated content-addressable data blobs. In a typical >> rafs filesystem, the metadata is stored in bootstrap while the data >> is stored in blobfile. >> ... >> >> bootstrap: The metadata is a merkle tree (I think that is typo, should be >> filesystem tree) whose nodes represents a regular filesystem's >> directory/file a leaf node refers to a file and contains hash value of >> its file data. >> Root node and internal nodes refer to directories and contain the >> hash value >> of their children nodes." >> >> Nydus is already supported "It won't require a different image format, just >> a seekable tarball that is compatible with old "v1" clients and we need to >> provide the composefs manifest file." feature in v2.2 and will be released >> later. > > Nydus is not using a tarball compatible with OCI v1. > > It defines a media type "application/vnd.oci.image.layer.nydus.blob.v1", that > means it is not compatible with existing clients that don't know about > it and you need special handling for that. I am not sure what you're saying: "media type" is quite out of topic here. If you said "mkcomposefs" is done in the server side, what is the media type of such manifest files? And why not Nydus cannot do in the same way? https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md > > Anyway, let's not bother LKML folks with these userspace details. It > has no relevance to the kernel and what file systems do. I'd like to avoid, I did't say anything about userspace details, I just would like to say "merged filesystem tree is also _not_ a new idea of composefs" not "media type", etc. > > >>> The seekable tarball allows individual files to be retrieved. OCI >>> clients will not need to pull the entire tarball, but only the individual >>> files that are not already present in the local CAS. They won't also need >>> to create the overlay layout at all, as we do today, since it is already >>> described with the composefs manifest file. >>> The manifest is portable on different machines with different >>> configurations, as you can use multiple CAS when mounting composefs. >>> Some users might have a local CAS, some others could have a >>> secondary >>> CAS on a network file system and composefs support all these >>> configurations with the same signed manifest file. >>> >>>> That is why EROFS selected exist in-kernel fscache and >>>> made userspace Nydus adapt it: >>>> >>>> even (here called) manifest on-disk format --- >>>> EROFS call primary device --- >>>> they call Nydus bootstrap; >>>> >>>> I'm not sure why it becomes impossible for ... ($$$$). >>> I am not sure what you mean, care to elaborate? >> >> I just meant these concepts are actually the same concept with >> different names and: >> Nydus is a 2020 stuff; > > CRFS[1] is 2019 stuff. Does CRFS have anything similiar to a merged filesystem tree? Here we talked about local CAS: I have no idea CRFS has anything similar to it. > >> EROFS + primary device is a 2021-mid stuff. >> >>>> In addition, if fscache is used, it can also use >>>> fsverity_get_digest() to enable fsverity for non-on-demand >>>> files. >>>> >>>> But again I think even Google's folks think that is >>>> (somewhat) broken so that they added fs-verity to its incFS >>>> in a self-contained way in Feb 2021 [6]. >>>> >>>> Finally, again, I do hope a LSF/MM discussion for this new >>>> overlay model (full of massive magical symlinks to override >>>> permission.) >>> you keep pointing it out but nobody is overriding any permission. >>> The >>> "symlinks" as you call them are just a way to refer to the payload files >>> so they can be shared among different mounts. It is the same idea used >>> by "overlay metacopy" and nobody is complaining about it being a >>> security issue (because it is not). >> >> See overlay documentation clearly wrote such metacopy behavior: >> https://docs.kernel.org/filesystems/overlayfs.html >> >> " >> Do not use metacopy=on with untrusted upper/lower directories. >> Otherwise it is possible that an attacker can create a handcrafted file >> with appropriate REDIRECT and METACOPY xattrs, and gain access to file >> on lower pointed by REDIRECT. This should not be possible on local >> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But >> it should be possible for untrusted layers like from a pen drive. >> " >> >> Do we really need such behavior working on another fs especially with >> on-disk format? At least Christian said, >> "FUSE and Overlayfs are adventurous enough and they don't have their >> own on-disk format." > > If users want to do something really weird then they can always find a > way but the composefs lookup is limited under the directories specified > at mount time, so it is not possible to access any file outside the > repository. > > >>> The files in the CAS are owned by the user that creates the mount, >>> so >>> there is no need to circumvent any permission check to access them. >>> We use fs-verity for these files to make sure they are not modified by a >>> malicious user that could get access to them (e.g. a container breakout). >> >> fs-verity is not always enforcing and it's broken here if fsverity is not >> supported in underlay fses, that is another my arguable point. > > It is a trade-off. It is up to the user to pick a configuration that > allows using fs-verity if they care about this feature. I don't think fsverity is optional with your plan. I wrote this all because it seems I didn't mention the original motivation to use fscache in v2: kernel already has such in-kernel local cache, and people liked to use it in 2019 rather than another stackable way (as mentioned in incremental fs thread.) Thanks, Gao Xiang > > Regards, > Giuseppe > > [1] https://github.com/google/crfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-22 0:39 ` Gao Xiang @ 2023-01-22 9:01 ` Giuseppe Scrivano 2023-01-22 9:32 ` Giuseppe Scrivano 0 siblings, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-22 9:01 UTC (permalink / raw) To: Gao Xiang Cc: Amir Goldstein, Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi, Linus Torvalds Gao Xiang <hsiangkao@linux.alibaba.com> writes: > On 2023/1/22 06:34, Giuseppe Scrivano wrote: >> Gao Xiang <hsiangkao@linux.alibaba.com> writes: >> >>> On 2023/1/22 00:19, Giuseppe Scrivano wrote: >>>> Gao Xiang <hsiangkao@linux.alibaba.com> writes: >>>> >>>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote: >>>>>> Hi Amir, >>>>>> Amir Goldstein <amir73il@gmail.com> writes: >>>>>> >>>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: >>>>> >>>>> ... >>>>> >>>>>>>> >>>>>>> >>>>>>> Hi Alexander, >>>>>>> >>>>>>> I must say that I am a little bit puzzled by this v3. >>>>>>> Gao, Christian and myself asked you questions on v2 >>>>>>> that are not mentioned in v3 at all. >>>>>>> >>>>>>> To sum it up, please do not propose composefs without explaining >>>>>>> what are the barriers for achieving the exact same outcome with >>>>>>> the use of a read-only overlayfs with two lower layer - >>>>>>> uppermost with erofs containing the metadata files, which include >>>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >>>>>>> to the lowermost layer containing the content files. >>>>>> I think Dave explained quite well why using overlay is not >>>>>> comparable to >>>>>> what composefs does. >>>>>> One big difference is that overlay still requires at least a syscall >>>>>> for >>>>>> each file in the image, and then we need the equivalent of "rm -rf" to >>>>>> clean it up. It is somehow acceptable for long-running services, but it >>>>>> is not for "serverless" containers where images/containers are created >>>>>> and destroyed frequently. So even in the case we already have all the >>>>>> image files available locally, we still need to create a checkout with >>>>>> the final structure we need for the image. >>>>>> I also don't see how overlay would solve the verified image problem. >>>>>> We >>>>>> would have the same problem we have today with fs-verity as it can only >>>>>> validate a single file but not the entire directory structure. Changes >>>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect} >>>>>> xattrs won't be noticed. >>>>>> There are at the moment two ways to handle container images, both >>>>>> somehow >>>>>> guided by the available file systems in the kernel. >>>>>> - A single image mounted as a block device. >>>>>> - A list of tarballs (OCI image) that are unpacked and mounted as >>>>>> overlay layers. >>>>>> One big advantage of the block devices model is that you can use >>>>>> dm-verity, this is something we miss today with OCI container images >>>>>> that use overlay. >>>>>> What we are proposing with composefs is a way to have "dm-verity" >>>>>> style >>>>>> validation based on fs-verity and the possibility to share individual >>>>>> files instead of layers. These files can also be on different file >>>>>> systems, which is something not possible with the block device model. >>>>> >>>>> That is not a new idea honestly, including chain of trust. Even laterly >>>>> out-of-tree incremental fs using fs-verity for this as well, except that >>>>> it's in a real self-contained way. >>>>> >>>>>> The composefs manifest blob could be generated remotely and signed. >>>>>> A >>>>>> client would need just to validate the signature for the manifest blob >>>>>> and from there retrieve the files that are not in the local CAS (even >>>>>> from an insecure source) and mount directly the manifest file. >>>>> >>>>> >>>>> Back to the topic, after thinking something I have to make a >>>>> compliment for reference. >>>>> >>>>> First, EROFS had the same internal dissussion and decision at >>>>> that time almost _two years ago_ (June 2021), it means: >>>>> >>>>> a) Some internal people really suggested EROFS could develop >>>>> an entire new file-based in-kernel local cache subsystem >>>>> (as you called local CAS, whatever) with stackable file >>>>> interface so that the exist Nydus image service [1] (as >>>>> ostree, and maybe ostree can use it as well) don't need to >>>>> modify anything to use exist blobs; >>>>> >>>>> b) Reuse exist fscache/cachefiles; >>>>> >>>>> The reason why we (especially me) finally selected b) because: >>>>> >>>>> - see the people discussion of Google's original Incremental >>>>> FS topic [2] [3] in 2019, as Amir already mentioned. At >>>>> that time all fs folks really like to reuse exist subsystem >>>>> for in-kernel caching rather than reinvent another new >>>>> in-kernel wheel for local cache. >>>>> >>>>> [ Reinventing a new wheel is not hard (fs or caching), just >>>>> makes Linux more fragmented. Especially a new filesystem >>>>> is just proposed to generate images full of massive massive >>>>> new magical symlinks with *overriden* uid/gid/permissions >>>>> to replace regular files. ] >>>>> >>>>> - in-kernel cache implementation usually met several common >>>>> potential security issues; reusing exist subsystem can >>>>> make all fses addressed them and benefited from it. >>>>> >>>>> - Usually an exist widely-used userspace implementation is >>>>> never an excuse for a new in-kernel feature. >>>>> >>>>> Although David Howells is always quite busy these months to >>>>> develop new netfs interface, otherwise (we think) we should >>>>> already support failover, multiple daemon/dirs, daemonless and >>>>> more. >>>> we have not added any new cache system. overlay does "layer >>>> deduplication" and in similar way composefs does "file deduplication". >>>> That is not a built-in feature, it is just a side effect of how things >>>> are packed together. >>>> Using fscache seems like a good idea and it has many advantages but >>>> it >>>> is a centralized cache mechanism and it looks like a potential problem >>>> when you think about allowing mounts from a user namespace. >>> >>> I think Christian [1] had the same feeling of my own at that time: >>> >>> "I'm pretty skeptical of this plan whether we should add more filesystems >>> that are mountable by unprivileged users. FUSE and Overlayfs are >>> adventurous enough and they don't have their own on-disk format. The >>> track record of bugs exploitable due to userns isn't making this >>> very attractive." >>> >>> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use >>> dm-verity) as well, but it doesn't change _anything_ about concerns of >>> "allowing mounts from a user namespace". >> I've mentioned that as a potential feature we could add in future, >> given >> the simplicity of the format and that it uses a CAS for its data instead >> of fscache. Each user can have and use their own store to mount the >> images. >> At this point it is just a wish from userspace, as it would improve >> a >> few real use cases we have. >> Having the possibility to run containers without root privileges is >> a >> big deal for many users, look at Flatpak apps for example, or rootless >> Podman. Mounting and validating images would be a a big security >> improvement. It is something that is not possible at the moment as >> fs-verity doesn't cover the directory structure and dm-verity seems out >> of reach from a user namespace. >> Composefs delegates the entire logic of dealing with files to the >> underlying file system in a similar way to overlay. >> Forging the inode metadata from a user namespace mount doesn't look >> like an insurmountable problem as well since it is already possible >> with a FUSE filesystem. >> So the proposal/wish here is to have a very simple format, that at >> some >> point could be considered safe to mount from a user namespace, in >> addition to overlay and FUSE. > > My response is quite similar to > https://lore.kernel.org/r/CAJfpeguyajzHwhae=4PWLF4CUBorwFWeybO-xX6UBD2Ekg81fg@mail.gmail.com/ I don't see how that applies to what I said about unprivileged mounts, except the part about lazy download where I agree with Miklos that should be handled through FUSE and that is something possible with composefs: mount -t composefs composefs -obasedir=/path/to/store:/mnt/fuse /mnt/cfs where /mnt/fuse is handled by a FUSE file system that takes care of loading the files from the remote server, and possibly write them to /path/to/store once they are completed. So each user could have their "lazy download" without interfering with other users or the centralized cache. >> >>>> As you know as I've contacted you, I've looked at EROFS in the past >>>> and tried to get our use cases to work with it before thinking about >>>> submitting composefs upstream. >>>> From what I could see EROFS and composefs use two different >>>> approaches >>>> to solve a similar problem, but it is not possible to do exactly with >>>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a >>>> block device that uses fscache, and composefs as an overlay for files >>>> instead of directories. >>> >>> I don't think so honestly. EROFS "Multiple device" feature is >>> actually "multiple blobs" feature if you really think "device" >>> is block device. >>> >>> Primary device -- primary blob -- "composefs manifest blob" >>> Blob device -- data blobs -- "composefs backing files" >>> >>> any difference? >> I wouldn't expect any substancial difference between two RO file >> systems. >> Please correct me if I am wrong: EROFS uses 16 bits for the blob >> device >> ID, so if we map each file to a single blob device we are kind of >> limited on how many files we can have. > > I was here just to represent "composefs manifest file" concept rather than > device ID. > >> Sure this is just an artificial limit and can be bumped in a future >> version but the major difference remains: EROFS uses the blob device >> through fscache while the composefs files are looked up in the specified >> repositories. > > No, fscache can also open any cookie when opening file. Again, even with > fscache, EROFS doesn't need to modify _any_ on-disk format to: > > - record a "cookie id" for such special "magical symlink" with a similar > symlink on-disk format (or whatever on-disk format with data, just with > a new on-disk flag); > > - open such "cookie id" on demand when opening such EROFS file just as > any other network fses. I don't think blob device is limited here. > > some difference now? recording the "cookie id" is done by a singleton userspace daemon that controls the cachefiles device and requires one operation for each file before the image can be mounted. Is that the case or I misunderstood something? >> >>>> Sure composefs is quite simple and you could embed the composefs >>>> features in EROFS and let EROFS behave as composefs when provided a >>>> similar manifest file. But how is that any better than having a >>> >>> EROFS always has such feature since v5.16, we called primary device, >>> or Nydus concept --- "bootstrap file". >>> >>>> separate implementation that does just one thing well instead of merging >>>> different paradigms together? >>> >>> It's exist fs on-disk compatible (people can deploy the same image >>> to wider scenarios), or you could modify/enhacnce any in-kernel local >>> fs to do so like I already suggested, such as enhancing "fs/romfs" and >>> make it maintained again due to this magic symlink feature >>> >>> (because composefs don't have other on-disk requirements other than >>> a symlink path and a SHA256 verity digest from its original >>> requirement. Any local fs can be enhanced like this.) >>> >>>> >>>>> I know that you guys repeatedly say it's a self-contained >>>>> stackable fs and has few code (the same words as Incfs >>>>> folks [3] said four years ago already), four reasons make it >>>>> weak IMHO: >>>>> >>>>> - I think core EROFS is about 2~3 kLOC as well if >>>>> compression, sysfs and fscache are all code-truncated. >>>>> >>>>> Also, it's always welcome that all people could submit >>>>> patches for cleaning up. I always do such cleanups >>>>> from time to time and makes it better. >>>>> >>>>> - "Few code lines" is somewhat weak because people do >>>>> develop new features, layout after upstream. >>>>> >>>>> Such claim is usually _NOT_ true in the future if you >>>>> guys do more to optimize performance, new layout or even >>>>> do your own lazy pulling with your local CAS codebase in >>>>> the future unless >>>>> you *promise* you once dump the code, and do bugfix >>>>> only like Christian said [4]. >>>>> >>>>> From LWN.net comments, I do see the opposite >>>>> possibility that you'd like to develop new features >>>>> later. >>>>> >>>>> - In the past, all in-tree kernel filesystems were >>>>> designed and implemented without some user-space >>>>> specific indication, including Nydus and ostree (I did >>>>> see a lot of discussion between folks before in ociv2 >>>>> brainstorm [5]). >>>> Since you are mentioning OCI: >>>> Potentially composefs can be the file system that enables something >>>> very >>>> close to "ociv2", but it won't need to be called v2 since it is >>>> completely compatible with the current OCI image format. >>>> It won't require a different image format, just a seekable tarball >>>> that >>>> is compatible with old "v1" clients and we need to provide the composefs >>>> manifest file. >>> >>> May I ask did you really look into what Nydus + EROFS already did (as you >>> mentioned we discussed before)? >>> >>> Your "composefs manifest file" is exactly "Nydus bootstrap file", see: >>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md >>> >>> "Rafs is a filesystem image containing a separated metadata blob and >>> several data-deduplicated content-addressable data blobs. In a typical >>> rafs filesystem, the metadata is stored in bootstrap while the data >>> is stored in blobfile. >>> ... >>> >>> bootstrap: The metadata is a merkle tree (I think that is typo, should be >>> filesystem tree) whose nodes represents a regular filesystem's >>> directory/file a leaf node refers to a file and contains hash value of >>> its file data. >>> Root node and internal nodes refer to directories and contain the >>> hash value >>> of their children nodes." >>> >>> Nydus is already supported "It won't require a different image format, just >>> a seekable tarball that is compatible with old "v1" clients and we need to >>> provide the composefs manifest file." feature in v2.2 and will be released >>> later. >> Nydus is not using a tarball compatible with OCI v1. >> It defines a media type >> "application/vnd.oci.image.layer.nydus.blob.v1", that >> means it is not compatible with existing clients that don't know about >> it and you need special handling for that. > > I am not sure what you're saying: "media type" is quite out of topic here. > > If you said "mkcomposefs" is done in the server side, what is the media > type of such manifest files? > > And why not Nydus cannot do in the same way? > https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md > I am not talking about the manifest or the bootstrap file, I am talking about the data blobs. >> Anyway, let's not bother LKML folks with these userspace details. >> It >> has no relevance to the kernel and what file systems do. > > I'd like to avoid, I did't say anything about userspace details, I just would > like to say > "merged filesystem tree is also _not_ a new idea of composefs" > not "media type", etc. > >> >>>> The seekable tarball allows individual files to be retrieved. OCI >>>> clients will not need to pull the entire tarball, but only the individual >>>> files that are not already present in the local CAS. They won't also need >>>> to create the overlay layout at all, as we do today, since it is already >>>> described with the composefs manifest file. >>>> The manifest is portable on different machines with different >>>> configurations, as you can use multiple CAS when mounting composefs. >>>> Some users might have a local CAS, some others could have a >>>> secondary >>>> CAS on a network file system and composefs support all these >>>> configurations with the same signed manifest file. >>>> >>>>> That is why EROFS selected exist in-kernel fscache and >>>>> made userspace Nydus adapt it: >>>>> >>>>> even (here called) manifest on-disk format --- >>>>> EROFS call primary device --- >>>>> they call Nydus bootstrap; >>>>> >>>>> I'm not sure why it becomes impossible for ... ($$$$). >>>> I am not sure what you mean, care to elaborate? >>> >>> I just meant these concepts are actually the same concept with >>> different names and: >>> Nydus is a 2020 stuff; >> CRFS[1] is 2019 stuff. > > Does CRFS have anything similiar to a merged filesystem tree? > > Here we talked about local CAS: > I have no idea CRFS has anything similar to it. yes it does and it uses it with a FUSE file system. So neither composefs nor EROFS have invented anything here. Anyway, does it really matter who made what first? I don't see how it helps to understand if there are relevant differences in composefs to justify its presence in the kernel. >> >>> EROFS + primary device is a 2021-mid stuff. >>> >>>>> In addition, if fscache is used, it can also use >>>>> fsverity_get_digest() to enable fsverity for non-on-demand >>>>> files. >>>>> >>>>> But again I think even Google's folks think that is >>>>> (somewhat) broken so that they added fs-verity to its incFS >>>>> in a self-contained way in Feb 2021 [6]. >>>>> >>>>> Finally, again, I do hope a LSF/MM discussion for this new >>>>> overlay model (full of massive magical symlinks to override >>>>> permission.) >>>> you keep pointing it out but nobody is overriding any permission. >>>> The >>>> "symlinks" as you call them are just a way to refer to the payload files >>>> so they can be shared among different mounts. It is the same idea used >>>> by "overlay metacopy" and nobody is complaining about it being a >>>> security issue (because it is not). >>> >>> See overlay documentation clearly wrote such metacopy behavior: >>> https://docs.kernel.org/filesystems/overlayfs.html >>> >>> " >>> Do not use metacopy=on with untrusted upper/lower directories. >>> Otherwise it is possible that an attacker can create a handcrafted file >>> with appropriate REDIRECT and METACOPY xattrs, and gain access to file >>> on lower pointed by REDIRECT. This should not be possible on local >>> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But >>> it should be possible for untrusted layers like from a pen drive. >>> " >>> >>> Do we really need such behavior working on another fs especially with >>> on-disk format? At least Christian said, >>> "FUSE and Overlayfs are adventurous enough and they don't have their >>> own on-disk format." >> If users want to do something really weird then they can always find >> a >> way but the composefs lookup is limited under the directories specified >> at mount time, so it is not possible to access any file outside the >> repository. >> >>>> The files in the CAS are owned by the user that creates the mount, >>>> so >>>> there is no need to circumvent any permission check to access them. >>>> We use fs-verity for these files to make sure they are not modified by a >>>> malicious user that could get access to them (e.g. a container breakout). >>> >>> fs-verity is not always enforcing and it's broken here if fsverity is not >>> supported in underlay fses, that is another my arguable point. >> It is a trade-off. It is up to the user to pick a configuration >> that >> allows using fs-verity if they care about this feature. > > I don't think fsverity is optional with your plan. yes it is optional. without fs-verity it would behave the same as today with overlay mounts without any fs-verity. How does validation work in EROFS for files served from fscache and that are on a remote file system? > I wrote this all because it seems I didn't mention the original motivation > to use fscache in v2: kernel already has such in-kernel local cache, and > people liked to use it in 2019 rather than another stackable way (as > mentioned in incremental fs thread.) still for us the stackable way works better. > Thanks, > Gao Xiang > >> Regards, >> Giuseppe >> [1] https://github.com/google/crfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-22 9:01 ` Giuseppe Scrivano @ 2023-01-22 9:32 ` Giuseppe Scrivano 2023-01-24 0:08 ` Gao Xiang 0 siblings, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-22 9:32 UTC (permalink / raw) To: Gao Xiang Cc: Amir Goldstein, Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi, Linus Torvalds Giuseppe Scrivano <gscrivan@redhat.com> writes: > Gao Xiang <hsiangkao@linux.alibaba.com> writes: > >> On 2023/1/22 06:34, Giuseppe Scrivano wrote: >>> Gao Xiang <hsiangkao@linux.alibaba.com> writes: >>> >>>> On 2023/1/22 00:19, Giuseppe Scrivano wrote: >>>>> Gao Xiang <hsiangkao@linux.alibaba.com> writes: >>>>> >>>>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote: >>>>>>> Hi Amir, >>>>>>> Amir Goldstein <amir73il@gmail.com> writes: >>>>>>> >>>>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: >>>>>> >>>>>> ... >>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Hi Alexander, >>>>>>>> >>>>>>>> I must say that I am a little bit puzzled by this v3. >>>>>>>> Gao, Christian and myself asked you questions on v2 >>>>>>>> that are not mentioned in v3 at all. >>>>>>>> >>>>>>>> To sum it up, please do not propose composefs without explaining >>>>>>>> what are the barriers for achieving the exact same outcome with >>>>>>>> the use of a read-only overlayfs with two lower layer - >>>>>>>> uppermost with erofs containing the metadata files, which include >>>>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >>>>>>>> to the lowermost layer containing the content files. >>>>>>> I think Dave explained quite well why using overlay is not >>>>>>> comparable to >>>>>>> what composefs does. >>>>>>> One big difference is that overlay still requires at least a syscall >>>>>>> for >>>>>>> each file in the image, and then we need the equivalent of "rm -rf" to >>>>>>> clean it up. It is somehow acceptable for long-running services, but it >>>>>>> is not for "serverless" containers where images/containers are created >>>>>>> and destroyed frequently. So even in the case we already have all the >>>>>>> image files available locally, we still need to create a checkout with >>>>>>> the final structure we need for the image. >>>>>>> I also don't see how overlay would solve the verified image problem. >>>>>>> We >>>>>>> would have the same problem we have today with fs-verity as it can only >>>>>>> validate a single file but not the entire directory structure. Changes >>>>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect} >>>>>>> xattrs won't be noticed. >>>>>>> There are at the moment two ways to handle container images, both >>>>>>> somehow >>>>>>> guided by the available file systems in the kernel. >>>>>>> - A single image mounted as a block device. >>>>>>> - A list of tarballs (OCI image) that are unpacked and mounted as >>>>>>> overlay layers. >>>>>>> One big advantage of the block devices model is that you can use >>>>>>> dm-verity, this is something we miss today with OCI container images >>>>>>> that use overlay. >>>>>>> What we are proposing with composefs is a way to have "dm-verity" >>>>>>> style >>>>>>> validation based on fs-verity and the possibility to share individual >>>>>>> files instead of layers. These files can also be on different file >>>>>>> systems, which is something not possible with the block device model. >>>>>> >>>>>> That is not a new idea honestly, including chain of trust. Even laterly >>>>>> out-of-tree incremental fs using fs-verity for this as well, except that >>>>>> it's in a real self-contained way. >>>>>> >>>>>>> The composefs manifest blob could be generated remotely and signed. >>>>>>> A >>>>>>> client would need just to validate the signature for the manifest blob >>>>>>> and from there retrieve the files that are not in the local CAS (even >>>>>>> from an insecure source) and mount directly the manifest file. >>>>>> >>>>>> >>>>>> Back to the topic, after thinking something I have to make a >>>>>> compliment for reference. >>>>>> >>>>>> First, EROFS had the same internal dissussion and decision at >>>>>> that time almost _two years ago_ (June 2021), it means: >>>>>> >>>>>> a) Some internal people really suggested EROFS could develop >>>>>> an entire new file-based in-kernel local cache subsystem >>>>>> (as you called local CAS, whatever) with stackable file >>>>>> interface so that the exist Nydus image service [1] (as >>>>>> ostree, and maybe ostree can use it as well) don't need to >>>>>> modify anything to use exist blobs; >>>>>> >>>>>> b) Reuse exist fscache/cachefiles; >>>>>> >>>>>> The reason why we (especially me) finally selected b) because: >>>>>> >>>>>> - see the people discussion of Google's original Incremental >>>>>> FS topic [2] [3] in 2019, as Amir already mentioned. At >>>>>> that time all fs folks really like to reuse exist subsystem >>>>>> for in-kernel caching rather than reinvent another new >>>>>> in-kernel wheel for local cache. >>>>>> >>>>>> [ Reinventing a new wheel is not hard (fs or caching), just >>>>>> makes Linux more fragmented. Especially a new filesystem >>>>>> is just proposed to generate images full of massive massive >>>>>> new magical symlinks with *overriden* uid/gid/permissions >>>>>> to replace regular files. ] >>>>>> >>>>>> - in-kernel cache implementation usually met several common >>>>>> potential security issues; reusing exist subsystem can >>>>>> make all fses addressed them and benefited from it. >>>>>> >>>>>> - Usually an exist widely-used userspace implementation is >>>>>> never an excuse for a new in-kernel feature. >>>>>> >>>>>> Although David Howells is always quite busy these months to >>>>>> develop new netfs interface, otherwise (we think) we should >>>>>> already support failover, multiple daemon/dirs, daemonless and >>>>>> more. >>>>> we have not added any new cache system. overlay does "layer >>>>> deduplication" and in similar way composefs does "file deduplication". >>>>> That is not a built-in feature, it is just a side effect of how things >>>>> are packed together. >>>>> Using fscache seems like a good idea and it has many advantages but >>>>> it >>>>> is a centralized cache mechanism and it looks like a potential problem >>>>> when you think about allowing mounts from a user namespace. >>>> >>>> I think Christian [1] had the same feeling of my own at that time: >>>> >>>> "I'm pretty skeptical of this plan whether we should add more filesystems >>>> that are mountable by unprivileged users. FUSE and Overlayfs are >>>> adventurous enough and they don't have their own on-disk format. The >>>> track record of bugs exploitable due to userns isn't making this >>>> very attractive." >>>> >>>> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use >>>> dm-verity) as well, but it doesn't change _anything_ about concerns of >>>> "allowing mounts from a user namespace". >>> I've mentioned that as a potential feature we could add in future, >>> given >>> the simplicity of the format and that it uses a CAS for its data instead >>> of fscache. Each user can have and use their own store to mount the >>> images. >>> At this point it is just a wish from userspace, as it would improve >>> a >>> few real use cases we have. >>> Having the possibility to run containers without root privileges is >>> a >>> big deal for many users, look at Flatpak apps for example, or rootless >>> Podman. Mounting and validating images would be a a big security >>> improvement. It is something that is not possible at the moment as >>> fs-verity doesn't cover the directory structure and dm-verity seems out >>> of reach from a user namespace. >>> Composefs delegates the entire logic of dealing with files to the >>> underlying file system in a similar way to overlay. >>> Forging the inode metadata from a user namespace mount doesn't look >>> like an insurmountable problem as well since it is already possible >>> with a FUSE filesystem. >>> So the proposal/wish here is to have a very simple format, that at >>> some >>> point could be considered safe to mount from a user namespace, in >>> addition to overlay and FUSE. >> >> My response is quite similar to >> https://lore.kernel.org/r/CAJfpeguyajzHwhae=4PWLF4CUBorwFWeybO-xX6UBD2Ekg81fg@mail.gmail.com/ > > I don't see how that applies to what I said about unprivileged mounts, > except the part about lazy download where I agree with Miklos that > should be handled through FUSE and that is something possible with > composefs: > > mount -t composefs composefs -obasedir=/path/to/store:/mnt/fuse /mnt/cfs > > where /mnt/fuse is handled by a FUSE file system that takes care of > loading the files from the remote server, and possibly write them to > /path/to/store once they are completed. > > So each user could have their "lazy download" without interfering with > other users or the centralized cache. > >>> >>>>> As you know as I've contacted you, I've looked at EROFS in the past >>>>> and tried to get our use cases to work with it before thinking about >>>>> submitting composefs upstream. >>>>> From what I could see EROFS and composefs use two different >>>>> approaches >>>>> to solve a similar problem, but it is not possible to do exactly with >>>>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a >>>>> block device that uses fscache, and composefs as an overlay for files >>>>> instead of directories. >>>> >>>> I don't think so honestly. EROFS "Multiple device" feature is >>>> actually "multiple blobs" feature if you really think "device" >>>> is block device. >>>> >>>> Primary device -- primary blob -- "composefs manifest blob" >>>> Blob device -- data blobs -- "composefs backing files" >>>> >>>> any difference? >>> I wouldn't expect any substancial difference between two RO file >>> systems. >>> Please correct me if I am wrong: EROFS uses 16 bits for the blob >>> device >>> ID, so if we map each file to a single blob device we are kind of >>> limited on how many files we can have. >> >> I was here just to represent "composefs manifest file" concept rather than >> device ID. >> >>> Sure this is just an artificial limit and can be bumped in a future >>> version but the major difference remains: EROFS uses the blob device >>> through fscache while the composefs files are looked up in the specified >>> repositories. >> >> No, fscache can also open any cookie when opening file. Again, even with >> fscache, EROFS doesn't need to modify _any_ on-disk format to: >> >> - record a "cookie id" for such special "magical symlink" with a similar >> symlink on-disk format (or whatever on-disk format with data, just with >> a new on-disk flag); >> >> - open such "cookie id" on demand when opening such EROFS file just as >> any other network fses. I don't think blob device is limited here. >> >> some difference now? > > recording the "cookie id" is done by a singleton userspace daemon that > controls the cachefiles device and requires one operation for each file > before the image can be mounted. > > Is that the case or I misunderstood something? > >>> >>>>> Sure composefs is quite simple and you could embed the composefs >>>>> features in EROFS and let EROFS behave as composefs when provided a >>>>> similar manifest file. But how is that any better than having a >>>> >>>> EROFS always has such feature since v5.16, we called primary device, >>>> or Nydus concept --- "bootstrap file". >>>> >>>>> separate implementation that does just one thing well instead of merging >>>>> different paradigms together? >>>> >>>> It's exist fs on-disk compatible (people can deploy the same image >>>> to wider scenarios), or you could modify/enhacnce any in-kernel local >>>> fs to do so like I already suggested, such as enhancing "fs/romfs" and >>>> make it maintained again due to this magic symlink feature >>>> >>>> (because composefs don't have other on-disk requirements other than >>>> a symlink path and a SHA256 verity digest from its original >>>> requirement. Any local fs can be enhanced like this.) >>>> >>>>> >>>>>> I know that you guys repeatedly say it's a self-contained >>>>>> stackable fs and has few code (the same words as Incfs >>>>>> folks [3] said four years ago already), four reasons make it >>>>>> weak IMHO: >>>>>> >>>>>> - I think core EROFS is about 2~3 kLOC as well if >>>>>> compression, sysfs and fscache are all code-truncated. >>>>>> >>>>>> Also, it's always welcome that all people could submit >>>>>> patches for cleaning up. I always do such cleanups >>>>>> from time to time and makes it better. >>>>>> >>>>>> - "Few code lines" is somewhat weak because people do >>>>>> develop new features, layout after upstream. >>>>>> >>>>>> Such claim is usually _NOT_ true in the future if you >>>>>> guys do more to optimize performance, new layout or even >>>>>> do your own lazy pulling with your local CAS codebase in >>>>>> the future unless >>>>>> you *promise* you once dump the code, and do bugfix >>>>>> only like Christian said [4]. >>>>>> >>>>>> From LWN.net comments, I do see the opposite >>>>>> possibility that you'd like to develop new features >>>>>> later. >>>>>> >>>>>> - In the past, all in-tree kernel filesystems were >>>>>> designed and implemented without some user-space >>>>>> specific indication, including Nydus and ostree (I did >>>>>> see a lot of discussion between folks before in ociv2 >>>>>> brainstorm [5]). >>>>> Since you are mentioning OCI: >>>>> Potentially composefs can be the file system that enables something >>>>> very >>>>> close to "ociv2", but it won't need to be called v2 since it is >>>>> completely compatible with the current OCI image format. >>>>> It won't require a different image format, just a seekable tarball >>>>> that >>>>> is compatible with old "v1" clients and we need to provide the composefs >>>>> manifest file. >>>> >>>> May I ask did you really look into what Nydus + EROFS already did (as you >>>> mentioned we discussed before)? >>>> >>>> Your "composefs manifest file" is exactly "Nydus bootstrap file", see: >>>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md >>>> >>>> "Rafs is a filesystem image containing a separated metadata blob and >>>> several data-deduplicated content-addressable data blobs. In a typical >>>> rafs filesystem, the metadata is stored in bootstrap while the data >>>> is stored in blobfile. >>>> ... >>>> >>>> bootstrap: The metadata is a merkle tree (I think that is typo, should be >>>> filesystem tree) whose nodes represents a regular filesystem's >>>> directory/file a leaf node refers to a file and contains hash value of >>>> its file data. >>>> Root node and internal nodes refer to directories and contain the >>>> hash value >>>> of their children nodes." >>>> >>>> Nydus is already supported "It won't require a different image format, just >>>> a seekable tarball that is compatible with old "v1" clients and we need to >>>> provide the composefs manifest file." feature in v2.2 and will be released >>>> later. >>> Nydus is not using a tarball compatible with OCI v1. >>> It defines a media type >>> "application/vnd.oci.image.layer.nydus.blob.v1", that >>> means it is not compatible with existing clients that don't know about >>> it and you need special handling for that. >> >> I am not sure what you're saying: "media type" is quite out of topic here. >> >> If you said "mkcomposefs" is done in the server side, what is the media >> type of such manifest files? >> >> And why not Nydus cannot do in the same way? >> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md >> > > I am not talking about the manifest or the bootstrap file, I am talking > about the data blobs. > >>> Anyway, let's not bother LKML folks with these userspace details. >>> It >>> has no relevance to the kernel and what file systems do. >> >> I'd like to avoid, I did't say anything about userspace details, I just would >> like to say >> "merged filesystem tree is also _not_ a new idea of composefs" >> not "media type", etc. >> >>> >>>>> The seekable tarball allows individual files to be retrieved. OCI >>>>> clients will not need to pull the entire tarball, but only the individual >>>>> files that are not already present in the local CAS. They won't also need >>>>> to create the overlay layout at all, as we do today, since it is already >>>>> described with the composefs manifest file. >>>>> The manifest is portable on different machines with different >>>>> configurations, as you can use multiple CAS when mounting composefs. >>>>> Some users might have a local CAS, some others could have a >>>>> secondary >>>>> CAS on a network file system and composefs support all these >>>>> configurations with the same signed manifest file. >>>>> >>>>>> That is why EROFS selected exist in-kernel fscache and >>>>>> made userspace Nydus adapt it: >>>>>> >>>>>> even (here called) manifest on-disk format --- >>>>>> EROFS call primary device --- >>>>>> they call Nydus bootstrap; >>>>>> >>>>>> I'm not sure why it becomes impossible for ... ($$$$). >>>>> I am not sure what you mean, care to elaborate? >>>> >>>> I just meant these concepts are actually the same concept with >>>> different names and: >>>> Nydus is a 2020 stuff; >>> CRFS[1] is 2019 stuff. >> >> Does CRFS have anything similiar to a merged filesystem tree? >> >> Here we talked about local CAS: >> I have no idea CRFS has anything similar to it. > > yes it does and it uses it with a FUSE file system. So neither > composefs nor EROFS have invented anything here. > > Anyway, does it really matter who made what first? I don't see how it > helps to understand if there are relevant differences in composefs to > justify its presence in the kernel. > >>> >>>> EROFS + primary device is a 2021-mid stuff. >>>> >>>>>> In addition, if fscache is used, it can also use >>>>>> fsverity_get_digest() to enable fsverity for non-on-demand >>>>>> files. >>>>>> >>>>>> But again I think even Google's folks think that is >>>>>> (somewhat) broken so that they added fs-verity to its incFS >>>>>> in a self-contained way in Feb 2021 [6]. >>>>>> >>>>>> Finally, again, I do hope a LSF/MM discussion for this new >>>>>> overlay model (full of massive magical symlinks to override >>>>>> permission.) >>>>> you keep pointing it out but nobody is overriding any permission. >>>>> The >>>>> "symlinks" as you call them are just a way to refer to the payload files >>>>> so they can be shared among different mounts. It is the same idea used >>>>> by "overlay metacopy" and nobody is complaining about it being a >>>>> security issue (because it is not). >>>> >>>> See overlay documentation clearly wrote such metacopy behavior: >>>> https://docs.kernel.org/filesystems/overlayfs.html >>>> >>>> " >>>> Do not use metacopy=on with untrusted upper/lower directories. >>>> Otherwise it is possible that an attacker can create a handcrafted file >>>> with appropriate REDIRECT and METACOPY xattrs, and gain access to file >>>> on lower pointed by REDIRECT. This should not be possible on local >>>> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But >>>> it should be possible for untrusted layers like from a pen drive. >>>> " >>>> >>>> Do we really need such behavior working on another fs especially with >>>> on-disk format? At least Christian said, >>>> "FUSE and Overlayfs are adventurous enough and they don't have their >>>> own on-disk format." >>> If users want to do something really weird then they can always find >>> a >>> way but the composefs lookup is limited under the directories specified >>> at mount time, so it is not possible to access any file outside the >>> repository. >>> >>>>> The files in the CAS are owned by the user that creates the mount, >>>>> so >>>>> there is no need to circumvent any permission check to access them. >>>>> We use fs-verity for these files to make sure they are not modified by a >>>>> malicious user that could get access to them (e.g. a container breakout). >>>> >>>> fs-verity is not always enforcing and it's broken here if fsverity is not >>>> supported in underlay fses, that is another my arguable point. >>> It is a trade-off. It is up to the user to pick a configuration >>> that >>> allows using fs-verity if they care about this feature. >> >> I don't think fsverity is optional with your plan. > > yes it is optional. without fs-verity it would behave the same as today > with overlay mounts without any fs-verity. > > How does validation work in EROFS for files served from fscache and that > are on a remote file system? nevermind my last question, I guess it would still go through the block device in EROFS. This is clearly a point in favor of a block device approach that a stacking file system like overlay or composefs cannot achieve without support from the underlying file system. > >> I wrote this all because it seems I didn't mention the original motivation >> to use fscache in v2: kernel already has such in-kernel local cache, and >> people liked to use it in 2019 rather than another stackable way (as >> mentioned in incremental fs thread.) > > still for us the stackable way works better. > >> Thanks, >> Gao Xiang >> >>> Regards, >>> Giuseppe >>> [1] https://github.com/google/crfs ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-22 9:32 ` Giuseppe Scrivano @ 2023-01-24 0:08 ` Gao Xiang 0 siblings, 0 replies; 87+ messages in thread From: Gao Xiang @ 2023-01-24 0:08 UTC (permalink / raw) To: Giuseppe Scrivano, Amir Goldstein Cc: Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi, Linus Torvalds On 2023/1/22 17:32, Giuseppe Scrivano wrote: > Giuseppe Scrivano <gscrivan@redhat.com> writes: > ... >> >> How does validation work in EROFS for files served from fscache and that >> are on a remote file system? > > nevermind my last question, I guess it would still go through the block > device in EROFS. > This is clearly a point in favor of a block device approach that a > stacking file system like overlay or composefs cannot achieve without > support from the underlying file system. nevermind my last answer, I was thinking with Amir's advice, you could just use FUSE+overlayfs option for this. I wonder if such option can meet all your requirements (including unprivileged mounts) without increasing on-disk formats in kernel to do unprivileged mounts. If there are still missing features, you could enhance FUSE or overlayfs. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-20 22:18 ` Giuseppe Scrivano 2023-01-21 3:08 ` Gao Xiang @ 2023-01-21 10:57 ` Amir Goldstein 2023-01-21 15:01 ` Giuseppe Scrivano 1 sibling, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-01-21 10:57 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi On Sat, Jan 21, 2023 at 12:18 AM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > Hi Amir, > > Amir Goldstein <amir73il@gmail.com> writes: > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: > >> > >> Giuseppe Scrivano and I have recently been working on a new project we > >> call composefs. This is the first time we propose this publically and > >> we would like some feedback on it. > >> > >> At its core, composefs is a way to construct and use read only images > >> that are used similar to how you would use e.g. loop-back mounted > >> squashfs images. On top of this composefs has two fundamental > >> features. First it allows sharing of file data (both on disk and in > >> page cache) between images, and secondly it has dm-verity like > >> validation on read. > >> > >> Let me first start with a minimal example of how this can be used, > >> before going into the details: > >> > >> Suppose we have this source for an image: > >> > >> rootfs/ > >> ├── dir > >> │ └── another_a > >> ├── file_a > >> └── file_b > >> > >> We can then use this to generate an image file and a set of > >> content-addressed backing files: > >> > >> # mkcomposefs --digest-store=objects rootfs/ rootfs.img > >> # ls -l rootfs.img objects/*/* > >> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 > >> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img > >> > >> The rootfs.img file contains all information about directory and file > >> metadata plus references to the backing files by name. We can now > >> mount this and look at the result: > >> > >> # mount -t composefs rootfs.img -o basedir=objects /mnt > >> # ls /mnt/ > >> dir file_a file_b > >> # cat /mnt/file_a > >> content_a > >> > >> When reading this file the kernel is actually reading the backing > >> file, in a fashion similar to overlayfs. Since the backing file is > >> content-addressed, the objects directory can be shared for multiple > >> images, and any files that happen to have the same content are > >> shared. I refer to this as opportunistic sharing, as it is different > >> than the more course-grained explicit sharing used by e.g. container > >> base images. > >> > >> The next step is the validation. Note how the object files have > >> fs-verity enabled. In fact, they are named by their fs-verity digest: > >> > >> # fsverity digest objects/*/* > >> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 > >> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> > >> The generated filesystm image may contain the expected digest for the > >> backing files. When the backing file digest is incorrect, the open > >> will fail, and if the open succeeds, any other on-disk file-changes > >> will be detected by fs-verity: > >> > >> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> content_a > >> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> # cat /mnt/file_a > >> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest > >> cat: /mnt/file_a: Input/output error > >> > >> This re-uses the existing fs-verity functionallity to protect against > >> changes in file contents, while adding on top of it protection against > >> changes in filesystem metadata and structure. I.e. protecting against > >> replacing a fs-verity enabled file or modifying file permissions or > >> xattrs. > >> > >> To be fully verified we need another step: we use fs-verity on the > >> image itself. Then we pass the expected digest on the mount command > >> line (which will be verified at mount time): > >> > >> # fsverity enable rootfs.img > >> # fsverity digest rootfs.img > >> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img > >> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt > >> > >> So, given a trusted set of mount options (say unlocked from TPM), we > >> have a fully verified filesystem tree mounted, with opportunistic > >> finegrained sharing of identical files. > >> > >> So, why do we want this? There are two initial users. First of all we > >> want to use the opportunistic sharing for the podman container image > >> baselayer. The idea is to use a composefs mount as the lower directory > >> in an overlay mount, with the upper directory being the container work > >> dir. This will allow automatical file-level disk and page-cache > >> sharning between any two images, independent of details like the > >> permissions and timestamps of the files. > >> > >> Secondly we are interested in using the verification aspects of > >> composefs in the ostree project. Ostree already supports a > >> content-addressed object store, but it is currently referenced by > >> hardlink farms. The object store and the trees that reference it are > >> signed and verified at download time, but there is no runtime > >> verification. If we replace the hardlink farm with a composefs image > >> that points into the existing object store we can use the verification > >> to implement runtime verification. > >> > >> In fact, the tooling to create composefs images is 100% reproducible, > >> so all we need is to add the composefs image fs-verity digest into the > >> ostree commit. Then the image can be reconstructed from the ostree > >> commit info, generating a file with the same fs-verity digest. > >> > >> These are the usecases we're currently interested in, but there seems > >> to be a breadth of other possible uses. For example, many systems use > >> loopback mounts for images (like lxc or snap), and these could take > >> advantage of the opportunistic sharing. We've also talked about using > >> fuse to implement a local cache for the backing files. I.e. you would > >> have the second basedir be a fuse filesystem. On lookup failure in the > >> first basedir it downloads the file and saves it in the first basedir > >> for later lookups. There are many interesting possibilities here. > >> > >> The patch series contains some documentation on the file format and > >> how to use the filesystem. > >> > >> The userspace tools (and a standalone kernel module) is available > >> here: > >> https://github.com/containers/composefs > >> > >> Initial work on ostree integration is here: > >> https://github.com/ostreedev/ostree/pull/2640 > >> > >> Changes since v2: > >> - Simplified filesystem format to use fixed size inodes. This resulted > >> in simpler (now < 2k lines) code as well as higher performance at > >> the cost of slightly (~40%) larger images. > >> - We now use multi-page mappings from the page cache, which removes > >> limits on sizes of xattrs and makes the dirent handling code simpler. > >> - Added more documentation about the on-disk file format. > >> - General cleanups based on review comments. > >> > > > > Hi Alexander, > > > > I must say that I am a little bit puzzled by this v3. > > Gao, Christian and myself asked you questions on v2 > > that are not mentioned in v3 at all. > > > > To sum it up, please do not propose composefs without explaining > > what are the barriers for achieving the exact same outcome with > > the use of a read-only overlayfs with two lower layer - > > uppermost with erofs containing the metadata files, which include > > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer > > to the lowermost layer containing the content files. > > I think Dave explained quite well why using overlay is not comparable to > what composefs does. > Where? Can I get a link please? If there are good reasons why composefs is superior to erofs+overlayfs Please include them in the submission, since several developers keep raising the same questions - that is all I ask. > One big difference is that overlay still requires at least a syscall for > each file in the image, and then we need the equivalent of "rm -rf" to > clean it up. It is somehow acceptable for long-running services, but it > is not for "serverless" containers where images/containers are created > and destroyed frequently. So even in the case we already have all the > image files available locally, we still need to create a checkout with > the final structure we need for the image. > I think you did not understand my suggestion: overlay read-only mount: layer 1: erofs mount of a precomposed image (same as mkcomposefs) layer 2: any pre-existing fs path with /blocks repository layer 3: any per-existing fs path with /blocks repository ... The mkcomposefs flow is exactly the same in this suggestion the upper layer image is created without any syscalls and removed without any syscalls. Overlayfs already has the feature of redirecting from upper layer to relative paths in lower layers. > I also don't see how overlay would solve the verified image problem. We > would have the same problem we have today with fs-verity as it can only > validate a single file but not the entire directory structure. Changes > that affect the layer containing the trusted.overlay.{metacopy,redirect} > xattrs won't be noticed. > The entire erofs image would be fsverified including the overlayfs xattrs. That is exactly the same model as composefs. I am not even saying that your model is wrong, only that you are within reach of implementing it with existing subsystems. > There are at the moment two ways to handle container images, both somehow > guided by the available file systems in the kernel. > > - A single image mounted as a block device. > - A list of tarballs (OCI image) that are unpacked and mounted as > overlay layers. > > One big advantage of the block devices model is that you can use > dm-verity, this is something we miss today with OCI container images > that use overlay. > > What we are proposing with composefs is a way to have "dm-verity" style > validation based on fs-verity and the possibility to share individual > files instead of layers. These files can also be on different file > systems, which is something not possible with the block device model. > > The composefs manifest blob could be generated remotely and signed. A > client would need just to validate the signature for the manifest blob > and from there retrieve the files that are not in the local CAS (even > from an insecure source) and mount directly the manifest file. > Excellent description of the problem. I agree that we need a hybrid solution between the block and tarball image model. All I am saying is that this solution can use existing kernel components and existing established on-disk formats (erofs+overlayfs). What was missing all along was the userspace component (i.e. composefs) and I am very happy that you guys are working on this project. These userspace tools could be useful for other use cases. For example, overlayfs is able to describe a large directory rename with redirect xattr since v4.9, but image composing tools do not make use of that, so an OCI image describing a large dir rename will currently contain all the files within. Once again, you may or may not be able to use erofs and overlayfs out of the box for your needs, but so far I did not see any functionality gap that is not possible to close. Please let me know if you know of such gaps or if my proposal does not meet the goals of composefs. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-21 10:57 ` Amir Goldstein @ 2023-01-21 15:01 ` Giuseppe Scrivano 2023-01-21 15:54 ` Amir Goldstein 0 siblings, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-21 15:01 UTC (permalink / raw) To: Amir Goldstein Cc: Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi Amir Goldstein <amir73il@gmail.com> writes: > On Sat, Jan 21, 2023 at 12:18 AM Giuseppe Scrivano <gscrivan@redhat.com> wrote: >> >> Hi Amir, >> >> Amir Goldstein <amir73il@gmail.com> writes: >> >> > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: >> >> >> >> Giuseppe Scrivano and I have recently been working on a new project we >> >> call composefs. This is the first time we propose this publically and >> >> we would like some feedback on it. >> >> >> >> At its core, composefs is a way to construct and use read only images >> >> that are used similar to how you would use e.g. loop-back mounted >> >> squashfs images. On top of this composefs has two fundamental >> >> features. First it allows sharing of file data (both on disk and in >> >> page cache) between images, and secondly it has dm-verity like >> >> validation on read. >> >> >> >> Let me first start with a minimal example of how this can be used, >> >> before going into the details: >> >> >> >> Suppose we have this source for an image: >> >> >> >> rootfs/ >> >> ├── dir >> >> │ └── another_a >> >> ├── file_a >> >> └── file_b >> >> >> >> We can then use this to generate an image file and a set of >> >> content-addressed backing files: >> >> >> >> # mkcomposefs --digest-store=objects rootfs/ rootfs.img >> >> # ls -l rootfs.img objects/*/* >> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 >> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> >> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img >> >> >> >> The rootfs.img file contains all information about directory and file >> >> metadata plus references to the backing files by name. We can now >> >> mount this and look at the result: >> >> >> >> # mount -t composefs rootfs.img -o basedir=objects /mnt >> >> # ls /mnt/ >> >> dir file_a file_b >> >> # cat /mnt/file_a >> >> content_a >> >> >> >> When reading this file the kernel is actually reading the backing >> >> file, in a fashion similar to overlayfs. Since the backing file is >> >> content-addressed, the objects directory can be shared for multiple >> >> images, and any files that happen to have the same content are >> >> shared. I refer to this as opportunistic sharing, as it is different >> >> than the more course-grained explicit sharing used by e.g. container >> >> base images. >> >> >> >> The next step is the validation. Note how the object files have >> >> fs-verity enabled. In fact, they are named by their fs-verity digest: >> >> >> >> # fsverity digest objects/*/* >> >> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 >> >> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> >> >> >> The generated filesystm image may contain the expected digest for the >> >> backing files. When the backing file digest is incorrect, the open >> >> will fail, and if the open succeeds, any other on-disk file-changes >> >> will be detected by fs-verity: >> >> >> >> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> >> content_a >> >> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> >> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >> >> # cat /mnt/file_a >> >> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest >> >> cat: /mnt/file_a: Input/output error >> >> >> >> This re-uses the existing fs-verity functionallity to protect against >> >> changes in file contents, while adding on top of it protection against >> >> changes in filesystem metadata and structure. I.e. protecting against >> >> replacing a fs-verity enabled file or modifying file permissions or >> >> xattrs. >> >> >> >> To be fully verified we need another step: we use fs-verity on the >> >> image itself. Then we pass the expected digest on the mount command >> >> line (which will be verified at mount time): >> >> >> >> # fsverity enable rootfs.img >> >> # fsverity digest rootfs.img >> >> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img >> >> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt >> >> >> >> So, given a trusted set of mount options (say unlocked from TPM), we >> >> have a fully verified filesystem tree mounted, with opportunistic >> >> finegrained sharing of identical files. >> >> >> >> So, why do we want this? There are two initial users. First of all we >> >> want to use the opportunistic sharing for the podman container image >> >> baselayer. The idea is to use a composefs mount as the lower directory >> >> in an overlay mount, with the upper directory being the container work >> >> dir. This will allow automatical file-level disk and page-cache >> >> sharning between any two images, independent of details like the >> >> permissions and timestamps of the files. >> >> >> >> Secondly we are interested in using the verification aspects of >> >> composefs in the ostree project. Ostree already supports a >> >> content-addressed object store, but it is currently referenced by >> >> hardlink farms. The object store and the trees that reference it are >> >> signed and verified at download time, but there is no runtime >> >> verification. If we replace the hardlink farm with a composefs image >> >> that points into the existing object store we can use the verification >> >> to implement runtime verification. >> >> >> >> In fact, the tooling to create composefs images is 100% reproducible, >> >> so all we need is to add the composefs image fs-verity digest into the >> >> ostree commit. Then the image can be reconstructed from the ostree >> >> commit info, generating a file with the same fs-verity digest. >> >> >> >> These are the usecases we're currently interested in, but there seems >> >> to be a breadth of other possible uses. For example, many systems use >> >> loopback mounts for images (like lxc or snap), and these could take >> >> advantage of the opportunistic sharing. We've also talked about using >> >> fuse to implement a local cache for the backing files. I.e. you would >> >> have the second basedir be a fuse filesystem. On lookup failure in the >> >> first basedir it downloads the file and saves it in the first basedir >> >> for later lookups. There are many interesting possibilities here. >> >> >> >> The patch series contains some documentation on the file format and >> >> how to use the filesystem. >> >> >> >> The userspace tools (and a standalone kernel module) is available >> >> here: >> >> https://github.com/containers/composefs >> >> >> >> Initial work on ostree integration is here: >> >> https://github.com/ostreedev/ostree/pull/2640 >> >> >> >> Changes since v2: >> >> - Simplified filesystem format to use fixed size inodes. This resulted >> >> in simpler (now < 2k lines) code as well as higher performance at >> >> the cost of slightly (~40%) larger images. >> >> - We now use multi-page mappings from the page cache, which removes >> >> limits on sizes of xattrs and makes the dirent handling code simpler. >> >> - Added more documentation about the on-disk file format. >> >> - General cleanups based on review comments. >> >> >> > >> > Hi Alexander, >> > >> > I must say that I am a little bit puzzled by this v3. >> > Gao, Christian and myself asked you questions on v2 >> > that are not mentioned in v3 at all. >> > >> > To sum it up, please do not propose composefs without explaining >> > what are the barriers for achieving the exact same outcome with >> > the use of a read-only overlayfs with two lower layer - >> > uppermost with erofs containing the metadata files, which include >> > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >> > to the lowermost layer containing the content files. >> >> I think Dave explained quite well why using overlay is not comparable to >> what composefs does. >> > > Where? Can I get a link please? I am referring to this message: https://lore.kernel.org/lkml/20230118002242.GB937597@dread.disaster.area/ > If there are good reasons why composefs is superior to erofs+overlayfs > Please include them in the submission, since several developers keep > raising the same questions - that is all I ask. > >> One big difference is that overlay still requires at least a syscall for >> each file in the image, and then we need the equivalent of "rm -rf" to >> clean it up. It is somehow acceptable for long-running services, but it >> is not for "serverless" containers where images/containers are created >> and destroyed frequently. So even in the case we already have all the >> image files available locally, we still need to create a checkout with >> the final structure we need for the image. >> > > I think you did not understand my suggestion: > > overlay read-only mount: > layer 1: erofs mount of a precomposed image (same as mkcomposefs) > layer 2: any pre-existing fs path with /blocks repository > layer 3: any per-existing fs path with /blocks repository > ... > > The mkcomposefs flow is exactly the same in this suggestion > the upper layer image is created without any syscalls and > removed without any syscalls. mkcomposefs is supposed to be used server side, when the image is built. The clients that will mount the image don't have to create it (at least for images that will provide the manifest). So this is quite different as in the overlay model we must create the layout, that is the equivalent of the composefs manifest, on any node the image is pulled to. > Overlayfs already has the feature of redirecting from upper layer > to relative paths in lower layers. Could you please provide more information on how you would compose the overlay image first? From what I can see, it still requires at least one syscall for each file in the image to be created and these images are not portable to a different machine. Should we always make "/blocks" a whiteout to prevent it is leaked in the container? And what prevents files under "/blocks" to be replaced with a different version? I think fs-verity on the EROFS image itself won't cover it. >> I also don't see how overlay would solve the verified image problem. We >> would have the same problem we have today with fs-verity as it can only >> validate a single file but not the entire directory structure. Changes >> that affect the layer containing the trusted.overlay.{metacopy,redirect} >> xattrs won't be noticed. >> > > The entire erofs image would be fsverified including the overlayfs xattrs. > That is exactly the same model as composefs. > I am not even saying that your model is wrong, only that you are within > reach of implementing it with existing subsystems. now we can do: mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt that is quite useful for mounting the OS image, as is the OSTree case. How would that be possible with the setup you are proposing? Would overlay gain a new "digest=" kind of option to validate its first layer? >> There are at the moment two ways to handle container images, both somehow >> guided by the available file systems in the kernel. >> >> - A single image mounted as a block device. >> - A list of tarballs (OCI image) that are unpacked and mounted as >> overlay layers. >> >> One big advantage of the block devices model is that you can use >> dm-verity, this is something we miss today with OCI container images >> that use overlay. >> >> What we are proposing with composefs is a way to have "dm-verity" style >> validation based on fs-verity and the possibility to share individual >> files instead of layers. These files can also be on different file >> systems, which is something not possible with the block device model. >> >> The composefs manifest blob could be generated remotely and signed. A >> client would need just to validate the signature for the manifest blob >> and from there retrieve the files that are not in the local CAS (even >> from an insecure source) and mount directly the manifest file. >> > > Excellent description of the problem. > I agree that we need a hybrid solution between the block > and tarball image model. > > All I am saying is that this solution can use existing kernel > components and existing established on-disk formats > (erofs+overlayfs). > > What was missing all along was the userspace component > (i.e. composefs) and I am very happy that you guys are > working on this project. > > These userspace tools could be useful for other use cases. > For example, overlayfs is able to describe a large directory > rename with redirect xattr since v4.9, but image composing > tools do not make use of that, so an OCI image describing a > large dir rename will currently contain all the files within. > > Once again, you may or may not be able to use erofs and > overlayfs out of the box for your needs, but so far I did not > see any functionality gap that is not possible to close. > > Please let me know if you know of such gaps or if my > proposal does not meet the goals of composefs. thanks for your helpful comments. Regards, Giuseppe ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-21 15:01 ` Giuseppe Scrivano @ 2023-01-21 15:54 ` Amir Goldstein 2023-01-21 16:26 ` Gao Xiang 0 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-01-21 15:54 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi On Sat, Jan 21, 2023 at 5:01 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > > On Sat, Jan 21, 2023 at 12:18 AM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > >> > >> Hi Amir, > >> > >> Amir Goldstein <amir73il@gmail.com> writes: > >> > >> > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: > >> >> > >> >> Giuseppe Scrivano and I have recently been working on a new project we > >> >> call composefs. This is the first time we propose this publically and > >> >> we would like some feedback on it. > >> >> > >> >> At its core, composefs is a way to construct and use read only images > >> >> that are used similar to how you would use e.g. loop-back mounted > >> >> squashfs images. On top of this composefs has two fundamental > >> >> features. First it allows sharing of file data (both on disk and in > >> >> page cache) between images, and secondly it has dm-verity like > >> >> validation on read. > >> >> > >> >> Let me first start with a minimal example of how this can be used, > >> >> before going into the details: > >> >> > >> >> Suppose we have this source for an image: > >> >> > >> >> rootfs/ > >> >> ├── dir > >> >> │ └── another_a > >> >> ├── file_a > >> >> └── file_b > >> >> > >> >> We can then use this to generate an image file and a set of > >> >> content-addressed backing files: > >> >> > >> >> # mkcomposefs --digest-store=objects rootfs/ rootfs.img > >> >> # ls -l rootfs.img objects/*/* > >> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 > >> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> >> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img > >> >> > >> >> The rootfs.img file contains all information about directory and file > >> >> metadata plus references to the backing files by name. We can now > >> >> mount this and look at the result: > >> >> > >> >> # mount -t composefs rootfs.img -o basedir=objects /mnt > >> >> # ls /mnt/ > >> >> dir file_a file_b > >> >> # cat /mnt/file_a > >> >> content_a > >> >> > >> >> When reading this file the kernel is actually reading the backing > >> >> file, in a fashion similar to overlayfs. Since the backing file is > >> >> content-addressed, the objects directory can be shared for multiple > >> >> images, and any files that happen to have the same content are > >> >> shared. I refer to this as opportunistic sharing, as it is different > >> >> than the more course-grained explicit sharing used by e.g. container > >> >> base images. > >> >> > >> >> The next step is the validation. Note how the object files have > >> >> fs-verity enabled. In fact, they are named by their fs-verity digest: > >> >> > >> >> # fsverity digest objects/*/* > >> >> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 > >> >> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> >> > >> >> The generated filesystm image may contain the expected digest for the > >> >> backing files. When the backing file digest is incorrect, the open > >> >> will fail, and if the open succeeds, any other on-disk file-changes > >> >> will be detected by fs-verity: > >> >> > >> >> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> >> content_a > >> >> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> >> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > >> >> # cat /mnt/file_a > >> >> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest > >> >> cat: /mnt/file_a: Input/output error > >> >> > >> >> This re-uses the existing fs-verity functionallity to protect against > >> >> changes in file contents, while adding on top of it protection against > >> >> changes in filesystem metadata and structure. I.e. protecting against > >> >> replacing a fs-verity enabled file or modifying file permissions or > >> >> xattrs. > >> >> > >> >> To be fully verified we need another step: we use fs-verity on the > >> >> image itself. Then we pass the expected digest on the mount command > >> >> line (which will be verified at mount time): > >> >> > >> >> # fsverity enable rootfs.img > >> >> # fsverity digest rootfs.img > >> >> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img > >> >> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt > >> >> > >> >> So, given a trusted set of mount options (say unlocked from TPM), we > >> >> have a fully verified filesystem tree mounted, with opportunistic > >> >> finegrained sharing of identical files. > >> >> > >> >> So, why do we want this? There are two initial users. First of all we > >> >> want to use the opportunistic sharing for the podman container image > >> >> baselayer. The idea is to use a composefs mount as the lower directory > >> >> in an overlay mount, with the upper directory being the container work > >> >> dir. This will allow automatical file-level disk and page-cache > >> >> sharning between any two images, independent of details like the > >> >> permissions and timestamps of the files. > >> >> > >> >> Secondly we are interested in using the verification aspects of > >> >> composefs in the ostree project. Ostree already supports a > >> >> content-addressed object store, but it is currently referenced by > >> >> hardlink farms. The object store and the trees that reference it are > >> >> signed and verified at download time, but there is no runtime > >> >> verification. If we replace the hardlink farm with a composefs image > >> >> that points into the existing object store we can use the verification > >> >> to implement runtime verification. > >> >> > >> >> In fact, the tooling to create composefs images is 100% reproducible, > >> >> so all we need is to add the composefs image fs-verity digest into the > >> >> ostree commit. Then the image can be reconstructed from the ostree > >> >> commit info, generating a file with the same fs-verity digest. > >> >> > >> >> These are the usecases we're currently interested in, but there seems > >> >> to be a breadth of other possible uses. For example, many systems use > >> >> loopback mounts for images (like lxc or snap), and these could take > >> >> advantage of the opportunistic sharing. We've also talked about using > >> >> fuse to implement a local cache for the backing files. I.e. you would > >> >> have the second basedir be a fuse filesystem. On lookup failure in the > >> >> first basedir it downloads the file and saves it in the first basedir > >> >> for later lookups. There are many interesting possibilities here. > >> >> > >> >> The patch series contains some documentation on the file format and > >> >> how to use the filesystem. > >> >> > >> >> The userspace tools (and a standalone kernel module) is available > >> >> here: > >> >> https://github.com/containers/composefs > >> >> > >> >> Initial work on ostree integration is here: > >> >> https://github.com/ostreedev/ostree/pull/2640 > >> >> > >> >> Changes since v2: > >> >> - Simplified filesystem format to use fixed size inodes. This resulted > >> >> in simpler (now < 2k lines) code as well as higher performance at > >> >> the cost of slightly (~40%) larger images. > >> >> - We now use multi-page mappings from the page cache, which removes > >> >> limits on sizes of xattrs and makes the dirent handling code simpler. > >> >> - Added more documentation about the on-disk file format. > >> >> - General cleanups based on review comments. > >> >> > >> > > >> > Hi Alexander, > >> > > >> > I must say that I am a little bit puzzled by this v3. > >> > Gao, Christian and myself asked you questions on v2 > >> > that are not mentioned in v3 at all. > >> > > >> > To sum it up, please do not propose composefs without explaining > >> > what are the barriers for achieving the exact same outcome with > >> > the use of a read-only overlayfs with two lower layer - > >> > uppermost with erofs containing the metadata files, which include > >> > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer > >> > to the lowermost layer containing the content files. > >> > >> I think Dave explained quite well why using overlay is not comparable to > >> what composefs does. > >> > > > > Where? Can I get a link please? > > I am referring to this message: https://lore.kernel.org/lkml/20230118002242.GB937597@dread.disaster.area/ > That is a good explanation why the current container runtime overlay storage driver is inadequate, because the orchestration requires untar of OCI tarball image before mounting overlayfs. It is not a kernel issue, it is a userspace issue, because userspace does not utilize overlayfs driver features that are now 6 years old (redirect_dir) and 4 years old (metacopy). I completely agree that reflink and hardlinks are not a viable solution to ephemeral containers. > > If there are good reasons why composefs is superior to erofs+overlayfs > > Please include them in the submission, since several developers keep > > raising the same questions - that is all I ask. > > > >> One big difference is that overlay still requires at least a syscall for > >> each file in the image, and then we need the equivalent of "rm -rf" to > >> clean it up. It is somehow acceptable for long-running services, but it > >> is not for "serverless" containers where images/containers are created > >> and destroyed frequently. So even in the case we already have all the > >> image files available locally, we still need to create a checkout with > >> the final structure we need for the image. > >> > > > > I think you did not understand my suggestion: > > > > overlay read-only mount: > > layer 1: erofs mount of a precomposed image (same as mkcomposefs) > > layer 2: any pre-existing fs path with /blocks repository > > layer 3: any per-existing fs path with /blocks repository > > ... > > > > The mkcomposefs flow is exactly the same in this suggestion > > the upper layer image is created without any syscalls and > > removed without any syscalls. > > mkcomposefs is supposed to be used server side, when the image is built. > The clients that will mount the image don't have to create it (at least > for images that will provide the manifest). > > So this is quite different as in the overlay model we must create the > layout, that is the equivalent of the composefs manifest, on any node > the image is pulled to. > You don't need to re-create the erofs manifest on the client. Unless I am completely missing something, the flow that I am suggesting is drop-in replacement to what you have done. IIUC, you invented an on-disk format for composefs manifest. Is there anything preventing you from using the existing erofs on-disk format to pack the manifest file? The files in the manifest would be inodes with no blocks, only with size and attributes and overlay xattrs with references to the real object blocks, same as you would do with mkcomposefs. Is it not? Maybe what I am missing is how are the blob objects distributed? Are they also shipped as composefs image bundles? That can still be the case with erofs images that may contain both blobs with data and metadata files referencing blobs in older images. > > Overlayfs already has the feature of redirecting from upper layer > > to relative paths in lower layers. > > Could you please provide more information on how you would compose the > overlay image first? > > From what I can see, it still requires at least one syscall for each > file in the image to be created and these images are not portable to a > different machine. Terminology nuance - you do not create an overlayfs image on the server you create an erofs image on the server, exactly as you would create a composefs image on the server. The shipped overlay "image" would then be the erofs image with references to prereqisite images that contain the blobs and the digest of the erofs image. # mount -t composefs rootfs.img -o basedir=objects /mnt client will do: # mount -t erofs rootfs.img -o digest=da.... /metadata # mount -t overlay -o ro,metacopy=on,lowerdir=/metadata:/objects /mnt > > Should we always make "/blocks" a whiteout to prevent it is leaked in > the container? That would be the simplest option, yes. If needed we can also make it a hidden layer whose objects never appear in the namespace and can only be referenced from an upper layer redirection. > > And what prevents files under "/blocks" to be replaced with a different > version? I think fs-verity on the EROFS image itself won't cover it. > I think that part should be added to the overlayfs kernel driver. We could enhance overlayfs to include optional "overlay.verity" digest on the metacopy upper files to be fed into fsverity when opening lower blob files that reside on an fsverity supported filesystem. I am not an expert in trust chains, but I think this is equivalent to how composefs driver was going to solve the same problem? > >> I also don't see how overlay would solve the verified image problem. We > >> would have the same problem we have today with fs-verity as it can only > >> validate a single file but not the entire directory structure. Changes > >> that affect the layer containing the trusted.overlay.{metacopy,redirect} > >> xattrs won't be noticed. > >> > > > > The entire erofs image would be fsverified including the overlayfs xattrs. > > That is exactly the same model as composefs. > > I am not even saying that your model is wrong, only that you are within > > reach of implementing it with existing subsystems. > > now we can do: > > mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt > > that is quite useful for mounting the OS image, as is the OSTree case. > > How would that be possible with the setup you are proposing? Would > overlay gain a new "digest=" kind of option to validate its first layer? > Overlayfs job is to merge the layers. The first layer would first need to be mounted as erofs, so I think that the option digest= would need to be added to erofs. Then, any content in the erofs mount (which is the first overlay layer) would be verified by fsverity and overlayfs job would be to feed the digest found in "overlay.verity" xattrs inside the erofs layer when accessing files in the blob lower (or hidden) layer. Does this make sense to you? Or is there still something that I am missing or misunderstanding about the use case? Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-21 15:54 ` Amir Goldstein @ 2023-01-21 16:26 ` Gao Xiang 0 siblings, 0 replies; 87+ messages in thread From: Gao Xiang @ 2023-01-21 16:26 UTC (permalink / raw) To: Amir Goldstein, Giuseppe Scrivano Cc: Alexander Larsson, linux-fsdevel, linux-kernel, david, brauner, viro, Vivek Goyal, Miklos Szeredi On 2023/1/21 23:54, Amir Goldstein wrote: > On Sat, Jan 21, 2023 at 5:01 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: >> >> Amir Goldstein <amir73il@gmail.com> writes: >> >>> On Sat, Jan 21, 2023 at 12:18 AM Giuseppe Scrivano <gscrivan@redhat.com> wrote: >>>> >>>> Hi Amir, >>>> >>>> Amir Goldstein <amir73il@gmail.com> writes: >>>> >>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> wrote: >>>>>> >>>>>> Giuseppe Scrivano and I have recently been working on a new project we >>>>>> call composefs. This is the first time we propose this publically and >>>>>> we would like some feedback on it. >>>>>> >>>>>> At its core, composefs is a way to construct and use read only images >>>>>> that are used similar to how you would use e.g. loop-back mounted >>>>>> squashfs images. On top of this composefs has two fundamental >>>>>> features. First it allows sharing of file data (both on disk and in >>>>>> page cache) between images, and secondly it has dm-verity like >>>>>> validation on read. >>>>>> >>>>>> Let me first start with a minimal example of how this can be used, >>>>>> before going into the details: >>>>>> >>>>>> Suppose we have this source for an image: >>>>>> >>>>>> rootfs/ >>>>>> ├── dir >>>>>> │ └── another_a >>>>>> ├── file_a >>>>>> └── file_b >>>>>> >>>>>> We can then use this to generate an image file and a set of >>>>>> content-addressed backing files: >>>>>> >>>>>> # mkcomposefs --digest-store=objects rootfs/ rootfs.img >>>>>> # ls -l rootfs.img objects/*/* >>>>>> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 >>>>>> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >>>>>> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img >>>>>> >>>>>> The rootfs.img file contains all information about directory and file >>>>>> metadata plus references to the backing files by name. We can now >>>>>> mount this and look at the result: >>>>>> >>>>>> # mount -t composefs rootfs.img -o basedir=objects /mnt >>>>>> # ls /mnt/ >>>>>> dir file_a file_b >>>>>> # cat /mnt/file_a >>>>>> content_a >>>>>> >>>>>> When reading this file the kernel is actually reading the backing >>>>>> file, in a fashion similar to overlayfs. Since the backing file is >>>>>> content-addressed, the objects directory can be shared for multiple >>>>>> images, and any files that happen to have the same content are >>>>>> shared. I refer to this as opportunistic sharing, as it is different >>>>>> than the more course-grained explicit sharing used by e.g. container >>>>>> base images. >>>>>> >>>>>> The next step is the validation. Note how the object files have >>>>>> fs-verity enabled. In fact, they are named by their fs-verity digest: >>>>>> >>>>>> # fsverity digest objects/*/* >>>>>> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 >>>>>> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >>>>>> >>>>>> The generated filesystm image may contain the expected digest for the >>>>>> backing files. When the backing file digest is incorrect, the open >>>>>> will fail, and if the open succeeds, any other on-disk file-changes >>>>>> will be detected by fs-verity: >>>>>> >>>>>> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >>>>>> content_a >>>>>> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >>>>>> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f >>>>>> # cat /mnt/file_a >>>>>> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest >>>>>> cat: /mnt/file_a: Input/output error >>>>>> >>>>>> This re-uses the existing fs-verity functionallity to protect against >>>>>> changes in file contents, while adding on top of it protection against >>>>>> changes in filesystem metadata and structure. I.e. protecting against >>>>>> replacing a fs-verity enabled file or modifying file permissions or >>>>>> xattrs. >>>>>> >>>>>> To be fully verified we need another step: we use fs-verity on the >>>>>> image itself. Then we pass the expected digest on the mount command >>>>>> line (which will be verified at mount time): >>>>>> >>>>>> # fsverity enable rootfs.img >>>>>> # fsverity digest rootfs.img >>>>>> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img >>>>>> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt >>>>>> >>>>>> So, given a trusted set of mount options (say unlocked from TPM), we >>>>>> have a fully verified filesystem tree mounted, with opportunistic >>>>>> finegrained sharing of identical files. >>>>>> >>>>>> So, why do we want this? There are two initial users. First of all we >>>>>> want to use the opportunistic sharing for the podman container image >>>>>> baselayer. The idea is to use a composefs mount as the lower directory >>>>>> in an overlay mount, with the upper directory being the container work >>>>>> dir. This will allow automatical file-level disk and page-cache >>>>>> sharning between any two images, independent of details like the >>>>>> permissions and timestamps of the files. >>>>>> >>>>>> Secondly we are interested in using the verification aspects of >>>>>> composefs in the ostree project. Ostree already supports a >>>>>> content-addressed object store, but it is currently referenced by >>>>>> hardlink farms. The object store and the trees that reference it are >>>>>> signed and verified at download time, but there is no runtime >>>>>> verification. If we replace the hardlink farm with a composefs image >>>>>> that points into the existing object store we can use the verification >>>>>> to implement runtime verification. >>>>>> >>>>>> In fact, the tooling to create composefs images is 100% reproducible, >>>>>> so all we need is to add the composefs image fs-verity digest into the >>>>>> ostree commit. Then the image can be reconstructed from the ostree >>>>>> commit info, generating a file with the same fs-verity digest. >>>>>> >>>>>> These are the usecases we're currently interested in, but there seems >>>>>> to be a breadth of other possible uses. For example, many systems use >>>>>> loopback mounts for images (like lxc or snap), and these could take >>>>>> advantage of the opportunistic sharing. We've also talked about using >>>>>> fuse to implement a local cache for the backing files. I.e. you would >>>>>> have the second basedir be a fuse filesystem. On lookup failure in the >>>>>> first basedir it downloads the file and saves it in the first basedir >>>>>> for later lookups. There are many interesting possibilities here. >>>>>> >>>>>> The patch series contains some documentation on the file format and >>>>>> how to use the filesystem. >>>>>> >>>>>> The userspace tools (and a standalone kernel module) is available >>>>>> here: >>>>>> https://github.com/containers/composefs >>>>>> >>>>>> Initial work on ostree integration is here: >>>>>> https://github.com/ostreedev/ostree/pull/2640 >>>>>> >>>>>> Changes since v2: >>>>>> - Simplified filesystem format to use fixed size inodes. This resulted >>>>>> in simpler (now < 2k lines) code as well as higher performance at >>>>>> the cost of slightly (~40%) larger images. >>>>>> - We now use multi-page mappings from the page cache, which removes >>>>>> limits on sizes of xattrs and makes the dirent handling code simpler. >>>>>> - Added more documentation about the on-disk file format. >>>>>> - General cleanups based on review comments. >>>>>> >>>>> >>>>> Hi Alexander, >>>>> >>>>> I must say that I am a little bit puzzled by this v3. >>>>> Gao, Christian and myself asked you questions on v2 >>>>> that are not mentioned in v3 at all. >>>>> >>>>> To sum it up, please do not propose composefs without explaining >>>>> what are the barriers for achieving the exact same outcome with >>>>> the use of a read-only overlayfs with two lower layer - >>>>> uppermost with erofs containing the metadata files, which include >>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer >>>>> to the lowermost layer containing the content files. >>>> >>>> I think Dave explained quite well why using overlay is not comparable to >>>> what composefs does. >>>> >>> >>> Where? Can I get a link please? >> >> I am referring to this message: https://lore.kernel.org/lkml/20230118002242.GB937597@dread.disaster.area/ >> > > That is a good explanation why the current container runtime > overlay storage driver is inadequate, because the orchestration > requires untar of OCI tarball image before mounting overlayfs. > > It is not a kernel issue, it is a userspace issue, because userspace > does not utilize overlayfs driver features that are now 6 years > old (redirect_dir) and 4 years old (metacopy). > > I completely agree that reflink and hardlinks are not a viable solution > to ephemeral containers. > >>> If there are good reasons why composefs is superior to erofs+overlayfs >>> Please include them in the submission, since several developers keep >>> raising the same questions - that is all I ask. >>> >>>> One big difference is that overlay still requires at least a syscall for >>>> each file in the image, and then we need the equivalent of "rm -rf" to >>>> clean it up. It is somehow acceptable for long-running services, but it >>>> is not for "serverless" containers where images/containers are created >>>> and destroyed frequently. So even in the case we already have all the >>>> image files available locally, we still need to create a checkout with >>>> the final structure we need for the image. >>>> >>> >>> I think you did not understand my suggestion: >>> >>> overlay read-only mount: >>> layer 1: erofs mount of a precomposed image (same as mkcomposefs) >>> layer 2: any pre-existing fs path with /blocks repository >>> layer 3: any per-existing fs path with /blocks repository >>> ... >>> >>> The mkcomposefs flow is exactly the same in this suggestion >>> the upper layer image is created without any syscalls and >>> removed without any syscalls. >> >> mkcomposefs is supposed to be used server side, when the image is built. >> The clients that will mount the image don't have to create it (at least >> for images that will provide the manifest). >> >> So this is quite different as in the overlay model we must create the >> layout, that is the equivalent of the composefs manifest, on any node >> the image is pulled to. >> > > You don't need to re-create the erofs manifest on the client. > Unless I am completely missing something, the flow that I am > suggesting is drop-in replacement to what you have done. > > IIUC, you invented an on-disk format for composefs manifest. > Is there anything preventing you from using the existing > erofs on-disk format to pack the manifest file? > The files in the manifest would be inodes with no blocks, only > with size and attributes and overlay xattrs with references to > the real object blocks, same as you would do with mkcomposefs. > Is it not? Yes, some EROFS special images work as all regular files with empty data and some overlay "trusted" xattrs included as lower dir would be ok. > > Maybe what I am missing is how are the blob objects distributed? > Are they also shipped as composefs image bundles? > That can still be the case with erofs images that may contain both > blobs with data and metadata files referencing blobs in older images. Maybe just empty regular files in EROFS (or whatever else fs) with a magic "trusted.overlay.blablabla" xattr to point to the real file. > >>> Overlayfs already has the feature of redirecting from upper layer >>> to relative paths in lower layers. >> >> Could you please provide more information on how you would compose the >> overlay image first? >> >> From what I can see, it still requires at least one syscall for each >> file in the image to be created and these images are not portable to a >> different machine. > > Terminology nuance - you do not create an overlayfs image on the server > you create an erofs image on the server, exactly as you would create > a composefs image on the server. > > The shipped overlay "image" would then be the erofs image with > references to prereqisite images that contain the blobs and the digest > of the erofs image. > > # mount -t composefs rootfs.img -o basedir=objects /mnt > > client will do: > > # mount -t erofs rootfs.img -o digest=da.... /metadata > # mount -t overlay -o ro,metacopy=on,lowerdir=/metadata:/objects /mnt Currently maybe not even introduce "-o digest", just loop+dm-verity for such manifest is already ok. > >> >> Should we always make "/blocks" a whiteout to prevent it is leaked in >> the container? > > That would be the simplest option, yes. > If needed we can also make it a hidden layer whose objects > never appear in the namespace and can only be referenced > from an upper layer redirection. > >> >> And what prevents files under "/blocks" to be replaced with a different >> version? I think fs-verity on the EROFS image itself won't cover it. >> > > I think that part should be added to the overlayfs kernel driver. > We could enhance overlayfs to include optional "overlay.verity" digest > on the metacopy upper files to be fed into fsverity when opening lower > blob files that reside on an fsverity supported filesystem. Agreed, another overlayfs "trusted.overlay.verity" xattr in EROFS (or whatever else fs) for each empty regular files to do the same fsverity_get_digest() trick. That would have the same impact IMO. Thanks, Gao Xiang ... > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-20 19:44 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Amir Goldstein 2023-01-20 22:18 ` Giuseppe Scrivano @ 2023-01-23 17:56 ` Alexander Larsson 2023-01-23 23:59 ` Gao Xiang 2023-01-24 3:24 ` Amir Goldstein 1 sibling, 2 replies; 87+ messages in thread From: Alexander Larsson @ 2023-01-23 17:56 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> > wrote: > > > > Giuseppe Scrivano and I have recently been working on a new project > > we > > call composefs. This is the first time we propose this publically > > and > > we would like some feedback on it. > > > > Hi Alexander, > > I must say that I am a little bit puzzled by this v3. > Gao, Christian and myself asked you questions on v2 > that are not mentioned in v3 at all. I got lots of good feedback from Dave Chinner on V2 that caused rather large changes to simplify the format. So I wanted the new version with those changes out to continue that review. I think also having that simplified version will be helpful for the general discussion. > To sum it up, please do not propose composefs without explaining > what are the barriers for achieving the exact same outcome with > the use of a read-only overlayfs with two lower layer - > uppermost with erofs containing the metadata files, which include > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that > refer to the lowermost layer containing the content files. So, to be more precise, and so that everyone is on the same page, lemme state the two options in full. For both options, we have a directory "objects" with content-addressed backing files (i.e. files named by sha256). In this directory all files have fs-verity enabled. Additionally there is an image file which you downloaded to the system that somehow references the objects directory by relative filenames. Composefs option: The image file has fs-verity enabled. To use the image, you mount it with options "basedir=objects,digest=$imagedigest". Overlayfs option: The image file is a loopback image of a gpt disk with two partitions, one partition contains the dm-verity hashes, and the other contains some read-only filesystem. The read-only filesystem has regular versions of directories and symlinks, but for regular files it has sparse files with the xattrs "trusted.overlay.metacopy" and "trusted.overlay.redirect" set, the later containing a string like like "/de/adbeef..." referencing a backing file in the "objects" directory. In addition, the image also contains overlayfs whiteouts to cover any toplevel filenames from the objects directory that would otherwise appear if objects is used as a lower dir. To use this you loopback mount the file, and use dm-verity to set up the combined partitions, which you then mount somewhere. Then you mount an overlayfs with options: "metacopy=on,redirect_dir=follow,lowerdir=veritydev:objects" I would say both versions of this can work. There are some minor technical issues with the overlay option: * To get actual verification of the backing files you would need to add support to overlayfs for an "trusted.overlay.digest" xattrs, with behaviour similar to composefs. * mkfs.erofs doesn't support sparse files (not sure if the kernel code does), which means it is not a good option for the backing all these sparse files. Squashfs seems to support this though, so that is an option. However, the main issue I have with the overlayfs approach is that it is sort of clumsy and over-complex. Basically, the composefs approach is laser focused on read-only images, whereas the overlayfs approach just chains together technologies that happen to work, but also do a lot of other stuff. The result is that it is more work to use it, it uses more kernel objects (mounts, dm devices, loopbacks) and it has worse performance. To measure performance I created a largish image (2.6 GB centos9 rootfs) and mounted it via composefs, as well as overlay-over-squashfs, both backed by the same objects directory (on xfs). If I clear all caches between each run, a `ls -lR` run on composefs runs in around 700 msec: # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR cfs-mount" Benchmark 1: ls -lR cfs-mount Time (mean ± σ): 701.0 ms ± 21.9 ms [User: 153.6 ms, System: 373.3 ms] Range (min … max): 662.3 ms … 725.3 ms 10 runs Whereas same with overlayfs takes almost four times as long: # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR ovl-mount" Benchmark 1: ls -lR ovl-mount Time (mean ± σ): 2.738 s ± 0.029 s [User: 0.176 s, System: 1.688 s] Range (min … max): 2.699 s … 2.787 s 10 runs With page cache between runs the difference is smaller, but still there: # hyperfine "ls -lR cfs-mnt" Benchmark 1: ls -lR cfs-mnt Time (mean ± σ): 390.1 ms ± 3.7 ms [User: 140.9 ms, System: 247.1 ms] Range (min … max): 381.5 ms … 393.9 ms 10 runs vs # hyperfine -i "ls -lR ovl-mount" Benchmark 1: ls -lR ovl-mount Time (mean ± σ): 431.5 ms ± 1.2 ms [User: 124.3 ms, System: 296.9 ms] Range (min … max): 429.4 ms … 433.3 ms 10 runs This isn't all that strange, as overlayfs does a lot more work for each lookup, including multiple name lookups as well as several xattr lookups, whereas composefs just does a single lookup in a pre-computed table. But, given that we don't need any of the other features of overlayfs here, this performance loss seems rather unnecessary. I understand that there is a cost to adding more code, but efficiently supporting containers and other forms of read-only images is a pretty important usecase for Linux these days, and having something tailored for that seems pretty useful to me, even considering the code duplication. I also understand Cristians worry about stacking filesystem, having looked a bit more at the overlayfs code. But, since composefs doesn't really expose the metadata or vfs structure of the lower directories it is much simpler in a fundamental way. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com He's a fast talking sweet-toothed farmboy who must take medication to keep him sane. She's a wealthy streetsmart magician's assistant who dreams of becoming Elvis. They fight crime! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-23 17:56 ` Alexander Larsson @ 2023-01-23 23:59 ` Gao Xiang 2023-01-24 3:24 ` Amir Goldstein 1 sibling, 0 replies; 87+ messages in thread From: Gao Xiang @ 2023-01-23 23:59 UTC (permalink / raw) To: Alexander Larsson, Amir Goldstein Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On 2023/1/24 01:56, Alexander Larsson wrote: > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: >> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> >> wrote: >>> >>> Giuseppe Scrivano and I have recently been working on a new project >>> we >>> call composefs. This is the first time we propose this publically >>> and >>> we would like some feedback on it. >>> >> >> Hi Alexander, >> >> I must say that I am a little bit puzzled by this v3. >> Gao, Christian and myself asked you questions on v2 >> that are not mentioned in v3 at all. > > I got lots of good feedback from Dave Chinner on V2 that caused rather > large changes to simplify the format. So I wanted the new version with > those changes out to continue that review. I think also having that > simplified version will be helpful for the general discussion. > >> To sum it up, please do not propose composefs without explaining >> what are the barriers for achieving the exact same outcome with >> the use of a read-only overlayfs with two lower layer - >> uppermost with erofs containing the metadata files, which include >> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that >> refer to the lowermost layer containing the content files. > ... > > I would say both versions of this can work. There are some minor > technical issues with the overlay option: > > * To get actual verification of the backing files you would need to > add support to overlayfs for an "trusted.overlay.digest" xattrs, with > behaviour similar to composefs. > > * mkfs.erofs doesn't support sparse files (not sure if the kernel code > does), which means it is not a good option for the backing all these > sparse files. Squashfs seems to support this though, so that is an > option. EROFS support chunk-based files, you actually can use this feature to do sparse files if really needed. Currently Android use cases and OCI v1 both doesn't need this feature, but you can simply use ext4, I don't think squashfs here is a good option since it doesn't optimize anything about directory lookup. > > However, the main issue I have with the overlayfs approach is that it > is sort of clumsy and over-complex. Basically, the composefs approach > is laser focused on read-only images, whereas the overlayfs approach > just chains together technologies that happen to work, but also do a > lot of other stuff. The result is that it is more work to use it, it > uses more kernel objects (mounts, dm devices, loopbacks) and it has > worse performance. > > To measure performance I created a largish image (2.6 GB centos9 > rootfs) and mounted it via composefs, as well as overlay-over-squashfs, > both backed by the same objects directory (on xfs). > > If I clear all caches between each run, a `ls -lR` run on composefs > runs in around 700 msec: > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR cfs-mount" > Benchmark 1: ls -lR cfs-mount > Time (mean ± σ): 701.0 ms ± 21.9 ms [User: 153.6 ms, System: 373.3 ms] > Range (min … max): 662.3 ms … 725.3 ms 10 runs > > Whereas same with overlayfs takes almost four times as long: > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR ovl-mount" > Benchmark 1: ls -lR ovl-mount > Time (mean ± σ): 2.738 s ± 0.029 s [User: 0.176 s, System: 1.688 s] > Range (min … max): 2.699 s … 2.787 s 10 runs > > With page cache between runs the difference is smaller, but still > there: > > # hyperfine "ls -lR cfs-mnt" > Benchmark 1: ls -lR cfs-mnt > Time (mean ± σ): 390.1 ms ± 3.7 ms [User: 140.9 ms, System: 247.1 ms] > Range (min … max): 381.5 ms … 393.9 ms 10 runs > > vs > > # hyperfine -i "ls -lR ovl-mount" > Benchmark 1: ls -lR ovl-mount > Time (mean ± σ): 431.5 ms ± 1.2 ms [User: 124.3 ms, System: 296.9 ms] > Range (min … max): 429.4 ms … 433.3 ms 10 runs > > This isn't all that strange, as overlayfs does a lot more work for > each lookup, including multiple name lookups as well as several xattr > lookups, whereas composefs just does a single lookup in a pre-computed > table. But, given that we don't need any of the other features of > overlayfs here, this performance loss seems rather unnecessary. You should use ext4 to make a try first. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-23 17:56 ` Alexander Larsson 2023-01-23 23:59 ` Gao Xiang @ 2023-01-24 3:24 ` Amir Goldstein 2023-01-24 13:10 ` Alexander Larsson 1 sibling, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-01-24 3:24 UTC (permalink / raw) To: Alexander Larsson Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> wrote: > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <alexl@redhat.com> > > wrote: > > > > > > Giuseppe Scrivano and I have recently been working on a new project > > > we > > > call composefs. This is the first time we propose this publically > > > and > > > we would like some feedback on it. > > > > > > > Hi Alexander, > > > > I must say that I am a little bit puzzled by this v3. > > Gao, Christian and myself asked you questions on v2 > > that are not mentioned in v3 at all. > > I got lots of good feedback from Dave Chinner on V2 that caused rather > large changes to simplify the format. So I wanted the new version with > those changes out to continue that review. I think also having that > simplified version will be helpful for the general discussion. > That's ok. I was not puzzled about why you posted v3. I was puzzled by why you did not mention anything about the alternatives to adding a new filesystem that were discussed on v2 and argue in favor of the new filesystem option. If you post another version, please make sure to include a good explanation for that. > > To sum it up, please do not propose composefs without explaining > > what are the barriers for achieving the exact same outcome with > > the use of a read-only overlayfs with two lower layer - > > uppermost with erofs containing the metadata files, which include > > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that > > refer to the lowermost layer containing the content files. > > So, to be more precise, and so that everyone is on the same page, lemme > state the two options in full. > > For both options, we have a directory "objects" with content-addressed > backing files (i.e. files named by sha256). In this directory all > files have fs-verity enabled. Additionally there is an image file > which you downloaded to the system that somehow references the objects > directory by relative filenames. > > Composefs option: > > The image file has fs-verity enabled. To use the image, you mount it > with options "basedir=objects,digest=$imagedigest". > > Overlayfs option: > > The image file is a loopback image of a gpt disk with two partitions, > one partition contains the dm-verity hashes, and the other contains > some read-only filesystem. > > The read-only filesystem has regular versions of directories and > symlinks, but for regular files it has sparse files with the xattrs > "trusted.overlay.metacopy" and "trusted.overlay.redirect" set, the > later containing a string like like "/de/adbeef..." referencing a > backing file in the "objects" directory. In addition, the image also > contains overlayfs whiteouts to cover any toplevel filenames from the > objects directory that would otherwise appear if objects is used as > a lower dir. > > To use this you loopback mount the file, and use dm-verity to set up > the combined partitions, which you then mount somewhere. Then you > mount an overlayfs with options: > "metacopy=on,redirect_dir=follow,lowerdir=veritydev:objects" > > I would say both versions of this can work. There are some minor > technical issues with the overlay option: > > * To get actual verification of the backing files you would need to > add support to overlayfs for an "trusted.overlay.digest" xattrs, with > behaviour similar to composefs. > > * mkfs.erofs doesn't support sparse files (not sure if the kernel code > does), which means it is not a good option for the backing all these > sparse files. Squashfs seems to support this though, so that is an > option. > Fair enough. Wasn't expecting for things to work without any changes. Let's first agree that these alone are not a good enough reason to introduce a new filesystem. Let's move on.. > However, the main issue I have with the overlayfs approach is that it > is sort of clumsy and over-complex. Basically, the composefs approach > is laser focused on read-only images, whereas the overlayfs approach > just chains together technologies that happen to work, but also do a > lot of other stuff. The result is that it is more work to use it, it > uses more kernel objects (mounts, dm devices, loopbacks) and it has Up to this point, it's just hand waving, and a bit annoying if I am being honest. overlayfs+metacopy feature were created for the containers use case for very similar set of requirements - they do not just "happen to work" for the same use case. Please stick to technical arguments when arguing in favor of the new "laser focused" filesystem option. > worse performance. > > To measure performance I created a largish image (2.6 GB centos9 > rootfs) and mounted it via composefs, as well as overlay-over-squashfs, > both backed by the same objects directory (on xfs). > > If I clear all caches between each run, a `ls -lR` run on composefs > runs in around 700 msec: > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR cfs-mount" > Benchmark 1: ls -lR cfs-mount > Time (mean ± σ): 701.0 ms ± 21.9 ms [User: 153.6 ms, System: 373.3 ms] > Range (min … max): 662.3 ms … 725.3 ms 10 runs > > Whereas same with overlayfs takes almost four times as long: No it is not overlayfs, it is overlayfs+squashfs, please stick to facts. As Gao wrote, squashfs does not optimize directory lookup. You can run a test with ext4 for POC as Gao suggested. I am sure that mkfs.erofs sparse file support can be added if needed. > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR ovl-mount" > Benchmark 1: ls -lR ovl-mount > Time (mean ± σ): 2.738 s ± 0.029 s [User: 0.176 s, System: 1.688 s] > Range (min … max): 2.699 s … 2.787 s 10 runs > > With page cache between runs the difference is smaller, but still > there: It is the dentry cache that mostly matters for this test and please use hyerfine -w 1 to warmup dentry cache for correct measurement of warm cache lookup. I guess these test runs started with warm cache? but it wasn't mentioned explicitly. > > # hyperfine "ls -lR cfs-mnt" > Benchmark 1: ls -lR cfs-mnt > Time (mean ± σ): 390.1 ms ± 3.7 ms [User: 140.9 ms, System: 247.1 ms] > Range (min … max): 381.5 ms … 393.9 ms 10 runs > > vs > > # hyperfine -i "ls -lR ovl-mount" > Benchmark 1: ls -lR ovl-mount > Time (mean ± σ): 431.5 ms ± 1.2 ms [User: 124.3 ms, System: 296.9 ms] > Range (min … max): 429.4 ms … 433.3 ms 10 runs > > This isn't all that strange, as overlayfs does a lot more work for > each lookup, including multiple name lookups as well as several xattr > lookups, whereas composefs just does a single lookup in a pre-computed Seriously, "multiple name lookups"? Overlayfs does exactly one lookup for anything but first level subdirs and for sparse files it does the exact same lookup in /objects as composefs. Enough with the hand waving please. Stick to hard facts. > table. But, given that we don't need any of the other features of > overlayfs here, this performance loss seems rather unnecessary. > > I understand that there is a cost to adding more code, but efficiently > supporting containers and other forms of read-only images is a pretty > important usecase for Linux these days, and having something tailored > for that seems pretty useful to me, even considering the code > duplication. > > > > I also understand Cristians worry about stacking filesystem, having > looked a bit more at the overlayfs code. But, since composefs doesn't > really expose the metadata or vfs structure of the lower directories it > is much simpler in a fundamental way. > I agree that composefs is simpler than overlayfs and that its security model is simpler, but this is not the relevant question. The question is what are the benefits to the prospect users of composefs that justify this new filesystem driver if overlayfs already implements the needed functionality. The only valid technical argument I could gather from your email is - 10% performance improvement in warm cache ls -lR on a 2.6 GB centos9 rootfs image compared to overlayfs+squashfs. I am not counting the cold cache results until we see results of a modern ro-image fs. Considering that most real life workloads include reading the data and that most of the time inodes and dentries are cached, IMO, the 10% ls -lR improvement is not a good enough reason for a new "laser focused" filesystem driver. Correct me if I am wrong, but isn't the use case of ephemeral containers require that composefs is layered under a writable tmpfs using overlayfs? If that is the case then the warm cache comparison is incorrect as well. To argue for the new filesystem you will need to compare ls -lR of overlay{tmpfs,composefs,xfs} vs. overlay{tmpfs,erofs,xfs} Alexander, On a more personal note, I know this discussion has been a bit stormy, but am not trying to fight you. I think that {mk,}composefs is a wonderful thing that will improve the life of many users. But mount -t composefs vs. mount -t overlayfs is insignificant to those users, so we just need to figure out based on facts and numbers, which is the best technical alternative. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-24 3:24 ` Amir Goldstein @ 2023-01-24 13:10 ` Alexander Larsson 2023-01-24 14:40 ` Gao Xiang 2023-01-24 19:06 ` Amir Goldstein 0 siblings, 2 replies; 87+ messages in thread From: Alexander Larsson @ 2023-01-24 13:10 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi [-- Attachment #1: Type: text/plain, Size: 16697 bytes --] On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote: > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> > wrote: > > > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson > > > <alexl@redhat.com> > > > wrote: > > > > > > > > Giuseppe Scrivano and I have recently been working on a new > > > > project > > > > we > > > > call composefs. This is the first time we propose this > > > > publically > > > > and > > > > we would like some feedback on it. > > > > > > > > > > Hi Alexander, > > > > > > I must say that I am a little bit puzzled by this v3. > > > Gao, Christian and myself asked you questions on v2 > > > that are not mentioned in v3 at all. > > > > I got lots of good feedback from Dave Chinner on V2 that caused > > rather > > large changes to simplify the format. So I wanted the new version > > with > > those changes out to continue that review. I think also having that > > simplified version will be helpful for the general discussion. > > > > That's ok. > I was not puzzled about why you posted v3. > I was puzzled by why you did not mention anything about the > alternatives to adding a new filesystem that were discussed on > v2 and argue in favor of the new filesystem option. > If you post another version, please make sure to include a good > explanation for that. Sure, I will add something to the next version. But like, there was already a discussion about this, duplicating that discussion in the v3 announcement when the v2->v3 changes are unrelated to it doesn't seem like it makes a ton of difference. > > > To sum it up, please do not propose composefs without explaining > > > what are the barriers for achieving the exact same outcome with > > > the use of a read-only overlayfs with two lower layer - > > > uppermost with erofs containing the metadata files, which include > > > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that > > > refer to the lowermost layer containing the content files. > > > > So, to be more precise, and so that everyone is on the same page, > > lemme > > state the two options in full. > > > > For both options, we have a directory "objects" with content- > > addressed > > backing files (i.e. files named by sha256). In this directory all > > files have fs-verity enabled. Additionally there is an image file > > which you downloaded to the system that somehow references the > > objects > > directory by relative filenames. > > > > Composefs option: > > > > The image file has fs-verity enabled. To use the image, you mount > > it > > with options "basedir=objects,digest=$imagedigest". > > > > Overlayfs option: > > > > The image file is a loopback image of a gpt disk with two > > partitions, > > one partition contains the dm-verity hashes, and the other > > contains > > some read-only filesystem. > > > > The read-only filesystem has regular versions of directories and > > symlinks, but for regular files it has sparse files with the > > xattrs > > "trusted.overlay.metacopy" and "trusted.overlay.redirect" set, the > > later containing a string like like "/de/adbeef..." referencing a > > backing file in the "objects" directory. In addition, the image > > also > > contains overlayfs whiteouts to cover any toplevel filenames from > > the > > objects directory that would otherwise appear if objects is used > > as > > a lower dir. > > > > To use this you loopback mount the file, and use dm-verity to set > > up > > the combined partitions, which you then mount somewhere. Then you > > mount an overlayfs with options: > > "metacopy=on,redirect_dir=follow,lowerdir=veritydev:objects" > > > > I would say both versions of this can work. There are some minor > > technical issues with the overlay option: > > > > * To get actual verification of the backing files you would need to > > add support to overlayfs for an "trusted.overlay.digest" xattrs, > > with > > behaviour similar to composefs. > > > > * mkfs.erofs doesn't support sparse files (not sure if the kernel > > code > > does), which means it is not a good option for the backing all > > these > > sparse files. Squashfs seems to support this though, so that is an > > option. > > > > Fair enough. > Wasn't expecting for things to work without any changes. > Let's first agree that these alone are not a good enough reason to > introduce a new filesystem. > Let's move on.. Yeah. > > However, the main issue I have with the overlayfs approach is that > > it > > is sort of clumsy and over-complex. Basically, the composefs > > approach > > is laser focused on read-only images, whereas the overlayfs > > approach > > just chains together technologies that happen to work, but also do > > a > > lot of other stuff. The result is that it is more work to use it, > > it > > uses more kernel objects (mounts, dm devices, loopbacks) and it has > > Up to this point, it's just hand waving, and a bit annoying if I am > being honest. > overlayfs+metacopy feature were created for the containers use case > for very similar set of requirements - they do not just "happen to > work" > for the same use case. > Please stick to technical arguments when arguing in favor of the new > "laser focused" filesystem option. > > > worse performance. > > > > To measure performance I created a largish image (2.6 GB centos9 > > rootfs) and mounted it via composefs, as well as overlay-over- > > squashfs, > > both backed by the same objects directory (on xfs). > > > > If I clear all caches between each run, a `ls -lR` run on composefs > > runs in around 700 msec: > > > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR cfs- > > mount" > > Benchmark 1: ls -lR cfs-mount > > Time (mean ± σ): 701.0 ms ± 21.9 ms [User: 153.6 ms, > > System: 373.3 ms] > > Range (min … max): 662.3 ms … 725.3 ms 10 runs > > > > Whereas same with overlayfs takes almost four times as long: > > No it is not overlayfs, it is overlayfs+squashfs, please stick to > facts. > As Gao wrote, squashfs does not optimize directory lookup. > You can run a test with ext4 for POC as Gao suggested. > I am sure that mkfs.erofs sparse file support can be added if needed. New measurements follow, they now include also erofs over loopback, although that isn't strictly fair, because that image is much larger due to the fact that it didn't store the files sparsely. It also includes a version where the topmost lower is directly on the backing xfs (i.e. not via loopback). I attached the scripts used to create the images and do the profiling in case anyone wants to reproduce. Here are the results (on x86-64, xfs base fs): overlayfs + loopback squashfs - uncached Benchmark 1: ls -lR mnt-ovl Time (mean ± σ): 2.483 s ± 0.029 s [User: 0.167 s, System: 1.656 s] Range (min … max): 2.427 s … 2.530 s 10 runs overlayfs + loopback squashfs - cached Benchmark 1: ls -lR mnt-ovl Time (mean ± σ): 429.2 ms ± 4.6 ms [User: 123.6 ms, System: 295.0 ms] Range (min … max): 421.2 ms … 435.3 ms 10 runs overlayfs + loopback ext4 - uncached Benchmark 1: ls -lR mnt-ovl Time (mean ± σ): 4.332 s ± 0.060 s [User: 0.204 s, System: 3.150 s] Range (min … max): 4.261 s … 4.442 s 10 runs overlayfs + loopback ext4 - cached Benchmark 1: ls -lR mnt-ovl Time (mean ± σ): 528.3 ms ± 4.0 ms [User: 143.4 ms, System: 381.2 ms] Range (min … max): 521.1 ms … 536.4 ms 10 runs overlayfs + loopback erofs - uncached Benchmark 1: ls -lR mnt-ovl Time (mean ± σ): 3.045 s ± 0.127 s [User: 0.198 s, System: 1.129 s] Range (min … max): 2.926 s … 3.338 s 10 runs overlayfs + loopback erofs - cached Benchmark 1: ls -lR mnt-ovl Time (mean ± σ): 516.9 ms ± 5.7 ms [User: 139.4 ms, System: 374.0 ms] Range (min … max): 503.6 ms … 521.9 ms 10 runs overlayfs + direct - uncached Benchmark 1: ls -lR mnt-ovl Time (mean ± σ): 2.562 s ± 0.028 s [User: 0.199 s, System: 1.129 s] Range (min … max): 2.497 s … 2.585 s 10 runs overlayfs + direct - cached Benchmark 1: ls -lR mnt-ovl Time (mean ± σ): 524.5 ms ± 1.6 ms [User: 148.7 ms, System: 372.2 ms] Range (min … max): 522.8 ms … 527.8 ms 10 runs composefs - uncached Benchmark 1: ls -lR mnt-fs Time (mean ± σ): 681.4 ms ± 14.1 ms [User: 154.4 ms, System: 369.9 ms] Range (min … max): 652.5 ms … 703.2 ms 10 runs composefs - cached Benchmark 1: ls -lR mnt-fs Time (mean ± σ): 390.8 ms ± 4.7 ms [User: 144.7 ms, System: 243.7 ms] Range (min … max): 382.8 ms … 399.1 ms 10 runs For the uncached case, composefs is still almost four times faster than the fastest overlay combo (squashfs), and the non-squashfs versions are strictly slower. For the cached case the difference is less (10%) but with similar order of performance. For size comparison, here are the resulting images: 8.6M large.composefs 2.5G large.erofs 200M large.ext4 2.6M large.squashfs > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR ovl- > > mount" > > Benchmark 1: ls -lR ovl-mount > > Time (mean ± σ): 2.738 s ± 0.029 s [User: 0.176 s, > > System: 1.688 s] > > Range (min … max): 2.699 s … 2.787 s 10 runs > > > > With page cache between runs the difference is smaller, but still > > there: > > It is the dentry cache that mostly matters for this test and please > use hyerfine -w 1 to warmup dentry cache for correct measurement > of warm cache lookup. I'm not sure why the dentry cache case would be more important? Starting a new container will very often not have cached the image. To me the interesting case is for a new image, but with some existing page cache for the backing files directory. That seems to model staring a new image in an active container host, but its somewhat hard to test that case. > I guess these test runs started with warm cache? but it wasn't > mentioned explicitly. Yes, they were warm (because I ran the previous test before it). But, the new profile script explicitly adds -w 1. > > # hyperfine "ls -lR cfs-mnt" > > Benchmark 1: ls -lR cfs-mnt > > Time (mean ± σ): 390.1 ms ± 3.7 ms [User: 140.9 ms, > > System: 247.1 ms] > > Range (min … max): 381.5 ms … 393.9 ms 10 runs > > > > vs > > > > # hyperfine -i "ls -lR ovl-mount" > > Benchmark 1: ls -lR ovl-mount > > Time (mean ± σ): 431.5 ms ± 1.2 ms [User: 124.3 ms, > > System: 296.9 ms] > > Range (min … max): 429.4 ms … 433.3 ms 10 runs > > > > This isn't all that strange, as overlayfs does a lot more work for > > each lookup, including multiple name lookups as well as several > > xattr > > lookups, whereas composefs just does a single lookup in a pre- > > computed > > Seriously, "multiple name lookups"? > Overlayfs does exactly one lookup for anything but first level > subdirs > and for sparse files it does the exact same lookup in /objects as > composefs. > Enough with the hand waving please. Stick to hard facts. With the discussed layout, in a stat() call on a regular file, ovl_lookup() will do lookups on both the sparse file and the backing file, whereas cfs_dir_lookup() will just map some page cache pages and do a binary search. Of course if you actually open the file, then cfs_open_file() would do the equivalent lookups in /objects. But that is often not what happens, for example in "ls -l". Additionally, these extra lookups will cause extra memory use, as you need dentries and inodes for the erofs/squashfs inodes in addition to the overlay inodes. > > table. But, given that we don't need any of the other features of > > overlayfs here, this performance loss seems rather unnecessary. > > > > I understand that there is a cost to adding more code, but > > efficiently > > supporting containers and other forms of read-only images is a > > pretty > > important usecase for Linux these days, and having something > > tailored > > for that seems pretty useful to me, even considering the code > > duplication. > > > > > > > > I also understand Cristians worry about stacking filesystem, having > > looked a bit more at the overlayfs code. But, since composefs > > doesn't > > really expose the metadata or vfs structure of the lower > > directories it > > is much simpler in a fundamental way. > > > > I agree that composefs is simpler than overlayfs and that its > security > model is simpler, but this is not the relevant question. > The question is what are the benefits to the prospect users of > composefs > that justify this new filesystem driver if overlayfs already > implements > the needed functionality. > > The only valid technical argument I could gather from your email is - > 10% performance improvement in warm cache ls -lR on a 2.6 GB > centos9 rootfs image compared to overlayfs+squashfs. > > I am not counting the cold cache results until we see results of > a modern ro-image fs. They are all strictly worse than squashfs in the above testing. > Considering that most real life workloads include reading the data > and that most of the time inodes and dentries are cached, IMO, > the 10% ls -lR improvement is not a good enough reason > for a new "laser focused" filesystem driver. > > Correct me if I am wrong, but isn't the use case of ephemeral > containers require that composefs is layered under a writable tmpfs > using overlayfs? > > If that is the case then the warm cache comparison is incorrect > as well. To argue for the new filesystem you will need to compare > ls -lR of overlay{tmpfs,composefs,xfs} vs. overlay{tmpfs,erofs,xfs} That very much depends. For the ostree rootfs uscase there would be no writable layer, and for containers I'm personally primarily interested in "--readonly" containers (i.e. without an writable layer) in my current automobile/embedded work. For many container cases however, that is true, and no doubt that would make the overhead of overlayfs less of a issue. > Alexander, > > On a more personal note, I know this discussion has been a bit > stormy, but am not trying to fight you. I'm overall not getting a warm fuzzy feeling from this discussion. Getting weird complaints that I'm somehow "stealing" functions or weird "who did $foo first" arguments for instance. You haven't personally attacked me like that, but some of your comments can feel rather pointy, especially in the context of a stormy thread like this. I'm just not used to kernel development workflows, so have patience with me if I do things wrong. > I think that {mk,}composefs is a wonderful thing that will improve > the life of many users. > But mount -t composefs vs. mount -t overlayfs is insignificant > to those users, so we just need to figure out based on facts > and numbers, which is the best technical alternative. In reality things are never as easy as one thing strictly being technically best. There is always a multitude of considerations. Is composefs technically better if it uses less memory and performs better for a particular usecase? Or is overlayfs technically better because it is useful for more usecases and already exists? A judgement needs to be made depending on things like complexity/maintainability of the new fs, ease of use, measured performance differences, relative importance of particular performance measurements, and importance of the specific usecase. It is my belief that the advantages of composefs outweight the cost of the code duplication, but I understand the point of view of a maintainer of an existing codebase and that saying "no" is often the right thing. I will continue to try to argue for my point of view, but will try to make it as factual as possible. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com He's a shy shark-wrestling librarian whom everyone believes is mad. She's an enchanted tempestuous stripper operating on the wrong side of the law. They fight crime! [-- Attachment #2: mkhack.sh --] [-- Type: application/x-shellscript, Size: 1208 bytes --] [-- Attachment #3: profile.sh --] [-- Type: application/x-shellscript, Size: 758 bytes --] ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-24 13:10 ` Alexander Larsson @ 2023-01-24 14:40 ` Gao Xiang 2023-01-24 19:06 ` Amir Goldstein 1 sibling, 0 replies; 87+ messages in thread From: Gao Xiang @ 2023-01-24 14:40 UTC (permalink / raw) To: Alexander Larsson, Amir Goldstein Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On 2023/1/24 21:10, Alexander Larsson wrote: > On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote: >> On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> ... >> >> No it is not overlayfs, it is overlayfs+squashfs, please stick to >> facts. >> As Gao wrote, squashfs does not optimize directory lookup. >> You can run a test with ext4 for POC as Gao suggested. >> I am sure that mkfs.erofs sparse file support can be added if needed. > > New measurements follow, they now include also erofs over loopback, > although that isn't strictly fair, because that image is much larger > due to the fact that it didn't store the files sparsely. It also > includes a version where the topmost lower is directly on the backing > xfs (i.e. not via loopback). I attached the scripts used to create the > images and do the profiling in case anyone wants to reproduce. > > Here are the results (on x86-64, xfs base fs): > > overlayfs + loopback squashfs - uncached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 2.483 s ± 0.029 s [User: 0.167 s, System: 1.656 s] > Range (min … max): 2.427 s … 2.530 s 10 runs > > overlayfs + loopback squashfs - cached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 429.2 ms ± 4.6 ms [User: 123.6 ms, System: 295.0 ms] > Range (min … max): 421.2 ms … 435.3 ms 10 runs > > overlayfs + loopback ext4 - uncached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 4.332 s ± 0.060 s [User: 0.204 s, System: 3.150 s] > Range (min … max): 4.261 s … 4.442 s 10 runs > > overlayfs + loopback ext4 - cached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 528.3 ms ± 4.0 ms [User: 143.4 ms, System: 381.2 ms] > Range (min … max): 521.1 ms … 536.4 ms 10 runs > > overlayfs + loopback erofs - uncached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 3.045 s ± 0.127 s [User: 0.198 s, System: 1.129 s] > Range (min … max): 2.926 s … 3.338 s 10 runs > > overlayfs + loopback erofs - cached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 516.9 ms ± 5.7 ms [User: 139.4 ms, System: 374.0 ms] > Range (min … max): 503.6 ms … 521.9 ms 10 runs > > overlayfs + direct - uncached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 2.562 s ± 0.028 s [User: 0.199 s, System: 1.129 s] > Range (min … max): 2.497 s … 2.585 s 10 runs > > overlayfs + direct - cached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 524.5 ms ± 1.6 ms [User: 148.7 ms, System: 372.2 ms] > Range (min … max): 522.8 ms … 527.8 ms 10 runs > > composefs - uncached > Benchmark 1: ls -lR mnt-fs > Time (mean ± σ): 681.4 ms ± 14.1 ms [User: 154.4 ms, System: 369.9 ms] > Range (min … max): 652.5 ms … 703.2 ms 10 runs > > composefs - cached > Benchmark 1: ls -lR mnt-fs > Time (mean ± σ): 390.8 ms ± 4.7 ms [User: 144.7 ms, System: 243.7 ms] > Range (min … max): 382.8 ms … 399.1 ms 10 runs > > For the uncached case, composefs is still almost four times faster than > the fastest overlay combo (squashfs), and the non-squashfs versions are > strictly slower. For the cached case the difference is less (10%) but > with similar order of performance. > > For size comparison, here are the resulting images: > > 8.6M large.composefs > 2.5G large.erofs > 200M large.ext4 > 2.6M large.squashfs Ok, I have to say I'm a bit surprised by these results. Just a wild guess, `ls -lR` is a seq-like access, so that compressed data (assumed that you use it) is benefited from it. I cannot think of a proper cause before looking into more. EROFS is impacted since EROFS on-disk inodes are not arranged together with the current mkfs.erofs implemenetation (it's just a userspace implementation details, if people really care about it, I will refine the implementation), and I will also implement such sparse files later so that all on-disk inodes won't be impacted as well (I'm on vacation, but I will try my best). From the overall results, I don't really know what's the most bottleneck point honestly: maybe just like what you said -- due to overlayfs overhead; or maybe a bottleneck of loopback device. so it's much better to show some results of "ls -lR" without overlayfs stacked too. IMHO, Amir's main point is always [1] "w.r.t overlayfs, I am not even sure that anything needs to be modified in the driver. overlayfs already supports "metacopy" feature which means that an upper layer could be composed in a way that the file content would be read from an arbitrary path in lower fs, e.g. objects/cc/XXX. " I think there is nothing wrong with it (except for fsverity). From the results, such functionality indeed can already be achieved by overlayfs + some localfs with some user-space adaption. And it was not mentioned in RFC and v2. So without fs-verity requirement, currently your proposal is mainly resolving a performance issue of an exist in-kernel approach (except for unprivileged mounts). It's much better to describe in the cover letter -- The original problem, why overlayfs + (localfs or FUSE for metadata) doesn't meet the requirements. That makes much sense compared with the current cover letter. Thanks, Gao Xiang [1] https://lore.kernel.org/r/CAOQ4uxh34udueT-+Toef6TmTtyLjFUnSJs=882DH=HxADX8pKw@mail.gmail.com/ ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-24 13:10 ` Alexander Larsson 2023-01-24 14:40 ` Gao Xiang @ 2023-01-24 19:06 ` Amir Goldstein 2023-01-25 4:18 ` Dave Chinner 2023-01-25 9:37 ` Alexander Larsson 1 sibling, 2 replies; 87+ messages in thread From: Amir Goldstein @ 2023-01-24 19:06 UTC (permalink / raw) To: Alexander Larsson Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@redhat.com> wrote: > > On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote: > > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> > > wrote: > > > > > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: > > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson > > > > <alexl@redhat.com> > > > > wrote: > > > > > > > > > > Giuseppe Scrivano and I have recently been working on a new > > > > > project > > > > > we > > > > > call composefs. This is the first time we propose this > > > > > publically > > > > > and > > > > > we would like some feedback on it. > > > > > > > > > > > > > Hi Alexander, > > > > > > > > I must say that I am a little bit puzzled by this v3. > > > > Gao, Christian and myself asked you questions on v2 > > > > that are not mentioned in v3 at all. > > > > > > I got lots of good feedback from Dave Chinner on V2 that caused > > > rather > > > large changes to simplify the format. So I wanted the new version > > > with > > > those changes out to continue that review. I think also having that > > > simplified version will be helpful for the general discussion. > > > > > > > That's ok. > > I was not puzzled about why you posted v3. > > I was puzzled by why you did not mention anything about the > > alternatives to adding a new filesystem that were discussed on > > v2 and argue in favor of the new filesystem option. > > If you post another version, please make sure to include a good > > explanation for that. > > Sure, I will add something to the next version. But like, there was > already a discussion about this, duplicating that discussion in the v3 > announcement when the v2->v3 changes are unrelated to it doesn't seem > like it makes a ton of difference. > > > > > To sum it up, please do not propose composefs without explaining > > > > what are the barriers for achieving the exact same outcome with > > > > the use of a read-only overlayfs with two lower layer - > > > > uppermost with erofs containing the metadata files, which include > > > > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that > > > > refer to the lowermost layer containing the content files. > > > > > > So, to be more precise, and so that everyone is on the same page, > > > lemme > > > state the two options in full. > > > > > > For both options, we have a directory "objects" with content- > > > addressed > > > backing files (i.e. files named by sha256). In this directory all > > > files have fs-verity enabled. Additionally there is an image file > > > which you downloaded to the system that somehow references the > > > objects > > > directory by relative filenames. > > > > > > Composefs option: > > > > > > The image file has fs-verity enabled. To use the image, you mount > > > it > > > with options "basedir=objects,digest=$imagedigest". > > > > > > Overlayfs option: > > > > > > The image file is a loopback image of a gpt disk with two > > > partitions, > > > one partition contains the dm-verity hashes, and the other > > > contains > > > some read-only filesystem. > > > > > > The read-only filesystem has regular versions of directories and > > > symlinks, but for regular files it has sparse files with the > > > xattrs > > > "trusted.overlay.metacopy" and "trusted.overlay.redirect" set, the > > > later containing a string like like "/de/adbeef..." referencing a > > > backing file in the "objects" directory. In addition, the image > > > also > > > contains overlayfs whiteouts to cover any toplevel filenames from > > > the > > > objects directory that would otherwise appear if objects is used > > > as > > > a lower dir. > > > > > > To use this you loopback mount the file, and use dm-verity to set > > > up > > > the combined partitions, which you then mount somewhere. Then you > > > mount an overlayfs with options: > > > "metacopy=on,redirect_dir=follow,lowerdir=veritydev:objects" > > > > > > I would say both versions of this can work. There are some minor > > > technical issues with the overlay option: > > > > > > * To get actual verification of the backing files you would need to > > > add support to overlayfs for an "trusted.overlay.digest" xattrs, > > > with > > > behaviour similar to composefs. > > > > > > * mkfs.erofs doesn't support sparse files (not sure if the kernel > > > code > > > does), which means it is not a good option for the backing all > > > these > > > sparse files. Squashfs seems to support this though, so that is an > > > option. > > > > > > > Fair enough. > > Wasn't expecting for things to work without any changes. > > Let's first agree that these alone are not a good enough reason to > > introduce a new filesystem. > > Let's move on.. > > Yeah. > > > > However, the main issue I have with the overlayfs approach is that > > > it > > > is sort of clumsy and over-complex. Basically, the composefs > > > approach > > > is laser focused on read-only images, whereas the overlayfs > > > approach > > > just chains together technologies that happen to work, but also do > > > a > > > lot of other stuff. The result is that it is more work to use it, > > > it > > > uses more kernel objects (mounts, dm devices, loopbacks) and it has > > > > Up to this point, it's just hand waving, and a bit annoying if I am > > being honest. > > overlayfs+metacopy feature were created for the containers use case > > for very similar set of requirements - they do not just "happen to > > work" > > for the same use case. > > Please stick to technical arguments when arguing in favor of the new > > "laser focused" filesystem option. > > > > > worse performance. > > > > > > To measure performance I created a largish image (2.6 GB centos9 > > > rootfs) and mounted it via composefs, as well as overlay-over- > > > squashfs, > > > both backed by the same objects directory (on xfs). > > > > > > If I clear all caches between each run, a `ls -lR` run on composefs > > > runs in around 700 msec: > > > > > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR cfs- > > > mount" > > > Benchmark 1: ls -lR cfs-mount > > > Time (mean ± σ): 701.0 ms ± 21.9 ms [User: 153.6 ms, > > > System: 373.3 ms] > > > Range (min … max): 662.3 ms … 725.3 ms 10 runs > > > > > > Whereas same with overlayfs takes almost four times as long: > > > > No it is not overlayfs, it is overlayfs+squashfs, please stick to > > facts. > > As Gao wrote, squashfs does not optimize directory lookup. > > You can run a test with ext4 for POC as Gao suggested. > > I am sure that mkfs.erofs sparse file support can be added if needed. > > New measurements follow, they now include also erofs over loopback, > although that isn't strictly fair, because that image is much larger > due to the fact that it didn't store the files sparsely. It also > includes a version where the topmost lower is directly on the backing > xfs (i.e. not via loopback). I attached the scripts used to create the > images and do the profiling in case anyone wants to reproduce. > > Here are the results (on x86-64, xfs base fs): > > overlayfs + loopback squashfs - uncached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 2.483 s ± 0.029 s [User: 0.167 s, System: 1.656 s] > Range (min … max): 2.427 s … 2.530 s 10 runs > > overlayfs + loopback squashfs - cached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 429.2 ms ± 4.6 ms [User: 123.6 ms, System: 295.0 ms] > Range (min … max): 421.2 ms … 435.3 ms 10 runs > > overlayfs + loopback ext4 - uncached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 4.332 s ± 0.060 s [User: 0.204 s, System: 3.150 s] > Range (min … max): 4.261 s … 4.442 s 10 runs > > overlayfs + loopback ext4 - cached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 528.3 ms ± 4.0 ms [User: 143.4 ms, System: 381.2 ms] > Range (min … max): 521.1 ms … 536.4 ms 10 runs > > overlayfs + loopback erofs - uncached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 3.045 s ± 0.127 s [User: 0.198 s, System: 1.129 s] > Range (min … max): 2.926 s … 3.338 s 10 runs > > overlayfs + loopback erofs - cached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 516.9 ms ± 5.7 ms [User: 139.4 ms, System: 374.0 ms] > Range (min … max): 503.6 ms … 521.9 ms 10 runs > > overlayfs + direct - uncached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 2.562 s ± 0.028 s [User: 0.199 s, System: 1.129 s] > Range (min … max): 2.497 s … 2.585 s 10 runs > > overlayfs + direct - cached > Benchmark 1: ls -lR mnt-ovl > Time (mean ± σ): 524.5 ms ± 1.6 ms [User: 148.7 ms, System: 372.2 ms] > Range (min … max): 522.8 ms … 527.8 ms 10 runs > > composefs - uncached > Benchmark 1: ls -lR mnt-fs > Time (mean ± σ): 681.4 ms ± 14.1 ms [User: 154.4 ms, System: 369.9 ms] > Range (min … max): 652.5 ms … 703.2 ms 10 runs > > composefs - cached > Benchmark 1: ls -lR mnt-fs > Time (mean ± σ): 390.8 ms ± 4.7 ms [User: 144.7 ms, System: 243.7 ms] > Range (min … max): 382.8 ms … 399.1 ms 10 runs > > For the uncached case, composefs is still almost four times faster than > the fastest overlay combo (squashfs), and the non-squashfs versions are > strictly slower. For the cached case the difference is less (10%) but > with similar order of performance. > > For size comparison, here are the resulting images: > > 8.6M large.composefs > 2.5G large.erofs > 200M large.ext4 > 2.6M large.squashfs > Nice. Clearly, mkfs.ext4 and mkfs.erofs are not optimized for space. Note that Android has make_ext4fs which can create a compact ro ext4 image without a journal. Found this project that builds it outside of Android, but did not test: https://github.com/iglunix/make_ext4fs > > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR ovl- > > > mount" > > > Benchmark 1: ls -lR ovl-mount > > > Time (mean ± σ): 2.738 s ± 0.029 s [User: 0.176 s, > > > System: 1.688 s] > > > Range (min … max): 2.699 s … 2.787 s 10 runs > > > > > > With page cache between runs the difference is smaller, but still > > > there: > > > > It is the dentry cache that mostly matters for this test and please > > use hyerfine -w 1 to warmup dentry cache for correct measurement > > of warm cache lookup. > > I'm not sure why the dentry cache case would be more important? > Starting a new container will very often not have cached the image. > > To me the interesting case is for a new image, but with some existing > page cache for the backing files directory. That seems to model staring > a new image in an active container host, but its somewhat hard to test > that case. > ok, you can argue that faster cold cache ls -lR is important for starting new images. I think you will be asked to show a real life container use case where that benchmark really matters. > > I guess these test runs started with warm cache? but it wasn't > > mentioned explicitly. > > Yes, they were warm (because I ran the previous test before it). But, > the new profile script explicitly adds -w 1. > > > > # hyperfine "ls -lR cfs-mnt" > > > Benchmark 1: ls -lR cfs-mnt > > > Time (mean ± σ): 390.1 ms ± 3.7 ms [User: 140.9 ms, > > > System: 247.1 ms] > > > Range (min … max): 381.5 ms … 393.9 ms 10 runs > > > > > > vs > > > > > > # hyperfine -i "ls -lR ovl-mount" > > > Benchmark 1: ls -lR ovl-mount > > > Time (mean ± σ): 431.5 ms ± 1.2 ms [User: 124.3 ms, > > > System: 296.9 ms] > > > Range (min … max): 429.4 ms … 433.3 ms 10 runs > > > > > > This isn't all that strange, as overlayfs does a lot more work for > > > each lookup, including multiple name lookups as well as several > > > xattr > > > lookups, whereas composefs just does a single lookup in a pre- > > > computed > > > > Seriously, "multiple name lookups"? > > Overlayfs does exactly one lookup for anything but first level > > subdirs > > and for sparse files it does the exact same lookup in /objects as > > composefs. > > Enough with the hand waving please. Stick to hard facts. > > With the discussed layout, in a stat() call on a regular file, > ovl_lookup() will do lookups on both the sparse file and the backing > file, whereas cfs_dir_lookup() will just map some page cache pages and > do a binary search. > > Of course if you actually open the file, then cfs_open_file() would do > the equivalent lookups in /objects. But that is often not what happens, > for example in "ls -l". > > Additionally, these extra lookups will cause extra memory use, as you > need dentries and inodes for the erofs/squashfs inodes in addition to > the overlay inodes. > I see. composefs is really very optimized for ls -lR. Now only need to figure out if real users start a container and do ls -lR without reading many files is a real life use case. > > > table. But, given that we don't need any of the other features of > > > overlayfs here, this performance loss seems rather unnecessary. > > > > > > I understand that there is a cost to adding more code, but > > > efficiently > > > supporting containers and other forms of read-only images is a > > > pretty > > > important usecase for Linux these days, and having something > > > tailored > > > for that seems pretty useful to me, even considering the code > > > duplication. > > > > > > > > > > > > I also understand Cristians worry about stacking filesystem, having > > > looked a bit more at the overlayfs code. But, since composefs > > > doesn't > > > really expose the metadata or vfs structure of the lower > > > directories it > > > is much simpler in a fundamental way. > > > > > > > I agree that composefs is simpler than overlayfs and that its > > security > > model is simpler, but this is not the relevant question. > > The question is what are the benefits to the prospect users of > > composefs > > that justify this new filesystem driver if overlayfs already > > implements > > the needed functionality. > > > > The only valid technical argument I could gather from your email is - > > 10% performance improvement in warm cache ls -lR on a 2.6 GB > > centos9 rootfs image compared to overlayfs+squashfs. > > > > I am not counting the cold cache results until we see results of > > a modern ro-image fs. > > They are all strictly worse than squashfs in the above testing. > It's interesting to know why and if an optimized mkfs.erofs mkfs.ext4 would have done any improvement. > > Considering that most real life workloads include reading the data > > and that most of the time inodes and dentries are cached, IMO, > > the 10% ls -lR improvement is not a good enough reason > > for a new "laser focused" filesystem driver. > > > > Correct me if I am wrong, but isn't the use case of ephemeral > > containers require that composefs is layered under a writable tmpfs > > using overlayfs? > > > > If that is the case then the warm cache comparison is incorrect > > as well. To argue for the new filesystem you will need to compare > > ls -lR of overlay{tmpfs,composefs,xfs} vs. overlay{tmpfs,erofs,xfs} > > That very much depends. For the ostree rootfs uscase there would be no > writable layer, and for containers I'm personally primarily interested > in "--readonly" containers (i.e. without an writable layer) in my > current automobile/embedded work. For many container cases however, > that is true, and no doubt that would make the overhead of overlayfs > less of a issue. > > > Alexander, > > > > On a more personal note, I know this discussion has been a bit > > stormy, but am not trying to fight you. > > I'm overall not getting a warm fuzzy feeling from this discussion. > Getting weird complaints that I'm somehow "stealing" functions or weird > "who did $foo first" arguments for instance. You haven't personally > attacked me like that, but some of your comments can feel rather > pointy, especially in the context of a stormy thread like this. I'm > just not used to kernel development workflows, so have patience with me > if I do things wrong. > Fair enough. As long as the things that we discussed are duly mentioned in future posts, I'll do my best to be less pointy. > > I think that {mk,}composefs is a wonderful thing that will improve > > the life of many users. > > But mount -t composefs vs. mount -t overlayfs is insignificant > > to those users, so we just need to figure out based on facts > > and numbers, which is the best technical alternative. > > In reality things are never as easy as one thing strictly being > technically best. There is always a multitude of considerations. Is > composefs technically better if it uses less memory and performs better > for a particular usecase? Or is overlayfs technically better because it > is useful for more usecases and already exists? A judgement needs to be > made depending on things like complexity/maintainability of the new fs, > ease of use, measured performance differences, relative importance of > particular performance measurements, and importance of the specific > usecase. > > It is my belief that the advantages of composefs outweight the cost of > the code duplication, but I understand the point of view of a > maintainer of an existing codebase and that saying "no" is often the > right thing. I will continue to try to argue for my point of view, but > will try to make it as factual as possible. > Improving overlayfs and erofs has additional advantages - improving performance and size of erofs image may benefit many other users regardless of the ephemeral containers use case, so indeed, there are many aspects to consider. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-24 19:06 ` Amir Goldstein @ 2023-01-25 4:18 ` Dave Chinner 2023-01-25 8:32 ` Amir Goldstein 2023-01-25 9:37 ` Alexander Larsson 1 sibling, 1 reply; 87+ messages in thread From: Dave Chinner @ 2023-01-25 4:18 UTC (permalink / raw) To: Amir Goldstein Cc: Alexander Larsson, linux-fsdevel, linux-kernel, gscrivan, brauner, viro, Vivek Goyal, Miklos Szeredi On Tue, Jan 24, 2023 at 09:06:13PM +0200, Amir Goldstein wrote: > On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@redhat.com> wrote: > > On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote: > > > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> > > > wrote: > > > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: > > > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson > > > > > <alexl@redhat.com> > > > > > wrote: > > I'm not sure why the dentry cache case would be more important? > > Starting a new container will very often not have cached the image. > > > > To me the interesting case is for a new image, but with some existing > > page cache for the backing files directory. That seems to model staring > > a new image in an active container host, but its somewhat hard to test > > that case. > > > > ok, you can argue that faster cold cache ls -lR is important > for starting new images. > I think you will be asked to show a real life container use case where > that benchmark really matters. I've already described the real world production system bottlenecks that composefs is designed to overcome in a previous thread. Please go back an read this: https://lore.kernel.org/linux-fsdevel/20230118002242.GB937597@dread.disaster.area/ Cold cache performance dominates the runtime of short lived containers as well as high density container hosts being run to their container level memory limits. `ls -lR` is just a microbenchmark that demonstrates how much better composefs cold cache behaviour is than the alternatives being proposed.... This might also help explain why my initial review comments focussed on getting rid of optional format features, straight lining the processing, changing the format or search algorithms so more sequential cacheline accesses occurred resulting in less memory stalls, etc. i.e. reductions in cold cache lookup overhead will directly translate into faster container workload spin up. > > > > This isn't all that strange, as overlayfs does a lot more work for > > > > each lookup, including multiple name lookups as well as several > > > > xattr > > > > lookups, whereas composefs just does a single lookup in a pre- > > > > computed > > > > > > Seriously, "multiple name lookups"? > > > Overlayfs does exactly one lookup for anything but first level > > > subdirs > > > and for sparse files it does the exact same lookup in /objects as > > > composefs. > > > Enough with the hand waving please. Stick to hard facts. > > > > With the discussed layout, in a stat() call on a regular file, > > ovl_lookup() will do lookups on both the sparse file and the backing > > file, whereas cfs_dir_lookup() will just map some page cache pages and > > do a binary search. > > > > Of course if you actually open the file, then cfs_open_file() would do > > the equivalent lookups in /objects. But that is often not what happens, > > for example in "ls -l". > > > > Additionally, these extra lookups will cause extra memory use, as you > > need dentries and inodes for the erofs/squashfs inodes in addition to > > the overlay inodes. > > I see. composefs is really very optimized for ls -lR. No, composefs is optimised for minimal namespace and inode resolution overhead. 'ls -lR' does a lot of these operations, and therefore you see the efficiency of the design being directly exposed.... > Now only need to figure out if real users start a container and do ls -lR > without reading many files is a real life use case. I've been using 'ls -lR' and 'find . -ctime 1' to benchmark cold cache directory iteration and inode lookup performance for roughly 20 years. The benchmarks I run *never* read file data, nor is that desired - they are pure directory and inode lookup micro-benchmarks used to analyse VFS and filesystem directory and inode lookup performance. I have been presenting such measurements and patches improving performance of these microbnechmarks to the XFS and fsdevel lists over 15 years and I have *never* had to justify that what I'm measuring is a "real world workload" to anyone. Ever. Complaining about real world relevancy of the presented benchmark might be considered applying a double standard, wouldn't you agree? -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 4:18 ` Dave Chinner @ 2023-01-25 8:32 ` Amir Goldstein 2023-01-25 10:08 ` Alexander Larsson 2023-01-25 10:39 ` Giuseppe Scrivano 0 siblings, 2 replies; 87+ messages in thread From: Amir Goldstein @ 2023-01-25 8:32 UTC (permalink / raw) To: Dave Chinner Cc: Alexander Larsson, linux-fsdevel, linux-kernel, gscrivan, brauner, viro, Vivek Goyal, Miklos Szeredi On Wed, Jan 25, 2023 at 6:18 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jan 24, 2023 at 09:06:13PM +0200, Amir Goldstein wrote: > > On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@redhat.com> wrote: > > > On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote: > > > > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> > > > > wrote: > > > > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: > > > > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson > > > > > > <alexl@redhat.com> > > > > > > wrote: > > > I'm not sure why the dentry cache case would be more important? > > > Starting a new container will very often not have cached the image. > > > > > > To me the interesting case is for a new image, but with some existing > > > page cache for the backing files directory. That seems to model staring > > > a new image in an active container host, but its somewhat hard to test > > > that case. > > > > > > > ok, you can argue that faster cold cache ls -lR is important > > for starting new images. > > I think you will be asked to show a real life container use case where > > that benchmark really matters. > > I've already described the real world production system bottlenecks > that composefs is designed to overcome in a previous thread. > > Please go back an read this: > > https://lore.kernel.org/linux-fsdevel/20230118002242.GB937597@dread.disaster.area/ > I've read it and now re-read it. Most of the post talks about the excess time of creating the namespace, which is addressed by erofs+overlayfs. I guess you mean this requirement: "When you have container instances that might only be needed for a few seconds, taking half a minute to set up the container instance and then another half a minute to tear it down just isn't viable - we need instantiation and teardown times in the order of a second or two." Forgive for not being part of the containers world, so I have to ask - Which real life use case requires instantiation and teardown times in the order of a second? What is the order of number of files in the manifest of those ephemeral images? The benchmark was done on a 2.6GB centos9 image. My very minimal understanding of containers world, is that A large centos9 image would be used quite often on a client so it would be deployed as created inodes in disk filesystem and the ephemeral images are likely to be small changes on top of those large base images. Furthermore, the ephmeral images would likely be composed of cenos9 + several layers, so the situation of single composefs image as large as centos9 is highly unlikely. Am I understanding the workflow correctly? If I am, then I would rather see benchmarks with images that correspond with the real life use case that drives composefs, such as small manifests and/or composefs in combination with overlayfs as it would be used more often. > Cold cache performance dominates the runtime of short lived > containers as well as high density container hosts being run to > their container level memory limits. `ls -lR` is just a > microbenchmark that demonstrates how much better composefs cold > cache behaviour is than the alternatives being proposed.... > > This might also help explain why my initial review comments focussed > on getting rid of optional format features, straight lining the > processing, changing the format or search algorithms so more > sequential cacheline accesses occurred resulting in less memory > stalls, etc. i.e. reductions in cold cache lookup overhead will > directly translate into faster container workload spin up. > I agree that this technology is novel and understand why it results in faster cold cache lookup. I do not know erofs enough to say if similar techniques could be applied to optimize erofs lookup at mkfs.erofs time, but I can guess that this optimization was never attempted. > > > > > This isn't all that strange, as overlayfs does a lot more work for > > > > > each lookup, including multiple name lookups as well as several > > > > > xattr > > > > > lookups, whereas composefs just does a single lookup in a pre- > > > > > computed > > > > > > > > Seriously, "multiple name lookups"? > > > > Overlayfs does exactly one lookup for anything but first level > > > > subdirs > > > > and for sparse files it does the exact same lookup in /objects as > > > > composefs. > > > > Enough with the hand waving please. Stick to hard facts. > > > > > > With the discussed layout, in a stat() call on a regular file, > > > ovl_lookup() will do lookups on both the sparse file and the backing > > > file, whereas cfs_dir_lookup() will just map some page cache pages and > > > do a binary search. > > > > > > Of course if you actually open the file, then cfs_open_file() would do > > > the equivalent lookups in /objects. But that is often not what happens, > > > for example in "ls -l". > > > > > > Additionally, these extra lookups will cause extra memory use, as you > > > need dentries and inodes for the erofs/squashfs inodes in addition to > > > the overlay inodes. > > > > I see. composefs is really very optimized for ls -lR. > > No, composefs is optimised for minimal namespace and inode > resolution overhead. 'ls -lR' does a lot of these operations, and > therefore you see the efficiency of the design being directly > exposed.... > > > Now only need to figure out if real users start a container and do ls -lR > > without reading many files is a real life use case. > > I've been using 'ls -lR' and 'find . -ctime 1' to benchmark cold > cache directory iteration and inode lookup performance for roughly > 20 years. The benchmarks I run *never* read file data, nor is that > desired - they are pure directory and inode lookup micro-benchmarks > used to analyse VFS and filesystem directory and inode lookup > performance. > > I have been presenting such measurements and patches improving > performance of these microbnechmarks to the XFS and fsdevel lists > over 15 years and I have *never* had to justify that what I'm > measuring is a "real world workload" to anyone. Ever. > > Complaining about real world relevancy of the presented benchmark > might be considered applying a double standard, wouldn't you agree? > I disagree. Perhaps my comment was misunderstood. The cold cache benchmark is certainly relevant for composefs comparison and I expect to see it in future submissions. The point I am trying to drive is this: There are two alternatives on the table: 1. Add fs/composefs 2. Improve erofs and overlayfs Functionally, I think we all agree that both alternatives should work. Option #1 will take much less effort from composefs authors, so it is understandable that they would do their best to argue in its favor. Option #2 is prefered for long term maintenance reasons, which is why vfs/erofs/overlayfs developers argue in favor of it. The only factor that remains that could shift the balance inside this gray area are the actual performance numbers. And back to my point: the not so simple decision between the two options, by whoever makes this decision, should be based on a real life example of performance improvement and not of a microbenchamk. In my limited experience, a real life example means composefs as a layer in overlayfs. I did not see those numbers and it is clear that they will not be as impressive as the bare composefs numbers, so proposing composefs needs to include those numbers as well. Alexander did claim that he has real life use cases for bare readonly composefs images, but he did not say what the size of the manifests in those images are and he did not say whether these use cases also require startup and teardown in orders of seconds. It looks like the different POV are now well understood by all parties and that we are in the process of fine tuning the information that needs to be presented for making the best decision based on facts. This discussion, which was on a collision course at the beginning, looks like it is in a converging course - this makes me happy. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 8:32 ` Amir Goldstein @ 2023-01-25 10:08 ` Alexander Larsson 2023-01-25 10:43 ` Amir Goldstein 2023-01-25 10:39 ` Giuseppe Scrivano 1 sibling, 1 reply; 87+ messages in thread From: Alexander Larsson @ 2023-01-25 10:08 UTC (permalink / raw) To: Amir Goldstein, Dave Chinner Cc: linux-fsdevel, linux-kernel, gscrivan, brauner, viro, Vivek Goyal, Miklos Szeredi On Wed, 2023-01-25 at 10:32 +0200, Amir Goldstein wrote: > On Wed, Jan 25, 2023 at 6:18 AM Dave Chinner <david@fromorbit.com> > wrote: > > > > > > > > I've already described the real world production system bottlenecks > > that composefs is designed to overcome in a previous thread. > > > > Please go back an read this: > > > > https://lore.kernel.org/linux-fsdevel/20230118002242.GB937597@dread.disaster.area/ > > > > I've read it and now re-read it. > Most of the post talks about the excess time of creating the > namespace, > which is addressed by erofs+overlayfs. > > I guess you mean this requirement: > "When you have container instances that might only be needed for a > few seconds, taking half a minute to set up the container instance > and then another half a minute to tear it down just isn't viable - > we need instantiation and teardown times in the order of a second or > two." > > Forgive for not being part of the containers world, so I have to ask > - > Which real life use case requires instantiation and teardown times in > the order of a second? > > What is the order of number of files in the manifest of those > ephemeral > images? > > The benchmark was done on a 2.6GB centos9 image. What does this matter? We want to measure a particular kind of operation, so, we use a sample with a lot of those operations. What would it help running some operation on a smaller image that does much less of the critical operations. That would just make it harder to see the data for all the noise. Nobody is saying that reading all the metadata in a 2.6GB image is something a container would do. It is however doing lots of the operations that constrains container startup, and it allows us to compare the performance of these operation between different alternatives. > My very minimal understanding of containers world, is that > A large centos9 image would be used quite often on a client so it > would be deployed as created inodes in disk filesystem > and the ephemeral images are likely to be small changes > on top of those large base images. > > Furthermore, the ephmeral images would likely be composed > of cenos9 + several layers, so the situation of single composefs > image as large as centos9 is highly unlikely. > > Am I understanding the workflow correctly? In a composefs based container storage implementation one would likely not use a layered approach for the "derived" images. Since all file content is shared anyway its more useful to just combine the metadata of the layers into a single composefs image. It is not going to be very large anyway, and it will make lookups much faster as you don't need to do all the negative lookups in the upper layers when looking for files in the base layer. > If I am, then I would rather see benchmarks with images > that correspond with the real life use case that drives composefs, > such as small manifests and/or composefs in combination with > overlayfs as it would be used more often. I feel like there is a constant moving of the goal post here. I've provided lots of raw performance numbers, and explained that they are important to our usecases, there has to be an end to how detailed they need to be. I'm not interested in implementing a complete container runtime based on overlayfs just to show that it performs poorly. > > Cold cache performance dominates the runtime of short lived > > containers as well as high density container hosts being run to > > their container level memory limits. `ls -lR` is just a > > microbenchmark that demonstrates how much better composefs cold > > cache behaviour is than the alternatives being proposed.... > > > > This might also help explain why my initial review comments > > focussed > > on getting rid of optional format features, straight lining the > > processing, changing the format or search algorithms so more > > sequential cacheline accesses occurred resulting in less memory > > stalls, etc. i.e. reductions in cold cache lookup overhead will > > directly translate into faster container workload spin up. > > > > I agree that this technology is novel and understand why it results > in faster cold cache lookup. > I do not know erofs enough to say if similar techniques could be > applied to optimize erofs lookup at mkfs.erofs time, but I can guess > that this optimization was never attempted. > > > > > On the contrary, erofs lookup is very similar to composefs. There is nothing magical about it, we're talking about pre-computed, static lists of names. What you do is you sort the names, put them in a compact seek-free form, and then you binary search on them. Composefs v3 has some changes to make larger directories slightly more efficient (no chunking), but the general performance should be comparable. I believe Gao said that mkfs.erofs could do slightly better at how data arranged so that related things are closer to each other. That may help some, but I don't think this is gonna be a massive difference. > > > > > -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com He's a short-sighted guerilla filmmaker with a winning smile and a way with the ladies. She's a scantily clad mutant magician's assistant living on borrowed time. They fight crime! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 10:08 ` Alexander Larsson @ 2023-01-25 10:43 ` Amir Goldstein 0 siblings, 0 replies; 87+ messages in thread From: Amir Goldstein @ 2023-01-25 10:43 UTC (permalink / raw) To: Alexander Larsson Cc: Dave Chinner, linux-fsdevel, linux-kernel, gscrivan, brauner, viro, Vivek Goyal, Miklos Szeredi On Wed, Jan 25, 2023 at 12:08 PM Alexander Larsson <alexl@redhat.com> wrote: > > On Wed, 2023-01-25 at 10:32 +0200, Amir Goldstein wrote: > > On Wed, Jan 25, 2023 at 6:18 AM Dave Chinner <david@fromorbit.com> > > wrote: > > > > > > > > > > > > I've already described the real world production system bottlenecks > > > that composefs is designed to overcome in a previous thread. > > > > > > Please go back an read this: > > > > > > https://lore.kernel.org/linux-fsdevel/20230118002242.GB937597@dread.disaster.area/ > > > > > > > I've read it and now re-read it. > > Most of the post talks about the excess time of creating the > > namespace, > > which is addressed by erofs+overlayfs. > > > > I guess you mean this requirement: > > "When you have container instances that might only be needed for a > > few seconds, taking half a minute to set up the container instance > > and then another half a minute to tear it down just isn't viable - > > we need instantiation and teardown times in the order of a second or > > two." > > > > Forgive for not being part of the containers world, so I have to ask > > - > > Which real life use case requires instantiation and teardown times in > > the order of a second? > > > > What is the order of number of files in the manifest of those > > ephemeral > > images? > > > > The benchmark was done on a 2.6GB centos9 image. > > What does this matter? We want to measure a particular kind of > operation, so, we use a sample with a lot of those operations. What > would it help running some operation on a smaller image that does much > less of the critical operations. That would just make it harder to see > the data for all the noise. Nobody is saying that reading all the > metadata in a 2.6GB image is something a container would do. It is > however doing lots of the operations that constrains container startup, > and it allows us to compare the performance of these operation between > different alternatives. > When talking about performance improvements sometimes the absolute numbers matter just as well as the percentage. You write that: "The primary KPI is cold boot performance, because there are legal requirements for the entire system to boot in 2 seconds." so the size of the image does matter. If for the automotive use case, a centos9-like image needs to boot in 2 seconds and you show that you can accomplish that with composefs and cannot accomplish that with overlayfs+composefs then you have a pretty strong argument, with very few performance numbers ;-) > > My very minimal understanding of containers world, is that > > A large centos9 image would be used quite often on a client so it > > would be deployed as created inodes in disk filesystem > > and the ephemeral images are likely to be small changes > > on top of those large base images. > > > > Furthermore, the ephmeral images would likely be composed > > of cenos9 + several layers, so the situation of single composefs > > image as large as centos9 is highly unlikely. > > > > Am I understanding the workflow correctly? > > In a composefs based container storage implementation one would likely > not use a layered approach for the "derived" images. Since all file > content is shared anyway its more useful to just combine the metadata > of the layers into a single composefs image. It is not going to be very > large anyway, and it will make lookups much faster as you don't need to > do all the negative lookups in the upper layers when looking for files > in the base layer. > Aha! that is something that wasn't clear to me - that the idea is to change the image distribution so that there are many "data layers" but the "metadata layers" are merged on the server, so the client uses only one. Maybe I am slow and maybe this part needs to be explained better. > > If I am, then I would rather see benchmarks with images > > that correspond with the real life use case that drives composefs, > > such as small manifests and/or composefs in combination with > > overlayfs as it would be used more often. > > I feel like there is a constant moving of the goal post here. I've > provided lots of raw performance numbers, and explained that they are > important to our usecases, there has to be an end to how detailed they > need to be. I'm not interested in implementing a complete container > runtime based on overlayfs just to show that it performs poorly. > Alexander, be patient. This is the process everyone that wants to upstream a new fs/subsystem/feature has to go through and everyone that wants to publish an academic paper has to go through. The reviewers are also in a learning process and you cannot expect reviewers to have all the questions ready for you on V1 and not have other questions pop up as their understanding of the problem space evolves. Note that my request was conditional to "if my understanding of the workflow is correct". Since you explained that your workflow does not include overlayfs you do not need to provide the benchmark of overlayfs+composefs, but if you intend to use the argument that "It is also quite typical to have shortlived containers in cloud workloads, and startup time there is very important." then you have to be honest about it and acknowledge that those short lived containers are not readonly, so if you want to use this use case when arguing for composefs, please do provide the performance numbers that correspond with this use case. > > > Cold cache performance dominates the runtime of short lived > > > containers as well as high density container hosts being run to > > > their container level memory limits. `ls -lR` is just a > > > microbenchmark that demonstrates how much better composefs cold > > > cache behaviour is than the alternatives being proposed.... > > > > > > This might also help explain why my initial review comments > > > focussed > > > on getting rid of optional format features, straight lining the > > > processing, changing the format or search algorithms so more > > > sequential cacheline accesses occurred resulting in less memory > > > stalls, etc. i.e. reductions in cold cache lookup overhead will > > > directly translate into faster container workload spin up. > > > > > > > I agree that this technology is novel and understand why it results > > in faster cold cache lookup. > > I do not know erofs enough to say if similar techniques could be > > applied to optimize erofs lookup at mkfs.erofs time, but I can guess > > that this optimization was never attempted. > > > > > > > > On the contrary, erofs lookup is very similar to composefs. There is > nothing magical about it, we're talking about pre-computed, static > lists of names. What you do is you sort the names, put them in a > compact seek-free form, and then you binary search on them. Composefs > v3 has some changes to make larger directories slightly more efficient > (no chunking), but the general performance should be comparable. > > I believe Gao said that mkfs.erofs could do slightly better at how data > arranged so that related things are closer to each other. That may help > some, but I don't think this is gonna be a massive difference. Cool, so for readonly images, it is down to a performance comparison of overlayfs vs. composefs and to be fair overlayfs will also have a single upper metadata layer. May the best fs win! Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 8:32 ` Amir Goldstein 2023-01-25 10:08 ` Alexander Larsson @ 2023-01-25 10:39 ` Giuseppe Scrivano 2023-01-25 11:17 ` Amir Goldstein 1 sibling, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-25 10:39 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi Amir Goldstein <amir73il@gmail.com> writes: > On Wed, Jan 25, 2023 at 6:18 AM Dave Chinner <david@fromorbit.com> wrote: >> >> On Tue, Jan 24, 2023 at 09:06:13PM +0200, Amir Goldstein wrote: >> > On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@redhat.com> wrote: >> > > On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote: >> > > > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> >> > > > wrote: >> > > > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: >> > > > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson >> > > > > > <alexl@redhat.com> >> > > > > > wrote: >> > > I'm not sure why the dentry cache case would be more important? >> > > Starting a new container will very often not have cached the image. >> > > >> > > To me the interesting case is for a new image, but with some existing >> > > page cache for the backing files directory. That seems to model staring >> > > a new image in an active container host, but its somewhat hard to test >> > > that case. >> > > >> > >> > ok, you can argue that faster cold cache ls -lR is important >> > for starting new images. >> > I think you will be asked to show a real life container use case where >> > that benchmark really matters. >> >> I've already described the real world production system bottlenecks >> that composefs is designed to overcome in a previous thread. >> >> Please go back an read this: >> >> https://lore.kernel.org/linux-fsdevel/20230118002242.GB937597@dread.disaster.area/ >> > > I've read it and now re-read it. > Most of the post talks about the excess time of creating the namespace, > which is addressed by erofs+overlayfs. > > I guess you mean this requirement: > "When you have container instances that might only be needed for a > few seconds, taking half a minute to set up the container instance > and then another half a minute to tear it down just isn't viable - > we need instantiation and teardown times in the order of a second or > two." > > Forgive for not being part of the containers world, so I have to ask - > Which real life use case requires instantiation and teardown times in > the order of a second? > > What is the order of number of files in the manifest of those ephemeral > images? > > The benchmark was done on a 2.6GB centos9 image. > > My very minimal understanding of containers world, is that > A large centos9 image would be used quite often on a client so it > would be deployed as created inodes in disk filesystem > and the ephemeral images are likely to be small changes > on top of those large base images. > > Furthermore, the ephmeral images would likely be composed > of cenos9 + several layers, so the situation of single composefs > image as large as centos9 is highly unlikely. > > Am I understanding the workflow correctly? > > If I am, then I would rather see benchmarks with images > that correspond with the real life use case that drives composefs, > such as small manifests and/or composefs in combination with > overlayfs as it would be used more often. > >> Cold cache performance dominates the runtime of short lived >> containers as well as high density container hosts being run to >> their container level memory limits. `ls -lR` is just a >> microbenchmark that demonstrates how much better composefs cold >> cache behaviour is than the alternatives being proposed.... >> >> This might also help explain why my initial review comments focussed >> on getting rid of optional format features, straight lining the >> processing, changing the format or search algorithms so more >> sequential cacheline accesses occurred resulting in less memory >> stalls, etc. i.e. reductions in cold cache lookup overhead will >> directly translate into faster container workload spin up. >> > > I agree that this technology is novel and understand why it results > in faster cold cache lookup. > I do not know erofs enough to say if similar techniques could be > applied to optimize erofs lookup at mkfs.erofs time, but I can guess > that this optimization was never attempted. As Dave mentioned, containers in a cluster usually run with low memory limits to increase density of how many containers can run on a single host. I've done some tests to get some numbers on the memory usage. Please let me know if you've any comment on the method I've used to read the memory usage, if you've any better suggestion please let me know. I am using a Fedora container image, but I think the image used is not relevant, as the memory used should increase linearly to the image size for both setups. I am using systemd-run --scope to get a new cgroup, the system uses cgroupv2. For this first test I am using a RO mount both for composefs and erofs+overlayfs. # echo 3 > /proc/sys/vm/drop_caches # \time systemd-run --scope sh -c 'ls -lR /mnt/composefs > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' Running scope as unit: run-r482ec1c3024a4a8b9d2a369bf5dc6df3.scope 16367616 0.03user 0.54system 0:00.71elapsed 80%CPU (0avgtext+0avgdata 7552maxresident)k 10592inputs+0outputs (28major+1273minor)pagefaults 0swaps # echo 3 > /proc/sys/vm/drop_caches # \time systemd-run --scope sh -c 'ls -lR /mnt/erofs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' Running scope as unit: run-r5f0f599053c349669e5c1ecacaa037b6.scope 48390144 0.04user 1.03system 0:01.81elapsed 59%CPU (0avgtext+0avgdata 7552maxresident)k 30776inputs+0outputs (28major+1269minor)pagefaults 0swaps the erofs+overlay setup takes 2.5 times to complete and it uses 3 times the memory used by composefs. The second test involves a RW mount for composefs. For the erofs+overlay setup I've just added an upperdir and workdir to the overlay mount, while for composefs I create a completely new overlay mount that uses the composefs mount as the lower layer. # echo 3 > /proc/sys/vm/drop_caches # \time systemd-run --scope sh -c 'ls -lR /mnt/composefs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' Running scope as unit: run-r23519c8048704e5b84a1355f131d9d93.scope 31014912 0.05user 1.15system 0:01.38elapsed 87%CPU (0avgtext+0avgdata 7552maxresident)k 10944inputs+0outputs (28major+1282minor)pagefaults 0swaps # echo 3 > /proc/sys/vm/drop_caches # \time systemd-run --scope sh -c 'ls -lR /mnt/erofs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' Running scope as unit: run-rdbccf045f3124e379cec00273638db08.scope 48308224 0.07user 2.04system 0:03.22elapsed 65%CPU (0avgtext+0avgdata 7424maxresident)k 30720inputs+0outputs (28major+1273minor)pagefaults 0swaps so the erofs+overlay setup still takes more time (almost 2.5 times) and uses more memory (slightly more than 1.5 times) >> > > > > This isn't all that strange, as overlayfs does a lot more work for >> > > > > each lookup, including multiple name lookups as well as several >> > > > > xattr >> > > > > lookups, whereas composefs just does a single lookup in a pre- >> > > > > computed >> > > > >> > > > Seriously, "multiple name lookups"? >> > > > Overlayfs does exactly one lookup for anything but first level >> > > > subdirs >> > > > and for sparse files it does the exact same lookup in /objects as >> > > > composefs. >> > > > Enough with the hand waving please. Stick to hard facts. >> > > >> > > With the discussed layout, in a stat() call on a regular file, >> > > ovl_lookup() will do lookups on both the sparse file and the backing >> > > file, whereas cfs_dir_lookup() will just map some page cache pages and >> > > do a binary search. >> > > >> > > Of course if you actually open the file, then cfs_open_file() would do >> > > the equivalent lookups in /objects. But that is often not what happens, >> > > for example in "ls -l". >> > > >> > > Additionally, these extra lookups will cause extra memory use, as you >> > > need dentries and inodes for the erofs/squashfs inodes in addition to >> > > the overlay inodes. >> > >> > I see. composefs is really very optimized for ls -lR. >> >> No, composefs is optimised for minimal namespace and inode >> resolution overhead. 'ls -lR' does a lot of these operations, and >> therefore you see the efficiency of the design being directly >> exposed.... >> >> > Now only need to figure out if real users start a container and do ls -lR >> > without reading many files is a real life use case. >> >> I've been using 'ls -lR' and 'find . -ctime 1' to benchmark cold >> cache directory iteration and inode lookup performance for roughly >> 20 years. The benchmarks I run *never* read file data, nor is that >> desired - they are pure directory and inode lookup micro-benchmarks >> used to analyse VFS and filesystem directory and inode lookup >> performance. >> >> I have been presenting such measurements and patches improving >> performance of these microbnechmarks to the XFS and fsdevel lists >> over 15 years and I have *never* had to justify that what I'm >> measuring is a "real world workload" to anyone. Ever. >> >> Complaining about real world relevancy of the presented benchmark >> might be considered applying a double standard, wouldn't you agree? >> > > I disagree. > Perhaps my comment was misunderstood. > > The cold cache benchmark is certainly relevant for composefs > comparison and I expect to see it in future submissions. > > The point I am trying to drive is this: > There are two alternatives on the table: > 1. Add fs/composefs > 2. Improve erofs and overlayfs > > Functionally, I think we all agree that both alternatives should work. > > Option #1 will take much less effort from composefs authors, so it is > understandable that they would do their best to argue in its favor. > > Option #2 is prefered for long term maintenance reasons, which is > why vfs/erofs/overlayfs developers argue in favor of it. > > The only factor that remains that could shift the balance inside > this gray area are the actual performance numbers. > > And back to my point: the not so simple decision between the > two options, by whoever makes this decision, should be based > on a real life example of performance improvement and not of > a microbenchamk. > > In my limited experience, a real life example means composefs > as a layer in overlayfs. > > I did not see those numbers and it is clear that they will not be > as impressive as the bare composefs numbers, so proposing > composefs needs to include those numbers as well. > > Alexander did claim that he has real life use cases for bare readonly > composefs images, but he did not say what the size of the manifests > in those images are and he did not say whether these use cases > also require startup and teardown in orders of seconds. > > It looks like the different POV are now well understood by all parties > and that we are in the process of fine tuning the information that > needs to be presented for making the best decision based on facts. > > This discussion, which was on a collision course at the beginning, > looks like it is in a converging course - this makes me happy. > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 10:39 ` Giuseppe Scrivano @ 2023-01-25 11:17 ` Amir Goldstein 2023-01-25 12:30 ` Giuseppe Scrivano 0 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-01-25 11:17 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi On Wed, Jan 25, 2023 at 12:39 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > > On Wed, Jan 25, 2023 at 6:18 AM Dave Chinner <david@fromorbit.com> wrote: > >> > >> On Tue, Jan 24, 2023 at 09:06:13PM +0200, Amir Goldstein wrote: > >> > On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@redhat.com> wrote: > >> > > On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote: > >> > > > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> > >> > > > wrote: > >> > > > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: > >> > > > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson > >> > > > > > <alexl@redhat.com> > >> > > > > > wrote: > >> > > I'm not sure why the dentry cache case would be more important? > >> > > Starting a new container will very often not have cached the image. > >> > > > >> > > To me the interesting case is for a new image, but with some existing > >> > > page cache for the backing files directory. That seems to model staring > >> > > a new image in an active container host, but its somewhat hard to test > >> > > that case. > >> > > > >> > > >> > ok, you can argue that faster cold cache ls -lR is important > >> > for starting new images. > >> > I think you will be asked to show a real life container use case where > >> > that benchmark really matters. > >> > >> I've already described the real world production system bottlenecks > >> that composefs is designed to overcome in a previous thread. > >> > >> Please go back an read this: > >> > >> https://lore.kernel.org/linux-fsdevel/20230118002242.GB937597@dread.disaster.area/ > >> > > > > I've read it and now re-read it. > > Most of the post talks about the excess time of creating the namespace, > > which is addressed by erofs+overlayfs. > > > > I guess you mean this requirement: > > "When you have container instances that might only be needed for a > > few seconds, taking half a minute to set up the container instance > > and then another half a minute to tear it down just isn't viable - > > we need instantiation and teardown times in the order of a second or > > two." > > > > Forgive for not being part of the containers world, so I have to ask - > > Which real life use case requires instantiation and teardown times in > > the order of a second? > > > > What is the order of number of files in the manifest of those ephemeral > > images? > > > > The benchmark was done on a 2.6GB centos9 image. > > > > My very minimal understanding of containers world, is that > > A large centos9 image would be used quite often on a client so it > > would be deployed as created inodes in disk filesystem > > and the ephemeral images are likely to be small changes > > on top of those large base images. > > > > Furthermore, the ephmeral images would likely be composed > > of cenos9 + several layers, so the situation of single composefs > > image as large as centos9 is highly unlikely. > > > > Am I understanding the workflow correctly? > > > > If I am, then I would rather see benchmarks with images > > that correspond with the real life use case that drives composefs, > > such as small manifests and/or composefs in combination with > > overlayfs as it would be used more often. > > > >> Cold cache performance dominates the runtime of short lived > >> containers as well as high density container hosts being run to > >> their container level memory limits. `ls -lR` is just a > >> microbenchmark that demonstrates how much better composefs cold > >> cache behaviour is than the alternatives being proposed.... > >> > >> This might also help explain why my initial review comments focussed > >> on getting rid of optional format features, straight lining the > >> processing, changing the format or search algorithms so more > >> sequential cacheline accesses occurred resulting in less memory > >> stalls, etc. i.e. reductions in cold cache lookup overhead will > >> directly translate into faster container workload spin up. > >> > > > > I agree that this technology is novel and understand why it results > > in faster cold cache lookup. > > I do not know erofs enough to say if similar techniques could be > > applied to optimize erofs lookup at mkfs.erofs time, but I can guess > > that this optimization was never attempted. > > As Dave mentioned, containers in a cluster usually run with low memory > limits to increase density of how many containers can run on a single Good selling point. > host. I've done some tests to get some numbers on the memory usage. > > Please let me know if you've any comment on the method I've used to read > the memory usage, if you've any better suggestion please let me know. > > I am using a Fedora container image, but I think the image used is not > relevant, as the memory used should increase linearly to the image size > for both setups. > > I am using systemd-run --scope to get a new cgroup, the system uses > cgroupv2. > > For this first test I am using a RO mount both for composefs and > erofs+overlayfs. > > # echo 3 > /proc/sys/vm/drop_caches > # \time systemd-run --scope sh -c 'ls -lR /mnt/composefs > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > Running scope as unit: run-r482ec1c3024a4a8b9d2a369bf5dc6df3.scope > 16367616 > 0.03user 0.54system 0:00.71elapsed 80%CPU (0avgtext+0avgdata 7552maxresident)k > 10592inputs+0outputs (28major+1273minor)pagefaults 0swaps > > # echo 3 > /proc/sys/vm/drop_caches > # \time systemd-run --scope sh -c 'ls -lR /mnt/erofs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > Running scope as unit: run-r5f0f599053c349669e5c1ecacaa037b6.scope > 48390144 > 0.04user 1.03system 0:01.81elapsed 59%CPU (0avgtext+0avgdata 7552maxresident)k > 30776inputs+0outputs (28major+1269minor)pagefaults 0swaps > > the erofs+overlay setup takes 2.5 times to complete and it uses 3 times > the memory used by composefs. > > The second test involves a RW mount for composefs. > > For the erofs+overlay setup I've just added an upperdir and workdir to > the overlay mount, while for composefs I create a completely new overlay > mount that uses the composefs mount as the lower layer. > > # echo 3 > /proc/sys/vm/drop_caches > # \time systemd-run --scope sh -c 'ls -lR /mnt/composefs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > Running scope as unit: run-r23519c8048704e5b84a1355f131d9d93.scope > 31014912 > 0.05user 1.15system 0:01.38elapsed 87%CPU (0avgtext+0avgdata 7552maxresident)k > 10944inputs+0outputs (28major+1282minor)pagefaults 0swaps > > # echo 3 > /proc/sys/vm/drop_caches > # \time systemd-run --scope sh -c 'ls -lR /mnt/erofs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' > Running scope as unit: run-rdbccf045f3124e379cec00273638db08.scope > 48308224 > 0.07user 2.04system 0:03.22elapsed 65%CPU (0avgtext+0avgdata 7424maxresident)k > 30720inputs+0outputs (28major+1273minor)pagefaults 0swaps > > so the erofs+overlay setup still takes more time (almost 2.5 times) and > uses more memory (slightly more than 1.5 times) > That's an important comparison. Thanks for running it. Based on Alexander's explanation about the differences between overlayfs lookup vs. composefs lookup of a regular "metacopy" file, I just need to point out that the same optimization (lazy lookup of the lower data file on open) can be done in overlayfs as well. (*) currently, overlayfs needs to lookup the lower file also for st_blocks. I am not saying that it should be done or that Miklos will agree to make this change in overlayfs, but that seems to be the major difference. getxattr may have some extra cost depending on in-inode xattr format of erofs, but specifically, the metacopy getxattr can be avoided if this is a special overlayfs RO mount that is marked as EVERYTHING IS METACOPY. I don't expect you guys to now try to hack overlayfs and explore this path to completion. My expectation is that this information will be clearly visible to anyone reviewing future submission, e.g.: - This is the comparison we ran... - This is the reason that composefs gives better results... - It MAY be possible to optimize erofs/overlayfs to get to similar results, but we did not try to do that It is especially important IMO to get the ACK of both Gao and Miklos on your analysis, because remember than when this thread started, you did not know about the metacopy option and your main argument was saving the time it takes to create the overlayfs layer files in the filesystem, because you were missing some technical background on overlayfs. I hope that after you are done being annoyed by all the chores we put you guys up to, you will realize that they help you build your case for the final submission... Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 11:17 ` Amir Goldstein @ 2023-01-25 12:30 ` Giuseppe Scrivano 2023-01-25 12:46 ` Amir Goldstein 0 siblings, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-25 12:30 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi Amir Goldstein <amir73il@gmail.com> writes: > On Wed, Jan 25, 2023 at 12:39 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: >> >> Amir Goldstein <amir73il@gmail.com> writes: >> >> > On Wed, Jan 25, 2023 at 6:18 AM Dave Chinner <david@fromorbit.com> wrote: >> >> >> >> On Tue, Jan 24, 2023 at 09:06:13PM +0200, Amir Goldstein wrote: >> >> > On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@redhat.com> wrote: >> >> > > On Tue, 2023-01-24 at 05:24 +0200, Amir Goldstein wrote: >> >> > > > On Mon, Jan 23, 2023 at 7:56 PM Alexander Larsson <alexl@redhat.com> >> >> > > > wrote: >> >> > > > > On Fri, 2023-01-20 at 21:44 +0200, Amir Goldstein wrote: >> >> > > > > > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson >> >> > > > > > <alexl@redhat.com> >> >> > > > > > wrote: >> >> > > I'm not sure why the dentry cache case would be more important? >> >> > > Starting a new container will very often not have cached the image. >> >> > > >> >> > > To me the interesting case is for a new image, but with some existing >> >> > > page cache for the backing files directory. That seems to model staring >> >> > > a new image in an active container host, but its somewhat hard to test >> >> > > that case. >> >> > > >> >> > >> >> > ok, you can argue that faster cold cache ls -lR is important >> >> > for starting new images. >> >> > I think you will be asked to show a real life container use case where >> >> > that benchmark really matters. >> >> >> >> I've already described the real world production system bottlenecks >> >> that composefs is designed to overcome in a previous thread. >> >> >> >> Please go back an read this: >> >> >> >> https://lore.kernel.org/linux-fsdevel/20230118002242.GB937597@dread.disaster.area/ >> >> >> > >> > I've read it and now re-read it. >> > Most of the post talks about the excess time of creating the namespace, >> > which is addressed by erofs+overlayfs. >> > >> > I guess you mean this requirement: >> > "When you have container instances that might only be needed for a >> > few seconds, taking half a minute to set up the container instance >> > and then another half a minute to tear it down just isn't viable - >> > we need instantiation and teardown times in the order of a second or >> > two." >> > >> > Forgive for not being part of the containers world, so I have to ask - >> > Which real life use case requires instantiation and teardown times in >> > the order of a second? >> > >> > What is the order of number of files in the manifest of those ephemeral >> > images? >> > >> > The benchmark was done on a 2.6GB centos9 image. >> > >> > My very minimal understanding of containers world, is that >> > A large centos9 image would be used quite often on a client so it >> > would be deployed as created inodes in disk filesystem >> > and the ephemeral images are likely to be small changes >> > on top of those large base images. >> > >> > Furthermore, the ephmeral images would likely be composed >> > of cenos9 + several layers, so the situation of single composefs >> > image as large as centos9 is highly unlikely. >> > >> > Am I understanding the workflow correctly? >> > >> > If I am, then I would rather see benchmarks with images >> > that correspond with the real life use case that drives composefs, >> > such as small manifests and/or composefs in combination with >> > overlayfs as it would be used more often. >> > >> >> Cold cache performance dominates the runtime of short lived >> >> containers as well as high density container hosts being run to >> >> their container level memory limits. `ls -lR` is just a >> >> microbenchmark that demonstrates how much better composefs cold >> >> cache behaviour is than the alternatives being proposed.... >> >> >> >> This might also help explain why my initial review comments focussed >> >> on getting rid of optional format features, straight lining the >> >> processing, changing the format or search algorithms so more >> >> sequential cacheline accesses occurred resulting in less memory >> >> stalls, etc. i.e. reductions in cold cache lookup overhead will >> >> directly translate into faster container workload spin up. >> >> >> > >> > I agree that this technology is novel and understand why it results >> > in faster cold cache lookup. >> > I do not know erofs enough to say if similar techniques could be >> > applied to optimize erofs lookup at mkfs.erofs time, but I can guess >> > that this optimization was never attempted. >> >> As Dave mentioned, containers in a cluster usually run with low memory >> limits to increase density of how many containers can run on a single > > Good selling point. > >> host. I've done some tests to get some numbers on the memory usage. >> >> Please let me know if you've any comment on the method I've used to read >> the memory usage, if you've any better suggestion please let me know. >> >> I am using a Fedora container image, but I think the image used is not >> relevant, as the memory used should increase linearly to the image size >> for both setups. >> >> I am using systemd-run --scope to get a new cgroup, the system uses >> cgroupv2. >> >> For this first test I am using a RO mount both for composefs and >> erofs+overlayfs. >> >> # echo 3 > /proc/sys/vm/drop_caches >> # \time systemd-run --scope sh -c 'ls -lR /mnt/composefs > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >> Running scope as unit: run-r482ec1c3024a4a8b9d2a369bf5dc6df3.scope >> 16367616 >> 0.03user 0.54system 0:00.71elapsed 80%CPU (0avgtext+0avgdata 7552maxresident)k >> 10592inputs+0outputs (28major+1273minor)pagefaults 0swaps >> >> # echo 3 > /proc/sys/vm/drop_caches >> # \time systemd-run --scope sh -c 'ls -lR /mnt/erofs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >> Running scope as unit: run-r5f0f599053c349669e5c1ecacaa037b6.scope >> 48390144 >> 0.04user 1.03system 0:01.81elapsed 59%CPU (0avgtext+0avgdata 7552maxresident)k >> 30776inputs+0outputs (28major+1269minor)pagefaults 0swaps >> >> the erofs+overlay setup takes 2.5 times to complete and it uses 3 times >> the memory used by composefs. >> >> The second test involves a RW mount for composefs. >> >> For the erofs+overlay setup I've just added an upperdir and workdir to >> the overlay mount, while for composefs I create a completely new overlay >> mount that uses the composefs mount as the lower layer. >> >> # echo 3 > /proc/sys/vm/drop_caches >> # \time systemd-run --scope sh -c 'ls -lR /mnt/composefs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >> Running scope as unit: run-r23519c8048704e5b84a1355f131d9d93.scope >> 31014912 >> 0.05user 1.15system 0:01.38elapsed 87%CPU (0avgtext+0avgdata 7552maxresident)k >> 10944inputs+0outputs (28major+1282minor)pagefaults 0swaps >> >> # echo 3 > /proc/sys/vm/drop_caches >> # \time systemd-run --scope sh -c 'ls -lR /mnt/erofs-overlay > /dev/null; cat $(cat /proc/self/cgroup | sed -e "s|0::|/sys/fs/cgroup|")/memory.peak' >> Running scope as unit: run-rdbccf045f3124e379cec00273638db08.scope >> 48308224 >> 0.07user 2.04system 0:03.22elapsed 65%CPU (0avgtext+0avgdata 7424maxresident)k >> 30720inputs+0outputs (28major+1273minor)pagefaults 0swaps >> >> so the erofs+overlay setup still takes more time (almost 2.5 times) and >> uses more memory (slightly more than 1.5 times) >> > > That's an important comparison. Thanks for running it. > > Based on Alexander's explanation about the differences between overlayfs > lookup vs. composefs lookup of a regular "metacopy" file, I just need to > point out that the same optimization (lazy lookup of the lower data > file on open) > can be done in overlayfs as well. > (*) currently, overlayfs needs to lookup the lower file also for st_blocks. > > I am not saying that it should be done or that Miklos will agree to make > this change in overlayfs, but that seems to be the major difference. > getxattr may have some extra cost depending on in-inode xattr format > of erofs, but specifically, the metacopy getxattr can be avoided if this > is a special overlayfs RO mount that is marked as EVERYTHING IS > METACOPY. > > I don't expect you guys to now try to hack overlayfs and explore > this path to completion. > My expectation is that this information will be clearly visible to anyone > reviewing future submission, e.g.: > > - This is the comparison we ran... > - This is the reason that composefs gives better results... > - It MAY be possible to optimize erofs/overlayfs to get to similar results, > but we did not try to do that > > It is especially important IMO to get the ACK of both Gao and Miklos > on your analysis, because remember than when this thread started, > you did not know about the metacopy option and your main argument > was saving the time it takes to create the overlayfs layer files in the > filesystem, because you were missing some technical background on overlayfs. we knew about metacopy, which we already use in our tools to create mapped image copies when idmapped mounts are not available, and also knew about the other new features in overlayfs. For example, the "volatile" feature which was mentioned in your Overlayfs-containers-lpc-2020 talk, was only submitted upstream after begging Miklos and Vivek for months. I had a PoC that I used and tested locally and asked for their help to get it integrated at the file system layer, using seccomp for the same purpose would have been more complex and prone to errors when dealing with external bind mounts containing persistent data. The only missing bit, at least from my side, was to consider an image that contains only overlay metadata as something we could distribute. I previously mentioned my wish of using it from a user namespace, the goal seems more challenging with EROFS or any other block devices. I don't know about the difficulty of getting overlay metacopy working in a user namespace, even though it would be helpful for other use cases as well. Thanks, Giuseppe > > I hope that after you are done being annoyed by all the chores we put > you guys up to, you will realize that they help you build your case for > the final submission... > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 12:30 ` Giuseppe Scrivano @ 2023-01-25 12:46 ` Amir Goldstein 2023-01-25 13:10 ` Giuseppe Scrivano 2023-01-25 15:24 ` Christian Brauner 0 siblings, 2 replies; 87+ messages in thread From: Amir Goldstein @ 2023-01-25 12:46 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi > > > > Based on Alexander's explanation about the differences between overlayfs > > lookup vs. composefs lookup of a regular "metacopy" file, I just need to > > point out that the same optimization (lazy lookup of the lower data > > file on open) > > can be done in overlayfs as well. > > (*) currently, overlayfs needs to lookup the lower file also for st_blocks. > > > > I am not saying that it should be done or that Miklos will agree to make > > this change in overlayfs, but that seems to be the major difference. > > getxattr may have some extra cost depending on in-inode xattr format > > of erofs, but specifically, the metacopy getxattr can be avoided if this > > is a special overlayfs RO mount that is marked as EVERYTHING IS > > METACOPY. > > > > I don't expect you guys to now try to hack overlayfs and explore > > this path to completion. > > My expectation is that this information will be clearly visible to anyone > > reviewing future submission, e.g.: > > > > - This is the comparison we ran... > > - This is the reason that composefs gives better results... > > - It MAY be possible to optimize erofs/overlayfs to get to similar results, > > but we did not try to do that > > > > It is especially important IMO to get the ACK of both Gao and Miklos > > on your analysis, because remember than when this thread started, > > you did not know about the metacopy option and your main argument > > was saving the time it takes to create the overlayfs layer files in the > > filesystem, because you were missing some technical background on overlayfs. > > we knew about metacopy, which we already use in our tools to create > mapped image copies when idmapped mounts are not available, and also > knew about the other new features in overlayfs. For example, the > "volatile" feature which was mentioned in your > Overlayfs-containers-lpc-2020 talk, was only submitted upstream after > begging Miklos and Vivek for months. I had a PoC that I used and tested > locally and asked for their help to get it integrated at the file > system layer, using seccomp for the same purpose would have been more > complex and prone to errors when dealing with external bind mounts > containing persistent data. > > The only missing bit, at least from my side, was to consider an image > that contains only overlay metadata as something we could distribute. > I'm glad that I was able to point this out to you, because now the comparison between the overlayfs and composefs options is more fair. > I previously mentioned my wish of using it from a user namespace, the > goal seems more challenging with EROFS or any other block devices. I > don't know about the difficulty of getting overlay metacopy working in a > user namespace, even though it would be helpful for other use cases as > well. > There is no restriction of metacopy in user namespace. overlayfs needs to be mounted with -o userxattr and the overlay xattrs needs to use user.overlay. prefix. w.r.t. the implied claim that composefs on-disk format is simple enough so it could be made robust enough to avoid exploits, I will remain silent and let others speak up, but I advise you to take cover, because this is an explosive topic ;) Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 12:46 ` Amir Goldstein @ 2023-01-25 13:10 ` Giuseppe Scrivano 2023-01-25 18:07 ` Amir Goldstein 2023-01-25 15:24 ` Christian Brauner 1 sibling, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-25 13:10 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi Amir Goldstein <amir73il@gmail.com> writes: >> > >> > Based on Alexander's explanation about the differences between overlayfs >> > lookup vs. composefs lookup of a regular "metacopy" file, I just need to >> > point out that the same optimization (lazy lookup of the lower data >> > file on open) >> > can be done in overlayfs as well. >> > (*) currently, overlayfs needs to lookup the lower file also for st_blocks. >> > >> > I am not saying that it should be done or that Miklos will agree to make >> > this change in overlayfs, but that seems to be the major difference. >> > getxattr may have some extra cost depending on in-inode xattr format >> > of erofs, but specifically, the metacopy getxattr can be avoided if this >> > is a special overlayfs RO mount that is marked as EVERYTHING IS >> > METACOPY. >> > >> > I don't expect you guys to now try to hack overlayfs and explore >> > this path to completion. >> > My expectation is that this information will be clearly visible to anyone >> > reviewing future submission, e.g.: >> > >> > - This is the comparison we ran... >> > - This is the reason that composefs gives better results... >> > - It MAY be possible to optimize erofs/overlayfs to get to similar results, >> > but we did not try to do that >> > >> > It is especially important IMO to get the ACK of both Gao and Miklos >> > on your analysis, because remember than when this thread started, >> > you did not know about the metacopy option and your main argument >> > was saving the time it takes to create the overlayfs layer files in the >> > filesystem, because you were missing some technical background on overlayfs. >> >> we knew about metacopy, which we already use in our tools to create >> mapped image copies when idmapped mounts are not available, and also >> knew about the other new features in overlayfs. For example, the >> "volatile" feature which was mentioned in your >> Overlayfs-containers-lpc-2020 talk, was only submitted upstream after >> begging Miklos and Vivek for months. I had a PoC that I used and tested >> locally and asked for their help to get it integrated at the file >> system layer, using seccomp for the same purpose would have been more >> complex and prone to errors when dealing with external bind mounts >> containing persistent data. >> >> The only missing bit, at least from my side, was to consider an image >> that contains only overlay metadata as something we could distribute. >> > > I'm glad that I was able to point this out to you, because now the comparison > between the overlayfs and composefs options is more fair. > >> I previously mentioned my wish of using it from a user namespace, the >> goal seems more challenging with EROFS or any other block devices. I >> don't know about the difficulty of getting overlay metacopy working in a >> user namespace, even though it would be helpful for other use cases as >> well. >> > > There is no restriction of metacopy in user namespace. > overlayfs needs to be mounted with -o userxattr and the overlay > xattrs needs to use user.overlay. prefix. if I specify both userxattr and metacopy=on then the mount ends up in the following check: if (config->userxattr) { [...] if (config->metacopy && metacopy_opt) { pr_err("conflicting options: userxattr,metacopy=on\n"); return -EINVAL; } } to me it looks like it was done on purpose to prevent metacopy from a user namespace, but I don't know the reason for sure. > w.r.t. the implied claim that composefs on-disk format is simple enough > so it could be made robust enough to avoid exploits, I will remain > silent and let others speak up, but I advise you to take cover, > because this is an explosive topic ;) > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 13:10 ` Giuseppe Scrivano @ 2023-01-25 18:07 ` Amir Goldstein 2023-01-25 19:45 ` Giuseppe Scrivano 0 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-01-25 18:07 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi > >> I previously mentioned my wish of using it from a user namespace, the > >> goal seems more challenging with EROFS or any other block devices. I > >> don't know about the difficulty of getting overlay metacopy working in a > >> user namespace, even though it would be helpful for other use cases as > >> well. > >> > > > > There is no restriction of metacopy in user namespace. > > overlayfs needs to be mounted with -o userxattr and the overlay > > xattrs needs to use user.overlay. prefix. > > if I specify both userxattr and metacopy=on then the mount ends up in > the following check: > > if (config->userxattr) { > [...] > if (config->metacopy && metacopy_opt) { > pr_err("conflicting options: userxattr,metacopy=on\n"); > return -EINVAL; > } > } > Right, my bad. > to me it looks like it was done on purpose to prevent metacopy from a > user namespace, but I don't know the reason for sure. > With hand crafted metacopy, an unpriv user can chmod any files to anything by layering another file with different mode on top of it.... Not sure how the composefs security model intends to handle this scenario with userns mount, but it sounds like a similar problem. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 18:07 ` Amir Goldstein @ 2023-01-25 19:45 ` Giuseppe Scrivano 2023-01-25 20:23 ` Amir Goldstein 0 siblings, 1 reply; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-25 19:45 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi Amir Goldstein <amir73il@gmail.com> writes: >> >> I previously mentioned my wish of using it from a user namespace, the >> >> goal seems more challenging with EROFS or any other block devices. I >> >> don't know about the difficulty of getting overlay metacopy working in a >> >> user namespace, even though it would be helpful for other use cases as >> >> well. >> >> >> > >> > There is no restriction of metacopy in user namespace. >> > overlayfs needs to be mounted with -o userxattr and the overlay >> > xattrs needs to use user.overlay. prefix. >> >> if I specify both userxattr and metacopy=on then the mount ends up in >> the following check: >> >> if (config->userxattr) { >> [...] >> if (config->metacopy && metacopy_opt) { >> pr_err("conflicting options: userxattr,metacopy=on\n"); >> return -EINVAL; >> } >> } >> > > Right, my bad. > >> to me it looks like it was done on purpose to prevent metacopy from a >> user namespace, but I don't know the reason for sure. >> > > With hand crafted metacopy, an unpriv user can chmod > any files to anything by layering another file with different > mode on top of it.... I might be missing something obvious about metacopy, so please correct me if I am wrong, but I don't see how it is any different than just copying the file and chowning it. Of course, as long as overlay uses the same security model so that a file that wasn't originally possible to access must be still blocked, even if referenced through metacopy. > Not sure how the composefs security model intends to handle > this scenario with userns mount, but it sounds like a similar > problem. composefs, if it is going to be used from a user namespace, should be doing the same check as overlay and do not allow accessing files that weren't accessible before. It could be even stricter than overlay, and expect the payload files to be owned by the user who mounted the file system (or be world readable) instead of any ID mapped inside the user namespace. Thanks, Giuseppe ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 19:45 ` Giuseppe Scrivano @ 2023-01-25 20:23 ` Amir Goldstein 2023-01-25 20:29 ` Amir Goldstein 2023-01-27 15:57 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Vivek Goyal 0 siblings, 2 replies; 87+ messages in thread From: Amir Goldstein @ 2023-01-25 20:23 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi On Wed, Jan 25, 2023 at 9:45 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > >> >> I previously mentioned my wish of using it from a user namespace, the > >> >> goal seems more challenging with EROFS or any other block devices. I > >> >> don't know about the difficulty of getting overlay metacopy working in a > >> >> user namespace, even though it would be helpful for other use cases as > >> >> well. > >> >> > >> > > >> > There is no restriction of metacopy in user namespace. > >> > overlayfs needs to be mounted with -o userxattr and the overlay > >> > xattrs needs to use user.overlay. prefix. > >> > >> if I specify both userxattr and metacopy=on then the mount ends up in > >> the following check: > >> > >> if (config->userxattr) { > >> [...] > >> if (config->metacopy && metacopy_opt) { > >> pr_err("conflicting options: userxattr,metacopy=on\n"); > >> return -EINVAL; > >> } > >> } > >> > > > > Right, my bad. > > > >> to me it looks like it was done on purpose to prevent metacopy from a > >> user namespace, but I don't know the reason for sure. > >> > > > > With hand crafted metacopy, an unpriv user can chmod > > any files to anything by layering another file with different > > mode on top of it.... > > I might be missing something obvious about metacopy, so please correct > me if I am wrong, but I don't see how it is any different than just > copying the file and chowning it. Of course, as long as overlay uses > the same security model so that a file that wasn't originally possible > to access must be still blocked, even if referenced through metacopy. > You're right. The reason for mutual exclusion maybe related to the comment in ovl_check_metacopy_xattr() about EACCES. Need to check with Vivek or Miklos. But get this - you do not need metacopy=on to follow lower inode. It should work without metacopy=on. metacopy=on only instructs overlayfs whether to copy up data or only metadata when changing metadata of lower object, so it is not relevant for readonly mount. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 20:23 ` Amir Goldstein @ 2023-01-25 20:29 ` Amir Goldstein 2023-01-26 5:26 ` userns mount and metacopy redirects (Was: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem) Amir Goldstein 2023-01-27 15:57 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Vivek Goyal 1 sibling, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-01-25 20:29 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Vivek Goyal, Miklos Szeredi On Wed, Jan 25, 2023 at 10:23 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Wed, Jan 25, 2023 at 9:45 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > > > Amir Goldstein <amir73il@gmail.com> writes: > > > > >> >> I previously mentioned my wish of using it from a user namespace, the > > >> >> goal seems more challenging with EROFS or any other block devices. I > > >> >> don't know about the difficulty of getting overlay metacopy working in a > > >> >> user namespace, even though it would be helpful for other use cases as > > >> >> well. > > >> >> > > >> > > > >> > There is no restriction of metacopy in user namespace. > > >> > overlayfs needs to be mounted with -o userxattr and the overlay > > >> > xattrs needs to use user.overlay. prefix. > > >> > > >> if I specify both userxattr and metacopy=on then the mount ends up in > > >> the following check: > > >> > > >> if (config->userxattr) { > > >> [...] > > >> if (config->metacopy && metacopy_opt) { > > >> pr_err("conflicting options: userxattr,metacopy=on\n"); > > >> return -EINVAL; > > >> } > > >> } > > >> > > > > > > Right, my bad. > > > > > >> to me it looks like it was done on purpose to prevent metacopy from a > > >> user namespace, but I don't know the reason for sure. > > >> > > > > > > With hand crafted metacopy, an unpriv user can chmod > > > any files to anything by layering another file with different > > > mode on top of it.... > > > > I might be missing something obvious about metacopy, so please correct > > me if I am wrong, but I don't see how it is any different than just > > copying the file and chowning it. Of course, as long as overlay uses > > the same security model so that a file that wasn't originally possible > > to access must be still blocked, even if referenced through metacopy. > > > > You're right. > The reason for mutual exclusion maybe related to the > comment in ovl_check_metacopy_xattr() about EACCES. > Need to check with Vivek or Miklos. > > But get this - you do not need metacopy=on to follow lower inode. > It should work without metacopy=on. > metacopy=on only instructs overlayfs whether to copy up data > or only metadata when changing metadata of lower object, so it is > not relevant for readonly mount. > However, you do need redirect=follow and that one is only mutually exclusive with userxattr. Again, need to ask Miklos whether that could be relaxed under some conditions. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* userns mount and metacopy redirects (Was: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem) 2023-01-25 20:29 ` Amir Goldstein @ 2023-01-26 5:26 ` Amir Goldstein 2023-01-26 8:22 ` Christian Brauner 0 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-01-26 5:26 UTC (permalink / raw) To: Giuseppe Scrivano Cc: Alexander Larsson, Christian Brauner, Vivek Goyal, Miklos Szeredi, overlayfs [spawning overlayfs sub-topic] On Wed, Jan 25, 2023 at 10:29 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Wed, Jan 25, 2023 at 10:23 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Wed, Jan 25, 2023 at 9:45 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > > > > > Amir Goldstein <amir73il@gmail.com> writes: > > > > > > >> >> I previously mentioned my wish of using it from a user namespace, the > > > >> >> goal seems more challenging with EROFS or any other block devices. I For those who are starting to read here, the context is userns mounting of overlayfs with a lower EROFS layer containing metacopy references to lower data blobs in another fs (a.k.a the composefs model). IMO, mounting a readonly image of whatever on-disk format is a very high risk for userns mount. A privileged mount helper that verifies and mounts the EROFS layer sounds like a more feasible solution. > > > >> >> don't know about the difficulty of getting overlay metacopy working in a > > > >> >> user namespace, even though it would be helpful for other use cases as > > > >> >> well. > > > >> >> > > > >> > > > > >> > There is no restriction of metacopy in user namespace. > > > >> > overlayfs needs to be mounted with -o userxattr and the overlay > > > >> > xattrs needs to use user.overlay. prefix. > > > >> > > > >> if I specify both userxattr and metacopy=on then the mount ends up in > > > >> the following check: > > > >> > > > >> if (config->userxattr) { > > > >> [...] > > > >> if (config->metacopy && metacopy_opt) { > > > >> pr_err("conflicting options: userxattr,metacopy=on\n"); > > > >> return -EINVAL; > > > >> } > > > >> } > > > >> > > > > > > > > Right, my bad. > > > > > > > >> to me it looks like it was done on purpose to prevent metacopy from a > > > >> user namespace, but I don't know the reason for sure. > > > >> > > > > > > > > With hand crafted metacopy, an unpriv user can chmod > > > > any files to anything by layering another file with different > > > > mode on top of it.... > > > > > > I might be missing something obvious about metacopy, so please correct > > > me if I am wrong, but I don't see how it is any different than just > > > copying the file and chowning it. Of course, as long as overlay uses > > > the same security model so that a file that wasn't originally possible > > > to access must be still blocked, even if referenced through metacopy. > > > > > > > You're right. > > The reason for mutual exclusion maybe related to the > > comment in ovl_check_metacopy_xattr() about EACCES. > > Need to check with Vivek or Miklos. > > > > But get this - you do not need metacopy=on to follow lower inode. > > It should work without metacopy=on. > > metacopy=on only instructs overlayfs whether to copy up data > > or only metadata when changing metadata of lower object, so it is > > not relevant for readonly mount. > > > > However, you do need redirect=follow and that one is only mutually > exclusive with userxattr. > Again, need to ask Miklos whether that could be relaxed under > some conditions. > I can see some possible problems with userns mount and redirect: - referencing same dir inode from different paths - referencing same inode from different paths with wrong nlink and inconsistent metadata However, I think a mode that only follows a redirect from a lower metacopy file to its data should be safe for userns mount. In this special case (lower metacopy file) we may also be able to implement the lazy lookup of the data file on open to optimize 'find' performance, but need to figure out what to do with st_blocks of stat() in that case. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: userns mount and metacopy redirects (Was: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem) 2023-01-26 5:26 ` userns mount and metacopy redirects (Was: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem) Amir Goldstein @ 2023-01-26 8:22 ` Christian Brauner 0 siblings, 0 replies; 87+ messages in thread From: Christian Brauner @ 2023-01-26 8:22 UTC (permalink / raw) To: Amir Goldstein Cc: Giuseppe Scrivano, Alexander Larsson, Vivek Goyal, Miklos Szeredi, overlayfs On Thu, Jan 26, 2023 at 07:26:49AM +0200, Amir Goldstein wrote: > [spawning overlayfs sub-topic] > > On Wed, Jan 25, 2023 at 10:29 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Wed, Jan 25, 2023 at 10:23 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > On Wed, Jan 25, 2023 at 9:45 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > > > > > > > Amir Goldstein <amir73il@gmail.com> writes: > > > > > > > > >> >> I previously mentioned my wish of using it from a user namespace, the > > > > >> >> goal seems more challenging with EROFS or any other block devices. I > > For those who are starting to read here, the context is userns mounting > of overlayfs with a lower EROFS layer containing metacopy references to > lower data blobs in another fs (a.k.a the composefs model). > > IMO, mounting a readonly image of whatever on-disk format > is a very high risk for userns mount. > A privileged mount helper that verifies and mounts the EROFS > layer sounds like a more feasible solution. Very much agreed. This filesystem specific userns mountable stuff where filesystems with any kind of on-disk format guarantees the safety is not something we should support. I'm starting to think about how to make it possible for a privileged process to delegate/allow a filesystem mount to an unprivileged one. The policy belongs in userspace. Something which I've talked about before a few years ago but now I actually have time to work on this. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 20:23 ` Amir Goldstein 2023-01-25 20:29 ` Amir Goldstein @ 2023-01-27 15:57 ` Vivek Goyal 1 sibling, 0 replies; 87+ messages in thread From: Vivek Goyal @ 2023-01-27 15:57 UTC (permalink / raw) To: Amir Goldstein Cc: Giuseppe Scrivano, Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, brauner, viro, Miklos Szeredi On Wed, Jan 25, 2023 at 10:23:08PM +0200, Amir Goldstein wrote: > On Wed, Jan 25, 2023 at 9:45 PM Giuseppe Scrivano <gscrivan@redhat.com> wrote: > > > > Amir Goldstein <amir73il@gmail.com> writes: > > > > >> >> I previously mentioned my wish of using it from a user namespace, the > > >> >> goal seems more challenging with EROFS or any other block devices. I > > >> >> don't know about the difficulty of getting overlay metacopy working in a > > >> >> user namespace, even though it would be helpful for other use cases as > > >> >> well. > > >> >> > > >> > > > >> > There is no restriction of metacopy in user namespace. > > >> > overlayfs needs to be mounted with -o userxattr and the overlay > > >> > xattrs needs to use user.overlay. prefix. > > >> > > >> if I specify both userxattr and metacopy=on then the mount ends up in > > >> the following check: > > >> > > >> if (config->userxattr) { > > >> [...] > > >> if (config->metacopy && metacopy_opt) { > > >> pr_err("conflicting options: userxattr,metacopy=on\n"); > > >> return -EINVAL; > > >> } > > >> } > > >> > > > > > > Right, my bad. > > > > > >> to me it looks like it was done on purpose to prevent metacopy from a > > >> user namespace, but I don't know the reason for sure. > > >> > > > > > > With hand crafted metacopy, an unpriv user can chmod > > > any files to anything by layering another file with different > > > mode on top of it.... > > > > I might be missing something obvious about metacopy, so please correct > > me if I am wrong, but I don't see how it is any different than just > > copying the file and chowning it. Of course, as long as overlay uses > > the same security model so that a file that wasn't originally possible > > to access must be still blocked, even if referenced through metacopy. > > > > You're right. > The reason for mutual exclusion maybe related to the > comment in ovl_check_metacopy_xattr() about EACCES. > Need to check with Vivek or Miklos. > > But get this - you do not need metacopy=on to follow lower inode. > It should work without metacopy=on. > metacopy=on only instructs overlayfs whether to copy up data > or only metadata when changing metadata of lower object, so it is > not relevant for readonly mount. I think you might need metacopy=on even to just follow lower inode. I see following in ovl_lookup(). if ((uppermetacopy || d.metacopy) && !ofs->config.metacopy) { dput(this); err = -EPERM; pr_warn_ratelimited("refusing to follow metacopy origin for (%pd2)\n", dentry); goto out_put; } W.r.t allowing metacopy=on from inside userns, I never paid much attention to this as I never needed it. But this might be interesting to look into it now if it is needed. Thanks Vivek ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 12:46 ` Amir Goldstein 2023-01-25 13:10 ` Giuseppe Scrivano @ 2023-01-25 15:24 ` Christian Brauner 2023-01-25 16:05 ` Giuseppe Scrivano 1 sibling, 1 reply; 87+ messages in thread From: Christian Brauner @ 2023-01-25 15:24 UTC (permalink / raw) To: Amir Goldstein Cc: Giuseppe Scrivano, Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, viro, Vivek Goyal, Miklos Szeredi On Wed, Jan 25, 2023 at 02:46:59PM +0200, Amir Goldstein wrote: > > > > > > Based on Alexander's explanation about the differences between overlayfs > > > lookup vs. composefs lookup of a regular "metacopy" file, I just need to > > > point out that the same optimization (lazy lookup of the lower data > > > file on open) > > > can be done in overlayfs as well. > > > (*) currently, overlayfs needs to lookup the lower file also for st_blocks. > > > > > > I am not saying that it should be done or that Miklos will agree to make > > > this change in overlayfs, but that seems to be the major difference. > > > getxattr may have some extra cost depending on in-inode xattr format > > > of erofs, but specifically, the metacopy getxattr can be avoided if this > > > is a special overlayfs RO mount that is marked as EVERYTHING IS > > > METACOPY. > > > > > > I don't expect you guys to now try to hack overlayfs and explore > > > this path to completion. > > > My expectation is that this information will be clearly visible to anyone > > > reviewing future submission, e.g.: > > > > > > - This is the comparison we ran... > > > - This is the reason that composefs gives better results... > > > - It MAY be possible to optimize erofs/overlayfs to get to similar results, > > > but we did not try to do that > > > > > > It is especially important IMO to get the ACK of both Gao and Miklos > > > on your analysis, because remember than when this thread started, > > > you did not know about the metacopy option and your main argument > > > was saving the time it takes to create the overlayfs layer files in the > > > filesystem, because you were missing some technical background on overlayfs. > > > > we knew about metacopy, which we already use in our tools to create > > mapped image copies when idmapped mounts are not available, and also > > knew about the other new features in overlayfs. For example, the > > "volatile" feature which was mentioned in your > > Overlayfs-containers-lpc-2020 talk, was only submitted upstream after > > begging Miklos and Vivek for months. I had a PoC that I used and tested > > locally and asked for their help to get it integrated at the file > > system layer, using seccomp for the same purpose would have been more > > complex and prone to errors when dealing with external bind mounts > > containing persistent data. > > > > The only missing bit, at least from my side, was to consider an image > > that contains only overlay metadata as something we could distribute. > > > > I'm glad that I was able to point this out to you, because now the comparison > between the overlayfs and composefs options is more fair. > > > I previously mentioned my wish of using it from a user namespace, the > > goal seems more challenging with EROFS or any other block devices. I > > don't know about the difficulty of getting overlay metacopy working in a > > user namespace, even though it would be helpful for other use cases as > > well. > If you decide to try and make this work with overlayfs I can to cut out time and help with both review and patches. Because I can see this being beneficial for use-cases we have with systemd as well and actually being used by us as we do make heavy use of overlayfs already and probably will do even more so in the future on top of erofs. (As a sidenote, in the future, idmapped mounts can be made useable from userns and there's a todo and ideas for this on https://uapi-group.org/kernel-features. Additionally, I want users to have the ability to use them without any userns in the mix at all. Not just because there are legitimate users that don't need to allocate a userns at all but also because then we can do stuff like map down a range of ids to a single id (what probably nfs would call "squashing") and other stuff.) Christian ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 15:24 ` Christian Brauner @ 2023-01-25 16:05 ` Giuseppe Scrivano 0 siblings, 0 replies; 87+ messages in thread From: Giuseppe Scrivano @ 2023-01-25 16:05 UTC (permalink / raw) To: Christian Brauner Cc: Amir Goldstein, Dave Chinner, Alexander Larsson, linux-fsdevel, linux-kernel, viro, Vivek Goyal, Miklos Szeredi Christian Brauner <brauner@kernel.org> writes: > On Wed, Jan 25, 2023 at 02:46:59PM +0200, Amir Goldstein wrote: >> > > >> > > Based on Alexander's explanation about the differences between overlayfs >> > > lookup vs. composefs lookup of a regular "metacopy" file, I just need to >> > > point out that the same optimization (lazy lookup of the lower data >> > > file on open) >> > > can be done in overlayfs as well. >> > > (*) currently, overlayfs needs to lookup the lower file also for st_blocks. >> > > >> > > I am not saying that it should be done or that Miklos will agree to make >> > > this change in overlayfs, but that seems to be the major difference. >> > > getxattr may have some extra cost depending on in-inode xattr format >> > > of erofs, but specifically, the metacopy getxattr can be avoided if this >> > > is a special overlayfs RO mount that is marked as EVERYTHING IS >> > > METACOPY. >> > > >> > > I don't expect you guys to now try to hack overlayfs and explore >> > > this path to completion. >> > > My expectation is that this information will be clearly visible to anyone >> > > reviewing future submission, e.g.: >> > > >> > > - This is the comparison we ran... >> > > - This is the reason that composefs gives better results... >> > > - It MAY be possible to optimize erofs/overlayfs to get to similar results, >> > > but we did not try to do that >> > > >> > > It is especially important IMO to get the ACK of both Gao and Miklos >> > > on your analysis, because remember than when this thread started, >> > > you did not know about the metacopy option and your main argument >> > > was saving the time it takes to create the overlayfs layer files in the >> > > filesystem, because you were missing some technical background on overlayfs. >> > >> > we knew about metacopy, which we already use in our tools to create >> > mapped image copies when idmapped mounts are not available, and also >> > knew about the other new features in overlayfs. For example, the >> > "volatile" feature which was mentioned in your >> > Overlayfs-containers-lpc-2020 talk, was only submitted upstream after >> > begging Miklos and Vivek for months. I had a PoC that I used and tested >> > locally and asked for their help to get it integrated at the file >> > system layer, using seccomp for the same purpose would have been more >> > complex and prone to errors when dealing with external bind mounts >> > containing persistent data. >> > >> > The only missing bit, at least from my side, was to consider an image >> > that contains only overlay metadata as something we could distribute. >> > >> >> I'm glad that I was able to point this out to you, because now the comparison >> between the overlayfs and composefs options is more fair. >> >> > I previously mentioned my wish of using it from a user namespace, the >> > goal seems more challenging with EROFS or any other block devices. I >> > don't know about the difficulty of getting overlay metacopy working in a >> > user namespace, even though it would be helpful for other use cases as >> > well. >> > > If you decide to try and make this work with overlayfs I can to cut out > time and help with both review and patches. Because I can see this being > beneficial for use-cases we have with systemd as well and actually being > used by us as we do make heavy use of overlayfs already and probably > will do even more so in the future on top of erofs. > > (As a sidenote, in the future, idmapped mounts can be made useable from > userns and there's a todo and ideas for this on > https://uapi-group.org/kernel-features. we won't need metacopy to clone images once idmapped works in a userns, but I think it is still good to have, if possible. Not only for the interesting combination Amir suggested but also for speeding up a bunch of other operations that currently end up in a complete copy-up. > Additionally, I want users to have the ability to use them without any > userns in the mix at all. Not just because there are legitimate users > that don't need to allocate a userns at all but also because then we can > do stuff like map down a range of ids to a single id (what probably nfs > would call "squashing") and other stuff.) ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-24 19:06 ` Amir Goldstein 2023-01-25 4:18 ` Dave Chinner @ 2023-01-25 9:37 ` Alexander Larsson 2023-01-25 10:05 ` Gao Xiang 1 sibling, 1 reply; 87+ messages in thread From: Alexander Larsson @ 2023-01-25 9:37 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On Tue, 2023-01-24 at 21:06 +0200, Amir Goldstein wrote: > On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@redhat.com> > wrote: > > > > For the uncached case, composefs is still almost four times faster > > than > > the fastest overlay combo (squashfs), and the non-squashfs versions > > are > > strictly slower. For the cached case the difference is less (10%) > > but > > with similar order of performance. > > > > For size comparison, here are the resulting images: > > > > 8.6M large.composefs > > 2.5G large.erofs > > 200M large.ext4 > > 2.6M large.squashfs > > > > Nice. > Clearly, mkfs.ext4 and mkfs.erofs are not optimized for space. For different reasons. Ext4 is meant to be writable post creation, so it makes different choices wrt on-disk layout. Erofs is due to the lack of sparse files, so when it copied the sparse files into it they were made huge files full of zeros. > Note that Android has make_ext4fs which can create a compact > ro ext4 image without a journal. > Found this project that builds it outside of Android, but did not > test: > https://github.com/iglunix/make_ext4fs It doesn't seem to support either whiteout files or sparse files. > > > > # hyperfine -i -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR > > > > ovl- > > > > mount" > > > > Benchmark 1: ls -lR ovl-mount > > > > Time (mean ± σ): 2.738 s ± 0.029 s [User: 0.176 s, > > > > System: 1.688 s] > > > > Range (min … max): 2.699 s … 2.787 s 10 runs > > > > > > > > With page cache between runs the difference is smaller, but > > > > still > > > > there: > > > > > > It is the dentry cache that mostly matters for this test and > > > please > > > use hyerfine -w 1 to warmup dentry cache for correct measurement > > > of warm cache lookup. > > > > I'm not sure why the dentry cache case would be more important? > > Starting a new container will very often not have cached the image. > > > > To me the interesting case is for a new image, but with some > > existing > > page cache for the backing files directory. That seems to model > > staring > > a new image in an active container host, but its somewhat hard to > > test > > that case. > > > > ok, you can argue that faster cold cache ls -lR is important > for starting new images. > I think you will be asked to show a real life container use case > where > that benchmark really matters. > > My current work is in automotive, which wants to move to a containerized workload in the car. The primary KPI is cold boot performance, because there are legal requirements for the entire system to boot in 2 seconds. It is also quite typical to have shortlived containers in cloud workloads, and startup time there is very important. In fact, the last few months I've been primarily spending on optimizing container startup performance (as can be seen in the massive improvements to this in the upcoming podman 4.4). I'm obviously not saying that containers will actually recursively list the container contents on start. However they will do all sorts of cold cache metadata operation to resolve library dependencies, find config files, etc. Just strace any typical userspace app and see for yourself. A ls -lR is a simplified version of this kind of workload. > > > I guess these test runs started with warm cache? but it wasn't > > > mentioned explicitly. > > > > Yes, they were warm (because I ran the previous test before it). > > But, > > the new profile script explicitly adds -w 1. > > > > > > # hyperfine "ls -lR cfs-mnt" > > > > Benchmark 1: ls -lR cfs-mnt > > > > Time (mean ± σ): 390.1 ms ± 3.7 ms [User: 140.9 ms, > > > > System: 247.1 ms] > > > > Range (min … max): 381.5 ms … 393.9 ms 10 runs > > > > > > > > vs > > > > > > > > # hyperfine -i "ls -lR ovl-mount" > > > > Benchmark 1: ls -lR ovl-mount > > > > Time (mean ± σ): 431.5 ms ± 1.2 ms [User: 124.3 ms, > > > > System: 296.9 ms] > > > > Range (min … max): 429.4 ms … 433.3 ms 10 runs > > > > > > > > This isn't all that strange, as overlayfs does a lot more work > > > > for > > > > each lookup, including multiple name lookups as well as several > > > > xattr > > > > lookups, whereas composefs just does a single lookup in a pre- > > > > computed > > > > > > Seriously, "multiple name lookups"? > > > Overlayfs does exactly one lookup for anything but first level > > > subdirs > > > and for sparse files it does the exact same lookup in /objects as > > > composefs. > > > Enough with the hand waving please. Stick to hard facts. > > > > With the discussed layout, in a stat() call on a regular file, > > ovl_lookup() will do lookups on both the sparse file and the > > backing > > file, whereas cfs_dir_lookup() will just map some page cache pages > > and > > do a binary search. > > > > Of course if you actually open the file, then cfs_open_file() would > > do > > the equivalent lookups in /objects. But that is often not what > > happens, > > for example in "ls -l". > > > > Additionally, these extra lookups will cause extra memory use, as > > you > > need dentries and inodes for the erofs/squashfs inodes in addition > > to > > the overlay inodes. > > > > I see. composefs is really very optimized for ls -lR. > Now only need to figure out if real users start a container and do ls > -lR > without reading many files is a real life use case. A read-only filesystem does basically two things: metadata lookups and file content loading. Composefs hands off the content loading to the backing filesystem, so obviously then the design will focus on the remaining part. So, yes, this means optimizing for "ls -lR". > > > > table. But, given that we don't need any of the other features > > > > of > > > > overlayfs here, this performance loss seems rather unnecessary. > > > > > > > > I understand that there is a cost to adding more code, but > > > > efficiently > > > > supporting containers and other forms of read-only images is a > > > > pretty > > > > important usecase for Linux these days, and having something > > > > tailored > > > > for that seems pretty useful to me, even considering the code > > > > duplication. > > > > > > > > > > > > > > > > I also understand Cristians worry about stacking filesystem, > > > > having > > > > looked a bit more at the overlayfs code. But, since composefs > > > > doesn't > > > > really expose the metadata or vfs structure of the lower > > > > directories it > > > > is much simpler in a fundamental way. > > > > > > > > > > I agree that composefs is simpler than overlayfs and that its > > > security > > > model is simpler, but this is not the relevant question. > > > The question is what are the benefits to the prospect users of > > > composefs > > > that justify this new filesystem driver if overlayfs already > > > implements > > > the needed functionality. > > > > > > The only valid technical argument I could gather from your email > > > is - > > > 10% performance improvement in warm cache ls -lR on a 2.6 GB > > > centos9 rootfs image compared to overlayfs+squashfs. > > > > > > I am not counting the cold cache results until we see results of > > > a modern ro-image fs. > > > > They are all strictly worse than squashfs in the above testing. > > > > It's interesting to know why and if an optimized mkfs.erofs > mkfs.ext4 would have done any improvement. Even the non-loopback mounted (direct xfs backed) version performed worse than the squashfs one. I'm sure a erofs with sparse files would do better due to a more compact file, but I don't really see how it would perform significantly different than the squashfs code. Yes, squashfs lookup is linear in directory length, while erofs is log(n), but the directories are not so huge that this would dominate the runtime. To get an estimate of this I made a broken version of the erofs image, where the metacopy files are actually 0 byte size rather than sparse. This made the erofs file 18M instead, and gained 10% in the cold cache case. This, while good, is not near enough to matter compared to the others. I don't think the base performance here is really much dependent on the backing filesystem. An ls -lR workload is just a measurement of the actual (i.e. non-dcache) performance of the filesystem implementation of lookup and iterate, and overlayfs just has more work to do here, especially in terms of the amount of i/o needed. > > > Considering that most real life workloads include reading the > > > data > > > and that most of the time inodes and dentries are cached, IMO, > > > the 10% ls -lR improvement is not a good enough reason > > > for a new "laser focused" filesystem driver. > > > > > > Correct me if I am wrong, but isn't the use case of ephemeral > > > containers require that composefs is layered under a writable > > > tmpfs > > > using overlayfs? > > > > > > If that is the case then the warm cache comparison is incorrect > > > as well. To argue for the new filesystem you will need to compare > > > ls -lR of overlay{tmpfs,composefs,xfs} vs. > > > overlay{tmpfs,erofs,xfs} > > > > That very much depends. For the ostree rootfs uscase there would be > > no > > writable layer, and for containers I'm personally primarily > > interested > > in "--readonly" containers (i.e. without an writable layer) in my > > current automobile/embedded work. For many container cases however, > > that is true, and no doubt that would make the overhead of > > overlayfs > > less of a issue. > > > > > Alexander, > > > > > > On a more personal note, I know this discussion has been a bit > > > stormy, but am not trying to fight you. > > > > I'm overall not getting a warm fuzzy feeling from this discussion. > > Getting weird complaints that I'm somehow "stealing" functions or > > weird > > "who did $foo first" arguments for instance. You haven't personally > > attacked me like that, but some of your comments can feel rather > > pointy, especially in the context of a stormy thread like this. I'm > > just not used to kernel development workflows, so have patience > > with me > > if I do things wrong. > > > > Fair enough. > As long as the things that we discussed are duly > mentioned in future posts, I'll do my best to be less pointy. Thanks! > > > I think that {mk,}composefs is a wonderful thing that will > > > improve > > > the life of many users. > > > But mount -t composefs vs. mount -t overlayfs is insignificant > > > to those users, so we just need to figure out based on facts > > > and numbers, which is the best technical alternative. > > > > In reality things are never as easy as one thing strictly being > > technically best. There is always a multitude of considerations. Is > > composefs technically better if it uses less memory and performs > > better > > for a particular usecase? Or is overlayfs technically better > > because it > > is useful for more usecases and already exists? A judgement needs > > to be > > made depending on things like complexity/maintainability of the new > > fs, > > ease of use, measured performance differences, relative importance > > of > > particular performance measurements, and importance of the specific > > usecase. > > > > It is my belief that the advantages of composefs outweight the cost > > of > > the code duplication, but I understand the point of view of a > > maintainer of an existing codebase and that saying "no" is often > > the > > right thing. I will continue to try to argue for my point of view, > > but > > will try to make it as factual as possible. > > > > Improving overlayfs and erofs has additional advantages - > improving performance and size of erofs image may benefit > many other users regardless of the ephemeral containers > use case, so indeed, there are many aspects to consider. Yes. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com He's a hate-fuelled umbrella-wielding card sharp fleeing from a secret government programme. She's a violent antique-collecting lawyer in the wrong place at the wrong time. They fight crime! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 9:37 ` Alexander Larsson @ 2023-01-25 10:05 ` Gao Xiang 2023-01-25 10:15 ` Alexander Larsson 0 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-01-25 10:05 UTC (permalink / raw) To: Alexander Larsson, Amir Goldstein Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On 2023/1/25 17:37, Alexander Larsson wrote: > On Tue, 2023-01-24 at 21:06 +0200, Amir Goldstein wrote: >> On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson <alexl@redhat.com> ... >>> >>> They are all strictly worse than squashfs in the above testing. >>> >> >> It's interesting to know why and if an optimized mkfs.erofs >> mkfs.ext4 would have done any improvement. > > Even the non-loopback mounted (direct xfs backed) version performed > worse than the squashfs one. I'm sure a erofs with sparse files would > do better due to a more compact file, but I don't really see how it > would perform significantly different than the squashfs code. Yes, > squashfs lookup is linear in directory length, while erofs is log(n), > but the directories are not so huge that this would dominate the > runtime. > > To get an estimate of this I made a broken version of the erofs image, > where the metacopy files are actually 0 byte size rather than sparse. > This made the erofs file 18M instead, and gained 10% in the cold cache > case. This, while good, is not near enough to matter compared to the > others. > > I don't think the base performance here is really much dependent on the > backing filesystem. An ls -lR workload is just a measurement of the > actual (i.e. non-dcache) performance of the filesystem implementation > of lookup and iterate, and overlayfs just has more work to do here, > especially in terms of the amount of i/o needed. I will form a formal mkfs.erofs version in one or two days since we're cerebrating Lunar New year now. Since you don't have more I/O traces for analysis, I have to do another wild guess. Could you help benchmark your v2 too? I'm not sure if such performance also exists in v2. The reason why I guess as this is that it seems that you read all dir inode pages when doing the first lookup, it can benefit to seq dir access. I'm not sure if EROFS can make a similar number by doing forcing readahead on dirs to read all dir data at once as well. Apart from that I don't see significant difference, at least personally I'd like to know where it could have such huge difference. I don't think that is all because of read-only on-disk format differnce. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 10:05 ` Gao Xiang @ 2023-01-25 10:15 ` Alexander Larsson 2023-01-27 10:24 ` Gao Xiang 0 siblings, 1 reply; 87+ messages in thread From: Alexander Larsson @ 2023-01-25 10:15 UTC (permalink / raw) To: Gao Xiang, Amir Goldstein Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On Wed, 2023-01-25 at 18:05 +0800, Gao Xiang wrote: > > > On 2023/1/25 17:37, Alexander Larsson wrote: > > On Tue, 2023-01-24 at 21:06 +0200, Amir Goldstein wrote: > > > On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson > > > <alexl@redhat.com> > > ... > > > > > > > > > They are all strictly worse than squashfs in the above testing. > > > > > > > > > > It's interesting to know why and if an optimized mkfs.erofs > > > mkfs.ext4 would have done any improvement. > > > > Even the non-loopback mounted (direct xfs backed) version performed > > worse than the squashfs one. I'm sure a erofs with sparse files > > would > > do better due to a more compact file, but I don't really see how it > > would perform significantly different than the squashfs code. Yes, > > squashfs lookup is linear in directory length, while erofs is > > log(n), > > but the directories are not so huge that this would dominate the > > runtime. > > > > To get an estimate of this I made a broken version of the erofs > > image, > > where the metacopy files are actually 0 byte size rather than > > sparse. > > This made the erofs file 18M instead, and gained 10% in the cold > > cache > > case. This, while good, is not near enough to matter compared to > > the > > others. > > > > I don't think the base performance here is really much dependent on > > the > > backing filesystem. An ls -lR workload is just a measurement of the > > actual (i.e. non-dcache) performance of the filesystem > > implementation > > of lookup and iterate, and overlayfs just has more work to do here, > > especially in terms of the amount of i/o needed. > > I will form a formal mkfs.erofs version in one or two days since > we're > cerebrating Lunar New year now. > > Since you don't have more I/O traces for analysis, I have to do > another > wild guess. > > Could you help benchmark your v2 too? I'm not sure if such > performance also exists in v2. The reason why I guess as this is > that it seems that you read all dir inode pages when doing the first > lookup, it can benefit to seq dir access. > > I'm not sure if EROFS can make a similar number by doing forcing > readahead on dirs to read all dir data at once as well. > > Apart from that I don't see significant difference, at least > personally > I'd like to know where it could have such huge difference. I don't > think that is all because of read-only on-disk format differnce. I think the performance difference between v2 and v3 would be rather minor in this case, because I don't think a lot of the directories are large enough to be split in chunks. I also don't believe erofs and composefs should fundamentally differ much in performance here, given that both use a compact binary searchable layout for dirents. However, the full comparison is "composefs" vs "overlayfs + erofs", and in that case composefs wins. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com He's an obese Catholic messiah who knows the secret of the alien invasion. She's a provocative Bolivian single mother living on borrowed time. They fight crime! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-25 10:15 ` Alexander Larsson @ 2023-01-27 10:24 ` Gao Xiang 2023-02-01 4:28 ` Jingbo Xu 0 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-01-27 10:24 UTC (permalink / raw) To: Alexander Larsson, Amir Goldstein Cc: linux-fsdevel, linux-kernel, gscrivan, david, brauner, viro, Vivek Goyal, Miklos Szeredi On 2023/1/25 18:15, Alexander Larsson wrote: > On Wed, 2023-01-25 at 18:05 +0800, Gao Xiang wrote: >> >> >> On 2023/1/25 17:37, Alexander Larsson wrote: >>> On Tue, 2023-01-24 at 21:06 +0200, Amir Goldstein wrote: >>>> On Tue, Jan 24, 2023 at 3:13 PM Alexander Larsson >>>> <alexl@redhat.com> >> >> ... >> >>>>> >>>>> They are all strictly worse than squashfs in the above testing. >>>>> >>>> >>>> It's interesting to know why and if an optimized mkfs.erofs >>>> mkfs.ext4 would have done any improvement. >>> >>> Even the non-loopback mounted (direct xfs backed) version performed >>> worse than the squashfs one. I'm sure a erofs with sparse files >>> would >>> do better due to a more compact file, but I don't really see how it >>> would perform significantly different than the squashfs code. Yes, >>> squashfs lookup is linear in directory length, while erofs is >>> log(n), >>> but the directories are not so huge that this would dominate the >>> runtime. >>> >>> To get an estimate of this I made a broken version of the erofs >>> image, >>> where the metacopy files are actually 0 byte size rather than >>> sparse. >>> This made the erofs file 18M instead, and gained 10% in the cold >>> cache >>> case. This, while good, is not near enough to matter compared to >>> the >>> others. >>> >>> I don't think the base performance here is really much dependent on >>> the >>> backing filesystem. An ls -lR workload is just a measurement of the >>> actual (i.e. non-dcache) performance of the filesystem >>> implementation >>> of lookup and iterate, and overlayfs just has more work to do here, >>> especially in terms of the amount of i/o needed. >> >> I will form a formal mkfs.erofs version in one or two days since >> we're >> cerebrating Lunar New year now. I've made a version and did some test, it can be fetched from: git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experimental this feature can be used with -Ededupe or --chunksize=# (assuming that all sparse files are holed, so that each file will only has one chunk.) >> >> Since you don't have more I/O traces for analysis, I have to do >> another >> wild guess. >> >> Could you help benchmark your v2 too? I'm not sure if such >> performance also exists in v2. The reason why I guess as this is >> that it seems that you read all dir inode pages when doing the first >> lookup, it can benefit to seq dir access. >> >> I'm not sure if EROFS can make a similar number by doing forcing >> readahead on dirs to read all dir data at once as well. >> >> Apart from that I don't see significant difference, at least >> personally >> I'd like to know where it could have such huge difference. I don't >> think that is all because of read-only on-disk format differnce. > > I think the performance difference between v2 and v3 would be rather > minor in this case, because I don't think a lot of the directories are > large enough to be split in chunks. I also don't believe erofs and > composefs should fundamentally differ much in performance here, given > that both use a compact binary searchable layout for dirents. However, > the full comparison is "composefs" vs "overlayfs + erofs", and in that > case composefs wins. I'm still on vacation.. I will play with composefs personally to get more insights when I'm back, but it would be much better to provide some datasets for this as well (assuming the dataset can be shown in public.) Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-01-27 10:24 ` Gao Xiang @ 2023-02-01 4:28 ` Jingbo Xu 2023-02-01 7:44 ` Amir Goldstein 2023-02-01 9:46 ` Alexander Larsson 0 siblings, 2 replies; 87+ messages in thread From: Jingbo Xu @ 2023-02-01 4:28 UTC (permalink / raw) To: Gao Xiang, Alexander Larsson, Amir Goldstein, gscrivan, brauner Cc: linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi Hi all, There are some updated performance statistics with different combinations on my test environment if you are interested. On 1/27/23 6:24 PM, Gao Xiang wrote: > ... > > I've made a version and did some test, it can be fetched from: > git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b > experimental > Setup ====== CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Disk: 6800 IOPS upper limit OS: Linux v6.2 (with composefs v3 patchset) I build erofs/squashfs images following the scripts attached on [1], with each file in the rootfs tagged with "metacopy" and "redirect" xattr. The source rootfs is from the docker image of tensorflow [2]. The erofs images are built with mkfs.erofs with support for sparse file added [3]. [1] https://lore.kernel.org/linux-fsdevel/5fb32a1297821040edd8c19ce796fc0540101653.camel@redhat.com/ [2] https://hub.docker.com/layers/tensorflow/tensorflow/2.10.0/images/sha256-7f9f23ce2473eb52d17fe1b465c79c3a3604047343e23acc036296f512071bc9?context=explore [3] https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/commit/?h=experimental&id=7c49e8b195ad90f6ca9dfccce9f6e3e39a8676f6 Image size =========== 6.4M large.composefs 5.7M large.composefs.w/o.digest (w/o --compute-digest) 6.2M large.erofs 5.2M large.erofs.T0 (with -T0, i.e. w/o nanosecond timestamp) 1.7M large.squashfs 5.8M large.squashfs.uncompressed (with -noI -noD -noF -noX) (large.erofs.T0 is built without nanosecond timestamp, so that we get smaller disk inode size (same with squashfs).) Runtime Perf ============= The "uncached" column is tested with: hyperfine -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR $MNTPOINT" While the "cached" column is tested with: hyperfine -w 1 "ls -lR $MNTPOINT" erofs and squashfs are mounted with loopback device. | uncached(ms)| cached(ms) ----------------------------------|-------------|----------- composefs (with digest) | 326 | 135 erofs (w/o -T0) | 264 | 172 erofs (w/o -T0) + overlayfs | 651 | 238 squashfs (compressed) | 538 | 211 squashfs (compressed) + overlayfs | 968 | 302 squashfs (uncompressed) | 406 | 172 squashfs (uncompressed)+overlayfs | 833 | 264 Following on are the detailed test statistics: composefs(with digest) - uncached Benchmark 1: ls -lR /mnt/cps Time (mean ± σ): 326.0 ms ± 6.1 ms [User: 64.1 ms, System: 126.0 ms] Range (min … max): 316.3 ms … 334.5 ms 10 runs composefs(with digest) - cached Benchmark 1: ls -lR /mnt/cps Time (mean ± σ): 135.5 ms ± 4.1 ms [User: 59.9 ms, System: 74.8 ms] Range (min … max): 129.5 ms … 144.8 ms 21 runs loopback erofs(w/o -T0) - uncached Benchmark 1: ls -lR /mnt/bootstrap Time (mean ± σ): 264.1 ms ± 2.1 ms [User: 66.7 ms, System: 166.2 ms] Range (min … max): 261.0 ms … 267.5 ms 10 runs loopback erofs(w/o -T0) - cached Benchmark 1: ls -lR /mnt/bootstrap Time (mean ± σ): 172.3 ms ± 3.9 ms [User: 59.3 ms, System: 112.2 ms] Range (min … max): 166.5 ms … 180.8 ms 17 runs overlayfs + loopback erofs(w/o -T0) - uncached Benchmark 1: ls -lR /mnt/ovl/mntdir Time (mean ± σ): 651.8 ms ± 8.8 ms [User: 74.2 ms, System: 391.1 ms] Range (min … max): 632.6 ms … 665.8 ms 10 runs overlayfs + loopback erofs(w/o -T0) - cached Benchmark 1: ls -lR /mnt/ovl/mntdir Time (mean ± σ): 238.1 ms ± 7.7 ms [User: 63.4 ms, System: 173.4 ms] Range (min … max): 226.7 ms … 251.2 ms 12 runs loopback squashfs (compressed) - uncached Benchmark 1: ls -lR /mnt/squashfs-compressed/bootstrap Time (mean ± σ): 538.4 ms ± 2.4 ms [User: 67.8 ms, System: 410.3 ms] Range (min … max): 535.6 ms … 543.6 ms 10 runs loopback squashfs (compressed) - cached Benchmark 1: ls -lR /mnt/squashfs-compressed/bootstrap Time (mean ± σ): 211.3 ms ± 2.9 ms [User: 61.2 ms, System: 141.3 ms] Range (min … max): 206.5 ms … 216.1 ms 13 runs overlayfs + loopback squashfs (compressed) - uncached Benchmark 1: ls -lR /mnt/squashfs-compressed/mntdir Time (mean ± σ): 968.0 ms ± 7.1 ms [User: 78.4 ms, System: 675.7 ms] Range (min … max): 956.4 ms … 977.2 ms 10 runs overlayfs + loopback squashfs (compressed) - cached Benchmark 1: ls -lR /mnt/squashfs-compressed/mntdir Time (mean ± σ): 302.6 ms ± 6.7 ms [User: 67.3 ms, System: 225.6 ms] Range (min … max): 292.4 ms … 312.3 ms 10 runs loopback squashfs (uncompressed) - uncached Benchmark 1: ls -lR /mnt/squashfs-uncompressed/bootstrap Time (mean ± σ): 406.6 ms ± 3.9 ms [User: 69.2 ms, System: 273.3 ms] Range (min … max): 400.3 ms … 414.2 ms 10 runs loopback squashfs (uncompressed) - cached Benchmark 1: ls -lR /mnt/squashfs-uncompressed/bootstrap Time (mean ± σ): 172.8 ms ± 3.2 ms [User: 61.9 ms, System: 101.6 ms] Range (min … max): 168.6 ms … 178.9 ms 16 runs overlayfs + loopback squashfs (uncompressed) - uncached Benchmark 1: ls -lR /mnt/squashfs-uncompressed/mntdir Time (mean ± σ): 833.4 ms ± 8.0 ms [User: 74.1 ms, System: 539.7 ms] Range (min … max): 820.7 ms … 844.3 ms 10 runs overlayfs + loopback squashfs (uncompressed) - cached Benchmark 1: ls -lR /mnt/squashfs-uncompressed/mntdir Time (mean ± σ): 264.4 ms ± 7.2 ms [User: 68.2 ms, System: 186.2 ms] Range (min … max): 256.5 ms … 277.1 ms 10 runs -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 4:28 ` Jingbo Xu @ 2023-02-01 7:44 ` Amir Goldstein 2023-02-01 8:59 ` Jingbo Xu 2023-02-01 9:46 ` Alexander Larsson 1 sibling, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-02-01 7:44 UTC (permalink / raw) To: Jingbo Xu Cc: Gao Xiang, Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On Wed, Feb 1, 2023 at 6:28 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > Hi all, > > There are some updated performance statistics with different > combinations on my test environment if you are interested. > Cool report! > > On 1/27/23 6:24 PM, Gao Xiang wrote: > > ... > > > > I've made a version and did some test, it can be fetched from: > > git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b > > experimental > > > > Setup > ====== > CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > Disk: 6800 IOPS upper limit > OS: Linux v6.2 (with composefs v3 patchset) > > I build erofs/squashfs images following the scripts attached on [1], > with each file in the rootfs tagged with "metacopy" and "redirect" xattr. > > The source rootfs is from the docker image of tensorflow [2]. > > The erofs images are built with mkfs.erofs with support for sparse file > added [3]. > > [1] > https://lore.kernel.org/linux-fsdevel/5fb32a1297821040edd8c19ce796fc0540101653.camel@redhat.com/ > [2] > https://hub.docker.com/layers/tensorflow/tensorflow/2.10.0/images/sha256-7f9f23ce2473eb52d17fe1b465c79c3a3604047343e23acc036296f512071bc9?context=explore > [3] > https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/commit/?h=experimental&id=7c49e8b195ad90f6ca9dfccce9f6e3e39a8676f6 > > > > Image size > =========== > 6.4M large.composefs > 5.7M large.composefs.w/o.digest (w/o --compute-digest) > 6.2M large.erofs > 5.2M large.erofs.T0 (with -T0, i.e. w/o nanosecond timestamp) > 1.7M large.squashfs > 5.8M large.squashfs.uncompressed (with -noI -noD -noF -noX) > > (large.erofs.T0 is built without nanosecond timestamp, so that we get > smaller disk inode size (same with squashfs).) > > > Runtime Perf > ============= > > The "uncached" column is tested with: > hyperfine -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR $MNTPOINT" > > > While the "cached" column is tested with: > hyperfine -w 1 "ls -lR $MNTPOINT" > > > erofs and squashfs are mounted with loopback device. > > > | uncached(ms)| cached(ms) > ----------------------------------|-------------|----------- > composefs (with digest) | 326 | 135 > erofs (w/o -T0) | 264 | 172 > erofs (w/o -T0) + overlayfs | 651 | 238 This is a nice proof of the overlayfs "early lookup" overhead. As I wrote, this overhead could be optimized by doing "lazy lookup" on open like composefs does. Here is a suggestion for a simple test variant that could be used to approximate the expected improvement - if you set all the metacopy files in erofs to redirect to the same lower block, most of the lower lookup time will be amortized because all but the first lower lookup are cached. If you get a performance number with erofs + overlayfs that are close to composefs performance numbers, it will prove the point that same functionality and performance could be achieved by modifying ovelrayfs/mkfs.erofs. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 7:44 ` Amir Goldstein @ 2023-02-01 8:59 ` Jingbo Xu 2023-02-01 9:52 ` Alexander Larsson 0 siblings, 1 reply; 87+ messages in thread From: Jingbo Xu @ 2023-02-01 8:59 UTC (permalink / raw) To: Amir Goldstein Cc: Gao Xiang, Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2/1/23 3:44 PM, Amir Goldstein wrote: > On Wed, Feb 1, 2023 at 6:28 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >> >> Hi all, >> >> There are some updated performance statistics with different >> combinations on my test environment if you are interested. >> > > Cool report! > >> >> On 1/27/23 6:24 PM, Gao Xiang wrote: >>> ... >>> >>> I've made a version and did some test, it can be fetched from: >>> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b >>> experimental >>> >> >> Setup >> ====== >> CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz >> Disk: 6800 IOPS upper limit >> OS: Linux v6.2 (with composefs v3 patchset) >> >> I build erofs/squashfs images following the scripts attached on [1], >> with each file in the rootfs tagged with "metacopy" and "redirect" xattr. >> >> The source rootfs is from the docker image of tensorflow [2]. >> >> The erofs images are built with mkfs.erofs with support for sparse file >> added [3]. >> >> [1] >> https://lore.kernel.org/linux-fsdevel/5fb32a1297821040edd8c19ce796fc0540101653.camel@redhat.com/ >> [2] >> https://hub.docker.com/layers/tensorflow/tensorflow/2.10.0/images/sha256-7f9f23ce2473eb52d17fe1b465c79c3a3604047343e23acc036296f512071bc9?context=explore >> [3] >> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/commit/?h=experimental&id=7c49e8b195ad90f6ca9dfccce9f6e3e39a8676f6 >> >> >> >> Image size >> =========== >> 6.4M large.composefs >> 5.7M large.composefs.w/o.digest (w/o --compute-digest) >> 6.2M large.erofs >> 5.2M large.erofs.T0 (with -T0, i.e. w/o nanosecond timestamp) >> 1.7M large.squashfs >> 5.8M large.squashfs.uncompressed (with -noI -noD -noF -noX) >> >> (large.erofs.T0 is built without nanosecond timestamp, so that we get >> smaller disk inode size (same with squashfs).) >> >> >> Runtime Perf >> ============= >> >> The "uncached" column is tested with: >> hyperfine -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR $MNTPOINT" >> >> >> While the "cached" column is tested with: >> hyperfine -w 1 "ls -lR $MNTPOINT" >> >> >> erofs and squashfs are mounted with loopback device. >> >> >> | uncached(ms)| cached(ms) >> ----------------------------------|-------------|----------- >> composefs (with digest) | 326 | 135 >> erofs (w/o -T0) | 264 | 172 >> erofs (w/o -T0) + overlayfs | 651 | 238 > > This is a nice proof of the overlayfs "early lookup" overhead. > As I wrote, this overhead could be optimized by doing "lazy lookup" > on open like composefs does. > > Here is a suggestion for a simple test variant that could be used to > approximate the expected improvement - > if you set all the metacopy files in erofs to redirect to the same > lower block, most of the lower lookup time will be amortized > because all but the first lower lookup are cached. > If you get a performance number with erofs + overlayfs that are > close to composefs performance numbers, it will prove the point > that same functionality and performance could be achieved by > modifying ovelrayfs/mkfs.erofs. > I redid the test with suggestion from Amir, with all files inside the erofs layer are redirected to the same lower block, e.g. "/objects/00/014430a0b489d101c8a103ef829dd258448a13eb48b4d1e9ff0731d1e82b92". The result is shown in the fourth line. | uncached(ms)| cached(ms) ----------------------------------|-------------|----------- composefs (with digest) | 326 | 135 erofs (w/o -T0) | 264 | 172 erofs (w/o -T0) + overlayfs | 651 | 238 erofs (hacked and redirect to one | | lower block) + overlayfs | 400 | 230 It seems that the "lazy lookup" in overlayfs indeed optimizes in this situation. The performance gap in cached situation (especially comparing composefs and standalone erofs) is still under investigation and I will see if there's any hint by perf diff. overlayfs + loopback erofs(redirect to the one same lwer block) - uncached Benchmark 1: ls -lR /mnt/ovl/mntdir Time (mean ± σ): 399.5 ms ± 3.8 ms [User: 69.9 ms, System: 298.1 ms] Range (min … max): 394.3 ms … 403.7 ms 10 runs overlayfs + loopback erofs(w/o -T0) - cached Benchmark 1: ls -lR /mnt/ovl/mntdir Time (mean ± σ): 230.5 ms ± 5.7 ms [User: 63.8 ms, System: 165.6 ms] Range (min … max): 220.4 ms … 240.2 ms 12 runs -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 8:59 ` Jingbo Xu @ 2023-02-01 9:52 ` Alexander Larsson 2023-02-01 12:39 ` Jingbo Xu 0 siblings, 1 reply; 87+ messages in thread From: Alexander Larsson @ 2023-02-01 9:52 UTC (permalink / raw) To: Jingbo Xu, Amir Goldstein Cc: Gao Xiang, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On Wed, 2023-02-01 at 16:59 +0800, Jingbo Xu wrote: > > I redid the test with suggestion from Amir, with all files inside the > erofs layer are redirected to the same lower block, e.g. > "/objects/00/014430a0b489d101c8a103ef829dd258448a13eb48b4d1e9ff0731d1 > e82b92". > > The result is shown in the fourth line. > > | uncached(ms)| cached(ms) > ----------------------------------|-------------|----------- > composefs (with digest) | 326 | 135 > erofs (w/o -T0) | 264 | 172 > erofs (w/o -T0) + overlayfs | 651 | 238 > erofs (hacked and redirect to one | | > lower block) + overlayfs | 400 | 230 > > It seems that the "lazy lookup" in overlayfs indeed optimizes in this > situation. > > > The performance gap in cached situation (especially comparing > composefs > and standalone erofs) is still under investigation and I will see if > there's any hint by perf diff. The fact that plain erofs is faster than composefs uncached, but slower cached is very strange. Also, see my other mail where erofs+ovl cached is slower than squashfs+ovl cached for me. Something seems to be off with the cached erofs case... -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com He's a sword-wielding alcoholic barbarian She's a pregnant snooty nun who dreams of becoming Elvis. They fight crime! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 9:52 ` Alexander Larsson @ 2023-02-01 12:39 ` Jingbo Xu 0 siblings, 0 replies; 87+ messages in thread From: Jingbo Xu @ 2023-02-01 12:39 UTC (permalink / raw) To: Alexander Larsson, Amir Goldstein Cc: Gao Xiang, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2/1/23 5:52 PM, Alexander Larsson wrote: > On Wed, 2023-02-01 at 16:59 +0800, Jingbo Xu wrote: >> >> I redid the test with suggestion from Amir, with all files inside the >> erofs layer are redirected to the same lower block, e.g. >> "/objects/00/014430a0b489d101c8a103ef829dd258448a13eb48b4d1e9ff0731d1 >> e82b92". >> >> The result is shown in the fourth line. >> >> | uncached(ms)| cached(ms) >> ----------------------------------|-------------|----------- >> composefs (with digest) | 326 | 135 >> erofs (w/o -T0) | 264 | 172 >> erofs (w/o -T0) + overlayfs | 651 | 238 >> erofs (hacked and redirect to one | | >> lower block) + overlayfs | 400 | 230 >> >> It seems that the "lazy lookup" in overlayfs indeed optimizes in this >> situation. >> >> >> The performance gap in cached situation (especially comparing >> composefs >> and standalone erofs) is still under investigation and I will see if >> there's any hint by perf diff. > > The fact that plain erofs is faster than composefs uncached, but slower > cached is very strange. Also, see my other mail where erofs+ovl cached > is slower than squashfs+ovl cached for me. Something seems to be off > with the cached erofs case... > I tested erofs with ACL disabled (see fourth line). | uncached(ms)| cached(ms) ----------------------------------|-------------|----------- composefs (with digest) | 326 | 135 squashfs (uncompressed) | 406 | 172 erofs (w/o -T0) | 264 | 172 erofs (w/o -T0, mount with noacl) | 225 | 141 The remained perf difference in cached situation might be noisy and may be due to the difference of test environment. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 4:28 ` Jingbo Xu 2023-02-01 7:44 ` Amir Goldstein @ 2023-02-01 9:46 ` Alexander Larsson 2023-02-01 10:01 ` Gao Xiang ` (2 more replies) 1 sibling, 3 replies; 87+ messages in thread From: Alexander Larsson @ 2023-02-01 9:46 UTC (permalink / raw) To: Jingbo Xu, Gao Xiang, Amir Goldstein, gscrivan, brauner Cc: linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On Wed, 2023-02-01 at 12:28 +0800, Jingbo Xu wrote: > Hi all, > > There are some updated performance statistics with different > combinations on my test environment if you are interested. > > > On 1/27/23 6:24 PM, Gao Xiang wrote: > > ... > > > > I've made a version and did some test, it can be fetched from: > > git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git > > -b > > experimental > > > > Setup > ====== > CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > Disk: 6800 IOPS upper limit > OS: Linux v6.2 (with composefs v3 patchset) For the record, what was the filesystem backing the basedir files? > I build erofs/squashfs images following the scripts attached on [1], > with each file in the rootfs tagged with "metacopy" and "redirect" > xattr. > > The source rootfs is from the docker image of tensorflow [2]. > > The erofs images are built with mkfs.erofs with support for sparse > file > added [3]. > > [1] > https://lore.kernel.org/linux-fsdevel/5fb32a1297821040edd8c19ce796fc0540101653.camel@redhat.com/ > [2] > https://hub.docker.com/layers/tensorflow/tensorflow/2.10.0/images/sha256-7f9f23ce2473eb52d17fe1b465c79c3a3604047343e23acc036296f512071bc9?context=explore > [3] > https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/commit/?h=experimental&id=7c49e8b195ad90f6ca9dfccce9f6e3e39a8676f6 > > > > Image size > =========== > 6.4M large.composefs > 5.7M large.composefs.w/o.digest (w/o --compute-digest) > 6.2M large.erofs > 5.2M large.erofs.T0 (with -T0, i.e. w/o nanosecond timestamp) > 1.7M large.squashfs > 5.8M large.squashfs.uncompressed (with -noI -noD -noF -noX) > > (large.erofs.T0 is built without nanosecond timestamp, so that we get > smaller disk inode size (same with squashfs).) > > > Runtime Perf > ============= > > The "uncached" column is tested with: > hyperfine -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR $MNTPOINT" > > > While the "cached" column is tested with: > hyperfine -w 1 "ls -lR $MNTPOINT" > > > erofs and squashfs are mounted with loopback device. > > > | uncached(ms)| cached(ms) > ----------------------------------|-------------|----------- > composefs (with digest) | 326 | 135 > erofs (w/o -T0) | 264 | 172 > erofs (w/o -T0) + overlayfs | 651 | 238 > squashfs (compressed) | 538 | 211 > squashfs (compressed) + overlayfs | 968 | 302 Clearly erofs with sparse files is the best fs now for the ro-fs + overlay case. But still, we can see that the additional cost of the overlayfs layer is not negligible. According to amir this could be helped by a special composefs-like mode in overlayfs, but its unclear what performance that would reach, and we're then talking net new development that further complicates the overlayfs codebase. Its not clear to me which alternative is easier to develop/maintain. Also, the difference between cached and uncached here is less than in my tests. Probably because my test image was larger. With the test image I use, the results are: | uncached(ms)| cached(ms) ----------------------------------|-------------|----------- composefs (with digest) | 681 | 390 erofs (w/o -T0) + overlayfs | 1788 | 532 squashfs (compressed) + overlayfs | 2547 | 443 I gotta say it is weird though that squashfs performed better than erofs in the cached case. May be worth looking into. The test data I'm using is available here: https://my.owndrive.com/index.php/s/irHJXRpZHtT3a5i -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com He's a lonely flyboy grifter living undercover at Ringling Bros. Circus. She's a virginal thirtysomething former first lady looking for love in all the wrong places. They fight crime! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 9:46 ` Alexander Larsson @ 2023-02-01 10:01 ` Gao Xiang 2023-02-01 11:22 ` Gao Xiang 2023-02-01 12:06 ` Jingbo Xu 2023-02-02 4:57 ` Jingbo Xu 2 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-02-01 10:01 UTC (permalink / raw) To: Alexander Larsson, Jingbo Xu, Amir Goldstein, gscrivan, brauner Cc: linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2023/2/1 17:46, Alexander Larsson wrote: ... >> >> | uncached(ms)| cached(ms) >> ----------------------------------|-------------|----------- >> composefs (with digest) | 326 | 135 >> erofs (w/o -T0) | 264 | 172 >> erofs (w/o -T0) + overlayfs | 651 | 238 >> squashfs (compressed) | 538 | 211 >> squashfs (compressed) + overlayfs | 968 | 302 > > > Clearly erofs with sparse files is the best fs now for the ro-fs + > overlay case. But still, we can see that the additional cost of the > overlayfs layer is not negligible. > > According to amir this could be helped by a special composefs-like mode > in overlayfs, but its unclear what performance that would reach, and > we're then talking net new development that further complicates the > overlayfs codebase. Its not clear to me which alternative is easier to > develop/maintain. > > Also, the difference between cached and uncached here is less than in > my tests. Probably because my test image was larger. With the test > image I use, the results are: > > | uncached(ms)| cached(ms) > ----------------------------------|-------------|----------- > composefs (with digest) | 681 | 390 > erofs (w/o -T0) + overlayfs | 1788 | 532 > squashfs (compressed) + overlayfs | 2547 | 443 > > > I gotta say it is weird though that squashfs performed better than > erofs in the cached case. May be worth looking into. The test data I'm > using is available here: As another wild guess, cached performance is a just vfs-stuff. I think the performance difference may be due to ACL (since both composefs and squashfs don't support ACL). I already asked Jingbo to get more "perf data" to analyze this but he's now busy in another stuff. Again, my overall point is quite simple as always, currently composefs is a read-only filesystem with massive symlink-like files. It behaves as a subset of all generic read-only filesystems just for this specific use cases. In facts there are many options to improve this (much like Amir said before): 1) improve overlayfs, and then it can be used with any local fs; 2) enhance erofs to support this (even without on-disk change); 3) introduce fs/composefs; In addition to option 1), option 2) has many benefits as well, since your manifest files can save real regular files in addition to composefs model. Even if you guys still consider 3), I'm not sure that is all codebase you will just do bugfix and don't add any new features like what I said. So eventually, I still think that is another read-only fs which is much similar to compressed-part-truncated EROFS. Thanks, Gao Xiang > > https://my.owndrive.com/index.php/s/irHJXRpZHtT3a5i > > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 10:01 ` Gao Xiang @ 2023-02-01 11:22 ` Gao Xiang 2023-02-02 6:37 ` Amir Goldstein 0 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-02-01 11:22 UTC (permalink / raw) To: Alexander Larsson, Jingbo Xu, Amir Goldstein, gscrivan, brauner Cc: linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2023/2/1 18:01, Gao Xiang wrote: > > > On 2023/2/1 17:46, Alexander Larsson wrote: > > ... > >>> >>> | uncached(ms)| cached(ms) >>> ----------------------------------|-------------|----------- >>> composefs (with digest) | 326 | 135 >>> erofs (w/o -T0) | 264 | 172 >>> erofs (w/o -T0) + overlayfs | 651 | 238 >>> squashfs (compressed) | 538 | 211 >>> squashfs (compressed) + overlayfs | 968 | 302 >> >> >> Clearly erofs with sparse files is the best fs now for the ro-fs + >> overlay case. But still, we can see that the additional cost of the >> overlayfs layer is not negligible. >> >> According to amir this could be helped by a special composefs-like mode >> in overlayfs, but its unclear what performance that would reach, and >> we're then talking net new development that further complicates the >> overlayfs codebase. Its not clear to me which alternative is easier to >> develop/maintain. >> >> Also, the difference between cached and uncached here is less than in >> my tests. Probably because my test image was larger. With the test >> image I use, the results are: >> >> | uncached(ms)| cached(ms) >> ----------------------------------|-------------|----------- >> composefs (with digest) | 681 | 390 >> erofs (w/o -T0) + overlayfs | 1788 | 532 >> squashfs (compressed) + overlayfs | 2547 | 443 >> >> >> I gotta say it is weird though that squashfs performed better than >> erofs in the cached case. May be worth looking into. The test data I'm >> using is available here: > > As another wild guess, cached performance is a just vfs-stuff. > > I think the performance difference may be due to ACL (since both > composefs and squashfs don't support ACL). I already asked Jingbo > to get more "perf data" to analyze this but he's now busy in another > stuff. > > Again, my overall point is quite simple as always, currently > composefs is a read-only filesystem with massive symlink-like files. > It behaves as a subset of all generic read-only filesystems just > for this specific use cases. > > In facts there are many options to improve this (much like Amir > said before): > 1) improve overlayfs, and then it can be used with any local fs; > > 2) enhance erofs to support this (even without on-disk change); > > 3) introduce fs/composefs; > > In addition to option 1), option 2) has many benefits as well, since > your manifest files can save real regular files in addition to composefs > model. (add some words..) My first response at that time (on Slack) was "kindly request Giuseppe to ask in the fsdevel mailing list if this new overlay model and use cases is feasable", if so, I'm much happy to integrate in to EROFS (in a cooperative way) in several ways: - just use EROFS symlink layout and open such file in a stacked way; or (now) - just identify overlayfs "trusted.overlay.redirect" in EROFS itself and open file so such image can be both used for EROFS only and EROFS + overlayfs. If that happened, then I think the overlayfs "metacopy" option can also be shown by other fs community people later (since I'm not an overlay expert), but I'm not sure why they becomes impossible finally and even not mentioned at all. Or if you guys really don't want to use EROFS for whatever reasons (EROFS is completely open-source, used, contributed by many vendors), you could improve squashfs, ext4, or other exist local fses with this new use cases (since they don't need any on-disk change as well, for example, by using some xattr), I don't think it's really hard. And like what you said in the other reply, " On the contrary, erofs lookup is very similar to composefs. There is nothing magical about it, we're talking about pre-computed, static lists of names. What you do is you sort the names, put them in a compact seek-free form, and then you binary search on them. Composefs v3 has some changes to make larger directories slightly more efficient (no chunking), but the general performance should be comparable. " yet core EROFS was a 2017-2018 stuff since we're addressed common issues of generic read-only use cases. Also if you'd like to read all dir data and pin such pages in memory at once. If you run into an AI dataset with (typically) 10 million samples or more in a dir, you will suffer from it under many devices with limited memory. That is especially EROFS original target users. I'm not sure how kernel filesystem upstream works like this (also a few days ago, I heard another in-kernel new one called "tarfs" which implements tar in ~500 loc (maybe) from confidental container guys, but I don't really know how an unaligned unseekable archive format for tape like tar works effectively without data block-aligned.) Anyway, that is all what I could do now for your use cases. > > Even if you guys still consider 3), I'm not sure that is all codebase > you will just do bugfix and don't add any new features like what I > said. So eventually, I still think that is another read-only fs which > is much similar to compressed-part-truncated EROFS. > > > Thanks, > Gao Xiang > > >> https://my.owndrive.com/index.php/s/irHJXRpZHtT3a5i >> >> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 11:22 ` Gao Xiang @ 2023-02-02 6:37 ` Amir Goldstein 2023-02-02 7:17 ` Gao Xiang 0 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-02-02 6:37 UTC (permalink / raw) To: Gao Xiang Cc: Alexander Larsson, Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On Wed, Feb 1, 2023 at 1:22 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > > > On 2023/2/1 18:01, Gao Xiang wrote: > > > > > > On 2023/2/1 17:46, Alexander Larsson wrote: > > > > ... > > > >>> > >>> | uncached(ms)| cached(ms) > >>> ----------------------------------|-------------|----------- > >>> composefs (with digest) | 326 | 135 > >>> erofs (w/o -T0) | 264 | 172 > >>> erofs (w/o -T0) + overlayfs | 651 | 238 > >>> squashfs (compressed) | 538 | 211 > >>> squashfs (compressed) + overlayfs | 968 | 302 > >> > >> > >> Clearly erofs with sparse files is the best fs now for the ro-fs + > >> overlay case. But still, we can see that the additional cost of the > >> overlayfs layer is not negligible. > >> > >> According to amir this could be helped by a special composefs-like mode > >> in overlayfs, but its unclear what performance that would reach, and > >> we're then talking net new development that further complicates the > >> overlayfs codebase. Its not clear to me which alternative is easier to > >> develop/maintain. > >> > >> Also, the difference between cached and uncached here is less than in > >> my tests. Probably because my test image was larger. With the test > >> image I use, the results are: > >> > >> | uncached(ms)| cached(ms) > >> ----------------------------------|-------------|----------- > >> composefs (with digest) | 681 | 390 > >> erofs (w/o -T0) + overlayfs | 1788 | 532 > >> squashfs (compressed) + overlayfs | 2547 | 443 > >> > >> > >> I gotta say it is weird though that squashfs performed better than > >> erofs in the cached case. May be worth looking into. The test data I'm > >> using is available here: > > > > As another wild guess, cached performance is a just vfs-stuff. > > > > I think the performance difference may be due to ACL (since both > > composefs and squashfs don't support ACL). I already asked Jingbo > > to get more "perf data" to analyze this but he's now busy in another > > stuff. > > > > Again, my overall point is quite simple as always, currently > > composefs is a read-only filesystem with massive symlink-like files. > > It behaves as a subset of all generic read-only filesystems just > > for this specific use cases. > > > > In facts there are many options to improve this (much like Amir > > said before): > > 1) improve overlayfs, and then it can be used with any local fs; > > > > 2) enhance erofs to support this (even without on-disk change); > > > > 3) introduce fs/composefs; > > > > In addition to option 1), option 2) has many benefits as well, since > > your manifest files can save real regular files in addition to composefs > > model. > > (add some words..) > > My first response at that time (on Slack) was "kindly request > Giuseppe to ask in the fsdevel mailing list if this new overlay model > and use cases is feasable", if so, I'm much happy to integrate in to > EROFS (in a cooperative way) in several ways: > > - just use EROFS symlink layout and open such file in a stacked way; > > or (now) > > - just identify overlayfs "trusted.overlay.redirect" in EROFS itself > and open file so such image can be both used for EROFS only and > EROFS + overlayfs. > > If that happened, then I think the overlayfs "metacopy" option can > also be shown by other fs community people later (since I'm not an > overlay expert), but I'm not sure why they becomes impossible finally > and even not mentioned at all. > > Or if you guys really don't want to use EROFS for whatever reasons > (EROFS is completely open-source, used, contributed by many vendors), > you could improve squashfs, ext4, or other exist local fses with this > new use cases (since they don't need any on-disk change as well, for > example, by using some xattr), I don't think it's really hard. > Engineering-wise, merging composefs features into EROFS would be the simplest option and FWIW, my personal preference. However, you need to be aware that this will bring into EROFS vfs considerations, such as s_stack_depth nesting (which AFAICS is not see incremented composefs?). It's not the end of the world, but this is no longer plain fs over block game. There's a whole new class of bugs (that syzbot is very eager to explore) so you need to ask yourself whether this is a direction you want to lead EROFS towards. Giuseppe expressed his plans to make use of the composefs method inside userns one day. It is not a hard dependency, but I believe that keeping the "RO efficient verifiable image format" functionality (EROFS) separate from "userns composition of verifiable images" (overlayfs) may benefit the userns mount goal in the long term. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-02 6:37 ` Amir Goldstein @ 2023-02-02 7:17 ` Gao Xiang 2023-02-02 7:37 ` Gao Xiang 0 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-02-02 7:17 UTC (permalink / raw) To: Amir Goldstein Cc: Alexander Larsson, Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2023/2/2 14:37, Amir Goldstein wrote: > On Wed, Feb 1, 2023 at 1:22 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >> >> >> On 2023/2/1 18:01, Gao Xiang wrote: >>> >>> >>> On 2023/2/1 17:46, Alexander Larsson wrote: >>> >>> ... >>> >>>>> >>>>> | uncached(ms)| cached(ms) >>>>> ----------------------------------|-------------|----------- >>>>> composefs (with digest) | 326 | 135 >>>>> erofs (w/o -T0) | 264 | 172 >>>>> erofs (w/o -T0) + overlayfs | 651 | 238 >>>>> squashfs (compressed) | 538 | 211 >>>>> squashfs (compressed) + overlayfs | 968 | 302 >>>> >>>> >>>> Clearly erofs with sparse files is the best fs now for the ro-fs + >>>> overlay case. But still, we can see that the additional cost of the >>>> overlayfs layer is not negligible. >>>> >>>> According to amir this could be helped by a special composefs-like mode >>>> in overlayfs, but its unclear what performance that would reach, and >>>> we're then talking net new development that further complicates the >>>> overlayfs codebase. Its not clear to me which alternative is easier to >>>> develop/maintain. >>>> >>>> Also, the difference between cached and uncached here is less than in >>>> my tests. Probably because my test image was larger. With the test >>>> image I use, the results are: >>>> >>>> | uncached(ms)| cached(ms) >>>> ----------------------------------|-------------|----------- >>>> composefs (with digest) | 681 | 390 >>>> erofs (w/o -T0) + overlayfs | 1788 | 532 >>>> squashfs (compressed) + overlayfs | 2547 | 443 >>>> >>>> >>>> I gotta say it is weird though that squashfs performed better than >>>> erofs in the cached case. May be worth looking into. The test data I'm >>>> using is available here: >>> >>> As another wild guess, cached performance is a just vfs-stuff. >>> >>> I think the performance difference may be due to ACL (since both >>> composefs and squashfs don't support ACL). I already asked Jingbo >>> to get more "perf data" to analyze this but he's now busy in another >>> stuff. >>> >>> Again, my overall point is quite simple as always, currently >>> composefs is a read-only filesystem with massive symlink-like files. >>> It behaves as a subset of all generic read-only filesystems just >>> for this specific use cases. >>> >>> In facts there are many options to improve this (much like Amir >>> said before): >>> 1) improve overlayfs, and then it can be used with any local fs; >>> >>> 2) enhance erofs to support this (even without on-disk change); >>> >>> 3) introduce fs/composefs; >>> >>> In addition to option 1), option 2) has many benefits as well, since >>> your manifest files can save real regular files in addition to composefs >>> model. >> >> (add some words..) >> >> My first response at that time (on Slack) was "kindly request >> Giuseppe to ask in the fsdevel mailing list if this new overlay model >> and use cases is feasable", if so, I'm much happy to integrate in to >> EROFS (in a cooperative way) in several ways: >> >> - just use EROFS symlink layout and open such file in a stacked way; >> >> or (now) >> >> - just identify overlayfs "trusted.overlay.redirect" in EROFS itself >> and open file so such image can be both used for EROFS only and >> EROFS + overlayfs. >> >> If that happened, then I think the overlayfs "metacopy" option can >> also be shown by other fs community people later (since I'm not an >> overlay expert), but I'm not sure why they becomes impossible finally >> and even not mentioned at all. >> >> Or if you guys really don't want to use EROFS for whatever reasons >> (EROFS is completely open-source, used, contributed by many vendors), >> you could improve squashfs, ext4, or other exist local fses with this >> new use cases (since they don't need any on-disk change as well, for >> example, by using some xattr), I don't think it's really hard. >> > > Engineering-wise, merging composefs features into EROFS > would be the simplest option and FWIW, my personal preference. > > However, you need to be aware that this will bring into EROFS > vfs considerations, such as s_stack_depth nesting (which AFAICS > is not see incremented composefs?). It's not the end of the world, but this > is no longer plain fs over block game. There's a whole new class of bugs > (that syzbot is very eager to explore) so you need to ask yourself whether > this is a direction you want to lead EROFS towards. I'd like to make a seperated Kconfig for this. I consider this just because currently composefs is much similar to EROFS but it doesn't have some ability to keep real regular file (even some README, VERSION or Changelog in these images) in its (composefs-called) manifest files. Even its on-disk super block doesn't have a UUID now [1] and some boot sector for booting or some potential hybird formats such as tar + EROFS, cpio + EROFS. I'm not sure if those potential new on-disk features is unneeded even for future composefs. But if composefs laterly supports such on-disk features, that makes composefs closer to EROFS even more. I don't see disadvantage to make these actual on-disk compatible (like ext2 and ext4). The only difference now is manifest file itself I/O interface -- bio vs file. but EROFS can be distributed to raw block devices as well, composefs can't. Also, I'd like to seperate core-EROFS from advanced features (or people who are interested to work on this are always welcome) and composefs-like model, if people don't tend to use any EROFS advanced features, it could be disabled from compiling explicitly. > > Giuseppe expressed his plans to make use of the composefs method > inside userns one day. It is not a hard dependency, but I believe that > keeping the "RO efficient verifiable image format" functionality (EROFS) > separate from "userns composition of verifiable images" (overlayfs) > may benefit the userns mount goal in the long term. If that is needed, I'm very happy to get more detailed path of this from some discussion in LSF/MM/BPF 2023: how we get this (userns) reliably in practice. As of code lines, core EROFS on-disk format is quite simple (I don't think total LOC is a barrier), if you see fs/erofs/data.c fs/erofs/namei.c fs/erofs/dir.c or erofs_super_block erofs_inode_compact erofs_inode_extended erofs_dirent but for example, fs/erofs/super.c which is just used to enable EROFS advanced features is almost 1000LOC now. But most code is quite trivial, I don't think these can cause any difference to userns plan. Thanks, Gao Xiang [1] https://lore.kernel.org/r/CAOQ4uxjm7i+uO4o4470ACctsft1m18EiUpxBfCeT-Wyqf1FAYg@mail.gmail.com/ > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-02 7:17 ` Gao Xiang @ 2023-02-02 7:37 ` Gao Xiang 2023-02-03 11:32 ` Alexander Larsson 0 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-02-02 7:37 UTC (permalink / raw) To: Amir Goldstein Cc: Alexander Larsson, Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2023/2/2 15:17, Gao Xiang wrote: > > > On 2023/2/2 14:37, Amir Goldstein wrote: >> On Wed, Feb 1, 2023 at 1:22 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >>> >>> >>> >>> On 2023/2/1 18:01, Gao Xiang wrote: >>>> >>>> >>>> On 2023/2/1 17:46, Alexander Larsson wrote: >>>> >>>> ... >>>> >>>>>> >>>>>> | uncached(ms)| cached(ms) >>>>>> ----------------------------------|-------------|----------- >>>>>> composefs (with digest) | 326 | 135 >>>>>> erofs (w/o -T0) | 264 | 172 >>>>>> erofs (w/o -T0) + overlayfs | 651 | 238 >>>>>> squashfs (compressed) | 538 | 211 >>>>>> squashfs (compressed) + overlayfs | 968 | 302 >>>>> >>>>> >>>>> Clearly erofs with sparse files is the best fs now for the ro-fs + >>>>> overlay case. But still, we can see that the additional cost of the >>>>> overlayfs layer is not negligible. >>>>> >>>>> According to amir this could be helped by a special composefs-like mode >>>>> in overlayfs, but its unclear what performance that would reach, and >>>>> we're then talking net new development that further complicates the >>>>> overlayfs codebase. Its not clear to me which alternative is easier to >>>>> develop/maintain. >>>>> >>>>> Also, the difference between cached and uncached here is less than in >>>>> my tests. Probably because my test image was larger. With the test >>>>> image I use, the results are: >>>>> >>>>> | uncached(ms)| cached(ms) >>>>> ----------------------------------|-------------|----------- >>>>> composefs (with digest) | 681 | 390 >>>>> erofs (w/o -T0) + overlayfs | 1788 | 532 >>>>> squashfs (compressed) + overlayfs | 2547 | 443 >>>>> >>>>> >>>>> I gotta say it is weird though that squashfs performed better than >>>>> erofs in the cached case. May be worth looking into. The test data I'm >>>>> using is available here: >>>> >>>> As another wild guess, cached performance is a just vfs-stuff. >>>> >>>> I think the performance difference may be due to ACL (since both >>>> composefs and squashfs don't support ACL). I already asked Jingbo >>>> to get more "perf data" to analyze this but he's now busy in another >>>> stuff. >>>> >>>> Again, my overall point is quite simple as always, currently >>>> composefs is a read-only filesystem with massive symlink-like files. >>>> It behaves as a subset of all generic read-only filesystems just >>>> for this specific use cases. >>>> >>>> In facts there are many options to improve this (much like Amir >>>> said before): >>>> 1) improve overlayfs, and then it can be used with any local fs; >>>> >>>> 2) enhance erofs to support this (even without on-disk change); >>>> >>>> 3) introduce fs/composefs; >>>> >>>> In addition to option 1), option 2) has many benefits as well, since >>>> your manifest files can save real regular files in addition to composefs >>>> model. >>> >>> (add some words..) >>> >>> My first response at that time (on Slack) was "kindly request >>> Giuseppe to ask in the fsdevel mailing list if this new overlay model >>> and use cases is feasable", if so, I'm much happy to integrate in to >>> EROFS (in a cooperative way) in several ways: >>> >>> - just use EROFS symlink layout and open such file in a stacked way; >>> >>> or (now) >>> >>> - just identify overlayfs "trusted.overlay.redirect" in EROFS itself >>> and open file so such image can be both used for EROFS only and >>> EROFS + overlayfs. >>> >>> If that happened, then I think the overlayfs "metacopy" option can >>> also be shown by other fs community people later (since I'm not an >>> overlay expert), but I'm not sure why they becomes impossible finally >>> and even not mentioned at all. >>> >>> Or if you guys really don't want to use EROFS for whatever reasons >>> (EROFS is completely open-source, used, contributed by many vendors), >>> you could improve squashfs, ext4, or other exist local fses with this >>> new use cases (since they don't need any on-disk change as well, for >>> example, by using some xattr), I don't think it's really hard. >>> >> >> Engineering-wise, merging composefs features into EROFS >> would be the simplest option and FWIW, my personal preference. >> >> However, you need to be aware that this will bring into EROFS >> vfs considerations, such as s_stack_depth nesting (which AFAICS >> is not see incremented composefs?). It's not the end of the world, but this >> is no longer plain fs over block game. There's a whole new class of bugs >> (that syzbot is very eager to explore) so you need to ask yourself whether >> this is a direction you want to lead EROFS towards. > > I'd like to make a seperated Kconfig for this. I consider this just because > currently composefs is much similar to EROFS but it doesn't have some ability > to keep real regular file (even some README, VERSION or Changelog in these > images) in its (composefs-called) manifest files. Even its on-disk super block > doesn't have a UUID now [1] and some boot sector for booting or some potential > hybird formats such as tar + EROFS, cpio + EROFS. > > I'm not sure if those potential new on-disk features is unneeded even for > future composefs. But if composefs laterly supports such on-disk features, > that makes composefs closer to EROFS even more. I don't see disadvantage to > make these actual on-disk compatible (like ext2 and ext4). > > The only difference now is manifest file itself I/O interface -- bio vs file. > but EROFS can be distributed to raw block devices as well, composefs can't. > > Also, I'd like to seperate core-EROFS from advanced features (or people who > are interested to work on this are always welcome) and composefs-like model, > if people don't tend to use any EROFS advanced features, it could be disabled > from compiling explicitly. Apart from that, I still fail to get some thoughts (apart from unprivileged mounts) how EROFS + overlayfs combination fails on automative real workloads aside from "ls -lR" (readdir + stat). And eventually we still need overlayfs for most use cases to do writable stuffs, anyway, it needs some words to describe why such < 1s difference is very very important to the real workload as you already mentioned before. And with overlayfs lazy lookup, I think it can be close to ~100ms or better. > >> >> Giuseppe expressed his plans to make use of the composefs method >> inside userns one day. It is not a hard dependency, but I believe that >> keeping the "RO efficient verifiable image format" functionality (EROFS) >> separate from "userns composition of verifiable images" (overlayfs) >> may benefit the userns mount goal in the long term. > > If that is needed, I'm very happy to get more detailed path of this from > some discussion in LSF/MM/BPF 2023: how we get this (userns) reliably in > practice. > > As of code lines, core EROFS on-disk format is quite simple (I don't think > total LOC is a barrier), if you see > fs/erofs/data.c > fs/erofs/namei.c > fs/erofs/dir.c > > or > erofs_super_block > erofs_inode_compact > erofs_inode_extended > erofs_dirent > > but for example, fs/erofs/super.c which is just used to enable EROFS advanced > features is almost 1000LOC now. But most code is quite trivial, I don't think > these can cause any difference to userns plan. > > Thanks, > Gao Xiang > > [1] https://lore.kernel.org/r/CAOQ4uxjm7i+uO4o4470ACctsft1m18EiUpxBfCeT-Wyqf1FAYg@mail.gmail.com/ > >> >> Thanks, >> Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-02 7:37 ` Gao Xiang @ 2023-02-03 11:32 ` Alexander Larsson 2023-02-03 12:46 ` Amir Goldstein 0 siblings, 1 reply; 87+ messages in thread From: Alexander Larsson @ 2023-02-03 11:32 UTC (permalink / raw) To: Gao Xiang, Amir Goldstein Cc: Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On Thu, 2023-02-02 at 15:37 +0800, Gao Xiang wrote: > > > On 2023/2/2 15:17, Gao Xiang wrote: > > > > > > On 2023/2/2 14:37, Amir Goldstein wrote: > > > On Wed, Feb 1, 2023 at 1:22 PM Gao Xiang > > > <hsiangkao@linux.alibaba.com> wrote: > > > > > > > > > > > > > > > > On 2023/2/1 18:01, Gao Xiang wrote: > > > > > > > > > > > > > > > On 2023/2/1 17:46, Alexander Larsson wrote: > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > | uncached(ms)| > > > > > > > cached(ms) > > > > > > > ----------------------------------|-------------|-------- > > > > > > > --- > > > > > > > composefs (with digest) | 326 | 135 > > > > > > > erofs (w/o -T0) | 264 | 172 > > > > > > > erofs (w/o -T0) + overlayfs | 651 | 238 > > > > > > > squashfs (compressed) | 538 | 211 > > > > > > > squashfs (compressed) + overlayfs | 968 | 302 > > > > > > > > > > > > > > > > > > Clearly erofs with sparse files is the best fs now for the > > > > > > ro-fs + > > > > > > overlay case. But still, we can see that the additional > > > > > > cost of the > > > > > > overlayfs layer is not negligible. > > > > > > > > > > > > According to amir this could be helped by a special > > > > > > composefs-like mode > > > > > > in overlayfs, but its unclear what performance that would > > > > > > reach, and > > > > > > we're then talking net new development that further > > > > > > complicates the > > > > > > overlayfs codebase. Its not clear to me which alternative > > > > > > is easier to > > > > > > develop/maintain. > > > > > > > > > > > > Also, the difference between cached and uncached here is > > > > > > less than in > > > > > > my tests. Probably because my test image was larger. With > > > > > > the test > > > > > > image I use, the results are: > > > > > > > > > > > > | uncached(ms)| > > > > > > cached(ms) > > > > > > ----------------------------------|-------------|---------- > > > > > > - > > > > > > composefs (with digest) | 681 | 390 > > > > > > erofs (w/o -T0) + overlayfs | 1788 | 532 > > > > > > squashfs (compressed) + overlayfs | 2547 | 443 > > > > > > > > > > > > > > > > > > I gotta say it is weird though that squashfs performed > > > > > > better than > > > > > > erofs in the cached case. May be worth looking into. The > > > > > > test data I'm > > > > > > using is available here: > > > > > > > > > > As another wild guess, cached performance is a just vfs- > > > > > stuff. > > > > > > > > > > I think the performance difference may be due to ACL (since > > > > > both > > > > > composefs and squashfs don't support ACL). I already asked > > > > > Jingbo > > > > > to get more "perf data" to analyze this but he's now busy in > > > > > another > > > > > stuff. > > > > > > > > > > Again, my overall point is quite simple as always, currently > > > > > composefs is a read-only filesystem with massive symlink-like > > > > > files. > > > > > It behaves as a subset of all generic read-only filesystems > > > > > just > > > > > for this specific use cases. > > > > > > > > > > In facts there are many options to improve this (much like > > > > > Amir > > > > > said before): > > > > > 1) improve overlayfs, and then it can be used with any > > > > > local fs; > > > > > > > > > > 2) enhance erofs to support this (even without on-disk > > > > > change); > > > > > > > > > > 3) introduce fs/composefs; > > > > > > > > > > In addition to option 1), option 2) has many benefits as > > > > > well, since > > > > > your manifest files can save real regular files in addition > > > > > to composefs > > > > > model. > > > > > > > > (add some words..) > > > > > > > > My first response at that time (on Slack) was "kindly request > > > > Giuseppe to ask in the fsdevel mailing list if this new overlay > > > > model > > > > and use cases is feasable", if so, I'm much happy to integrate > > > > in to > > > > EROFS (in a cooperative way) in several ways: > > > > > > > > - just use EROFS symlink layout and open such file in a > > > > stacked way; > > > > > > > > or (now) > > > > > > > > - just identify overlayfs "trusted.overlay.redirect" in > > > > EROFS itself > > > > and open file so such image can be both used for EROFS > > > > only and > > > > EROFS + overlayfs. > > > > > > > > If that happened, then I think the overlayfs "metacopy" option > > > > can > > > > also be shown by other fs community people later (since I'm not > > > > an > > > > overlay expert), but I'm not sure why they becomes impossible > > > > finally > > > > and even not mentioned at all. > > > > > > > > Or if you guys really don't want to use EROFS for whatever > > > > reasons > > > > (EROFS is completely open-source, used, contributed by many > > > > vendors), > > > > you could improve squashfs, ext4, or other exist local fses > > > > with this > > > > new use cases (since they don't need any on-disk change as > > > > well, for > > > > example, by using some xattr), I don't think it's really hard. > > > > > > > > > > Engineering-wise, merging composefs features into EROFS > > > would be the simplest option and FWIW, my personal preference. > > > > > > However, you need to be aware that this will bring into EROFS > > > vfs considerations, such as s_stack_depth nesting (which AFAICS > > > is not see incremented composefs?). It's not the end of the > > > world, but this > > > is no longer plain fs over block game. There's a whole new class > > > of bugs > > > (that syzbot is very eager to explore) so you need to ask > > > yourself whether > > > this is a direction you want to lead EROFS towards. > > > > I'd like to make a seperated Kconfig for this. I consider this > > just because > > currently composefs is much similar to EROFS but it doesn't have > > some ability > > to keep real regular file (even some README, VERSION or Changelog > > in these > > images) in its (composefs-called) manifest files. Even its on-disk > > super block > > doesn't have a UUID now [1] and some boot sector for booting or > > some potential > > hybird formats such as tar + EROFS, cpio + EROFS. > > > > I'm not sure if those potential new on-disk features is unneeded > > even for > > future composefs. But if composefs laterly supports such on-disk > > features, > > that makes composefs closer to EROFS even more. I don't see > > disadvantage to > > make these actual on-disk compatible (like ext2 and ext4). > > > > The only difference now is manifest file itself I/O interface -- > > bio vs file. > > but EROFS can be distributed to raw block devices as well, > > composefs can't. > > > > Also, I'd like to seperate core-EROFS from advanced features (or > > people who > > are interested to work on this are always welcome) and composefs- > > like model, > > if people don't tend to use any EROFS advanced features, it could > > be disabled > > from compiling explicitly. > > Apart from that, I still fail to get some thoughts (apart from > unprivileged > mounts) how EROFS + overlayfs combination fails on automative real > workloads > aside from "ls -lR" (readdir + stat). > > And eventually we still need overlayfs for most use cases to do > writable > stuffs, anyway, it needs some words to describe why such < 1s > difference is > very very important to the real workload as you already mentioned > before. > > And with overlayfs lazy lookup, I think it can be close to ~100ms or > better. > If we had an overlay.fs-verity xattr, then I think there are no individual features lacking for it to work for the automotive usecase I'm working on. Nor for the OCI container usecase. However, the possibility of doing something doesn't mean it is the better technical solution. The container usecase is very important in real world Linux use today, and as such it makes sense to have a technically excellent solution for it, not just a workable solution. Obviously we all have different viewpoints of what that is, but these are the reasons why I think a composefs solution is better: * It is faster than all other approaches for the one thing it actually needs to do (lookup and readdir performance). Other kinds of performance (file i/o speed, etc) is up to the backing filesystem anyway. Even if there are possible approaches to make overlayfs perform better here (the "lazy lookup" idea) it will not reach the performance of composefs, while further complicating the overlayfs codebase. (btw, did someone ask Miklos what he thinks of that idea?) For the automotive usecase we have strict cold-boot time requirements that make cold-cache performance very important to us. Of course, there is no simple time requirements for the specific case of listing files in an image, but any improvement in cold-cache performance for both the ostree rootfs and the containers started during boot will be worth its weight in gold trying to reach these hard KPIs. * It uses less memory, as we don't need the extra inodes that comes with the overlayfs mount. (See profiling data in giuseppes mail[1]). The use of loopback vs directly reading the image file from page cache also have effects on memory use. Normally we have both the loopback file in page cache, plus the block cache for the loopback device. We could use loopback with O_DIRECT, but then we don't use the page cache for the image file, which I think could have performance implications. * The userspace API complexity of the combined overlayfs approach is much greater than for composefs, with more moving pieces. For composefs, all you need is a single mount syscall for set up. For the overlay approach you would need to first create a loopback device, then create a dm-verity device-mapper device from it, then mount the readonly fs, then mount the overlayfs. All this complexity has a cost in terms of setup/teardown performance, userspace complexity and overall memory use. Are any of these a hard blocker for the feature? Not really, but I would find it sad to use an (imho) worse solution. The other mentioned approach is to extend EROFS with composefs features. For this to be interesting to me it would have to include: * Direct reading of the image from page cache (not via loopback) * Ability to verify fs-verity digest of that image file * Support for stacked content files in a set of specified basedirs (not using fscache). * Verification of expected fs-verity digest for these basedir files Anything less than this and I think the overlayfs+erofs approach is a better choice. However, this is essentially just proposing we re-implement all the composefs code with a different name. And then we get a filesystem supporting *both* stacking and traditional block device use, which seems a bit weird to me. It will certainly make the erofs code more complex having to support all these combinations. Also, given the harsh arguments and accusations towards me on the list I don't feel very optimistic about how well such a cooperation would work. (A note about Kconfig options: I'm totally uninterested in using a custom build of erofs. We always use a standard distro kernel that has to support all possible uses of erofs, so we can't ship a neutered version of it.) [1] https://lore.kernel.org/lkml/87wn5ac2z6.fsf@redhat.com/ -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- =-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com He's a world-famous day-dreaming cop on his last day in the job. She's a plucky streetsmart wrestler descended from a line of powerful witches. They fight crime! ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-03 11:32 ` Alexander Larsson @ 2023-02-03 12:46 ` Amir Goldstein 2023-02-03 15:09 ` Gao Xiang 2023-02-06 12:43 ` Alexander Larsson 0 siblings, 2 replies; 87+ messages in thread From: Amir Goldstein @ 2023-02-03 12:46 UTC (permalink / raw) To: Alexander Larsson, Miklos Szeredi Cc: Gao Xiang, Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik > > > > Engineering-wise, merging composefs features into EROFS > > > > would be the simplest option and FWIW, my personal preference. > > > > > > > > However, you need to be aware that this will bring into EROFS > > > > vfs considerations, such as s_stack_depth nesting (which AFAICS > > > > is not see incremented composefs?). It's not the end of the > > > > world, but this > > > > is no longer plain fs over block game. There's a whole new class > > > > of bugs > > > > (that syzbot is very eager to explore) so you need to ask > > > > yourself whether > > > > this is a direction you want to lead EROFS towards. > > > > > > I'd like to make a seperated Kconfig for this. I consider this > > > just because > > > currently composefs is much similar to EROFS but it doesn't have > > > some ability > > > to keep real regular file (even some README, VERSION or Changelog > > > in these > > > images) in its (composefs-called) manifest files. Even its on-disk > > > super block > > > doesn't have a UUID now [1] and some boot sector for booting or > > > some potential > > > hybird formats such as tar + EROFS, cpio + EROFS. > > > > > > I'm not sure if those potential new on-disk features is unneeded > > > even for > > > future composefs. But if composefs laterly supports such on-disk > > > features, > > > that makes composefs closer to EROFS even more. I don't see > > > disadvantage to > > > make these actual on-disk compatible (like ext2 and ext4). > > > > > > The only difference now is manifest file itself I/O interface -- > > > bio vs file. > > > but EROFS can be distributed to raw block devices as well, > > > composefs can't. > > > > > > Also, I'd like to seperate core-EROFS from advanced features (or > > > people who > > > are interested to work on this are always welcome) and composefs- > > > like model, > > > if people don't tend to use any EROFS advanced features, it could > > > be disabled > > > from compiling explicitly. > > > > Apart from that, I still fail to get some thoughts (apart from > > unprivileged > > mounts) how EROFS + overlayfs combination fails on automative real > > workloads > > aside from "ls -lR" (readdir + stat). > > > > And eventually we still need overlayfs for most use cases to do > > writable > > stuffs, anyway, it needs some words to describe why such < 1s > > difference is > > very very important to the real workload as you already mentioned > > before. > > > > And with overlayfs lazy lookup, I think it can be close to ~100ms or > > better. > > > > If we had an overlay.fs-verity xattr, then I think there are no > individual features lacking for it to work for the automotive usecase > I'm working on. Nor for the OCI container usecase. However, the > possibility of doing something doesn't mean it is the better technical > solution. > > The container usecase is very important in real world Linux use today, > and as such it makes sense to have a technically excellent solution for > it, not just a workable solution. Obviously we all have different > viewpoints of what that is, but these are the reasons why I think a > composefs solution is better: > > * It is faster than all other approaches for the one thing it actually > needs to do (lookup and readdir performance). Other kinds of > performance (file i/o speed, etc) is up to the backing filesystem > anyway. > > Even if there are possible approaches to make overlayfs perform better > here (the "lazy lookup" idea) it will not reach the performance of > composefs, while further complicating the overlayfs codebase. (btw, did > someone ask Miklos what he thinks of that idea?) > Well, Miklos was CCed (now in TO:) I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 but no response on that yet. TBH, in the end, Miklos really is the one who is going to have the most weight on the outcome. If Miklos is interested in adding this functionality to overlayfs, you are going to have a VERY hard sell, trying to merge composefs as an independent expert filesystem. The community simply does not approve of this sort of fragmentation unless there is a very good reason to do that. > For the automotive usecase we have strict cold-boot time requirements > that make cold-cache performance very important to us. Of course, there > is no simple time requirements for the specific case of listing files > in an image, but any improvement in cold-cache performance for both the > ostree rootfs and the containers started during boot will be worth its > weight in gold trying to reach these hard KPIs. > > * It uses less memory, as we don't need the extra inodes that comes > with the overlayfs mount. (See profiling data in giuseppes mail[1]). Understood, but we will need profiling data with the optimized ovl (or with the single blob hack) to compare the relevant alternatives. > > The use of loopback vs directly reading the image file from page cache > also have effects on memory use. Normally we have both the loopback > file in page cache, plus the block cache for the loopback device. We > could use loopback with O_DIRECT, but then we don't use the page cache > for the image file, which I think could have performance implications. > I am not sure this is correct. The loop blockdev page cache can be used, for reading metadata, can it not? But that argument is true for EROFS and for almost every other fs that could be mounted with -oloop. If the loopdev overhead is a problem and O_DIRECT is not a good enough solution, then you should work on a generic solution that all fs could use. > * The userspace API complexity of the combined overlayfs approach is > much greater than for composefs, with more moving pieces. For > composefs, all you need is a single mount syscall for set up. For the > overlay approach you would need to first create a loopback device, then > create a dm-verity device-mapper device from it, then mount the > readonly fs, then mount the overlayfs. Userspace API complexity has never been and will never be a reason for making changes in the kernel, let alone add a new filesystem driver. Userspace API complexity can be hidden behind a userspace expert library. You can even create a mount.composefs helper that users can use mount -t composefs that sets up erofs+overlayfs behind the scenes. Similarly, mkfs.composefs can be an alias to mkfs.erofs with a specific set of preset options, much like mkfs.ext* family. > All this complexity has a cost > in terms of setup/teardown performance, userspace complexity and > overall memory use. > This claim needs to be quantified *after* the proposed improvements (or equivalent hack) to existing subsystems. > Are any of these a hard blocker for the feature? Not really, but I > would find it sad to use an (imho) worse solution. > I respect your emotion and it is not uncommon for people to want to see their creation merged as is, but from personal experience, it is often a much better option for you, to have your code merge into an existing subsystem. I think if you knew all the advantages, you would have fought for this option yourself ;) > > > The other mentioned approach is to extend EROFS with composefs > features. For this to be interesting to me it would have to include: > > * Direct reading of the image from page cache (not via loopback) > * Ability to verify fs-verity digest of that image file > * Support for stacked content files in a set of specified basedirs > (not using fscache). > * Verification of expected fs-verity digest for these basedir files > > Anything less than this and I think the overlayfs+erofs approach is a > better choice. > > However, this is essentially just proposing we re-implement all the > composefs code with a different name. And then we get a filesystem > supporting *both* stacking and traditional block device use, which > seems a bit weird to me. It will certainly make the erofs code more > complex having to support all these combinations. Also, given the harsh > arguments and accusations towards me on the list I don't feel very > optimistic about how well such a cooperation would work. > I understand why you write that and I am sorry that you feel this way. This is a good opportunity to urge you and Giuseppe again to request an invite to LSFMM [1] and propose composefs vs. erofs+ovl as a TOPIC. Meeting the developers in person is often the best way to understand each other in situations just like this one where the email discussions fail to remain on a purely technical level and our emotions get involved. It is just too hard to express emotions accurately in emails and people are so very often misunderstood when that happens. I guarantee you that it is much more pleasant to argue with people over email after you have met them in person ;) Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/Y9qBs82f94aV4%2F78@localhost.localdomain/ ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-03 12:46 ` Amir Goldstein @ 2023-02-03 15:09 ` Gao Xiang 2023-02-05 19:06 ` Amir Goldstein 2023-02-06 12:43 ` Alexander Larsson 1 sibling, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-02-03 15:09 UTC (permalink / raw) To: Amir Goldstein, Alexander Larsson, Miklos Szeredi Cc: Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik On 2023/2/3 20:46, Amir Goldstein wrote: >>>>> Engineering-wise, merging composefs features into EROFS >>>>> would be the simplest option and FWIW, my personal preference. >>>>> >>>>> However, you need to be aware that this will bring into EROFS >>>>> vfs considerations, such as s_stack_depth nesting (which AFAICS >>>>> is not see incremented composefs?). It's not the end of the >>>>> world, but this >>>>> is no longer plain fs over block game. There's a whole new class >>>>> of bugs >>>>> (that syzbot is very eager to explore) so you need to ask >>>>> yourself whether >>>>> this is a direction you want to lead EROFS towards. >>>> >>>> I'd like to make a seperated Kconfig for this. I consider this >>>> just because >>>> currently composefs is much similar to EROFS but it doesn't have >>>> some ability >>>> to keep real regular file (even some README, VERSION or Changelog >>>> in these >>>> images) in its (composefs-called) manifest files. Even its on-disk >>>> super block >>>> doesn't have a UUID now [1] and some boot sector for booting or >>>> some potential >>>> hybird formats such as tar + EROFS, cpio + EROFS. >>>> >>>> I'm not sure if those potential new on-disk features is unneeded >>>> even for >>>> future composefs. But if composefs laterly supports such on-disk >>>> features, >>>> that makes composefs closer to EROFS even more. I don't see >>>> disadvantage to >>>> make these actual on-disk compatible (like ext2 and ext4). >>>> >>>> The only difference now is manifest file itself I/O interface -- >>>> bio vs file. >>>> but EROFS can be distributed to raw block devices as well, >>>> composefs can't. >>>> >>>> Also, I'd like to seperate core-EROFS from advanced features (or >>>> people who >>>> are interested to work on this are always welcome) and composefs- >>>> like model, >>>> if people don't tend to use any EROFS advanced features, it could >>>> be disabled >>>> from compiling explicitly. >>> >>> Apart from that, I still fail to get some thoughts (apart from >>> unprivileged >>> mounts) how EROFS + overlayfs combination fails on automative real >>> workloads >>> aside from "ls -lR" (readdir + stat). >>> >>> And eventually we still need overlayfs for most use cases to do >>> writable >>> stuffs, anyway, it needs some words to describe why such < 1s >>> difference is >>> very very important to the real workload as you already mentioned >>> before. >>> >>> And with overlayfs lazy lookup, I think it can be close to ~100ms or >>> better. >>> >> >> If we had an overlay.fs-verity xattr, then I think there are no >> individual features lacking for it to work for the automotive usecase >> I'm working on. Nor for the OCI container usecase. However, the >> possibility of doing something doesn't mean it is the better technical >> solution. >> >> The container usecase is very important in real world Linux use today, >> and as such it makes sense to have a technically excellent solution for >> it, not just a workable solution. Obviously we all have different >> viewpoints of what that is, but these are the reasons why I think a >> composefs solution is better: >> >> * It is faster than all other approaches for the one thing it actually >> needs to do (lookup and readdir performance). Other kinds of >> performance (file i/o speed, etc) is up to the backing filesystem >> anyway. >> >> Even if there are possible approaches to make overlayfs perform better >> here (the "lazy lookup" idea) it will not reach the performance of >> composefs, while further complicating the overlayfs codebase. (btw, did >> someone ask Miklos what he thinks of that idea?) >> > > Well, Miklos was CCed (now in TO:) > I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: > https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 > but no response on that yet. > > TBH, in the end, Miklos really is the one who is going to have the most > weight on the outcome. > > If Miklos is interested in adding this functionality to overlayfs, you are going > to have a VERY hard sell, trying to merge composefs as an independent > expert filesystem. The community simply does not approve of this sort of > fragmentation unless there is a very good reason to do that. > >> For the automotive usecase we have strict cold-boot time requirements >> that make cold-cache performance very important to us. Of course, there >> is no simple time requirements for the specific case of listing files >> in an image, but any improvement in cold-cache performance for both the >> ostree rootfs and the containers started during boot will be worth its >> weight in gold trying to reach these hard KPIs. >> >> * It uses less memory, as we don't need the extra inodes that comes >> with the overlayfs mount. (See profiling data in giuseppes mail[1]). > > Understood, but we will need profiling data with the optimized ovl > (or with the single blob hack) to compare the relevant alternatives. My little request again, could you help benchmark on your real workload rather than "ls -lR" stuff? If your hard KPI is really what as you said, why not just benchmark the real workload now and write a detailed analysis to everyone to explain it's a _must_ that we should upstream a new stacked fs for this? My own argument is that I don't see in-tree fses which is always designed for a detailed specific use case, maybe there was something like _omfs_, but I don't know who use omfs now. Ostree is indeed better since it has massive users now, but there is already a replacement (erofs+ovl) and other options. If you want the extreme performance, EROFS can have an inline overlay writable layer rather than introducing another overlayfs, but why we need that? In addition, union mount could also be introduced to VFS at that time rather than the current overlayfs choice. > >> >> The use of loopback vs directly reading the image file from page cache >> also have effects on memory use. Normally we have both the loopback >> file in page cache, plus the block cache for the loopback device. We >> could use loopback with O_DIRECT, but then we don't use the page cache >> for the image file, which I think could have performance implications. I don't know what do you mean as a kernel filesystem developer, really. Also, we've already internally tested all combinations these days, and such manifest files even didn't behave as a heavy I/O model. Loopback device here is not a sensitive stuff (except that you really care <10ms difference due to loopback device.) >> > > I am not sure this is correct. The loop blockdev page cache can be used, > for reading metadata, can it not? > But that argument is true for EROFS and for almost every other fs > that could be mounted with -oloop. > If the loopdev overhead is a problem and O_DIRECT is not a good enough > solution, then you should work on a generic solution that all fs could use. > >> * The userspace API complexity of the combined overlayfs approach is >> much greater than for composefs, with more moving pieces. For >> composefs, all you need is a single mount syscall for set up. For the >> overlay approach you would need to first create a loopback device, then >> create a dm-verity device-mapper device from it, then mount the >> readonly fs, then mount the overlayfs. > > Userspace API complexity has never been and will never be a reason > for making changes in the kernel, let alone add a new filesystem driver. > Userspace API complexity can be hidden behind a userspace expert library. > You can even create a mount.composefs helper that users can use > mount -t composefs that sets up erofs+overlayfs behind the scenes. > > Similarly, mkfs.composefs can be an alias to mkfs.erofs with a specific > set of preset options, much like mkfs.ext* family. > >> All this complexity has a cost >> in terms of setup/teardown performance, userspace complexity and >> overall memory use. >> > > This claim needs to be quantified *after* the proposed improvements > (or equivalent hack) to existing subsystems. > >> Are any of these a hard blocker for the feature? Not really, but I >> would find it sad to use an (imho) worse solution. >> > > I respect your emotion and it is not uncommon for people to want > to see their creation merged as is, but from personal experience, > it is often a much better option for you, to have your code merge into > an existing subsystem. I think if you knew all the advantages, you > would have fought for this option yourself ;) > >> >> >> The other mentioned approach is to extend EROFS with composefs >> features. For this to be interesting to me it would have to include: >> >> * Direct reading of the image from page cache (not via loopback) >> * Ability to verify fs-verity digest of that image file >> * Support for stacked content files in a set of specified basedirs >> (not using fscache). >> * Verification of expected fs-verity digest for these basedir files >> >> Anything less than this and I think the overlayfs+erofs approach is a >> better choice. >> >> However, this is essentially just proposing we re-implement all the >> composefs code with a different name. And then we get a filesystem I don't think loopback here is really an issue, really, for your workload. It cannot become an issue since real metadata I/O access is almost random access. >> supporting *both* stacking and traditional block device use, which >> seems a bit weird to me. It will certainly make the erofs code more >> complex having to support all these combinations. Also, given the harsh >> arguments and accusations towards me on the list I don't feel very >> optimistic about how well such a cooperation would work. >> > > I understand why you write that and I am sorry that you feel this way. > This is a good opportunity to urge you and Giuseppe again to request > an invite to LSFMM [1] and propose composefs vs. erofs+ovl as a TOPIC. > > Meeting the developers in person is often the best way to understand each > other in situations just like this one where the email discussions fail to > remain on a purely technical level and our emotions get involved. > It is just too hard to express emotions accurately in emails and people are > so very often misunderstood when that happens. > > I guarantee you that it is much more pleasant to argue with people over email > after you have met them in person ;) What I'd like to say is already in the previous emails. It's not needed to repeat again. Since my tourist visa application to Belgium is refused, so I have no way to speak @ FOSDEM 23 on site. Also I'm not the one to decide whether or not add such new filesystem, just my own comments. Also I'm very happy to discuss userns stuff if it's possible @ LSF/MM/BPF 2023. As for EROFS, at least at that time we carefully analyzed all exist in-kernel read-only fses, carefully designed with Miao Xie (who once was a btrfs developer for many years), Chao Yu (a f2fs developer for many years) and more, and many attempts not limited to: - landed billion products; - write a ATC paper with microbenchmark and real workloads; - speak at LSF/MM/BPF 2019; Best wishes that you won't add any on-disk features except for userns stuffs and keep the codebase below 2k LoC as you promise as always. Thanks, Gao Xiang > Thanks, > Amir. > > [1] https://lore.kernel.org/linux-fsdevel/Y9qBs82f94aV4%2F78@localhost.localdomain/ ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-03 15:09 ` Gao Xiang @ 2023-02-05 19:06 ` Amir Goldstein 2023-02-06 7:59 ` Amir Goldstein ` (3 more replies) 0 siblings, 4 replies; 87+ messages in thread From: Amir Goldstein @ 2023-02-05 19:06 UTC (permalink / raw) To: Alexander Larsson Cc: Miklos Szeredi, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu > >>> Apart from that, I still fail to get some thoughts (apart from > >>> unprivileged > >>> mounts) how EROFS + overlayfs combination fails on automative real > >>> workloads > >>> aside from "ls -lR" (readdir + stat). > >>> > >>> And eventually we still need overlayfs for most use cases to do > >>> writable > >>> stuffs, anyway, it needs some words to describe why such < 1s > >>> difference is > >>> very very important to the real workload as you already mentioned > >>> before. > >>> > >>> And with overlayfs lazy lookup, I think it can be close to ~100ms or > >>> better. > >>> > >> > >> If we had an overlay.fs-verity xattr, then I think there are no > >> individual features lacking for it to work for the automotive usecase > >> I'm working on. Nor for the OCI container usecase. However, the > >> possibility of doing something doesn't mean it is the better technical > >> solution. > >> > >> The container usecase is very important in real world Linux use today, > >> and as such it makes sense to have a technically excellent solution for > >> it, not just a workable solution. Obviously we all have different > >> viewpoints of what that is, but these are the reasons why I think a > >> composefs solution is better: > >> > >> * It is faster than all other approaches for the one thing it actually > >> needs to do (lookup and readdir performance). Other kinds of > >> performance (file i/o speed, etc) is up to the backing filesystem > >> anyway. > >> > >> Even if there are possible approaches to make overlayfs perform better > >> here (the "lazy lookup" idea) it will not reach the performance of > >> composefs, while further complicating the overlayfs codebase. (btw, did > >> someone ask Miklos what he thinks of that idea?) > >> > > > > Well, Miklos was CCed (now in TO:) > > I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: > > https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 > > but no response on that yet. > > > > TBH, in the end, Miklos really is the one who is going to have the most > > weight on the outcome. > > > > If Miklos is interested in adding this functionality to overlayfs, you are going > > to have a VERY hard sell, trying to merge composefs as an independent > > expert filesystem. The community simply does not approve of this sort of > > fragmentation unless there is a very good reason to do that. > > > >> For the automotive usecase we have strict cold-boot time requirements > >> that make cold-cache performance very important to us. Of course, there > >> is no simple time requirements for the specific case of listing files > >> in an image, but any improvement in cold-cache performance for both the > >> ostree rootfs and the containers started during boot will be worth its > >> weight in gold trying to reach these hard KPIs. > >> > >> * It uses less memory, as we don't need the extra inodes that comes > >> with the overlayfs mount. (See profiling data in giuseppes mail[1]). > > > > Understood, but we will need profiling data with the optimized ovl > > (or with the single blob hack) to compare the relevant alternatives. > > My little request again, could you help benchmark on your real workload > rather than "ls -lR" stuff? If your hard KPI is really what as you > said, why not just benchmark the real workload now and write a detailed > analysis to everyone to explain it's a _must_ that we should upstream > a new stacked fs for this? > I agree that benchmarking the actual KPI (boot time) will have a much stronger impact and help to build a much stronger case for composefs if you can prove that the boot time difference really matters. In order to test boot time on fair grounds, I prepared for you a POC branch with overlayfs lazy lookup: https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata It is very lightly tested, but should be sufficient for the benchmark. Note that: 1. You need to opt-in with redirect_dir=lazyfollow,metacopy=on 2. The lazyfollow POC only works with read-only overlay that has two lower dirs (1 metadata layer and one data blobs layer) 3. The data layer must be a local blockdev fs (i.e. not a network fs) 4. Only absolute path redirects are lazy (e.g. "/objects/cc/3da...") These limitations could be easily lifted with a bit more work. If any of those limitations stand in your way for running the benchmark let me know and I'll see what I can do. If there is any issue with the POC branch, please let me know. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-05 19:06 ` Amir Goldstein @ 2023-02-06 7:59 ` Amir Goldstein 2023-02-06 10:35 ` Miklos Szeredi ` (2 subsequent siblings) 3 siblings, 0 replies; 87+ messages in thread From: Amir Goldstein @ 2023-02-06 7:59 UTC (permalink / raw) To: Alexander Larsson Cc: Miklos Szeredi, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu On Sun, Feb 5, 2023 at 9:06 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > >>> Apart from that, I still fail to get some thoughts (apart from > > >>> unprivileged > > >>> mounts) how EROFS + overlayfs combination fails on automative real > > >>> workloads > > >>> aside from "ls -lR" (readdir + stat). > > >>> > > >>> And eventually we still need overlayfs for most use cases to do > > >>> writable > > >>> stuffs, anyway, it needs some words to describe why such < 1s > > >>> difference is > > >>> very very important to the real workload as you already mentioned > > >>> before. > > >>> > > >>> And with overlayfs lazy lookup, I think it can be close to ~100ms or > > >>> better. > > >>> > > >> > > >> If we had an overlay.fs-verity xattr, then I think there are no > > >> individual features lacking for it to work for the automotive usecase > > >> I'm working on. Nor for the OCI container usecase. However, the > > >> possibility of doing something doesn't mean it is the better technical > > >> solution. > > >> > > >> The container usecase is very important in real world Linux use today, > > >> and as such it makes sense to have a technically excellent solution for > > >> it, not just a workable solution. Obviously we all have different > > >> viewpoints of what that is, but these are the reasons why I think a > > >> composefs solution is better: > > >> > > >> * It is faster than all other approaches for the one thing it actually > > >> needs to do (lookup and readdir performance). Other kinds of > > >> performance (file i/o speed, etc) is up to the backing filesystem > > >> anyway. > > >> > > >> Even if there are possible approaches to make overlayfs perform better > > >> here (the "lazy lookup" idea) it will not reach the performance of > > >> composefs, while further complicating the overlayfs codebase. (btw, did > > >> someone ask Miklos what he thinks of that idea?) > > >> > > > > > > Well, Miklos was CCed (now in TO:) > > > I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: > > > https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 > > > but no response on that yet. > > > > > > TBH, in the end, Miklos really is the one who is going to have the most > > > weight on the outcome. > > > > > > If Miklos is interested in adding this functionality to overlayfs, you are going > > > to have a VERY hard sell, trying to merge composefs as an independent > > > expert filesystem. The community simply does not approve of this sort of > > > fragmentation unless there is a very good reason to do that. > > > > > >> For the automotive usecase we have strict cold-boot time requirements > > >> that make cold-cache performance very important to us. Of course, there > > >> is no simple time requirements for the specific case of listing files > > >> in an image, but any improvement in cold-cache performance for both the > > >> ostree rootfs and the containers started during boot will be worth its > > >> weight in gold trying to reach these hard KPIs. > > >> > > >> * It uses less memory, as we don't need the extra inodes that comes > > >> with the overlayfs mount. (See profiling data in giuseppes mail[1]). > > > > > > Understood, but we will need profiling data with the optimized ovl > > > (or with the single blob hack) to compare the relevant alternatives. > > > > My little request again, could you help benchmark on your real workload > > rather than "ls -lR" stuff? If your hard KPI is really what as you > > said, why not just benchmark the real workload now and write a detailed > > analysis to everyone to explain it's a _must_ that we should upstream > > a new stacked fs for this? > > > > I agree that benchmarking the actual KPI (boot time) will have > a much stronger impact and help to build a much stronger case > for composefs if you can prove that the boot time difference really matters. > > In order to test boot time on fair grounds, I prepared for you a POC > branch with overlayfs lazy lookup: > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata > > It is very lightly tested, but should be sufficient for the benchmark. > Note that: > 1. You need to opt-in with redirect_dir=lazyfollow,metacopy=on > 2. The lazyfollow POC only works with read-only overlay that > has two lower dirs (1 metadata layer and one data blobs layer) > 3. The data layer must be a local blockdev fs (i.e. not a network fs) > 4. Only absolute path redirects are lazy (e.g. "/objects/cc/3da...") Forgot to mention that 5. The redirect path should be a realpath within the local fs - symlinks are not followed. > > These limitations could be easily lifted with a bit more work. > If any of those limitations stand in your way for running the benchmark > let me know and I'll see what I can do. > > If there is any issue with the POC branch, please let me know. > Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-05 19:06 ` Amir Goldstein 2023-02-06 7:59 ` Amir Goldstein @ 2023-02-06 10:35 ` Miklos Szeredi 2023-02-06 13:30 ` Amir Goldstein 2023-02-06 12:51 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson 2023-02-07 8:12 ` Jingbo Xu 3 siblings, 1 reply; 87+ messages in thread From: Miklos Szeredi @ 2023-02-06 10:35 UTC (permalink / raw) To: Amir Goldstein Cc: Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu On Sun, 5 Feb 2023 at 20:06, Amir Goldstein <amir73il@gmail.com> wrote: > > > >>> Apart from that, I still fail to get some thoughts (apart from > > >>> unprivileged > > >>> mounts) how EROFS + overlayfs combination fails on automative real > > >>> workloads > > >>> aside from "ls -lR" (readdir + stat). > > >>> > > >>> And eventually we still need overlayfs for most use cases to do > > >>> writable > > >>> stuffs, anyway, it needs some words to describe why such < 1s > > >>> difference is > > >>> very very important to the real workload as you already mentioned > > >>> before. > > >>> > > >>> And with overlayfs lazy lookup, I think it can be close to ~100ms or > > >>> better. > > >>> > > >> > > >> If we had an overlay.fs-verity xattr, then I think there are no > > >> individual features lacking for it to work for the automotive usecase > > >> I'm working on. Nor for the OCI container usecase. However, the > > >> possibility of doing something doesn't mean it is the better technical > > >> solution. > > >> > > >> The container usecase is very important in real world Linux use today, > > >> and as such it makes sense to have a technically excellent solution for > > >> it, not just a workable solution. Obviously we all have different > > >> viewpoints of what that is, but these are the reasons why I think a > > >> composefs solution is better: > > >> > > >> * It is faster than all other approaches for the one thing it actually > > >> needs to do (lookup and readdir performance). Other kinds of > > >> performance (file i/o speed, etc) is up to the backing filesystem > > >> anyway. > > >> > > >> Even if there are possible approaches to make overlayfs perform better > > >> here (the "lazy lookup" idea) it will not reach the performance of > > >> composefs, while further complicating the overlayfs codebase. (btw, did > > >> someone ask Miklos what he thinks of that idea?) > > >> > > > > > > Well, Miklos was CCed (now in TO:) > > > I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: > > > https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 > > > but no response on that yet. > > > > > > TBH, in the end, Miklos really is the one who is going to have the most > > > weight on the outcome. > > > > > > If Miklos is interested in adding this functionality to overlayfs, you are going > > > to have a VERY hard sell, trying to merge composefs as an independent > > > expert filesystem. The community simply does not approve of this sort of > > > fragmentation unless there is a very good reason to do that. > > > > > >> For the automotive usecase we have strict cold-boot time requirements > > >> that make cold-cache performance very important to us. Of course, there > > >> is no simple time requirements for the specific case of listing files > > >> in an image, but any improvement in cold-cache performance for both the > > >> ostree rootfs and the containers started during boot will be worth its > > >> weight in gold trying to reach these hard KPIs. > > >> > > >> * It uses less memory, as we don't need the extra inodes that comes > > >> with the overlayfs mount. (See profiling data in giuseppes mail[1]). > > > > > > Understood, but we will need profiling data with the optimized ovl > > > (or with the single blob hack) to compare the relevant alternatives. > > > > My little request again, could you help benchmark on your real workload > > rather than "ls -lR" stuff? If your hard KPI is really what as you > > said, why not just benchmark the real workload now and write a detailed > > analysis to everyone to explain it's a _must_ that we should upstream > > a new stacked fs for this? > > > > I agree that benchmarking the actual KPI (boot time) will have > a much stronger impact and help to build a much stronger case > for composefs if you can prove that the boot time difference really matters. > > In order to test boot time on fair grounds, I prepared for you a POC > branch with overlayfs lazy lookup: > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata Sorry about being late to the party... Can you give a little detail about what exactly this does? Thanks, Miklos ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 10:35 ` Miklos Szeredi @ 2023-02-06 13:30 ` Amir Goldstein 2023-02-06 16:34 ` Miklos Szeredi 0 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-02-06 13:30 UTC (permalink / raw) To: Miklos Szeredi Cc: Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu > > > My little request again, could you help benchmark on your real workload > > > rather than "ls -lR" stuff? If your hard KPI is really what as you > > > said, why not just benchmark the real workload now and write a detailed > > > analysis to everyone to explain it's a _must_ that we should upstream > > > a new stacked fs for this? > > > > > > > I agree that benchmarking the actual KPI (boot time) will have > > a much stronger impact and help to build a much stronger case > > for composefs if you can prove that the boot time difference really matters. > > > > In order to test boot time on fair grounds, I prepared for you a POC > > branch with overlayfs lazy lookup: > > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata > > Sorry about being late to the party... > > Can you give a little detail about what exactly this does? > Consider a container image distribution system, with base images and derived images and instruction on how to compose these images using overlayfs or other methods. Consider a derived image L3 that depends on images L2, L1. With the composefs methodology, the image distribution server splits each image is split into metadata only (metacopy) images M3, M2, M1 and their underlying data images containing content addressable blobs D3, D2, D1. The image distribution server goes on to merge the metadata layers on the server, so U3 = M3 + M2 + M1. In order to start image L3, the container client will unpack the data layers D3, D2, D1 to local fs normally, but the server merged U3 metadata image will be distributed as a read-only fsverity signed image that can be mounted by mount -t composefs U3.img (much like mount -t erofs -o loop U3.img). The composefs image format contains "redirect" instruction to the data blob path and an fsverity signature that can be used to verify the redirected data content. When composefs authors proposed to merge composefs, Gao and me pointed out that the same functionality can be achieved with minimal changes using erofs+overlayfs. Composefs authors have presented ls -lR time and memory usage benchmarks that demonstrate how composefs performs better that erofs+overlayfs in this workload and explained that the lookup of the data blobs is what takes the extra time and memory in the erofs+overlayfs ls -lR test. The lazyfollow POC optimizes-out the lowerdata lookup for the ls -lR benchmark, so that composefs could be compared to erofs+overlayfs. To answer Alexander's question: > Cool. I'll play around with this. Does this need to be an opt-in > option in the final version? It feels like this could be useful to > improve performance in general for overlayfs, for example when > metacopy is used in container layers. I think lazyfollow could be enabled by default after we hashed out all the bugs and corner cases and most importantly remove the POC limitation of lower-only overlay. The feedback that composefs authors are asking from you is whether you will agree to consider adding the "lazyfollow lower data" optimization and "fsverity signature for metacopy" feature to overlayfs? If you do agree, the I think they should invest their resources in making those improvements to overlayfs and perhaps other improvements to erofs, rather than proposing a new specialized filesystem. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 13:30 ` Amir Goldstein @ 2023-02-06 16:34 ` Miklos Szeredi 2023-02-06 17:16 ` Amir Goldstein 0 siblings, 1 reply; 87+ messages in thread From: Miklos Szeredi @ 2023-02-06 16:34 UTC (permalink / raw) To: Amir Goldstein Cc: Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu On Mon, 6 Feb 2023 at 14:31, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > My little request again, could you help benchmark on your real workload > > > > rather than "ls -lR" stuff? If your hard KPI is really what as you > > > > said, why not just benchmark the real workload now and write a detailed > > > > analysis to everyone to explain it's a _must_ that we should upstream > > > > a new stacked fs for this? > > > > > > > > > > I agree that benchmarking the actual KPI (boot time) will have > > > a much stronger impact and help to build a much stronger case > > > for composefs if you can prove that the boot time difference really matters. > > > > > > In order to test boot time on fair grounds, I prepared for you a POC > > > branch with overlayfs lazy lookup: > > > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata > > > > Sorry about being late to the party... > > > > Can you give a little detail about what exactly this does? > > > > Consider a container image distribution system, with base images > and derived images and instruction on how to compose these images > using overlayfs or other methods. > > Consider a derived image L3 that depends on images L2, L1. > > With the composefs methodology, the image distribution server splits > each image is split into metadata only (metacopy) images M3, M2, M1 > and their underlying data images containing content addressable blobs > D3, D2, D1. > > The image distribution server goes on to merge the metadata layers > on the server, so U3 = M3 + M2 + M1. > > In order to start image L3, the container client will unpack the data layers > D3, D2, D1 to local fs normally, but the server merged U3 metadata image > will be distributed as a read-only fsverity signed image that can be mounted > by mount -t composefs U3.img (much like mount -t erofs -o loop U3.img). > > The composefs image format contains "redirect" instruction to the data blob > path and an fsverity signature that can be used to verify the redirected data > content. > > When composefs authors proposed to merge composefs, Gao and me > pointed out that the same functionality can be achieved with minimal changes > using erofs+overlayfs. > > Composefs authors have presented ls -lR time and memory usage benchmarks > that demonstrate how composefs performs better that erofs+overlayfs in > this workload and explained that the lookup of the data blobs is what takes > the extra time and memory in the erofs+overlayfs ls -lR test. > > The lazyfollow POC optimizes-out the lowerdata lookup for the ls -lR > benchmark, so that composefs could be compared to erofs+overlayfs. Got it, thanks. > > To answer Alexander's question: > > > Cool. I'll play around with this. Does this need to be an opt-in > > option in the final version? It feels like this could be useful to > > improve performance in general for overlayfs, for example when > > metacopy is used in container layers. > > I think lazyfollow could be enabled by default after we hashed out > all the bugs and corner cases and most importantly remove the > POC limitation of lower-only overlay. > > The feedback that composefs authors are asking from you > is whether you will agree to consider adding the "lazyfollow > lower data" optimization and "fsverity signature for metacopy" > feature to overlayfs? > > If you do agree, the I think they should invest their resources > in making those improvements to overlayfs and perhaps > other improvements to erofs, rather than proposing a new > specialized filesystem. Lazy follow seems to make sense. Why does it need to be optional? Does it have any advantage to *not* do lazy follow? Not sure I follow the fsverity requirement. For overlay+erofs case itsn't it enough to verify the erofs image? Thanks, Miklos > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 16:34 ` Miklos Szeredi @ 2023-02-06 17:16 ` Amir Goldstein 2023-02-06 18:17 ` Amir Goldstein ` (2 more replies) 0 siblings, 3 replies; 87+ messages in thread From: Amir Goldstein @ 2023-02-06 17:16 UTC (permalink / raw) To: Miklos Szeredi Cc: Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu On Mon, Feb 6, 2023 at 6:34 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Mon, 6 Feb 2023 at 14:31, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > My little request again, could you help benchmark on your real workload > > > > > rather than "ls -lR" stuff? If your hard KPI is really what as you > > > > > said, why not just benchmark the real workload now and write a detailed > > > > > analysis to everyone to explain it's a _must_ that we should upstream > > > > > a new stacked fs for this? > > > > > > > > > > > > > I agree that benchmarking the actual KPI (boot time) will have > > > > a much stronger impact and help to build a much stronger case > > > > for composefs if you can prove that the boot time difference really matters. > > > > > > > > In order to test boot time on fair grounds, I prepared for you a POC > > > > branch with overlayfs lazy lookup: > > > > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata > > > > > > Sorry about being late to the party... > > > > > > Can you give a little detail about what exactly this does? > > > > > > > Consider a container image distribution system, with base images > > and derived images and instruction on how to compose these images > > using overlayfs or other methods. > > > > Consider a derived image L3 that depends on images L2, L1. > > > > With the composefs methodology, the image distribution server splits > > each image is split into metadata only (metacopy) images M3, M2, M1 > > and their underlying data images containing content addressable blobs > > D3, D2, D1. > > > > The image distribution server goes on to merge the metadata layers > > on the server, so U3 = M3 + M2 + M1. > > > > In order to start image L3, the container client will unpack the data layers > > D3, D2, D1 to local fs normally, but the server merged U3 metadata image > > will be distributed as a read-only fsverity signed image that can be mounted > > by mount -t composefs U3.img (much like mount -t erofs -o loop U3.img). > > > > The composefs image format contains "redirect" instruction to the data blob > > path and an fsverity signature that can be used to verify the redirected data > > content. > > > > When composefs authors proposed to merge composefs, Gao and me > > pointed out that the same functionality can be achieved with minimal changes > > using erofs+overlayfs. > > > > Composefs authors have presented ls -lR time and memory usage benchmarks > > that demonstrate how composefs performs better that erofs+overlayfs in > > this workload and explained that the lookup of the data blobs is what takes > > the extra time and memory in the erofs+overlayfs ls -lR test. > > > > The lazyfollow POC optimizes-out the lowerdata lookup for the ls -lR > > benchmark, so that composefs could be compared to erofs+overlayfs. > > Got it, thanks. > > > > > To answer Alexander's question: > > > > > Cool. I'll play around with this. Does this need to be an opt-in > > > option in the final version? It feels like this could be useful to > > > improve performance in general for overlayfs, for example when > > > metacopy is used in container layers. > > > > I think lazyfollow could be enabled by default after we hashed out > > all the bugs and corner cases and most importantly remove the > > POC limitation of lower-only overlay. > > > > The feedback that composefs authors are asking from you > > is whether you will agree to consider adding the "lazyfollow > > lower data" optimization and "fsverity signature for metacopy" > > feature to overlayfs? > > > > If you do agree, the I think they should invest their resources > > in making those improvements to overlayfs and perhaps > > other improvements to erofs, rather than proposing a new > > specialized filesystem. > > Lazy follow seems to make sense. Why does it need to be optional? It doesn't. > Does it have any advantage to *not* do lazy follow? > Not that I can think of. > Not sure I follow the fsverity requirement. For overlay+erofs case > itsn't it enough to verify the erofs image? > it's not overlay{erofs+erofs} it's overlay{erofs+ext4} (or another fs-verity [1] supporting fs) the lower layer is a mutable fs with /objects/ dir containing the blobs. The way to ensure the integrity of erofs is to setup dm-verity at erofs mount time. The way to ensure the integrity of the blobs is to store an fs-verity signature of each blob file in trusted.overlay.verify xattr on the metacopy and for overlayfs to enable fsverity on the blob file before allowing access to the lowerdata. At least this is my understanding of the security model. Thanks, Amir. [1] https://www.kernel.org/doc/html/latest/filesystems/fsverity.html ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 17:16 ` Amir Goldstein @ 2023-02-06 18:17 ` Amir Goldstein 2023-02-06 19:32 ` Miklos Szeredi 2023-04-03 19:00 ` Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) Amir Goldstein 2 siblings, 0 replies; 87+ messages in thread From: Amir Goldstein @ 2023-02-06 18:17 UTC (permalink / raw) To: Miklos Szeredi Cc: Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu, overlayfs [+overlayfs list] On Mon, Feb 6, 2023 at 7:16 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Feb 6, 2023 at 6:34 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Mon, 6 Feb 2023 at 14:31, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > My little request again, could you help benchmark on your real workload > > > > > > rather than "ls -lR" stuff? If your hard KPI is really what as you > > > > > > said, why not just benchmark the real workload now and write a detailed > > > > > > analysis to everyone to explain it's a _must_ that we should upstream > > > > > > a new stacked fs for this? > > > > > > > > > > > > > > > > I agree that benchmarking the actual KPI (boot time) will have > > > > > a much stronger impact and help to build a much stronger case > > > > > for composefs if you can prove that the boot time difference really matters. > > > > > > > > > > In order to test boot time on fair grounds, I prepared for you a POC > > > > > branch with overlayfs lazy lookup: > > > > > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata > > > > > > > > Sorry about being late to the party... > > > > > > > > Can you give a little detail about what exactly this does? > > > > > > > > > > Consider a container image distribution system, with base images > > > and derived images and instruction on how to compose these images > > > using overlayfs or other methods. > > > > > > Consider a derived image L3 that depends on images L2, L1. > > > > > > With the composefs methodology, the image distribution server splits > > > each image is split into metadata only (metacopy) images M3, M2, M1 > > > and their underlying data images containing content addressable blobs > > > D3, D2, D1. > > > > > > The image distribution server goes on to merge the metadata layers > > > on the server, so U3 = M3 + M2 + M1. > > > > > > In order to start image L3, the container client will unpack the data layers > > > D3, D2, D1 to local fs normally, but the server merged U3 metadata image > > > will be distributed as a read-only fsverity signed image that can be mounted > > > by mount -t composefs U3.img (much like mount -t erofs -o loop U3.img). > > > > > > The composefs image format contains "redirect" instruction to the data blob > > > path and an fsverity signature that can be used to verify the redirected data > > > content. > > > > > > When composefs authors proposed to merge composefs, Gao and me > > > pointed out that the same functionality can be achieved with minimal changes > > > using erofs+overlayfs. > > > > > > Composefs authors have presented ls -lR time and memory usage benchmarks > > > that demonstrate how composefs performs better that erofs+overlayfs in > > > this workload and explained that the lookup of the data blobs is what takes > > > the extra time and memory in the erofs+overlayfs ls -lR test. > > > > > > The lazyfollow POC optimizes-out the lowerdata lookup for the ls -lR > > > benchmark, so that composefs could be compared to erofs+overlayfs. > > > > Got it, thanks. > > > > > > > > To answer Alexander's question: > > > > > > > Cool. I'll play around with this. Does this need to be an opt-in > > > > option in the final version? It feels like this could be useful to > > > > improve performance in general for overlayfs, for example when > > > > metacopy is used in container layers. > > > > > > I think lazyfollow could be enabled by default after we hashed out > > > all the bugs and corner cases and most importantly remove the > > > POC limitation of lower-only overlay. > > > > > > The feedback that composefs authors are asking from you > > > is whether you will agree to consider adding the "lazyfollow > > > lower data" optimization and "fsverity signature for metacopy" > > > feature to overlayfs? > > > > > > If you do agree, the I think they should invest their resources > > > in making those improvements to overlayfs and perhaps > > > other improvements to erofs, rather than proposing a new > > > specialized filesystem. > > > > Lazy follow seems to make sense. Why does it need to be optional? > > It doesn't. > > > Does it have any advantage to *not* do lazy follow? > > > > Not that I can think of. > > > Not sure I follow the fsverity requirement. For overlay+erofs case > > itsn't it enough to verify the erofs image? > > > > it's not overlay{erofs+erofs} > it's overlay{erofs+ext4} (or another fs-verity [1] supporting fs) > the lower layer is a mutable fs with /objects/ dir containing > the blobs. > > The way to ensure the integrity of erofs is to setup dm-verity at > erofs mount time. > > The way to ensure the integrity of the blobs is to store an fs-verity > signature of each blob file in trusted.overlay.verify xattr on the > metacopy and for overlayfs to enable fsverity on the blob file before > allowing access to the lowerdata. > Perhaps I should have mentioned that the lower /objects dir, despite being mutable (mostly append-only) is shared among several overlays. This technically breaks the law of no modification to lower layer, but the /objects dir itself is a whiteout in the metadata layer, so the blobs are only accessible via absolute path redirect and there is no /objects overlay dir, so there is no readdir cache to invalidate. Naturally, the content addressable blobs are not expected to be renamed/unlinked while an overlayfs that references them is mounted. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 17:16 ` Amir Goldstein 2023-02-06 18:17 ` Amir Goldstein @ 2023-02-06 19:32 ` Miklos Szeredi 2023-02-06 20:06 ` Amir Goldstein 2023-04-03 19:00 ` Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) Amir Goldstein 2 siblings, 1 reply; 87+ messages in thread From: Miklos Szeredi @ 2023-02-06 19:32 UTC (permalink / raw) To: Amir Goldstein Cc: Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu On Mon, 6 Feb 2023 at 18:16, Amir Goldstein <amir73il@gmail.com> wrote: > it's not overlay{erofs+erofs} > it's overlay{erofs+ext4} (or another fs-verity [1] supporting fs) > the lower layer is a mutable fs with /objects/ dir containing > the blobs. > > The way to ensure the integrity of erofs is to setup dm-verity at > erofs mount time. > > The way to ensure the integrity of the blobs is to store an fs-verity > signature of each blob file in trusted.overlay.verify xattr on the > metacopy and for overlayfs to enable fsverity on the blob file before > allowing access to the lowerdata. > > At least this is my understanding of the security model. So this should work out of the box, right? Thanks, Miklos ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 19:32 ` Miklos Szeredi @ 2023-02-06 20:06 ` Amir Goldstein 2023-02-07 8:12 ` Alexander Larsson 0 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-02-06 20:06 UTC (permalink / raw) To: Miklos Szeredi Cc: Alexander Larsson, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu On Mon, Feb 6, 2023 at 9:32 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Mon, 6 Feb 2023 at 18:16, Amir Goldstein <amir73il@gmail.com> wrote: > > > it's not overlay{erofs+erofs} > > it's overlay{erofs+ext4} (or another fs-verity [1] supporting fs) > > the lower layer is a mutable fs with /objects/ dir containing > > the blobs. > > > > The way to ensure the integrity of erofs is to setup dm-verity at > > erofs mount time. > > > > The way to ensure the integrity of the blobs is to store an fs-verity > > signature of each blob file in trusted.overlay.verify xattr on the > > metacopy and for overlayfs to enable fsverity on the blob file before > > allowing access to the lowerdata. > > > > At least this is my understanding of the security model. > > So this should work out of the box, right? > Mostly. IIUC, overlayfs just needs to verify the signature on open to fulfill the chain of trust, see cfs_open_file(): https://lore.kernel.org/linux-fsdevel/9b799ec7e403ba814e7bc097b1e8bd5f7662d596.1674227308.git.alexl@redhat.com/ Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 20:06 ` Amir Goldstein @ 2023-02-07 8:12 ` Alexander Larsson 0 siblings, 0 replies; 87+ messages in thread From: Alexander Larsson @ 2023-02-07 8:12 UTC (permalink / raw) To: Amir Goldstein Cc: Miklos Szeredi, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu On Mon, Feb 6, 2023 at 9:06 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Feb 6, 2023 at 9:32 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Mon, 6 Feb 2023 at 18:16, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > it's not overlay{erofs+erofs} > > > it's overlay{erofs+ext4} (or another fs-verity [1] supporting fs) > > > the lower layer is a mutable fs with /objects/ dir containing > > > the blobs. > > > > > > The way to ensure the integrity of erofs is to setup dm-verity at > > > erofs mount time. > > > > > > The way to ensure the integrity of the blobs is to store an fs-verity > > > signature of each blob file in trusted.overlay.verify xattr on the > > > metacopy and for overlayfs to enable fsverity on the blob file before > > > allowing access to the lowerdata. > > > > > > At least this is my understanding of the security model. > > > > So this should work out of the box, right? > > > > Mostly. IIUC, overlayfs just needs to verify the signature on > open to fulfill the chain of trust, see cfs_open_file(): > https://lore.kernel.org/linux-fsdevel/9b799ec7e403ba814e7bc097b1e8bd5f7662d596.1674227308.git.alexl@redhat.com/ Yeah, we need to add an "overlay.digest" xattr which if specified contains the expected fs-verity digest of the content file for the metacopy file. We also need to export fsverity_get_digest for module use: https://lore.kernel.org/linux-fsdevel/f5f292caee6b288d39112486ee1b2daef590c3ec.1674227308.git.alexl@redhat.com/ -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 87+ messages in thread
* Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) 2023-02-06 17:16 ` Amir Goldstein 2023-02-06 18:17 ` Amir Goldstein 2023-02-06 19:32 ` Miklos Szeredi @ 2023-04-03 19:00 ` Amir Goldstein 2023-04-11 15:50 ` Miklos Szeredi 2 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-04-03 19:00 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Alexander Larsson, overlayfs > > > > > > I think lazyfollow could be enabled by default after we hashed out > > > all the bugs and corner cases and most importantly remove the > > > POC limitation of lower-only overlay. > > > [...] > > > > > > > Lazy follow seems to make sense. Why does it need to be optional? > > It doesn't. > > > Does it have any advantage to *not* do lazy follow? > > > > Not that I can think of. Miklos, I completed writing the lazy lookup patches [1]. It wasn't trivial and the first versions had many traps that took time to trip on, so I've made some design choices to make it safer and easier to land an initial improvement that will cater the composefs use case. The main design choice has to do with making lazy lowerdata lookup completely opt-in by defining a new type of data-only layers, such as the content addressable lower layer of composefs. The request for the data-only layers came from Alexander. The current patches only do lazy lookup in data-only layers and the lookup in data-only layers is always lazy. Data-only layers have some other advantages, for example, multiple data-only uuid-less layers are allowed. Please see the text below taken from the patches. What do you think about this direction? Alexander has started to test these patches. If he finds no issues and if you have no objections to the concept, then I will post the patches for wider review. Thanks, Amir. [1] https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata-rc2 Data-only lower layers ---------------------- With "metacopy" feature enabled, an overlayfs regular file may be a composition of information from up to three different layers: 1) metadata from a file in the upper layer 2) st_ino and st_dev object identifier from a file in a lower layer 3) data from a file in another lower layer (further below) The "lower data" file can be on any lower layer, except from the top most lower layer. Below the top most lower layer, any number of lower most layers may be defined as "data-only" lower layers, using the double collon ("::") separator. For example: mount -t overlay overlay -olowerdir=/lower1::/lower2:/lower3 /merged The paths of files in the "data-only" lower layers are not visible in the merged overlayfs directories and the metadata and st_ino/st_dev of files in the "data-only" lower layers are not visible in overlayfs inodes. Only the data of the files in the "data-only" lower layers may be visible when a "metacopy" file in one of the lower layers above it, has a "redirect" to the absolute path of the "lower data" file in the "data-only" lower layer. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) 2023-04-03 19:00 ` Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) Amir Goldstein @ 2023-04-11 15:50 ` Miklos Szeredi 2023-04-12 14:06 ` Amir Goldstein 0 siblings, 1 reply; 87+ messages in thread From: Miklos Szeredi @ 2023-04-11 15:50 UTC (permalink / raw) To: Amir Goldstein; +Cc: Alexander Larsson, overlayfs On Mon, 3 Apr 2023 at 21:00, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > I think lazyfollow could be enabled by default after we hashed out > > > > all the bugs and corner cases and most importantly remove the > > > > POC limitation of lower-only overlay. > > > > > [...] > > > > > > > > > > Lazy follow seems to make sense. Why does it need to be optional? > > > > It doesn't. > > > > > Does it have any advantage to *not* do lazy follow? > > > > > > > Not that I can think of. > > Miklos, > > I completed writing the lazy lookup patches [1]. > > It wasn't trivial and the first versions had many traps that took time to > trip on, so I've made some design choices to make it safer and easier to > land an initial improvement that will cater the composefs use case. > > The main design choice has to do with making lazy lowerdata lookup > completely opt-in by defining a new type of data-only layers, such as > the content addressable lower layer of composefs. > The request for the data-only layers came from Alexander. > > The current patches only do lazy lookup in data-only layers and the lookup > in data-only layers is always lazy. > > Data-only layers have some other advantages, for example, multiple > data-only uuid-less layers are allowed. > Please see the text below taken from the patches. > > What do you think about this direction? > > Alexander has started to test these patches. > If he finds no issues and if you have no objections to the concept, > then I will post the patches for wider review. > > > Thanks, > Amir. > > [1] https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata-rc2 > > Data-only lower layers > ---------------------- > > With "metacopy" feature enabled, an overlayfs regular file may be a > composition of information from up to three different layers: > > 1) metadata from a file in the upper layer > > 2) st_ino and st_dev object identifier from a file in a lower layer > > 3) data from a file in another lower layer (further below) > > The "lower data" file can be on any lower layer, except from the top most > lower layer. > > Below the top most lower layer, any number of lower most layers may be > defined as "data-only" lower layers, using the double collon ("::") separator. > > For example: > > mount -t overlay overlay -olowerdir=/lower1::/lower2:/lower3 /merged What are the rules? Is "do1::do2::lower" allowed? Is "do1::lower1:do2::lower2 allowed? > > The paths of files in the "data-only" lower layers are not visible in the > merged overlayfs directories and the metadata and st_ino/st_dev of files > in the "data-only" lower layers are not visible in overlayfs inodes. > > Only the data of the files in the "data-only" lower layers may be visible > when a "metacopy" file in one of the lower layers above it, has a "redirect" > to the absolute path of the "lower data" file in the "data-only" lower layer. Okay. Thanks, Miklos ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) 2023-04-11 15:50 ` Miklos Szeredi @ 2023-04-12 14:06 ` Amir Goldstein 2023-04-12 14:20 ` Miklos Szeredi 0 siblings, 1 reply; 87+ messages in thread From: Amir Goldstein @ 2023-04-12 14:06 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Alexander Larsson, overlayfs On Tue, Apr 11, 2023 at 6:50 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Mon, 3 Apr 2023 at 21:00, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > > > > I think lazyfollow could be enabled by default after we hashed out > > > > > all the bugs and corner cases and most importantly remove the > > > > > POC limitation of lower-only overlay. > > > > > > > [...] > > > > > > > > > > > > > Lazy follow seems to make sense. Why does it need to be optional? > > > > > > It doesn't. > > > > > > > Does it have any advantage to *not* do lazy follow? > > > > > > > > > > Not that I can think of. > > > > Miklos, > > > > I completed writing the lazy lookup patches [1]. > > > > It wasn't trivial and the first versions had many traps that took time to > > trip on, so I've made some design choices to make it safer and easier to > > land an initial improvement that will cater the composefs use case. > > > > The main design choice has to do with making lazy lowerdata lookup > > completely opt-in by defining a new type of data-only layers, such as > > the content addressable lower layer of composefs. > > The request for the data-only layers came from Alexander. > > > > The current patches only do lazy lookup in data-only layers and the lookup > > in data-only layers is always lazy. > > > > Data-only layers have some other advantages, for example, multiple > > data-only uuid-less layers are allowed. > > Please see the text below taken from the patches. > > > > What do you think about this direction? > > > > Alexander has started to test these patches. > > If he finds no issues and if you have no objections to the concept, > > then I will post the patches for wider review. > > > > > > Thanks, > > Amir. > > > > [1] https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata-rc2 > > > > Data-only lower layers > > ---------------------- > > > > With "metacopy" feature enabled, an overlayfs regular file may be a > > composition of information from up to three different layers: > > > > 1) metadata from a file in the upper layer > > > > 2) st_ino and st_dev object identifier from a file in a lower layer > > > > 3) data from a file in another lower layer (further below) > > > > The "lower data" file can be on any lower layer, except from the top most > > lower layer. > > > > Below the top most lower layer, any number of lower most layers may be > > defined as "data-only" lower layers, using the double collon ("::") separator. > > > > For example: > > > > mount -t overlay overlay -olowerdir=/lower1::/lower2:/lower3 /merged > > What are the rules? > > Is "do1::do2::lower" allowed? > Is "do1::lower1:do2::lower2 allowed? > To elaborate: lowerdir="lo1:lo2:lo3::do1:do2:do3" is allowed :: must have non-zero lower layers on the left side and non-zero data-only layers on the right side. Actually, this feature originates from a request from Alexander to respect opaque root dir in lower layers, but I preferred to make this change of behavior opt-in so it can be tested by userspace. I took it one step further than the opaque root dir request - the lookup in data-only is a generic vfs_path_lookup() of an absolute path redirect from one of the lowerdirs, with no checking of redirect/metacopy/opque xattrs. And then I only implemented lazy lookup for the lookup in those new data-only layers, which made things simpler. Please see the patches I just posted for details [1]. Thanks, Amir. [1] https://lore.kernel.org/linux-unionfs/20230412135412.1684197-1-amir73il@gmail.com/ ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) 2023-04-12 14:06 ` Amir Goldstein @ 2023-04-12 14:20 ` Miklos Szeredi 2023-04-12 15:41 ` Amir Goldstein 0 siblings, 1 reply; 87+ messages in thread From: Miklos Szeredi @ 2023-04-12 14:20 UTC (permalink / raw) To: Amir Goldstein; +Cc: Alexander Larsson, overlayfs On Wed, 12 Apr 2023 at 16:07, Amir Goldstein <amir73il@gmail.com> wrote: > To elaborate: > > lowerdir="lo1:lo2:lo3::do1:do2:do3" is allowed > > :: must have non-zero lower layers on the left side > and non-zero data-only layers on the right side. Okay. Can you please add this to the documentation? > > Actually, this feature originates from a request from Alexander to > respect opaque root dir in lower layers, but I preferred to make this > change of behavior opt-in so it can be tested by userspace. Not sure I get that. Does "opaque root dir" mean that only absolute redirects can access layers below such a layer? I guess that's not something that works today. Or am I mistaken? I also don't get what you mean by testing in userspace. Can you ellaborate? > > I took it one step further than the opaque root dir request - > the lookup in data-only is a generic vfs_path_lookup() of an > absolute path redirect from one of the lowerdirs, with no > checking of redirect/metacopy/opque xattrs. > > And then I only implemented lazy lookup for the lookup > in those new data-only layers, which made things simpler. Okay, makes sense. If someone hits this limitation, then we can always start thinking about generalizing this feature. > Please see the patches I just posted for details [1]. Will do. Thanks, Miklos ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) 2023-04-12 14:20 ` Miklos Szeredi @ 2023-04-12 15:41 ` Amir Goldstein 0 siblings, 0 replies; 87+ messages in thread From: Amir Goldstein @ 2023-04-12 15:41 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Alexander Larsson, overlayfs On Wed, Apr 12, 2023 at 5:21 PM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Wed, 12 Apr 2023 at 16:07, Amir Goldstein <amir73il@gmail.com> wrote: > > > To elaborate: > > > > lowerdir="lo1:lo2:lo3::do1:do2:do3" is allowed > > > > :: must have non-zero lower layers on the left side > > and non-zero data-only layers on the right side. > > Okay. Can you please add this to the documentation? > OK. > > > > Actually, this feature originates from a request from Alexander to > > respect opaque root dir in lower layers, but I preferred to make this > > change of behavior opt-in so it can be tested by userspace. > > Not sure I get that. Does "opaque root dir" mean that only absolute > redirects can access layers below such a layer? Yes, that's what he wanted. to hide the subdirs being redirected to from the namespace. > > I guess that's not something that works today. Or am I mistaken? You are not mistaken, it is not working today. > > I also don't get what you mean by testing in userspace. Can you ellaborate? > > > > > I took it one step further than the opaque root dir request - > > the lookup in data-only is a generic vfs_path_lookup() of an > > absolute path redirect from one of the lowerdirs, with no > > checking of redirect/metacopy/opque xattrs. > > > > And then I only implemented lazy lookup for the lookup > > in those new data-only layers, which made things simpler. > > Okay, makes sense. If someone hits this limitation, then we can > always start thinking about generalizing this feature. Yap, that's what I thought. Thanks, Amir. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-05 19:06 ` Amir Goldstein 2023-02-06 7:59 ` Amir Goldstein 2023-02-06 10:35 ` Miklos Szeredi @ 2023-02-06 12:51 ` Alexander Larsson 2023-02-07 8:12 ` Jingbo Xu 3 siblings, 0 replies; 87+ messages in thread From: Alexander Larsson @ 2023-02-06 12:51 UTC (permalink / raw) To: Amir Goldstein Cc: Miklos Szeredi, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang, Jingbo Xu On Sun, Feb 5, 2023 at 8:06 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > >>> Apart from that, I still fail to get some thoughts (apart from > > >>> unprivileged > > >>> mounts) how EROFS + overlayfs combination fails on automative real > > >>> workloads > > >>> aside from "ls -lR" (readdir + stat). > > >>> > > >>> And eventually we still need overlayfs for most use cases to do > > >>> writable > > >>> stuffs, anyway, it needs some words to describe why such < 1s > > >>> difference is > > >>> very very important to the real workload as you already mentioned > > >>> before. > > >>> > > >>> And with overlayfs lazy lookup, I think it can be close to ~100ms or > > >>> better. > > >>> > > >> > > >> If we had an overlay.fs-verity xattr, then I think there are no > > >> individual features lacking for it to work for the automotive usecase > > >> I'm working on. Nor for the OCI container usecase. However, the > > >> possibility of doing something doesn't mean it is the better technical > > >> solution. > > >> > > >> The container usecase is very important in real world Linux use today, > > >> and as such it makes sense to have a technically excellent solution for > > >> it, not just a workable solution. Obviously we all have different > > >> viewpoints of what that is, but these are the reasons why I think a > > >> composefs solution is better: > > >> > > >> * It is faster than all other approaches for the one thing it actually > > >> needs to do (lookup and readdir performance). Other kinds of > > >> performance (file i/o speed, etc) is up to the backing filesystem > > >> anyway. > > >> > > >> Even if there are possible approaches to make overlayfs perform better > > >> here (the "lazy lookup" idea) it will not reach the performance of > > >> composefs, while further complicating the overlayfs codebase. (btw, did > > >> someone ask Miklos what he thinks of that idea?) > > >> > > > > > > Well, Miklos was CCed (now in TO:) > > > I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: > > > https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 > > > but no response on that yet. > > > > > > TBH, in the end, Miklos really is the one who is going to have the most > > > weight on the outcome. > > > > > > If Miklos is interested in adding this functionality to overlayfs, you are going > > > to have a VERY hard sell, trying to merge composefs as an independent > > > expert filesystem. The community simply does not approve of this sort of > > > fragmentation unless there is a very good reason to do that. > > > > > >> For the automotive usecase we have strict cold-boot time requirements > > >> that make cold-cache performance very important to us. Of course, there > > >> is no simple time requirements for the specific case of listing files > > >> in an image, but any improvement in cold-cache performance for both the > > >> ostree rootfs and the containers started during boot will be worth its > > >> weight in gold trying to reach these hard KPIs. > > >> > > >> * It uses less memory, as we don't need the extra inodes that comes > > >> with the overlayfs mount. (See profiling data in giuseppes mail[1]). > > > > > > Understood, but we will need profiling data with the optimized ovl > > > (or with the single blob hack) to compare the relevant alternatives. > > > > My little request again, could you help benchmark on your real workload > > rather than "ls -lR" stuff? If your hard KPI is really what as you > > said, why not just benchmark the real workload now and write a detailed > > analysis to everyone to explain it's a _must_ that we should upstream > > a new stacked fs for this? > > > > I agree that benchmarking the actual KPI (boot time) will have > a much stronger impact and help to build a much stronger case > for composefs if you can prove that the boot time difference really matters. I will not be able to produce any full comparisons of a car booting with this. First of all its customer internal data, and secondly its not something that is currently at a stage that is finished enough to do such a benchmark. For this discussion, consider it more a weak example of why cold-cache performance is important in many cases. > In order to test boot time on fair grounds, I prepared for you a POC > branch with overlayfs lazy lookup: > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata Cool. I'll play around with this. Does this need to be an opt-in option in the final version? It feels like this could be useful to improve performance in general for overlayfs, for example when metacopy is used in container layers. > It is very lightly tested, but should be sufficient for the benchmark. > Note that: > 1. You need to opt-in with redirect_dir=lazyfollow,metacopy=on > 2. The lazyfollow POC only works with read-only overlay that > has two lower dirs (1 metadata layer and one data blobs layer) > 3. The data layer must be a local blockdev fs (i.e. not a network fs) > 4. Only absolute path redirects are lazy (e.g. "/objects/cc/3da...") > > These limitations could be easily lifted with a bit more work. > If any of those limitations stand in your way for running the benchmark > let me know and I'll see what I can do. > > If there is any issue with the POC branch, please let me know. > > Thanks, > Amir. > -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-05 19:06 ` Amir Goldstein ` (2 preceding siblings ...) 2023-02-06 12:51 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson @ 2023-02-07 8:12 ` Jingbo Xu 3 siblings, 0 replies; 87+ messages in thread From: Jingbo Xu @ 2023-02-07 8:12 UTC (permalink / raw) To: Amir Goldstein, Alexander Larsson Cc: Miklos Szeredi, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik, Gao Xiang [-- Attachment #1: Type: text/plain, Size: 9135 bytes --] Hi amir and all, On 2/6/23 3:06 AM, Amir Goldstein wrote: >>>>> Apart from that, I still fail to get some thoughts (apart from >>>>> unprivileged >>>>> mounts) how EROFS + overlayfs combination fails on automative real >>>>> workloads >>>>> aside from "ls -lR" (readdir + stat). >>>>> >>>>> And eventually we still need overlayfs for most use cases to do >>>>> writable >>>>> stuffs, anyway, it needs some words to describe why such < 1s >>>>> difference is >>>>> very very important to the real workload as you already mentioned >>>>> before. >>>>> >>>>> And with overlayfs lazy lookup, I think it can be close to ~100ms or >>>>> better. >>>>> >>>> >>>> If we had an overlay.fs-verity xattr, then I think there are no >>>> individual features lacking for it to work for the automotive usecase >>>> I'm working on. Nor for the OCI container usecase. However, the >>>> possibility of doing something doesn't mean it is the better technical >>>> solution. >>>> >>>> The container usecase is very important in real world Linux use today, >>>> and as such it makes sense to have a technically excellent solution for >>>> it, not just a workable solution. Obviously we all have different >>>> viewpoints of what that is, but these are the reasons why I think a >>>> composefs solution is better: >>>> >>>> * It is faster than all other approaches for the one thing it actually >>>> needs to do (lookup and readdir performance). Other kinds of >>>> performance (file i/o speed, etc) is up to the backing filesystem >>>> anyway. >>>> >>>> Even if there are possible approaches to make overlayfs perform better >>>> here (the "lazy lookup" idea) it will not reach the performance of >>>> composefs, while further complicating the overlayfs codebase. (btw, did >>>> someone ask Miklos what he thinks of that idea?) >>>> >>> >>> Well, Miklos was CCed (now in TO:) >>> I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: >>> https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 >>> but no response on that yet. >>> >>> TBH, in the end, Miklos really is the one who is going to have the most >>> weight on the outcome. >>> >>> If Miklos is interested in adding this functionality to overlayfs, you are going >>> to have a VERY hard sell, trying to merge composefs as an independent >>> expert filesystem. The community simply does not approve of this sort of >>> fragmentation unless there is a very good reason to do that. >>> >>>> For the automotive usecase we have strict cold-boot time requirements >>>> that make cold-cache performance very important to us. Of course, there >>>> is no simple time requirements for the specific case of listing files >>>> in an image, but any improvement in cold-cache performance for both the >>>> ostree rootfs and the containers started during boot will be worth its >>>> weight in gold trying to reach these hard KPIs. >>>> >>>> * It uses less memory, as we don't need the extra inodes that comes >>>> with the overlayfs mount. (See profiling data in giuseppes mail[1]). >>> >>> Understood, but we will need profiling data with the optimized ovl >>> (or with the single blob hack) to compare the relevant alternatives. >> >> My little request again, could you help benchmark on your real workload >> rather than "ls -lR" stuff? If your hard KPI is really what as you >> said, why not just benchmark the real workload now and write a detailed >> analysis to everyone to explain it's a _must_ that we should upstream >> a new stacked fs for this? >> > > I agree that benchmarking the actual KPI (boot time) will have > a much stronger impact and help to build a much stronger case > for composefs if you can prove that the boot time difference really matters. > > In order to test boot time on fair grounds, I prepared for you a POC > branch with overlayfs lazy lookup: > https://github.com/amir73il/linux/commits/ovl-lazy-lowerdata > > It is very lightly tested, but should be sufficient for the benchmark. > Note that: > 1. You need to opt-in with redirect_dir=lazyfollow,metacopy=on > 2. The lazyfollow POC only works with read-only overlay that > has two lower dirs (1 metadata layer and one data blobs layer) > 3. The data layer must be a local blockdev fs (i.e. not a network fs) > 4. Only absolute path redirects are lazy (e.g. "/objects/cc/3da...") > > These limitations could be easily lifted with a bit more work. > If any of those limitations stand in your way for running the benchmark > let me know and I'll see what I can do. > Thanks for the lazyfollow POC, I updated the perf test with overlayfs lazyfollow enabled. | uncached(ms)| cached(ms) ----------------------------------------------------+-----+---- composefs | 404 | 181 composefs (readahead of manifest disabled) | 523 | 178 erofs (loop BUFFER) | 300 | 188 erofs (loop DIRECT) | 486 | 188 erofs (loop DIRECT + ra manifest) | 292 | 190 erofs (loop BUFFER) +overlayfs(lazyfollowup) | 502 | 286 erofs (loop DIRECT) +overlayfs(lazyfollowup) | 686 | 285 erofs (loop DIRECT+ra manifest)+overlayfs(lazyfollowup) | 484 | 300 I find that composefs behaves better than purely erofs (loop DIRECT), e.g. 404ms vs 486ms in uncached situation, somewhat because composefs reads part of metadata by buffered kernel_read() and thus the builtin readahead is performed on the manifest file. With the readahead for the manifest disabled, the performance gets much worse. Erofs can also use similar optimization of readahead the manifest file when accessing the metadata if really needed. An example POC implementation is inlined in the bottom of this mail. Considering the workload of "ls -lR" will read basically the full content of the manifest file, plusing the manifest file size is just ~10MB, the POC implementation just performs async readahead upon the manifest with a fixed step of 128KB. With this opt-in, erofs performs somewhat better in uncached situation. I have to admit that this much depends on the range and the step size of the readahead, but at least it indicates that erofs can get comparable performance with similar optimization. Besides, as mentioned in [1], in composefs the on-disk inode under one directory is arranged closer than erofs, which means the submitted IO when doing "ls -l" in erofs is more random than that in composefs, somewhat affecting the performance. It can be possibly fixed by improving mkfs.erofs if the gap (~80ms) really matters. The inode id arrangement under the root directory of tested rootfs is shown as an attachment, and the tested erofs image of erofs+overlayfs is constructed from the script (mkhack.sh) attached in [2] offered by Alexander. To summarize: For composefs and erofs, they are quite similar and the performance is also comparable (with the same optimization). But when comparing composefs and erofs+overlayfs(lazyfollowup), at least in the workload of "ls -lR", the combination of erofs and overlayfs costs ~100ms more in both cached and uncached situation. If such ~100ms diff really matters, erofs could resolve "redirect" xattr itself, in which case overlayfs is not involved and then the performance shall be comparable with composefs, but i'm not sure if it's worthwhile considering the results are already close. Besides the rootfs is read-only in this case, and if we need writable layer anyway, overlayfs still needs to be introduced. [1] https://lore.kernel.org/lkml/1d65be2f-6d3a-13c6-4982-66bbb0f9b530@linux.alibaba.com/ [2] https://lore.kernel.org/linux-fsdevel/5fb32a1297821040edd8c19ce796fc0540101653.camel@redhat.com/ diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c index d3b8736fa124..e74e24e00b49 100644 --- a/fs/erofs/inode.c +++ b/fs/erofs/inode.c @@ -21,6 +21,10 @@ static void *erofs_read_inode(struct erofs_buf *buf, struct erofs_inode_compact *dic; struct erofs_inode_extended *die, *copied = NULL; unsigned int ifmt; + struct folio *folio; + struct inode *binode = sb->s_bdev->bd_inode; + struct address_space *mapping = binode->i_mapping; + pgoff_t index = inode_loc >> PAGE_SHIFT; int err; blkaddr = erofs_blknr(inode_loc); @@ -29,6 +33,16 @@ static void *erofs_read_inode(struct erofs_buf *buf, erofs_dbg("%s, reading inode nid %llu at %u of blkaddr %u", __func__, vi->nid, *ofs, blkaddr); + folio = filemap_get_folio(mapping, index); + if (!folio || !folio_test_uptodate(folio)) { + loff_t isize = i_size_read(binode); + pgoff_t end_index = (isize - 1) >> PAGE_SHIFT; + unsigned long nr_to_read = min_t(unsigned long, end_index - index, 32); + DEFINE_READAHEAD(ractl, NULL, NULL, mapping, index); + + page_cache_ra_unbounded(&ractl, nr_to_read, 0); + } + kaddr = erofs_read_metabuf(buf, sb, blkaddr, EROFS_KMAP); if (IS_ERR(kaddr)) { erofs_err(sb, "failed to get inode (nid: %llu) page, err %ld", -- Thanks, Jingbo [-- Attachment #2: inode_arrange.txt --] [-- Type: text/plain, Size: 16164 bytes --] composefs: ========= # ls -il /mnt/cps /mnt/cps: 1 dr-xr-xr-x 2 root root 6 8月 10 2021 afs 2 lrwxrwxrwx 1 root root 7 8月 10 2021 bin -> usr/bin 3 dr-xr-xr-x 5 root root 270 1月 17 18:15 boot 4 drwxr-xr-x 2 root root 6 1月 17 18:14 dev 5 drwxr-xr-x 87 root root 4096 1月 17 18:15 etc 6 drwxr-xr-x 3 root root 19 1月 17 18:15 home 7 lrwxrwxrwx 1 root root 7 8月 10 2021 lib -> usr/lib 8 lrwxrwxrwx 1 root root 9 8月 10 2021 lib64 -> usr/lib64 9 drwxr-xr-x 2 root root 6 8月 10 2021 media 10 drwxr-xr-x 2 root root 6 8月 10 2021 mnt 11 drwxr-xr-x 3 root root 16 1月 17 18:14 opt 12 drwxr-xr-x 2 root root 6 1月 17 18:14 proc 13 dr-xr-x--- 3 root root 103 1月 17 18:15 root 14 drwxr-xr-x 14 root root 187 1月 17 18:15 run 15 lrwxrwxrwx 1 root root 8 8月 10 2021 sbin -> usr/sbin 16 drwxr-xr-x 2 root root 6 8月 10 2021 srv 17 drwxr-xr-x 2 root root 6 1月 17 18:14 sys 18 drwxrwxrwt 3 root root 29 1月 17 18:15 tmp 19 drwxr-xr-x 12 root root 144 1月 17 18:14 usr 20 drwxr-xr-x 18 root root 235 1月 17 18:14 var erofs: ==== # ls -il /mnt/erofs-raw/bootstrap /mnt/erofs-raw/bootstrap: 1149 crw-r--r-- 1 root root 0, 0 2月 2 11:00 00 1016 crw-r--r-- 1 root root 0, 0 2月 2 11:00 01 1018 crw-r--r-- 1 root root 0, 0 2月 2 11:00 02 1020 crw-r--r-- 1 root root 0, 0 2月 2 11:00 03 1022 crw-r--r-- 1 root root 0, 0 2月 2 11:00 04 1152 crw-r--r-- 1 root root 0, 0 2月 2 11:00 05 1154 crw-r--r-- 1 root root 0, 0 2月 2 11:00 06 1156 crw-r--r-- 1 root root 0, 0 2月 2 11:00 07 1158 crw-r--r-- 1 root root 0, 0 2月 2 11:00 08 1160 crw-r--r-- 1 root root 0, 0 2月 2 11:00 09 1162 crw-r--r-- 1 root root 0, 0 2月 2 11:00 0a 1164 crw-r--r-- 1 root root 0, 0 2月 2 11:00 0b 1166 crw-r--r-- 1 root root 0, 0 2月 2 11:00 0c 1168 crw-r--r-- 1 root root 0, 0 2月 2 11:00 0d 1170 crw-r--r-- 1 root root 0, 0 2月 2 11:00 0e 1172 crw-r--r-- 1 root root 0, 0 2月 2 11:00 0f 1174 crw-r--r-- 1 root root 0, 0 2月 2 11:00 10 1176 crw-r--r-- 1 root root 0, 0 2月 2 11:00 11 1178 crw-r--r-- 1 root root 0, 0 2月 2 11:00 12 1180 crw-r--r-- 1 root root 0, 0 2月 2 11:00 13 1182 crw-r--r-- 1 root root 0, 0 2月 2 11:00 14 1184 crw-r--r-- 1 root root 0, 0 2月 2 11:00 15 1186 crw-r--r-- 1 root root 0, 0 2月 2 11:00 16 1188 crw-r--r-- 1 root root 0, 0 2月 2 11:00 17 1190 crw-r--r-- 1 root root 0, 0 2月 2 11:00 18 1192 crw-r--r-- 1 root root 0, 0 2月 2 11:00 19 1194 crw-r--r-- 1 root root 0, 0 2月 2 11:00 1a 1196 crw-r--r-- 1 root root 0, 0 2月 2 11:00 1b 1198 crw-r--r-- 1 root root 0, 0 2月 2 11:00 1c 1200 crw-r--r-- 1 root root 0, 0 2月 2 11:00 1d 1202 crw-r--r-- 1 root root 0, 0 2月 2 11:00 1e 1204 crw-r--r-- 1 root root 0, 0 2月 2 11:00 1f 1206 crw-r--r-- 1 root root 0, 0 2月 2 11:00 20 1208 crw-r--r-- 1 root root 0, 0 2月 2 11:00 21 1210 crw-r--r-- 1 root root 0, 0 2月 2 11:00 22 1212 crw-r--r-- 1 root root 0, 0 2月 2 11:00 23 1214 crw-r--r-- 1 root root 0, 0 2月 2 11:00 24 1216 crw-r--r-- 1 root root 0, 0 2月 2 11:00 25 1218 crw-r--r-- 1 root root 0, 0 2月 2 11:00 26 1220 crw-r--r-- 1 root root 0, 0 2月 2 11:00 27 1222 crw-r--r-- 1 root root 0, 0 2月 2 11:00 28 1224 crw-r--r-- 1 root root 0, 0 2月 2 11:00 29 1226 crw-r--r-- 1 root root 0, 0 2月 2 11:00 2a 1228 crw-r--r-- 1 root root 0, 0 2月 2 11:00 2b 1230 crw-r--r-- 1 root root 0, 0 2月 2 11:00 2c 1232 crw-r--r-- 1 root root 0, 0 2月 2 11:00 2d 1234 crw-r--r-- 1 root root 0, 0 2月 2 11:00 2e 1236 crw-r--r-- 1 root root 0, 0 2月 2 11:00 2f 1238 crw-r--r-- 1 root root 0, 0 2月 2 11:00 30 1240 crw-r--r-- 1 root root 0, 0 2月 2 11:00 31 1242 crw-r--r-- 1 root root 0, 0 2月 2 11:00 32 1244 crw-r--r-- 1 root root 0, 0 2月 2 11:00 33 1246 crw-r--r-- 1 root root 0, 0 2月 2 11:00 34 1248 crw-r--r-- 1 root root 0, 0 2月 2 11:00 35 1250 crw-r--r-- 1 root root 0, 0 2月 2 11:00 36 1252 crw-r--r-- 1 root root 0, 0 2月 2 11:00 37 1254 crw-r--r-- 1 root root 0, 0 2月 2 11:00 38 1256 crw-r--r-- 1 root root 0, 0 2月 2 11:00 39 1258 crw-r--r-- 1 root root 0, 0 2月 2 11:00 3a 1260 crw-r--r-- 1 root root 0, 0 2月 2 11:00 3b 1262 crw-r--r-- 1 root root 0, 0 2月 2 11:00 3c 1264 crw-r--r-- 1 root root 0, 0 2月 2 11:00 3d 1266 crw-r--r-- 1 root root 0, 0 2月 2 11:00 3e 1268 crw-r--r-- 1 root root 0, 0 2月 2 11:00 3f 1270 crw-r--r-- 1 root root 0, 0 2月 2 11:00 40 1272 crw-r--r-- 1 root root 0, 0 2月 2 11:00 41 1274 crw-r--r-- 1 root root 0, 0 2月 2 11:00 42 1276 crw-r--r-- 1 root root 0, 0 2月 2 11:00 43 1280 crw-r--r-- 1 root root 0, 0 2月 2 11:00 44 1278 crw-r--r-- 1 root root 0, 0 2月 2 11:00 45 1282 crw-r--r-- 1 root root 0, 0 2月 2 11:00 46 1284 crw-r--r-- 1 root root 0, 0 2月 2 11:00 47 1286 crw-r--r-- 1 root root 0, 0 2月 2 11:00 48 1288 crw-r--r-- 1 root root 0, 0 2月 2 11:00 49 1290 crw-r--r-- 1 root root 0, 0 2月 2 11:00 4a 1292 crw-r--r-- 1 root root 0, 0 2月 2 11:00 4b 1294 crw-r--r-- 1 root root 0, 0 2月 2 11:00 4c 1296 crw-r--r-- 1 root root 0, 0 2月 2 11:00 4d 1298 crw-r--r-- 1 root root 0, 0 2月 2 11:00 4e 1300 crw-r--r-- 1 root root 0, 0 2月 2 11:00 4f 1302 crw-r--r-- 1 root root 0, 0 2月 2 11:00 50 1304 crw-r--r-- 1 root root 0, 0 2月 2 11:00 51 1306 crw-r--r-- 1 root root 0, 0 2月 2 11:00 52 1308 crw-r--r-- 1 root root 0, 0 2月 2 11:00 53 1310 crw-r--r-- 1 root root 0, 0 2月 2 11:00 54 1312 crw-r--r-- 1 root root 0, 0 2月 2 11:00 55 1314 crw-r--r-- 1 root root 0, 0 2月 2 11:00 56 1316 crw-r--r-- 1 root root 0, 0 2月 2 11:00 57 1318 crw-r--r-- 1 root root 0, 0 2月 2 11:00 58 1320 crw-r--r-- 1 root root 0, 0 2月 2 11:00 59 1322 crw-r--r-- 1 root root 0, 0 2月 2 11:00 5a 1324 crw-r--r-- 1 root root 0, 0 2月 2 11:00 5b 1326 crw-r--r-- 1 root root 0, 0 2月 2 11:00 5c 1328 crw-r--r-- 1 root root 0, 0 2月 2 11:00 5d 1330 crw-r--r-- 1 root root 0, 0 2月 2 11:00 5e 1332 crw-r--r-- 1 root root 0, 0 2月 2 11:00 5f 1334 crw-r--r-- 1 root root 0, 0 2月 2 11:00 60 1336 crw-r--r-- 1 root root 0, 0 2月 2 11:00 61 1338 crw-r--r-- 1 root root 0, 0 2月 2 11:00 62 1340 crw-r--r-- 1 root root 0, 0 2月 2 11:00 63 1342 crw-r--r-- 1 root root 0, 0 2月 2 11:00 64 1344 crw-r--r-- 1 root root 0, 0 2月 2 11:00 65 1346 crw-r--r-- 1 root root 0, 0 2月 2 11:00 66 1348 crw-r--r-- 1 root root 0, 0 2月 2 11:00 67 1350 crw-r--r-- 1 root root 0, 0 2月 2 11:00 68 1352 crw-r--r-- 1 root root 0, 0 2月 2 11:00 69 1354 crw-r--r-- 1 root root 0, 0 2月 2 11:00 6a 1356 crw-r--r-- 1 root root 0, 0 2月 2 11:00 6b 1358 crw-r--r-- 1 root root 0, 0 2月 2 11:00 6c 1360 crw-r--r-- 1 root root 0, 0 2月 2 11:00 6d 1362 crw-r--r-- 1 root root 0, 0 2月 2 11:00 6e 1364 crw-r--r-- 1 root root 0, 0 2月 2 11:00 6f 1366 crw-r--r-- 1 root root 0, 0 2月 2 11:00 70 1368 crw-r--r-- 1 root root 0, 0 2月 2 11:00 71 1370 crw-r--r-- 1 root root 0, 0 2月 2 11:00 72 1372 crw-r--r-- 1 root root 0, 0 2月 2 11:00 73 1374 crw-r--r-- 1 root root 0, 0 2月 2 11:00 74 1376 crw-r--r-- 1 root root 0, 0 2月 2 11:00 75 1378 crw-r--r-- 1 root root 0, 0 2月 2 11:00 76 1380 crw-r--r-- 1 root root 0, 0 2月 2 11:00 77 1382 crw-r--r-- 1 root root 0, 0 2月 2 11:00 78 1384 crw-r--r-- 1 root root 0, 0 2月 2 11:00 79 1386 crw-r--r-- 1 root root 0, 0 2月 2 11:00 7a 1388 crw-r--r-- 1 root root 0, 0 2月 2 11:00 7b 1390 crw-r--r-- 1 root root 0, 0 2月 2 11:00 7c 1392 crw-r--r-- 1 root root 0, 0 2月 2 11:00 7d 1394 crw-r--r-- 1 root root 0, 0 2月 2 11:00 7e 1396 crw-r--r-- 1 root root 0, 0 2月 2 11:00 7f 1398 crw-r--r-- 1 root root 0, 0 2月 2 11:00 80 1400 crw-r--r-- 1 root root 0, 0 2月 2 11:00 81 1402 crw-r--r-- 1 root root 0, 0 2月 2 11:00 82 1404 crw-r--r-- 1 root root 0, 0 2月 2 11:00 83 1408 crw-r--r-- 1 root root 0, 0 2月 2 11:00 84 1406 crw-r--r-- 1 root root 0, 0 2月 2 11:00 85 1410 crw-r--r-- 1 root root 0, 0 2月 2 11:00 86 1412 crw-r--r-- 1 root root 0, 0 2月 2 11:00 87 1414 crw-r--r-- 1 root root 0, 0 2月 2 11:00 88 1416 crw-r--r-- 1 root root 0, 0 2月 2 11:00 89 1418 crw-r--r-- 1 root root 0, 0 2月 2 11:00 8a 1420 crw-r--r-- 1 root root 0, 0 2月 2 11:00 8b 1422 crw-r--r-- 1 root root 0, 0 2月 2 11:00 8c 1424 crw-r--r-- 1 root root 0, 0 2月 2 11:00 8d 1426 crw-r--r-- 1 root root 0, 0 2月 2 11:00 8e 1428 crw-r--r-- 1 root root 0, 0 2月 2 11:00 8f 1430 crw-r--r-- 1 root root 0, 0 2月 2 11:00 90 1432 crw-r--r-- 1 root root 0, 0 2月 2 11:00 91 1434 crw-r--r-- 1 root root 0, 0 2月 2 11:00 92 1436 crw-r--r-- 1 root root 0, 0 2月 2 11:00 93 1438 crw-r--r-- 1 root root 0, 0 2月 2 11:00 94 1440 crw-r--r-- 1 root root 0, 0 2月 2 11:00 95 1442 crw-r--r-- 1 root root 0, 0 2月 2 11:00 96 1444 crw-r--r-- 1 root root 0, 0 2月 2 11:00 97 1446 crw-r--r-- 1 root root 0, 0 2月 2 11:00 98 1448 crw-r--r-- 1 root root 0, 0 2月 2 11:00 99 1450 crw-r--r-- 1 root root 0, 0 2月 2 11:00 9a 1452 crw-r--r-- 1 root root 0, 0 2月 2 11:00 9b 1454 crw-r--r-- 1 root root 0, 0 2月 2 11:00 9c 1456 crw-r--r-- 1 root root 0, 0 2月 2 11:00 9d 1458 crw-r--r-- 1 root root 0, 0 2月 2 11:00 9e 1460 crw-r--r-- 1 root root 0, 0 2月 2 11:00 9f 1462 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a0 1464 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a1 1466 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a2 1468 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a3 1470 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a4 1472 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a5 1474 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a6 1476 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a7 1478 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a8 1480 crw-r--r-- 1 root root 0, 0 2月 2 11:00 a9 1482 crw-r--r-- 1 root root 0, 0 2月 2 11:00 aa 1484 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ab 1486 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ac 1488 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ad 1490 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ae 1492 crw-r--r-- 1 root root 0, 0 2月 2 11:00 af 1494 dr-xr-xr-x 2 root root 27 8月 10 2021 afs 1497 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b0 1499 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b1 1501 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b2 1503 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b3 1505 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b4 1507 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b5 1509 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b6 1511 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b7 1513 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b8 1515 crw-r--r-- 1 root root 0, 0 2月 2 11:00 b9 1517 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ba 1519 crw-r--r-- 1 root root 0, 0 2月 2 11:00 bb 1521 crw-r--r-- 1 root root 0, 0 2月 2 11:00 bc 1523 crw-r--r-- 1 root root 0, 0 2月 2 11:00 bd 1525 crw-r--r-- 1 root root 0, 0 2月 2 11:00 be 1527 crw-r--r-- 1 root root 0, 0 2月 2 11:00 bf 1529 lrwxrwxrwx 1 root root 7 8月 10 2021 bin -> usr/bin 1536 dr-xr-xr-x 5 root root 323 1月 17 18:15 boot 1661 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c0 1674 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c1 1676 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c2 1678 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c3 1680 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c4 1682 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c5 1684 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c6 1686 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c7 1688 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c8 1690 crw-r--r-- 1 root root 0, 0 2月 2 11:00 c9 1692 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ca 1694 crw-r--r-- 1 root root 0, 0 2月 2 11:00 cb 1696 crw-r--r-- 1 root root 0, 0 2月 2 11:00 cc 1698 crw-r--r-- 1 root root 0, 0 2月 2 11:00 cd 1700 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ce 1702 crw-r--r-- 1 root root 0, 0 2月 2 11:00 cf 1704 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d0 1706 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d1 1708 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d2 1710 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d3 1712 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d4 1714 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d5 1716 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d6 1718 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d7 1720 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d8 1722 crw-r--r-- 1 root root 0, 0 2月 2 11:00 d9 1724 crw-r--r-- 1 root root 0, 0 2月 2 11:00 da 1726 crw-r--r-- 1 root root 0, 0 2月 2 11:00 db 1728 crw-r--r-- 1 root root 0, 0 2月 2 11:00 dc 1730 crw-r--r-- 1 root root 0, 0 2月 2 11:00 dd 1732 crw-r--r-- 1 root root 0, 0 2月 2 11:00 de 1734 drwxr-xr-x 2 root root 27 1月 17 18:14 dev 1737 crw-r--r-- 1 root root 0, 0 2月 2 11:00 df 1739 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e0 1741 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e1 1743 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e2 1745 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e3 1747 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e4 1749 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e5 1751 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e6 1753 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e7 1755 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e8 1757 crw-r--r-- 1 root root 0, 0 2月 2 11:00 e9 1759 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ea 1761 crw-r--r-- 1 root root 0, 0 2月 2 11:00 eb 1763 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ec 1765 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ed 1767 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ee 1769 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ef 1792 drwxr-xr-x 87 root root 3255 1月 17 18:15 etc 4862 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f0 4734 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f1 2942 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f2 4990 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f3 4350 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f4 3710 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f5 2174 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f6 6114 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f7 6116 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f8 6118 crw-r--r-- 1 root root 0, 0 2月 2 11:00 f9 6120 crw-r--r-- 1 root root 0, 0 2月 2 11:00 fa 6122 crw-r--r-- 1 root root 0, 0 2月 2 11:00 fb 6124 crw-r--r-- 1 root root 0, 0 2月 2 11:00 fc 6126 crw-r--r-- 1 root root 0, 0 2月 2 11:00 fd 6128 crw-r--r-- 1 root root 0, 0 2月 2 11:00 fe 6130 crw-r--r-- 1 root root 0, 0 2月 2 11:00 ff 6132 drwxr-xr-x 3 root root 44 1月 17 18:15 home 6162 lrwxrwxrwx 1 root root 7 8月 10 2021 lib -> usr/lib 6165 lrwxrwxrwx 1 root root 9 8月 10 2021 lib64 -> usr/lib64 6168 drwxr-xr-x 2 root root 27 8月 10 2021 media 6171 drwxr-xr-x 2 root root 27 8月 10 2021 mnt 6174 drwxr-xr-x 3 root root 41 1月 17 18:14 opt 15221 drwxr-xr-x 2 root root 27 1月 17 18:14 proc 15224 dr-xr-x--- 3 root root 148 1月 17 18:15 root 15259 drwxr-xr-x 14 root root 260 1月 17 18:15 run 15339 lrwxrwxrwx 1 root root 8 8月 10 2021 sbin -> usr/sbin 15342 drwxr-xr-x 2 root root 27 8月 10 2021 srv 15345 drwxr-xr-x 2 root root 27 1月 17 18:14 sys 15348 drwxrwxrwt 3 root root 54 1月 17 18:15 tmp 15366 drwxr-xr-x 12 root root 209 1月 17 18:14 usr 280740 drwxr-xr-x 18 root root 332 1月 17 18:14 var ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-03 12:46 ` Amir Goldstein 2023-02-03 15:09 ` Gao Xiang @ 2023-02-06 12:43 ` Alexander Larsson 2023-02-06 13:27 ` Gao Xiang 1 sibling, 1 reply; 87+ messages in thread From: Alexander Larsson @ 2023-02-06 12:43 UTC (permalink / raw) To: Amir Goldstein Cc: Miklos Szeredi, Gao Xiang, Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik On Fri, Feb 3, 2023 at 1:47 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > Engineering-wise, merging composefs features into EROFS > > > > > would be the simplest option and FWIW, my personal preference. > > > > > > > > > > However, you need to be aware that this will bring into EROFS > > > > > vfs considerations, such as s_stack_depth nesting (which AFAICS > > > > > is not see incremented composefs?). It's not the end of the > > > > > world, but this > > > > > is no longer plain fs over block game. There's a whole new class > > > > > of bugs > > > > > (that syzbot is very eager to explore) so you need to ask > > > > > yourself whether > > > > > this is a direction you want to lead EROFS towards. > > > > > > > > I'd like to make a seperated Kconfig for this. I consider this > > > > just because > > > > currently composefs is much similar to EROFS but it doesn't have > > > > some ability > > > > to keep real regular file (even some README, VERSION or Changelog > > > > in these > > > > images) in its (composefs-called) manifest files. Even its on-disk > > > > super block > > > > doesn't have a UUID now [1] and some boot sector for booting or > > > > some potential > > > > hybird formats such as tar + EROFS, cpio + EROFS. > > > > > > > > I'm not sure if those potential new on-disk features is unneeded > > > > even for > > > > future composefs. But if composefs laterly supports such on-disk > > > > features, > > > > that makes composefs closer to EROFS even more. I don't see > > > > disadvantage to > > > > make these actual on-disk compatible (like ext2 and ext4). > > > > > > > > The only difference now is manifest file itself I/O interface -- > > > > bio vs file. > > > > but EROFS can be distributed to raw block devices as well, > > > > composefs can't. > > > > > > > > Also, I'd like to seperate core-EROFS from advanced features (or > > > > people who > > > > are interested to work on this are always welcome) and composefs- > > > > like model, > > > > if people don't tend to use any EROFS advanced features, it could > > > > be disabled > > > > from compiling explicitly. > > > > > > Apart from that, I still fail to get some thoughts (apart from > > > unprivileged > > > mounts) how EROFS + overlayfs combination fails on automative real > > > workloads > > > aside from "ls -lR" (readdir + stat). > > > > > > And eventually we still need overlayfs for most use cases to do > > > writable > > > stuffs, anyway, it needs some words to describe why such < 1s > > > difference is > > > very very important to the real workload as you already mentioned > > > before. > > > > > > And with overlayfs lazy lookup, I think it can be close to ~100ms or > > > better. > > > > > > > If we had an overlay.fs-verity xattr, then I think there are no > > individual features lacking for it to work for the automotive usecase > > I'm working on. Nor for the OCI container usecase. However, the > > possibility of doing something doesn't mean it is the better technical > > solution. > > > > The container usecase is very important in real world Linux use today, > > and as such it makes sense to have a technically excellent solution for > > it, not just a workable solution. Obviously we all have different > > viewpoints of what that is, but these are the reasons why I think a > > composefs solution is better: > > > > * It is faster than all other approaches for the one thing it actually > > needs to do (lookup and readdir performance). Other kinds of > > performance (file i/o speed, etc) is up to the backing filesystem > > anyway. > > > > Even if there are possible approaches to make overlayfs perform better > > here (the "lazy lookup" idea) it will not reach the performance of > > composefs, while further complicating the overlayfs codebase. (btw, did > > someone ask Miklos what he thinks of that idea?) > > > > Well, Miklos was CCed (now in TO:) > I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: > https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 > but no response on that yet. > > TBH, in the end, Miklos really is the one who is going to have the most > weight on the outcome. > > If Miklos is interested in adding this functionality to overlayfs, you are going > to have a VERY hard sell, trying to merge composefs as an independent > expert filesystem. The community simply does not approve of this sort of > fragmentation unless there is a very good reason to do that. Yeah, if overlayfs get close to similar performance it does make more sense to use that. Lets see what miklos says. > > For the automotive usecase we have strict cold-boot time requirements > > that make cold-cache performance very important to us. Of course, there > > is no simple time requirements for the specific case of listing files > > in an image, but any improvement in cold-cache performance for both the > > ostree rootfs and the containers started during boot will be worth its > > weight in gold trying to reach these hard KPIs. > > > > * It uses less memory, as we don't need the extra inodes that comes > > with the overlayfs mount. (See profiling data in giuseppes mail[1]). > > Understood, but we will need profiling data with the optimized ovl > (or with the single blob hack) to compare the relevant alternatives. > > > > > The use of loopback vs directly reading the image file from page cache > > also have effects on memory use. Normally we have both the loopback > > file in page cache, plus the block cache for the loopback device. We > > could use loopback with O_DIRECT, but then we don't use the page cache > > for the image file, which I think could have performance implications. > > > > I am not sure this is correct. The loop blockdev page cache can be used, > for reading metadata, can it not? > But that argument is true for EROFS and for almost every other fs > that could be mounted with -oloop. > If the loopdev overhead is a problem and O_DIRECT is not a good enough > solution, then you should work on a generic solution that all fs could use. > > > * The userspace API complexity of the combined overlayfs approach is > > much greater than for composefs, with more moving pieces. For > > composefs, all you need is a single mount syscall for set up. For the > > overlay approach you would need to first create a loopback device, then > > create a dm-verity device-mapper device from it, then mount the > > readonly fs, then mount the overlayfs. > > Userspace API complexity has never been and will never be a reason > for making changes in the kernel, let alone add a new filesystem driver. > Userspace API complexity can be hidden behind a userspace expert library. > You can even create a mount.composefs helper that users can use > mount -t composefs that sets up erofs+overlayfs behind the scenes. I don't really care that it's more work for userspace to set it up, that can clearly always be hidden behind some abstraction. However, all this complexity is part of the reason why the combination use more memory and perform less well. It also gets in the way when using the code in more complex, stacked ways. For example, you need have /dev/loop and /dev/mapper/control available to be able to loopback mount a dm-verify using erofs image. This means it is not by default doable in typical sandbox/containers environments without adding access to additional global (potentially quite unsafe, in the case of dev-mapper) devices. Again, not a showstopper, but also not great. I guess we could use fs-verity for loopback mounted files though, which drops the dependency on dev-mapper. This makes it quite a lot better, but loopback is still a global non-namespaced resource. At some point loopfs was proposed to make namespaced loopback possible, but that seems to have gotten nowhere unfortunately. > Similarly, mkfs.composefs can be an alias to mkfs.erofs with a specific > set of preset options, much like mkfs.ext* family. > > > All this complexity has a cost > > in terms of setup/teardown performance, userspace complexity and > > overall memory use. > > > > This claim needs to be quantified *after* the proposed improvements > (or equivalent hack) to existing subsystems. > > > Are any of these a hard blocker for the feature? Not really, but I > > would find it sad to use an (imho) worse solution. > > > > I respect your emotion and it is not uncommon for people to want > to see their creation merged as is, but from personal experience, > it is often a much better option for you, to have your code merge into > an existing subsystem. I think if you knew all the advantages, you > would have fought for this option yourself ;) I'm gonna do some more experimenting with the erofs+overlayfs approach to get a better idea for the complete solution. One problem I ran into is that erofs seems to only support mounting filesystem images that are created with the native page size. This means I can't mount a erofs image created on a 4k page-size machine on an arm64 mac with 64k pages. That doesn't seem great. Maybe this limitation can be lifted from the erofs code though. > > The other mentioned approach is to extend EROFS with composefs > > features. For this to be interesting to me it would have to include: > > > > * Direct reading of the image from page cache (not via loopback) > > * Ability to verify fs-verity digest of that image file > > * Support for stacked content files in a set of specified basedirs > > (not using fscache). > > * Verification of expected fs-verity digest for these basedir files > > > > Anything less than this and I think the overlayfs+erofs approach is a > > better choice. > > > > However, this is essentially just proposing we re-implement all the > > composefs code with a different name. And then we get a filesystem > > supporting *both* stacking and traditional block device use, which > > seems a bit weird to me. It will certainly make the erofs code more > > complex having to support all these combinations. Also, given the harsh > > arguments and accusations towards me on the list I don't feel very > > optimistic about how well such a cooperation would work. > > > > I understand why you write that and I am sorry that you feel this way. > This is a good opportunity to urge you and Giuseppe again to request > an invite to LSFMM [1] and propose composefs vs. erofs+ovl as a TOPIC. > > Meeting the developers in person is often the best way to understand each > other in situations just like this one where the email discussions fail to > remain on a purely technical level and our emotions get involved. > It is just too hard to express emotions accurately in emails and people are > so very often misunderstood when that happens. > > I guarantee you that it is much more pleasant to argue with people over email > after you have met them in person ;) I'll try to see if this works in my schedule. But, yeah, in-person discussions would probably speed things up. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 12:43 ` Alexander Larsson @ 2023-02-06 13:27 ` Gao Xiang 2023-02-06 15:31 ` Alexander Larsson 0 siblings, 1 reply; 87+ messages in thread From: Gao Xiang @ 2023-02-06 13:27 UTC (permalink / raw) To: Alexander Larsson, Amir Goldstein Cc: Miklos Szeredi, Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik On 2023/2/6 20:43, Alexander Larsson wrote: > On Fri, Feb 3, 2023 at 1:47 PM Amir Goldstein <amir73il@gmail.com> wrote: >> >>>>>> Engineering-wise, merging composefs features into EROFS >>>>>> would be the simplest option and FWIW, my personal preference. >>>>>> >>>>>> However, you need to be aware that this will bring into EROFS >>>>>> vfs considerations, such as s_stack_depth nesting (which AFAICS >>>>>> is not see incremented composefs?). It's not the end of the >>>>>> world, but this >>>>>> is no longer plain fs over block game. There's a whole new class >>>>>> of bugs >>>>>> (that syzbot is very eager to explore) so you need to ask >>>>>> yourself whether >>>>>> this is a direction you want to lead EROFS towards. >>>>> >>>>> I'd like to make a seperated Kconfig for this. I consider this >>>>> just because >>>>> currently composefs is much similar to EROFS but it doesn't have >>>>> some ability >>>>> to keep real regular file (even some README, VERSION or Changelog >>>>> in these >>>>> images) in its (composefs-called) manifest files. Even its on-disk >>>>> super block >>>>> doesn't have a UUID now [1] and some boot sector for booting or >>>>> some potential >>>>> hybird formats such as tar + EROFS, cpio + EROFS. >>>>> >>>>> I'm not sure if those potential new on-disk features is unneeded >>>>> even for >>>>> future composefs. But if composefs laterly supports such on-disk >>>>> features, >>>>> that makes composefs closer to EROFS even more. I don't see >>>>> disadvantage to >>>>> make these actual on-disk compatible (like ext2 and ext4). >>>>> >>>>> The only difference now is manifest file itself I/O interface -- >>>>> bio vs file. >>>>> but EROFS can be distributed to raw block devices as well, >>>>> composefs can't. >>>>> >>>>> Also, I'd like to seperate core-EROFS from advanced features (or >>>>> people who >>>>> are interested to work on this are always welcome) and composefs- >>>>> like model, >>>>> if people don't tend to use any EROFS advanced features, it could >>>>> be disabled >>>>> from compiling explicitly. >>>> >>>> Apart from that, I still fail to get some thoughts (apart from >>>> unprivileged >>>> mounts) how EROFS + overlayfs combination fails on automative real >>>> workloads >>>> aside from "ls -lR" (readdir + stat). >>>> >>>> And eventually we still need overlayfs for most use cases to do >>>> writable >>>> stuffs, anyway, it needs some words to describe why such < 1s >>>> difference is >>>> very very important to the real workload as you already mentioned >>>> before. >>>> >>>> And with overlayfs lazy lookup, I think it can be close to ~100ms or >>>> better. >>>> >>> >>> If we had an overlay.fs-verity xattr, then I think there are no >>> individual features lacking for it to work for the automotive usecase >>> I'm working on. Nor for the OCI container usecase. However, the >>> possibility of doing something doesn't mean it is the better technical >>> solution. >>> >>> The container usecase is very important in real world Linux use today, >>> and as such it makes sense to have a technically excellent solution for >>> it, not just a workable solution. Obviously we all have different >>> viewpoints of what that is, but these are the reasons why I think a >>> composefs solution is better: >>> >>> * It is faster than all other approaches for the one thing it actually >>> needs to do (lookup and readdir performance). Other kinds of >>> performance (file i/o speed, etc) is up to the backing filesystem >>> anyway. >>> >>> Even if there are possible approaches to make overlayfs perform better >>> here (the "lazy lookup" idea) it will not reach the performance of >>> composefs, while further complicating the overlayfs codebase. (btw, did >>> someone ask Miklos what he thinks of that idea?) >>> >> >> Well, Miklos was CCed (now in TO:) >> I did ask him specifically about relaxing -ouserxarr,metacopy,redirect: >> https://lore.kernel.org/linux-unionfs/20230126082228.rweg75ztaexykejv@wittgenstein/T/#mc375df4c74c0d41aa1a2251c97509c6522487f96 >> but no response on that yet. >> >> TBH, in the end, Miklos really is the one who is going to have the most >> weight on the outcome. >> >> If Miklos is interested in adding this functionality to overlayfs, you are going >> to have a VERY hard sell, trying to merge composefs as an independent >> expert filesystem. The community simply does not approve of this sort of >> fragmentation unless there is a very good reason to do that. > > Yeah, if overlayfs get close to similar performance it does make more > sense to use that. Lets see what miklos says. > >>> For the automotive usecase we have strict cold-boot time requirements >>> that make cold-cache performance very important to us. Of course, there >>> is no simple time requirements for the specific case of listing files >>> in an image, but any improvement in cold-cache performance for both the >>> ostree rootfs and the containers started during boot will be worth its >>> weight in gold trying to reach these hard KPIs. >>> >>> * It uses less memory, as we don't need the extra inodes that comes >>> with the overlayfs mount. (See profiling data in giuseppes mail[1]). >> >> Understood, but we will need profiling data with the optimized ovl >> (or with the single blob hack) to compare the relevant alternatives. >> >>> >>> The use of loopback vs directly reading the image file from page cache >>> also have effects on memory use. Normally we have both the loopback >>> file in page cache, plus the block cache for the loopback device. We >>> could use loopback with O_DIRECT, but then we don't use the page cache >>> for the image file, which I think could have performance implications. >>> >> >> I am not sure this is correct. The loop blockdev page cache can be used, >> for reading metadata, can it not? >> But that argument is true for EROFS and for almost every other fs >> that could be mounted with -oloop. >> If the loopdev overhead is a problem and O_DIRECT is not a good enough >> solution, then you should work on a generic solution that all fs could use. >> >>> * The userspace API complexity of the combined overlayfs approach is >>> much greater than for composefs, with more moving pieces. For >>> composefs, all you need is a single mount syscall for set up. For the >>> overlay approach you would need to first create a loopback device, then >>> create a dm-verity device-mapper device from it, then mount the >>> readonly fs, then mount the overlayfs. >> >> Userspace API complexity has never been and will never be a reason >> for making changes in the kernel, let alone add a new filesystem driver. >> Userspace API complexity can be hidden behind a userspace expert library. >> You can even create a mount.composefs helper that users can use >> mount -t composefs that sets up erofs+overlayfs behind the scenes. > > I don't really care that it's more work for userspace to set it up, > that can clearly always be hidden behind some abstraction. > > However, all this complexity is part of the reason why the combination > use more memory and perform less well. It also gets in the way when > using the code in more complex, stacked ways. For example, you need > have /dev/loop and /dev/mapper/control available to be able to > loopback mount a dm-verify using erofs image. This means it is not by > default doable in typical sandbox/containers environments without > adding access to additional global (potentially quite unsafe, in the > case of dev-mapper) devices. > > Again, not a showstopper, but also not great. > > I guess we could use fs-verity for loopback mounted files though, > which drops the dependency on dev-mapper. This makes it quite a lot > better, but loopback is still a global non-namespaced resource. At > some point loopfs was proposed to make namespaced loopback possible, > but that seems to have gotten nowhere unfortunately. Yes, in principle, fsverity could be used as well as long as those digests are checked before mounting so that dm-verity is not needed. > >> Similarly, mkfs.composefs can be an alias to mkfs.erofs with a specific >> set of preset options, much like mkfs.ext* family. >> >>> All this complexity has a cost >>> in terms of setup/teardown performance, userspace complexity and >>> overall memory use. >>> >> >> This claim needs to be quantified *after* the proposed improvements >> (or equivalent hack) to existing subsystems. >> >>> Are any of these a hard blocker for the feature? Not really, but I >>> would find it sad to use an (imho) worse solution. >>> >> >> I respect your emotion and it is not uncommon for people to want >> to see their creation merged as is, but from personal experience, >> it is often a much better option for you, to have your code merge into >> an existing subsystem. I think if you knew all the advantages, you >> would have fought for this option yourself ;) > > I'm gonna do some more experimenting with the erofs+overlayfs approach > to get a better idea for the complete solution. > > One problem I ran into is that erofs seems to only support mounting > filesystem images that are created with the native page size. This > means I can't mount a erofs image created on a 4k page-size machine on > an arm64 mac with 64k pages. That doesn't seem great. Maybe this > limitation can be lifted from the erofs code though. Honestly, EROFS 64k support has been in our roadmap for a quite long time, and it has been almost done for the uncompressed part apart from replacing EROFS_BLKSIZ to erofs_blksiz(sb). Currently it's not urgent just because our Cloud environment always use 4k PAGE_SIZE, but it seems Android will consider 16k pagesize as well, so yes, we will support !4k page size for the uncompressed part in the near future. But it seems that arm64 RHEL 9 switched back to 4k page size? > >>> The other mentioned approach is to extend EROFS with composefs >>> features. For this to be interesting to me it would have to include: >>> >>> * Direct reading of the image from page cache (not via loopback) >>> * Ability to verify fs-verity digest of that image file >>> * Support for stacked content files in a set of specified basedirs >>> (not using fscache). >>> * Verification of expected fs-verity digest for these basedir files >>> >>> Anything less than this and I think the overlayfs+erofs approach is a >>> better choice. >>> >>> However, this is essentially just proposing we re-implement all the >>> composefs code with a different name. And then we get a filesystem >>> supporting *both* stacking and traditional block device use, which >>> seems a bit weird to me. It will certainly make the erofs code more >>> complex having to support all these combinations. Also, given the harsh >>> arguments and accusations towards me on the list I don't feel very >>> optimistic about how well such a cooperation would work. >>> >> >> I understand why you write that and I am sorry that you feel this way. >> This is a good opportunity to urge you and Giuseppe again to request >> an invite to LSFMM [1] and propose composefs vs. erofs+ovl as a TOPIC. >> >> Meeting the developers in person is often the best way to understand each >> other in situations just like this one where the email discussions fail to >> remain on a purely technical level and our emotions get involved. >> It is just too hard to express emotions accurately in emails and people are >> so very often misunderstood when that happens. >> >> I guarantee you that it is much more pleasant to argue with people over email >> after you have met them in person ;) > > I'll try to see if this works in my schedule. But, yeah, in-person > discussions would probably speed things up. Jingbo has been investigated in the latest performance numbers, currently, it seems O_DIRECT loop device vs composefs manifest file is that some composefs reads (like inode reads) are used by using kernel_read() with buffered I/O, so that kernel_read() will have builtin readahead, while EROFS just uses bdev + page cache sync interface so that it causes some difference, but EROFS could do readahead as well for dir data/inode read if needed. Consider currently the common manifest files are quite small (~10MB), so the readahead policy can be adapted honestly. Jingbo is off work now, but he could post some latest "ls -lR" numbers tomorrow if needed. Thanks, Gao Xiang > ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-06 13:27 ` Gao Xiang @ 2023-02-06 15:31 ` Alexander Larsson 0 siblings, 0 replies; 87+ messages in thread From: Alexander Larsson @ 2023-02-06 15:31 UTC (permalink / raw) To: Gao Xiang Cc: Amir Goldstein, Miklos Szeredi, Jingbo Xu, gscrivan, brauner, linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Josef Bacik On Mon, Feb 6, 2023 at 2:27 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > On 2023/2/6 20:43, Alexander Larsson wrote: > > > > One problem I ran into is that erofs seems to only support mounting > > filesystem images that are created with the native page size. This > > means I can't mount a erofs image created on a 4k page-size machine on > > an arm64 mac with 64k pages. That doesn't seem great. Maybe this > > limitation can be lifted from the erofs code though. > > Honestly, EROFS 64k support has been in our roadmap for a quite long > time, and it has been almost done for the uncompressed part apart from > replacing EROFS_BLKSIZ to erofs_blksiz(sb). Good, as long as it is on the roadmap. > Currently it's not urgent just because our Cloud environment always use > 4k PAGE_SIZE, but it seems Android will consider 16k pagesize as well, so > yes, we will support !4k page size for the uncompressed part in the near > future. But it seems that arm64 RHEL 9 switched back to 4k page size? Honestly I'm not following it all that closely, but I think Fedora was at least talking about 64k pages. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@redhat.com alexander.larsson@gmail.com ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 9:46 ` Alexander Larsson 2023-02-01 10:01 ` Gao Xiang @ 2023-02-01 12:06 ` Jingbo Xu 2023-02-02 4:57 ` Jingbo Xu 2 siblings, 0 replies; 87+ messages in thread From: Jingbo Xu @ 2023-02-01 12:06 UTC (permalink / raw) To: Alexander Larsson, Gao Xiang, Amir Goldstein, gscrivan, brauner Cc: linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2/1/23 5:46 PM, Alexander Larsson wrote: > On Wed, 2023-02-01 at 12:28 +0800, Jingbo Xu wrote: >> Hi all, >> >> There are some updated performance statistics with different >> combinations on my test environment if you are interested. >> >> >> On 1/27/23 6:24 PM, Gao Xiang wrote: >>> ... >>> >>> I've made a version and did some test, it can be fetched from: >>> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git >>> -b >>> experimental >>> >> >> Setup >> ====== >> CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz >> Disk: 6800 IOPS upper limit >> OS: Linux v6.2 (with composefs v3 patchset) > > For the record, what was the filesystem backing the basedir files? ext4 -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-01 9:46 ` Alexander Larsson 2023-02-01 10:01 ` Gao Xiang 2023-02-01 12:06 ` Jingbo Xu @ 2023-02-02 4:57 ` Jingbo Xu 2023-02-02 4:59 ` Jingbo Xu 2 siblings, 1 reply; 87+ messages in thread From: Jingbo Xu @ 2023-02-02 4:57 UTC (permalink / raw) To: Alexander Larsson, Gao Xiang, Amir Goldstein, gscrivan, brauner Cc: linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2/1/23 5:46 PM, Alexander Larsson wrote: > On Wed, 2023-02-01 at 12:28 +0800, Jingbo Xu wrote: >> Hi all, >> >> There are some updated performance statistics with different >> combinations on my test environment if you are interested. >> >> >> On 1/27/23 6:24 PM, Gao Xiang wrote: >>> ... >>> >>> I've made a version and did some test, it can be fetched from: >>> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git >>> -b >>> experimental >>> >> >> Setup >> ====== >> CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz >> Disk: 6800 IOPS upper limit >> OS: Linux v6.2 (with composefs v3 patchset) > > For the record, what was the filesystem backing the basedir files? > >> I build erofs/squashfs images following the scripts attached on [1], >> with each file in the rootfs tagged with "metacopy" and "redirect" >> xattr. >> >> The source rootfs is from the docker image of tensorflow [2]. >> >> The erofs images are built with mkfs.erofs with support for sparse >> file >> added [3]. >> >> [1] >> https://lore.kernel.org/linux-fsdevel/5fb32a1297821040edd8c19ce796fc0540101653.camel@redhat.com/ >> [2] >> https://hub.docker.com/layers/tensorflow/tensorflow/2.10.0/images/sha256-7f9f23ce2473eb52d17fe1b465c79c3a3604047343e23acc036296f512071bc9?context=explore >> [3] >> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/commit/?h=experimental&id=7c49e8b195ad90f6ca9dfccce9f6e3e39a8676f6 >> >> >> >> Image size >> =========== >> 6.4M large.composefs >> 5.7M large.composefs.w/o.digest (w/o --compute-digest) >> 6.2M large.erofs >> 5.2M large.erofs.T0 (with -T0, i.e. w/o nanosecond timestamp) >> 1.7M large.squashfs >> 5.8M large.squashfs.uncompressed (with -noI -noD -noF -noX) >> >> (large.erofs.T0 is built without nanosecond timestamp, so that we get >> smaller disk inode size (same with squashfs).) >> >> >> Runtime Perf >> ============= >> >> The "uncached" column is tested with: >> hyperfine -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR $MNTPOINT" >> >> >> While the "cached" column is tested with: >> hyperfine -w 1 "ls -lR $MNTPOINT" >> >> >> erofs and squashfs are mounted with loopback device. >> >> >> | uncached(ms)| cached(ms) >> ----------------------------------|-------------|----------- >> composefs (with digest) | 326 | 135 >> erofs (w/o -T0) | 264 | 172 >> erofs (w/o -T0) + overlayfs | 651 | 238 >> squashfs (compressed) | 538 | 211 >> squashfs (compressed) + overlayfs | 968 | 302 > > > Clearly erofs with sparse files is the best fs now for the ro-fs + > overlay case. But still, we can see that the additional cost of the > overlayfs layer is not negligible. > > According to amir this could be helped by a special composefs-like mode > in overlayfs, but its unclear what performance that would reach, and > we're then talking net new development that further complicates the > overlayfs codebase. Its not clear to me which alternative is easier to > develop/maintain. > > Also, the difference between cached and uncached here is less than in > my tests. Probably because my test image was larger. With the test > image I use, the results are: > > | uncached(ms)| cached(ms) > ----------------------------------|-------------|----------- > composefs (with digest) | 681 | 390 > erofs (w/o -T0) + overlayfs | 1788 | 532 > squashfs (compressed) + overlayfs | 2547 | 443 > > > I gotta say it is weird though that squashfs performed better than > erofs in the cached case. May be worth looking into. The test data I'm > using is available here: > > https://my.owndrive.com/index.php/s/irHJXRpZHtT3a5i > > Hi, I also tested upon the rootfs you given. Setup ====== CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz Disk: 11800 IOPS upper limit OS: Linux v6.2 (with composefs v3 patchset) FS of backing objects: xfs Image size =========== 8.6M large.composefs (with --compute-digest) 7.6M large.composefs.wo.digest (w/o --compute-digest) 8.9M large.erofs 7.4M large.erofs.T0 (with -T0, i.e. w/o nanosecond timestamp) 2.6M large.squashfs.compressed 8.2M large.squashfs.uncompressed (with -noI -noD -noF -noX) Runtime Perf ============= The "uncached" column is tested with: hyperfine -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR $MNTPOINT" While the "cached" column is tested with: hyperfine -w 1 "ls -lR $MNTPOINT" erofs and squashfs are mounted with loopback device. | uncached(ms)| cached(ms) ----------------------------------|-------------|----------- composefs | 408 | 176 erofs | 308 | 190 erofs + overlayfs | 1097 | 294 erofs.hack | 298 | 187 erofs.hack + overlayfs | 524 | 283 squashfs (compressed) | 770 | 265 squashfs (compressed) + overlayfs | 1600 | 372 squashfs (uncompressed) | 646 | 223 squashfs (uncompressed)+overlayfs | 1480 | 330 - all erofs mounted with "noacl" - composefs: using large.composefs - erofs: using large.erofs - erofs.hack: using large.erofs.hack where each file in the erofs layer redirecting to the same lower block, e.g. "/objects/00/02bef8682cac782594e542d1ec6e031b9f7ac40edcfa6a1eb6d15d3b1ab126", to evaluate the potential optimization of composefs like "lazy lookup" in overlayfs - squashfs (compressed): using large.squashfs.compressed - squashfs (uncompressed): using large.squashfs.uncompressed -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem 2023-02-02 4:57 ` Jingbo Xu @ 2023-02-02 4:59 ` Jingbo Xu 0 siblings, 0 replies; 87+ messages in thread From: Jingbo Xu @ 2023-02-02 4:59 UTC (permalink / raw) To: Alexander Larsson, Gao Xiang, Amir Goldstein, gscrivan, brauner Cc: linux-fsdevel, linux-kernel, david, viro, Vivek Goyal, Miklos Szeredi On 2/2/23 12:57 PM, Jingbo Xu wrote: > > > On 2/1/23 5:46 PM, Alexander Larsson wrote: >> On Wed, 2023-02-01 at 12:28 +0800, Jingbo Xu wrote: >>> Hi all, >>> >>> There are some updated performance statistics with different >>> combinations on my test environment if you are interested. >>> >>> >>> On 1/27/23 6:24 PM, Gao Xiang wrote: >>>> ... >>>> >>>> I've made a version and did some test, it can be fetched from: >>>> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git >>>> -b >>>> experimental >>>> >>> >>> Setup >>> ====== >>> CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz >>> Disk: 6800 IOPS upper limit >>> OS: Linux v6.2 (with composefs v3 patchset) >> >> For the record, what was the filesystem backing the basedir files? >> >>> I build erofs/squashfs images following the scripts attached on [1], >>> with each file in the rootfs tagged with "metacopy" and "redirect" >>> xattr. >>> >>> The source rootfs is from the docker image of tensorflow [2]. >>> >>> The erofs images are built with mkfs.erofs with support for sparse >>> file >>> added [3]. >>> >>> [1] >>> https://lore.kernel.org/linux-fsdevel/5fb32a1297821040edd8c19ce796fc0540101653.camel@redhat.com/ >>> [2] >>> https://hub.docker.com/layers/tensorflow/tensorflow/2.10.0/images/sha256-7f9f23ce2473eb52d17fe1b465c79c3a3604047343e23acc036296f512071bc9?context=explore >>> [3] >>> https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/commit/?h=experimental&id=7c49e8b195ad90f6ca9dfccce9f6e3e39a8676f6 >>> >>> >>> >>> Image size >>> =========== >>> 6.4M large.composefs >>> 5.7M large.composefs.w/o.digest (w/o --compute-digest) >>> 6.2M large.erofs >>> 5.2M large.erofs.T0 (with -T0, i.e. w/o nanosecond timestamp) >>> 1.7M large.squashfs >>> 5.8M large.squashfs.uncompressed (with -noI -noD -noF -noX) >>> >>> (large.erofs.T0 is built without nanosecond timestamp, so that we get >>> smaller disk inode size (same with squashfs).) >>> >>> >>> Runtime Perf >>> ============= >>> >>> The "uncached" column is tested with: >>> hyperfine -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR $MNTPOINT" >>> >>> >>> While the "cached" column is tested with: >>> hyperfine -w 1 "ls -lR $MNTPOINT" >>> >>> >>> erofs and squashfs are mounted with loopback device. >>> >>> >>> | uncached(ms)| cached(ms) >>> ----------------------------------|-------------|----------- >>> composefs (with digest) | 326 | 135 >>> erofs (w/o -T0) | 264 | 172 >>> erofs (w/o -T0) + overlayfs | 651 | 238 >>> squashfs (compressed) | 538 | 211 >>> squashfs (compressed) + overlayfs | 968 | 302 >> >> >> Clearly erofs with sparse files is the best fs now for the ro-fs + >> overlay case. But still, we can see that the additional cost of the >> overlayfs layer is not negligible. >> >> According to amir this could be helped by a special composefs-like mode >> in overlayfs, but its unclear what performance that would reach, and >> we're then talking net new development that further complicates the >> overlayfs codebase. Its not clear to me which alternative is easier to >> develop/maintain. >> >> Also, the difference between cached and uncached here is less than in >> my tests. Probably because my test image was larger. With the test >> image I use, the results are: >> >> | uncached(ms)| cached(ms) >> ----------------------------------|-------------|----------- >> composefs (with digest) | 681 | 390 >> erofs (w/o -T0) + overlayfs | 1788 | 532 >> squashfs (compressed) + overlayfs | 2547 | 443 >> >> >> I gotta say it is weird though that squashfs performed better than >> erofs in the cached case. May be worth looking into. The test data I'm >> using is available here: >> >> https://my.owndrive.com/index.php/s/irHJXRpZHtT3a5i >> >> > > Hi, > > I also tested upon the rootfs you given. > > > Setup > ====== > CPU: x86_64 Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz > Disk: 11800 IOPS upper limit > OS: Linux v6.2 (with composefs v3 patchset) > FS of backing objects: xfs > > > Image size > =========== > 8.6M large.composefs (with --compute-digest) > 7.6M large.composefs.wo.digest (w/o --compute-digest) > 8.9M large.erofs > 7.4M large.erofs.T0 (with -T0, i.e. w/o nanosecond timestamp) > 2.6M large.squashfs.compressed > 8.2M large.squashfs.uncompressed (with -noI -noD -noF -noX) > > > Runtime Perf > ============= > > The "uncached" column is tested with: > hyperfine -p "echo 3 > /proc/sys/vm/drop_caches" "ls -lR $MNTPOINT" > > > While the "cached" column is tested with: > hyperfine -w 1 "ls -lR $MNTPOINT" > > > erofs and squashfs are mounted with loopback device. > > | uncached(ms)| cached(ms) > ----------------------------------|-------------|----------- > composefs | 408 | 176 > erofs | 308 | 190 > erofs + overlayfs | 1097 | 294 > erofs.hack | 298 | 187 > erofs.hack + overlayfs | 524 | 283 > squashfs (compressed) | 770 | 265 > squashfs (compressed) + overlayfs | 1600 | 372 > squashfs (uncompressed) | 646 | 223 > squashfs (uncompressed)+overlayfs | 1480 | 330 > > - all erofs mounted with "noacl" > - composefs: using large.composefs > - erofs: using large.erofs > - erofs.hack: using large.erofs.hack where each file in the erofs layer > redirecting to the same lower block, e.g. > "/objects/00/02bef8682cac782594e542d1ec6e031b9f7ac40edcfa6a1eb6d15d3b1ab126", > to evaluate the potential optimization of composefs like "lazy lookup" ^ composefs-like "lazy lookup" in overlayfs ... > in overlayfs > - squashfs (compressed): using large.squashfs.compressed > - squashfs (uncompressed): using large.squashfs.uncompressed > > -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 87+ messages in thread
end of thread, other threads:[~2023-04-12 15:42 UTC | newest] Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-01-20 15:23 [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 1/6] fsverity: Export fsverity_get_digest Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 2/6] composefs: Add on-disk layout header Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 3/6] composefs: Add descriptor parsing code Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 4/6] composefs: Add filesystem implementation Alexander Larsson 2023-01-20 15:23 ` [PATCH v3 5/6] composefs: Add documentation Alexander Larsson 2023-01-21 2:19 ` Bagas Sanjaya 2023-01-20 15:23 ` [PATCH v3 6/6] composefs: Add kconfig and build support Alexander Larsson 2023-01-20 19:44 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Amir Goldstein 2023-01-20 22:18 ` Giuseppe Scrivano 2023-01-21 3:08 ` Gao Xiang 2023-01-21 16:19 ` Giuseppe Scrivano 2023-01-21 17:15 ` Gao Xiang 2023-01-21 22:34 ` Giuseppe Scrivano 2023-01-22 0:39 ` Gao Xiang 2023-01-22 9:01 ` Giuseppe Scrivano 2023-01-22 9:32 ` Giuseppe Scrivano 2023-01-24 0:08 ` Gao Xiang 2023-01-21 10:57 ` Amir Goldstein 2023-01-21 15:01 ` Giuseppe Scrivano 2023-01-21 15:54 ` Amir Goldstein 2023-01-21 16:26 ` Gao Xiang 2023-01-23 17:56 ` Alexander Larsson 2023-01-23 23:59 ` Gao Xiang 2023-01-24 3:24 ` Amir Goldstein 2023-01-24 13:10 ` Alexander Larsson 2023-01-24 14:40 ` Gao Xiang 2023-01-24 19:06 ` Amir Goldstein 2023-01-25 4:18 ` Dave Chinner 2023-01-25 8:32 ` Amir Goldstein 2023-01-25 10:08 ` Alexander Larsson 2023-01-25 10:43 ` Amir Goldstein 2023-01-25 10:39 ` Giuseppe Scrivano 2023-01-25 11:17 ` Amir Goldstein 2023-01-25 12:30 ` Giuseppe Scrivano 2023-01-25 12:46 ` Amir Goldstein 2023-01-25 13:10 ` Giuseppe Scrivano 2023-01-25 18:07 ` Amir Goldstein 2023-01-25 19:45 ` Giuseppe Scrivano 2023-01-25 20:23 ` Amir Goldstein 2023-01-25 20:29 ` Amir Goldstein 2023-01-26 5:26 ` userns mount and metacopy redirects (Was: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem) Amir Goldstein 2023-01-26 8:22 ` Christian Brauner 2023-01-27 15:57 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Vivek Goyal 2023-01-25 15:24 ` Christian Brauner 2023-01-25 16:05 ` Giuseppe Scrivano 2023-01-25 9:37 ` Alexander Larsson 2023-01-25 10:05 ` Gao Xiang 2023-01-25 10:15 ` Alexander Larsson 2023-01-27 10:24 ` Gao Xiang 2023-02-01 4:28 ` Jingbo Xu 2023-02-01 7:44 ` Amir Goldstein 2023-02-01 8:59 ` Jingbo Xu 2023-02-01 9:52 ` Alexander Larsson 2023-02-01 12:39 ` Jingbo Xu 2023-02-01 9:46 ` Alexander Larsson 2023-02-01 10:01 ` Gao Xiang 2023-02-01 11:22 ` Gao Xiang 2023-02-02 6:37 ` Amir Goldstein 2023-02-02 7:17 ` Gao Xiang 2023-02-02 7:37 ` Gao Xiang 2023-02-03 11:32 ` Alexander Larsson 2023-02-03 12:46 ` Amir Goldstein 2023-02-03 15:09 ` Gao Xiang 2023-02-05 19:06 ` Amir Goldstein 2023-02-06 7:59 ` Amir Goldstein 2023-02-06 10:35 ` Miklos Szeredi 2023-02-06 13:30 ` Amir Goldstein 2023-02-06 16:34 ` Miklos Szeredi 2023-02-06 17:16 ` Amir Goldstein 2023-02-06 18:17 ` Amir Goldstein 2023-02-06 19:32 ` Miklos Szeredi 2023-02-06 20:06 ` Amir Goldstein 2023-02-07 8:12 ` Alexander Larsson 2023-04-03 19:00 ` Lazy lowerdata lookup and data-only layers (Was: Re: Composefs:) Amir Goldstein 2023-04-11 15:50 ` Miklos Szeredi 2023-04-12 14:06 ` Amir Goldstein 2023-04-12 14:20 ` Miklos Szeredi 2023-04-12 15:41 ` Amir Goldstein 2023-02-06 12:51 ` [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson 2023-02-07 8:12 ` Jingbo Xu 2023-02-06 12:43 ` Alexander Larsson 2023-02-06 13:27 ` Gao Xiang 2023-02-06 15:31 ` Alexander Larsson 2023-02-01 12:06 ` Jingbo Xu 2023-02-02 4:57 ` Jingbo Xu 2023-02-02 4:59 ` Jingbo Xu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.