linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
@ 2023-01-13 15:33 Alexander Larsson
  2023-01-13 15:33 ` [PATCH v2 1/6] fsverity: Export fsverity_get_digest Alexander Larsson
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Alexander Larsson @ 2023-01-13 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, gscrivan, Alexander Larsson

Giuseppe Scrivano and I have recently been working on a new project we
call composefs. This is the first time we propose this publically and
we would like some feedback on it.

At its core, composefs is a way to construct and use read only images
that are used similar to how you would use e.g. loop-back mounted
squashfs images. On top of this composefs has two fundamental
features. First it allows sharing of file data (both on disk and in
page cache) between images, and secondly it has dm-verity like
validation on read.

Let me first start with a minimal example of how this can be used,
before going into the details:

Suppose we have this source for an image:

rootfs/
├── dir
│   └── another_a
├── file_a
└── file_b

We can then use this to generate an image file and a set of
content-addressed backing files:

# mkcomposefs --digest-store=objects rootfs/ rootfs.img
# ls -l rootfs.img objects/*/*
-rw-------. 1 root root   10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
-rw-------. 1 root root   10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
-rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img

The rootfs.img file contains all information about directory and file
metadata plus references to the backing files by name. We can now
mount this and look at the result:

# mount -t composefs rootfs.img -o basedir=objects /mnt
# ls  /mnt/
dir  file_a  file_b
# cat /mnt/file_a
content_a

When reading this file the kernel is actually reading the backing
file, in a fashion similar to overlayfs. Since the backing file is
content-addressed, the objects directory can be shared for multiple
images, and any files that happen to have the same content are
shared. I refer to this as opportunistic sharing, as it is different
than the more course-grained explicit sharing used by e.g. container
base images.

The next step is the validation. Note how the object files have
fs-verity enabled. In fact, they are named by their fs-verity digest:

# fsverity digest objects/*/*
sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f

The generated filesystm image may contain the expected digest for the
backing files. When the backing file digest is incorrect, the open
will fail, and if the open succeeds, any other on-disk file-changes
will be detected by fs-verity:

# cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
content_a
# rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
# echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
# cat /mnt/file_a
WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
cat: /mnt/file_a: Input/output error

This re-uses the existing fs-verity functionallity to protect against
changes in file contents, while adding on top of it protection against
changes in filesystem metadata and structure. I.e. protecting against
replacing a fs-verity enabled file or modifying file permissions or
xattrs.

To be fully verified we need another step: we use fs-verity on the
image itself. Then we pass the expected digest on the mount command
line (which will be verified at mount time):

# fsverity enable rootfs.img
# fsverity digest rootfs.img
sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
# mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt

So, given a trusted set of mount options (say unlocked from TPM), we
have a fully verified filesystem tree mounted, with opportunistic
finegrained sharing of identical files.

So, why do we want this? There are two initial users. First of all we
want to use the opportunistic sharing for the podman container image
baselayer. The idea is to use a composefs mount as the lower directory
in an overlay mount, with the upper directory being the container work
dir. This will allow automatical file-level disk and page-cache
sharning between any two images, independent of details like the
permissions and timestamps of the files.

Secondly we are interested in using the verification aspects of
composefs in the ostree project. Ostree already supports a
content-addressed object store, but it is currently referenced by
hardlink farms. The object store and the trees that reference it are
signed and verified at download time, but there is no runtime
verification. If we replace the hardlink farm with a composefs image
that points into the existing object store we can use the verification
to implement runtime verification.

In fact, the tooling to create composefs images is 100% reproducible,
so all we need is to add the composefs image fs-verity digest into the
ostree commit. Then the image can be reconstructed from the ostree
commit info, generating a file with the same fs-verity digest.

These are the usecases we're currently interested in, but there seems
to be a breadth of other possible uses. For example, many systems use
loopback mounts for images (like lxc or snap), and these could take
advantage of the opportunistic sharing. We've also talked about using
fuse to implement a local cache for the backing files. I.e. you would
have the second basedir be a fuse filesystem. On lookup failure in the
first basedir it downloads the file and saves it in the first basedir
for later lookups. There are many interesting possibilities here.

The patch series contains some documentation on the file format and
how to use the filesystem.

The userspace tools (and a standalone kernel module) is available
here:
  https://github.com/containers/composefs

Initial work on ostree integration is here:
  https://github.com/ostreedev/ostree/pull/2640

Changes since v1:
- Fixed some minor compiler warnings
- Fixed build with !CONFIG_MMU
- Documentation fixes from review by Bagas Sanjaya
- Code style and cleanup from review by Brian Masney
- Use existing kernel helpers for hex digit conversion
- Use kmap_local_page() instead of deprecated kmap()

Alexander Larsson (6):
  fsverity: Export fsverity_get_digest
  composefs: Add on-disk layout
  composefs: Add descriptor parsing code
  composefs: Add filesystem implementation
  composefs: Add documentation
  composefs: Add kconfig and build support

 Documentation/filesystems/composefs.rst | 169 +++++
 Documentation/filesystems/index.rst     |   1 +
 fs/Kconfig                              |   1 +
 fs/Makefile                             |   1 +
 fs/composefs/Kconfig                    |  18 +
 fs/composefs/Makefile                   |   5 +
 fs/composefs/cfs-internals.h            |  63 ++
 fs/composefs/cfs-reader.c               | 927 ++++++++++++++++++++++++
 fs/composefs/cfs.c                      | 903 +++++++++++++++++++++++
 fs/composefs/cfs.h                      | 203 ++++++
 fs/verity/measure.c                     |   1 +
 11 files changed, 2292 insertions(+)
 create mode 100644 Documentation/filesystems/composefs.rst
 create mode 100644 fs/composefs/Kconfig
 create mode 100644 fs/composefs/Makefile
 create mode 100644 fs/composefs/cfs-internals.h
 create mode 100644 fs/composefs/cfs-reader.c
 create mode 100644 fs/composefs/cfs.c
 create mode 100644 fs/composefs/cfs.h

-- 
2.39.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/6] fsverity: Export fsverity_get_digest
  2023-01-13 15:33 [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson
@ 2023-01-13 15:33 ` Alexander Larsson
  2023-01-13 15:33 ` [PATCH v2 2/6] composefs: Add on-disk layout Alexander Larsson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Alexander Larsson @ 2023-01-13 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, gscrivan, Alexander Larsson

Composefs needs to call this when built in module form, so
we need to export the symbol. This uses EXPORT_SYMBOL_GPL
like the other fsverity functions do.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
---
 fs/verity/measure.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/verity/measure.c b/fs/verity/measure.c
index 5c79ea1b2468..875d143e0c7e 100644
--- a/fs/verity/measure.c
+++ b/fs/verity/measure.c
@@ -85,3 +85,4 @@ int fsverity_get_digest(struct inode *inode,
 	*alg = hash_alg->algo_id;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(fsverity_get_digest);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 2/6] composefs: Add on-disk layout
  2023-01-13 15:33 [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson
  2023-01-13 15:33 ` [PATCH v2 1/6] fsverity: Export fsverity_get_digest Alexander Larsson
@ 2023-01-13 15:33 ` Alexander Larsson
  2023-01-16  1:29   ` Dave Chinner
  2023-01-13 15:33 ` [PATCH v2 3/6] composefs: Add descriptor parsing code Alexander Larsson
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Alexander Larsson @ 2023-01-13 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, gscrivan, Alexander Larsson

This commit adds the on-disk layout header file of composefs.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---
 fs/composefs/cfs.h | 203 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 203 insertions(+)
 create mode 100644 fs/composefs/cfs.h

diff --git a/fs/composefs/cfs.h b/fs/composefs/cfs.h
new file mode 100644
index 000000000000..658df728e366
--- /dev/null
+++ b/fs/composefs/cfs.h
@@ -0,0 +1,203 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * composefs
+ *
+ * Copyright (C) 2021 Giuseppe Scrivano
+ * Copyright (C) 2022 Alexander Larsson
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef _CFS_H
+#define _CFS_H
+
+#include <asm/byteorder.h>
+#include <crypto/sha2.h>
+#include <linux/fs.h>
+#include <linux/stat.h>
+#include <linux/types.h>
+
+#define CFS_VERSION 1
+
+#define CFS_MAGIC 0xc078629aU
+
+#define CFS_MAX_DIR_CHUNK_SIZE 4096
+#define CFS_MAX_XATTRS_SIZE 4096
+
+static inline int cfs_digest_from_payload(const char *payload, size_t payload_len,
+					  u8 digest_out[SHA256_DIGEST_SIZE])
+{
+	const char *p, *end;
+	u8 last_digit = 0;
+	int digit = 0;
+	size_t n_nibbles = 0;
+
+	/* This handles payloads (i.e. path names) that are "essentially" a
+	 * digest as the digest (if the DIGEST_FROM_PAYLOAD flag is set). The
+	 * "essential" part means that we ignore hierarchical structure as well
+	 * as any extension. So, for example "ef/deadbeef.file" would match the
+	 * (too short) digest "efdeadbeef".
+	 *
+	 * This allows images to avoid storing both the digest and the pathname,
+	 * yet work with pre-existing object store formats of various kinds.
+	 */
+
+	end = payload + payload_len;
+	for (p = payload; p != end; p++) {
+		/* Skip subdir structure */
+		if (*p == '/')
+			continue;
+
+		/* Break at (and ignore) extension */
+		if (*p == '.')
+			break;
+
+		if (n_nibbles == SHA256_DIGEST_SIZE * 2)
+			return -EINVAL; /* Too long */
+
+		digit = hex_to_bin(*p);
+		if (digit == -1)
+			return -EINVAL; /* Not hex digit */
+
+		n_nibbles++;
+		if ((n_nibbles % 2) == 0)
+			digest_out[n_nibbles / 2 - 1] = (last_digit << 4) | digit;
+		last_digit = digit;
+	}
+
+	if (n_nibbles != SHA256_DIGEST_SIZE * 2)
+		return -EINVAL; /* Too short */
+
+	return 0;
+}
+
+struct cfs_vdata_s {
+	u64 off;
+	u32 len;
+} __packed;
+
+struct cfs_header_s {
+	u8 version;
+	u8 unused1;
+	u16 unused2;
+
+	u32 magic;
+	u64 data_offset;
+	u64 root_inode;
+
+	u64 unused3[2];
+} __packed;
+
+enum cfs_inode_flags {
+	CFS_INODE_FLAGS_NONE = 0,
+	CFS_INODE_FLAGS_PAYLOAD = 1 << 0,
+	CFS_INODE_FLAGS_MODE = 1 << 1,
+	CFS_INODE_FLAGS_NLINK = 1 << 2,
+	CFS_INODE_FLAGS_UIDGID = 1 << 3,
+	CFS_INODE_FLAGS_RDEV = 1 << 4,
+	CFS_INODE_FLAGS_TIMES = 1 << 5,
+	CFS_INODE_FLAGS_TIMES_NSEC = 1 << 6,
+	CFS_INODE_FLAGS_LOW_SIZE = 1 << 7, /* Low 32bit of st_size */
+	CFS_INODE_FLAGS_HIGH_SIZE = 1 << 8, /* High 32bit of st_size */
+	CFS_INODE_FLAGS_XATTRS = 1 << 9,
+	CFS_INODE_FLAGS_DIGEST = 1 << 10, /* fs-verity sha256 digest */
+	CFS_INODE_FLAGS_DIGEST_FROM_PAYLOAD = 1 << 11, /* Compute digest from payload */
+};
+
+#define CFS_INODE_FLAG_CHECK(_flag, _name)                                     \
+	(((_flag) & (CFS_INODE_FLAGS_##_name)) != 0)
+#define CFS_INODE_FLAG_CHECK_SIZE(_flag, _name, _size)                         \
+	(CFS_INODE_FLAG_CHECK(_flag, _name) ? (_size) : 0)
+
+#define CFS_INODE_DEFAULT_MODE 0100644
+#define CFS_INODE_DEFAULT_NLINK 1
+#define CFS_INODE_DEFAULT_NLINK_DIR 2
+#define CFS_INODE_DEFAULT_UIDGID 0
+#define CFS_INODE_DEFAULT_RDEV 0
+#define CFS_INODE_DEFAULT_TIMES 0
+
+struct cfs_inode_s {
+	u32 flags;
+
+	/* Optional data: (selected by flags) */
+
+	/* This is the size of the type specific data that comes directly after
+	 * the inode in the file. Of this type:
+	 *
+	 * directory: cfs_dir_s
+	 * regular file: the backing filename
+	 * symlink: the target link
+	 *
+	 * Canonically payload_length is 0 for empty dir/file/symlink.
+	 */
+	u32 payload_length;
+
+	u32 st_mode; /* File type and mode.  */
+	u32 st_nlink; /* Number of hard links, only for regular files.  */
+	u32 st_uid; /* User ID of owner.  */
+	u32 st_gid; /* Group ID of owner.  */
+	u32 st_rdev; /* Device ID (if special file).  */
+	u64 st_size; /* Size of file, only used for regular files */
+
+	struct cfs_vdata_s xattrs; /* ref to variable data */
+
+	u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */
+
+	struct timespec64 st_mtim; /* Time of last modification.  */
+	struct timespec64 st_ctim; /* Time of last status change.  */
+};
+
+static inline u32 cfs_inode_encoded_size(u32 flags)
+{
+	return sizeof(u32) /* flags */ +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, PAYLOAD, sizeof(u32)) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, MODE, sizeof(u32)) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, NLINK, sizeof(u32)) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, UIDGID, sizeof(u32) + sizeof(u32)) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, RDEV, sizeof(u32)) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, TIMES, sizeof(u64) * 2) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, TIMES_NSEC, sizeof(u32) * 2) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, LOW_SIZE, sizeof(u32)) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, HIGH_SIZE, sizeof(u32)) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, XATTRS, sizeof(u64) + sizeof(u32)) +
+	       CFS_INODE_FLAG_CHECK_SIZE(flags, DIGEST, SHA256_DIGEST_SIZE);
+}
+
+struct cfs_dentry_s {
+	/* Index of struct cfs_inode_s */
+	u64 inode_index;
+	u8 d_type;
+	u8 name_len;
+	u16 name_offset;
+} __packed;
+
+struct cfs_dir_chunk_s {
+	u16 n_dentries;
+	u16 chunk_size;
+	u64 chunk_offset;
+} __packed;
+
+struct cfs_dir_s {
+	u32 n_chunks;
+	struct cfs_dir_chunk_s chunks[];
+} __packed;
+
+#define cfs_dir_size(_n_chunks)                                                \
+	(sizeof(struct cfs_dir_s) + (_n_chunks) * sizeof(struct cfs_dir_chunk_s))
+
+/* xattr representation.  */
+struct cfs_xattr_element_s {
+	u16 key_length;
+	u16 value_length;
+} __packed;
+
+struct cfs_xattr_header_s {
+	u16 n_attr;
+	struct cfs_xattr_element_s attr[0];
+} __packed;
+
+#define cfs_xattr_header_size(_n_element)                                      \
+	(sizeof(struct cfs_xattr_header_s) +                                   \
+	 (_n_element) * sizeof(struct cfs_xattr_element_s))
+
+#endif
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 3/6] composefs: Add descriptor parsing code
  2023-01-13 15:33 [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson
  2023-01-13 15:33 ` [PATCH v2 1/6] fsverity: Export fsverity_get_digest Alexander Larsson
  2023-01-13 15:33 ` [PATCH v2 2/6] composefs: Add on-disk layout Alexander Larsson
@ 2023-01-13 15:33 ` Alexander Larsson
  2023-01-13 15:33 ` [PATCH v2 4/6] composefs: Add filesystem implementation Alexander Larsson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Alexander Larsson @ 2023-01-13 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, gscrivan, Alexander Larsson

This adds the code to load and decode the filesystem descriptor file
format.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---
 fs/composefs/cfs-internals.h |  63 +++
 fs/composefs/cfs-reader.c    | 927 +++++++++++++++++++++++++++++++++++
 2 files changed, 990 insertions(+)
 create mode 100644 fs/composefs/cfs-internals.h
 create mode 100644 fs/composefs/cfs-reader.c

diff --git a/fs/composefs/cfs-internals.h b/fs/composefs/cfs-internals.h
new file mode 100644
index 000000000000..007f40a95e51
--- /dev/null
+++ b/fs/composefs/cfs-internals.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _CFS_INTERNALS_H
+#define _CFS_INTERNALS_H
+
+#include "cfs.h"
+
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
+
+#define CFS_N_PRELOAD_DIR_CHUNKS 4
+
+struct cfs_inode_data_s {
+	u32 payload_length;
+	char *path_payload; /* Real pathname for files, target for symlinks */
+	u32 n_dir_chunks;
+	struct cfs_dir_chunk_s preloaded_dir_chunks[CFS_N_PRELOAD_DIR_CHUNKS];
+
+	u64 xattrs_offset;
+	u32 xattrs_len;
+
+	bool has_digest;
+	u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */
+};
+
+struct cfs_context_s {
+	struct cfs_header_s header;
+	struct file *descriptor;
+
+	u64 descriptor_len;
+};
+
+int cfs_init_ctx(const char *descriptor_path, const u8 *required_digest,
+		 struct cfs_context_s *ctx);
+
+void cfs_ctx_put(struct cfs_context_s *ctx);
+
+void cfs_inode_data_put(struct cfs_inode_data_s *inode_data);
+
+struct cfs_inode_s *cfs_get_root_ino(struct cfs_context_s *ctx,
+				     struct cfs_inode_s *ino_buf, u64 *index);
+
+struct cfs_inode_s *cfs_get_ino_index(struct cfs_context_s *ctx, u64 index,
+				      struct cfs_inode_s *buffer);
+
+int cfs_init_inode_data(struct cfs_context_s *ctx, struct cfs_inode_s *ino,
+			u64 index, struct cfs_inode_data_s *data);
+
+ssize_t cfs_list_xattrs(struct cfs_context_s *ctx, struct cfs_inode_data_s *inode_data,
+			char *names, size_t size);
+int cfs_get_xattr(struct cfs_context_s *ctx, struct cfs_inode_data_s *inode_data,
+		  const char *name, void *value, size_t size);
+
+typedef bool (*cfs_dir_iter_cb)(void *private, const char *name, int namelen,
+				u64 ino, unsigned int dtype);
+
+int cfs_dir_iterate(struct cfs_context_s *ctx, u64 index,
+		    struct cfs_inode_data_s *inode_data, loff_t first,
+		    cfs_dir_iter_cb cb, void *private);
+
+int cfs_dir_lookup(struct cfs_context_s *ctx, u64 index,
+		   struct cfs_inode_data_s *inode_data, const char *name,
+		   size_t name_len, u64 *index_out);
+
+#endif
diff --git a/fs/composefs/cfs-reader.c b/fs/composefs/cfs-reader.c
new file mode 100644
index 000000000000..e68bfd0fca98
--- /dev/null
+++ b/fs/composefs/cfs-reader.c
@@ -0,0 +1,927 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * composefs
+ *
+ * Copyright (C) 2021 Giuseppe Scrivano
+ * Copyright (C) 2022 Alexander Larsson
+ *
+ * This file is released under the GPL.
+ */
+
+#include "cfs-internals.h"
+
+#include <linux/file.h>
+#include <linux/fsverity.h>
+#include <linux/pagemap.h>
+#include <linux/unaligned/packed_struct.h>
+
+struct cfs_buf {
+	struct page *page;
+	void *base;
+};
+
+static void cfs_buf_put(struct cfs_buf *buf)
+{
+	if (buf->page) {
+		if (buf->base)
+			kunmap_local(buf->base);
+		put_page(buf->page);
+		buf->base = NULL;
+		buf->page = NULL;
+	}
+}
+
+static void *cfs_get_buf(struct cfs_context_s *ctx, u64 offset, u32 size,
+			 struct cfs_buf *buf)
+{
+	struct inode *inode = ctx->descriptor->f_inode;
+	struct address_space *const mapping = inode->i_mapping;
+	u32 page_offset = offset & (PAGE_SIZE - 1);
+	u64 index = offset >> PAGE_SHIFT;
+	struct page *page = buf->page;
+
+	if (offset > ctx->descriptor_len)
+		return ERR_PTR(-EFSCORRUPTED);
+
+	if ((offset + size < offset) || (offset + size > ctx->descriptor_len))
+		return ERR_PTR(-EFSCORRUPTED);
+
+	if (size > PAGE_SIZE)
+		return ERR_PTR(-EINVAL);
+
+	if (PAGE_SIZE - page_offset < size)
+		return ERR_PTR(-EINVAL);
+
+	if (!page || page->index != index) {
+		cfs_buf_put(buf);
+
+		page = read_cache_page(mapping, index, NULL, NULL);
+		if (IS_ERR(page))
+			return page;
+
+		buf->page = page;
+		buf->base = kmap_local_page(page);
+	}
+
+	return buf->base + page_offset;
+}
+
+static void *cfs_read_data(struct cfs_context_s *ctx, u64 offset, u64 size, u8 *dest)
+{
+	loff_t pos = offset;
+	size_t copied;
+
+	if (offset > ctx->descriptor_len)
+		return ERR_PTR(-EFSCORRUPTED);
+
+	if ((offset + size < offset) || (offset + size > ctx->descriptor_len))
+		return ERR_PTR(-EFSCORRUPTED);
+
+	copied = 0;
+	while (copied < size) {
+		ssize_t bytes;
+
+		bytes = kernel_read(ctx->descriptor, dest + copied,
+				    size - copied, &pos);
+		if (bytes < 0)
+			return ERR_PTR(bytes);
+		if (bytes == 0)
+			return ERR_PTR(-EINVAL);
+
+		copied += bytes;
+	}
+
+	if (copied != size)
+		return ERR_PTR(-EFSCORRUPTED);
+	return dest;
+}
+
+int cfs_init_ctx(const char *descriptor_path, const u8 *required_digest,
+		 struct cfs_context_s *ctx_out)
+{
+	u8 verity_digest[FS_VERITY_MAX_DIGEST_SIZE];
+	struct cfs_header_s *header;
+	enum hash_algo verity_algo;
+	struct cfs_context_s ctx;
+	struct file *descriptor;
+	loff_t i_size;
+	int res;
+
+	descriptor = filp_open(descriptor_path, O_RDONLY, 0);
+	if (IS_ERR(descriptor))
+		return PTR_ERR(descriptor);
+
+	if (required_digest) {
+		res = fsverity_get_digest(d_inode(descriptor->f_path.dentry),
+					  verity_digest, &verity_algo);
+		if (res < 0) {
+			pr_err("ERROR: composefs descriptor has no fs-verity digest\n");
+			goto fail;
+		}
+		if (verity_algo != HASH_ALGO_SHA256 ||
+		    memcmp(required_digest, verity_digest, SHA256_DIGEST_SIZE) != 0) {
+			pr_err("ERROR: composefs descriptor has wrong fs-verity digest\n");
+			res = -EINVAL;
+			goto fail;
+		}
+	}
+
+	i_size = i_size_read(file_inode(descriptor));
+	if (i_size <= (sizeof(struct cfs_header_s) + sizeof(struct cfs_inode_s))) {
+		res = -EINVAL;
+		goto fail;
+	}
+
+	/* Need this temporary ctx for cfs_read_data() */
+	ctx.descriptor = descriptor;
+	ctx.descriptor_len = i_size;
+
+	header = cfs_read_data(&ctx, 0, sizeof(struct cfs_header_s),
+			       (u8 *)&ctx.header);
+	if (IS_ERR(header)) {
+		res = PTR_ERR(header);
+		goto fail;
+	}
+	header->magic = le32_to_cpu(header->magic);
+	header->data_offset = le64_to_cpu(header->data_offset);
+	header->root_inode = le64_to_cpu(header->root_inode);
+
+	if (header->magic != CFS_MAGIC || header->data_offset > ctx.descriptor_len ||
+	    sizeof(struct cfs_header_s) + header->root_inode > ctx.descriptor_len) {
+		res = -EINVAL;
+		goto fail;
+	}
+
+	*ctx_out = ctx;
+	return 0;
+
+fail:
+	fput(descriptor);
+	return res;
+}
+
+void cfs_ctx_put(struct cfs_context_s *ctx)
+{
+	if (ctx->descriptor) {
+		fput(ctx->descriptor);
+		ctx->descriptor = NULL;
+	}
+}
+
+static void *cfs_get_inode_data(struct cfs_context_s *ctx, u64 offset, u64 size,
+				u8 *dest)
+{
+	return cfs_read_data(ctx, offset + sizeof(struct cfs_header_s), size, dest);
+}
+
+static void *cfs_get_inode_data_max(struct cfs_context_s *ctx, u64 offset,
+				    u64 max_size, u64 *read_size, u8 *dest)
+{
+	u64 remaining = ctx->descriptor_len - sizeof(struct cfs_header_s);
+	u64 size;
+
+	if (offset > remaining)
+		return ERR_PTR(-EINVAL);
+	remaining -= offset;
+
+	/* Read at most remaining bytes, and no more than max_size */
+	size = min(remaining, max_size);
+	*read_size = size;
+
+	return cfs_get_inode_data(ctx, offset, size, dest);
+}
+
+static void *cfs_get_inode_payload_w_len(struct cfs_context_s *ctx,
+					 u32 payload_length, u64 index,
+					 u8 *dest, u64 offset, size_t len)
+{
+	/* Payload is stored before the inode, check it fits */
+	if (payload_length > index)
+		return ERR_PTR(-EINVAL);
+
+	if (offset > payload_length)
+		return ERR_PTR(-EINVAL);
+
+	if (offset + len > payload_length)
+		return ERR_PTR(-EINVAL);
+
+	return cfs_get_inode_data(ctx, index - payload_length + offset, len, dest);
+}
+
+static void *cfs_get_inode_payload(struct cfs_context_s *ctx,
+				   struct cfs_inode_s *ino, u64 index, u8 *dest)
+{
+	return cfs_get_inode_payload_w_len(ctx, ino->payload_length, index,
+					   dest, 0, ino->payload_length);
+}
+
+static void *cfs_get_vdata_buf(struct cfs_context_s *ctx, u64 offset, u32 len,
+			       struct cfs_buf *buf)
+{
+	if (offset > ctx->descriptor_len - ctx->header.data_offset)
+		return ERR_PTR(-EINVAL);
+
+	if (len > ctx->descriptor_len - ctx->header.data_offset - offset)
+		return ERR_PTR(-EINVAL);
+
+	return cfs_get_buf(ctx, ctx->header.data_offset + offset, len, buf);
+}
+
+static u32 cfs_read_u32(u8 **data)
+{
+	u32 v = le32_to_cpu(__get_unaligned_cpu32(*data));
+	*data += sizeof(u32);
+	return v;
+}
+
+static u64 cfs_read_u64(u8 **data)
+{
+	u64 v = le64_to_cpu(__get_unaligned_cpu64(*data));
+	*data += sizeof(u64);
+	return v;
+}
+
+struct cfs_inode_s *cfs_get_ino_index(struct cfs_context_s *ctx, u64 index,
+				      struct cfs_inode_s *ino)
+{
+	/* Buffer that fits the maximal encoded size: */
+	u8 buffer[sizeof(struct cfs_inode_s)];
+	u64 offset = index;
+	u64 inode_size;
+	u64 read_size;
+	u8 *data;
+
+	data = cfs_get_inode_data_max(ctx, offset, sizeof(buffer), &read_size, buffer);
+	if (IS_ERR(data))
+		return ERR_CAST(data);
+
+	/* Need to fit at least flags to decode */
+	if (read_size < sizeof(u32))
+		return ERR_PTR(-EFSCORRUPTED);
+
+	memset(ino, 0, sizeof(*ino));
+	ino->flags = cfs_read_u32(&data);
+
+	inode_size = cfs_inode_encoded_size(ino->flags);
+	/* Shouldn't happen, but let's check */
+	if (inode_size > sizeof(buffer))
+		return ERR_PTR(-EFSCORRUPTED);
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, PAYLOAD))
+		ino->payload_length = cfs_read_u32(&data);
+	else
+		ino->payload_length = 0;
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, MODE))
+		ino->st_mode = cfs_read_u32(&data);
+	else
+		ino->st_mode = CFS_INODE_DEFAULT_MODE;
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, NLINK)) {
+		ino->st_nlink = cfs_read_u32(&data);
+	} else {
+		if ((ino->st_mode & S_IFMT) == S_IFDIR)
+			ino->st_nlink = CFS_INODE_DEFAULT_NLINK_DIR;
+		else
+			ino->st_nlink = CFS_INODE_DEFAULT_NLINK;
+	}
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, UIDGID)) {
+		ino->st_uid = cfs_read_u32(&data);
+		ino->st_gid = cfs_read_u32(&data);
+	} else {
+		ino->st_uid = CFS_INODE_DEFAULT_UIDGID;
+		ino->st_gid = CFS_INODE_DEFAULT_UIDGID;
+	}
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, RDEV))
+		ino->st_rdev = cfs_read_u32(&data);
+	else
+		ino->st_rdev = CFS_INODE_DEFAULT_RDEV;
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, TIMES)) {
+		ino->st_mtim.tv_sec = cfs_read_u64(&data);
+		ino->st_ctim.tv_sec = cfs_read_u64(&data);
+	} else {
+		ino->st_mtim.tv_sec = CFS_INODE_DEFAULT_TIMES;
+		ino->st_ctim.tv_sec = CFS_INODE_DEFAULT_TIMES;
+	}
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, TIMES_NSEC)) {
+		ino->st_mtim.tv_nsec = cfs_read_u32(&data);
+		ino->st_ctim.tv_nsec = cfs_read_u32(&data);
+	} else {
+		ino->st_mtim.tv_nsec = 0;
+		ino->st_ctim.tv_nsec = 0;
+	}
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, LOW_SIZE))
+		ino->st_size = cfs_read_u32(&data);
+	else
+		ino->st_size = 0;
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, HIGH_SIZE))
+		ino->st_size += (u64)cfs_read_u32(&data) << 32;
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, XATTRS)) {
+		ino->xattrs.off = cfs_read_u64(&data);
+		ino->xattrs.len = cfs_read_u32(&data);
+	} else {
+		ino->xattrs.off = 0;
+		ino->xattrs.len = 0;
+	}
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, DIGEST)) {
+		memcpy(ino->digest, data, SHA256_DIGEST_SIZE);
+		data += 32;
+	}
+
+	return ino;
+}
+
+struct cfs_inode_s *cfs_get_root_ino(struct cfs_context_s *ctx,
+				     struct cfs_inode_s *ino_buf, u64 *index)
+{
+	u64 root_ino = ctx->header.root_inode;
+
+	*index = root_ino;
+	return cfs_get_ino_index(ctx, root_ino, ino_buf);
+}
+
+static int cfs_get_digest(struct cfs_context_s *ctx, struct cfs_inode_s *ino,
+			  const char *payload, u8 digest_out[SHA256_DIGEST_SIZE])
+{
+	int r;
+
+	if (CFS_INODE_FLAG_CHECK(ino->flags, DIGEST)) {
+		memcpy(digest_out, ino->digest, SHA256_DIGEST_SIZE);
+		return 1;
+	}
+
+	if (payload && CFS_INODE_FLAG_CHECK(ino->flags, DIGEST_FROM_PAYLOAD)) {
+		r = cfs_digest_from_payload(payload, ino->payload_length, digest_out);
+		if (r < 0)
+			return r;
+		return 1;
+	}
+
+	return 0;
+}
+
+static bool cfs_validate_filename(const char *name, size_t name_len)
+{
+	if (name_len == 0)
+		return false;
+
+	if (name_len == 1 && name[0] == '.')
+		return false;
+
+	if (name_len == 2 && name[0] == '.' && name[1] == '.')
+		return false;
+
+	if (memchr(name, '/', name_len))
+		return false;
+
+	return true;
+}
+
+static struct cfs_dir_s *cfs_dir_read_chunk_header(struct cfs_context_s *ctx,
+						   size_t payload_length,
+						   u64 index, u8 *chunk_buf,
+						   size_t chunk_buf_size,
+						   size_t max_n_chunks)
+{
+	struct cfs_dir_s *dir;
+	size_t n_chunks;
+
+	/* Payload and buffer should be large enough to fit the n_chunks */
+	if (payload_length < sizeof(struct cfs_dir_s) ||
+	    chunk_buf_size < sizeof(struct cfs_dir_s))
+		return ERR_PTR(-EFSCORRUPTED);
+
+	/* Make sure we fit max_n_chunks in buffer before reading it */
+	if (chunk_buf_size < cfs_dir_size(max_n_chunks))
+		return ERR_PTR(-EINVAL);
+
+	dir = cfs_get_inode_payload_w_len(ctx, payload_length, index, chunk_buf,
+					  0, min(chunk_buf_size, payload_length));
+	if (IS_ERR(dir))
+		return ERR_CAST(dir);
+
+	n_chunks = le32_to_cpu(dir->n_chunks);
+	dir->n_chunks = n_chunks;
+
+	/* Don't support n_chunks == 0, the canonical version of that is payload_length == 0 */
+	if (n_chunks == 0)
+		return ERR_PTR(-EFSCORRUPTED);
+
+	if (payload_length != cfs_dir_size(n_chunks))
+		return ERR_PTR(-EFSCORRUPTED);
+
+	max_n_chunks = min(n_chunks, max_n_chunks);
+
+	/* Verify data (up to max_n_chunks) */
+	for (size_t i = 0; i < max_n_chunks; i++) {
+		struct cfs_dir_chunk_s *chunk = &dir->chunks[i];
+
+		chunk->n_dentries = le16_to_cpu(chunk->n_dentries);
+		chunk->chunk_size = le16_to_cpu(chunk->chunk_size);
+		chunk->chunk_offset = le64_to_cpu(chunk->chunk_offset);
+
+		if (chunk->chunk_size < sizeof(struct cfs_dentry_s) * chunk->n_dentries)
+			return ERR_PTR(-EFSCORRUPTED);
+
+		if (chunk->chunk_size > CFS_MAX_DIR_CHUNK_SIZE)
+			return ERR_PTR(-EFSCORRUPTED);
+
+		if (chunk->n_dentries == 0)
+			return ERR_PTR(-EFSCORRUPTED);
+
+		if (chunk->chunk_size == 0)
+			return ERR_PTR(-EFSCORRUPTED);
+
+		if (chunk->chunk_offset > ctx->descriptor_len - ctx->header.data_offset)
+			return ERR_PTR(-EFSCORRUPTED);
+	}
+
+	return dir;
+}
+
+static char *cfs_dup_payload_path(struct cfs_context_s *ctx,
+				  struct cfs_inode_s *ino, u64 index)
+{
+	const char *v;
+	u8 *path;
+
+	if ((ino->st_mode & S_IFMT) != S_IFREG && (ino->st_mode & S_IFMT) != S_IFLNK)
+		return ERR_PTR(-EINVAL);
+
+	if (ino->payload_length == 0 || ino->payload_length > PATH_MAX)
+		return ERR_PTR(-EFSCORRUPTED);
+
+	path = kmalloc(ino->payload_length + 1, GFP_KERNEL);
+	if (!path)
+		return ERR_PTR(-ENOMEM);
+
+	v = cfs_get_inode_payload(ctx, ino, index, path);
+	if (IS_ERR(v)) {
+		kfree(path);
+		return ERR_CAST(v);
+	}
+
+	/* zero terminate */
+	path[ino->payload_length] = 0;
+
+	return (char *)path;
+}
+
+int cfs_init_inode_data(struct cfs_context_s *ctx, struct cfs_inode_s *ino,
+			u64 index, struct cfs_inode_data_s *inode_data)
+{
+	u8 buf[cfs_dir_size(CFS_N_PRELOAD_DIR_CHUNKS)];
+	char *path_payload = NULL;
+	struct cfs_dir_s *dir;
+	int ret = 0;
+
+	inode_data->payload_length = ino->payload_length;
+
+	if ((ino->st_mode & S_IFMT) != S_IFDIR || ino->payload_length == 0) {
+		inode_data->n_dir_chunks = 0;
+	} else {
+		u32 n_chunks;
+
+		dir = cfs_dir_read_chunk_header(ctx, ino->payload_length, index,
+						buf, sizeof(buf),
+						CFS_N_PRELOAD_DIR_CHUNKS);
+		if (IS_ERR(dir))
+			return PTR_ERR(dir);
+
+		n_chunks = dir->n_chunks;
+		inode_data->n_dir_chunks = n_chunks;
+
+		for (size_t i = 0; i < n_chunks && i < CFS_N_PRELOAD_DIR_CHUNKS; i++)
+			inode_data->preloaded_dir_chunks[i] = dir->chunks[i];
+	}
+
+	if ((ino->st_mode & S_IFMT) == S_IFLNK ||
+	    ((ino->st_mode & S_IFMT) == S_IFREG && ino->payload_length > 0)) {
+		path_payload = cfs_dup_payload_path(ctx, ino, index);
+		if (IS_ERR(path_payload)) {
+			ret = PTR_ERR(path_payload);
+			goto fail;
+		}
+	}
+	inode_data->path_payload = path_payload;
+
+	ret = cfs_get_digest(ctx, ino, path_payload, inode_data->digest);
+	if (ret < 0)
+		goto fail;
+
+	inode_data->has_digest = ret != 0;
+
+	inode_data->xattrs_offset = ino->xattrs.off;
+	inode_data->xattrs_len = ino->xattrs.len;
+
+	if (inode_data->xattrs_len != 0) {
+		/* Validate xattr size */
+		if (inode_data->xattrs_len < sizeof(struct cfs_xattr_header_s) ||
+		    inode_data->xattrs_len > CFS_MAX_XATTRS_SIZE) {
+			ret = -EFSCORRUPTED;
+			goto fail;
+		}
+	}
+
+	return 0;
+
+fail:
+	cfs_inode_data_put(inode_data);
+	return ret;
+}
+
+void cfs_inode_data_put(struct cfs_inode_data_s *inode_data)
+{
+	inode_data->n_dir_chunks = 0;
+	kfree(inode_data->path_payload);
+	inode_data->path_payload = NULL;
+}
+
+ssize_t cfs_list_xattrs(struct cfs_context_s *ctx,
+			struct cfs_inode_data_s *inode_data, char *names, size_t size)
+{
+	const struct cfs_xattr_header_s *xattrs;
+	struct cfs_buf vdata_buf = { NULL };
+	size_t n_xattrs = 0;
+	u8 *data, *data_end;
+	ssize_t copied = 0;
+
+	if (inode_data->xattrs_len == 0)
+		return 0;
+
+	/* xattrs_len basic size req was verified in cfs_init_inode_data */
+
+	xattrs = cfs_get_vdata_buf(ctx, inode_data->xattrs_offset,
+				   inode_data->xattrs_len, &vdata_buf);
+	if (IS_ERR(xattrs))
+		return PTR_ERR(xattrs);
+
+	n_xattrs = le16_to_cpu(xattrs->n_attr);
+
+	/* Verify that array fits */
+	if (inode_data->xattrs_len < cfs_xattr_header_size(n_xattrs)) {
+		copied = -EFSCORRUPTED;
+		goto exit;
+	}
+
+	data = ((u8 *)xattrs) + cfs_xattr_header_size(n_xattrs);
+	data_end = ((u8 *)xattrs) + inode_data->xattrs_len;
+
+	for (size_t i = 0; i < n_xattrs; i++) {
+		const struct cfs_xattr_element_s *e = &xattrs->attr[i];
+		u16 this_value_len = le16_to_cpu(e->value_length);
+		u16 this_key_len = le16_to_cpu(e->key_length);
+		const char *this_key;
+
+		if (this_key_len > XATTR_NAME_MAX ||
+		    /* key and data needs to fit in data */
+		    data_end - data < this_key_len + this_value_len) {
+			copied = -EFSCORRUPTED;
+			goto exit;
+		}
+
+		this_key = data;
+		data += this_key_len + this_value_len;
+
+		if (size) {
+			if (size - copied < this_key_len + 1) {
+				copied = -E2BIG;
+				goto exit;
+			}
+
+			memcpy(names + copied, this_key, this_key_len);
+			names[copied + this_key_len] = '\0';
+		}
+
+		copied += this_key_len + 1;
+	}
+
+exit:
+	cfs_buf_put(&vdata_buf);
+
+	return copied;
+}
+
+int cfs_get_xattr(struct cfs_context_s *ctx, struct cfs_inode_data_s *inode_data,
+		  const char *name, void *value, size_t size)
+{
+	struct cfs_xattr_header_s *xattrs;
+	struct cfs_buf vdata_buf = { NULL };
+	size_t name_len = strlen(name);
+	size_t n_xattrs = 0;
+	u8 *data, *data_end;
+	int res;
+
+	if (inode_data->xattrs_len == 0)
+		return -ENODATA;
+
+	/* xattrs_len basic size req was verified in cfs_init_inode_data */
+
+	xattrs = cfs_get_vdata_buf(ctx, inode_data->xattrs_offset,
+				   inode_data->xattrs_len, &vdata_buf);
+	if (IS_ERR(xattrs))
+		return PTR_ERR(xattrs);
+
+	n_xattrs = le16_to_cpu(xattrs->n_attr);
+
+	/* Verify that array fits */
+	if (inode_data->xattrs_len < cfs_xattr_header_size(n_xattrs)) {
+		res = -EFSCORRUPTED;
+		goto exit;
+	}
+
+	data = ((u8 *)xattrs) + cfs_xattr_header_size(n_xattrs);
+	data_end = ((u8 *)xattrs) + inode_data->xattrs_len;
+
+	for (size_t i = 0; i < n_xattrs; i++) {
+		const struct cfs_xattr_element_s *e = &xattrs->attr[i];
+		u16 this_value_len = le16_to_cpu(e->value_length);
+		u16 this_key_len = le16_to_cpu(e->key_length);
+		const char *this_key, *this_value;
+
+		if (this_key_len > XATTR_NAME_MAX ||
+		    /* key and data needs to fit in data */
+		    data_end - data < this_key_len + this_value_len) {
+			res = -EFSCORRUPTED;
+			goto exit;
+		}
+
+		this_key = data;
+		this_value = data + this_key_len;
+		data += this_key_len + this_value_len;
+
+		if (this_key_len != name_len || memcmp(this_key, name, name_len) != 0)
+			continue;
+
+		if (size > 0) {
+			if (size < this_value_len) {
+				res = -E2BIG;
+				goto exit;
+			}
+			memcpy(value, this_value, this_value_len);
+		}
+
+		res = this_value_len;
+		goto exit;
+	}
+
+	res = -ENODATA;
+
+exit:
+	return res;
+}
+
+static struct cfs_dir_s *
+cfs_dir_read_chunk_header_alloc(struct cfs_context_s *ctx, u64 index,
+				struct cfs_inode_data_s *inode_data)
+{
+	size_t chunk_buf_size = cfs_dir_size(inode_data->n_dir_chunks);
+	struct cfs_dir_s *dir;
+	u8 *chunk_buf;
+
+	chunk_buf = kmalloc(chunk_buf_size, GFP_KERNEL);
+	if (!chunk_buf)
+		return ERR_PTR(-ENOMEM);
+
+	dir = cfs_dir_read_chunk_header(ctx, inode_data->payload_length, index,
+					chunk_buf, chunk_buf_size,
+					inode_data->n_dir_chunks);
+	if (IS_ERR(dir)) {
+		kfree(chunk_buf);
+		return ERR_CAST(dir);
+	}
+
+	return dir;
+}
+
+static struct cfs_dir_chunk_s *
+cfs_dir_get_chunk_info(struct cfs_context_s *ctx, u64 index,
+		       struct cfs_inode_data_s *inode_data, void **chunks_buf)
+{
+	struct cfs_dir_s *full_dir;
+
+	if (inode_data->n_dir_chunks <= CFS_N_PRELOAD_DIR_CHUNKS) {
+		*chunks_buf = NULL;
+		return inode_data->preloaded_dir_chunks;
+	}
+
+	full_dir = cfs_dir_read_chunk_header_alloc(ctx, index, inode_data);
+	if (IS_ERR(full_dir))
+		return ERR_CAST(full_dir);
+
+	*chunks_buf = full_dir;
+	return full_dir->chunks;
+}
+
+static inline int memcmp2(const void *a, const size_t a_size, const void *b,
+			  size_t b_size)
+{
+	size_t common_size = min(a_size, b_size);
+	int res;
+
+	res = memcmp(a, b, common_size);
+	if (res != 0 || a_size == b_size)
+		return res;
+
+	return a_size < b_size ? -1 : 1;
+}
+
+int cfs_dir_iterate(struct cfs_context_s *ctx, u64 index,
+		    struct cfs_inode_data_s *inode_data, loff_t first,
+		    cfs_dir_iter_cb cb, void *private)
+{
+	struct cfs_buf vdata_buf = { NULL };
+	struct cfs_dir_chunk_s *chunks;
+	struct cfs_dentry_s *dentries;
+	char *namedata, *namedata_end;
+	void *chunks_buf;
+	size_t n_chunks;
+	loff_t pos;
+	int res;
+
+	n_chunks = inode_data->n_dir_chunks;
+	if (n_chunks == 0)
+		return 0;
+
+	chunks = cfs_dir_get_chunk_info(ctx, index, inode_data, &chunks_buf);
+	if (IS_ERR(chunks))
+		return PTR_ERR(chunks);
+
+	pos = 0;
+	for (size_t i = 0; i < n_chunks; i++) {
+		/* Chunks info are verified/converted in cfs_dir_read_chunk_header */
+		u64 chunk_offset = chunks[i].chunk_offset;
+		size_t chunk_size = chunks[i].chunk_size;
+		size_t n_dentries = chunks[i].n_dentries;
+
+		/* Do we need to look at this chunk */
+		if (first >= pos + n_dentries) {
+			pos += n_dentries;
+			continue;
+		}
+
+		/* Read chunk dentries from page cache */
+		dentries = cfs_get_vdata_buf(ctx, chunk_offset, chunk_size,
+					     &vdata_buf);
+		if (IS_ERR(dentries)) {
+			res = PTR_ERR(dentries);
+			goto exit;
+		}
+
+		namedata = ((char *)dentries) +
+			   sizeof(struct cfs_dentry_s) * n_dentries;
+		namedata_end = ((char *)dentries) + chunk_size;
+
+		for (size_t j = 0; j < n_dentries; j++) {
+			struct cfs_dentry_s *dentry = &dentries[j];
+			size_t dentry_name_len = dentry->name_len;
+			char *dentry_name = (char *)namedata + dentry->name_offset;
+
+			/* name needs to fit in namedata */
+			if (dentry_name >= namedata_end ||
+			    namedata_end - dentry_name < dentry_name_len) {
+				res = -EFSCORRUPTED;
+				goto exit;
+			}
+
+			if (!cfs_validate_filename(dentry_name, dentry_name_len)) {
+				res = -EFSCORRUPTED;
+				goto exit;
+			}
+
+			if (pos++ < first)
+				continue;
+
+			if (!cb(private, dentry_name, dentry_name_len,
+				le64_to_cpu(dentry->inode_index), dentry->d_type)) {
+				res = 0;
+				goto exit;
+			}
+		}
+	}
+
+	res = 0;
+exit:
+	kfree(chunks_buf);
+	cfs_buf_put(&vdata_buf);
+	return res;
+}
+
+#define BEFORE_CHUNK 1
+#define AFTER_CHUNK 2
+// -1 => error, 0 == hit, 1 == name is before chunk, 2 == name is after chunk
+static int cfs_dir_lookup_in_chunk(const char *name, size_t name_len,
+				   struct cfs_dentry_s *dentries,
+				   size_t n_dentries, char *namedata,
+				   char *namedata_end, u64 *index_out)
+{
+	int start_dentry, end_dentry;
+	int cmp;
+
+	// This should not happen in a valid fs, and if it does we don't know if
+	// the name is before or after the chunk.
+	if (n_dentries == 0)
+		return -EFSCORRUPTED;
+
+	start_dentry = 0;
+	end_dentry = n_dentries - 1;
+	while (start_dentry <= end_dentry) {
+		int mid_dentry = start_dentry + (end_dentry - start_dentry) / 2;
+		struct cfs_dentry_s *dentry = &dentries[mid_dentry];
+		char *dentry_name = (char *)namedata + dentry->name_offset;
+		size_t dentry_name_len = dentry->name_len;
+
+		/* name needs to fit in namedata */
+		if (dentry_name >= namedata_end ||
+		    namedata_end - dentry_name < dentry_name_len) {
+			return -EFSCORRUPTED;
+		}
+
+		cmp = memcmp2(name, name_len, dentry_name, dentry_name_len);
+		if (cmp == 0) {
+			*index_out = le64_to_cpu(dentry->inode_index);
+			return 0;
+		}
+
+		if (cmp > 0)
+			start_dentry = mid_dentry + 1;
+		else
+			end_dentry = mid_dentry - 1;
+	}
+
+	return cmp > 0 ? AFTER_CHUNK : BEFORE_CHUNK;
+}
+
+int cfs_dir_lookup(struct cfs_context_s *ctx, u64 index,
+		   struct cfs_inode_data_s *inode_data, const char *name,
+		   size_t name_len, u64 *index_out)
+{
+	int n_chunks, start_chunk, end_chunk;
+	struct cfs_buf vdata_buf = { NULL };
+	char *namedata, *namedata_end;
+	struct cfs_dir_chunk_s *chunks;
+	struct cfs_dentry_s *dentries;
+	void *chunks_buf;
+	int res, r;
+
+	n_chunks = inode_data->n_dir_chunks;
+	if (n_chunks == 0)
+		return 0;
+
+	chunks = cfs_dir_get_chunk_info(ctx, index, inode_data, &chunks_buf);
+	if (IS_ERR(chunks))
+		return PTR_ERR(chunks);
+
+	start_chunk = 0;
+	end_chunk = n_chunks - 1;
+
+	while (start_chunk <= end_chunk) {
+		int mid_chunk = start_chunk + (end_chunk - start_chunk) / 2;
+
+		/* Chunks info are verified/converted in cfs_dir_read_chunk_header */
+		u64 chunk_offset = chunks[mid_chunk].chunk_offset;
+		size_t chunk_size = chunks[mid_chunk].chunk_size;
+		size_t n_dentries = chunks[mid_chunk].n_dentries;
+
+		/* Read chunk dentries from page cache */
+		dentries = cfs_get_vdata_buf(ctx, chunk_offset, chunk_size,
+					     &vdata_buf);
+		if (IS_ERR(dentries)) {
+			res = PTR_ERR(dentries);
+			goto exit;
+		}
+
+		namedata = ((u8 *)dentries) + sizeof(struct cfs_dentry_s) * n_dentries;
+		namedata_end = ((u8 *)dentries) + chunk_size;
+
+		r = cfs_dir_lookup_in_chunk(name, name_len, dentries, n_dentries,
+					    namedata, namedata_end, index_out);
+		if (r < 0) {
+			res = r; /* error */
+			goto exit;
+		} else if (r == 0) {
+			res = 1; /* found it */
+			goto exit;
+		} else if (r == AFTER_CHUNK) {
+			start_chunk = mid_chunk + 1;
+		} else { /* before */
+			end_chunk = mid_chunk - 1;
+		}
+	}
+
+	/* not found */
+	res = 0;
+
+exit:
+	kfree(chunks_buf);
+	cfs_buf_put(&vdata_buf);
+	return res;
+}
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 4/6] composefs: Add filesystem implementation
  2023-01-13 15:33 [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson
                   ` (2 preceding siblings ...)
  2023-01-13 15:33 ` [PATCH v2 3/6] composefs: Add descriptor parsing code Alexander Larsson
@ 2023-01-13 15:33 ` Alexander Larsson
  2023-01-13 21:55   ` kernel test robot
  2023-01-16 22:07   ` Al Viro
  2023-01-13 15:33 ` [PATCH v2 5/6] composefs: Add documentation Alexander Larsson
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Alexander Larsson @ 2023-01-13 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, gscrivan, Alexander Larsson

This is the basic inode and filesystem implementation.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---
 fs/composefs/cfs.c | 903 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 903 insertions(+)
 create mode 100644 fs/composefs/cfs.c

diff --git a/fs/composefs/cfs.c b/fs/composefs/cfs.c
new file mode 100644
index 000000000000..b3c0adb69983
--- /dev/null
+++ b/fs/composefs/cfs.c
@@ -0,0 +1,903 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * composefs
+ *
+ * Copyright (C) 2000 Linus Torvalds.
+ *               2000 Transmeta Corp.
+ * Copyright (C) 2021 Giuseppe Scrivano
+ * Copyright (C) 2022 Alexander Larsson
+ *
+ * This file is released under the GPL.
+ */
+
+#include <linux/exportfs.h>
+#include <linux/fsverity.h>
+#include <linux/fs_parser.h>
+#include <linux/module.h>
+#include <linux/namei.h>
+#include <linux/seq_file.h>
+#include <linux/version.h>
+#include <linux/xattr.h>
+#include <linux/statfs.h>
+
+#include "cfs-internals.h"
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Giuseppe Scrivano <gscrivan@redhat.com>");
+
+#define CFS_MAX_STACK 500
+
+#define FILEID_CFS 0x91
+
+struct cfs_info {
+	struct cfs_context_s cfs_ctx;
+
+	char *base_path;
+
+	size_t n_bases;
+	struct vfsmount **bases;
+
+	u32 verity_check; /* 0 == none, 1 == if specified in image, 2 == require in image */
+	bool has_digest;
+	u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */
+};
+
+struct cfs_inode {
+	/* must be first for clear in cfs_alloc_inode to work */
+	struct inode vfs_inode;
+
+	struct cfs_inode_data_s inode_data;
+};
+
+static inline struct cfs_inode *CFS_I(struct inode *inode)
+{
+	return container_of(inode, struct cfs_inode, vfs_inode);
+}
+
+static struct file empty_file;
+
+static const struct file_operations cfs_file_operations;
+
+static const struct super_operations cfs_ops;
+static const struct file_operations cfs_dir_operations;
+static const struct inode_operations cfs_dir_inode_operations;
+static const struct inode_operations cfs_file_inode_operations;
+static const struct inode_operations cfs_link_inode_operations;
+
+static const struct xattr_handler *cfs_xattr_handlers[];
+static const struct export_operations cfs_export_operations;
+
+static const struct address_space_operations cfs_aops = {
+	.direct_IO = noop_direct_IO,
+};
+
+static ssize_t cfs_listxattr(struct dentry *dentry, char *names, size_t size);
+
+/* copied from overlayfs.  */
+static unsigned int cfs_split_basedirs(char *str)
+{
+	unsigned int ctr = 1;
+	char *s, *d;
+
+	for (s = d = str;; s++, d++) {
+		if (*s == '\\') {
+			s++;
+		} else if (*s == ':') {
+			*d = '\0';
+			ctr++;
+			continue;
+		}
+		*d = *s;
+		if (!*s)
+			break;
+	}
+	return ctr;
+}
+
+static struct inode *cfs_make_inode(struct cfs_context_s *ctx,
+				    struct super_block *sb, ino_t ino_num,
+				    struct cfs_inode_s *ino, const struct inode *dir)
+{
+	struct cfs_inode_data_s inode_data = { 0 };
+	struct cfs_xattr_header_s *xattrs = NULL;
+	struct inode *inode = NULL;
+	struct cfs_inode *cino;
+	int ret, res;
+
+	res = cfs_init_inode_data(ctx, ino, ino_num, &inode_data);
+	if (res < 0)
+		return ERR_PTR(res);
+
+	inode = new_inode(sb);
+	if (inode) {
+		inode_init_owner(&init_user_ns, inode, dir, ino->st_mode);
+		inode->i_mapping->a_ops = &cfs_aops;
+
+		cino = CFS_I(inode);
+		cino->inode_data = inode_data;
+
+		inode->i_ino = ino_num;
+		set_nlink(inode, ino->st_nlink);
+		inode->i_rdev = ino->st_rdev;
+		inode->i_uid = make_kuid(current_user_ns(), ino->st_uid);
+		inode->i_gid = make_kgid(current_user_ns(), ino->st_gid);
+		inode->i_mode = ino->st_mode;
+		inode->i_atime = ino->st_mtim;
+		inode->i_mtime = ino->st_mtim;
+		inode->i_ctime = ino->st_ctim;
+
+		switch (ino->st_mode & S_IFMT) {
+		case S_IFREG:
+			inode->i_op = &cfs_file_inode_operations;
+			inode->i_fop = &cfs_file_operations;
+			inode->i_size = ino->st_size;
+			break;
+		case S_IFLNK:
+			inode->i_link = cino->inode_data.path_payload;
+			inode->i_op = &cfs_link_inode_operations;
+			inode->i_fop = &cfs_file_operations;
+			break;
+		case S_IFDIR:
+			inode->i_op = &cfs_dir_inode_operations;
+			inode->i_fop = &cfs_dir_operations;
+			inode->i_size = 4096;
+			break;
+		case S_IFCHR:
+		case S_IFBLK:
+			if (current_user_ns() != &init_user_ns) {
+				ret = -EPERM;
+				goto fail;
+			}
+			fallthrough;
+		default:
+			inode->i_op = &cfs_file_inode_operations;
+			init_special_inode(inode, ino->st_mode, ino->st_rdev);
+			break;
+		}
+	}
+	return inode;
+
+fail:
+	if (inode)
+		iput(inode);
+	kfree(xattrs);
+	cfs_inode_data_put(&inode_data);
+	return ERR_PTR(ret);
+}
+
+static struct inode *cfs_get_root_inode(struct super_block *sb)
+{
+	struct cfs_info *fsi = sb->s_fs_info;
+	struct cfs_inode_s ino_buf;
+	struct cfs_inode_s *ino;
+	u64 index;
+
+	ino = cfs_get_root_ino(&fsi->cfs_ctx, &ino_buf, &index);
+	if (IS_ERR(ino))
+		return ERR_CAST(ino);
+
+	return cfs_make_inode(&fsi->cfs_ctx, sb, index, ino, NULL);
+}
+
+static bool cfs_iterate_cb(void *private, const char *name, int name_len,
+			   u64 ino, unsigned int dtype)
+{
+	struct dir_context *ctx = private;
+
+	if (!dir_emit(ctx, name, name_len, ino, dtype))
+		return 0;
+
+	ctx->pos++;
+	return 1;
+}
+
+static int cfs_iterate(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file->f_inode;
+	struct cfs_info *fsi = inode->i_sb->s_fs_info;
+	struct cfs_inode *cino = CFS_I(inode);
+
+	if (!dir_emit_dots(file, ctx))
+		return 0;
+
+	return cfs_dir_iterate(&fsi->cfs_ctx, inode->i_ino, &cino->inode_data,
+			       ctx->pos - 2, cfs_iterate_cb, ctx);
+}
+
+static struct dentry *cfs_lookup(struct inode *dir, struct dentry *dentry,
+				 unsigned int flags)
+{
+	struct cfs_info *fsi = dir->i_sb->s_fs_info;
+	struct cfs_inode *cino = CFS_I(dir);
+	struct cfs_inode_s ino_buf;
+	struct cfs_inode_s *ino_s;
+	struct inode *inode;
+	u64 index;
+	int ret;
+
+	if (dentry->d_name.len > NAME_MAX)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	ret = cfs_dir_lookup(&fsi->cfs_ctx, dir->i_ino, &cino->inode_data,
+			     dentry->d_name.name, dentry->d_name.len, &index);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (ret == 0)
+		goto return_negative;
+
+	ino_s = cfs_get_ino_index(&fsi->cfs_ctx, index, &ino_buf);
+	if (IS_ERR(ino_s))
+		return ERR_CAST(ino_s);
+
+	inode = cfs_make_inode(&fsi->cfs_ctx, dir->i_sb, index, ino_s, dir);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	return d_splice_alias(inode, dentry);
+
+return_negative:
+	d_add(dentry, NULL);
+	return NULL;
+}
+
+static const struct file_operations cfs_dir_operations = {
+	.llseek = generic_file_llseek,
+	.read = generic_read_dir,
+	.iterate_shared = cfs_iterate,
+};
+
+static const struct inode_operations cfs_dir_inode_operations = {
+	.lookup = cfs_lookup,
+	.listxattr = cfs_listxattr,
+};
+
+static const struct inode_operations cfs_link_inode_operations = {
+	.get_link = simple_get_link,
+	.listxattr = cfs_listxattr,
+};
+
+static int digest_from_string(const char *digest_str, u8 *digest)
+{
+	int res;
+
+	res = hex2bin(digest, digest_str, SHA256_DIGEST_SIZE);
+	if (res < 0)
+		return res;
+
+	if (digest_str[2 * SHA256_DIGEST_SIZE] != 0)
+		return -EINVAL; /* Too long string */
+
+	return 0;
+}
+
+/*
+ * Display the mount options in /proc/mounts.
+ */
+static int cfs_show_options(struct seq_file *m, struct dentry *root)
+{
+	struct cfs_info *fsi = root->d_sb->s_fs_info;
+
+	if (fsi->base_path)
+		seq_show_option(m, "basedir", fsi->base_path);
+	if (fsi->has_digest)
+		seq_printf(m, ",digest=%*phN", SHA256_DIGEST_SIZE, fsi->digest);
+	if (fsi->verity_check != 0)
+		seq_printf(m, ",verity_check=%u", fsi->verity_check);
+
+	return 0;
+}
+
+static struct kmem_cache *cfs_inode_cachep;
+
+static struct inode *cfs_alloc_inode(struct super_block *sb)
+{
+	struct cfs_inode *cino = alloc_inode_sb(sb, cfs_inode_cachep, GFP_KERNEL);
+
+	if (!cino)
+		return NULL;
+
+	memset((u8 *)cino + sizeof(struct inode), 0,
+	       sizeof(struct cfs_inode) - sizeof(struct inode));
+
+	return &cino->vfs_inode;
+}
+
+static void cfs_destroy_inode(struct inode *inode)
+{
+	struct cfs_inode *cino = CFS_I(inode);
+
+	cfs_inode_data_put(&cino->inode_data);
+}
+
+static void cfs_free_inode(struct inode *inode)
+{
+	struct cfs_inode *cino = CFS_I(inode);
+
+	kmem_cache_free(cfs_inode_cachep, cino);
+}
+
+static void cfs_put_super(struct super_block *sb)
+{
+	struct cfs_info *fsi = sb->s_fs_info;
+
+	cfs_ctx_put(&fsi->cfs_ctx);
+	if (fsi->bases) {
+		kern_unmount_array(fsi->bases, fsi->n_bases);
+		kfree(fsi->bases);
+	}
+	kfree(fsi->base_path);
+
+	kfree(fsi);
+}
+
+static int cfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	struct cfs_info *fsi = dentry->d_sb->s_fs_info;
+	int err = 0;
+
+	/* We return the free space, etc from the first base dir. */
+	if (fsi->n_bases > 0) {
+		struct path root = { .mnt = fsi->bases[0],
+				     .dentry = fsi->bases[0]->mnt_root };
+		err = vfs_statfs(&root, buf);
+	}
+
+	if (!err) {
+		buf->f_namelen = NAME_MAX;
+		buf->f_type = dentry->d_sb->s_magic;
+	}
+
+	return err;
+}
+
+static const struct super_operations cfs_ops = {
+	.statfs = cfs_statfs,
+	.drop_inode = generic_delete_inode,
+	.show_options = cfs_show_options,
+	.put_super = cfs_put_super,
+	.destroy_inode = cfs_destroy_inode,
+	.alloc_inode = cfs_alloc_inode,
+	.free_inode = cfs_free_inode,
+};
+
+enum cfs_param {
+	Opt_base_path,
+	Opt_digest,
+	Opt_verity_check,
+};
+
+const struct fs_parameter_spec cfs_parameters[] = {
+	fsparam_string("basedir", Opt_base_path),
+	fsparam_string("digest", Opt_digest),
+	fsparam_u32("verity_check", Opt_verity_check),
+	{}
+};
+
+static int cfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+	struct cfs_info *fsi = fc->s_fs_info;
+	struct fs_parse_result result;
+	int opt, r;
+
+	opt = fs_parse(fc, cfs_parameters, param, &result);
+	if (opt == -ENOPARAM)
+		return vfs_parse_fs_param_source(fc, param);
+	if (opt < 0)
+		return opt;
+
+	switch (opt) {
+	case Opt_base_path:
+		kfree(fsi->base_path);
+		/* Take ownership.  */
+		fsi->base_path = param->string;
+		param->string = NULL;
+		break;
+	case Opt_digest:
+		r = digest_from_string(param->string, fsi->digest);
+		if (r < 0)
+			return r;
+		fsi->has_digest = true;
+		fsi->verity_check = 2; /* Default to full verity check */
+		break;
+	case Opt_verity_check:
+		if (result.uint_32 > 2)
+			return invalfc(fc, "Invalid verity_check mode");
+		fsi->verity_check = result.uint_32;
+		break;
+	}
+
+	return 0;
+}
+
+static struct vfsmount *resolve_basedir(const char *name)
+{
+	struct path path = {};
+	struct vfsmount *mnt;
+	int err = -EINVAL;
+
+	if (!*name) {
+		pr_err("empty basedir\n");
+		goto out;
+	}
+	err = kern_path(name, LOOKUP_FOLLOW, &path);
+	if (err) {
+		pr_err("failed to resolve '%s': %i\n", name, err);
+		goto out;
+	}
+
+	mnt = clone_private_mount(&path);
+	err = PTR_ERR(mnt);
+	if (IS_ERR(mnt)) {
+		pr_err("failed to clone basedir\n");
+		goto out_put;
+	}
+
+	path_put(&path);
+
+	/* Don't inherit atime flags */
+	mnt->mnt_flags &= ~(MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME);
+
+	return mnt;
+
+out_put:
+	path_put(&path);
+out:
+	return ERR_PTR(err);
+}
+
+static int cfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct cfs_info *fsi = sb->s_fs_info;
+	struct vfsmount **bases = NULL;
+	size_t numbasedirs = 0;
+	struct inode *inode;
+	struct vfsmount *mnt;
+	int ret;
+
+	if (sb->s_root)
+		return -EINVAL;
+
+	/* Set up the inode allocator early */
+	sb->s_op = &cfs_ops;
+	sb->s_flags |= SB_RDONLY;
+	sb->s_magic = CFS_MAGIC;
+	sb->s_xattr = cfs_xattr_handlers;
+	sb->s_export_op = &cfs_export_operations;
+
+	if (fsi->base_path) {
+		char *lower, *splitlower = NULL;
+
+		ret = -ENOMEM;
+		splitlower = kstrdup(fsi->base_path, GFP_KERNEL);
+		if (!splitlower)
+			goto fail;
+
+		ret = -EINVAL;
+		numbasedirs = cfs_split_basedirs(splitlower);
+		if (numbasedirs > CFS_MAX_STACK) {
+			pr_err("too many lower directories, limit is %d\n",
+			       CFS_MAX_STACK);
+			kfree(splitlower);
+			goto fail;
+		}
+
+		ret = -ENOMEM;
+		bases = kcalloc(numbasedirs, sizeof(struct vfsmount *), GFP_KERNEL);
+		if (!bases) {
+			kfree(splitlower);
+			goto fail;
+		}
+
+		lower = splitlower;
+		for (size_t i = 0; i < numbasedirs; i++) {
+			mnt = resolve_basedir(lower);
+			if (IS_ERR(mnt)) {
+				ret = PTR_ERR(mnt);
+				kfree(splitlower);
+				goto fail;
+			}
+			bases[i] = mnt;
+
+			lower = strchr(lower, '\0') + 1;
+		}
+		kfree(splitlower);
+	}
+
+	/* Must be inited before calling cfs_get_inode.  */
+	ret = cfs_init_ctx(fc->source, fsi->has_digest ? fsi->digest : NULL,
+			   &fsi->cfs_ctx);
+	if (ret < 0)
+		goto fail;
+
+	inode = cfs_get_root_inode(sb);
+	if (IS_ERR(inode)) {
+		ret = PTR_ERR(inode);
+		goto fail;
+	}
+	sb->s_root = d_make_root(inode);
+
+	ret = -ENOMEM;
+	if (!sb->s_root)
+		goto fail;
+
+	sb->s_maxbytes = MAX_LFS_FILESIZE;
+	sb->s_blocksize = PAGE_SIZE;
+	sb->s_blocksize_bits = PAGE_SHIFT;
+
+	sb->s_time_gran = 1;
+
+	fsi->bases = bases;
+	fsi->n_bases = numbasedirs;
+	return 0;
+fail:
+	if (bases) {
+		for (size_t i = 0; i < numbasedirs; i++) {
+			if (bases[i])
+				kern_unmount(bases[i]);
+		}
+		kfree(bases);
+	}
+	cfs_ctx_put(&fsi->cfs_ctx);
+	return ret;
+}
+
+static int cfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_nodev(fc, cfs_fill_super);
+}
+
+static const struct fs_context_operations cfs_context_ops = {
+	.parse_param = cfs_parse_param,
+	.get_tree = cfs_get_tree,
+};
+
+static struct file *open_base_file(struct cfs_info *fsi, struct inode *inode,
+				   struct file *file)
+{
+	struct cfs_inode *cino = CFS_I(inode);
+	struct file *real_file;
+	char *real_path = cino->inode_data.path_payload;
+
+	for (size_t i = 0; i < fsi->n_bases; i++) {
+		real_file = file_open_root_mnt(fsi->bases[i], real_path,
+					       file->f_flags, 0);
+		if (!IS_ERR(real_file) || PTR_ERR(real_file) != -ENOENT)
+			return real_file;
+	}
+
+	return ERR_PTR(-ENOENT);
+}
+
+static int cfs_open_file(struct inode *inode, struct file *file)
+{
+	struct cfs_info *fsi = inode->i_sb->s_fs_info;
+	struct cfs_inode *cino = CFS_I(inode);
+	char *real_path = cino->inode_data.path_payload;
+	struct file *faked_file;
+	struct file *real_file;
+
+	if (WARN_ON(!file))
+		return -EIO;
+
+	if (file->f_flags & (O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_TRUNC))
+		return -EROFS;
+
+	if (!real_path) {
+		file->private_data = &empty_file;
+		return 0;
+	}
+
+	if (fsi->verity_check >= 2 && !cino->inode_data.has_digest) {
+		pr_warn("WARNING: composefs image file '%pd' specified no fs-verity digest\n",
+			file->f_path.dentry);
+		return -EIO;
+	}
+
+	real_file = open_base_file(fsi, inode, file);
+
+	if (IS_ERR(real_file))
+		return PTR_ERR(real_file);
+
+	/* If metadata records a digest for the file, ensure it is there
+	 * and correct before using the contents.
+	 */
+	if (cino->inode_data.has_digest && fsi->verity_check >= 1) {
+		u8 verity_digest[FS_VERITY_MAX_DIGEST_SIZE];
+		enum hash_algo verity_algo;
+		int res;
+
+		res = fsverity_get_digest(d_inode(real_file->f_path.dentry),
+					  verity_digest, &verity_algo);
+		if (res < 0) {
+			pr_warn("WARNING: composefs backing file '%pd' has no fs-verity digest\n",
+				real_file->f_path.dentry);
+			fput(real_file);
+			return -EIO;
+		}
+		if (verity_algo != HASH_ALGO_SHA256 ||
+		    memcmp(cino->inode_data.digest, verity_digest,
+			   SHA256_DIGEST_SIZE) != 0) {
+			pr_warn("WARNING: composefs backing file '%pd' has the wrong fs-verity digest\n",
+				real_file->f_path.dentry);
+			fput(real_file);
+			return -EIO;
+		}
+	}
+
+	faked_file = open_with_fake_path(&file->f_path, file->f_flags,
+					 real_file->f_inode, current_cred());
+	fput(real_file);
+
+	if (IS_ERR(faked_file))
+		return PTR_ERR(faked_file);
+
+	file->private_data = faked_file;
+	return 0;
+}
+
+#ifdef CONFIG_MMU
+static unsigned long cfs_mmu_get_unmapped_area(struct file *file, unsigned long addr,
+					       unsigned long len, unsigned long pgoff,
+					       unsigned long flags)
+{
+	struct file *realfile = file->private_data;
+
+	if (realfile == &empty_file)
+		return 0;
+
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+#endif
+
+static int cfs_release_file(struct inode *inode, struct file *file)
+{
+	struct file *realfile = file->private_data;
+
+	if (WARN_ON(!realfile))
+		return -EIO;
+
+	if (realfile == &empty_file)
+		return 0;
+
+	fput(file->private_data);
+
+	return 0;
+}
+
+static int cfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct file *realfile = file->private_data;
+	int ret;
+
+	if (realfile == &empty_file)
+		return 0;
+
+	if (!realfile->f_op->mmap)
+		return -ENODEV;
+
+	if (WARN_ON(file != vma->vm_file))
+		return -EIO;
+
+	vma_set_file(vma, realfile);
+
+	ret = call_mmap(vma->vm_file, vma);
+
+	return ret;
+}
+
+static ssize_t cfs_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *file = iocb->ki_filp;
+	struct file *realfile = file->private_data;
+	int ret;
+
+	if (realfile == &empty_file)
+		return 0;
+
+	if (!realfile->f_op->read_iter)
+		return -ENODEV;
+
+	iocb->ki_filp = realfile;
+	ret = call_read_iter(realfile, iocb, iter);
+	iocb->ki_filp = file;
+
+	return ret;
+}
+
+static int cfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
+{
+	struct file *realfile = file->private_data;
+
+	if (realfile == &empty_file)
+		return 0;
+
+	return vfs_fadvise(realfile, offset, len, advice);
+}
+
+static int cfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
+			 struct inode *parent)
+{
+	u32 generation;
+	int len = 3;
+	u64 nodeid;
+
+	if (*max_len < len) {
+		*max_len = len;
+		return FILEID_INVALID;
+	}
+
+	nodeid = inode->i_ino;
+	generation = inode->i_generation;
+
+	fh[0] = (u32)(nodeid >> 32);
+	fh[1] = (u32)(nodeid & 0xffffffff);
+	fh[2] = generation;
+
+	*max_len = len;
+
+	return FILEID_CFS;
+}
+
+static struct dentry *cfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+				       int fh_len, int fh_type)
+{
+	struct cfs_info *fsi = sb->s_fs_info;
+	struct inode *ino;
+	u64 inode_index;
+	u32 generation;
+
+	if (fh_type != FILEID_CFS || fh_len < 3)
+		return NULL;
+
+	inode_index = (u64)(fid->raw[0]) << 32;
+	inode_index |= fid->raw[1];
+	generation = fid->raw[2];
+
+	ino = ilookup(sb, inode_index);
+	if (!ino) {
+		struct cfs_inode_s inode_buf;
+		struct cfs_inode_s *inode;
+
+		inode = cfs_get_ino_index(&fsi->cfs_ctx, inode_index, &inode_buf);
+		if (IS_ERR(inode))
+			return ERR_CAST(inode);
+
+		ino = cfs_make_inode(&fsi->cfs_ctx, sb, inode_index, inode, NULL);
+		if (IS_ERR(ino))
+			return ERR_CAST(ino);
+	}
+	if (ino->i_generation != generation) {
+		iput(ino);
+		return ERR_PTR(-ESTALE);
+	}
+	return d_obtain_alias(ino);
+}
+
+static struct dentry *cfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+				       int fh_len, int fh_type)
+{
+	return ERR_PTR(-EACCES);
+}
+
+static int cfs_get_name(struct dentry *parent, char *name, struct dentry *child)
+{
+	WARN_ON_ONCE(1);
+	return -EIO;
+}
+
+static struct dentry *cfs_get_parent(struct dentry *dentry)
+{
+	WARN_ON_ONCE(1);
+	return ERR_PTR(-EIO);
+}
+
+static const struct export_operations cfs_export_operations = {
+	.fh_to_dentry = cfs_fh_to_dentry,
+	.fh_to_parent = cfs_fh_to_parent,
+	.encode_fh = cfs_encode_fh,
+	.get_parent = cfs_get_parent,
+	.get_name = cfs_get_name,
+};
+
+static int cfs_getxattr(const struct xattr_handler *handler,
+			struct dentry *unused2, struct inode *inode,
+			const char *name, void *value, size_t size)
+{
+	struct cfs_info *fsi = inode->i_sb->s_fs_info;
+	struct cfs_inode *cino = CFS_I(inode);
+
+	return cfs_get_xattr(&fsi->cfs_ctx, &cino->inode_data, name, value, size);
+}
+
+static ssize_t cfs_listxattr(struct dentry *dentry, char *names, size_t size)
+{
+	struct inode *inode = d_inode(dentry);
+	struct cfs_info *fsi = inode->i_sb->s_fs_info;
+	struct cfs_inode *cino = CFS_I(inode);
+
+	return cfs_list_xattrs(&fsi->cfs_ctx, &cino->inode_data, names, size);
+}
+
+static const struct file_operations cfs_file_operations = {
+	.read_iter = cfs_read_iter,
+	.mmap = cfs_mmap,
+	.fadvise = cfs_fadvise,
+	.fsync = noop_fsync,
+	.splice_read = generic_file_splice_read,
+	.llseek = generic_file_llseek,
+#ifdef CONFIG_MMU
+	.get_unmapped_area = cfs_mmu_get_unmapped_area,
+#endif
+	.release = cfs_release_file,
+	.open = cfs_open_file,
+};
+
+static const struct xattr_handler cfs_xattr_handler = {
+	.prefix = "", /* catch all */
+	.get = cfs_getxattr,
+};
+
+static const struct xattr_handler *cfs_xattr_handlers[] = {
+	&cfs_xattr_handler,
+	NULL,
+};
+
+static const struct inode_operations cfs_file_inode_operations = {
+	.setattr = simple_setattr,
+	.getattr = simple_getattr,
+
+	.listxattr = cfs_listxattr,
+};
+
+static int cfs_init_fs_context(struct fs_context *fc)
+{
+	struct cfs_info *fsi;
+
+	fsi = kzalloc(sizeof(*fsi), GFP_KERNEL);
+	if (!fsi)
+		return -ENOMEM;
+
+	fc->s_fs_info = fsi;
+	fc->ops = &cfs_context_ops;
+	return 0;
+}
+
+static struct file_system_type cfs_type = {
+	.owner = THIS_MODULE,
+	.name = "composefs",
+	.init_fs_context = cfs_init_fs_context,
+	.parameters = cfs_parameters,
+	.kill_sb = kill_anon_super,
+};
+
+static void cfs_inode_init_once(void *foo)
+{
+	struct cfs_inode *cino = foo;
+
+	inode_init_once(&cino->vfs_inode);
+}
+
+static int __init init_cfs(void)
+{
+	cfs_inode_cachep = kmem_cache_create(
+		"cfs_inode", sizeof(struct cfs_inode), 0,
+		(SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD | SLAB_ACCOUNT),
+		cfs_inode_init_once);
+	if (!cfs_inode_cachep)
+		return -ENOMEM;
+
+	return register_filesystem(&cfs_type);
+}
+
+static void __exit exit_cfs(void)
+{
+	unregister_filesystem(&cfs_type);
+
+	/* Ensure all RCU free inodes are safe to be destroyed. */
+	rcu_barrier();
+
+	kmem_cache_destroy(cfs_inode_cachep);
+}
+
+module_init(init_cfs);
+module_exit(exit_cfs);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 5/6] composefs: Add documentation
  2023-01-13 15:33 [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson
                   ` (3 preceding siblings ...)
  2023-01-13 15:33 ` [PATCH v2 4/6] composefs: Add filesystem implementation Alexander Larsson
@ 2023-01-13 15:33 ` Alexander Larsson
  2023-01-14  3:20   ` Bagas Sanjaya
  2023-01-13 15:33 ` [PATCH v2 6/6] composefs: Add kconfig and build support Alexander Larsson
  2023-01-16  4:44 ` [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Gao Xiang
  6 siblings, 1 reply; 34+ messages in thread
From: Alexander Larsson @ 2023-01-13 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, gscrivan, Alexander Larsson, linux-doc

Adds documentation about the composefs filesystem and
how to use it.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
---
 Documentation/filesystems/composefs.rst | 169 ++++++++++++++++++++++++
 Documentation/filesystems/index.rst     |   1 +
 2 files changed, 170 insertions(+)
 create mode 100644 Documentation/filesystems/composefs.rst

diff --git a/Documentation/filesystems/composefs.rst b/Documentation/filesystems/composefs.rst
new file mode 100644
index 000000000000..306f0e2e22ba
--- /dev/null
+++ b/Documentation/filesystems/composefs.rst
@@ -0,0 +1,169 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Composefs Filesystem
+====================
+
+Introduction
+============
+
+Composefs is a read-only file system that is backed by regular files
+(rather than a block device). It is designed to help easily share
+content between different directory trees, such as container images in
+a local store or ostree checkouts. In addition it also has support for
+integrity validation of file content and directory metadata, in an
+efficient way (using fs-verity).
+
+The filesystem mount source is a binary blob called the descriptor. It
+contains all the inode and directory entry data for the entire
+filesystem. However, instead of storing the file content each regular
+file inode stores a relative path name, and the filesystem gets the
+file content from the filesystem by looking up that filename in a set
+of base directories.
+
+Given such a descriptor called "image.cfs" and a directory with files
+called "/dir" you can mount it like::
+
+  mount -t composefs image.cfs -o basedir=/dir /mnt
+
+Content sharing
+===============
+
+Suppose you have a single basedir where the files are content
+addressed (i.e. named by content digest), and a set of composefs
+descriptors using this basedir. Any file that happen to be shared
+between two images (same content, so same digest) will now only be
+stored once on the disk.
+
+Such sharing is possible even if the metadata for the file in the
+image differs (common reasons for metadata difference are mtime,
+permissions, xattrs, etc). The sharing is also anonymous in the sense
+that you can't tell the difference on the mounted files from a
+non-shared file (for example by looking at the link count for a
+hardlinked file).
+
+In addition, any shared files that are actively in use will share
+page-cache, because the page cache for the file contents will be
+addressed by the backing file in the basedir, This means (for example)
+that shared libraries between images will only be mmap:ed once across
+all mounts.
+
+Integrity validation
+====================
+
+Composefs uses :doc:`fs-verity <fsverity>` for integrity validation,
+and extends it by making the validation also apply to the directory
+metadata.  This happens on two levels, validation of the descriptor
+and validation of the backing files.
+
+For descriptor validation, the idea is that you enable fs-verity on
+the descriptor file which seals it from changes that would affect the
+directory metadata. Additionally you can pass a `digest` mount option,
+which composefs verifies against the descriptor fs-verity
+measure. Such a mount option could be encoded in a trusted source
+(like a signed kernel command line) and be used as a root of trust if
+using composefs for the root filesystem.
+
+For file validation, the descriptor can contain digest for each
+backing file, and you can enable fs-verity on the backing
+files. Composefs will validate the digest before using the backing
+files. This means any (accidental or malicious) modification of the
+basedir will be detected at the time the file is used.
+
+Expected use-cases
+==================
+
+Container Image Storage
+```````````````````````
+
+Typically a container image is stored as a set of "layer"
+directories. merged into one mount by using overlayfs.  The lower
+layers are read-only image content and the upper layer is the
+writable state of a running container. Multiple uses of the same
+layer can be shared this way, but it is hard to share individual
+files between unrelated layers.
+
+Using composefs, we can instead use a shared, content-addressed
+store for all the images in the system, and use a composefs image
+for the read-only image content of each image, pointing into the
+shared store. Then for a running container we use an overlayfs
+with the lower dir being the composefs and the upper dir being
+the writable state.
+
+
+Ostree root filesystem validation
+`````````````````````````````````
+
+Ostree uses a content-addressed on-disk store for file content,
+allowing efficient updates and sharing of content. However to actually
+use these as a root filesystem it needs to create a real
+"chroot-style" directory, containing hard links into the store. The
+store itself is validated when created, but once the hard-link
+directory is created, nothing validates the directory structure of
+that.
+
+Instead of a chroot we can we can use composefs. We create a composefs
+image pointing into the object store, enable fs-verity for everything
+and encode the fs-verity digest of the descriptor in the
+kernel-command line. This will allow booting a trusted system where
+all directory metadata and file content is validated lazily at use.
+
+
+Mount options
+=============
+
+basedir
+    A colon separated list of directories to use as a base when resolving
+    relative content paths.
+
+verity_check=[0,1,2]
+    When to verify backing file fs-verity: 0 == never, 1 == if specified in
+    image, 2 == always and require it in image.
+
+digest
+    A fs-verity sha256 digest that the descriptor file must match. If set,
+    `verity_check` defaults to 2.
+
+
+Filesystem format
+=================
+
+The format of the descriptor is contains three sections: header,
+inodes and variable data. All data in the file is stored in
+little-endian form.
+
+The header starts at the beginning of the file and contains version,
+magic value, offsets to the variable data and the root inode nr.
+
+The inode section starts at a fixed location right after the
+header. It is a array of inode data, where for each inode there is
+first a variable length chunk and then a fixed size chunk. An inode nr
+is the offset in the inode data to the start of the fixed chunk.
+
+The fixed inode chunk starts with a flag that tells what parts of the
+inode are stored in the file (meaning it is only the maximal size that
+is fixed). After that the various inode attributes are serialized in
+order, such as mode, ownership, xattrs, and payload length. The
+payload length attribute gives the size of the variable chunk.
+
+The inode variable chunk contains different things depending on the
+file type.  For regular files it is the backing filename. For symlinks
+it is the symlink target. For directories it is a list of references to
+dentries, stored in chunks of maximum 4k. The dentry chunks themselves
+are stored in the variable data section.
+
+The variable data section is stored after the inode section, and you
+can find it from the offset in the header. It contains dentries and
+Xattrs data. The xattrs are referred to by offset and size in the
+xattr attribute in the inode data. Each xattr data can be used by many
+inodes in the filesystem. The variable data chunks are all smaller than
+a page (4K) and are padded to not span pages.
+
+Tools
+=====
+
+Tools for composefs can be found at https://github.com/containers/composefs
+
+There is a mkcomposefs tool which can be used to create images on the
+CLI, and a library that applications can use to create composefs
+images.
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index bee63d42e5ec..9b7cf136755d 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -75,6 +75,7 @@ Documentation for filesystem implementations.
    cifs/index
    ceph
    coda
+   composefs
    configfs
    cramfs
    dax
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 6/6] composefs: Add kconfig and build support
  2023-01-13 15:33 [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson
                   ` (4 preceding siblings ...)
  2023-01-13 15:33 ` [PATCH v2 5/6] composefs: Add documentation Alexander Larsson
@ 2023-01-13 15:33 ` Alexander Larsson
  2023-01-16  4:44 ` [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Gao Xiang
  6 siblings, 0 replies; 34+ messages in thread
From: Alexander Larsson @ 2023-01-13 15:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, gscrivan, Alexander Larsson

This commit adds Makefile and Kconfig for composefs, and
updates Makefile and Kconfig files in the fs directory

Signed-off-by: Alexander Larsson <alexl@redhat.com>
---
 fs/Kconfig            |  1 +
 fs/Makefile           |  1 +
 fs/composefs/Kconfig  | 18 ++++++++++++++++++
 fs/composefs/Makefile |  5 +++++
 4 files changed, 25 insertions(+)
 create mode 100644 fs/composefs/Kconfig
 create mode 100644 fs/composefs/Makefile

diff --git a/fs/Kconfig b/fs/Kconfig
index 2685a4d0d353..de8493fc2b1e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -127,6 +127,7 @@ source "fs/quota/Kconfig"
 source "fs/autofs/Kconfig"
 source "fs/fuse/Kconfig"
 source "fs/overlayfs/Kconfig"
+source "fs/composefs/Kconfig"
 
 menu "Caches"
 
diff --git a/fs/Makefile b/fs/Makefile
index 4dea17840761..d16974e02468 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -137,3 +137,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-$(CONFIG_COMPOSEFS_FS)	+= composefs/
diff --git a/fs/composefs/Kconfig b/fs/composefs/Kconfig
new file mode 100644
index 000000000000..88c5b55380e6
--- /dev/null
+++ b/fs/composefs/Kconfig
@@ -0,0 +1,18 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+config COMPOSEFS_FS
+	tristate "Composefs filesystem support"
+	select EXPORTFS
+	help
+	  Composefs is a filesystem that allows combining file content from
+	  existing regular files with a metadata directory structure from
+	  a separate binary file. This is useful to share file content between
+	  many different directory trees, such as in a local container image store.
+
+	  Composefs also allows using fs-verity to validate the content of the
+	  content-files as well as the metadata file which allows dm-verity
+	  like validation with the flexibility of regular files.
+
+	  For more information see Documentation/filesystems/composefs.rst
+
+	  If unsure, say N.
diff --git a/fs/composefs/Makefile b/fs/composefs/Makefile
new file mode 100644
index 000000000000..eac8445e7d25
--- /dev/null
+++ b/fs/composefs/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+obj-$(CONFIG_COMPOSEFS_FS) += composefs.o
+
+composefs-objs += cfs-reader.o cfs.o
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/6] composefs: Add filesystem implementation
  2023-01-13 15:33 ` [PATCH v2 4/6] composefs: Add filesystem implementation Alexander Larsson
@ 2023-01-13 21:55   ` kernel test robot
  2023-01-16 22:07   ` Al Viro
  1 sibling, 0 replies; 34+ messages in thread
From: kernel test robot @ 2023-01-13 21:55 UTC (permalink / raw)
  To: Alexander Larsson, linux-fsdevel
  Cc: oe-kbuild-all, linux-kernel, gscrivan, Alexander Larsson

Hi Alexander,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on fscrypt/fsverity]
[also build test WARNING on linus/master v6.2-rc3 next-20230113]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Alexander-Larsson/fsverity-Export-fsverity_get_digest/20230113-234920
base:   https://git.kernel.org/pub/scm/fs/fscrypt/fscrypt.git fsverity
patch link:    https://lore.kernel.org/r/ee96ab52b9d2ab58e7b793e34ce5dc956686ada9.1673623253.git.alexl%40redhat.com
patch subject: [PATCH v2 4/6] composefs: Add filesystem implementation
reproduce:
        make versioncheck

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

versioncheck warnings: (new ones prefixed by >>)
   INFO PATH=/opt/cross/clang/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
   /usr/bin/timeout -k 100 3h /usr/bin/make W=1 --keep-going HOSTCC=gcc-11 CC=gcc-11 -j32 ARCH=x86_64 versioncheck
   find ./* \( -name SCCS -o -name BitKeeper -o -name .svn -o -name CVS -o -name .pc -o -name .hg -o -name .git \) -prune -o \
   	-name '*.[hcS]' -type f -print | sort \
   	| xargs perl -w ./scripts/checkversion.pl
   ./drivers/accessibility/speakup/genmap.c: 13 linux/version.h not needed.
   ./drivers/accessibility/speakup/makemapdata.c: 13 linux/version.h not needed.
   ./drivers/net/ethernet/qlogic/qede/qede.h: 10 linux/version.h not needed.
   ./drivers/net/ethernet/qlogic/qede/qede_ethtool.c: 7 linux/version.h not needed.
   ./drivers/scsi/cxgbi/libcxgbi.h: 27 linux/version.h not needed.
   ./drivers/scsi/mpi3mr/mpi3mr.h: 32 linux/version.h not needed.
   ./drivers/scsi/qedi/qedi_dbg.h: 14 linux/version.h not needed.
   ./drivers/soc/tegra/cbb/tegra-cbb.c: 19 linux/version.h not needed.
   ./drivers/soc/tegra/cbb/tegra194-cbb.c: 26 linux/version.h not needed.
   ./drivers/soc/tegra/cbb/tegra234-cbb.c: 27 linux/version.h not needed.
   ./drivers/staging/media/atomisp/include/linux/atomisp.h: 23 linux/version.h not needed.
>> ./fs/composefs/cfs.c: 19 linux/version.h not needed.
   ./init/version-timestamp.c: 5 linux/version.h not needed.
   ./samples/trace_events/trace_custom_sched.c: 11 linux/version.h not needed.
   ./sound/soc/codecs/cs42l42.c: 14 linux/version.h not needed.
   ./tools/lib/bpf/bpf_helpers.h: 289: need linux/version.h
   ./tools/perf/tests/bpf-script-example.c: 60: need linux/version.h
   ./tools/perf/tests/bpf-script-test-kbuild.c: 21: need linux/version.h
   ./tools/perf/tests/bpf-script-test-prologue.c: 47: need linux/version.h
   ./tools/perf/tests/bpf-script-test-relocation.c: 51: need linux/version.h
   ./tools/testing/selftests/bpf/progs/dev_cgroup.c: 9 linux/version.h not needed.
   ./tools/testing/selftests/bpf/progs/netcnt_prog.c: 3 linux/version.h not needed.
   ./tools/testing/selftests/bpf/progs/test_map_lock.c: 4 linux/version.h not needed.
   ./tools/testing/selftests/bpf/progs/test_send_signal_kern.c: 4 linux/version.h not needed.
   ./tools/testing/selftests/bpf/progs/test_spin_lock.c: 4 linux/version.h not needed.
   ./tools/testing/selftests/bpf/progs/test_tcp_estats.c: 37 linux/version.h not needed.
   ./tools/testing/selftests/wireguard/qemu/init.c: 27 linux/version.h not needed.

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 5/6] composefs: Add documentation
  2023-01-13 15:33 ` [PATCH v2 5/6] composefs: Add documentation Alexander Larsson
@ 2023-01-14  3:20   ` Bagas Sanjaya
  2023-01-16 12:38     ` Alexander Larsson
  0 siblings, 1 reply; 34+ messages in thread
From: Bagas Sanjaya @ 2023-01-14  3:20 UTC (permalink / raw)
  To: Alexander Larsson, linux-fsdevel; +Cc: linux-kernel, gscrivan, linux-doc

[-- Attachment #1: Type: text/plain, Size: 14916 bytes --]

On Fri, Jan 13, 2023 at 04:33:58PM +0100, Alexander Larsson wrote:
> Adds documentation about the composefs filesystem and
> how to use it.

s/Adds documentation/Add documentation/

> diff --git a/Documentation/filesystems/composefs.rst b/Documentation/filesystems/composefs.rst
> new file mode 100644
> index 000000000000..306f0e2e22ba
> --- /dev/null
> +++ b/Documentation/filesystems/composefs.rst
> @@ -0,0 +1,169 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +====================
> +Composefs Filesystem
> +====================
> +
> +Introduction
> +============
> +
> +Composefs is a read-only file system that is backed by regular files
> +(rather than a block device). It is designed to help easily share
> +content between different directory trees, such as container images in
> +a local store or ostree checkouts. In addition it also has support for
> +integrity validation of file content and directory metadata, in an
> +efficient way (using fs-verity).
> +
> +The filesystem mount source is a binary blob called the descriptor. It
> +contains all the inode and directory entry data for the entire
> +filesystem. However, instead of storing the file content each regular
> +file inode stores a relative path name, and the filesystem gets the
> +file content from the filesystem by looking up that filename in a set
> +of base directories.
> +
> +Given such a descriptor called "image.cfs" and a directory with files
> +called "/dir" you can mount it like::
> +
> +  mount -t composefs image.cfs -o basedir=/dir /mnt
> +
> +Content sharing
> +===============
> +
> +Suppose you have a single basedir where the files are content
> +addressed (i.e. named by content digest), and a set of composefs
> +descriptors using this basedir. Any file that happen to be shared
> +between two images (same content, so same digest) will now only be
> +stored once on the disk.
> +
> +Such sharing is possible even if the metadata for the file in the
> +image differs (common reasons for metadata difference are mtime,
> +permissions, xattrs, etc). The sharing is also anonymous in the sense
> +that you can't tell the difference on the mounted files from a
> +non-shared file (for example by looking at the link count for a
> +hardlinked file).
> +
> +In addition, any shared files that are actively in use will share
> +page-cache, because the page cache for the file contents will be
> +addressed by the backing file in the basedir, This means (for example)
> +that shared libraries between images will only be mmap:ed once across
> +all mounts.
> +
> +Integrity validation
> +====================
> +
> +Composefs uses :doc:`fs-verity <fsverity>` for integrity validation,
> +and extends it by making the validation also apply to the directory
> +metadata.  This happens on two levels, validation of the descriptor
> +and validation of the backing files.
> +
> +For descriptor validation, the idea is that you enable fs-verity on
> +the descriptor file which seals it from changes that would affect the
> +directory metadata. Additionally you can pass a `digest` mount option,
> +which composefs verifies against the descriptor fs-verity
> +measure. Such a mount option could be encoded in a trusted source
> +(like a signed kernel command line) and be used as a root of trust if
> +using composefs for the root filesystem.

Quote mount option names (like other keywords for consistency):

---- >8 ----
diff --git a/Documentation/filesystems/composefs.rst b/Documentation/filesystems/composefs.rst
index c96f9b99d72979..cc65945e3d5302 100644
--- a/Documentation/filesystems/composefs.rst
+++ b/Documentation/filesystems/composefs.rst
@@ -58,7 +58,7 @@ and validation of the backing files.
 
 For descriptor validation, the idea is that you enable fs-verity on
 the descriptor file which seals it from changes that would affect the
-directory metadata. Additionally you can pass a `digest` mount option,
+directory metadata. Additionally you can pass a "digest" mount option,
 which composefs verifies against the descriptor fs-verity
 measure. Such a mount option could be encoded in a trusted source
 (like a signed kernel command line) and be used as a root of trust if
@@ -125,7 +125,7 @@ verity_check=[0,1,2]
 
 digest
     A fs-verity sha256 digest that the descriptor file must match. If set,
-    `verity_check` defaults to 2.
+    "verity_check" defaults to 2.
 
 
 Filesystem format

> +
> +For file validation, the descriptor can contain digest for each
> +backing file, and you can enable fs-verity on the backing
> +files. Composefs will validate the digest before using the backing
> +files. This means any (accidental or malicious) modification of the
> +basedir will be detected at the time the file is used.
> +
> +Expected use-cases
> +==================
> +
> +Container Image Storage
> +```````````````````````
> +
> +Typically a container image is stored as a set of "layer"
> +directories. merged into one mount by using overlayfs.  The lower
> +layers are read-only image content and the upper layer is the
> +writable state of a running container. Multiple uses of the same
> +layer can be shared this way, but it is hard to share individual
> +files between unrelated layers.
> +
> +Using composefs, we can instead use a shared, content-addressed
> +store for all the images in the system, and use a composefs image
> +for the read-only image content of each image, pointing into the
> +shared store. Then for a running container we use an overlayfs
> +with the lower dir being the composefs and the upper dir being
> +the writable state.
> +
> +
> +Ostree root filesystem validation
> +`````````````````````````````````
> +
> +Ostree uses a content-addressed on-disk store for file content,
> +allowing efficient updates and sharing of content. However to actually
> +use these as a root filesystem it needs to create a real
> +"chroot-style" directory, containing hard links into the store. The
> +store itself is validated when created, but once the hard-link
> +directory is created, nothing validates the directory structure of
> +that.
> +
> +Instead of a chroot we can we can use composefs. We create a composefs
> +image pointing into the object store, enable fs-verity for everything
> +and encode the fs-verity digest of the descriptor in the
> +kernel-command line. This will allow booting a trusted system where
> +all directory metadata and file content is validated lazily at use.
> +
> +
> +Mount options
> +=============
> +
> +basedir
> +    A colon separated list of directories to use as a base when resolving
> +    relative content paths.
> +
> +verity_check=[0,1,2]
> +    When to verify backing file fs-verity: 0 == never, 1 == if specified in
> +    image, 2 == always and require it in image.

I think bullet lists should do the job for verity_check values:

---- >8 ----
diff --git a/Documentation/filesystems/composefs.rst b/Documentation/filesystems/composefs.rst
index 306f0e2e22baf5..c96f9b99d72979 100644
--- a/Documentation/filesystems/composefs.rst
+++ b/Documentation/filesystems/composefs.rst
@@ -117,8 +117,11 @@ basedir
     relative content paths.
 
 verity_check=[0,1,2]
-    When to verify backing file fs-verity: 0 == never, 1 == if specified in
-    image, 2 == always and require it in image.
+    When to verify backing file fs-verity:
+
+    * 0: never
+    * 1: if specified in image
+    * 2: always and require it in image.
 
 digest
     A fs-verity sha256 digest that the descriptor file must match. If set,

> +
> +digest
> +    A fs-verity sha256 digest that the descriptor file must match. If set,
> +    `verity_check` defaults to 2.
> +
> +
> +Filesystem format
> +=================
> +
> +The format of the descriptor is contains three sections: header,
> +inodes and variable data. All data in the file is stored in
> +little-endian form.
> +
> +The header starts at the beginning of the file and contains version,
> +magic value, offsets to the variable data and the root inode nr.
> +
> +The inode section starts at a fixed location right after the
> +header. It is a array of inode data, where for each inode there is
> +first a variable length chunk and then a fixed size chunk. An inode nr
> +is the offset in the inode data to the start of the fixed chunk.
> +
> +The fixed inode chunk starts with a flag that tells what parts of the
> +inode are stored in the file (meaning it is only the maximal size that
> +is fixed). After that the various inode attributes are serialized in
> +order, such as mode, ownership, xattrs, and payload length. The
> +payload length attribute gives the size of the variable chunk.
> +
> +The inode variable chunk contains different things depending on the
> +file type.  For regular files it is the backing filename. For symlinks
> +it is the symlink target. For directories it is a list of references to
> +dentries, stored in chunks of maximum 4k. The dentry chunks themselves
> +are stored in the variable data section.
> +
> +The variable data section is stored after the inode section, and you
> +can find it from the offset in the header. It contains dentries and
> +Xattrs data. The xattrs are referred to by offset and size in the
> +xattr attribute in the inode data. Each xattr data can be used by many
> +inodes in the filesystem. The variable data chunks are all smaller than
> +a page (4K) and are padded to not span pages.
> +
> +Tools
> +=====
> +
> +Tools for composefs can be found at https://github.com/containers/composefs
> +
> +There is a mkcomposefs tool which can be used to create images on the
> +CLI, and a library that applications can use to create composefs
> +images.

The rest can be slightly reworded:

---- >8 ----
diff --git a/Documentation/filesystems/composefs.rst b/Documentation/filesystems/composefs.rst
index cc65945e3d5302..9bd5a6f4e5d676 100644
--- a/Documentation/filesystems/composefs.rst
+++ b/Documentation/filesystems/composefs.rst
@@ -59,16 +59,16 @@ and validation of the backing files.
 For descriptor validation, the idea is that you enable fs-verity on
 the descriptor file which seals it from changes that would affect the
 directory metadata. Additionally you can pass a "digest" mount option,
-which composefs verifies against the descriptor fs-verity
-measure. Such a mount option could be encoded in a trusted source
-(like a signed kernel command line) and be used as a root of trust if
-using composefs for the root filesystem.
+which composefs verifies against the descriptor fs-verity measure. Such
+an option could be embedded in a trusted source (like a signed kernel
+command line) and be used as a root of trust if using composefs for the
+root filesystem.
 
 For file validation, the descriptor can contain digest for each
-backing file, and you can enable fs-verity on the backing
-files. Composefs will validate the digest before using the backing
-files. This means any (accidental or malicious) modification of the
-basedir will be detected at the time the file is used.
+backing file, and you can enable fs-verity on them too. Composefs will
+validate the digest before using the backing files. This means any
+(accidental or malicious) modification of the basedir will be detected
+at the time the file is used.
 
 Expected use-cases
 ==================
@@ -76,19 +76,18 @@ Expected use-cases
 Container Image Storage
 ```````````````````````
 
-Typically a container image is stored as a set of "layer"
-directories. merged into one mount by using overlayfs.  The lower
-layers are read-only image content and the upper layer is the
-writable state of a running container. Multiple uses of the same
-layer can be shared this way, but it is hard to share individual
-files between unrelated layers.
+Typically a container image is stored as a set of "layer" directories,
+merged into one mount by using overlayfs.  The lower layers are
+read-only image and the upper layer is the writable directory of a
+running container. Multiple uses of the same layer can be shared this
+way, but it is hard to share individual files between unrelated layers.
 
 Using composefs, we can instead use a shared, content-addressed
-store for all the images in the system, and use a composefs image
-for the read-only image content of each image, pointing into the
+store for all the images in the system, and use composefs
+for the read-only image of each container, pointing into the
 shared store. Then for a running container we use an overlayfs
 with the lower dir being the composefs and the upper dir being
-the writable state.
+the writable directory.
 
 
 Ostree root filesystem validation
@@ -99,12 +98,12 @@ allowing efficient updates and sharing of content. However to actually
 use these as a root filesystem it needs to create a real
 "chroot-style" directory, containing hard links into the store. The
 store itself is validated when created, but once the hard-link
-directory is created, nothing validates the directory structure of
-that.
+directory is created, the directory structure is impossible to
+verify.
 
-Instead of a chroot we can we can use composefs. We create a composefs
-image pointing into the object store, enable fs-verity for everything
-and encode the fs-verity digest of the descriptor in the
+Instead of a chroot we can use composefs. The composefs image pointing
+to the object store is created, then fs-verity is enabled for
+everything and the descriptor digest is encoded in the
 kernel-command line. This will allow booting a trusted system where
 all directory metadata and file content is validated lazily at use.
 
@@ -119,9 +118,9 @@ basedir
 verity_check=[0,1,2]
     When to verify backing file fs-verity:
 
-    * 0: never
-    * 1: if specified in image
-    * 2: always and require it in image.
+    * 0: never verify
+    * 1: if the digest is specified in the image
+    * 2: always verify the image (and requires verification).
 
 digest
     A fs-verity sha256 digest that the descriptor file must match. If set,
@@ -147,7 +146,7 @@ The fixed inode chunk starts with a flag that tells what parts of the
 inode are stored in the file (meaning it is only the maximal size that
 is fixed). After that the various inode attributes are serialized in
 order, such as mode, ownership, xattrs, and payload length. The
-payload length attribute gives the size of the variable chunk.
+latter attribute gives the size of the variable chunk.
 
 The inode variable chunk contains different things depending on the
 file type.  For regular files it is the backing filename. For symlinks
 
Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/6] composefs: Add on-disk layout
  2023-01-13 15:33 ` [PATCH v2 2/6] composefs: Add on-disk layout Alexander Larsson
@ 2023-01-16  1:29   ` Dave Chinner
  2023-01-16 11:00     ` Alexander Larsson
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2023-01-16  1:29 UTC (permalink / raw)
  To: Alexander Larsson; +Cc: linux-fsdevel, linux-kernel, gscrivan

On Fri, Jan 13, 2023 at 04:33:55PM +0100, Alexander Larsson wrote:
> This commit adds the on-disk layout header file of composefs.

This isn't really a useful commit message.

Perhaps it should actually explain what the overall goals of the
on-disk format are - space usage, complexity trade-offs, potential
issues with validation of variable payload sections, etc.

> 
> Signed-off-by: Alexander Larsson <alexl@redhat.com>
> Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com>
> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> ---
>  fs/composefs/cfs.h | 203 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 203 insertions(+)
>  create mode 100644 fs/composefs/cfs.h
> 
> diff --git a/fs/composefs/cfs.h b/fs/composefs/cfs.h
> new file mode 100644
> index 000000000000..658df728e366
> --- /dev/null
> +++ b/fs/composefs/cfs.h
> @@ -0,0 +1,203 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * composefs
> + *
> + * Copyright (C) 2021 Giuseppe Scrivano
> + * Copyright (C) 2022 Alexander Larsson
> + *
> + * This file is released under the GPL.
> + */
> +
> +#ifndef _CFS_H
> +#define _CFS_H
> +
> +#include <asm/byteorder.h>
> +#include <crypto/sha2.h>
> +#include <linux/fs.h>
> +#include <linux/stat.h>
> +#include <linux/types.h>
> +
> +#define CFS_VERSION 1

This should start with a description of the on-disk format for the
version 1 format.


> +
> +#define CFS_MAGIC 0xc078629aU
> +
> +#define CFS_MAX_DIR_CHUNK_SIZE 4096
> +#define CFS_MAX_XATTRS_SIZE 4096

How do we store 64kB xattrs in this format if the max attr size is
4096 bytes? Or is that the maximum total xattr storage?

A comment telling us what these limits are would be nice.

> +
> +static inline int cfs_digest_from_payload(const char *payload, size_t payload_len,
> +					  u8 digest_out[SHA256_DIGEST_SIZE])
> +{
> +	const char *p, *end;
> +	u8 last_digit = 0;
> +	int digit = 0;
> +	size_t n_nibbles = 0;
> +
> +	/* This handles payloads (i.e. path names) that are "essentially" a
> +	 * digest as the digest (if the DIGEST_FROM_PAYLOAD flag is set). The
> +	 * "essential" part means that we ignore hierarchical structure as well
> +	 * as any extension. So, for example "ef/deadbeef.file" would match the
> +	 * (too short) digest "efdeadbeef".
> +	 *
> +	 * This allows images to avoid storing both the digest and the pathname,
> +	 * yet work with pre-existing object store formats of various kinds.
> +	 */
> +
> +	end = payload + payload_len;
> +	for (p = payload; p != end; p++) {
> +		/* Skip subdir structure */
> +		if (*p == '/')
> +			continue;
> +
> +		/* Break at (and ignore) extension */
> +		if (*p == '.')
> +			break;
> +
> +		if (n_nibbles == SHA256_DIGEST_SIZE * 2)
> +			return -EINVAL; /* Too long */
> +
> +		digit = hex_to_bin(*p);
> +		if (digit == -1)
> +			return -EINVAL; /* Not hex digit */
> +
> +		n_nibbles++;
> +		if ((n_nibbles % 2) == 0)
> +			digest_out[n_nibbles / 2 - 1] = (last_digit << 4) | digit;
> +		last_digit = digit;
> +	}
> +
> +	if (n_nibbles != SHA256_DIGEST_SIZE * 2)
> +		return -EINVAL; /* Too short */
> +
> +	return 0;
> +}

Too big to be a inline function.

> +
> +struct cfs_vdata_s {

Drop the "_s" suffix to indicate the type is a structure - that's
waht "struct" tells us.

> +	u64 off;
> +	u32 len;

If these are on-disk format structures, why aren't the defined as
using the specific endian they are encoded in? i.e. __le64, __le32,
etc? Otherwise a file built on a big endian machine won't be
readable on a little endian machine (and vice versa).

> +} __packed;
> +
> +struct cfs_header_s {
> +	u8 version;
> +	u8 unused1;
> +	u16 unused2;

Why are you hyper-optimising these structures for minimal space
usage? This is 2023 - we can use a __le32 for the version number,
the magic number and then leave....
> +
> +	u32 magic;
> +	u64 data_offset;
> +	u64 root_inode;
> +
> +	u64 unused3[2];

a whole heap of space to round it up to at least a CPU cacheline
size using something like "__le64 unused[15]".

That way we don't need packed structures nor do we care about having
weird little holes in the structures to fill....

> +} __packed;
> +
> +enum cfs_inode_flags {
> +	CFS_INODE_FLAGS_NONE = 0,
> +	CFS_INODE_FLAGS_PAYLOAD = 1 << 0,
> +	CFS_INODE_FLAGS_MODE = 1 << 1,
> +	CFS_INODE_FLAGS_NLINK = 1 << 2,
> +	CFS_INODE_FLAGS_UIDGID = 1 << 3,
> +	CFS_INODE_FLAGS_RDEV = 1 << 4,
> +	CFS_INODE_FLAGS_TIMES = 1 << 5,
> +	CFS_INODE_FLAGS_TIMES_NSEC = 1 << 6,
> +	CFS_INODE_FLAGS_LOW_SIZE = 1 << 7, /* Low 32bit of st_size */
> +	CFS_INODE_FLAGS_HIGH_SIZE = 1 << 8, /* High 32bit of st_size */

Why do we need to complicate things by splitting the inode size
like this?

> +	CFS_INODE_FLAGS_XATTRS = 1 << 9,
> +	CFS_INODE_FLAGS_DIGEST = 1 << 10, /* fs-verity sha256 digest */
> +	CFS_INODE_FLAGS_DIGEST_FROM_PAYLOAD = 1 << 11, /* Compute digest from payload */
> +};
> +
> +#define CFS_INODE_FLAG_CHECK(_flag, _name)                                     \
> +	(((_flag) & (CFS_INODE_FLAGS_##_name)) != 0)

Check what about a flag? If this is a "check that a feature is set",
then open coding it better, but if you must do it like this, then
please use static inline functions like:

	if (cfs_inode_has_xattrs(inode->flags)) {
		.....
	}

> +#define CFS_INODE_FLAG_CHECK_SIZE(_flag, _name, _size)                         \
> +	(CFS_INODE_FLAG_CHECK(_flag, _name) ? (_size) : 0)

This doesn't seem particularly useful, because you've still got to
test is the return value is valid. i.e.

	size = CFS_INODE_FLAG_CHECK_SIZE(inode->flags, XATTRS, 32);
	if (size == 32) {
		/* got xattrs, decode! */
	}

vs
	if (cfs_inode_has_xattrs(inode->flags)) {
		/* decode! */
	}



> +
> +#define CFS_INODE_DEFAULT_MODE 0100644
> +#define CFS_INODE_DEFAULT_NLINK 1
> +#define CFS_INODE_DEFAULT_NLINK_DIR 2
> +#define CFS_INODE_DEFAULT_UIDGID 0
> +#define CFS_INODE_DEFAULT_RDEV 0
> +#define CFS_INODE_DEFAULT_TIMES 0

Where do these get used? Are they on disk defaults or something
else? (comment, please!)

> +struct cfs_inode_s {
> +	u32 flags;
> +
> +	/* Optional data: (selected by flags) */

WHy would you make them optional given that all the fields are still
defined in the structure?

It's much simpler just to decode the entire structure into memory
than to have to check each flag value to determine if a field needs
to be decoded...

> +	/* This is the size of the type specific data that comes directly after
> +	 * the inode in the file. Of this type:
> +	 *
> +	 * directory: cfs_dir_s
> +	 * regular file: the backing filename
> +	 * symlink: the target link
> +	 *
> +	 * Canonically payload_length is 0 for empty dir/file/symlink.
> +	 */
> +	u32 payload_length;

How do you have an empty symlink?

> +	u32 st_mode; /* File type and mode.  */
> +	u32 st_nlink; /* Number of hard links, only for regular files.  */
> +	u32 st_uid; /* User ID of owner.  */
> +	u32 st_gid; /* Group ID of owner.  */
> +	u32 st_rdev; /* Device ID (if special file).  */
> +	u64 st_size; /* Size of file, only used for regular files */
> +
> +	struct cfs_vdata_s xattrs; /* ref to variable data */

This is in the payload that follows the inode?  Is it included in
the payload_length above?

If not, where is this stuff located, how do we validate it points to
the correct place in the on-disk format file, the xattrs belong to
this specific inode, etc? I think that's kinda important to
describe, because xattrs often contain important security
information...


> +
> +	u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */

Why would you have this in the on-disk structure, then also have
"digest from payload" that allows the digest to be in the payload
section of the inode data?

> +
> +	struct timespec64 st_mtim; /* Time of last modification.  */
> +	struct timespec64 st_ctim; /* Time of last status change.  */
> +};

This really feels like an in-memory format inode, not an on-disk
format inode, because this:

> +
> +static inline u32 cfs_inode_encoded_size(u32 flags)
> +{
> +	return sizeof(u32) /* flags */ +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, PAYLOAD, sizeof(u32)) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, MODE, sizeof(u32)) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, NLINK, sizeof(u32)) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, UIDGID, sizeof(u32) + sizeof(u32)) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, RDEV, sizeof(u32)) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, TIMES, sizeof(u64) * 2) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, TIMES_NSEC, sizeof(u32) * 2) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, LOW_SIZE, sizeof(u32)) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, HIGH_SIZE, sizeof(u32)) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, XATTRS, sizeof(u64) + sizeof(u32)) +
> +	       CFS_INODE_FLAG_CHECK_SIZE(flags, DIGEST, SHA256_DIGEST_SIZE);
> +}

looks like the on-disk format is an encoded format hyper-optimised
for minimal storage space usage?

Without comments to explain it, I'm not exactly sure what is stored
in the on-disk format inodes, nor the layout of the variable
payload section or how payload sections are defined and verified.

Seems overly complex to me - it's far simpler just to have a fixed
inode structure and just decode it directly into the in-memory
structure when it is read....

> +struct cfs_dentry_s {
> +	/* Index of struct cfs_inode_s */

Not a useful (or correct!) comment :/

Also, the typical term for this on disk structure in a filesystem is
a "dirent", and this is also what readdir() returns to userspace.
dentry is typically used internally in the kernel to refer to the
VFS cache layer objects, not the filesystem dirents the VFS layers
look up to populate it's dentry cache.

> +	u64 inode_index;
> +	u8 d_type;
> +	u8 name_len;
> +	u16 name_offset;

What's this name_offset refer to? 

> +} __packed;
> +
> +struct cfs_dir_chunk_s {
> +	u16 n_dentries;
> +	u16 chunk_size;
> +	u64 chunk_offset;

What's this chunk offset refer to?

> +} __packed;
> +
> +struct cfs_dir_s {
> +	u32 n_chunks;
> +	struct cfs_dir_chunk_s chunks[];
> +} __packed;

So directory data is packed in discrete chunks? Given that this is a
static directory format, and the size of the directory is known at
image creation time, why does the storage need to be chunked?

> +
> +#define cfs_dir_size(_n_chunks)                                                \
> +	(sizeof(struct cfs_dir_s) + (_n_chunks) * sizeof(struct cfs_dir_chunk_s))

static inline, at least.

Also, this appears to be the size of the encoded directory
header, not the size of the directory itself. cfs_dir_header_size(),
perhaps, to match the cfs_xattr_header_size() function that does the
same thing?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-13 15:33 [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson
                   ` (5 preceding siblings ...)
  2023-01-13 15:33 ` [PATCH v2 6/6] composefs: Add kconfig and build support Alexander Larsson
@ 2023-01-16  4:44 ` Gao Xiang
  2023-01-16  9:30   ` Alexander Larsson
  6 siblings, 1 reply; 34+ messages in thread
From: Gao Xiang @ 2023-01-16  4:44 UTC (permalink / raw)
  To: Alexander Larsson, linux-fsdevel; +Cc: linux-kernel, gscrivan

Hi Alexander and folks,

On 2023/1/13 23:33, Alexander Larsson wrote:
> Giuseppe Scrivano and I have recently been working on a new project we
> call composefs. This is the first time we propose this publically and
> we would like some feedback on it.
> 
> At its core, composefs is a way to construct and use read only images
> that are used similar to how you would use e.g. loop-back mounted
> squashfs images. On top of this composefs has two fundamental
> features. First it allows sharing of file data (both on disk and in
> page cache) between images, and secondly it has dm-verity like
> validation on read.
> 
> Let me first start with a minimal example of how this can be used,
> before going into the details:
> 
> Suppose we have this source for an image:
> 
> rootfs/
> ├── dir
> │   └── another_a
> ├── file_a
> └── file_b
> 
> We can then use this to generate an image file and a set of
> content-addressed backing files:
> 
> # mkcomposefs --digest-store=objects rootfs/ rootfs.img
> # ls -l rootfs.img objects/*/*
> -rw-------. 1 root root   10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> -rw-------. 1 root root   10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img
> 
> The rootfs.img file contains all information about directory and file
> metadata plus references to the backing files by name. We can now
> mount this and look at the result:
> 
> # mount -t composefs rootfs.img -o basedir=objects /mnt
> # ls  /mnt/
> dir  file_a  file_b
> # cat /mnt/file_a
> content_a
> 
> When reading this file the kernel is actually reading the backing
> file, in a fashion similar to overlayfs. Since the backing file is
> content-addressed, the objects directory can be shared for multiple
> images, and any files that happen to have the same content are
> shared. I refer to this as opportunistic sharing, as it is different
> than the more course-grained explicit sharing used by e.g. container
> base images.
> 


I'd like to say sorry about comments in LWN.net article.  If it helps
to the community,  my own concern about this new overlay model was
(which is different from overlayfs since overlayfs doesn't have
  different permission of original files) somewhat a security issue (as
I told Giuseppe Scrivano before when he initially found me on slack):

As composefs on-disk shown:

struct cfs_inode_s {

         ...

	u32 st_mode; /* File type and mode.  */
	u32 st_nlink; /* Number of hard links, only for regular files.  */
	u32 st_uid; /* User ID of owner.  */
	u32 st_gid; /* Group ID of owner.  */

         ...
};

It seems Composefs can override uid / gid and mode bits of the
original file

    considering a rootfs image:
      ├── /bin
      │   └── su

/bin/su has SUID bit set in the Composefs inode metadata, but I didn't
find some clues if ostree "objects/abc" could be actually replaced
with data of /bin/sh if composefs fsverity feature is disabled (it
doesn't seem composefs enforcely enables fsverity according to
documentation).

I think that could cause _privilege escalation attack_ of these SUID
files is replaced with some root shell.  Administrators cannot keep
all the time of these SUID files because such files can also be
replaced at runtime.

Composefs may assume that ostree is always for such content-addressed
directory.  But if considering it could laterly be an upstream fs, I
think we cannot always tell people "no, don't use this way, it doesn't
work" if people use Composefs under an untrusted repo (maybe even
without ostree).

That was my own concern at that time when Giuseppe Scrivano told me
to enhance EROFS as this way, and I requested him to discuss this in
the fsdevel mailing list in order to resolve this, but it doesn't
happen.

Otherwise, EROFS could face such issue as well, that is why I think
it needs to be discussed first.


> The next step is the validation. Note how the object files have
> fs-verity enabled. In fact, they are named by their fs-verity digest:
> 
> # fsverity digest objects/*/*
> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> 
> The generated filesystm image may contain the expected digest for the
> backing files. When the backing file digest is incorrect, the open
> will fail, and if the open succeeds, any other on-disk file-changes
> will be detected by fs-verity:
> 
> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> content_a
> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> # cat /mnt/file_a
> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
> cat: /mnt/file_a: Input/output error
> 
> This re-uses the existing fs-verity functionallity to protect against
> changes in file contents, while adding on top of it protection against
> changes in filesystem metadata and structure. I.e. protecting against
> replacing a fs-verity enabled file or modifying file permissions or
> xattrs.
> 
> To be fully verified we need another step: we use fs-verity on the
> image itself. Then we pass the expected digest on the mount command
> line (which will be verified at mount time):
> 
> # fsverity enable rootfs.img
> # fsverity digest rootfs.img
> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
> 


It seems that Composefs uses fsverity_get_digest() to do fsverity
check.  If Composefs uses symlink-like payload to redirect a file to
another underlayfs file, such underlayfs file can exist in any other
fses.

I can see Composefs could work with ext4, btrfs, f2fs, and later XFS
but I'm not sure how it could work with overlayfs, FUSE, or other
network fses.  That could limit the use cases as well.

Except for the above, I think EROFS could implement this in about
300~500 new lines of code as Giuseppe found me, or squashfs or
overlayfs.

I'm very happy to implement such model if it can be proved as safe
(I'd also like to say here by no means I dislike ostree) and I'm
also glad if folks feel like to introduce a new file system for
this as long as this overlay model is proved as safe.

Hopefully it helps.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-16  4:44 ` [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Gao Xiang
@ 2023-01-16  9:30   ` Alexander Larsson
  2023-01-16 10:19     ` Gao Xiang
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Larsson @ 2023-01-16  9:30 UTC (permalink / raw)
  To: Gao Xiang, linux-fsdevel; +Cc: linux-kernel, gscrivan

On Mon, 2023-01-16 at 12:44 +0800, Gao Xiang wrote:
> Hi Alexander and folks,
> 
> I'd like to say sorry about comments in LWN.net article.  If it helps
> to the community,  my own concern about this new overlay model was
> (which is different from overlayfs since overlayfs doesn't have
>   different permission of original files) somewhat a security issue
> (as
> I told Giuseppe Scrivano before when he initially found me on slack):
> 
> As composefs on-disk shown:
> 
> struct cfs_inode_s {
> 
>          ...
> 
>         u32 st_mode; /* File type and mode.  */
>         u32 st_nlink; /* Number of hard links, only for regular
> files.  */
>         u32 st_uid; /* User ID of owner.  */
>         u32 st_gid; /* Group ID of owner.  */
> 
>          ...
> };
> 
> It seems Composefs can override uid / gid and mode bits of the
> original file
> 
>     considering a rootfs image:
>       ├── /bin
>       │   └── su
> 
> /bin/su has SUID bit set in the Composefs inode metadata, but I
> didn't
> find some clues if ostree "objects/abc" could be actually replaced
> with data of /bin/sh if composefs fsverity feature is disabled (it
> doesn't seem composefs enforcely enables fsverity according to
> documentation).
> 
> I think that could cause _privilege escalation attack_ of these SUID
> files is replaced with some root shell.  Administrators cannot keep
> all the time of these SUID files because such files can also be
> replaced at runtime.
> 
> Composefs may assume that ostree is always for such content-addressed
> directory.  But if considering it could laterly be an upstream fs, I
> think we cannot always tell people "no, don't use this way, it
> doesn't
> work" if people use Composefs under an untrusted repo (maybe even
> without ostree).
> 
> That was my own concern at that time when Giuseppe Scrivano told me
> to enhance EROFS as this way, and I requested him to discuss this in
> the fsdevel mailing list in order to resolve this, but it doesn't
> happen.
> 
> Otherwise, EROFS could face such issue as well, that is why I think
> it needs to be discussed first.

I mean, you're not wrong about this being possible. But I don't see
that this is necessarily a new problem. For example, consider the case
of loopback mounting an ext4 filesystem containing a setuid /bin/su
file. If you have the right permissions, nothing prohibits you from
modifying the loopback mounted file and replacing the content of the su
file with a copy of bash.

In both these cases, the security of the system is fully defined by the
filesystem permissions of the backing file data. I think viewing
composefs as a "new type" of overlayfs gets the wrong idea across. Its
more similar to a "new type" of loopback mount. In particular, the
backing file metadata is completely unrelated to the metadata exposed
by the filesystem, which means that you can chose to protect the
backing files (and directories) in ways which protect against changes
from non-privileged users.

Note: The above assumes that mounting either a loopback mount or a
composefs image is a privileged operation. Allowing unprivileged mounts
is a very different thing.

> > To be fully verified we need another step: we use fs-verity on the
> > image itself. Then we pass the expected digest on the mount command
> > line (which will be verified at mount time):
> > 
> > # fsverity enable rootfs.img
> > # fsverity digest rootfs.img
> > sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af8
> > 6a76 rootfs.img
> > # mount -t composefs rootfs.img -o
> > basedir=objects,digest=da42003782992856240a3e25264b19601016114775de
> > bd80c01620260af86a76 /mnt
> > 
> 
> 
> It seems that Composefs uses fsverity_get_digest() to do fsverity
> check.  If Composefs uses symlink-like payload to redirect a file to
> another underlayfs file, such underlayfs file can exist in any other
> fses.
> 
> I can see Composefs could work with ext4, btrfs, f2fs, and later XFS
> but I'm not sure how it could work with overlayfs, FUSE, or other
> network fses.  That could limit the use cases as well.

Yes, if you chose to store backing files on a non-fs-verity enabled
filesystem you cannot use the fs-verity feature. But this is just a
decision users of composefs have to take if they wish to use this
particular feature. I think re-using fs-verity like this is a better
approach than re-implementing verity.

> Except for the above, I think EROFS could implement this in about
> 300~500 new lines of code as Giuseppe found me, or squashfs or
> overlayfs.
> 
> I'm very happy to implement such model if it can be proved as safe
> (I'd also like to say here by no means I dislike ostree) and I'm
> also glad if folks feel like to introduce a new file system for
> this as long as this overlay model is proved as safe.

My personal target usecase is that of the ostree trusted root
filesystem, and it has a lot of specific requirements that lead to
choices in the design of composefs. I took a look at EROFS a while ago,
and I think that even with some verify-like feature it would not fit
this usecase. 

EROFS does indeed do some of the file-sharing aspects of composefs with
its use of fs-cache (although the current n_chunk limit would need to
be raised). However, I think there are two problems with this. 

First of all is the complexity of having to involve a userspace for the
cache. For trusted boot to work we have to have all the cachefs
userspace machinery on the (signed) initrd, and then have to properly
transition this across the pivot-root into the full os boot. I'm sure
it is technically *possible*, but it is very complex and a pain to set
up and maintain.

Secondly, the use of fs-cache doesn't stack, as there can only be one
cachefs agent. For example, mixing an ostree EROFS boot with a
container backend using EROFS isn't possible (at least without deep
integration between the two userspaces).

Also, f we ignore the file sharing aspects there is the question of how
to actually integrate a new digest-based image format with the pre-
existing ostree formats and distribution mechanisms. If we just replace
everything with distributing a signed image file then we can easily use
existing technology (say dm-verity + squashfs + loopback). However,
this would be essentially A/B booting and we would lose all the
advantages of ostree. 

Instead what we have done with composefs is to make filesystem image
generation from the ostree repository 100% reproducible. Then we can
keep the entire pre-existing ostree distribution mechanism and on-disk
repo format, adding just a single piece of metadata to the ostree
commit, containing the composefs toplevel digest. Then the client can
easily and efficiently re-generate the composefs image locally, and
boot into it specifying the trusted not-locally-generated digest. A
filesystem that doesn't have this reproduceability feature isn't going
to be possible to integrate with ostree without enormous changes to
ostree, and a filesystem more complex that composefs will have a hard
time giving such guarantees.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's an unconventional gay card sharp moving from town to town, helping
folk in trouble. She's a virginal goth bounty hunter descended from a 
line of powerful witches. They fight crime! 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-16  9:30   ` Alexander Larsson
@ 2023-01-16 10:19     ` Gao Xiang
  2023-01-16 12:33       ` Alexander Larsson
  0 siblings, 1 reply; 34+ messages in thread
From: Gao Xiang @ 2023-01-16 10:19 UTC (permalink / raw)
  To: Alexander Larsson, linux-fsdevel; +Cc: linux-kernel, gscrivan

Hi Alexander,

On 2023/1/16 17:30, Alexander Larsson wrote:
> On Mon, 2023-01-16 at 12:44 +0800, Gao Xiang wrote:
>> Hi Alexander and folks,
>>
>> I'd like to say sorry about comments in LWN.net article.  If it helps
>> to the community,  my own concern about this new overlay model was
>> (which is different from overlayfs since overlayfs doesn't have
>>    different permission of original files) somewhat a security issue
>> (as
>> I told Giuseppe Scrivano before when he initially found me on slack):
>>
>> As composefs on-disk shown:
>>
>> struct cfs_inode_s {
>>
>>           ...
>>
>>          u32 st_mode; /* File type and mode.  */
>>          u32 st_nlink; /* Number of hard links, only for regular
>> files.  */
>>          u32 st_uid; /* User ID of owner.  */
>>          u32 st_gid; /* Group ID of owner.  */
>>
>>           ...
>> };
>>
>> It seems Composefs can override uid / gid and mode bits of the
>> original file
>>
>>      considering a rootfs image:
>>        ├── /bin
>>        │   └── su
>>
>> /bin/su has SUID bit set in the Composefs inode metadata, but I
>> didn't
>> find some clues if ostree "objects/abc" could be actually replaced
>> with data of /bin/sh if composefs fsverity feature is disabled (it
>> doesn't seem composefs enforcely enables fsverity according to
>> documentation).
>>
>> I think that could cause _privilege escalation attack_ of these SUID
>> files is replaced with some root shell.  Administrators cannot keep
>> all the time of these SUID files because such files can also be
>> replaced at runtime.
>>
>> Composefs may assume that ostree is always for such content-addressed
>> directory.  But if considering it could laterly be an upstream fs, I
>> think we cannot always tell people "no, don't use this way, it
>> doesn't
>> work" if people use Composefs under an untrusted repo (maybe even
>> without ostree).
>>
>> That was my own concern at that time when Giuseppe Scrivano told me
>> to enhance EROFS as this way, and I requested him to discuss this in
>> the fsdevel mailing list in order to resolve this, but it doesn't
>> happen.
>>
>> Otherwise, EROFS could face such issue as well, that is why I think
>> it needs to be discussed first.
> 
> I mean, you're not wrong about this being possible. But I don't see
> that this is necessarily a new problem. For example, consider the case
> of loopback mounting an ext4 filesystem containing a setuid /bin/su
> file. If you have the right permissions, nothing prohibits you from
> modifying the loopback mounted file and replacing the content of the su
> file with a copy of bash.
> 
> In both these cases, the security of the system is fully defined by the
> filesystem permissions of the backing file data. I think viewing
> composefs as a "new type" of overlayfs gets the wrong idea across. Its
> more similar to a "new type" of loopback mount. In particular, the
> backing file metadata is completely unrelated to the metadata exposed
> by the filesystem, which means that you can chose to protect the
> backing files (and directories) in ways which protect against changes
> from non-privileged users.
> 
> Note: The above assumes that mounting either a loopback mount or a
> composefs image is a privileged operation. Allowing unprivileged mounts
> is a very different thing.

Thanks for the reply.  I think if I understand correctly, I could
answer some of your questions.  Hopefully help to everyone interested.

Let's avoid thinking unprivileged mounts first, although Giuseppe told
me earilier that is also a future step of Composefs. But I don't know
how it could work reliably if a fs has some on-disk format, we could
discuss it later.

I think as a loopback mount, such loopback files are quite under control
(take ext4 loopback mount as an example, each ext4 has the only one file
  to access when setting up loopback devices and such loopback file was
  also opened when setting up loopback mount so it cannot be replaced.

  If you enables fsverity for such loopback mount before, it cannot be
  modified as well) by admins.


But IMHO, here composefs shows a new model that some stackable
filesystem can point to massive files under a random directory as what
ostree does (even files in such directory can be bind-mounted later in
principle).  But the original userspace ostree strictly follows
underlayfs permission check but Composefs can override
uid/gid/permission instead.

That is also why we selected fscache at the first time to manage all
local cache data for EROFS, since such content-defined directory is
quite under control by in-kernel fscache instead of selecting a
random directory created and given by some userspace program.

If you are interested in looking info the current in-kernel fscache
behavior, I think that is much similar as what ostree does now.

It just needs new features like
   - multiple directories;
   - daemonless
to match.

> 
>>> To be fully verified we need another step: we use fs-verity on the
>>> image itself. Then we pass the expected digest on the mount command
>>> line (which will be verified at mount time):
>>>
>>> # fsverity enable rootfs.img
>>> # fsverity digest rootfs.img
>>> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af8
>>> 6a76 rootfs.img
>>> # mount -t composefs rootfs.img -o
>>> basedir=objects,digest=da42003782992856240a3e25264b19601016114775de
>>> bd80c01620260af86a76 /mnt
>>>
>>
>>
>> It seems that Composefs uses fsverity_get_digest() to do fsverity
>> check.  If Composefs uses symlink-like payload to redirect a file to
>> another underlayfs file, such underlayfs file can exist in any other
>> fses.
>>
>> I can see Composefs could work with ext4, btrfs, f2fs, and later XFS
>> but I'm not sure how it could work with overlayfs, FUSE, or other
>> network fses.  That could limit the use cases as well.
> 
> Yes, if you chose to store backing files on a non-fs-verity enabled
> filesystem you cannot use the fs-verity feature. But this is just a
> decision users of composefs have to take if they wish to use this
> particular feature. I think re-using fs-verity like this is a better
> approach than re-implementing verity.
> 
>> Except for the above, I think EROFS could implement this in about
>> 300~500 new lines of code as Giuseppe found me, or squashfs or
>> overlayfs.
>>
>> I'm very happy to implement such model if it can be proved as safe
>> (I'd also like to say here by no means I dislike ostree) and I'm
>> also glad if folks feel like to introduce a new file system for
>> this as long as this overlay model is proved as safe.
> 
> My personal target usecase is that of the ostree trusted root
> filesystem, and it has a lot of specific requirements that lead to
> choices in the design of composefs. I took a look at EROFS a while ago,
> and I think that even with some verify-like feature it would not fit
> this usecase.
> 
> EROFS does indeed do some of the file-sharing aspects of composefs with
> its use of fs-cache (although the current n_chunk limit would need to
> be raised). However, I think there are two problems with this.
> 
> First of all is the complexity of having to involve a userspace for the
> cache. For trusted boot to work we have to have all the cachefs
> userspace machinery on the (signed) initrd, and then have to properly
> transition this across the pivot-root into the full os boot. I'm sure
> it is technically *possible*, but it is very complex and a pain to set
> up and maintain.
> 
> Secondly, the use of fs-cache doesn't stack, as there can only be one
> cachefs agent. For example, mixing an ostree EROFS boot with a
> container backend using EROFS isn't possible (at least without deep
> integration between the two userspaces).

The reasons above are all current fscache implementation limitation:

  - First, if such overlay model really works, EROFS can do it without
fscache feature as well to integrate userspace ostree.  But even that
I hope this new feature can be landed in overlayfs rather than some
other ways since it has native writable layer so we don't need another
overlayfs mount at all for writing;

  - Second, as I mentioned above, the limitation above is what fscache
behaves now not fscache will behave.  I did discuss with David Howells
that he also would like to develop multiple directories and daemonless
features for network fses.

> 
> Also, f we ignore the file sharing aspects there is the question of how
> to actually integrate a new digest-based image format with the pre-
> existing ostree formats and distribution mechanisms. If we just replace
> everything with distributing a signed image file then we can easily use
> existing technology (say dm-verity + squashfs + loopback). However,
> this would be essentially A/B booting and we would lose all the
> advantages of ostree.
EROFS now can do data-duplication and later page cache sharing as well.

> 
> Instead what we have done with composefs is to make filesystem image
> generation from the ostree repository 100% reproducible. Then we can

EROFS is all 100% reproduciable as well.

> keep the entire pre-existing ostree distribution mechanism and on-disk
> repo format, adding just a single piece of metadata to the ostree
> commit, containing the composefs toplevel digest. Then the client can
> easily and efficiently re-generate the composefs image locally, and
> boot into it specifying the trusted not-locally-generated digest. A
> filesystem that doesn't have this reproduceability feature isn't going
> to be possible to integrate with ostree without enormous changes to
> ostree, and a filesystem more complex that composefs will have a hard
> time giving such guarantees.

I'm not sure why EROFS is not good at this, I could also make an
EROFS-version the same as what Composefs does with some symlink path
attached to each regular file. And ostree can also make use of it.

But really, personally I think the issue above is different from
loopback devices and may need to be resolved first. And if possible,
I hope it could be an new overlayfs feature for everyone.

Thanks,
Gao Xiang

> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/6] composefs: Add on-disk layout
  2023-01-16  1:29   ` Dave Chinner
@ 2023-01-16 11:00     ` Alexander Larsson
  2023-01-16 23:06       ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Larsson @ 2023-01-16 11:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, gscrivan

On Mon, 2023-01-16 at 12:29 +1100, Dave Chinner wrote:
> On Fri, Jan 13, 2023 at 04:33:55PM +0100, Alexander Larsson wrote:
> > This commit adds the on-disk layout header file of composefs.
> 
> This isn't really a useful commit message.
> 
> Perhaps it should actually explain what the overall goals of the
> on-disk format are - space usage, complexity trade-offs, potential
> issues with validation of variable payload sections, etc.
> 

I agree, will flesh it out. But, as for below discussions, one of the
overall goals is to keep the on-disk file size low.

> > Signed-off-by: Alexander Larsson <alexl@redhat.com>
> > Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com>
> > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> > ---
> >  fs/composefs/cfs.h | 203
> > +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 203 insertions(+)
> >  create mode 100644 fs/composefs/cfs.h
> > 
> > diff --git a/fs/composefs/cfs.h b/fs/composefs/cfs.h
> > new file mode 100644
> > index 000000000000..658df728e366
> > --- /dev/null
> > +++ b/fs/composefs/cfs.h
> > @@ -0,0 +1,203 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * composefs
> > + *
> > + * Copyright (C) 2021 Giuseppe Scrivano
> > + * Copyright (C) 2022 Alexander Larsson
> > + *
> > + * This file is released under the GPL.
> > + */
> > +
> > +#ifndef _CFS_H
> > +#define _CFS_H
> > +
> > +#include <asm/byteorder.h>
> > +#include <crypto/sha2.h>
> > +#include <linux/fs.h>
> > +#include <linux/stat.h>
> > +#include <linux/types.h>
> > +
> > +#define CFS_VERSION 1
> 
> This should start with a description of the on-disk format for the
> version 1 format.

There are some format descriptions in the later document patch. What is
the general approach here, do we document in the header, or in separate
doc file? For example, I don't see much of format descriptions in the
xfs headers. I mean, I should probably add *some* info here for easier
reading of the stuff below, but I don't feel like headers are a great
place for docs.

> > +
> > +#define CFS_MAGIC 0xc078629aU
> > +
> > +#define CFS_MAX_DIR_CHUNK_SIZE 4096
> > +#define CFS_MAX_XATTRS_SIZE 4096
> 
> How do we store 64kB xattrs in this format if the max attr size is
> 4096 bytes? Or is that the maximum total xattr storage?

This is a current limitation of the composefs file format. I am aware
that the kernel maximum size is 64k, but I'm not sure what use this
would have in a read-only filesystem image in practice. I could extend
this limit with some added complextity, but would it be worth the
increase in complexity?

> A comment telling us what these limits are would be nice.
> 

Sure.

> > +
> > +static inline int cfs_digest_from_payload(const char *payload,
> > size_t payload_len,
> > +                                         u8
> > digest_out[SHA256_DIGEST_SIZE])
> > +{
> > +       const char *p, *end;
> > +       u8 last_digit = 0;
> > +       int digit = 0;
> > +       size_t n_nibbles = 0;
> > +
> > +       /* This handles payloads (i.e. path names) that are
> > "essentially" a
> > +        * digest as the digest (if the DIGEST_FROM_PAYLOAD flag is
> > set). The
> > +        * "essential" part means that we ignore hierarchical
> > structure as well
> > +        * as any extension. So, for example "ef/deadbeef.file"
> > would match the
> > +        * (too short) digest "efdeadbeef".
> > +        *
> > +        * This allows images to avoid storing both the digest and
> > the pathname,
> > +        * yet work with pre-existing object store formats of
> > various kinds.
> > +        */
> > +
> > +       end = payload + payload_len;
> > +       for (p = payload; p != end; p++) {
> > +               /* Skip subdir structure */
> > +               if (*p == '/')
> > +                       continue;
> > +
> > +               /* Break at (and ignore) extension */
> > +               if (*p == '.')
> > +                       break;
> > +
> > +               if (n_nibbles == SHA256_DIGEST_SIZE * 2)
> > +                       return -EINVAL; /* Too long */
> > +
> > +               digit = hex_to_bin(*p);
> > +               if (digit == -1)
> > +                       return -EINVAL; /* Not hex digit */
> > +
> > +               n_nibbles++;
> > +               if ((n_nibbles % 2) == 0)
> > +                       digest_out[n_nibbles / 2 - 1] = (last_digit
> > << 4) | digit;
> > +               last_digit = digit;
> > +       }
> > +
> > +       if (n_nibbles != SHA256_DIGEST_SIZE * 2)
> > +               return -EINVAL; /* Too short */
> > +
> > +       return 0;
> > +}
> 
> Too big to be a inline function.
> 

Yeah, I'm aware of this. I mainly put it in the header as the
implementation of it is sort of part of the on-disk format. But, I can
move it to a .c file instead.


> > +
> > +struct cfs_vdata_s {
> 
> Drop the "_s" suffix to indicate the type is a structure - that's
> waht "struct" tells us.

Sure.

> > +       u64 off;
> > +       u32 len;
> 
> If these are on-disk format structures, why aren't the defined as
> using the specific endian they are encoded in? i.e. __le64, __le32,
> etc? Otherwise a file built on a big endian machine won't be
> readable on a little endian machine (and vice versa).

On disk all fields are little endian. However, when we read them from
disk we convert them using e.g. le32_to_cpu(), and then we use the same
structure in memory, with native endian. So, it seems wrong to mark
them as little endian.

> 
> > +} __packed;
> > +
> > +struct cfs_header_s {
> > +       u8 version;
> > +       u8 unused1;
> > +       u16 unused2;
> 
> Why are you hyper-optimising these structures for minimal space
> usage? This is 2023 - we can use a __le32 for the version number,
> the magic number and then leave....
>
> > +
> > +       u32 magic;
> > +       u64 data_offset;
> > +       u64 root_inode;
> > +
> > +       u64 unused3[2];
> 
> a whole heap of space to round it up to at least a CPU cacheline
> size using something like "__le64 unused[15]".
> 
> That way we don't need packed structures nor do we care about having
> weird little holes in the structures to fill....

Sure.

> > +} __packed;
> > +
> > +enum cfs_inode_flags {
> > +       CFS_INODE_FLAGS_NONE = 0,
> > +       CFS_INODE_FLAGS_PAYLOAD = 1 << 0,
> > +       CFS_INODE_FLAGS_MODE = 1 << 1,
> > +       CFS_INODE_FLAGS_NLINK = 1 << 2,
> > +       CFS_INODE_FLAGS_UIDGID = 1 << 3,
> > +       CFS_INODE_FLAGS_RDEV = 1 << 4,
> > +       CFS_INODE_FLAGS_TIMES = 1 << 5,
> > +       CFS_INODE_FLAGS_TIMES_NSEC = 1 << 6,
> > +       CFS_INODE_FLAGS_LOW_SIZE = 1 << 7, /* Low 32bit of st_size
> > */
> > +       CFS_INODE_FLAGS_HIGH_SIZE = 1 << 8, /* High 32bit of
> > st_size */
> 
> Why do we need to complicate things by splitting the inode size
> like this?
> 

The goal is to minimize the image size for a typical rootfs or
container image. Almost zero files in any such images are > 4GB. 

Also, we don't just "not decode" the items with the flag not set, they
are not even stored on disk.

> > +       CFS_INODE_FLAGS_XATTRS = 1 << 9,
> > +       CFS_INODE_FLAGS_DIGEST = 1 << 10, /* fs-verity sha256
> > digest */
> > +       CFS_INODE_FLAGS_DIGEST_FROM_PAYLOAD = 1 << 11, /* Compute
> > digest from payload */
> > +};
> > +
> > +#define CFS_INODE_FLAG_CHECK(_flag,
> > _name)                                     \
> > +       (((_flag) & (CFS_INODE_FLAGS_##_name)) != 0)
> 
> Check what about a flag? If this is a "check that a feature is set",
> then open coding it better, but if you must do it like this, then
> please use static inline functions like:
> 
>         if (cfs_inode_has_xattrs(inode->flags)) {
>                 .....
>         }
> 

The check is if the flag is set, so maybe CFS_INODE_FLAG_IS_SET is a
better name. This is used only when decoding the on-disk version of the
inode to the in memory one, which is a bunch of:

	if (CFS_INODE_FLAG_CHECK(ino->flags, THE_FIELD))
		ino->the_field = cfs_read_u32(&data);
	else
		ino->the_field = THE_FIELD_DEFUALT;

I can easily open-code these checks, although I'm not sure it makes a
great difference either way.

> > +#define CFS_INODE_FLAG_CHECK_SIZE(_flag, _name,
> > _size)                         \
> > +       (CFS_INODE_FLAG_CHECK(_flag, _name) ? (_size) : 0)
> 
> This doesn't seem particularly useful, because you've still got to
> test is the return value is valid. i.e.
> 
>         size = CFS_INODE_FLAG_CHECK_SIZE(inode->flags, XATTRS, 32);
>         if (size == 32) {
>                 /* got xattrs, decode! */
>         }
> 
> vs
>         if (cfs_inode_has_xattrs(inode->flags)) {
>                 /* decode! */
>         }

This macro is only uses by the cfs_inode_encoded_size() function that
computes the size of the on-disk format of an inode, given its flags:

static inline u32 cfs_inode_encoded_size(u32 flags)
{
	return sizeof(u32) /* flags */ +
	       CFS_INODE_FLAG_CHECK_SIZE(flags, PAYLOAD, sizeof(u32))
+
	       CFS_INODE_FLAG_CHECK_SIZE(flags, MODE, sizeof(u32)) +
	       CFS_INODE_FLAG_CHECK_SIZE(flags, NLINK, sizeof(u32)) +
...

It is only useful in the sense that it makes this function easy to
read/write. I should maybe move the definion of the macro to that
function.

> 
> > +
> > +#define CFS_INODE_DEFAULT_MODE 0100644
> > +#define CFS_INODE_DEFAULT_NLINK 1
> > +#define CFS_INODE_DEFAULT_NLINK_DIR 2
> > +#define CFS_INODE_DEFAULT_UIDGID 0
> > +#define CFS_INODE_DEFAULT_RDEV 0
> > +#define CFS_INODE_DEFAULT_TIMES 0
> 
> Where do these get used? Are they on disk defaults or something
> else? (comment, please!)

They are the defaults that are used when inode fields on disk are
missing. I'll add some comments.

> > +struct cfs_inode_s {
> > +       u32 flags;
> > +
> > +       /* Optional data: (selected by flags) */
> 
> WHy would you make them optional given that all the fields are still
> defined in the structure?
> 
> It's much simpler just to decode the entire structure into memory
> than to have to check each flag value to determine if a field needs
> to be decoded...
> 

I guess I need to clarify these comments a bit, but they are optional
on-disk, and decoded and extended with the above defaults by
cfs_get_ino_index() when read into memory. So, they are not optional in
memory.


> > +       /* This is the size of the type specific data that comes
> > directly after
> > +        * the inode in the file. Of this type:
> > +        *
> > +        * directory: cfs_dir_s
> > +        * regular file: the backing filename
> > +        * symlink: the target link
> > +        *
> > +        * Canonically payload_length is 0 for empty
> > dir/file/symlink.
> > +        */
> > +       u32 payload_length;
> 
> How do you have an empty symlink?

In terms of the file format, empty would mean a zero length target
string. But you're right that this isn't allowed. I'll change this
comment.

> > +       u32 st_mode; /* File type and mode.  */
> > +       u32 st_nlink; /* Number of hard links, only for regular
> > files.  */
> > +       u32 st_uid; /* User ID of owner.  */
> > +       u32 st_gid; /* Group ID of owner.  */
> > +       u32 st_rdev; /* Device ID (if special file).  */
> > +       u64 st_size; /* Size of file, only used for regular files
> > */
> > +
> > +       struct cfs_vdata_s xattrs; /* ref to variable data */
> 
> This is in the payload that follows the inode?  Is it included in
> the payload_length above?
> 
> If not, where is this stuff located, how do we validate it points to
> the correct place in the on-disk format file, the xattrs belong to
> this specific inode, etc? I think that's kinda important to
> describe, because xattrs often contain important security
> information...

No, all inodes are packed into the initial part of the file, each
containing a flags set, a variable size (from flags) chunk of fixed
size elements and an variable size payload. The payload is either the
target symlink for symlinks, or the path of the backing file for
regular files. Other data, such as xattrs and dirents are stored in a
separate part of the file and the offsets for those in the inode refer
to offsets into that area.

> 
> > +
> > +       u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */
> 
> Why would you have this in the on-disk structure, then also have
> "digest from payload" that allows the digest to be in the payload
> section of the inode data?

The payload is normally the path to the backing file, and then you need
to store the verity digest separately. This is what would be needed
when using this with ostree for instance, because we have an existing
backing file repo format we can't change. However, if your backing
store files are stored by their fs-verity digest already (which is the
default for mkcomposefs), then we can set this flag and avoid storing
the digest unnecessary.

> > +
> > +       struct timespec64 st_mtim; /* Time of last modification. 
> > */
> > +       struct timespec64 st_ctim; /* Time of last status change. 
> > */
> > +};
> 
> This really feels like an in-memory format inode, not an on-disk
> format inode, because this:
> 
> > +
> > +static inline u32 cfs_inode_encoded_size(u32 flags)
> > +{
> > +       return sizeof(u32) /* flags */ +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, PAYLOAD,
> > sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, MODE, sizeof(u32))
> > +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, NLINK, sizeof(u32))
> > +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, UIDGID, sizeof(u32)
> > + sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, RDEV, sizeof(u32))
> > +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, TIMES, sizeof(u64)
> > * 2) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, TIMES_NSEC,
> > sizeof(u32) * 2) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, LOW_SIZE,
> > sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, HIGH_SIZE,
> > sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, XATTRS, sizeof(u64)
> > + sizeof(u32)) +
> > +              CFS_INODE_FLAG_CHECK_SIZE(flags, DIGEST,
> > SHA256_DIGEST_SIZE);
> > +}
> 
> looks like the on-disk format is an encoded format hyper-optimised
> for minimal storage space usage?

Yes.


> Without comments to explain it, I'm not exactly sure what is stored
> in the on-disk format inodes, nor the layout of the variable
> payload section or how payload sections are defined and verified.
> 
> Seems overly complex to me - it's far simpler just to have a fixed
> inode structure and just decode it directly into the in-memory
> structure when it is read....

We have a not-fixed-size on disk inode structure (for size reasons)
which we decode directly into the above in-memory structure when read.
So I don't think we're that far from what you expect. However, yes,
this could easily be explained better.

> 
> > +struct cfs_dentry_s {
> > +       /* Index of struct cfs_inode_s */
> 
> Not a useful (or correct!) comment :/

Its not really incorrect, but I agree its not neccessary a great
comment. At this specific offset in the inode section we can decode the
cfs_inode_s that this inode refers to, and his offset is also the inode
number of the inode.

> Also, the typical term for this on disk structure in a filesystem is
> a "dirent", and this is also what readdir() returns to userspace.
> dentry is typically used internally in the kernel to refer to the
> VFS cache layer objects, not the filesystem dirents the VFS layers
> look up to populate it's dentry cache.
> 

Yeah, i'll rename it.

> > +       u64 inode_index;
> > +       u8 d_type;
> > +       u8 name_len;
> > +       u16 name_offset;
> 
> What's this name_offset refer to? 

Dirents are stored in chunks, each chunk < 4k. These chunks are a list
of these dirents, followed by the strings for the names, the
name_offset is the offset from the start of the chunk to the name.

> > +} __packed;
> > +
> > +struct cfs_dir_chunk_s {
> > +       u16 n_dentries;
> > +       u16 chunk_size;
> > +       u64 chunk_offset;
> 
> What's this chunk offset refer to?
> 

This is the offset in the "variable data" section of the image. This
section follows the packed inode data section. Again, better comments
needed.

> > +} __packed;
> > +
> > +struct cfs_dir_s {
> > +       u32 n_chunks;
> > +       struct cfs_dir_chunk_s chunks[];
> > +} __packed;
> 
> So directory data is packed in discrete chunks? Given that this is a
> static directory format, and the size of the directory is known at
> image creation time, why does the storage need to be chunked?

We chunk the data such that each chunk fits inside a single page in the
image file. I did this to make accessing image data directly from the
page cache easier. We can just kmap_page_local() each chunk and treat
it as a non-split continuous dirent array, then move on to the next
chunk in the next page. If we had dirent data spanning multiple pages
then we would either need to map the pages consecutively (which seems
hard/costly) or have complex in-kernel code to handle the case where a
dirent straddles two pages.

> > +
> > +#define
> > cfs_dir_size(_n_chunks)                                            
> >     \
> > +       (sizeof(struct cfs_dir_s) + (_n_chunks) * sizeof(struct
> > cfs_dir_chunk_s))
> 
> static inline, at least.
> 
> Also, this appears to be the size of the encoded directory
> header, not the size of the directory itself. cfs_dir_header_size(),
> perhaps, to match the cfs_xattr_header_size() function that does the
> same thing?

Yeah, that makes sense.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a suicidal hunchbacked cyborg who knows the secret of the alien 
invasion. She's a time-travelling out-of-work snake charmer with a song
in her heart and a spring in her step. They fight crime! 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-16 10:19     ` Gao Xiang
@ 2023-01-16 12:33       ` Alexander Larsson
  2023-01-16 13:26         ` Gao Xiang
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Larsson @ 2023-01-16 12:33 UTC (permalink / raw)
  To: Gao Xiang, linux-fsdevel; +Cc: linux-kernel, gscrivan

On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote:
> Hi Alexander,
> 
> On 2023/1/16 17:30, Alexander Larsson wrote:
> > 
> > I mean, you're not wrong about this being possible. But I don't see
> > that this is necessarily a new problem. For example, consider the
> > case
> > of loopback mounting an ext4 filesystem containing a setuid /bin/su
> > file. If you have the right permissions, nothing prohibits you from
> > modifying the loopback mounted file and replacing the content of
> > the su
> > file with a copy of bash.
> > 
> > In both these cases, the security of the system is fully defined by
> > the
> > filesystem permissions of the backing file data. I think viewing
> > composefs as a "new type" of overlayfs gets the wrong idea across.
> > Its
> > more similar to a "new type" of loopback mount. In particular, the
> > backing file metadata is completely unrelated to the metadata
> > exposed
> > by the filesystem, which means that you can chose to protect the
> > backing files (and directories) in ways which protect against
> > changes
> > from non-privileged users.
> > 
> > Note: The above assumes that mounting either a loopback mount or a
> > composefs image is a privileged operation. Allowing unprivileged
> > mounts
> > is a very different thing.
> 
> Thanks for the reply.  I think if I understand correctly, I could
> answer some of your questions.  Hopefully help to everyone
> interested.
> 
> Let's avoid thinking unprivileged mounts first, although Giuseppe
> told
> me earilier that is also a future step of Composefs. But I don't know
> how it could work reliably if a fs has some on-disk format, we could
> discuss it later.
> 
> I think as a loopback mount, such loopback files are quite under
> control
> (take ext4 loopback mount as an example, each ext4 has the only one
> file
>   to access when setting up loopback devices and such loopback file
> was
>   also opened when setting up loopback mount so it cannot be
> replaced.
> 
>   If you enables fsverity for such loopback mount before, it cannot
> be
>   modified as well) by admins.
> 
> 
> But IMHO, here composefs shows a new model that some stackable
> filesystem can point to massive files under a random directory as
> what
> ostree does (even files in such directory can be bind-mounted later
> in
> principle).  But the original userspace ostree strictly follows
> underlayfs permission check but Composefs can override
> uid/gid/permission instead.

Suppose you have:

-rw-r--r-- root root image.ext4
-rw-r--r-- root root image.composefs
drwxr--r-- root root objects/
-rw-r--r-- root root objects/backing.file

Are you saying it is easier for someone to modify backing.file than
image.ext4? 

I argue it is not, but composefs takes some steps to avoid issues here.
At mount time, when the basedir ("objects/" above) argument is parsed,
we resolve that path and then create a private vfsmount for it: 

 resolve_basedir(path) {
        ...
	mnt = clone_private_mount(&path);
        ...
 }

 fsi->bases[i] = resolve_basedir(path);

Then we open backing files with this mount as root:

 real_file = file_open_root_mnt(fsi->bases[i], real_path,
 			        file->f_flags, 0);

This will never resolve outside the initially specified basedir, even
with symlinks or whatever. It will also not be affected by later mount
changes in the original mount namespace, as this is a private mount. 

This is the same mechanism that overlayfs uses for its upper dirs.

I would argue that anyone who has rights to modify the contents of
files in "objects" (supposing they were created with sane permissions)
would also have rights to modify "image.ext4".

> That is also why we selected fscache at the first time to manage all
> local cache data for EROFS, since such content-defined directory is
> quite under control by in-kernel fscache instead of selecting a
> random directory created and given by some userspace program.
> 
> If you are interested in looking info the current in-kernel fscache
> behavior, I think that is much similar as what ostree does now.
> 
> It just needs new features like
>    - multiple directories;
>    - daemonless
> to match.
> 

Obviously everything can be extended to support everything. But
composefs is very small and simple (2128 lines of code), while at the
same time being easy to use (just mount it with one syscall) and needs
no complex userspace machinery and configuration. But even without the
above feature additions fscache + cachefiles is 7982 lines, plus erofs
is 9075 lines, and then on top of that you need userspace integration
to even use the thing.

Don't take me wrong, EROFS is great for its usecases, but I don't
really think it is the right choice for my usecase.

> > > 
> > Secondly, the use of fs-cache doesn't stack, as there can only be
> > one
> > cachefs agent. For example, mixing an ostree EROFS boot with a
> > container backend using EROFS isn't possible (at least without deep
> > integration between the two userspaces).
> 
> The reasons above are all current fscache implementation limitation:
> 
>   - First, if such overlay model really works, EROFS can do it
> without
> fscache feature as well to integrate userspace ostree.  But even that
> I hope this new feature can be landed in overlayfs rather than some
> other ways since it has native writable layer so we don't need
> another
> overlayfs mount at all for writing;

I don't think it is the right approach for overlayfs to integrate
something like image support. Merging the two codebases would
complicate both while adding costs to users who need only support for
one of the features. I think reusing and stacking separate features is
a better idea than combining them. 

> 
> > 
> > Instead what we have done with composefs is to make filesystem
> > image
> > generation from the ostree repository 100% reproducible. Then we
> > can
> 
> EROFS is all 100% reproduciable as well.
> 


Really, so if I today, on fedora 36 run:
# tar xvf oci-image.tar
# mkfs.erofs oci-dir/ oci.erofs

And then in 5 years, if someone on debian 13 runs the same, with the
same tar file, then both oci.erofs files will have the same sha256
checksum?

How do you handle things like different versions or builds of
compression libraries creating different results? Do you guarantee to
not add any new backwards compat changes by default, or change any
default options? Do you guarantee that the files are read from "oci-
dir" in the same order each time? It doesn't look like it.

> 
> But really, personally I think the issue above is different from
> loopback devices and may need to be resolved first. And if possible,
> I hope it could be an new overlayfs feature for everyone.

Yeah. Independent of composefs, I think EROFS would be better if you
could just point it to a chunk directory at mount time rather than
having to route everything through a system-wide global cachefs
singleton. I understand that cachefs does help with the on-demand
download aspect, but when you don't need that it is just in the way.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a one-legged guitar-strumming firefighter who hides his scarred
face 
behind a mask. She's an orphaned gypsy lawyer with a flame-thrower.
They 
fight crime! 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 5/6] composefs: Add documentation
  2023-01-14  3:20   ` Bagas Sanjaya
@ 2023-01-16 12:38     ` Alexander Larsson
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Larsson @ 2023-01-16 12:38 UTC (permalink / raw)
  To: Bagas Sanjaya, linux-fsdevel; +Cc: linux-kernel, gscrivan, linux-doc

On Sat, 2023-01-14 at 10:20 +0700, Bagas Sanjaya wrote:
> On Fri, Jan 13, 2023 at 04:33:58PM +0100, Alexander Larsson wrote:
> > Adds documentation about the composefs filesystem and
> > how to use it.
> 
> s/Adds documentation/Add documentation/
> 

Thanks, I'll apply your proposals in the next version.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a shy pirate astronaut on his last day in the job. She's a plucky 
paranoid museum curator with an MBA from Harvard. They fight crime! 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-16 12:33       ` Alexander Larsson
@ 2023-01-16 13:26         ` Gao Xiang
  2023-01-16 14:18           ` Giuseppe Scrivano
  2023-01-16 15:27           ` Alexander Larsson
  0 siblings, 2 replies; 34+ messages in thread
From: Gao Xiang @ 2023-01-16 13:26 UTC (permalink / raw)
  To: Alexander Larsson, linux-fsdevel; +Cc: linux-kernel, gscrivan



On 2023/1/16 20:33, Alexander Larsson wrote:
> On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote:
>> Hi Alexander,
>>
>> On 2023/1/16 17:30, Alexander Larsson wrote:
>>>
>>> I mean, you're not wrong about this being possible. But I don't see
>>> that this is necessarily a new problem. For example, consider the
>>> case
>>> of loopback mounting an ext4 filesystem containing a setuid /bin/su
>>> file. If you have the right permissions, nothing prohibits you from
>>> modifying the loopback mounted file and replacing the content of
>>> the su
>>> file with a copy of bash.
>>>
>>> In both these cases, the security of the system is fully defined by
>>> the
>>> filesystem permissions of the backing file data. I think viewing
>>> composefs as a "new type" of overlayfs gets the wrong idea across.
>>> Its
>>> more similar to a "new type" of loopback mount. In particular, the
>>> backing file metadata is completely unrelated to the metadata
>>> exposed
>>> by the filesystem, which means that you can chose to protect the
>>> backing files (and directories) in ways which protect against
>>> changes
>>> from non-privileged users.
>>>
>>> Note: The above assumes that mounting either a loopback mount or a
>>> composefs image is a privileged operation. Allowing unprivileged
>>> mounts
>>> is a very different thing.
>>
>> Thanks for the reply.  I think if I understand correctly, I could
>> answer some of your questions.  Hopefully help to everyone
>> interested.
>>
>> Let's avoid thinking unprivileged mounts first, although Giuseppe
>> told
>> me earilier that is also a future step of Composefs. But I don't know
>> how it could work reliably if a fs has some on-disk format, we could
>> discuss it later.
>>
>> I think as a loopback mount, such loopback files are quite under
>> control
>> (take ext4 loopback mount as an example, each ext4 has the only one
>> file
>>    to access when setting up loopback devices and such loopback file
>> was
>>    also opened when setting up loopback mount so it cannot be
>> replaced.
>>
>>    If you enables fsverity for such loopback mount before, it cannot
>> be
>>    modified as well) by admins.
>>
>>
>> But IMHO, here composefs shows a new model that some stackable
>> filesystem can point to massive files under a random directory as
>> what
>> ostree does (even files in such directory can be bind-mounted later
>> in
>> principle).  But the original userspace ostree strictly follows
>> underlayfs permission check but Composefs can override
>> uid/gid/permission instead.
> 
> Suppose you have:
> 
> -rw-r--r-- root root image.ext4
> -rw-r--r-- root root image.composefs
> drwxr--r-- root root objects/
> -rw-r--r-- root root objects/backing.file
> 
> Are you saying it is easier for someone to modify backing.file than
> image.ext4?
> 
> I argue it is not, but composefs takes some steps to avoid issues here.
> At mount time, when the basedir ("objects/" above) argument is parsed,
> we resolve that path and then create a private vfsmount for it:
> 
>   resolve_basedir(path) {
>          ...
> 	mnt = clone_private_mount(&path);
>          ...
>   }
> 
>   fsi->bases[i] = resolve_basedir(path);
> 
> Then we open backing files with this mount as root:
> 
>   real_file = file_open_root_mnt(fsi->bases[i], real_path,
>   			        file->f_flags, 0);
> 
> This will never resolve outside the initially specified basedir, even
> with symlinks or whatever. It will also not be affected by later mount
> changes in the original mount namespace, as this is a private mount.
> 
> This is the same mechanism that overlayfs uses for its upper dirs.

Ok.  I have no problem of this part.

> 
> I would argue that anyone who has rights to modify the contents of
> files in "objects" (supposing they were created with sane permissions)
> would also have rights to modify "image.ext4".

But you don't have any permission check for files in such
"objects/" directory in composefs source code, do you?

As I said in my original reply, don't assume random users or
malicious people just passing in or behaving like your expected
way.  Sometimes they're not but I think in-kernel fses should
handle such cases by design.  Obviously, any system written by
human can cause unexpected bugs, but that is another story.
I think in general it needs to have such design at least.

> 
>> That is also why we selected fscache at the first time to manage all
>> local cache data for EROFS, since such content-defined directory is
>> quite under control by in-kernel fscache instead of selecting a
>> random directory created and given by some userspace program.
>>
>> If you are interested in looking info the current in-kernel fscache
>> behavior, I think that is much similar as what ostree does now.
>>
>> It just needs new features like
>>     - multiple directories;
>>     - daemonless
>> to match.
>>
> 
> Obviously everything can be extended to support everything. But
> composefs is very small and simple (2128 lines of code), while at the
> same time being easy to use (just mount it with one syscall) and needs
> no complex userspace machinery and configuration. But even without the
> above feature additions fscache + cachefiles is 7982 lines, plus erofs
> is 9075 lines, and then on top of that you need userspace integration
> to even use the thing.

I've replied this in the comment of LWN.net.  EROFS can handle both
device-based or file-based images. It can handle FSDAX, compression,
data deduplication, rolling-hash finer compressed data duplication,
etc.  Of course, for your use cases, you can just turn them off by
Kconfig, I think such code is useless to your use cases as well.

And as a team work these years, EROFS always accept useful features
from other people.  And I've been always working on cleaning up
EROFS, but as long as it gains more features, the code can expand
of course.

Also take your project -- flatpak for example, I don't think the
total line of current version is as same as the original version.

Also you will always maintain Composefs source code below 2.5k Loc?

> 
> Don't take me wrong, EROFS is great for its usecases, but I don't
> really think it is the right choice for my usecase.
> 
>>>>
>>> Secondly, the use of fs-cache doesn't stack, as there can only be
>>> one
>>> cachefs agent. For example, mixing an ostree EROFS boot with a
>>> container backend using EROFS isn't possible (at least without deep
>>> integration between the two userspaces).
>>
>> The reasons above are all current fscache implementation limitation:
>>
>>    - First, if such overlay model really works, EROFS can do it
>> without
>> fscache feature as well to integrate userspace ostree.  But even that
>> I hope this new feature can be landed in overlayfs rather than some
>> other ways since it has native writable layer so we don't need
>> another
>> overlayfs mount at all for writing;
> 
> I don't think it is the right approach for overlayfs to integrate
> something like image support. Merging the two codebases would
> complicate both while adding costs to users who need only support for
> one of the features. I think reusing and stacking separate features is
> a better idea than combining them.

Why? overlayfs could have metadata support as well, if they'd like
to support advanced features like partial copy-up without fscache
support.

> 
>>
>>>
>>> Instead what we have done with composefs is to make filesystem
>>> image
>>> generation from the ostree repository 100% reproducible. Then we
>>> can
>>
>> EROFS is all 100% reproduciable as well.
>>
> 
> 
> Really, so if I today, on fedora 36 run:
> # tar xvf oci-image.tar
> # mkfs.erofs oci-dir/ oci.erofs
> 
> And then in 5 years, if someone on debian 13 runs the same, with the
> same tar file, then both oci.erofs files will have the same sha256
> checksum?

Why it doesn't?  Reproducable builds is a MUST for Android use cases
as well.

Yes, it may break between versions by mistake, but I think
reproducable builds is a basic functionalaity for all image
use cases.

> 
> How do you handle things like different versions or builds of
> compression libraries creating different results? Do you guarantee to
> not add any new backwards compat changes by default, or change any
> default options? Do you guarantee that the files are read from "oci-
> dir" in the same order each time? It doesn't look like it.

If you'd like to say like that, why mkcomposefs doesn't have the
same issue that it may be broken by some bug.

> 
>>
>> But really, personally I think the issue above is different from
>> loopback devices and may need to be resolved first. And if possible,
>> I hope it could be an new overlayfs feature for everyone.
> 
> Yeah. Independent of composefs, I think EROFS would be better if you
> could just point it to a chunk directory at mount time rather than
> having to route everything through a system-wide global cachefs
> singleton. I understand that cachefs does help with the on-demand
> download aspect, but when you don't need that it is just in the way.

Just check your reply to Dave's review, it seems that how
composefs dir on-disk format works is also much similar to
EROFS as well, see:

https://docs.kernel.org/filesystems/erofs.html -- Directories

a block vs a chunk = dirent + names

cfs_dir_lookup -> erofs_namei + find_target_block_classic;
cfs_dir_lookup_in_chunk -> find_target_dirent.

Yes, great projects could be much similar to each other
occasionally, not to mention opensource projects ;)

Anyway, I'm not opposed to Composefs if folks really like a
new read-only filesystem for this. That is almost all I'd like
to say about Composefs formally, have fun!

Thanks,
Gao Xiang

> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-16 13:26         ` Gao Xiang
@ 2023-01-16 14:18           ` Giuseppe Scrivano
  2023-01-16 15:27           ` Alexander Larsson
  1 sibling, 0 replies; 34+ messages in thread
From: Giuseppe Scrivano @ 2023-01-16 14:18 UTC (permalink / raw)
  To: Gao Xiang; +Cc: Alexander Larsson, linux-fsdevel, linux-kernel

Gao Xiang <hsiangkao@linux.alibaba.com> writes:

> On 2023/1/16 20:33, Alexander Larsson wrote:
>> On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote:
>>> Hi Alexander,
>>>
>>> On 2023/1/16 17:30, Alexander Larsson wrote:
>>>>
>>>> I mean, you're not wrong about this being possible. But I don't see
>>>> that this is necessarily a new problem. For example, consider the
>>>> case
>>>> of loopback mounting an ext4 filesystem containing a setuid /bin/su
>>>> file. If you have the right permissions, nothing prohibits you from
>>>> modifying the loopback mounted file and replacing the content of
>>>> the su
>>>> file with a copy of bash.
>>>>
>>>> In both these cases, the security of the system is fully defined by
>>>> the
>>>> filesystem permissions of the backing file data. I think viewing
>>>> composefs as a "new type" of overlayfs gets the wrong idea across.
>>>> Its
>>>> more similar to a "new type" of loopback mount. In particular, the
>>>> backing file metadata is completely unrelated to the metadata
>>>> exposed
>>>> by the filesystem, which means that you can chose to protect the
>>>> backing files (and directories) in ways which protect against
>>>> changes
>>>> from non-privileged users.
>>>>
>>>> Note: The above assumes that mounting either a loopback mount or a
>>>> composefs image is a privileged operation. Allowing unprivileged
>>>> mounts
>>>> is a very different thing.
>>>
>>> Thanks for the reply.  I think if I understand correctly, I could
>>> answer some of your questions.  Hopefully help to everyone
>>> interested.
>>>
>>> Let's avoid thinking unprivileged mounts first, although Giuseppe
>>> told
>>> me earilier that is also a future step of Composefs. But I don't know
>>> how it could work reliably if a fs has some on-disk format, we could
>>> discuss it later.
>>>
>>> I think as a loopback mount, such loopback files are quite under
>>> control
>>> (take ext4 loopback mount as an example, each ext4 has the only one
>>> file
>>>    to access when setting up loopback devices and such loopback file
>>> was
>>>    also opened when setting up loopback mount so it cannot be
>>> replaced.
>>>
>>>    If you enables fsverity for such loopback mount before, it cannot
>>> be
>>>    modified as well) by admins.
>>>
>>>
>>> But IMHO, here composefs shows a new model that some stackable
>>> filesystem can point to massive files under a random directory as
>>> what
>>> ostree does (even files in such directory can be bind-mounted later
>>> in
>>> principle).  But the original userspace ostree strictly follows
>>> underlayfs permission check but Composefs can override
>>> uid/gid/permission instead.
>> Suppose you have:
>> -rw-r--r-- root root image.ext4
>> -rw-r--r-- root root image.composefs
>> drwxr--r-- root root objects/
>> -rw-r--r-- root root objects/backing.file
>> Are you saying it is easier for someone to modify backing.file than
>> image.ext4?
>> I argue it is not, but composefs takes some steps to avoid issues
>> here.
>> At mount time, when the basedir ("objects/" above) argument is parsed,
>> we resolve that path and then create a private vfsmount for it:
>>   resolve_basedir(path) {
>>          ...
>> 	mnt = clone_private_mount(&path);
>>          ...
>>   }
>>   fsi->bases[i] = resolve_basedir(path);
>> Then we open backing files with this mount as root:
>>   real_file = file_open_root_mnt(fsi->bases[i], real_path,
>>   			        file->f_flags, 0);
>> This will never resolve outside the initially specified basedir,
>> even
>> with symlinks or whatever. It will also not be affected by later mount
>> changes in the original mount namespace, as this is a private mount.
>> This is the same mechanism that overlayfs uses for its upper dirs.
>
> Ok.  I have no problem of this part.
>
>> I would argue that anyone who has rights to modify the contents of
>> files in "objects" (supposing they were created with sane permissions)
>> would also have rights to modify "image.ext4".
>
> But you don't have any permission check for files in such
> "objects/" directory in composefs source code, do you?
>
> As I said in my original reply, don't assume random users or
> malicious people just passing in or behaving like your expected
> way.  Sometimes they're not but I think in-kernel fses should
> handle such cases by design.  Obviously, any system written by
> human can cause unexpected bugs, but that is another story.
> I think in general it needs to have such design at least.

what malicious people are you worried about?

composefs is usable only in the initial user namespace for now so only
root can use it and has the responsibility to use trusted files.

>> 
>>> That is also why we selected fscache at the first time to manage all
>>> local cache data for EROFS, since such content-defined directory is
>>> quite under control by in-kernel fscache instead of selecting a
>>> random directory created and given by some userspace program.
>>>
>>> If you are interested in looking info the current in-kernel fscache
>>> behavior, I think that is much similar as what ostree does now.
>>>
>>> It just needs new features like
>>>     - multiple directories;
>>>     - daemonless
>>> to match.
>>>
>> Obviously everything can be extended to support everything. But
>> composefs is very small and simple (2128 lines of code), while at the
>> same time being easy to use (just mount it with one syscall) and needs
>> no complex userspace machinery and configuration. But even without the
>> above feature additions fscache + cachefiles is 7982 lines, plus erofs
>> is 9075 lines, and then on top of that you need userspace integration
>> to even use the thing.
>
> I've replied this in the comment of LWN.net.  EROFS can handle both
> device-based or file-based images. It can handle FSDAX, compression,
> data deduplication, rolling-hash finer compressed data duplication,
> etc.  Of course, for your use cases, you can just turn them off by
> Kconfig, I think such code is useless to your use cases as well.
>
> And as a team work these years, EROFS always accept useful features
> from other people.  And I've been always working on cleaning up
> EROFS, but as long as it gains more features, the code can expand
> of course.
>
> Also take your project -- flatpak for example, I don't think the
> total line of current version is as same as the original version.
>
> Also you will always maintain Composefs source code below 2.5k Loc?
>
>> Don't take me wrong, EROFS is great for its usecases, but I don't
>> really think it is the right choice for my usecase.
>> 
>>>>>
>>>> Secondly, the use of fs-cache doesn't stack, as there can only be
>>>> one
>>>> cachefs agent. For example, mixing an ostree EROFS boot with a
>>>> container backend using EROFS isn't possible (at least without deep
>>>> integration between the two userspaces).
>>>
>>> The reasons above are all current fscache implementation limitation:
>>>
>>>    - First, if such overlay model really works, EROFS can do it
>>> without
>>> fscache feature as well to integrate userspace ostree.  But even that
>>> I hope this new feature can be landed in overlayfs rather than some
>>> other ways since it has native writable layer so we don't need
>>> another
>>> overlayfs mount at all for writing;
>> I don't think it is the right approach for overlayfs to integrate
>> something like image support. Merging the two codebases would
>> complicate both while adding costs to users who need only support for
>> one of the features. I think reusing and stacking separate features is
>> a better idea than combining them.
>
> Why? overlayfs could have metadata support as well, if they'd like
> to support advanced features like partial copy-up without fscache
> support.
>
>> 
>>>
>>>>
>>>> Instead what we have done with composefs is to make filesystem
>>>> image
>>>> generation from the ostree repository 100% reproducible. Then we
>>>> can
>>>
>>> EROFS is all 100% reproduciable as well.
>>>
>> Really, so if I today, on fedora 36 run:
>> # tar xvf oci-image.tar
>> # mkfs.erofs oci-dir/ oci.erofs
>> And then in 5 years, if someone on debian 13 runs the same, with the
>> same tar file, then both oci.erofs files will have the same sha256
>> checksum?
>
> Why it doesn't?  Reproducable builds is a MUST for Android use cases
> as well.
>
> Yes, it may break between versions by mistake, but I think
> reproducable builds is a basic functionalaity for all image
> use cases.
>
>> How do you handle things like different versions or builds of
>> compression libraries creating different results? Do you guarantee to
>> not add any new backwards compat changes by default, or change any
>> default options? Do you guarantee that the files are read from "oci-
>> dir" in the same order each time? It doesn't look like it.
>
> If you'd like to say like that, why mkcomposefs doesn't have the
> same issue that it may be broken by some bug.
>
>> 
>>>
>>> But really, personally I think the issue above is different from
>>> loopback devices and may need to be resolved first. And if possible,
>>> I hope it could be an new overlayfs feature for everyone.
>> Yeah. Independent of composefs, I think EROFS would be better if you
>> could just point it to a chunk directory at mount time rather than
>> having to route everything through a system-wide global cachefs
>> singleton. I understand that cachefs does help with the on-demand
>> download aspect, but when you don't need that it is just in the way.
>
> Just check your reply to Dave's review, it seems that how
> composefs dir on-disk format works is also much similar to
> EROFS as well, see:
>
> https://docs.kernel.org/filesystems/erofs.html -- Directories
>
> a block vs a chunk = dirent + names
>
> cfs_dir_lookup -> erofs_namei + find_target_block_classic;
> cfs_dir_lookup_in_chunk -> find_target_dirent.
>
> Yes, great projects could be much similar to each other
> occasionally, not to mention opensource projects ;)
>
> Anyway, I'm not opposed to Composefs if folks really like a
> new read-only filesystem for this. That is almost all I'd like
> to say about Composefs formally, have fun!
>
> Thanks,
> Gao Xiang
>
>> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-16 13:26         ` Gao Xiang
  2023-01-16 14:18           ` Giuseppe Scrivano
@ 2023-01-16 15:27           ` Alexander Larsson
  2023-01-17  0:12             ` Gao Xiang
  1 sibling, 1 reply; 34+ messages in thread
From: Alexander Larsson @ 2023-01-16 15:27 UTC (permalink / raw)
  To: Gao Xiang, linux-fsdevel; +Cc: linux-kernel, gscrivan

On Mon, 2023-01-16 at 21:26 +0800, Gao Xiang wrote:
> 
> 
> On 2023/1/16 20:33, Alexander Larsson wrote:
> > 
> > 
> > Suppose you have:
> > 
> > -rw-r--r-- root root image.ext4
> > -rw-r--r-- root root image.composefs
> > drwxr--r-- root root objects/
> > -rw-r--r-- root root objects/backing.file
> > 
> > Are you saying it is easier for someone to modify backing.file than
> > image.ext4?
> > 
> > I argue it is not, but composefs takes some steps to avoid issues
> > here.
> > At mount time, when the basedir ("objects/" above) argument is
> > parsed,
> > we resolve that path and then create a private vfsmount for it:
> > 
> >   resolve_basedir(path) {
> >          ...
> >         mnt = clone_private_mount(&path);
> >          ...
> >   }
> > 
> >   fsi->bases[i] = resolve_basedir(path);
> > 
> > Then we open backing files with this mount as root:
> > 
> >   real_file = file_open_root_mnt(fsi->bases[i], real_path,
> >                                 file->f_flags, 0);
> > 
> > This will never resolve outside the initially specified basedir,
> > even
> > with symlinks or whatever. It will also not be affected by later
> > mount
> > changes in the original mount namespace, as this is a private
> > mount.
> > 
> > This is the same mechanism that overlayfs uses for its upper dirs.
> 
> Ok.  I have no problem of this part.
> 
> > 
> > I would argue that anyone who has rights to modify the contents of
> > files in "objects" (supposing they were created with sane
> > permissions)
> > would also have rights to modify "image.ext4".
> 
> But you don't have any permission check for files in such
> "objects/" directory in composefs source code, do you?

I don't see how permission checks would make any difference to the
ability to modify the image by anyone? Do you mean the kernel should
validate the basedir so that is has sane permissions rather than
trusting the user? That seems weird to me.

Or do you mean that someone would create a composefs image that
references a file they could not otherwise read, and then use it as a
basedir in a composefs mount to read the file? Such a mount can only
happen if you are root, and it can only read files inside that
particular directory. However, maybe we should use the callers
credentials to ensure that they are allowed to read the backing file,
just in case. That can't hurt.

> As I said in my original reply, don't assume random users or
> malicious people just passing in or behaving like your expected
> way.  Sometimes they're not but I think in-kernel fses should
> handle such cases by design.  Obviously, any system written by
> human can cause unexpected bugs, but that is another story.
> I think in general it needs to have such design at least.

You need to be root to mount a fs, an operation which is generally
unsafe (because few filesystems are completely resistant to hostile
filesystem data). Therefore I think we can expect a certain amount of
sanity in its use, such as "don't pass in directories that are world
writable".

> > 
> > > That is also why we selected fscache at the first time to manage
> > > all
> > > local cache data for EROFS, since such content-defined directory
> > > is
> > > quite under control by in-kernel fscache instead of selecting a
> > > random directory created and given by some userspace program.
> > > 
> > > If you are interested in looking info the current in-kernel
> > > fscache
> > > behavior, I think that is much similar as what ostree does now.
> > > 
> > > It just needs new features like
> > >     - multiple directories;
> > >     - daemonless
> > > to match.
> > > 
> > 
> > Obviously everything can be extended to support everything. But
> > composefs is very small and simple (2128 lines of code), while at
> > the
> > same time being easy to use (just mount it with one syscall) and
> > needs
> > no complex userspace machinery and configuration. But even without
> > the
> > above feature additions fscache + cachefiles is 7982 lines, plus
> > erofs
> > is 9075 lines, and then on top of that you need userspace
> > integration
> > to even use the thing.
> 
> I've replied this in the comment of LWN.net.  EROFS can handle both
> device-based or file-based images. It can handle FSDAX, compression,
> data deduplication, rolling-hash finer compressed data duplication,
> etc.  Of course, for your use cases, you can just turn them off by
> Kconfig, I think such code is useless to your use cases as well.
>
> And as a team work these years, EROFS always accept useful features
> from other people.  And I've been always working on cleaning up
> EROFS, but as long as it gains more features, the code can expand
> of course.
> 
> Also take your project -- flatpak for example, I don't think the
> total line of current version is as same as the original version.
> 
> Also you will always maintain Composefs source code below 2.5k Loc?
> 
> > 
> > Don't take me wrong, EROFS is great for its usecases, but I don't
> > really think it is the right choice for my usecase.
> > 
> > > > > 
> > > > Secondly, the use of fs-cache doesn't stack, as there can only
> > > > be
> > > > one
> > > > cachefs agent. For example, mixing an ostree EROFS boot with a
> > > > container backend using EROFS isn't possible (at least without
> > > > deep
> > > > integration between the two userspaces).
> > > 
> > > The reasons above are all current fscache implementation
> > > limitation:
> > > 
> > >    - First, if such overlay model really works, EROFS can do it
> > > without
> > > fscache feature as well to integrate userspace ostree.  But even
> > > that
> > > I hope this new feature can be landed in overlayfs rather than
> > > some
> > > other ways since it has native writable layer so we don't need
> > > another
> > > overlayfs mount at all for writing;
> > 
> > I don't think it is the right approach for overlayfs to integrate
> > something like image support. Merging the two codebases would
> > complicate both while adding costs to users who need only support
> > for
> > one of the features. I think reusing and stacking separate features
> > is
> > a better idea than combining them.
> 
> Why? overlayfs could have metadata support as well, if they'd like
> to support advanced features like partial copy-up without fscache
> support.
> 
> > 
> > > 
> > > > 
> > > > Instead what we have done with composefs is to make filesystem
> > > > image
> > > > generation from the ostree repository 100% reproducible. Then
> > > > we
> > > > can
> > > 
> > > EROFS is all 100% reproduciable as well.
> > > 
> > 
> > 
> > Really, so if I today, on fedora 36 run:
> > # tar xvf oci-image.tar
> > # mkfs.erofs oci-dir/ oci.erofs
> > 
> > And then in 5 years, if someone on debian 13 runs the same, with
> > the
> > same tar file, then both oci.erofs files will have the same sha256
> > checksum?
> 
> Why it doesn't?  Reproducable builds is a MUST for Android use cases
> as well.

That is not quite the same requirements. A reproducible build in the
traditional sense is limited to a particular build configuration. You
define a set of tools for the build, and use the same ones for each
build, and get a fixed output. You don't expect to be able to change
e.g. the compiler and get the same result. Similarly, it is often the
case that different builds or versions of compression libraries gives
different results, so you can't expect to use e.g. a different libz and
get identical images.

> Yes, it may break between versions by mistake, but I think
> reproducable builds is a basic functionalaity for all image
> use cases.
> 
> > 
> > How do you handle things like different versions or builds of
> > compression libraries creating different results? Do you guarantee
> > to
> > not add any new backwards compat changes by default, or change any
> > default options? Do you guarantee that the files are read from
> > "oci-
> > dir" in the same order each time? It doesn't look like it.
> 
> If you'd like to say like that, why mkcomposefs doesn't have the
> same issue that it may be broken by some bug.
> 

libcomposefs defines a normalized form for everything like file order,
xattr orders, etc, and carefully normalizes everything such that we can
guarantee these properties. It is possible that some detail was missed,
because we're humans. But it was a very conscious and deliberate design
choice that is deeply encoded in the code and format. For example, this
is why we don't use compression but try to minimize size in other ways.

> > > 
> > > But really, personally I think the issue above is different from
> > > loopback devices and may need to be resolved first. And if
> > > possible,
> > > I hope it could be an new overlayfs feature for everyone.
> > 
> > Yeah. Independent of composefs, I think EROFS would be better if
> > you
> > could just point it to a chunk directory at mount time rather than
> > having to route everything through a system-wide global cachefs
> > singleton. I understand that cachefs does help with the on-demand
> > download aspect, but when you don't need that it is just in the
> > way.
> 
> Just check your reply to Dave's review, it seems that how
> composefs dir on-disk format works is also much similar to
> EROFS as well, see:
> 
> https://docs.kernel.org/filesystems/erofs.html -- Directories
> 
> a block vs a chunk = dirent + names
> 
> cfs_dir_lookup -> erofs_namei + find_target_block_classic;
> cfs_dir_lookup_in_chunk -> find_target_dirent.

Yeah, the dirent layout looks very similar. I guess great minds think
alike! My approach was simpler initially, but it kinda converged on
this when I started optimizing the kernel lookup code with binary
search.

> Yes, great projects could be much similar to each other
> occasionally, not to mention opensource projects ;)
> 
> Anyway, I'm not opposed to Composefs if folks really like a
> new read-only filesystem for this. That is almost all I'd like
> to say about Composefs formally, have fun!
> 
> Thanks,
> Gao Xiang

Cool, thanks for the feedback.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a maverick guitar-strumming senator with a passion for fast cars. 
She's an orphaned winged angel with her own daytime radio talk show.
They 
fight crime! 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/6] composefs: Add filesystem implementation
  2023-01-13 15:33 ` [PATCH v2 4/6] composefs: Add filesystem implementation Alexander Larsson
  2023-01-13 21:55   ` kernel test robot
@ 2023-01-16 22:07   ` Al Viro
  2023-01-17 13:29     ` Alexander Larsson
  1 sibling, 1 reply; 34+ messages in thread
From: Al Viro @ 2023-01-16 22:07 UTC (permalink / raw)
  To: Alexander Larsson; +Cc: linux-fsdevel, linux-kernel, gscrivan

	Several random observations:

> +static struct inode *cfs_make_inode(struct cfs_context_s *ctx,
> +				    struct super_block *sb, ino_t ino_num,
> +				    struct cfs_inode_s *ino, const struct inode *dir)
> +{
> +	struct cfs_inode_data_s inode_data = { 0 };
> +	struct cfs_xattr_header_s *xattrs = NULL;
> +	struct inode *inode = NULL;
> +	struct cfs_inode *cino;
> +	int ret, res;

I would suggest
	if (IS_ERR(ino))
		return ERR_CAST(ino);
here.  The callers get simpler that way, AFAICS.

> +	res = cfs_init_inode_data(ctx, ino, ino_num, &inode_data);
> +	if (res < 0)
> +		return ERR_PTR(res);
> +
> +	inode = new_inode(sb);
> +	if (inode) {
> +		inode_init_owner(&init_user_ns, inode, dir, ino->st_mode);
> +		inode->i_mapping->a_ops = &cfs_aops;
> +
> +		cino = CFS_I(inode);
> +		cino->inode_data = inode_data;
> +
> +		inode->i_ino = ino_num;
> +		set_nlink(inode, ino->st_nlink);
> +		inode->i_rdev = ino->st_rdev;
> +		inode->i_uid = make_kuid(current_user_ns(), ino->st_uid);
> +		inode->i_gid = make_kgid(current_user_ns(), ino->st_gid);
> +		inode->i_mode = ino->st_mode;
> +		inode->i_atime = ino->st_mtim;
> +		inode->i_mtime = ino->st_mtim;
> +		inode->i_ctime = ino->st_ctim;
> +
> +		switch (ino->st_mode & S_IFMT) {
> +		case S_IFREG:
> +			inode->i_op = &cfs_file_inode_operations;
> +			inode->i_fop = &cfs_file_operations;
> +			inode->i_size = ino->st_size;
> +			break;
> +		case S_IFLNK:
> +			inode->i_link = cino->inode_data.path_payload;
> +			inode->i_op = &cfs_link_inode_operations;
> +			inode->i_fop = &cfs_file_operations;
> +			break;
> +		case S_IFDIR:
> +			inode->i_op = &cfs_dir_inode_operations;
> +			inode->i_fop = &cfs_dir_operations;
> +			inode->i_size = 4096;
> +			break;
> +		case S_IFCHR:
> +		case S_IFBLK:
> +			if (current_user_ns() != &init_user_ns) {
> +				ret = -EPERM;
> +				goto fail;
> +			}
> +			fallthrough;
> +		default:
> +			inode->i_op = &cfs_file_inode_operations;
> +			init_special_inode(inode, ino->st_mode, ino->st_rdev);
> +			break;
> +		}
> +	}
> +	return inode;
> +
> +fail:
> +	if (inode)
> +		iput(inode);

Huh?  Just how do we get here with NULL inode?  While we are at it,
NULL on -ENOMEM is fine when it's the only error that can happen;
here, OTOH...

> +	kfree(xattrs);
> +	cfs_inode_data_put(&inode_data);
> +	return ERR_PTR(ret);
> +}
> +
> +static struct inode *cfs_get_root_inode(struct super_block *sb)
> +{
> +	struct cfs_info *fsi = sb->s_fs_info;
> +	struct cfs_inode_s ino_buf;
> +	struct cfs_inode_s *ino;
> +	u64 index;
> +
> +	ino = cfs_get_root_ino(&fsi->cfs_ctx, &ino_buf, &index);
> +	if (IS_ERR(ino))
> +		return ERR_CAST(ino);

See what I mean re callers?

> +	return cfs_make_inode(&fsi->cfs_ctx, sb, index, ino, NULL);
> +}

> +static struct dentry *cfs_lookup(struct inode *dir, struct dentry *dentry,
> +				 unsigned int flags)
> +{
> +	struct cfs_info *fsi = dir->i_sb->s_fs_info;
> +	struct cfs_inode *cino = CFS_I(dir);
> +	struct cfs_inode_s ino_buf;
> +	struct cfs_inode_s *ino_s;
> +	struct inode *inode;
> +	u64 index;
> +	int ret;
> +
> +	if (dentry->d_name.len > NAME_MAX)
> +		return ERR_PTR(-ENAMETOOLONG);
> +
> +	ret = cfs_dir_lookup(&fsi->cfs_ctx, dir->i_ino, &cino->inode_data,
> +			     dentry->d_name.name, dentry->d_name.len, &index);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +	if (ret == 0)
> +		goto return_negative;
> +
> +	ino_s = cfs_get_ino_index(&fsi->cfs_ctx, index, &ino_buf);
> +	if (IS_ERR(ino_s))
> +		return ERR_CAST(ino_s);
> +
> +	inode = cfs_make_inode(&fsi->cfs_ctx, dir->i_sb, index, ino_s, dir);
> +	if (IS_ERR(inode))
> +		return ERR_CAST(inode);
> +
> +	return d_splice_alias(inode, dentry);
> +
> +return_negative:
> +	d_add(dentry, NULL);
> +	return NULL;
> +}

Ugh...  One problem here is that out of memory in new_inode() translates into
successful negative lookup.  Another...

	struct inode *inode = NULL;

	if (dentry->d_name.len > NAME_MAX)
		return ERR_PTR(-ENAMETOOLONG);

	ret = cfs_dir_lookup(&fsi->cfs_ctx, dir->i_ino, &cino->inode_data,
			     dentry->d_name.name, dentry->d_name.len, &index);
	if (ret) {
		if (ret < 0)
			return ERR_PTR(ret);
		ino_s = cfs_get_ino_index(&fsi->cfs_ctx, index, &ino_buf);
		inode = cfs_make_inode(&fsi->cfs_ctx, dir->i_sb, index, ino_s, dir);
	}
	return d_splice_alias(inode, dentry);

is all you really need.  d_splice_alias() will do the right thing if given
ERR_PTR()...

> +{
> +	struct cfs_inode *cino = alloc_inode_sb(sb, cfs_inode_cachep, GFP_KERNEL);
> +
> +	if (!cino)
> +		return NULL;
> +
> +	memset((u8 *)cino + sizeof(struct inode), 0,
> +	       sizeof(struct cfs_inode) - sizeof(struct inode));

Huh?  What's wrong with memset(&cino->inode_data, 0, sizeof(cino->inode_data))?

> +static void cfs_destroy_inode(struct inode *inode)
> +{
> +	struct cfs_inode *cino = CFS_I(inode);
> +
> +	cfs_inode_data_put(&cino->inode_data);
> +}

Umm...  Any reason that can't be done from your ->free_inode()?  Looks like
nothing in there needs to be synchronous...  For that matter, what's wrong
with simply kfree(cino->inode_data.path_payload) from cfs_free_inode(),
just before it frees cino itself?

> +static void cfs_put_super(struct super_block *sb)
> +{
> +	struct cfs_info *fsi = sb->s_fs_info;
> +
> +	cfs_ctx_put(&fsi->cfs_ctx);
> +	if (fsi->bases) {
> +		kern_unmount_array(fsi->bases, fsi->n_bases);
> +		kfree(fsi->bases);
> +	}
> +	kfree(fsi->base_path);
> +
> +	kfree(fsi);
> +}

> +static struct vfsmount *resolve_basedir(const char *name)
> +{
> +	struct path path = {};
> +	struct vfsmount *mnt;
> +	int err = -EINVAL;
> +
> +	if (!*name) {
> +		pr_err("empty basedir\n");

		return ERR_PTR(-EINVAL);

> +		goto out;
> +	}
> +	err = kern_path(name, LOOKUP_FOLLOW, &path);

Are sure you don't want LOOKUP_DIRECTORY added here?

> +	if (err) {
> +		pr_err("failed to resolve '%s': %i\n", name, err);

		return ERR_PTR(err);

> +		goto out;
> +	}
> +
> +	mnt = clone_private_mount(&path);
> +	err = PTR_ERR(mnt);
> +	if (IS_ERR(mnt)) {
> +		pr_err("failed to clone basedir\n");
> +		goto out_put;
> +	}
> +
> +	path_put(&path);

	mnt = clone_private_mount(&path);
	path_put(&path);
	/* Don't inherit atime flags */
	if (!IS_ERR(mnt))
		mnt->mnt_flags &= ~(MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME);
	return mnt;

I'm not saying that gotos are to be religiously avoided, but here they
make it harder to follow...

> +
> +	/* Don't inherit atime flags */
> +	mnt->mnt_flags &= ~(MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME);
> +
> +	return mnt;
> +
> +out_put:
> +	path_put(&path);
> +out:
> +	return ERR_PTR(err);
> +}

> +
> +static int cfs_fill_super(struct super_block *sb, struct fs_context *fc)
> +{
> +	struct cfs_info *fsi = sb->s_fs_info;
> +	struct vfsmount **bases = NULL;
> +	size_t numbasedirs = 0;
> +	struct inode *inode;
> +	struct vfsmount *mnt;
> +	int ret;
> +
> +	if (sb->s_root)
> +		return -EINVAL;

Wha...?  How could it ever get called with non-NULL ->s_root?


> +static struct file *open_base_file(struct cfs_info *fsi, struct inode *inode,
> +				   struct file *file)
> +{
> +	struct cfs_inode *cino = CFS_I(inode);
> +	struct file *real_file;
> +	char *real_path = cino->inode_data.path_payload;
> +
> +	for (size_t i = 0; i < fsi->n_bases; i++) {
> +		real_file = file_open_root_mnt(fsi->bases[i], real_path,
> +					       file->f_flags, 0);
> +		if (!IS_ERR(real_file) || PTR_ERR(real_file) != -ENOENT)
> +			return real_file;

That's a strange way to spell if (real_file != ERR_PTR(-ENOENT))...

> +static int cfs_open_file(struct inode *inode, struct file *file)
> +{
> +	struct cfs_info *fsi = inode->i_sb->s_fs_info;
> +	struct cfs_inode *cino = CFS_I(inode);
> +	char *real_path = cino->inode_data.path_payload;
> +	struct file *faked_file;
> +	struct file *real_file;
> +
> +	if (WARN_ON(!file))
> +		return -EIO;

Huh?

> +	if (file->f_flags & (O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_TRUNC))
> +		return -EROFS;
> +
> +	if (!real_path) {
> +		file->private_data = &empty_file;
> +		return 0;
> +	}
> +
> +	if (fsi->verity_check >= 2 && !cino->inode_data.has_digest) {
> +		pr_warn("WARNING: composefs image file '%pd' specified no fs-verity digest\n",
> +			file->f_path.dentry);

%pD with file, please, both here and later.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/6] composefs: Add on-disk layout
  2023-01-16 11:00     ` Alexander Larsson
@ 2023-01-16 23:06       ` Dave Chinner
  2023-01-17 12:11         ` Alexander Larsson
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2023-01-16 23:06 UTC (permalink / raw)
  To: Alexander Larsson; +Cc: linux-fsdevel, linux-kernel, gscrivan

On Mon, Jan 16, 2023 at 12:00:03PM +0100, Alexander Larsson wrote:
> On Mon, 2023-01-16 at 12:29 +1100, Dave Chinner wrote:
> > On Fri, Jan 13, 2023 at 04:33:55PM +0100, Alexander Larsson wrote:
> > > This commit adds the on-disk layout header file of composefs.
> > 
> > This isn't really a useful commit message.
> > 
> > Perhaps it should actually explain what the overall goals of the
> > on-disk format are - space usage, complexity trade-offs, potential
> > issues with validation of variable payload sections, etc.
> > 
> 
> I agree, will flesh it out. But, as for below discussions, one of the
> overall goals is to keep the on-disk file size low.
> 
> > > Signed-off-by: Alexander Larsson <alexl@redhat.com>
> > > Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com>
> > > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> > > ---
> > >  fs/composefs/cfs.h | 203
> > > +++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 203 insertions(+)
> > >  create mode 100644 fs/composefs/cfs.h
> > > 
> > > diff --git a/fs/composefs/cfs.h b/fs/composefs/cfs.h
> > > new file mode 100644
> > > index 000000000000..658df728e366
> > > --- /dev/null
> > > +++ b/fs/composefs/cfs.h
> > > @@ -0,0 +1,203 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * composefs
> > > + *
> > > + * Copyright (C) 2021 Giuseppe Scrivano
> > > + * Copyright (C) 2022 Alexander Larsson
> > > + *
> > > + * This file is released under the GPL.
> > > + */
> > > +
> > > +#ifndef _CFS_H
> > > +#define _CFS_H
> > > +
> > > +#include <asm/byteorder.h>
> > > +#include <crypto/sha2.h>
> > > +#include <linux/fs.h>
> > > +#include <linux/stat.h>
> > > +#include <linux/types.h>
> > > +
> > > +#define CFS_VERSION 1
> > 
> > This should start with a description of the on-disk format for the
> > version 1 format.
> 
> There are some format descriptions in the later document patch. What is
> the general approach here, do we document in the header, or in separate
> doc file? For example, I don't see much of format descriptions in the
> xfs headers. I mean, I should probably add *some* info here for easier
> reading of the stuff below, but I don't feel like headers are a great
> place for docs.

it's fine to describe the format in the docs, but when reading the
code there needs to at least an overview of the structure the code
is implementing so that the code makes some sense without having to
go find the external place the format is documented.

> > > +
> > > +#define CFS_MAGIC 0xc078629aU
> > > +
> > > +#define CFS_MAX_DIR_CHUNK_SIZE 4096
> > > +#define CFS_MAX_XATTRS_SIZE 4096
> > 
> > How do we store 64kB xattrs in this format if the max attr size is
> > 4096 bytes? Or is that the maximum total xattr storage?
> 
> This is a current limitation of the composefs file format.

Yes, but is that 4kB limit the maximum size of a single xattr, or is
it the total xattr storage space for an inode?

> I am aware
> that the kernel maximum size is 64k,

For a single xattr, yes. Hence my question....

> > > +static inline int cfs_digest_from_payload(const char *payload,
> > > size_t payload_len,
> > > +                                         u8
> > > digest_out[SHA256_DIGEST_SIZE])
.....
> > Too big to be a inline function.
> 
> Yeah, I'm aware of this. I mainly put it in the header as the
> implementation of it is sort of part of the on-disk format. But, I can
> move it to a .c file instead.

Please do - it's really part of the reader implementation, not the
structure definition.

> > > +struct cfs_vdata_s {
> > 
> > Drop the "_s" suffix to indicate the type is a structure - that's
> > waht "struct" tells us.
> 
> Sure.
> 
> > > +       u64 off;
> > > +       u32 len;
> > 
> > If these are on-disk format structures, why aren't the defined as
> > using the specific endian they are encoded in? i.e. __le64, __le32,
> > etc? Otherwise a file built on a big endian machine won't be
> > readable on a little endian machine (and vice versa).
> 
> On disk all fields are little endian. However, when we read them from
> disk we convert them using e.g. le32_to_cpu(), and then we use the same
> structure in memory, with native endian. So, it seems wrong to mark
> them as little endian.

Then these structures do not define "on-disk format". Looking a bit
further through the patchset, these are largely intermediate
structures that are read once to instatiate objects in memory, then
never used again. The cfs_inode_s is a good example of this - I'll
come back to that.

> 
> > 
> > > +} __packed;
> > > +
> > > +struct cfs_header_s {
> > > +       u8 version;
> > > +       u8 unused1;
> > > +       u16 unused2;
> > 
> > Why are you hyper-optimising these structures for minimal space
> > usage? This is 2023 - we can use a __le32 for the version number,
> > the magic number and then leave....
> >
> > > +
> > > +       u32 magic;
> > > +       u64 data_offset;
> > > +       u64 root_inode;
> > > +
> > > +       u64 unused3[2];
> > 
> > a whole heap of space to round it up to at least a CPU cacheline
> > size using something like "__le64 unused[15]".
> > 
> > That way we don't need packed structures nor do we care about having
> > weird little holes in the structures to fill....
> 
> Sure.

FWIW, now I see how this is used, this header kinda defines what
we'd call the superblock in the on-disk format of a filesystem. It's
at a fixed location in the image file, so there should be a #define
somewhere in this file to document it's fixed location.

Also, if this is the in-memory representation of the structure and
not the actual on-disk format, why does it even need padding,
packing or even store the magic number?

i.e. this information could simply be stored in a few fields in the cfs
superblock structure that wraps the vfs superblock, and the
superblock read function could decode straight into those fields...


> > > +} __packed;
> > > +
> > > +enum cfs_inode_flags {
> > > +       CFS_INODE_FLAGS_NONE = 0,
> > > +       CFS_INODE_FLAGS_PAYLOAD = 1 << 0,
> > > +       CFS_INODE_FLAGS_MODE = 1 << 1,
> > > +       CFS_INODE_FLAGS_NLINK = 1 << 2,
> > > +       CFS_INODE_FLAGS_UIDGID = 1 << 3,
> > > +       CFS_INODE_FLAGS_RDEV = 1 << 4,
> > > +       CFS_INODE_FLAGS_TIMES = 1 << 5,
> > > +       CFS_INODE_FLAGS_TIMES_NSEC = 1 << 6,
> > > +       CFS_INODE_FLAGS_LOW_SIZE = 1 << 7, /* Low 32bit of st_size
> > > */
> > > +       CFS_INODE_FLAGS_HIGH_SIZE = 1 << 8, /* High 32bit of
> > > st_size */
> > 
> > Why do we need to complicate things by splitting the inode size
> > like this?
> > 
> 
> The goal is to minimize the image size for a typical rootfs or
> container image. Almost zero files in any such images are > 4GB. 

Sure, but how much space does this typically save, versus how much
complexity it adds to runtime decoding of inodes?

I mean, in a dense container system the critical resources that need
to be saved is runtime memory and CPU overhead of operations, not
the storage space. Saving a 30-40 bytes of storage space per inode
means a typical image might ber a few MB smaller, but given the
image file is not storing data we're only talking about images the
use maybe 500 bytes of data per inode. Storage space for images
is not a limiting factor, nor is network transmission (because
compression), so it comes back to runtime CPU and memory usage.

The inodes are decoded out of the page cache, so the memory for the
raw inode information is volatile and reclaimed when needed.
Similarly, the VFS inode built from this information is reclaimable
when not in use, too. So the only real overhead for runtime is the
decoding time to find the inode in the image file and then decode
it.

Given the decoding of the inode -all branches- and is not
straight-line code, it cannot be well optimised and the CPU branch
predictor is not going to get it right every time. Straight line
code that decodes every field whether it is zero or not is going to
be faster.

Further, with a fixed size inode in the image file, the inode table
can be entirely fixed size, getting rid of the whole unaligned data
retreival problem that code currently has (yes, all that
"le32_to_cpu(__get_unaligned(__le32, data)" code) because we can
ensure that all the inode fields are aligned in the data pages. This
will significantly speed up decoding into the in-memory inode
structures.

And to take it another step, the entire struct cfs_inode_s structure
could go away - it is entirely a temporary structure used to shuffle
data from the on-disk encoded format to the the initialisation of
the VFS inode. The on-disk inode data could be decoded directly into
the VFS inode after it has been instantiated, rather than decoding
the inode from the backing file and the instantiating the in-memory
inode.

i.e. instead of:

cfs_lookup()
	cfs_dir_lookup(&index)
	cfs_get_ino_index(index, &inode_s)
		cfs_get_inode_data_max(index, &data)
		inode_s->st_.... = cfs_read_....(&data);
		inode_s->st_.... = cfs_read_....(&data);
		inode_s->st_.... = cfs_read_....(&data);
		inode_s->st_.... = cfs_read_....(&data);
	cfs_make_inode(inode_s, &vfs_inode)
		inode = new_inode(sb)
		inode->i_... = inode_s->st_....;
		inode->i_... = inode_s->st_....;
		inode->i_... = inode_s->st_....;
		inode->i_... = inode_s->st_....;

You could collapse this straight down to:

cfs_lookup()
	cfs_dir_lookup(&index)
	cfs_make_inode(index, &vfs_inode)
		inode = new_inode(sb)
		cfs_get_inode_data_max(index, &data)
		inode->i_... = cfs_read_....(&data);
		inode->i_... = cfs_read_....(&data);
		inode->i_... = cfs_read_....(&data);
		inode->i_... = cfs_read_....(&data);

This removes an intermediately layer from the inode instantiation
fast path completely. ANd if the inode table is fixed size and
always well aligned, then the cfs_make_inode() code that sets up the
VFS inode is almost unchanged from what it is now. There are no new
branches, the file image format is greatly simplified, and the
runtime overhead of instantiating inodes is significantly reduced.

Similar things can be done with the rest of the "descriptor"
abstraction - the intermediate in-memory structures can be placed
directly in the cfs_inode structure that wraps the VFS inode, and
the initialisation of them can call the decoding code directly
instead of using intermediate structures as is currently done.

This will remove a chunk of code from the implemenation and make it
run faster....

> Also, we don't just "not decode" the items with the flag not set, they
> are not even stored on disk.

Yup, and I think that is a mistake - premature optimisation and all
that...

> 
> > > +       CFS_INODE_FLAGS_XATTRS = 1 << 9,
> > > +       CFS_INODE_FLAGS_DIGEST = 1 << 10, /* fs-verity sha256
> > > digest */
> > > +       CFS_INODE_FLAGS_DIGEST_FROM_PAYLOAD = 1 << 11, /* Compute
> > > digest from payload */
> > > +};
> > > +
> > > +#define CFS_INODE_FLAG_CHECK(_flag,
> > > _name)                                     \
> > > +       (((_flag) & (CFS_INODE_FLAGS_##_name)) != 0)
> > 
> > Check what about a flag? If this is a "check that a feature is set",
> > then open coding it better, but if you must do it like this, then
> > please use static inline functions like:
> > 
> >         if (cfs_inode_has_xattrs(inode->flags)) {
> >                 .....
> >         }
> > 
> 
> The check is if the flag is set, so maybe CFS_INODE_FLAG_IS_SET is a
> better name. This is used only when decoding the on-disk version of the
> inode to the in memory one, which is a bunch of:
> 
> 	if (CFS_INODE_FLAG_CHECK(ino->flags, THE_FIELD))
> 		ino->the_field = cfs_read_u32(&data);
> 	else
> 		ino->the_field = THE_FIELD_DEFUALT;
> 
> I can easily open-code these checks, although I'm not sure it makes a
> great difference either way.

If they are used only once, then it should be open coded. But I
think the whole "optional inode fields" stuff should just go away
entirely at this point...

> > > +#define CFS_INODE_DEFAULT_MODE 0100644
> > > +#define CFS_INODE_DEFAULT_NLINK 1
> > > +#define CFS_INODE_DEFAULT_NLINK_DIR 2
> > > +#define CFS_INODE_DEFAULT_UIDGID 0
> > > +#define CFS_INODE_DEFAULT_RDEV 0
> > > +#define CFS_INODE_DEFAULT_TIMES 0
> > 
> > Where do these get used? Are they on disk defaults or something
> > else? (comment, please!)
> 
> They are the defaults that are used when inode fields on disk are
> missing. I'll add some comments.

They go away entirely with fixed size on-disk inodes.

> > > +       u32 st_mode; /* File type and mode.  */
> > > +       u32 st_nlink; /* Number of hard links, only for regular
> > > files.  */
> > > +       u32 st_uid; /* User ID of owner.  */
> > > +       u32 st_gid; /* Group ID of owner.  */
> > > +       u32 st_rdev; /* Device ID (if special file).  */
> > > +       u64 st_size; /* Size of file, only used for regular files
> > > */
> > > +
> > > +       struct cfs_vdata_s xattrs; /* ref to variable data */
> > 
> > This is in the payload that follows the inode?  Is it included in
> > the payload_length above?
> > 
> > If not, where is this stuff located, how do we validate it points to
> > the correct place in the on-disk format file, the xattrs belong to
> > this specific inode, etc? I think that's kinda important to
> > describe, because xattrs often contain important security
> > information...
> 
> No, all inodes are packed into the initial part of the file, each
> containing a flags set, a variable size (from flags) chunk of fixed
> size elements and an variable size payload. The payload is either the
> target symlink for symlinks, or the path of the backing file for
> regular files.

Ok, I think you need to stop calling that a "payload", then. It's
the path name to the backing file. The backing file is only relevant
for S_IFREG and S_IFLINK types - directories don't need path names
as they only contain pointers to other inodes in the image file.
Types like S_IFIFO, S_IFBLK, etc should not have backing files,
either - they should just be instantiated as the correct type in the
VFS inode and not require any backing file interactions at all...

Hence I think this "payload" should be called something like
"backing path" or something similar.

> Other data, such as xattrs and dirents are stored in a
> separate part of the file and the offsets for those in the inode refer
> to offsets into that area.

So "variable data" and "payload" are different sections in the file
format? You haven't defined what these names mean anywhere in this
file (see my original comment about describing the format in the
code), so it's hard to understand the difference without any actual
reference....

> > Why would you have this in the on-disk structure, then also have
> > "digest from payload" that allows the digest to be in the payload
> > section of the inode data?
> 
> The payload is normally the path to the backing file, and then you need
> to store the verity digest separately. This is what would be needed
> when using this with ostree for instance, because we have an existing
> backing file repo format we can't change.

*nod*

> However, if your backing
> store files are stored by their fs-verity digest already (which is the
> default for mkcomposefs), then we can set this flag and avoid storing
> the digest unnecessary.

So if you name your files according to the fsverity digest using
FS_VERITY_HASH_ALG_SHA256, you have to either completely rebuild
the repository if you want to change to FS_VERITY_HASH_ALG_SHA512
or you have to move to storing the fsverity digest in the image
files anyway?

Seems much better to me to require the repo to use an independent
content index rather than try to use the same hash to index the
contents and detect tampering at the same time. This, once again,
seems like an attempt to minimise file image size at the expense of
everything else and I'm very much not convinced this is the right
tradeoff to be making for modern computing technologies.

.....

> > > +struct cfs_dir_s {
> > > +       u32 n_chunks;
> > > +       struct cfs_dir_chunk_s chunks[];
> > > +} __packed;
> > 
> > So directory data is packed in discrete chunks? Given that this is a
> > static directory format, and the size of the directory is known at
> > image creation time, why does the storage need to be chunked?
> 
> We chunk the data such that each chunk fits inside a single page in the
> image file. I did this to make accessing image data directly from the
> page cache easier.

Hmmmm. So you defined a -block size- that matched the x86-64 -page
size- to avoid page cache issues.  Now, what about ARM or POWER
which has 64kB page sizes?

IOWs, "page size" is not the same on all machines, whilst the
on-disk format for a filesystem image needs to be the same on all
machines. Hence it appears that this:

> > > +#define CFS_MAX_DIR_CHUNK_SIZE 4096

should actually be defined in terms of the block size for the
filesystem image, and this size of these dir chunks should be
recorded in the superblock of the filesystem image. That way it
is clear that the image has a specific chunk size, and it also paves
the way for supporting more efficient directory structures using
larger-than-page size chunks in future.

> We can just kmap_page_local() each chunk and treat
> it as a non-split continuous dirent array, then move on to the next
> chunk in the next page.

OK.

> If we had dirent data spanning multiple pages
> then we would either need to map the pages consecutively (which seems
> hard/costly) or have complex in-kernel code to handle the case where a
> dirent straddles two pages.

Actually pretty easy - we do this with XFS for multi-page directory
buffers. We just use vm_map_ram() on a page array at the moment,
but in the near future there will be other options based on
multipage folios.

That is, the page cache now stores folios rather than pages, and is
capable of using contiguous multi-page folios in the cache. As a
result, multipage folios could be used to cache multi-page
structures in the page cache and efficiently map them as a whole.

That mapping code isn't there yet - kmap_local_folio() only maps the
page within the folio at the offset given - but the foundation is
there for supporting this functionality natively....

I certainly wouldn't be designing a new filesystem these days that
has it's on-disk format constrained by the x86-64 4kB page size...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-16 15:27           ` Alexander Larsson
@ 2023-01-17  0:12             ` Gao Xiang
  2023-01-17  7:05               ` Amir Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Gao Xiang @ 2023-01-17  0:12 UTC (permalink / raw)
  To: Alexander Larsson, linux-fsdevel; +Cc: linux-kernel, gscrivan



On 2023/1/16 23:27, Alexander Larsson wrote:
> On Mon, 2023-01-16 at 21:26 +0800, Gao Xiang wrote:

I will stop saying this overlay permission model anymore since
there are more experienced folks working on this, although SUID
stuff is still dangerous to me as an end-user:  IMHO, its hard
for me to identify proper sub-sub-subdir UID/GID in "objects"
at runtime, even they could happen much deep which is different
from localfs with loopback devices or overlayfs.  I don't know
what then inproper sub-sub-subdir UID/GID in "objects" could
cause.

It seems currently ostree uses "root" all the time for such
"objects" subdirs, I don't know.

>>>
>>>>
>>>>>
>>>>> Instead what we have done with composefs is to make filesystem
>>>>> image
>>>>> generation from the ostree repository 100% reproducible. Then
>>>>> we
>>>>> can
>>>>
>>>> EROFS is all 100% reproduciable as well.
>>>>
>>>
>>>
>>> Really, so if I today, on fedora 36 run:
>>> # tar xvf oci-image.tar
>>> # mkfs.erofs oci-dir/ oci.erofs
>>>
>>> And then in 5 years, if someone on debian 13 runs the same, with
>>> the
>>> same tar file, then both oci.erofs files will have the same sha256
>>> checksum?
>>
>> Why it doesn't?  Reproducable builds is a MUST for Android use cases
>> as well.
> 
> That is not quite the same requirements. A reproducible build in the
> traditional sense is limited to a particular build configuration. You
> define a set of tools for the build, and use the same ones for each
> build, and get a fixed output. You don't expect to be able to change
> e.g. the compiler and get the same result. Similarly, it is often the
> case that different builds or versions of compression libraries gives
> different results, so you can't expect to use e.g. a different libz and
> get identical images.
> 
>> Yes, it may break between versions by mistake, but I think
>> reproducable builds is a basic functionalaity for all image
>> use cases.
>>
>>>
>>> How do you handle things like different versions or builds of
>>> compression libraries creating different results? Do you guarantee
>>> to
>>> not add any new backwards compat changes by default, or change any
>>> default options? Do you guarantee that the files are read from
>>> "oci-
>>> dir" in the same order each time? It doesn't look like it.
>>
>> If you'd like to say like that, why mkcomposefs doesn't have the
>> same issue that it may be broken by some bug.
>>
> 
> libcomposefs defines a normalized form for everything like file order,
> xattr orders, etc, and carefully normalizes everything such that we can
> guarantee these properties. It is possible that some detail was missed,
> because we're humans. But it was a very conscious and deliberate design
> choice that is deeply encoded in the code and format. For example, this
> is why we don't use compression but try to minimize size in other ways.

EROFS is reproducable since its dirents are all sorted because
of its on-disk definition.  And its xattrs are also sorted if
images needs to be reproducable.

I don't know what's the difference between these two
reproducable builds.  EROFS is designed for golden images, if
you pass in a set of configuration options for mkfs.erofs, it
should output the same output, otherwise they are really
buges and need to be fixed.

Compression algorithms could generate different outputs between
versions, and generally compressed data is stable for most
compression algorithms in a specific version but that is another
story.

EROFS can live without compression.

> 
>>>>
>>>> But really, personally I think the issue above is different from
>>>> loopback devices and may need to be resolved first. And if
>>>> possible,
>>>> I hope it could be an new overlayfs feature for everyone.
>>>
>>> Yeah. Independent of composefs, I think EROFS would be better if
>>> you
>>> could just point it to a chunk directory at mount time rather than
>>> having to route everything through a system-wide global cachefs
>>> singleton. I understand that cachefs does help with the on-demand
>>> download aspect, but when you don't need that it is just in the
>>> way.
>>
>> Just check your reply to Dave's review, it seems that how
>> composefs dir on-disk format works is also much similar to
>> EROFS as well, see:
>>
>> https://docs.kernel.org/filesystems/erofs.html -- Directories
>>
>> a block vs a chunk = dirent + names
>>
>> cfs_dir_lookup -> erofs_namei + find_target_block_classic;
>> cfs_dir_lookup_in_chunk -> find_target_dirent.
> 
> Yeah, the dirent layout looks very similar. I guess great minds think
> alike! My approach was simpler initially, but it kinda converged on
> this when I started optimizing the kernel lookup code with binary
> search.
> 
>> Yes, great projects could be much similar to each other
>> occasionally, not to mention opensource projects ;)
>>
>> Anyway, I'm not opposed to Composefs if folks really like a
>> new read-only filesystem for this. That is almost all I'd like
>> to say about Composefs formally, have fun!
Because, anyway, I have no idea considering opensource projects
could also do folk, so (maybe) such is life.

It seems rather another an incomplete EROFS from several points
of view.  Also see:
https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u

I will go on making a better EROFS as a promise to the
community initially.

Thanks,
Gao Xiang

>>
>> Thanks,
>> Gao Xiang
> 
> Cool, thanks for the feedback.
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-17  0:12             ` Gao Xiang
@ 2023-01-17  7:05               ` Amir Goldstein
  2023-01-17 10:12                 ` Christian Brauner
  0 siblings, 1 reply; 34+ messages in thread
From: Amir Goldstein @ 2023-01-17  7:05 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Alexander Larsson, linux-fsdevel, linux-kernel, gscrivan,
	Miklos Szeredi, Yurii Zubrytskyi, Eugene Zemtsov, Vivek Goyal

> It seems rather another an incomplete EROFS from several points
> of view.  Also see:
> https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
>

Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
where the community reactions rhyme with the reactions to composefs.
The discussion on Incremental FS resembles composefs case even more [1].
AFAIK, Android is still maintaining Incremental FS out-of-tree.

Alexander and Giuseppe,

I'd like to join Gao is saying that I think it is in the best interest
of everyone,
composefs developers and prospect users included,
if the composefs requirements would drive improvement to existing
kernel subsystems rather than adding a custom filesystem driver
that partly duplicates other subsystems.

Especially so, when the modifications to existing components
(erofs and overlayfs) appear to be relatively minor and the maintainer
of erofs is receptive to new features and happy to collaborate with you.

w.r.t overlayfs, I am not even sure that anything needs to be modified
in the driver.
overlayfs already supports "metacopy" feature which means that an upper layer
could be composed in a way that the file content would be read from an arbitrary
path in lower fs, e.g. objects/cc/XXX.

I gave a talk on LPC a few years back about overlayfs and container images [2].
The emphasis was that overlayfs driver supports many new features, but userland
tools for building advanced overlayfs images based on those new features are
nowhere to be found.

I may be wrong, but it looks to me like composefs could potentially
fill this void,
without having to modify the overlayfs driver at all, or maybe just a
little bit.
Please start a discussion with overlayfs developers about missing driver
features if you have any.

Overall, this sounds like a fun discussion to have at LSFMMBPF23 [3]
so you are most welcome to submit a topic proposal for
"opportunistically sharing verified image filesystem".

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAK8JDrGRzA+yphpuX+GQ0syRwF_p2Fora+roGCnYqB5E1eOmXA@mail.gmail.com/
[2] https://lpc.events/event/7/contributions/639/attachments/501/969/Overlayfs-containers-lpc-2020.pdf
[3] https://lore.kernel.org/linux-fsdevel/Y7hDVliKq+PzY1yY@localhost.localdomain/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-17  7:05               ` Amir Goldstein
@ 2023-01-17 10:12                 ` Christian Brauner
  2023-01-17 10:30                   ` Gao Xiang
                                     ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Christian Brauner @ 2023-01-17 10:12 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Gao Xiang, Alexander Larsson, linux-fsdevel, linux-kernel,
	gscrivan, Miklos Szeredi, Yurii Zubrytskyi, Eugene Zemtsov,
	Vivek Goyal

On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
> > It seems rather another an incomplete EROFS from several points
> > of view.  Also see:
> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
> >
> 
> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
> where the community reactions rhyme with the reactions to composefs.
> The discussion on Incremental FS resembles composefs case even more [1].
> AFAIK, Android is still maintaining Incremental FS out-of-tree.
> 
> Alexander and Giuseppe,
> 
> I'd like to join Gao is saying that I think it is in the best interest
> of everyone,
> composefs developers and prospect users included,
> if the composefs requirements would drive improvement to existing
> kernel subsystems rather than adding a custom filesystem driver
> that partly duplicates other subsystems.
> 
> Especially so, when the modifications to existing components
> (erofs and overlayfs) appear to be relatively minor and the maintainer
> of erofs is receptive to new features and happy to collaborate with you.
> 
> w.r.t overlayfs, I am not even sure that anything needs to be modified
> in the driver.
> overlayfs already supports "metacopy" feature which means that an upper layer
> could be composed in a way that the file content would be read from an arbitrary
> path in lower fs, e.g. objects/cc/XXX.
> 
> I gave a talk on LPC a few years back about overlayfs and container images [2].
> The emphasis was that overlayfs driver supports many new features, but userland
> tools for building advanced overlayfs images based on those new features are
> nowhere to be found.
> 
> I may be wrong, but it looks to me like composefs could potentially
> fill this void,
> without having to modify the overlayfs driver at all, or maybe just a
> little bit.
> Please start a discussion with overlayfs developers about missing driver
> features if you have any.

Surprising that I and others weren't Cced on this given that we had a
meeting with the main developers and a few others where we had said the
same thing. I hadn't followed this. 

We have at least 58 filesystems currently in the kernel (and that's a
conservative count just based on going by obvious directories and
ignoring most virtual filesystems).

A non-insignificant portion is probably slowly rotting away with little
fixes coming in, with few users, and not much attention is being paid to
syzkaller reports for them if they show up. I haven't quantified this of
course.

Taking in a new filesystems into kernel in the worst case means that
it's being dumped there once and will slowly become unmaintained. Then
we'll have a few users for the next 20 years and we can't reasonably
deprecate it (Maybe that's another good topic: How should we fade out
filesystems.).

Of course, for most fs developers it probably doesn't matter how many
other filesystems there are in the kernel (aside from maybe competing
for the same users).

But for developers who touch the vfs every new filesystems may increase
the cost of maintaining and reworking existing functionality, or adding
new functionality. Making it more likely to accumulate hacks, adding
workarounds, or flatout being unable to kill off infrastructure that
should reasonably go away. Maybe this is an unfair complaint but just
from experience a new filesystem potentially means one or two weeks to
make a larger vfs change.

I want to stress that I'm not at all saying "no more new fs" but we
should be hesitant before we merge new filesystems into the kernel.

Especially for filesystems that are tailored to special use-cases.
Every few years another filesystem tailored to container use-cases shows
up. And frankly, a good portion of the issues that they are trying to
solve are caused by design choices in userspace.

And I have to say I'm especially NAK-friendly about anything that comes
even close to yet another stacking filesystems or anything that layers
on top of a lower filesystem/mount such as ecryptfs, ksmbd, and
overlayfs. They are hard to get right, with lots of corner cases and
they cause the most headaches when making vfs changes.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-17 10:12                 ` Christian Brauner
@ 2023-01-17 10:30                   ` Gao Xiang
  2023-01-17 13:56                   ` Giuseppe Scrivano
  2023-01-20  9:22                   ` Alexander Larsson
  2 siblings, 0 replies; 34+ messages in thread
From: Gao Xiang @ 2023-01-17 10:30 UTC (permalink / raw)
  To: Christian Brauner, Amir Goldstein
  Cc: Alexander Larsson, linux-fsdevel, linux-kernel, gscrivan,
	Miklos Szeredi, Yurii Zubrytskyi, Eugene Zemtsov, Vivek Goyal

Hi Amir and Christian,

On 2023/1/17 18:12, Christian Brauner wrote:
> On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
>>> It seems rather another an incomplete EROFS from several points
>>> of view.  Also see:
>>> https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
>>>
>>
>> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
>> where the community reactions rhyme with the reactions to composefs.
>> The discussion on Incremental FS resembles composefs case even more [1].
>> AFAIK, Android is still maintaining Incremental FS out-of-tree.
>>
>> Alexander and Giuseppe,
>>
>> I'd like to join Gao is saying that I think it is in the best interest
>> of everyone,
>> composefs developers and prospect users included,
>> if the composefs requirements would drive improvement to existing
>> kernel subsystems rather than adding a custom filesystem driver
>> that partly duplicates other subsystems.
>>
>> Especially so, when the modifications to existing components
>> (erofs and overlayfs) appear to be relatively minor and the maintainer
>> of erofs is receptive to new features and happy to collaborate with you.
>>
>> w.r.t overlayfs, I am not even sure that anything needs to be modified
>> in the driver.
>> overlayfs already supports "metacopy" feature which means that an upper layer
>> could be composed in a way that the file content would be read from an arbitrary
>> path in lower fs, e.g. objects/cc/XXX.
>>
>> I gave a talk on LPC a few years back about overlayfs and container images [2].
>> The emphasis was that overlayfs driver supports many new features, but userland
>> tools for building advanced overlayfs images based on those new features are
>> nowhere to be found.
>>
>> I may be wrong, but it looks to me like composefs could potentially
>> fill this void,
>> without having to modify the overlayfs driver at all, or maybe just a
>> little bit.
>> Please start a discussion with overlayfs developers about missing driver
>> features if you have any.
> 

...

> 
> I want to stress that I'm not at all saying "no more new fs" but we
> should be hesitant before we merge new filesystems into the kernel.
> 
> Especially for filesystems that are tailored to special use-cases.
> Every few years another filesystem tailored to container use-cases shows
> up. And frankly, a good portion of the issues that they are trying to
> solve are caused by design choices in userspace.
> 
> And I have to say I'm especially NAK-friendly about anything that comes
> even close to yet another stacking filesystems or anything that layers
> on top of a lower filesystem/mount such as ecryptfs, ksmbd, and
> overlayfs. They are hard to get right, with lots of corner cases and
> they cause the most headaches when making vfs changes.

That is also my original (little) request if such overlay model is
correct...

In principle, it's not hard for EROFS since currently EROFS already
has symlink on-disk layout, and the difference is just applying them
to all regular files (even without some on-disk changes, but maybe
we need to optimize them if there are other special requirements
for specific use cases like ostree), and makes EROFS do like
stackable way... That is not hard honestly (on-disk compatible)...

But I'm not sure whether it's fortunate for EROFS now to support
this without a proper overlay model for careful discussion.

So if there could be some discussion for this overlay model on
LSF/MM/BPF, I'd like to attend (thanks!) And I support to make it
in overlayfs (if possible) but it seems EROFS could do as well as
long as it has enough constrait to conclude...

Thanks,
Gao Xiang


> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/6] composefs: Add on-disk layout
  2023-01-16 23:06       ` Dave Chinner
@ 2023-01-17 12:11         ` Alexander Larsson
  2023-01-18  3:08           ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Larsson @ 2023-01-17 12:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-kernel, gscrivan

On Tue, 2023-01-17 at 10:06 +1100, Dave Chinner wrote:
> On Mon, Jan 16, 2023 at 12:00:03PM +0100, Alexander Larsson wrote:
> > On Mon, 2023-01-16 at 12:29 +1100, Dave Chinner wrote:
> > > On Fri, Jan 13, 2023 at 04:33:55PM +0100, Alexander Larsson
> > > wrote:
> > > > This commit adds the on-disk layout header file of composefs.
> > > 
> > > This isn't really a useful commit message.
> > > 
> > > Perhaps it should actually explain what the overall goals of the
> > > on-disk format are - space usage, complexity trade-offs,
> > > potential
> > > issues with validation of variable payload sections, etc.
> > > 
> > 
> > I agree, will flesh it out. But, as for below discussions, one of
> > the
> > overall goals is to keep the on-disk file size low.
> > 
> > > > Signed-off-by: Alexander Larsson <alexl@redhat.com>
> > > > Co-developed-by: Giuseppe Scrivano <gscrivan@redhat.com>
> > > > Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
> > > > ---
> > > >  fs/composefs/cfs.h | 203
> > > > +++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 203 insertions(+)
> > > >  create mode 100644 fs/composefs/cfs.h
> > > > 
> > > > diff --git a/fs/composefs/cfs.h b/fs/composefs/cfs.h
> > > > new file mode 100644
> > > > index 000000000000..658df728e366
> > > > --- /dev/null
> > > > +++ b/fs/composefs/cfs.h
> > > > @@ -0,0 +1,203 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +/*
> > > > + * composefs
> > > > + *
> > > > + * Copyright (C) 2021 Giuseppe Scrivano
> > > > + * Copyright (C) 2022 Alexander Larsson
> > > > + *
> > > > + * This file is released under the GPL.
> > > > + */
> > > > +
> > > > +#ifndef _CFS_H
> > > > +#define _CFS_H
> > > > +
> > > > +#include <asm/byteorder.h>
> > > > +#include <crypto/sha2.h>
> > > > +#include <linux/fs.h>
> > > > +#include <linux/stat.h>
> > > > +#include <linux/types.h>
> > > > +
> > > > +#define CFS_VERSION 1
> > > 
> > > This should start with a description of the on-disk format for
> > > the
> > > version 1 format.
> > 
> > There are some format descriptions in the later document patch.
> > What is
> > the general approach here, do we document in the header, or in
> > separate
> > doc file? For example, I don't see much of format descriptions in
> > the
> > xfs headers. I mean, I should probably add *some* info here for
> > easier
> > reading of the stuff below, but I don't feel like headers are a
> > great
> > place for docs.
> 
> it's fine to describe the format in the docs, but when reading the
> code there needs to at least an overview of the structure the code
> is implementing so that the code makes some sense without having to
> go find the external place the format is documented.

Yeah, I'll try to make format in the next series be overall more
commented.

> > > 
> > > > +#define CFS_MAGIC 0xc078629aU
> > > > +
> > > > +#define CFS_MAX_DIR_CHUNK_SIZE 4096
> > > > +#define CFS_MAX_XATTRS_SIZE 4096
> > > 
> > > How do we store 64kB xattrs in this format if the max attr size
> > > is
> > > 4096 bytes? Or is that the maximum total xattr storage?
> > 
> > This is a current limitation of the composefs file format.
> 
> Yes, but is that 4kB limit the maximum size of a single xattr, or is
> it the total xattr storage space for an inode?

Currently it is actually the total xattrs storage. I've never seen any
container images or rootfs images in general use any large amount of
xattrs. However, given the below discussion on multi-page mappings
maybe its possible to easily drop this limit.

> > I am aware
> > that the kernel maximum size is 64k,
> 
> For a single xattr, yes. Hence my question....
> 
> > > > +static inline int cfs_digest_from_payload(const char *payload,
> > > > size_t payload_len,
> > > > +                                         u8
> > > > digest_out[SHA256_DIGEST_SIZE])
> .....
> > > Too big to be a inline function.
> > 
> > Yeah, I'm aware of this. I mainly put it in the header as the
> > implementation of it is sort of part of the on-disk format. But, I
> > can
> > move it to a .c file instead.
> 
> Please do - it's really part of the reader implementation, not the
> structure definition.
> 
> > > > +struct cfs_vdata_s {
> > > 
> > > Drop the "_s" suffix to indicate the type is a structure - that's
> > > waht "struct" tells us.
> > 
> > Sure.
> > 
> > > > +       u64 off;
> > > > +       u32 len;
> > > 
> > > If these are on-disk format structures, why aren't the defined as
> > > using the specific endian they are encoded in? i.e. __le64,
> > > __le32,
> > > etc? Otherwise a file built on a big endian machine won't be
> > > readable on a little endian machine (and vice versa).
> > 
> > On disk all fields are little endian. However, when we read them
> > from
> > disk we convert them using e.g. le32_to_cpu(), and then we use the
> > same
> > structure in memory, with native endian. So, it seems wrong to mark
> > them as little endian.
> 
> Then these structures do not define "on-disk format". Looking a bit
> further through the patchset, these are largely intermediate
> structures that are read once to instatiate objects in memory, then
> never used again. The cfs_inode_s is a good example of this - I'll
> come back to that.

The header/superblock is actually just read from the fs as-is, as are
most of the other structures. Only the inode data is packed.

> > > > +} __packed;
> > > > +
> > > > +struct cfs_header_s {
> > > > +       u8 version;
> > > > +       u8 unused1;
> > > > +       u16 unused2;
> > > 
> > > Why are you hyper-optimising these structures for minimal space
> > > usage? This is 2023 - we can use a __le32 for the version number,
> > > the magic number and then leave....
> > > 
> > > > +
> > > > +       u32 magic;
> > > > +       u64 data_offset;
> > > > +       u64 root_inode;
> > > > +
> > > > +       u64 unused3[2];
> > > 
> > > a whole heap of space to round it up to at least a CPU cacheline
> > > size using something like "__le64 unused[15]".
> > > 
> > > That way we don't need packed structures nor do we care about
> > > having
> > > weird little holes in the structures to fill....
> > 
> > Sure.
> 
> FWIW, now I see how this is used, this header kinda defines what
> we'd call the superblock in the on-disk format of a filesystem. It's
> at a fixed location in the image file, so there should be a #define
> somewhere in this file to document it's fixed location.

It is at offset zero. I don't really think that needs a define, does
it? Maybe a comment though.

> Also, if this is the in-memory representation of the structure and
> not the actual on-disk format, why does it even need padding,
> packing or even store the magic number?

In this case it is the on-disk format though.

> i.e. this information could simply be stored in a few fields in the
> cfs
> superblock structure that wraps the vfs superblock, and the
> superblock read function could decode straight into those fields...

We just read this header from disk, validate the magic and then convert
the fields to native endian, then the few used fields (data_offset and
root_inode) to the vfs superblock structure.


> > > > +} __packed;
> > > > +
> > > > +enum cfs_inode_flags {
> > > > +       CFS_INODE_FLAGS_NONE = 0,
> > > > +       CFS_INODE_FLAGS_PAYLOAD = 1 << 0,
> > > > +       CFS_INODE_FLAGS_MODE = 1 << 1,
> > > > +       CFS_INODE_FLAGS_NLINK = 1 << 2,
> > > > +       CFS_INODE_FLAGS_UIDGID = 1 << 3,
> > > > +       CFS_INODE_FLAGS_RDEV = 1 << 4,
> > > > +       CFS_INODE_FLAGS_TIMES = 1 << 5,
> > > > +       CFS_INODE_FLAGS_TIMES_NSEC = 1 << 6,
> > > > +       CFS_INODE_FLAGS_LOW_SIZE = 1 << 7, /* Low 32bit of
> > > > st_size
> > > > */
> > > > +       CFS_INODE_FLAGS_HIGH_SIZE = 1 << 8, /* High 32bit of
> > > > st_size */
> > > 
> > > Why do we need to complicate things by splitting the inode size
> > > like this?
> > > 
> > 
> > The goal is to minimize the image size for a typical rootfs or
> > container image. Almost zero files in any such images are > 4GB. 
> 
> Sure, but how much space does this typically save, versus how much
> complexity it adds to runtime decoding of inodes?
> 
> I mean, in a dense container system the critical resources that need
> to be saved is runtime memory and CPU overhead of operations, not
> the storage space. Saving a 30-40 bytes of storage space per inode
> means a typical image might ber a few MB smaller, but given the
> image file is not storing data we're only talking about images the
> use maybe 500 bytes of data per inode. Storage space for images
> is not a limiting factor, nor is network transmission (because
> compression), so it comes back to runtime CPU and memory usage.

Here are some example sizes of composefs images with the current packed
inodes: 

6.2M cs9-developer-rootfs.composefs
2.1M cs9-minimal-rootfs.composefs
1.2M fedora-37-container.composefs
433K ubuntu-22.04-container.composefs

If we set all the flags for the inodes (i.e. fixed size inodes) we get:

8.8M cs9-developer-rootfs.composefs
3.0M cs9-minimal-rootfs.composefs
1.6M fedora-37-container.composefs
625K ubuntu-22.04-container.composefs

So, images are about 40% larger with fixed size inodes.

> The inodes are decoded out of the page cache, so the memory for the
> raw inode information is volatile and reclaimed when needed.
> Similarly, the VFS inode built from this information is reclaimable
> when not in use, too. So the only real overhead for runtime is the
> decoding time to find the inode in the image file and then decode
> it.

I disagree with this characterization. It is true that page cache is
volatile, but if you can fit 40% less inode data in the page cache then
there is additional overhead where you need to read this from disk. So,
decoding time is not the only thing that affects overhead.

Additionally, just by being larger and less dense, more data has to be
read from disk, which itself is slower.

> Given the decoding of the inode -all branches- and is not
> straight-line code, it cannot be well optimised and the CPU branch
> predictor is not going to get it right every time. Straight line
> code that decodes every field whether it is zero or not is going to
> be faster.
>
> Further, with a fixed size inode in the image file, the inode table
> can be entirely fixed size, getting rid of the whole unaligned data
> retreival problem that code currently has (yes, all that
> "le32_to_cpu(__get_unaligned(__le32, data)" code) because we can
> ensure that all the inode fields are aligned in the data pages. This
> will significantly speed up decoding into the in-memory inode
> structures.

I agree it could be faster. But is inode decode actually the limiting
factor, compared to things like disk i/o or better use of page cache?

> And to take it another step, the entire struct cfs_inode_s structure
> could go away - it is entirely a temporary structure used to shuffle
> data from the on-disk encoded format to the the initialisation of
> the VFS inode. The on-disk inode data could be decoded directly into
> the VFS inode after it has been instantiated, rather than decoding
> the inode from the backing file and the instantiating the in-memory
> inode.
> 
> i.e. instead of:
> 
> cfs_lookup()
>         cfs_dir_lookup(&index)
>         cfs_get_ino_index(index, &inode_s)
>                 cfs_get_inode_data_max(index, &data)
>                 inode_s->st_.... = cfs_read_....(&data);
>                 inode_s->st_.... = cfs_read_....(&data);
>                 inode_s->st_.... = cfs_read_....(&data);
>                 inode_s->st_.... = cfs_read_....(&data);
>         cfs_make_inode(inode_s, &vfs_inode)
>                 inode = new_inode(sb)
>                 inode->i_... = inode_s->st_....;
>                 inode->i_... = inode_s->st_....;
>                 inode->i_... = inode_s->st_....;
>                 inode->i_... = inode_s->st_....;
> 
> You could collapse this straight down to:
> 
> cfs_lookup()
>         cfs_dir_lookup(&index)
>         cfs_make_inode(index, &vfs_inode)
>                 inode = new_inode(sb)
>                 cfs_get_inode_data_max(index, &data)
>                 inode->i_... = cfs_read_....(&data);
>                 inode->i_... = cfs_read_....(&data);
>                 inode->i_... = cfs_read_....(&data);
>                 inode->i_... = cfs_read_....(&data);
> 
> This removes an intermediately layer from the inode instantiation
> fast path completely. ANd if the inode table is fixed size and
> always well aligned, then the cfs_make_inode() code that sets up the
> VFS inode is almost unchanged from what it is now. There are no new
> branches, the file image format is greatly simplified, and the
> runtime overhead of instantiating inodes is significantly reduced.

I'm not sure the performance win is clear compared to the extra size,
as generally inodes are only decoded once and kept around in memory for
most of its use. However, I agree that there are clear advantages in
simplifying the format. That makes it easier to maintain and
understand. I'll give this some thought.

> Similar things can be done with the rest of the "descriptor"
> abstraction - the intermediate in-memory structures can be placed
> directly in the cfs_inode structure that wraps the VFS inode, and
> the initialisation of them can call the decoding code directly
> instead of using intermediate structures as is currently done.
> 
> This will remove a chunk of code from the implemenation and make it
> run faster....
> 
> > Also, we don't just "not decode" the items with the flag not set,
> > they
> > are not even stored on disk.
> 
> Yup, and I think that is a mistake - premature optimisation and all
> that...
> 
> > 
> > > > +       CFS_INODE_FLAGS_XATTRS = 1 << 9,
> > > > +       CFS_INODE_FLAGS_DIGEST = 1 << 10, /* fs-verity sha256
> > > > digest */
> > > > +       CFS_INODE_FLAGS_DIGEST_FROM_PAYLOAD = 1 << 11, /*
> > > > Compute
> > > > digest from payload */
> > > > +};
> > > > +
> > > > +#define CFS_INODE_FLAG_CHECK(_flag,
> > > > _name)                                     \
> > > > +       (((_flag) & (CFS_INODE_FLAGS_##_name)) != 0)
> > > 
> > > Check what about a flag? If this is a "check that a feature is
> > > set",
> > > then open coding it better, but if you must do it like this, then
> > > please use static inline functions like:
> > > 
> > >         if (cfs_inode_has_xattrs(inode->flags)) {
> > >                 .....
> > >         }
> > > 
> > 
> > The check is if the flag is set, so maybe CFS_INODE_FLAG_IS_SET is
> > a
> > better name. This is used only when decoding the on-disk version of
> > the
> > inode to the in memory one, which is a bunch of:
> > 
> >         if (CFS_INODE_FLAG_CHECK(ino->flags, THE_FIELD))
> >                 ino->the_field = cfs_read_u32(&data);
> >         else
> >                 ino->the_field = THE_FIELD_DEFUALT;
> > 
> > I can easily open-code these checks, although I'm not sure it makes
> > a
> > great difference either way.
> 
> If they are used only once, then it should be open coded. But I
> think the whole "optional inode fields" stuff should just go away
> entirely at this point...
> 
> > > > +#define CFS_INODE_DEFAULT_MODE 0100644
> > > > +#define CFS_INODE_DEFAULT_NLINK 1
> > > > +#define CFS_INODE_DEFAULT_NLINK_DIR 2
> > > > +#define CFS_INODE_DEFAULT_UIDGID 0
> > > > +#define CFS_INODE_DEFAULT_RDEV 0
> > > > +#define CFS_INODE_DEFAULT_TIMES 0
> > > 
> > > Where do these get used? Are they on disk defaults or something
> > > else? (comment, please!)
> > 
> > They are the defaults that are used when inode fields on disk are
> > missing. I'll add some comments.
> 
> They go away entirely with fixed size on-disk inodes.
> 
> > > > +       u32 st_mode; /* File type and mode.  */
> > > > +       u32 st_nlink; /* Number of hard links, only for regular
> > > > files.  */
> > > > +       u32 st_uid; /* User ID of owner.  */
> > > > +       u32 st_gid; /* Group ID of owner.  */
> > > > +       u32 st_rdev; /* Device ID (if special file).  */
> > > > +       u64 st_size; /* Size of file, only used for regular
> > > > files
> > > > */
> > > > +
> > > > +       struct cfs_vdata_s xattrs; /* ref to variable data */
> > > 
> > > This is in the payload that follows the inode?  Is it included in
> > > the payload_length above?
> > > 
> > > If not, where is this stuff located, how do we validate it points
> > > to
> > > the correct place in the on-disk format file, the xattrs belong
> > > to
> > > this specific inode, etc? I think that's kinda important to
> > > describe, because xattrs often contain important security
> > > information...
> > 
> > No, all inodes are packed into the initial part of the file, each
> > containing a flags set, a variable size (from flags) chunk of fixed
> > size elements and an variable size payload. The payload is either
> > the
> > target symlink for symlinks, or the path of the backing file for
> > regular files.
> 
> Ok, I think you need to stop calling that a "payload", then. It's
> the path name to the backing file. The backing file is only relevant
> for S_IFREG and S_IFLINK types - directories don't need path names
> as they only contain pointers to other inodes in the image file.
> Types like S_IFIFO, S_IFBLK, etc should not have backing files,
> either - they should just be instantiated as the correct type in the
> VFS inode and not require any backing file interactions at all...
> 
> Hence I think this "payload" should be called something like
> "backing path" or something similar.

Yeah, that may be better.

> 
> .....
> 
> > > > +struct cfs_dir_s {
> > > > +       u32 n_chunks;
> > > > +       struct cfs_dir_chunk_s chunks[];
> > > > +} __packed;
> > > 
> > > So directory data is packed in discrete chunks? Given that this
> > > is a
> > > static directory format, and the size of the directory is known
> > > at
> > > image creation time, why does the storage need to be chunked?
> > 
> > We chunk the data such that each chunk fits inside a single page in
> > the
> > image file. I did this to make accessing image data directly from
> > the
> > page cache easier.
> 
> Hmmmm. So you defined a -block size- that matched the x86-64 -page
> size- to avoid page cache issues.  Now, what about ARM or POWER
> which has 64kB page sizes?
> 
> IOWs, "page size" is not the same on all machines, whilst the
> on-disk format for a filesystem image needs to be the same on all
> machines. Hence it appears that this:
> 
> > > > +#define CFS_MAX_DIR_CHUNK_SIZE 4096
> 
> should actually be defined in terms of the block size for the
> filesystem image, and this size of these dir chunks should be
> recorded in the superblock of the filesystem image. That way it
> is clear that the image has a specific chunk size, and it also paves
> the way for supporting more efficient directory structures using
> larger-than-page size chunks in future.

Yes, its true that assuming a (min) 4k page size is wasteful on some
arches, but it would be hard to read a filesystem created for 64k pages
on a 4k page machine, which is not ideal. However, wrt your commend on
multi-page mappings, maybe we can just totally drop these limits. I'll
have a look at that.

> > If we had dirent data spanning multiple pages
> > then we would either need to map the pages consecutively (which
> > seems
> > hard/costly) or have complex in-kernel code to handle the case
> > where a
> > dirent straddles two pages.
> 
> Actually pretty easy - we do this with XFS for multi-page directory
> buffers. We just use vm_map_ram() on a page array at the moment,
> but in the near future there will be other options based on
> multipage folios.
> 
> That is, the page cache now stores folios rather than pages, and is
> capable of using contiguous multi-page folios in the cache. As a
> result, multipage folios could be used to cache multi-page
> structures in the page cache and efficiently map them as a whole.
> 
> That mapping code isn't there yet - kmap_local_folio() only maps the
> page within the folio at the offset given - but the foundation is
> there for supporting this functionality natively....
> 
> I certainly wouldn't be designing a new filesystem these days that
> has it's on-disk format constrained by the x86-64 4kB page size...

Yes, I agree. I'm gonna look at using multi-page mapping for both
dirents and xattr data, which should completely drop these limits, as
well as get rid of the dirent chunking.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a witless playboy firefighter fleeing from a secret government 
programme. She's a ditzy streetsmart socialite from the wrong side of
the 
tracks. They fight crime! 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/6] composefs: Add filesystem implementation
  2023-01-16 22:07   ` Al Viro
@ 2023-01-17 13:29     ` Alexander Larsson
  0 siblings, 0 replies; 34+ messages in thread
From: Alexander Larsson @ 2023-01-17 13:29 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, linux-kernel, gscrivan

On Mon, 2023-01-16 at 22:07 +0000, Al Viro wrote:
>         Several random observations:

Thanks, I'll integrate this in the next version.

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a scrappy ninja firefighter possessed of the uncanny powers of an 
insect. She's a blind punk former first lady with an evil twin sister. 
They fight crime! 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-17 10:12                 ` Christian Brauner
  2023-01-17 10:30                   ` Gao Xiang
@ 2023-01-17 13:56                   ` Giuseppe Scrivano
  2023-01-17 14:28                     ` Gao Xiang
  2023-01-17 15:27                     ` Christian Brauner
  2023-01-20  9:22                   ` Alexander Larsson
  2 siblings, 2 replies; 34+ messages in thread
From: Giuseppe Scrivano @ 2023-01-17 13:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Gao Xiang, Alexander Larsson, linux-fsdevel,
	linux-kernel, Miklos Szeredi, Yurii Zubrytskyi, Eugene Zemtsov,
	Vivek Goyal, Al Viro

Christian Brauner <brauner@kernel.org> writes:

> On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
>> > It seems rather another an incomplete EROFS from several points
>> > of view.  Also see:
>> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
>> >
>> 
>> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
>> where the community reactions rhyme with the reactions to composefs.
>> The discussion on Incremental FS resembles composefs case even more [1].
>> AFAIK, Android is still maintaining Incremental FS out-of-tree.
>> 
>> Alexander and Giuseppe,
>> 
>> I'd like to join Gao is saying that I think it is in the best interest
>> of everyone,
>> composefs developers and prospect users included,
>> if the composefs requirements would drive improvement to existing
>> kernel subsystems rather than adding a custom filesystem driver
>> that partly duplicates other subsystems.
>> 
>> Especially so, when the modifications to existing components
>> (erofs and overlayfs) appear to be relatively minor and the maintainer
>> of erofs is receptive to new features and happy to collaborate with you.
>> 
>> w.r.t overlayfs, I am not even sure that anything needs to be modified
>> in the driver.
>> overlayfs already supports "metacopy" feature which means that an upper layer
>> could be composed in a way that the file content would be read from an arbitrary
>> path in lower fs, e.g. objects/cc/XXX.
>> 
>> I gave a talk on LPC a few years back about overlayfs and container images [2].
>> The emphasis was that overlayfs driver supports many new features, but userland
>> tools for building advanced overlayfs images based on those new features are
>> nowhere to be found.
>> 
>> I may be wrong, but it looks to me like composefs could potentially
>> fill this void,
>> without having to modify the overlayfs driver at all, or maybe just a
>> little bit.
>> Please start a discussion with overlayfs developers about missing driver
>> features if you have any.
>
> Surprising that I and others weren't Cced on this given that we had a
> meeting with the main developers and a few others where we had said the
> same thing. I hadn't followed this. 

well that wasn't done on purpose, sorry for that.

After our meeting, I thought it was clear that we have different needs
for our use cases and that we were going to submit composefs upstream,
as we did, to gather some feedbacks from the wider community.

Of course we looked at overlay before we decided to upstream composefs.

Some of the use cases we have in mind are not easily doable, some others
are not possible at all.  metacopy is a good starting point, but from
user space it works quite differently than what we can do with
composefs.

Let's assume we have a git like repository with a bunch of files stored
by their checksum and that they can be shared among different containers.

Using the overlayfs model:

1) We need to create the final image layout, either using reflinks or
hardlinks:

- reflinks: we can reflect a correct st_nlink value for the inode but we
  lose page cache sharing.

- hardlinks: make the st_nlink bogus.  Another problem is that overlay
  expects the lower layer to never change and now st_nlink can change
  for files in other lower layers.

These operations have a cost.  Even if we all the files are already
available locally, we still need at least one operation per file to
create it, and more than one if we start tweaking the inode metadata.

2) no multi repo support:

Both reflinks and hardlinks do not work across mount points, so we
cannot have images that span multiple file systems; one common use case
is to have a network file system to share some images/files and be able
to use files from there when they are available.

At the moment we deduplicate entire layers, and with overlay we can do
something like the following without problems:

# mount overlay -t overlay -olowerdir=/first/disk/layer1:/second/disk/layer2

but this won't work with the files granularity we are looking at.  So in
this case we need to do a full copy of the files that are not on the
same file system.

3) no support for fs-verity.  No idea how overlay could ever support it,
it doesn't fit there.  If we want this feature we need to look at
another RO file system.

We looked at EROFS since it is already upstream but it is quite
different than what we are doing as Alex already pointed out.

Sure we could bloat EROFS and add all the new features there, after all
composefs is quite simple, but I don't see how this is any cleaner than
having a simple file system that does just one thing.

On top of what was already said: I wish at some point we can do all of
this from a user namespace.  That is the main reason for having an easy
on-disk format for composefs.  This seems much more difficult to achieve
with EROFS given its complexity.

> We have at least 58 filesystems currently in the kernel (and that's a
> conservative count just based on going by obvious directories and
> ignoring most virtual filesystems).
>
> A non-insignificant portion is probably slowly rotting away with little
> fixes coming in, with few users, and not much attention is being paid to
> syzkaller reports for them if they show up. I haven't quantified this of
> course.
>
> Taking in a new filesystems into kernel in the worst case means that
> it's being dumped there once and will slowly become unmaintained. Then
> we'll have a few users for the next 20 years and we can't reasonably
> deprecate it (Maybe that's another good topic: How should we fade out
> filesystems.).
>
> Of course, for most fs developers it probably doesn't matter how many
> other filesystems there are in the kernel (aside from maybe competing
> for the same users).
>
> But for developers who touch the vfs every new filesystems may increase
> the cost of maintaining and reworking existing functionality, or adding
> new functionality. Making it more likely to accumulate hacks, adding
> workarounds, or flatout being unable to kill off infrastructure that
> should reasonably go away. Maybe this is an unfair complaint but just
> from experience a new filesystem potentially means one or two weeks to
> make a larger vfs change.
>
> I want to stress that I'm not at all saying "no more new fs" but we
> should be hesitant before we merge new filesystems into the kernel.
>
> Especially for filesystems that are tailored to special use-cases.
> Every few years another filesystem tailored to container use-cases shows
> up. And frankly, a good portion of the issues that they are trying to
> solve are caused by design choices in userspace.

Having a way to deprecate file systems seem like a good idea in general,
and IMHO seems to make more sense than blocking new components that
can be useful to some users.

We are aware the bar for a new file system is high, and we were
expecting criticism and push back, but so far it doesn't seem there is a
way to achieve what we are trying to do.

> And I have to say I'm especially NAK-friendly about anything that comes
> even close to yet another stacking filesystems or anything that layers
> on top of a lower filesystem/mount such as ecryptfs, ksmbd, and
> overlayfs. They are hard to get right, with lots of corner cases and
> they cause the most headaches when making vfs changes.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-17 13:56                   ` Giuseppe Scrivano
@ 2023-01-17 14:28                     ` Gao Xiang
  2023-01-17 15:27                     ` Christian Brauner
  1 sibling, 0 replies; 34+ messages in thread
From: Gao Xiang @ 2023-01-17 14:28 UTC (permalink / raw)
  To: Giuseppe Scrivano, Christian Brauner
  Cc: Amir Goldstein, Alexander Larsson, linux-fsdevel, linux-kernel,
	Miklos Szeredi, Yurii Zubrytskyi, Eugene Zemtsov, Vivek Goyal,
	Al Viro



On 2023/1/17 21:56, Giuseppe Scrivano wrote:
> Christian Brauner <brauner@kernel.org> writes:
> 

...

> 
> We looked at EROFS since it is already upstream but it is quite
> different than what we are doing as Alex already pointed out.
> 

Sigh..  please kindly help me find out what's the difference if
EROFS uses some symlink layout for each regular inode?

Some question for me to ask about this new overlay permission
model once again:

What's the difference between symlink (maybe with some limitations)
and this new overlay model? I'm not sure why symlink permission bits
is ignored (AFAIK)?  I don't think it too further since I don't quite
an experienced one in the unionfs field, but if possible, I'm quite
happy to learn new stuffs as a newbie filesystem developer to gain
more knowledge if it could be some topic at LSF/MM/BPF 2023.

> Sure we could bloat EROFS and add all the new features there, after all
> composefs is quite simple, but I don't see how this is any cleaner than
> having a simple file system that does just one thing.

Also if I have time, I could do a code-truncated EROFS without any
useless features specificly for ostree use cases.  Or I could just
seperate out all of that useless code of Ostree-specific use cases
by using Kconfig.

If you don't want to use EROFS from whatever reason, I'm not oppose
to it (You also could use other in-kernel local filesystem for this
as well).  Except for this new overlay model, I just tried to say
how it works similiar to EROFS.

> 
> On top of what was already said: I wish at some point we can do all of
> this from a user namespace.  That is the main reason for having an easy
> on-disk format for composefs.  This seems much more difficult to achieve
> with EROFS given its complexity.

Why?


[ Gao Xiang: this time I will try my best stop talking about EROFS under
   the Composefs patchset anymore because I'd like to avoid appearing at
   the first time (unless such permission model is never discussed until
   now)...

   No matter in the cover letter it never mentioned EROFS at all. ]

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-17 13:56                   ` Giuseppe Scrivano
  2023-01-17 14:28                     ` Gao Xiang
@ 2023-01-17 15:27                     ` Christian Brauner
  2023-01-18  0:22                       ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2023-01-17 15:27 UTC (permalink / raw)
  To: Giuseppe Scrivano
  Cc: Amir Goldstein, Gao Xiang, Alexander Larsson, linux-fsdevel,
	linux-kernel, Miklos Szeredi, Yurii Zubrytskyi, Eugene Zemtsov,
	Vivek Goyal, Al Viro

On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote:
> Christian Brauner <brauner@kernel.org> writes:
> 
> > On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
> >> > It seems rather another an incomplete EROFS from several points
> >> > of view.  Also see:
> >> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
> >> >
> >> 
> >> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
> >> where the community reactions rhyme with the reactions to composefs.
> >> The discussion on Incremental FS resembles composefs case even more [1].
> >> AFAIK, Android is still maintaining Incremental FS out-of-tree.
> >> 
> >> Alexander and Giuseppe,
> >> 
> >> I'd like to join Gao is saying that I think it is in the best interest
> >> of everyone,
> >> composefs developers and prospect users included,
> >> if the composefs requirements would drive improvement to existing
> >> kernel subsystems rather than adding a custom filesystem driver
> >> that partly duplicates other subsystems.
> >> 
> >> Especially so, when the modifications to existing components
> >> (erofs and overlayfs) appear to be relatively minor and the maintainer
> >> of erofs is receptive to new features and happy to collaborate with you.
> >> 
> >> w.r.t overlayfs, I am not even sure that anything needs to be modified
> >> in the driver.
> >> overlayfs already supports "metacopy" feature which means that an upper layer
> >> could be composed in a way that the file content would be read from an arbitrary
> >> path in lower fs, e.g. objects/cc/XXX.
> >> 
> >> I gave a talk on LPC a few years back about overlayfs and container images [2].
> >> The emphasis was that overlayfs driver supports many new features, but userland
> >> tools for building advanced overlayfs images based on those new features are
> >> nowhere to be found.
> >> 
> >> I may be wrong, but it looks to me like composefs could potentially
> >> fill this void,
> >> without having to modify the overlayfs driver at all, or maybe just a
> >> little bit.
> >> Please start a discussion with overlayfs developers about missing driver
> >> features if you have any.
> >
> > Surprising that I and others weren't Cced on this given that we had a
> > meeting with the main developers and a few others where we had said the
> > same thing. I hadn't followed this. 
> 
> well that wasn't done on purpose, sorry for that.

I understand. I was just surprised given that I very much work on the
vfs on a day to day basis.

> 
> After our meeting, I thought it was clear that we have different needs
> for our use cases and that we were going to submit composefs upstream,
> as we did, to gather some feedbacks from the wider community.
> 
> Of course we looked at overlay before we decided to upstream composefs.
> 
> Some of the use cases we have in mind are not easily doable, some others
> are not possible at all.  metacopy is a good starting point, but from
> user space it works quite differently than what we can do with
> composefs.
> 
> Let's assume we have a git like repository with a bunch of files stored
> by their checksum and that they can be shared among different containers.
> 
> Using the overlayfs model:
> 
> 1) We need to create the final image layout, either using reflinks or
> hardlinks:
> 
> - reflinks: we can reflect a correct st_nlink value for the inode but we
>   lose page cache sharing.
> 
> - hardlinks: make the st_nlink bogus.  Another problem is that overlay
>   expects the lower layer to never change and now st_nlink can change
>   for files in other lower layers.
> 
> These operations have a cost.  Even if we all the files are already
> available locally, we still need at least one operation per file to
> create it, and more than one if we start tweaking the inode metadata.

Which you now encode in a manifest file which changes properties on a
per file basis without any vfs involvement which makes me pretty uneasy.

If you combine overlayfs with idmapped mounts you can already change
ownership on a fairly granular basis.

If you need additional per file ownership use overlayfs which gives you
the ability to change file attributes on a per file per container basis.

> 
> 2) no multi repo support:
> 
> Both reflinks and hardlinks do not work across mount points, so we

Just fwiw, afaict reflinks work across mount points since at least 5.18.

> cannot have images that span multiple file systems; one common use case
> is to have a network file system to share some images/files and be able
> to use files from there when they are available.
> 
> At the moment we deduplicate entire layers, and with overlay we can do
> something like the following without problems:
> 
> # mount overlay -t overlay -olowerdir=/first/disk/layer1:/second/disk/layer2
> 
> but this won't work with the files granularity we are looking at.  So in
> this case we need to do a full copy of the files that are not on the
> same file system.
> 
> 3) no support for fs-verity.  No idea how overlay could ever support it,
> it doesn't fit there.  If we want this feature we need to look at
> another RO file system.
> 
> We looked at EROFS since it is already upstream but it is quite
> different than what we are doing as Alex already pointed out.
> 
> Sure we could bloat EROFS and add all the new features there, after all
> composefs is quite simple, but I don't see how this is any cleaner than
> having a simple file system that does just one thing.
> 
> On top of what was already said: I wish at some point we can do all of
> this from a user namespace.  That is the main reason for having an easy
> on-disk format for composefs.  This seems much more difficult to achieve

I'm pretty skeptical of this plan whether we should add more filesystems
that are mountable by unprivileged users. FUSE and Overlayfs are
adventurous enough and they don't have their own on-disk format. The
track record of bugs exploitable due to userns isn't making this
very attractive.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-17 15:27                     ` Christian Brauner
@ 2023-01-18  0:22                       ` Dave Chinner
  2023-01-18  1:27                         ` Gao Xiang
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2023-01-18  0:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Giuseppe Scrivano, Amir Goldstein, Gao Xiang, Alexander Larsson,
	linux-fsdevel, linux-kernel, Miklos Szeredi, Yurii Zubrytskyi,
	Eugene Zemtsov, Vivek Goyal, Al Viro

On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote:
> On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote:
> > Christian Brauner <brauner@kernel.org> writes:
> > 2) no multi repo support:
> > 
> > Both reflinks and hardlinks do not work across mount points, so we
> 
> Just fwiw, afaict reflinks work across mount points since at least 5.18.

The might work for NFS server *file clones* across different exports
within the same NFS server (or server cluster), but they most
certainly don't work across mountpoints for local filesystems, or
across different types of filesystems.

I'm not here to advocate that composefs as the right solution, I'm
just pointing out that the proposed alternatives do not, in any way,
have the same critical behavioural characteristics as composefs
provides container orchestration systems and hence do not solve the
problems that composefs is attempting to solve.

In short: any solution that requires userspace to create a new
filesystem heirarchy one file at a time via standard syscall
mechanisms is not going to perform acceptibly at scale - that's a
major problem that composefs addresses.

The whole problem with file copying to create images - even with
reflink or hardlinks avoiding data copying - is the overhead of
creating and destroying those copies in the first place. A reflink
copy of a tens of thousands of files in a complex directory
structure is not free - each individual reflink has a time, CPU,
memory and IO cost to it. The teardown cost is similar - the only
way to remove the "container image" built with reflinks is "rm -rf",
and that has significant time, CPU memory and IO costs associated
with it as well.

Further, you can't ship container images to remote hosts using
reflink copies - they can only be created at runtime on the host
that the container will be instantiated on. IOWs, the entire cost of
reflink copies for container instances must be taken at container
instantiation and destruction time.

When you have container instances that might only be needed for a
few seconds, taking half a minute to set up the container instance
and then another half a minute to tear it down just isn't viable -
we need instantiation and teardown times in the order of a second or
two.

From my reading of the code, composefs is based around the concept
of a verifiable "shipping manifest", where the filesystem namespace
presented to users by the kernel is derived from the manifest rahter
than from some other filesystem namespace. Overlay, reflinks, etc
all use some other filesystem namespace to generate the container
namespace that links to the common data, whilst composefs uses the
manifest for that.

The use of a minfest file means there is almost zero container setup
overhead - ship the manifest file, mount it, all done - and zero
teardown overhead as unmounting the filesystem is all that is needed
to remove all traces of the container instance from the system.

In having a custom manifest format, the manifest can easily contain
verification information alongside the pointer to the content the
namespace should expose. i.e. the manifest references a secure
content addressed repository that is protected by fsverity and
contains the fsverity digests itself. Hence it doesn't rely on the
repository to self-verify, it actually ensures that the repository
files actually contain the data the manifest expects them to
contain.

Hence if the composefs kernel module is provided with a mechanism
for validating the chain of trust for the manifest file that a user
is trying to mount, then we just don't care who the mounting user
is.  This architecture is a viable path to rootless mounting of
pre-built third party container images.

Also, with the host's content addressed repository being managed
separately by the trusted host and distro package management, the
manifest is not be unique to a single container host. The distro can
build manifests so that containers are running known, signed and
verified container images built by the distro. The container
orchestration software or admin could also build manifests on demand
and sign them.

If the manifest is not signed, not signed with a key loaded
into the kernel keyring, or does not pass verification, then we
simply fall back to root-in-the-init-ns permissions being required
to mount the manifest. This fallback is exactly the same security
model we have for every other type of filesystem image that the
linux kernel can mount - we trust root not to be mounting malicious
images.

Essentially, I don't think any of the filesystems in the linux
kernel currently provide a viable solution to the problem that
composefs is trying to solve. We need a different way of solving the
ephemeral container namespace creation and destruction overhead
problem. Composefs provides a mechanism that not only solves this
problem and potentially several others, whilst also being easy to
retrofit into existing production container stacks.

As such, I think composefs is definitely worth further time and
investment as a unique line of filesystem development for Linux.
Solve the chain of trust problem (i.e. crypto signing for the
manifest files) and we potentially have game changing container
infrastructure in a couple of thousand lines of code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-18  0:22                       ` Dave Chinner
@ 2023-01-18  1:27                         ` Gao Xiang
  0 siblings, 0 replies; 34+ messages in thread
From: Gao Xiang @ 2023-01-18  1:27 UTC (permalink / raw)
  To: Dave Chinner, Christian Brauner
  Cc: Giuseppe Scrivano, Amir Goldstein, Alexander Larsson,
	linux-fsdevel, linux-kernel, Miklos Szeredi, Yurii Zubrytskyi,
	Eugene Zemtsov, Vivek Goyal, Al Viro



On 2023/1/18 08:22, Dave Chinner wrote:
> On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote:
>> On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote:
>>> Christian Brauner <brauner@kernel.org> writes:
>>> 2) no multi repo support:
>>>
>>> Both reflinks and hardlinks do not work across mount points, so we
>>
>> Just fwiw, afaict reflinks work across mount points since at least 5.18.
> 

...

> 
> As such, I think composefs is definitely worth further time and
> investment as a unique line of filesystem development for Linux.
> Solve the chain of trust problem (i.e. crypto signing for the
> manifest files) and we potentially have game changing container
> infrastructure in a couple of thousand lines of code...

I think that is the last time I write some words in this v2
patchset.  At a quick glance of the current v2 patchset:
   
   1) struct cfs_buf {  -> struct erofs_buf;

   2) cfs_buf_put -> erofs_put_metabuf;

   3) cfs_get_buf -> erofs_bread -> (but erofs_read_metabuf() in
                                        v5.17 is much closer);
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/erofs/data.c?h=linux-5.17.y

   4) cfs_dentry_s -> erofs_dirent;

   ...

Also it drops EROFS __lexx and uses buggy uxx instead.

It drops iomap/fscache interface with a stackable file
interface and it doesn't have ACL and (else) I don't
have time to look into more.

That is the current my point of view of the current
Composefs. Yes, you could use/fork any code in
open-source projects, but it currently seems like an
immature EROFS-truncated copy and its cover letter
never mentioned EROFS at all.

I'd suggest you guys refactor similar code (if you
claim that is not another EROFS) before it really
needs to be upstreamed, otherwise I would feel
uneasy as well.  Apart from that, again I have no
objection if folks feel like a new read-only
stackable filesystem like this.

Apart from the codebase, I do hope there could be some
discussion of this topic at LSF/MM/BPF 2023 as Amir
suggested because I don't think this overlay model is
really safe without fs-verity enforcing.

Thank all for the time.  I'm done.

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/6] composefs: Add on-disk layout
  2023-01-17 12:11         ` Alexander Larsson
@ 2023-01-18  3:08           ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2023-01-18  3:08 UTC (permalink / raw)
  To: Alexander Larsson; +Cc: linux-fsdevel, linux-kernel, gscrivan

On Tue, Jan 17, 2023 at 01:11:33PM +0100, Alexander Larsson wrote:
> On Tue, 2023-01-17 at 10:06 +1100, Dave Chinner wrote:
> > On Mon, Jan 16, 2023 at 12:00:03PM +0100, Alexander Larsson wrote:
> > > On Mon, 2023-01-16 at 12:29 +1100, Dave Chinner wrote:
> > > > On Fri, Jan 13, 2023 at 04:33:55PM +0100, Alexander Larsson
> > > > wrote:
> > > > > +} __packed;
> > > > > +
> > > > > +struct cfs_header_s {
> > > > > +       u8 version;
> > > > > +       u8 unused1;
> > > > > +       u16 unused2;
> > > > 
> > > > Why are you hyper-optimising these structures for minimal space
> > > > usage? This is 2023 - we can use a __le32 for the version number,
> > > > the magic number and then leave....
> > > > 
> > > > > +
> > > > > +       u32 magic;
> > > > > +       u64 data_offset;
> > > > > +       u64 root_inode;
> > > > > +
> > > > > +       u64 unused3[2];
> > > > 
> > > > a whole heap of space to round it up to at least a CPU cacheline
> > > > size using something like "__le64 unused[15]".
> > > > 
> > > > That way we don't need packed structures nor do we care about
> > > > having
> > > > weird little holes in the structures to fill....
> > > 
> > > Sure.
> > 
> > FWIW, now I see how this is used, this header kinda defines what
> > we'd call the superblock in the on-disk format of a filesystem. It's
> > at a fixed location in the image file, so there should be a #define
> > somewhere in this file to document it's fixed location.
> 
> It is at offset zero. I don't really think that needs a define, does
> it? Maybe a comment though.

Having the code use magic numbers for accessing fixed structures
(e.g. the hard coded 0 in the superblock read function)
is generally considered bad form.

If someone needs to understand how an image file is laid out, where
do they look to find where structures are physically located? Should
it be defined in a header file that is easy to find, or should they
have to read all the code to find where the magic number is embedded
in the code that defines the location of critical structures?


> > Also, if this is the in-memory representation of the structure and
> > not the actual on-disk format, why does it even need padding,
> > packing or even store the magic number?
> 
> In this case it is the on-disk format though.

Yeah, that wasn't obvious at first glance.

> > > > > +} __packed;
> > > > > +
> > > > > +enum cfs_inode_flags {
> > > > > +       CFS_INODE_FLAGS_NONE = 0,
> > > > > +       CFS_INODE_FLAGS_PAYLOAD = 1 << 0,
> > > > > +       CFS_INODE_FLAGS_MODE = 1 << 1,
> > > > > +       CFS_INODE_FLAGS_NLINK = 1 << 2,
> > > > > +       CFS_INODE_FLAGS_UIDGID = 1 << 3,
> > > > > +       CFS_INODE_FLAGS_RDEV = 1 << 4,
> > > > > +       CFS_INODE_FLAGS_TIMES = 1 << 5,
> > > > > +       CFS_INODE_FLAGS_TIMES_NSEC = 1 << 6,
> > > > > +       CFS_INODE_FLAGS_LOW_SIZE = 1 << 7, /* Low 32bit of
> > > > > st_size
> > > > > */
> > > > > +       CFS_INODE_FLAGS_HIGH_SIZE = 1 << 8, /* High 32bit of
> > > > > st_size */
> > > > 
> > > > Why do we need to complicate things by splitting the inode size
> > > > like this?
> > > > 
> > > 
> > > The goal is to minimize the image size for a typical rootfs or
> > > container image. Almost zero files in any such images are > 4GB. 
> > 
> > Sure, but how much space does this typically save, versus how much
> > complexity it adds to runtime decoding of inodes?
> > 
> > I mean, in a dense container system the critical resources that need
> > to be saved is runtime memory and CPU overhead of operations, not
> > the storage space. Saving a 30-40 bytes of storage space per inode
> > means a typical image might ber a few MB smaller, but given the
> > image file is not storing data we're only talking about images the
> > use maybe 500 bytes of data per inode. Storage space for images
> > is not a limiting factor, nor is network transmission (because
> > compression), so it comes back to runtime CPU and memory usage.
> 
> Here are some example sizes of composefs images with the current packed
> inodes: 
> 
> 6.2M cs9-developer-rootfs.composefs
> 2.1M cs9-minimal-rootfs.composefs
> 1.2M fedora-37-container.composefs
> 433K ubuntu-22.04-container.composefs
> 
> If we set all the flags for the inodes (i.e. fixed size inodes) we get:
> 
> 8.8M cs9-developer-rootfs.composefs
> 3.0M cs9-minimal-rootfs.composefs
> 1.6M fedora-37-container.composefs
> 625K ubuntu-22.04-container.composefs
> 
> So, images are about 40% larger with fixed size inodes.

40% sounds like a lot, but in considering the size magnitude of the
image files I'd say we just don't care about a few hundred KB to a
couple of MB extra space usage. Indeed, we'll use much more than 40%
extra space on XFS internally via speculative EOF preallocation when
writing those files to disk....

Also, I don't think that this is an issue for shipping them across
the network or archiving the images for the long term: compression
should remove most of the extra zeros.

Hence I'm still not convinced that the complexity of conditional
field storage is worth the decrease in image file size...

> > The inodes are decoded out of the page cache, so the memory for the
> > raw inode information is volatile and reclaimed when needed.
> > Similarly, the VFS inode built from this information is reclaimable
> > when not in use, too. So the only real overhead for runtime is the
> > decoding time to find the inode in the image file and then decode
> > it.
> 
> I disagree with this characterization. It is true that page cache is
> volatile, but if you can fit 40% less inode data in the page cache then
> there is additional overhead where you need to read this from disk. So,
> decoding time is not the only thing that affects overhead.

True, but the page cache is a secondary cache for inodes - if you
are relying on secondary caches for performance then you've already
lost because it means the primary cache is not functioning
effectively for your production workload.

> Additionally, just by being larger and less dense, more data has to be
> read from disk, which itself is slower.

That's a surprisingly common fallacy.

e.g. we can do a 64kB read IO for only 5% more time and CPU cost
than a 4kB read IO. This means we can pull 16x as much information
into the cache for almost no extra cost. This has been true since
spinning disks were invented more than 4 decades ago, but it's still
true with modern SSDs (for different reasons).

A 64kb IO is going to allow more inodes to be bought into the cache
for effectively the same IO cost, yet it provides a 16x improvement
in subsequent cache hit probability compared to doing 4kB IO. In
comparison, saving 40% in object size only improves the cache hit
probability for the same IO by ~1.5x....

Hence I don't consider object density isn't a primary issue for a
secondary IO caches; what matters is how many objects you can bring
into cache per IO, and how likely a primary level cache miss for
those objects will be in the near future before memory reclaim
removes them from the cache again.

As an example of this, the XFS inode allocation layout and caching
architecture is from the early 1990s, and it is a direct embodiment
of the the above principle. We move inodes in and out of the
secondary cache in clusters of 32 inodes (16KB IOs) because it is
much more CPU and IO efficient than doing it in 4kB IOs... 

> > Given the decoding of the inode -all branches- and is not
> > straight-line code, it cannot be well optimised and the CPU branch
> > predictor is not going to get it right every time. Straight line
> > code that decodes every field whether it is zero or not is going to
> > be faster.
> >
> > Further, with a fixed size inode in the image file, the inode table
> > can be entirely fixed size, getting rid of the whole unaligned data
> > retreival problem that code currently has (yes, all that
> > "le32_to_cpu(__get_unaligned(__le32, data)" code) because we can
> > ensure that all the inode fields are aligned in the data pages. This
> > will significantly speed up decoding into the in-memory inode
> > structures.
> 
> I agree it could be faster. But is inode decode actually the limiting
> factor, compared to things like disk i/o or better use of page cache?

The limiting factor in filesystem lookup paths tends to CPU usage.
It's spread across many parts of the kernel, but every bit we can
save makes a difference. Especially on a large server running
thousands of containers - the less CPU we use doing inode lookup and
instantiation, the more CPU there is for the user workloads. We are
rarely IO limited on machines like this, and as SSDs get even faster
in the near future, that's going to be even less of a problem than
it now.

> > > > > +struct cfs_dir_s {
> > > > > +       u32 n_chunks;
> > > > > +       struct cfs_dir_chunk_s chunks[];
> > > > > +} __packed;
> > > > 
> > > > So directory data is packed in discrete chunks? Given that this
> > > > is a
> > > > static directory format, and the size of the directory is known
> > > > at
> > > > image creation time, why does the storage need to be chunked?
> > > 
> > > We chunk the data such that each chunk fits inside a single page in
> > > the
> > > image file. I did this to make accessing image data directly from
> > > the
> > > page cache easier.
> > 
> > Hmmmm. So you defined a -block size- that matched the x86-64 -page
> > size- to avoid page cache issues.  Now, what about ARM or POWER
> > which has 64kB page sizes?
> > 
> > IOWs, "page size" is not the same on all machines, whilst the
> > on-disk format for a filesystem image needs to be the same on all
> > machines. Hence it appears that this:
> > 
> > > > > +#define CFS_MAX_DIR_CHUNK_SIZE 4096
> > 
> > should actually be defined in terms of the block size for the
> > filesystem image, and this size of these dir chunks should be
> > recorded in the superblock of the filesystem image. That way it
> > is clear that the image has a specific chunk size, and it also paves
> > the way for supporting more efficient directory structures using
> > larger-than-page size chunks in future.
> 
> Yes, its true that assuming a (min) 4k page size is wasteful on some
> arches, but it would be hard to read a filesystem created for 64k pages
> on a 4k page machine, which is not ideal. However, wrt your commend on
> multi-page mappings, maybe we can just totally drop these limits. I'll
> have a look at that.

It's not actually that hard - just read in all the pages into the
page cache, look them up, map them, do the operation, unmap them.

After all, you already ahve a cfs_buf that you could store a page
array in, and then you have an object that you can use for single
pages (on a 64kB machine) or 16 pages (4kB page machine) without the
code that is walking the buffers caring about the underlying page
size. This is exactly what we do with the struct xfs_buf. :)

> 
> > > If we had dirent data spanning multiple pages
> > > then we would either need to map the pages consecutively (which
> > > seems
> > > hard/costly) or have complex in-kernel code to handle the case
> > > where a
> > > dirent straddles two pages.
> > 
> > Actually pretty easy - we do this with XFS for multi-page directory
> > buffers. We just use vm_map_ram() on a page array at the moment,
> > but in the near future there will be other options based on
> > multipage folios.
> > 
> > That is, the page cache now stores folios rather than pages, and is
> > capable of using contiguous multi-page folios in the cache. As a
> > result, multipage folios could be used to cache multi-page
> > structures in the page cache and efficiently map them as a whole.
> > 
> > That mapping code isn't there yet - kmap_local_folio() only maps the
> > page within the folio at the offset given - but the foundation is
> > there for supporting this functionality natively....
> > 
> > I certainly wouldn't be designing a new filesystem these days that
> > has it's on-disk format constrained by the x86-64 4kB page size...
> 
> Yes, I agree. I'm gonna look at using multi-page mapping for both
> dirents and xattr data, which should completely drop these limits, as
> well as get rid of the dirent chunking.

That will be interesting to see :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem
  2023-01-17 10:12                 ` Christian Brauner
  2023-01-17 10:30                   ` Gao Xiang
  2023-01-17 13:56                   ` Giuseppe Scrivano
@ 2023-01-20  9:22                   ` Alexander Larsson
  2 siblings, 0 replies; 34+ messages in thread
From: Alexander Larsson @ 2023-01-20  9:22 UTC (permalink / raw)
  To: Christian Brauner, Amir Goldstein
  Cc: Gao Xiang, linux-fsdevel, linux-kernel, gscrivan, Miklos Szeredi,
	Yurii Zubrytskyi, Eugene Zemtsov, Vivek Goyal

On Tue, 2023-01-17 at 11:12 +0100, Christian Brauner wrote:
> On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
> > > It seems rather another an incomplete EROFS from several points
> > > of view.  Also see:
> > > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
> > > 
> > 
> > Ironically, ZUFS is one of two new filesystems that were discussed
> > in LSFMM19,
> > where the community reactions rhyme with the reactions to
> > composefs.
> > The discussion on Incremental FS resembles composefs case even more
> > [1].
> > AFAIK, Android is still maintaining Incremental FS out-of-tree.
> > 
> > Alexander and Giuseppe,
> > 
> > I'd like to join Gao is saying that I think it is in the best
> > interest
> > of everyone,
> > composefs developers and prospect users included,
> > if the composefs requirements would drive improvement to existing
> > kernel subsystems rather than adding a custom filesystem driver
> > that partly duplicates other subsystems.
> > 
> > Especially so, when the modifications to existing components
> > (erofs and overlayfs) appear to be relatively minor and the
> > maintainer
> > of erofs is receptive to new features and happy to collaborate with
> > you.
> > 
> > w.r.t overlayfs, I am not even sure that anything needs to be
> > modified
> > in the driver.
> > overlayfs already supports "metacopy" feature which means that an
> > upper layer
> > could be composed in a way that the file content would be read from
> > an arbitrary
> > path in lower fs, e.g. objects/cc/XXX.
> > 
> > I gave a talk on LPC a few years back about overlayfs and container
> > images [2].
> > The emphasis was that overlayfs driver supports many new features,
> > but userland
> > tools for building advanced overlayfs images based on those new
> > features are
> > nowhere to be found.
> > 
> > I may be wrong, but it looks to me like composefs could potentially
> > fill this void,
> > without having to modify the overlayfs driver at all, or maybe just
> > a
> > little bit.
> > Please start a discussion with overlayfs developers about missing
> > driver
> > features if you have any.
> 
> Surprising that I and others weren't Cced on this given that we had a
> meeting with the main developers and a few others where we had said
> the
> same thing. I hadn't followed this. 

Sorry about that, I'm just not very used to the kernel submission
mechanism. I'll CC you on the next version.

> 
> We have at least 58 filesystems currently in the kernel (and that's a
> conservative count just based on going by obvious directories and
> ignoring most virtual filesystems).
> 
> A non-insignificant portion is probably slowly rotting away with
> little
> fixes coming in, with few users, and not much attention is being paid
> to
> syzkaller reports for them if they show up. I haven't quantified this
> of
> course.
> 
> Taking in a new filesystems into kernel in the worst case means that
> it's being dumped there once and will slowly become unmaintained.
> Then
> we'll have a few users for the next 20 years and we can't reasonably
> deprecate it (Maybe that's another good topic: How should we fade out
> filesystems.).
> 
> Of course, for most fs developers it probably doesn't matter how many
> other filesystems there are in the kernel (aside from maybe competing
> for the same users).
> 
> But for developers who touch the vfs every new filesystems may
> increase
> the cost of maintaining and reworking existing functionality, or
> adding
> new functionality. Making it more likely to accumulate hacks, adding
> workarounds, or flatout being unable to kill off infrastructure that
> should reasonably go away. Maybe this is an unfair complaint but just
> from experience a new filesystem potentially means one or two weeks
> to
> make a larger vfs change.
> 
> I want to stress that I'm not at all saying "no more new fs" but we
> should be hesitant before we merge new filesystems into the kernel.

Well, it sure reads as "no more new fs" to me. But I understand that
there is hesitation towards this. The new version will be even simpler
(based on feedback from dave), weighing in at < 2000 lines. Hopefully
this will make it easier to review and maintain and somewhat countering
the cost of yet another filesystem.

> Especially for filesystems that are tailored to special use-cases.
> Every few years another filesystem tailored to container use-cases
> shows
> up. And frankly, a good portion of the issues that they are trying to
> solve are caused by design choices in userspace.

Well, we have at least two use cases, but sure, it is not a general
purpose filesystem.

> And I have to say I'm especially NAK-friendly about anything that
> comes
> even close to yet another stacking filesystems or anything that
> layers
> on top of a lower filesystem/mount such as ecryptfs, ksmbd, and
> overlayfs. They are hard to get right, with lots of corner cases and
> they cause the most headaches when making vfs changes.

I can't disagree here, because I'm not a vfs maintainer, but I will say
that composefs is fundamentally much simpler that these examples. First
because it is completely read-only, and secondly because it doesn't
rely on the lower filesystem for anything but file content (i.e. lower
fs metadata or directory structure doesn't affect the upper fs).

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
 Alexander Larsson                                            Red Hat,
Inc 
       alexl@redhat.com            alexander.larsson@gmail.com 
He's a jaded white trash astronaut haunted by an iconic dead American 
confidante She's a brilliant extravagent femme fatale who can talk to 
animals. They fight crime! 


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-01-20  9:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-13 15:33 [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Alexander Larsson
2023-01-13 15:33 ` [PATCH v2 1/6] fsverity: Export fsverity_get_digest Alexander Larsson
2023-01-13 15:33 ` [PATCH v2 2/6] composefs: Add on-disk layout Alexander Larsson
2023-01-16  1:29   ` Dave Chinner
2023-01-16 11:00     ` Alexander Larsson
2023-01-16 23:06       ` Dave Chinner
2023-01-17 12:11         ` Alexander Larsson
2023-01-18  3:08           ` Dave Chinner
2023-01-13 15:33 ` [PATCH v2 3/6] composefs: Add descriptor parsing code Alexander Larsson
2023-01-13 15:33 ` [PATCH v2 4/6] composefs: Add filesystem implementation Alexander Larsson
2023-01-13 21:55   ` kernel test robot
2023-01-16 22:07   ` Al Viro
2023-01-17 13:29     ` Alexander Larsson
2023-01-13 15:33 ` [PATCH v2 5/6] composefs: Add documentation Alexander Larsson
2023-01-14  3:20   ` Bagas Sanjaya
2023-01-16 12:38     ` Alexander Larsson
2023-01-13 15:33 ` [PATCH v2 6/6] composefs: Add kconfig and build support Alexander Larsson
2023-01-16  4:44 ` [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Gao Xiang
2023-01-16  9:30   ` Alexander Larsson
2023-01-16 10:19     ` Gao Xiang
2023-01-16 12:33       ` Alexander Larsson
2023-01-16 13:26         ` Gao Xiang
2023-01-16 14:18           ` Giuseppe Scrivano
2023-01-16 15:27           ` Alexander Larsson
2023-01-17  0:12             ` Gao Xiang
2023-01-17  7:05               ` Amir Goldstein
2023-01-17 10:12                 ` Christian Brauner
2023-01-17 10:30                   ` Gao Xiang
2023-01-17 13:56                   ` Giuseppe Scrivano
2023-01-17 14:28                     ` Gao Xiang
2023-01-17 15:27                     ` Christian Brauner
2023-01-18  0:22                       ` Dave Chinner
2023-01-18  1:27                         ` Gao Xiang
2023-01-20  9:22                   ` Alexander Larsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).