Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* Initial patches for Incremental FS
@ 2019-05-02  4:03 ezemtsov
  2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: ezemtsov @ 2019-05-02  4:03 UTC (permalink / raw)
  To: linux-fsdevel, ezemtsov; +Cc: tytso

Hi All,

Please take a look at Incremental FS.

Incremental FS is special-purpose Linux virtual file system that allows
execution of a program while its binary and resource files are still being
lazily downloaded over the network, USB etc. It is focused on incremental
delivery for a small number (under 100) of big files (more than 10 megabytes each).
Incremental FS doesn’t allow direct writes into files and, once loaded, file
content never changes. Incremental FS doesn’t use a block device, instead it
saves data into a backing file located on a regular file-system.

What’s it for?

It allows running big Android apps before their binaries and resources are
fully loaded to an Android device. If an app reads something not loaded yet,
it needs to wait for the data block to be fetched, but in most cases hot blocks
can be loaded in advance and apps can run smoothly and almost instantly.

More details can be found in Documentation/filesystems/incremental.fs

Coming up next:
[PATCH 1/6] incfs: Add first files of incrementalfs
[PATCH 2/6] incfs: Backing file format
[PATCH 3/6] incfs: Management of in-memory FS data structures
[PATCH 4/6] incfs: Integration with VFS layer
[PATCH 5/6] incfs: sample data loader for incremental-fs
[PATCH 6/6] incfs: Integration tests for incremental-fs

Thanks,
Eugene.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/6] incfs: Add first files of incrementalfs
  2019-05-02  4:03 Initial patches for Incremental FS ezemtsov
@ 2019-05-02  4:03 ` ezemtsov
  2019-05-02 19:06   ` Miklos Szeredi
                     ` (4 more replies)
  2019-05-02  4:03 ` [PATCH 2/6] incfs: Backing file format ezemtsov
                   ` (5 subsequent siblings)
  6 siblings, 5 replies; 33+ messages in thread
From: ezemtsov @ 2019-05-02  4:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tytso, Eugene Zemtsov

From: Eugene Zemtsov <ezemtsov@google.com>

- fs/incfs dir
- Kconfig (CONFIG_INCREMENTAL_FS)
- Makefile
- Module and file system initialization and clean up code
- New MAINTAINERS entry
- Add incrementalfs.h UAPI header
- Register ioctl range in ioctl-numbers.txt
- Documentation

Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>
---
 Documentation/filesystems/incrementalfs.rst | 452 ++++++++++++++++++++
 Documentation/ioctl/ioctl-number.txt        |   1 +
 MAINTAINERS                                 |   7 +
 fs/Kconfig                                  |   1 +
 fs/Makefile                                 |   1 +
 fs/incfs/Kconfig                            |  10 +
 fs/incfs/Makefile                           |   4 +
 fs/incfs/main.c                             |  85 ++++
 fs/incfs/vfs.c                              |  37 ++
 include/uapi/linux/incrementalfs.h          | 189 ++++++++
 10 files changed, 787 insertions(+)
 create mode 100644 Documentation/filesystems/incrementalfs.rst
 create mode 100644 fs/incfs/Kconfig
 create mode 100644 fs/incfs/Makefile
 create mode 100644 fs/incfs/main.c
 create mode 100644 fs/incfs/vfs.c
 create mode 100644 include/uapi/linux/incrementalfs.h

diff --git a/Documentation/filesystems/incrementalfs.rst b/Documentation/filesystems/incrementalfs.rst
new file mode 100644
index 000000000000..682e3dcb6b5a
--- /dev/null
+++ b/Documentation/filesystems/incrementalfs.rst
@@ -0,0 +1,452 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+Incremental File System
+=======================
+
+Overview
+========
+Incremental FS is special-purpose Linux virtual file system that allows
+execution of a program while its binary and resource files are still being
+lazily downloaded over the network, USB etc. It is focused on incremental
+delivery for a small number (under 100) of big files (more than 10 megabytes).
+Incremental FS doesn’t allow direct writes into files and, once loaded, file
+content never changes. Incremental FS doesn’t use a block device, instead it
+saves data into a backing file located on a regular file-system.
+
+But why?
+--------
+To allow running **big** Android apps before their binaries and resources are
+fully downloaded to an Android device. If an app reads something not loaded yet,
+it needs to wait for the data block to be fetched, but in most cases hot blocks
+can be loaded in advance.
+
+Workflow
+--------
+A userspace process, called a data loader, mounts an instance of incremental-fs
+giving it a file descriptor on an underlying file system (like ext4 or f2fs).
+Incremental-fs reads content (if any) of this backing file and interprets it as
+a file system image with files, directories and data blocks. At this point
+the data loader can declare new files to be shown by incremental-fs.
+
+A process is started from a binary located on incremental-fs.
+All reads are served directly from the backing file
+without roundtrips into userspace. If the process accesses a data block that was
+not originally present in the backing file, the read operation waits.
+
+Meanwhile the data loader can feed new data blocks to incremental-fs by calling
+write() on a special .cmd pseudo-file. The data loader can request information
+about pending reads by calling poll() and read() on the .cmd pseudo-file.
+This mechanism allows the data loader to serve most urgently needed data first.
+Once a data block is given to incremental-fs, it saves it to the backing file
+and unblocks all the reads waiting for this block.
+
+Eventually all data for all files is uploaded by the data loader, and saved by
+incremental-fs into the backing file. At that moment the data loader is not
+needed any longer. The backing file will play the role of a complete
+filesystem image for all future runs of the program.
+
+Non-goals
+---------
+* Allowing direct writes by the executing processes into files on incremental-fs
+* Allowing the data loader change file size or content after it was loaded.
+* Having more than a couple hundred files and directories.
+
+
+Features
+========
+
+Read-only, but not unchanging
+-----------------------------
+On the surface a mount directory of incremental-fs would look similar to
+a read-only instance of network file system: files and directories can be
+listed and read, but can’t be directly created or modified via creat() or
+write(). At the same time the data loader can make changes to a directory
+structure via external ioctl-s. i.e. link and unlink files and directories
+(if they empty). Data can't be changed this way, once a file block is loaded
+there is no way to change it.
+
+Filesystem image in a backing file
+----------------------------------
+Instead of using a block device, all data and metadata is stored in a
+backing file provided as a mount parameter. The backing file is located on
+an underlying file system (like ext4 or f2fs). Such approach is very similar
+to what might be achieved by using loopback device with a traditional file
+system, but it avoids extra set-up steps and indirections. It also allows
+incremental-fs image to dynamically grow as new files and data come without
+having to do any extra steps for resizing.
+
+If the backing file contains data at the moment when incremental-fs is mounted,
+content of the backing file is being interpreted as filesystem image.
+New files and data can still be added through the external interface,
+and they will be saved to the backing file.
+
+Data compression
+----------------
+Incremental-fs can store compressed data. In this case each 4KB data block is
+compressed separately. Data blocks can be provided to incremental-fs by
+the data loader in a compressed form. Incremental-fs uncompresses blocks
+each time a executing process reads it (modulo page cache). Compression also
+takes care of blocks composed of all zero bytes removing necessity to handle
+this case separately.
+
+Partially present files
+-----------------------
+Data in the files consists of 4KB blocks, each block can be present or absent.
+Unlike in sparse files, reading an absent block doesn’t return all zeros.
+It waits for the data block to be loaded via the ioctl interface
+(respecting a timeout). Once a data block is loaded it never disappears
+and can’t be changed or erased from a file. This ability to frictionlessly
+wait for temporary missing data is the main feature of incremental-fs.
+
+Hard links. Multiple names for the same file
+--------------------------------------------
+Like all traditional UNIX file systems, incremental-fs supports hard links,
+i.e. different file names in different directories can refer to the same file.
+As mentioned above new hard links can be created and removed via
+the ioctl interface, but actual data files are immutable, modulo partial
+data loading. Each directory can only have at most one name referencing it.
+
+Inspection of incremental-fs internal state
+-------------------------------------------
+poll() and read() on the .cmd pseudo-file allow data loaders to get a list of
+read operations stalled due to lack of a data block (pending reads).
+
+
+Application Programming Interface
+=================================
+
+Regular file system interface
+-----------------------------
+Executing process access files and directories via regular Linux file interface:
+open, read, close etc. All the intricacies of data loading a file representation
+are hidden from them.
+
+External .cmd file interface
+----------------------------
+When incremental-fs is mounted, a mount directory contains a pseudo-file
+called '.cmd'. The data loader will open this file and call read(), write(),
+poll() and ioctl() on it inspect and change state of incremental-fs.
+
+poll() and read() are used by the data loader to wait for pending reads to
+appear and obtain an array of ``struct incfs_pending_read_info``.
+
+write() is used by the data loader to feed new data blocks to incremental-fs.
+A data buffer given to write() is interpreted as an array of
+``struct incfs_new_data_block``. Structs in the array describe locations and
+properties of data blocks loaded with this write() call.
+
+``ioctl(INCFS_IOC_PROCESS_INSTRUCTION)`` is used to change structure of
+incremental-fs. It receives an pointer to ``struct incfs_instruction``
+where type field can have be one of the following values.
+
+**INCFS_INSTRUCTION_NEW_FILE**
+Creates an inode (a file or a directory) without a name.
+It assumes ``incfs_new_file_instruction.file`` is populated with details.
+
+**INCFS_INSTRUCTION_ADD_DIR_ENTRY**
+Creates a name (aka hardlink) for an inode in a directory.
+A directory can't have more than one hardlink pointing to it, but files can be
+linked from different directories.
+It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
+
+**INCFS_INSTRUCTION_REMOVE_DIR_ENTRY**
+Remove a name (aka hardlink) for a file from a directory.
+Only empty directories can be unlinked.
+It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
+
+For more details see in uapi/linux/incrementalfs.h and samples below.
+
+Supported mount options
+-----------------------
+See ``fs/incfs/options.c`` for more details.
+
+    * ``backing_fd=<unsigned int>``
+        Required. A file descriptor of a backing file opened by the process
+        calling mount(2). This descriptor can be closed after mount returns.
+
+    * ``read_timeout_msc=<unsigned int>``
+        Default: 1000. Timeout in milliseconds before a read operation fails
+        if no data found in the backing file or provided by the data loader.
+
+Sysfs files
+-----------
+``/sys/fs/incremental-fs/version`` - a current version of the filesystem.
+One ASCII encoded positive integer number with a new line at the end.
+
+
+Examples
+--------
+See ``sample_data_loader.c`` for a complete implementation of a data loader.
+
+Mount incremental-fs
+~~~~~~~~~~~~~~~~~~~~
+
+::
+
+    int mount_fs(char *mount_dir, char *backing_file, int timeout_msc)
+    {
+        static const char fs_name[] = INCFS_NAME;
+        char mount_options[512];
+        int backing_fd;
+        int result;
+
+        backing_fd = open(backing_file, O_RDWR);
+        if (backing_fd == -1) {
+            perror("Error in opening backing file");
+            return 1;
+        }
+
+        snprintf(mount_options, ARRAY_SIZE(mount_options),
+            "backing_fd=%u,read_timeout_msc=%u", backing_fd, timeout_msc);
+
+        result = mount(fs_name, mount_dir, fs_name, 0, mount_options);
+        if (result != 0)
+            perror("Error mounting fs.");
+        return result;
+    }
+
+Open .cmd file
+~~~~~~~~~~~~~~
+
+::
+
+    int open_commands_file(char *mount_dir)
+    {
+        char cmd_file[255];
+        int cmd_fd;
+
+        snprintf(cmd_file, ARRAY_SIZE(cmd_file), "%s/.cmd", mount_dir);
+        cmd_fd = open(cmd_file, O_RDWR);
+        if (cmd_fd < 0)
+            perror("Can't open commands file");
+        return cmd_fd;
+    }
+
+Add a file to the file system
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+    int create_file(int cmd_fd, char *filename, int *ino_out, size_t size)
+    {
+        int ret = 0;
+        __u16 ino = 0;
+        struct incfs_instruction inst = {
+                .version = INCFS_HEADER_VER,
+                .type = INCFS_INSTRUCTION_NEW_FILE,
+                .file = {
+                    .size = size,
+                    .mode = S_IFREG | 0555,
+                }
+        };
+
+        ret = ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, &inst);
+        if (ret)
+            return -errno;
+
+        ino = inst.file.ino_out;
+        inst = (struct incfs_instruction){
+                .version = INCFS_HEADER_VER,
+                .type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
+                .dir_entry = {
+                    .dir_ino = INCFS_ROOT_INODE,
+                    .child_ino = ino,
+                    .name = ptr_to_u64(filename),
+                    .name_len = strlen(filename)
+                }
+            };
+        ret = ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, &inst);
+        if (ret)
+            return -errno;
+        *ino_out = ino;
+        return 0;
+    }
+
+Load data into a file
+~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+    int cmd_fd = open_commands_file(path_to_mount_dir);
+    char *data = get_some_data();
+    struct incfs_new_data_block block;
+    int err;
+
+    block.file_ino = file_ino;
+    block.block_index = 0;
+    block.compression = COMPRESSION_NONE;
+    block.data = (__u64)data;
+    block.data_len = INCFS_DATA_FILE_BLOCK_SIZE;
+
+    err = write(cmd_fd, &block, sizeof(block));
+
+
+Get an array of pending reads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+    int poll_res = 0;
+    struct incfs_pending_read_info reads[10];
+    int cmd_fd = open_commands_file(path_to_mount_dir);
+    struct pollfd pollfd = {
+        .fd = cmd_fd,
+        .events = POLLIN
+    };
+
+    poll_res = poll(&pollfd, 1, timeout);
+    if (poll_res > 0 && (pollfd.revents | POLLIN)) {
+        ssize_t read_res = read(cmd_fd, reads, sizeof(reads));
+        if (read_res > 0)
+            printf("Waiting reads %ld\n", read_res / sizeof(reads[0]));
+    }
+
+
+
+Ondisk format
+=============
+
+General principles
+------------------
+* The backbone of the incremental-fs ondisk format is an append only linked
+  list of metadata blocks. Each metadata block contains an offset of the next
+  one. These blocks describe files and directories on the
+  file system. They also represent actions of adding and removing file names
+  (hard links).
+  Every time incremental-fs instance is mounted, it reads through this list
+  to recreate filesystem's state in memory. An offset of the first record in the
+  metadata list is stored in the superblock at the beginning of the backing
+  file.
+
+* Most of the backing file is taken by data areas and blockmaps.
+  Since data blocks can be compressed and have different sizes,
+  single per-file data area can't be pre-allocated. That's why blockmaps are
+  needed in order to find a location and size of each data block in
+  the backing file. Each time a file is created, a corresponding block map is
+  allocated to store future offsets of data blocks.
+
+  Whenever a data block is given by data loader to incremental-fs:
+    - A data area with the given block is appended to the end of
+      the backing file.
+    - A record in the blockmap for the given block index is updated to reflect
+      its location, size, and compression algorithm.
+
+Important format details
+------------------------
+Ondisk structures are defined in the ``format.h`` file. They are all packed
+and use little-endian order.
+A backing file must start with ``incfs_super_block`` with ``s_magic`` field
+equal to 0x5346434e49 "INCFS".
+
+Metadata records:
+
+* ``incfs_inode`` - metadata record to declare a file or a directory.
+                    ``incfs_inode.i_mode`` determents if it is a file
+                    or a directory.
+* ``incfs_blockmap_entry`` - metadata record that specifies size and location
+                            of a blockmap area for a given file. This area
+                            contains an array of ``incfs_blockmap_entry``-s.
+* ``incfs_dir_action`` - metadata record that specifies changes made to a
+                    to a directory structure, e.g. add or remove a hardlink.
+* ``incfs_md_header`` - header of a metadata record. It's always a part
+                    of other structures and served purpose of metadata
+                    bookkeeping.
+
+Other ondisk structures:
+
+* ``incfs_super_block`` - backing file header
+* ``incfs_blockmap_entry`` - a record in a blockmap area that describes size
+                        and location of a data block.
+* Data blocks dont have any particular structure, they are written to the backing
+  file in a raw form as they come from a data loader.
+
+
+Backing file layout
+-------------------
+::
+
+              +-------------------------------------------+
+              |            incfs_super_block              |]---+
+              +-------------------------------------------+    |
+              |                 metadata                  |<---+
+              |                incfs_inode                |]---+
+              +-------------------------------------------+    |
+                        .........................              |
+              +-------------------------------------------+    |   metadata
+     +------->|               blockmap area               |    |  list links
+     |        |          [incfs_blockmap_entry]           |    |
+     |        |          [incfs_blockmap_entry]           |    |
+     |        |          [incfs_blockmap_entry]           |    |
+     |    +--[|          [incfs_blockmap_entry]           |    |
+     |    |   |          [incfs_blockmap_entry]           |    |
+     |    |   |          [incfs_blockmap_entry]           |    |
+     |    |   +-------------------------------------------+    |
+     |    |             .........................              |
+     |    |   +-------------------------------------------+    |
+     |    |   |                 metadata                  |<---+
+     +----|--[|               incfs_blockmap              |]---+
+          |   +-------------------------------------------+    |
+          |             .........................              |
+          |   +-------------------------------------------+    |
+          +-->|                 data block                |    |
+              +-------------------------------------------+    |
+                        .........................              |
+              +-------------------------------------------+    |
+              |                 metadata                  |<---+
+              |             incfs_dir_action              |
+              +-------------------------------------------+
+
+Unreferenced files and absence of garbage collection
+----------------------------------------------------
+Described file format can produce files that don't have any names for them in
+any directories. Incremental-fs takes no steps to prevent such situations or
+reclaim space occupied by such files in the backing file. If garbage collection
+is needed it has to be implemented as a separate userspace tool.
+
+
+Design alternatives
+===================
+
+Why isn't incremental-fs implemented via FUSE?
+----------------------------------------------
+TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
+scenarios, and increase power use on mobile beyond acceptable limit
+for widespread deployment. A custom kernel filesystem is the way to overcome
+these limitations.
+
+From the theoretical side of things, FUSE filesystem adds some overhead to
+each filesystem operation that’s not handled by OS page cache:
+
+    * When an IO request arrives to FUSE driver (D), it puts it into a queue
+      that runs on a separate kernel thread
+    * Then another separate user-mode handler process (H) has to run,
+      potentially after a context switch, to read the request from the queue.
+      Reading the request adds a kernel-user mode transition to the handling.
+    * (H) sends the IO request to kernel to handle it on some underlying storage
+      filesystem. This adds a user-kernel and kernel-user mode transition
+      pair to the handling.
+    * (H) then responds to the FUSE request via a write(2) call.
+      Writing the response is another user-kernel mode transition.
+    * (D) needs to read the response from (H) when its kernel thread runs
+      and forward it to the user
+
+Together, the scenario adds 2 extra user-kernel-user mode transition pairs,
+and potentially has up to 3 additional context switches for the FUSE kernel
+thread and the user-mode handler to start running for each IO request on the
+filesystem.
+This overhead can vary from unnoticeable to unmanageable, depending on the
+target scenario. But it will always burn extra power via CPU staying longer
+in non-idle state, handling context switches and mode transitions.
+One important goal for the new filesystem is to be able to handle each page
+read separately on demand, because we don't want to wait and download more data
+than absolutely necessary. Thus readahead would need to be disabled completely.
+This increases the number of separate IO requests and the FUSE related overhead
+by almost 32x (128KB readahead limit vs 4KB individual block operations)
+
+For more info see a 2017 USENIX research paper:
+To FUSE or Not to FUSE: Performance of User-Space File Systems
+Bharath Kumar Reddy Vangoor, Stony Brook University;
+Vasily Tarasov, IBM Research-Almaden;
+Erez Zadok, Stony Brook University
+https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf
diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index c9558146ac58..a5f8e0eaff91 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -227,6 +227,7 @@ Code  Seq#(hex)	Include File		Comments
 'f'	00-0F	fs/ocfs2/ocfs2_fs.h	conflict!
 'g'	00-0F	linux/usb/gadgetfs.h
 'g'	20-2F	linux/usb/g_printer.h
+'g'	30-3F	include/uapi/linux/incrementalfs.h
 'h'	00-7F				conflict! Charon filesystem
 					<mailto:zapman@interlan.net>
 'h'	00-1F	linux/hpet.h		conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 5c38f21aee78..c92ad89ee5e5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7630,6 +7630,13 @@ F:	Documentation/hwmon/ina2xx
 F:	drivers/hwmon/ina2xx.c
 F:	include/linux/platform_data/ina2xx.h

+INCREMENTAL FILESYSTEM
+M:	Eugene Zemtsov <ezemtsov@google.com>
+S:	Supported
+F:	fs/incfs/
+F:	include/uapi/linux/incrementalfs.h
+F:	Documentation/filesystems/incrementalfs.rst
+
 INDUSTRY PACK SUBSYSTEM (IPACK)
 M:	Samuel Iglesias Gonsalvez <siglesias@igalia.com>
 M:	Jens Taprogge <jens.taprogge@taprogge.org>
diff --git a/fs/Kconfig b/fs/Kconfig
index 3e6d3101f3ff..19f89c936209 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -119,6 +119,7 @@ source "fs/quota/Kconfig"
 source "fs/autofs/Kconfig"
 source "fs/fuse/Kconfig"
 source "fs/overlayfs/Kconfig"
+source "fs/incfs/Kconfig"

 menu "Caches"

diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..08c6b827df1a 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_AUTOFS_FS)		+= autofs/
 obj-$(CONFIG_ADFS_FS)		+= adfs/
 obj-$(CONFIG_FUSE_FS)		+= fuse/
 obj-$(CONFIG_OVERLAY_FS)	+= overlayfs/
+obj-$(CONFIG_INCREMENTAL_FS)	+= incfs/
 obj-$(CONFIG_ORANGEFS_FS)       += orangefs/
 obj-$(CONFIG_UDF_FS)		+= udf/
 obj-$(CONFIG_SUN_OPENPROMFS)	+= openpromfs/
diff --git a/fs/incfs/Kconfig b/fs/incfs/Kconfig
new file mode 100644
index 000000000000..a810131deed0
--- /dev/null
+++ b/fs/incfs/Kconfig
@@ -0,0 +1,10 @@
+config INCREMENTAL_FS
+	tristate "Incremental file system support"
+	depends on BLOCK && CRC32
+	help
+	  Incremental FS is a read-only virtual file system that facilitates execution
+	  of programs while their binaries are still being lazily downloaded over the
+	  network, USB or pigeon post.
+
+	  To compile this file system support as a module, choose M here: the
+	  module will be called incrementalfs.
\ No newline at end of file
diff --git a/fs/incfs/Makefile b/fs/incfs/Makefile
new file mode 100644
index 000000000000..7892196c634f
--- /dev/null
+++ b/fs/incfs/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_INCREMENTAL_FS)	+= incrementalfs.o
+
+incrementalfs-y := main.o vfs.o
\ No newline at end of file
diff --git a/fs/incfs/main.c b/fs/incfs/main.c
new file mode 100644
index 000000000000..07e1952ede9e
--- /dev/null
+++ b/fs/incfs/main.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018 Google LLC
+ */
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/module.h>
+
+#include <uapi/linux/incrementalfs.h>
+
+#define INCFS_CORE_VERSION 1
+
+extern struct file_system_type incfs_fs_type;
+
+static struct kobject *sysfs_root;
+
+static ssize_t version_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buff)
+{
+	return snprintf(buff, PAGE_SIZE, "%d\n", INCFS_CORE_VERSION);
+}
+
+static struct kobj_attribute version_attr = __ATTR_RO(version);
+
+static struct attribute *attributes[] = {
+	&version_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group attr_group = {
+	.attrs = attributes,
+};
+
+static int __init init_sysfs(void)
+{
+	int res = 0;
+
+	sysfs_root = kobject_create_and_add(INCFS_NAME, fs_kobj);
+	if (!sysfs_root)
+		return -ENOMEM;
+
+	res = sysfs_create_group(sysfs_root, &attr_group);
+	if (res) {
+		kobject_put(sysfs_root);
+		sysfs_root = NULL;
+	}
+	return res;
+}
+
+static void cleanup_sysfs(void)
+{
+	if (sysfs_root) {
+		sysfs_remove_group(sysfs_root, &attr_group);
+		kobject_put(sysfs_root);
+		sysfs_root = NULL;
+	}
+}
+
+static int __init init_incfs_module(void)
+{
+	int err = 0;
+
+	err = init_sysfs();
+	if (err)
+		return err;
+
+	err = register_filesystem(&incfs_fs_type);
+	if (err)
+		cleanup_sysfs();
+
+	return err;
+}
+
+static void __exit cleanup_incfs_module(void)
+{
+	cleanup_sysfs();
+	unregister_filesystem(&incfs_fs_type);
+}
+
+module_init(init_incfs_module);
+module_exit(cleanup_incfs_module);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Eugene Zemtsov <ezemtsov@google.com>");
+MODULE_DESCRIPTION("Incremental File System");
diff --git a/fs/incfs/vfs.c b/fs/incfs/vfs.c
new file mode 100644
index 000000000000..2e71f0edf8a1
--- /dev/null
+++ b/fs/incfs/vfs.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018 Google LLC
+ */
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+
+#include <uapi/linux/incrementalfs.h>
+
+static struct dentry *mount_fs(struct file_system_type *type, int flags,
+			       const char *dev_name, void *data);
+static void kill_sb(struct super_block *sb);
+
+struct file_system_type incfs_fs_type = {
+	.owner = THIS_MODULE,
+	.name = INCFS_NAME,
+	.mount = mount_fs,
+	.kill_sb = kill_sb,
+	.fs_flags = 0
+};
+
+static int fill_super_block(struct super_block *sb, void *data, int silent)
+{
+	return 0;
+}
+
+static struct dentry *mount_fs(struct file_system_type *type, int flags,
+			       const char *dev_name, void *data)
+{
+	return mount_nodev(type, flags, data, fill_super_block);
+}
+
+static void kill_sb(struct super_block *sb)
+{
+	generic_shutdown_super(sb);
+}
+
diff --git a/include/uapi/linux/incrementalfs.h b/include/uapi/linux/incrementalfs.h
new file mode 100644
index 000000000000..5bcf66ac852b
--- /dev/null
+++ b/include/uapi/linux/incrementalfs.h
@@ -0,0 +1,189 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Userspace interface for Incremental FS.
+ *
+ * Incremental FS is special-purpose Linux virtual file system that allows
+ * execution of a program while its binary and resource files are still being
+ * lazily downloaded over the network, USB etc.
+ *
+ * Copyright 2019 Google LLC
+ */
+#ifndef _UAPI_LINUX_INCREMENTALFS_H
+#define _UAPI_LINUX_INCREMENTALFS_H
+
+#include <linux/limits.h>
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/* ===== constants ===== */
+#define INCFS_NAME "incremental-fs"
+#define INCFS_MAGIC_NUMBER (0x5346434e49ul)
+#define INCFS_DATA_FILE_BLOCK_SIZE 4096
+#define INCFS_HEADER_VER 1
+
+#define INCFS_MAX_FILES 1000
+#define INCFS_COMMAND_INODE 1
+#define INCFS_ROOT_INODE 2
+
+#define INCFS_IOCTL_BASE_CODE 'g'
+
+/* ===== ioctl requests on command file ===== */
+
+/* Make changes to the file system via incfs instructions. */
+#define INCFS_IOC_PROCESS_INSTRUCTION \
+	_IOWR(INCFS_IOCTL_BASE_CODE, 30, struct incfs_instruction)
+
+enum incfs_compression_alg { COMPRESSION_NONE = 0, COMPRESSION_LZ4 = 1 };
+
+/*
+ * Description of a pending read. A pending read - a read call by
+ * a userspace program for which the filesystem currently doesn't have data.
+ *
+ * This structs can be read from .cmd file to obtain a set of reads which
+ * are currently pending.
+ */
+struct incfs_pending_read_info {
+	/* Inode number of a file that is being read from. */
+	__aligned_u64 file_ino;
+
+	/* Index of a file block that is being read. */
+	__u32 block_index;
+
+	/* A serial number of this pending read. */
+	__u32 serial_number;
+};
+
+/*
+ * A struct to be written into a .cmd file to provide a data block for a file.
+ */
+struct incfs_new_data_block {
+	/* Inode number of a file this block belongs to. */
+	__aligned_u64 file_ino;
+
+	/* Index of a data block. */
+	__u32 block_index;
+
+	/* Length of data */
+	__u32 data_len;
+
+	/*
+	 * A pointer ot an actual data for the block.
+	 *
+	 * Equivalent to: __u8 *data;
+	 */
+	__aligned_u64 data;
+
+	/*
+	 * Compression algorithm used to compress the data block.
+	 * Values from enum incfs_compression_alg.
+	 */
+	__u32 compression;
+
+	__u32 reserved1;
+
+	__aligned_u64 reserved2;
+};
+
+enum incfs_instruction_type {
+	INCFS_INSTRUCTION_NOOP = 0,
+	INCFS_INSTRUCTION_NEW_FILE = 1,
+	INCFS_INSTRUCTION_ADD_DIR_ENTRY = 3,
+	INCFS_INSTRUCTION_REMOVE_DIR_ENTRY = 4,
+};
+
+/*
+ * Create a new file or directory.
+ * Corresponds to INCFS_INSTRUCTION_NEW_FILE
+ */
+struct incfs_new_file_instruction {
+	/*
+	 * [Out param. Populated by the kernel after ioctl.]
+	 * Inode number of a newly created file.
+	 */
+	__aligned_u64 ino_out;
+
+	/*
+	 * Total size of the new file. Ignored if S_ISDIR(mode).
+	 */
+	__aligned_u64 size;
+
+	/*
+	 * File mode. Permissions and dir flag.
+	 */
+	__u16 mode;
+
+	__u16 reserved1;
+
+	__u32 reserved2;
+
+	__aligned_u64 reserved3;
+
+	__aligned_u64 reserved4;
+
+	__aligned_u64 reserved5;
+
+	__aligned_u64 reserved6;
+
+	__aligned_u64 reserved7;
+};
+
+/*
+ * Create or remove a name (aka hardlink) for a file in a directory.
+ * Corresponds to
+ * INCFS_INSTRUCTION_ADD_DIR_ENTRY,
+ * INCFS_INSTRUCTION_REMOVE_DIR_ENTRY
+ */
+struct incfs_dir_entry_instruction {
+	/* Inode number of a directory to add/remove a file to/from. */
+	__aligned_u64 dir_ino;
+
+	/* File to add/remove. */
+	__aligned_u64 child_ino;
+
+	/* Length of name field */
+	__u32 name_len;
+
+	__u32 reserved1;
+
+	/*
+	 * A pointer to the name characters of a file to add/remove
+	 *
+	 * Equivalent to: char *name;
+	 */
+	__aligned_u64 name;
+
+	__aligned_u64 reserved2;
+
+	__aligned_u64 reserved3;
+
+	__aligned_u64 reserved4;
+
+	__aligned_u64 reserved5;
+};
+
+/*
+ * An Incremental FS instruction is the way for userspace
+ * to
+ *   - create files and directories
+ *   - show and hide files in the directory structure
+ */
+struct incfs_instruction {
+	/* Populate with INCFS_HEADER_VER */
+	__u32 version;
+
+	/*
+	 * Type - what this instruction actually does.
+	 * Values from enum incfs_instruction_type.
+	 */
+	__u32 type;
+
+	union {
+		struct incfs_new_file_instruction file;
+		struct incfs_dir_entry_instruction dir_entry;
+
+		/* Hard limit on the instruction body size in the future. */
+		__u8 reserved[64];
+	};
+};
+
+#endif /* _UAPI_LINUX_INCREMENTALFS_H */
--
2.21.0.593.g511ec345e18-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/6] incfs: Backing file format
  2019-05-02  4:03 Initial patches for Incremental FS ezemtsov
  2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
@ 2019-05-02  4:03 ` ezemtsov
  2019-05-02  4:03 ` [PATCH 3/6] incfs: Management of in-memory FS data structures ezemtsov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: ezemtsov @ 2019-05-02  4:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tytso, Eugene Zemtsov

From: Eugene Zemtsov <ezemtsov@google.com>

- Read and write logic for ondisk backing file format (aka incfs image)
- Add format.c and format.h
- Utils in internal.h

Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>
---
 fs/incfs/Makefile   |   2 +-
 fs/incfs/format.c   | 554 ++++++++++++++++++++++++++++++++++++++++++++
 fs/incfs/format.h   | 294 +++++++++++++++++++++++
 fs/incfs/internal.h |  31 +++
 4 files changed, 880 insertions(+), 1 deletion(-)
 create mode 100644 fs/incfs/format.c
 create mode 100644 fs/incfs/format.h
 create mode 100644 fs/incfs/internal.h

diff --git a/fs/incfs/Makefile b/fs/incfs/Makefile
index 7892196c634f..cdea18c7213e 100644
--- a/fs/incfs/Makefile
+++ b/fs/incfs/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_INCREMENTAL_FS)	+= incrementalfs.o

-incrementalfs-y := main.o vfs.o
\ No newline at end of file
+incrementalfs-y := main.o vfs.o format.o
diff --git a/fs/incfs/format.c b/fs/incfs/format.c
new file mode 100644
index 000000000000..a0e6ecec09d3
--- /dev/null
+++ b/fs/incfs/format.c
@@ -0,0 +1,554 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018 Google LLC
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/types.h>
+#include <linux/mutex.h>
+#include <linux/mm.h>
+#include <linux/falloc.h>
+#include <linux/slab.h>
+#include <linux/crc32.h>
+
+#include "format.h"
+
+struct backing_file_context *incfs_alloc_bfc(struct file *backing_file)
+{
+	struct backing_file_context *result = NULL;
+
+	result = kzalloc(sizeof(*result), GFP_NOFS);
+	if (!result)
+		return ERR_PTR(-ENOMEM);
+
+	result->bc_file = backing_file;
+	mutex_init(&result->bc_mutex);
+	return result;
+}
+
+void incfs_free_bfc(struct backing_file_context *bfc)
+{
+	if (!bfc)
+		return;
+
+	if (bfc->bc_file)
+		fput(bfc->bc_file);
+
+	mutex_destroy(&bfc->bc_mutex);
+	kfree(bfc);
+}
+
+loff_t incfs_get_end_offset(struct file *f)
+{
+	/*
+	 * This function assumes that file size and the end-offset
+	 * are the same. This is not always true.
+	 */
+	return i_size_read(file_inode(f));
+}
+
+/*
+ * Truncate the tail of the file to the given length.
+ * Used to rollback partially successful multistep writes.
+ */
+static int truncate_backing_file(struct backing_file_context *bfc,
+				loff_t new_end)
+{
+	struct inode *inode = NULL;
+	struct dentry *dentry = NULL;
+	loff_t old_end = 0;
+	struct iattr attr;
+	int result = 0;
+
+	if (!bfc)
+		return -EFAULT;
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+
+	if (!bfc->bc_file)
+		return -EFAULT;
+
+	old_end = incfs_get_end_offset(bfc->bc_file);
+	if (old_end == new_end)
+		return 0;
+	if (old_end < new_end)
+		return -EINVAL;
+
+	inode = bfc->bc_file->f_inode;
+	dentry = bfc->bc_file->f_path.dentry;
+
+	attr.ia_size = new_end;
+	attr.ia_valid = ATTR_SIZE;
+
+	inode_lock(inode);
+	result = notify_change(dentry, &attr, NULL);
+	inode_unlock(inode);
+
+	return result;
+}
+
+/* Append a given number of zero bytes to the end of the backing file. */
+static int append_zeros(struct backing_file_context *bfc, size_t len)
+{
+	loff_t file_size = 0;
+	loff_t new_last_byte_offset = 0;
+	int res = 0;
+
+	if (!bfc)
+		return -EFAULT;
+
+	if (len == 0)
+		return -EINVAL;
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+
+	/*
+	 * Allocate only one byte at the new desired end of the file.
+	 * It will increase file size and create a zeroed area of
+	 * a given size.
+	 */
+	file_size = incfs_get_end_offset(bfc->bc_file);
+	new_last_byte_offset = file_size + len - 1;
+	res = vfs_fallocate(bfc->bc_file, 0, new_last_byte_offset, 1);
+	if (res)
+		return res;
+
+	res = vfs_fsync_range(bfc->bc_file, file_size, file_size + len, 1);
+	return res;
+}
+
+static int write_to_bf(struct backing_file_context *bfc, const void *buf,
+			size_t count, loff_t pos, bool sync)
+{
+	ssize_t res = 0;
+	loff_t p = pos;
+
+	res = kernel_write(bfc->bc_file, buf, count, &p);
+	if (res < 0)
+		return res;
+	if (res != count)
+		return -EIO;
+
+	if (sync)
+		return vfs_fsync_range(bfc->bc_file, pos, pos + count, 1);
+
+	return 0;
+}
+
+static u32 calc_md_crc(struct incfs_md_header *record)
+{
+	u32 result = 0;
+	__le32 saved_crc = record->h_record_crc;
+	__le64 saved_md_offset = record->h_next_md_offset;
+	size_t record_size = min_t(size_t, le16_to_cpu(record->h_record_size),
+				INCFS_MAX_METADATA_RECORD_SIZE);
+
+	/* Zero fields which needs to be excluded from CRC calculation. */
+	record->h_record_crc = 0;
+	record->h_next_md_offset = 0;
+	result = crc32(0, record, record_size);
+
+	/* Restore excluded fields. */
+	record->h_record_crc = saved_crc;
+	record->h_next_md_offset = saved_md_offset;
+
+	return result;
+}
+
+/*
+ * Append a given metadata record to the backing file and update a previous
+ * record to add the new record the the metadata list.
+ */
+static int append_md_to_backing_file(struct backing_file_context *bfc,
+			      struct incfs_md_header *record)
+{
+	int result = 0;
+	loff_t record_offset;
+	loff_t file_pos;
+	__le64 new_md_offset;
+	size_t record_size;
+
+	if (!bfc || !record)
+		return -EFAULT;
+
+	if (bfc->bc_last_md_record_offset < 0)
+		return -EINVAL;
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+
+	record_size = le16_to_cpu(record->h_record_size);
+	file_pos = incfs_get_end_offset(bfc->bc_file);
+	record->h_prev_md_offset = bfc->bc_last_md_record_offset;
+	record->h_next_md_offset = 0;
+	record->h_record_crc = cpu_to_le32(calc_md_crc(record));
+
+	/* Write the metadata record to the end of the backing file */
+	record_offset = file_pos;
+	new_md_offset = cpu_to_le64(record_offset);
+	result = write_to_bf(bfc, record, record_size, file_pos, true);
+	if (result)
+		return result;
+
+	/* Update next metadata offset in a previous record or a superblock. */
+	if (bfc->bc_last_md_record_offset) {
+		/*
+		 * Find a place in the previous md record where new record's
+		 * offset needs to be saved.
+		 */
+		file_pos = bfc->bc_last_md_record_offset +
+			offsetof(struct incfs_md_header, h_next_md_offset);
+	} else {
+		/* No metadata yet, file a place to update in the superblock. */
+		file_pos = offsetof(struct incfs_super_block,
+				s_first_md_offset);
+	}
+	result = write_to_bf(bfc, &new_md_offset, sizeof(new_md_offset),
+				file_pos, true);
+	if (result)
+		return result;
+
+	bfc->bc_last_md_record_offset = record_offset;
+	return result;
+}
+
+/* Append incfs_inode metadata record to the backing file. */
+int incfs_write_inode_to_backing_file(struct backing_file_context *bfc, u64 ino,
+				u64 size, u16 mode)
+{
+	struct incfs_inode disk_inode = {};
+
+	if (!bfc)
+		return -EFAULT;
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+	disk_inode.i_header.h_md_entry_type = INCFS_MD_INODE;
+	disk_inode.i_header.h_record_size = cpu_to_le16(sizeof(disk_inode));
+	disk_inode.i_header.h_next_md_offset = cpu_to_le64(0);
+	disk_inode.i_no = cpu_to_le64(ino);
+	disk_inode.i_size = cpu_to_le64(size);
+	disk_inode.i_mode = cpu_to_le16(mode);
+	disk_inode.i_flags = cpu_to_le32(0);
+
+	return append_md_to_backing_file(bfc, &disk_inode.i_header);
+}
+
+/* Append incfs_dir_action metadata record to the backing file. */
+int incfs_write_dir_action(struct backing_file_context *bfc, u64 dir_ino,
+		     u64 dentry_ino, enum incfs_dir_action_type type,
+		     struct mem_range name)
+{
+	struct incfs_dir_action action = {};
+	u8 name_len = min_t(u8, INCFS_MAX_NAME_LEN, name.len);
+
+	if (!bfc)
+		return -EFAULT;
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+	action.da_header.h_md_entry_type = INCFS_MD_DIR_ACTION;
+	action.da_header.h_record_size = cpu_to_le16(sizeof(action));
+	action.da_header.h_next_md_offset = cpu_to_le64(0);
+	action.da_dir_inode = cpu_to_le64(dir_ino);
+	action.da_entry_inode = cpu_to_le64(dentry_ino);
+	action.da_type = (__u8)type;
+	action.da_name_len = name_len;
+	memcpy(action.da_name, name.data, name_len);
+
+	return append_md_to_backing_file(bfc, &action.da_header);
+}
+
+/*
+ * Reserve 0-filled space for the blockmap body, and append
+ * incfs_blockmap metadata record pointing to it.
+ */
+int incfs_write_blockmap_to_backing_file(struct backing_file_context *bfc,
+				u64 ino, u32 block_count, loff_t *map_base_off)
+{
+	struct incfs_blockmap blockmap = {};
+	int result = 0;
+	loff_t file_end = 0;
+	size_t map_size = block_count * sizeof(struct incfs_blockmap_entry);
+
+	if (!bfc)
+		return -EFAULT;
+
+	blockmap.m_header.h_md_entry_type = INCFS_MD_BLOCK_MAP;
+	blockmap.m_header.h_record_size = cpu_to_le16(sizeof(blockmap));
+	blockmap.m_header.h_next_md_offset = cpu_to_le64(0);
+	blockmap.m_inode = cpu_to_le64(ino);
+	blockmap.m_block_count = cpu_to_le32(block_count);
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+
+	/* Reserve 0-filled space for the blockmap body in the backing file. */
+	file_end = incfs_get_end_offset(bfc->bc_file);
+	result = append_zeros(bfc, map_size);
+	if (result)
+		return result;
+
+	/* Write blockmap metadata record pointing to the body written above. */
+	blockmap.m_base_offset = cpu_to_le64(file_end);
+	result = append_md_to_backing_file(bfc, &blockmap.m_header);
+	if (result) {
+		/* Error, rollback file changes */
+		truncate_backing_file(bfc, file_end);
+	} else if (map_base_off) {
+		*map_base_off = file_end;
+	}
+
+	return result;
+}
+
+/*
+ * Write a backing file header (superblock).
+ * It should always be called only on empty file.
+ * incfs_super_block.s_first_md_offset is 0 for now, but will be updated
+ * once first metadata record is added.
+ */
+int incfs_write_sb_to_backing_file(struct backing_file_context *bfc)
+{
+	struct incfs_super_block sb = {};
+	loff_t file_pos = 0;
+
+	if (!bfc)
+		return -EFAULT;
+
+	sb.s_magic = cpu_to_le64(INCFS_MAGIC_NUMBER);
+	sb.s_version = cpu_to_le64(INCFS_FORMAT_CURRENT_VER);
+	sb.s_super_block_size = cpu_to_le16(sizeof(sb));
+	sb.s_first_md_offset = cpu_to_le64(0);
+	sb.s_data_block_size = cpu_to_le16(INCFS_DATA_FILE_BLOCK_SIZE);
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+
+	file_pos = incfs_get_end_offset(bfc->bc_file);
+	if (file_pos != 0)
+		return -EEXIST;
+
+	return write_to_bf(bfc, &sb, sizeof(sb), file_pos, true);
+}
+
+/* Write a given data block and update file's blockmap to point it. */
+int incfs_write_data_block_to_backing_file(struct backing_file_context *bfc,
+				     struct mem_range block, int block_index,
+				     loff_t bm_base_off, u16 flags, u32 crc)
+{
+	struct incfs_blockmap_entry bm_entry = {};
+	int result = 0;
+	loff_t data_offset = 0;
+	loff_t bm_entry_off =
+		bm_base_off + sizeof(struct incfs_blockmap_entry) * block_index;
+
+	if (!bfc)
+		return -EFAULT;
+
+	if (block.len >= (1 << 16) || block_index < 0)
+		return -EINVAL;
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+
+	data_offset = incfs_get_end_offset(bfc->bc_file);
+	if (data_offset <= bm_entry_off) {
+		/* Blockmap entry is beyond the file's end. It is not normal. */
+		return -EINVAL;
+	}
+
+	/* Write the block data at the end of the backing file. */
+	result = write_to_bf(bfc, block.data, block.len, data_offset, false);
+	if (result)
+		return result;
+
+	/* Update the blockmap to point to the newly written data. */
+	bm_entry.me_data_offset_lo = cpu_to_le32((u32)data_offset);
+	bm_entry.me_data_offset_hi = cpu_to_le16((u16)(data_offset >> 32));
+	bm_entry.me_data_size = cpu_to_le16((u16)block.len);
+	bm_entry.me_flags = cpu_to_le16(flags);
+	bm_entry.me_data_crc = cpu_to_le32(crc);
+
+	result = write_to_bf(bfc, &bm_entry, sizeof(bm_entry),
+				bm_entry_off, false);
+
+	return result;
+}
+
+/* Initialize a new image in a given backing file. */
+int incfs_make_empty_backing_file(struct backing_file_context *bfc)
+{
+	int result = 0;
+
+	if (!bfc || !bfc->bc_file)
+		return -EFAULT;
+
+	result = mutex_lock_interruptible(&bfc->bc_mutex);
+	if (result)
+		goto out;
+
+	result = truncate_backing_file(bfc, 0);
+	if (result)
+		goto out;
+
+	result = incfs_write_sb_to_backing_file(bfc);
+out:
+	mutex_unlock(&bfc->bc_mutex);
+	return result;
+}
+
+int incfs_read_blockmap_entry(struct backing_file_context *bfc, int block_index,
+			loff_t bm_base_off,
+			struct incfs_blockmap_entry *bm_entry)
+{
+	loff_t bm_entry_off =
+		bm_base_off + sizeof(struct incfs_blockmap_entry) * block_index;
+	const size_t bytes_to_read = sizeof(struct incfs_blockmap_entry);
+	int result = 0;
+
+	if (!bfc || !bm_entry)
+		return -EFAULT;
+
+	if (block_index < 0 || bm_base_off <= 0)
+		return -ENODATA;
+
+	result = kernel_read(bfc->bc_file, bm_entry, bytes_to_read,
+			     &bm_entry_off);
+	if (result < 0)
+		return result;
+	if (result < bytes_to_read)
+		return -EIO;
+	return 0;
+}
+
+int incfs_read_superblock(struct backing_file_context *bfc,
+				loff_t *first_md_off)
+{
+	loff_t pos = 0;
+	ssize_t bytes_read = 0;
+	struct incfs_super_block sb = {};
+
+	if (!bfc || !first_md_off)
+		return -EFAULT;
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+	bytes_read = kernel_read(bfc->bc_file, &sb, sizeof(sb), &pos);
+	if (bytes_read < 0)
+		return bytes_read;
+
+	if (bytes_read < sizeof(sb))
+		return -EBADMSG;
+
+	if (le64_to_cpu(sb.s_magic) != INCFS_MAGIC_NUMBER)
+		return -EILSEQ;
+
+	if (le64_to_cpu(sb.s_version) > INCFS_FORMAT_CURRENT_VER)
+		return -EILSEQ;
+
+	if (le16_to_cpu(sb.s_data_block_size) != INCFS_DATA_FILE_BLOCK_SIZE)
+		return -EILSEQ;
+
+	if (le16_to_cpu(sb.s_super_block_size) > sizeof(sb))
+		return -EILSEQ;
+
+	*first_md_off = le64_to_cpu(sb.s_first_md_offset);
+	return 0;
+}
+
+/*
+ * Read through metadata records from the backing file one by one
+ * and call provided metadata handlers.
+ */
+int incfs_read_next_metadata_record(struct backing_file_context *bfc,
+			      struct metadata_handler *handler)
+{
+	loff_t pos = 0;
+	const ssize_t max_md_size = INCFS_MAX_METADATA_RECORD_SIZE;
+	ssize_t bytes_read = 0;
+	size_t md_record_size = 0;
+	loff_t next_record = 0;
+	loff_t prev_record = 0;
+	int res = 0;
+	struct incfs_md_header *md_hdr = NULL;
+
+	if (!bfc || !handler)
+		return -EFAULT;
+
+	LOCK_REQUIRED(bfc->bc_mutex);
+
+	if (handler->md_record_offset == 0)
+		return -EPERM;
+
+	memset(&handler->md_buffer, 0, max_md_size);
+	pos = handler->md_record_offset;
+	bytes_read = kernel_read(bfc->bc_file, (u8 *)&handler->md_buffer,
+				 max_md_size, &pos);
+	if (bytes_read < 0)
+		return bytes_read;
+	if (bytes_read < sizeof(*md_hdr))
+		return -EBADMSG;
+
+	md_hdr = &handler->md_buffer.md_header;
+	next_record = le64_to_cpu(md_hdr->h_next_md_offset);
+	prev_record = le64_to_cpu(md_hdr->h_prev_md_offset);
+	md_record_size = le16_to_cpu(md_hdr->h_record_size);
+
+	if (md_record_size > max_md_size) {
+		pr_warn("incfs: The record is too large. Size: %ld",
+				md_record_size);
+		return -EBADMSG;
+	}
+
+	if (bytes_read < md_record_size) {
+		pr_warn("incfs: The record hasn't been fully read.");
+		return -EBADMSG;
+	}
+
+	if (next_record <= handler->md_record_offset && next_record != 0) {
+		pr_warn("incfs: Next record (%lld) points back in file.",
+			next_record);
+		return -EBADMSG;
+	}
+
+	if (prev_record != handler->md_prev_record_offset) {
+		pr_warn("incfs: Metadata chain has been corrupted.");
+		return -EBADMSG;
+	}
+
+	if (le32_to_cpu(md_hdr->h_record_crc) != calc_md_crc(md_hdr)) {
+		pr_warn("incfs: Metadata CRC mismatch.");
+		return -EBADMSG;
+	}
+
+	switch (md_hdr->h_md_entry_type) {
+	case INCFS_MD_NONE:
+		break;
+	case INCFS_MD_INODE:
+		if (handler->handle_inode)
+			res = handler->handle_inode(&handler->md_buffer.inode,
+						    handler);
+		break;
+	case INCFS_MD_BLOCK_MAP:
+		if (handler->handle_blockmap)
+			res = handler->handle_blockmap(
+				&handler->md_buffer.blockmap, handler);
+		break;
+	case INCFS_MD_DIR_ACTION:
+		if (handler->handle_dir_action)
+			res = handler->handle_dir_action(
+				&handler->md_buffer.dir_action, handler);
+		break;
+	default:
+		res = -ENOTSUPP;
+		break;
+	}
+
+	if (!res) {
+		if (next_record == 0) {
+			/*
+			 * Zero offset for the next record means that the last
+			 * metadata record has just been processed.
+			 */
+			bfc->bc_last_md_record_offset =
+				handler->md_record_offset;
+		}
+		handler->md_prev_record_offset = handler->md_record_offset;
+		handler->md_record_offset = next_record;
+	}
+	return res;
+}
diff --git a/fs/incfs/format.h b/fs/incfs/format.h
new file mode 100644
index 000000000000..2c2114bdd08f
--- /dev/null
+++ b/fs/incfs/format.h
@@ -0,0 +1,294 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2018 Google LLC
+ */
+
+/*
+ * Overview
+ * --------
+ * The backbone of the incremental-fs ondisk format is an append only linked
+ * list of metadata blocks. Each metadata block contains an offset of the next
+ * one. These blocks describe files and directories on the
+ * file system. They also represent actions of adding and removing file names
+ * (hard links).
+ *
+ * Every time incremental-fs instance is mounted, it reads through this list
+ * to recreate filesystem's state in memory. An offset of the first record in
+ * the metadata list is stored in the superblock at the beginning of the backing
+ * file.
+ *
+ * Most of the backing file is taken by data areas and blockmaps.
+ * Since data blocks can be compressed and have different sizes,
+ * single per-file data area can't be pre-allocated. That's why blockmaps are
+ * needed in order to find a location and size of each data block in
+ * the backing file. Each time a file is created, a corresponding block map is
+ * allocated to store future offsets of data blocks.
+ *
+ * Whenever a data block is given by data loader to incremental-fs:
+ *   - A data area with the given block is appended to the end of
+ *     the backing file.
+ *   - A record in the blockmap for the given block index is updated to reflect
+ *     its location, size, and compression algorithm.
+
+ * Metadata records
+ * ----------------
+ * incfs_inode - metadata record to declare a file or a directory.
+ *                    incfs_inode.i_mode determents if it is a file
+ *                    or a directory.
+ * incfs_blockmap_entry - metadata record that specifies size and location
+ *                           of a blockmap area for a given file. This area
+ *                           contains an array of incfs_blockmap_entry-s.
+ * incfs_dir_action - metadata record that specifies changes made to a
+ *                   to a directory structure, e.g. add or remove a hardlink.
+ *
+ * Metadata header
+ * ---------------
+ * incfs_md_header - header of a metadata record. It's always a part
+ *                   of other structures and served purpose of metadata
+ *                   bookkeeping.
+ *
+ *              +-----------------------------------------------+       ^
+ *              |            incfs_md_header                    |       |
+ *              | 1. type of body(INODE, BLOCKMAP, DIR ACTION..)|       |
+ *              | 2. size of the whole record header + body     |       |
+ *              | 3. CRC the whole record header + body         |       |
+ *              | 4. offset of the previous md record           |]------+
+ *              | 5. offset of the next md record (md link)     |]---+
+ *              +-----------------------------------------------+    |
+ *              |  Metadata record body with useful data        |    |
+ *              +-----------------------------------------------+    |
+ *                                                                   +--->
+ *
+ * Other ondisk structures
+ * -----------------------
+ * incfs_super_block - backing file header
+ * incfs_blockmap_entry - a record in a blockmap area that describes size
+ *                       and location of a data block.
+ * Data blocks dont have any particular structure, they are written to the
+ * backing file in a raw form as they come from a data loader.
+ *
+ * Backing file layout
+ * -------------------
+ *
+ *
+ *              +-------------------------------------------+
+ *              |            incfs_super_block              |]---+
+ *              +-------------------------------------------+    |
+ *              |                 metadata                  |<---+
+ *              |                incfs_inode                |]---+
+ *              +-------------------------------------------+    |
+ *                        .........................              |
+ *              +-------------------------------------------+    |   metadata
+ *     +------->|               blockmap area               |    |  list links
+ *     |        |          [incfs_blockmap_entry]           |    |
+ *     |        |          [incfs_blockmap_entry]           |    |
+ *     |        |          [incfs_blockmap_entry]           |    |
+ *     |    +--[|          [incfs_blockmap_entry]           |    |
+ *     |    |   |          [incfs_blockmap_entry]           |    |
+ *     |    |   |          [incfs_blockmap_entry]           |    |
+ *     |    |   +-------------------------------------------+    |
+ *     |    |             .........................              |
+ *     |    |   +-------------------------------------------+    |
+ *     |    |   |                 metadata                  |<---+
+ *     +----|--[|               incfs_blockmap              |]---+
+ *          |   +-------------------------------------------+    |
+ *          |             .........................              |
+ *          |   +-------------------------------------------+    |
+ *          +-->|                 data block                |    |
+ *              +-------------------------------------------+    |
+ *                        .........................              |
+ *              +-------------------------------------------+    |
+ *              |                 metadata                  |<---+
+ *              |             incfs_dir_action              |
+ *              +-------------------------------------------+
+ */
+#ifndef _INCFS_FORMAT_H
+#define _INCFS_FORMAT_H
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <uapi/linux/incrementalfs.h>
+
+#include "internal.h"
+
+#define INCFS_MAX_NAME_LEN 255
+#define INCFS_FORMAT_V1 1
+#define INCFS_FORMAT_CURRENT_VER INCFS_FORMAT_V1
+
+enum incfs_metadata_type {
+	INCFS_MD_NONE = 0,
+	INCFS_MD_INODE = 1,
+	INCFS_MD_BLOCK_MAP = 2,
+	INCFS_MD_DIR_ACTION = 3
+};
+
+/* Header included at the beginning of all metadata records on the disk. */
+struct incfs_md_header {
+	__u8 h_md_entry_type;
+
+	/*
+	 * Size of the metadata record.
+	 * (e.g. inode, dir entry etc) not just this struct.
+	 */
+	__le16 h_record_size;
+
+	/*
+	 * CRC32 of the metadata record.
+	 * (e.g. inode, dir entry etc) not just this struct.
+	 */
+	__le32 h_record_crc;
+
+	/* Offset of the next metadata entry if any */
+	__le64 h_next_md_offset;
+
+	/* Offset of the previous metadata entry if any */
+	__le64 h_prev_md_offset;
+
+} __packed;
+
+/* Backing file header */
+struct incfs_super_block {
+	__le64 s_magic; /* Magic signature: INCFS_MAGIC_NUMBER */
+	__le64 s_version; /* Format version: INCFS_FORMAT_CURRENT_VER */
+	__le16 s_super_block_size; /* sizeof(incfs_super_block) */
+	__le32 s_flags; /* Reserved for future use. */
+	__le64 s_first_md_offset; /* Offset of the first metadata record */
+	__le16 s_data_block_size; /* INCFS_DATA_FILE_BLOCK_SIZE */
+} __packed;
+
+/* Metadata record for files and directories. Type = INCFS_MD_INODE */
+struct incfs_inode {
+	struct incfs_md_header i_header;
+	__le64 i_no; /* inode number */
+	__le64 i_size; /* Full size of the file's content */
+	__le16 i_mode; /* File mode */
+	__le32 i_flags; /* Reserved for future use. */
+} __packed;
+
+enum incfs_block_map_entry_flags {
+	INCFS_BLOCK_COMPRESSED_LZ4 = (1 << 0),
+};
+
+/* Block map entry pointing to an actual location of the data block. */
+struct incfs_blockmap_entry {
+	/* Offset of the actual data block. Lower 32 bits */
+	__le32 me_data_offset_lo;
+
+	/* Offset of the actual data block. Higher 16 bits */
+	__le16 me_data_offset_hi;
+
+	/* How many bytes the data actually occupies in the backing file */
+	__le16 me_data_size;
+
+	/* Block flags from incfs_block_map_entry_flags */
+	__u16 me_flags;
+
+	/* CRC32 of the block's data */
+	__le32 me_data_crc;
+} __packed;
+
+/* Metadata record for locations of file blocks. Type = INCFS_MD_BLOCK_MAP */
+struct incfs_blockmap {
+	struct incfs_md_header m_header;
+	/* inode of a file this map belongs to */
+	__le64 m_inode;
+
+	/* Base offset of the array of incfs_blockmap_entry */
+	__le64 m_base_offset;
+
+	/* Size of the map entry array in blocks */
+	__le32 m_block_count;
+} __packed;
+
+enum incfs_dir_action_type {
+	INCFS_DIRA_NONE = 0,
+	INCFS_DIRA_ADD_ENTRY = 1,
+	INCFS_DIRA_REMOVE_ENTRY = 2,
+};
+
+/* Metadata record of directory content change. Type = INCFS_MD_DIR_ACTION */
+struct incfs_dir_action {
+	struct incfs_md_header da_header;
+	__le64 da_dir_inode; /* Parent directory inode number */
+	__le64 da_entry_inode; /* File/subdirectory inode number */
+	__u8 da_type; /* One of enums incfs_dir_action_type */
+	__u8 da_name_len; /* Name length */
+	char da_name[INCFS_MAX_NAME_LEN]; /* File name */
+} __packed;
+
+/* State of the backing file. */
+struct backing_file_context {
+	/* Protects writes to bc_file */
+	struct mutex bc_mutex;
+
+	/* File object to read data from */
+	struct file *bc_file;
+
+	/*
+	 * Offset of the last known metadata record in the backing file.
+	 * 0 means there are no metadata records.
+	 */
+	loff_t bc_last_md_record_offset;
+};
+
+struct metadata_handler {
+	loff_t md_record_offset;
+	loff_t md_prev_record_offset;
+	void *context;
+
+	union {
+		struct incfs_md_header md_header;
+		struct incfs_inode inode;
+		struct incfs_blockmap blockmap;
+		struct incfs_dir_action dir_action;
+	} md_buffer;
+
+	int (*handle_inode)(struct incfs_inode *inode,
+			    struct metadata_handler *handler);
+	int (*handle_blockmap)(struct incfs_blockmap *bm,
+			       struct metadata_handler *handler);
+	int (*handle_dir_action)(struct incfs_dir_action *da,
+				 struct metadata_handler *handler);
+};
+#define INCFS_MAX_METADATA_RECORD_SIZE \
+	FIELD_SIZEOF(struct metadata_handler, md_buffer)
+
+loff_t incfs_get_end_offset(struct file *f);
+
+/* Backing file context management */
+struct backing_file_context *incfs_alloc_bfc(struct file *backing_file);
+
+void incfs_free_bfc(struct backing_file_context *bfc);
+
+/* Writing stuff */
+int incfs_write_inode_to_backing_file(struct backing_file_context *bfc, u64 ino,
+				      u64 size, u16 mode);
+
+int incfs_write_dir_action(struct backing_file_context *bfc, u64 dir_ino,
+			   u64 dentry_ino, enum incfs_dir_action_type type,
+			   struct mem_range name);
+
+int incfs_write_blockmap_to_backing_file(struct backing_file_context *bfc,
+					 u64 ino, u32 block_count,
+					 loff_t *map_base_off);
+
+int incfs_write_sb_to_backing_file(struct backing_file_context *bfc);
+
+int incfs_write_data_block_to_backing_file(struct backing_file_context *bfc,
+					   struct mem_range block,
+					   int block_index, loff_t bm_base_off,
+					   u16 flags, u32 crc);
+
+int incfs_make_empty_backing_file(struct backing_file_context *bfc);
+
+/* Reading stuff */
+int incfs_read_superblock(struct backing_file_context *bfc,
+			  loff_t *first_md_off);
+
+int incfs_read_blockmap_entry(struct backing_file_context *bfc, int block_index,
+			      loff_t bm_base_off,
+			      struct incfs_blockmap_entry *bm_entry);
+
+int incfs_read_next_metadata_record(struct backing_file_context *bfc,
+				    struct metadata_handler *handler);
+
+#endif /* _INCFS_FORMAT_H */
diff --git a/fs/incfs/internal.h b/fs/incfs/internal.h
new file mode 100644
index 000000000000..de8b6240e347
--- /dev/null
+++ b/fs/incfs/internal.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2018 Google LLC
+ */
+#ifndef _INCFS_INTERNAL_H
+#define _INCFS_INTERNAL_H
+#include <linux/types.h>
+
+struct mem_range {
+	u8 *data;
+	size_t len;
+};
+
+static inline struct mem_range range(u8 *data, size_t len)
+{
+	return (struct mem_range){ .data = data, .len = len };
+}
+
+#ifdef DEBUG
+#define LOCK_REQUIRED(lock)                                                    \
+	do {                                                                   \
+		if (!mutex_is_locked(&(lock))) {                               \
+			pr_err(#lock " must be taken");                        \
+			panic("Lock not taken.");                              \
+		}                                                              \
+	} while (0)
+#else
+#define LOCK_REQUIRED(lock)
+#endif
+
+#endif /* _INCFS_INTERNAL_H */
--
2.21.0.593.g511ec345e18-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/6] incfs: Management of in-memory FS data structures
  2019-05-02  4:03 Initial patches for Incremental FS ezemtsov
  2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
  2019-05-02  4:03 ` [PATCH 2/6] incfs: Backing file format ezemtsov
@ 2019-05-02  4:03 ` ezemtsov
  2019-05-02  4:03 ` [PATCH 4/6] incfs: Integration with VFS layer ezemtsov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: ezemtsov @ 2019-05-02  4:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tytso, Eugene Zemtsov

From: Eugene Zemtsov <ezemtsov@google.com>

- Data structures for files, dirs, blocks, segments etc.
- Reading and uncompressing data blocks
- Waiting for temporarily missing data blocks
- Pending reads reporting
- Processing incfs instructions coming from ioctl
- Processing metadata blocks read from the backing file

Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>
---
 fs/incfs/Makefile    |    2 +-
 fs/incfs/data_mgmt.c | 1312 ++++++++++++++++++++++++++++++++++++++++++
 fs/incfs/data_mgmt.h |  213 +++++++
 3 files changed, 1526 insertions(+), 1 deletion(-)
 create mode 100644 fs/incfs/data_mgmt.c
 create mode 100644 fs/incfs/data_mgmt.h

diff --git a/fs/incfs/Makefile b/fs/incfs/Makefile
index cdea18c7213e..19250a09348e 100644
--- a/fs/incfs/Makefile
+++ b/fs/incfs/Makefile
@@ -1,4 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_INCREMENTAL_FS)	+= incrementalfs.o

-incrementalfs-y := main.o vfs.o format.o
+incrementalfs-y := main.o vfs.o format.o data_mgmt.o
diff --git a/fs/incfs/data_mgmt.c b/fs/incfs/data_mgmt.c
new file mode 100644
index 000000000000..c19b0cbae2d8
--- /dev/null
+++ b/fs/incfs/data_mgmt.c
@@ -0,0 +1,1312 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2019 Google LLC
+ */
+#include <linux/gfp.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/jiffies.h>
+#include <linux/mm.h>
+#include <linux/lz4.h>
+#include <linux/rhashtable.h>
+#include <linux/crc32.h>
+
+#include "data_mgmt.h"
+
+#define INCFS_MIN_FILE_INODE INCFS_ROOT_INODE
+#define INCFS_MAX_FILE_INODE (INCFS_MIN_FILE_INODE + (1 << 30))
+
+static u32 ino_hash(const void *data, u32 len, u32 seed);
+
+static struct rhashtable_params node_map_params = {
+	.nelem_hint		= 20,
+	.key_len		= FIELD_SIZEOF(struct inode_info, n_ino),
+	.key_offset		= offsetof(struct inode_info, n_ino),
+	.head_offset		= offsetof(struct inode_info, n_hash_list),
+	.automatic_shrinking	= false,
+	.hashfn = ino_hash
+};
+
+struct mount_info *incfs_alloc_mount_info(struct super_block *sb,
+					struct file *backing_file)
+{
+	struct mount_info *mi = NULL;
+	int error = 0;
+
+	mi = kzalloc(sizeof(*mi), GFP_NOFS);
+	if (!mi) {
+		error = -ENOMEM;
+		goto err;
+	}
+
+	error = rhashtable_init(&mi->mi_nodes, &node_map_params);
+	if (error)
+		goto err;
+
+	mi->mi_bf_context = incfs_alloc_bfc(backing_file);
+	if (IS_ERR(mi->mi_bf_context)) {
+		error = PTR_ERR(mi->mi_bf_context);
+		mi->mi_bf_context = NULL;
+		goto err;
+	}
+
+	mi->mi_sb = sb;
+
+	/* Initialize root dir */
+	mi->mi_root.d_node.n_ino = INCFS_ROOT_INODE;
+	mi->mi_root.d_node.n_mount_info = mi;
+	mi->mi_root.d_node.n_type = INCFS_NODE_DIR;
+	mi->mi_root.d_node.n_mode = S_IFDIR | 0555;
+	INIT_LIST_HEAD(&mi->mi_root.d_entries_head);
+	INIT_LIST_HEAD(&mi->mi_root.d_node.n_parent_links_head);
+	mi->mi_next_ino = INCFS_ROOT_INODE + 1;
+
+	error = rhashtable_insert_fast(&mi->mi_nodes,
+					&mi->mi_root.d_node.n_hash_list,
+					node_map_params);
+	if (error)
+		goto err;
+
+	spin_lock_init(&mi->pending_reads_counters_lock);
+	mutex_init(&mi->mi_nodes_mutex);
+	mutex_init(&mi->mi_dir_ops_mutex);
+	init_waitqueue_head(&mi->mi_pending_reads_notif_wq);
+	return mi;
+err:
+
+	if (mi) {
+		rhashtable_destroy(&mi->mi_nodes);
+
+		if (mi->mi_bf_context)
+			incfs_free_bfc(mi->mi_bf_context);
+
+		kfree(mi);
+	}
+	return ERR_PTR(error);
+}
+
+static bool is_valid_inode(int ino)
+{
+	return ino >= INCFS_MIN_FILE_INODE && ino <= INCFS_MAX_FILE_INODE;
+}
+
+static u32 ino_hash(const void *data, u32 len, u32 seed)
+{
+	const int *ino = data;
+
+	return (u32)(*ino) ^ seed;
+}
+
+static void data_file_segment_init(struct data_file_segment *segment)
+{
+	INIT_LIST_HEAD(&segment->reads_list_head);
+	init_waitqueue_head(&segment->new_data_arrival_wq);
+	mutex_init(&segment->reads_mutex);
+	mutex_init(&segment->blockmap_mutex);
+}
+
+static void data_file_segment_destroy(struct data_file_segment *segment)
+{
+	list_del(&segment->reads_list_head);
+	mutex_destroy(&segment->reads_mutex);
+	mutex_destroy(&segment->blockmap_mutex);
+}
+
+static void free_data_file(struct data_file *df)
+{
+	int i;
+
+	if (!df)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(df->df_segments); i++)
+		data_file_segment_destroy(&df->df_segments[i]);
+	kfree(df);
+}
+
+/*
+ * Adds a new file to the mount_info and
+ * returns an error code (!NULL) in case of an error.
+ */
+static struct data_file *add_data_file(struct mount_info *mi, int ino,
+					loff_t size, umode_t mode)
+{
+	struct data_file *df = NULL;
+	int error = 0;
+	int i;
+
+	if (!mi)
+		return ERR_PTR(-EFAULT);
+
+	if (!is_valid_inode(ino))
+		return ERR_PTR(-EINVAL);
+
+	LOCK_REQUIRED(mi->mi_nodes_mutex);
+
+	if (rhashtable_lookup_fast(&mi->mi_nodes, &ino, node_map_params))
+		return ERR_PTR(-EEXIST);
+
+	df = kzalloc(sizeof(*df), GFP_NOFS);
+	if (!df)
+		return ERR_PTR(-ENOMEM);
+
+	df->df_node.n_ino = ino;
+	df->df_node.n_type = INCFS_NODE_FILE;
+	df->df_node.n_mode = (mode & 0555) | S_IFREG;
+	df->df_node.n_mount_info = mi;
+	INIT_LIST_HEAD(&df->df_node.n_parent_links_head);
+
+	df->df_size = size;
+	if (size > 0)
+		df->df_block_count =
+			1 + (size - 1) / INCFS_DATA_FILE_BLOCK_SIZE;
+
+	for (i = 0; i < ARRAY_SIZE(df->df_segments); i++)
+		data_file_segment_init(&df->df_segments[i]);
+
+	error = rhashtable_insert_fast(&mi->mi_nodes,
+					&df->df_node.n_hash_list,
+					node_map_params);
+	if (error) {
+		free_data_file(df);
+		return ERR_PTR(error);
+	}
+	return df;
+}
+
+static void free_dir_entry(struct dir_entry_info *entry)
+{
+	if (!entry)
+		return;
+
+	kfree(entry->de_name.data);
+	kfree(entry);
+}
+
+static void free_dir(struct directory *dir)
+{
+	struct dir_entry_info *entry = NULL;
+	struct dir_entry_info *tmp = NULL;
+
+	if (!dir)
+		return;
+
+	list_for_each_entry_safe(entry, tmp, &dir->d_entries_head,
+				  de_entries_list) {
+		free_dir_entry(entry);
+	}
+
+	kfree(dir);
+}
+
+static void hashtable_free_node(void *ptr, void *arg)
+{
+	struct mount_info *mi = arg;
+	struct inode_info *node = ptr;
+	struct data_file *df = incfs_get_file_from_node(node);
+	struct directory *dir = NULL;
+
+	if (df) {
+		free_data_file(df);
+		return;
+	}
+
+	dir = incfs_get_dir_from_node(node);
+	if (dir && dir != &mi->mi_root)
+		free_dir(dir);
+}
+
+void incfs_free_mount_info(struct mount_info *mi)
+{
+	if (!mi)
+		return;
+
+	if (mi->mi_bf_context)
+		incfs_free_bfc(mi->mi_bf_context);
+
+	rhashtable_free_and_destroy(&mi->mi_nodes, hashtable_free_node, mi);
+	mutex_destroy(&mi->mi_nodes_mutex);
+	mutex_destroy(&mi->mi_dir_ops_mutex);
+	kfree(mi);
+}
+
+static struct directory *add_dir(struct mount_info *mi, int ino, umode_t mode)
+{
+	struct directory *result = NULL;
+	int error = 0;
+
+	if (!mi)
+		return ERR_PTR(-EFAULT);
+
+	if (!is_valid_inode(ino))
+		return ERR_PTR(-EINVAL);
+
+	LOCK_REQUIRED(mi->mi_nodes_mutex);
+
+	if (rhashtable_lookup_fast(&mi->mi_nodes, &ino, node_map_params))
+		return ERR_PTR(-EEXIST);
+
+	result = kzalloc(sizeof(*result), GFP_NOFS);
+	if (!result)
+		return ERR_PTR(-ENOMEM);
+
+	result->d_node.n_ino = ino;
+	result->d_node.n_type = INCFS_NODE_DIR;
+	result->d_node.n_mode = (mode & 0555) | S_IFDIR;
+	result->d_node.n_mount_info = mi;
+	INIT_LIST_HEAD(&result->d_entries_head);
+	INIT_LIST_HEAD(&result->d_node.n_parent_links_head);
+
+	error = rhashtable_insert_fast(&mi->mi_nodes,
+					&result->d_node.n_hash_list,
+					node_map_params);
+	if (error) {
+		free_dir(result);
+		return ERR_PTR(error);
+	}
+	return result;
+}
+
+static struct dir_entry_info *add_dir_entry(struct directory *dir,
+				     const char *name, size_t name_len,
+				     struct inode_info *child)
+{
+	struct dir_entry_info *result = NULL;
+	struct dir_entry_info *entry = NULL;
+	struct mount_info *mi = NULL;
+	int error = 0;
+
+	if (!dir || !child || !name)
+		return ERR_PTR(-EFAULT);
+
+	if ((child->n_ino == INCFS_ROOT_INODE) ||
+		(child->n_ino == dir->d_node.n_ino))
+		return ERR_PTR(-EINVAL);
+
+	mi = dir->d_node.n_mount_info;
+
+	result = kzalloc(sizeof(*result), GFP_NOFS);
+	if (!result) {
+		error = -ENOMEM;
+		goto err;
+	}
+
+	result->de_parent = dir;
+	result->de_child = child;
+	result->de_name.len = name_len;
+	result->de_name.data = kstrndup(name, name_len, GFP_NOFS);
+	if (!result->de_name.data) {
+		error = -ENOMEM;
+		goto err;
+	}
+
+	mutex_lock(&mi->mi_dir_ops_mutex);
+	list_for_each_entry(entry, &dir->d_entries_head, de_entries_list) {
+		if (incfs_equal_ranges(range((u8 *)name, name_len),
+				       entry->de_name)) {
+			error = -EEXIST;
+			goto err;
+		}
+	}
+
+	if (child->n_type == INCFS_NODE_DIR) {
+		/*
+		 * Directories are not allowed to be referenced from more
+		 * than one parent directory. If parent link list is not
+		 * empty we can't create another name for this directory.
+		 */
+		if (!list_empty(&child->n_parent_links_head)) {
+			error = -EMLINK;
+			goto err;
+		}
+	}
+	/* Adding to the child's list of all links pointing to it. */
+	list_add_tail(&result->de_backlink_list,
+		&child->n_parent_links_head);
+
+	/* Adding to the dentry list's end to preserve insertion order. */
+	list_add_tail(&result->de_entries_list, &dir->d_entries_head);
+	atomic_inc(&dir->d_version);
+
+	mutex_unlock(&mi->mi_dir_ops_mutex);
+	return result;
+
+err:
+	mutex_unlock(&mi->mi_dir_ops_mutex);
+	if (result) {
+		kfree(result->de_name.data);
+		kfree(result);
+	}
+
+	return ERR_PTR(error);
+}
+
+static int remove_dir_entry(struct directory *dir,
+			const char *name, size_t name_len)
+{
+	struct dir_entry_info *entry = NULL;
+	struct dir_entry_info *iter = NULL;
+	struct directory *subdir = NULL;
+	struct mount_info *mi = NULL;
+	int result = 0;
+
+	if (!dir || !name)
+		return -EFAULT;
+
+	mi = dir->d_node.n_mount_info;
+	mutex_lock(&mi->mi_dir_ops_mutex);
+	list_for_each_entry(iter, &dir->d_entries_head, de_entries_list) {
+		if (incfs_equal_ranges(range((u8 *)name, name_len),
+					iter->de_name)) {
+			entry = iter;
+			break;
+		}
+	}
+
+	if (!entry) {
+		result = -ENOENT;
+		goto out;
+	}
+
+	subdir = incfs_get_dir_from_node(entry->de_child);
+	if (subdir && !list_empty(&subdir->d_entries_head)) {
+		/* Can't remove a dir entry for not empty directory. */
+		result = -ENOTEMPTY;
+		goto out;
+	}
+
+	list_del(&entry->de_backlink_list);
+	list_del(&entry->de_entries_list);
+
+	free_dir_entry(entry);
+	atomic_inc(&dir->d_version);
+
+out:
+	mutex_unlock(&mi->mi_dir_ops_mutex);
+	return result;
+}
+
+static struct data_file_segment *get_file_segment(struct data_file *df,
+					   int block_index)
+{
+	int seg_idx = block_index % ARRAY_SIZE(df->df_segments);
+
+	return &df->df_segments[seg_idx];
+}
+
+static struct pending_read *alloc_pending_read(void)
+{
+	struct pending_read *result = NULL;
+
+	result = kzalloc(sizeof(*result), GFP_NOFS);
+	if (!result)
+		return NULL;
+
+	INIT_LIST_HEAD(&result->reads_list);
+	return result;
+}
+
+static bool is_read_done(struct pending_read *read)
+{
+	/*
+	 * A barrier to make sure that updated value of read->done
+	 * is properly reloaded each time we try to wake up or just before
+	 * sleeping on new_data_arrival_wq.
+	 */
+	smp_mb__before_atomic();
+	return atomic_read(&read->done) != 0;
+}
+
+static void set_read_done(struct pending_read *read)
+{
+	atomic_inc(&read->done);
+	/*
+	 * A barrier to make sure that a new value of read->done
+	 * is globally visible.
+	 */
+	smp_mb__after_atomic();
+}
+
+struct inode_info *incfs_get_node_by_name(struct directory *dir,
+					  const char *name, int *dir_ver_out)
+{
+	struct mount_info *mi = NULL;
+	struct dir_entry_info *entry = NULL;
+	struct inode_info *result = NULL;
+	size_t len = 0;
+
+	if (!dir || !name)
+		return NULL;
+
+	mi = dir->d_node.n_mount_info;
+	len = strlen(name);
+
+	mutex_lock(&mi->mi_dir_ops_mutex);
+	list_for_each_entry(entry, &dir->d_entries_head, de_entries_list) {
+		if (incfs_equal_ranges(entry->de_name,
+					range((u8 *)name, len))) {
+			result = entry->de_child;
+			break;
+		}
+	}
+	if (dir_ver_out)
+		*dir_ver_out = atomic_read(&dir->d_version);
+	mutex_unlock(&mi->mi_dir_ops_mutex);
+	return result;
+}
+
+struct data_file *incfs_get_file_from_node(struct inode_info *node)
+{
+	if (!node || node->n_type != INCFS_NODE_FILE)
+		return NULL;
+	return container_of(node, struct data_file, df_node);
+}
+
+struct directory *incfs_get_dir_from_node(struct inode_info *node)
+{
+	if (!node || node->n_type != INCFS_NODE_DIR)
+		return NULL;
+	return container_of(node, struct directory, d_node);
+}
+
+struct inode_info *incfs_get_node_by_ino(struct mount_info *mi, int ino)
+{
+	if (!mi)
+		return NULL;
+
+	LOCK_REQUIRED(mi->mi_nodes_mutex);
+	return rhashtable_lookup_fast(&mi->mi_nodes, &ino, node_map_params);
+}
+
+struct data_file *incfs_get_file_by_ino(struct mount_info *mi, int ino)
+{
+	return incfs_get_file_from_node(incfs_get_node_by_ino(mi, ino));
+}
+
+struct directory *incfs_get_dir_by_ino(struct mount_info *mi, int ino)
+{
+	return incfs_get_dir_from_node(incfs_get_node_by_ino(mi, ino));
+}
+
+static int get_data_file_block(struct data_file *df, int index,
+			struct data_file_block *res_block)
+{
+	struct incfs_blockmap_entry bme = {};
+	struct backing_file_context *bfc = NULL;
+	loff_t blockmap_off = 0;
+	u16 flags = 0;
+	int error = 0;
+
+	if (!df || !res_block)
+		return -EFAULT;
+
+	blockmap_off = atomic64_read(&df->df_blockmap_off);
+	bfc = df->df_node.n_mount_info->mi_bf_context;
+
+	if (index < 0 || index >= df->df_block_count || blockmap_off == 0)
+		return -EINVAL;
+
+	error = incfs_read_blockmap_entry(bfc, index, blockmap_off, &bme);
+	if (error)
+		return error;
+
+	flags = le16_to_cpu(bme.me_flags);
+	res_block->db_backing_file_data_offset =
+		le16_to_cpu(bme.me_data_offset_hi);
+	res_block->db_backing_file_data_offset <<= 32;
+	res_block->db_backing_file_data_offset |=
+		le32_to_cpu(bme.me_data_offset_lo);
+	res_block->db_stored_size = le16_to_cpu(bme.me_data_size);
+	res_block->db_crc = le32_to_cpu(bme.me_data_crc);
+	res_block->db_comp_alg = (flags & INCFS_BLOCK_COMPRESSED_LZ4) ?
+					 COMPRESSION_LZ4 :
+					 COMPRESSION_NONE;
+	return 0;
+}
+
+static int notify_pending_reads(struct data_file_segment *segment, int index)
+{
+	struct pending_read *entry = NULL;
+
+	if (!segment || index < 0)
+		return -EINVAL;
+
+	/* Notify pending reads waiting for this block. */
+	mutex_lock(&segment->reads_mutex);
+	list_for_each_entry(entry, &segment->reads_list_head, reads_list) {
+		if (entry->block_index == index)
+			set_read_done(entry);
+	}
+	mutex_unlock(&segment->reads_mutex);
+	wake_up_all(&segment->new_data_arrival_wq);
+	return 0;
+}
+
+/*
+ * Quickly checks if there are pending reads with a serial number larger
+ * than a given one.
+ */
+bool incfs_fresh_pending_reads_exist(struct mount_info *mi, int last_number)
+{
+	bool result = false;
+
+	spin_lock(&mi->pending_reads_counters_lock);
+	result = (mi->mi_last_pending_read_number > last_number) &&
+		 (mi->mi_pending_reads_count > 0);
+	spin_unlock(&mi->pending_reads_counters_lock);
+	return result;
+}
+
+static bool is_data_block_present(struct data_file_block *block)
+{
+	return (block->db_backing_file_data_offset != 0) &&
+	       (block->db_stored_size != 0);
+}
+
+/*
+ * Notifies a given data file about pending read from a given block.
+ * Returns a new pending read entry.
+ */
+static struct pending_read *add_pending_read(struct data_file *df,
+						int block_index)
+{
+	struct pending_read *result = NULL;
+	struct data_file_segment *segment = NULL;
+	struct mount_info *mi = NULL;
+
+	WARN_ON(!df);
+	segment = get_file_segment(df, block_index);
+	mi = df->df_node.n_mount_info;
+
+	WARN_ON(!segment);
+	WARN_ON(!mi);
+
+	result = alloc_pending_read();
+	if (!result)
+		return NULL;
+
+	result->block_index = block_index;
+
+	mutex_lock(&segment->reads_mutex);
+
+	spin_lock(&mi->pending_reads_counters_lock);
+	result->serial_number = ++mi->mi_last_pending_read_number;
+	mi->mi_pending_reads_count++;
+	spin_unlock(&mi->pending_reads_counters_lock);
+
+	list_add(&result->reads_list, &segment->reads_list_head);
+	mutex_unlock(&segment->reads_mutex);
+
+	wake_up_all(&mi->mi_pending_reads_notif_wq);
+	return result;
+}
+
+/* Notifies a given data file that pending read is completed. */
+static void remove_pending_read(struct data_file *df, struct pending_read *read)
+{
+	struct data_file_segment *segment = NULL;
+	struct mount_info *mi = NULL;
+
+	if (!df || !read) {
+		WARN_ON(!df);
+		WARN_ON(!read);
+		return;
+	}
+
+	segment = get_file_segment(df, read->block_index);
+	mi = df->df_node.n_mount_info;
+
+	WARN_ON(!segment);
+	WARN_ON(!mi);
+
+	mutex_lock(&segment->reads_mutex);
+	list_del(&read->reads_list);
+
+	spin_lock(&mi->pending_reads_counters_lock);
+	mi->mi_pending_reads_count--;
+	spin_unlock(&mi->pending_reads_counters_lock);
+	mutex_unlock(&segment->reads_mutex);
+
+	kfree(read);
+}
+
+static int wait_for_data_block(struct data_file *df, int block_index,
+			int timeout_ms, struct data_file_block *res_block)
+{
+	struct data_file_block block = {};
+	struct data_file_segment *segment = NULL;
+	struct pending_read *read = NULL;
+	int error = 0;
+	int wait_res = 0;
+
+	if (!df || !res_block)
+		return -EFAULT;
+
+	if (block_index < 0 || block_index >= df->df_block_count)
+		return -EINVAL;
+
+	if (atomic64_read(&df->df_blockmap_off) <= 0)
+		return -ENODATA;
+
+	segment = get_file_segment(df, block_index);
+	WARN_ON(!segment);
+
+	error = mutex_lock_interruptible(&segment->blockmap_mutex);
+	if (error)
+		return error;
+
+	/* Look up the given block */
+	error = get_data_file_block(df, block_index, &block);
+
+	/* If it's not found, create a pending read */
+	if (!error && !is_data_block_present(&block))
+		read = add_pending_read(df, block_index);
+
+	mutex_unlock(&segment->blockmap_mutex);
+	if (error)
+		return error;
+
+	/* If the block was found, just return it. No need to wait. */
+	if (is_data_block_present(&block)) {
+		*res_block = block;
+		return 0;
+	}
+
+	if (!read)
+		return -ENOMEM;
+
+	/* Wait for notifications about block's arrival */
+	wait_res =
+		wait_event_interruptible_timeout(segment->new_data_arrival_wq,
+						 (is_read_done(read)),
+						 msecs_to_jiffies(timeout_ms));
+
+	/* Woke up, the pending read is nor longer needed. */
+	remove_pending_read(df, read);
+	read = NULL;
+
+	if (wait_res == 0) {
+		/* Wait has timed out */
+		return -ETIME;
+	}
+	if (wait_res < 0) {
+		/*
+		 * Only ERESTARTSYS is really expected here when a signal
+		 * comes while we wait.
+		 */
+		return wait_res;
+	}
+
+	error = mutex_lock_interruptible(&segment->blockmap_mutex);
+	if (error)
+		return error;
+
+	/*
+	 * Re-read block's info now, it has just arrived and
+	 * should be available.
+	 */
+	error = get_data_file_block(df, block_index, &block);
+	if (!error) {
+		if (is_data_block_present(&block))
+			*res_block = block;
+		else {
+			/*
+			 * Somehow wait finished successfully bug block still
+			 * can't be found. It's not normal.
+			 */
+			pr_warn("Wait succeeded, but block %d:%d not found.",
+				df->df_node.n_ino, block_index);
+			error = -ENODATA;
+		}
+	}
+
+	mutex_unlock(&segment->blockmap_mutex);
+	return error;
+}
+
+int incfs_collect_pending_reads(struct mount_info *mi, int sn_lowerbound,
+			  struct incfs_pending_read_info *reads, int reads_size)
+{
+	int i = 0;
+	int reported_reads = 0;
+	bool stop = true;
+	int start_sn = 0;
+	int start_count = 0;
+	struct rhashtable_iter iter;
+	struct inode_info *node;
+	int error = 0;
+
+	if (!mi)
+		return -EFAULT;
+
+	mutex_lock(&mi->mi_nodes_mutex);
+
+	spin_lock(&mi->pending_reads_counters_lock);
+	start_sn = mi->mi_last_pending_read_number;
+	start_count = mi->mi_pending_reads_count;
+	spin_unlock(&mi->pending_reads_counters_lock);
+
+	stop = (reads_size == 0 || start_count == 0);
+
+	rhashtable_walk_enter(&mi->mi_nodes, &iter);
+	rhashtable_walk_start(&iter);
+
+	while (!stop && (node = rhashtable_walk_next(&iter))) {
+		struct data_file *df = NULL;
+
+		if (IS_ERR(node)) {
+			error = PTR_ERR(node);
+			break;
+		}
+		df = incfs_get_file_from_node(node);
+		if (!df)
+			continue;
+
+		rhashtable_walk_stop(&iter);
+		for (i = 0; i < SEGMENTS_PER_FILE && !stop; i++) {
+			struct data_file_segment *segment = &df->df_segments[i];
+			struct pending_read *entry = NULL;
+
+			mutex_lock(&segment->reads_mutex);
+			list_for_each_entry(entry, &segment->reads_list_head,
+					     reads_list) {
+				if (entry->serial_number <= sn_lowerbound)
+					continue;
+				/*
+				 * Skip over pending reads that were not here at
+				 * the beggining of the collection process.
+				 * They will be addressed during a next call.
+				 *
+				 * If this is not done, and all pending reads
+				 * are reported, then there might be a race
+				 * between this code and pending reads being
+				 * added to other segmeents/files.
+				 *
+				 * Skipping everything newer than read number
+				 * known at the beggining guaranties consistent
+				 * snapshot of pending reads across all files
+				 * and segments. Is saves us from having to
+				 * instoduce a big contended lock for
+				 * everything.
+				 */
+				if (entry->serial_number > start_sn)
+					continue;
+
+				reads[reported_reads].file_ino =
+					df->df_node.n_ino;
+				reads[reported_reads].block_index =
+					entry->block_index;
+				reads[reported_reads].serial_number =
+					entry->serial_number;
+
+				reported_reads++;
+				stop = (reported_reads >= reads_size) ||
+					(reported_reads >= start_count);
+				if (stop)
+					break;
+			}
+			mutex_unlock(&segment->reads_mutex);
+		}
+		rhashtable_walk_start(&iter);
+	}
+
+	rhashtable_walk_stop(&iter);
+	rhashtable_walk_exit(&iter);
+	mutex_unlock(&mi->mi_nodes_mutex);
+	return error ? error : reported_reads;
+}
+
+static ssize_t decompress(struct mem_range src, struct mem_range dst)
+{
+	int result = LZ4_decompress_safe(src.data, dst.data, src.len, dst.len);
+
+	if (result < 0)
+		return -EBADMSG;
+
+	return result;
+}
+
+static ssize_t read_with_crc(struct file *f, void *buf, size_t len,
+				loff_t pos, u32 expected_crc)
+{
+	ssize_t result = 0;
+	u32 buf_crc = 0;
+
+	result = kernel_read(f, buf, len, &pos);
+	if (result == len) {
+		buf_crc = crc32(0, buf, len);
+		if (buf_crc != expected_crc) {
+			const char *name = f->f_path.dentry->d_name.name;
+
+			pr_warn_once("incfs: Data CRC mismatch in %s. %u %u",
+				name, buf_crc, expected_crc);
+			return -EBADMSG;
+		}
+	}
+	return result;
+}
+
+ssize_t incfs_read_data_file_block(struct mem_range dst, struct data_file *df,
+			     int index)
+{
+	loff_t pos;
+	ssize_t result;
+	size_t bytes_to_read;
+	u8 *decomp_buffer;
+	struct mount_info *mi = NULL;
+	struct file *bf = NULL;
+	const size_t decomp_buf_size = 2 * INCFS_DATA_FILE_BLOCK_SIZE;
+	struct data_file_block block = {};
+	int timeout_ms = 0;
+
+	if (!dst.data || !df)
+		return -EFAULT;
+
+	mi = df->df_node.n_mount_info;
+	bf = mi->mi_bf_context->bc_file;
+	timeout_ms = mi->mi_options.read_timeout_ms;
+
+	result = wait_for_data_block(df, index, timeout_ms, &block);
+	if (result < 0)
+		return result;
+
+	pos = block.db_backing_file_data_offset;
+	if (block.db_comp_alg == COMPRESSION_NONE) {
+		bytes_to_read = min(dst.len, block.db_stored_size);
+		result = read_with_crc(bf, dst.data, bytes_to_read,
+					pos, block.db_crc);
+
+		/* Some data was read, but not enough */
+		if (result >= 0 && result != bytes_to_read)
+			result = -EIO;
+	} else {
+		decomp_buffer = (u8 *)__get_free_pages(
+			GFP_NOFS, get_order(decomp_buf_size));
+		if (!decomp_buffer)
+			return -ENOMEM;
+
+		bytes_to_read = min(decomp_buf_size, block.db_stored_size);
+		result = read_with_crc(bf, decomp_buffer, bytes_to_read,
+					pos, block.db_crc);
+		if (result == bytes_to_read) {
+			result = decompress(range(decomp_buffer, bytes_to_read),
+					    dst);
+			if (result < 0) {
+				const char *name =
+						bf->f_path.dentry->d_name.name;
+
+				pr_warn_once("incfs: Decompression error. %s",
+					name);
+			}
+		} else if (result >= 0) {
+			/* Some data was read, but not enough */
+			result = -EIO;
+		}
+
+		free_pages((unsigned long)decomp_buffer,
+			   get_order(decomp_buf_size));
+	}
+
+	return result;
+}
+
+int incfs_process_new_data_block(struct mount_info *mi,
+			   struct incfs_new_data_block *block,
+			   u8 *data)
+{
+	struct backing_file_context *bfc = NULL;
+	struct data_file *df = NULL;
+	struct data_file_segment *segment = NULL;
+	struct data_file_block existing_block = {};
+	u16 flags = 0;
+	u32 crc = 0;
+	int error = 0;
+
+	if (!mi || !block)
+		return -EFAULT;
+	bfc = mi->mi_bf_context;
+
+	mutex_lock(&mi->mi_nodes_mutex);
+	df = incfs_get_file_by_ino(mi, block->file_ino);
+	mutex_unlock(&mi->mi_nodes_mutex);
+
+	if (!df)
+		return -ENOENT;
+	if (block->block_index >= df->df_block_count)
+		return -ERANGE;
+	segment = get_file_segment(df, block->block_index);
+	if (!segment)
+		return -EFAULT;
+	if (block->compression == COMPRESSION_LZ4)
+		flags |= INCFS_BLOCK_COMPRESSED_LZ4;
+
+
+	crc = crc32(0, data, block->data_len);
+	error = mutex_lock_interruptible(&segment->blockmap_mutex);
+	if (error)
+		return error;
+
+	error = get_data_file_block(df, block->block_index, &existing_block);
+	if (error)
+		goto unlock;
+	if (is_data_block_present(&existing_block)) {
+		/* Block is already present, nothing to do here */
+		goto unlock;
+	}
+
+	error = mutex_lock_interruptible(&bfc->bc_mutex);
+	if (!error) {
+		error = incfs_write_data_block_to_backing_file(
+			bfc, range(data, block->data_len),
+			block->block_index, atomic64_read(&df->df_blockmap_off),
+			flags, crc);
+		mutex_unlock(&bfc->bc_mutex);
+	}
+	if (!error)
+		error = notify_pending_reads(segment, block->block_index);
+
+unlock:
+	mutex_unlock(&segment->blockmap_mutex);
+	return error;
+}
+
+int incfs_process_new_file_inst(struct mount_info *mi,
+			  struct incfs_new_file_instruction *inst)
+{
+	struct directory *new_dir = NULL;
+	struct data_file *new_file = NULL;
+	struct backing_file_context *bfc = NULL;
+	u16 mode = 0;
+	int error = 0;
+
+	if (!mi || !inst)
+		return -EFAULT;
+
+	bfc = mi->mi_bf_context;
+	error = mutex_lock_interruptible(&bfc->bc_mutex);
+	if (error)
+		return error;
+
+	/* Create and register in-memory dir or data_file objects */
+	mutex_lock(&mi->mi_nodes_mutex);
+	if (atomic_read(&mi->mi_nodes.nelems) >= INCFS_MAX_FILES) {
+		/* File system already has too many files. */
+		error = -ENFILE;
+	} else if (S_ISREG(inst->mode)) {
+		/* Create a regular file. */
+		inst->ino_out = mi->mi_next_ino;
+		new_file = add_data_file(mi, inst->ino_out, inst->size,
+			inst->mode);
+
+		if (IS_ERR_OR_NULL(new_file))
+			error = PTR_ERR(new_file);
+		else {
+			mi->mi_next_ino++;
+			mode = new_file->df_node.n_mode;
+		}
+	} else if (S_ISDIR(inst->mode)) {
+		/* Create a directory. */
+		inst->ino_out = mi->mi_next_ino;
+		new_dir = add_dir(mi, inst->ino_out, inst->mode);
+
+		if (IS_ERR_OR_NULL(new_dir))
+			error = PTR_ERR(new_dir);
+		else {
+			mi->mi_next_ino++;
+			mode = new_dir->d_node.n_mode;
+		}
+	} else
+		error = -EINVAL;
+	mutex_unlock(&mi->mi_nodes_mutex);
+	if (error)
+		goto out;
+
+	/* Write inode to the backing file */
+	error = incfs_write_inode_to_backing_file(bfc, inst->ino_out,
+					inst->size, mode);
+	if (error)
+		goto out;
+
+	/* If it's a data file, also reserve space for the block map. */
+	if (new_file && new_file->df_block_count > 0) {
+		loff_t bm_base_off = 0;
+
+		error = incfs_write_blockmap_to_backing_file(bfc,
+						       new_file->df_node.n_ino,
+						       new_file->df_block_count,
+						       &bm_base_off);
+		if (error)
+			goto out;
+		atomic64_set(&new_file->df_blockmap_off, bm_base_off);
+	}
+out:
+	mutex_unlock(&bfc->bc_mutex);
+	return error;
+}
+
+int incfs_process_new_dir_entry_inst(struct mount_info *mi,
+			       enum incfs_instruction_type type,
+			       struct incfs_dir_entry_instruction *inst,
+			       char *name)
+{
+	struct backing_file_context *bfc = NULL;
+	int error = 0;
+
+	if (!mi || !inst)
+		return -EFAULT;
+
+	bfc = mi->mi_bf_context;
+	error = mutex_lock_interruptible(&bfc->bc_mutex);
+	if (error)
+		return error;
+
+	switch (type) {
+	case INCFS_INSTRUCTION_ADD_DIR_ENTRY: {
+		struct dir_entry_info *dentry = NULL;
+		struct inode_info *child = NULL;
+		struct directory *parent = NULL;
+
+		/* Find nodes that we want to connect */
+		mutex_lock(&mi->mi_nodes_mutex);
+		parent = incfs_get_dir_by_ino(mi, inst->dir_ino);
+		child = incfs_get_node_by_ino(mi, inst->child_ino);
+		mutex_unlock(&mi->mi_nodes_mutex);
+		if (!child || !parent) {
+			error = -ENOENT;
+			goto out;
+		}
+
+		/* Put a dir/file into a parent dir object in memory */
+		dentry = add_dir_entry(parent, name, inst->name_len, child);
+		if (IS_ERR_OR_NULL(dentry)) {
+			error = PTR_ERR(dentry);
+			goto out;
+		}
+
+		/* Save record about the dir entry to the backing file */
+		error = incfs_write_dir_action(bfc, inst->dir_ino,
+				inst->child_ino, INCFS_DIRA_ADD_ENTRY,
+				dentry->de_name);
+		break;
+	}
+	case INCFS_INSTRUCTION_REMOVE_DIR_ENTRY: {
+		struct directory *dir = NULL;
+
+		/* Find nodes that we want to connect */
+		mutex_lock(&mi->mi_nodes_mutex);
+		dir = incfs_get_dir_by_ino(mi, inst->dir_ino);
+		mutex_unlock(&mi->mi_nodes_mutex);
+
+		if (!dir) {
+			error = -ENOENT;
+			goto out;
+		}
+
+		/* Remove dir entry from the dir object in memory */
+		error = remove_dir_entry(dir, name, inst->name_len);
+		if (error)
+			goto out;
+
+		/* Save record about the dir entry to the backing file */
+		error = incfs_write_dir_action(
+			bfc, dir->d_node.n_ino, inst->child_ino,
+			INCFS_DIRA_REMOVE_ENTRY,
+			range((u8 *)name, inst->name_len));
+		break;
+	}
+	default:
+		error = -ENOTSUPP;
+		break;
+	}
+
+out:
+	mutex_unlock(&bfc->bc_mutex);
+	return error;
+}
+
+static int process_inode_md(struct incfs_inode *inode,
+			    struct metadata_handler *handler)
+{
+	struct mount_info *mi = handler->context;
+	int error = 0;
+	u64 ino = le64_to_cpu(inode->i_no);
+	u64 size = le64_to_cpu(inode->i_size);
+	u16 mode = le16_to_cpu(inode->i_mode);
+
+	if (!mi)
+		return -EFAULT;
+
+	mutex_lock(&mi->mi_nodes_mutex);
+	if (S_ISREG(mode)) {
+		struct data_file *df = add_data_file(mi, ino, size, mode);
+
+		if (!df)
+			error = -EFAULT;
+		else if (IS_ERR(df))
+			error = PTR_ERR(df);
+	} else if (S_ISDIR(mode)) {
+		struct directory *dir = add_dir(mi, ino, mode);
+
+		if (!dir)
+			error = -EFAULT;
+		else if (IS_ERR(dir))
+			error = PTR_ERR(dir);
+	} else
+		error = -EINVAL;
+
+	if (!error && ino >= mi->mi_next_ino)
+		mi->mi_next_ino = ino + 1;
+	mutex_unlock(&mi->mi_nodes_mutex);
+	return error;
+}
+
+static int process_blockmap_md(struct incfs_blockmap *bm,
+			       struct metadata_handler *handler)
+{
+	struct mount_info *mi = handler->context;
+	struct data_file *df = NULL;
+	int error = 0;
+	u64 ino = le64_to_cpu(bm->m_inode);
+	loff_t base_off = le64_to_cpu(bm->m_base_offset);
+	u32 block_count = le32_to_cpu(bm->m_block_count);
+
+	if (!mi)
+		return -EFAULT;
+
+	mutex_lock(&mi->mi_nodes_mutex);
+	df = incfs_get_file_by_ino(mi, ino);
+	mutex_unlock(&mi->mi_nodes_mutex);
+
+	if (!df)
+		return -ENOENT;
+
+	if (df->df_block_count != block_count)
+		return -EBADFD;
+
+	if (atomic64_cmpxchg(&df->df_blockmap_off, 0, base_off) != 0)
+		error = -EBADFD;
+
+	return error;
+}
+
+static int process_dir_action_md(struct incfs_dir_action *da,
+				 struct metadata_handler *handler)
+{
+	struct mount_info *mi = handler->context;
+	struct directory *dir = NULL;
+	u64 dir_ino = le64_to_cpu(da->da_dir_inode);
+	u64 entry_ino = le64_to_cpu(da->da_entry_inode);
+	u8 type = da->da_type;
+	u8 name_len = da->da_name_len;
+	char *name = da->da_name;
+	int result = 0;
+
+	if (!mi)
+		return -EFAULT;
+
+	switch (type) {
+	case INCFS_DIRA_NONE:
+		result = 0;
+		break;
+	case INCFS_DIRA_ADD_ENTRY: {
+		struct inode_info *node = NULL;
+		struct dir_entry_info *dentry = NULL;
+
+		mutex_lock(&mi->mi_nodes_mutex);
+		dir = incfs_get_dir_by_ino(mi, dir_ino);
+		node = incfs_get_node_by_ino(mi, entry_ino);
+		mutex_unlock(&mi->mi_nodes_mutex);
+
+		if (!dir || !node)
+			return -ENOENT;
+
+		dentry = add_dir_entry(dir, name, name_len, node);
+		if (IS_ERR_OR_NULL(dentry))
+			return PTR_ERR(dentry);
+		break;
+	}
+
+	case INCFS_DIRA_REMOVE_ENTRY: {
+		mutex_lock(&mi->mi_nodes_mutex);
+		dir = incfs_get_dir_by_ino(mi, dir_ino);
+		mutex_unlock(&mi->mi_nodes_mutex);
+
+		if (!dir)
+			return -ENOENT;
+
+		result = remove_dir_entry(dir, name, name_len);
+		break;
+	}
+	default:
+		result = -ENOTSUPP;
+	}
+	return result;
+}
+
+int incfs_scan_backing_file(struct mount_info *mi)
+{
+	struct metadata_handler *handler = NULL;
+	int result = 0;
+	int records_count = 0;
+	int error = 0;
+	struct backing_file_context *bfc = NULL;
+
+	if (!mi || !mi->mi_bf_context)
+		return -EFAULT;
+
+	bfc = mi->mi_bf_context;
+
+	handler = kzalloc(sizeof(*handler), GFP_NOFS);
+	if (!handler)
+		return -ENOMEM;
+
+	/* No writing to the backing file while it's being scanned. */
+	error = mutex_lock_interruptible(&bfc->bc_mutex);
+	if (error)
+		goto out;
+
+	/* Reading superblock */
+	error = incfs_read_superblock(bfc, &handler->md_record_offset);
+	if (error)
+		goto unlock;
+
+	handler->context = mi;
+	handler->handle_inode = process_inode_md;
+	handler->handle_blockmap = process_blockmap_md;
+	handler->handle_dir_action = process_dir_action_md;
+
+	pr_debug("Starting reading incfs-metadata records at offset %lld",
+		 handler->md_record_offset);
+	while (handler->md_record_offset > 0) {
+		error = incfs_read_next_metadata_record(bfc, handler);
+		if (error) {
+			pr_warn("incfs: Error during reading incfs-metadata record. Offset: %lld Record #%d Error code: %d",
+				handler->md_record_offset, records_count + 1,
+				-error);
+			break;
+		}
+		records_count++;
+	}
+	if (error) {
+		pr_debug("Error %d after reading %d incfs-metadata records.",
+			 -error, records_count);
+		result = error;
+	} else {
+		pr_debug("Finished reading %d incfs-metadata records.",
+			 records_count);
+		result = records_count;
+	}
+unlock:
+	mutex_unlock(&bfc->bc_mutex);
+out:
+	kfree(handler);
+	return result;
+}
+
+bool incfs_equal_ranges(struct mem_range lhs, struct mem_range rhs)
+{
+	if (lhs.len != rhs.len)
+		return false;
+	return memcmp(lhs.data, rhs.data, lhs.len) == 0;
+}
diff --git a/fs/incfs/data_mgmt.h b/fs/incfs/data_mgmt.h
new file mode 100644
index 000000000000..d849e262cf84
--- /dev/null
+++ b/fs/incfs/data_mgmt.h
@@ -0,0 +1,213 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2019 Google LLC
+ */
+#ifndef _INCFS_DATA_MGMT_H
+#define _INCFS_DATA_MGMT_H
+
+#include <linux/fs.h>
+#include <linux/types.h>
+#include <linux/mutex.h>
+#include <linux/completion.h>
+#include <linux/wait.h>
+#include <linux/rhashtable-types.h>
+
+#include "internal.h"
+#include "format.h"
+
+#define SEGMENTS_PER_FILE 5
+
+struct data_file_block {
+	loff_t db_backing_file_data_offset;
+
+	size_t db_stored_size;
+
+	u32 db_crc;
+
+	enum incfs_compression_alg db_comp_alg;
+};
+
+struct pending_read {
+	struct list_head reads_list;
+
+	int block_index;
+
+	int serial_number;
+
+	atomic_t done;
+};
+
+struct data_file_segment {
+	wait_queue_head_t new_data_arrival_wq;
+
+	/* Protects reads and writes from the blockmap */
+	/* Good candidate for read/write mutex */
+	struct mutex blockmap_mutex;
+
+	/* Protects reads_list_head */
+	struct mutex reads_mutex;
+
+	/* List of active pending_read objects */
+	struct list_head reads_list_head;
+};
+
+struct mount_info;
+
+enum incfs_node_type { INCFS_NODE_FILE = 0, INCFS_NODE_DIR = 1 };
+
+/* Common parts between data files and dirs. */
+struct inode_info {
+	struct mount_info *n_mount_info; /* Mount this file belongs to */
+
+	/* Hash bucket list for mount_info.mi_nodes */
+	struct rhash_head n_hash_list;
+
+	/* List of dir_entry_info pointing to this node */
+	struct list_head n_parent_links_head;
+
+	int n_ino;
+
+	umode_t n_mode;
+
+	u8 n_type; /* Node type values from enum incfs_node_type */
+};
+
+struct data_file {
+	struct inode_info df_node;
+
+	/*
+	 * Array of segments used to reduce lock contention for the file.
+	 * Segment is chosen for a block depends on the block's index.
+	 */
+	struct data_file_segment df_segments[SEGMENTS_PER_FILE];
+
+	/* Base offset of the block map. */
+	atomic64_t df_blockmap_off;
+
+	/* File size in bytes */
+	loff_t df_size;
+
+	int df_block_count; /* File size in DATA_FILE_BLOCK_SIZE blocks */
+};
+
+struct directory {
+	struct inode_info d_node;
+
+	/* List of struct dir_entry_info belonging to this directory */
+	struct list_head d_entries_head;
+
+	atomic_t d_version;
+};
+
+struct dir_entry_info {
+	struct list_head de_entries_list;
+
+	struct list_head de_backlink_list;
+
+	struct mem_range de_name;
+
+	struct inode_info *de_child;
+
+	struct directory *de_parent;
+};
+
+struct mount_options {
+	unsigned int backing_fd;
+	unsigned int read_timeout_ms;
+};
+
+struct mount_info {
+	struct super_block *mi_sb;
+	struct mount_options mi_options;
+
+	/*
+	 * Protects operations with directory entries, basically it
+	 * protects all instances of lists:
+	 *   - directory.d_entries_head
+	 *   - inode_info.n_parent_links_head
+	 */
+	struct mutex mi_dir_ops_mutex;
+
+	/* Protects mi_nodes, mi_next_ino, and mi_root */
+	struct mutex mi_nodes_mutex;
+
+	/* State of the backing file */
+	struct backing_file_context *mi_bf_context;
+
+	/*
+	 * Hashtable (int ino) -> (struct inode_info)
+	 */
+	struct rhashtable mi_nodes;
+
+	/* Directory entry for the filesystem root */
+	struct directory mi_root;
+
+	/* Node number to allocate next */
+	int mi_next_ino;
+
+	/* Protects mi_last_pending_read_number and mi_pending_reads_count */
+	spinlock_t pending_reads_counters_lock;
+
+	/*
+	 * A queue of waiters who want to be notified about new pending reads.
+	 */
+	wait_queue_head_t mi_pending_reads_notif_wq;
+
+	/*
+	 * Last serial number that was assigned to a pending read.
+	 * 0 means no pending reads have been seen yet.
+	 */
+	int mi_last_pending_read_number;
+
+	/* Total number of reads waiting on data from all files */
+	int mi_pending_reads_count;
+};
+
+/* mount_info functions */
+struct mount_info *incfs_alloc_mount_info(struct super_block *sb,
+					struct file *backing_file);
+void incfs_free_mount_info(struct mount_info *mi);
+
+bool incfs_fresh_pending_reads_exist(struct mount_info *mi, int last_number);
+
+struct inode_info *incfs_get_node_by_name(struct directory *dir,
+					const char *name, int *dir_ver_out);
+struct data_file *incfs_get_file_from_node(struct inode_info *node);
+struct directory *incfs_get_dir_from_node(struct inode_info *node);
+struct inode_info *incfs_get_node_by_ino(struct mount_info *mi, int ino);
+struct data_file *incfs_get_file_by_ino(struct mount_info *mi, int ino);
+struct directory *incfs_get_dir_by_ino(struct mount_info *mi, int ino);
+
+ssize_t incfs_read_data_file_block(struct mem_range dst, struct data_file *df,
+			     int index);
+
+/*
+ * Collects pending reads and saves them into the array (reads/reads_size).
+ * Only reads with serial_number > sn_lowerbound are reported.
+ * Returns how many reads were saved into the array.
+ */
+int incfs_collect_pending_reads(struct mount_info *mi, int sn_lowerbound,
+			  struct incfs_pending_read_info *reads,
+			  int reads_size);
+
+/* Instructions processing */
+int incfs_process_new_file_inst(struct mount_info *mi,
+			  struct incfs_new_file_instruction *inst);
+int incfs_process_new_dir_entry_inst(struct mount_info *mi,
+			       enum incfs_instruction_type type,
+			       struct incfs_dir_entry_instruction *inst,
+			       char *name);
+
+int incfs_process_new_data_block(struct mount_info *mi,
+			   struct incfs_new_data_block *block,
+			   u8 *data);
+
+/*
+ * Scans whole backing file for metadata records.
+ * Returns an error or a number of processed metadata records.
+ */
+int incfs_scan_backing_file(struct mount_info *mi);
+
+bool incfs_equal_ranges(struct mem_range lhs, struct mem_range rhs);
+
+#endif /* _INCFS_DATA_MGMT_H */
--
2.21.0.593.g511ec345e18-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 4/6] incfs: Integration with VFS layer
  2019-05-02  4:03 Initial patches for Incremental FS ezemtsov
                   ` (2 preceding siblings ...)
  2019-05-02  4:03 ` [PATCH 3/6] incfs: Management of in-memory FS data structures ezemtsov
@ 2019-05-02  4:03 ` ezemtsov
  2019-05-02  4:03 ` [PATCH 6/6] incfs: Integration tests for incremental-fs ezemtsov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: ezemtsov @ 2019-05-02  4:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tytso, Eugene Zemtsov

From: Eugene Zemtsov <ezemtsov@google.com>

Implementation of VFS callbacks for
- Reading data pages
- Traversing dir structure
- Handling ioctl-s
- Handling .cmd file reads and writes
- Mounting/unmounting file system

Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>
---
 fs/incfs/vfs.c | 834 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 834 insertions(+)

diff --git a/fs/incfs/vfs.c b/fs/incfs/vfs.c
index 2e71f0edf8a1..7b453f19b543 100644
--- a/fs/incfs/vfs.c
+++ b/fs/incfs/vfs.c
@@ -4,12 +4,50 @@
  */
 #include <linux/blkdev.h>
 #include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/kernel.h>
+#include <linux/pagemap.h>
+#include <linux/string.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include <linux/poll.h>

 #include <uapi/linux/incrementalfs.h>
+#include "data_mgmt.h"

+#define READ_EXEC_FILE_MODE 0555
+#define READ_WRITE_FILE_MODE 0666
+
+static int remount_fs(struct super_block *sb, int *flags, char *data);
 static struct dentry *mount_fs(struct file_system_type *type, int flags,
 			       const char *dev_name, void *data);
+static struct dentry *dir_lookup(struct inode *dir_inode, struct dentry *dentry,
+				 unsigned int flags);
+static int iterate_incfs_dir(struct file *file, struct dir_context *ctx);
+static int read_one_page(struct file *f, struct page *page);
+static ssize_t command_write(struct file *f, const char __user *buf,
+			size_t size, loff_t *offset);
+static ssize_t command_read(struct file *f, char __user *buf, size_t len,
+			    loff_t *ppos);
+static __poll_t command_poll(struct file *file, poll_table *wait);
+static int command_open(struct inode *inode, struct file *file);
+static int command_release(struct inode *, struct file *);
+
 static void kill_sb(struct super_block *sb);
+static int dentry_revalidate(struct dentry *dentry, unsigned int flags);
+static int dentry_revalidate_weak(struct dentry *dentry, unsigned int flags);
+static long dispatch_ioctl(struct file *f, unsigned int req, unsigned long arg);
+static int show_options(struct seq_file *, struct dentry *);
+static int show_devname(struct seq_file *, struct dentry *);
+
+/* State of an open .cmd file, unique for each file descriptor. */
+struct command_file_state {
+	/* A serial number of the last pending read obtained from this file. */
+	int last_pending_read_sn;
+};

 struct file_system_type incfs_fs_type = {
 	.owner = THIS_MODULE,
@@ -19,9 +57,785 @@ struct file_system_type incfs_fs_type = {
 	.fs_flags = 0
 };

+static const struct super_operations incfs_super_ops = {
+	.statfs = simple_statfs,
+	.remount_fs = remount_fs,
+	.show_options = show_options,
+	.show_devname = show_devname
+};
+
+static const struct inode_operations incfs_dir_inode_ops = {
+	.lookup = dir_lookup,
+};
+
+static const struct file_operations incfs_dir_fops = {
+	.llseek = generic_file_llseek,
+	.read = generic_read_dir,
+	.iterate = iterate_incfs_dir,
+};
+
+static const struct dentry_operations incfs_dentry_ops = {
+	.d_revalidate = dentry_revalidate,
+	.d_weak_revalidate = dentry_revalidate_weak,
+};
+
+static const struct address_space_operations incfs_address_space_ops = {
+	.readpage = read_one_page,
+};
+
+static const struct file_operations incfs_file_ops = {
+	.read_iter = generic_file_read_iter,
+	.mmap = generic_file_mmap,
+	.splice_read = generic_file_splice_read,
+	.llseek = generic_file_llseek
+};
+
+static const struct file_operations incfs_command_file_ops = {
+	.read = command_read,
+	.write = command_write,
+	.poll = command_poll,
+	.open = command_open,
+	.release = command_release,
+	.llseek = noop_llseek,
+	.unlocked_ioctl = dispatch_ioctl,
+	.compat_ioctl = dispatch_ioctl
+};
+
+static const struct inode_operations incfs_file_inode_ops = {
+	.setattr = simple_setattr,
+	.getattr = simple_getattr,
+};
+
+static const char command_file_name[] = ".cmd";
+static struct mem_range command_file_name_range = {
+	.data = (u8 *)command_file_name,
+	.len = ARRAY_SIZE(command_file_name) - 1
+};
+static struct mem_range dot_range = {
+	.data = (u8 *)".",
+	.len = 1
+};
+static struct mem_range dotdot_range = {
+	.data = (u8 *)"..",
+	.len = 2
+};
+
+enum parse_parameter { Opt_backing_fd, Opt_read_timeout, Opt_err };
+static const match_table_t option_tokens = {
+	{ Opt_backing_fd, "backing_fd=%u" },
+	{ Opt_read_timeout, "read_timeout_ms=%u" },
+	{ Opt_err, NULL }
+};
+
+static struct super_block *file_superblock(struct file *f)
+{
+	struct inode *inode;
+
+	inode = file_inode(f);
+	return inode->i_sb;
+}
+
+static struct mount_info *get_mount_info(struct super_block *sb)
+{
+	struct mount_info *result = sb->s_fs_info;
+
+	WARN_ON(!result);
+	return result;
+}
+
+static int validate_name(struct mem_range name)
+{
+	int i = 0;
+
+	if (name.len > INCFS_MAX_NAME_LEN)
+		return -ENAMETOOLONG;
+
+	if (incfs_equal_ranges(dot_range, name) ||
+	    incfs_equal_ranges(dotdot_range, name) ||
+	    incfs_equal_ranges(command_file_name_range, name))
+		return -EINVAL;
+
+	for (i = 0; i < name.len; i++)
+		if (name.data[i] == 0 || name.data[i] == '/')
+			return -EINVAL;
+
+	return 0;
+}
+
+static int read_one_page(struct file *f, struct page *page)
+{
+	loff_t offset = 0;
+	loff_t size = 0;
+	ssize_t bytes_to_read = 0;
+	ssize_t read_result = 0;
+	struct inode *inode = page->mapping->host;
+	int block_index = 0;
+	int result = 0;
+	struct data_file *df = NULL;
+	void *page_start = kmap(page);
+
+	offset = page_offset(page);
+	block_index = offset / INCFS_DATA_FILE_BLOCK_SIZE;
+	if (offset & (INCFS_DATA_FILE_BLOCK_SIZE - 1)) {
+		/*
+		 * Page offset must be a multiplier of
+		 * INCFS_DATA_FILE_BLOCK_SIZE
+		 */
+		pr_warn("incfs: Not aligned read from a file %d at offset %lld",
+			(int)inode->i_ino, offset);
+		result = -EINVAL;
+		goto out;
+	}
+
+	size = i_size_read(inode);
+	df = incfs_get_file_from_node((struct inode_info *)inode->i_private);
+	if (!df) {
+		result = -EBADF;
+		goto out;
+	}
+
+	if (offset < size) {
+		bytes_to_read = min_t(loff_t, size - offset, PAGE_SIZE);
+		read_result = incfs_read_data_file_block(
+			range(page_start, bytes_to_read), df, block_index);
+	} else {
+		bytes_to_read = 0;
+		read_result = 0;
+	}
+
+	if (read_result < 0)
+		result = read_result;
+	else if (read_result < PAGE_SIZE)
+		zero_user(page, read_result, PAGE_SIZE - read_result);
+
+out:
+	if (result == 0)
+		SetPageUptodate(page);
+	else
+		SetPageError(page);
+
+	flush_dcache_page(page);
+	kunmap(page);
+	unlock_page(page);
+	return result;
+}
+
+static long ioctl_process_instructions(struct mount_info *mi, void __user *arg)
+{
+	struct incfs_instruction inst = {};
+	int error = 0;
+	const ssize_t data_buf_size = 2 * INCFS_DATA_FILE_BLOCK_SIZE;
+	bool copy_inst_back = false;
+	struct incfs_instruction __user *inst_usr_ptr = arg;
+	u8 *data_buf = NULL;
+
+	data_buf = (u8 *)__get_free_pages(GFP_NOFS,
+					  get_order(data_buf_size));
+	if (!data_buf)
+		return -ENOMEM;
+
+	/*
+	 * Make sure that incfs_instruction doesn't have
+	 * anything beyond reserved.
+	 */
+	BUILD_BUG_ON(sizeof(struct incfs_instruction) >
+		offsetof(struct incfs_instruction, reserved) +
+		sizeof(inst.reserved));
+	if (copy_from_user(&inst, inst_usr_ptr, sizeof(inst)) > 0) {
+		error = -EINVAL;
+		goto out;
+	}
+
+	if (inst.version != INCFS_HEADER_VER)
+		return -ENOTSUPP;
+
+	switch (inst.type) {
+	case INCFS_INSTRUCTION_NEW_FILE: {
+		error = incfs_process_new_file_inst(mi, &inst.file);
+		copy_inst_back = true;
+		break;
+	}
+	case INCFS_INSTRUCTION_ADD_DIR_ENTRY:
+	case INCFS_INSTRUCTION_REMOVE_DIR_ENTRY: {
+		if (inst.dir_entry.name_len > data_buf_size) {
+			error = -E2BIG;
+			break;
+		}
+		if (copy_from_user(data_buf,
+				u64_to_user_ptr(inst.dir_entry.name),
+				inst.dir_entry.name_len)) {
+			error = -EFAULT;
+			break;
+		}
+		error = validate_name(range(data_buf,
+					inst.dir_entry.name_len));
+		if (error)
+			break;
+
+		error = incfs_process_new_dir_entry_inst(mi, inst.type,
+							&inst.dir_entry,
+							(char *)data_buf);
+		break;
+	}
+	default:
+		error = -EINVAL;
+		break;
+	}
+
+	if (!error && copy_inst_back) {
+		/*
+		 * Copy instruction back to populate _out fields.
+		 */
+		if (copy_to_user(inst_usr_ptr, &inst, sizeof(inst)))
+			error = -EFAULT;
+	}
+out:
+	if (data_buf)
+		free_pages((unsigned long)data_buf, get_order(data_buf_size));
+	return error;
+}
+
+static long dispatch_ioctl(struct file *f, unsigned int req, unsigned long arg)
+{
+	struct mount_info *mi = get_mount_info(file_superblock(f));
+
+	switch (req) {
+	case INCFS_IOC_PROCESS_INSTRUCTION:
+		return ioctl_process_instructions(mi, (void __user *)arg);
+	default:
+		return -EINVAL;
+	}
+}
+
+static int command_open(struct inode *inode, struct file *file)
+{
+	struct command_file_state *cmd_state = NULL;
+
+	cmd_state = kzalloc(sizeof(*cmd_state), GFP_NOFS);
+	if (!cmd_state)
+		return -ENOMEM;
+
+	file->private_data = cmd_state;
+	return 0;
+}
+
+static int command_release(struct inode *inode, struct file *file)
+{
+	kfree(file->private_data);
+	return 0;
+}
+
+static ssize_t command_write(struct file *f, const char __user *buf,
+			size_t size, loff_t *offset)
+{
+	struct mount_info *mi = get_mount_info(file_superblock(f));
+	const ssize_t data_buf_size = 2 * INCFS_DATA_FILE_BLOCK_SIZE;
+	size_t block_count = size / sizeof(struct incfs_new_data_block);
+	struct incfs_new_data_block __user *usr_blocks =
+				(struct incfs_new_data_block __user *)buf;
+	u8 *data_buf = NULL;
+	ssize_t error = 0;
+	int i = 0;
+
+	data_buf = (u8 *)__get_free_pages(GFP_NOFS,
+					  get_order(data_buf_size));
+	if (!data_buf)
+		return -ENOMEM;
+
+	for (i = 0; i < block_count; i++) {
+		struct incfs_new_data_block block = {};
+
+		if (copy_from_user(&block, &usr_blocks[i], sizeof(block)) > 0) {
+			error = -EFAULT;
+			break;
+		}
+
+		if (block.data_len > data_buf_size) {
+			error = -E2BIG;
+			break;
+		}
+		if (copy_from_user(data_buf, u64_to_user_ptr(block.data),
+					block.data_len) > 0) {
+			error = -EFAULT;
+			break;
+		}
+		block.data = 0; /* To make sure nobody uses it. */
+		error = incfs_process_new_data_block(mi, &block, data_buf);
+		if (error)
+			break;
+	}
+
+	if (data_buf)
+		free_pages((unsigned long)data_buf, get_order(data_buf_size));
+	*offset = 0;
+
+	/*
+	 * Only report the error if no records were processed, otherwise
+	 * just return how many were processed successfully.
+	 */
+	if (i == 0)
+		return error;
+
+	return i * sizeof(struct incfs_new_data_block);
+}
+
+static ssize_t command_read(struct file *f, char __user *buf, size_t len,
+			    loff_t *ppos)
+{
+	struct command_file_state *cmd_state = f->private_data;
+	struct mount_info *mi = get_mount_info(file_superblock(f));
+	struct incfs_pending_read_info *reads_buf = NULL;
+	size_t reads_to_collect = len / sizeof(*reads_buf);
+	int last_known_read_sn = READ_ONCE(cmd_state->last_pending_read_sn);
+	int new_max_sn = last_known_read_sn;
+	int reads_collected = 0;
+	ssize_t result = 0;
+	int i = 0;
+
+	if (!incfs_fresh_pending_reads_exist(mi, last_known_read_sn))
+		return 0;
+
+	reads_buf = (struct incfs_pending_read_info *)get_zeroed_page(
+		GFP_NOFS);
+	if (!reads_buf)
+		return -ENOMEM;
+
+	reads_to_collect = min_t(size_t, PAGE_SIZE / sizeof(*reads_buf),
+				reads_to_collect);
+
+	reads_collected = incfs_collect_pending_reads(
+		mi, last_known_read_sn, reads_buf, reads_to_collect);
+	if (reads_collected < 0) {
+		result = reads_collected;
+		goto out;
+	}
+
+	for (i = 0; i < reads_collected; i++)
+		if (reads_buf[i].serial_number > new_max_sn)
+			new_max_sn = reads_buf[i].serial_number;
+
+	/*
+	 * Just to make sure that we don't accidentally copy more data
+	 * to reads buffer than userspace can handle.
+	 */
+	reads_collected = min_t(size_t, reads_collected, reads_to_collect);
+	result = reads_collected * sizeof(*reads_buf);
+
+	/* Copy reads info to the userspace buffer */
+	if (copy_to_user(buf, reads_buf, result)) {
+		result = -EFAULT;
+		goto out;
+	}
+
+	 WRITE_ONCE(cmd_state->last_pending_read_sn, new_max_sn);
+	 *ppos = 0;
+out:
+	if (reads_buf)
+		free_page((unsigned long)reads_buf);
+	return result;
+}
+
+static __poll_t command_poll(struct file *file, poll_table *wait)
+{
+	struct command_file_state *cmd_state = file->private_data;
+	struct mount_info *mi = get_mount_info(file_superblock(file));
+	__poll_t ret = 0;
+
+	poll_wait(file, &mi->mi_pending_reads_notif_wq, wait);
+	if (incfs_fresh_pending_reads_exist(mi,
+		cmd_state->last_pending_read_sn))
+		ret = EPOLLIN | EPOLLRDNORM;
+
+	return ret;
+}
+
+static struct timespec64 backing_file_time(struct super_block *sb)
+{
+	struct timespec64 zero_time = { .tv_sec = 0, .tv_nsec = 0 };
+	struct mount_info *mi = get_mount_info(sb);
+	struct inode *backing_inode = NULL;
+
+	backing_inode = file_inode(mi->mi_bf_context->bc_file);
+	if (!backing_inode)
+		return zero_time;
+	return backing_inode->i_ctime;
+}
+
+static struct inode *get_inode_for_incfs_node(struct super_block *sb,
+					      struct inode_info *n_info)
+{
+	unsigned long ino = n_info->n_ino;
+	struct inode *inode = iget_locked(sb, ino);
+
+	if (!inode)
+		return NULL;
+
+	if (inode->i_state & I_NEW) {
+		inode->i_ctime = backing_file_time(sb);
+		inode->i_mtime = inode->i_ctime;
+		inode->i_atime = inode->i_ctime;
+		inode->i_ino = ino;
+		inode->i_private = n_info;
+		inode_init_owner(inode, NULL, n_info->n_mode);
+
+		switch (n_info->n_type) {
+		case INCFS_NODE_FILE: {
+			struct data_file *df = incfs_get_file_from_node(n_info);
+
+			inode->i_size = df->df_size;
+			inode->i_blocks = df->df_block_count;
+			inode->i_mapping->a_ops = &incfs_address_space_ops;
+			inode->i_op = &incfs_file_inode_ops;
+			inode->i_fop = &incfs_file_ops;
+			break;
+		}
+		case INCFS_NODE_DIR:
+			inode->i_size = 0;
+			inode->i_blocks = 1;
+			inode->i_mapping->a_ops = &incfs_address_space_ops;
+			inode->i_op = &incfs_dir_inode_ops;
+			inode->i_fop = &incfs_dir_fops;
+			break;
+
+			break;
+		default:
+			pr_warn("incfs: Unknown inode type");
+			break;
+		}
+
+		unlock_new_inode(inode);
+	}
+
+	return inode;
+}
+
+static struct inode *get_inode_for_commands(struct super_block *sb)
+{
+	struct inode *inode = iget_locked(sb, INCFS_COMMAND_INODE);
+
+	if (!inode)
+		return NULL;
+
+	if (inode->i_state & I_NEW) {
+		inode->i_ctime = backing_file_time(sb);
+		inode->i_mtime = inode->i_ctime;
+		inode->i_atime = inode->i_ctime;
+		inode->i_size = 0;
+		inode->i_ino = INCFS_COMMAND_INODE;
+		inode->i_private = NULL;
+
+		inode_init_owner(inode, NULL, S_IFREG | READ_WRITE_FILE_MODE);
+
+		inode->i_op = &incfs_file_inode_ops;
+		inode->i_fop = &incfs_command_file_ops;
+
+		unlock_new_inode(inode);
+	}
+
+	return inode;
+}
+
+static int iterate_incfs_dir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct directory *dir = NULL;
+	struct mount_info *mi = NULL;
+	struct dir_entry_info *entry;
+	loff_t entries_found = 0;
+	loff_t aux_entries_count = 2; // 2 for "." and ".."
+
+	dir = incfs_get_dir_from_node((struct inode_info *)inode->i_private);
+	if (!dir)
+		return -EFAULT;
+
+	if (!dir_emit_dots(file, ctx))
+		return 0;
+
+	mi = dir->d_node.n_mount_info;
+	if (ctx->pos == 2 && dir->d_node.n_ino == INCFS_ROOT_INODE) {
+		if (!dir_emit(ctx, command_file_name,
+			      ARRAY_SIZE(command_file_name) - 1,
+			      INCFS_COMMAND_INODE, DT_REG))
+			return 0;
+		ctx->pos++;
+		aux_entries_count++; //Aux entry for the .cmd file
+	}
+
+	mutex_lock(&mi->mi_dir_ops_mutex);
+	list_for_each_entry(entry, &dir->d_entries_head, de_entries_list) {
+		unsigned int type = (entry->de_child->n_type == INCFS_NODE_DIR)
+					? DT_DIR : DT_REG;
+
+		entries_found++;
+		if (entries_found > ctx->pos - aux_entries_count) {
+			if (!dir_emit(ctx, entry->de_name.data,
+					entry->de_name.len,
+					entry->de_child->n_ino, type))
+				break;
+			ctx->pos++;
+		}
+	}
+	mutex_unlock(&mi->mi_dir_ops_mutex);
+	return 0;
+}
+
+static struct dentry *dir_lookup(struct inode *dir_inode, struct dentry *dentry,
+				 unsigned int flags)
+{
+	struct inode *result = NULL;
+	struct super_block *sb = dir_inode->i_sb;
+	struct mount_info *mi = get_mount_info(sb);
+	int dir_ver = 0;
+	struct mem_range name_rng = range((u8 *)dentry->d_name.name,
+						dentry->d_name.len);
+
+	if (incfs_equal_ranges(dot_range, name_rng))
+		result = dir_inode;
+	else if (incfs_equal_ranges(dotdot_range, name_rng)) {
+		struct directory *parent_dir = NULL;
+
+		mutex_lock(&mi->mi_nodes_mutex);
+		parent_dir = incfs_get_dir_by_ino(mi, parent_ino(dentry));
+		if (parent_dir)
+			result = get_inode_for_incfs_node(sb,
+							&parent_dir->d_node);
+		mutex_unlock(&mi->mi_nodes_mutex);
+	} else if (incfs_equal_ranges(command_file_name_range, name_rng)) {
+		result = get_inode_for_commands(sb);
+	} else {
+		struct directory *dir = NULL;
+		struct inode_info *n_info = NULL;
+
+		mutex_lock(&mi->mi_nodes_mutex);
+		dir = incfs_get_dir_from_node(
+			(struct inode_info *)dir_inode->i_private);
+		n_info = incfs_get_node_by_name(dir, dentry->d_name.name,
+						&dir_ver);
+		if (n_info)
+			result = get_inode_for_incfs_node(sb, n_info);
+
+		mutex_unlock(&mi->mi_nodes_mutex);
+	}
+	dentry->d_fsdata = (void *)(long)dir_ver;
+	d_add(dentry, result);
+	return NULL;
+}
+
+static int parse_options(struct mount_options *opts, char *str)
+{
+	substring_t args[MAX_OPT_ARGS];
+	int value;
+	char *position;
+
+	if (opts == NULL)
+		return -EFAULT;
+
+	opts->backing_fd = 0;
+	opts->read_timeout_ms = 1000; /* Default: 1s */
+	if (str == NULL || *str == 0)
+		return 0;
+
+	while ((position = strsep(&str, ",")) != NULL) {
+		int token;
+
+		if (!*position)
+			continue;
+
+		token = match_token(position, option_tokens, args);
+
+		switch (token) {
+		case Opt_backing_fd:
+			if (match_int(&args[0], &value))
+				return -EINVAL;
+			opts->backing_fd = value;
+			break;
+		case Opt_read_timeout:
+			if (match_int(&args[0], &value))
+				return -EINVAL;
+			opts->read_timeout_ms = value;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int remount_fs(struct super_block *sb, int *flags, char *data)
+{
+	struct mount_info *mi = get_mount_info(sb);
+	struct mount_options options;
+	int err = 0;
+
+	sync_filesystem(sb);
+	err = parse_options(&options, (char *)data);
+	if (err)
+		return err;
+
+	if (mi->mi_options.read_timeout_ms != options.read_timeout_ms) {
+		mi->mi_options.read_timeout_ms = options.read_timeout_ms;
+		pr_info("New Incremental-fs timeout_ms=%d",
+			options.read_timeout_ms);
+	}
+
+	return 0;
+}
+
+static int dentry_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	int dentry_ver = (int)(long)dentry->d_fsdata;
+	struct inode *inode = NULL;
+	struct dentry *parent = NULL;
+	struct directory *parent_dir = NULL;
+
+	if (flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	parent = dget_parent(dentry);
+	parent_dir = incfs_get_dir_from_node((struct inode_info *)
+						d_inode(parent)->i_private);
+	dput(parent);
+
+	if (!parent_dir)
+		return 0;
+
+	/*
+	 * Reload globally visible parent dir version. If it hasn't changed
+	 * since the dentry had been created, it must be still valid.
+	 */
+	smp_mb__before_atomic();
+	if (dentry_ver == atomic_read(&parent_dir->d_version))
+		return 1;
+
+	/* Root dentry is always valid. */
+	inode = d_inode(dentry);
+	if (inode && inode->i_ino == INCFS_ROOT_INODE)
+		return 1;
+
+	return 0;
+}
+
+static int dentry_revalidate_weak(struct dentry *dentry, unsigned int flags)
+{
+	/*
+	 * Weak version of revalidate only needs to make sure that inode
+	 * is still okay. Incremental-fs never deletes inodes, so no need
+	 * for extra steps here.
+	 */
+	struct inode *inode = d_inode(dentry);
+
+	if (!inode || !inode->i_private)
+		return 0;
+	return 1;
+}
+
 static int fill_super_block(struct super_block *sb, void *data, int silent)
 {
+	struct mount_options options;
+	struct inode *inode = NULL;
+	struct mount_info *mi = NULL;
+	struct file *backing_file = NULL;
+	const char *file_name = NULL;
+	int result = 0;
+
+	sb->s_op = &incfs_super_ops;
+	sb->s_d_op = &incfs_dentry_ops;
+	sb->s_flags |= S_NOATIME;
+	sb->s_magic = INCFS_MAGIC_NUMBER;
+	sb->s_time_gran = 1;
+	sb->s_blocksize = INCFS_DATA_FILE_BLOCK_SIZE;
+	sb->s_blocksize_bits = blksize_bits(sb->s_blocksize);
+	sb->s_maxbytes = MAX_LFS_FILESIZE;
+
+	BUILD_BUG_ON(PAGE_SIZE != INCFS_DATA_FILE_BLOCK_SIZE);
+
+	result = parse_options(&options, (char *)data);
+	if (result != 0)
+		goto err;
+
+	if (options.backing_fd == 0) {
+		pr_err("Backing FD not set, filesystem can't be mounted.");
+		result = -EBADFD;
+		goto err;
+	}
+
+	backing_file = fget(options.backing_fd);
+	if (!backing_file) {
+		pr_err("Invalid backing FD: %d", options.backing_fd);
+		result = -EBADFD;
+		goto err;
+	}
+
+	mi = incfs_alloc_mount_info(sb, backing_file);
+	if (IS_ERR_OR_NULL(mi)) {
+		result = PTR_ERR(mi);
+		mi = NULL;
+		goto err;
+	}
+
+	mi->mi_options = options;
+	sb->s_fs_info = mi;
+	file_name = mi->mi_bf_context->bc_file->f_path.dentry->d_name.name;
+
+	inode = new_inode(sb);
+	if (inode) {
+		inode->i_ino = INCFS_ROOT_INODE;
+		inode->i_ctime = backing_file_time(sb);
+		inode->i_mtime = inode->i_ctime;
+		inode->i_atime = inode->i_ctime;
+		inode->i_private = &mi->mi_root.d_node;
+
+		inode->i_op = &incfs_dir_inode_ops;
+		inode->i_fop = &incfs_dir_fops;
+
+		inode_init_owner(inode, NULL, S_IFDIR | READ_EXEC_FILE_MODE);
+	}
+
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root) {
+		result = -ENOMEM;
+		goto err;
+	}
+
+	if (incfs_get_end_offset(mi->mi_bf_context->bc_file) > 0) {
+		int found_mds = 0;
+
+		/*
+		 * Backing file has data,
+		 * let's try to interpret it as inc-fs image.
+		 */
+		found_mds = incfs_scan_backing_file(mi);
+		if (found_mds < 0) {
+			result = found_mds;
+			pr_err("Backing file '%s' scan error: %d",
+				file_name, -result);
+			goto err;
+		}
+	} else {
+		/*
+		 * No data in the backing file,
+		 * let's initialize a new image.
+		 */
+		result = incfs_make_empty_backing_file(mi->mi_bf_context);
+		if (result < 0) {
+			pr_err("Backing file '%s' initialization error: %d",
+				file_name, -result);
+			goto err;
+		}
+	}
 	return 0;
+err:
+	sb->s_fs_info = NULL;
+	incfs_free_mount_info(mi);
+	if (!mi && backing_file) {
+		/*
+		 * Close backing_file only if mount_info was never created.
+		 * Otherwise it's closed in incfs_free_mount_info.
+		 */
+		fput(backing_file);
+	}
+	return result;
 }

 static struct dentry *mount_fs(struct file_system_type *type, int flags,
@@ -32,6 +846,26 @@ static struct dentry *mount_fs(struct file_system_type *type, int flags,

 static void kill_sb(struct super_block *sb)
 {
+	struct mount_info *mi = sb->s_fs_info;
+
+	incfs_free_mount_info(mi);
 	generic_shutdown_super(sb);
 }

+static int show_devname(struct seq_file *m, struct dentry *root)
+{
+	struct mount_info *mi = get_mount_info(root->d_sb);
+	const char *backing_file =
+			mi->mi_bf_context->bc_file->f_path.dentry->d_name.name;
+
+	seq_puts(m, backing_file);
+	return 0;
+}
+
+static int show_options(struct seq_file *m, struct dentry *root)
+{
+	struct mount_info *mi = get_mount_info(root->d_sb);
+
+	seq_printf(m, ",read_timeout_ms=%u", mi->mi_options.read_timeout_ms);
+	return 0;
+}
--
2.21.0.593.g511ec345e18-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 6/6] incfs: Integration tests for incremental-fs
  2019-05-02  4:03 Initial patches for Incremental FS ezemtsov
                   ` (3 preceding siblings ...)
  2019-05-02  4:03 ` [PATCH 4/6] incfs: Integration with VFS layer ezemtsov
@ 2019-05-02  4:03 ` ezemtsov
  2019-05-02 11:19 ` Initial patches for Incremental FS Amir Goldstein
  2019-05-02 13:47 ` J. R. Okajima
  6 siblings, 0 replies; 33+ messages in thread
From: ezemtsov @ 2019-05-02  4:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: tytso, Eugene Zemtsov

From: Eugene Zemtsov <ezemtsov@google.com>

Testing main use cases for Incremental FS.
Things like:
	- interaction between consuments and producers
	- basic dir operations
	- mounting a backing file with existing data

Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>
---
 tools/testing/selftests/Makefile              |    1 +
 .../selftests/filesystems/incfs/.gitignore    |    1 +
 .../selftests/filesystems/incfs/Makefile      |   12 +
 .../selftests/filesystems/incfs/config        |    1 +
 .../selftests/filesystems/incfs/incfs_test.c  | 1603 +++++++++++++++++
 .../selftests/filesystems/incfs/utils.c       |  159 ++
 .../selftests/filesystems/incfs/utils.h       |   39 +
 7 files changed, 1816 insertions(+)
 create mode 100644 tools/testing/selftests/filesystems/incfs/.gitignore
 create mode 100644 tools/testing/selftests/filesystems/incfs/Makefile
 create mode 100644 tools/testing/selftests/filesystems/incfs/config
 create mode 100644 tools/testing/selftests/filesystems/incfs/incfs_test.c
 create mode 100644 tools/testing/selftests/filesystems/incfs/utils.c
 create mode 100644 tools/testing/selftests/filesystems/incfs/utils.h

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 971fc8428117..78fd8590cede 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -11,6 +11,7 @@ TARGETS += efivarfs
 TARGETS += exec
 TARGETS += filesystems
 TARGETS += filesystems/binderfs
+TARGETS += filesystems/incfs
 TARGETS += firmware
 TARGETS += ftrace
 TARGETS += futex
diff --git a/tools/testing/selftests/filesystems/incfs/.gitignore b/tools/testing/selftests/filesystems/incfs/.gitignore
new file mode 100644
index 000000000000..4cba9c219a92
--- /dev/null
+++ b/tools/testing/selftests/filesystems/incfs/.gitignore
@@ -0,0 +1 @@
+incfs_test
\ No newline at end of file
diff --git a/tools/testing/selftests/filesystems/incfs/Makefile b/tools/testing/selftests/filesystems/incfs/Makefile
new file mode 100644
index 000000000000..493efc23fa33
--- /dev/null
+++ b/tools/testing/selftests/filesystems/incfs/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -Wall
+CFLAGS += -I../../../../../usr/include/
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../lib
+
+EXTRA_SOURCES := utils.c ../../../../lib/lz4.c
+TEST_GEN_PROGS := incfs_test
+
+include ../../lib.mk
+
+$(OUTPUT)/incfs_test: incfs_test.c $(EXTRA_SOURCES)
diff --git a/tools/testing/selftests/filesystems/incfs/config b/tools/testing/selftests/filesystems/incfs/config
new file mode 100644
index 000000000000..b6749837a318
--- /dev/null
+++ b/tools/testing/selftests/filesystems/incfs/config
@@ -0,0 +1 @@
+CONFIG_INCREMENTAL_FS=y
\ No newline at end of file
diff --git a/tools/testing/selftests/filesystems/incfs/incfs_test.c b/tools/testing/selftests/filesystems/incfs/incfs_test.c
new file mode 100644
index 000000000000..6c2797970f77
--- /dev/null
+++ b/tools/testing/selftests/filesystems/incfs/incfs_test.c
@@ -0,0 +1,1603 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018 Google LLC
+ */
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mount.h>
+#include <errno.h>
+#include <sys/wait.h>
+#include <alloca.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include "../../kselftest.h"
+
+#include "lz4.h"
+#include "utils.h"
+#define TEST_FAILURE 1
+#define TEST_SUCCESS 0
+
+struct test_file {
+	int ino;
+	char *name;
+	off_t size;
+};
+
+struct test_files_set {
+	struct test_file *files;
+	int files_count;
+};
+
+struct test_files_set get_test_files_set(void)
+{
+	static struct test_file files[] = {
+			{ .name = "file_one_byte", .size = 1 },
+			{ .name = "file_one_block",
+			.size = INCFS_DATA_FILE_BLOCK_SIZE },
+			{ .name = "file_one_and_a_half_blocks",
+			.size = INCFS_DATA_FILE_BLOCK_SIZE +
+				INCFS_DATA_FILE_BLOCK_SIZE / 2 },
+			{ .name = "file_three",
+			.size = 300 * INCFS_DATA_FILE_BLOCK_SIZE + 3 },
+			{ .name = "file_four",
+			.size = 400 * INCFS_DATA_FILE_BLOCK_SIZE + 7 },
+			{ .name = "file_five",
+			.size = 500 * INCFS_DATA_FILE_BLOCK_SIZE + 7 },
+			{ .name = "file_six",
+			.size = 600 * INCFS_DATA_FILE_BLOCK_SIZE + 7 },
+			{ .name = "file_seven",
+			.size = 700 * INCFS_DATA_FILE_BLOCK_SIZE + 7 },
+			{ .name = "file_eight",
+			.size = 800 * INCFS_DATA_FILE_BLOCK_SIZE + 7 },
+			{ .name = "file_nine",
+			.size = 900 * INCFS_DATA_FILE_BLOCK_SIZE + 7 },
+			{ .name = "file_big",
+			.size = 500 * 1024 * 1024 }
+		};
+	return (struct test_files_set){
+		.files = files,
+		.files_count = ARRAY_SIZE(files)
+	};
+}
+
+struct test_files_set get_small_test_files_set(void)
+{
+	static struct test_file files[] = {
+			{ .name = "file_one_byte", .size = 1 },
+			{ .name = "file_one_block",
+			.size = INCFS_DATA_FILE_BLOCK_SIZE },
+			{ .name = "file_one_and_a_half_blocks",
+			.size = INCFS_DATA_FILE_BLOCK_SIZE +
+				INCFS_DATA_FILE_BLOCK_SIZE / 2 },
+			{ .name = "file_three",
+			.size = 300 * INCFS_DATA_FILE_BLOCK_SIZE + 3 },
+			{ .name = "file_four",
+			.size = 400 * INCFS_DATA_FILE_BLOCK_SIZE + 7 }
+		};
+	return (struct test_files_set){
+		.files = files,
+		.files_count = ARRAY_SIZE(files)
+	};
+}
+
+static int get_file_block_seed(int file, int block)
+{
+	return 7919 * file + block;
+}
+
+static loff_t min(loff_t a, loff_t b)
+{
+	return a < b ? a : b;
+}
+
+static pid_t flush_and_fork(void)
+{
+	fflush(stdout);
+	return fork();
+}
+
+static void print_error(char *msg)
+{
+	ksft_print_msg("%s: %s\n", msg, strerror(errno));
+}
+
+static int wait_for_process(pid_t pid)
+{
+	int status;
+	int wait_res;
+
+	wait_res = waitpid(pid, &status, 0);
+	if (wait_res <= 0) {
+		print_error("Can't wait for the child");
+		return -EINVAL;
+	}
+	if (!WIFEXITED(status)) {
+		ksft_print_msg("Unexpected child status pid=%d\n", pid);
+		return -EINVAL;
+	}
+	status = WEXITSTATUS(status);
+	if (status != 0)
+		return status;
+	return 0;
+}
+
+static void rnd_buf(uint8_t *data, size_t len, unsigned int seed)
+{
+	int i;
+
+	for (i = 0; i < len; i++) {
+		seed = 1103515245 * seed + 12345;
+		data[i] = (uint8_t)(seed >> (i % 13));
+	}
+}
+
+struct file_and_block {
+	struct test_file *file;
+	int block_index;
+};
+
+static int emit_test_blocks(int fd, struct file_and_block *blocks, int count)
+{
+	uint8_t data[INCFS_DATA_FILE_BLOCK_SIZE];
+	uint8_t comp_data[2 * INCFS_DATA_FILE_BLOCK_SIZE];
+	int block_count = (count > 32) ? 32 : count;
+	int data_buf_size = 2 * INCFS_DATA_FILE_BLOCK_SIZE
+					* block_count;
+	uint8_t *data_buf = malloc(data_buf_size);
+	uint8_t *current_data = data_buf;
+	uint8_t *data_end = data_buf + data_buf_size;
+	struct incfs_new_data_block *block_buf =
+			calloc(block_count, sizeof(*block_buf));
+	ssize_t write_res = 0;
+	int error = 0;
+	int i = 0;
+	int blocks_written = 0;
+
+	for (i = 0; i < block_count; i++) {
+		int block_index = blocks[i].block_index;
+		struct test_file *file = blocks[i].file;
+		bool compress = (file->ino + block_index) % 2 == 0;
+		int seed = get_file_block_seed(file->ino, block_index);
+		off_t block_offset =
+			((off_t)block_index) * INCFS_DATA_FILE_BLOCK_SIZE;
+		size_t block_size = 0;
+
+		if (block_offset > file->size) {
+			error = -EINVAL;
+			break;
+		} else {
+			if (file->size - block_offset
+					> INCFS_DATA_FILE_BLOCK_SIZE)
+				block_size = INCFS_DATA_FILE_BLOCK_SIZE;
+			else
+				block_size = file->size - block_offset;
+		}
+
+		rnd_buf(data, block_size, seed);
+		if (compress) {
+			size_t comp_size = LZ4_compress_default((char *)data,
+				(char *)comp_data, block_size,
+				ARRAY_SIZE(comp_data));
+
+			if (comp_size <= 0) {
+				error = -EBADMSG;
+				break;
+			}
+			if (current_data + comp_size > data_end) {
+				error = -ENOMEM;
+				break;
+			}
+			memcpy(current_data, comp_data, comp_size);
+			block_size = comp_size;
+			block_buf[i].compression = COMPRESSION_LZ4;
+		} else {
+			if (current_data + block_size > data_end) {
+				error = -ENOMEM;
+				break;
+			}
+			memcpy(current_data, data, block_size);
+			block_buf[i].compression = COMPRESSION_NONE;
+		}
+
+		block_buf[i].file_ino = file->ino;
+		block_buf[i].block_index = block_index;
+		block_buf[i].data_len = block_size;
+		block_buf[i].data = ptr_to_u64(current_data);
+		block_buf[i].compression =
+			compress ? COMPRESSION_LZ4 : COMPRESSION_NONE;
+		current_data += block_size;
+	}
+
+	if (!error) {
+		write_res = write(fd, block_buf, sizeof(*block_buf) * i);
+		if (write_res < 0)
+			error = -errno;
+		else
+			blocks_written = write_res / sizeof(*block_buf);
+	}
+	if (error) {
+		ksft_print_msg("Writing data block error. Write returned: %d. Error:%s\n",
+				write_res, strerror(-error));
+	}
+	free(block_buf);
+	free(data_buf);
+	return (error < 0) ? error : blocks_written;
+}
+
+static int emit_test_block(int fd, struct test_file *file, int block_index)
+{
+	struct file_and_block blk = {
+		.file = file,
+		.block_index = block_index
+	};
+	int res = 0;
+
+	res = emit_test_blocks(fd, &blk, 1);
+	if (res == 0)
+		return -EINVAL;
+	if (res == 1)
+		return 0;
+	return res;
+}
+
+static void shuffle(int array[], int count, unsigned int seed)
+{
+	int i;
+
+	for (i = 0; i < count - 1; i++) {
+		int items_left = count - i;
+		int shuffle_index;
+		int v;
+
+		seed = 1103515245 * seed + 12345;
+		shuffle_index = i + seed % items_left;
+
+		v = array[shuffle_index];
+		array[shuffle_index] = array[i];
+		array[i] = v;
+	}
+}
+
+static int emit_test_file_data(int fd, struct test_file *file)
+{
+	int i;
+	int block_cnt = 1 + (file->size - 1) / INCFS_DATA_FILE_BLOCK_SIZE;
+	int *block_indexes = NULL;
+	struct file_and_block *blocks = NULL;
+	int result = 0;
+	int blocks_written = 0;
+
+	if (file->size == 0)
+		return 0;
+
+	blocks = calloc(block_cnt, sizeof(*blocks));
+	block_indexes = calloc(block_cnt, sizeof(*block_indexes));
+	for (i = 0; i < block_cnt; i++)
+		block_indexes[i] = i;
+
+	shuffle(block_indexes, block_cnt, file->ino);
+	for (i = 0; i < block_cnt; i++) {
+		blocks[i].block_index = block_indexes[i];
+		blocks[i].file = file;
+	}
+
+	for (i = 0; i < block_cnt; i += blocks_written) {
+		blocks_written = emit_test_blocks(fd,
+				blocks + i,
+				block_cnt - i);
+		if (blocks_written < 0) {
+			result = blocks_written;
+			goto out;
+		}
+		if (blocks_written == 0) {
+			result = -EIO;
+			goto out;
+		}
+	}
+out:
+	free(blocks);
+	free(block_indexes);
+	return result;
+}
+
+static loff_t read_whole_file(char *filename)
+{
+	int fd = -1;
+	loff_t result;
+	loff_t bytes_read = 0;
+	uint8_t buff[16 * 1024];
+
+	fd = open(filename, O_RDONLY);
+	if (fd <= 0)
+		return fd;
+
+	while (1) {
+		int read_result = read(fd, buff, ARRAY_SIZE(buff));
+
+		if (read_result < 0) {
+			print_error("Error during reading from a file.");
+			result = -errno;
+			goto cleanup;
+		} else if (read_result == 0)
+			break;
+
+		bytes_read += read_result;
+	}
+	result = bytes_read;
+
+cleanup:
+	close(fd);
+	return result;
+}
+
+
+static int read_test_file(uint8_t *buf, size_t len,
+			char *filename, int block_idx)
+{
+	int fd = -1;
+	int result;
+	int bytes_read = 0;
+	size_t bytes_to_read = len;
+	off_t offset = ((off_t)block_idx) * INCFS_DATA_FILE_BLOCK_SIZE;
+
+	fd = open(filename, O_RDONLY);
+	if (fd <= 0)
+		return fd;
+
+	if (lseek(fd, offset, SEEK_SET) != offset) {
+		print_error("Seek error");
+		return -errno;
+	}
+
+	while (bytes_read < bytes_to_read) {
+		int read_result =
+			read(fd, buf + bytes_read, bytes_to_read - bytes_read);
+		if (read_result < 0) {
+			result = -errno;
+			goto cleanup;
+		} else if (read_result == 0)
+			break;
+
+		bytes_read += read_result;
+	}
+	result = bytes_read;
+
+cleanup:
+	close(fd);
+	return result;
+}
+
+static int open_test_backing_file(char *mount_dir, bool delete)
+{
+	char backing_file_name[255];
+	int backing_fd;
+
+	snprintf(backing_file_name, ARRAY_SIZE(backing_file_name), "%s.img",
+		 mount_dir);
+	backing_fd = open(backing_file_name, O_CREAT | O_RDWR | O_TRUNC, 0666);
+	if (backing_fd < 0)
+		print_error("Can't open backing file");
+	else if (delete) {
+		/* Once backing file was opened, it's safe to delete it ;) */
+		remove(backing_file_name);
+	}
+	return backing_fd;
+}
+
+static int open_existing_test_backing_file(char *mount_dir, bool delete)
+{
+	char backing_file_name[255];
+	int backing_fd;
+
+	snprintf(backing_file_name, ARRAY_SIZE(backing_file_name), "%s.img",
+		 mount_dir);
+	backing_fd = open(backing_file_name, O_RDWR);
+	if (backing_fd < 0)
+		print_error("Can't open backing file");
+	else if (delete) {
+		/* Once backing file was opened, it's safe to delete it ;) */
+		remove(backing_file_name);
+	}
+	return backing_fd;
+}
+
+static int validate_test_file_content_with_seed(char *mount_dir,
+					 struct test_file *file,
+					 unsigned int shuffle_seed)
+{
+	int error = -1;
+	char *filename = concat_file_name(mount_dir, file->name);
+	off_t size = file->size;
+	loff_t actual_size = get_file_size(filename);
+	int block_cnt = 1 + (size - 1) / INCFS_DATA_FILE_BLOCK_SIZE;
+	int *block_indexes = NULL;
+	int i;
+
+	block_indexes = alloca(sizeof(int) * block_cnt);
+	for (i = 0; i < block_cnt; i++)
+		block_indexes[i] = i;
+
+	if (shuffle_seed != 0)
+		shuffle(block_indexes, block_cnt, shuffle_seed);
+
+	if (actual_size != size) {
+		ksft_print_msg("File size doesn't match. name: %s expected size:%ld actual size:%ld\n",
+		       filename, size, actual_size);
+		error = -1;
+		goto failure;
+	}
+
+	for (i = 0; i < block_cnt; i++) {
+		int block_idx = block_indexes[i];
+		uint8_t expected_block[INCFS_DATA_FILE_BLOCK_SIZE];
+		uint8_t actual_block[INCFS_DATA_FILE_BLOCK_SIZE];
+		int seed = get_file_block_seed(file->ino, block_idx);
+		size_t bytes_to_compare =
+			min((off_t)INCFS_DATA_FILE_BLOCK_SIZE,
+			size - ((off_t)block_idx) * INCFS_DATA_FILE_BLOCK_SIZE);
+		int read_result =
+			read_test_file(actual_block, INCFS_DATA_FILE_BLOCK_SIZE,
+				       filename, block_idx);
+		if (read_result < 0) {
+			ksft_print_msg("Error reading block %d from file %s. Error: %s\n",
+			       block_idx, filename, strerror(-read_result));
+			error = read_result;
+			goto failure;
+		}
+		rnd_buf(expected_block, INCFS_DATA_FILE_BLOCK_SIZE, seed);
+		if (memcmp(expected_block, actual_block, bytes_to_compare)) {
+			ksft_print_msg("File contents don't match. name: %s block:%d\n",
+			       file->name, block_idx);
+			error = -2;
+			goto failure;
+		}
+	}
+	free(filename);
+	return 0;
+
+failure:
+	free(filename);
+	return error;
+}
+
+static int validate_test_file_content(char *mount_dir, struct test_file *file)
+{
+	return validate_test_file_content_with_seed(mount_dir, file, 0);
+}
+
+static int dynamic_files_and_data_test(char *mount_dir)
+{
+	struct test_files_set test = get_test_files_set();
+	const int file_num = test.files_count;
+	const int missing_file_idx = 5;
+	int backing_fd = -1, cmd_fd = -1;
+	int i;
+
+	backing_fd = open_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+
+	/* Mount FS and release the backing file. */
+	if (mount_fs(mount_dir, backing_fd, 50) != 0)
+		goto failure;
+	close(backing_fd);
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Check that test files don't exist in the filesystem. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		char *filename = concat_file_name(mount_dir, file->name);
+
+		if (access(filename, F_OK) != -1) {
+			ksft_print_msg("File %s somehow already exists in a clean FS.\n",
+			       filename);
+			goto failure;
+		}
+		free(filename);
+	}
+
+	/* Write test data into the command file. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		int res;
+
+		res = emit_file(cmd_fd, file->name,
+			&file->ino, INCFS_ROOT_INODE, file->size);
+		if (res < 0) {
+			ksft_print_msg("Error %s emiting file %s.\n",
+					strerror(-res), file->name);
+			goto failure;
+		}
+
+		/* Skip writing data to one file so we can check */
+		/* that it's missing later. */
+		if (i == missing_file_idx)
+			continue;
+
+		res = emit_test_file_data(cmd_fd, file);
+		if (res) {
+			ksft_print_msg("Error %s emiting data for %s.\n",
+					strerror(-res), file->name);
+			goto failure;
+		}
+	}
+
+	/* Validate contents of the FS */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+
+		if (i == missing_file_idx) {
+			/* No data has been written to this file. */
+			/* Check for read error; */
+			uint8_t buf;
+			char *filename =
+				concat_file_name(mount_dir, file->name);
+			int res = read_test_file(&buf, 1, filename, 0);
+
+			free(filename);
+			if (res > 0) {
+				ksft_print_msg("Data present, even though never writtern.\n");
+				goto failure;
+			}
+			if (res != -ETIME) {
+				ksft_print_msg("Wrong error code: %d.\n", res);
+				goto failure;
+			}
+		} else {
+			if (validate_test_file_content(mount_dir, file) < 0)
+				goto failure;
+		}
+	}
+
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static int errors_on_overwrite_test(char *mount_dir)
+{
+	struct test_files_set test = get_small_test_files_set();
+	const int file_num = test.files_count;
+	int backing_fd = -1, cmd_fd = -1;
+	int i, bidx;
+
+	backing_fd = open_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+
+	/* Mount FS and release the backing file. */
+	if (mount_fs(mount_dir, backing_fd, 50) != 0)
+		goto failure;
+	close(backing_fd);
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Write test data into the command file. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		int emit_res;
+
+		emit_res = emit_file(cmd_fd, file->name, &file->ino,
+				     INCFS_ROOT_INODE, file->size);
+		if (emit_res < 0)
+			goto failure;
+
+		emit_res = emit_test_file_data(cmd_fd, file);
+		if (emit_res)
+			goto failure;
+	}
+
+	/* Write again, this time all writes should fail. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		int emit_res;
+
+		emit_res = emit_file(cmd_fd, file->name, &file->ino,
+				     INCFS_ROOT_INODE, file->size);
+		if (emit_res != -EEXIST) {
+			ksft_print_msg("Repeated file %s wasn't reported.\n",
+			       file->name);
+			goto failure;
+		}
+
+		for (bidx = 0; bidx * INCFS_DATA_FILE_BLOCK_SIZE < file->size;
+		     bidx++) {
+			emit_res = emit_test_block(cmd_fd, file, bidx);
+
+			/* Repeated blocks are ignored without an error */
+			if (emit_res < 0) {
+				ksft_print_msg("Repeated block was reported. err:%s\n",
+				       strerror(-emit_res));
+				goto failure;
+			}
+		}
+	}
+
+	/* Validate contents of the FS */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+
+		if (validate_test_file_content(mount_dir, file) < 0)
+			goto failure;
+	}
+
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static int work_after_remount_test(char *mount_dir)
+{
+	struct test_files_set test = get_test_files_set();
+	const int file_num = test.files_count;
+	const int file_num_stage1 = file_num / 2;
+	const int file_num_stage2 = file_num;
+	int i = 0;
+	int backing_fd = -1, cmd_fd = -1;
+
+	backing_fd = open_test_backing_file(mount_dir, false);
+	if (backing_fd < 0)
+		goto failure;
+
+	/* Mount FS and release the backing file. */
+	if (mount_fs(mount_dir, backing_fd, 50) != 0)
+		goto failure;
+	close(backing_fd);
+	backing_fd = -1;
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Write first half of the data into the command file. (stage 1) */
+	for (i = 0; i < file_num_stage1; i++) {
+		struct test_file *file = &test.files[i];
+
+		emit_file(cmd_fd, file->name, &file->ino, INCFS_ROOT_INODE,
+			  file->size);
+		if (emit_test_file_data(cmd_fd, file))
+			goto failure;
+	}
+
+	/* Unmount and mount again, to see that data is persistent. */
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+	backing_fd = open_existing_test_backing_file(mount_dir, false);
+	if (backing_fd < 0)
+		goto failure;
+	if (mount_fs(mount_dir, backing_fd, 50) != 0)
+		goto failure;
+	close(backing_fd);
+	backing_fd = -1;
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Write the second half of the data into the command file. (stage 2) */
+	for (; i < file_num_stage2; i++) {
+		struct test_file *file = &test.files[i];
+
+		emit_file(cmd_fd, file->name, &file->ino, INCFS_ROOT_INODE,
+			  file->size);
+		if (emit_test_file_data(cmd_fd, file))
+			goto failure;
+	}
+
+	/* Validate contents of the FS */
+	for (i = 0; i < file_num_stage2; i++) {
+		struct test_file *file = &test.files[i];
+
+		if (validate_test_file_content(mount_dir, file) < 0)
+			goto failure;
+	}
+
+	/* Hide all files */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		char *filename = concat_file_name(mount_dir, file->name);
+
+		if (access(filename, F_OK) != 0) {
+			ksft_print_msg("File %s is not visible.\n", filename);
+			goto failure;
+		}
+
+		unlink_node(cmd_fd, INCFS_ROOT_INODE, file->name);
+
+		if (access(filename, F_OK) != -1) {
+			ksft_print_msg("File %s is still visible.\n", filename);
+			goto failure;
+		}
+		free(filename);
+	}
+
+	/* Unmount and mount again, to see that unlinked files stay unlinked. */
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+	backing_fd = open_existing_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+	if (mount_fs(mount_dir, backing_fd, 50) != 0)
+		goto failure;
+	close(backing_fd);
+	backing_fd = -1;
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Validate all hidden files are still hidden. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		char *filename = concat_file_name(mount_dir, file->name);
+
+		if (access(filename, F_OK) != -1) {
+			ksft_print_msg("File %s is still visible.\n", filename);
+			goto failure;
+		}
+		free(filename);
+	}
+
+	/* Final unmount */
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	close(backing_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static int validate_dir(char *dir_path, struct dirent *entries, int count)
+{
+	DIR *dir;
+	struct dirent *dp;
+	int result = 0;
+	int matching_entries = 0;
+
+	dir = opendir(dir_path);
+	if (!dir) {
+		result = -errno;
+		goto out;
+	}
+
+	while ((dp = readdir(dir))) {
+		int i;
+
+		if (!strcmp(dp->d_name, ".") || !strcmp(dp->d_name, ".."))
+			continue;
+
+		for (i = 0; i < count; i++) {
+			struct dirent *entry = entries + i;
+
+			if ((dp->d_ino == entry->d_ino) &&
+			    (strcmp(dp->d_name, entry->d_name) == 0) &&
+			    (dp->d_type == entry->d_type)) {
+				matching_entries++;
+				break;
+			}
+		}
+	}
+	result = count - matching_entries;
+
+out:
+	if (dir)
+		closedir(dir);
+	return result;
+}
+
+/* Test for:
+ *  1. No more than one hardlink can be created for a dir.
+ *  2. Only an empty dir can be unlinked.
+ */
+static int dirs_corner_cases(char *mount_dir)
+{
+	int dir1_ino = 0;
+	int dir2_ino = 0;
+	int backing_fd = -1, cmd_fd = -1;
+	char dirname1[] = "dir1";
+	char *dir_path1 = concat_file_name(mount_dir, dirname1);
+	char dirname2[] = "dir2";
+	char *dir_path2 = concat_file_name(dir_path1, dirname2);
+	struct stat st = {};
+	int ret;
+	struct incfs_instruction inst = {};
+
+	backing_fd = open_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+
+	/* Mount FS and release the backing file. */
+	if (mount_fs(mount_dir, backing_fd, 50) != 0)
+		goto failure;
+	close(backing_fd);
+	backing_fd = -1;
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Create dir1 node. */
+	inst = (struct incfs_instruction) {
+			.type = INCFS_INSTRUCTION_NEW_FILE,
+			.file = {
+				.size = 0,
+				.mode = S_IFDIR | 0555,
+			}
+	};
+	ret = send_md_instruction(cmd_fd, &inst);
+	dir1_ino = inst.file.ino_out;
+	if (ret)
+		goto failure;
+
+	/* Create dir2 node. */
+	inst = (struct incfs_instruction) {
+			.type = INCFS_INSTRUCTION_NEW_FILE,
+			.file = {
+				.size = 0,
+				.mode = S_IFDIR | 0555,
+			}
+	};
+	ret = send_md_instruction(cmd_fd, &inst);
+	dir2_ino = inst.file.ino_out;
+	if (ret)
+		goto failure;
+
+	/* Try to put dir1 into itself. */
+	inst = (struct incfs_instruction){
+			.type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
+			.dir_entry = {
+				.dir_ino = dir1_ino,
+				.child_ino = dir1_ino,
+				.name = ptr_to_u64(dirname1),
+				.name_len = strlen(dirname1)
+			}
+	};
+	ret = send_md_instruction(cmd_fd, &inst);
+	if (ret != -EINVAL)
+		goto failure;
+
+	/* Try to put root into dir1. */
+	inst = (struct incfs_instruction){
+			.type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
+			.dir_entry = {
+				.dir_ino = dir1_ino,
+				.child_ino = INCFS_ROOT_INODE,
+				.name = ptr_to_u64(dirname1),
+				.name_len = strlen(dirname1)
+			}
+	};
+	ret = send_md_instruction(cmd_fd, &inst);
+	if (ret != -EINVAL)
+		goto failure;
+
+	/* Put dir1 into root. */
+	inst = (struct incfs_instruction){
+			.type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
+			.dir_entry = {
+				.dir_ino = INCFS_ROOT_INODE,
+				.child_ino = dir1_ino,
+				.name = ptr_to_u64(dirname1),
+				.name_len = strlen(dirname1)
+			}
+	};
+	ret = send_md_instruction(cmd_fd, &inst);
+	if (ret)
+		goto failure;
+
+	/* Check dir1 is visible. */
+	if (stat(dir_path1, &st) != 0 || st.st_ino != dir1_ino) {
+		print_error("stat failed for dir1");
+		goto failure;
+	}
+
+	/* Put dir2 into dir1. */
+	inst = (struct incfs_instruction){
+			.type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
+			.dir_entry = {
+				.dir_ino = dir1_ino,
+				.child_ino = dir2_ino,
+				.name = ptr_to_u64(dirname2),
+				.name_len = strlen(dirname2)
+			}
+	};
+	ret = send_md_instruction(cmd_fd, &inst);
+	if (ret)
+		goto failure;
+
+	/* Check dir2 is visible. */
+	if (stat(dir_path2, &st) != 0 || st.st_ino != dir2_ino) {
+		print_error("stat failed for dir2");
+		goto failure;
+	}
+
+	/* Try to create a loop. Put dir2 into dir1. */
+	inst = (struct incfs_instruction){
+			.type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
+			.dir_entry = {
+				.dir_ino = dir2_ino,
+				.child_ino = dir1_ino,
+				.name = ptr_to_u64(dirname1),
+				.name_len = strlen(dirname1)
+			}
+	};
+	ret = send_md_instruction(cmd_fd, &inst);
+	if (ret != -EMLINK) {
+		ksft_print_msg("Loop creation test filed. %s\n",
+					strerror(-ret));
+		goto failure;
+	}
+
+	/* Try to unlink dir1 without removing dir2 first. */
+	ret = unlink_node(cmd_fd, INCFS_ROOT_INODE, dirname1);
+	if (ret != -ENOTEMPTY) {
+		ksft_print_msg("Unlinked non empty dir: %s\n", strerror(-ret));
+		goto failure;
+	}
+
+	ret = unlink_node(cmd_fd, dir1_ino, dirname2);
+	if (ret)
+		goto failure;
+
+	ret = unlink_node(cmd_fd, INCFS_ROOT_INODE, dirname1);
+	if (ret)
+		goto failure;
+
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+	free(dir_path1);
+	free(dir_path2);
+
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	close(backing_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static int directory_structure_test(char *mount_dir)
+{
+	int dir_ino = 0;
+	int file_ino = 0;
+	int backing_fd = -1, cmd_fd = -1;
+	int mismatch_count = 0;
+	char dirname[] = "dir";
+	char filename[] = "file";
+	char *dir_path = concat_file_name(mount_dir, dirname);
+	char *file_path = concat_file_name(dir_path, filename);
+	struct stat st;
+	int ret;
+
+	backing_fd = open_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+
+	/* Mount FS and release the backing file. */
+	if (mount_fs(mount_dir, backing_fd, 50) != 0)
+		goto failure;
+	close(backing_fd);
+	backing_fd = -1;
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Write test data into the command file. */
+	ret = emit_dir(cmd_fd, dirname, &dir_ino, INCFS_ROOT_INODE);
+	if (ret < 0) {
+		ksft_print_msg("Error creating a dir: %s\n", strerror(-ret));
+		goto failure;
+	}
+	ret = emit_file(cmd_fd, filename, &file_ino, dir_ino, 0);
+	if (ret < 0) {
+		ksft_print_msg("Error creating a file: %s\n", strerror(-ret));
+		goto failure;
+	}
+
+	/* Validate directory structure */
+	{
+		struct dirent dir_dentry = { .d_ino = dir_ino,
+					     .d_type = DT_DIR,
+					     .d_name = "dir" };
+		struct dirent cmd_dentry = { .d_ino = INCFS_COMMAND_INODE,
+					     .d_type = DT_REG,
+					     .d_name = ".cmd" };
+		struct dirent file_dentry = { .d_ino = file_ino,
+					      .d_type = DT_REG,
+					      .d_name = "file" };
+		struct dirent root_entries[] = { cmd_dentry, dir_dentry };
+		struct dirent dir_entries[] = { file_dentry };
+
+		mismatch_count = validate_dir(mount_dir, root_entries,
+					      ARRAY_SIZE(root_entries));
+		if (mismatch_count) {
+			ksft_print_msg("Root validatoin failed. Mismatch %d",
+			       mismatch_count);
+			goto failure;
+		}
+
+		mismatch_count = validate_dir(dir_path, dir_entries,
+					      ARRAY_SIZE(dir_entries));
+		if (mismatch_count) {
+			ksft_print_msg("Subdir validatoin failed. Mismatch %d",
+			       mismatch_count);
+			goto failure;
+		}
+	}
+
+	/* Validate file inode */
+	if (stat(file_path, &st) != 0) {
+		print_error("stat failed");
+		goto failure;
+	}
+
+	if (st.st_ino != file_ino) {
+		ksft_print_msg("Unexpected file inode.");
+		goto failure;
+	}
+
+	if (st.st_size != 0) {
+		ksft_print_msg("Unexpected file size.");
+		goto failure;
+	}
+
+	ret = unlink_node(cmd_fd, dir_ino, filename);
+	if (ret < 0) {
+		ksft_print_msg("Error unlinking a file: %s\n", strerror(-ret));
+		goto failure;
+	}
+
+	/* Validate directory structure */
+	{
+		struct dirent dir_entries[0] = {};
+
+		mismatch_count = validate_dir(dir_path, dir_entries, 0);
+		if (mismatch_count) {
+			ksft_print_msg("Second subdir validatoin failed. Mismatch %d",
+			       mismatch_count);
+			goto failure;
+		}
+
+		if (access(file_path, F_OK) != -1) {
+			ksft_print_msg("Unlinked file is still visible");
+			goto failure;
+		}
+	}
+
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+	free(file_path);
+	free(dir_path);
+
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	close(backing_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static int data_producer(int fd, struct test_files_set *test_set)
+{
+	int ret = 0;
+	int timeout_ms = 1000;
+	struct incfs_pending_read_info prs[100] = {};
+	int prs_size = ARRAY_SIZE(prs);
+
+	while ((ret = wait_for_pending_reads(fd, timeout_ms,
+						prs, prs_size)) > 0) {
+		struct file_and_block blocks[ARRAY_SIZE(prs)] = {};
+		int read_count = ret;
+		int i;
+
+		for (i = 0; i < read_count; i++) {
+			int j = 0;
+
+			for (j = 0; j < test_set->files_count; j++) {
+				if (test_set->files[j].ino == prs[i].file_ino)
+					blocks[i].file = &test_set->files[j];
+			}
+			blocks[i].block_index = prs[i].block_index;
+		}
+
+		ret = emit_test_blocks(fd, blocks, read_count);
+		if (ret < 0) {
+			ksft_print_msg("Emitting test data error: %s\n",
+				strerror(-ret));
+			return ret;
+		}
+	}
+	return ret;
+}
+
+
+static int multiple_providers_test(char *mount_dir)
+{
+	struct test_files_set test = get_test_files_set();
+	const int file_num = test.files_count;
+	const int producer_count = 5;
+	int backing_fd = -1, cmd_fd = -1;
+	int status;
+	int i;
+	pid_t *producer_pids = alloca(producer_count * sizeof(pid_t));
+
+	backing_fd = open_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+
+	/* Mount FS and release the backing file. */
+	if (mount_fs(mount_dir, backing_fd, 10000) != 0)
+		goto failure;
+	close(backing_fd);
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Tell FS about the files, without actually providing the data. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+
+		if (emit_file(cmd_fd, file->name, &file->ino, INCFS_ROOT_INODE,
+			      file->size) < 0)
+			goto failure;
+	}
+
+	/* Start producer processes */
+	for (i = 0; i < producer_count; i++) {
+		pid_t producer_pid = flush_and_fork();
+
+		if (producer_pid == 0) {
+			int ret;
+			/*
+			 * This is a child that should provide data to
+			 * pending reads.
+			 */
+
+			ret = data_producer(cmd_fd, &test);
+			exit(-ret);
+		} else if (producer_pid > 0) {
+			producer_pids[i] = producer_pid;
+		} else {
+			print_error("Fork error");
+			goto failure;
+		}
+	}
+
+	/* Validate FS content */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		char *filename = concat_file_name(mount_dir, file->name);
+		loff_t read_result = read_whole_file(filename);
+
+		free(filename);
+		if (read_result != file->size) {
+			ksft_print_msg("Error validating file %s. Result: %ld\n",
+				file->name, read_result);
+			goto failure;
+		}
+	}
+
+	/* Check that all producers has finished with 0 exit status */
+	for (i = 0; i < producer_count; i++) {
+		status = wait_for_process(producer_pids[i]);
+		if (status != 0) {
+			ksft_print_msg("Producer %d failed with code (%s)\n",
+			       i, strerror(status));
+			goto failure;
+		}
+	}
+
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static int concurrent_reads_and_writes_test(char *mount_dir)
+{
+	struct test_files_set test = get_test_files_set();
+	const int file_num = test.files_count;
+	/* Validate each file from that many child processes. */
+	const int child_multiplier = 3;
+	int backing_fd = -1, cmd_fd = -1;
+	int status;
+	int i;
+	pid_t producer_pid;
+	pid_t *child_pids = alloca(child_multiplier * file_num * sizeof(pid_t));
+
+	backing_fd = open_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+
+	/* Mount FS and release the backing file. */
+	if (mount_fs(mount_dir, backing_fd, 10000) != 0)
+		goto failure;
+	close(backing_fd);
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Tell FS about the files, without actually providing the data. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+
+		if (emit_file(cmd_fd, file->name, &file->ino, INCFS_ROOT_INODE,
+			      file->size) < 0)
+			goto failure;
+	}
+
+	/* Start child processes acessing data in the files */
+	for (i = 0; i < file_num * child_multiplier; i++) {
+		struct test_file *file = &test.files[i / child_multiplier];
+		pid_t child_pid = flush_and_fork();
+
+		if (child_pid == 0) {
+			/* This is a child process, do the data validation. */
+			int ret = validate_test_file_content_with_seed(
+				mount_dir, file, i);
+			if (ret >= 0) {
+				/* Zero exit status if data is valid. */
+				exit(0);
+			}
+
+			/* Positive status if validation error found. */
+			exit(-ret);
+		} else if (child_pid > 0) {
+			child_pids[i] = child_pid;
+		} else {
+			print_error("Fork error");
+			goto failure;
+		}
+	}
+
+	producer_pid = flush_and_fork();
+	if (producer_pid == 0) {
+		int ret;
+		/*
+		 * This is a child that should provide data to
+		 * pending reads.
+		 */
+
+		ret = data_producer(cmd_fd, &test);
+		exit(-ret);
+	} else {
+		status = wait_for_process(producer_pid);
+		if (status != 0) {
+			ksft_print_msg("Data produces failed. %d(%s) ", status,
+			       strerror(status));
+			goto failure;
+		}
+	}
+
+	/* Check that all children has finished with 0 exit status */
+	for (i = 0; i < file_num * child_multiplier; i++) {
+		struct test_file *file = &test.files[i / child_multiplier];
+
+		status = wait_for_process(child_pids[i]);
+		if (status != 0) {
+			ksft_print_msg("Validation for the file %s failed with code %d (%s)\n",
+			       file->name, status, strerror(status));
+			goto failure;
+		}
+	}
+
+	/* Check that there are no pending reads left */
+	{
+		struct incfs_pending_read_info prs[1] = {};
+		int timeout = 0;
+		int read_count = wait_for_pending_reads(cmd_fd, timeout, prs,
+							ARRAY_SIZE(prs));
+
+		if (read_count) {
+			ksft_print_msg("Pending reads pending when all data written\n");
+			goto failure;
+		}
+	}
+
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static int child_procs_waiting_for_data_test(char *mount_dir)
+{
+	struct test_files_set test = get_test_files_set();
+	const int file_num = test.files_count;
+	int backing_fd = -1, cmd_fd = -1;
+	int i;
+	pid_t *child_pids = alloca(file_num * sizeof(pid_t));
+
+	backing_fd = open_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+
+	/* Mount FS and release the backing file. (10s wait time) */
+	if (mount_fs(mount_dir, backing_fd, 10000) != 0)
+		goto failure;
+	close(backing_fd);
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/* Tell FS about the files, without actually providing the data. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+
+		emit_file(cmd_fd, file->name, &file->ino, INCFS_ROOT_INODE,
+			  file->size);
+	}
+
+	/* Start child processes acessing data in the files */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		pid_t child_pid = flush_and_fork();
+
+		if (child_pid == 0) {
+			/* This is a child process, do the data validation. */
+			int ret = validate_test_file_content(mount_dir, file);
+
+			if (ret >= 0) {
+				/* Zero exit status if data is valid. */
+				exit(0);
+			}
+
+			/* Positive status if validation error found. */
+			exit(-ret);
+		} else if (child_pid > 0) {
+			child_pids[i] = child_pid;
+		} else {
+			print_error("Fork error");
+			goto failure;
+		}
+	}
+
+	/* Write test data into the command file. */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+
+		if (emit_test_file_data(cmd_fd, file))
+			goto failure;
+	}
+
+	/* Check that all children has finished with 0 exit status */
+	for (i = 0; i < file_num; i++) {
+		struct test_file *file = &test.files[i];
+		int status = wait_for_process(child_pids[i]);
+
+		if (status != 0) {
+			ksft_print_msg("Validation for the file %s failed with code %d (%s)\n",
+			       file->name, status, strerror(status));
+			goto failure;
+		}
+	}
+
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static int file_count_limit(char *mount_dir)
+{
+	int file_ino = 0;
+	int i;
+	int backing_fd = -1, cmd_fd = -1;
+	char filename[100];
+	char file_path[100];
+	int ret;
+
+	backing_fd = open_test_backing_file(mount_dir, true);
+	if (backing_fd < 0)
+		goto failure;
+
+	if (mount_fs(mount_dir, backing_fd, 50) != 0)
+		goto failure;
+	close(backing_fd);
+	backing_fd = -1;
+
+	cmd_fd = open_commands_file(mount_dir);
+	if (cmd_fd < 0)
+		goto failure;
+
+	/*
+	 * Create INCFS_MAX_FILES - 1 files as see that everything works.
+	 * One inode is already taken by the root dir.
+	 */
+	for (i = 0; i < INCFS_MAX_FILES - 1; i++) {
+		struct stat st;
+
+		sprintf(filename, "file_%d", i);
+		sprintf(file_path, "%s/%s", mount_dir, filename);
+		ret = emit_file(cmd_fd, filename, &file_ino,
+				INCFS_ROOT_INODE, 0);
+		if (ret < 0) {
+			ksft_print_msg("Error creating a file: %s (%s)\n",
+				filename, strerror(-ret));
+			goto failure;
+		}
+
+		if (stat(file_path, &st) != 0) {
+			print_error("stat failed");
+			goto failure;
+		}
+	}
+
+	ret = emit_file(cmd_fd, "over_limit_file", &file_ino,
+			INCFS_ROOT_INODE, 0);
+	if (ret != -ENFILE) {
+		ksft_print_msg("Too many files were allowed to be cerated.\n");
+		goto failure;
+	}
+
+	close(cmd_fd);
+	cmd_fd = -1;
+	if (umount(mount_dir) != 0) {
+		print_error("Can't unmout FS");
+		goto failure;
+	}
+	return TEST_SUCCESS;
+
+failure:
+	close(cmd_fd);
+	close(backing_fd);
+	umount(mount_dir);
+	return TEST_FAILURE;
+}
+
+static char *setup_mount_dir()
+{
+	struct stat st;
+	char *current_dir = get_current_dir_name();
+	char *mount_dir = concat_file_name(current_dir,
+						"incfs_test_mount_dir");
+
+	free(current_dir);
+	if (stat(mount_dir, &st) == 0) {
+		if (S_ISDIR(st.st_mode))
+			return mount_dir;
+
+		ksft_print_msg("%s is a file, not a dir.\n", mount_dir);
+		return NULL;
+	}
+
+	if (mkdir(mount_dir, 0777)) {
+		print_error("Can't create mount dir.");
+		return NULL;
+	}
+
+	return mount_dir;
+}
+
+int main(int argc, char *argv[])
+{
+	char *mount_dir = NULL;
+	int fails = 0;
+
+	ksft_print_header();
+
+	if (geteuid() != 0)
+		ksft_print_msg("Not a root, might fail to mount.\n");
+
+	mount_dir = setup_mount_dir();
+	if (mount_dir == NULL)
+		ksft_exit_fail_msg("Can't create a mount dir\n");
+
+#define RUN_TEST(test)                                                         \
+	do {                                                                   \
+		ksft_print_msg("Running " #test "\n");                         \
+		if (test(mount_dir) == TEST_SUCCESS)                           \
+			ksft_test_result_pass(#test "\n");                     \
+		else {                                                         \
+			ksft_test_result_fail(#test "\n");                     \
+			fails++;                                               \
+		}                                                              \
+	} while (0)
+
+	RUN_TEST(directory_structure_test);
+	RUN_TEST(dirs_corner_cases);
+	RUN_TEST(file_count_limit);
+	RUN_TEST(work_after_remount_test);
+	RUN_TEST(child_procs_waiting_for_data_test);
+	RUN_TEST(errors_on_overwrite_test);
+	RUN_TEST(concurrent_reads_and_writes_test);
+	RUN_TEST(multiple_providers_test);
+	RUN_TEST(dynamic_files_and_data_test);
+
+#undef RUN_TEST
+	umount2(mount_dir, MNT_FORCE);
+	rmdir(mount_dir);
+
+	if (fails > 0)
+		ksft_exit_pass();
+	else
+		ksft_exit_pass();
+	return 0;
+}
diff --git a/tools/testing/selftests/filesystems/incfs/utils.c b/tools/testing/selftests/filesystems/incfs/utils.c
new file mode 100644
index 000000000000..1f83b97225de
--- /dev/null
+++ b/tools/testing/selftests/filesystems/incfs/utils.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018 Google LLC
+ */
+#include <stdio.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/mount.h>
+#include <errno.h>
+#include <string.h>
+#include <poll.h>
+
+#include "utils.h"
+
+int mount_fs(char *mount_dir, int backing_fd, int read_timeout_ms)
+{
+	static const char fs_name[] = INCFS_NAME;
+	char mount_options[512];
+	int result;
+
+	snprintf(mount_options, ARRAY_SIZE(mount_options),
+		 "backing_fd=%u,read_timeout_ms=%u",
+		 backing_fd, read_timeout_ms);
+
+	result = mount(fs_name, mount_dir, fs_name, 0, mount_options);
+	if (result != 0)
+		perror("Error mounting fs.");
+	return result;
+}
+
+int unlink_node(int fd, int parent_ino, char *filename)
+{
+	struct incfs_instruction inst = {
+			.type = INCFS_INSTRUCTION_REMOVE_DIR_ENTRY,
+			.dir_entry = {
+				.dir_ino = parent_ino,
+				.name = ptr_to_u64(filename),
+				.name_len = strlen(filename)
+			}
+	};
+
+	return send_md_instruction(fd, &inst);
+}
+
+int emit_node(int fd, char *filename, int *ino_out, int parent_ino,
+		size_t size, mode_t mode)
+{
+	int ret = 0;
+	__u64 ino = 0;
+	struct incfs_instruction inst = {
+			.type = INCFS_INSTRUCTION_NEW_FILE,
+			.file = {
+				.size = size,
+				.mode = mode,
+			}
+	};
+
+	ret = send_md_instruction(fd, &inst);
+	if (ret)
+		return ret;
+
+	ino = inst.file.ino_out;
+	inst = (struct incfs_instruction){
+			.type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
+			.dir_entry = {
+				.dir_ino = parent_ino,
+				.child_ino = ino,
+				.name = ptr_to_u64(filename),
+				.name_len = strlen(filename)
+			}
+		};
+	ret = send_md_instruction(fd, &inst);
+	if (ret)
+		return ret;
+	*ino_out = ino;
+	return 0;
+}
+
+
+int emit_dir(int fd, char *filename, int *ino_out, int parent_ino)
+{
+	return emit_node(fd, filename, ino_out, parent_ino, 0, S_IFDIR | 0555);
+}
+
+int emit_file(int fd, char *filename, int *ino_out, int parent_ino, size_t size)
+{
+	return emit_node(fd, filename, ino_out, parent_ino, size,
+				S_IFREG | 0555);
+}
+
+int send_md_instruction(int cmd_fd, struct incfs_instruction *inst)
+{
+	inst->version = INCFS_HEADER_VER;
+	if (ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, inst) == 0)
+		return 0;
+	return -errno;
+}
+
+loff_t get_file_size(char *name)
+{
+	struct stat st;
+
+	if (stat(name, &st) == 0)
+		return st.st_size;
+	return -ENOENT;
+}
+
+int open_commands_file(char *mount_dir)
+{
+	char cmd_file[255];
+	int cmd_fd;
+
+	snprintf(cmd_file, ARRAY_SIZE(cmd_file), "%s/.cmd", mount_dir);
+	cmd_fd = open(cmd_file, O_RDWR);
+	if (cmd_fd < 0)
+		perror("Can't open commands file");
+	return cmd_fd;
+}
+
+int wait_for_pending_reads(int fd, int timeout_ms,
+	struct incfs_pending_read_info *prs, int prs_count)
+{
+	ssize_t read_res = 0;
+
+	if (timeout_ms > 0) {
+		int poll_res = 0;
+		struct pollfd pollfd = {
+			.fd = fd,
+			.events = POLLIN
+		};
+
+		poll_res = poll(&pollfd, 1, timeout_ms);
+		if (poll_res < 0)
+			return -errno;
+		if (poll_res == 0)
+			return 0;
+		if (!(pollfd.revents | POLLIN))
+			return 0;
+	}
+
+	read_res = read(fd, prs, prs_count * sizeof(*prs));
+	if (read_res < 0)
+		return -errno;
+
+	return read_res / sizeof(*prs);
+}
+
+char *concat_file_name(char *dir, char *file)
+{
+	char full_name[FILENAME_MAX] = "";
+
+	if (snprintf(full_name, ARRAY_SIZE(full_name), "%s/%s", dir, file) < 0)
+		return NULL;
+	return strdup(full_name);
+}
diff --git a/tools/testing/selftests/filesystems/incfs/utils.h b/tools/testing/selftests/filesystems/incfs/utils.h
new file mode 100644
index 000000000000..c3423fe01857
--- /dev/null
+++ b/tools/testing/selftests/filesystems/incfs/utils.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2019 Google LLC
+ */
+#include <stdbool.h>
+#include <sys/stat.h>
+
+#include "../../include/uapi/linux/incrementalfs.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof(arr[0]))
+
+#ifdef __LP64__
+#define ptr_to_u64(p) ((__u64)p)
+#else
+#define ptr_to_u64(p) ((__u64)(__u32)p)
+#endif
+
+int mount_fs(char *mount_dir, int backing_fd, int read_timeout_ms);
+
+int send_md_instruction(int cmd_fd, struct incfs_instruction *inst);
+
+int emit_node(int fd, char *filename, int *ino_out, int parent_ino,
+		size_t size, mode_t mode);
+
+int emit_dir(int fd, char *filename, int *ino_out, int parent_ino);
+
+int emit_file(int fd, char *filename, int *ino_out, int parent_ino,
+		size_t size);
+
+int unlink_node(int fd, int parent_ino, char *filename);
+
+loff_t get_file_size(char *name);
+
+int open_commands_file(char *mount_dir);
+
+int wait_for_pending_reads(int fd, int timeout_ms,
+	struct incfs_pending_read_info *prs, int prs_count);
+
+char *concat_file_name(char *dir, char *file);
--
2.21.0.593.g511ec345e18-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-02  4:03 Initial patches for Incremental FS ezemtsov
                   ` (4 preceding siblings ...)
  2019-05-02  4:03 ` [PATCH 6/6] incfs: Integration tests for incremental-fs ezemtsov
@ 2019-05-02 11:19 ` Amir Goldstein
  2019-05-02 13:10   ` Theodore Ts'o
  2019-05-02 18:16   ` Richard Weinberger
  2019-05-02 13:47 ` J. R. Okajima
  6 siblings, 2 replies; 33+ messages in thread
From: Amir Goldstein @ 2019-05-02 11:19 UTC (permalink / raw)
  To: ezemtsov; +Cc: linux-fsdevel, Theodore Tso, Miklos Szeredi

On Thu, May 2, 2019 at 12:04 AM <ezemtsov@google.com> wrote:
>
> Hi All,
>
> Please take a look at Incremental FS.
>
> Incremental FS is special-purpose Linux virtual file system that allows
> execution of a program while its binary and resource files are still being
> lazily downloaded over the network, USB etc. It is focused on incremental
> delivery for a small number (under 100) of big files (more than 10 megabytes each).
> Incremental FS doesn’t allow direct writes into files and, once loaded, file
> content never changes. Incremental FS doesn’t use a block device, instead it
> saves data into a backing file located on a regular file-system.
>
> What’s it for?
>
> It allows running big Android apps before their binaries and resources are
> fully loaded to an Android device. If an app reads something not loaded yet,
> it needs to wait for the data block to be fetched, but in most cases hot blocks
> can be loaded in advance and apps can run smoothly and almost instantly.

This sounds very useful.

Why does it have to be a new special-purpose Linux virtual file?
Why not FUSE, which is meant for this purpose?
Those are things that you should explain when you are proposing a new
filesystem,
but I will answer for you - because FUSE page fault will incur high
latency also after
blocks are locally available in your backend store. Right?

How about fscache support for FUSE then?
You can even write your own fscache backend if the existing ones don't
fit your needs for some reason.

Do you know of the project https://vfsforgit.org/?
Not exactly the same use case but very similar.
There is ongoing work on a Linux port developed by GitHub.com:
https://github.com/github/libprojfs

Piling logic into the kernel is not the answer.
Adding the missing interfaces to the kernel is the answer.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-02 11:19 ` Initial patches for Incremental FS Amir Goldstein
@ 2019-05-02 13:10   ` Theodore Ts'o
  2019-05-02 13:26     ` Al Viro
  2019-05-02 13:46     ` Amir Goldstein
  2019-05-02 18:16   ` Richard Weinberger
  1 sibling, 2 replies; 33+ messages in thread
From: Theodore Ts'o @ 2019-05-02 13:10 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: ezemtsov, linux-fsdevel, Miklos Szeredi

On Thu, May 02, 2019 at 07:19:52AM -0400, Amir Goldstein wrote:
> 
> This sounds very useful.
> 
> Why does it have to be a new special-purpose Linux virtual file?
> Why not FUSE, which is meant for this purpose?
> Those are things that you should explain when you are proposing a new
> filesystem,
> but I will answer for you - because FUSE page fault will incur high
> latency also after
> blocks are locally available in your backend store. Right?

From the documentation file in the first patch:

+Why isn't incremental-fs implemented via FUSE?
+----------------------------------------------
+TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
+scenarios, and increase power use on mobile beyond acceptable limit
+for widespread deployment. A custom kernel filesystem is the way to overcome
+these limitations.
+

There are several paragraphs of more detail which I leave for the
interested reader to review....

						- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-02 13:10   ` Theodore Ts'o
@ 2019-05-02 13:26     ` Al Viro
  2019-05-03  4:23       ` Eugene Zemtsov
  2019-05-02 13:46     ` Amir Goldstein
  1 sibling, 1 reply; 33+ messages in thread
From: Al Viro @ 2019-05-02 13:26 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Amir Goldstein, ezemtsov, linux-fsdevel, Miklos Szeredi

On Thu, May 02, 2019 at 09:10:34AM -0400, Theodore Ts'o wrote:

> +Why isn't incremental-fs implemented via FUSE?
> +----------------------------------------------
> +TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
> +scenarios, and increase power use on mobile beyond acceptable limit
> +for widespread deployment. A custom kernel filesystem is the way to overcome
> +these limitations.
> +
> 
> There are several paragraphs of more detail which I leave for the
> interested reader to review....

Why not CODA, though, with local fs as cache?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-02 13:10   ` Theodore Ts'o
  2019-05-02 13:26     ` Al Viro
@ 2019-05-02 13:46     ` Amir Goldstein
  1 sibling, 0 replies; 33+ messages in thread
From: Amir Goldstein @ 2019-05-02 13:46 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: ezemtsov, linux-fsdevel, Miklos Szeredi

On Thu, May 2, 2019 at 9:10 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Thu, May 02, 2019 at 07:19:52AM -0400, Amir Goldstein wrote:
> >
> > This sounds very useful.
> >
> > Why does it have to be a new special-purpose Linux virtual file?
> > Why not FUSE, which is meant for this purpose?
> > Those are things that you should explain when you are proposing a new
> > filesystem,
> > but I will answer for you - because FUSE page fault will incur high
> > latency also after
> > blocks are locally available in your backend store. Right?
>
> From the documentation file in the first patch:
>
> +Why isn't incremental-fs implemented via FUSE?
> +----------------------------------------------
> +TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
> +scenarios, and increase power use on mobile beyond acceptable limit
> +for widespread deployment. A custom kernel filesystem is the way to overcome
> +these limitations.
> +
>

Fair enough. I didn't think FUSE could be an alternative as-is.
I am familiar with USENIX paper.
The question is if we need to re-intent the wheel or try to improve the wheel.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-02  4:03 Initial patches for Incremental FS ezemtsov
                   ` (5 preceding siblings ...)
  2019-05-02 11:19 ` Initial patches for Incremental FS Amir Goldstein
@ 2019-05-02 13:47 ` J. R. Okajima
  6 siblings, 0 replies; 33+ messages in thread
From: J. R. Okajima @ 2019-05-02 13:47 UTC (permalink / raw)
  To: ezemtsov; +Cc: linux-fsdevel, tytso

ezemtsov@google.com:
> Incremental FS is special-purpose Linux virtual file system that allows
> execution of a program while its binary and resource files are still being
> lazily downloaded over the network, USB etc. It is focused on incremental
	:::

I had developed a very similar userspace daemon many years ago which is
called ULOOP.  As you can guess it is based upon the loopback block
device in stead of the filesystem.

(from the readme file)
----------------------------------------
1. sample for HTTP
Simple 'make' will build ./drivers/block/uloop.ko and ./ulohttp.
Ulohttp application behaves like losetup(8). Additionally, ulohttp is
an actual daemon which handles I/O request.
Here is a syntax.

ulohttp [-b bitmap] [-c cache] device URL

The device is /dev/loopN and the URL is a URL for fs-image file via
HTTP. The http server must support byte range (Range: header).
The bitmap is a new filename or previously specified as the bitmap for
the same URL. Its filesize will be 'the size of the specified fs-image
/ pagesize (usually 4k) / bits in a byte (8)', and round-up to
pagesize.
The cache is a new filename or previously specified as the cache for
the same URL. Its filesize will be 'the size of the specified
fs-image', and round-up to pagesize.
Note that both the bitmap and the cache are re-usable as long as you
don't change the filedata and URL.

When someone reads from the specified /dev/loopN, or accesses a file
on a filesystem after mounting /dev/loopN, ULOOP driver first checks
the corresponding bit in the bitmap file. When the bit is not set,
which means the block is not retrieved yet, it passes the offset and
size of the I/O request to ulohttp daemon.
Ulohttp converts the offset and the size into HTTP GET request with
Range header and send it to the http server.
Retriving the data from the http server, ulohttp stores it to the
cache file, and tells ULOOP driver that the HTTP transfer completes.
Then the ULOOP driver sets the corresponding bit in the bitmap, and
finishes the I/O/request.

In other words, it is equivalent to this operation.
$ wget URL_for_fsimage
$ sudo mount -o loop retrieved_fsimage /mnt
But ULOOP driver and ulohttp retrieves only the data (block) on-demand,
and stores into the cache file. The first access to a block is slow
since it involves HTTP GET, but the next access to the same block is
fast since it is in the local cache file. In this case, the behaviour
is equivalent to the simple /dev/loop device.

----------------------------------------

If you are interested, then try
https://sourceforge.net/p/aufs/aufs-util/ci/aufs4.14/tree/sample/uloop/

It is just for your information.


J. R. Okajima

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-02 11:19 ` Initial patches for Incremental FS Amir Goldstein
  2019-05-02 13:10   ` Theodore Ts'o
@ 2019-05-02 18:16   ` Richard Weinberger
  2019-05-02 18:33     ` Richard Weinberger
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Weinberger @ 2019-05-02 18:16 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: ezemtsov, linux-fsdevel, Theodore Tso, Miklos Szeredi

On Thu, May 2, 2019 at 1:21 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, May 2, 2019 at 12:04 AM <ezemtsov@google.com> wrote:
> >
> > Hi All,
> >
> > Please take a look at Incremental FS.
> >
> > Incremental FS is special-purpose Linux virtual file system that allows
> > execution of a program while its binary and resource files are still being
> > lazily downloaded over the network, USB etc. It is focused on incremental
> > delivery for a small number (under 100) of big files (more than 10 megabytes each).
> > Incremental FS doesn’t allow direct writes into files and, once loaded, file
> > content never changes. Incremental FS doesn’t use a block device, instead it
> > saves data into a backing file located on a regular file-system.
> >
> > What’s it for?
> >
> > It allows running big Android apps before their binaries and resources are
> > fully loaded to an Android device. If an app reads something not loaded yet,
> > it needs to wait for the data block to be fetched, but in most cases hot blocks
> > can be loaded in advance and apps can run smoothly and almost instantly.
>
> This sounds very useful.
>
> Why does it have to be a new special-purpose Linux virtual file?
> Why not FUSE, which is meant for this purpose?
> Those are things that you should explain when you are proposing a new
> filesystem,
> but I will answer for you - because FUSE page fault will incur high
> latency also after
> blocks are locally available in your backend store. Right?
>
> How about fscache support for FUSE then?
> You can even write your own fscache backend if the existing ones don't
> fit your needs for some reason.
>
> Do you know of the project https://vfsforgit.org/?
> Not exactly the same use case but very similar.
> There is ongoing work on a Linux port developed by GitHub.com:
> https://github.com/github/libprojfs
>
> Piling logic into the kernel is not the answer.
> Adding the missing interfaces to the kernel is the answer.

I wonder whether userfaultfd can but used for that use-case too?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-02 18:16   ` Richard Weinberger
@ 2019-05-02 18:33     ` Richard Weinberger
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Weinberger @ 2019-05-02 18:33 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: ezemtsov, linux-fsdevel, Theodore Tso, Miklos Szeredi

On Thu, May 2, 2019 at 8:16 PM Richard Weinberger
<richard.weinberger@gmail.com> wrote:
> > Piling logic into the kernel is not the answer.
> > Adding the missing interfaces to the kernel is the answer.
>
> I wonder whether userfaultfd can but used for that use-case too?

...hit the send button too eary.

My thought is, userfaultfd is used to support live migration of VMs such that
pages from the remote side are loaded on demand.
Sounds a little like the android app use-case, hm?

The loader (ld-linux) runs the app and using userfaultfd missing pages
get downloaded on demand.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] incfs: Add first files of incrementalfs
  2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
@ 2019-05-02 19:06   ` Miklos Szeredi
  2019-05-02 20:41   ` Randy Dunlap
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Miklos Szeredi @ 2019-05-02 19:06 UTC (permalink / raw)
  To: ezemtsov; +Cc: linux-fsdevel, Theodore Ts'o

On Thu, May 2, 2019 at 12:03 AM <ezemtsov@google.com> wrote:

> +Design alternatives
> +===================
> +
> +Why isn't incremental-fs implemented via FUSE?
> +----------------------------------------------
> +TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
> +scenarios, and increase power use on mobile beyond acceptable limit
> +for widespread deployment. A custom kernel filesystem is the way to overcome
> +these limitations.

he 80% performance overhead sounds bad.   As fuse maintainer I'd
really be interested in finding out the causes.

> +
> +From the theoretical side of things, FUSE filesystem adds some overhead to
> +each filesystem operation that’s not handled by OS page cache:
> +
> +    * When an IO request arrives to FUSE driver (D), it puts it into a queue
> +      that runs on a separate kernel thread

 The queue is run on a *user* thread, there's no intermediate kernel
thread involved.

> +    * Then another separate user-mode handler process (H) has to run,
> +      potentially after a context switch, to read the request from the queue.

Yes.   How is it different from the data loader doing read(2) on .cmd?

> +      Reading the request adds a kernel-user mode transition to the handling.
> +    * (H) sends the IO request to kernel to handle it on some underlying storage
> +      filesystem. This adds a user-kernel and kernel-user mode transition
> +      pair to the handling.
> +    * (H) then responds to the FUSE request via a write(2) call.
> +      Writing the response is another user-kernel mode transition.
> +    * (D) needs to read the response from (H) when its kernel thread runs
> +      and forward it to the user

Again, you've just described exactly the same thing for data loader
and .cmd.  Why is the fuse case different?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] incfs: Add first files of incrementalfs
  2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
  2019-05-02 19:06   ` Miklos Szeredi
@ 2019-05-02 20:41   ` Randy Dunlap
  2019-05-07 15:57   ` Jann Horn
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Randy Dunlap @ 2019-05-02 20:41 UTC (permalink / raw)
  To: ezemtsov, linux-fsdevel; +Cc: tytso

On 5/1/19 9:03 PM, ezemtsov@google.com wrote:
> From: Eugene Zemtsov <ezemtsov@google.com>
> 
> - fs/incfs dir
> - Kconfig (CONFIG_INCREMENTAL_FS)
> - Makefile
> - Module and file system initialization and clean up code
> - New MAINTAINERS entry
> - Add incrementalfs.h UAPI header
> - Register ioctl range in ioctl-numbers.txt
> - Documentation
> 
> Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>

Hi,
This is just Documentation comments...

> ---
>  Documentation/filesystems/incrementalfs.rst | 452 ++++++++++++++++++++
>  Documentation/ioctl/ioctl-number.txt        |   1 +
>  MAINTAINERS                                 |   7 +
>  fs/Kconfig                                  |   1 +
>  fs/Makefile                                 |   1 +
>  fs/incfs/Kconfig                            |  10 +
>  fs/incfs/Makefile                           |   4 +
>  fs/incfs/main.c                             |  85 ++++
>  fs/incfs/vfs.c                              |  37 ++
>  include/uapi/linux/incrementalfs.h          | 189 ++++++++
>  10 files changed, 787 insertions(+)
>  create mode 100644 Documentation/filesystems/incrementalfs.rst
>  create mode 100644 fs/incfs/Kconfig
>  create mode 100644 fs/incfs/Makefile
>  create mode 100644 fs/incfs/main.c
>  create mode 100644 fs/incfs/vfs.c
>  create mode 100644 include/uapi/linux/incrementalfs.h
> 
> diff --git a/Documentation/filesystems/incrementalfs.rst b/Documentation/filesystems/incrementalfs.rst
> new file mode 100644
> index 000000000000..682e3dcb6b5a
> --- /dev/null
> +++ b/Documentation/filesystems/incrementalfs.rst
> @@ -0,0 +1,452 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=======================
> +Incremental File System
> +=======================
> +
> +Overview
> +========
> +Incremental FS is special-purpose Linux virtual file system that allows
> +execution of a program while its binary and resource files are still being
> +lazily downloaded over the network, USB etc. It is focused on incremental
> +delivery for a small number (under 100) of big files (more than 10 megabytes).
> +Incremental FS doesn’t allow direct writes into files and, once loaded, file
> +content never changes. Incremental FS doesn’t use a block device, instead it
> +saves data into a backing file located on a regular file-system.
> +
> +But why?
> +--------
> +To allow running **big** Android apps before their binaries and resources are
> +fully downloaded to an Android device. If an app reads something not loaded yet,
> +it needs to wait for the data block to be fetched, but in most cases hot blocks
> +can be loaded in advance.
> +
> +Workflow
> +--------
> +A userspace process, called a data loader, mounts an instance of incremental-fs
> +giving it a file descriptor on an underlying file system (like ext4 or f2fs).
> +Incremental-fs reads content (if any) of this backing file and interprets it as
> +a file system image with files, directories and data blocks. At this point
> +the data loader can declare new files to be shown by incremental-fs.
> +
> +A process is started from a binary located on incremental-fs.
> +All reads are served directly from the backing file
> +without roundtrips into userspace. If the process accesses a data block that was
> +not originally present in the backing file, the read operation waits.
> +
> +Meanwhile the data loader can feed new data blocks to incremental-fs by calling
> +write() on a special .cmd pseudo-file. The data loader can request information
> +about pending reads by calling poll() and read() on the .cmd pseudo-file.
> +This mechanism allows the data loader to serve most urgently needed data first.
> +Once a data block is given to incremental-fs, it saves it to the backing file
> +and unblocks all the reads waiting for this block.
> +
> +Eventually all data for all files is uploaded by the data loader, and saved by
> +incremental-fs into the backing file. At that moment the data loader is not
> +needed any longer. The backing file will play the role of a complete
> +filesystem image for all future runs of the program.
> +
> +Non-goals
> +---------
> +* Allowing direct writes by the executing processes into files on incremental-fs
> +* Allowing the data loader change file size or content after it was loaded.
> +* Having more than a couple hundred files and directories.
> +
> +
> +Features
> +========
> +
> +Read-only, but not unchanging
> +-----------------------------
> +On the surface a mount directory of incremental-fs would look similar to
> +a read-only instance of network file system: files and directories can be
> +listed and read, but can’t be directly created or modified via creat() or
> +write(). At the same time the data loader can make changes to a directory
> +structure via external ioctl-s. i.e. link and unlink files and directories
> +(if they empty). Data can't be changed this way, once a file block is loaded

                                               way; once

> +there is no way to change it.
> +
> +Filesystem image in a backing file
> +----------------------------------
> +Instead of using a block device, all data and metadata is stored in a

                                                          are stored

> +backing file provided as a mount parameter. The backing file is located on
> +an underlying file system (like ext4 or f2fs). Such approach is very similar
> +to what might be achieved by using loopback device with a traditional file
> +system, but it avoids extra set-up steps and indirections. It also allows
> +incremental-fs image to dynamically grow as new files and data come without
> +having to do any extra steps for resizing.
> +
> +If the backing file contains data at the moment when incremental-fs is mounted,
> +content of the backing file is being interpreted as filesystem image.

                          file is interpreted as

> +New files and data can still be added through the external interface,
> +and they will be saved to the backing file.
> +
> +Data compression
> +----------------
> +Incremental-fs can store compressed data. In this case each 4KB data block is
> +compressed separately. Data blocks can be provided to incremental-fs by
> +the data loader in a compressed form. Incremental-fs uncompresses blocks
> +each time a executing process reads it (modulo page cache). Compression also

             an executing process

> +takes care of blocks composed of all zero bytes removing necessity to handle
> +this case separately.
> +
> +Partially present files
> +-----------------------
> +Data in the files consists of 4KB blocks, each block can be present or absent.

                                     blocks; each block

> +Unlike in sparse files, reading an absent block doesn’t return all zeros.
> +It waits for the data block to be loaded via the ioctl interface
> +(respecting a timeout). Once a data block is loaded it never disappears
> +and can’t be changed or erased from a file. This ability to frictionlessly
> +wait for temporary missing data is the main feature of incremental-fs.
> +
> +Hard links. Multiple names for the same file
> +--------------------------------------------
> +Like all traditional UNIX file systems, incremental-fs supports hard links,
> +i.e. different file names in different directories can refer to the same file.
> +As mentioned above new hard links can be created and removed via
> +the ioctl interface, but actual data files are immutable, modulo partial
> +data loading. Each directory can only have at most one name referencing it.
> +
> +Inspection of incremental-fs internal state
> +-------------------------------------------
> +poll() and read() on the .cmd pseudo-file allow data loaders to get a list of
> +read operations stalled due to lack of a data block (pending reads).
> +
> +
> +Application Programming Interface
> +=================================
> +
> +Regular file system interface
> +-----------------------------
> +Executing process access files and directories via regular Linux file interface:

                     accesses

> +open, read, close etc. All the intricacies of data loading a file representation
> +are hidden from them.
> +
> +External .cmd file interface
> +----------------------------
> +When incremental-fs is mounted, a mount directory contains a pseudo-file
> +called '.cmd'. The data loader will open this file and call read(), write(),
> +poll() and ioctl() on it inspect and change state of incremental-fs.
> +
> +poll() and read() are used by the data loader to wait for pending reads to
> +appear and obtain an array of ``struct incfs_pending_read_info``.
> +
> +write() is used by the data loader to feed new data blocks to incremental-fs.
> +A data buffer given to write() is interpreted as an array of
> +``struct incfs_new_data_block``. Structs in the array describe locations and
> +properties of data blocks loaded with this write() call.
> +
> +``ioctl(INCFS_IOC_PROCESS_INSTRUCTION)`` is used to change structure of
> +incremental-fs. It receives an pointer to ``struct incfs_instruction``

                               a pointer

> +where type field can have be one of the following values.
> +
> +**INCFS_INSTRUCTION_NEW_FILE**
> +Creates an inode (a file or a directory) without a name.
> +It assumes ``incfs_new_file_instruction.file`` is populated with details.
> +
> +**INCFS_INSTRUCTION_ADD_DIR_ENTRY**
> +Creates a name (aka hardlink) for an inode in a directory.
> +A directory can't have more than one hardlink pointing to it, but files can be
> +linked from different directories.
> +It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
> +
> +**INCFS_INSTRUCTION_REMOVE_DIR_ENTRY**
> +Remove a name (aka hardlink) for a file from a directory.
> +Only empty directories can be unlinked.
> +It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
> +
> +For more details see in uapi/linux/incrementalfs.h and samples below.
> +
> +Supported mount options
> +-----------------------
> +See ``fs/incfs/options.c`` for more details.
> +
> +    * ``backing_fd=<unsigned int>``
> +        Required. A file descriptor of a backing file opened by the process
> +        calling mount(2). This descriptor can be closed after mount returns.
> +
> +    * ``read_timeout_msc=<unsigned int>``
> +        Default: 1000. Timeout in milliseconds before a read operation fails
> +        if no data found in the backing file or provided by the data loader.
> +
> +Sysfs files
> +-----------
> +``/sys/fs/incremental-fs/version`` - a current version of the filesystem.
> +One ASCII encoded positive integer number with a new line at the end.
> +
> +
> +Examples
> +--------
> +See ``sample_data_loader.c`` for a complete implementation of a data loader.
> +
> +Mount incremental-fs
> +~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> +    int mount_fs(char *mount_dir, char *backing_file, int timeout_msc)
> +    {
> +        static const char fs_name[] = INCFS_NAME;
> +        char mount_options[512];
> +        int backing_fd;
> +        int result;
> +
> +        backing_fd = open(backing_file, O_RDWR);
> +        if (backing_fd == -1) {
> +            perror("Error in opening backing file");
> +            return 1;
> +        }
> +
> +        snprintf(mount_options, ARRAY_SIZE(mount_options),
> +            "backing_fd=%u,read_timeout_msc=%u", backing_fd, timeout_msc);
> +
> +        result = mount(fs_name, mount_dir, fs_name, 0, mount_options);
> +        if (result != 0)
> +            perror("Error mounting fs.");
> +        return result;
> +    }
> +
> +Open .cmd file
> +~~~~~~~~~~~~~~
> +
> +::
> +
> +    int open_commands_file(char *mount_dir)
> +    {
> +        char cmd_file[255];
> +        int cmd_fd;
> +
> +        snprintf(cmd_file, ARRAY_SIZE(cmd_file), "%s/.cmd", mount_dir);
> +        cmd_fd = open(cmd_file, O_RDWR);
> +        if (cmd_fd < 0)
> +            perror("Can't open commands file");
> +        return cmd_fd;
> +    }
> +
> +Add a file to the file system
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> +    int create_file(int cmd_fd, char *filename, int *ino_out, size_t size)
> +    {
> +        int ret = 0;
> +        __u16 ino = 0;
> +        struct incfs_instruction inst = {
> +                .version = INCFS_HEADER_VER,
> +                .type = INCFS_INSTRUCTION_NEW_FILE,
> +                .file = {
> +                    .size = size,
> +                    .mode = S_IFREG | 0555,
> +                }
> +        };
> +
> +        ret = ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, &inst);
> +        if (ret)
> +            return -errno;
> +
> +        ino = inst.file.ino_out;
> +        inst = (struct incfs_instruction){
> +                .version = INCFS_HEADER_VER,
> +                .type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
> +                .dir_entry = {
> +                    .dir_ino = INCFS_ROOT_INODE,
> +                    .child_ino = ino,
> +                    .name = ptr_to_u64(filename),
> +                    .name_len = strlen(filename)
> +                }
> +            };
> +        ret = ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, &inst);
> +        if (ret)
> +            return -errno;
> +        *ino_out = ino;
> +        return 0;
> +    }
> +
> +Load data into a file
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> +    int cmd_fd = open_commands_file(path_to_mount_dir);
> +    char *data = get_some_data();
> +    struct incfs_new_data_block block;
> +    int err;
> +
> +    block.file_ino = file_ino;
> +    block.block_index = 0;
> +    block.compression = COMPRESSION_NONE;
> +    block.data = (__u64)data;
> +    block.data_len = INCFS_DATA_FILE_BLOCK_SIZE;
> +
> +    err = write(cmd_fd, &block, sizeof(block));
> +
> +
> +Get an array of pending reads
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> +    int poll_res = 0;
> +    struct incfs_pending_read_info reads[10];
> +    int cmd_fd = open_commands_file(path_to_mount_dir);
> +    struct pollfd pollfd = {
> +        .fd = cmd_fd,
> +        .events = POLLIN
> +    };
> +
> +    poll_res = poll(&pollfd, 1, timeout);
> +    if (poll_res > 0 && (pollfd.revents | POLLIN)) {
> +        ssize_t read_res = read(cmd_fd, reads, sizeof(reads));
> +        if (read_res > 0)
> +            printf("Waiting reads %ld\n", read_res / sizeof(reads[0]));
> +    }
> +
> +
> +
> +Ondisk format
> +=============
> +
> +General principles
> +------------------
> +* The backbone of the incremental-fs ondisk format is an append only linked
> +  list of metadata blocks. Each metadata block contains an offset of the next
> +  one. These blocks describe files and directories on the
> +  file system. They also represent actions of adding and removing file names
> +  (hard links).
> +  Every time incremental-fs instance is mounted, it reads through this list
> +  to recreate filesystem's state in memory. An offset of the first record in the
> +  metadata list is stored in the superblock at the beginning of the backing
> +  file.
> +
> +* Most of the backing file is taken by data areas and blockmaps.
> +  Since data blocks can be compressed and have different sizes,
> +  single per-file data area can't be pre-allocated. That's why blockmaps are
> +  needed in order to find a location and size of each data block in
> +  the backing file. Each time a file is created, a corresponding block map is
> +  allocated to store future offsets of data blocks.
> +
> +  Whenever a data block is given by data loader to incremental-fs:
> +    - A data area with the given block is appended to the end of
> +      the backing file.
> +    - A record in the blockmap for the given block index is updated to reflect
> +      its location, size, and compression algorithm.
> +
> +Important format details
> +------------------------
> +Ondisk structures are defined in the ``format.h`` file. They are all packed
> +and use little-endian order.
> +A backing file must start with ``incfs_super_block`` with ``s_magic`` field
> +equal to 0x5346434e49 "INCFS".
> +
> +Metadata records:
> +
> +* ``incfs_inode`` - metadata record to declare a file or a directory.
> +                    ``incfs_inode.i_mode`` determents if it is a file

                                              determines

> +                    or a directory.
> +* ``incfs_blockmap_entry`` - metadata record that specifies size and location
> +                            of a blockmap area for a given file. This area
> +                            contains an array of ``incfs_blockmap_entry``-s.
> +* ``incfs_dir_action`` - metadata record that specifies changes made to a
> +                    to a directory structure, e.g. add or remove a hardlink.
> +* ``incfs_md_header`` - header of a metadata record. It's always a part
> +                    of other structures and served purpose of metadata

?                                              serves

> +                    bookkeeping.
> +
> +Other ondisk structures:
> +
> +* ``incfs_super_block`` - backing file header
> +* ``incfs_blockmap_entry`` - a record in a blockmap area that describes size
> +                        and location of a data block.
> +* Data blocks dont have any particular structure, they are written to the backing

                 don't

> +  file in a raw form as they come from a data loader.
> +
> +
> +Backing file layout
> +-------------------
> +::
> +
> +              +-------------------------------------------+
> +              |            incfs_super_block              |]---+
> +              +-------------------------------------------+    |
> +              |                 metadata                  |<---+
> +              |                incfs_inode                |]---+
> +              +-------------------------------------------+    |
> +                        .........................              |
> +              +-------------------------------------------+    |   metadata
> +     +------->|               blockmap area               |    |  list links
> +     |        |          [incfs_blockmap_entry]           |    |
> +     |        |          [incfs_blockmap_entry]           |    |
> +     |        |          [incfs_blockmap_entry]           |    |
> +     |    +--[|          [incfs_blockmap_entry]           |    |
> +     |    |   |          [incfs_blockmap_entry]           |    |
> +     |    |   |          [incfs_blockmap_entry]           |    |
> +     |    |   +-------------------------------------------+    |
> +     |    |             .........................              |
> +     |    |   +-------------------------------------------+    |
> +     |    |   |                 metadata                  |<---+
> +     +----|--[|               incfs_blockmap              |]---+
> +          |   +-------------------------------------------+    |
> +          |             .........................              |
> +          |   +-------------------------------------------+    |
> +          +-->|                 data block                |    |
> +              +-------------------------------------------+    |
> +                        .........................              |
> +              +-------------------------------------------+    |
> +              |                 metadata                  |<---+
> +              |             incfs_dir_action              |
> +              +-------------------------------------------+
> +
> +Unreferenced files and absence of garbage collection
> +----------------------------------------------------
> +Described file format can produce files that don't have any names for them in
> +any directories. Incremental-fs takes no steps to prevent such situations or
> +reclaim space occupied by such files in the backing file. If garbage collection
> +is needed it has to be implemented as a separate userspace tool.
> +
> +
> +Design alternatives
> +===================
> +
> +Why isn't incremental-fs implemented via FUSE?
> +----------------------------------------------
> +TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
> +scenarios, and increase power use on mobile beyond acceptable limit
> +for widespread deployment. A custom kernel filesystem is the way to overcome
> +these limitations.
> +
> +From the theoretical side of things, FUSE filesystem adds some overhead to
> +each filesystem operation that’s not handled by OS page cache:
> +
> +    * When an IO request arrives to FUSE driver (D), it puts it into a queue
> +      that runs on a separate kernel thread
> +    * Then another separate user-mode handler process (H) has to run,
> +      potentially after a context switch, to read the request from the queue.
> +      Reading the request adds a kernel-user mode transition to the handling.
> +    * (H) sends the IO request to kernel to handle it on some underlying storage
> +      filesystem. This adds a user-kernel and kernel-user mode transition
> +      pair to the handling.
> +    * (H) then responds to the FUSE request via a write(2) call.
> +      Writing the response is another user-kernel mode transition.
> +    * (D) needs to read the response from (H) when its kernel thread runs
> +      and forward it to the user
> +
> +Together, the scenario adds 2 extra user-kernel-user mode transition pairs,
> +and potentially has up to 3 additional context switches for the FUSE kernel
> +thread and the user-mode handler to start running for each IO request on the
> +filesystem.
> +This overhead can vary from unnoticeable to unmanageable, depending on the
> +target scenario. But it will always burn extra power via CPU staying longer
> +in non-idle state, handling context switches and mode transitions.
> +One important goal for the new filesystem is to be able to handle each page
> +read separately on demand, because we don't want to wait and download more data
> +than absolutely necessary. Thus readahead would need to be disabled completely.
> +This increases the number of separate IO requests and the FUSE related overhead
> +by almost 32x (128KB readahead limit vs 4KB individual block operations)
> +
> +For more info see a 2017 USENIX research paper:
> +To FUSE or Not to FUSE: Performance of User-Space File Systems
> +Bharath Kumar Reddy Vangoor, Stony Brook University;
> +Vasily Tarasov, IBM Research-Almaden;
> +Erez Zadok, Stony Brook University
> +https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf


-- 
~Randy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-02 13:26     ` Al Viro
@ 2019-05-03  4:23       ` Eugene Zemtsov
  2019-05-03  5:19         ` Amir Goldstein
                           ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Eugene Zemtsov @ 2019-05-03  4:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Al Viro, tytso, Amir Goldstein, miklos, richard.weinberger

On Thu, May 2, 2019 at 6:26 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Why not CODA, though, with local fs as cache?

On Thu, May 2, 2019 at 4:20 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> This sounds very useful.
>
> Why does it have to be a new special-purpose Linux virtual file?
> Why not FUSE, which is meant for this purpose?
> Those are things that you should explain when you are proposing a new
> filesystem,
> but I will answer for you - because FUSE page fault will incur high
> latency also after
> blocks are locally available in your backend store. Right?
>
> How about fscache support for FUSE then?
> You can even write your own fscache backend if the existing ones don't
> fit your needs for some reason.
>
> Piling logic into the kernel is not the answer.
> Adding the missing interfaces to the kernel is the answer.
>

Thanks for the interest and feedback. What I dreaded most was silence.

Probably I should have given a bit more details in the introductory email.
Important features we’re aiming for:

1. An attempt to read a missing data block gives a userspace data loader a
chance to fetch it. Once a block is loaded (in advance or after a page fault)
it is saved into a local backing storage and following reads of the same block
are done directly by the kernel. [Implemented]

2. Block level compression. It saves space on a device, while still allowing
very granular loading and mapping. Less granular compression would trigger
loading of more data than absolutely necessary, and that’s the thing we
want to avoid. [Implemented]

3. Block level integrity verification. The signature scheme is similar to
DMverity or fs-verity. In other words, each file has a Merkle tree with
crypto-digests of 4KB blocks. The root digest is signed with RSASSA or ECDSA.
Each time a data block is read digest is calculated and checked with the
Merkle tree, if the signature check fails the read operation fails as well.
Ideally I’d like to use fs-verity API for that. [Not implemented yet.]

4. New files can be pushed into incremental-fs “externally” when an app needs
a new resource or a binary. This is needed for situations when a new resource
or a new version of code is available, e.g. a user just changed the system
language to Spanish, or a developer rolled out an app update.
Things change over time and this means that we can’t just incrementally
load a precooked ext4 image and mount it via a loopback device.   [Implemented]

5. No need to support writes or file resizing. It eliminates a lot of
complexity.

Currently not all of these features are implemented yet, but they all will be
needed to achieve our goals:
 - Apps can be delivered incrementally without having to wait for extra data.
   At the same time given enough time the app can be downloaded fully without
   having to keep a connection open after that.
- App’s integrity should be verifiable without having to read all its blocks.
- Local storage and battery need to be conserved.
- Apps binaries and resources can change over time.
   Such changes are triggered by external events.

I’d like to comment on proposed alternative solutions:

FUSE
We have a FUSE based prototype and though functional it turned out to be battery
hungry and read performance leaving much to be desired.
Our measurements were roughly corresponding to results in the article
I link in PATCH 1 incrementalfs.rst

In this thread Amir Goldstein absolutely correctly pointed out that FUSE’s
constant overhead keeps hurting app’s performance even when all blocks are
available locally. But not only that, FUSE needs to be involved with each
readdir() and stat() call. And to our surprise we learned that many apps do
directory traversals and stat()-s much more often that it seems reasonable.

Moreover, Android has a bit of a recent history with FUSE. A big chunk of
Android directory tree (“external storage”) use to be mounted via FUSE.
It didn’t turn out to be a great approach and it was eventually replaced by
a kernel module.

I reckon the amount of changes that we’d need to introduce to FUSE in order
to make it support things mentioned above will be, to put it mildly,
very substantial. And having to be as generic as FUSE (i.e. support writes etc)
will make the task much more complicated than it is now.

Coda
Indeed it is somewhat similar to what we need. But according to Coda’s
documentation it fetches a whole file first time it is accessed,
which is opposite of what we need. It is not really obvious that adding all
the things above to Coda would be simpler than creating a separate driver.
Especially if Coda needs to keep supporting all of its existing features.

userfaultfd
As far as I can see this would only work for mmap-ed files.
All read() and readdir() calls would never return right results.




-- 
Thanks,
Eugene Zemtsov.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-03  4:23       ` Eugene Zemtsov
@ 2019-05-03  5:19         ` Amir Goldstein
  2019-05-08 20:09           ` Eugene Zemtsov
  2019-05-03  7:23         ` Richard Weinberger
  2019-05-03 10:22         ` Miklos Szeredi
  2 siblings, 1 reply; 33+ messages in thread
From: Amir Goldstein @ 2019-05-03  5:19 UTC (permalink / raw)
  To: Eugene Zemtsov
  Cc: linux-fsdevel, Al Viro, Theodore Tso, Miklos Szeredi, Richard Weinberger

On Fri, May 3, 2019 at 12:23 AM Eugene Zemtsov <ezemtsov@google.com> wrote:
>
> On Thu, May 2, 2019 at 6:26 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Why not CODA, though, with local fs as cache?
>
> On Thu, May 2, 2019 at 4:20 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > This sounds very useful.
> >
> > Why does it have to be a new special-purpose Linux virtual file?
> > Why not FUSE, which is meant for this purpose?
> > Those are things that you should explain when you are proposing a new
> > filesystem,
> > but I will answer for you - because FUSE page fault will incur high
> > latency also after
> > blocks are locally available in your backend store. Right?
> >
> > How about fscache support for FUSE then?
> > You can even write your own fscache backend if the existing ones don't
> > fit your needs for some reason.
> >
> > Piling logic into the kernel is not the answer.
> > Adding the missing interfaces to the kernel is the answer.
> >
>
> Thanks for the interest and feedback. What I dreaded most was silence.
>
> Probably I should have given a bit more details in the introductory email.
> Important features we’re aiming for:
>
> 1. An attempt to read a missing data block gives a userspace data loader a
> chance to fetch it. Once a block is loaded (in advance or after a page fault)
> it is saved into a local backing storage and following reads of the same block
> are done directly by the kernel. [Implemented]
>
> 2. Block level compression. It saves space on a device, while still allowing
> very granular loading and mapping. Less granular compression would trigger
> loading of more data than absolutely necessary, and that’s the thing we
> want to avoid. [Implemented]
>
> 3. Block level integrity verification. The signature scheme is similar to
> DMverity or fs-verity. In other words, each file has a Merkle tree with
> crypto-digests of 4KB blocks. The root digest is signed with RSASSA or ECDSA.
> Each time a data block is read digest is calculated and checked with the
> Merkle tree, if the signature check fails the read operation fails as well.
> Ideally I’d like to use fs-verity API for that. [Not implemented yet.]
>
> 4. New files can be pushed into incremental-fs “externally” when an app needs
> a new resource or a binary. This is needed for situations when a new resource
> or a new version of code is available, e.g. a user just changed the system
> language to Spanish, or a developer rolled out an app update.
> Things change over time and this means that we can’t just incrementally
> load a precooked ext4 image and mount it via a loopback device.   [Implemented]
>
> 5. No need to support writes or file resizing. It eliminates a lot of
> complexity.
>
> Currently not all of these features are implemented yet, but they all will be
> needed to achieve our goals:
>  - Apps can be delivered incrementally without having to wait for extra data.
>    At the same time given enough time the app can be downloaded fully without
>    having to keep a connection open after that.
> - App’s integrity should be verifiable without having to read all its blocks.
> - Local storage and battery need to be conserved.
> - Apps binaries and resources can change over time.
>    Such changes are triggered by external events.
>

This really sounds to me like the properties of a network filesystem
with local cache. It seems that you did a thorough research, but
I am not sure that you examined the fscache option properly.
Remember, if an existing module does not meet your needs,
it does not mean that creating a new module is the right answer.
It may be that extending an existing module is something that
everyone, including yourself will benefit from.

> I’d like to comment on proposed alternative solutions:
>
> FUSE
> We have a FUSE based prototype and though functional it turned out to be battery
> hungry and read performance leaving much to be desired.
> Our measurements were roughly corresponding to results in the article
> I link in PATCH 1 incrementalfs.rst
>
> In this thread Amir Goldstein absolutely correctly pointed out that FUSE’s
> constant overhead keeps hurting app’s performance even when all blocks are
> available locally. But not only that, FUSE needs to be involved with each
> readdir() and stat() call. And to our surprise we learned that many apps do
> directory traversals and stat()-s much more often that it seems reasonable.
>

That is a real problem. Alas readdir cache, recently added probably solves
your problem since your directory changes are infrequent.
stat cache also exists, but will be used depending on policy of mount options.
I am sure you can come up with caching policy that will meet your needs
and AFAIK FUSE protocol supports invalidating cache entries from server
(i.e. on "external" changes).

> Moreover, Android has a bit of a recent history with FUSE. A big chunk of
> Android directory tree (“external storage”) use to be mounted via FUSE.
> It didn’t turn out to be a great approach and it was eventually replaced by
> a kernel module.
>

I am aware of that history.
I suspect the decision to write sdcardfs followed similar logic to the one
that has lead you to write incfs.

> I reckon the amount of changes that we’d need to introduce to FUSE in order
> to make it support things mentioned above will be, to put it mildly,
> very substantial. And having to be as generic as FUSE (i.e. support writes etc)
> will make the task much more complicated than it is now.
>

Maybe. We won't know until you explore this option. Will we?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-03  4:23       ` Eugene Zemtsov
  2019-05-03  5:19         ` Amir Goldstein
@ 2019-05-03  7:23         ` Richard Weinberger
  2019-05-03 10:22         ` Miklos Szeredi
  2 siblings, 0 replies; 33+ messages in thread
From: Richard Weinberger @ 2019-05-03  7:23 UTC (permalink / raw)
  To: Eugene Zemtsov
  Cc: linux-fsdevel, Al Viro, tytso, Amir Goldstein, Miklos Szeredi,
	Richard Weinberger

Eugene,

----- Ursprüngliche Mail -----
> userfaultfd
> As far as I can see this would only work for mmap-ed files.
> All read() and readdir() calls would never return right results.

Yep. For lazy loading a program that should be okay.
But now with more details on your use-case I agree with Amir,
a cached network filesystem makes more sense.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-03  4:23       ` Eugene Zemtsov
  2019-05-03  5:19         ` Amir Goldstein
  2019-05-03  7:23         ` Richard Weinberger
@ 2019-05-03 10:22         ` Miklos Szeredi
  2 siblings, 0 replies; 33+ messages in thread
From: Miklos Szeredi @ 2019-05-03 10:22 UTC (permalink / raw)
  To: Eugene Zemtsov
  Cc: linux-fsdevel, Al Viro, Theodore Ts'o, Amir Goldstein,
	Richard Weinberger

On Fri, May 3, 2019 at 12:23 AM Eugene Zemtsov <ezemtsov@google.com> wrote:
>
> On Thu, May 2, 2019 at 6:26 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Why not CODA, though, with local fs as cache?
>
> On Thu, May 2, 2019 at 4:20 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > This sounds very useful.
> >
> > Why does it have to be a new special-purpose Linux virtual file?
> > Why not FUSE, which is meant for this purpose?
> > Those are things that you should explain when you are proposing a new
> > filesystem,
> > but I will answer for you - because FUSE page fault will incur high
> > latency also after
> > blocks are locally available in your backend store. Right?
> >
> > How about fscache support for FUSE then?
> > You can even write your own fscache backend if the existing ones don't
> > fit your needs for some reason.
> >
> > Piling logic into the kernel is not the answer.
> > Adding the missing interfaces to the kernel is the answer.
> >
>
> Thanks for the interest and feedback. What I dreaded most was silence.
>
> Probably I should have given a bit more details in the introductory email.
> Important features we’re aiming for:
>
> 1. An attempt to read a missing data block gives a userspace data loader a
> chance to fetch it. Once a block is loaded (in advance or after a page fault)
> it is saved into a local backing storage and following reads of the same block
> are done directly by the kernel. [Implemented]
>
> 2. Block level compression. It saves space on a device, while still allowing
> very granular loading and mapping. Less granular compression would trigger
> loading of more data than absolutely necessary, and that’s the thing we
> want to avoid. [Implemented]
>
> 3. Block level integrity verification. The signature scheme is similar to
> DMverity or fs-verity. In other words, each file has a Merkle tree with
> crypto-digests of 4KB blocks. The root digest is signed with RSASSA or ECDSA.
> Each time a data block is read digest is calculated and checked with the
> Merkle tree, if the signature check fails the read operation fails as well.
> Ideally I’d like to use fs-verity API for that. [Not implemented yet.]
>
> 4. New files can be pushed into incremental-fs “externally” when an app needs
> a new resource or a binary. This is needed for situations when a new resource
> or a new version of code is available, e.g. a user just changed the system
> language to Spanish, or a developer rolled out an app update.
> Things change over time and this means that we can’t just incrementally
> load a precooked ext4 image and mount it via a loopback device.   [Implemented]
>
> 5. No need to support writes or file resizing. It eliminates a lot of
> complexity.
>
> Currently not all of these features are implemented yet, but they all will be
> needed to achieve our goals:
>  - Apps can be delivered incrementally without having to wait for extra data.
>    At the same time given enough time the app can be downloaded fully without
>    having to keep a connection open after that.
> - App’s integrity should be verifiable without having to read all its blocks.
> - Local storage and battery need to be conserved.
> - Apps binaries and resources can change over time.
>    Such changes are triggered by external events.
>

Good summary.  I understand the requirements better now.

I still have issues with this design, because it looks very android
specific.  For example I know that lazy download  is something
actually being heavily used by distributed computing (see cernvm-fs)
so it's not a specific requirement of android.   By bundling these
features together into a kernel module you are basically limiting the
user base and hence possibly missing out on some of the advantages of
having a more varied user base.

I wonder how much of the performance issues with the fuse prototype
was because of 4k reads/disabling re adahead?   I know you require
that for the data loading part, but it would be trivial to turn that
behavior off once everything is in place.   Does the prototype do
that?  Have you tried doing that?  Is the prototype in a good enough
shape to perhaps move it to a public repository for review?

I'm also wondering about some of the features you describle above.
Why a new block fs?  A normal fs (ext4) provides most of those things:
you can add files to it, etc...  The one thing it doesn't provide is
compression, and that's because it's hard for the non-incremental
case.   So do we really need a new disk format for this?  Or can the
missing compression feature (perhaps with limits) be implemented in
ext4/f2fs?  In that case we even can take that work off of fuse and
just leave the loading to the fuse part. Cernvm-fs does that with a
fuse fs on the lower layer that does  lazy downloading, and putting
already downloaded files in an upper layer of overlayfs for faster
access, but it's possible that there's a better way of doing that not
involving even overlayfs.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] incfs: Add first files of incrementalfs
  2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
  2019-05-02 19:06   ` Miklos Szeredi
  2019-05-02 20:41   ` Randy Dunlap
@ 2019-05-07 15:57   ` Jann Horn
  2019-05-07 17:13   ` Greg KH
  2019-05-07 17:18   ` Greg KH
  4 siblings, 0 replies; 33+ messages in thread
From: Jann Horn @ 2019-05-07 15:57 UTC (permalink / raw)
  To: ezemtsov; +Cc: linux-fsdevel, Theodore Y. Ts'o, Linux API

+linux-api

On Tue, May 7, 2019 at 4:23 PM <ezemtsov@google.com> wrote:
> - fs/incfs dir
> - Kconfig (CONFIG_INCREMENTAL_FS)
> - Makefile
> - Module and file system initialization and clean up code
> - New MAINTAINERS entry
> - Add incrementalfs.h UAPI header
> - Register ioctl range in ioctl-numbers.txt
> - Documentation
>
> Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>
[...]
> diff --git a/Documentation/filesystems/incrementalfs.rst b/Documentation/filesystems/incrementalfs.rst
> new file mode 100644
> index 000000000000..682e3dcb6b5a
> --- /dev/null
> +++ b/Documentation/filesystems/incrementalfs.rst
> @@ -0,0 +1,452 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=======================
> +Incremental File System
> +=======================
> +
> +Overview
> +========
> +Incremental FS is special-purpose Linux virtual file system that allows
> +execution of a program while its binary and resource files are still being
> +lazily downloaded over the network, USB etc. It is focused on incremental
> +delivery for a small number (under 100) of big files (more than 10 megabytes).
> +Incremental FS doesn’t allow direct writes into files and, once loaded, file
> +content never changes. Incremental FS doesn’t use a block device, instead it
> +saves data into a backing file located on a regular file-system.
> +
> +But why?
> +--------
> +To allow running **big** Android apps before their binaries and resources are
> +fully downloaded to an Android device. If an app reads something not loaded yet,
> +it needs to wait for the data block to be fetched, but in most cases hot blocks
> +can be loaded in advance.

But the idea is that eventually, the complete application will be
downloaded, right? You're not planning to download the last few chunks
of an app on demand weeks after it has been installed?

> +Workflow
> +--------
> +A userspace process, called a data loader, mounts an instance of incremental-fs
> +giving it a file descriptor on an underlying file system (like ext4 or f2fs).
> +Incremental-fs reads content (if any) of this backing file and interprets it as
> +a file system image with files, directories and data blocks. At this point
> +the data loader can declare new files to be shown by incremental-fs.
> +
> +A process is started from a binary located on incremental-fs.
> +All reads are served directly from the backing file
> +without roundtrips into userspace. If the process accesses a data block that was
> +not originally present in the backing file, the read operation waits.
> +
> +Meanwhile the data loader can feed new data blocks to incremental-fs by calling
> +write() on a special .cmd pseudo-file. The data loader can request information
> +about pending reads by calling poll() and read() on the .cmd pseudo-file.
> +This mechanism allows the data loader to serve most urgently needed data first.
> +Once a data block is given to incremental-fs, it saves it to the backing file
> +and unblocks all the reads waiting for this block.
> +
> +Eventually all data for all files is uploaded by the data loader, and saved by
> +incremental-fs into the backing file. At that moment the data loader is not
> +needed any longer. The backing file will play the role of a complete
> +filesystem image for all future runs of the program.

This means that for all future runs, you still need to mount an incfs
instance to be able to access application files, even when the
complete application has been downloaded already, right? Wouldn't it
be nicer if, once the complete application has been downloaded, you
could stop using a shim layer entirely? That way, the performance of
the shim layer would also matter less.

Is there a reason why this thing is not backed by a normal directory
hierarchy on the backing file system, instead of the single image file
you're proposing?

> +External .cmd file interface
> +----------------------------
> +When incremental-fs is mounted, a mount directory contains a pseudo-file
> +called '.cmd'. The data loader will open this file and call read(), write(),
> +poll() and ioctl() on it inspect and change state of incremental-fs.
> +
> +poll() and read() are used by the data loader to wait for pending reads to
> +appear and obtain an array of ``struct incfs_pending_read_info``.
> +
> +write() is used by the data loader to feed new data blocks to incremental-fs.
> +A data buffer given to write() is interpreted as an array of
> +``struct incfs_new_data_block``. Structs in the array describe locations and
> +properties of data blocks loaded with this write() call.

You can't do that. A write() handler must not interpret written data
as pointers; that must be handled with an ioctl instead.

> +``ioctl(INCFS_IOC_PROCESS_INSTRUCTION)`` is used to change structure of
> +incremental-fs. It receives an pointer to ``struct incfs_instruction``
> +where type field can have be one of the following values.
> +
> +**INCFS_INSTRUCTION_NEW_FILE**
> +Creates an inode (a file or a directory) without a name.
> +It assumes ``incfs_new_file_instruction.file`` is populated with details.
> +
> +**INCFS_INSTRUCTION_ADD_DIR_ENTRY**
> +Creates a name (aka hardlink) for an inode in a directory.
> +A directory can't have more than one hardlink pointing to it, but files can be
> +linked from different directories.
> +It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
> +
> +**INCFS_INSTRUCTION_REMOVE_DIR_ENTRY**
> +Remove a name (aka hardlink) for a file from a directory.
> +Only empty directories can be unlinked.
> +It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.

What is the usecase for removing directory entries?

With the API you're proposing, you're always going to want to populate
the entire directory hierarchy before running an application from
incfs, because otherwise lookups and readdir might fail in a way the
application doesn't expect, right?

> +For more details see in uapi/linux/incrementalfs.h and samples below.
> +
> +Supported mount options
> +-----------------------
> +See ``fs/incfs/options.c`` for more details.
> +
> +    * ``backing_fd=<unsigned int>``
> +        Required. A file descriptor of a backing file opened by the process
> +        calling mount(2). This descriptor can be closed after mount returns.
> +
> +    * ``read_timeout_msc=<unsigned int>``
> +        Default: 1000. Timeout in milliseconds before a read operation fails
> +        if no data found in the backing file or provided by the data loader.

So... if I run an application from this incremental file system, and
the application page faults on a page that hasn't been loaded yet, and
my phone happens to not have connectivity for a second because it's
moving between wifi and cellular or whatever, the application will
crash?

> +Open .cmd file
> +~~~~~~~~~~~~~~
> +
> +::
> +
> +    int open_commands_file(char *mount_dir)
> +    {
> +        char cmd_file[255];
> +        int cmd_fd;
> +
> +        snprintf(cmd_file, ARRAY_SIZE(cmd_file), "%s/.cmd", mount_dir);
> +        cmd_fd = open(cmd_file, O_RDWR);
> +        if (cmd_fd < 0)
> +            perror("Can't open commands file");
> +        return cmd_fd;
> +    }

How is access control for this supposed to work? The command file is
created with mode 0666, so does that mean that any instance of the
application can write arbitrary code into chunks that haven't been
loaded yet, modulo SELinux?

> +Design alternatives
> +===================
> +
> +Why isn't incremental-fs implemented via FUSE?
> +----------------------------------------------
> +TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
> +scenarios

Really?

> and increase power use on mobile beyond acceptable limit
> +for widespread deployment.

From what I can tell, you only really need this thing to be active
while the application is still being downloaded - and at that point in
time, you're shoving packets over a wireless connection, checking data
integrity, writing blocks to disk, and so on, right? Does FUSE add
noticeable power use to that?

> A custom kernel filesystem is the way to overcome
> +these limitations.

I doubt that. I see two main alternatives that I think would both be better:

1. Use a FUSE filesystem to trap writes while files are being
downloaded, then switch to native ext4.
2. Add an eBPF hook in the ext4 read path. The hook would take the
inode number and the offset as input and return a value that indicates
whether the kernel should let the read go through or block the read
and send a notification to userspace over a file descriptor. Sort of
like userfaultfd, except with an eBPF-based fastpath. (And to deal
with readahead, you could perhaps add a flag that is passed through to
the read code to say "this is readahead", and then throw an error
instead of blocking the read.)

> +From the theoretical side of things, FUSE filesystem adds some overhead to
> +each filesystem operation that’s not handled by OS page cache:

How many filesystem operations do you have during application download
that are not handled by the OS page cache?

> +    * When an IO request arrives to FUSE driver (D), it puts it into a queue
> +      that runs on a separate kernel thread
> +    * Then another separate user-mode handler process (H) has to run,
> +      potentially after a context switch, to read the request from the queue.
> +      Reading the request adds a kernel-user mode transition to the handling.
> +    * (H) sends the IO request to kernel to handle it on some underlying storage
> +      filesystem. This adds a user-kernel and kernel-user mode transition
> +      pair to the handling.
> +    * (H) then responds to the FUSE request via a write(2) call.
> +      Writing the response is another user-kernel mode transition.
> +    * (D) needs to read the response from (H) when its kernel thread runs
> +      and forward it to the user
> +
> +Together, the scenario adds 2 extra user-kernel-user mode transition pairs,
> +and potentially has up to 3 additional context switches for the FUSE kernel
> +thread and the user-mode handler to start running for each IO request on the
> +filesystem.
> +This overhead can vary from unnoticeable to unmanageable, depending on the
> +target scenario.

Is the overhead of extra context switches really "unmanageable"
compared to the latency of storage?

> But it will always burn extra power via CPU staying longer
> +in non-idle state, handling context switches and mode transitions.
> +One important goal for the new filesystem is to be able to handle each page
> +read separately on demand, because we don't want to wait and download more data
> +than absolutely necessary. Thus readahead would need to be disabled completely.
> +This increases the number of separate IO requests and the FUSE related overhead
> +by almost 32x (128KB readahead limit vs 4KB individual block operations)

You could implement the readahead in the FUSE filesystem, no? Check if
adjacent blocks are already available, and if so, shove them into the
page cache without waiting for the kernel to ask for them?

> +For more info see a 2017 USENIX research paper:
> +To FUSE or Not to FUSE: Performance of User-Space File Systems
> +Bharath Kumar Reddy Vangoor, Stony Brook University;
> +Vasily Tarasov, IBM Research-Almaden;
> +Erez Zadok, Stony Brook University
> +https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

From that paper, the workloads that are interesting for you are either
the seq-rd-1th-1f or the rnd-rd-1th-1f workloads, right?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] incfs: Add first files of incrementalfs
  2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
                     ` (2 preceding siblings ...)
  2019-05-07 15:57   ` Jann Horn
@ 2019-05-07 17:13   ` Greg KH
  2019-05-07 17:18   ` Greg KH
  4 siblings, 0 replies; 33+ messages in thread
From: Greg KH @ 2019-05-07 17:13 UTC (permalink / raw)
  To: ezemtsov; +Cc: linux-fsdevel, tytso

On Wed, May 01, 2019 at 09:03:26PM -0700, ezemtsov@google.com wrote:
> +Sysfs files
> +-----------
> +``/sys/fs/incremental-fs/version`` - a current version of the filesystem.
> +One ASCII encoded positive integer number with a new line at the end.

sysfs (no "S" please) documentation goes in Documentation/ABI/, not
burried in a random file somewhere else :)

Also, "filesystem version" does not make much sense, does it?  Why does
userspace care about this, all they really care about is the _kernel_
version.  We do not independantly version the individual components of
the kernel, that way lies madness.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/6] incfs: Add first files of incrementalfs
  2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
                     ` (3 preceding siblings ...)
  2019-05-07 17:13   ` Greg KH
@ 2019-05-07 17:18   ` Greg KH
  4 siblings, 0 replies; 33+ messages in thread
From: Greg KH @ 2019-05-07 17:18 UTC (permalink / raw)
  To: ezemtsov; +Cc: linux-fsdevel, tytso

But since you did write some sysfs code, might as well review it so you
know how to do it better next time around :)

On Wed, May 01, 2019 at 09:03:26PM -0700, ezemtsov@google.com wrote:
> +static struct kobject *sysfs_root;
> +
> +static ssize_t version_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buff)
> +{
> +	return snprintf(buff, PAGE_SIZE, "%d\n", INCFS_CORE_VERSION);

Hint about sysfs, it's "one value per file", so you NEVER care about the
size of the buffer because you "know" your single little number will
always fit.

So this should be:
	return sprintf(buff, "%d\n", INCFS_CORE_VERSION);

Yes, code checkers hate it, send them my way, I'll be glad to point out
their folly :)

> +static struct kobj_attribute version_attr = __ATTR_RO(version);
> +
> +static struct attribute *attributes[] = {
> +	&version_attr.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group attr_group = {
> +	.attrs = attributes,
> +};

ATTRIBUTE_GROUP()?

> +static int __init init_sysfs(void)
> +{
> +	int res = 0;

No need to set to 0 here.

> +
> +	sysfs_root = kobject_create_and_add(INCFS_NAME, fs_kobj);
> +	if (!sysfs_root)
> +		return -ENOMEM;
> +
> +	res = sysfs_create_group(sysfs_root, &attr_group);
> +	if (res) {
> +		kobject_put(sysfs_root);
> +		sysfs_root = NULL;
> +	}
> +	return res;

To be extra "fancy", there's no real need to create a kobject for your
filesystem if all you are doing is creating a subdir and some individual
attributes.  Just add a "named" group to the parent kobject.  Can be
done in a single line, no need for having to deal with fancy error
cases.

> +}
> +
> +static void cleanup_sysfs(void)
> +{
> +	if (sysfs_root) {
> +		sysfs_remove_group(sysfs_root, &attr_group);
> +		kobject_put(sysfs_root);
> +		sysfs_root = NULL;

Why set it to NULL?

> +	}
> +}

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-03  5:19         ` Amir Goldstein
@ 2019-05-08 20:09           ` Eugene Zemtsov
  2019-05-09  8:15             ` Amir Goldstein
  0 siblings, 1 reply; 33+ messages in thread
From: Eugene Zemtsov @ 2019-05-08 20:09 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, Al Viro, Theodore Tso, Miklos Szeredi, Richard Weinberger

> This really sounds to me like the properties of a network filesystem
> with local cache. It seems that you did a thorough research, but
> I am not sure that you examined the fscache option properly.
> Remember, if an existing module does not meet your needs,
> it does not mean that creating a new module is the right answer.
> It may be that extending an existing module is something that
> everyone, including yourself will benefit from.

> I am sure you can come up with caching policy that will meet your needs
> and AFAIK FUSE protocol supports invalidating cache entries from server
> (i.e. on "external" changes).

You’re right. On a very high level it looks quite plausible that incfs can be
replaced by a combination of
1. fscache interface change to accomodate compression, hashes etc
2. a new fscache backend
3. a FUSE change, that would allow FUSE to load data to fscache and server data
    from directly fscache.

After it is all done, FUSE and fscache will have more features and support more
use cases for years to come. But this approach is not without
tradeoffs, features
increase support burden and FUSE interface changes are almost
impossible to deprecate.

On the other hand we have a simple self-contained module, which handles
incremental app loading for Android. All in all, incfs currently has
about 6KLOC,
where only 3.5KLOC is actual kernel code. It is not likely to be used “as is”
for other purposes, but it doesn’t increase already significant complexity of
fscache, FUSE, and VFS. People working with those components won’t need to fret
about extra hooks and corner cases created for incremental app loading.
If for some reason incfs doesn’t gain wide adoption, it can be relatively
painlessly removed from the kernel.

Having a standalone module is very important for me on a yet another level.
It helps in porting it to older kernels. Patches scattered across fs/ substree
will be less portable and self contained. (BTW this is the reason to have
a version file in sysfs - new versions of incfs can be backported to
older kernels.)

Hopefully this will clarify why I think that VFS interface is the right boundary
for incremental-fs. It is sufficiently low-level to achieve all
goals of incremental app loading, but at the same time sufficiently isolated
not to meddle with the rest of the kernel.

Thoughts?

Thanks,
Eugene.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-08 20:09           ` Eugene Zemtsov
@ 2019-05-09  8:15             ` Amir Goldstein
       [not found]               ` <CAK8JDrEQnXTcCtAPkb+S4r4hORiKh_yX=0A0A=LYSVKUo_n4OA@mail.gmail.com>
  0 siblings, 1 reply; 33+ messages in thread
From: Amir Goldstein @ 2019-05-09  8:15 UTC (permalink / raw)
  To: Eugene Zemtsov
  Cc: linux-fsdevel, Al Viro, Theodore Tso, Miklos Szeredi, Richard Weinberger

On Wed, May 8, 2019 at 11:10 PM Eugene Zemtsov <ezemtsov@google.com> wrote:
>
> > This really sounds to me like the properties of a network filesystem
> > with local cache. It seems that you did a thorough research, but
> > I am not sure that you examined the fscache option properly.
> > Remember, if an existing module does not meet your needs,
> > it does not mean that creating a new module is the right answer.
> > It may be that extending an existing module is something that
> > everyone, including yourself will benefit from.
>
> > I am sure you can come up with caching policy that will meet your needs
> > and AFAIK FUSE protocol supports invalidating cache entries from server
> > (i.e. on "external" changes).
>
> You’re right. On a very high level it looks quite plausible that incfs can be
> replaced by a combination of
> 1. fscache interface change to accomodate compression, hashes etc
> 2. a new fscache backend
> 3. a FUSE change, that would allow FUSE to load data to fscache and server data
>     from directly fscache.
>
> After it is all done, FUSE and fscache will have more features and support more
> use cases for years to come. But this approach is not without
> tradeoffs, features
> increase support burden and FUSE interface changes are almost
> impossible to deprecate.
>
> On the other hand we have a simple self-contained module, which handles
> incremental app loading for Android. All in all, incfs currently has
> about 6KLOC,
> where only 3.5KLOC is actual kernel code. It is not likely to be used “as is”
> for other purposes, but it doesn’t increase already significant complexity of
> fscache, FUSE, and VFS. People working with those components won’t need to fret
> about extra hooks and corner cases created for incremental app loading.
> If for some reason incfs doesn’t gain wide adoption, it can be relatively
> painlessly removed from the kernel.
>

If you add NEW fscache APIs you won't risk breaking the old ones.
You certainly won't make VFS more complex because you won't be
changing VFS.
You know what, even if you do submit incfs a new user space fs, without FUSE,
I'd rather that you used fscache frontend/backend design, so that at
least it will
make it easier for someone else in the community to take the backend parts
and add fronend support to FUSE or any other network fs.

And FYI, since fscache is an internal kernel API, the NEW interfaces could
be just as painlessly removed if the incfs *backend* doesn't gain any adoption.

> Having a standalone module is very important for me on a yet another level.
> It helps in porting it to older kernels. Patches scattered across fs/ substree
> will be less portable and self contained. (BTW this is the reason to have
> a version file in sysfs - new versions of incfs can be backported to
> older kernels.)
>
> Hopefully this will clarify why I think that VFS interface is the right boundary
> for incremental-fs. It is sufficiently low-level to achieve all
> goals of incremental app loading, but at the same time sufficiently isolated
> not to meddle with the rest of the kernel.
>
> Thoughts?
>

I think you have made the right choice for you and for the product you are
working on to use an isolated module to provide this functionality.

But I assume the purpose of your posting was to request upstream inclusion,
community code review, etc. This is not likely to happen when the
implementation and design choices are derived from Employer needs vs.
the community needs. Sure, you can get high level design review, which is
what *this* is, but I recon not much more.

This discussion has several references to community projects that can benefit
from this functionality, but not in its current form.

This development model has worked well in the past for Android and the Android
user base leverage could help to get you a ticket to staging, but eventually,
those modules (e.g. ashmem) often do get replaced with more community oriented
APIs.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
       [not found]               ` <CAK8JDrEQnXTcCtAPkb+S4r4hORiKh_yX=0A0A=LYSVKUo_n4OA@mail.gmail.com>
@ 2019-05-21  1:32                 ` Yurii Zubrytskyi
  2019-05-22  8:32                   ` Miklos Szeredi
  2019-05-22 10:54                   ` Amir Goldstein
  0 siblings, 2 replies; 33+ messages in thread
From: Yurii Zubrytskyi @ 2019-05-21  1:32 UTC (permalink / raw)
  To: Eugene Zemtsov, amir73il, linux-fsdevel, miklos

On Thu, May 9, 2019 at 1:15 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> I think you have made the right choice for you and for the product you are
> working on to use an isolated module to provide this functionality.
>
> But I assume the purpose of your posting was to request upstream inclusion,
> community code review, etc. This is not likely to happen when the
> implementation and design choices are derived from Employer needs vs.
> the community needs. Sure, you can get high level design review, which is
> what *this* is, but I recon not much more.
>
> This discussion has several references to community projects that can benefit
> from this functionality, but not in its current form.
>
> This development model has worked well in the past for Android and the Android
> user base leverage could help to get you a ticket to staging, but eventually,
> those modules (e.g. ashmem) often do get replaced with more community oriented
> APIs.
>

Hi fsdevel
I'm Yurii, and I work with Eugene on the same team and the same project.

I want to explain how we ended up with a custom filesystem instead of
trying to improve FUSE for everyone, and
why we think (maybe incorrectly) that it may be still pretty useful
for the community.
As the project goal was to allow instant (-ish) deployment of apps
from the dev environment to Android phone, we were hoping to
stick with plain FUSE filesystem, and that's what we've done at first.
But it turned out that even with the best tuning it was still
really slow and battery-hungry (phones spent energy faster than they
were charging over the cord).
At this point we've already collected the profiles for the filesystem
usage, and also figured out what features are essential
to make it usable for streaming:
1. Random reads are the most common -> 4kb-sized read is the size we
have to support, and may not go to usermode on each of those
2. Android tends to list the app directory and stat files in it often
-> these operations need to be cached in kernel as well
3. Because of *random* reads streaming files sequentially isn't
optimal -> need to be able to collect read logs from first deployment
    and stream in that order next time on incremental builds
4. Devices have small flash cards, need to deploy uncompressed game
images for speed and mmap access ->
    support storing 4kb blocks compressed
4.1. Host computer is much better at compression -> support streaming
compressed blocks into the filesystem storage directly, without
       recompression on the phone
5. Android has to verify app signature for installation -> need to
support per-block signing and lazy verification
5.1. For big games even per-block signature data can be huge, so need
to stream even the signatures
6. Development cycle is usually edit-build-try-edit-... -> need to
support delta-patches from existing files
7. File names for installed apps are standard and different from what
they were on the host ->
    must be able to store user-supplied 'key' next to each file to identify it
8. Files never change -> no need to have complex code for mutable data
in the filesystem

In the end, we saw only two ways how to make all of this work: either
take sdcardfs as a base and extend it, or change FUSE to
support cache in kernel; and as you can imagine, sdcardfs route got
thrown out of the window immediately after looking at the code.
But after learning some FUSE internals and its code what we found out
is that to make it do all the listed things we'd basically have
to implement a totally new filesystem inside of it. The only real use
of FUSE that remained was to send FUSE_INIT, and occasional
read requests. Everything else required, first of all, making a cache
object inside FUSE intercept every message before it goes to the
user mode, and also adding new specialized commands initiated by the
usermode (e.g. prefetching data that hasn't been requested
yet, or streaming hashes in). Some things even didn't make sense for a
generic usecase (e.g. having a limited circular buffer of read
blocks in kernel that user can ask for and flush).

In the end, after several tries we just came to a conclusion that the
very set of original requirements is so specific that, funny enough,
anyone who wants to create a lazy-loading experience would hit most of
them, while anyone who's doing something else, would miss
most of them. That's the main reason to go with a separate specialized
driver module, and the reason to share it with the community -
we have a feeling that people will benefit from a high-quality
implementation of lazy loading in kernel, and we will benefit from the
community support and guiding.

Again, we all are human and can be wrong at any step when making
conclusions. E.g. we didn't know about the fscache subsystem,
and were only planning to create a cache object inside FUSE instead.
But for now I still feel that our original research stands, and
that in the long run specialized filesystem serves its users much
better than several scattered changes in other places that all
pretty much look like the same filesystem split into three parts and
adopted to the interfaces those places force onto it. Even more,
those changes and interfaces look quite strange on their own, when not
used together.

Please tell me what you think about this whole thing. We do care about
the feature in general, not about making it
look as we've coded it right now. If you feel that making fscache
interface that covers the whole FUSE usermode
messages + allows for those requirements is useful beyond streaming,
we'll investigate that route further.

Thank you, and sorry for a long email

--
Thanks, Yurii

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-21  1:32                 ` Yurii Zubrytskyi
@ 2019-05-22  8:32                   ` Miklos Szeredi
  2019-05-22 17:25                     ` Yurii Zubrytskyi
  2019-05-22 10:54                   ` Amir Goldstein
  1 sibling, 1 reply; 33+ messages in thread
From: Miklos Szeredi @ 2019-05-22  8:32 UTC (permalink / raw)
  To: Yurii Zubrytskyi; +Cc: Eugene Zemtsov, Amir Goldstein, linux-fsdevel

On Tue, May 21, 2019 at 3:32 AM Yurii Zubrytskyi <zyy@google.com> wrote:
>
> On Thu, May 9, 2019 at 1:15 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I think you have made the right choice for you and for the product you are
> > working on to use an isolated module to provide this functionality.
> >
> > But I assume the purpose of your posting was to request upstream inclusion,
> > community code review, etc. This is not likely to happen when the
> > implementation and design choices are derived from Employer needs vs.
> > the community needs. Sure, you can get high level design review, which is
> > what *this* is, but I recon not much more.
> >
> > This discussion has several references to community projects that can benefit
> > from this functionality, but not in its current form.
> >
> > This development model has worked well in the past for Android and the Android
> > user base leverage could help to get you a ticket to staging, but eventually,
> > those modules (e.g. ashmem) often do get replaced with more community oriented
> > APIs.
> >
>
> Hi fsdevel
> I'm Yurii, and I work with Eugene on the same team and the same project.
>
> I want to explain how we ended up with a custom filesystem instead of
> trying to improve FUSE for everyone, and
> why we think (maybe incorrectly) that it may be still pretty useful
> for the community.
> As the project goal was to allow instant (-ish) deployment of apps
> from the dev environment to Android phone, we were hoping to
> stick with plain FUSE filesystem, and that's what we've done at first.
> But it turned out that even with the best tuning it was still
> really slow and battery-hungry (phones spent energy faster than they
> were charging over the cord).
> At this point we've already collected the profiles for the filesystem
> usage, and also figured out what features are essential
> to make it usable for streaming:
> 1. Random reads are the most common -> 4kb-sized read is the size we
> have to support, and may not go to usermode on each of those
> 2. Android tends to list the app directory and stat files in it often
> -> these operations need to be cached in kernel as well
> 3. Because of *random* reads streaming files sequentially isn't
> optimal -> need to be able to collect read logs from first deployment
>     and stream in that order next time on incremental builds
> 4. Devices have small flash cards, need to deploy uncompressed game
> images for speed and mmap access ->
>     support storing 4kb blocks compressed
> 4.1. Host computer is much better at compression -> support streaming
> compressed blocks into the filesystem storage directly, without
>        recompression on the phone
> 5. Android has to verify app signature for installation -> need to
> support per-block signing and lazy verification
> 5.1. For big games even per-block signature data can be huge, so need
> to stream even the signatures
> 6. Development cycle is usually edit-build-try-edit-... -> need to
> support delta-patches from existing files
> 7. File names for installed apps are standard and different from what
> they were on the host ->
>     must be able to store user-supplied 'key' next to each file to identify it
> 8. Files never change -> no need to have complex code for mutable data
> in the filesystem
>
> In the end, we saw only two ways how to make all of this work: either
> take sdcardfs as a base and extend it, or change FUSE to
> support cache in kernel; and as you can imagine, sdcardfs route got
> thrown out of the window immediately after looking at the code.
> But after learning some FUSE internals and its code what we found out
> is that to make it do all the listed things we'd basically have
> to implement a totally new filesystem inside of it. The only real use
> of FUSE that remained was to send FUSE_INIT, and occasional
> read requests. Everything else required, first of all, making a cache
> object inside FUSE intercept every message before it goes to the
> user mode, and also adding new specialized commands initiated by the
> usermode (e.g. prefetching data that hasn't been requested
> yet, or streaming hashes in). Some things even didn't make sense for a
> generic usecase (e.g. having a limited circular buffer of read
> blocks in kernel that user can ask for and flush).

Hang on, fuse does use caches in the kernel (page cache,
dcache/icache).  The issue is probably not lack of cache, it's how the
caches are primed and used.  Did you disable these caches?  Did you
not disable invalidation for data, metadata and dcache?  In recent
kernels we added caching readdir as well.  The only objects not cached
are (non-acl) xattrs.   Do you have those?     Re prefetching data:
there's the NOTIFY_STORE message.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-21  1:32                 ` Yurii Zubrytskyi
  2019-05-22  8:32                   ` Miklos Szeredi
@ 2019-05-22 10:54                   ` Amir Goldstein
  1 sibling, 0 replies; 33+ messages in thread
From: Amir Goldstein @ 2019-05-22 10:54 UTC (permalink / raw)
  To: Yurii Zubrytskyi; +Cc: Eugene Zemtsov, linux-fsdevel, Miklos Szeredi

On Tue, May 21, 2019 at 4:32 AM Yurii Zubrytskyi <zyy@google.com> wrote:
>
> On Thu, May 9, 2019 at 1:15 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > I think you have made the right choice for you and for the product you are
> > working on to use an isolated module to provide this functionality.
> >
> > But I assume the purpose of your posting was to request upstream inclusion,
> > community code review, etc. This is not likely to happen when the
> > implementation and design choices are derived from Employer needs vs.
> > the community needs. Sure, you can get high level design review, which is
> > what *this* is, but I recon not much more.
> >
> > This discussion has several references to community projects that can benefit
> > from this functionality, but not in its current form.
> >
> > This development model has worked well in the past for Android and the Android
> > user base leverage could help to get you a ticket to staging, but eventually,
> > those modules (e.g. ashmem) often do get replaced with more community oriented
> > APIs.
> >
>
> Hi fsdevel
> I'm Yurii, and I work with Eugene on the same team and the same project.
>
> I want to explain how we ended up with a custom filesystem instead of
> trying to improve FUSE for everyone, and
> why we think (maybe incorrectly) that it may be still pretty useful
> for the community.
> As the project goal was to allow instant (-ish) deployment of apps
> from the dev environment to Android phone, we were hoping to
> stick with plain FUSE filesystem, and that's what we've done at first.
> But it turned out that even with the best tuning it was still
> really slow and battery-hungry (phones spent energy faster than they
> were charging over the cord).
> At this point we've already collected the profiles for the filesystem
> usage, and also figured out what features are essential
> to make it usable for streaming:
> 1. Random reads are the most common -> 4kb-sized read is the size we
> have to support, and may not go to usermode on each of those
> 2. Android tends to list the app directory and stat files in it often
> -> these operations need to be cached in kernel as well
> 3. Because of *random* reads streaming files sequentially isn't
> optimal -> need to be able to collect read logs from first deployment
>     and stream in that order next time on incremental builds
> 4. Devices have small flash cards, need to deploy uncompressed game
> images for speed and mmap access ->
>     support storing 4kb blocks compressed
> 4.1. Host computer is much better at compression -> support streaming
> compressed blocks into the filesystem storage directly, without
>        recompression on the phone

Aha. It wasn't clear to me that block aligned decompression of a specific
compression format was part of the filesystem.
eromfs (also for Android) also provides new block aligned decompression
subsystem and I have heard the maintainer say that the decompression
engine could be moved into a library like fs/crypt so that other filesystem
could support the same on-the-fly decryption engine.

> 5. Android has to verify app signature for installation -> need to
> support per-block signing and lazy verification
> 5.1. For big games even per-block signature data can be huge, so need
> to stream even the signatures
> 6. Development cycle is usually edit-build-try-edit-... -> need to
> support delta-patches from existing files
> 7. File names for installed apps are standard and different from what
> they were on the host ->
>     must be able to store user-supplied 'key' next to each file to identify it
> 8. Files never change -> no need to have complex code for mutable data
> in the filesystem
>
> In the end, we saw only two ways how to make all of this work: either
> take sdcardfs as a base and extend it, or change FUSE to
> support cache in kernel; and as you can imagine, sdcardfs route got
> thrown out of the window immediately after looking at the code.
> But after learning some FUSE internals and its code what we found out
> is that to make it do all the listed things we'd basically have
> to implement a totally new filesystem inside of it. The only real use
> of FUSE that remained was to send FUSE_INIT, and occasional
> read requests. Everything else required, first of all, making a cache
> object inside FUSE intercept every message before it goes to the
> user mode, and also adding new specialized commands initiated by the
> usermode (e.g. prefetching data that hasn't been requested
> yet, or streaming hashes in). Some things even didn't make sense for a
> generic usecase (e.g. having a limited circular buffer of read
> blocks in kernel that user can ask for and flush).

As Miklos pointed out and I as well in a previous message, it
appears that you missed a few caching capabilities of FUSE -
Those capabilities may or may not make sense to extend using
fscache or by other means - I haven't studied this myself, but the
study you publish does not convince me that this option has been
fully exhausted.

>
> In the end, after several tries we just came to a conclusion that the
> very set of original requirements is so specific that, funny enough,
> anyone who wants to create a lazy-loading experience would hit most of
> them, while anyone who's doing something else, would miss
> most of them. That's the main reason to go with a separate specialized
> driver module, and the reason to share it with the community -
> we have a feeling that people will benefit from a high-quality
> implementation of lazy loading in kernel, and we will benefit from the
> community support and guiding.
>
> Again, we all are human and can be wrong at any step when making
> conclusions. E.g. we didn't know about the fscache subsystem,
> and were only planning to create a cache object inside FUSE instead.
> But for now I still feel that our original research stands, and
> that in the long run specialized filesystem serves its users much

"serves its users" - that's the difference between our perspective.
I am not thinking only on users of incfs as you defined it, but on a wider
variety of users that need the functionality of user space filesystem with
kernel "fast path" optimizations.

> better than several scattered changes in other places that all
> pretty much look like the same filesystem split into three parts and
> adopted to the interfaces those places force onto it. Even more,
> those changes and interfaces look quite strange on their own, when not
> used together.
>
> Please tell me what you think about this whole thing. We do care about
> the feature in general, not about making it
> look as we've coded it right now. If you feel that making fscache
> interface that covers the whole FUSE usermode
> messages + allows for those requirements is useful beyond streaming,
> we'll investigate that route further.
>

Let me answer that with an anecdote.
The very first FUSE filesystem was AVFS (http://avf.sourceforge.net/).
It provides a VFS interface to archives (e.g. zip).
It is quite amusing that this is exactly what incfs in meant to provide
(APK is a zip file, apparently with some new block aligned compression?).

Now forget Android and APK download and imagine a huge tar archive
on a slow tape device, which can be browsed using AVFS/FUSE.
In that case, would AVFS users benefit from local disk caching of
pieces of archive read from the tape? listing read from the tape? (Yes)
Would AVFS users benefit from storing the compressed blocks in
local disk cache and decompressing them on first read? (Sure why not).

The decision of whether or not incfs functionality should be built into
FUSE needs to take into account all the FUSE filesystems out there.
How many of them would benefit from the extended functionality?
My personal guess is that the answer is "a lot".

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-22  8:32                   ` Miklos Szeredi
@ 2019-05-22 17:25                     ` Yurii Zubrytskyi
  2019-05-23  4:25                       ` Miklos Szeredi
  0 siblings, 1 reply; 33+ messages in thread
From: Yurii Zubrytskyi @ 2019-05-22 17:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Eugene Zemtsov, Amir Goldstein, linux-fsdevel

> Hang on, fuse does use caches in the kernel (page cache,
> dcache/icache).  The issue is probably not lack of cache, it's how the
> caches are primed and used.  Did you disable these caches?  Did you
> not disable invalidation for data, metadata and dcache?  In recent
> kernels we added caching readdir as well.  The only objects not cached
> are (non-acl) xattrs.   Do you have those?
Android (which is our primary use case) is constantly under memory
pressure, so caches
don't actually last long. Our experience with FOPEN_KEEP_CACHE has
shown that pages are
evicted more often than the files are getting reopened, so it doesn't
help. FUSE has to re-read
the data from the backing store all the time.
We didn't use xattrs for the FUSE-based implementation, but ended up
requiring a similar thing in
the Incremental FS, so the final design would have to include them.

> Re prefetching data:
> there's the NOTIFY_STORE message.
To add to the previous point, we do not have the data for prefetching,
as we're loading it page-by-page
from the host. We had to disable readahead for FUSE completely,
otherwise even USB3 isn't fast enough
to deliver data in that big chunks in time, and applications keep
hanging on page faults.

Overall, better caching doesn't save much *on Android*; what would
work is a full-blown data storage system inside
FUSE kernel code, that can intercept requests before they go into user
mode and process them completely. That's how
we could keep the data out of RAM but still get rid of that extra
context switch and kernel-user transition.
But this also means that FUSE becomes damn too much aware of the
specific storage format and all its features, and
basically gets specialized implementation of one of its filesystem
inside the generic FUSE code.
Even if we separate that out, the kernel API between the storage and
FUSE ended up being complete VFS API copy,
with some additions to send data blocks and Merkle tree blocks in. The
code is truly if we stuff the Incremental FS into
FUSE instead of mounting it directly.

-- 
Thanks, Yurii

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-22 17:25                     ` Yurii Zubrytskyi
@ 2019-05-23  4:25                       ` Miklos Szeredi
  2019-05-29 21:06                         ` Yurii Zubrytskyi
  0 siblings, 1 reply; 33+ messages in thread
From: Miklos Szeredi @ 2019-05-23  4:25 UTC (permalink / raw)
  To: Yurii Zubrytskyi; +Cc: Eugene Zemtsov, Amir Goldstein, linux-fsdevel

On Wed, May 22, 2019 at 7:25 PM Yurii Zubrytskyi <zyy@google.com> wrote:
>
> > Hang on, fuse does use caches in the kernel (page cache,
> > dcache/icache).  The issue is probably not lack of cache, it's how the
> > caches are primed and used.  Did you disable these caches?  Did you
> > not disable invalidation for data, metadata and dcache?  In recent
> > kernels we added caching readdir as well.  The only objects not cached
> > are (non-acl) xattrs.   Do you have those?
> Android (which is our primary use case) is constantly under memory
> pressure, so caches
> don't actually last long. Our experience with FOPEN_KEEP_CACHE has
> shown that pages are
> evicted more often than the files are getting reopened, so it doesn't
> help. FUSE has to re-read
> the data from the backing store all the time.

What would benefit many fuse applications is to let the kernel
transfer data to/from a given location (i.e. offset within a file).
So instead of transferring data directly in the READ/WRITE messages,
there would be a MAP message that would return information about where
the data resides (list of extents+extra parameters for
compression/encryption).  The returned information could be generic
enough for your needs, I think.  The fuse kernel module would cache
this mapping, and could keep the mapping around for possibly much
longer than the data itself, since it would require orders of
magnitude less memory. This would not only be saving memory copies,
but also the number of round trips to userspace.

There's also work currently ongoing in optimizing the overhead of
userspace roundtrip.  The most promising thing appears to be matching
up the CPU for the userspace server with that of the task doing the
request.  This can apparently result in  60-500% speed improvement.

> We didn't use xattrs for the FUSE-based implementation, but ended up
> requiring a similar thing in
> the Incremental FS, so the final design would have to include them.
>
> > Re prefetching data:
> > there's the NOTIFY_STORE message.
> To add to the previous point, we do not have the data for prefetching,
> as we're loading it page-by-page
> from the host. We had to disable readahead for FUSE completely,
> otherwise even USB3 isn't fast enough

Understood.  Did you re-enable readahead for the case when the file
has been fully downloaded?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-23  4:25                       ` Miklos Szeredi
@ 2019-05-29 21:06                         ` Yurii Zubrytskyi
  2019-05-30  9:22                           ` Miklos Szeredi
  0 siblings, 1 reply; 33+ messages in thread
From: Yurii Zubrytskyi @ 2019-05-29 21:06 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Eugene Zemtsov, Amir Goldstein, linux-fsdevel

On Wed, May 22, 2019 at 9:25 PM Miklos Szeredi <miklos@szeredi.hu> wrote:

> What would benefit many fuse applications is to let the kernel
> transfer data to/from a given location (i.e. offset within a file).
> So instead of transferring data directly in the READ/WRITE messages,
> there would be a MAP message that would return information about where
> the data resides (list of extents+extra parameters for
> compression/encryption).  The returned information could be generic
> enough for your needs, I think.  The fuse kernel module would cache
> this mapping, and could keep the mapping around for possibly much
> longer than the data itself, since it would require orders of
> magnitude less memory. This would not only be saving memory copies,
> but also the number of round trips to userspace.

Yes, and this was _exactly_ our first plan, and it mitigates the read
performance
issue. The reasons why we didn't move forward with it are that we figured out
all other requirements, and fixing each of those needs another change in
FUSE, up to the level when FUSE interface becomes 50% dedicated to
our specific goal:
1. MAP message would have to support data compression (with different
algorithms), hash verification (same thing) with hash streaming (because
even the Merkle tree for a 5GB file is huge, and can't be preloaded
at once)
  1.1. Mapping memory usage can get out of hands pretty quickly: it has to
be at least (offset + size + compression type + hash location + hash size +
hash kind) per each block. I'm not even thinking about multiple storage files
here. For that 5GB file (that's a debug APK for some Android game we're
targeting) we have 1.3M blocks, so ~16 bytes *1.3M = 20M of index only,
without actual overhead for the lookup table.
If the kernel code owns and manages its own on-disk data store and the
format, this index can be loaded and discarded on demand there.

2. We need the same kind of a MAP message but for the directory structure
and for stat(2) calls - Android does way too many of these, and has no
intention to fix it. These caches need to be dynamically sized as well
(as I said, standard kernel caches don't hold anything long enough on
Android because of the usual thing when all memory is used by running
apps)

3. Several smaller features would have to be added, again with their own
interface and specific code in FUSE
3.1 E.g. collecting logs of all block reads - we're planning to have a ring
buffer of configurable size there, and a way to request its content from the
 user space; this doesn't look that useful for other FUSE users, and may
actually be a serious security hole there. We'd not need it at all if FUSE
was calling into user space on each read, so here it's almost like we're
fighting ourselves and making two opposing changes in FUSE

4. All these features are much easier to implement for a readonly
filesystem (cache invalidation is a big thing). But if we limit them in FUSE
to readonly mode we'd make half of its interface dedicated to even
smaller use case.

> There's also work currently ongoing in optimizing the overhead of
> userspace roundtrip.  The most promising thing appears to be matching
> up the CPU for the userspace server with that of the task doing the
> request.  This can apparently result in  60-500% speed improvement.

That sounds almost too good to be true, and will be really cool.
Do you have any patches or git remote available in any compilable state to
try the optimization out? Android has quite complicated hardware config
and I want to see how this works, especially with our model where
several processes may send requests into the same filesystem FD.

> Understood.  Did you re-enable readahead for the case when the file
> has been fully downloaded?

Yes, and it doesn't really help - readahead wants a bunch of blocks
together, but those are scattered around the backing image because they
arrived separately and at different times. So usermode process still has to
issue multiple read commands to respond to a single FUSE read(ahead)
request, which is still slow. Even worse thing happens if CPU was in
reduced frequency idling mode at that time (which is normal for mobile) -
it takes couple hundred ms to ramp it up, and during that time latency
is huge (milliseconds / block)

Overall, I see that it is possible to change FUSE in a way that meets our
needs, but I'm not sure if that kind of change keeps FUSE interface
friendly for all existing and new uses. The set of requirements is so big
and the mobile platform constraints are so harsh that _as efficient as
possible_ and _generic_ do, unfortunately, contradict each other.

Please tell me if you see it differently, and if you have some better ideas
on how to change FUSE in a simpler way
--
Thanks, Yurii

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-29 21:06                         ` Yurii Zubrytskyi
@ 2019-05-30  9:22                           ` Miklos Szeredi
  2019-05-30 22:45                             ` Yurii Zubrytskyi
  0 siblings, 1 reply; 33+ messages in thread
From: Miklos Szeredi @ 2019-05-30  9:22 UTC (permalink / raw)
  To: Yurii Zubrytskyi; +Cc: Eugene Zemtsov, Amir Goldstein, linux-fsdevel

On Wed, May 29, 2019 at 11:06 PM Yurii Zubrytskyi <zyy@google.com> wrote:

> Yes, and this was _exactly_ our first plan, and it mitigates the read
> performance
> issue. The reasons why we didn't move forward with it are that we figured out
> all other requirements, and fixing each of those needs another change in
> FUSE, up to the level when FUSE interface becomes 50% dedicated to
> our specific goal:
> 1. MAP message would have to support data compression (with different
> algorithms), hash verification (same thing) with hash streaming (because
> even the Merkle tree for a 5GB file is huge, and can't be preloaded
> at once)

With the proposed FUSE solution the following sequences would occur:

kernel: if index for given block is missing, send MAP message
  userspace: if data/hash is missing for given block then download data/hash
  userspace: send MAP reply
kernel: decompress data and verify hash based on index

The kernel would not be involved in either streaming data or hash, it
would only work with data/hash that has already been downloaded.
Right?

Or is your implementation doing streamed decompress/hash or partial blocks?

>   1.1. Mapping memory usage can get out of hands pretty quickly: it has to
> be at least (offset + size + compression type + hash location + hash size +
> hash kind) per each block. I'm not even thinking about multiple storage files
> here. For that 5GB file (that's a debug APK for some Android game we're
> targeting) we have 1.3M blocks, so ~16 bytes *1.3M = 20M of index only,
> without actual overhead for the lookup table.
> If the kernel code owns and manages its own on-disk data store and the
> format, this index can be loaded and discarded on demand there.

Why does the kernel have to know the on-disk format to be able to load
and discard parts of the index on-demand?  It only needs to know which
blocks were accessed recently and which not so recently.

> > There's also work currently ongoing in optimizing the overhead of
> > userspace roundtrip.  The most promising thing appears to be matching
> > up the CPU for the userspace server with that of the task doing the
> > request.  This can apparently result in  60-500% speed improvement.
>
> That sounds almost too good to be true, and will be really cool.
> Do you have any patches or git remote available in any compilable state to
> try the optimization out? Android has quite complicated hardware config
> and I want to see how this works, especially with our model where
> several processes may send requests into the same filesystem FD.

Currently it's only a bunch of hacks, no proper interfaces yet.

I'll let you know once there's something useful for testing with a
real filesystem.

BTW, which interface does your fuse filesystem use?  Libfuse?  Raw device?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-30  9:22                           ` Miklos Szeredi
@ 2019-05-30 22:45                             ` Yurii Zubrytskyi
  2019-05-31  9:02                               ` Miklos Szeredi
  0 siblings, 1 reply; 33+ messages in thread
From: Yurii Zubrytskyi @ 2019-05-30 22:45 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Eugene Zemtsov, Amir Goldstein, linux-fsdevel

> With the proposed FUSE solution the following sequences would occur:
>
> kernel: if index for given block is missing, send MAP message
>   userspace: if data/hash is missing for given block then download data/hash
>   userspace: send MAP reply
> kernel: decompress data and verify hash based on index
>
> The kernel would not be involved in either streaming data or hash, it
> would only work with data/hash that has already been downloaded.
> Right?
>
> Or is your implementation doing streamed decompress/hash or partial blocks?
> ...
> Why does the kernel have to know the on-disk format to be able to load
> and discard parts of the index on-demand?  It only needs to know which
> blocks were accessed recently and which not so recently.
>
(1) You're correct, only the userspace deals with all streaming.
Kernel then sees full blocks of data (usually LZ4-compressed) and
blocks of hashes
We'd need to give the location of the hash tree instead of the
individual hash here though - verification has to go all the way to
the top and even check the signature there. And the same 5 GB file
would have over 40 MB of hashes (32 bytes of SHA2 for each 4K block),
so those have to be read from disk as well.
Overall, let's just imagine a phone with 100 apps, 100MB each,
installed this way. That ends up being ~10GB of data, so we'd need _at
least_ 40 MB for the index and 80 MB for hashes *in kernel*. Android
now fights for each megabyte of RAM used in the system services, so
FUSE won't be able to cache that, going back to the user mode for
almost all reads again.
(1 and 2) ... If FUSE were to know the on-disk format it would be able
to simply parse and read it when needed, with as little memory
footprint as it can. Requesting this data from the usermode every time
with little caching defeats the whole purpose of the change.

> BTW, which interface does your fuse filesystem use?  Libfuse?  Raw device?
Yes, our code interacts with the raw FUSE fd via poll/read/write
calls. We have tried the multithreaded approach via duping the control
fd and FUSE_DEV_IOC_CLONE, but it didn't give much improvement -
Android apps aren't usually use multithreaded, so there's at most two
pending reads at once. I've seen 10 once, but that was some kind of
miractle

And again, we have not even looked at the directory structure and stat
caching yet, neither interface nor memory usage. For a general case we
have to make direct disk reads from kernel and this forces even bigger
part of the disk format to be defined there. The end result is what
we've got when researching FUSE - a huge chunk of FUSE gets
overspecialized to handle our own way of using it end to end, with no
real configurability (because making it configurable makes that code
even bigger and more complex)

--
Thanks, Yurii

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Initial patches for Incremental FS
  2019-05-30 22:45                             ` Yurii Zubrytskyi
@ 2019-05-31  9:02                               ` Miklos Szeredi
  0 siblings, 0 replies; 33+ messages in thread
From: Miklos Szeredi @ 2019-05-31  9:02 UTC (permalink / raw)
  To: Yurii Zubrytskyi; +Cc: Eugene Zemtsov, Amir Goldstein, linux-fsdevel

On Fri, May 31, 2019 at 12:45 AM Yurii Zubrytskyi <zyy@google.com> wrote:

> We'd need to give the location of the hash tree instead of the
> individual hash here though - verification has to go all the way to
> the top and even check the signature there. And the same 5 GB file
> would have over 40 MB of hashes (32 bytes of SHA2 for each 4K block),
> so those have to be read from disk as well.

As I think Eugene mentioned, dealing with the hash tree should be done
by the verity subsystem anyway.

> Overall, let's just imagine a phone with 100 apps, 100MB each,
> installed this way. That ends up being ~10GB of data, so we'd need _at
> least_ 40 MB for the index and 80 MB for hashes *in kernel*.

Seriously?  Are those 100 apps accessing that 10G simultaneously?

I really don't know the usage pattern of those apps, but I can imagine
that some games do quite a lot of paging data in and out.   And my
guess is that most of those page-ins will still be sequential, and so
getting a pageful of index from userspace will allow the kernel to
serve quite a few reads without having to go back to userspace.

My guess is that even really tiny amount of caching (e.g. one page of
index per open file) will get 90% or more of the possible performance
improvement.  But those are all just guesses.  If you say this is not
the right direction for your project, fine.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, back to index

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-02  4:03 Initial patches for Incremental FS ezemtsov
2019-05-02  4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
2019-05-02 19:06   ` Miklos Szeredi
2019-05-02 20:41   ` Randy Dunlap
2019-05-07 15:57   ` Jann Horn
2019-05-07 17:13   ` Greg KH
2019-05-07 17:18   ` Greg KH
2019-05-02  4:03 ` [PATCH 2/6] incfs: Backing file format ezemtsov
2019-05-02  4:03 ` [PATCH 3/6] incfs: Management of in-memory FS data structures ezemtsov
2019-05-02  4:03 ` [PATCH 4/6] incfs: Integration with VFS layer ezemtsov
2019-05-02  4:03 ` [PATCH 6/6] incfs: Integration tests for incremental-fs ezemtsov
2019-05-02 11:19 ` Initial patches for Incremental FS Amir Goldstein
2019-05-02 13:10   ` Theodore Ts'o
2019-05-02 13:26     ` Al Viro
2019-05-03  4:23       ` Eugene Zemtsov
2019-05-03  5:19         ` Amir Goldstein
2019-05-08 20:09           ` Eugene Zemtsov
2019-05-09  8:15             ` Amir Goldstein
     [not found]               ` <CAK8JDrEQnXTcCtAPkb+S4r4hORiKh_yX=0A0A=LYSVKUo_n4OA@mail.gmail.com>
2019-05-21  1:32                 ` Yurii Zubrytskyi
2019-05-22  8:32                   ` Miklos Szeredi
2019-05-22 17:25                     ` Yurii Zubrytskyi
2019-05-23  4:25                       ` Miklos Szeredi
2019-05-29 21:06                         ` Yurii Zubrytskyi
2019-05-30  9:22                           ` Miklos Szeredi
2019-05-30 22:45                             ` Yurii Zubrytskyi
2019-05-31  9:02                               ` Miklos Szeredi
2019-05-22 10:54                   ` Amir Goldstein
2019-05-03  7:23         ` Richard Weinberger
2019-05-03 10:22         ` Miklos Szeredi
2019-05-02 13:46     ` Amir Goldstein
2019-05-02 18:16   ` Richard Weinberger
2019-05-02 18:33     ` Richard Weinberger
2019-05-02 13:47 ` J. R. Okajima

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git