Re: [PATCH 1/6] incfs: Add first files of incrementalfs

From: Randy Dunlap <rdunlap@infradead.org>
To: ezemtsov@google.com, linux-fsdevel@vger.kernel.org
Cc: tytso@mit.edu
Subject: Re: [PATCH 1/6] incfs: Add first files of incrementalfs
Date: Thu, 2 May 2019 13:41:42 -0700	[thread overview]
Message-ID: <e36f89ef-2060-9cb6-2b3f-0b246c9208d2@infradead.org> (raw)
In-Reply-To: <20190502040331.81196-2-ezemtsov@google.com>

On 5/1/19 9:03 PM, ezemtsov@google.com wrote:
> From: Eugene Zemtsov <ezemtsov@google.com>
> 
> - fs/incfs dir
> - Kconfig (CONFIG_INCREMENTAL_FS)
> - Makefile
> - Module and file system initialization and clean up code
> - New MAINTAINERS entry
> - Add incrementalfs.h UAPI header
> - Register ioctl range in ioctl-numbers.txt
> - Documentation
> 
> Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>

Hi,
This is just Documentation comments...

> ---
>  Documentation/filesystems/incrementalfs.rst | 452 ++++++++++++++++++++
>  Documentation/ioctl/ioctl-number.txt        |   1 +
>  MAINTAINERS                                 |   7 +
>  fs/Kconfig                                  |   1 +
>  fs/Makefile                                 |   1 +
>  fs/incfs/Kconfig                            |  10 +
>  fs/incfs/Makefile                           |   4 +
>  fs/incfs/main.c                             |  85 ++++
>  fs/incfs/vfs.c                              |  37 ++
>  include/uapi/linux/incrementalfs.h          | 189 ++++++++
>  10 files changed, 787 insertions(+)
>  create mode 100644 Documentation/filesystems/incrementalfs.rst
>  create mode 100644 fs/incfs/Kconfig
>  create mode 100644 fs/incfs/Makefile
>  create mode 100644 fs/incfs/main.c
>  create mode 100644 fs/incfs/vfs.c
>  create mode 100644 include/uapi/linux/incrementalfs.h
> 
> diff --git a/Documentation/filesystems/incrementalfs.rst b/Documentation/filesystems/incrementalfs.rst
> new file mode 100644
> index 000000000000..682e3dcb6b5a
> --- /dev/null
> +++ b/Documentation/filesystems/incrementalfs.rst
> @@ -0,0 +1,452 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=======================
> +Incremental File System
> +=======================
> +
> +Overview
> +========
> +Incremental FS is special-purpose Linux virtual file system that allows
> +execution of a program while its binary and resource files are still being
> +lazily downloaded over the network, USB etc. It is focused on incremental
> +delivery for a small number (under 100) of big files (more than 10 megabytes).
> +Incremental FS doesn’t allow direct writes into files and, once loaded, file
> +content never changes. Incremental FS doesn’t use a block device, instead it
> +saves data into a backing file located on a regular file-system.
> +
> +But why?
> +--------
> +To allow running **big** Android apps before their binaries and resources are
> +fully downloaded to an Android device. If an app reads something not loaded yet,
> +it needs to wait for the data block to be fetched, but in most cases hot blocks
> +can be loaded in advance.
> +
> +Workflow
> +--------
> +A userspace process, called a data loader, mounts an instance of incremental-fs
> +giving it a file descriptor on an underlying file system (like ext4 or f2fs).
> +Incremental-fs reads content (if any) of this backing file and interprets it as
> +a file system image with files, directories and data blocks. At this point
> +the data loader can declare new files to be shown by incremental-fs.
> +
> +A process is started from a binary located on incremental-fs.
> +All reads are served directly from the backing file
> +without roundtrips into userspace. If the process accesses a data block that was
> +not originally present in the backing file, the read operation waits.
> +
> +Meanwhile the data loader can feed new data blocks to incremental-fs by calling
> +write() on a special .cmd pseudo-file. The data loader can request information
> +about pending reads by calling poll() and read() on the .cmd pseudo-file.
> +This mechanism allows the data loader to serve most urgently needed data first.
> +Once a data block is given to incremental-fs, it saves it to the backing file
> +and unblocks all the reads waiting for this block.
> +
> +Eventually all data for all files is uploaded by the data loader, and saved by
> +incremental-fs into the backing file. At that moment the data loader is not
> +needed any longer. The backing file will play the role of a complete
> +filesystem image for all future runs of the program.
> +
> +Non-goals
> +---------
> +* Allowing direct writes by the executing processes into files on incremental-fs
> +* Allowing the data loader change file size or content after it was loaded.
> +* Having more than a couple hundred files and directories.
> +
> +
> +Features
> +========
> +
> +Read-only, but not unchanging
> +-----------------------------
> +On the surface a mount directory of incremental-fs would look similar to
> +a read-only instance of network file system: files and directories can be
> +listed and read, but can’t be directly created or modified via creat() or
> +write(). At the same time the data loader can make changes to a directory
> +structure via external ioctl-s. i.e. link and unlink files and directories
> +(if they empty). Data can't be changed this way, once a file block is loaded

                                               way; once

> +there is no way to change it.
> +
> +Filesystem image in a backing file
> +----------------------------------
> +Instead of using a block device, all data and metadata is stored in a

                                                          are stored

> +backing file provided as a mount parameter. The backing file is located on
> +an underlying file system (like ext4 or f2fs). Such approach is very similar
> +to what might be achieved by using loopback device with a traditional file
> +system, but it avoids extra set-up steps and indirections. It also allows
> +incremental-fs image to dynamically grow as new files and data come without
> +having to do any extra steps for resizing.
> +
> +If the backing file contains data at the moment when incremental-fs is mounted,
> +content of the backing file is being interpreted as filesystem image.

                          file is interpreted as

> +New files and data can still be added through the external interface,
> +and they will be saved to the backing file.
> +
> +Data compression
> +----------------
> +Incremental-fs can store compressed data. In this case each 4KB data block is
> +compressed separately. Data blocks can be provided to incremental-fs by
> +the data loader in a compressed form. Incremental-fs uncompresses blocks
> +each time a executing process reads it (modulo page cache). Compression also

             an executing process

> +takes care of blocks composed of all zero bytes removing necessity to handle
> +this case separately.
> +
> +Partially present files
> +-----------------------
> +Data in the files consists of 4KB blocks, each block can be present or absent.

                                     blocks; each block

> +Unlike in sparse files, reading an absent block doesn’t return all zeros.
> +It waits for the data block to be loaded via the ioctl interface
> +(respecting a timeout). Once a data block is loaded it never disappears
> +and can’t be changed or erased from a file. This ability to frictionlessly
> +wait for temporary missing data is the main feature of incremental-fs.
> +
> +Hard links. Multiple names for the same file
> +--------------------------------------------
> +Like all traditional UNIX file systems, incremental-fs supports hard links,
> +i.e. different file names in different directories can refer to the same file.
> +As mentioned above new hard links can be created and removed via
> +the ioctl interface, but actual data files are immutable, modulo partial
> +data loading. Each directory can only have at most one name referencing it.
> +
> +Inspection of incremental-fs internal state
> +-------------------------------------------
> +poll() and read() on the .cmd pseudo-file allow data loaders to get a list of
> +read operations stalled due to lack of a data block (pending reads).
> +
> +
> +Application Programming Interface
> +=================================
> +
> +Regular file system interface
> +-----------------------------
> +Executing process access files and directories via regular Linux file interface:

                     accesses

> +open, read, close etc. All the intricacies of data loading a file representation
> +are hidden from them.
> +
> +External .cmd file interface
> +----------------------------
> +When incremental-fs is mounted, a mount directory contains a pseudo-file
> +called '.cmd'. The data loader will open this file and call read(), write(),
> +poll() and ioctl() on it inspect and change state of incremental-fs.
> +
> +poll() and read() are used by the data loader to wait for pending reads to
> +appear and obtain an array of ``struct incfs_pending_read_info``.
> +
> +write() is used by the data loader to feed new data blocks to incremental-fs.
> +A data buffer given to write() is interpreted as an array of
> +``struct incfs_new_data_block``. Structs in the array describe locations and
> +properties of data blocks loaded with this write() call.
> +
> +``ioctl(INCFS_IOC_PROCESS_INSTRUCTION)`` is used to change structure of
> +incremental-fs. It receives an pointer to ``struct incfs_instruction``

                               a pointer

> +where type field can have be one of the following values.
> +
> +**INCFS_INSTRUCTION_NEW_FILE**
> +Creates an inode (a file or a directory) without a name.
> +It assumes ``incfs_new_file_instruction.file`` is populated with details.
> +
> +**INCFS_INSTRUCTION_ADD_DIR_ENTRY**
> +Creates a name (aka hardlink) for an inode in a directory.
> +A directory can't have more than one hardlink pointing to it, but files can be
> +linked from different directories.
> +It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
> +
> +**INCFS_INSTRUCTION_REMOVE_DIR_ENTRY**
> +Remove a name (aka hardlink) for a file from a directory.
> +Only empty directories can be unlinked.
> +It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
> +
> +For more details see in uapi/linux/incrementalfs.h and samples below.
> +
> +Supported mount options
> +-----------------------
> +See ``fs/incfs/options.c`` for more details.
> +
> +    * ``backing_fd=<unsigned int>``
> +        Required. A file descriptor of a backing file opened by the process
> +        calling mount(2). This descriptor can be closed after mount returns.
> +
> +    * ``read_timeout_msc=<unsigned int>``
> +        Default: 1000. Timeout in milliseconds before a read operation fails
> +        if no data found in the backing file or provided by the data loader.
> +
> +Sysfs files
> +-----------
> +``/sys/fs/incremental-fs/version`` - a current version of the filesystem.
> +One ASCII encoded positive integer number with a new line at the end.
> +
> +
> +Examples
> +--------
> +See ``sample_data_loader.c`` for a complete implementation of a data loader.
> +
> +Mount incremental-fs
> +~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> +    int mount_fs(char *mount_dir, char *backing_file, int timeout_msc)
> +    {
> +        static const char fs_name[] = INCFS_NAME;
> +        char mount_options[512];
> +        int backing_fd;
> +        int result;
> +
> +        backing_fd = open(backing_file, O_RDWR);
> +        if (backing_fd == -1) {
> +            perror("Error in opening backing file");
> +            return 1;
> +        }
> +
> +        snprintf(mount_options, ARRAY_SIZE(mount_options),
> +            "backing_fd=%u,read_timeout_msc=%u", backing_fd, timeout_msc);
> +
> +        result = mount(fs_name, mount_dir, fs_name, 0, mount_options);
> +        if (result != 0)
> +            perror("Error mounting fs.");
> +        return result;
> +    }
> +
> +Open .cmd file
> +~~~~~~~~~~~~~~
> +
> +::
> +
> +    int open_commands_file(char *mount_dir)
> +    {
> +        char cmd_file[255];
> +        int cmd_fd;
> +
> +        snprintf(cmd_file, ARRAY_SIZE(cmd_file), "%s/.cmd", mount_dir);
> +        cmd_fd = open(cmd_file, O_RDWR);
> +        if (cmd_fd < 0)
> +            perror("Can't open commands file");
> +        return cmd_fd;
> +    }
> +
> +Add a file to the file system
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> +    int create_file(int cmd_fd, char *filename, int *ino_out, size_t size)
> +    {
> +        int ret = 0;
> +        __u16 ino = 0;
> +        struct incfs_instruction inst = {
> +                .version = INCFS_HEADER_VER,
> +                .type = INCFS_INSTRUCTION_NEW_FILE,
> +                .file = {
> +                    .size = size,
> +                    .mode = S_IFREG | 0555,
> +                }
> +        };
> +
> +        ret = ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, &inst);
> +        if (ret)
> +            return -errno;
> +
> +        ino = inst.file.ino_out;
> +        inst = (struct incfs_instruction){
> +                .version = INCFS_HEADER_VER,
> +                .type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
> +                .dir_entry = {
> +                    .dir_ino = INCFS_ROOT_INODE,
> +                    .child_ino = ino,
> +                    .name = ptr_to_u64(filename),
> +                    .name_len = strlen(filename)
> +                }
> +            };
> +        ret = ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, &inst);
> +        if (ret)
> +            return -errno;
> +        *ino_out = ino;
> +        return 0;
> +    }
> +
> +Load data into a file
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> +    int cmd_fd = open_commands_file(path_to_mount_dir);
> +    char *data = get_some_data();
> +    struct incfs_new_data_block block;
> +    int err;
> +
> +    block.file_ino = file_ino;
> +    block.block_index = 0;
> +    block.compression = COMPRESSION_NONE;
> +    block.data = (__u64)data;
> +    block.data_len = INCFS_DATA_FILE_BLOCK_SIZE;
> +
> +    err = write(cmd_fd, &block, sizeof(block));
> +
> +
> +Get an array of pending reads
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> +    int poll_res = 0;
> +    struct incfs_pending_read_info reads[10];
> +    int cmd_fd = open_commands_file(path_to_mount_dir);
> +    struct pollfd pollfd = {
> +        .fd = cmd_fd,
> +        .events = POLLIN
> +    };
> +
> +    poll_res = poll(&pollfd, 1, timeout);
> +    if (poll_res > 0 && (pollfd.revents | POLLIN)) {
> +        ssize_t read_res = read(cmd_fd, reads, sizeof(reads));
> +        if (read_res > 0)
> +            printf("Waiting reads %ld\n", read_res / sizeof(reads[0]));
> +    }
> +
> +
> +
> +Ondisk format
> +=============
> +
> +General principles
> +------------------
> +* The backbone of the incremental-fs ondisk format is an append only linked
> +  list of metadata blocks. Each metadata block contains an offset of the next
> +  one. These blocks describe files and directories on the
> +  file system. They also represent actions of adding and removing file names
> +  (hard links).
> +  Every time incremental-fs instance is mounted, it reads through this list
> +  to recreate filesystem's state in memory. An offset of the first record in the
> +  metadata list is stored in the superblock at the beginning of the backing
> +  file.
> +
> +* Most of the backing file is taken by data areas and blockmaps.
> +  Since data blocks can be compressed and have different sizes,
> +  single per-file data area can't be pre-allocated. That's why blockmaps are
> +  needed in order to find a location and size of each data block in
> +  the backing file. Each time a file is created, a corresponding block map is
> +  allocated to store future offsets of data blocks.
> +
> +  Whenever a data block is given by data loader to incremental-fs:
> +    - A data area with the given block is appended to the end of
> +      the backing file.
> +    - A record in the blockmap for the given block index is updated to reflect
> +      its location, size, and compression algorithm.
> +
> +Important format details
> +------------------------
> +Ondisk structures are defined in the ``format.h`` file. They are all packed
> +and use little-endian order.
> +A backing file must start with ``incfs_super_block`` with ``s_magic`` field
> +equal to 0x5346434e49 "INCFS".
> +
> +Metadata records:
> +
> +* ``incfs_inode`` - metadata record to declare a file or a directory.
> +                    ``incfs_inode.i_mode`` determents if it is a file

                                              determines

> +                    or a directory.
> +* ``incfs_blockmap_entry`` - metadata record that specifies size and location
> +                            of a blockmap area for a given file. This area
> +                            contains an array of ``incfs_blockmap_entry``-s.
> +* ``incfs_dir_action`` - metadata record that specifies changes made to a
> +                    to a directory structure, e.g. add or remove a hardlink.
> +* ``incfs_md_header`` - header of a metadata record. It's always a part
> +                    of other structures and served purpose of metadata

?                                              serves

> +                    bookkeeping.
> +
> +Other ondisk structures:
> +
> +* ``incfs_super_block`` - backing file header
> +* ``incfs_blockmap_entry`` - a record in a blockmap area that describes size
> +                        and location of a data block.
> +* Data blocks dont have any particular structure, they are written to the backing

                 don't

> +  file in a raw form as they come from a data loader.
> +
> +
> +Backing file layout
> +-------------------
> +::
> +
> +              +-------------------------------------------+
> +              |            incfs_super_block              |]---+
> +              +-------------------------------------------+    |
> +              |                 metadata                  |<---+
> +              |                incfs_inode                |]---+
> +              +-------------------------------------------+    |
> +                        .........................              |
> +              +-------------------------------------------+    |   metadata
> +     +------->|               blockmap area               |    |  list links
> +     |        |          [incfs_blockmap_entry]           |    |
> +     |        |          [incfs_blockmap_entry]           |    |
> +     |        |          [incfs_blockmap_entry]           |    |
> +     |    +--[|          [incfs_blockmap_entry]           |    |
> +     |    |   |          [incfs_blockmap_entry]           |    |
> +     |    |   |          [incfs_blockmap_entry]           |    |
> +     |    |   +-------------------------------------------+    |
> +     |    |             .........................              |
> +     |    |   +-------------------------------------------+    |
> +     |    |   |                 metadata                  |<---+
> +     +----|--[|               incfs_blockmap              |]---+
> +          |   +-------------------------------------------+    |
> +          |             .........................              |
> +          |   +-------------------------------------------+    |
> +          +-->|                 data block                |    |
> +              +-------------------------------------------+    |
> +                        .........................              |
> +              +-------------------------------------------+    |
> +              |                 metadata                  |<---+
> +              |             incfs_dir_action              |
> +              +-------------------------------------------+
> +
> +Unreferenced files and absence of garbage collection
> +----------------------------------------------------
> +Described file format can produce files that don't have any names for them in
> +any directories. Incremental-fs takes no steps to prevent such situations or
> +reclaim space occupied by such files in the backing file. If garbage collection
> +is needed it has to be implemented as a separate userspace tool.
> +
> +
> +Design alternatives
> +===================
> +
> +Why isn't incremental-fs implemented via FUSE?
> +----------------------------------------------
> +TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
> +scenarios, and increase power use on mobile beyond acceptable limit
> +for widespread deployment. A custom kernel filesystem is the way to overcome
> +these limitations.
> +
> +From the theoretical side of things, FUSE filesystem adds some overhead to
> +each filesystem operation that’s not handled by OS page cache:
> +
> +    * When an IO request arrives to FUSE driver (D), it puts it into a queue
> +      that runs on a separate kernel thread
> +    * Then another separate user-mode handler process (H) has to run,
> +      potentially after a context switch, to read the request from the queue.
> +      Reading the request adds a kernel-user mode transition to the handling.
> +    * (H) sends the IO request to kernel to handle it on some underlying storage
> +      filesystem. This adds a user-kernel and kernel-user mode transition
> +      pair to the handling.
> +    * (H) then responds to the FUSE request via a write(2) call.
> +      Writing the response is another user-kernel mode transition.
> +    * (D) needs to read the response from (H) when its kernel thread runs
> +      and forward it to the user
> +
> +Together, the scenario adds 2 extra user-kernel-user mode transition pairs,
> +and potentially has up to 3 additional context switches for the FUSE kernel
> +thread and the user-mode handler to start running for each IO request on the
> +filesystem.
> +This overhead can vary from unnoticeable to unmanageable, depending on the
> +target scenario. But it will always burn extra power via CPU staying longer
> +in non-idle state, handling context switches and mode transitions.
> +One important goal for the new filesystem is to be able to handle each page
> +read separately on demand, because we don't want to wait and download more data
> +than absolutely necessary. Thus readahead would need to be disabled completely.
> +This increases the number of separate IO requests and the FUSE related overhead
> +by almost 32x (128KB readahead limit vs 4KB individual block operations)
> +
> +For more info see a 2017 USENIX research paper:
> +To FUSE or Not to FUSE: Performance of User-Space File Systems
> +Bharath Kumar Reddy Vangoor, Stony Brook University;
> +Vasily Tarasov, IBM Research-Almaden;
> +Erez Zadok, Stony Brook University
> +https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

-- 
~Randy