From: Randy Dunlap <rdunlap@infradead.org>
To: ezemtsov@google.com, linux-fsdevel@vger.kernel.org
Cc: tytso@mit.edu
Subject: Re: [PATCH 1/6] incfs: Add first files of incrementalfs
Date: Thu, 2 May 2019 13:41:42 -0700 [thread overview]
Message-ID: <e36f89ef-2060-9cb6-2b3f-0b246c9208d2@infradead.org> (raw)
In-Reply-To: <20190502040331.81196-2-ezemtsov@google.com>
On 5/1/19 9:03 PM, ezemtsov@google.com wrote:
> From: Eugene Zemtsov <ezemtsov@google.com>
>
> - fs/incfs dir
> - Kconfig (CONFIG_INCREMENTAL_FS)
> - Makefile
> - Module and file system initialization and clean up code
> - New MAINTAINERS entry
> - Add incrementalfs.h UAPI header
> - Register ioctl range in ioctl-numbers.txt
> - Documentation
>
> Signed-off-by: Eugene Zemtsov <ezemtsov@google.com>
Hi,
This is just Documentation comments...
> ---
> Documentation/filesystems/incrementalfs.rst | 452 ++++++++++++++++++++
> Documentation/ioctl/ioctl-number.txt | 1 +
> MAINTAINERS | 7 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/incfs/Kconfig | 10 +
> fs/incfs/Makefile | 4 +
> fs/incfs/main.c | 85 ++++
> fs/incfs/vfs.c | 37 ++
> include/uapi/linux/incrementalfs.h | 189 ++++++++
> 10 files changed, 787 insertions(+)
> create mode 100644 Documentation/filesystems/incrementalfs.rst
> create mode 100644 fs/incfs/Kconfig
> create mode 100644 fs/incfs/Makefile
> create mode 100644 fs/incfs/main.c
> create mode 100644 fs/incfs/vfs.c
> create mode 100644 include/uapi/linux/incrementalfs.h
>
> diff --git a/Documentation/filesystems/incrementalfs.rst b/Documentation/filesystems/incrementalfs.rst
> new file mode 100644
> index 000000000000..682e3dcb6b5a
> --- /dev/null
> +++ b/Documentation/filesystems/incrementalfs.rst
> @@ -0,0 +1,452 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=======================
> +Incremental File System
> +=======================
> +
> +Overview
> +========
> +Incremental FS is special-purpose Linux virtual file system that allows
> +execution of a program while its binary and resource files are still being
> +lazily downloaded over the network, USB etc. It is focused on incremental
> +delivery for a small number (under 100) of big files (more than 10 megabytes).
> +Incremental FS doesn’t allow direct writes into files and, once loaded, file
> +content never changes. Incremental FS doesn’t use a block device, instead it
> +saves data into a backing file located on a regular file-system.
> +
> +But why?
> +--------
> +To allow running **big** Android apps before their binaries and resources are
> +fully downloaded to an Android device. If an app reads something not loaded yet,
> +it needs to wait for the data block to be fetched, but in most cases hot blocks
> +can be loaded in advance.
> +
> +Workflow
> +--------
> +A userspace process, called a data loader, mounts an instance of incremental-fs
> +giving it a file descriptor on an underlying file system (like ext4 or f2fs).
> +Incremental-fs reads content (if any) of this backing file and interprets it as
> +a file system image with files, directories and data blocks. At this point
> +the data loader can declare new files to be shown by incremental-fs.
> +
> +A process is started from a binary located on incremental-fs.
> +All reads are served directly from the backing file
> +without roundtrips into userspace. If the process accesses a data block that was
> +not originally present in the backing file, the read operation waits.
> +
> +Meanwhile the data loader can feed new data blocks to incremental-fs by calling
> +write() on a special .cmd pseudo-file. The data loader can request information
> +about pending reads by calling poll() and read() on the .cmd pseudo-file.
> +This mechanism allows the data loader to serve most urgently needed data first.
> +Once a data block is given to incremental-fs, it saves it to the backing file
> +and unblocks all the reads waiting for this block.
> +
> +Eventually all data for all files is uploaded by the data loader, and saved by
> +incremental-fs into the backing file. At that moment the data loader is not
> +needed any longer. The backing file will play the role of a complete
> +filesystem image for all future runs of the program.
> +
> +Non-goals
> +---------
> +* Allowing direct writes by the executing processes into files on incremental-fs
> +* Allowing the data loader change file size or content after it was loaded.
> +* Having more than a couple hundred files and directories.
> +
> +
> +Features
> +========
> +
> +Read-only, but not unchanging
> +-----------------------------
> +On the surface a mount directory of incremental-fs would look similar to
> +a read-only instance of network file system: files and directories can be
> +listed and read, but can’t be directly created or modified via creat() or
> +write(). At the same time the data loader can make changes to a directory
> +structure via external ioctl-s. i.e. link and unlink files and directories
> +(if they empty). Data can't be changed this way, once a file block is loaded
way; once
> +there is no way to change it.
> +
> +Filesystem image in a backing file
> +----------------------------------
> +Instead of using a block device, all data and metadata is stored in a
are stored
> +backing file provided as a mount parameter. The backing file is located on
> +an underlying file system (like ext4 or f2fs). Such approach is very similar
> +to what might be achieved by using loopback device with a traditional file
> +system, but it avoids extra set-up steps and indirections. It also allows
> +incremental-fs image to dynamically grow as new files and data come without
> +having to do any extra steps for resizing.
> +
> +If the backing file contains data at the moment when incremental-fs is mounted,
> +content of the backing file is being interpreted as filesystem image.
file is interpreted as
> +New files and data can still be added through the external interface,
> +and they will be saved to the backing file.
> +
> +Data compression
> +----------------
> +Incremental-fs can store compressed data. In this case each 4KB data block is
> +compressed separately. Data blocks can be provided to incremental-fs by
> +the data loader in a compressed form. Incremental-fs uncompresses blocks
> +each time a executing process reads it (modulo page cache). Compression also
an executing process
> +takes care of blocks composed of all zero bytes removing necessity to handle
> +this case separately.
> +
> +Partially present files
> +-----------------------
> +Data in the files consists of 4KB blocks, each block can be present or absent.
blocks; each block
> +Unlike in sparse files, reading an absent block doesn’t return all zeros.
> +It waits for the data block to be loaded via the ioctl interface
> +(respecting a timeout). Once a data block is loaded it never disappears
> +and can’t be changed or erased from a file. This ability to frictionlessly
> +wait for temporary missing data is the main feature of incremental-fs.
> +
> +Hard links. Multiple names for the same file
> +--------------------------------------------
> +Like all traditional UNIX file systems, incremental-fs supports hard links,
> +i.e. different file names in different directories can refer to the same file.
> +As mentioned above new hard links can be created and removed via
> +the ioctl interface, but actual data files are immutable, modulo partial
> +data loading. Each directory can only have at most one name referencing it.
> +
> +Inspection of incremental-fs internal state
> +-------------------------------------------
> +poll() and read() on the .cmd pseudo-file allow data loaders to get a list of
> +read operations stalled due to lack of a data block (pending reads).
> +
> +
> +Application Programming Interface
> +=================================
> +
> +Regular file system interface
> +-----------------------------
> +Executing process access files and directories via regular Linux file interface:
accesses
> +open, read, close etc. All the intricacies of data loading a file representation
> +are hidden from them.
> +
> +External .cmd file interface
> +----------------------------
> +When incremental-fs is mounted, a mount directory contains a pseudo-file
> +called '.cmd'. The data loader will open this file and call read(), write(),
> +poll() and ioctl() on it inspect and change state of incremental-fs.
> +
> +poll() and read() are used by the data loader to wait for pending reads to
> +appear and obtain an array of ``struct incfs_pending_read_info``.
> +
> +write() is used by the data loader to feed new data blocks to incremental-fs.
> +A data buffer given to write() is interpreted as an array of
> +``struct incfs_new_data_block``. Structs in the array describe locations and
> +properties of data blocks loaded with this write() call.
> +
> +``ioctl(INCFS_IOC_PROCESS_INSTRUCTION)`` is used to change structure of
> +incremental-fs. It receives an pointer to ``struct incfs_instruction``
a pointer
> +where type field can have be one of the following values.
> +
> +**INCFS_INSTRUCTION_NEW_FILE**
> +Creates an inode (a file or a directory) without a name.
> +It assumes ``incfs_new_file_instruction.file`` is populated with details.
> +
> +**INCFS_INSTRUCTION_ADD_DIR_ENTRY**
> +Creates a name (aka hardlink) for an inode in a directory.
> +A directory can't have more than one hardlink pointing to it, but files can be
> +linked from different directories.
> +It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
> +
> +**INCFS_INSTRUCTION_REMOVE_DIR_ENTRY**
> +Remove a name (aka hardlink) for a file from a directory.
> +Only empty directories can be unlinked.
> +It assumes ``incfs_new_file_instruction.dir_entry`` is populated with details.
> +
> +For more details see in uapi/linux/incrementalfs.h and samples below.
> +
> +Supported mount options
> +-----------------------
> +See ``fs/incfs/options.c`` for more details.
> +
> + * ``backing_fd=<unsigned int>``
> + Required. A file descriptor of a backing file opened by the process
> + calling mount(2). This descriptor can be closed after mount returns.
> +
> + * ``read_timeout_msc=<unsigned int>``
> + Default: 1000. Timeout in milliseconds before a read operation fails
> + if no data found in the backing file or provided by the data loader.
> +
> +Sysfs files
> +-----------
> +``/sys/fs/incremental-fs/version`` - a current version of the filesystem.
> +One ASCII encoded positive integer number with a new line at the end.
> +
> +
> +Examples
> +--------
> +See ``sample_data_loader.c`` for a complete implementation of a data loader.
> +
> +Mount incremental-fs
> +~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> + int mount_fs(char *mount_dir, char *backing_file, int timeout_msc)
> + {
> + static const char fs_name[] = INCFS_NAME;
> + char mount_options[512];
> + int backing_fd;
> + int result;
> +
> + backing_fd = open(backing_file, O_RDWR);
> + if (backing_fd == -1) {
> + perror("Error in opening backing file");
> + return 1;
> + }
> +
> + snprintf(mount_options, ARRAY_SIZE(mount_options),
> + "backing_fd=%u,read_timeout_msc=%u", backing_fd, timeout_msc);
> +
> + result = mount(fs_name, mount_dir, fs_name, 0, mount_options);
> + if (result != 0)
> + perror("Error mounting fs.");
> + return result;
> + }
> +
> +Open .cmd file
> +~~~~~~~~~~~~~~
> +
> +::
> +
> + int open_commands_file(char *mount_dir)
> + {
> + char cmd_file[255];
> + int cmd_fd;
> +
> + snprintf(cmd_file, ARRAY_SIZE(cmd_file), "%s/.cmd", mount_dir);
> + cmd_fd = open(cmd_file, O_RDWR);
> + if (cmd_fd < 0)
> + perror("Can't open commands file");
> + return cmd_fd;
> + }
> +
> +Add a file to the file system
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> + int create_file(int cmd_fd, char *filename, int *ino_out, size_t size)
> + {
> + int ret = 0;
> + __u16 ino = 0;
> + struct incfs_instruction inst = {
> + .version = INCFS_HEADER_VER,
> + .type = INCFS_INSTRUCTION_NEW_FILE,
> + .file = {
> + .size = size,
> + .mode = S_IFREG | 0555,
> + }
> + };
> +
> + ret = ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, &inst);
> + if (ret)
> + return -errno;
> +
> + ino = inst.file.ino_out;
> + inst = (struct incfs_instruction){
> + .version = INCFS_HEADER_VER,
> + .type = INCFS_INSTRUCTION_ADD_DIR_ENTRY,
> + .dir_entry = {
> + .dir_ino = INCFS_ROOT_INODE,
> + .child_ino = ino,
> + .name = ptr_to_u64(filename),
> + .name_len = strlen(filename)
> + }
> + };
> + ret = ioctl(cmd_fd, INCFS_IOC_PROCESS_INSTRUCTION, &inst);
> + if (ret)
> + return -errno;
> + *ino_out = ino;
> + return 0;
> + }
> +
> +Load data into a file
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> + int cmd_fd = open_commands_file(path_to_mount_dir);
> + char *data = get_some_data();
> + struct incfs_new_data_block block;
> + int err;
> +
> + block.file_ino = file_ino;
> + block.block_index = 0;
> + block.compression = COMPRESSION_NONE;
> + block.data = (__u64)data;
> + block.data_len = INCFS_DATA_FILE_BLOCK_SIZE;
> +
> + err = write(cmd_fd, &block, sizeof(block));
> +
> +
> +Get an array of pending reads
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +::
> +
> + int poll_res = 0;
> + struct incfs_pending_read_info reads[10];
> + int cmd_fd = open_commands_file(path_to_mount_dir);
> + struct pollfd pollfd = {
> + .fd = cmd_fd,
> + .events = POLLIN
> + };
> +
> + poll_res = poll(&pollfd, 1, timeout);
> + if (poll_res > 0 && (pollfd.revents | POLLIN)) {
> + ssize_t read_res = read(cmd_fd, reads, sizeof(reads));
> + if (read_res > 0)
> + printf("Waiting reads %ld\n", read_res / sizeof(reads[0]));
> + }
> +
> +
> +
> +Ondisk format
> +=============
> +
> +General principles
> +------------------
> +* The backbone of the incremental-fs ondisk format is an append only linked
> + list of metadata blocks. Each metadata block contains an offset of the next
> + one. These blocks describe files and directories on the
> + file system. They also represent actions of adding and removing file names
> + (hard links).
> + Every time incremental-fs instance is mounted, it reads through this list
> + to recreate filesystem's state in memory. An offset of the first record in the
> + metadata list is stored in the superblock at the beginning of the backing
> + file.
> +
> +* Most of the backing file is taken by data areas and blockmaps.
> + Since data blocks can be compressed and have different sizes,
> + single per-file data area can't be pre-allocated. That's why blockmaps are
> + needed in order to find a location and size of each data block in
> + the backing file. Each time a file is created, a corresponding block map is
> + allocated to store future offsets of data blocks.
> +
> + Whenever a data block is given by data loader to incremental-fs:
> + - A data area with the given block is appended to the end of
> + the backing file.
> + - A record in the blockmap for the given block index is updated to reflect
> + its location, size, and compression algorithm.
> +
> +Important format details
> +------------------------
> +Ondisk structures are defined in the ``format.h`` file. They are all packed
> +and use little-endian order.
> +A backing file must start with ``incfs_super_block`` with ``s_magic`` field
> +equal to 0x5346434e49 "INCFS".
> +
> +Metadata records:
> +
> +* ``incfs_inode`` - metadata record to declare a file or a directory.
> + ``incfs_inode.i_mode`` determents if it is a file
determines
> + or a directory.
> +* ``incfs_blockmap_entry`` - metadata record that specifies size and location
> + of a blockmap area for a given file. This area
> + contains an array of ``incfs_blockmap_entry``-s.
> +* ``incfs_dir_action`` - metadata record that specifies changes made to a
> + to a directory structure, e.g. add or remove a hardlink.
> +* ``incfs_md_header`` - header of a metadata record. It's always a part
> + of other structures and served purpose of metadata
? serves
> + bookkeeping.
> +
> +Other ondisk structures:
> +
> +* ``incfs_super_block`` - backing file header
> +* ``incfs_blockmap_entry`` - a record in a blockmap area that describes size
> + and location of a data block.
> +* Data blocks dont have any particular structure, they are written to the backing
don't
> + file in a raw form as they come from a data loader.
> +
> +
> +Backing file layout
> +-------------------
> +::
> +
> + +-------------------------------------------+
> + | incfs_super_block |]---+
> + +-------------------------------------------+ |
> + | metadata |<---+
> + | incfs_inode |]---+
> + +-------------------------------------------+ |
> + ......................... |
> + +-------------------------------------------+ | metadata
> + +------->| blockmap area | | list links
> + | | [incfs_blockmap_entry] | |
> + | | [incfs_blockmap_entry] | |
> + | | [incfs_blockmap_entry] | |
> + | +--[| [incfs_blockmap_entry] | |
> + | | | [incfs_blockmap_entry] | |
> + | | | [incfs_blockmap_entry] | |
> + | | +-------------------------------------------+ |
> + | | ......................... |
> + | | +-------------------------------------------+ |
> + | | | metadata |<---+
> + +----|--[| incfs_blockmap |]---+
> + | +-------------------------------------------+ |
> + | ......................... |
> + | +-------------------------------------------+ |
> + +-->| data block | |
> + +-------------------------------------------+ |
> + ......................... |
> + +-------------------------------------------+ |
> + | metadata |<---+
> + | incfs_dir_action |
> + +-------------------------------------------+
> +
> +Unreferenced files and absence of garbage collection
> +----------------------------------------------------
> +Described file format can produce files that don't have any names for them in
> +any directories. Incremental-fs takes no steps to prevent such situations or
> +reclaim space occupied by such files in the backing file. If garbage collection
> +is needed it has to be implemented as a separate userspace tool.
> +
> +
> +Design alternatives
> +===================
> +
> +Why isn't incremental-fs implemented via FUSE?
> +----------------------------------------------
> +TLDR: FUSE-based filesystems add 20-80% of performance overhead for target
> +scenarios, and increase power use on mobile beyond acceptable limit
> +for widespread deployment. A custom kernel filesystem is the way to overcome
> +these limitations.
> +
> +From the theoretical side of things, FUSE filesystem adds some overhead to
> +each filesystem operation that’s not handled by OS page cache:
> +
> + * When an IO request arrives to FUSE driver (D), it puts it into a queue
> + that runs on a separate kernel thread
> + * Then another separate user-mode handler process (H) has to run,
> + potentially after a context switch, to read the request from the queue.
> + Reading the request adds a kernel-user mode transition to the handling.
> + * (H) sends the IO request to kernel to handle it on some underlying storage
> + filesystem. This adds a user-kernel and kernel-user mode transition
> + pair to the handling.
> + * (H) then responds to the FUSE request via a write(2) call.
> + Writing the response is another user-kernel mode transition.
> + * (D) needs to read the response from (H) when its kernel thread runs
> + and forward it to the user
> +
> +Together, the scenario adds 2 extra user-kernel-user mode transition pairs,
> +and potentially has up to 3 additional context switches for the FUSE kernel
> +thread and the user-mode handler to start running for each IO request on the
> +filesystem.
> +This overhead can vary from unnoticeable to unmanageable, depending on the
> +target scenario. But it will always burn extra power via CPU staying longer
> +in non-idle state, handling context switches and mode transitions.
> +One important goal for the new filesystem is to be able to handle each page
> +read separately on demand, because we don't want to wait and download more data
> +than absolutely necessary. Thus readahead would need to be disabled completely.
> +This increases the number of separate IO requests and the FUSE related overhead
> +by almost 32x (128KB readahead limit vs 4KB individual block operations)
> +
> +For more info see a 2017 USENIX research paper:
> +To FUSE or Not to FUSE: Performance of User-Space File Systems
> +Bharath Kumar Reddy Vangoor, Stony Brook University;
> +Vasily Tarasov, IBM Research-Almaden;
> +Erez Zadok, Stony Brook University
> +https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf
--
~Randy
next prev parent reply other threads:[~2019-05-02 20:41 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-02 4:03 Initial patches for Incremental FS ezemtsov
2019-05-02 4:03 ` [PATCH 1/6] incfs: Add first files of incrementalfs ezemtsov
2019-05-02 19:06 ` Miklos Szeredi
2019-05-02 20:41 ` Randy Dunlap [this message]
2019-05-07 15:57 ` Jann Horn
2019-05-07 17:13 ` Greg KH
2019-05-07 17:18 ` Greg KH
2019-05-02 4:03 ` [PATCH 2/6] incfs: Backing file format ezemtsov
2019-05-02 4:03 ` [PATCH 3/6] incfs: Management of in-memory FS data structures ezemtsov
2019-05-02 4:03 ` [PATCH 4/6] incfs: Integration with VFS layer ezemtsov
2019-05-02 4:03 ` [PATCH 6/6] incfs: Integration tests for incremental-fs ezemtsov
2019-05-02 11:19 ` Initial patches for Incremental FS Amir Goldstein
2019-05-02 13:10 ` Theodore Ts'o
2019-05-02 13:26 ` Al Viro
2019-05-03 4:23 ` Eugene Zemtsov
2019-05-03 5:19 ` Amir Goldstein
2019-05-08 20:09 ` Eugene Zemtsov
2019-05-09 8:15 ` Amir Goldstein
[not found] ` <CAK8JDrEQnXTcCtAPkb+S4r4hORiKh_yX=0A0A=LYSVKUo_n4OA@mail.gmail.com>
2019-05-21 1:32 ` Yurii Zubrytskyi
2019-05-22 8:32 ` Miklos Szeredi
2019-05-22 17:25 ` Yurii Zubrytskyi
2019-05-23 4:25 ` Miklos Szeredi
2019-05-29 21:06 ` Yurii Zubrytskyi
2019-05-30 9:22 ` Miklos Szeredi
2019-05-30 22:45 ` Yurii Zubrytskyi
2019-05-31 9:02 ` Miklos Szeredi
2019-05-22 10:54 ` Amir Goldstein
2019-05-03 7:23 ` Richard Weinberger
2019-05-03 10:22 ` Miklos Szeredi
2019-05-02 13:46 ` Amir Goldstein
2019-05-02 18:16 ` Richard Weinberger
2019-05-02 18:33 ` Richard Weinberger
2019-05-02 13:47 ` J. R. Okajima
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e36f89ef-2060-9cb6-2b3f-0b246c9208d2@infradead.org \
--to=rdunlap@infradead.org \
--cc=ezemtsov@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).