From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from merlin.infradead.org ([205.233.59.134]:54708 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030435AbeCSUoU (ORCPT ); Mon, 19 Mar 2018 16:44:20 -0400 Subject: Re: [RFC v2 01/83] Introduction and documentation of NOVA filesystem. To: Andiry Xu , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvdimm@lists.01.org Cc: dan.j.williams@intel.com, andy.rudoff@intel.com, coughlan@redhat.com, swanson@cs.ucsd.edu, david@fromorbit.com, jack@suse.com, swhiteho@redhat.com, miklos@szeredi.hu, andiry.xu@gmail.com, Andiry Xu References: <1520705944-6723-1-git-send-email-jix024@eng.ucsd.edu> <1520705944-6723-2-git-send-email-jix024@eng.ucsd.edu> From: Randy Dunlap Message-ID: <3ca76af0-a8aa-6dec-242f-b031e6eb4710@infradead.org> Date: Mon, 19 Mar 2018 13:43:42 -0700 MIME-Version: 1.0 In-Reply-To: <1520705944-6723-2-git-send-email-jix024@eng.ucsd.edu> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 03/10/2018 10:17 AM, Andiry Xu wrote: > From: Andiry Xu > > NOVA is a log-structured file system tailored for byte-addressable non-volatile memories. > It was designed and developed at the Non-Volatile Systems Laboratory in the Computer > Science and Engineering Department at the University of California, San Diego. > Its primary authors are Andiry Xu , Lu Zhang > , and Steven Swanson . > > These two papers provide a detailed, high-level description of NOVA's design goals and approach: > > NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories > In The 14th USENIX Conference on File and Storage Technologies (FAST '16) > (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf) > > NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System > In The 26th ACM Symposium on Operating Systems Principles (SOSP '17) > (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf) > > This patchset contains features from the FAST paper. We leave NOVA-Fortis features, > such as snapshot, metadata and data replication and RAID parity for > future submission. > > Signed-off-by: Andiry Xu > --- > Documentation/filesystems/00-INDEX | 2 + > Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++ > MAINTAINERS | 8 + > 3 files changed, 508 insertions(+) > create mode 100644 Documentation/filesystems/nova.txt > diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt > new file mode 100644 > index 0000000..4728f50 > --- /dev/null > +++ b/Documentation/filesystems/nova.txt > @@ -0,0 +1,498 @@ > +The NOVA Filesystem > +=================== > + > +NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system > +designed to provide a high performance and production-ready file system > +tailored for byte-addressable non-volatile memories (e.g., NVDIMMs > +and Intel's soon-to-be-released 3DXPoint DIMMs). > +NOVA combines design elements from many other file systems > +and adapts conventional log-structured file system techniques to > +exploit the fast random access that NVMs provide. In particular, NOVA maintains > +separate logs for each inode to improve concurrency, and stores file data > +outside the log to minimize log size and reduce garbage collection costs. NOVA's > +logs provide metadata and data atomicity and focus on simplicity and > +reliability, keeping complex metadata structures in DRAM to accelerate lookup > +operations. > + > +NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in > +the Computer Science and Engineering Department at the University of > +California, San Diego. > + > +A more thorough discussion of NOVA's design is avaialable in these two papers: available > + > +NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories > +Jian Xu and Steven Swanson > +In The 14th USENIX Conference on File and Storage Technologies (FAST '16) > + > +NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System > +Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, > +Tamires Brito Da Silva, Andy Rudoff and Steven Swanson > +In The 26th ACM Symposium on Operating Systems Principles (SOSP '17) > + > +This version of NOVA contains features from the FAST paper. > +NOVA-Fortis features, such as snapshot, metadata and data protection and replication > +are left for future submission. > + > +The main NOVA features include: > + > + * POSIX semantics > + * Directly access (DAX) byte-addressable NVMM without page caching > + * Per-CPU NVMM pool to maximize concurrency > + * Strong consistency guarantees with 8-byte atomic stores > + > + > +Filesystem Design > +================= > + > +NOVA divides NVMM into several regions. NOVA's 512B superblock contains global (prefer:) 512-byte > +file system information and the recovery inode. The recovery inode represents a > +special file that stores recovery information (e.g., the list of unallocated > +NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also > +provides per-CPU journals for complex file operations that involve multiple > +inodes. The rest of the available NVMM stores logs and file data. > + > +NOVA is log-structured and stores a separate log for each inode to maximize > +concurrency and provide atomicity for operations that affect a single file. The > +logs only store metadata and comprise a linked list of 4 KB pages. Log entries > +are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log > +pages may reside anywhere in NVMM. > + > +NOVA keeps copies of most file metadata in DRAM during normal > +operations, eliminating the need to access metadata in NVMM during reads. > + > +NOVA supports both copy-on-write and in-place file data updates and appends > +metadata about the write to the log. For operations that affect multiple inodes inodes, > +NOVA uses lightweight, fixed-length journals –one per core. -- one per core. > + > +NOVA divides the allocatable NVMM into multiple regions, one region per CPU > +core. A per-core allocator manages each of the regions, minimizing contention > +during memory allocation. > + > +After a system crash, NOVA must scan all the logs to rebuild the memory > +allocator state. Since, there are many logs, NOVA aggressively parallelizes the Since there are > +scan. > + > + > +Building and using NOVA > +======================= > + > +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), > +DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support. Install as usual. > + > +NOVA runs on a pmem non-volatile memory region. You can create one of these > +regions with the `memmap` kernel command line option. For instance, adding > +`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting > +from address 8GB, and the kernel will create a `pmem0` block device under the > +`/dev` directory. > + > +After the OS has booted, you can initialize a NOVA instance with the following commands: > + > + > +# modprobe nova > +# mount -t NOVA -o init /dev/pmem0 /mnt/nova Hmph, unique in upper-case-ness (at least for in-tree fs-es). Would you consider "nova" instead? > + > + > +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on > +`/mnt/nova`. > + > +NOVA support several module command line options: supports > + > + * measure_timing: Measure the timing of file system operations for profiling (default: 0) > + > + * inplace_data_updates: Update data in place rather than with COW (default: 0) > + > +To recover an existing NOVA instance, mount NOVA without the init option, for example: > + > +# mount -t NOVA /dev/pmem0 /mnt/nova > + > + > +Sysfs support > +------------- > + > +NOVA provides sysfs support to enable user to get/set information of enable a user or enable users And the line above ends with a trailing space. Please check/remove all of those. > +a running NOVA instance. > +After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/: Above uses lower-case "nova" in /proc/fs/nova/... but the examples below use NOVA. nova is preferred (IMO). > + > +timing_stats IO_stats allocator gc > + > +Show NOVA file operation timing statistics: > +# cat /proc/fs/NOVA/pmem#/timing_stats > + > +Clear timing statistics: > +# echo 1 > /proc/fs/NOVA/pmem#/timing_stats > + > +Show NOVA I/O statistics: > +# cat /proc/fs/NOVA/pmem#/IO_stats > + > +Clear I/O statistics: > +# echo 1 > /proc/fs/NOVA/pmem#/IO_stats > + > +Show NOVA allocator information: > +# cat /proc/fs/NOVA/pmem#/allocator > + > +Manual garbage collection: > +# echo #inode_number > /proc/fs/NOVA/pmem#/gc > + > + > +Source File Structure > +===================== > + > + * nova_def.h/nova.h > + Defines NOVA macros and key inline functions. > + > + * balloc.{h,c} > + NOVA's pmem allocator implementation. > + > + * bbuild.c > + Implements recovery routines to restore the in-use inode list and the NVMM > + allocator information. > + > + * dax.c > + Implements DAX read/write and mmap functions to access file data. NOVA uses > + copy-on-write to modify file pages by default, unless inplace data update is > + enabled at mount-time. > + > + * dir.c > + Contains functions to create, update, and remove NOVA dentries. > + > + * file.c > + Implements file-related operations such as open, fallocate, llseek, fsync, > + and flush. > + > + * gc.c > + NOVA's garbage collection functions. > + > + * inode.{h,c} > + Creates, reads, and frees NOVA inode tables and inodes. > + > + * ioctl.c > + Implements some ioctl commands to call NOVA's internal functions. > + > + * journal.{h,c} > + For operations that affect multiple inodes NOVA uses lightweight, > + fixed-length journals – one per core. This file contains functions to > + create and manage the lite journals. > + > + * log.{h,c} > + Functions to manipulate NOVA inode logs, including log page allocation, log > + entry creation, commit, modification, and deletion. > + > + * namei.c > + Functions to create/remove files, directories, and links. It also looks for > + the NOVA inode number for a given path name. > + > + * rebuild.c > + When mounting NOVA, rebuild NOVA inodes from its logs. > + > + * stats.{h,c} > + Provide routines to gather and print NOVA usage statistics. > + > + * super.{h,c} > + Super block structures and NOVA FS layout and entry points for NOVA > + mounting and unmounting, initializing or recovering the NOVA super block > + and other global file system information. > + > + * symlink.c > + Implements functions to create and read symbolic links in the filesystem. > + > + * sysfs.c > + Implements sysfs entries to take user inputs for printing NOVA statistics. s/sysfs/procfs/ > + > + > +Filesystem Layout > +================= > + > +A NOVA file systems resides in single PMEM device. ***** > +NOVA divides the device into 4KB blocks. 4 KB {or use 4KB way up above here} > + > + block > ++---------------------------------------------------------+ > +| 0 | primary super block (struct nova_super_block) | > ++---------------------------------------------------------+ > +| 1 | Reserved inodes | > ++---------------------------------------------------------+ > +| 2 - 15 | reserved | > ++---------------------------------------------------------+ > +| 16 - 31 | Inode table pointers | > ++---------------------------------------------------------+ > +| 32 - 47 | Journal pointers | > ++---------------------------------------------------------+ > +| 48 - 63 | reserved | > ++---------------------------------------------------------+ > +| ... | log and data pages | > ++---------------------------------------------------------+ > +| n-2 | replica reserved Inodes | > ++---------------------------------------------------------+ > +| n-1 | replica super block | > ++---------------------------------------------------------+ > + > + > + > +Superblock and Associated Structures > +==================================== > + > +The beginning of the PMEM device hold the super block and its associated holds > +tables. These include reserved inodes, a table of pointers to the journals > +NOVA uses for complex operations, and pointers to inodes tables. NOVA > +maintains replicas of the super block and reserved inodes in the last two > +blocks of the PMEM area. > + > + > +Block Allocator/Free Lists > +========================== > + > +NOVA uses per-CPU allocators to manage free PMEM blocks. On initialization,> +NOVA divides the range of blocks in the PMEM device among the CPUs, and those > +blocks are managed solely by that CPU. We call these ranges of "allocation regions". > +Each allocator maintains a red-black tree of unallocated ranges (struct > +nova_range_node). > + > +Allocation Functions > +-------------------- > + > +NOVA allocate PMEM blocks using two mechanisms: allocates > + > +1. Static allocation as defined in super.h > + > +2. Allocation for log and data pages via nova_new_log_blocks() and > +nova_new_data_blocks(). > + > + > +PMEM Address Translation > +------------------------ > + > +In NOVA's persistent data structures, memory locations are given as offsets > +from the beginning of the PMEM region. nova_get_block() translates offsets to > +PMEM addresses. nova_get_addr_off() performs the reverse translation. > + > + > +Inodes > +====== > + > +NOVA maintains per-CPU inode tables, and inode numbers are striped across the > +tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1). > + > +The inodes themselves live in a set of linked lists (one per CPU) of 2MB > +blocks. The last 8 bytes of each block points to the next block. Pointers to > +heads of these list live in PMEM block INODE_TABLE_START. lists > +Additional space for inodes is allocated on demand. > + > +To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB s/cpu/CPU/g s/a RB/an RB/ but that isn't quite a sentence. Please fix it. > +tree that holds ranges of allocated inode numbers. > + > + > +Logs > +==== > + > +NOVA maintains a log for each inode that records updates to the inode's > +metadata and holds pointers to the file data. NOVA makes updates to file data > +and metadata atomic by atomically appending log entries to the log. > + > +Each inode contains pointers to head and tail of the inode's log. When the log > +grows past the end of the last page, nova allocates additional space. For > +short logs (less than 1MB) , it doubles the length. For longer logs, it adds a > +fixed amount of additional space (1MB). > + > +Log space is reclaimed during garbage collection. > + > +Log Entries > +----------- > + > +There are four kinds of log entry, documented in log.h. The log entries have > +several entries in common: > + > + 1. 'epoch_id' gives the epoch during which the log entry was created. > + Creating a snapshot increments the epoch_id for the file systems. file system. (?) or do multiple epochs (snapshots) => multiple fs-es? > + Currently disabled (always zero). > + > + 2. 'trans_id' is per-inode, monotone increasing, number assigned each > + log entry. It provides an ordering over FS operations on a single inode. > + > + 3. 'invalid' is true if the effects of this entry are dead and the log > + entry can be garbage collected. > + > + 4. 'csum' is a CRC32 checksum for the entry. Currently it is disabled. > + > +Log structure > +------------- > + > +The logs comprise a linked list of PMEM blocks. The tail of each block > +contains some metadata about the block and pointers to the next block and > +block's replica (struct nova_inode_page_tail). > + > ++----------------+ > +| log entry | > ++----------------+ > +| log entry | > ++----------------+ > +| ... | > ++----------------+ > +| tail | > +| metadata | > +| -> next block | > ++----------------+ > + > + > +Journals > +======== > + > +NOVA uses a lightweight journaling mechanisms to provide atomicity for mechanism > +operations that modify more than one on inode. The journals providing logging end of that "sentence" (above) is confusing or missing something. > +for two operations: > + > +1. Single word updates (JOURNAL_ENTRY) > +2. Copying inodes (JOURNAL_INODE) > + > +The journals are undo logs: NOVA creates the journal entries for an operation, > +and if the operation does not complete due to a system failure, the recovery > +process rolls back the changes using the journal entries. > + > +To commit, NOVA drops the log. > + > +NOVA maintains one journal per CPU. The head and tail pointers for each > +journal live in a reserved page near the beginning of the file system. > + > +During recovery, NOVA scans the journals and undoes the operations described by > +each entry. > + > + > +File and Directory Access > +========================= > + > +To access file data via read(), NOVA maintains a radix tree in DRAM for each > +inode (nova_inode_info_header.tree) that maps file offsets to write log > +entries. For directories, the same tree maps a hash of filenames to their > +corresponding dentry. > + > +In both cases, the nova populates the tree when the file or directory is opened the nova fs (?) > +by scanning its log. > + > + > +MMap and DAX > +============ > + > +NOVA leverages the kernel's DAX mechanisms for mmap and file data access. > +NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the > +application's address space. > + > + > +Garbage Collection > +================== > + > +NOVA recovers log space with a two-phase garbage collection system. When a log > +reaches the end of its allocated pages, NOVA allocates more space. Then, the > +fast GC algorithm scans the log to remove pages that have no valid entries. > +Then, it estimates how many pages the logs valid entries would fill. If this > +is less than half the number of pages in the log, the second GC phase copies > +the valid entries to new pages. > + > +For example (V=valid; I=invalid): > + > ++---+ +---+ +---+ > +| I | | I | | V | > ++---+ +---+ Thorough +---+ > +| V | | V | GC | V | > ++---+ +---+ =====> +---+ > +| I | | I | | V | > ++---+ +---+ +---+ > +| V | | V | | V | > ++---+ +---+ +---+ > + | | > + V V > ++---+ +---+ > +| I | | V | > ++---+ +---+ > +| I | fast GC | I | > ++---+ ====> +---+ > +| I | | I | > ++---+ +---+ > +| I | | V | > ++---+ +---+ > + | > + V > ++---+ > +| V | > ++---+ > +| I | > ++---+ > +| I | > ++---+ > +| V | > ++---+ > + > + > +Umount and Recovery > +=================== > + > +Clean umount/mount > +------------------ > + > +On a clean unmount, NOVA saves the contents of many of its DRAM data structures > +to PMEM to accelerate the next mount: > + > +1. NOVA stores the allocator state for each of the per-cpu allocators to the > + log of a reserved inode (NOVA_BLOCK_NODE_INO). > + > +2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the > + NOVA_BLOCK_INODELIST_INO reserved inode. > + > +After a clean unmount, the following mount restores these data and then > +invalidates them. > + > +Recovery after failures > +----------------------- > + > +In case of a unclean dismount (e.g., system crash), NOVA must rebuild these of an unclean > +DRAM structures by scanning the inode logs. NOVA log scanning is fast because > +per-CPU inode tables and per-inode logs allow for parallel recovery. > + > +The number of live log entries in an inode log is roughly the number of extents > +in the file. As a result, NOVA only needs to scan a small fraction of the NVMM > +during recovery. > + > +The NOVA failure recovery consists of two steps: > + > +First, NOVA checks its lite weight journals and rolls back any uncommitted should be one word: lightweight (or liteweight) > +transactions to restore the file system to a consistent state. > + > +Second, NOVA starts a recovery thread on each CPU and scans the inode tables in > +parallel, performing log scanning for every valid inode in the inode table. > +NOVA use different recovery mechanisms for directory inodes and file inodes: and file inodes. > +For a directory inode, NOVA scans the log's linked list to enumerate the pages > +it occupies, but it does not inspect the log's contents. For a file inode, > +NOVA reads the write entries in the log to enumerate the data pages. > + > +During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds > +the allocator based on the result. After this process completes, the file > +system is ready to accept new requests. > + > +During the same scan, it rebuilds the list of available inodes. > + > + > +Gaps, Missing Features, and Development Status > +============================================== > + > +Although NOVA is a fully-functional file system, there is still much work left > +to be done. In particular, (at least) the following items are currently missing: > + > +1. Snapshot, metadata and data replication and protection are left for future submission. > +2. There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system). > +3. NOVA only works on x86-64 kernels. > +4. NOVA does not currently support extended attributes or ACL. > +5. NOVA doesn't provide quota support. > +6. Moving NOVA file systems between machines with different numbers of CPUs does not work. You could artificially limit the number of "known" CPUs so that a NOVA fs could be moved from a 16-CPU system to an 8-CPU system by telling NOVA to use only 8 CPUs (as an example). Just a thought. > + > +None of these are fundamental limitations of NOVA's design. > + > +NOVA is complete and robust enough to run a range of complex applications, but > +it is not yet ready for production use. Our current focus is on adding a few > +missing features from the list above and finding/fixing bugs. > + > + > +Hacking and Contributing > +======================== > + > +If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues. > + > +If you have other questions or suggestions you can contact the NOVA developers > +at cse-nova-hackers@eng.ucsd.edu. -- ~Randy