Linux-Fsdevel Archive on
 help / color / Atom feed
* [RFC v2 00/83] NOVA: a new file system for persistent memory
@ 2018-03-10 18:17 Andiry Xu
  2018-03-10 18:17 ` [RFC v2 01/83] Introduction and documentation of NOVA filesystem Andiry Xu
                   ` (83 more replies)
  0 siblings, 84 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <>

This is the second version of RFC patch series that impements
NOVA (NOn-Volatile memory Accelerated file system), a new file system built for PMEM.

NOVA's goal is to provide a high performance, production-ready
file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
and Intel's soon-to-be-released 3DXpoint DIMMs).
NOVA was developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <>, Lu Zhang
<>, and Steven Swanson <>.
NOVA is stable enough to run complex applications, but there is substantial
work left to do.  This RFC is intended to gather feedback to guide its
development toward eventual inclusion upstream.
The patches are based on Linux 4.16-rc4.

Changes from v1:

* Remove snapshot, metadata replication and data parity for future submission.
  This significantly reduces complexity and LOC: 22129 -> 13834.

* Breakdown the code in a more reviewer-friendly way:
  The patchset starts with a simple skeleton and adds more features gradually.
  Each patch leaves the tree in a compilable and working state,
  and is self-contained and small, so easier to review.

* Fix bugs so that NOVA passes xfstests:


NOVA is primarily a log-structured file system, but rather than maintain a
single global log for the entire file system, it maintains separate logs for
each inode.  NOVA breaks the logs into 4KB pages, they need not be
contiguous in memory.  The logs only contain metadata.
File data pages reside outside the log, and log entries for write operations
point to data pages they modify.  File modification can be done in
either inplace update or copy-on-write (COW) way to provide atomic file updates.
For file operations that involve multiple inodes, NOVA use small, fixed-sized
redo logs to atomically append log entries to the logs of the inodes involved.
This structure keeps logs small and makes garbage collection very fast.  It also
enables enormous parallelism during recovery from an unclean unmount, since
threads can scan logs in parallel.
Documentation/filesystems/NOVA.txt contains some lower-level implementation and
usage information.  A more thorough discussion of NOVA's goals and design is
avaialable in two papers:
NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
Jian Xu and Steven Swanson
Published in FAST 2016

NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson
Published in SOSP 2017

This version contains features from the FAST paper. We leave NOVA-Fortis
features for future.

Build and Run

To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.

NOVA runs on a pmem non-volatile memory region created by memmap kernel option.
For instance, adding 'memmap=16G!8G' to the kernel boot parameters will reserve
16GB memory starting from address 8GB, and the kernel will create a pmem0 
block device under the /dev directory.

After the OS has booted, initialize a NOVA instance with the following commands:

# modprobe nova
# mount -t NOVA -o init /dev/pmem0 /mnt/nova

The above commands create a NOVA instance on /dev/pmem0 and mounts it on
/mnt/nova. Currently NOVA does not have mkfs or fsck support.


Comparing to other DAX file systems such as ext4-DAX and xfs-DAX,
NOVA provides fine-grained, byte granularity metadata operation,
and it performs better in metadata-intensive and write-intensive applications.
NOVA also excel in append-fsync access pattern, i.e. write-ahead logging,
which is very common in DBMS and key-value stores.

The following test is performed on Intel i7-3770K with 16GB DRAM
and 8GB PMEM emulated with DRAM. The kernel is 4.16-rc4 64bit on Ubuntu 16.04.
Performance may vary on different platforms.

Filebench throughout (ops/s):
		xfs-DAX	ext4-DAX	NOVA
Fileserver	86971	177826		334166
Varmail		148032	288033		999794
Webserver	370245	370144		374130
Webproxy	315084	737544		927216

Webserver is read-intensive and all the file systems have similar performance.

SQLite test:
SQLite has four journaling modes:
Delete: delete the undo log file after transaction commit
Truncate: truncate the undo log file to zero after transaction commit
Persist: write a flag at the beginning of the log file after transaction commit
WAL: write-ahead logging

SQLite insert (transactions/s):
		xfs-DAX	ext4-DAX	NOVA
Delete		18525	23615		45289
Truncate	21930	26391		52046	
Persist		58053	56106		50554
WAL		38622	62703		85395

NOVA performs bad in Persist mode because it does copy-on-write for writes,
and writes 4KB for sub-page writes.

Redis: fsync the WAL file after every set.
Redis set throughout (trans/s):
xfs-DAX	ext4-DAX	NOVA
49771	88308		102560

RocksDB fillunique test (ops/s):
		xfs-DAX	ext4-DAX	NOVA
WAL sync	33563	62066		295655
WAL nosync	254533	288106		393713

Both ext4-DAX and xfs-DAX suffer from high fsync overhead.

More test results are available in the two NOVA papers.

NOVA uses per-inode logging, per-CPU inode table and journal to avoid lock contention.
We use the FxMark test suite (
to test the filesystem scalability. The result is at



Andiry Xu (83):
  Introduction and documentation of NOVA filesystem.
  Add nova_def.h.
  Add super.h.
  NOVA inode definition.
  Add NOVA filesystem definitions and useful helper routines.
  Add inode get/read methods.
  Initialize inode_info and rebuild inode information in nova_iget().
  NOVA superblock operations.
  Add Kconfig and Makefile
  Add superblock integrity check.
  Add timing and I/O statistics for performance analysis and profiling.
  Add timing for mount and init.
  Add remount_fs and show_options methods.
  Add range node kmem cache.
  Add free list data structure.
  Initialize block map and free lists in nova_init().
  Add statfs support.
  Add freelist statistics printing.
  Add pmem block free routines.
  Pmem block allocation routines.
  Add log structure.
  Inode log pages allocation and reclaimation.
  Save allocator to pmem in put_super.
  Initialize and allocate inode table.
  Support get normal inode address and inode table extentsion.
  Add inode_map to track inuse inodes.
  Save the inode inuse list to pmem upon umount
  Add NOVA address space operations
  Add write_inode and dirty_inode routines.
  New NOVA inode allocation.
  Add new vfs inode allocation.
  Add log entry definitions.
  Inode log and entry printing for debug purpose.
  Journal: NOVA light weight journal definitions.
  Journal: Lite journal helper routines.
  Journal: Lite journal recovery.
  Journal: Lite journal create and commit.
  Journal: NOVA lite journal initialization.
  Log operation: dentry append.
  Log operation: file write entry append.
  Log operation: setattr entry append
  Log operation: link change append.
  Log operation: in-place update log entry
  Log operation: invalidate log entries
  Log operation: file inode log lookup and assign
  Dir: Add Directory radix tree insert/remove methods.
  Dir: Add initial dentries when initializing a directory inode log.
  Dir: Readdir operation.
  Dir: Append create/remove dentry.
  Inode: Add nova_evict_inode.
  Rebuild: directory inode.
  Rebuild: file inode.
  Namei: lookup.
  Namei: create and mknod.
  Namei: mkdir
  Namei: link and unlink.
  Namei: rmdir
  Namei: rename
  Namei: setattr
  Add special inode operations.
  Super: Add nova_export_ops.
  File: getattr and file inode operations
  File operation: llseek.
  File operation: open, fsync, flush.
  File operation: read.
  Super: Add file write item cache.
  Dax: commit list of file write items to log.
  File operation: copy-on-write write.
  Super: Add module param inplace_data_updates.
  File operation: Inplace write.
  Symlink support.
  File operation: fallocate.
  Dax: Add iomap operations.
  File operation: Mmap.
  File operation: read/write iter.
  Ioctl support.
  GC: Fast garbage collection.
  GC: Thorough garbage collection.
  Normal recovery.
  Failure recovery: bitmap operations.
  Failure recovery: Inode pages recovery routines.
  Failure recovery: Per-CPU recovery.
  Sysfs support.

 Documentation/filesystems/00-INDEX |    2 +
 Documentation/filesystems/nova.txt |  498 +++++++++++++
 MAINTAINERS                        |    8 +
 fs/Kconfig                         |    2 +
 fs/Makefile                        |    1 +
 fs/nova/Kconfig                    |   15 +
 fs/nova/Makefile                   |    8 +
 fs/nova/balloc.c                   |  730 ++++++++++++++++++
 fs/nova/balloc.h                   |   96 +++
 fs/nova/bbuild.c                   | 1437 ++++++++++++++++++++++++++++++++++++
 fs/nova/bbuild.h                   |   28 +
 fs/nova/dax.c                      |  970 ++++++++++++++++++++++++
 fs/nova/dir.c                      |  520 +++++++++++++
 fs/nova/file.c                     |  728 ++++++++++++++++++
 fs/nova/gc.c                       |  459 ++++++++++++
 fs/nova/inode.c                    | 1310 ++++++++++++++++++++++++++++++++
 fs/nova/inode.h                    |  277 +++++++
 fs/nova/ioctl.c                    |  184 +++++
 fs/nova/journal.c                  |  412 +++++++++++
 fs/nova/journal.h                  |   56 ++
 fs/nova/log.c                      | 1111 ++++++++++++++++++++++++++++
 fs/nova/log.h                      |  417 +++++++++++
 fs/nova/namei.c                    |  848 +++++++++++++++++++++
 fs/nova/nova.h                     |  566 ++++++++++++++
 fs/nova/nova_def.h                 |  128 ++++
 fs/nova/rebuild.c                  |  499 +++++++++++++
 fs/nova/stats.c                    |  600 +++++++++++++++
 fs/nova/stats.h                    |  178 +++++
 fs/nova/super.c                    | 1063 ++++++++++++++++++++++++++
 fs/nova/super.h                    |  171 +++++
 fs/nova/symlink.c                  |  133 ++++
 fs/nova/sysfs.c                    |  379 ++++++++++
 32 files changed, 13834 insertions(+)
 create mode 100644 Documentation/filesystems/nova.txt
 create mode 100644 fs/nova/Kconfig
 create mode 100644 fs/nova/Makefile
 create mode 100644 fs/nova/balloc.c
 create mode 100644 fs/nova/balloc.h
 create mode 100644 fs/nova/bbuild.c
 create mode 100644 fs/nova/bbuild.h
 create mode 100644 fs/nova/dax.c
 create mode 100644 fs/nova/dir.c
 create mode 100644 fs/nova/file.c
 create mode 100644 fs/nova/gc.c
 create mode 100644 fs/nova/inode.c
 create mode 100644 fs/nova/inode.h
 create mode 100644 fs/nova/ioctl.c
 create mode 100644 fs/nova/journal.c
 create mode 100644 fs/nova/journal.h
 create mode 100644 fs/nova/log.c
 create mode 100644 fs/nova/log.h
 create mode 100644 fs/nova/namei.c
 create mode 100644 fs/nova/nova.h
 create mode 100644 fs/nova/nova_def.h
 create mode 100644 fs/nova/rebuild.c
 create mode 100644 fs/nova/stats.c
 create mode 100644 fs/nova/stats.h
 create mode 100644 fs/nova/super.c
 create mode 100644 fs/nova/super.h
 create mode 100644 fs/nova/symlink.c
 create mode 100644 fs/nova/sysfs.c


^ permalink raw reply	[flat|nested] 119+ messages in thread