All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 00/16] NOVA: a new file system for persistent memory
@ 2017-08-03  7:48 ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

This is an RFC patch series that impements NOVA (NOn-Volatile memory
Accelerated file system), a new file system built for PMEM.

NOVA's goal is to provide a high-performance, full-featured, production-ready
file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
and Intel's soon-to-be-released 3DXpoint DIMMs).  It combines design elements
from many other file systems to provide a combination of high-performance,
strong consistency guarantees, and comprehensive data protection.  NOVA supports
DAX-style mmap, and making DAX perform well is a first-order priority in NOVA's
design.

NOVA was developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <jix024@eng.ucsd.edu>, Lu Zhang
<luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.

NOVA is stable enough to run complex applications, but there is substantial
work left to do.  This RFC is intended to gather feedback to guide its
development toward eventual inclusion upstream.

The patches are relative Linux 4.12.

Overview
========

NOVA is primarily a log-structured file system, but rather than maintain a
single global log for the entire file system, it maintains separate logs for
each file (inode).  NOVA breaks the logs into 4KB pages, they need not be
contiguous in memory.  The logs only contain metadata.

File data pages reside outside the log, and log entries for write operations
point to data pages they modify.  File modification uses copy-on-write (COW) to
provide atomic file updates.

For file operations that involve multiple inodes, NOVA use small, fixed-sized
redo logs to atomically append log entries to the logs of the inodes involved.

This structure keeps logs small and makes garbage collection very fast.  It also
enables enormous parallelism during recovery from an unclean unmount, since
threads can scan logs in parallel.

NOVA replicates and checksums all metadata structures and protects file data
with RAID-4-style parity.  It supports checkpoints to facilitate backups.

Documentation/filesystems/NOVA.txt contains some lower-level implementation and
usage information.  A more thorough discussion of NOVA's goals and design is
avaialable in two papers:

NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf
Jian Xu and Steven Swanson
Published in FAST 2016

Hardening the NOVA File System
http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf UCSD-CSE
Techreport CS2017-1018
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha
Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven
Swanson

-steve


---

Steven Swanson (16):
      NOVA: Documentation
      NOVA: Superblock and fs layout
      NOVA: PMEM allocation system
      NOVA: Inode operations and structures
      NOVA: Log data structures and operations
      NOVA: Lite-weight journaling for complex ops
      NOVA: File and directory operations
      NOVA: Garbage collection
      NOVA: DAX code
      NOVA: File data protection
      NOVA: Snapshot support
      NOVA: Recovery code
      NOVA: Sysfs and ioctl
      NOVA: Read-only pmem devices
      NOVA: Performance measurement
      NOVA: Build infrastructure


 Documentation/filesystems/00-INDEX |    2 
 Documentation/filesystems/nova.txt |  771 +++++++++++++++++
 MAINTAINERS                        |    8 
 README.md                          |  173 ++++
 arch/x86/include/asm/io.h          |    1 
 arch/x86/mm/fault.c                |   11 
 arch/x86/mm/ioremap.c              |   25 -
 drivers/nvdimm/pmem.c              |   14 
 fs/Kconfig                         |    2 
 fs/Makefile                        |    1 
 fs/nova/Kconfig                    |   15 
 fs/nova/Makefile                   |    9 
 fs/nova/balloc.c                   |  827 +++++++++++++++++++
 fs/nova/balloc.h                   |  118 +++
 fs/nova/bbuild.c                   | 1602 ++++++++++++++++++++++++++++++++++++
 fs/nova/checksum.c                 |  912 ++++++++++++++++++++
 fs/nova/dax.c                      | 1346 ++++++++++++++++++++++++++++++
 fs/nova/dir.c                      |  760 +++++++++++++++++
 fs/nova/file.c                     |  943 +++++++++++++++++++++
 fs/nova/gc.c                       |  739 +++++++++++++++++
 fs/nova/inode.c                    | 1467 +++++++++++++++++++++++++++++++++
 fs/nova/inode.h                    |  389 +++++++++
 fs/nova/ioctl.c                    |  185 ++++
 fs/nova/journal.c                  |  474 +++++++++++
 fs/nova/journal.h                  |   61 +
 fs/nova/log.c                      | 1411 ++++++++++++++++++++++++++++++++
 fs/nova/log.h                      |  333 +++++++
 fs/nova/mprotect.c                 |  604 ++++++++++++++
 fs/nova/mprotect.h                 |  190 ++++
 fs/nova/namei.c                    |  919 +++++++++++++++++++++
 fs/nova/nova.h                     | 1137 ++++++++++++++++++++++++++
 fs/nova/nova_def.h                 |  154 +++
 fs/nova/parity.c                   |  411 +++++++++
 fs/nova/perf.c                     |  594 +++++++++++++
 fs/nova/perf.h                     |   96 ++
 fs/nova/rebuild.c                  |  847 +++++++++++++++++++
 fs/nova/snapshot.c                 | 1407 ++++++++++++++++++++++++++++++++
 fs/nova/snapshot.h                 |   98 ++
 fs/nova/stats.c                    |  685 +++++++++++++++
 fs/nova/stats.h                    |  218 +++++
 fs/nova/super.c                    | 1222 +++++++++++++++++++++++++++
 fs/nova/super.h                    |  216 +++++
 fs/nova/symlink.c                  |  153 +++
 fs/nova/sysfs.c                    |  543 ++++++++++++
 include/linux/io.h                 |    2 
 include/linux/mm.h                 |    2 
 include/linux/mm_types.h           |    3 
 kernel/memremap.c                  |   24 +
 mm/memory.c                        |    2 
 mm/mmap.c                          |    1 
 mm/mprotect.c                      |   13 
 51 files changed, 22129 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/filesystems/nova.txt
 create mode 100644 README.md
 create mode 100644 fs/nova/Kconfig
 create mode 100644 fs/nova/Makefile
 create mode 100644 fs/nova/balloc.c
 create mode 100644 fs/nova/balloc.h
 create mode 100644 fs/nova/bbuild.c
 create mode 100644 fs/nova/checksum.c
 create mode 100644 fs/nova/dax.c
 create mode 100644 fs/nova/dir.c
 create mode 100644 fs/nova/file.c
 create mode 100644 fs/nova/gc.c
 create mode 100644 fs/nova/inode.c
 create mode 100644 fs/nova/inode.h
 create mode 100644 fs/nova/ioctl.c
 create mode 100644 fs/nova/journal.c
 create mode 100644 fs/nova/journal.h
 create mode 100644 fs/nova/log.c
 create mode 100644 fs/nova/log.h
 create mode 100644 fs/nova/mprotect.c
 create mode 100644 fs/nova/mprotect.h
 create mode 100644 fs/nova/namei.c
 create mode 100644 fs/nova/nova.h
 create mode 100644 fs/nova/nova_def.h
 create mode 100644 fs/nova/parity.c
 create mode 100644 fs/nova/perf.c
 create mode 100644 fs/nova/perf.h
 create mode 100644 fs/nova/rebuild.c
 create mode 100644 fs/nova/snapshot.c
 create mode 100644 fs/nova/snapshot.h
 create mode 100644 fs/nova/stats.c
 create mode 100644 fs/nova/stats.h
 create mode 100644 fs/nova/super.c
 create mode 100644 fs/nova/super.h
 create mode 100644 fs/nova/symlink.c
 create mode 100644 fs/nova/sysfs.c

--
Signature
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC 00/16] NOVA: a new file system for persistent memory
@ 2017-08-03  7:48 ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

This is an RFC patch series that impements NOVA (NOn-Volatile memory
Accelerated file system), a new file system built for PMEM.

NOVA's goal is to provide a high-performance, full-featured, production-ready
file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
and Intel's soon-to-be-released 3DXpoint DIMMs).  It combines design elements
from many other file systems to provide a combination of high-performance,
strong consistency guarantees, and comprehensive data protection.  NOVA supports
DAX-style mmap, and making DAX perform well is a first-order priority in NOVA's
design.

NOVA was developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <jix024@eng.ucsd.edu>, Lu Zhang
<luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.

NOVA is stable enough to run complex applications, but there is substantial
work left to do.  This RFC is intended to gather feedback to guide its
development toward eventual inclusion upstream.

The patches are relative Linux 4.12.

Overview
========

NOVA is primarily a log-structured file system, but rather than maintain a
single global log for the entire file system, it maintains separate logs for
each file (inode).  NOVA breaks the logs into 4KB pages, they need not be
contiguous in memory.  The logs only contain metadata.

File data pages reside outside the log, and log entries for write operations
point to data pages they modify.  File modification uses copy-on-write (COW) to
provide atomic file updates.

For file operations that involve multiple inodes, NOVA use small, fixed-sized
redo logs to atomically append log entries to the logs of the inodes involved.

This structure keeps logs small and makes garbage collection very fast.  It also
enables enormous parallelism during recovery from an unclean unmount, since
threads can scan logs in parallel.

NOVA replicates and checksums all metadata structures and protects file data
with RAID-4-style parity.  It supports checkpoints to facilitate backups.

Documentation/filesystems/NOVA.txt contains some lower-level implementation and
usage information.  A more thorough discussion of NOVA's goals and design is
avaialable in two papers:

NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf
Jian Xu and Steven Swanson
Published in FAST 2016

Hardening the NOVA File System
http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf UCSD-CSE
Techreport CS2017-1018
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha
Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven
Swanson

-steve


---

Steven Swanson (16):
      NOVA: Documentation
      NOVA: Superblock and fs layout
      NOVA: PMEM allocation system
      NOVA: Inode operations and structures
      NOVA: Log data structures and operations
      NOVA: Lite-weight journaling for complex ops
      NOVA: File and directory operations
      NOVA: Garbage collection
      NOVA: DAX code
      NOVA: File data protection
      NOVA: Snapshot support
      NOVA: Recovery code
      NOVA: Sysfs and ioctl
      NOVA: Read-only pmem devices
      NOVA: Performance measurement
      NOVA: Build infrastructure


 Documentation/filesystems/00-INDEX |    2 
 Documentation/filesystems/nova.txt |  771 +++++++++++++++++
 MAINTAINERS                        |    8 
 README.md                          |  173 ++++
 arch/x86/include/asm/io.h          |    1 
 arch/x86/mm/fault.c                |   11 
 arch/x86/mm/ioremap.c              |   25 -
 drivers/nvdimm/pmem.c              |   14 
 fs/Kconfig                         |    2 
 fs/Makefile                        |    1 
 fs/nova/Kconfig                    |   15 
 fs/nova/Makefile                   |    9 
 fs/nova/balloc.c                   |  827 +++++++++++++++++++
 fs/nova/balloc.h                   |  118 +++
 fs/nova/bbuild.c                   | 1602 ++++++++++++++++++++++++++++++++++++
 fs/nova/checksum.c                 |  912 ++++++++++++++++++++
 fs/nova/dax.c                      | 1346 ++++++++++++++++++++++++++++++
 fs/nova/dir.c                      |  760 +++++++++++++++++
 fs/nova/file.c                     |  943 +++++++++++++++++++++
 fs/nova/gc.c                       |  739 +++++++++++++++++
 fs/nova/inode.c                    | 1467 +++++++++++++++++++++++++++++++++
 fs/nova/inode.h                    |  389 +++++++++
 fs/nova/ioctl.c                    |  185 ++++
 fs/nova/journal.c                  |  474 +++++++++++
 fs/nova/journal.h                  |   61 +
 fs/nova/log.c                      | 1411 ++++++++++++++++++++++++++++++++
 fs/nova/log.h                      |  333 +++++++
 fs/nova/mprotect.c                 |  604 ++++++++++++++
 fs/nova/mprotect.h                 |  190 ++++
 fs/nova/namei.c                    |  919 +++++++++++++++++++++
 fs/nova/nova.h                     | 1137 ++++++++++++++++++++++++++
 fs/nova/nova_def.h                 |  154 +++
 fs/nova/parity.c                   |  411 +++++++++
 fs/nova/perf.c                     |  594 +++++++++++++
 fs/nova/perf.h                     |   96 ++
 fs/nova/rebuild.c                  |  847 +++++++++++++++++++
 fs/nova/snapshot.c                 | 1407 ++++++++++++++++++++++++++++++++
 fs/nova/snapshot.h                 |   98 ++
 fs/nova/stats.c                    |  685 +++++++++++++++
 fs/nova/stats.h                    |  218 +++++
 fs/nova/super.c                    | 1222 +++++++++++++++++++++++++++
 fs/nova/super.h                    |  216 +++++
 fs/nova/symlink.c                  |  153 +++
 fs/nova/sysfs.c                    |  543 ++++++++++++
 include/linux/io.h                 |    2 
 include/linux/mm.h                 |    2 
 include/linux/mm_types.h           |    3 
 kernel/memremap.c                  |   24 +
 mm/memory.c                        |    2 
 mm/mmap.c                          |    1 
 mm/mprotect.c                      |   13 
 51 files changed, 22129 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/filesystems/nova.txt
 create mode 100644 README.md
 create mode 100644 fs/nova/Kconfig
 create mode 100644 fs/nova/Makefile
 create mode 100644 fs/nova/balloc.c
 create mode 100644 fs/nova/balloc.h
 create mode 100644 fs/nova/bbuild.c
 create mode 100644 fs/nova/checksum.c
 create mode 100644 fs/nova/dax.c
 create mode 100644 fs/nova/dir.c
 create mode 100644 fs/nova/file.c
 create mode 100644 fs/nova/gc.c
 create mode 100644 fs/nova/inode.c
 create mode 100644 fs/nova/inode.h
 create mode 100644 fs/nova/ioctl.c
 create mode 100644 fs/nova/journal.c
 create mode 100644 fs/nova/journal.h
 create mode 100644 fs/nova/log.c
 create mode 100644 fs/nova/log.h
 create mode 100644 fs/nova/mprotect.c
 create mode 100644 fs/nova/mprotect.h
 create mode 100644 fs/nova/namei.c
 create mode 100644 fs/nova/nova.h
 create mode 100644 fs/nova/nova_def.h
 create mode 100644 fs/nova/parity.c
 create mode 100644 fs/nova/perf.c
 create mode 100644 fs/nova/perf.h
 create mode 100644 fs/nova/rebuild.c
 create mode 100644 fs/nova/snapshot.c
 create mode 100644 fs/nova/snapshot.h
 create mode 100644 fs/nova/stats.c
 create mode 100644 fs/nova/stats.h
 create mode 100644 fs/nova/super.c
 create mode 100644 fs/nova/super.h
 create mode 100644 fs/nova/symlink.c
 create mode 100644 fs/nova/sysfs.c

--
Signature

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC 01/16] NOVA: Documentation
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:48   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

A brief overview is in README.md.

Implementation and usage details are in Documentation/filesystems/nova.txt.

These two papers provide a detailed, high-level description of NOVA's design goals and approach:

   NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)

   Hardening the NOVA File System (http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf)

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 Documentation/filesystems/00-INDEX |    2 
 Documentation/filesystems/nova.txt |  771 ++++++++++++++++++++++++++++++++++++
 MAINTAINERS                        |    8 
 README.md                          |  173 ++++++++
 4 files changed, 954 insertions(+)
 create mode 100644 Documentation/filesystems/nova.txt
 create mode 100644 README.md

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index b7bd6c9009cc..dc5c72273957 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -95,6 +95,8 @@ nfs/
 	- nfs-related documentation.
 nilfs2.txt
 	- info and mount options for the NILFS2 filesystem.
+nova.txt
+	- info on the NOVA filesystem.
 ntfs.txt
 	- info and mount options for the NTFS filesystem (Windows NT).
 ocfs2.txt
diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
new file mode 100644
index 000000000000..af90da1c3fb1
--- /dev/null
+++ b/Documentation/filesystems/nova.txt
@@ -0,0 +1,771 @@
+The NOVA Filesystem
+===================
+
+NOVA is a DAX file system designed to maximize performance on hybrid DRAM and
+non-volatile main memory (NVMM) systems while providing strong consistency
+guarantees. NOVA adapts conventional log-structured file system techniques to
+exploit the fast random access that NVMs provide. In particular, it maintains
+separate logs for each inode to improve concurrency, and stores file data
+outside the log to minimize log size and reduce garbage collection costs. NOVA's
+logs provide metadata, data, and mmap atomicity and focus on simplicity and
+reliability, keeping complex metadata structures in DRAM to accelerate lookup
+operations.
+
+The main NOVA features include:
+
+  * POSIX semantics
+  * Directly access (DAX) byte-addressable NVMM without page caching
+  * Per-CPU NVMM pool to maximize concurrency
+  * Strong consistency guarantees with 8-byte atomic stores
+  * Full filesystem snapshot with DAX-mmap support
+  * Checksums on metadata and file data (crc32c)
+  * Full metadata replication and RAID-5 parity per file page
+  * Online filesystem integrity check and corruption recovery
+
+Filesystem Design
+=================
+NOVA divides NVMM into five regions. NOVA's 512 B superblock contains global
+file system information and the recovery inode. The recovery inode represents a
+special file that stores recovery information (e.g., the list of unallocated
+NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
+provides per-CPU journals for complex file operations that involve multiple
+inodes. The rest of the available NVMM stores logs and file data.
+
+NOVA is log-structured and stores a separate log for each inode to maximize
+concurrency and provide atomicity for operations that affect a single file. The
+logs only store metadata and comprise a linked list of 4 KB pages. Log entries
+are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
+pages may reside anywhere in NVMM.
+
+NOVA keeps read-only copies of most file metadata in DRAM during normal
+operations, eliminating the need to access metadata in NVMM during reads.
+
+NOVA uses copy-on-write to provide atomic updates for file data and appends
+metadata about the write to the log. For operations that affect multiple inodes
+NOVA uses lightweight, fixed-length journals – one per core.
+
+NOVA divides the allocatable NVMM into multiple regions, one region per CPU
+core. A per-core allocator manages each of the regions, minimizing contention
+during memory allocation.
+
+After a system crash, NOVA must scan all the logs to rebuild the memory
+allocator state. Since, there are many logs, NOVA aggressively parallelizes the
+scan.
+
+Using NOVA
+==========
+
+NOVA runs on a pmem non-volatile memory region.  You can create one of these
+regions with the `memmap` kernel command line option.  For instance, adding
+`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
+from address 8GB, and the kernel will create a `pmem0` block device under the
+`/dev` directory.
+
+After the OS has booted, you can initialize a NOVA instance with the following commands:
+
+
+# modprobe nova
+# mount -t NOVA -o init /dev/pmem0 /mnt/ramdisk
+
+
+The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
+`/mnt/ramdisk`.
+
+Nova support several module command line options:
+
+ * metadata_csum: Enable metadata replication and checksums (default 0)
+
+ * data_csum: Compute checksums on file data. (default: 0)
+
+ * data_parity: Compute parity for file data. (default: 0)
+
+ * inplace_data_updates:  Update data in place rather than with COW (default: 0)
+
+ * wprotect: Make PMEM unwritable and then use CR0.WP to enable writes as
+   needed (default: 0).  You must also install the nd_pmem module as with
+   wprotect =1 (e.g., modprobe nd_pmem readonly=1).
+
+For instance to enable all Nova's data protection features:
+
+# modprobe nova metadata_csum=1\
+  	       data_csum=1\
+	       data_parity=1\
+	       wprotect=1
+
+Currently, remounting file systems with different combinations of options may
+not work.
+
+To recover an existing NOVA instance, mount NOVA without the init option, for example:
+
+# mount -t NOVA /dev/pmem0 /mnt/ramdisk
+
+### Taking Snapshots
+
+To create a snapshot:
+
+# echo 1 > /proc/fs/NOVA/<device>/create_snapshot
+
+To list the current snapshots:
+
+# cat /proc/fs/NOVA/<device>/snapshots
+
+To mount a snapshot, mount NOVA and specifying the snapshot index, for example:
+
+# mount -t NOVA -o snapshot=<index> /dev/pmem0 /mnt/ramdisk
+
+Users should not write to the file system after mounting a snapshot.
+
+Source File Structure
+=====================
+
+  * nova_def.h/nova.h
+   Defines NOVA macros and key inline functions.
+    
+  * balloc.{h,c}
+    NOVA's block allocator implementation.
+    
+  * bbuild.c
+    Implements recovery routines to restore the in-use inode list, the NVMM
+    allocator information, and the snapshot table.
+
+  * checksum.c
+    Contains checksum-related functions to compute and verify checksums on NOVA
+    data structures and file pages, and also performs recovery actions when
+    corruptions are detected.
+
+  * dax.c
+    Implements DAX read/write functions to access file data. NOVA uses
+    copy-on-write to modify file pages by default, unless inplace data update is
+    enabled at mount-time. There are also functions to update and verify the
+    file data integrity information.
+
+  * dir.c
+    Contains functions to create, update, and remove NOVA dentries.
+
+  * file.c
+    Implements file-related operations such as open, fallocate, llseek, fsync,
+    and flush.
+
+  * gc.c
+    NOVA's garbage collection functions. 
+
+  * inode.{h,c}
+    Creates, reads, and frees NOVA inode tables and inodes.
+
+  * ioctl.c
+    Implements some ioctl commands to call NOVA's internal functions.
+
+  * journal.{h,c}
+    For operations that affect multiple inodes NOVA uses lightweight,
+    fixed-length journals – one per core. This file contains functions to
+    create and manage the lite journals.
+
+  * log.{h,c}
+    Functions to manipulate NOVA inode logs, including log page allocation, log
+    entry creation, commit, modification, and deletion.
+
+  * mprotect.{h,c}
+    Implements inline functions to enable/disable writing to different NOVA
+    data structures.
+    
+  * namei.c
+    Functions to create/remove files, directories, and links. It also looks for
+    the NOVA inode number for a given path name.
+
+  * parity.c
+    Functions to compute file page parity bits. Each file page is striped in to
+    equally sized segments (or strips), and one parity strip is calculated using
+    RAID-5 method. A function to restore a broken data strip is also implemented
+    in this file.
+
+  * perf.{h,c}
+    Function performance measurements. It defines
+    function IDs and call prototypes.  Measures primitive functions'
+    performance, including memory copy functions for DRAM and NVMM, checksum
+    functions, and XOR parity functions.
+
+  * rebuild.c
+    When mounting NOVA after a crash, rebuilds NOVA inodes from its logs. There
+    are also functions to re-calculate checksums and parity bits for file pages
+    that were mmapped during the crash.
+
+  * snapshot.{h,c}
+    Code and data structures for taking snapshots.
+    
+  * stats.h
+    Defines data structures and macros that are relevant to gather NOVA usage
+    statistics.
+
+  * stats.c
+    Implements routines to gather and print NOVA usage statistics.
+
+  * super.{h,c}
+    Super block structures and Nova FS layout and entry points for NOVA
+    mounting and unmounting, initializing or recovering the NOVA super block
+    and other global file system information.
+
+  * symlink.c
+    Implements functions to create and read symbolic links in the filesystem.
+
+  * sysfs.c
+    Implements sysfs entries to take user inputs for taking snapshots, printing
+    NOVA statistics, and measuring function's performance.
+
+
+FS Layout
+======================
+
+A Nova file systems resides in single PMEM device. Nova divides the device int
+4KB blocks.
+
+ block
++-----------------------------------------------------+
+|  0  | primary super block (struct nova_super_block) |
++-----------------------------------------------------+
+|  1  | Reserved inodes                               |
++-----------------------------------------------------+
+|  2  | reserved                                      |
++-----------------------------------------------------+
+|  3  | Journal pointers                              |
++-----------------------------------------------------+
+| 4-5 | Inode pointer tables                          |
++-----------------------------------------------------+
+|  6  | reserved                                      |
++-----------------------------------------------------+
+|  7  | reserved                                      |
++-----------------------------------------------------+
+| ... | data pages                                    |
++-----------------------------------------------------+
+| n-2 | replica reserved Inodes                       |
++-----------------------------------------------------+
+| n-1 | replica super block                           |
++-----------------------------------------------------+
+
+
+
+Superblock and Associated Structures
+====================================
+
+The beginning of the PMEM device hold the super block and its associated
+tables.  These include reserved inodes, a table of pointers to the journals
+Nova uses for complex operations, and pointers to inodes tables.  Nova
+maintains replicas of the super block and reserved inodes in the last two
+blocks of the PMEM area.
+
+
+Block Allocator/Free Lists
+==========================
+
+Nova uses per-CPU allocators to manage free PMEM blocks.  On initialization,
+NOVA divides the range of blocks in the PMEM device among the CPUs, and those
+blocks are managed solely by that CPU.  We call these ranges of "allocation regions".
+
+Some of the blocks in an allocation region have fixed roles.  Here's the
+layout:
+
++-------------------------------+
+| data checksum blocks          |
++-------------------------------+
+| data parity blocks            |
++-------------------------------+
+|                               |
+| Allocatable blocks            |
+|                               |
++-------------------------------+
+| replica data parity blocks    |
++-------------------------------+
+| replica data checksum blocks  |
++-------------------------------+
+
+The first and last allocation regions, also contain the super block, inode
+tables, etc. and their replicas, respectively.
+
+Each allocator maintains a red-black tree of unallocated ranges (struct
+nova_range_node).
+
+Allocation Functions
+--------------------
+
+Nova allocate PMEM blocks using two mechanisms:
+
+1.  Static allocation as defined in super.h
+
+2.  Allocation for log and data pages via nova_new_log_blocks() and
+nova_new_data_blocks().
+
+Both of these functions allow the caller to control whether the allocator
+preferes higher addresses for allocation or lower addresses.  We use this to
+encourage meta data structures and their replicas to be far from one another.
+
+PMEM Address Translation
+------------------------
+
+In Nova's persistent data structures, memory locations are given as offsets
+from the beginning of the PMEM region.  nova_get_block() translates offsets to
+PMEM addresses.  nova_get_addr_off() performs the reverse translation.
+
+
+Inodes
+======
+
+Nova maintains per-CPU inode tables, and inode numbers are striped across the
+tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
+
+The inodes themselves live in a set of linked lists (one per CPU) of 2MB
+blocks.  The last 8 bytes of each block points to the next block.  Pointers to
+heads of these list live in PMEM block INODE_TABLE0_START and are replicated in
+PMEM block INODE_TABLE1_START.  Additional space for inodes is allocated on
+demand.
+
+To allocate inodes, Nova maintains a per-cpu "inuse_list" in DRAM holds a RB
+tree that holds ranges of unallocated inode numbers.
+
+Logs
+====
+
+Nova maintains a log for each inode that records updates to the inode's
+metadata and holds pointers to the file data.  Nova makes updates to file data
+and metadata atomic by atomically appending log entries to the log.
+
+Each inode contains pointers to head and tail of the inode's log.  When the log
+grows past the end of the last page, nova allocates additional space.  For
+short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
+fixed amount of additional space (1MB).
+
+Log space is reclaimed during garbage collection.
+
+Log Entries
+-----------
+
+There are eight kinds of log entry, documented in log.h.  The log entries have
+several entries in common:
+
+   1.  'epoch_id' gives the epoch during which the log entry was created.
+   Creating a snapshot increiments the epoch_id for the file systems.
+
+   2.  'trans_id' is filesystem-wide, monotone increasing, number assigned each
+   log entry.  It provides an ordering over all FS operations.
+
+   3.  'invalid' is true if the effects of this entry are dead and the log
+   entry can be garbage collected.
+
+   4.  'csum' is a CRC32 checksum for the entry.
+
+Log structure
+-------------
+
+The logs comprise a linked list of PMEM blocks.  The tail of each block
+
+contains some metadata about the block and pointers to the next block and
+block's replica (struct nova_inode_page_tail).
+
++----------------+
+| log entry      |
++----------------+
+| log entry      |
++----------------+
+| ...            |
++----------------+
+| tail           |
+|  metadata      |
+|  -> next block |
++----------------+
+
+
+Journals
+========
+
+Nova uses a lightweight journaling mechanisms to provide atomicity for
+operations that modify more than one on inode.  The journals providing logging
+for two operations:
+
+1.  Single word updates (JOURNAL_ENTRY)
+2.  Copying inodes (JOURNAL_INODE)
+                                                  
+The journals are undo logs: Nova creates the journal entries for an operation,
+and if the operation does not complete due to a system failure, the recovery
+process rolls back the changes using the journal entries.
+
+To commit, Nova drops the log.
+
+Nova maintains one journal per CPU.  The head and tail pointers for each
+journal live in a reserved page near the beginning of the file system.  
+
+During recovery, Nova scans the journals and undoes the operations described by
+each entry.
+
+
+File and Directory Access
+=========================
+
+To access file data via read(), Nova maintains a radix tree in DRAM for each
+inode (nova_inode_info_header.tree) that maps file offsets to write log
+entries.  For directories, the same tree maps a hash of filenames to their
+corresponding dentry.
+
+In both cases, the nova populates the tree when the file or directory is opened
+by scanning its log.
+
+MMap and DAX
+============
+
+NOVA leverages the kernel's DAX mechanisms for mmap and file data access.  Nova
+maintains a red-black tree in DRAM (nova_inode_info_header.vma_tree) to track
+which portions of a file have been mapped.
+
+Garbage Collection
+==================
+
+Nova recovers log space with a two-phase garbage collection system.  When a log
+reaches the end of its allocated pages, Nova allocates more space.  Then, the
+fast GC algorithm scans the log to remove pages that have no valid entries.
+Then, it estimates how many pages the logs valid entries would fill.  If this
+is less than half the number of pages in the log, the second GC phase copies
+the valid entries to new pages.
+
+For example (V=valid; I=invalid):
+
++---+          +---+	        +---+
+| I |	       | I |  	      	| V |
++---+	       +---+  Thorough	+---+
+| V |	       | V |  	 GC   	| V |
++---+	       +---+   =====> 	+---+
+| I |	       | I |  	      	| V |
++---+	       +---+	        +---+
+| V |	       | V |  	        | V |
++---+	       +---+            +---+	
+  |	         |	       
+  V	         V             
++---+	       +---+ 	       
+| I |	       | V | 	       
++---+	       +---+ 	       
+| I | fast GC  | I | 	       
++---+  ====>   +---+ 	       
+| I |	       | I | 	       
++---+	       +---+ 	       
+| I |	       | V | 	       
++---+	       +---+ 	       
+  |	       	
+  V	       	
++---+	       	
+| V |	       	
++---+	       	
+| I |	       	
++---+	       	
+| I |	       	
++---+	       	
+| V |	       	
++---+            
+
+
+Replication and Checksums
+=========================
+
+Nova protects data and metadat from corruption due to media errors and
+"scribbles" -- software errors in the kernels that may overwrite Nova data.
+
+Replication
+-----------
+
+Nova replicates all PMEM metadata structures (there are a few exceptions.  They
+are WIP).  For structure, there is a primary and an "alternate" (denoted as
+"alter" in the code).  To ensure that Nova can recover a consistent copy of the
+data in case of a failure, Nova first updates the primary, and issues a persist
+barrier to ensure that data is written to NVMM.  Then it does the same for the
+alternate.
+
+Detection
+---------
+
+Nova uses two techniques to detect data corruption.  For media errors, Nova
+should always uses memcpy_from_pmem() to read data from PMEM, usually by
+copying the PMEM data structure into DRAM.
+
+To detect software-caused corruption, Nova uses CRC32 checksums.  All the PMEM
+data structures in Nova include csum field for this purpose.  Nova also
+computes CRC32 checksums each 512-byte slice of each data page.
+
+The checksums are stored in dedicated pages in each CPU's allocation region.
+
+                                                          replica
+                                                 parity   parity 	
+					         page	  page	  
+            +---+---+---+---+---+---+---+---+    +---+    +---+       
+data page 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |    | 0 |    | 0 |  	
+            +---+---+---+---+---+---+---+---+    +---+    +---+  	
+data page 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |    | 1 |    | 1 |  	
+            +---+---+---+---+---+---+---+---+    +---+    +---+  	
+data page 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |    | 0 |    | 0 |  	
+            +---+---+---+---+---+---+---+---+    +---+    +---+  	
+data page 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |    | 0 |    | 0 |  	
+            +---+---+---+---+---+---+---+---+    +---+    +---+  	
+    ...                    ...                    ...      ...   
+
+Recovery
+--------
+
+Nova uses replication to support recovery of metadata structures and
+RAID4-style parity to recover corrupted data.
+
+If Nova detects corruption of a metadata structure, it restores the structure
+using the replica.
+
+If it detects a corrupt slice of data page, it uses RAID4-style recovery to
+restore it.  The CRC32 checksums for the page slices are replicated.
+
+Cautious allocation
+-------------------
+
+To maximize its resilience to software scribbles, Nova allocate metadata
+structures and their replicas far from one another.  It tries to allocate the
+primary copy at a low address and the replica at a high address within the PMEM
+region.
+
+Write Protection
+----------------
+
+Finally, Nova supports can prevent unintended writes PMEM by mapping the entire
+PMEM device as read-only and then disabling _all_ write protection by clearing
+the WP bit the CR0 control register when Nova needs to perform a write.  The
+wprotect mount-time option controls this behavior.
+
+To map the PMEM device as read-only, we have added a readonly module command
+line option to nd_pmem.  There is probably a better approach to achieving this
+goal. 
+
+Unsafe modes
+============
+
+Nova support modes that disable some of the protections it provides to improve
+perforamnce.
+
+File data
+---------
+
+Nova can disable parity and/or checksums on file data (options 'data_parity=0'
+and 'data_checksum=0').  Without parity, Nova can detect but not recover from
+data corruption.  Without checksums, Nova will still detect and recover from
+media errors, but not scribbles.
+
+Nova also supports in-place file updates (option: inplace_data_updates=1).
+This breaks atomicity for writes, but improve performance, especially for
+sub-page writes, since these require a full page COW in the default mode.
+
+Metadata
+--------
+
+Nova can disable metadata checksums and replication (option 'metadata_csum=0').
+
+
+Snapshots
+=========
+
+Nova supports snapshots to facilitate backups.
+
+Taking a snapshot
+-----------------
+
+Each Nova file systems has a current epoch_id in the super block and each log
+entry has the epoch_id attached to it at creation.  When the user creates a
+snaphot, Nova increments the epoch_id for the file system and the old epoch_id
+identifies the moment the snapshot was taken.
+
+Nova records the epoch_id and a timestamp in a new log entry (struct
+snapshot_info_log_entry) and appends it to the log of the reserved snapshot
+inode (NOVA_SNAPSHOT_INODE) in the superblock.
+
+Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct
+snapshot_info in DRAM indexed by epoch_id.
+
+Nova also marks all mmap'd pages as read-only and uses COW to preserve file
+contents after the snapshot.
+
+Tracking Live Data
+------------------
+
+Supporting snapshots requires Nova to preserve file contents from previous
+snapshots while also being able to recover the space a snapshot occupied after
+its deletion.
+
+Preserving file contents requires a small change to how Nova implements write
+operations.  To perform a write, Nova appends a write log entry to the file's
+log.  The log entry includes pointers to newly-allocated and populated NVMM
+pages that hold the written data.  If the write overwrites existing data, Nova
+locates the previous write log entry for that portion of the file, and performs
+an "epoch check" that compares the old log entry's epoch_id to the file
+system's current epoch_id.  If the comparison matches, the old write log entry
+and the file data blocks it points to no longer belong to any snapshot, and
+Nova reclaims the data blocks.
+
+If the epoch_id's do not match, then the data in the old log entry belongs to
+an earlier snapshot and Nova leaves the log entry in place.
+
+Determining when to reclaim data belonging to deleted snapshots requires
+additional bookkeeping.  For each snapshot, Nova maintains a "snapshot log"
+that records the inodes and blocks that belong to that snapshot, but are not
+part of the current file system image.
+
+Nova populates the snapshot log during the epoch check: If the epoch_ids for
+the new and old log entries do not match, it appends a log entry (either struct
+snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log
+that the old log entry belongs to.  The log entry contains a pointer to the old
+log entry, and the filesystem's current epoch_id as the delete_epoch_id.
+
+To delete a snapshot, Nova removes the snapshot from the list of live snapshots
+and appends its log to the following snapshot's log.  Then, a background thread
+traverses the combined log and reclaims dead inode/data based on the delete
+epoch_id: If the delete epoch_id for an entry in the log is less than or equal
+to the snapshot's epoch_id, it means the log entry and/or the associated data
+blocks are now dead.
+
+Snapshots and DAX
+-----------------
+
+Taking consistent snapshots while applications are modifying files using
+DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM
+become persistent (i.e., reach physical NVMM so they will survive a system
+failure).  These applications rely on the processor's ``memory persistence
+model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees
+about when and in what order stores become persistent.  These guarantees allow
+the application to restore their data to a consistent state during recovery
+from a system failure.
+
+From the application's perspective, reading a snapshot is equivalent to
+recovering from a system failure.  In both cases, the contents of the
+memory-mapped file reflect its state at a moment when application operations
+might be in-flight and when the application had no chance to shut down cleanly.
+
+A naive approach to checkpointing mmap()'d files in NOVA would simply mark each
+of the read/write mapped pages as read-only and then do copy-on-write when a
+store occurs to preserve the old pages as part of the snapshot.
+
+However, this approach can leave the snapshot in an inconsistent state:
+Setting the page to read-only captures its contents for the
+snapshot, and the kernel requires NOVA to set the pages as read-only
+one at a time.  So, if the order in which NOVA marks pages as read-only
+is incompatible with ordering that the application requires, the snapshot will
+contain an inconsistent version of the file.
+
+To resolve this problem, when NOVA starts marking pages as read-only, it blocks
+page faults to the read-only mmap()'d pages until it has marked all the pages
+read-only and finished taking the snapshot.
+
+More detail is available in the technical report referenced at the top of this
+document.
+
+We have implemented this functionality in NOVA by adding the 'original_write'
+flag to struct vm_area_struct that tracks whether the vm_area_struct is created
+with write permission, but has been marked read-only in the course of taking a
+snapshot.  We have also added a 'dax_cow' operation to struct
+vm_operations_struct that the page fault handler runs when applications write
+to a page with original_write = 1.  NOVA's dax_cow operation
+(nova_restore_page_write()) performs the COW, maps the page to a new physical
+page and allows writing.
+
+Saving Snapshot State
+---------------------
+
+During a clean shutdown, Nova stores the snapshot information to PMEM.
+
+Nova reserves an inode for storing snapshot information.  The log for the inode
+contains an entry for each snapshot (struct snapshot_info_log_entry).  On
+shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array
+of struct snapshot_nvmm_list.
+
+Each of these lists (one per CPU) contains head and tail pointers to a linked
+list of blocks (just like an inode log).  The lists contain a struct
+snapshot_file_write_entry or struct snapshot_inode_entry for each operation
+that modified file data or an inode.
+
+Superblock
++--------------------+
+|   ...              |
++--------------------+
+| Reserved Inodes    |
++---+----------------+
+|   |     ...        |
++---+----------------+
+| 7 | Snapshot Inode |
+|   | head           |
++---+----------------+
+        /
+       /
+      / 
++---------+---------+---------+
+|  Snap   |  Snap   |  Snap   |
+| epoch=1 | epoch=4 | epoch=11|
+|         |         |         |
+|nvmm_page|nvmm_page|nvmm_page|
++---------+---------+---------+
+     |
+     |
++----------+   +--------+--------+
+|  cpu 0   |   | snap 	| snap   |	
+|   head   |-->| inode	| write	 |
+|          |   | entry  | entry  |      
+|          |   +--------+--------+
++----------+   +--------+--------+
+|  cpu 1   |   | snap 	| snap   |
+|   head   |-->| write	| write	 |
+|          |   | entry  | entry  |
+|          |   +--------+--------+
++----------+ 
+|    ...   | 
++----------+   +--------+
+|  cpu 128 |   | snap 	|
+|   head   |-->| inode	|
+|          |   | entry  |
+|          |   +--------+
++----------+
+
+
+Umount and Recovery
+===================
+
+Clean umount/mount
+------------------
+
+On a clean unmount, Nova saves the contents of many of its DRAM data structures
+to PMEM to accelerate the next mount:
+
+1. Nova stores the allocator state for each of the per-cpu allocators to the
+   log of a reserved inode (NOVA_BLOCK_NODE_INO).
+    
+2. Nova stores the per-CPU lists of available inodes (the inuse_list) to the
+   NOVA_BLOCK_INODELIST1_INO reserved inode.
+
+3. Nova stores the snapshot state to PMEM as described above.
+
+After a clean unmount, the following mount restores these data and then
+invalidates them.
+
+Recovery after failures
+------------------------
+
+In case of a unclean dismount (e.g., system crash), Nova must rebuild these
+DRAM structures by scanning the inode logs.  Nova log scanning is fast because
+per-CPU inode tables and per-inode logs allow for parallel recovery.
+
+The number of live log entries in an inode log is roughly the number of extents
+in the file.  As a result, Nova only needs to scan a small fraction of the NVMM
+during recovery.
+
+The Nova failure recovery consists of two steps:
+
+First, Nova checks its lite weight journals and rolls back any uncommitted
+transactions to restore the file system to a consistent state.
+
+Second, Nova starts a recovery thread on each CPU and scans the inode tables in
+parallel, performing log scanning for every valid inode in the inode table.
+Nova use different recovery mechanisms for directory inodes and file inodes:
+For a directory inode, Nova scans the log's linked list to enumerate the pages
+it occupies, but it does not inspect the log's contents.  For a file inode,
+Nova reads the write entries in the log to enumerate the data pages.
+
+During the recovery scan Nova builds a bitmap of occupied pages, and rebuilds
+the allocator based on the result. After this process completes, the file
+system is ready to accept new requests.
+
+During the same scan, it rebuilds the snapshot information and the list
+available inodes.
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 767e9d202adf..cfcee556acc6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9108,6 +9108,14 @@ F:	drivers/power/supply/bq27xxx_battery_i2c.c
 F:	drivers/power/supply/isp1704_charger.c
 F:	drivers/power/supply/rx51_battery.c
 
+NOVA FILE SYSTEM
+M:	Andiry Xu <jix024@cs.ucsd.edu>
+M:	Steven Swanson <swanson@cs.ucsd.edu>
+L:	linux-fsdevel@vger.kernel.org
+L:	linux-nvdimm@lists.01.org
+F:	Documentation/filesystems/nova.txt
+F:	fs/nova/
+
 NTB DRIVER CORE
 M:	Jon Mason <jdmason@kudzu.us>
 M:	Dave Jiang <dave.jiang@intel.com>
diff --git a/README.md b/README.md
new file mode 100644
index 000000000000..4f778e99a79e
--- /dev/null
+++ b/README.md
@@ -0,0 +1,173 @@
+# NOVA: NOn-Volatile memory Accelerated log-structured file system
+
+NOVA's goal is to provide a high-performance, full-featured, production-ready
+file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
+and Intel's soon-to-be-released 3DXpoint DIMMs).  It combines design elements
+from many other file systems to provide a combination of high-performance,
+strong consistency guarantees, and comprehensive data protection.  NOVA support
+DAX-style mmap and making DAX performs well is a first-order priority in NOVA's
+design.  NOVA was developed by the [Non-Volatile Systems Laboratory][NVSL] in
+the [Computer Science and Engineering Department][CSE] at the [University of
+California, San Diego][UCSD].
+
+
+NOVA is primarily a log-structured file system, but rather than maintain a
+single global log for the entire file system, it maintains separate logs for
+each file (inode).  NOVA breaks the logs into 4KB pages, they need not be
+contiguous in memory.  The logs only contain metadata.
+
+File data pages reside outside the log, and log entries for write operations
+point to data pages they modify.  File modification uses copy-on-write (COW) to
+provide atomic file updates.
+
+For file operations that involve multiple inodes, NOVA use small, fixed-sized
+redo logs to atomically append log entries to the logs of the inodes involned.
+
+This structure keeps logs small and make garbage collection very fast.  It also
+enables enormous parallelism during recovery from an unclean unmount, since
+threads can scan logs in parallel.
+
+NOVA replicates and checksums all metadata structures and protects file data
+with RAID-4-style parity.  It supports checkpoints to facilitate backups.
+
+A more thorough discussion of NOVA's design is avaialable in these two papers:
+
+**NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories** 
+[PDF](http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)<br>
+*Jian Xu and Steven Swanson*<br>
+Published in [FAST 2016][FAST2016]
+
+**Hardening the NOVA File System**
+[PDF](http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf) <br>
+UCSD-CSE Techreport CS2017-1018
+*Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson*<br>
+
+Read on for further details about NOVA's overall design and its current status 
+
+### Compatibilty with Other File Systems
+
+NOVA aims to be compatible with other Linux file systems.  To help verify that it achieves this we run several test suites against NOVA each night.
+
+* The latest version of XFSTests. ([Current failures](https://github.com/NVSL/linux-nova/issues?q=is%3Aopen+is%3Aissue+label%3AXFSTests))
+* The (Linux testing project)(https://linux-test-project.github.io/) file system tests.
+* The (fstest POSIX test suite)[POSIXtest].
+
+Currently, nearly all of these tests pass for the `master` branch, and we have
+run complex programs on NOVA.  There are, of course, many bugs left to fix.
+
+NOVA uses the standard PMEM kernel interfaces for accessing and managing
+persistent memory.
+
+### Atomicity
+
+By default, NOVA makes all metadata and file data operations atomic.
+
+Strong atomicity guarantees make it easier to build reliable applications on
+NOVA, and NOVA can provide these guarantees with sacrificing much performance
+because NVDIMMs support very fast random access.
+
+NOVA also supports "unsafe data" and "unsafe metadata" modes that
+improve performance in some cases and allows for non-atomic updates of file
+data and metadata, respectively.
+
+### Data Protection
+
+NOVA aims to protect data against both misdirected writes in the kernel (which
+can easily "scribble" over the contents of an NVDIMM) as well as media errors.
+
+NOVA protects all of its metadata data structures with a combination of
+replication and checksums.  It protects file data using RAID-5 style parity.
+
+NOVA can detects data corruption by verifying checksums on each access and by
+catching and handling machine check exceptions (MCEs) that arise when the
+system's memory controller detects at uncorrectable media error.
+
+We use a fault injection tool that allows testing of these recovery mechanisms.
+
+To facilitate backups, NOVA can take snapshots of the current filesystem state
+that can be mounted read-only while the current file system is mounted
+read-write.
+
+The tech report list above describes the design of NOVA's data protection system in detail.
+
+### DAX Support
+
+Supporting DAX efficiently is a core feature of NOVA and one of the challenges
+in designing NOVA is reconciling DAX support which aims to avoid file system
+intervention when file data changes, and other features that require such
+intervention.
+
+NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
+to modify a file, the program must take full responsibility for that data and
+NOVA must ensure that the memory will behave as expected.  At other times, the
+file system provides protection.  This approach has several implications:
+
+1. Implementing `msync()` in user space works fine.
+
+2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
+mechanism, because protecting it would be too expensive.  When the file is
+unmapped and/or during file system recovery, protection is restored.
+
+3. The snapshot mechanism must be careful about the order in which in adds
+pages to the file's snapshot image.
+
+### Performance
+
+The research paper and technical report referenced above compare NOVA's
+performance to other file systems.  In almost all cases, NOVA outperforms other
+DAX-enabled file systems.  A notable exception is sub-page updates which incur
+COW overheads for the entire page.
+
+The technical report also illustrates the trade-offs between our protection
+mechanisms and performance.
+
+## Gaps, Missing Features, and Development Status
+
+Although NOVA is a fully-functional file system, there is still much work left
+to be done.  In particular, (at least) the following items are currently missing:
+
+1.  There is no mkfs or fsk utility (`mount` takes `-o init` to create a NOVA file system)
+2.  NOVA doesn't scrub data to prevent corruption from accumulating in infrequently accessed data.
+3.  NOVA doesn't read bad block information on mount and attempt recovery of the effected data.
+4.  NOVA only works on x86-64 kernels.
+5.  NOVA does not currently support extended attributes or ACL.
+6.  NOVA does not currently prevent writes to mounted snapshots.
+7.  Using `write()` to modify pages that are mmap'd is not supported.
+8.  NOVA deoesn't provide quota support.
+9.  Moving NOVA file systems between machines with different numbers of CPUs does not work.
+10. Remounting a NOVA file system with different mount options may fail.
+
+None of these are fundamental limitations of NOVA's design.  Additional bugs
+and issues are here [here][https://github.com/NVSL/linux-nova/issues].
+
+NOVA is complete and robust enough to run a range of complex applications, but
+it is not yet ready for production use.  Our current focus is on adding a few
+missing features list above and finding/fixing bugs.
+
+## Building and Using NOVA
+
+This repo contains a version of the Linux with NOVA included.  You should be
+able to build and install it just as you would the mainline Linux source.
+
+### Building NOVA
+
+To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
+
+## Hacking and Contributing
+
+The NOVA source code is almost completely contains in the `fs/nova` directory.
+The execptions are some small changes in the kernel's memory management system
+to support checkpointing.
+
+`Documentation/filesystems/nova.txt` describes the internals of Nova in more detail.
+
+If you find bugs, please [report them](https://github.com/NVSL/linux-nova/issues).
+
+If you have other questions or suggestions you can contact the NOVA developers at [cse-nova-hackers@eng.ucsd.edu](mailto:cse-nova-hackers@eng.ucsd.edu).
+
+
+[NVSL]: http://nvsl.ucsd.edu/ "http://nvsl.ucsd.edu"
+[POSIXtest]: http://www.tuxera.com/community/posix-test-suite/ 
+[FAST2016]: https://www.usenix.org/conference/fast16/technical-sessions
+[CSE]: http://cs.ucsd.edu
+[UCSD]: http://www.ucsd.edu
\ No newline at end of file

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 01/16] NOVA: Documentation
@ 2017-08-03  7:48   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

A brief overview is in README.md.

Implementation and usage details are in Documentation/filesystems/nova.txt.

These two papers provide a detailed, high-level description of NOVA's design goals and approach:

   NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)

   Hardening the NOVA File System (http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf)

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 Documentation/filesystems/00-INDEX |    2 
 Documentation/filesystems/nova.txt |  771 ++++++++++++++++++++++++++++++++++++
 MAINTAINERS                        |    8 
 README.md                          |  173 ++++++++
 4 files changed, 954 insertions(+)
 create mode 100644 Documentation/filesystems/nova.txt
 create mode 100644 README.md

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index b7bd6c9009cc..dc5c72273957 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -95,6 +95,8 @@ nfs/
 	- nfs-related documentation.
 nilfs2.txt
 	- info and mount options for the NILFS2 filesystem.
+nova.txt
+	- info on the NOVA filesystem.
 ntfs.txt
 	- info and mount options for the NTFS filesystem (Windows NT).
 ocfs2.txt
diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
new file mode 100644
index 000000000000..af90da1c3fb1
--- /dev/null
+++ b/Documentation/filesystems/nova.txt
@@ -0,0 +1,771 @@
+The NOVA Filesystem
+===================
+
+NOVA is a DAX file system designed to maximize performance on hybrid DRAM and
+non-volatile main memory (NVMM) systems while providing strong consistency
+guarantees. NOVA adapts conventional log-structured file system techniques to
+exploit the fast random access that NVMs provide. In particular, it maintains
+separate logs for each inode to improve concurrency, and stores file data
+outside the log to minimize log size and reduce garbage collection costs. NOVA's
+logs provide metadata, data, and mmap atomicity and focus on simplicity and
+reliability, keeping complex metadata structures in DRAM to accelerate lookup
+operations.
+
+The main NOVA features include:
+
+  * POSIX semantics
+  * Directly access (DAX) byte-addressable NVMM without page caching
+  * Per-CPU NVMM pool to maximize concurrency
+  * Strong consistency guarantees with 8-byte atomic stores
+  * Full filesystem snapshot with DAX-mmap support
+  * Checksums on metadata and file data (crc32c)
+  * Full metadata replication and RAID-5 parity per file page
+  * Online filesystem integrity check and corruption recovery
+
+Filesystem Design
+=================
+NOVA divides NVMM into five regions. NOVA's 512 B superblock contains global
+file system information and the recovery inode. The recovery inode represents a
+special file that stores recovery information (e.g., the list of unallocated
+NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
+provides per-CPU journals for complex file operations that involve multiple
+inodes. The rest of the available NVMM stores logs and file data.
+
+NOVA is log-structured and stores a separate log for each inode to maximize
+concurrency and provide atomicity for operations that affect a single file. The
+logs only store metadata and comprise a linked list of 4 KB pages. Log entries
+are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
+pages may reside anywhere in NVMM.
+
+NOVA keeps read-only copies of most file metadata in DRAM during normal
+operations, eliminating the need to access metadata in NVMM during reads.
+
+NOVA uses copy-on-write to provide atomic updates for file data and appends
+metadata about the write to the log. For operations that affect multiple inodes
+NOVA uses lightweight, fixed-length journals – one per core.
+
+NOVA divides the allocatable NVMM into multiple regions, one region per CPU
+core. A per-core allocator manages each of the regions, minimizing contention
+during memory allocation.
+
+After a system crash, NOVA must scan all the logs to rebuild the memory
+allocator state. Since, there are many logs, NOVA aggressively parallelizes the
+scan.
+
+Using NOVA
+==========
+
+NOVA runs on a pmem non-volatile memory region.  You can create one of these
+regions with the `memmap` kernel command line option.  For instance, adding
+`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
+from address 8GB, and the kernel will create a `pmem0` block device under the
+`/dev` directory.
+
+After the OS has booted, you can initialize a NOVA instance with the following commands:
+
+
+# modprobe nova
+# mount -t NOVA -o init /dev/pmem0 /mnt/ramdisk
+
+
+The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
+`/mnt/ramdisk`.
+
+Nova support several module command line options:
+
+ * metadata_csum: Enable metadata replication and checksums (default 0)
+
+ * data_csum: Compute checksums on file data. (default: 0)
+
+ * data_parity: Compute parity for file data. (default: 0)
+
+ * inplace_data_updates:  Update data in place rather than with COW (default: 0)
+
+ * wprotect: Make PMEM unwritable and then use CR0.WP to enable writes as
+   needed (default: 0).  You must also install the nd_pmem module as with
+   wprotect =1 (e.g., modprobe nd_pmem readonly=1).
+
+For instance to enable all Nova's data protection features:
+
+# modprobe nova metadata_csum=1\
+  	       data_csum=1\
+	       data_parity=1\
+	       wprotect=1
+
+Currently, remounting file systems with different combinations of options may
+not work.
+
+To recover an existing NOVA instance, mount NOVA without the init option, for example:
+
+# mount -t NOVA /dev/pmem0 /mnt/ramdisk
+
+### Taking Snapshots
+
+To create a snapshot:
+
+# echo 1 > /proc/fs/NOVA/<device>/create_snapshot
+
+To list the current snapshots:
+
+# cat /proc/fs/NOVA/<device>/snapshots
+
+To mount a snapshot, mount NOVA and specifying the snapshot index, for example:
+
+# mount -t NOVA -o snapshot=<index> /dev/pmem0 /mnt/ramdisk
+
+Users should not write to the file system after mounting a snapshot.
+
+Source File Structure
+=====================
+
+  * nova_def.h/nova.h
+   Defines NOVA macros and key inline functions.
+    
+  * balloc.{h,c}
+    NOVA's block allocator implementation.
+    
+  * bbuild.c
+    Implements recovery routines to restore the in-use inode list, the NVMM
+    allocator information, and the snapshot table.
+
+  * checksum.c
+    Contains checksum-related functions to compute and verify checksums on NOVA
+    data structures and file pages, and also performs recovery actions when
+    corruptions are detected.
+
+  * dax.c
+    Implements DAX read/write functions to access file data. NOVA uses
+    copy-on-write to modify file pages by default, unless inplace data update is
+    enabled at mount-time. There are also functions to update and verify the
+    file data integrity information.
+
+  * dir.c
+    Contains functions to create, update, and remove NOVA dentries.
+
+  * file.c
+    Implements file-related operations such as open, fallocate, llseek, fsync,
+    and flush.
+
+  * gc.c
+    NOVA's garbage collection functions. 
+
+  * inode.{h,c}
+    Creates, reads, and frees NOVA inode tables and inodes.
+
+  * ioctl.c
+    Implements some ioctl commands to call NOVA's internal functions.
+
+  * journal.{h,c}
+    For operations that affect multiple inodes NOVA uses lightweight,
+    fixed-length journals – one per core. This file contains functions to
+    create and manage the lite journals.
+
+  * log.{h,c}
+    Functions to manipulate NOVA inode logs, including log page allocation, log
+    entry creation, commit, modification, and deletion.
+
+  * mprotect.{h,c}
+    Implements inline functions to enable/disable writing to different NOVA
+    data structures.
+    
+  * namei.c
+    Functions to create/remove files, directories, and links. It also looks for
+    the NOVA inode number for a given path name.
+
+  * parity.c
+    Functions to compute file page parity bits. Each file page is striped in to
+    equally sized segments (or strips), and one parity strip is calculated using
+    RAID-5 method. A function to restore a broken data strip is also implemented
+    in this file.
+
+  * perf.{h,c}
+    Function performance measurements. It defines
+    function IDs and call prototypes.  Measures primitive functions'
+    performance, including memory copy functions for DRAM and NVMM, checksum
+    functions, and XOR parity functions.
+
+  * rebuild.c
+    When mounting NOVA after a crash, rebuilds NOVA inodes from its logs. There
+    are also functions to re-calculate checksums and parity bits for file pages
+    that were mmapped during the crash.
+
+  * snapshot.{h,c}
+    Code and data structures for taking snapshots.
+    
+  * stats.h
+    Defines data structures and macros that are relevant to gather NOVA usage
+    statistics.
+
+  * stats.c
+    Implements routines to gather and print NOVA usage statistics.
+
+  * super.{h,c}
+    Super block structures and Nova FS layout and entry points for NOVA
+    mounting and unmounting, initializing or recovering the NOVA super block
+    and other global file system information.
+
+  * symlink.c
+    Implements functions to create and read symbolic links in the filesystem.
+
+  * sysfs.c
+    Implements sysfs entries to take user inputs for taking snapshots, printing
+    NOVA statistics, and measuring function's performance.
+
+
+FS Layout
+======================
+
+A Nova file systems resides in single PMEM device. Nova divides the device int
+4KB blocks.
+
+ block
++-----------------------------------------------------+
+|  0  | primary super block (struct nova_super_block) |
++-----------------------------------------------------+
+|  1  | Reserved inodes                               |
++-----------------------------------------------------+
+|  2  | reserved                                      |
++-----------------------------------------------------+
+|  3  | Journal pointers                              |
++-----------------------------------------------------+
+| 4-5 | Inode pointer tables                          |
++-----------------------------------------------------+
+|  6  | reserved                                      |
++-----------------------------------------------------+
+|  7  | reserved                                      |
++-----------------------------------------------------+
+| ... | data pages                                    |
++-----------------------------------------------------+
+| n-2 | replica reserved Inodes                       |
++-----------------------------------------------------+
+| n-1 | replica super block                           |
++-----------------------------------------------------+
+
+
+
+Superblock and Associated Structures
+====================================
+
+The beginning of the PMEM device hold the super block and its associated
+tables.  These include reserved inodes, a table of pointers to the journals
+Nova uses for complex operations, and pointers to inodes tables.  Nova
+maintains replicas of the super block and reserved inodes in the last two
+blocks of the PMEM area.
+
+
+Block Allocator/Free Lists
+==========================
+
+Nova uses per-CPU allocators to manage free PMEM blocks.  On initialization,
+NOVA divides the range of blocks in the PMEM device among the CPUs, and those
+blocks are managed solely by that CPU.  We call these ranges of "allocation regions".
+
+Some of the blocks in an allocation region have fixed roles.  Here's the
+layout:
+
++-------------------------------+
+| data checksum blocks          |
++-------------------------------+
+| data parity blocks            |
++-------------------------------+
+|                               |
+| Allocatable blocks            |
+|                               |
++-------------------------------+
+| replica data parity blocks    |
++-------------------------------+
+| replica data checksum blocks  |
++-------------------------------+
+
+The first and last allocation regions, also contain the super block, inode
+tables, etc. and their replicas, respectively.
+
+Each allocator maintains a red-black tree of unallocated ranges (struct
+nova_range_node).
+
+Allocation Functions
+--------------------
+
+Nova allocate PMEM blocks using two mechanisms:
+
+1.  Static allocation as defined in super.h
+
+2.  Allocation for log and data pages via nova_new_log_blocks() and
+nova_new_data_blocks().
+
+Both of these functions allow the caller to control whether the allocator
+preferes higher addresses for allocation or lower addresses.  We use this to
+encourage meta data structures and their replicas to be far from one another.
+
+PMEM Address Translation
+------------------------
+
+In Nova's persistent data structures, memory locations are given as offsets
+from the beginning of the PMEM region.  nova_get_block() translates offsets to
+PMEM addresses.  nova_get_addr_off() performs the reverse translation.
+
+
+Inodes
+======
+
+Nova maintains per-CPU inode tables, and inode numbers are striped across the
+tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
+
+The inodes themselves live in a set of linked lists (one per CPU) of 2MB
+blocks.  The last 8 bytes of each block points to the next block.  Pointers to
+heads of these list live in PMEM block INODE_TABLE0_START and are replicated in
+PMEM block INODE_TABLE1_START.  Additional space for inodes is allocated on
+demand.
+
+To allocate inodes, Nova maintains a per-cpu "inuse_list" in DRAM holds a RB
+tree that holds ranges of unallocated inode numbers.
+
+Logs
+====
+
+Nova maintains a log for each inode that records updates to the inode's
+metadata and holds pointers to the file data.  Nova makes updates to file data
+and metadata atomic by atomically appending log entries to the log.
+
+Each inode contains pointers to head and tail of the inode's log.  When the log
+grows past the end of the last page, nova allocates additional space.  For
+short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
+fixed amount of additional space (1MB).
+
+Log space is reclaimed during garbage collection.
+
+Log Entries
+-----------
+
+There are eight kinds of log entry, documented in log.h.  The log entries have
+several entries in common:
+
+   1.  'epoch_id' gives the epoch during which the log entry was created.
+   Creating a snapshot increiments the epoch_id for the file systems.
+
+   2.  'trans_id' is filesystem-wide, monotone increasing, number assigned each
+   log entry.  It provides an ordering over all FS operations.
+
+   3.  'invalid' is true if the effects of this entry are dead and the log
+   entry can be garbage collected.
+
+   4.  'csum' is a CRC32 checksum for the entry.
+
+Log structure
+-------------
+
+The logs comprise a linked list of PMEM blocks.  The tail of each block
+
+contains some metadata about the block and pointers to the next block and
+block's replica (struct nova_inode_page_tail).
+
++----------------+
+| log entry      |
++----------------+
+| log entry      |
++----------------+
+| ...            |
++----------------+
+| tail           |
+|  metadata      |
+|  -> next block |
++----------------+
+
+
+Journals
+========
+
+Nova uses a lightweight journaling mechanisms to provide atomicity for
+operations that modify more than one on inode.  The journals providing logging
+for two operations:
+
+1.  Single word updates (JOURNAL_ENTRY)
+2.  Copying inodes (JOURNAL_INODE)
+                                                  
+The journals are undo logs: Nova creates the journal entries for an operation,
+and if the operation does not complete due to a system failure, the recovery
+process rolls back the changes using the journal entries.
+
+To commit, Nova drops the log.
+
+Nova maintains one journal per CPU.  The head and tail pointers for each
+journal live in a reserved page near the beginning of the file system.  
+
+During recovery, Nova scans the journals and undoes the operations described by
+each entry.
+
+
+File and Directory Access
+=========================
+
+To access file data via read(), Nova maintains a radix tree in DRAM for each
+inode (nova_inode_info_header.tree) that maps file offsets to write log
+entries.  For directories, the same tree maps a hash of filenames to their
+corresponding dentry.
+
+In both cases, the nova populates the tree when the file or directory is opened
+by scanning its log.
+
+MMap and DAX
+============
+
+NOVA leverages the kernel's DAX mechanisms for mmap and file data access.  Nova
+maintains a red-black tree in DRAM (nova_inode_info_header.vma_tree) to track
+which portions of a file have been mapped.
+
+Garbage Collection
+==================
+
+Nova recovers log space with a two-phase garbage collection system.  When a log
+reaches the end of its allocated pages, Nova allocates more space.  Then, the
+fast GC algorithm scans the log to remove pages that have no valid entries.
+Then, it estimates how many pages the logs valid entries would fill.  If this
+is less than half the number of pages in the log, the second GC phase copies
+the valid entries to new pages.
+
+For example (V=valid; I=invalid):
+
++---+          +---+	        +---+
+| I |	       | I |  	      	| V |
++---+	       +---+  Thorough	+---+
+| V |	       | V |  	 GC   	| V |
++---+	       +---+   =====> 	+---+
+| I |	       | I |  	      	| V |
++---+	       +---+	        +---+
+| V |	       | V |  	        | V |
++---+	       +---+            +---+	
+  |	         |	       
+  V	         V             
++---+	       +---+ 	       
+| I |	       | V | 	       
++---+	       +---+ 	       
+| I | fast GC  | I | 	       
++---+  ====>   +---+ 	       
+| I |	       | I | 	       
++---+	       +---+ 	       
+| I |	       | V | 	       
++---+	       +---+ 	       
+  |	       	
+  V	       	
++---+	       	
+| V |	       	
++---+	       	
+| I |	       	
++---+	       	
+| I |	       	
++---+	       	
+| V |	       	
++---+            
+
+
+Replication and Checksums
+=========================
+
+Nova protects data and metadat from corruption due to media errors and
+"scribbles" -- software errors in the kernels that may overwrite Nova data.
+
+Replication
+-----------
+
+Nova replicates all PMEM metadata structures (there are a few exceptions.  They
+are WIP).  For structure, there is a primary and an "alternate" (denoted as
+"alter" in the code).  To ensure that Nova can recover a consistent copy of the
+data in case of a failure, Nova first updates the primary, and issues a persist
+barrier to ensure that data is written to NVMM.  Then it does the same for the
+alternate.
+
+Detection
+---------
+
+Nova uses two techniques to detect data corruption.  For media errors, Nova
+should always uses memcpy_from_pmem() to read data from PMEM, usually by
+copying the PMEM data structure into DRAM.
+
+To detect software-caused corruption, Nova uses CRC32 checksums.  All the PMEM
+data structures in Nova include csum field for this purpose.  Nova also
+computes CRC32 checksums each 512-byte slice of each data page.
+
+The checksums are stored in dedicated pages in each CPU's allocation region.
+
+                                                          replica
+                                                 parity   parity 	
+					         page	  page	  
+            +---+---+---+---+---+---+---+---+    +---+    +---+       
+data page 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |    | 0 |    | 0 |  	
+            +---+---+---+---+---+---+---+---+    +---+    +---+  	
+data page 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |    | 1 |    | 1 |  	
+            +---+---+---+---+---+---+---+---+    +---+    +---+  	
+data page 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |    | 0 |    | 0 |  	
+            +---+---+---+---+---+---+---+---+    +---+    +---+  	
+data page 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |    | 0 |    | 0 |  	
+            +---+---+---+---+---+---+---+---+    +---+    +---+  	
+    ...                    ...                    ...      ...   
+
+Recovery
+--------
+
+Nova uses replication to support recovery of metadata structures and
+RAID4-style parity to recover corrupted data.
+
+If Nova detects corruption of a metadata structure, it restores the structure
+using the replica.
+
+If it detects a corrupt slice of data page, it uses RAID4-style recovery to
+restore it.  The CRC32 checksums for the page slices are replicated.
+
+Cautious allocation
+-------------------
+
+To maximize its resilience to software scribbles, Nova allocate metadata
+structures and their replicas far from one another.  It tries to allocate the
+primary copy at a low address and the replica at a high address within the PMEM
+region.
+
+Write Protection
+----------------
+
+Finally, Nova supports can prevent unintended writes PMEM by mapping the entire
+PMEM device as read-only and then disabling _all_ write protection by clearing
+the WP bit the CR0 control register when Nova needs to perform a write.  The
+wprotect mount-time option controls this behavior.
+
+To map the PMEM device as read-only, we have added a readonly module command
+line option to nd_pmem.  There is probably a better approach to achieving this
+goal. 
+
+Unsafe modes
+============
+
+Nova support modes that disable some of the protections it provides to improve
+perforamnce.
+
+File data
+---------
+
+Nova can disable parity and/or checksums on file data (options 'data_parity=0'
+and 'data_checksum=0').  Without parity, Nova can detect but not recover from
+data corruption.  Without checksums, Nova will still detect and recover from
+media errors, but not scribbles.
+
+Nova also supports in-place file updates (option: inplace_data_updates=1).
+This breaks atomicity for writes, but improve performance, especially for
+sub-page writes, since these require a full page COW in the default mode.
+
+Metadata
+--------
+
+Nova can disable metadata checksums and replication (option 'metadata_csum=0').
+
+
+Snapshots
+=========
+
+Nova supports snapshots to facilitate backups.
+
+Taking a snapshot
+-----------------
+
+Each Nova file systems has a current epoch_id in the super block and each log
+entry has the epoch_id attached to it at creation.  When the user creates a
+snaphot, Nova increments the epoch_id for the file system and the old epoch_id
+identifies the moment the snapshot was taken.
+
+Nova records the epoch_id and a timestamp in a new log entry (struct
+snapshot_info_log_entry) and appends it to the log of the reserved snapshot
+inode (NOVA_SNAPSHOT_INODE) in the superblock.
+
+Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct
+snapshot_info in DRAM indexed by epoch_id.
+
+Nova also marks all mmap'd pages as read-only and uses COW to preserve file
+contents after the snapshot.
+
+Tracking Live Data
+------------------
+
+Supporting snapshots requires Nova to preserve file contents from previous
+snapshots while also being able to recover the space a snapshot occupied after
+its deletion.
+
+Preserving file contents requires a small change to how Nova implements write
+operations.  To perform a write, Nova appends a write log entry to the file's
+log.  The log entry includes pointers to newly-allocated and populated NVMM
+pages that hold the written data.  If the write overwrites existing data, Nova
+locates the previous write log entry for that portion of the file, and performs
+an "epoch check" that compares the old log entry's epoch_id to the file
+system's current epoch_id.  If the comparison matches, the old write log entry
+and the file data blocks it points to no longer belong to any snapshot, and
+Nova reclaims the data blocks.
+
+If the epoch_id's do not match, then the data in the old log entry belongs to
+an earlier snapshot and Nova leaves the log entry in place.
+
+Determining when to reclaim data belonging to deleted snapshots requires
+additional bookkeeping.  For each snapshot, Nova maintains a "snapshot log"
+that records the inodes and blocks that belong to that snapshot, but are not
+part of the current file system image.
+
+Nova populates the snapshot log during the epoch check: If the epoch_ids for
+the new and old log entries do not match, it appends a log entry (either struct
+snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log
+that the old log entry belongs to.  The log entry contains a pointer to the old
+log entry, and the filesystem's current epoch_id as the delete_epoch_id.
+
+To delete a snapshot, Nova removes the snapshot from the list of live snapshots
+and appends its log to the following snapshot's log.  Then, a background thread
+traverses the combined log and reclaims dead inode/data based on the delete
+epoch_id: If the delete epoch_id for an entry in the log is less than or equal
+to the snapshot's epoch_id, it means the log entry and/or the associated data
+blocks are now dead.
+
+Snapshots and DAX
+-----------------
+
+Taking consistent snapshots while applications are modifying files using
+DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM
+become persistent (i.e., reach physical NVMM so they will survive a system
+failure).  These applications rely on the processor's ``memory persistence
+model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees
+about when and in what order stores become persistent.  These guarantees allow
+the application to restore their data to a consistent state during recovery
+from a system failure.
+
+From the application's perspective, reading a snapshot is equivalent to
+recovering from a system failure.  In both cases, the contents of the
+memory-mapped file reflect its state at a moment when application operations
+might be in-flight and when the application had no chance to shut down cleanly.
+
+A naive approach to checkpointing mmap()'d files in NOVA would simply mark each
+of the read/write mapped pages as read-only and then do copy-on-write when a
+store occurs to preserve the old pages as part of the snapshot.
+
+However, this approach can leave the snapshot in an inconsistent state:
+Setting the page to read-only captures its contents for the
+snapshot, and the kernel requires NOVA to set the pages as read-only
+one at a time.  So, if the order in which NOVA marks pages as read-only
+is incompatible with ordering that the application requires, the snapshot will
+contain an inconsistent version of the file.
+
+To resolve this problem, when NOVA starts marking pages as read-only, it blocks
+page faults to the read-only mmap()'d pages until it has marked all the pages
+read-only and finished taking the snapshot.
+
+More detail is available in the technical report referenced at the top of this
+document.
+
+We have implemented this functionality in NOVA by adding the 'original_write'
+flag to struct vm_area_struct that tracks whether the vm_area_struct is created
+with write permission, but has been marked read-only in the course of taking a
+snapshot.  We have also added a 'dax_cow' operation to struct
+vm_operations_struct that the page fault handler runs when applications write
+to a page with original_write = 1.  NOVA's dax_cow operation
+(nova_restore_page_write()) performs the COW, maps the page to a new physical
+page and allows writing.
+
+Saving Snapshot State
+---------------------
+
+During a clean shutdown, Nova stores the snapshot information to PMEM.
+
+Nova reserves an inode for storing snapshot information.  The log for the inode
+contains an entry for each snapshot (struct snapshot_info_log_entry).  On
+shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array
+of struct snapshot_nvmm_list.
+
+Each of these lists (one per CPU) contains head and tail pointers to a linked
+list of blocks (just like an inode log).  The lists contain a struct
+snapshot_file_write_entry or struct snapshot_inode_entry for each operation
+that modified file data or an inode.
+
+Superblock
++--------------------+
+|   ...              |
++--------------------+
+| Reserved Inodes    |
++---+----------------+
+|   |     ...        |
++---+----------------+
+| 7 | Snapshot Inode |
+|   | head           |
++---+----------------+
+        /
+       /
+      / 
++---------+---------+---------+
+|  Snap   |  Snap   |  Snap   |
+| epoch=1 | epoch=4 | epoch=11|
+|         |         |         |
+|nvmm_page|nvmm_page|nvmm_page|
++---------+---------+---------+
+     |
+     |
++----------+   +--------+--------+
+|  cpu 0   |   | snap 	| snap   |	
+|   head   |-->| inode	| write	 |
+|          |   | entry  | entry  |      
+|          |   +--------+--------+
++----------+   +--------+--------+
+|  cpu 1   |   | snap 	| snap   |
+|   head   |-->| write	| write	 |
+|          |   | entry  | entry  |
+|          |   +--------+--------+
++----------+ 
+|    ...   | 
++----------+   +--------+
+|  cpu 128 |   | snap 	|
+|   head   |-->| inode	|
+|          |   | entry  |
+|          |   +--------+
++----------+
+
+
+Umount and Recovery
+===================
+
+Clean umount/mount
+------------------
+
+On a clean unmount, Nova saves the contents of many of its DRAM data structures
+to PMEM to accelerate the next mount:
+
+1. Nova stores the allocator state for each of the per-cpu allocators to the
+   log of a reserved inode (NOVA_BLOCK_NODE_INO).
+    
+2. Nova stores the per-CPU lists of available inodes (the inuse_list) to the
+   NOVA_BLOCK_INODELIST1_INO reserved inode.
+
+3. Nova stores the snapshot state to PMEM as described above.
+
+After a clean unmount, the following mount restores these data and then
+invalidates them.
+
+Recovery after failures
+------------------------
+
+In case of a unclean dismount (e.g., system crash), Nova must rebuild these
+DRAM structures by scanning the inode logs.  Nova log scanning is fast because
+per-CPU inode tables and per-inode logs allow for parallel recovery.
+
+The number of live log entries in an inode log is roughly the number of extents
+in the file.  As a result, Nova only needs to scan a small fraction of the NVMM
+during recovery.
+
+The Nova failure recovery consists of two steps:
+
+First, Nova checks its lite weight journals and rolls back any uncommitted
+transactions to restore the file system to a consistent state.
+
+Second, Nova starts a recovery thread on each CPU and scans the inode tables in
+parallel, performing log scanning for every valid inode in the inode table.
+Nova use different recovery mechanisms for directory inodes and file inodes:
+For a directory inode, Nova scans the log's linked list to enumerate the pages
+it occupies, but it does not inspect the log's contents.  For a file inode,
+Nova reads the write entries in the log to enumerate the data pages.
+
+During the recovery scan Nova builds a bitmap of occupied pages, and rebuilds
+the allocator based on the result. After this process completes, the file
+system is ready to accept new requests.
+
+During the same scan, it rebuilds the snapshot information and the list
+available inodes.
+
diff --git a/MAINTAINERS b/MAINTAINERS
index 767e9d202adf..cfcee556acc6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9108,6 +9108,14 @@ F:	drivers/power/supply/bq27xxx_battery_i2c.c
 F:	drivers/power/supply/isp1704_charger.c
 F:	drivers/power/supply/rx51_battery.c
 
+NOVA FILE SYSTEM
+M:	Andiry Xu <jix024@cs.ucsd.edu>
+M:	Steven Swanson <swanson@cs.ucsd.edu>
+L:	linux-fsdevel@vger.kernel.org
+L:	linux-nvdimm@lists.01.org
+F:	Documentation/filesystems/nova.txt
+F:	fs/nova/
+
 NTB DRIVER CORE
 M:	Jon Mason <jdmason@kudzu.us>
 M:	Dave Jiang <dave.jiang@intel.com>
diff --git a/README.md b/README.md
new file mode 100644
index 000000000000..4f778e99a79e
--- /dev/null
+++ b/README.md
@@ -0,0 +1,173 @@
+# NOVA: NOn-Volatile memory Accelerated log-structured file system
+
+NOVA's goal is to provide a high-performance, full-featured, production-ready
+file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
+and Intel's soon-to-be-released 3DXpoint DIMMs).  It combines design elements
+from many other file systems to provide a combination of high-performance,
+strong consistency guarantees, and comprehensive data protection.  NOVA support
+DAX-style mmap and making DAX performs well is a first-order priority in NOVA's
+design.  NOVA was developed by the [Non-Volatile Systems Laboratory][NVSL] in
+the [Computer Science and Engineering Department][CSE] at the [University of
+California, San Diego][UCSD].
+
+
+NOVA is primarily a log-structured file system, but rather than maintain a
+single global log for the entire file system, it maintains separate logs for
+each file (inode).  NOVA breaks the logs into 4KB pages, they need not be
+contiguous in memory.  The logs only contain metadata.
+
+File data pages reside outside the log, and log entries for write operations
+point to data pages they modify.  File modification uses copy-on-write (COW) to
+provide atomic file updates.
+
+For file operations that involve multiple inodes, NOVA use small, fixed-sized
+redo logs to atomically append log entries to the logs of the inodes involned.
+
+This structure keeps logs small and make garbage collection very fast.  It also
+enables enormous parallelism during recovery from an unclean unmount, since
+threads can scan logs in parallel.
+
+NOVA replicates and checksums all metadata structures and protects file data
+with RAID-4-style parity.  It supports checkpoints to facilitate backups.
+
+A more thorough discussion of NOVA's design is avaialable in these two papers:
+
+**NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories** 
+[PDF](http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)<br>
+*Jian Xu and Steven Swanson*<br>
+Published in [FAST 2016][FAST2016]
+
+**Hardening the NOVA File System**
+[PDF](http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf) <br>
+UCSD-CSE Techreport CS2017-1018
+*Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson*<br>
+
+Read on for further details about NOVA's overall design and its current status 
+
+### Compatibilty with Other File Systems
+
+NOVA aims to be compatible with other Linux file systems.  To help verify that it achieves this we run several test suites against NOVA each night.
+
+* The latest version of XFSTests. ([Current failures](https://github.com/NVSL/linux-nova/issues?q=is%3Aopen+is%3Aissue+label%3AXFSTests))
+* The (Linux testing project)(https://linux-test-project.github.io/) file system tests.
+* The (fstest POSIX test suite)[POSIXtest].
+
+Currently, nearly all of these tests pass for the `master` branch, and we have
+run complex programs on NOVA.  There are, of course, many bugs left to fix.
+
+NOVA uses the standard PMEM kernel interfaces for accessing and managing
+persistent memory.
+
+### Atomicity
+
+By default, NOVA makes all metadata and file data operations atomic.
+
+Strong atomicity guarantees make it easier to build reliable applications on
+NOVA, and NOVA can provide these guarantees with sacrificing much performance
+because NVDIMMs support very fast random access.
+
+NOVA also supports "unsafe data" and "unsafe metadata" modes that
+improve performance in some cases and allows for non-atomic updates of file
+data and metadata, respectively.
+
+### Data Protection
+
+NOVA aims to protect data against both misdirected writes in the kernel (which
+can easily "scribble" over the contents of an NVDIMM) as well as media errors.
+
+NOVA protects all of its metadata data structures with a combination of
+replication and checksums.  It protects file data using RAID-5 style parity.
+
+NOVA can detects data corruption by verifying checksums on each access and by
+catching and handling machine check exceptions (MCEs) that arise when the
+system's memory controller detects at uncorrectable media error.
+
+We use a fault injection tool that allows testing of these recovery mechanisms.
+
+To facilitate backups, NOVA can take snapshots of the current filesystem state
+that can be mounted read-only while the current file system is mounted
+read-write.
+
+The tech report list above describes the design of NOVA's data protection system in detail.
+
+### DAX Support
+
+Supporting DAX efficiently is a core feature of NOVA and one of the challenges
+in designing NOVA is reconciling DAX support which aims to avoid file system
+intervention when file data changes, and other features that require such
+intervention.
+
+NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
+to modify a file, the program must take full responsibility for that data and
+NOVA must ensure that the memory will behave as expected.  At other times, the
+file system provides protection.  This approach has several implications:
+
+1. Implementing `msync()` in user space works fine.
+
+2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
+mechanism, because protecting it would be too expensive.  When the file is
+unmapped and/or during file system recovery, protection is restored.
+
+3. The snapshot mechanism must be careful about the order in which in adds
+pages to the file's snapshot image.
+
+### Performance
+
+The research paper and technical report referenced above compare NOVA's
+performance to other file systems.  In almost all cases, NOVA outperforms other
+DAX-enabled file systems.  A notable exception is sub-page updates which incur
+COW overheads for the entire page.
+
+The technical report also illustrates the trade-offs between our protection
+mechanisms and performance.
+
+## Gaps, Missing Features, and Development Status
+
+Although NOVA is a fully-functional file system, there is still much work left
+to be done.  In particular, (at least) the following items are currently missing:
+
+1.  There is no mkfs or fsk utility (`mount` takes `-o init` to create a NOVA file system)
+2.  NOVA doesn't scrub data to prevent corruption from accumulating in infrequently accessed data.
+3.  NOVA doesn't read bad block information on mount and attempt recovery of the effected data.
+4.  NOVA only works on x86-64 kernels.
+5.  NOVA does not currently support extended attributes or ACL.
+6.  NOVA does not currently prevent writes to mounted snapshots.
+7.  Using `write()` to modify pages that are mmap'd is not supported.
+8.  NOVA deoesn't provide quota support.
+9.  Moving NOVA file systems between machines with different numbers of CPUs does not work.
+10. Remounting a NOVA file system with different mount options may fail.
+
+None of these are fundamental limitations of NOVA's design.  Additional bugs
+and issues are here [here][https://github.com/NVSL/linux-nova/issues].
+
+NOVA is complete and robust enough to run a range of complex applications, but
+it is not yet ready for production use.  Our current focus is on adding a few
+missing features list above and finding/fixing bugs.
+
+## Building and Using NOVA
+
+This repo contains a version of the Linux with NOVA included.  You should be
+able to build and install it just as you would the mainline Linux source.
+
+### Building NOVA
+
+To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
+
+## Hacking and Contributing
+
+The NOVA source code is almost completely contains in the `fs/nova` directory.
+The execptions are some small changes in the kernel's memory management system
+to support checkpointing.
+
+`Documentation/filesystems/nova.txt` describes the internals of Nova in more detail.
+
+If you find bugs, please [report them](https://github.com/NVSL/linux-nova/issues).
+
+If you have other questions or suggestions you can contact the NOVA developers at [cse-nova-hackers@eng.ucsd.edu](mailto:cse-nova-hackers@eng.ucsd.edu).
+
+
+[NVSL]: http://nvsl.ucsd.edu/ "http://nvsl.ucsd.edu"
+[POSIXtest]: http://www.tuxera.com/community/posix-test-suite/ 
+[FAST2016]: https://www.usenix.org/conference/fast16/technical-sessions
+[CSE]: http://cs.ucsd.edu
+[UCSD]: http://www.ucsd.edu
\ No newline at end of file

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 02/16] NOVA: Superblock and fs layout
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:48   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

FS Layout
======================

A Nova file systems resides in single PMEM device. Nova divides the device into
4KB blocks that are arrange like so:

 block
+-----------------------------------------------------+
|  0  | primary super block (struct nova_super_block) |
+-----------------------------------------------------+
|  1  | Reserved inodes                               |
+-----------------------------------------------------+
|  2  | reserved                                      |
+-----------------------------------------------------+
|  3  | Journal pointers                              |
+-----------------------------------------------------+
| 4-5 | Inode pointer tables                          |
+-----------------------------------------------------+
|  6  | reserved                                      |
+-----------------------------------------------------+
|  7  | reserved                                      |
+-----------------------------------------------------+
| ... | data pages                                    |
+-----------------------------------------------------+
| n-2 | replica reserved Inodes                       |
+-----------------------------------------------------+
| n-1 | replica super block                           |
+-----------------------------------------------------+


Superblock and Associated Structures
====================================

The beginning of the PMEM device hold the super block and its associated
tables.  These include reserved inodes, a table of pointers to the journals
Nova uses for complex operations, and pointers to inodes tables.  Nova
maintains replicas of the super block and reserved inodes in the last two
blocks of the PMEM area.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/nova.h     | 1137 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova_def.h |  154 +++++++
 fs/nova/super.c    | 1222 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/super.h    |  216 +++++++++
 4 files changed, 2729 insertions(+)
 create mode 100644 fs/nova/nova.h
 create mode 100644 fs/nova/nova_def.h
 create mode 100644 fs/nova/super.c
 create mode 100644 fs/nova/super.h

diff --git a/fs/nova/nova.h b/fs/nova/nova.h
new file mode 100644
index 000000000000..b0e9e19b53b7
--- /dev/null
+++ b/fs/nova/nova.h
@@ -0,0 +1,1137 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the NOVA filesystem.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#ifndef __NOVA_H
+#define __NOVA_H
+
+#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/init.h>
+#include <linux/time.h>
+#include <linux/rtc.h>
+#include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/backing-dev.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/rcupdate.h>
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/radix-tree.h>
+#include <linux/version.h>
+#include <linux/kthread.h>
+#include <linux/buffer_head.h>
+#include <linux/uio.h>
+#include <linux/pmem.h>
+#include <linux/iomap.h>
+#include <linux/crc32c.h>
+#include <asm/tlbflush.h>
+#include <linux/version.h>
+#include <linux/pfn_t.h>
+#include <linux/pagevec.h>
+
+#include "nova_def.h"
+#include "stats.h"
+#include "snapshot.h"
+
+#define PAGE_SHIFT_2M 21
+#define PAGE_SHIFT_1G 30
+
+
+/*
+ * Debug code
+ */
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/* #define nova_dbg(s, args...)		pr_debug(s, ## args) */
+#define nova_dbg(s, args ...)		pr_info(s, ## args)
+#define nova_dbg1(s, args ...)
+#define nova_err(sb, s, args ...)	nova_error_mng(sb, s, ## args)
+#define nova_warn(s, args ...)		pr_warn(s, ## args)
+#define nova_info(s, args ...)		pr_info(s, ## args)
+
+extern unsigned int nova_dbgmask;
+#define NOVA_DBGMASK_MMAPHUGE	       (0x00000001)
+#define NOVA_DBGMASK_MMAP4K	       (0x00000002)
+#define NOVA_DBGMASK_MMAPVERBOSE       (0x00000004)
+#define NOVA_DBGMASK_MMAPVVERBOSE      (0x00000008)
+#define NOVA_DBGMASK_VERBOSE	       (0x00000010)
+#define NOVA_DBGMASK_TRANSACTION       (0x00000020)
+
+#define nova_dbg_mmap4k(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAP4K) ? nova_dbg(s, args) : 0)
+#define nova_dbg_mmapv(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAPVERBOSE) ? nova_dbg(s, args) : 0)
+#define nova_dbg_mmapvv(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAPVVERBOSE) ? nova_dbg(s, args) : 0)
+
+#define nova_dbg_verbose(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_VERBOSE) ? nova_dbg(s, ##args) : 0)
+#define nova_dbgv(s, args ...)	nova_dbg_verbose(s, ##args)
+#define nova_dbg_trans(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_TRANSACTION) ? nova_dbg(s, ##args) : 0)
+
+#define NOVA_ASSERT(x) do {\
+			       if (!(x))\
+				       nova_warn("assertion failed %s:%d: %s\n", \
+			       __FILE__, __LINE__, #x);\
+		       } while (0)
+
+#define nova_set_bit		       __test_and_set_bit_le
+#define nova_clear_bit		       __test_and_clear_bit_le
+#define nova_find_next_zero_bit	       find_next_zero_bit_le
+
+#define clear_opt(o, opt)	(o &= ~NOVA_MOUNT_ ## opt)
+#define set_opt(o, opt)		(o |= NOVA_MOUNT_ ## opt)
+#define test_opt(sb, opt)	(NOVA_SB(sb)->s_mount_opt & NOVA_MOUNT_ ## opt)
+
+#define NOVA_LARGE_INODE_TABLE_SIZE    (0x200000)
+/* NOVA size threshold for using 2M blocks for inode table */
+#define NOVA_LARGE_INODE_TABLE_THREASHOLD    (0x20000000)
+/*
+ * nova inode flags
+ *
+ * NOVA_EOFBLOCKS_FL	There are blocks allocated beyond eof
+ */
+#define NOVA_EOFBLOCKS_FL      0x20000000
+/* Flags that should be inherited by new inodes from their parent. */
+#define NOVA_FL_INHERITED (FS_SECRM_FL | FS_UNRM_FL | FS_COMPR_FL | \
+			    FS_SYNC_FL | FS_NODUMP_FL | FS_NOATIME_FL |	\
+			    FS_COMPRBLK_FL | FS_NOCOMP_FL | \
+			    FS_JOURNAL_DATA_FL | FS_NOTAIL_FL | FS_DIRSYNC_FL)
+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define NOVA_REG_FLMASK (~(FS_DIRSYNC_FL | FS_TOPDIR_FL))
+/* Flags that are appropriate for non-directories/regular files. */
+#define NOVA_OTHER_FLMASK (FS_NODUMP_FL | FS_NOATIME_FL)
+#define NOVA_FL_USER_VISIBLE (FS_FL_USER_VISIBLE | NOVA_EOFBLOCKS_FL)
+
+/* IOCTLs */
+#define	NOVA_PRINT_TIMING		0xBCD00010
+#define	NOVA_CLEAR_STATS		0xBCD00011
+#define	NOVA_PRINT_LOG			0xBCD00013
+#define	NOVA_PRINT_LOG_BLOCKNODE	0xBCD00014
+#define	NOVA_PRINT_LOG_PAGES		0xBCD00015
+#define	NOVA_PRINT_FREE_LISTS		0xBCD00018
+
+
+#define	READDIR_END			(ULONG_MAX)
+#define	INVALID_CPU			(-1)
+#define	ANY_CPU				(65536)
+#define	FREE_BATCH			(16)
+#define	DEAD_ZONE_BLOCKS		(256)
+
+extern int measure_timing;
+extern int metadata_csum;
+extern int unsafe_metadata;
+extern int inplace_data_updates;
+extern int wprotect;
+extern int data_csum;
+extern int data_parity;
+extern int dram_struct_csum;
+
+extern unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX];
+extern unsigned int blk_type_to_size[NOVA_BLOCK_TYPE_MAX];
+
+
+
+#define	MMAP_WRITE_BIT	0x20UL	// mmaped for write
+#define	IS_MAP_WRITE(p)	((p) & (MMAP_WRITE_BIT))
+#define	MMAP_ADDR(p)	((p) & (PAGE_MASK))
+
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static inline __le32 nova_mask_flags(umode_t mode, __le32 flags)
+{
+	flags &= cpu_to_le32(NOVA_FL_INHERITED);
+	if (S_ISDIR(mode))
+		return flags;
+	else if (S_ISREG(mode))
+		return flags & cpu_to_le32(NOVA_REG_FLMASK);
+	else
+		return flags & cpu_to_le32(NOVA_OTHER_FLMASK);
+}
+
+/* Update the crc32c value by appending a 64b data word. */
+#define nova_crc32c_qword(qword, crc) do { \
+	asm volatile ("crc32q %1, %0" \
+		: "=r" (crc) \
+		: "r" (qword), "0" (crc)); \
+	} while (0)
+
+static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
+{
+	u8 *ptr = (u8 *) data;
+	u64 acc = crc; /* accumulator, crc32c value in lower 32b */
+	u32 csum;
+
+	/* x86 instruction crc32 is part of SSE-4.2 */
+	if (static_cpu_has(X86_FEATURE_XMM4_2)) {
+		/* This inline assembly implementation should be equivalent
+		 * to the kernel's crc32c_intel_le_hw() function used by
+		 * crc32c(), but this performs better on test machines.
+		 */
+		while (len > 8) {
+			asm volatile(/* 64b quad words */
+				"crc32q (%1), %0"
+				: "=r" (acc)
+				: "r"  (ptr), "0" (acc)
+			);
+			ptr += 8;
+			len -= 8;
+		}
+
+		while (len > 0) {
+			asm volatile(/* trailing bytes */
+				"crc32b (%1), %0"
+				: "=r" (acc)
+				: "r"  (ptr), "0" (acc)
+			);
+			ptr++;
+			len--;
+		}
+
+		csum = (u32) acc;
+	} else {
+		/* The kernel's crc32c() function should also detect and use the
+		 * crc32 instruction of SSE-4.2. But calling in to this function
+		 * is about 3x to 5x slower than the inline assembly version on
+		 * some test machines.
+		 */
+		csum = crc32c(crc, data, len);
+	}
+
+	return csum;
+}
+
+/* uses CPU instructions to atomically write up to 8 bytes */
+static inline void nova_memcpy_atomic(void *dst, const void *src, u8 size)
+{
+	switch (size) {
+	case 1: {
+		volatile u8 *daddr = dst;
+		const u8 *saddr = src;
+		*daddr = *saddr;
+		break;
+	}
+	case 2: {
+		volatile __le16 *daddr = dst;
+		const u16 *saddr = src;
+		*daddr = cpu_to_le16(*saddr);
+		break;
+	}
+	case 4: {
+		volatile __le32 *daddr = dst;
+		const u32 *saddr = src;
+		*daddr = cpu_to_le32(*saddr);
+		break;
+	}
+	case 8: {
+		volatile __le64 *daddr = dst;
+		const u64 *saddr = src;
+		*daddr = cpu_to_le64(*saddr);
+		break;
+	}
+	default:
+		nova_dbg("error: memcpy_atomic called with %d bytes\n", size);
+		//BUG();
+	}
+}
+
+static inline int memcpy_to_pmem_nocache(void *dst, const void *src,
+	unsigned int size)
+{
+	int ret;
+
+	ret = __copy_from_user_inatomic_nocache(dst, src, size);
+
+	return ret;
+}
+
+
+/* assumes the length to be 4-byte aligned */
+static inline void memset_nt(void *dest, uint32_t dword, size_t length)
+{
+	uint64_t dummy1, dummy2;
+	uint64_t qword = ((uint64_t)dword << 32) | dword;
+
+	asm volatile ("movl %%edx,%%ecx\n"
+		"andl $63,%%edx\n"
+		"shrl $6,%%ecx\n"
+		"jz 9f\n"
+		"1:	 movnti %%rax,(%%rdi)\n"
+		"2:	 movnti %%rax,1*8(%%rdi)\n"
+		"3:	 movnti %%rax,2*8(%%rdi)\n"
+		"4:	 movnti %%rax,3*8(%%rdi)\n"
+		"5:	 movnti %%rax,4*8(%%rdi)\n"
+		"8:	 movnti %%rax,5*8(%%rdi)\n"
+		"7:	 movnti %%rax,6*8(%%rdi)\n"
+		"8:	 movnti %%rax,7*8(%%rdi)\n"
+		"leaq 64(%%rdi),%%rdi\n"
+		"decl %%ecx\n"
+		"jnz 1b\n"
+		"9:	movl %%edx,%%ecx\n"
+		"andl $7,%%edx\n"
+		"shrl $3,%%ecx\n"
+		"jz 11f\n"
+		"10:	 movnti %%rax,(%%rdi)\n"
+		"leaq 8(%%rdi),%%rdi\n"
+		"decl %%ecx\n"
+		"jnz 10b\n"
+		"11:	 movl %%edx,%%ecx\n"
+		"shrl $2,%%ecx\n"
+		"jz 12f\n"
+		"movnti %%eax,(%%rdi)\n"
+		"12:\n"
+		: "=D"(dummy1), "=d" (dummy2)
+		: "D" (dest), "a" (qword), "d" (length)
+		: "memory", "rcx");
+}
+
+
+#include "super.h" // Remove when we factor out these and other functions.
+
+/* Translate an offset the beginning of the Nova instance to a PMEM address.
+ *
+ * If this is part of a read-modify-write of the block,
+ * nova_memunlock_block() before calling!
+ */
+static inline void *nova_get_block(struct super_block *sb, u64 block)
+{
+	struct nova_super_block *ps = nova_get_super(sb);
+
+	return block ? ((void *)ps + block) : NULL;
+}
+
+static inline int nova_get_reference(struct super_block *sb, u64 block,
+	void *dram, void **nvmm, size_t size)
+{
+	int rc;
+
+	*nvmm = nova_get_block(sb, block);
+	rc = memcpy_mcsafe(dram, *nvmm, size);
+	return rc;
+}
+
+
+static inline u64
+nova_get_addr_off(struct nova_sb_info *sbi, void *addr)
+{
+	NOVA_ASSERT((addr >= sbi->virt_addr) &&
+			(addr < (sbi->virt_addr + sbi->initsize)));
+	return (u64)(addr - sbi->virt_addr);
+}
+
+static inline u64
+nova_get_block_off(struct super_block *sb, unsigned long blocknr,
+		    unsigned short btype)
+{
+	return (u64)blocknr << PAGE_SHIFT;
+}
+
+
+static inline u64 nova_get_epoch_id(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return sbi->s_epoch_id;
+}
+
+static inline void nova_print_curr_epoch_id(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 ret;
+
+	ret = sbi->s_epoch_id;
+	nova_dbg("Current epoch id: %llu\n", ret);
+}
+
+#include "inode.h"
+static inline int nova_get_head_tail(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih)
+{
+	struct nova_inode fake_pi;
+	int rc;
+
+	rc = memcpy_mcsafe(&fake_pi, pi, sizeof(struct nova_inode));
+	if (rc)
+		return rc;
+
+	sih->i_blk_type = fake_pi.i_blk_type;
+	sih->log_head = fake_pi.log_head;
+	sih->log_tail = fake_pi.log_tail;
+	sih->alter_log_head = fake_pi.alter_log_head;
+	sih->alter_log_tail = fake_pi.alter_log_tail;
+
+	return rc;
+}
+
+struct nova_range_node_lowhigh {
+	__le64 range_low;
+	__le64 range_high;
+};
+
+#define	RANGENODE_PER_PAGE	254
+
+/* A node in the RB tree representing a range of pages */
+struct nova_range_node {
+	struct rb_node node;
+	struct vm_area_struct *vma;
+	unsigned long mmap_entry;
+	unsigned long range_low;
+	unsigned long range_high;
+	u32	csum;		/* Protect vma, range low/high */
+};
+
+struct vma_item {
+	/* Reuse header of nova_range_node struct */
+	struct rb_node node;
+	struct vm_area_struct *vma;
+	unsigned long mmap_entry;
+};
+
+static inline u32 nova_calculate_range_node_csum(struct nova_range_node *node)
+{
+	u32 crc;
+
+	crc = nova_crc32c(~0, (__u8 *)&node->vma,
+			(unsigned long)&node->csum - (unsigned long)&node->vma);
+
+	return crc;
+}
+
+static inline int nova_update_range_node_checksum(struct nova_range_node *node)
+{
+	if (dram_struct_csum)
+		node->csum = nova_calculate_range_node_csum(node);
+
+	return 0;
+}
+
+static inline bool nova_range_node_checksum_ok(struct nova_range_node *node)
+{
+	bool ret;
+
+	if (dram_struct_csum == 0)
+		return true;
+
+	ret = node->csum == nova_calculate_range_node_csum(node);
+	if (!ret) {
+		nova_dbg("%s: checksum failure, vma %p, range low %lu, range high %lu, csum 0x%x\n",
+			 __func__, node->vma, node->range_low, node->range_high,
+			 node->csum);
+	}
+
+	return ret;
+}
+
+
+enum bm_type {
+	BM_4K = 0,
+	BM_2M,
+	BM_1G,
+};
+
+struct single_scan_bm {
+	unsigned long bitmap_size;
+	unsigned long *bitmap;
+};
+
+struct scan_bitmap {
+	struct single_scan_bm scan_bm_4K;
+	struct single_scan_bm scan_bm_2M;
+	struct single_scan_bm scan_bm_1G;
+};
+
+
+
+struct inode_map {
+	struct mutex		inode_table_mutex;
+	struct rb_root		inode_inuse_tree;
+	unsigned long		num_range_node_inode;
+	struct nova_range_node *first_inode_range;
+	int			allocated;
+	int			freed;
+};
+
+
+
+
+
+
+
+/* Old entry is freeable if it is appended after the latest snapshot */
+static inline int old_entry_freeable(struct super_block *sb, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (epoch_id == sbi->s_epoch_id)
+		return 1;
+
+	return 0;
+}
+
+static inline int pass_mount_snapshot(struct super_block *sb, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (epoch_id > sbi->mount_snapshot_epoch_id)
+		return 1;
+
+	return 0;
+}
+
+
+// BKDR String Hash Function
+static inline unsigned long BKDRHash(const char *str, int length)
+{
+	unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
+	unsigned long hash = 0;
+	int i;
+
+	for (i = 0; i < length; i++)
+		hash = hash * seed + (*str++);
+
+	return hash;
+}
+
+
+#include "mprotect.h"
+
+#include "log.h"
+
+static inline struct nova_file_write_entry *
+nova_get_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr)
+{
+	struct nova_file_write_entry *entry;
+
+	entry = radix_tree_lookup(&sih->tree, blocknr);
+
+	return entry;
+}
+
+
+/*
+ * Find data at a file offset (pgoff) in the data pointed to by a write log
+ * entry.
+ */
+static inline unsigned long get_nvmm(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long pgoff)
+{
+	/* entry is already verified before this call and resides in dram
+	 * or we can do memcpy_mcsafe here but have to avoid double copy and
+	 * verification of the entry.
+	 */
+	if (entry->pgoff > pgoff || (unsigned long) entry->pgoff +
+			(unsigned long) entry->num_pages <= pgoff) {
+		struct nova_sb_info *sbi = NOVA_SB(sb);
+		u64 curr;
+
+		curr = nova_get_addr_off(sbi, entry);
+		nova_dbg("Entry ERROR: inode %lu, curr 0x%llx, pgoff %lu, entry pgoff %llu, num %u\n",
+			sih->ino,
+			curr, pgoff, entry->pgoff, entry->num_pages);
+		nova_print_nova_log_pages(sb, sih);
+		nova_print_nova_log(sb, sih);
+		NOVA_ASSERT(0);
+	}
+
+	return (unsigned long) (entry->block >> PAGE_SHIFT) + pgoff
+		- entry->pgoff;
+}
+
+bool nova_verify_entry_csum(struct super_block *sb, void *entry, void *entryc);
+
+static inline u64 nova_find_nvmm_block(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long blocknr)
+{
+	unsigned long nvmm;
+	struct nova_file_write_entry *entryc, entry_copy;
+
+	if (!entry) {
+		entry = nova_get_write_entry(sb, sih, blocknr);
+		if (!entry)
+			return 0;
+	}
+
+	/* Don't check entry here as someone else may be modifying it
+	 * when called from reset_vma_csum_parity
+	 */
+	entryc = &entry_copy;
+	if (memcpy_mcsafe(entryc, entry,
+			sizeof(struct nova_file_write_entry)) < 0)
+		return 0;
+
+	nvmm = get_nvmm(sb, sih, entryc, blocknr);
+	return nvmm << PAGE_SHIFT;
+}
+
+
+
+static inline unsigned long
+nova_get_numblocks(unsigned short btype)
+{
+	unsigned long num_blocks;
+
+	if (btype == NOVA_BLOCK_TYPE_4K) {
+		num_blocks = 1;
+	} else if (btype == NOVA_BLOCK_TYPE_2M) {
+		num_blocks = 512;
+	} else {
+		//btype == NOVA_BLOCK_TYPE_1G
+		num_blocks = 0x40000;
+	}
+	return num_blocks;
+}
+
+static inline unsigned long
+nova_get_blocknr(struct super_block *sb, u64 block, unsigned short btype)
+{
+	return block >> PAGE_SHIFT;
+}
+
+static inline unsigned long nova_get_pfn(struct super_block *sb, u64 block)
+{
+	return (NOVA_SB(sb)->phys_addr + block) >> PAGE_SHIFT;
+}
+
+static inline u64 next_log_page(struct super_block *sb, u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 next = 0;
+	int rc;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	rc = memcpy_mcsafe(&next, &curr_page->page_tail.next_page,
+				sizeof(u64));
+	if (rc)
+		return rc;
+
+	return next;
+}
+
+static inline u64 alter_log_page(struct super_block *sb, u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 next = 0;
+	int rc;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	rc = memcpy_mcsafe(&next, &curr_page->page_tail.alter_page,
+				sizeof(u64));
+	if (rc)
+		return rc;
+
+	return next;
+}
+
+#if 0
+static inline u64 next_log_page(struct super_block *sb, u64 curr_p)
+{
+	void *curr_addr = nova_get_block(sb, curr_p);
+	unsigned long page_tail = BLOCK_OFF((unsigned long)curr_addr)
+					+ LOG_BLOCK_TAIL;
+	return ((struct nova_inode_page_tail *)page_tail)->next_page;
+}
+
+static inline u64 alter_log_page(struct super_block *sb, u64 curr_p)
+{
+	void *curr_addr = nova_get_block(sb, curr_p);
+	unsigned long page_tail = BLOCK_OFF((unsigned long)curr_addr)
+					+ LOG_BLOCK_TAIL;
+	if (metadata_csum == 0)
+		return 0;
+
+	return ((struct nova_inode_page_tail *)page_tail)->alter_page;
+}
+#endif
+
+static inline u64 alter_log_entry(struct super_block *sb, u64 curr_p)
+{
+	u64 alter_page;
+	void *curr_addr = nova_get_block(sb, curr_p);
+	unsigned long page_tail = BLOCK_OFF((unsigned long)curr_addr)
+					+ LOG_BLOCK_TAIL;
+	if (metadata_csum == 0)
+		return 0;
+
+	alter_page = ((struct nova_inode_page_tail *)page_tail)->alter_page;
+	return alter_page + ENTRY_LOC(curr_p);
+}
+
+static inline void nova_set_next_page_flag(struct super_block *sb, u64 curr_p)
+{
+	void *p;
+
+	if (ENTRY_LOC(curr_p) >= LOG_BLOCK_TAIL)
+		return;
+
+	p = nova_get_block(sb, curr_p);
+	nova_set_entry_type(p, NEXT_PAGE);
+	nova_flush_buffer(p, CACHELINE_SIZE, 1);
+}
+
+static inline void nova_set_next_page_address(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 next_page, int fence)
+{
+	curr_page->page_tail.next_page = next_page;
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+	if (fence)
+		PERSISTENT_BARRIER();
+}
+
+static inline void nova_set_page_num_entries(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, int num, int flush)
+{
+	curr_page->page_tail.num_entries = num;
+	if (flush)
+		nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_set_page_invalid_entries(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, int num, int flush)
+{
+	curr_page->page_tail.invalid_entries = num;
+	if (flush)
+		nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_inc_page_num_entries(struct super_block *sb,
+	u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+
+	curr_page->page_tail.num_entries++;
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+u64 nova_print_log_entry(struct super_block *sb, u64 curr);
+
+static inline void nova_inc_page_invalid_entries(struct super_block *sb,
+	u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 old_curr = curr;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+
+	curr_page->page_tail.invalid_entries++;
+	if (curr_page->page_tail.invalid_entries >
+			curr_page->page_tail.num_entries) {
+		nova_dbg("Page 0x%llx has %u entries, %u invalid\n",
+				curr,
+				curr_page->page_tail.num_entries,
+				curr_page->page_tail.invalid_entries);
+		nova_print_log_entry(sb, old_curr);
+	}
+
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_set_alter_page_address(struct super_block *sb,
+	u64 curr, u64 alter_curr)
+{
+	struct nova_inode_log_page *curr_page;
+	struct nova_inode_log_page *alter_page;
+
+	if (metadata_csum == 0)
+		return;
+
+	curr_page = nova_get_block(sb, BLOCK_OFF(curr));
+	alter_page = nova_get_block(sb, BLOCK_OFF(alter_curr));
+
+	curr_page->page_tail.alter_page = alter_curr;
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+
+	alter_page->page_tail.alter_page = curr;
+	nova_flush_buffer(&alter_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+#define	CACHE_ALIGN(p)	((p) & ~(CACHELINE_SIZE - 1))
+
+static inline bool is_last_entry(u64 curr_p, size_t size)
+{
+	unsigned int entry_end;
+
+	entry_end = ENTRY_LOC(curr_p) + size;
+
+	return entry_end > LOG_BLOCK_TAIL;
+}
+
+static inline bool goto_next_page(struct super_block *sb, u64 curr_p)
+{
+	void *addr;
+	u8 type;
+	int rc;
+
+	/* Each kind of entry takes at least 32 bytes */
+	if (ENTRY_LOC(curr_p) + 32 > LOG_BLOCK_TAIL)
+		return true;
+
+	addr = nova_get_block(sb, curr_p);
+	rc = memcpy_mcsafe(&type, addr, sizeof(u8));
+
+	if (rc < 0)
+		return true;
+
+	if (type == NEXT_PAGE)
+		return true;
+
+	return false;
+}
+
+static inline int is_dir_init_entry(struct super_block *sb,
+	struct nova_dentry *entry)
+{
+	if (entry->name_len == 1 && strncmp(entry->name, ".", 1) == 0)
+		return 1;
+	if (entry->name_len == 2 && strncmp(entry->name, "..", 2) == 0)
+		return 1;
+
+	return 0;
+}
+
+#include "balloc.h" // remove once we move the following functions away
+
+/* Checksum methods */
+static inline void *nova_get_data_csum_addr(struct super_block *sb, u64 strp_nr,
+	int replica)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long blocknr;
+	void *data_csum_addr;
+	u64 blockoff;
+	int index;
+	int BLOCK_SHIFT = PAGE_SHIFT - NOVA_STRIPE_SHIFT;
+
+	if (!data_csum) {
+		nova_dbg("%s: Data checksum is disabled!\n", __func__);
+		return NULL;
+	}
+
+	blocknr = strp_nr >> BLOCK_SHIFT;
+	index = blocknr / sbi->per_list_blocks;
+
+	if (index >= sbi->cpus) {
+		nova_dbg("%s: Invalid blocknr %lu\n", __func__, blocknr);
+		return NULL;
+	}
+
+	strp_nr -= (index * sbi->per_list_blocks) << BLOCK_SHIFT;
+	free_list = nova_get_free_list(sb, index);
+	if (replica == 0)
+		blockoff = free_list->csum_start << PAGE_SHIFT;
+	else
+		blockoff = free_list->replica_csum_start << PAGE_SHIFT;
+
+	/* Range test */
+	if (((NOVA_DATA_CSUM_LEN * strp_nr) >> PAGE_SHIFT) >=
+			free_list->num_csum_blocks) {
+		nova_dbg("%s: Invalid strp number %llu, free list %d\n",
+				__func__, strp_nr, free_list->index);
+		return NULL;
+	}
+
+	data_csum_addr = (u8 *) nova_get_block(sb, blockoff)
+				+ NOVA_DATA_CSUM_LEN * strp_nr;
+
+	return data_csum_addr;
+}
+
+static inline void *nova_get_parity_addr(struct super_block *sb,
+	unsigned long blocknr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	void *data_csum_addr;
+	u64 blockoff;
+	int index;
+	int BLOCK_SHIFT = PAGE_SHIFT - NOVA_STRIPE_SHIFT;
+
+	if (data_parity == 0) {
+		nova_dbg("%s: Data parity is disabled!\n", __func__);
+		return NULL;
+	}
+
+	index = blocknr / sbi->per_list_blocks;
+
+	if (index >= sbi->cpus) {
+		nova_dbg("%s: Invalid blocknr %lu\n", __func__, blocknr);
+		return NULL;
+	}
+
+	free_list = nova_get_free_list(sb, index);
+	blockoff = free_list->parity_start << PAGE_SHIFT;
+
+	/* Range test */
+	if (((blocknr - free_list->block_start) >> BLOCK_SHIFT) >=
+			free_list->num_parity_blocks) {
+		nova_dbg("%s: Invalid blocknr %lu, free list %d\n",
+				__func__, blocknr, free_list->index);
+		return NULL;
+	}
+
+	data_csum_addr = (u8 *) nova_get_block(sb, blockoff) +
+				((blocknr - free_list->block_start)
+				 << NOVA_STRIPE_SHIFT);
+
+	return data_csum_addr;
+}
+
+/* Function Prototypes */
+
+
+
+/* bbuild.c */
+inline void set_bm(unsigned long bit, struct scan_bitmap *bm,
+	enum bm_type type);
+void nova_save_blocknode_mappings_to_log(struct super_block *sb);
+void nova_save_inode_list_to_log(struct super_block *sb);
+void nova_init_header(struct super_block *sb,
+	struct nova_inode_info_header *sih, u16 i_mode);
+int nova_recovery(struct super_block *sb);
+
+/* checksum.c */
+void nova_update_entry_csum(void *entry);
+int nova_update_block_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, u8 *block, unsigned long blocknr,
+	size_t offset, size_t bytes, int zero);
+int nova_update_alter_entry(struct super_block *sb, void *entry);
+int nova_check_inode_integrity(struct super_block *sb, u64 ino, u64 pi_addr,
+	u64 alter_pi_addr, struct nova_inode *pic, int check_replica);
+int nova_update_pgoff_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero);
+bool nova_verify_data_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr,
+	size_t offset, size_t bytes);
+int nova_update_truncated_block_csum(struct super_block *sb,
+	struct inode *inode, loff_t newsize);
+
+/*
+ * Inodes and files operations
+ */
+
+/* dax.c */
+int nova_cleanup_incomplete_write(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr,
+	int allocated, u64 begin_tail, u64 end_tail);
+void nova_init_file_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time,
+	u64 size);
+int nova_reassign_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 begin_tail);
+unsigned long nova_check_existing_entry(struct super_block *sb,
+	struct inode *inode, unsigned long num_blocks, unsigned long start_blk,
+	struct nova_file_write_entry **ret_entry,
+	struct nova_file_write_entry *ret_entryc, int check_next, u64 epoch_id,
+	int *inplace, int locked);
+int nova_dax_get_blocks(struct inode *inode, sector_t iblock,
+	unsigned long max_blocks, u32 *bno, bool *new, bool *boundary,
+	int create, bool taking_lock);
+int nova_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+	unsigned int flags, struct iomap *iomap, bool taking_lock);
+int nova_iomap_end(struct inode *inode, loff_t offset, loff_t length,
+	ssize_t written, unsigned int flags, struct iomap *iomap);
+int nova_insert_write_vma(struct vm_area_struct *vma);
+
+int nova_check_overlap_vmas(struct super_block *sb,
+			    struct nova_inode_info_header *sih,
+			    unsigned long pgoff, unsigned long num_pages);
+int nova_handle_head_tail_blocks(struct super_block *sb,
+				 struct inode *inode, loff_t pos,
+				 size_t count, void *kmem);
+int nova_protect_file_data(struct super_block *sb, struct inode *inode,
+	loff_t pos, size_t count, const char __user *buf, unsigned long blocknr,
+	bool inplace);
+ssize_t nova_inplace_file_write(struct file *filp, const char __user *buf,
+				size_t len, loff_t *ppos);
+
+extern const struct vm_operations_struct nova_dax_vm_ops;
+
+
+/* dir.c */
+extern const struct file_operations nova_dir_operations;
+int nova_insert_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name,
+	int namelen, struct nova_dentry *direntry);
+int nova_remove_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name, int namelen,
+	int replay, struct nova_dentry **create_dentry);
+int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *dir, struct dentry *dentry, u64 ino,
+	unsigned short de_len, struct nova_inode_update *update,
+	int link_change, u64 epoch_id);
+int nova_append_dir_init_entries(struct super_block *sb,
+	struct nova_inode *pi, u64 self_ino, u64 parent_ino, u64 epoch_id);
+int nova_add_dentry(struct dentry *dentry, u64 ino, int inc_link,
+	struct nova_inode_update *update, u64 epoch_id);
+int nova_remove_dentry(struct dentry *dentry, int dec_link,
+	struct nova_inode_update *update, u64 epoch_id);
+int nova_invalidate_dentries(struct super_block *sb,
+	struct nova_inode_update *update);
+void nova_print_dir_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long ino);
+void nova_delete_dir_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+struct nova_dentry *nova_find_dentry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, const char *name,
+	unsigned long name_len);
+
+/* file.c */
+extern const struct inode_operations nova_file_inode_operations;
+extern const struct file_operations nova_dax_file_operations;
+extern const struct file_operations nova_wrap_file_operations;
+
+
+/* gc.c */
+int nova_inode_log_fast_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_tail, u64 new_block, u64 alter_new_block, int num_pages,
+	int force_thorough);
+
+/* ioctl.c */
+extern long nova_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
+#ifdef CONFIG_COMPAT
+extern long nova_compat_ioctl(struct file *file, unsigned int cmd,
+	unsigned long arg);
+#endif
+
+
+
+/* mprotect.c */
+extern int nova_dax_mem_protect(struct super_block *sb,
+				 void *vaddr, unsigned long size, int rw);
+int nova_get_vma_overlap_range(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	unsigned long entry_pgoff, unsigned long entry_pages,
+	unsigned long *start_pgoff, unsigned long *num_pages);
+int nova_mmap_to_new_blocks(struct vm_area_struct *vma,
+	unsigned long address);
+bool nova_find_pgoff_in_vma(struct inode *inode, unsigned long pgoff);
+int nova_set_vmas_readonly(struct super_block *sb);
+
+/* namei.c */
+extern const struct inode_operations nova_dir_inode_operations;
+extern const struct inode_operations nova_special_inode_operations;
+extern struct dentry *nova_get_parent(struct dentry *child);
+
+/* parity.c */
+int nova_update_pgoff_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero);
+int nova_update_block_csum_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, u8 *block, unsigned long blocknr,
+	size_t offset, size_t bytes);
+int nova_restore_data(struct super_block *sb, unsigned long blocknr,
+	unsigned int badstrip_id, void *badstrip, int nvmmerr, u32 csum0,
+	u32 csum1, u32 *csum_good);
+int nova_update_truncated_block_parity(struct super_block *sb,
+	struct inode *inode, loff_t newsize);
+
+/* rebuild.c */
+int nova_reset_csum_parity_range(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long start_pgoff, unsigned long end_pgoff, int zero,
+	int check_entry);
+int nova_reset_mapping_csum_parity(struct super_block *sb,
+	struct inode *inode, struct address_space *mapping,
+	unsigned long start_pgoff, unsigned long end_pgoff);
+int nova_reset_vma_csum_parity(struct super_block *sb,
+	struct vma_item *item);
+int nova_rebuild_dir_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih);
+int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
+	u64 ino, u64 pi_addr, int rebuild_dir);
+int nova_restore_snapshot_table(struct super_block *sb, int just_init);
+
+/* snapshot.c */
+int nova_encounter_mount_snapshot(struct super_block *sb, void *addr,
+	u8 type);
+int nova_save_snapshots(struct super_block *sb);
+int nova_destroy_snapshot_infos(struct super_block *sb);
+int nova_restore_snapshot_entry(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 curr_p, int just_init);
+int nova_mount_snapshot(struct super_block *sb);
+int nova_append_data_to_snapshot(struct super_block *sb,
+	struct nova_file_write_entry *entry, u64 nvmm, u64 num_pages,
+	u64 delete_epoch_id);
+int nova_append_inode_to_snapshot(struct super_block *sb,
+	struct nova_inode *pi);
+int nova_print_snapshots(struct super_block *sb, struct seq_file *seq);
+int nova_print_snapshot_lists(struct super_block *sb, struct seq_file *seq);
+int nova_delete_dead_inode(struct super_block *sb, u64 ino);
+int nova_create_snapshot(struct super_block *sb);
+int nova_delete_snapshot(struct super_block *sb, u64 epoch_id);
+int nova_snapshot_init(struct super_block *sb);
+
+
+/* symlink.c */
+int nova_block_symlink(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, const char *symname, int len, u64 epoch_id);
+extern const struct inode_operations nova_symlink_inode_operations;
+
+/* sysfs.c */
+extern const char *proc_dirname;
+extern struct proc_dir_entry *nova_proc_root;
+void nova_sysfs_init(struct super_block *sb);
+void nova_sysfs_exit(struct super_block *sb);
+
+/* nova_stats.c */
+void nova_get_timing_stats(void);
+void nova_get_IO_stats(void);
+void nova_print_timing_stats(struct super_block *sb);
+void nova_clear_stats(struct super_block *sb);
+void nova_print_inode(struct nova_inode *pi);
+void nova_print_inode_log(struct super_block *sb, struct inode *inode);
+void nova_print_inode_log_pages(struct super_block *sb, struct inode *inode);
+int nova_check_inode_logs(struct super_block *sb, struct nova_inode *pi);
+void nova_print_free_lists(struct super_block *sb);
+
+/* perf.c */
+int nova_test_perf(struct super_block *sb, unsigned int func_id,
+	unsigned int poolmb, size_t size, unsigned int disks);
+
+#endif /* __NOVA_H */
diff --git a/fs/nova/nova_def.h b/fs/nova/nova_def.h
new file mode 100644
index 000000000000..61ade439e138
--- /dev/null
+++ b/fs/nova/nova_def.h
@@ -0,0 +1,154 @@
+/*
+ * FILE NAME include/linux/nova_fs.h
+ *
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the NOVA filesystem.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#ifndef _LINUX_NOVA_DEF_H
+#define _LINUX_NOVA_DEF_H
+
+#include <linux/types.h>
+#include <linux/magic.h>
+
+#define	NOVA_SUPER_MAGIC	0x4E4F5641	/* NOVA */
+
+/*
+ * The NOVA filesystem constants/structures
+ */
+
+/*
+ * Mount flags
+ */
+#define NOVA_MOUNT_PROTECT      0x000001    /* wprotect CR0.WP */
+#define NOVA_MOUNT_XATTR_USER   0x000002    /* Extended user attributes */
+#define NOVA_MOUNT_POSIX_ACL    0x000004    /* POSIX Access Control Lists */
+#define NOVA_MOUNT_DAX          0x000008    /* Direct Access */
+#define NOVA_MOUNT_ERRORS_CONT  0x000010    /* Continue on errors */
+#define NOVA_MOUNT_ERRORS_RO    0x000020    /* Remount fs ro on errors */
+#define NOVA_MOUNT_ERRORS_PANIC 0x000040    /* Panic on errors */
+#define NOVA_MOUNT_HUGEMMAP     0x000080    /* Huge mappings with mmap */
+#define NOVA_MOUNT_HUGEIOREMAP  0x000100    /* Huge mappings with ioremap */
+#define NOVA_MOUNT_FORMAT       0x000200    /* was FS formatted on mount? */
+
+/*
+ * Maximal count of links to a file
+ */
+#define NOVA_LINK_MAX          32000
+
+#define NOVA_DEF_BLOCK_SIZE_4K 4096
+
+#define NOVA_INODE_BITS   7
+#define NOVA_INODE_SIZE   128    /* must be power of two */
+
+#define NOVA_NAME_LEN 255
+
+#define MAX_CPUS 64
+
+/* NOVA supported data blocks */
+#define NOVA_BLOCK_TYPE_4K     0
+#define NOVA_BLOCK_TYPE_2M     1
+#define NOVA_BLOCK_TYPE_1G     2
+#define NOVA_BLOCK_TYPE_MAX    3
+
+#define META_BLK_SHIFT 9
+
+/*
+ * Play with this knob to change the default block type.
+ * By changing the NOVA_DEFAULT_BLOCK_TYPE to 2M or 1G,
+ * we should get pretty good coverage in testing.
+ */
+#define NOVA_DEFAULT_BLOCK_TYPE NOVA_BLOCK_TYPE_4K
+
+
+/* ======================= Write ordering ========================= */
+
+#define CACHELINE_SIZE  (64)
+#define CACHELINE_MASK  (~(CACHELINE_SIZE - 1))
+#define CACHELINE_ALIGN(addr) (((addr)+CACHELINE_SIZE-1) & CACHELINE_MASK)
+
+
+static inline bool arch_has_clwb(void)
+{
+	return static_cpu_has(X86_FEATURE_CLWB);
+}
+
+extern int support_clwb;
+
+#define _mm_clflush(addr)\
+	asm volatile("clflush %0" : "+m" (*(volatile char *)(addr)))
+#define _mm_clflushopt(addr)\
+	asm volatile(".byte 0x66; clflush %0" : "+m" \
+		     (*(volatile char *)(addr)))
+#define _mm_clwb(addr)\
+	asm volatile(".byte 0x66; xsaveopt %0" : "+m" \
+		     (*(volatile char *)(addr)))
+
+/* Provides ordering from all previous clflush too */
+static inline void PERSISTENT_MARK(void)
+{
+	/* TODO: Fix me. */
+}
+
+static inline void PERSISTENT_BARRIER(void)
+{
+	asm volatile ("sfence\n" : : );
+}
+
+static inline void nova_flush_buffer(void *buf, uint32_t len, bool fence)
+{
+	uint32_t i;
+
+	len = len + ((unsigned long)(buf) & (CACHELINE_SIZE - 1));
+	if (support_clwb) {
+		for (i = 0; i < len; i += CACHELINE_SIZE)
+			_mm_clwb(buf + i);
+	} else {
+		for (i = 0; i < len; i += CACHELINE_SIZE)
+			_mm_clflush(buf + i);
+	}
+	/* Do a fence only if asked. We often don't need to do a fence
+	 * immediately after clflush because even if we get context switched
+	 * between clflush and subsequent fence, the context switch operation
+	 * provides implicit fence.
+	 */
+	if (fence)
+		PERSISTENT_BARRIER();
+}
+
+/* =============== Integrity and Recovery Parameters =============== */
+#define	NOVA_META_CSUM_LEN	(4)
+#define	NOVA_DATA_CSUM_LEN	(4)
+
+/* This is to set the initial value of checksum state register.
+ * For CRC32C this should not matter and can be set to any value.
+ */
+#define	NOVA_INIT_CSUM		(1)
+
+#define	ADDR_ALIGN(p, bytes)	((void *) (((unsigned long) p) & ~(bytes - 1)))
+
+/* Data stripe size in bytes and shift.
+ * In NOVA this size determines the size of a checksummed stripe, and it
+ * equals to the affordable lost size of data per block (page).
+ * Its value should be no less than the poison radius size of media errors.
+ *
+ * Support NOVA_STRIPE_SHIFT <= PAGE_SHIFT (NOVA file block size shift).
+ */
+#define POISON_RADIUS		(512)
+#define POISON_MASK		(~(POISON_RADIUS - 1))
+#define NOVA_STRIPE_SHIFT	(9) /* size should be no less than PR_SIZE */
+#define NOVA_STRIPE_SIZE	(1 << NOVA_STRIPE_SHIFT)
+
+#endif /* _LINUX_NOVA_DEF_H */
diff --git a/fs/nova/super.c b/fs/nova/super.c
new file mode 100644
index 000000000000..6be94edf116c
--- /dev/null
+++ b/fs/nova/super.c
@@ -0,0 +1,1222 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Super block operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/parser.h>
+#include <linux/vfs.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+#include <linux/seq_file.h>
+#include <linux/mount.h>
+#include <linux/mm.h>
+#include <linux/ctype.h>
+#include <linux/bitops.h>
+#include <linux/magic.h>
+#include <linux/exportfs.h>
+#include <linux/random.h>
+#include <linux/cred.h>
+#include <linux/list.h>
+#include <linux/dax.h>
+#include "nova.h"
+#include "journal.h"
+#include "super.h"
+#include "inode.h"
+
+int measure_timing;
+int metadata_csum;
+int wprotect;
+int data_csum;
+int data_parity;
+int dram_struct_csum;
+int support_clwb;
+int inplace_data_updates;
+
+module_param(measure_timing, int, 0444);
+MODULE_PARM_DESC(measure_timing, "Timing measurement");
+
+module_param(metadata_csum, int, 0444);
+MODULE_PARM_DESC(metadata_csum, "Protect metadata structures with replication and checksums");
+
+module_param(wprotect, int, 0444);
+MODULE_PARM_DESC(wprotect, "Write-protect pmem region and use CR0.WP to allow updates");
+
+module_param(data_csum, int, 0444);
+MODULE_PARM_DESC(data_csum, "Detect corruption of data pages using checksum");
+
+module_param(data_parity, int, 0444);
+MODULE_PARM_DESC(data_parity, "Protect file data using RAID-5 style parity.");
+
+module_param(inplace_data_updates, int, 0444);
+MODULE_PARM_DESC(inplace_data_updates, "Perform data updates in-place (i.e., not atomically)");
+
+module_param(dram_struct_csum, int, 0444);
+MODULE_PARM_DESC(dram_struct_csum, "Protect key DRAM data structures with checksums");
+
+module_param(nova_dbgmask, int, 0444);
+MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
+
+static struct super_operations nova_sops;
+static const struct export_operations nova_export_ops;
+static struct kmem_cache *nova_inode_cachep;
+static struct kmem_cache *nova_range_node_cachep;
+static struct kmem_cache *nova_snapshot_info_cachep;
+
+/* FIXME: should the following variable be one per NOVA instance? */
+unsigned int nova_dbgmask;
+
+void nova_error_mng(struct super_block *sb, const char *fmt, ...)
+{
+	va_list args;
+
+	printk(KERN_CRIT "nova error: ");
+	va_start(args, fmt);
+	vprintk(fmt, args);
+	va_end(args);
+
+	if (test_opt(sb, ERRORS_PANIC))
+		panic("nova: panic from previous error\n");
+	if (test_opt(sb, ERRORS_RO)) {
+		printk(KERN_CRIT "nova err: remounting filesystem read-only");
+		sb->s_flags |= MS_RDONLY;
+	}
+}
+
+static void nova_set_blocksize(struct super_block *sb, unsigned long size)
+{
+	int bits;
+
+	/*
+	 * We've already validated the user input and the value here must be
+	 * between NOVA_MAX_BLOCK_SIZE and NOVA_MIN_BLOCK_SIZE
+	 * and it must be a power of 2.
+	 */
+	bits = fls(size) - 1;
+	sb->s_blocksize_bits = bits;
+	sb->s_blocksize = (1 << bits);
+}
+
+static int nova_get_nvmm_info(struct super_block *sb,
+	struct nova_sb_info *sbi)
+{
+	void *virt_addr = NULL;
+	pfn_t __pfn_t;
+	long size;
+	struct dax_device *dax_dev;
+	int ret;
+
+	ret = bdev_dax_supported(sb, PAGE_SIZE);
+	nova_dbg_verbose("%s: dax_supported = %d; bdev->super=0x%p",
+			 __func__, ret, sb->s_bdev->bd_super);
+	if (ret) {
+		nova_err(sb, "device does not support DAX\n");
+		return ret;
+	}
+
+	sbi->s_bdev = sb->s_bdev;
+
+	dax_dev = fs_dax_get_by_host(sb->s_bdev->bd_disk->disk_name);
+	if (!dax_dev) {
+		nova_err(sb, "Couldn't retrieve DAX device.\n");
+		return -EINVAL;
+	}
+	sbi->s_dax_dev = dax_dev;
+
+	size = dax_direct_access(sbi->s_dax_dev, 0, LONG_MAX/PAGE_SIZE,
+				 &virt_addr, &__pfn_t) * PAGE_SIZE;
+	if (size <= 0) {
+		nova_err(sb, "direct_access failed\n");
+		return -EINVAL;
+	}
+
+	sbi->virt_addr = virt_addr;
+
+	if (!sbi->virt_addr) {
+		nova_err(sb, "ioremap of the nova image failed(1)\n");
+		return -EINVAL;
+	}
+
+	sbi->phys_addr = pfn_t_to_pfn(__pfn_t) << PAGE_SHIFT;
+	sbi->initsize = size;
+	sbi->replica_reserved_inodes_addr = virt_addr + size -
+			(sbi->tail_reserved_blocks << PAGE_SHIFT);
+	sbi->replica_sb_addr = virt_addr + size - PAGE_SIZE;
+
+	nova_dbg("%s: dev %s, phys_addr 0x%llx, virt_addr %p, size %ld\n",
+		__func__, sbi->s_bdev->bd_disk->disk_name,
+		sbi->phys_addr, sbi->virt_addr, sbi->initsize);
+
+	return 0;
+}
+
+static loff_t nova_max_size(int bits)
+{
+	loff_t res;
+
+	res = (1ULL << 63) - 1;
+
+	if (res > MAX_LFS_FILESIZE)
+		res = MAX_LFS_FILESIZE;
+
+	nova_dbg_verbose("max file size %llu bytes\n", res);
+	return res;
+}
+
+enum {
+	Opt_bpi, Opt_init, Opt_snapshot, Opt_mode, Opt_uid,
+	Opt_gid, Opt_blocksize, Opt_wprotect,
+	Opt_err_cont, Opt_err_panic, Opt_err_ro,
+	Opt_dbgmask, Opt_err
+};
+
+static const match_table_t tokens = {
+	{ Opt_bpi,	     "bpi=%u"		  },
+	{ Opt_init,	     "init"		  },
+	{ Opt_snapshot,	     "snapshot=%u"	  },
+	{ Opt_mode,	     "mode=%o"		  },
+	{ Opt_uid,	     "uid=%u"		  },
+	{ Opt_gid,	     "gid=%u"		  },
+	{ Opt_wprotect,	     "wprotect"		  },
+	{ Opt_err_cont,	     "errors=continue"	  },
+	{ Opt_err_panic,     "errors=panic"	  },
+	{ Opt_err_ro,	     "errors=remount-ro"  },
+	{ Opt_dbgmask,	     "dbgmask=%u"	  },
+	{ Opt_err,	     NULL		  },
+};
+
+static int nova_parse_options(char *options, struct nova_sb_info *sbi,
+			       bool remount)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int option;
+	kuid_t uid;
+
+	if (!options)
+		return 0;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_bpi:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			if (remount && sbi->bpi)
+				goto bad_opt;
+			sbi->bpi = option;
+			break;
+		case Opt_uid:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			uid = make_kuid(current_user_ns(), option);
+			if (remount && !uid_eq(sbi->uid, uid))
+				goto bad_opt;
+			sbi->uid = uid;
+			break;
+		case Opt_gid:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			sbi->gid = make_kgid(current_user_ns(), option);
+			break;
+		case Opt_mode:
+			if (match_octal(&args[0], &option))
+				goto bad_val;
+			sbi->mode = option & 01777U;
+			break;
+		case Opt_init:
+			if (remount)
+				goto bad_opt;
+			set_opt(sbi->s_mount_opt, FORMAT);
+			break;
+		case Opt_snapshot:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			sbi->mount_snapshot = 1;
+			sbi->mount_snapshot_epoch_id = option;
+			break;
+		case Opt_err_panic:
+			clear_opt(sbi->s_mount_opt, ERRORS_CONT);
+			clear_opt(sbi->s_mount_opt, ERRORS_RO);
+			set_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			break;
+		case Opt_err_ro:
+			clear_opt(sbi->s_mount_opt, ERRORS_CONT);
+			clear_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			set_opt(sbi->s_mount_opt, ERRORS_RO);
+			break;
+		case Opt_err_cont:
+			clear_opt(sbi->s_mount_opt, ERRORS_RO);
+			clear_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			set_opt(sbi->s_mount_opt, ERRORS_CONT);
+			break;
+		case Opt_wprotect:
+			if (remount)
+				goto bad_opt;
+			set_opt(sbi->s_mount_opt, PROTECT);
+			nova_info("NOVA: Enabling new Write Protection (CR0.WP)\n");
+			break;
+		case Opt_dbgmask:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			nova_dbgmask = option;
+			break;
+		default: {
+			goto bad_opt;
+		}
+		}
+	}
+
+	return 0;
+
+bad_val:
+	nova_info("Bad value '%s' for mount option '%s'\n", args[0].from,
+	       p);
+	return -EINVAL;
+bad_opt:
+	nova_info("Bad mount option: \"%s\"\n", p);
+	return -EINVAL;
+}
+
+
+/* Make sure we have enough space */
+static bool nova_check_size(struct super_block *sb, unsigned long size)
+{
+	unsigned long minimum_size;
+
+	/* space required for super block and root directory.*/
+	minimum_size = (HEAD_RESERVED_BLOCKS + TAIL_RESERVED_BLOCKS + 1)
+			  << sb->s_blocksize_bits;
+
+	if (size < minimum_size)
+		return false;
+
+	return true;
+}
+
+static inline int nova_check_super_checksum(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u32 crc = 0;
+
+	// Check CRC but skip c_sum, which is the 4 bytes at the beginning
+	crc = nova_crc32c(~0, (__u8 *)sbi->nova_sb + sizeof(__le32),
+			sizeof(struct nova_super_block) - sizeof(__le32));
+
+	if (sbi->nova_sb->s_sum == cpu_to_le32(crc))
+		return 0;
+	else
+		return 1;
+}
+
+inline void nova_sync_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_super_block *super = nova_get_super(sb);
+	struct nova_super_block *super_redund;
+
+	nova_memunlock_super(sb);
+
+	super_redund = nova_get_redund_super(sb);
+
+	memcpy_to_pmem_nocache((void *)super, (void *)sbi->nova_sb,
+		sizeof(struct nova_super_block));
+	PERSISTENT_BARRIER();
+
+	memcpy_to_pmem_nocache((void *)super_redund, (void *)sbi->nova_sb,
+		sizeof(struct nova_super_block));
+	PERSISTENT_BARRIER();
+
+	nova_memlock_super(sb);
+}
+
+/* Update checksum for the DRAM copy */
+inline void nova_update_super_crc(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u32 crc = 0;
+
+	sbi->nova_sb->s_wtime = cpu_to_le32(get_seconds());
+	sbi->nova_sb->s_sum = 0;
+	crc = nova_crc32c(~0, (__u8 *)sbi->nova_sb + sizeof(__le32),
+			sizeof(struct nova_super_block) - sizeof(__le32));
+	sbi->nova_sb->s_sum = cpu_to_le32(crc);
+}
+
+
+static inline void nova_update_mount_time(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 mnt_write_time;
+
+	mnt_write_time = (get_seconds() & 0xFFFFFFFF);
+	mnt_write_time = mnt_write_time | (mnt_write_time << 32);
+
+	sbi->nova_sb->s_mtime = cpu_to_le64(mnt_write_time);
+	nova_update_super_crc(sb);
+
+	nova_sync_super(sb);
+}
+
+static struct nova_inode *nova_init(struct super_block *sb,
+				      unsigned long size)
+{
+	unsigned long blocksize;
+	struct nova_inode *root_i, *pi;
+	struct nova_super_block *super;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_update update;
+	u64 epoch_id;
+	timing_t init_time;
+
+	NOVA_START_TIMING(new_init_t, init_time);
+	nova_info("creating an empty nova of size %lu\n", size);
+	sbi->num_blocks = ((unsigned long)(size) >> PAGE_SHIFT);
+
+	nova_dbgv("nova: Default block size set to 4K\n");
+	sbi->blocksize = blocksize = NOVA_DEF_BLOCK_SIZE_4K;
+	nova_set_blocksize(sb, sbi->blocksize);
+
+	if (!nova_check_size(sb, size)) {
+		nova_warn("Specified NOVA size too small 0x%lx.\n", size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	nova_dbgv("max file name len %d\n", (unsigned int)NOVA_NAME_LEN);
+
+	super = nova_get_super(sb);
+
+	nova_memunlock_reserved(sb, super);
+	/* clear out super-block and inode table */
+	memset_nt(super, 0, sbi->head_reserved_blocks * sbi->blocksize);
+
+	pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	pi->nova_ino = NOVA_BLOCKNODE_INO;
+	nova_flush_buffer(pi, CACHELINE_SIZE, 1);
+
+	pi = nova_get_inode_by_ino(sb, NOVA_SNAPSHOT_INO);
+	pi->nova_ino = NOVA_SNAPSHOT_INO;
+	nova_flush_buffer(pi, CACHELINE_SIZE, 1);
+
+	memset(&update, 0, sizeof(struct nova_inode_update));
+	nova_update_inode(sb, &sbi->snapshot_si->vfs_inode, pi, &update, 1);
+
+	nova_memlock_reserved(sb, super);
+
+	nova_init_blockmap(sb, 0);
+
+	if (nova_lite_journal_hard_init(sb) < 0) {
+		nova_err(sb, "Lite journal hard initialization failed\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (nova_init_inode_inuse_list(sb) < 0)
+		return ERR_PTR(-EINVAL);
+
+	if (nova_init_inode_table(sb) < 0)
+		return ERR_PTR(-EINVAL);
+
+
+	sbi->nova_sb->s_size = cpu_to_le64(size);
+	sbi->nova_sb->s_blocksize = cpu_to_le32(blocksize);
+	sbi->nova_sb->s_magic = cpu_to_le32(NOVA_SUPER_MAGIC);
+	sbi->nova_sb->s_epoch_id = 0;
+	sbi->nova_sb->s_metadata_csum = metadata_csum;
+	sbi->nova_sb->s_data_csum = data_csum;
+	sbi->nova_sb->s_data_parity = data_parity;
+	nova_update_super_crc(sb);
+
+	nova_sync_super(sb);
+
+	root_i = nova_get_inode_by_ino(sb, NOVA_ROOT_INO);
+	nova_dbgv("%s: Allocate root inode @ 0x%p\n", __func__, root_i);
+
+	nova_memunlock_inode(sb, root_i);
+	root_i->i_mode = cpu_to_le16(sbi->mode | S_IFDIR);
+	root_i->i_uid = cpu_to_le32(from_kuid(&init_user_ns, sbi->uid));
+	root_i->i_gid = cpu_to_le32(from_kgid(&init_user_ns, sbi->gid));
+	root_i->i_links_count = cpu_to_le16(2);
+	root_i->i_blk_type = NOVA_BLOCK_TYPE_4K;
+	root_i->i_flags = 0;
+	root_i->i_size = cpu_to_le64(sb->s_blocksize);
+	root_i->i_atime = root_i->i_mtime = root_i->i_ctime =
+		cpu_to_le32(get_seconds());
+	root_i->nova_ino = cpu_to_le64(NOVA_ROOT_INO);
+	root_i->valid = 1;
+	/* nova_sync_inode(root_i); */
+	nova_flush_buffer(root_i, sizeof(*root_i), false);
+	nova_memlock_inode(sb, root_i);
+
+	epoch_id = nova_get_epoch_id(sb);
+	nova_append_dir_init_entries(sb, root_i, NOVA_ROOT_INO,
+					NOVA_ROOT_INO, epoch_id);
+
+	PERSISTENT_MARK();
+	PERSISTENT_BARRIER();
+	NOVA_END_TIMING(new_init_t, init_time);
+	nova_info("NOVA initialization finish\n");
+	return root_i;
+}
+
+static inline void set_default_opts(struct nova_sb_info *sbi)
+{
+	set_opt(sbi->s_mount_opt, HUGEIOREMAP);
+	set_opt(sbi->s_mount_opt, ERRORS_CONT);
+	sbi->head_reserved_blocks = HEAD_RESERVED_BLOCKS;
+	sbi->tail_reserved_blocks = TAIL_RESERVED_BLOCKS;
+	sbi->cpus = num_online_cpus();
+	sbi->map_id = 0;
+}
+
+static void nova_root_check(struct super_block *sb, struct nova_inode *root_pi)
+{
+	if (!S_ISDIR(le16_to_cpu(root_pi->i_mode)))
+		nova_warn("root is not a directory!\n");
+}
+
+/* Check super block magic and checksum */
+static int nova_check_super(struct super_block *sb,
+	struct nova_super_block *ps)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int rc;
+
+	rc = memcpy_mcsafe(sbi->nova_sb, ps,
+				sizeof(struct nova_super_block));
+
+	if (rc < 0)
+		return rc;
+
+	if (le32_to_cpu(sbi->nova_sb->s_magic) != NOVA_SUPER_MAGIC)
+		return -EIO;
+
+	if (nova_check_super_checksum(sb))
+		return -EIO;
+
+	return 0;
+}
+
+/* Check if we disable protection previously and enable it now */
+/* FIXME */
+static int nova_check_module_params(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->nova_sb->s_metadata_csum != metadata_csum) {
+		nova_dbg("%s metadata checksum\n",
+			sbi->nova_sb->s_metadata_csum ? "Enable" : "Disable");
+		metadata_csum = sbi->nova_sb->s_metadata_csum;
+	}
+
+	if (sbi->nova_sb->s_data_csum != data_csum) {
+		nova_dbg("%s data checksum\n",
+			sbi->nova_sb->s_data_csum ? "Enable" : "Disable");
+		data_csum = sbi->nova_sb->s_data_csum;
+	}
+
+	if (sbi->nova_sb->s_data_parity != data_parity) {
+		nova_dbg("%s data parity\n",
+			sbi->nova_sb->s_data_parity ? "Enable" : "Disable");
+		data_parity = sbi->nova_sb->s_data_parity;
+	}
+
+	return 0;
+}
+
+static int nova_check_integrity(struct super_block *sb)
+{
+	struct nova_super_block *super = nova_get_super(sb);
+	struct nova_super_block *super_redund;
+	int rc;
+
+	super_redund = nova_get_redund_super(sb);
+
+	/* Do sanity checks on the superblock */
+	rc = nova_check_super(sb, super);
+	if (rc < 0) {
+		rc = nova_check_super(sb, super_redund);
+		if (rc < 0) {
+			nova_err(sb, "Can't find a valid nova partition\n");
+			return rc;
+		} else
+			nova_warn("Error in super block: try to repair it with the other copy\n");
+		
+	}
+
+	nova_sync_super(sb);
+
+	nova_check_module_params(sb);
+	return 0;
+}
+
+static int nova_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct nova_inode *root_pi;
+	struct nova_sb_info *sbi = NULL;
+	struct inode *root_i = NULL;
+	struct inode_map *inode_map;
+	unsigned long blocksize;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	u32 random = 0;
+	int retval = -EINVAL;
+	int i;
+	timing_t mount_time;
+
+	NOVA_START_TIMING(mount_t, mount_time);
+
+	BUILD_BUG_ON(sizeof(struct nova_super_block) > NOVA_SB_SIZE);
+	BUILD_BUG_ON(sizeof(struct nova_inode) > NOVA_INODE_SIZE);
+	BUILD_BUG_ON(sizeof(struct nova_inode_log_page) != PAGE_SIZE);
+
+	BUILD_BUG_ON(sizeof(struct journal_ptr_pair) > CACHELINE_SIZE);
+	BUILD_BUG_ON(PAGE_SIZE/sizeof(struct journal_ptr_pair) < MAX_CPUS);
+	BUILD_BUG_ON(PAGE_SIZE/sizeof(struct nova_lite_journal_entry) <
+		     NOVA_MAX_JOURNAL_LENGTH);
+
+	BUILD_BUG_ON(sizeof(struct nova_inode_page_tail) +
+		     LOG_BLOCK_TAIL != PAGE_SIZE);
+
+	sbi = kzalloc(sizeof(struct nova_sb_info), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
+	sbi->nova_sb = kzalloc(sizeof(struct nova_super_block), GFP_KERNEL);
+	if (!sbi->nova_sb) {
+		kfree(sbi);
+		return -ENOMEM;
+	}
+
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	set_default_opts(sbi);
+
+	/* Currently the log page supports 64 journal pointer pairs */
+	if (sbi->cpus > MAX_CPUS) {
+		nova_err(sb, "NOVA needs more log pointer pages to support more than "
+			  __stringify(MAX_CPUS) " cpus.\n");
+		goto out;
+	}
+
+	retval = nova_get_nvmm_info(sb, sbi);
+	if (retval) {
+		nova_err(sb, "%s: Failed to get nvmm info.",
+			 __func__);
+		goto out;
+	}
+
+
+	nova_dbg("measure timing %d, metadata checksum %d, inplace update %d, wprotect %d, data checksum %d, data parity %d, DRAM checksum %d\n",
+		measure_timing, metadata_csum,
+		inplace_data_updates, wprotect,	 data_csum,
+		data_parity, dram_struct_csum);
+
+	get_random_bytes(&random, sizeof(u32));
+	atomic_set(&sbi->next_generation, random);
+
+	/* Init with default values */
+	sbi->mode = (0755);
+	sbi->uid = current_fsuid();
+	sbi->gid = current_fsgid();
+	set_opt(sbi->s_mount_opt, DAX);
+	set_opt(sbi->s_mount_opt, HUGEIOREMAP);
+
+	mutex_init(&sbi->vma_mutex);
+	INIT_LIST_HEAD(&sbi->mmap_sih_list);
+
+	sbi->inode_maps = kcalloc(sbi->cpus, sizeof(struct inode_map),
+					GFP_KERNEL);
+	if (!sbi->inode_maps) {
+		retval = -ENOMEM;
+		nova_dbg("%s: Allocating inode maps failed.",
+			 __func__);
+		goto out;
+	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		mutex_init(&inode_map->inode_table_mutex);
+		inode_map->inode_inuse_tree = RB_ROOT;
+	}
+
+	mutex_init(&sbi->s_lock);
+
+	sbi->zeroed_page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!sbi->zeroed_page) {
+		retval = -ENOMEM;
+		nova_dbg("%s: sbi->zeroed_page failed.",
+			 __func__);
+		goto out;
+	}
+
+	for (i = 0; i < 8; i++)
+		sbi->zero_csum[i] = nova_crc32c(NOVA_INIT_CSUM,
+				sbi->zeroed_page, strp_size);
+	sbi->zero_parity = kzalloc(strp_size, GFP_KERNEL);
+
+	if (!sbi->zero_parity) {
+		retval = -ENOMEM;
+		nova_err(sb, "%s: sbi->zero_parity failed.",
+			 __func__);
+		goto out;
+	}
+
+	sbi->snapshot_si = kmem_cache_alloc(nova_inode_cachep, GFP_NOFS);
+	nova_snapshot_init(sb);
+
+	retval = nova_parse_options(data, sbi, 0);
+	if (retval) {
+		nova_err(sb, "%s: Failed to parse nova command line options.",
+			 __func__);
+		goto out;
+	}
+
+	if (nova_alloc_block_free_lists(sb)) {
+		retval = -ENOMEM;
+		nova_err(sb, "%s: Failed to allocate block free lists.",
+			 __func__);
+		goto out;
+	}
+
+	nova_sysfs_init(sb);
+
+	/* Init a new nova instance */
+	if (sbi->s_mount_opt & NOVA_MOUNT_FORMAT) {
+		root_pi = nova_init(sb, sbi->initsize);
+		retval = -ENOMEM;
+		if (IS_ERR(root_pi)) {
+			nova_err(sb, "%s: root_pi error.",
+				 __func__);
+
+			goto out;
+		}
+		goto setup_sb;
+	}
+
+	nova_dbg_verbose("checking physical address 0x%016llx for nova image\n",
+		  (u64)sbi->phys_addr);
+
+	if (nova_check_integrity(sb) < 0) {
+		nova_dbg("Memory contains invalid nova %x:%x\n",
+			le32_to_cpu(sbi->nova_sb->s_magic), NOVA_SUPER_MAGIC);
+		goto out;
+	}
+
+	if (nova_lite_journal_soft_init(sb)) {
+		retval = -EINVAL;
+		nova_err(sb, "Lite journal initialization failed\n");
+		goto out;
+	}
+
+	if (sbi->mount_snapshot) {
+		retval = nova_mount_snapshot(sb);
+		if (retval) {
+			nova_err(sb, "Mount snapshot failed\n");
+			goto out;
+		}
+	}
+
+	blocksize = le32_to_cpu(sbi->nova_sb->s_blocksize);
+	nova_set_blocksize(sb, blocksize);
+
+	nova_dbg_verbose("blocksize %lu\n", blocksize);
+
+	/* Read the root inode */
+	root_pi = nova_get_inode_by_ino(sb, NOVA_ROOT_INO);
+
+	/* Check that the root inode is in a sane state */
+	nova_root_check(sb, root_pi);
+
+	/* Set it all up.. */
+setup_sb:
+	sb->s_magic = le32_to_cpu(sbi->nova_sb->s_magic);
+	sb->s_op = &nova_sops;
+	sb->s_maxbytes = nova_max_size(sb->s_blocksize_bits);
+	sb->s_time_gran = 1000000000; // 1 second.
+	sb->s_export_op = &nova_export_ops;
+	sb->s_xattr = NULL;
+	sb->s_flags |= MS_NOSEC;
+
+	/* If the FS was not formatted on this mount, scan the meta-data after
+	 * truncate list has been processed
+	 */
+	if ((sbi->s_mount_opt & NOVA_MOUNT_FORMAT) == 0)
+		nova_recovery(sb);
+
+	root_i = nova_iget(sb, NOVA_ROOT_INO);
+	if (IS_ERR(root_i)) {
+		retval = PTR_ERR(root_i);
+		nova_err(sb, "%s: failed to get root inode",
+			 __func__);
+
+		goto out;
+	}
+
+	sb->s_root = d_make_root(root_i);
+	if (!sb->s_root) {
+		nova_err(sb, "get nova root inode failed\n");
+		retval = -ENOMEM;
+		goto out;
+	}
+
+	if (!(sb->s_flags & MS_RDONLY))
+		nova_update_mount_time(sb);
+
+	nova_print_curr_epoch_id(sb);
+
+	retval = 0;
+	NOVA_END_TIMING(mount_t, mount_time);
+	return retval;
+out:
+	kfree(sbi->zeroed_page);
+	sbi->zeroed_page = NULL;
+
+	kfree(sbi->zero_parity);
+	sbi->zero_parity = NULL;
+
+	kfree(sbi->free_lists);
+	sbi->free_lists = NULL;
+
+	kfree(sbi->journal_locks);
+	sbi->journal_locks = NULL;
+
+	kfree(sbi->inode_maps);
+	sbi->inode_maps = NULL;
+
+	nova_sysfs_exit(sb);
+
+	kfree(sbi->nova_sb);
+	kfree(sbi);
+	return retval;
+}
+
+int nova_statfs(struct dentry *d, struct kstatfs *buf)
+{
+	struct super_block *sb = d->d_sb;
+	struct nova_sb_info *sbi = (struct nova_sb_info *)sb->s_fs_info;
+
+	buf->f_type = NOVA_SUPER_MAGIC;
+	buf->f_bsize = sb->s_blocksize;
+
+	buf->f_blocks = sbi->num_blocks;
+	buf->f_bfree = buf->f_bavail = nova_count_free_blocks(sb);
+	buf->f_files = LONG_MAX;
+	buf->f_ffree = LONG_MAX - sbi->s_inodes_used_count;
+	buf->f_namelen = NOVA_NAME_LEN;
+	nova_dbg_verbose("nova_stats: total 4k free blocks 0x%llx\n",
+		buf->f_bfree);
+	return 0;
+}
+
+static int nova_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct nova_sb_info *sbi = NOVA_SB(root->d_sb);
+
+	//seq_printf(seq, ",physaddr=0x%016llx", (u64)sbi->phys_addr);
+	//if (sbi->initsize)
+	//     seq_printf(seq, ",init=%luk", sbi->initsize >> 10);
+	//if (sbi->blocksize)
+	//	 seq_printf(seq, ",bs=%lu", sbi->blocksize);
+	//if (sbi->bpi)
+	//	seq_printf(seq, ",bpi=%lu", sbi->bpi);
+	if (sbi->mode != (0777 | S_ISVTX))
+		seq_printf(seq, ",mode=%03o", sbi->mode);
+	if (uid_valid(sbi->uid))
+		seq_printf(seq, ",uid=%u", from_kuid(&init_user_ns, sbi->uid));
+	if (gid_valid(sbi->gid))
+		seq_printf(seq, ",gid=%u", from_kgid(&init_user_ns, sbi->gid));
+	if (test_opt(root->d_sb, ERRORS_RO))
+		seq_puts(seq, ",errors=remount-ro");
+	if (test_opt(root->d_sb, ERRORS_PANIC))
+		seq_puts(seq, ",errors=panic");
+	/* memory protection disabled by default */
+	if (test_opt(root->d_sb, PROTECT))
+		seq_puts(seq, ",wprotect");
+	//if (test_opt(root->d_sb, DAX))
+	//	seq_puts(seq, ",dax");
+
+	return 0;
+}
+
+int nova_remount(struct super_block *sb, int *mntflags, char *data)
+{
+	unsigned long old_sb_flags;
+	unsigned long old_mount_opt;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret = -EINVAL;
+
+	/* Store the old options */
+	mutex_lock(&sbi->s_lock);
+	old_sb_flags = sb->s_flags;
+	old_mount_opt = sbi->s_mount_opt;
+
+	if (nova_parse_options(data, sbi, 1))
+		goto restore_opt;
+
+	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
+		      ((sbi->s_mount_opt & NOVA_MOUNT_POSIX_ACL) ?
+		       MS_POSIXACL : 0);
+
+	if ((*mntflags & MS_RDONLY) != (sb->s_flags & MS_RDONLY))
+		nova_update_mount_time(sb);
+
+	mutex_unlock(&sbi->s_lock);
+	ret = 0;
+	return ret;
+
+restore_opt:
+	sb->s_flags = old_sb_flags;
+	sbi->s_mount_opt = old_mount_opt;
+	mutex_unlock(&sbi->s_lock);
+	return ret;
+}
+
+static void nova_put_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	int i;
+
+	nova_print_curr_epoch_id(sb);
+
+	/* It's unmount time, so unmap the nova memory */
+//	nova_print_free_lists(sb);
+	if (sbi->virt_addr) {
+		nova_save_snapshots(sb);
+		kmem_cache_free(nova_inode_cachep, sbi->snapshot_si);
+		nova_save_inode_list_to_log(sb);
+		/* Save everything before blocknode mapping! */
+		nova_save_blocknode_mappings_to_log(sb);
+		sbi->virt_addr = NULL;
+	}
+
+	nova_delete_free_lists(sb);
+
+	kfree(sbi->zeroed_page);
+	kfree(sbi->zero_parity);
+	nova_dbgmask = 0;
+	kfree(sbi->free_lists);
+	kfree(sbi->journal_locks);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		nova_dbgv("CPU %d: inode allocated %d, freed %d\n",
+			i, inode_map->allocated, inode_map->freed);
+	}
+
+	kfree(sbi->inode_maps);
+
+	nova_sysfs_exit(sb);
+
+	kfree(sbi->nova_sb);
+	kfree(sbi);
+	sb->s_fs_info = NULL;
+}
+
+inline void nova_free_range_node(struct nova_range_node *node)
+{
+	kmem_cache_free(nova_range_node_cachep, node);
+}
+
+
+inline void nova_free_inode_node(struct super_block *sb,
+	struct nova_range_node *node)
+{
+	nova_free_range_node(node);
+}
+
+inline void nova_free_vma_item(struct super_block *sb,
+	struct vma_item *item)
+{
+	nova_free_range_node((struct nova_range_node *)item);
+}
+
+inline struct snapshot_info *nova_alloc_snapshot_info(struct super_block *sb)
+{
+	struct snapshot_info *p;
+
+	p = (struct snapshot_info *)
+		kmem_cache_alloc(nova_snapshot_info_cachep, GFP_NOFS);
+	return p;
+}
+
+inline void nova_free_snapshot_info(struct snapshot_info *info)
+{
+	kmem_cache_free(nova_snapshot_info_cachep, info);
+}
+
+inline struct nova_range_node *nova_alloc_range_node(struct super_block *sb)
+{
+	struct nova_range_node *p;
+
+	p = (struct nova_range_node *)
+		kmem_cache_zalloc(nova_range_node_cachep, GFP_NOFS);
+	return p;
+}
+
+
+inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb)
+{
+	return nova_alloc_range_node(sb);
+}
+
+inline struct vma_item *nova_alloc_vma_item(struct super_block *sb)
+{
+	return (struct vma_item *)nova_alloc_range_node(sb);
+}
+
+
+static struct inode *nova_alloc_inode(struct super_block *sb)
+{
+	struct nova_inode_info *vi;
+
+	vi = kmem_cache_alloc(nova_inode_cachep, GFP_NOFS);
+	if (!vi)
+		return NULL;
+
+	vi->vfs_inode.i_version = 1;
+
+	return &vi->vfs_inode;
+}
+
+static void nova_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct nova_inode_info *vi = NOVA_I(inode);
+
+	nova_dbg_verbose("%s: ino %lu\n", __func__, inode->i_ino);
+	kmem_cache_free(nova_inode_cachep, vi);
+}
+
+static void nova_destroy_inode(struct inode *inode)
+{
+	nova_dbgv("%s: %lu\n", __func__, inode->i_ino);
+	call_rcu(&inode->i_rcu, nova_i_callback);
+}
+
+static void init_once(void *foo)
+{
+	struct nova_inode_info *vi = foo;
+
+	inode_init_once(&vi->vfs_inode);
+}
+
+
+static int __init init_rangenode_cache(void)
+{
+	nova_range_node_cachep = kmem_cache_create("nova_range_node_cache",
+					sizeof(struct nova_range_node),
+					0, (SLAB_RECLAIM_ACCOUNT |
+					SLAB_MEM_SPREAD), NULL);
+	if (nova_range_node_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static int __init init_snapshot_info_cache(void)
+{
+	nova_snapshot_info_cachep = kmem_cache_create(
+					"nova_snapshot_info_cache",
+					sizeof(struct snapshot_info),
+					0, (SLAB_RECLAIM_ACCOUNT |
+					SLAB_MEM_SPREAD), NULL);
+	if (nova_snapshot_info_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static int __init init_inodecache(void)
+{
+	nova_inode_cachep = kmem_cache_create("nova_inode_cache",
+					       sizeof(struct nova_inode_info),
+					       0, (SLAB_RECLAIM_ACCOUNT |
+						   SLAB_MEM_SPREAD), init_once);
+	if (nova_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static void destroy_inodecache(void)
+{
+	/*
+	 * Make sure all delayed rcu free inodes are flushed before
+	 * we destroy cache.
+	 */
+	rcu_barrier();
+	kmem_cache_destroy(nova_inode_cachep);
+}
+
+static void destroy_rangenode_cache(void)
+{
+	kmem_cache_destroy(nova_range_node_cachep);
+}
+
+static void destroy_snapshot_info_cache(void)
+{
+	kmem_cache_destroy(nova_snapshot_info_cachep);
+}
+
+/*
+ * the super block writes are all done "on the fly", so the
+ * super block is never in a "dirty" state, so there's no need
+ * for write_super.
+ */
+static struct super_operations nova_sops = {
+	.alloc_inode	= nova_alloc_inode,
+	.destroy_inode	= nova_destroy_inode,
+	.write_inode	= nova_write_inode,
+	.dirty_inode	= nova_dirty_inode,
+	.evict_inode	= nova_evict_inode,
+	.put_super	= nova_put_super,
+	.statfs		= nova_statfs,
+	.remount_fs	= nova_remount,
+	.show_options	= nova_show_options,
+};
+
+static struct dentry *nova_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name, void *data)
+{
+	return mount_bdev(fs_type, flags, dev_name, data, nova_fill_super);
+}
+
+static struct file_system_type nova_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "NOVA",
+	.mount		= nova_mount,
+	.kill_sb	= kill_block_super,
+};
+
+static struct inode *nova_nfs_get_inode(struct super_block *sb,
+					 u64 ino, u32 generation)
+{
+	struct inode *inode;
+
+	if (ino < NOVA_ROOT_INO)
+		return ERR_PTR(-ESTALE);
+
+	if (ino > LONG_MAX)
+		return ERR_PTR(-ESTALE);
+
+	inode = nova_iget(sb, ino);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	if (generation && inode->i_generation != generation) {
+		/* we didn't find the right inode.. */
+		iput(inode);
+		return ERR_PTR(-ESTALE);
+	}
+
+	return inode;
+}
+
+static struct dentry *nova_fh_to_dentry(struct super_block *sb,
+					 struct fid *fid, int fh_len,
+					 int fh_type)
+{
+	return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+				    nova_nfs_get_inode);
+}
+
+static struct dentry *nova_fh_to_parent(struct super_block *sb,
+					 struct fid *fid, int fh_len,
+					 int fh_type)
+{
+	return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+				    nova_nfs_get_inode);
+}
+
+static const struct export_operations nova_export_ops = {
+	.fh_to_dentry	= nova_fh_to_dentry,
+	.fh_to_parent	= nova_fh_to_parent,
+	.get_parent	= nova_get_parent,
+};
+
+static int __init init_nova_fs(void)
+{
+	int rc = 0;
+	timing_t init_time;
+
+	NOVA_START_TIMING(init_t, init_time);
+	nova_dbg("%s: %d cpus online\n", __func__, num_online_cpus());
+	if (arch_has_clwb())
+		support_clwb = 1;
+
+	nova_info("Arch new instructions support: CLWB %s\n",
+			support_clwb ? "YES" : "NO");
+
+	nova_proc_root = proc_mkdir(proc_dirname, NULL);
+
+	nova_dbg("Data structure size: inode %lu, log_page %lu, file_write_entry %lu, dir_entry(max) %d, setattr_entry %lu, link_change_entry %lu\n",
+		sizeof(struct nova_inode),
+		sizeof(struct nova_inode_log_page),
+		sizeof(struct nova_file_write_entry),
+		NOVA_DIR_LOG_REC_LEN(NOVA_NAME_LEN),
+		sizeof(struct nova_setattr_logentry),
+		sizeof(struct nova_link_change_entry));
+
+	rc = init_rangenode_cache();
+	if (rc)
+		return rc;
+
+	rc = init_inodecache();
+	if (rc)
+		goto out1;
+
+	rc = init_snapshot_info_cache();
+	if (rc)
+		goto out2;
+
+	rc = register_filesystem(&nova_fs_type);
+	if (rc)
+		goto out3;
+
+	NOVA_END_TIMING(init_t, init_time);
+	return 0;
+
+out3:
+	destroy_snapshot_info_cache();
+out2:
+	destroy_inodecache();
+out1:
+	destroy_rangenode_cache();
+	return rc;
+}
+
+static void __exit exit_nova_fs(void)
+{
+	unregister_filesystem(&nova_fs_type);
+	remove_proc_entry(proc_dirname, NULL);
+	destroy_snapshot_info_cache();
+	destroy_inodecache();
+	destroy_rangenode_cache();
+}
+
+MODULE_AUTHOR("Andiry Xu <jix024@cs.ucsd.edu>");
+MODULE_DESCRIPTION("NOVA: A Persistent Memory File System");
+MODULE_LICENSE("GPL");
+
+module_init(init_nova_fs)
+module_exit(exit_nova_fs)
diff --git a/fs/nova/super.h b/fs/nova/super.h
new file mode 100644
index 000000000000..8c0ffbf79e9b
--- /dev/null
+++ b/fs/nova/super.h
@@ -0,0 +1,216 @@
+#ifndef __SUPER_H
+#define __SUPER_H
+/*
+ * Structure of the NOVA super block in PMEM
+ *
+ * The fields are partitioned into static and dynamic fields. The static fields
+ * never change after file system creation. This was primarily done because
+ * nova_get_block() returns NULL if the block offset is 0 (helps in catching
+ * bugs). So if we modify any field using journaling (for consistency), we
+ * will have to modify s_sum which is at offset 0. So journaling code fails.
+ * This (static+dynamic fields) is a temporary solution and can be avoided
+ * once the file system becomes stable and nova_get_block() returns correct
+ * pointers even for offset 0.
+ */
+struct nova_super_block {
+	/* static fields. they never change after file system creation.
+	 * checksum only validates up to s_start_dynamic field below
+	 */
+	__le32		s_sum;			/* checksum of this sb */
+	__le32		s_magic;		/* magic signature */
+	__le32		s_padding32;
+	__le32		s_blocksize;		/* blocksize in bytes */
+	__le64		s_size;			/* total size of fs in bytes */
+	char		s_volume_name[16];	/* volume name */
+
+	/* all the dynamic fields should go here */
+	__le64		s_epoch_id;		/* Epoch ID */
+
+	/* s_mtime and s_wtime should be together and their order should not be
+	 * changed. we use an 8 byte write to update both of them atomically
+	 */
+	__le32		s_mtime;		/* mount time */
+	__le32		s_wtime;		/* write time */
+
+	/* Metadata and data protections */
+	u8		s_padding8;
+	u8		s_metadata_csum;
+	u8		s_data_csum;
+	u8		s_data_parity;
+} __attribute((__packed__));
+
+#define NOVA_SB_SIZE 512       /* must be power of two */
+
+/* ======================= Reserved blocks ========================= */
+
+/*
+ * The first block contains super blocks;
+ * The second block contains reserved inodes;
+ * The third block is reserved.
+ * The fourth block contains pointers to journal pages.
+ * The fifth/sixth block contains pointers to inode tables.
+ * The seventh/eighth blocks are void by now.
+ *
+ * If data protection is enabled, more blocks are reserverd for checksums and
+ * parities and the number is derived according to the whole storage size.
+ */
+#define	HEAD_RESERVED_BLOCKS	8
+
+#define SUPER_BLOCK_START       0 // Superblock
+#define	RESERVE_INODE_START	1 // Reserved inodes
+#define	JOURNAL_START		3 // journal pointer table
+#define	INODE_TABLE0_START	4 // inode table
+#define	INODE_TABLE1_START	5 // replica inode table
+
+/* For replica super block and replica reserved inodes */
+#define	TAIL_RESERVED_BLOCKS	2
+
+/* ======================= Reserved inodes ========================= */
+
+/* We have space for 31 reserved inodes */
+#define NOVA_ROOT_INO		(1)
+#define NOVA_INODETABLE_INO	(2)	/* Fake inode associated with inode
+					 * stroage.  We need this because our
+					 * allocator requires inode to be
+					 * associated with each allocation.
+					 * The data actually lives in linked
+					 * lists in INODE_TABLE0_START. */
+#define NOVA_BLOCKNODE_INO	(3)     /* Storage for allocator state */
+#define NOVA_LITEJOURNAL_INO	(4)     /* Storage for lightweight journals */
+#define NOVA_INODELIST1_INO	(5)     /* Storage for Inode free list */
+#define NOVA_SNAPSHOT_INO	(6)	/* Storage for snapshot state */
+#define NOVA_TEST_PERF_INO	(7)
+
+
+/* Normal inode starts at 32 */
+#define NOVA_NORMAL_INODE_START      (32)
+
+
+
+/*
+ * NOVA super-block data in DRAM
+ */
+struct nova_sb_info {
+	struct super_block *sb;			/* VFS super block */
+	struct nova_super_block *nova_sb;	/* DRAM copy of SB */
+	struct block_device *s_bdev;
+	struct dax_device *s_dax_dev;
+
+	/*
+	 * base physical and virtual address of NOVA (which is also
+	 * the pointer to the super block)
+	 */
+	phys_addr_t	phys_addr;
+	void		*virt_addr;
+	void		*replica_reserved_inodes_addr;
+	void		*replica_sb_addr;
+
+	unsigned long	num_blocks;
+
+	/* TODO: Remove this, since it's unused */
+	/*
+	 * Backing store option:
+	 * 1 = no load, 2 = no store,
+	 * else do both
+	 */
+	unsigned int	nova_backing_option;
+
+	/* Mount options */
+	unsigned long	bpi;
+	unsigned long	blocksize;
+	unsigned long	initsize;
+	unsigned long	s_mount_opt;
+	kuid_t		uid;    /* Mount uid for root directory */
+	kgid_t		gid;    /* Mount gid for root directory */
+	umode_t		mode;   /* Mount mode for root directory */
+	atomic_t	next_generation;
+	/* inode tracking */
+	unsigned long	s_inodes_used_count;
+	unsigned long	head_reserved_blocks;
+	unsigned long	tail_reserved_blocks;
+
+	struct mutex	s_lock;	/* protects the SB's buffer-head */
+
+	int cpus;
+	struct proc_dir_entry *s_proc;
+
+	/* Snapshot related */
+	struct nova_inode_info	*snapshot_si;
+	struct radix_tree_root	snapshot_info_tree;
+	int num_snapshots;
+	/* Current epoch. volatile guarantees visibility */
+	volatile u64 s_epoch_id;
+	volatile int snapshot_taking;
+
+	int mount_snapshot;
+	u64 mount_snapshot_epoch_id;
+
+	struct task_struct *snapshot_cleaner_thread;
+	wait_queue_head_t snapshot_cleaner_wait;
+	wait_queue_head_t snapshot_mmap_wait;
+	void *curr_clean_snapshot_info;
+
+	/* DAX-mmap snapshot structures */
+	struct mutex vma_mutex;
+	struct list_head mmap_sih_list;
+
+	/* ZEROED page for cache page initialized */
+	void *zeroed_page;
+
+	/* Checksum and parity for zero block */
+	u32 zero_csum[8];
+	void *zero_parity;
+
+	/* Per-CPU journal lock */
+	spinlock_t *journal_locks;
+
+	/* Per-CPU inode map */
+	struct inode_map	*inode_maps;
+
+	/* Decide new inode map id */
+	unsigned long map_id;
+
+	/* Per-CPU free block list */
+	struct free_list *free_lists;
+	unsigned long per_list_blocks;
+};
+
+static inline struct nova_sb_info *NOVA_SB(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+
+
+static inline struct nova_super_block
+*nova_get_redund_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return (struct nova_super_block *)(sbi->replica_sb_addr);
+}
+
+
+/* If this is part of a read-modify-write of the super block,
+ * nova_memunlock_super() before calling!
+ */
+static inline struct nova_super_block *nova_get_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return (struct nova_super_block *)sbi->virt_addr;
+}
+
+extern struct super_block *nova_read_super(struct super_block *sb, void *data,
+	int silent);
+extern int nova_statfs(struct dentry *d, struct kstatfs *buf);
+extern int nova_remount(struct super_block *sb, int *flags, char *data);
+void *nova_ioremap(struct super_block *sb, phys_addr_t phys_addr,
+	ssize_t size);
+extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
+extern void nova_free_range_node(struct nova_range_node *node);
+extern void nova_update_super_crc(struct super_block *sb);
+extern void nova_sync_super(struct super_block *sb);
+
+struct snapshot_info *nova_alloc_snapshot_info(struct super_block *sb);
+#endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 02/16] NOVA: Superblock and fs layout
@ 2017-08-03  7:48   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

FS Layout
======================

A Nova file systems resides in single PMEM device. Nova divides the device into
4KB blocks that are arrange like so:

 block
+-----------------------------------------------------+
|  0  | primary super block (struct nova_super_block) |
+-----------------------------------------------------+
|  1  | Reserved inodes                               |
+-----------------------------------------------------+
|  2  | reserved                                      |
+-----------------------------------------------------+
|  3  | Journal pointers                              |
+-----------------------------------------------------+
| 4-5 | Inode pointer tables                          |
+-----------------------------------------------------+
|  6  | reserved                                      |
+-----------------------------------------------------+
|  7  | reserved                                      |
+-----------------------------------------------------+
| ... | data pages                                    |
+-----------------------------------------------------+
| n-2 | replica reserved Inodes                       |
+-----------------------------------------------------+
| n-1 | replica super block                           |
+-----------------------------------------------------+


Superblock and Associated Structures
====================================

The beginning of the PMEM device hold the super block and its associated
tables.  These include reserved inodes, a table of pointers to the journals
Nova uses for complex operations, and pointers to inodes tables.  Nova
maintains replicas of the super block and reserved inodes in the last two
blocks of the PMEM area.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/nova.h     | 1137 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova_def.h |  154 +++++++
 fs/nova/super.c    | 1222 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/super.h    |  216 +++++++++
 4 files changed, 2729 insertions(+)
 create mode 100644 fs/nova/nova.h
 create mode 100644 fs/nova/nova_def.h
 create mode 100644 fs/nova/super.c
 create mode 100644 fs/nova/super.h

diff --git a/fs/nova/nova.h b/fs/nova/nova.h
new file mode 100644
index 000000000000..b0e9e19b53b7
--- /dev/null
+++ b/fs/nova/nova.h
@@ -0,0 +1,1137 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the NOVA filesystem.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#ifndef __NOVA_H
+#define __NOVA_H
+
+#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/init.h>
+#include <linux/time.h>
+#include <linux/rtc.h>
+#include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/backing-dev.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/rcupdate.h>
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/radix-tree.h>
+#include <linux/version.h>
+#include <linux/kthread.h>
+#include <linux/buffer_head.h>
+#include <linux/uio.h>
+#include <linux/pmem.h>
+#include <linux/iomap.h>
+#include <linux/crc32c.h>
+#include <asm/tlbflush.h>
+#include <linux/version.h>
+#include <linux/pfn_t.h>
+#include <linux/pagevec.h>
+
+#include "nova_def.h"
+#include "stats.h"
+#include "snapshot.h"
+
+#define PAGE_SHIFT_2M 21
+#define PAGE_SHIFT_1G 30
+
+
+/*
+ * Debug code
+ */
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/* #define nova_dbg(s, args...)		pr_debug(s, ## args) */
+#define nova_dbg(s, args ...)		pr_info(s, ## args)
+#define nova_dbg1(s, args ...)
+#define nova_err(sb, s, args ...)	nova_error_mng(sb, s, ## args)
+#define nova_warn(s, args ...)		pr_warn(s, ## args)
+#define nova_info(s, args ...)		pr_info(s, ## args)
+
+extern unsigned int nova_dbgmask;
+#define NOVA_DBGMASK_MMAPHUGE	       (0x00000001)
+#define NOVA_DBGMASK_MMAP4K	       (0x00000002)
+#define NOVA_DBGMASK_MMAPVERBOSE       (0x00000004)
+#define NOVA_DBGMASK_MMAPVVERBOSE      (0x00000008)
+#define NOVA_DBGMASK_VERBOSE	       (0x00000010)
+#define NOVA_DBGMASK_TRANSACTION       (0x00000020)
+
+#define nova_dbg_mmap4k(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAP4K) ? nova_dbg(s, args) : 0)
+#define nova_dbg_mmapv(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAPVERBOSE) ? nova_dbg(s, args) : 0)
+#define nova_dbg_mmapvv(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAPVVERBOSE) ? nova_dbg(s, args) : 0)
+
+#define nova_dbg_verbose(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_VERBOSE) ? nova_dbg(s, ##args) : 0)
+#define nova_dbgv(s, args ...)	nova_dbg_verbose(s, ##args)
+#define nova_dbg_trans(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_TRANSACTION) ? nova_dbg(s, ##args) : 0)
+
+#define NOVA_ASSERT(x) do {\
+			       if (!(x))\
+				       nova_warn("assertion failed %s:%d: %s\n", \
+			       __FILE__, __LINE__, #x);\
+		       } while (0)
+
+#define nova_set_bit		       __test_and_set_bit_le
+#define nova_clear_bit		       __test_and_clear_bit_le
+#define nova_find_next_zero_bit	       find_next_zero_bit_le
+
+#define clear_opt(o, opt)	(o &= ~NOVA_MOUNT_ ## opt)
+#define set_opt(o, opt)		(o |= NOVA_MOUNT_ ## opt)
+#define test_opt(sb, opt)	(NOVA_SB(sb)->s_mount_opt & NOVA_MOUNT_ ## opt)
+
+#define NOVA_LARGE_INODE_TABLE_SIZE    (0x200000)
+/* NOVA size threshold for using 2M blocks for inode table */
+#define NOVA_LARGE_INODE_TABLE_THREASHOLD    (0x20000000)
+/*
+ * nova inode flags
+ *
+ * NOVA_EOFBLOCKS_FL	There are blocks allocated beyond eof
+ */
+#define NOVA_EOFBLOCKS_FL      0x20000000
+/* Flags that should be inherited by new inodes from their parent. */
+#define NOVA_FL_INHERITED (FS_SECRM_FL | FS_UNRM_FL | FS_COMPR_FL | \
+			    FS_SYNC_FL | FS_NODUMP_FL | FS_NOATIME_FL |	\
+			    FS_COMPRBLK_FL | FS_NOCOMP_FL | \
+			    FS_JOURNAL_DATA_FL | FS_NOTAIL_FL | FS_DIRSYNC_FL)
+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define NOVA_REG_FLMASK (~(FS_DIRSYNC_FL | FS_TOPDIR_FL))
+/* Flags that are appropriate for non-directories/regular files. */
+#define NOVA_OTHER_FLMASK (FS_NODUMP_FL | FS_NOATIME_FL)
+#define NOVA_FL_USER_VISIBLE (FS_FL_USER_VISIBLE | NOVA_EOFBLOCKS_FL)
+
+/* IOCTLs */
+#define	NOVA_PRINT_TIMING		0xBCD00010
+#define	NOVA_CLEAR_STATS		0xBCD00011
+#define	NOVA_PRINT_LOG			0xBCD00013
+#define	NOVA_PRINT_LOG_BLOCKNODE	0xBCD00014
+#define	NOVA_PRINT_LOG_PAGES		0xBCD00015
+#define	NOVA_PRINT_FREE_LISTS		0xBCD00018
+
+
+#define	READDIR_END			(ULONG_MAX)
+#define	INVALID_CPU			(-1)
+#define	ANY_CPU				(65536)
+#define	FREE_BATCH			(16)
+#define	DEAD_ZONE_BLOCKS		(256)
+
+extern int measure_timing;
+extern int metadata_csum;
+extern int unsafe_metadata;
+extern int inplace_data_updates;
+extern int wprotect;
+extern int data_csum;
+extern int data_parity;
+extern int dram_struct_csum;
+
+extern unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX];
+extern unsigned int blk_type_to_size[NOVA_BLOCK_TYPE_MAX];
+
+
+
+#define	MMAP_WRITE_BIT	0x20UL	// mmaped for write
+#define	IS_MAP_WRITE(p)	((p) & (MMAP_WRITE_BIT))
+#define	MMAP_ADDR(p)	((p) & (PAGE_MASK))
+
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static inline __le32 nova_mask_flags(umode_t mode, __le32 flags)
+{
+	flags &= cpu_to_le32(NOVA_FL_INHERITED);
+	if (S_ISDIR(mode))
+		return flags;
+	else if (S_ISREG(mode))
+		return flags & cpu_to_le32(NOVA_REG_FLMASK);
+	else
+		return flags & cpu_to_le32(NOVA_OTHER_FLMASK);
+}
+
+/* Update the crc32c value by appending a 64b data word. */
+#define nova_crc32c_qword(qword, crc) do { \
+	asm volatile ("crc32q %1, %0" \
+		: "=r" (crc) \
+		: "r" (qword), "0" (crc)); \
+	} while (0)
+
+static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
+{
+	u8 *ptr = (u8 *) data;
+	u64 acc = crc; /* accumulator, crc32c value in lower 32b */
+	u32 csum;
+
+	/* x86 instruction crc32 is part of SSE-4.2 */
+	if (static_cpu_has(X86_FEATURE_XMM4_2)) {
+		/* This inline assembly implementation should be equivalent
+		 * to the kernel's crc32c_intel_le_hw() function used by
+		 * crc32c(), but this performs better on test machines.
+		 */
+		while (len > 8) {
+			asm volatile(/* 64b quad words */
+				"crc32q (%1), %0"
+				: "=r" (acc)
+				: "r"  (ptr), "0" (acc)
+			);
+			ptr += 8;
+			len -= 8;
+		}
+
+		while (len > 0) {
+			asm volatile(/* trailing bytes */
+				"crc32b (%1), %0"
+				: "=r" (acc)
+				: "r"  (ptr), "0" (acc)
+			);
+			ptr++;
+			len--;
+		}
+
+		csum = (u32) acc;
+	} else {
+		/* The kernel's crc32c() function should also detect and use the
+		 * crc32 instruction of SSE-4.2. But calling in to this function
+		 * is about 3x to 5x slower than the inline assembly version on
+		 * some test machines.
+		 */
+		csum = crc32c(crc, data, len);
+	}
+
+	return csum;
+}
+
+/* uses CPU instructions to atomically write up to 8 bytes */
+static inline void nova_memcpy_atomic(void *dst, const void *src, u8 size)
+{
+	switch (size) {
+	case 1: {
+		volatile u8 *daddr = dst;
+		const u8 *saddr = src;
+		*daddr = *saddr;
+		break;
+	}
+	case 2: {
+		volatile __le16 *daddr = dst;
+		const u16 *saddr = src;
+		*daddr = cpu_to_le16(*saddr);
+		break;
+	}
+	case 4: {
+		volatile __le32 *daddr = dst;
+		const u32 *saddr = src;
+		*daddr = cpu_to_le32(*saddr);
+		break;
+	}
+	case 8: {
+		volatile __le64 *daddr = dst;
+		const u64 *saddr = src;
+		*daddr = cpu_to_le64(*saddr);
+		break;
+	}
+	default:
+		nova_dbg("error: memcpy_atomic called with %d bytes\n", size);
+		//BUG();
+	}
+}
+
+static inline int memcpy_to_pmem_nocache(void *dst, const void *src,
+	unsigned int size)
+{
+	int ret;
+
+	ret = __copy_from_user_inatomic_nocache(dst, src, size);
+
+	return ret;
+}
+
+
+/* assumes the length to be 4-byte aligned */
+static inline void memset_nt(void *dest, uint32_t dword, size_t length)
+{
+	uint64_t dummy1, dummy2;
+	uint64_t qword = ((uint64_t)dword << 32) | dword;
+
+	asm volatile ("movl %%edx,%%ecx\n"
+		"andl $63,%%edx\n"
+		"shrl $6,%%ecx\n"
+		"jz 9f\n"
+		"1:	 movnti %%rax,(%%rdi)\n"
+		"2:	 movnti %%rax,1*8(%%rdi)\n"
+		"3:	 movnti %%rax,2*8(%%rdi)\n"
+		"4:	 movnti %%rax,3*8(%%rdi)\n"
+		"5:	 movnti %%rax,4*8(%%rdi)\n"
+		"8:	 movnti %%rax,5*8(%%rdi)\n"
+		"7:	 movnti %%rax,6*8(%%rdi)\n"
+		"8:	 movnti %%rax,7*8(%%rdi)\n"
+		"leaq 64(%%rdi),%%rdi\n"
+		"decl %%ecx\n"
+		"jnz 1b\n"
+		"9:	movl %%edx,%%ecx\n"
+		"andl $7,%%edx\n"
+		"shrl $3,%%ecx\n"
+		"jz 11f\n"
+		"10:	 movnti %%rax,(%%rdi)\n"
+		"leaq 8(%%rdi),%%rdi\n"
+		"decl %%ecx\n"
+		"jnz 10b\n"
+		"11:	 movl %%edx,%%ecx\n"
+		"shrl $2,%%ecx\n"
+		"jz 12f\n"
+		"movnti %%eax,(%%rdi)\n"
+		"12:\n"
+		: "=D"(dummy1), "=d" (dummy2)
+		: "D" (dest), "a" (qword), "d" (length)
+		: "memory", "rcx");
+}
+
+
+#include "super.h" // Remove when we factor out these and other functions.
+
+/* Translate an offset the beginning of the Nova instance to a PMEM address.
+ *
+ * If this is part of a read-modify-write of the block,
+ * nova_memunlock_block() before calling!
+ */
+static inline void *nova_get_block(struct super_block *sb, u64 block)
+{
+	struct nova_super_block *ps = nova_get_super(sb);
+
+	return block ? ((void *)ps + block) : NULL;
+}
+
+static inline int nova_get_reference(struct super_block *sb, u64 block,
+	void *dram, void **nvmm, size_t size)
+{
+	int rc;
+
+	*nvmm = nova_get_block(sb, block);
+	rc = memcpy_mcsafe(dram, *nvmm, size);
+	return rc;
+}
+
+
+static inline u64
+nova_get_addr_off(struct nova_sb_info *sbi, void *addr)
+{
+	NOVA_ASSERT((addr >= sbi->virt_addr) &&
+			(addr < (sbi->virt_addr + sbi->initsize)));
+	return (u64)(addr - sbi->virt_addr);
+}
+
+static inline u64
+nova_get_block_off(struct super_block *sb, unsigned long blocknr,
+		    unsigned short btype)
+{
+	return (u64)blocknr << PAGE_SHIFT;
+}
+
+
+static inline u64 nova_get_epoch_id(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return sbi->s_epoch_id;
+}
+
+static inline void nova_print_curr_epoch_id(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 ret;
+
+	ret = sbi->s_epoch_id;
+	nova_dbg("Current epoch id: %llu\n", ret);
+}
+
+#include "inode.h"
+static inline int nova_get_head_tail(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih)
+{
+	struct nova_inode fake_pi;
+	int rc;
+
+	rc = memcpy_mcsafe(&fake_pi, pi, sizeof(struct nova_inode));
+	if (rc)
+		return rc;
+
+	sih->i_blk_type = fake_pi.i_blk_type;
+	sih->log_head = fake_pi.log_head;
+	sih->log_tail = fake_pi.log_tail;
+	sih->alter_log_head = fake_pi.alter_log_head;
+	sih->alter_log_tail = fake_pi.alter_log_tail;
+
+	return rc;
+}
+
+struct nova_range_node_lowhigh {
+	__le64 range_low;
+	__le64 range_high;
+};
+
+#define	RANGENODE_PER_PAGE	254
+
+/* A node in the RB tree representing a range of pages */
+struct nova_range_node {
+	struct rb_node node;
+	struct vm_area_struct *vma;
+	unsigned long mmap_entry;
+	unsigned long range_low;
+	unsigned long range_high;
+	u32	csum;		/* Protect vma, range low/high */
+};
+
+struct vma_item {
+	/* Reuse header of nova_range_node struct */
+	struct rb_node node;
+	struct vm_area_struct *vma;
+	unsigned long mmap_entry;
+};
+
+static inline u32 nova_calculate_range_node_csum(struct nova_range_node *node)
+{
+	u32 crc;
+
+	crc = nova_crc32c(~0, (__u8 *)&node->vma,
+			(unsigned long)&node->csum - (unsigned long)&node->vma);
+
+	return crc;
+}
+
+static inline int nova_update_range_node_checksum(struct nova_range_node *node)
+{
+	if (dram_struct_csum)
+		node->csum = nova_calculate_range_node_csum(node);
+
+	return 0;
+}
+
+static inline bool nova_range_node_checksum_ok(struct nova_range_node *node)
+{
+	bool ret;
+
+	if (dram_struct_csum == 0)
+		return true;
+
+	ret = node->csum == nova_calculate_range_node_csum(node);
+	if (!ret) {
+		nova_dbg("%s: checksum failure, vma %p, range low %lu, range high %lu, csum 0x%x\n",
+			 __func__, node->vma, node->range_low, node->range_high,
+			 node->csum);
+	}
+
+	return ret;
+}
+
+
+enum bm_type {
+	BM_4K = 0,
+	BM_2M,
+	BM_1G,
+};
+
+struct single_scan_bm {
+	unsigned long bitmap_size;
+	unsigned long *bitmap;
+};
+
+struct scan_bitmap {
+	struct single_scan_bm scan_bm_4K;
+	struct single_scan_bm scan_bm_2M;
+	struct single_scan_bm scan_bm_1G;
+};
+
+
+
+struct inode_map {
+	struct mutex		inode_table_mutex;
+	struct rb_root		inode_inuse_tree;
+	unsigned long		num_range_node_inode;
+	struct nova_range_node *first_inode_range;
+	int			allocated;
+	int			freed;
+};
+
+
+
+
+
+
+
+/* Old entry is freeable if it is appended after the latest snapshot */
+static inline int old_entry_freeable(struct super_block *sb, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (epoch_id == sbi->s_epoch_id)
+		return 1;
+
+	return 0;
+}
+
+static inline int pass_mount_snapshot(struct super_block *sb, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (epoch_id > sbi->mount_snapshot_epoch_id)
+		return 1;
+
+	return 0;
+}
+
+
+// BKDR String Hash Function
+static inline unsigned long BKDRHash(const char *str, int length)
+{
+	unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
+	unsigned long hash = 0;
+	int i;
+
+	for (i = 0; i < length; i++)
+		hash = hash * seed + (*str++);
+
+	return hash;
+}
+
+
+#include "mprotect.h"
+
+#include "log.h"
+
+static inline struct nova_file_write_entry *
+nova_get_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr)
+{
+	struct nova_file_write_entry *entry;
+
+	entry = radix_tree_lookup(&sih->tree, blocknr);
+
+	return entry;
+}
+
+
+/*
+ * Find data at a file offset (pgoff) in the data pointed to by a write log
+ * entry.
+ */
+static inline unsigned long get_nvmm(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long pgoff)
+{
+	/* entry is already verified before this call and resides in dram
+	 * or we can do memcpy_mcsafe here but have to avoid double copy and
+	 * verification of the entry.
+	 */
+	if (entry->pgoff > pgoff || (unsigned long) entry->pgoff +
+			(unsigned long) entry->num_pages <= pgoff) {
+		struct nova_sb_info *sbi = NOVA_SB(sb);
+		u64 curr;
+
+		curr = nova_get_addr_off(sbi, entry);
+		nova_dbg("Entry ERROR: inode %lu, curr 0x%llx, pgoff %lu, entry pgoff %llu, num %u\n",
+			sih->ino,
+			curr, pgoff, entry->pgoff, entry->num_pages);
+		nova_print_nova_log_pages(sb, sih);
+		nova_print_nova_log(sb, sih);
+		NOVA_ASSERT(0);
+	}
+
+	return (unsigned long) (entry->block >> PAGE_SHIFT) + pgoff
+		- entry->pgoff;
+}
+
+bool nova_verify_entry_csum(struct super_block *sb, void *entry, void *entryc);
+
+static inline u64 nova_find_nvmm_block(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long blocknr)
+{
+	unsigned long nvmm;
+	struct nova_file_write_entry *entryc, entry_copy;
+
+	if (!entry) {
+		entry = nova_get_write_entry(sb, sih, blocknr);
+		if (!entry)
+			return 0;
+	}
+
+	/* Don't check entry here as someone else may be modifying it
+	 * when called from reset_vma_csum_parity
+	 */
+	entryc = &entry_copy;
+	if (memcpy_mcsafe(entryc, entry,
+			sizeof(struct nova_file_write_entry)) < 0)
+		return 0;
+
+	nvmm = get_nvmm(sb, sih, entryc, blocknr);
+	return nvmm << PAGE_SHIFT;
+}
+
+
+
+static inline unsigned long
+nova_get_numblocks(unsigned short btype)
+{
+	unsigned long num_blocks;
+
+	if (btype == NOVA_BLOCK_TYPE_4K) {
+		num_blocks = 1;
+	} else if (btype == NOVA_BLOCK_TYPE_2M) {
+		num_blocks = 512;
+	} else {
+		//btype == NOVA_BLOCK_TYPE_1G
+		num_blocks = 0x40000;
+	}
+	return num_blocks;
+}
+
+static inline unsigned long
+nova_get_blocknr(struct super_block *sb, u64 block, unsigned short btype)
+{
+	return block >> PAGE_SHIFT;
+}
+
+static inline unsigned long nova_get_pfn(struct super_block *sb, u64 block)
+{
+	return (NOVA_SB(sb)->phys_addr + block) >> PAGE_SHIFT;
+}
+
+static inline u64 next_log_page(struct super_block *sb, u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 next = 0;
+	int rc;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	rc = memcpy_mcsafe(&next, &curr_page->page_tail.next_page,
+				sizeof(u64));
+	if (rc)
+		return rc;
+
+	return next;
+}
+
+static inline u64 alter_log_page(struct super_block *sb, u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 next = 0;
+	int rc;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	rc = memcpy_mcsafe(&next, &curr_page->page_tail.alter_page,
+				sizeof(u64));
+	if (rc)
+		return rc;
+
+	return next;
+}
+
+#if 0
+static inline u64 next_log_page(struct super_block *sb, u64 curr_p)
+{
+	void *curr_addr = nova_get_block(sb, curr_p);
+	unsigned long page_tail = BLOCK_OFF((unsigned long)curr_addr)
+					+ LOG_BLOCK_TAIL;
+	return ((struct nova_inode_page_tail *)page_tail)->next_page;
+}
+
+static inline u64 alter_log_page(struct super_block *sb, u64 curr_p)
+{
+	void *curr_addr = nova_get_block(sb, curr_p);
+	unsigned long page_tail = BLOCK_OFF((unsigned long)curr_addr)
+					+ LOG_BLOCK_TAIL;
+	if (metadata_csum == 0)
+		return 0;
+
+	return ((struct nova_inode_page_tail *)page_tail)->alter_page;
+}
+#endif
+
+static inline u64 alter_log_entry(struct super_block *sb, u64 curr_p)
+{
+	u64 alter_page;
+	void *curr_addr = nova_get_block(sb, curr_p);
+	unsigned long page_tail = BLOCK_OFF((unsigned long)curr_addr)
+					+ LOG_BLOCK_TAIL;
+	if (metadata_csum == 0)
+		return 0;
+
+	alter_page = ((struct nova_inode_page_tail *)page_tail)->alter_page;
+	return alter_page + ENTRY_LOC(curr_p);
+}
+
+static inline void nova_set_next_page_flag(struct super_block *sb, u64 curr_p)
+{
+	void *p;
+
+	if (ENTRY_LOC(curr_p) >= LOG_BLOCK_TAIL)
+		return;
+
+	p = nova_get_block(sb, curr_p);
+	nova_set_entry_type(p, NEXT_PAGE);
+	nova_flush_buffer(p, CACHELINE_SIZE, 1);
+}
+
+static inline void nova_set_next_page_address(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 next_page, int fence)
+{
+	curr_page->page_tail.next_page = next_page;
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+	if (fence)
+		PERSISTENT_BARRIER();
+}
+
+static inline void nova_set_page_num_entries(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, int num, int flush)
+{
+	curr_page->page_tail.num_entries = num;
+	if (flush)
+		nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_set_page_invalid_entries(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, int num, int flush)
+{
+	curr_page->page_tail.invalid_entries = num;
+	if (flush)
+		nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_inc_page_num_entries(struct super_block *sb,
+	u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+
+	curr_page->page_tail.num_entries++;
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+u64 nova_print_log_entry(struct super_block *sb, u64 curr);
+
+static inline void nova_inc_page_invalid_entries(struct super_block *sb,
+	u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 old_curr = curr;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+
+	curr_page->page_tail.invalid_entries++;
+	if (curr_page->page_tail.invalid_entries >
+			curr_page->page_tail.num_entries) {
+		nova_dbg("Page 0x%llx has %u entries, %u invalid\n",
+				curr,
+				curr_page->page_tail.num_entries,
+				curr_page->page_tail.invalid_entries);
+		nova_print_log_entry(sb, old_curr);
+	}
+
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_set_alter_page_address(struct super_block *sb,
+	u64 curr, u64 alter_curr)
+{
+	struct nova_inode_log_page *curr_page;
+	struct nova_inode_log_page *alter_page;
+
+	if (metadata_csum == 0)
+		return;
+
+	curr_page = nova_get_block(sb, BLOCK_OFF(curr));
+	alter_page = nova_get_block(sb, BLOCK_OFF(alter_curr));
+
+	curr_page->page_tail.alter_page = alter_curr;
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+
+	alter_page->page_tail.alter_page = curr;
+	nova_flush_buffer(&alter_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+#define	CACHE_ALIGN(p)	((p) & ~(CACHELINE_SIZE - 1))
+
+static inline bool is_last_entry(u64 curr_p, size_t size)
+{
+	unsigned int entry_end;
+
+	entry_end = ENTRY_LOC(curr_p) + size;
+
+	return entry_end > LOG_BLOCK_TAIL;
+}
+
+static inline bool goto_next_page(struct super_block *sb, u64 curr_p)
+{
+	void *addr;
+	u8 type;
+	int rc;
+
+	/* Each kind of entry takes at least 32 bytes */
+	if (ENTRY_LOC(curr_p) + 32 > LOG_BLOCK_TAIL)
+		return true;
+
+	addr = nova_get_block(sb, curr_p);
+	rc = memcpy_mcsafe(&type, addr, sizeof(u8));
+
+	if (rc < 0)
+		return true;
+
+	if (type == NEXT_PAGE)
+		return true;
+
+	return false;
+}
+
+static inline int is_dir_init_entry(struct super_block *sb,
+	struct nova_dentry *entry)
+{
+	if (entry->name_len == 1 && strncmp(entry->name, ".", 1) == 0)
+		return 1;
+	if (entry->name_len == 2 && strncmp(entry->name, "..", 2) == 0)
+		return 1;
+
+	return 0;
+}
+
+#include "balloc.h" // remove once we move the following functions away
+
+/* Checksum methods */
+static inline void *nova_get_data_csum_addr(struct super_block *sb, u64 strp_nr,
+	int replica)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long blocknr;
+	void *data_csum_addr;
+	u64 blockoff;
+	int index;
+	int BLOCK_SHIFT = PAGE_SHIFT - NOVA_STRIPE_SHIFT;
+
+	if (!data_csum) {
+		nova_dbg("%s: Data checksum is disabled!\n", __func__);
+		return NULL;
+	}
+
+	blocknr = strp_nr >> BLOCK_SHIFT;
+	index = blocknr / sbi->per_list_blocks;
+
+	if (index >= sbi->cpus) {
+		nova_dbg("%s: Invalid blocknr %lu\n", __func__, blocknr);
+		return NULL;
+	}
+
+	strp_nr -= (index * sbi->per_list_blocks) << BLOCK_SHIFT;
+	free_list = nova_get_free_list(sb, index);
+	if (replica == 0)
+		blockoff = free_list->csum_start << PAGE_SHIFT;
+	else
+		blockoff = free_list->replica_csum_start << PAGE_SHIFT;
+
+	/* Range test */
+	if (((NOVA_DATA_CSUM_LEN * strp_nr) >> PAGE_SHIFT) >=
+			free_list->num_csum_blocks) {
+		nova_dbg("%s: Invalid strp number %llu, free list %d\n",
+				__func__, strp_nr, free_list->index);
+		return NULL;
+	}
+
+	data_csum_addr = (u8 *) nova_get_block(sb, blockoff)
+				+ NOVA_DATA_CSUM_LEN * strp_nr;
+
+	return data_csum_addr;
+}
+
+static inline void *nova_get_parity_addr(struct super_block *sb,
+	unsigned long blocknr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	void *data_csum_addr;
+	u64 blockoff;
+	int index;
+	int BLOCK_SHIFT = PAGE_SHIFT - NOVA_STRIPE_SHIFT;
+
+	if (data_parity == 0) {
+		nova_dbg("%s: Data parity is disabled!\n", __func__);
+		return NULL;
+	}
+
+	index = blocknr / sbi->per_list_blocks;
+
+	if (index >= sbi->cpus) {
+		nova_dbg("%s: Invalid blocknr %lu\n", __func__, blocknr);
+		return NULL;
+	}
+
+	free_list = nova_get_free_list(sb, index);
+	blockoff = free_list->parity_start << PAGE_SHIFT;
+
+	/* Range test */
+	if (((blocknr - free_list->block_start) >> BLOCK_SHIFT) >=
+			free_list->num_parity_blocks) {
+		nova_dbg("%s: Invalid blocknr %lu, free list %d\n",
+				__func__, blocknr, free_list->index);
+		return NULL;
+	}
+
+	data_csum_addr = (u8 *) nova_get_block(sb, blockoff) +
+				((blocknr - free_list->block_start)
+				 << NOVA_STRIPE_SHIFT);
+
+	return data_csum_addr;
+}
+
+/* Function Prototypes */
+
+
+
+/* bbuild.c */
+inline void set_bm(unsigned long bit, struct scan_bitmap *bm,
+	enum bm_type type);
+void nova_save_blocknode_mappings_to_log(struct super_block *sb);
+void nova_save_inode_list_to_log(struct super_block *sb);
+void nova_init_header(struct super_block *sb,
+	struct nova_inode_info_header *sih, u16 i_mode);
+int nova_recovery(struct super_block *sb);
+
+/* checksum.c */
+void nova_update_entry_csum(void *entry);
+int nova_update_block_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, u8 *block, unsigned long blocknr,
+	size_t offset, size_t bytes, int zero);
+int nova_update_alter_entry(struct super_block *sb, void *entry);
+int nova_check_inode_integrity(struct super_block *sb, u64 ino, u64 pi_addr,
+	u64 alter_pi_addr, struct nova_inode *pic, int check_replica);
+int nova_update_pgoff_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero);
+bool nova_verify_data_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr,
+	size_t offset, size_t bytes);
+int nova_update_truncated_block_csum(struct super_block *sb,
+	struct inode *inode, loff_t newsize);
+
+/*
+ * Inodes and files operations
+ */
+
+/* dax.c */
+int nova_cleanup_incomplete_write(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr,
+	int allocated, u64 begin_tail, u64 end_tail);
+void nova_init_file_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time,
+	u64 size);
+int nova_reassign_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 begin_tail);
+unsigned long nova_check_existing_entry(struct super_block *sb,
+	struct inode *inode, unsigned long num_blocks, unsigned long start_blk,
+	struct nova_file_write_entry **ret_entry,
+	struct nova_file_write_entry *ret_entryc, int check_next, u64 epoch_id,
+	int *inplace, int locked);
+int nova_dax_get_blocks(struct inode *inode, sector_t iblock,
+	unsigned long max_blocks, u32 *bno, bool *new, bool *boundary,
+	int create, bool taking_lock);
+int nova_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+	unsigned int flags, struct iomap *iomap, bool taking_lock);
+int nova_iomap_end(struct inode *inode, loff_t offset, loff_t length,
+	ssize_t written, unsigned int flags, struct iomap *iomap);
+int nova_insert_write_vma(struct vm_area_struct *vma);
+
+int nova_check_overlap_vmas(struct super_block *sb,
+			    struct nova_inode_info_header *sih,
+			    unsigned long pgoff, unsigned long num_pages);
+int nova_handle_head_tail_blocks(struct super_block *sb,
+				 struct inode *inode, loff_t pos,
+				 size_t count, void *kmem);
+int nova_protect_file_data(struct super_block *sb, struct inode *inode,
+	loff_t pos, size_t count, const char __user *buf, unsigned long blocknr,
+	bool inplace);
+ssize_t nova_inplace_file_write(struct file *filp, const char __user *buf,
+				size_t len, loff_t *ppos);
+
+extern const struct vm_operations_struct nova_dax_vm_ops;
+
+
+/* dir.c */
+extern const struct file_operations nova_dir_operations;
+int nova_insert_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name,
+	int namelen, struct nova_dentry *direntry);
+int nova_remove_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name, int namelen,
+	int replay, struct nova_dentry **create_dentry);
+int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *dir, struct dentry *dentry, u64 ino,
+	unsigned short de_len, struct nova_inode_update *update,
+	int link_change, u64 epoch_id);
+int nova_append_dir_init_entries(struct super_block *sb,
+	struct nova_inode *pi, u64 self_ino, u64 parent_ino, u64 epoch_id);
+int nova_add_dentry(struct dentry *dentry, u64 ino, int inc_link,
+	struct nova_inode_update *update, u64 epoch_id);
+int nova_remove_dentry(struct dentry *dentry, int dec_link,
+	struct nova_inode_update *update, u64 epoch_id);
+int nova_invalidate_dentries(struct super_block *sb,
+	struct nova_inode_update *update);
+void nova_print_dir_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long ino);
+void nova_delete_dir_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+struct nova_dentry *nova_find_dentry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, const char *name,
+	unsigned long name_len);
+
+/* file.c */
+extern const struct inode_operations nova_file_inode_operations;
+extern const struct file_operations nova_dax_file_operations;
+extern const struct file_operations nova_wrap_file_operations;
+
+
+/* gc.c */
+int nova_inode_log_fast_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_tail, u64 new_block, u64 alter_new_block, int num_pages,
+	int force_thorough);
+
+/* ioctl.c */
+extern long nova_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
+#ifdef CONFIG_COMPAT
+extern long nova_compat_ioctl(struct file *file, unsigned int cmd,
+	unsigned long arg);
+#endif
+
+
+
+/* mprotect.c */
+extern int nova_dax_mem_protect(struct super_block *sb,
+				 void *vaddr, unsigned long size, int rw);
+int nova_get_vma_overlap_range(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	unsigned long entry_pgoff, unsigned long entry_pages,
+	unsigned long *start_pgoff, unsigned long *num_pages);
+int nova_mmap_to_new_blocks(struct vm_area_struct *vma,
+	unsigned long address);
+bool nova_find_pgoff_in_vma(struct inode *inode, unsigned long pgoff);
+int nova_set_vmas_readonly(struct super_block *sb);
+
+/* namei.c */
+extern const struct inode_operations nova_dir_inode_operations;
+extern const struct inode_operations nova_special_inode_operations;
+extern struct dentry *nova_get_parent(struct dentry *child);
+
+/* parity.c */
+int nova_update_pgoff_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero);
+int nova_update_block_csum_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, u8 *block, unsigned long blocknr,
+	size_t offset, size_t bytes);
+int nova_restore_data(struct super_block *sb, unsigned long blocknr,
+	unsigned int badstrip_id, void *badstrip, int nvmmerr, u32 csum0,
+	u32 csum1, u32 *csum_good);
+int nova_update_truncated_block_parity(struct super_block *sb,
+	struct inode *inode, loff_t newsize);
+
+/* rebuild.c */
+int nova_reset_csum_parity_range(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long start_pgoff, unsigned long end_pgoff, int zero,
+	int check_entry);
+int nova_reset_mapping_csum_parity(struct super_block *sb,
+	struct inode *inode, struct address_space *mapping,
+	unsigned long start_pgoff, unsigned long end_pgoff);
+int nova_reset_vma_csum_parity(struct super_block *sb,
+	struct vma_item *item);
+int nova_rebuild_dir_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih);
+int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
+	u64 ino, u64 pi_addr, int rebuild_dir);
+int nova_restore_snapshot_table(struct super_block *sb, int just_init);
+
+/* snapshot.c */
+int nova_encounter_mount_snapshot(struct super_block *sb, void *addr,
+	u8 type);
+int nova_save_snapshots(struct super_block *sb);
+int nova_destroy_snapshot_infos(struct super_block *sb);
+int nova_restore_snapshot_entry(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 curr_p, int just_init);
+int nova_mount_snapshot(struct super_block *sb);
+int nova_append_data_to_snapshot(struct super_block *sb,
+	struct nova_file_write_entry *entry, u64 nvmm, u64 num_pages,
+	u64 delete_epoch_id);
+int nova_append_inode_to_snapshot(struct super_block *sb,
+	struct nova_inode *pi);
+int nova_print_snapshots(struct super_block *sb, struct seq_file *seq);
+int nova_print_snapshot_lists(struct super_block *sb, struct seq_file *seq);
+int nova_delete_dead_inode(struct super_block *sb, u64 ino);
+int nova_create_snapshot(struct super_block *sb);
+int nova_delete_snapshot(struct super_block *sb, u64 epoch_id);
+int nova_snapshot_init(struct super_block *sb);
+
+
+/* symlink.c */
+int nova_block_symlink(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, const char *symname, int len, u64 epoch_id);
+extern const struct inode_operations nova_symlink_inode_operations;
+
+/* sysfs.c */
+extern const char *proc_dirname;
+extern struct proc_dir_entry *nova_proc_root;
+void nova_sysfs_init(struct super_block *sb);
+void nova_sysfs_exit(struct super_block *sb);
+
+/* nova_stats.c */
+void nova_get_timing_stats(void);
+void nova_get_IO_stats(void);
+void nova_print_timing_stats(struct super_block *sb);
+void nova_clear_stats(struct super_block *sb);
+void nova_print_inode(struct nova_inode *pi);
+void nova_print_inode_log(struct super_block *sb, struct inode *inode);
+void nova_print_inode_log_pages(struct super_block *sb, struct inode *inode);
+int nova_check_inode_logs(struct super_block *sb, struct nova_inode *pi);
+void nova_print_free_lists(struct super_block *sb);
+
+/* perf.c */
+int nova_test_perf(struct super_block *sb, unsigned int func_id,
+	unsigned int poolmb, size_t size, unsigned int disks);
+
+#endif /* __NOVA_H */
diff --git a/fs/nova/nova_def.h b/fs/nova/nova_def.h
new file mode 100644
index 000000000000..61ade439e138
--- /dev/null
+++ b/fs/nova/nova_def.h
@@ -0,0 +1,154 @@
+/*
+ * FILE NAME include/linux/nova_fs.h
+ *
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the NOVA filesystem.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#ifndef _LINUX_NOVA_DEF_H
+#define _LINUX_NOVA_DEF_H
+
+#include <linux/types.h>
+#include <linux/magic.h>
+
+#define	NOVA_SUPER_MAGIC	0x4E4F5641	/* NOVA */
+
+/*
+ * The NOVA filesystem constants/structures
+ */
+
+/*
+ * Mount flags
+ */
+#define NOVA_MOUNT_PROTECT      0x000001    /* wprotect CR0.WP */
+#define NOVA_MOUNT_XATTR_USER   0x000002    /* Extended user attributes */
+#define NOVA_MOUNT_POSIX_ACL    0x000004    /* POSIX Access Control Lists */
+#define NOVA_MOUNT_DAX          0x000008    /* Direct Access */
+#define NOVA_MOUNT_ERRORS_CONT  0x000010    /* Continue on errors */
+#define NOVA_MOUNT_ERRORS_RO    0x000020    /* Remount fs ro on errors */
+#define NOVA_MOUNT_ERRORS_PANIC 0x000040    /* Panic on errors */
+#define NOVA_MOUNT_HUGEMMAP     0x000080    /* Huge mappings with mmap */
+#define NOVA_MOUNT_HUGEIOREMAP  0x000100    /* Huge mappings with ioremap */
+#define NOVA_MOUNT_FORMAT       0x000200    /* was FS formatted on mount? */
+
+/*
+ * Maximal count of links to a file
+ */
+#define NOVA_LINK_MAX          32000
+
+#define NOVA_DEF_BLOCK_SIZE_4K 4096
+
+#define NOVA_INODE_BITS   7
+#define NOVA_INODE_SIZE   128    /* must be power of two */
+
+#define NOVA_NAME_LEN 255
+
+#define MAX_CPUS 64
+
+/* NOVA supported data blocks */
+#define NOVA_BLOCK_TYPE_4K     0
+#define NOVA_BLOCK_TYPE_2M     1
+#define NOVA_BLOCK_TYPE_1G     2
+#define NOVA_BLOCK_TYPE_MAX    3
+
+#define META_BLK_SHIFT 9
+
+/*
+ * Play with this knob to change the default block type.
+ * By changing the NOVA_DEFAULT_BLOCK_TYPE to 2M or 1G,
+ * we should get pretty good coverage in testing.
+ */
+#define NOVA_DEFAULT_BLOCK_TYPE NOVA_BLOCK_TYPE_4K
+
+
+/* ======================= Write ordering ========================= */
+
+#define CACHELINE_SIZE  (64)
+#define CACHELINE_MASK  (~(CACHELINE_SIZE - 1))
+#define CACHELINE_ALIGN(addr) (((addr)+CACHELINE_SIZE-1) & CACHELINE_MASK)
+
+
+static inline bool arch_has_clwb(void)
+{
+	return static_cpu_has(X86_FEATURE_CLWB);
+}
+
+extern int support_clwb;
+
+#define _mm_clflush(addr)\
+	asm volatile("clflush %0" : "+m" (*(volatile char *)(addr)))
+#define _mm_clflushopt(addr)\
+	asm volatile(".byte 0x66; clflush %0" : "+m" \
+		     (*(volatile char *)(addr)))
+#define _mm_clwb(addr)\
+	asm volatile(".byte 0x66; xsaveopt %0" : "+m" \
+		     (*(volatile char *)(addr)))
+
+/* Provides ordering from all previous clflush too */
+static inline void PERSISTENT_MARK(void)
+{
+	/* TODO: Fix me. */
+}
+
+static inline void PERSISTENT_BARRIER(void)
+{
+	asm volatile ("sfence\n" : : );
+}
+
+static inline void nova_flush_buffer(void *buf, uint32_t len, bool fence)
+{
+	uint32_t i;
+
+	len = len + ((unsigned long)(buf) & (CACHELINE_SIZE - 1));
+	if (support_clwb) {
+		for (i = 0; i < len; i += CACHELINE_SIZE)
+			_mm_clwb(buf + i);
+	} else {
+		for (i = 0; i < len; i += CACHELINE_SIZE)
+			_mm_clflush(buf + i);
+	}
+	/* Do a fence only if asked. We often don't need to do a fence
+	 * immediately after clflush because even if we get context switched
+	 * between clflush and subsequent fence, the context switch operation
+	 * provides implicit fence.
+	 */
+	if (fence)
+		PERSISTENT_BARRIER();
+}
+
+/* =============== Integrity and Recovery Parameters =============== */
+#define	NOVA_META_CSUM_LEN	(4)
+#define	NOVA_DATA_CSUM_LEN	(4)
+
+/* This is to set the initial value of checksum state register.
+ * For CRC32C this should not matter and can be set to any value.
+ */
+#define	NOVA_INIT_CSUM		(1)
+
+#define	ADDR_ALIGN(p, bytes)	((void *) (((unsigned long) p) & ~(bytes - 1)))
+
+/* Data stripe size in bytes and shift.
+ * In NOVA this size determines the size of a checksummed stripe, and it
+ * equals to the affordable lost size of data per block (page).
+ * Its value should be no less than the poison radius size of media errors.
+ *
+ * Support NOVA_STRIPE_SHIFT <= PAGE_SHIFT (NOVA file block size shift).
+ */
+#define POISON_RADIUS		(512)
+#define POISON_MASK		(~(POISON_RADIUS - 1))
+#define NOVA_STRIPE_SHIFT	(9) /* size should be no less than PR_SIZE */
+#define NOVA_STRIPE_SIZE	(1 << NOVA_STRIPE_SHIFT)
+
+#endif /* _LINUX_NOVA_DEF_H */
diff --git a/fs/nova/super.c b/fs/nova/super.c
new file mode 100644
index 000000000000..6be94edf116c
--- /dev/null
+++ b/fs/nova/super.c
@@ -0,0 +1,1222 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Super block operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/parser.h>
+#include <linux/vfs.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+#include <linux/seq_file.h>
+#include <linux/mount.h>
+#include <linux/mm.h>
+#include <linux/ctype.h>
+#include <linux/bitops.h>
+#include <linux/magic.h>
+#include <linux/exportfs.h>
+#include <linux/random.h>
+#include <linux/cred.h>
+#include <linux/list.h>
+#include <linux/dax.h>
+#include "nova.h"
+#include "journal.h"
+#include "super.h"
+#include "inode.h"
+
+int measure_timing;
+int metadata_csum;
+int wprotect;
+int data_csum;
+int data_parity;
+int dram_struct_csum;
+int support_clwb;
+int inplace_data_updates;
+
+module_param(measure_timing, int, 0444);
+MODULE_PARM_DESC(measure_timing, "Timing measurement");
+
+module_param(metadata_csum, int, 0444);
+MODULE_PARM_DESC(metadata_csum, "Protect metadata structures with replication and checksums");
+
+module_param(wprotect, int, 0444);
+MODULE_PARM_DESC(wprotect, "Write-protect pmem region and use CR0.WP to allow updates");
+
+module_param(data_csum, int, 0444);
+MODULE_PARM_DESC(data_csum, "Detect corruption of data pages using checksum");
+
+module_param(data_parity, int, 0444);
+MODULE_PARM_DESC(data_parity, "Protect file data using RAID-5 style parity.");
+
+module_param(inplace_data_updates, int, 0444);
+MODULE_PARM_DESC(inplace_data_updates, "Perform data updates in-place (i.e., not atomically)");
+
+module_param(dram_struct_csum, int, 0444);
+MODULE_PARM_DESC(dram_struct_csum, "Protect key DRAM data structures with checksums");
+
+module_param(nova_dbgmask, int, 0444);
+MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
+
+static struct super_operations nova_sops;
+static const struct export_operations nova_export_ops;
+static struct kmem_cache *nova_inode_cachep;
+static struct kmem_cache *nova_range_node_cachep;
+static struct kmem_cache *nova_snapshot_info_cachep;
+
+/* FIXME: should the following variable be one per NOVA instance? */
+unsigned int nova_dbgmask;
+
+void nova_error_mng(struct super_block *sb, const char *fmt, ...)
+{
+	va_list args;
+
+	printk(KERN_CRIT "nova error: ");
+	va_start(args, fmt);
+	vprintk(fmt, args);
+	va_end(args);
+
+	if (test_opt(sb, ERRORS_PANIC))
+		panic("nova: panic from previous error\n");
+	if (test_opt(sb, ERRORS_RO)) {
+		printk(KERN_CRIT "nova err: remounting filesystem read-only");
+		sb->s_flags |= MS_RDONLY;
+	}
+}
+
+static void nova_set_blocksize(struct super_block *sb, unsigned long size)
+{
+	int bits;
+
+	/*
+	 * We've already validated the user input and the value here must be
+	 * between NOVA_MAX_BLOCK_SIZE and NOVA_MIN_BLOCK_SIZE
+	 * and it must be a power of 2.
+	 */
+	bits = fls(size) - 1;
+	sb->s_blocksize_bits = bits;
+	sb->s_blocksize = (1 << bits);
+}
+
+static int nova_get_nvmm_info(struct super_block *sb,
+	struct nova_sb_info *sbi)
+{
+	void *virt_addr = NULL;
+	pfn_t __pfn_t;
+	long size;
+	struct dax_device *dax_dev;
+	int ret;
+
+	ret = bdev_dax_supported(sb, PAGE_SIZE);
+	nova_dbg_verbose("%s: dax_supported = %d; bdev->super=0x%p",
+			 __func__, ret, sb->s_bdev->bd_super);
+	if (ret) {
+		nova_err(sb, "device does not support DAX\n");
+		return ret;
+	}
+
+	sbi->s_bdev = sb->s_bdev;
+
+	dax_dev = fs_dax_get_by_host(sb->s_bdev->bd_disk->disk_name);
+	if (!dax_dev) {
+		nova_err(sb, "Couldn't retrieve DAX device.\n");
+		return -EINVAL;
+	}
+	sbi->s_dax_dev = dax_dev;
+
+	size = dax_direct_access(sbi->s_dax_dev, 0, LONG_MAX/PAGE_SIZE,
+				 &virt_addr, &__pfn_t) * PAGE_SIZE;
+	if (size <= 0) {
+		nova_err(sb, "direct_access failed\n");
+		return -EINVAL;
+	}
+
+	sbi->virt_addr = virt_addr;
+
+	if (!sbi->virt_addr) {
+		nova_err(sb, "ioremap of the nova image failed(1)\n");
+		return -EINVAL;
+	}
+
+	sbi->phys_addr = pfn_t_to_pfn(__pfn_t) << PAGE_SHIFT;
+	sbi->initsize = size;
+	sbi->replica_reserved_inodes_addr = virt_addr + size -
+			(sbi->tail_reserved_blocks << PAGE_SHIFT);
+	sbi->replica_sb_addr = virt_addr + size - PAGE_SIZE;
+
+	nova_dbg("%s: dev %s, phys_addr 0x%llx, virt_addr %p, size %ld\n",
+		__func__, sbi->s_bdev->bd_disk->disk_name,
+		sbi->phys_addr, sbi->virt_addr, sbi->initsize);
+
+	return 0;
+}
+
+static loff_t nova_max_size(int bits)
+{
+	loff_t res;
+
+	res = (1ULL << 63) - 1;
+
+	if (res > MAX_LFS_FILESIZE)
+		res = MAX_LFS_FILESIZE;
+
+	nova_dbg_verbose("max file size %llu bytes\n", res);
+	return res;
+}
+
+enum {
+	Opt_bpi, Opt_init, Opt_snapshot, Opt_mode, Opt_uid,
+	Opt_gid, Opt_blocksize, Opt_wprotect,
+	Opt_err_cont, Opt_err_panic, Opt_err_ro,
+	Opt_dbgmask, Opt_err
+};
+
+static const match_table_t tokens = {
+	{ Opt_bpi,	     "bpi=%u"		  },
+	{ Opt_init,	     "init"		  },
+	{ Opt_snapshot,	     "snapshot=%u"	  },
+	{ Opt_mode,	     "mode=%o"		  },
+	{ Opt_uid,	     "uid=%u"		  },
+	{ Opt_gid,	     "gid=%u"		  },
+	{ Opt_wprotect,	     "wprotect"		  },
+	{ Opt_err_cont,	     "errors=continue"	  },
+	{ Opt_err_panic,     "errors=panic"	  },
+	{ Opt_err_ro,	     "errors=remount-ro"  },
+	{ Opt_dbgmask,	     "dbgmask=%u"	  },
+	{ Opt_err,	     NULL		  },
+};
+
+static int nova_parse_options(char *options, struct nova_sb_info *sbi,
+			       bool remount)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int option;
+	kuid_t uid;
+
+	if (!options)
+		return 0;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_bpi:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			if (remount && sbi->bpi)
+				goto bad_opt;
+			sbi->bpi = option;
+			break;
+		case Opt_uid:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			uid = make_kuid(current_user_ns(), option);
+			if (remount && !uid_eq(sbi->uid, uid))
+				goto bad_opt;
+			sbi->uid = uid;
+			break;
+		case Opt_gid:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			sbi->gid = make_kgid(current_user_ns(), option);
+			break;
+		case Opt_mode:
+			if (match_octal(&args[0], &option))
+				goto bad_val;
+			sbi->mode = option & 01777U;
+			break;
+		case Opt_init:
+			if (remount)
+				goto bad_opt;
+			set_opt(sbi->s_mount_opt, FORMAT);
+			break;
+		case Opt_snapshot:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			sbi->mount_snapshot = 1;
+			sbi->mount_snapshot_epoch_id = option;
+			break;
+		case Opt_err_panic:
+			clear_opt(sbi->s_mount_opt, ERRORS_CONT);
+			clear_opt(sbi->s_mount_opt, ERRORS_RO);
+			set_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			break;
+		case Opt_err_ro:
+			clear_opt(sbi->s_mount_opt, ERRORS_CONT);
+			clear_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			set_opt(sbi->s_mount_opt, ERRORS_RO);
+			break;
+		case Opt_err_cont:
+			clear_opt(sbi->s_mount_opt, ERRORS_RO);
+			clear_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			set_opt(sbi->s_mount_opt, ERRORS_CONT);
+			break;
+		case Opt_wprotect:
+			if (remount)
+				goto bad_opt;
+			set_opt(sbi->s_mount_opt, PROTECT);
+			nova_info("NOVA: Enabling new Write Protection (CR0.WP)\n");
+			break;
+		case Opt_dbgmask:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			nova_dbgmask = option;
+			break;
+		default: {
+			goto bad_opt;
+		}
+		}
+	}
+
+	return 0;
+
+bad_val:
+	nova_info("Bad value '%s' for mount option '%s'\n", args[0].from,
+	       p);
+	return -EINVAL;
+bad_opt:
+	nova_info("Bad mount option: \"%s\"\n", p);
+	return -EINVAL;
+}
+
+
+/* Make sure we have enough space */
+static bool nova_check_size(struct super_block *sb, unsigned long size)
+{
+	unsigned long minimum_size;
+
+	/* space required for super block and root directory.*/
+	minimum_size = (HEAD_RESERVED_BLOCKS + TAIL_RESERVED_BLOCKS + 1)
+			  << sb->s_blocksize_bits;
+
+	if (size < minimum_size)
+		return false;
+
+	return true;
+}
+
+static inline int nova_check_super_checksum(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u32 crc = 0;
+
+	// Check CRC but skip c_sum, which is the 4 bytes at the beginning
+	crc = nova_crc32c(~0, (__u8 *)sbi->nova_sb + sizeof(__le32),
+			sizeof(struct nova_super_block) - sizeof(__le32));
+
+	if (sbi->nova_sb->s_sum == cpu_to_le32(crc))
+		return 0;
+	else
+		return 1;
+}
+
+inline void nova_sync_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_super_block *super = nova_get_super(sb);
+	struct nova_super_block *super_redund;
+
+	nova_memunlock_super(sb);
+
+	super_redund = nova_get_redund_super(sb);
+
+	memcpy_to_pmem_nocache((void *)super, (void *)sbi->nova_sb,
+		sizeof(struct nova_super_block));
+	PERSISTENT_BARRIER();
+
+	memcpy_to_pmem_nocache((void *)super_redund, (void *)sbi->nova_sb,
+		sizeof(struct nova_super_block));
+	PERSISTENT_BARRIER();
+
+	nova_memlock_super(sb);
+}
+
+/* Update checksum for the DRAM copy */
+inline void nova_update_super_crc(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u32 crc = 0;
+
+	sbi->nova_sb->s_wtime = cpu_to_le32(get_seconds());
+	sbi->nova_sb->s_sum = 0;
+	crc = nova_crc32c(~0, (__u8 *)sbi->nova_sb + sizeof(__le32),
+			sizeof(struct nova_super_block) - sizeof(__le32));
+	sbi->nova_sb->s_sum = cpu_to_le32(crc);
+}
+
+
+static inline void nova_update_mount_time(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 mnt_write_time;
+
+	mnt_write_time = (get_seconds() & 0xFFFFFFFF);
+	mnt_write_time = mnt_write_time | (mnt_write_time << 32);
+
+	sbi->nova_sb->s_mtime = cpu_to_le64(mnt_write_time);
+	nova_update_super_crc(sb);
+
+	nova_sync_super(sb);
+}
+
+static struct nova_inode *nova_init(struct super_block *sb,
+				      unsigned long size)
+{
+	unsigned long blocksize;
+	struct nova_inode *root_i, *pi;
+	struct nova_super_block *super;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_update update;
+	u64 epoch_id;
+	timing_t init_time;
+
+	NOVA_START_TIMING(new_init_t, init_time);
+	nova_info("creating an empty nova of size %lu\n", size);
+	sbi->num_blocks = ((unsigned long)(size) >> PAGE_SHIFT);
+
+	nova_dbgv("nova: Default block size set to 4K\n");
+	sbi->blocksize = blocksize = NOVA_DEF_BLOCK_SIZE_4K;
+	nova_set_blocksize(sb, sbi->blocksize);
+
+	if (!nova_check_size(sb, size)) {
+		nova_warn("Specified NOVA size too small 0x%lx.\n", size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	nova_dbgv("max file name len %d\n", (unsigned int)NOVA_NAME_LEN);
+
+	super = nova_get_super(sb);
+
+	nova_memunlock_reserved(sb, super);
+	/* clear out super-block and inode table */
+	memset_nt(super, 0, sbi->head_reserved_blocks * sbi->blocksize);
+
+	pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	pi->nova_ino = NOVA_BLOCKNODE_INO;
+	nova_flush_buffer(pi, CACHELINE_SIZE, 1);
+
+	pi = nova_get_inode_by_ino(sb, NOVA_SNAPSHOT_INO);
+	pi->nova_ino = NOVA_SNAPSHOT_INO;
+	nova_flush_buffer(pi, CACHELINE_SIZE, 1);
+
+	memset(&update, 0, sizeof(struct nova_inode_update));
+	nova_update_inode(sb, &sbi->snapshot_si->vfs_inode, pi, &update, 1);
+
+	nova_memlock_reserved(sb, super);
+
+	nova_init_blockmap(sb, 0);
+
+	if (nova_lite_journal_hard_init(sb) < 0) {
+		nova_err(sb, "Lite journal hard initialization failed\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (nova_init_inode_inuse_list(sb) < 0)
+		return ERR_PTR(-EINVAL);
+
+	if (nova_init_inode_table(sb) < 0)
+		return ERR_PTR(-EINVAL);
+
+
+	sbi->nova_sb->s_size = cpu_to_le64(size);
+	sbi->nova_sb->s_blocksize = cpu_to_le32(blocksize);
+	sbi->nova_sb->s_magic = cpu_to_le32(NOVA_SUPER_MAGIC);
+	sbi->nova_sb->s_epoch_id = 0;
+	sbi->nova_sb->s_metadata_csum = metadata_csum;
+	sbi->nova_sb->s_data_csum = data_csum;
+	sbi->nova_sb->s_data_parity = data_parity;
+	nova_update_super_crc(sb);
+
+	nova_sync_super(sb);
+
+	root_i = nova_get_inode_by_ino(sb, NOVA_ROOT_INO);
+	nova_dbgv("%s: Allocate root inode @ 0x%p\n", __func__, root_i);
+
+	nova_memunlock_inode(sb, root_i);
+	root_i->i_mode = cpu_to_le16(sbi->mode | S_IFDIR);
+	root_i->i_uid = cpu_to_le32(from_kuid(&init_user_ns, sbi->uid));
+	root_i->i_gid = cpu_to_le32(from_kgid(&init_user_ns, sbi->gid));
+	root_i->i_links_count = cpu_to_le16(2);
+	root_i->i_blk_type = NOVA_BLOCK_TYPE_4K;
+	root_i->i_flags = 0;
+	root_i->i_size = cpu_to_le64(sb->s_blocksize);
+	root_i->i_atime = root_i->i_mtime = root_i->i_ctime =
+		cpu_to_le32(get_seconds());
+	root_i->nova_ino = cpu_to_le64(NOVA_ROOT_INO);
+	root_i->valid = 1;
+	/* nova_sync_inode(root_i); */
+	nova_flush_buffer(root_i, sizeof(*root_i), false);
+	nova_memlock_inode(sb, root_i);
+
+	epoch_id = nova_get_epoch_id(sb);
+	nova_append_dir_init_entries(sb, root_i, NOVA_ROOT_INO,
+					NOVA_ROOT_INO, epoch_id);
+
+	PERSISTENT_MARK();
+	PERSISTENT_BARRIER();
+	NOVA_END_TIMING(new_init_t, init_time);
+	nova_info("NOVA initialization finish\n");
+	return root_i;
+}
+
+static inline void set_default_opts(struct nova_sb_info *sbi)
+{
+	set_opt(sbi->s_mount_opt, HUGEIOREMAP);
+	set_opt(sbi->s_mount_opt, ERRORS_CONT);
+	sbi->head_reserved_blocks = HEAD_RESERVED_BLOCKS;
+	sbi->tail_reserved_blocks = TAIL_RESERVED_BLOCKS;
+	sbi->cpus = num_online_cpus();
+	sbi->map_id = 0;
+}
+
+static void nova_root_check(struct super_block *sb, struct nova_inode *root_pi)
+{
+	if (!S_ISDIR(le16_to_cpu(root_pi->i_mode)))
+		nova_warn("root is not a directory!\n");
+}
+
+/* Check super block magic and checksum */
+static int nova_check_super(struct super_block *sb,
+	struct nova_super_block *ps)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int rc;
+
+	rc = memcpy_mcsafe(sbi->nova_sb, ps,
+				sizeof(struct nova_super_block));
+
+	if (rc < 0)
+		return rc;
+
+	if (le32_to_cpu(sbi->nova_sb->s_magic) != NOVA_SUPER_MAGIC)
+		return -EIO;
+
+	if (nova_check_super_checksum(sb))
+		return -EIO;
+
+	return 0;
+}
+
+/* Check if we disable protection previously and enable it now */
+/* FIXME */
+static int nova_check_module_params(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->nova_sb->s_metadata_csum != metadata_csum) {
+		nova_dbg("%s metadata checksum\n",
+			sbi->nova_sb->s_metadata_csum ? "Enable" : "Disable");
+		metadata_csum = sbi->nova_sb->s_metadata_csum;
+	}
+
+	if (sbi->nova_sb->s_data_csum != data_csum) {
+		nova_dbg("%s data checksum\n",
+			sbi->nova_sb->s_data_csum ? "Enable" : "Disable");
+		data_csum = sbi->nova_sb->s_data_csum;
+	}
+
+	if (sbi->nova_sb->s_data_parity != data_parity) {
+		nova_dbg("%s data parity\n",
+			sbi->nova_sb->s_data_parity ? "Enable" : "Disable");
+		data_parity = sbi->nova_sb->s_data_parity;
+	}
+
+	return 0;
+}
+
+static int nova_check_integrity(struct super_block *sb)
+{
+	struct nova_super_block *super = nova_get_super(sb);
+	struct nova_super_block *super_redund;
+	int rc;
+
+	super_redund = nova_get_redund_super(sb);
+
+	/* Do sanity checks on the superblock */
+	rc = nova_check_super(sb, super);
+	if (rc < 0) {
+		rc = nova_check_super(sb, super_redund);
+		if (rc < 0) {
+			nova_err(sb, "Can't find a valid nova partition\n");
+			return rc;
+		} else
+			nova_warn("Error in super block: try to repair it with the other copy\n");
+		
+	}
+
+	nova_sync_super(sb);
+
+	nova_check_module_params(sb);
+	return 0;
+}
+
+static int nova_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct nova_inode *root_pi;
+	struct nova_sb_info *sbi = NULL;
+	struct inode *root_i = NULL;
+	struct inode_map *inode_map;
+	unsigned long blocksize;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	u32 random = 0;
+	int retval = -EINVAL;
+	int i;
+	timing_t mount_time;
+
+	NOVA_START_TIMING(mount_t, mount_time);
+
+	BUILD_BUG_ON(sizeof(struct nova_super_block) > NOVA_SB_SIZE);
+	BUILD_BUG_ON(sizeof(struct nova_inode) > NOVA_INODE_SIZE);
+	BUILD_BUG_ON(sizeof(struct nova_inode_log_page) != PAGE_SIZE);
+
+	BUILD_BUG_ON(sizeof(struct journal_ptr_pair) > CACHELINE_SIZE);
+	BUILD_BUG_ON(PAGE_SIZE/sizeof(struct journal_ptr_pair) < MAX_CPUS);
+	BUILD_BUG_ON(PAGE_SIZE/sizeof(struct nova_lite_journal_entry) <
+		     NOVA_MAX_JOURNAL_LENGTH);
+
+	BUILD_BUG_ON(sizeof(struct nova_inode_page_tail) +
+		     LOG_BLOCK_TAIL != PAGE_SIZE);
+
+	sbi = kzalloc(sizeof(struct nova_sb_info), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
+	sbi->nova_sb = kzalloc(sizeof(struct nova_super_block), GFP_KERNEL);
+	if (!sbi->nova_sb) {
+		kfree(sbi);
+		return -ENOMEM;
+	}
+
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	set_default_opts(sbi);
+
+	/* Currently the log page supports 64 journal pointer pairs */
+	if (sbi->cpus > MAX_CPUS) {
+		nova_err(sb, "NOVA needs more log pointer pages to support more than "
+			  __stringify(MAX_CPUS) " cpus.\n");
+		goto out;
+	}
+
+	retval = nova_get_nvmm_info(sb, sbi);
+	if (retval) {
+		nova_err(sb, "%s: Failed to get nvmm info.",
+			 __func__);
+		goto out;
+	}
+
+
+	nova_dbg("measure timing %d, metadata checksum %d, inplace update %d, wprotect %d, data checksum %d, data parity %d, DRAM checksum %d\n",
+		measure_timing, metadata_csum,
+		inplace_data_updates, wprotect,	 data_csum,
+		data_parity, dram_struct_csum);
+
+	get_random_bytes(&random, sizeof(u32));
+	atomic_set(&sbi->next_generation, random);
+
+	/* Init with default values */
+	sbi->mode = (0755);
+	sbi->uid = current_fsuid();
+	sbi->gid = current_fsgid();
+	set_opt(sbi->s_mount_opt, DAX);
+	set_opt(sbi->s_mount_opt, HUGEIOREMAP);
+
+	mutex_init(&sbi->vma_mutex);
+	INIT_LIST_HEAD(&sbi->mmap_sih_list);
+
+	sbi->inode_maps = kcalloc(sbi->cpus, sizeof(struct inode_map),
+					GFP_KERNEL);
+	if (!sbi->inode_maps) {
+		retval = -ENOMEM;
+		nova_dbg("%s: Allocating inode maps failed.",
+			 __func__);
+		goto out;
+	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		mutex_init(&inode_map->inode_table_mutex);
+		inode_map->inode_inuse_tree = RB_ROOT;
+	}
+
+	mutex_init(&sbi->s_lock);
+
+	sbi->zeroed_page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!sbi->zeroed_page) {
+		retval = -ENOMEM;
+		nova_dbg("%s: sbi->zeroed_page failed.",
+			 __func__);
+		goto out;
+	}
+
+	for (i = 0; i < 8; i++)
+		sbi->zero_csum[i] = nova_crc32c(NOVA_INIT_CSUM,
+				sbi->zeroed_page, strp_size);
+	sbi->zero_parity = kzalloc(strp_size, GFP_KERNEL);
+
+	if (!sbi->zero_parity) {
+		retval = -ENOMEM;
+		nova_err(sb, "%s: sbi->zero_parity failed.",
+			 __func__);
+		goto out;
+	}
+
+	sbi->snapshot_si = kmem_cache_alloc(nova_inode_cachep, GFP_NOFS);
+	nova_snapshot_init(sb);
+
+	retval = nova_parse_options(data, sbi, 0);
+	if (retval) {
+		nova_err(sb, "%s: Failed to parse nova command line options.",
+			 __func__);
+		goto out;
+	}
+
+	if (nova_alloc_block_free_lists(sb)) {
+		retval = -ENOMEM;
+		nova_err(sb, "%s: Failed to allocate block free lists.",
+			 __func__);
+		goto out;
+	}
+
+	nova_sysfs_init(sb);
+
+	/* Init a new nova instance */
+	if (sbi->s_mount_opt & NOVA_MOUNT_FORMAT) {
+		root_pi = nova_init(sb, sbi->initsize);
+		retval = -ENOMEM;
+		if (IS_ERR(root_pi)) {
+			nova_err(sb, "%s: root_pi error.",
+				 __func__);
+
+			goto out;
+		}
+		goto setup_sb;
+	}
+
+	nova_dbg_verbose("checking physical address 0x%016llx for nova image\n",
+		  (u64)sbi->phys_addr);
+
+	if (nova_check_integrity(sb) < 0) {
+		nova_dbg("Memory contains invalid nova %x:%x\n",
+			le32_to_cpu(sbi->nova_sb->s_magic), NOVA_SUPER_MAGIC);
+		goto out;
+	}
+
+	if (nova_lite_journal_soft_init(sb)) {
+		retval = -EINVAL;
+		nova_err(sb, "Lite journal initialization failed\n");
+		goto out;
+	}
+
+	if (sbi->mount_snapshot) {
+		retval = nova_mount_snapshot(sb);
+		if (retval) {
+			nova_err(sb, "Mount snapshot failed\n");
+			goto out;
+		}
+	}
+
+	blocksize = le32_to_cpu(sbi->nova_sb->s_blocksize);
+	nova_set_blocksize(sb, blocksize);
+
+	nova_dbg_verbose("blocksize %lu\n", blocksize);
+
+	/* Read the root inode */
+	root_pi = nova_get_inode_by_ino(sb, NOVA_ROOT_INO);
+
+	/* Check that the root inode is in a sane state */
+	nova_root_check(sb, root_pi);
+
+	/* Set it all up.. */
+setup_sb:
+	sb->s_magic = le32_to_cpu(sbi->nova_sb->s_magic);
+	sb->s_op = &nova_sops;
+	sb->s_maxbytes = nova_max_size(sb->s_blocksize_bits);
+	sb->s_time_gran = 1000000000; // 1 second.
+	sb->s_export_op = &nova_export_ops;
+	sb->s_xattr = NULL;
+	sb->s_flags |= MS_NOSEC;
+
+	/* If the FS was not formatted on this mount, scan the meta-data after
+	 * truncate list has been processed
+	 */
+	if ((sbi->s_mount_opt & NOVA_MOUNT_FORMAT) == 0)
+		nova_recovery(sb);
+
+	root_i = nova_iget(sb, NOVA_ROOT_INO);
+	if (IS_ERR(root_i)) {
+		retval = PTR_ERR(root_i);
+		nova_err(sb, "%s: failed to get root inode",
+			 __func__);
+
+		goto out;
+	}
+
+	sb->s_root = d_make_root(root_i);
+	if (!sb->s_root) {
+		nova_err(sb, "get nova root inode failed\n");
+		retval = -ENOMEM;
+		goto out;
+	}
+
+	if (!(sb->s_flags & MS_RDONLY))
+		nova_update_mount_time(sb);
+
+	nova_print_curr_epoch_id(sb);
+
+	retval = 0;
+	NOVA_END_TIMING(mount_t, mount_time);
+	return retval;
+out:
+	kfree(sbi->zeroed_page);
+	sbi->zeroed_page = NULL;
+
+	kfree(sbi->zero_parity);
+	sbi->zero_parity = NULL;
+
+	kfree(sbi->free_lists);
+	sbi->free_lists = NULL;
+
+	kfree(sbi->journal_locks);
+	sbi->journal_locks = NULL;
+
+	kfree(sbi->inode_maps);
+	sbi->inode_maps = NULL;
+
+	nova_sysfs_exit(sb);
+
+	kfree(sbi->nova_sb);
+	kfree(sbi);
+	return retval;
+}
+
+int nova_statfs(struct dentry *d, struct kstatfs *buf)
+{
+	struct super_block *sb = d->d_sb;
+	struct nova_sb_info *sbi = (struct nova_sb_info *)sb->s_fs_info;
+
+	buf->f_type = NOVA_SUPER_MAGIC;
+	buf->f_bsize = sb->s_blocksize;
+
+	buf->f_blocks = sbi->num_blocks;
+	buf->f_bfree = buf->f_bavail = nova_count_free_blocks(sb);
+	buf->f_files = LONG_MAX;
+	buf->f_ffree = LONG_MAX - sbi->s_inodes_used_count;
+	buf->f_namelen = NOVA_NAME_LEN;
+	nova_dbg_verbose("nova_stats: total 4k free blocks 0x%llx\n",
+		buf->f_bfree);
+	return 0;
+}
+
+static int nova_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct nova_sb_info *sbi = NOVA_SB(root->d_sb);
+
+	//seq_printf(seq, ",physaddr=0x%016llx", (u64)sbi->phys_addr);
+	//if (sbi->initsize)
+	//     seq_printf(seq, ",init=%luk", sbi->initsize >> 10);
+	//if (sbi->blocksize)
+	//	 seq_printf(seq, ",bs=%lu", sbi->blocksize);
+	//if (sbi->bpi)
+	//	seq_printf(seq, ",bpi=%lu", sbi->bpi);
+	if (sbi->mode != (0777 | S_ISVTX))
+		seq_printf(seq, ",mode=%03o", sbi->mode);
+	if (uid_valid(sbi->uid))
+		seq_printf(seq, ",uid=%u", from_kuid(&init_user_ns, sbi->uid));
+	if (gid_valid(sbi->gid))
+		seq_printf(seq, ",gid=%u", from_kgid(&init_user_ns, sbi->gid));
+	if (test_opt(root->d_sb, ERRORS_RO))
+		seq_puts(seq, ",errors=remount-ro");
+	if (test_opt(root->d_sb, ERRORS_PANIC))
+		seq_puts(seq, ",errors=panic");
+	/* memory protection disabled by default */
+	if (test_opt(root->d_sb, PROTECT))
+		seq_puts(seq, ",wprotect");
+	//if (test_opt(root->d_sb, DAX))
+	//	seq_puts(seq, ",dax");
+
+	return 0;
+}
+
+int nova_remount(struct super_block *sb, int *mntflags, char *data)
+{
+	unsigned long old_sb_flags;
+	unsigned long old_mount_opt;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret = -EINVAL;
+
+	/* Store the old options */
+	mutex_lock(&sbi->s_lock);
+	old_sb_flags = sb->s_flags;
+	old_mount_opt = sbi->s_mount_opt;
+
+	if (nova_parse_options(data, sbi, 1))
+		goto restore_opt;
+
+	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
+		      ((sbi->s_mount_opt & NOVA_MOUNT_POSIX_ACL) ?
+		       MS_POSIXACL : 0);
+
+	if ((*mntflags & MS_RDONLY) != (sb->s_flags & MS_RDONLY))
+		nova_update_mount_time(sb);
+
+	mutex_unlock(&sbi->s_lock);
+	ret = 0;
+	return ret;
+
+restore_opt:
+	sb->s_flags = old_sb_flags;
+	sbi->s_mount_opt = old_mount_opt;
+	mutex_unlock(&sbi->s_lock);
+	return ret;
+}
+
+static void nova_put_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	int i;
+
+	nova_print_curr_epoch_id(sb);
+
+	/* It's unmount time, so unmap the nova memory */
+//	nova_print_free_lists(sb);
+	if (sbi->virt_addr) {
+		nova_save_snapshots(sb);
+		kmem_cache_free(nova_inode_cachep, sbi->snapshot_si);
+		nova_save_inode_list_to_log(sb);
+		/* Save everything before blocknode mapping! */
+		nova_save_blocknode_mappings_to_log(sb);
+		sbi->virt_addr = NULL;
+	}
+
+	nova_delete_free_lists(sb);
+
+	kfree(sbi->zeroed_page);
+	kfree(sbi->zero_parity);
+	nova_dbgmask = 0;
+	kfree(sbi->free_lists);
+	kfree(sbi->journal_locks);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		nova_dbgv("CPU %d: inode allocated %d, freed %d\n",
+			i, inode_map->allocated, inode_map->freed);
+	}
+
+	kfree(sbi->inode_maps);
+
+	nova_sysfs_exit(sb);
+
+	kfree(sbi->nova_sb);
+	kfree(sbi);
+	sb->s_fs_info = NULL;
+}
+
+inline void nova_free_range_node(struct nova_range_node *node)
+{
+	kmem_cache_free(nova_range_node_cachep, node);
+}
+
+
+inline void nova_free_inode_node(struct super_block *sb,
+	struct nova_range_node *node)
+{
+	nova_free_range_node(node);
+}
+
+inline void nova_free_vma_item(struct super_block *sb,
+	struct vma_item *item)
+{
+	nova_free_range_node((struct nova_range_node *)item);
+}
+
+inline struct snapshot_info *nova_alloc_snapshot_info(struct super_block *sb)
+{
+	struct snapshot_info *p;
+
+	p = (struct snapshot_info *)
+		kmem_cache_alloc(nova_snapshot_info_cachep, GFP_NOFS);
+	return p;
+}
+
+inline void nova_free_snapshot_info(struct snapshot_info *info)
+{
+	kmem_cache_free(nova_snapshot_info_cachep, info);
+}
+
+inline struct nova_range_node *nova_alloc_range_node(struct super_block *sb)
+{
+	struct nova_range_node *p;
+
+	p = (struct nova_range_node *)
+		kmem_cache_zalloc(nova_range_node_cachep, GFP_NOFS);
+	return p;
+}
+
+
+inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb)
+{
+	return nova_alloc_range_node(sb);
+}
+
+inline struct vma_item *nova_alloc_vma_item(struct super_block *sb)
+{
+	return (struct vma_item *)nova_alloc_range_node(sb);
+}
+
+
+static struct inode *nova_alloc_inode(struct super_block *sb)
+{
+	struct nova_inode_info *vi;
+
+	vi = kmem_cache_alloc(nova_inode_cachep, GFP_NOFS);
+	if (!vi)
+		return NULL;
+
+	vi->vfs_inode.i_version = 1;
+
+	return &vi->vfs_inode;
+}
+
+static void nova_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct nova_inode_info *vi = NOVA_I(inode);
+
+	nova_dbg_verbose("%s: ino %lu\n", __func__, inode->i_ino);
+	kmem_cache_free(nova_inode_cachep, vi);
+}
+
+static void nova_destroy_inode(struct inode *inode)
+{
+	nova_dbgv("%s: %lu\n", __func__, inode->i_ino);
+	call_rcu(&inode->i_rcu, nova_i_callback);
+}
+
+static void init_once(void *foo)
+{
+	struct nova_inode_info *vi = foo;
+
+	inode_init_once(&vi->vfs_inode);
+}
+
+
+static int __init init_rangenode_cache(void)
+{
+	nova_range_node_cachep = kmem_cache_create("nova_range_node_cache",
+					sizeof(struct nova_range_node),
+					0, (SLAB_RECLAIM_ACCOUNT |
+					SLAB_MEM_SPREAD), NULL);
+	if (nova_range_node_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static int __init init_snapshot_info_cache(void)
+{
+	nova_snapshot_info_cachep = kmem_cache_create(
+					"nova_snapshot_info_cache",
+					sizeof(struct snapshot_info),
+					0, (SLAB_RECLAIM_ACCOUNT |
+					SLAB_MEM_SPREAD), NULL);
+	if (nova_snapshot_info_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static int __init init_inodecache(void)
+{
+	nova_inode_cachep = kmem_cache_create("nova_inode_cache",
+					       sizeof(struct nova_inode_info),
+					       0, (SLAB_RECLAIM_ACCOUNT |
+						   SLAB_MEM_SPREAD), init_once);
+	if (nova_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static void destroy_inodecache(void)
+{
+	/*
+	 * Make sure all delayed rcu free inodes are flushed before
+	 * we destroy cache.
+	 */
+	rcu_barrier();
+	kmem_cache_destroy(nova_inode_cachep);
+}
+
+static void destroy_rangenode_cache(void)
+{
+	kmem_cache_destroy(nova_range_node_cachep);
+}
+
+static void destroy_snapshot_info_cache(void)
+{
+	kmem_cache_destroy(nova_snapshot_info_cachep);
+}
+
+/*
+ * the super block writes are all done "on the fly", so the
+ * super block is never in a "dirty" state, so there's no need
+ * for write_super.
+ */
+static struct super_operations nova_sops = {
+	.alloc_inode	= nova_alloc_inode,
+	.destroy_inode	= nova_destroy_inode,
+	.write_inode	= nova_write_inode,
+	.dirty_inode	= nova_dirty_inode,
+	.evict_inode	= nova_evict_inode,
+	.put_super	= nova_put_super,
+	.statfs		= nova_statfs,
+	.remount_fs	= nova_remount,
+	.show_options	= nova_show_options,
+};
+
+static struct dentry *nova_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name, void *data)
+{
+	return mount_bdev(fs_type, flags, dev_name, data, nova_fill_super);
+}
+
+static struct file_system_type nova_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "NOVA",
+	.mount		= nova_mount,
+	.kill_sb	= kill_block_super,
+};
+
+static struct inode *nova_nfs_get_inode(struct super_block *sb,
+					 u64 ino, u32 generation)
+{
+	struct inode *inode;
+
+	if (ino < NOVA_ROOT_INO)
+		return ERR_PTR(-ESTALE);
+
+	if (ino > LONG_MAX)
+		return ERR_PTR(-ESTALE);
+
+	inode = nova_iget(sb, ino);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	if (generation && inode->i_generation != generation) {
+		/* we didn't find the right inode.. */
+		iput(inode);
+		return ERR_PTR(-ESTALE);
+	}
+
+	return inode;
+}
+
+static struct dentry *nova_fh_to_dentry(struct super_block *sb,
+					 struct fid *fid, int fh_len,
+					 int fh_type)
+{
+	return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+				    nova_nfs_get_inode);
+}
+
+static struct dentry *nova_fh_to_parent(struct super_block *sb,
+					 struct fid *fid, int fh_len,
+					 int fh_type)
+{
+	return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+				    nova_nfs_get_inode);
+}
+
+static const struct export_operations nova_export_ops = {
+	.fh_to_dentry	= nova_fh_to_dentry,
+	.fh_to_parent	= nova_fh_to_parent,
+	.get_parent	= nova_get_parent,
+};
+
+static int __init init_nova_fs(void)
+{
+	int rc = 0;
+	timing_t init_time;
+
+	NOVA_START_TIMING(init_t, init_time);
+	nova_dbg("%s: %d cpus online\n", __func__, num_online_cpus());
+	if (arch_has_clwb())
+		support_clwb = 1;
+
+	nova_info("Arch new instructions support: CLWB %s\n",
+			support_clwb ? "YES" : "NO");
+
+	nova_proc_root = proc_mkdir(proc_dirname, NULL);
+
+	nova_dbg("Data structure size: inode %lu, log_page %lu, file_write_entry %lu, dir_entry(max) %d, setattr_entry %lu, link_change_entry %lu\n",
+		sizeof(struct nova_inode),
+		sizeof(struct nova_inode_log_page),
+		sizeof(struct nova_file_write_entry),
+		NOVA_DIR_LOG_REC_LEN(NOVA_NAME_LEN),
+		sizeof(struct nova_setattr_logentry),
+		sizeof(struct nova_link_change_entry));
+
+	rc = init_rangenode_cache();
+	if (rc)
+		return rc;
+
+	rc = init_inodecache();
+	if (rc)
+		goto out1;
+
+	rc = init_snapshot_info_cache();
+	if (rc)
+		goto out2;
+
+	rc = register_filesystem(&nova_fs_type);
+	if (rc)
+		goto out3;
+
+	NOVA_END_TIMING(init_t, init_time);
+	return 0;
+
+out3:
+	destroy_snapshot_info_cache();
+out2:
+	destroy_inodecache();
+out1:
+	destroy_rangenode_cache();
+	return rc;
+}
+
+static void __exit exit_nova_fs(void)
+{
+	unregister_filesystem(&nova_fs_type);
+	remove_proc_entry(proc_dirname, NULL);
+	destroy_snapshot_info_cache();
+	destroy_inodecache();
+	destroy_rangenode_cache();
+}
+
+MODULE_AUTHOR("Andiry Xu <jix024@cs.ucsd.edu>");
+MODULE_DESCRIPTION("NOVA: A Persistent Memory File System");
+MODULE_LICENSE("GPL");
+
+module_init(init_nova_fs)
+module_exit(exit_nova_fs)
diff --git a/fs/nova/super.h b/fs/nova/super.h
new file mode 100644
index 000000000000..8c0ffbf79e9b
--- /dev/null
+++ b/fs/nova/super.h
@@ -0,0 +1,216 @@
+#ifndef __SUPER_H
+#define __SUPER_H
+/*
+ * Structure of the NOVA super block in PMEM
+ *
+ * The fields are partitioned into static and dynamic fields. The static fields
+ * never change after file system creation. This was primarily done because
+ * nova_get_block() returns NULL if the block offset is 0 (helps in catching
+ * bugs). So if we modify any field using journaling (for consistency), we
+ * will have to modify s_sum which is at offset 0. So journaling code fails.
+ * This (static+dynamic fields) is a temporary solution and can be avoided
+ * once the file system becomes stable and nova_get_block() returns correct
+ * pointers even for offset 0.
+ */
+struct nova_super_block {
+	/* static fields. they never change after file system creation.
+	 * checksum only validates up to s_start_dynamic field below
+	 */
+	__le32		s_sum;			/* checksum of this sb */
+	__le32		s_magic;		/* magic signature */
+	__le32		s_padding32;
+	__le32		s_blocksize;		/* blocksize in bytes */
+	__le64		s_size;			/* total size of fs in bytes */
+	char		s_volume_name[16];	/* volume name */
+
+	/* all the dynamic fields should go here */
+	__le64		s_epoch_id;		/* Epoch ID */
+
+	/* s_mtime and s_wtime should be together and their order should not be
+	 * changed. we use an 8 byte write to update both of them atomically
+	 */
+	__le32		s_mtime;		/* mount time */
+	__le32		s_wtime;		/* write time */
+
+	/* Metadata and data protections */
+	u8		s_padding8;
+	u8		s_metadata_csum;
+	u8		s_data_csum;
+	u8		s_data_parity;
+} __attribute((__packed__));
+
+#define NOVA_SB_SIZE 512       /* must be power of two */
+
+/* ======================= Reserved blocks ========================= */
+
+/*
+ * The first block contains super blocks;
+ * The second block contains reserved inodes;
+ * The third block is reserved.
+ * The fourth block contains pointers to journal pages.
+ * The fifth/sixth block contains pointers to inode tables.
+ * The seventh/eighth blocks are void by now.
+ *
+ * If data protection is enabled, more blocks are reserverd for checksums and
+ * parities and the number is derived according to the whole storage size.
+ */
+#define	HEAD_RESERVED_BLOCKS	8
+
+#define SUPER_BLOCK_START       0 // Superblock
+#define	RESERVE_INODE_START	1 // Reserved inodes
+#define	JOURNAL_START		3 // journal pointer table
+#define	INODE_TABLE0_START	4 // inode table
+#define	INODE_TABLE1_START	5 // replica inode table
+
+/* For replica super block and replica reserved inodes */
+#define	TAIL_RESERVED_BLOCKS	2
+
+/* ======================= Reserved inodes ========================= */
+
+/* We have space for 31 reserved inodes */
+#define NOVA_ROOT_INO		(1)
+#define NOVA_INODETABLE_INO	(2)	/* Fake inode associated with inode
+					 * stroage.  We need this because our
+					 * allocator requires inode to be
+					 * associated with each allocation.
+					 * The data actually lives in linked
+					 * lists in INODE_TABLE0_START. */
+#define NOVA_BLOCKNODE_INO	(3)     /* Storage for allocator state */
+#define NOVA_LITEJOURNAL_INO	(4)     /* Storage for lightweight journals */
+#define NOVA_INODELIST1_INO	(5)     /* Storage for Inode free list */
+#define NOVA_SNAPSHOT_INO	(6)	/* Storage for snapshot state */
+#define NOVA_TEST_PERF_INO	(7)
+
+
+/* Normal inode starts at 32 */
+#define NOVA_NORMAL_INODE_START      (32)
+
+
+
+/*
+ * NOVA super-block data in DRAM
+ */
+struct nova_sb_info {
+	struct super_block *sb;			/* VFS super block */
+	struct nova_super_block *nova_sb;	/* DRAM copy of SB */
+	struct block_device *s_bdev;
+	struct dax_device *s_dax_dev;
+
+	/*
+	 * base physical and virtual address of NOVA (which is also
+	 * the pointer to the super block)
+	 */
+	phys_addr_t	phys_addr;
+	void		*virt_addr;
+	void		*replica_reserved_inodes_addr;
+	void		*replica_sb_addr;
+
+	unsigned long	num_blocks;
+
+	/* TODO: Remove this, since it's unused */
+	/*
+	 * Backing store option:
+	 * 1 = no load, 2 = no store,
+	 * else do both
+	 */
+	unsigned int	nova_backing_option;
+
+	/* Mount options */
+	unsigned long	bpi;
+	unsigned long	blocksize;
+	unsigned long	initsize;
+	unsigned long	s_mount_opt;
+	kuid_t		uid;    /* Mount uid for root directory */
+	kgid_t		gid;    /* Mount gid for root directory */
+	umode_t		mode;   /* Mount mode for root directory */
+	atomic_t	next_generation;
+	/* inode tracking */
+	unsigned long	s_inodes_used_count;
+	unsigned long	head_reserved_blocks;
+	unsigned long	tail_reserved_blocks;
+
+	struct mutex	s_lock;	/* protects the SB's buffer-head */
+
+	int cpus;
+	struct proc_dir_entry *s_proc;
+
+	/* Snapshot related */
+	struct nova_inode_info	*snapshot_si;
+	struct radix_tree_root	snapshot_info_tree;
+	int num_snapshots;
+	/* Current epoch. volatile guarantees visibility */
+	volatile u64 s_epoch_id;
+	volatile int snapshot_taking;
+
+	int mount_snapshot;
+	u64 mount_snapshot_epoch_id;
+
+	struct task_struct *snapshot_cleaner_thread;
+	wait_queue_head_t snapshot_cleaner_wait;
+	wait_queue_head_t snapshot_mmap_wait;
+	void *curr_clean_snapshot_info;
+
+	/* DAX-mmap snapshot structures */
+	struct mutex vma_mutex;
+	struct list_head mmap_sih_list;
+
+	/* ZEROED page for cache page initialized */
+	void *zeroed_page;
+
+	/* Checksum and parity for zero block */
+	u32 zero_csum[8];
+	void *zero_parity;
+
+	/* Per-CPU journal lock */
+	spinlock_t *journal_locks;
+
+	/* Per-CPU inode map */
+	struct inode_map	*inode_maps;
+
+	/* Decide new inode map id */
+	unsigned long map_id;
+
+	/* Per-CPU free block list */
+	struct free_list *free_lists;
+	unsigned long per_list_blocks;
+};
+
+static inline struct nova_sb_info *NOVA_SB(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+
+
+static inline struct nova_super_block
+*nova_get_redund_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return (struct nova_super_block *)(sbi->replica_sb_addr);
+}
+
+
+/* If this is part of a read-modify-write of the super block,
+ * nova_memunlock_super() before calling!
+ */
+static inline struct nova_super_block *nova_get_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return (struct nova_super_block *)sbi->virt_addr;
+}
+
+extern struct super_block *nova_read_super(struct super_block *sb, void *data,
+	int silent);
+extern int nova_statfs(struct dentry *d, struct kstatfs *buf);
+extern int nova_remount(struct super_block *sb, int *flags, char *data);
+void *nova_ioremap(struct super_block *sb, phys_addr_t phys_addr,
+	ssize_t size);
+extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
+extern void nova_free_range_node(struct nova_range_node *node);
+extern void nova_update_super_crc(struct super_block *sb);
+extern void nova_sync_super(struct super_block *sb);
+
+struct snapshot_info *nova_alloc_snapshot_info(struct super_block *sb);
+#endif

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 03/16] NOVA: PMEM allocation system
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:48   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Nova uses per-CPU allocators to manage free PMEM blocks.  On initialization,
NOVA divides the range of blocks in the PMEM device among the CPUs, and those
blocks are managed solely by that CPU.  We call these ranges of allocation regions.

Some of the blocks in an allocation region have fixed roles.  Here's the
layout:

+-------------------------------+
| data checksum blocks          |
+-------------------------------+
| data parity blocks            |
+-------------------------------+
|                               |
| Allocatable blocks            |
|                               |
+-------------------------------+
| replica data parity blocks    |
+-------------------------------+
| replica data checksum blocks  |
+-------------------------------+

The first and last allocation regions, also contain the super block, inode
tables, etc. and their replicas, respectively.

Each allocator maintains a red-black tree of unallocated ranges (struct
nova_range_node).

Allocation Functions
--------------------

Nova allocate PMEM blocks using two mechanisms:

1.  Static allocation as defined in super.h

2.  Allocation for log and data pages via nova_new_log_blocks() and
nova_new_data_blocks().

Both of these functions allow the caller to control whether the allocator
preferes higher addresses for allocation or lower addresses.  We use this to
encourage meta data structures and their replicas to be far from one another.

PMEM Address Translation
------------------------

In Nova's persistent data structures, memory locations are given as offsets
from the beginning of the PMEM region.  nova_get_block() translates offsets to
PMEM addresses.  nova_get_addr_off() performs the reverse translation.

Cautious allocation
-------------------

The allocator allows the caller to provide some control over where the blocks
come from.  Nova uses this to allocate replicas of metadata far from one
another.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/balloc.c |  827 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/balloc.h |  118 ++++++++
 2 files changed, 945 insertions(+)
 create mode 100644 fs/nova/balloc.c
 create mode 100644 fs/nova/balloc.h

diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
new file mode 100644
index 000000000000..434507b017bd
--- /dev/null
+++ b/fs/nova/balloc.c
@@ -0,0 +1,827 @@
+/*
+ * NOVA persistent memory management
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/bitops.h>
+#include "nova.h"
+#include "inode.h"
+
+int nova_alloc_block_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+
+	sbi->free_lists = kcalloc(sbi->cpus, sizeof(struct free_list),
+				  GFP_KERNEL);
+
+	if (!sbi->free_lists)
+		return -ENOMEM;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		free_list->block_free_tree = RB_ROOT;
+		spin_lock_init(&free_list->s_lock);
+		free_list->index = i;
+	}
+
+	return 0;
+}
+
+void nova_delete_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	/* Each tree is freed in save_blocknode_mappings */
+	kfree(sbi->free_lists);
+	sbi->free_lists = NULL;
+}
+
+static int nova_data_csum_init_free_list(struct super_block *sb,
+	struct free_list *free_list)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long data_csum_blocks;
+
+	/* Allocate pages to hold data checksums.  We store one checksum for
+	 * each stripe for each page.  We replicate the checksums at the
+	 * beginning and end of per-cpu region that holds the data they cover.
+	 */
+	data_csum_blocks = ((sbi->initsize >> NOVA_STRIPE_SHIFT)
+				* NOVA_DATA_CSUM_LEN) >> PAGE_SHIFT;
+	free_list->csum_start = free_list->block_start;
+	free_list->block_start += data_csum_blocks / sbi->cpus;
+	if (data_csum_blocks % sbi->cpus)
+		free_list->block_start++;
+
+	free_list->num_csum_blocks =
+		free_list->block_start - free_list->csum_start;
+
+	free_list->replica_csum_start = free_list->block_end + 1 -
+						free_list->num_csum_blocks;
+	free_list->block_end -= free_list->num_csum_blocks;
+
+	return 0;
+}
+
+
+static int nova_data_parity_init_free_list(struct super_block *sb,
+	struct free_list *free_list)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long blocksize, total_blocks, parity_blocks;
+
+	/* Allocate blocks to store data block parity stripes.
+	 * Always reserve in case user turns it off at init mount but later
+	 * turns it on.
+	 */
+	blocksize = sb->s_blocksize;
+	total_blocks = sbi->initsize / blocksize;
+	parity_blocks = total_blocks / (blocksize / NOVA_STRIPE_SIZE + 1);
+	if (total_blocks % (blocksize / NOVA_STRIPE_SIZE + 1))
+		parity_blocks++;
+
+	free_list->parity_start = free_list->block_start;
+	free_list->block_start += parity_blocks / sbi->cpus;
+	if (parity_blocks % sbi->cpus)
+		free_list->block_start++;
+
+	free_list->num_parity_blocks =
+		free_list->block_start - free_list->parity_start;
+
+	free_list->replica_parity_start = free_list->block_end + 1 -
+		free_list->num_parity_blocks;
+
+	return 0;
+}
+
+
+// Initialize a free list.  Each CPU gets an equal share of the block space to
+// manage.
+static void nova_init_free_list(struct super_block *sb,
+	struct free_list *free_list, int index)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long per_list_blocks;
+
+	per_list_blocks = sbi->num_blocks / sbi->cpus;
+
+	free_list->block_start = per_list_blocks * index;
+	free_list->block_end = free_list->block_start +
+					per_list_blocks - 1;
+	if (index == 0)
+		free_list->block_start += sbi->head_reserved_blocks;
+	if (index == sbi->cpus - 1)
+		free_list->block_end -= sbi->tail_reserved_blocks;
+
+	nova_data_csum_init_free_list(sb, free_list);
+	nova_data_parity_init_free_list(sb, free_list);
+}
+
+inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
+{
+	return nova_alloc_range_node(sb);
+}
+
+inline void nova_free_blocknode(struct super_block *sb,
+	struct nova_range_node *node)
+{
+	nova_free_range_node(node);
+}
+
+
+void nova_init_blockmap(struct super_block *sb, int recovery)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct rb_root *tree;
+	struct nova_range_node *blknode;
+	struct free_list *free_list;
+	int i;
+	int ret;
+
+	/* Divide the block range among per-CPU free lists */
+	sbi->per_list_blocks = sbi->num_blocks / sbi->cpus;
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		tree = &(free_list->block_free_tree);
+		nova_init_free_list(sb, free_list, i);
+
+		/* For recovery, update these fields later */
+		if (recovery == 0) {
+			free_list->num_free_blocks = free_list->block_end -
+						free_list->block_start + 1;
+
+			blknode = nova_alloc_blocknode(sb);
+			if (blknode == NULL)
+				BUG();
+			blknode->range_low = free_list->block_start;
+			blknode->range_high = free_list->block_end;
+			nova_update_range_node_checksum(blknode);
+			ret = nova_insert_blocktree(sbi, tree, blknode);
+			if (ret) {
+				nova_err(sb, "%s failed\n", __func__);
+				nova_free_blocknode(sb, blknode);
+				return;
+			}
+			free_list->first_node = blknode;
+			free_list->last_node = blknode;
+			free_list->num_blocknode = 1;
+		}
+
+		nova_dbgv("%s: free list %d: block start %lu, end %lu, %lu free blocks\n",
+			  __func__, i,
+			  free_list->block_start,
+			  free_list->block_end,
+			  free_list->num_free_blocks);
+	}
+}
+
+static inline int nova_rbtree_compare_rangenode(struct nova_range_node *curr,
+	unsigned long range_low)
+{
+	if (range_low < curr->range_low)
+		return -1;
+	if (range_low > curr->range_high)
+		return 1;
+
+	return 0;
+}
+
+int nova_find_range_node(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	struct nova_range_node **ret_node)
+{
+	struct nova_range_node *curr = NULL;
+	struct rb_node *temp;
+	int compVal;
+	int ret = 0;
+
+	temp = tree->rb_node;
+
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		compVal = nova_rbtree_compare_rangenode(curr, range_low);
+
+		if (compVal == -1) {
+			temp = temp->rb_left;
+		} else if (compVal == 1) {
+			temp = temp->rb_right;
+		} else {
+			ret = 1;
+			break;
+		}
+	}
+
+	if (curr && !nova_range_node_checksum_ok(curr)) {
+		nova_dbg("%s: curr failed\n", __func__);
+		return 0;
+	}
+
+	*ret_node = curr;
+	return ret;
+}
+
+
+int nova_insert_range_node(struct rb_root *tree,
+	struct nova_range_node *new_node)
+{
+	struct nova_range_node *curr;
+	struct rb_node **temp, *parent;
+	int compVal;
+
+	temp = &(tree->rb_node);
+	parent = NULL;
+
+	while (*temp) {
+		curr = container_of(*temp, struct nova_range_node, node);
+		compVal = nova_rbtree_compare_rangenode(curr,
+					new_node->range_low);
+		parent = *temp;
+
+		if (compVal == -1) {
+			temp = &((*temp)->rb_left);
+		} else if (compVal == 1) {
+			temp = &((*temp)->rb_right);
+		} else {
+			nova_dbg("%s: entry %lu - %lu already exists: %lu - %lu\n",
+				 __func__, new_node->range_low,
+				new_node->range_high, curr->range_low,
+				curr->range_high);
+			return -EINVAL;
+		}
+	}
+
+	rb_link_node(&new_node->node, parent, temp);
+	rb_insert_color(&new_node->node, tree);
+
+	return 0;
+}
+
+inline int nova_insert_blocktree(struct nova_sb_info *sbi,
+	struct rb_root *tree, struct nova_range_node *new_node)
+{
+	int ret;
+
+	ret = nova_insert_range_node(tree, new_node);
+	if (ret)
+		nova_dbg("ERROR: %s failed %d\n", __func__, ret);
+
+	return ret;
+}
+
+
+/* Used for both block free tree and inode inuse tree */
+int nova_find_free_slot(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	unsigned long range_high, struct nova_range_node **prev,
+	struct nova_range_node **next)
+{
+	struct nova_range_node *ret_node = NULL;
+	struct rb_node *tmp;
+	int check_prev = 0, check_next = 0;
+	int ret;
+
+	ret = nova_find_range_node(sbi, tree, range_low, &ret_node);
+	if (ret) {
+		nova_dbg("%s ERROR: %lu - %lu already in free list\n",
+			__func__, range_low, range_high);
+		return -EINVAL;
+	}
+
+	if (!ret_node) {
+		*prev = *next = NULL;
+	} else if (ret_node->range_high < range_low) {
+		*prev = ret_node;
+		tmp = rb_next(&ret_node->node);
+		if (tmp) {
+			*next = container_of(tmp, struct nova_range_node, node);
+			check_next = 1;
+		} else {
+			*next = NULL;
+		}
+	} else if (ret_node->range_low > range_high) {
+		*next = ret_node;
+		tmp = rb_prev(&ret_node->node);
+		if (tmp) {
+			*prev = container_of(tmp, struct nova_range_node, node);
+			check_prev = 1;
+		} else {
+			*prev = NULL;
+		}
+	} else {
+		nova_dbg("%s ERROR: %lu - %lu overlaps with existing node %lu - %lu\n",
+			 __func__, range_low, range_high, ret_node->range_low,
+			ret_node->range_high);
+		return -EINVAL;
+	}
+
+	if (check_prev && !nova_range_node_checksum_ok(*prev)) {
+		nova_dbg("%s: prev failed\n", __func__);
+		return -EIO;
+	}
+
+	if (check_next && !nova_range_node_checksum_ok(*next)) {
+		nova_dbg("%s: next failed\n", __func__);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int nova_free_blocks(struct super_block *sb, unsigned long blocknr,
+	int num, unsigned short btype, int log_page)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct rb_root *tree;
+	unsigned long block_low;
+	unsigned long block_high;
+	unsigned long num_blocks = 0;
+	struct nova_range_node *prev = NULL;
+	struct nova_range_node *next = NULL;
+	struct nova_range_node *curr_node;
+	struct free_list *free_list;
+	int cpuid;
+	int new_node_used = 0;
+	int ret;
+	timing_t free_time;
+
+	if (num <= 0) {
+		nova_dbg("%s ERROR: free %d\n", __func__, num);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(free_blocks_t, free_time);
+	cpuid = blocknr / sbi->per_list_blocks;
+
+	/* Pre-allocate blocknode */
+	curr_node = nova_alloc_blocknode(sb);
+	if (curr_node == NULL) {
+		/* returning without freeing the block*/
+		NOVA_END_TIMING(free_blocks_t, free_time);
+		return -ENOMEM;
+	}
+
+	free_list = nova_get_free_list(sb, cpuid);
+	spin_lock(&free_list->s_lock);
+
+	tree = &(free_list->block_free_tree);
+
+	num_blocks = nova_get_numblocks(btype) * num;
+	block_low = blocknr;
+	block_high = blocknr + num_blocks - 1;
+
+	nova_dbgv("Free: %lu - %lu\n", block_low, block_high);
+
+	if (blocknr < free_list->block_start ||
+			blocknr + num > free_list->block_end + 1) {
+		nova_err(sb, "free blocks %lu to %lu, free list %d, start %lu, end %lu\n",
+				blocknr, blocknr + num - 1,
+				free_list->index,
+				free_list->block_start,
+				free_list->block_end);
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_find_free_slot(sbi, tree, block_low,
+					block_high, &prev, &next);
+
+	if (ret) {
+		nova_dbg("%s: find free slot fail: %d\n", __func__, ret);
+		goto out;
+	}
+
+	if (prev && next && (block_low == prev->range_high + 1) &&
+			(block_high + 1 == next->range_low)) {
+		/* fits the hole */
+		rb_erase(&next->node, tree);
+		free_list->num_blocknode--;
+		prev->range_high = next->range_high;
+		nova_update_range_node_checksum(prev);
+		if (free_list->last_node == next)
+			free_list->last_node = prev;
+		nova_free_blocknode(sb, next);
+		goto block_found;
+	}
+	if (prev && (block_low == prev->range_high + 1)) {
+		/* Aligns left */
+		prev->range_high += num_blocks;
+		nova_update_range_node_checksum(prev);
+		goto block_found;
+	}
+	if (next && (block_high + 1 == next->range_low)) {
+		/* Aligns right */
+		next->range_low -= num_blocks;
+		nova_update_range_node_checksum(next);
+		goto block_found;
+	}
+
+	/* Aligns somewhere in the middle */
+	curr_node->range_low = block_low;
+	curr_node->range_high = block_high;
+	nova_update_range_node_checksum(curr_node);
+	new_node_used = 1;
+	ret = nova_insert_blocktree(sbi, tree, curr_node);
+	if (ret) {
+		new_node_used = 0;
+		goto out;
+	}
+	if (!prev)
+		free_list->first_node = curr_node;
+	if (!next)
+		free_list->last_node = curr_node;
+
+	free_list->num_blocknode++;
+
+block_found:
+	free_list->num_free_blocks += num_blocks;
+
+	if (log_page) {
+		free_list->free_log_count++;
+		free_list->freed_log_pages += num_blocks;
+	} else {
+		free_list->free_data_count++;
+		free_list->freed_data_pages += num_blocks;
+	}
+
+out:
+	spin_unlock(&free_list->s_lock);
+	if (new_node_used == 0)
+		nova_free_blocknode(sb, curr_node);
+
+	NOVA_END_TIMING(free_blocks_t, free_time);
+	return ret;
+}
+
+int nova_free_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num)
+{
+	int ret;
+	timing_t free_time;
+
+	nova_dbgv("Inode %lu: free %d data block from %lu to %lu\n",
+			sih->ino, num, blocknr, blocknr + num - 1);
+	if (blocknr == 0) {
+		nova_dbg("%s: ERROR: %lu, %d\n", __func__, blocknr, num);
+		return -EINVAL;
+	}
+	NOVA_START_TIMING(free_data_t, free_time);
+	ret = nova_free_blocks(sb, blocknr, num, sih->i_blk_type, 0);
+	if (ret) {
+		nova_err(sb, "Inode %lu: free %d data block from %lu to %lu failed!\n",
+			 sih->ino, num, blocknr, blocknr + num - 1);
+		nova_print_nova_log(sb, sih);
+	}
+	NOVA_END_TIMING(free_data_t, free_time);
+
+	return ret;
+}
+
+int nova_free_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num)
+{
+	int ret;
+	timing_t free_time;
+
+	nova_dbgv("Inode %lu: free %d log block from %lu to %lu\n",
+			sih->ino, num, blocknr, blocknr + num - 1);
+	if (blocknr == 0) {
+		nova_dbg("%s: ERROR: %lu, %d\n", __func__, blocknr, num);
+		return -EINVAL;
+	}
+	NOVA_START_TIMING(free_log_t, free_time);
+	ret = nova_free_blocks(sb, blocknr, num, sih->i_blk_type, 1);
+	if (ret) {
+		nova_err(sb, "Inode %lu: free %d log block from %lu to %lu failed!\n",
+			 sih->ino, num, blocknr, blocknr + num - 1);
+		nova_print_nova_log(sb, sih);
+	}
+	NOVA_END_TIMING(free_log_t, free_time);
+
+	return ret;
+}
+
+static int not_enough_blocks(struct free_list *free_list,
+	unsigned long num_blocks, enum alloc_type atype)
+{
+	struct nova_range_node *first = free_list->first_node;
+	struct nova_range_node *last = free_list->last_node;
+
+	if (free_list->num_free_blocks < num_blocks || !first || !last) {
+		nova_dbgv("%s: num_free_blocks=%ld; num_blocks=%ld; first=0x%p; last=0x%p",
+			  __func__, free_list->num_free_blocks, num_blocks,
+			  first, last);
+		return 1;
+	}
+
+	if (atype == LOG &&
+	    last->range_high - first->range_low < DEAD_ZONE_BLOCKS) {
+		nova_dbgv("%s: allocation would cause deadzone violation. high=0x%lx, low=0x%lx, DEADZONE=%d",
+			  __func__, last->range_high, first->range_low,
+			  DEAD_ZONE_BLOCKS);
+		return 1;
+	}
+
+	return 0;
+}
+
+/* Return how many blocks allocated */
+static long nova_alloc_blocks_in_free_list(struct super_block *sb,
+	struct free_list *free_list, unsigned short btype,
+	enum alloc_type atype, unsigned long num_blocks,
+	unsigned long *new_blocknr, enum nova_alloc_direction from_tail)
+{
+	struct rb_root *tree;
+	struct nova_range_node *curr, *next = NULL, *prev = NULL;
+	struct rb_node *temp, *next_node, *prev_node;
+	unsigned long curr_blocks;
+	bool found = 0;
+	unsigned long step = 0;
+
+	if (!free_list->first_node || free_list->num_free_blocks == 0) {
+		nova_dbgv("%s: Can't alloc. free_list->first_node=0x%p free_list->num_free_blocks = %lu",
+			  __func__, free_list->first_node,
+			  free_list->num_free_blocks);
+		return -ENOSPC;
+	}
+
+	if (atype == LOG && not_enough_blocks(free_list, num_blocks, atype)) {
+		nova_dbgv("%s: Can't alloc.  not_enough_blocks() == true",
+			  __func__);
+		return -ENOSPC;
+	}
+
+	tree = &(free_list->block_free_tree);
+	if (from_tail == ALLOC_FROM_HEAD)
+		temp = &(free_list->first_node->node);
+	else
+		temp = &(free_list->last_node->node);
+
+	while (temp) {
+		step++;
+		curr = container_of(temp, struct nova_range_node, node);
+
+		if (!nova_range_node_checksum_ok(curr)) {
+			nova_err(sb, "%s curr failed\n", __func__);
+			goto next;
+		}
+
+		curr_blocks = curr->range_high - curr->range_low + 1;
+
+		if (num_blocks >= curr_blocks) {
+			/* Superpage allocation must succeed */
+			if (btype > 0 && num_blocks > curr_blocks)
+				goto next;
+
+			/* Otherwise, allocate the whole blocknode */
+			if (curr == free_list->first_node) {
+				next_node = rb_next(temp);
+				if (next_node)
+					next = container_of(next_node,
+						struct nova_range_node, node);
+				free_list->first_node = next;
+			}
+
+			if (curr == free_list->last_node) {
+				prev_node = rb_prev(temp);
+				if (prev_node)
+					prev = container_of(prev_node,
+						struct nova_range_node, node);
+				free_list->last_node = prev;
+			}
+
+			rb_erase(&curr->node, tree);
+			free_list->num_blocknode--;
+			num_blocks = curr_blocks;
+			*new_blocknr = curr->range_low;
+			nova_free_blocknode(sb, curr);
+			found = 1;
+			break;
+		}
+
+		/* Allocate partial blocknode */
+		if (from_tail == ALLOC_FROM_HEAD) {
+			*new_blocknr = curr->range_low;
+			curr->range_low += num_blocks;
+		} else {
+			*new_blocknr = curr->range_high + 1 - num_blocks;
+			curr->range_high -= num_blocks;
+		}
+
+		nova_update_range_node_checksum(curr);
+		found = 1;
+		break;
+next:
+		if (from_tail == ALLOC_FROM_HEAD)
+			temp = rb_next(temp);
+		else
+			temp = rb_prev(temp);
+	}
+
+	if (free_list->num_free_blocks < num_blocks) {
+		nova_dbg("%s: free list %d has %lu free blocks, but allocated %lu blocks?\n",
+				__func__, free_list->index,
+				free_list->num_free_blocks, num_blocks);
+		return -ENOSPC;
+	}
+
+	if (found == 1)
+		free_list->num_free_blocks -= num_blocks;
+	else {
+		nova_dbgv("%s: Can't alloc.  found = %d", __func__, found);
+		return -ENOSPC;
+	}
+
+	NOVA_STATS_ADD(alloc_steps, step);
+
+	return num_blocks;
+}
+
+/* Find out the free list with most free blocks */
+static int nova_get_candidate_free_list(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int cpuid = 0;
+	int num_free_blocks = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		if (free_list->num_free_blocks > num_free_blocks) {
+			cpuid = i;
+			num_free_blocks = free_list->num_free_blocks;
+		}
+	}
+
+	return cpuid;
+}
+
+static int nova_new_blocks(struct super_block *sb, unsigned long *blocknr,
+	unsigned int num, unsigned short btype, int zero,
+	enum alloc_type atype, int cpuid, enum nova_alloc_direction from_tail)
+{
+	struct free_list *free_list;
+	void *bp;
+	unsigned long num_blocks = 0;
+	unsigned long new_blocknr = 0;
+	long ret_blocks = 0;
+	int retried = 0;
+	timing_t alloc_time;
+
+	num_blocks = num * nova_get_numblocks(btype);
+	if (num_blocks == 0) {
+		nova_dbg_verbose("%s: num_blocks == 0", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(new_blocks_t, alloc_time);
+	if (cpuid == ANY_CPU)
+		cpuid = smp_processor_id();
+
+retry:
+	free_list = nova_get_free_list(sb, cpuid);
+	spin_lock(&free_list->s_lock);
+
+	if (not_enough_blocks(free_list, num_blocks, atype)) {
+		nova_dbgv("%s: cpu %d, free_blocks %lu, required %lu, blocknode %lu\n",
+			  __func__, cpuid, free_list->num_free_blocks,
+			  num_blocks, free_list->num_blocknode);
+
+		if (retried >= 2)
+			/* Allocate anyway */
+			goto alloc;
+
+		spin_unlock(&free_list->s_lock);
+		cpuid = nova_get_candidate_free_list(sb);
+		retried++;
+		goto retry;
+	}
+alloc:
+	ret_blocks = nova_alloc_blocks_in_free_list(sb, free_list, btype, atype,
+					num_blocks, &new_blocknr, from_tail);
+
+	if (ret_blocks > 0) {
+		if (atype == LOG) {
+			free_list->alloc_log_count++;
+			free_list->alloc_log_pages += ret_blocks;
+		} else if (atype == DATA) {
+			free_list->alloc_data_count++;
+			free_list->alloc_data_pages += ret_blocks;
+		}
+	}
+
+	spin_unlock(&free_list->s_lock);
+	NOVA_END_TIMING(new_blocks_t, alloc_time);
+
+	if (ret_blocks <= 0 || new_blocknr == 0) {
+		nova_dbg_verbose("%s: not able to allocate %d blocks.  ret_blocks=%ld; new_blocknr=%lu",
+				 __func__, num, ret_blocks, new_blocknr);
+		return -ENOSPC;
+	}
+
+	if (zero) {
+		bp = nova_get_block(sb, nova_get_block_off(sb,
+						new_blocknr, btype));
+		nova_memunlock_range(sb, bp, PAGE_SIZE * ret_blocks);
+		memset_nt(bp, 0, PAGE_SIZE * ret_blocks);
+		nova_memlock_range(sb, bp, PAGE_SIZE * ret_blocks);
+	}
+	*blocknr = new_blocknr;
+
+	nova_dbg_verbose("Alloc %lu NVMM blocks 0x%lx\n", ret_blocks, *blocknr);
+	return ret_blocks / nova_get_numblocks(btype);
+}
+
+// Allocate data blocks.  The offset for the allocated block comes back in
+// blocknr.  Return the number of blocks allocated.
+inline int nova_new_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long *blocknr,
+	unsigned long start_blk, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail)
+{
+	int allocated;
+	timing_t alloc_time;
+
+	NOVA_START_TIMING(new_data_blocks_t, alloc_time);
+	allocated = nova_new_blocks(sb, blocknr, num,
+			    sih->i_blk_type, zero, DATA, cpu, from_tail);
+	NOVA_END_TIMING(new_data_blocks_t, alloc_time);
+	if (allocated < 0) {
+		nova_dbgv("FAILED: Inode %lu, start blk %lu, alloc %d data blocks from %lu to %lu\n",
+			  sih->ino, start_blk, allocated, *blocknr,
+			  *blocknr + allocated - 1);
+	} else {
+		nova_dbgv("Inode %lu, start blk %lu, alloc %d data blocks from %lu to %lu\n",
+			  sih->ino, start_blk, allocated, *blocknr,
+			  *blocknr + allocated - 1);
+	}
+	return allocated;
+}
+
+
+// Allocate log blocks.	 The offset for the allocated block comes back in
+// blocknr.  Return the number of blocks allocated.
+inline int nova_new_log_blocks(struct super_block *sb,
+			struct nova_inode_info_header *sih,
+			unsigned long *blocknr, unsigned int num,
+			enum nova_alloc_init zero, int cpu,
+			enum nova_alloc_direction from_tail)
+{
+	int allocated;
+	timing_t alloc_time;
+
+	NOVA_START_TIMING(new_log_blocks_t, alloc_time);
+	allocated = nova_new_blocks(sb, blocknr, num,
+			    sih->i_blk_type, zero, LOG, cpu, from_tail);
+	NOVA_END_TIMING(new_log_blocks_t, alloc_time);
+	if (allocated < 0) {
+		nova_dbgv("%s: ino %lu, failed to alloc %d log blocks",
+			  __func__, sih->ino, num);
+	} else {
+		nova_dbgv("%s: ino %lu, alloc %d of %d log blocks %lu to %lu\n",
+			  __func__, sih->ino, allocated, num, *blocknr,
+			  *blocknr + allocated - 1);
+	}
+	return allocated;
+}
+
+unsigned long nova_count_free_blocks(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long num_free_blocks = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		num_free_blocks += free_list->num_free_blocks;
+	}
+
+	return num_free_blocks;
+}
+
+
diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
new file mode 100644
index 000000000000..ce7166a5bf37
--- /dev/null
+++ b/fs/nova/balloc.h
@@ -0,0 +1,118 @@
+#ifndef __BALLOC_H
+#define __BALLOC_H
+
+#include "inode.h"
+
+/* DRAM structure to hold a list of free PMEM blocks */
+struct free_list {
+	spinlock_t s_lock;
+	struct rb_root	block_free_tree;
+	struct nova_range_node *first_node; // lowest address free range
+	struct nova_range_node *last_node; // highest address free range
+
+	int		index; // Which CPU do I belong to?
+
+	/* Where are the data checksum blocks */
+	unsigned long	csum_start;
+	unsigned long	replica_csum_start;
+	unsigned long	num_csum_blocks;
+
+	/* Where are the data parity blocks */
+	unsigned long	parity_start;
+	unsigned long	replica_parity_start;
+	unsigned long	num_parity_blocks;
+
+	/* Start and end of allocatable range, inclusive. Excludes csum and
+	 * parity blocks.
+	 */
+	unsigned long	block_start;
+	unsigned long	block_end;
+
+	unsigned long	num_free_blocks;
+
+	/* How many nodes in the rb tree? */
+	unsigned long	num_blocknode;
+
+	u32		csum;		/* Protect integrity */
+
+	/* Statistics */
+	unsigned long	alloc_log_count;
+	unsigned long	alloc_data_count;
+	unsigned long	free_log_count;
+	unsigned long	free_data_count;
+	unsigned long	alloc_log_pages;
+	unsigned long	alloc_data_pages;
+	unsigned long	freed_log_pages;
+	unsigned long	freed_data_pages;
+
+	u64		padding[8];	/* Cache line break */
+};
+
+static inline
+struct free_list *nova_get_free_list(struct super_block *sb, int cpu)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return &sbi->free_lists[cpu];
+}
+
+enum nova_alloc_direction {ALLOC_FROM_HEAD = 0,
+			   ALLOC_FROM_TAIL = 1};
+
+enum nova_alloc_init {ALLOC_NO_INIT = 0,
+		      ALLOC_INIT_ZERO = 1};
+
+enum alloc_type {
+	LOG = 1,
+	DATA,
+};
+
+
+
+
+int nova_alloc_block_free_lists(struct super_block *sb);
+void nova_delete_free_lists(struct super_block *sb);
+inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
+inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb);
+inline struct vma_item *nova_alloc_vma_item(struct super_block *sb);
+inline void nova_free_range_node(struct nova_range_node *node);
+inline void nova_free_snapshot_info(struct snapshot_info *info);
+inline void nova_free_blocknode(struct super_block *sb,
+	struct nova_range_node *bnode);
+inline void nova_free_inode_node(struct super_block *sb,
+	struct nova_range_node *bnode);
+inline void nova_free_vma_item(struct super_block *sb,
+	struct vma_item *item);
+extern void nova_init_blockmap(struct super_block *sb, int recovery);
+extern int nova_free_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
+extern int nova_free_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
+extern inline int nova_new_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long *blocknr,
+	unsigned long start_blk, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail);
+extern int nova_new_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	unsigned long *blocknr, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail);
+extern unsigned long nova_count_free_blocks(struct super_block *sb);
+inline int nova_search_inodetree(struct nova_sb_info *sbi,
+	unsigned long ino, struct nova_range_node **ret_node);
+inline int nova_insert_blocktree(struct nova_sb_info *sbi,
+	struct rb_root *tree, struct nova_range_node *new_node);
+inline int nova_insert_inodetree(struct nova_sb_info *sbi,
+	struct nova_range_node *new_node, int cpu);
+int nova_find_free_slot(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	unsigned long range_high, struct nova_range_node **prev,
+	struct nova_range_node **next);
+
+extern int nova_insert_range_node(struct rb_root *tree,
+				  struct nova_range_node *new_node);
+extern int nova_find_range_node(struct nova_sb_info *sbi,
+				struct rb_root *tree, unsigned long range_low,
+				struct nova_range_node **ret_node);
+#endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 03/16] NOVA: PMEM allocation system
@ 2017-08-03  7:48   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova uses per-CPU allocators to manage free PMEM blocks.  On initialization,
NOVA divides the range of blocks in the PMEM device among the CPUs, and those
blocks are managed solely by that CPU.  We call these ranges of allocation regions.

Some of the blocks in an allocation region have fixed roles.  Here's the
layout:

+-------------------------------+
| data checksum blocks          |
+-------------------------------+
| data parity blocks            |
+-------------------------------+
|                               |
| Allocatable blocks            |
|                               |
+-------------------------------+
| replica data parity blocks    |
+-------------------------------+
| replica data checksum blocks  |
+-------------------------------+

The first and last allocation regions, also contain the super block, inode
tables, etc. and their replicas, respectively.

Each allocator maintains a red-black tree of unallocated ranges (struct
nova_range_node).

Allocation Functions
--------------------

Nova allocate PMEM blocks using two mechanisms:

1.  Static allocation as defined in super.h

2.  Allocation for log and data pages via nova_new_log_blocks() and
nova_new_data_blocks().

Both of these functions allow the caller to control whether the allocator
preferes higher addresses for allocation or lower addresses.  We use this to
encourage meta data structures and their replicas to be far from one another.

PMEM Address Translation
------------------------

In Nova's persistent data structures, memory locations are given as offsets
from the beginning of the PMEM region.  nova_get_block() translates offsets to
PMEM addresses.  nova_get_addr_off() performs the reverse translation.

Cautious allocation
-------------------

The allocator allows the caller to provide some control over where the blocks
come from.  Nova uses this to allocate replicas of metadata far from one
another.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/balloc.c |  827 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/balloc.h |  118 ++++++++
 2 files changed, 945 insertions(+)
 create mode 100644 fs/nova/balloc.c
 create mode 100644 fs/nova/balloc.h

diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
new file mode 100644
index 000000000000..434507b017bd
--- /dev/null
+++ b/fs/nova/balloc.c
@@ -0,0 +1,827 @@
+/*
+ * NOVA persistent memory management
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/bitops.h>
+#include "nova.h"
+#include "inode.h"
+
+int nova_alloc_block_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+
+	sbi->free_lists = kcalloc(sbi->cpus, sizeof(struct free_list),
+				  GFP_KERNEL);
+
+	if (!sbi->free_lists)
+		return -ENOMEM;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		free_list->block_free_tree = RB_ROOT;
+		spin_lock_init(&free_list->s_lock);
+		free_list->index = i;
+	}
+
+	return 0;
+}
+
+void nova_delete_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	/* Each tree is freed in save_blocknode_mappings */
+	kfree(sbi->free_lists);
+	sbi->free_lists = NULL;
+}
+
+static int nova_data_csum_init_free_list(struct super_block *sb,
+	struct free_list *free_list)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long data_csum_blocks;
+
+	/* Allocate pages to hold data checksums.  We store one checksum for
+	 * each stripe for each page.  We replicate the checksums at the
+	 * beginning and end of per-cpu region that holds the data they cover.
+	 */
+	data_csum_blocks = ((sbi->initsize >> NOVA_STRIPE_SHIFT)
+				* NOVA_DATA_CSUM_LEN) >> PAGE_SHIFT;
+	free_list->csum_start = free_list->block_start;
+	free_list->block_start += data_csum_blocks / sbi->cpus;
+	if (data_csum_blocks % sbi->cpus)
+		free_list->block_start++;
+
+	free_list->num_csum_blocks =
+		free_list->block_start - free_list->csum_start;
+
+	free_list->replica_csum_start = free_list->block_end + 1 -
+						free_list->num_csum_blocks;
+	free_list->block_end -= free_list->num_csum_blocks;
+
+	return 0;
+}
+
+
+static int nova_data_parity_init_free_list(struct super_block *sb,
+	struct free_list *free_list)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long blocksize, total_blocks, parity_blocks;
+
+	/* Allocate blocks to store data block parity stripes.
+	 * Always reserve in case user turns it off at init mount but later
+	 * turns it on.
+	 */
+	blocksize = sb->s_blocksize;
+	total_blocks = sbi->initsize / blocksize;
+	parity_blocks = total_blocks / (blocksize / NOVA_STRIPE_SIZE + 1);
+	if (total_blocks % (blocksize / NOVA_STRIPE_SIZE + 1))
+		parity_blocks++;
+
+	free_list->parity_start = free_list->block_start;
+	free_list->block_start += parity_blocks / sbi->cpus;
+	if (parity_blocks % sbi->cpus)
+		free_list->block_start++;
+
+	free_list->num_parity_blocks =
+		free_list->block_start - free_list->parity_start;
+
+	free_list->replica_parity_start = free_list->block_end + 1 -
+		free_list->num_parity_blocks;
+
+	return 0;
+}
+
+
+// Initialize a free list.  Each CPU gets an equal share of the block space to
+// manage.
+static void nova_init_free_list(struct super_block *sb,
+	struct free_list *free_list, int index)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long per_list_blocks;
+
+	per_list_blocks = sbi->num_blocks / sbi->cpus;
+
+	free_list->block_start = per_list_blocks * index;
+	free_list->block_end = free_list->block_start +
+					per_list_blocks - 1;
+	if (index == 0)
+		free_list->block_start += sbi->head_reserved_blocks;
+	if (index == sbi->cpus - 1)
+		free_list->block_end -= sbi->tail_reserved_blocks;
+
+	nova_data_csum_init_free_list(sb, free_list);
+	nova_data_parity_init_free_list(sb, free_list);
+}
+
+inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
+{
+	return nova_alloc_range_node(sb);
+}
+
+inline void nova_free_blocknode(struct super_block *sb,
+	struct nova_range_node *node)
+{
+	nova_free_range_node(node);
+}
+
+
+void nova_init_blockmap(struct super_block *sb, int recovery)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct rb_root *tree;
+	struct nova_range_node *blknode;
+	struct free_list *free_list;
+	int i;
+	int ret;
+
+	/* Divide the block range among per-CPU free lists */
+	sbi->per_list_blocks = sbi->num_blocks / sbi->cpus;
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		tree = &(free_list->block_free_tree);
+		nova_init_free_list(sb, free_list, i);
+
+		/* For recovery, update these fields later */
+		if (recovery == 0) {
+			free_list->num_free_blocks = free_list->block_end -
+						free_list->block_start + 1;
+
+			blknode = nova_alloc_blocknode(sb);
+			if (blknode == NULL)
+				BUG();
+			blknode->range_low = free_list->block_start;
+			blknode->range_high = free_list->block_end;
+			nova_update_range_node_checksum(blknode);
+			ret = nova_insert_blocktree(sbi, tree, blknode);
+			if (ret) {
+				nova_err(sb, "%s failed\n", __func__);
+				nova_free_blocknode(sb, blknode);
+				return;
+			}
+			free_list->first_node = blknode;
+			free_list->last_node = blknode;
+			free_list->num_blocknode = 1;
+		}
+
+		nova_dbgv("%s: free list %d: block start %lu, end %lu, %lu free blocks\n",
+			  __func__, i,
+			  free_list->block_start,
+			  free_list->block_end,
+			  free_list->num_free_blocks);
+	}
+}
+
+static inline int nova_rbtree_compare_rangenode(struct nova_range_node *curr,
+	unsigned long range_low)
+{
+	if (range_low < curr->range_low)
+		return -1;
+	if (range_low > curr->range_high)
+		return 1;
+
+	return 0;
+}
+
+int nova_find_range_node(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	struct nova_range_node **ret_node)
+{
+	struct nova_range_node *curr = NULL;
+	struct rb_node *temp;
+	int compVal;
+	int ret = 0;
+
+	temp = tree->rb_node;
+
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		compVal = nova_rbtree_compare_rangenode(curr, range_low);
+
+		if (compVal == -1) {
+			temp = temp->rb_left;
+		} else if (compVal == 1) {
+			temp = temp->rb_right;
+		} else {
+			ret = 1;
+			break;
+		}
+	}
+
+	if (curr && !nova_range_node_checksum_ok(curr)) {
+		nova_dbg("%s: curr failed\n", __func__);
+		return 0;
+	}
+
+	*ret_node = curr;
+	return ret;
+}
+
+
+int nova_insert_range_node(struct rb_root *tree,
+	struct nova_range_node *new_node)
+{
+	struct nova_range_node *curr;
+	struct rb_node **temp, *parent;
+	int compVal;
+
+	temp = &(tree->rb_node);
+	parent = NULL;
+
+	while (*temp) {
+		curr = container_of(*temp, struct nova_range_node, node);
+		compVal = nova_rbtree_compare_rangenode(curr,
+					new_node->range_low);
+		parent = *temp;
+
+		if (compVal == -1) {
+			temp = &((*temp)->rb_left);
+		} else if (compVal == 1) {
+			temp = &((*temp)->rb_right);
+		} else {
+			nova_dbg("%s: entry %lu - %lu already exists: %lu - %lu\n",
+				 __func__, new_node->range_low,
+				new_node->range_high, curr->range_low,
+				curr->range_high);
+			return -EINVAL;
+		}
+	}
+
+	rb_link_node(&new_node->node, parent, temp);
+	rb_insert_color(&new_node->node, tree);
+
+	return 0;
+}
+
+inline int nova_insert_blocktree(struct nova_sb_info *sbi,
+	struct rb_root *tree, struct nova_range_node *new_node)
+{
+	int ret;
+
+	ret = nova_insert_range_node(tree, new_node);
+	if (ret)
+		nova_dbg("ERROR: %s failed %d\n", __func__, ret);
+
+	return ret;
+}
+
+
+/* Used for both block free tree and inode inuse tree */
+int nova_find_free_slot(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	unsigned long range_high, struct nova_range_node **prev,
+	struct nova_range_node **next)
+{
+	struct nova_range_node *ret_node = NULL;
+	struct rb_node *tmp;
+	int check_prev = 0, check_next = 0;
+	int ret;
+
+	ret = nova_find_range_node(sbi, tree, range_low, &ret_node);
+	if (ret) {
+		nova_dbg("%s ERROR: %lu - %lu already in free list\n",
+			__func__, range_low, range_high);
+		return -EINVAL;
+	}
+
+	if (!ret_node) {
+		*prev = *next = NULL;
+	} else if (ret_node->range_high < range_low) {
+		*prev = ret_node;
+		tmp = rb_next(&ret_node->node);
+		if (tmp) {
+			*next = container_of(tmp, struct nova_range_node, node);
+			check_next = 1;
+		} else {
+			*next = NULL;
+		}
+	} else if (ret_node->range_low > range_high) {
+		*next = ret_node;
+		tmp = rb_prev(&ret_node->node);
+		if (tmp) {
+			*prev = container_of(tmp, struct nova_range_node, node);
+			check_prev = 1;
+		} else {
+			*prev = NULL;
+		}
+	} else {
+		nova_dbg("%s ERROR: %lu - %lu overlaps with existing node %lu - %lu\n",
+			 __func__, range_low, range_high, ret_node->range_low,
+			ret_node->range_high);
+		return -EINVAL;
+	}
+
+	if (check_prev && !nova_range_node_checksum_ok(*prev)) {
+		nova_dbg("%s: prev failed\n", __func__);
+		return -EIO;
+	}
+
+	if (check_next && !nova_range_node_checksum_ok(*next)) {
+		nova_dbg("%s: next failed\n", __func__);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int nova_free_blocks(struct super_block *sb, unsigned long blocknr,
+	int num, unsigned short btype, int log_page)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct rb_root *tree;
+	unsigned long block_low;
+	unsigned long block_high;
+	unsigned long num_blocks = 0;
+	struct nova_range_node *prev = NULL;
+	struct nova_range_node *next = NULL;
+	struct nova_range_node *curr_node;
+	struct free_list *free_list;
+	int cpuid;
+	int new_node_used = 0;
+	int ret;
+	timing_t free_time;
+
+	if (num <= 0) {
+		nova_dbg("%s ERROR: free %d\n", __func__, num);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(free_blocks_t, free_time);
+	cpuid = blocknr / sbi->per_list_blocks;
+
+	/* Pre-allocate blocknode */
+	curr_node = nova_alloc_blocknode(sb);
+	if (curr_node == NULL) {
+		/* returning without freeing the block*/
+		NOVA_END_TIMING(free_blocks_t, free_time);
+		return -ENOMEM;
+	}
+
+	free_list = nova_get_free_list(sb, cpuid);
+	spin_lock(&free_list->s_lock);
+
+	tree = &(free_list->block_free_tree);
+
+	num_blocks = nova_get_numblocks(btype) * num;
+	block_low = blocknr;
+	block_high = blocknr + num_blocks - 1;
+
+	nova_dbgv("Free: %lu - %lu\n", block_low, block_high);
+
+	if (blocknr < free_list->block_start ||
+			blocknr + num > free_list->block_end + 1) {
+		nova_err(sb, "free blocks %lu to %lu, free list %d, start %lu, end %lu\n",
+				blocknr, blocknr + num - 1,
+				free_list->index,
+				free_list->block_start,
+				free_list->block_end);
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_find_free_slot(sbi, tree, block_low,
+					block_high, &prev, &next);
+
+	if (ret) {
+		nova_dbg("%s: find free slot fail: %d\n", __func__, ret);
+		goto out;
+	}
+
+	if (prev && next && (block_low == prev->range_high + 1) &&
+			(block_high + 1 == next->range_low)) {
+		/* fits the hole */
+		rb_erase(&next->node, tree);
+		free_list->num_blocknode--;
+		prev->range_high = next->range_high;
+		nova_update_range_node_checksum(prev);
+		if (free_list->last_node == next)
+			free_list->last_node = prev;
+		nova_free_blocknode(sb, next);
+		goto block_found;
+	}
+	if (prev && (block_low == prev->range_high + 1)) {
+		/* Aligns left */
+		prev->range_high += num_blocks;
+		nova_update_range_node_checksum(prev);
+		goto block_found;
+	}
+	if (next && (block_high + 1 == next->range_low)) {
+		/* Aligns right */
+		next->range_low -= num_blocks;
+		nova_update_range_node_checksum(next);
+		goto block_found;
+	}
+
+	/* Aligns somewhere in the middle */
+	curr_node->range_low = block_low;
+	curr_node->range_high = block_high;
+	nova_update_range_node_checksum(curr_node);
+	new_node_used = 1;
+	ret = nova_insert_blocktree(sbi, tree, curr_node);
+	if (ret) {
+		new_node_used = 0;
+		goto out;
+	}
+	if (!prev)
+		free_list->first_node = curr_node;
+	if (!next)
+		free_list->last_node = curr_node;
+
+	free_list->num_blocknode++;
+
+block_found:
+	free_list->num_free_blocks += num_blocks;
+
+	if (log_page) {
+		free_list->free_log_count++;
+		free_list->freed_log_pages += num_blocks;
+	} else {
+		free_list->free_data_count++;
+		free_list->freed_data_pages += num_blocks;
+	}
+
+out:
+	spin_unlock(&free_list->s_lock);
+	if (new_node_used == 0)
+		nova_free_blocknode(sb, curr_node);
+
+	NOVA_END_TIMING(free_blocks_t, free_time);
+	return ret;
+}
+
+int nova_free_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num)
+{
+	int ret;
+	timing_t free_time;
+
+	nova_dbgv("Inode %lu: free %d data block from %lu to %lu\n",
+			sih->ino, num, blocknr, blocknr + num - 1);
+	if (blocknr == 0) {
+		nova_dbg("%s: ERROR: %lu, %d\n", __func__, blocknr, num);
+		return -EINVAL;
+	}
+	NOVA_START_TIMING(free_data_t, free_time);
+	ret = nova_free_blocks(sb, blocknr, num, sih->i_blk_type, 0);
+	if (ret) {
+		nova_err(sb, "Inode %lu: free %d data block from %lu to %lu failed!\n",
+			 sih->ino, num, blocknr, blocknr + num - 1);
+		nova_print_nova_log(sb, sih);
+	}
+	NOVA_END_TIMING(free_data_t, free_time);
+
+	return ret;
+}
+
+int nova_free_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num)
+{
+	int ret;
+	timing_t free_time;
+
+	nova_dbgv("Inode %lu: free %d log block from %lu to %lu\n",
+			sih->ino, num, blocknr, blocknr + num - 1);
+	if (blocknr == 0) {
+		nova_dbg("%s: ERROR: %lu, %d\n", __func__, blocknr, num);
+		return -EINVAL;
+	}
+	NOVA_START_TIMING(free_log_t, free_time);
+	ret = nova_free_blocks(sb, blocknr, num, sih->i_blk_type, 1);
+	if (ret) {
+		nova_err(sb, "Inode %lu: free %d log block from %lu to %lu failed!\n",
+			 sih->ino, num, blocknr, blocknr + num - 1);
+		nova_print_nova_log(sb, sih);
+	}
+	NOVA_END_TIMING(free_log_t, free_time);
+
+	return ret;
+}
+
+static int not_enough_blocks(struct free_list *free_list,
+	unsigned long num_blocks, enum alloc_type atype)
+{
+	struct nova_range_node *first = free_list->first_node;
+	struct nova_range_node *last = free_list->last_node;
+
+	if (free_list->num_free_blocks < num_blocks || !first || !last) {
+		nova_dbgv("%s: num_free_blocks=%ld; num_blocks=%ld; first=0x%p; last=0x%p",
+			  __func__, free_list->num_free_blocks, num_blocks,
+			  first, last);
+		return 1;
+	}
+
+	if (atype == LOG &&
+	    last->range_high - first->range_low < DEAD_ZONE_BLOCKS) {
+		nova_dbgv("%s: allocation would cause deadzone violation. high=0x%lx, low=0x%lx, DEADZONE=%d",
+			  __func__, last->range_high, first->range_low,
+			  DEAD_ZONE_BLOCKS);
+		return 1;
+	}
+
+	return 0;
+}
+
+/* Return how many blocks allocated */
+static long nova_alloc_blocks_in_free_list(struct super_block *sb,
+	struct free_list *free_list, unsigned short btype,
+	enum alloc_type atype, unsigned long num_blocks,
+	unsigned long *new_blocknr, enum nova_alloc_direction from_tail)
+{
+	struct rb_root *tree;
+	struct nova_range_node *curr, *next = NULL, *prev = NULL;
+	struct rb_node *temp, *next_node, *prev_node;
+	unsigned long curr_blocks;
+	bool found = 0;
+	unsigned long step = 0;
+
+	if (!free_list->first_node || free_list->num_free_blocks == 0) {
+		nova_dbgv("%s: Can't alloc. free_list->first_node=0x%p free_list->num_free_blocks = %lu",
+			  __func__, free_list->first_node,
+			  free_list->num_free_blocks);
+		return -ENOSPC;
+	}
+
+	if (atype == LOG && not_enough_blocks(free_list, num_blocks, atype)) {
+		nova_dbgv("%s: Can't alloc.  not_enough_blocks() == true",
+			  __func__);
+		return -ENOSPC;
+	}
+
+	tree = &(free_list->block_free_tree);
+	if (from_tail == ALLOC_FROM_HEAD)
+		temp = &(free_list->first_node->node);
+	else
+		temp = &(free_list->last_node->node);
+
+	while (temp) {
+		step++;
+		curr = container_of(temp, struct nova_range_node, node);
+
+		if (!nova_range_node_checksum_ok(curr)) {
+			nova_err(sb, "%s curr failed\n", __func__);
+			goto next;
+		}
+
+		curr_blocks = curr->range_high - curr->range_low + 1;
+
+		if (num_blocks >= curr_blocks) {
+			/* Superpage allocation must succeed */
+			if (btype > 0 && num_blocks > curr_blocks)
+				goto next;
+
+			/* Otherwise, allocate the whole blocknode */
+			if (curr == free_list->first_node) {
+				next_node = rb_next(temp);
+				if (next_node)
+					next = container_of(next_node,
+						struct nova_range_node, node);
+				free_list->first_node = next;
+			}
+
+			if (curr == free_list->last_node) {
+				prev_node = rb_prev(temp);
+				if (prev_node)
+					prev = container_of(prev_node,
+						struct nova_range_node, node);
+				free_list->last_node = prev;
+			}
+
+			rb_erase(&curr->node, tree);
+			free_list->num_blocknode--;
+			num_blocks = curr_blocks;
+			*new_blocknr = curr->range_low;
+			nova_free_blocknode(sb, curr);
+			found = 1;
+			break;
+		}
+
+		/* Allocate partial blocknode */
+		if (from_tail == ALLOC_FROM_HEAD) {
+			*new_blocknr = curr->range_low;
+			curr->range_low += num_blocks;
+		} else {
+			*new_blocknr = curr->range_high + 1 - num_blocks;
+			curr->range_high -= num_blocks;
+		}
+
+		nova_update_range_node_checksum(curr);
+		found = 1;
+		break;
+next:
+		if (from_tail == ALLOC_FROM_HEAD)
+			temp = rb_next(temp);
+		else
+			temp = rb_prev(temp);
+	}
+
+	if (free_list->num_free_blocks < num_blocks) {
+		nova_dbg("%s: free list %d has %lu free blocks, but allocated %lu blocks?\n",
+				__func__, free_list->index,
+				free_list->num_free_blocks, num_blocks);
+		return -ENOSPC;
+	}
+
+	if (found == 1)
+		free_list->num_free_blocks -= num_blocks;
+	else {
+		nova_dbgv("%s: Can't alloc.  found = %d", __func__, found);
+		return -ENOSPC;
+	}
+
+	NOVA_STATS_ADD(alloc_steps, step);
+
+	return num_blocks;
+}
+
+/* Find out the free list with most free blocks */
+static int nova_get_candidate_free_list(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int cpuid = 0;
+	int num_free_blocks = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		if (free_list->num_free_blocks > num_free_blocks) {
+			cpuid = i;
+			num_free_blocks = free_list->num_free_blocks;
+		}
+	}
+
+	return cpuid;
+}
+
+static int nova_new_blocks(struct super_block *sb, unsigned long *blocknr,
+	unsigned int num, unsigned short btype, int zero,
+	enum alloc_type atype, int cpuid, enum nova_alloc_direction from_tail)
+{
+	struct free_list *free_list;
+	void *bp;
+	unsigned long num_blocks = 0;
+	unsigned long new_blocknr = 0;
+	long ret_blocks = 0;
+	int retried = 0;
+	timing_t alloc_time;
+
+	num_blocks = num * nova_get_numblocks(btype);
+	if (num_blocks == 0) {
+		nova_dbg_verbose("%s: num_blocks == 0", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(new_blocks_t, alloc_time);
+	if (cpuid == ANY_CPU)
+		cpuid = smp_processor_id();
+
+retry:
+	free_list = nova_get_free_list(sb, cpuid);
+	spin_lock(&free_list->s_lock);
+
+	if (not_enough_blocks(free_list, num_blocks, atype)) {
+		nova_dbgv("%s: cpu %d, free_blocks %lu, required %lu, blocknode %lu\n",
+			  __func__, cpuid, free_list->num_free_blocks,
+			  num_blocks, free_list->num_blocknode);
+
+		if (retried >= 2)
+			/* Allocate anyway */
+			goto alloc;
+
+		spin_unlock(&free_list->s_lock);
+		cpuid = nova_get_candidate_free_list(sb);
+		retried++;
+		goto retry;
+	}
+alloc:
+	ret_blocks = nova_alloc_blocks_in_free_list(sb, free_list, btype, atype,
+					num_blocks, &new_blocknr, from_tail);
+
+	if (ret_blocks > 0) {
+		if (atype == LOG) {
+			free_list->alloc_log_count++;
+			free_list->alloc_log_pages += ret_blocks;
+		} else if (atype == DATA) {
+			free_list->alloc_data_count++;
+			free_list->alloc_data_pages += ret_blocks;
+		}
+	}
+
+	spin_unlock(&free_list->s_lock);
+	NOVA_END_TIMING(new_blocks_t, alloc_time);
+
+	if (ret_blocks <= 0 || new_blocknr == 0) {
+		nova_dbg_verbose("%s: not able to allocate %d blocks.  ret_blocks=%ld; new_blocknr=%lu",
+				 __func__, num, ret_blocks, new_blocknr);
+		return -ENOSPC;
+	}
+
+	if (zero) {
+		bp = nova_get_block(sb, nova_get_block_off(sb,
+						new_blocknr, btype));
+		nova_memunlock_range(sb, bp, PAGE_SIZE * ret_blocks);
+		memset_nt(bp, 0, PAGE_SIZE * ret_blocks);
+		nova_memlock_range(sb, bp, PAGE_SIZE * ret_blocks);
+	}
+	*blocknr = new_blocknr;
+
+	nova_dbg_verbose("Alloc %lu NVMM blocks 0x%lx\n", ret_blocks, *blocknr);
+	return ret_blocks / nova_get_numblocks(btype);
+}
+
+// Allocate data blocks.  The offset for the allocated block comes back in
+// blocknr.  Return the number of blocks allocated.
+inline int nova_new_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long *blocknr,
+	unsigned long start_blk, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail)
+{
+	int allocated;
+	timing_t alloc_time;
+
+	NOVA_START_TIMING(new_data_blocks_t, alloc_time);
+	allocated = nova_new_blocks(sb, blocknr, num,
+			    sih->i_blk_type, zero, DATA, cpu, from_tail);
+	NOVA_END_TIMING(new_data_blocks_t, alloc_time);
+	if (allocated < 0) {
+		nova_dbgv("FAILED: Inode %lu, start blk %lu, alloc %d data blocks from %lu to %lu\n",
+			  sih->ino, start_blk, allocated, *blocknr,
+			  *blocknr + allocated - 1);
+	} else {
+		nova_dbgv("Inode %lu, start blk %lu, alloc %d data blocks from %lu to %lu\n",
+			  sih->ino, start_blk, allocated, *blocknr,
+			  *blocknr + allocated - 1);
+	}
+	return allocated;
+}
+
+
+// Allocate log blocks.	 The offset for the allocated block comes back in
+// blocknr.  Return the number of blocks allocated.
+inline int nova_new_log_blocks(struct super_block *sb,
+			struct nova_inode_info_header *sih,
+			unsigned long *blocknr, unsigned int num,
+			enum nova_alloc_init zero, int cpu,
+			enum nova_alloc_direction from_tail)
+{
+	int allocated;
+	timing_t alloc_time;
+
+	NOVA_START_TIMING(new_log_blocks_t, alloc_time);
+	allocated = nova_new_blocks(sb, blocknr, num,
+			    sih->i_blk_type, zero, LOG, cpu, from_tail);
+	NOVA_END_TIMING(new_log_blocks_t, alloc_time);
+	if (allocated < 0) {
+		nova_dbgv("%s: ino %lu, failed to alloc %d log blocks",
+			  __func__, sih->ino, num);
+	} else {
+		nova_dbgv("%s: ino %lu, alloc %d of %d log blocks %lu to %lu\n",
+			  __func__, sih->ino, allocated, num, *blocknr,
+			  *blocknr + allocated - 1);
+	}
+	return allocated;
+}
+
+unsigned long nova_count_free_blocks(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long num_free_blocks = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		num_free_blocks += free_list->num_free_blocks;
+	}
+
+	return num_free_blocks;
+}
+
+
diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
new file mode 100644
index 000000000000..ce7166a5bf37
--- /dev/null
+++ b/fs/nova/balloc.h
@@ -0,0 +1,118 @@
+#ifndef __BALLOC_H
+#define __BALLOC_H
+
+#include "inode.h"
+
+/* DRAM structure to hold a list of free PMEM blocks */
+struct free_list {
+	spinlock_t s_lock;
+	struct rb_root	block_free_tree;
+	struct nova_range_node *first_node; // lowest address free range
+	struct nova_range_node *last_node; // highest address free range
+
+	int		index; // Which CPU do I belong to?
+
+	/* Where are the data checksum blocks */
+	unsigned long	csum_start;
+	unsigned long	replica_csum_start;
+	unsigned long	num_csum_blocks;
+
+	/* Where are the data parity blocks */
+	unsigned long	parity_start;
+	unsigned long	replica_parity_start;
+	unsigned long	num_parity_blocks;
+
+	/* Start and end of allocatable range, inclusive. Excludes csum and
+	 * parity blocks.
+	 */
+	unsigned long	block_start;
+	unsigned long	block_end;
+
+	unsigned long	num_free_blocks;
+
+	/* How many nodes in the rb tree? */
+	unsigned long	num_blocknode;
+
+	u32		csum;		/* Protect integrity */
+
+	/* Statistics */
+	unsigned long	alloc_log_count;
+	unsigned long	alloc_data_count;
+	unsigned long	free_log_count;
+	unsigned long	free_data_count;
+	unsigned long	alloc_log_pages;
+	unsigned long	alloc_data_pages;
+	unsigned long	freed_log_pages;
+	unsigned long	freed_data_pages;
+
+	u64		padding[8];	/* Cache line break */
+};
+
+static inline
+struct free_list *nova_get_free_list(struct super_block *sb, int cpu)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return &sbi->free_lists[cpu];
+}
+
+enum nova_alloc_direction {ALLOC_FROM_HEAD = 0,
+			   ALLOC_FROM_TAIL = 1};
+
+enum nova_alloc_init {ALLOC_NO_INIT = 0,
+		      ALLOC_INIT_ZERO = 1};
+
+enum alloc_type {
+	LOG = 1,
+	DATA,
+};
+
+
+
+
+int nova_alloc_block_free_lists(struct super_block *sb);
+void nova_delete_free_lists(struct super_block *sb);
+inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
+inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb);
+inline struct vma_item *nova_alloc_vma_item(struct super_block *sb);
+inline void nova_free_range_node(struct nova_range_node *node);
+inline void nova_free_snapshot_info(struct snapshot_info *info);
+inline void nova_free_blocknode(struct super_block *sb,
+	struct nova_range_node *bnode);
+inline void nova_free_inode_node(struct super_block *sb,
+	struct nova_range_node *bnode);
+inline void nova_free_vma_item(struct super_block *sb,
+	struct vma_item *item);
+extern void nova_init_blockmap(struct super_block *sb, int recovery);
+extern int nova_free_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
+extern int nova_free_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
+extern inline int nova_new_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long *blocknr,
+	unsigned long start_blk, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail);
+extern int nova_new_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	unsigned long *blocknr, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail);
+extern unsigned long nova_count_free_blocks(struct super_block *sb);
+inline int nova_search_inodetree(struct nova_sb_info *sbi,
+	unsigned long ino, struct nova_range_node **ret_node);
+inline int nova_insert_blocktree(struct nova_sb_info *sbi,
+	struct rb_root *tree, struct nova_range_node *new_node);
+inline int nova_insert_inodetree(struct nova_sb_info *sbi,
+	struct nova_range_node *new_node, int cpu);
+int nova_find_free_slot(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	unsigned long range_high, struct nova_range_node **prev,
+	struct nova_range_node **next);
+
+extern int nova_insert_range_node(struct rb_root *tree,
+				  struct nova_range_node *new_node);
+extern int nova_find_range_node(struct nova_sb_info *sbi,
+				struct rb_root *tree, unsigned long range_low,
+				struct nova_range_node **ret_node);
+#endif

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 04/16] NOVA: Inode operations and structures
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:48   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Nova maintains per-CPU inode tables, and inode numbers are striped across the
tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).

The inodes themselves live in a set of linked lists (one per CPU) of 2MB
blocks.  The last 8 bytes of each block points to the next block.  Pointers to
heads of these list live in PMEM block INODE_TABLE0_START and are replicated in
PMEM block INODE_TABLE1_START.  Additional space for inodes is allocated on
demand.

To allocate inodes, Nova maintains a per-cpu inuse_list in DRAM holds a RB
tree that holds ranges of unallocated inode numbers.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/inode.c | 1467 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.h |  389 +++++++++++++++
 2 files changed, 1856 insertions(+)
 create mode 100644 fs/nova/inode.c
 create mode 100644 fs/nova/inode.h

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
new file mode 100644
index 000000000000..db001b7b5d4f
--- /dev/null
+++ b/fs/nova/inode.c
@@ -0,0 +1,1467 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode methods (allocate/free/read/write).
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/aio.h>
+#include <linux/highuid.h>
+#include <linux/module.h>
+#include <linux/mpage.h>
+#include <linux/backing-dev.h>
+#include <linux/types.h>
+#include <linux/ratelimit.h>
+#include "nova.h"
+#include "inode.h"
+
+unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX] = {12, 21, 30};
+uint32_t blk_type_to_size[NOVA_BLOCK_TYPE_MAX] = {0x1000, 0x200000, 0x40000000};
+
+int nova_init_inode_inuse_list(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_range_node *range_node;
+	struct inode_map *inode_map;
+	unsigned long range_high;
+	int i;
+	int ret;
+
+	sbi->s_inodes_used_count = NOVA_NORMAL_INODE_START;
+
+	range_high = NOVA_NORMAL_INODE_START / sbi->cpus;
+	if (NOVA_NORMAL_INODE_START % sbi->cpus)
+		range_high++;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		range_node = nova_alloc_inode_node(sb);
+		if (range_node == NULL)
+			/* FIXME: free allocated memories */
+			return -ENOMEM;
+
+		range_node->range_low = 0;
+		range_node->range_high = range_high;
+		nova_update_range_node_checksum(range_node);
+		ret = nova_insert_inodetree(sbi, range_node, i);
+		if (ret) {
+			nova_err(sb, "%s failed\n", __func__);
+			nova_free_inode_node(sb, range_node);
+			return ret;
+		}
+		inode_map->num_range_node_inode = 1;
+		inode_map->first_inode_range = range_node;
+	}
+
+	return 0;
+}
+
+static int nova_alloc_inode_table(struct super_block *sb,
+	struct nova_inode_info_header *sih, int version)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_table *inode_table;
+	unsigned long blocknr;
+	u64 block;
+	int allocated;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_table = nova_get_inode_table(sb, version, i);
+		if (!inode_table)
+			return -EINVAL;
+
+		/* Allocate replicate inodes from tail */
+		allocated = nova_new_log_blocks(sb, sih, &blocknr, 1,
+				ALLOC_INIT_ZERO, i,
+				version ? ALLOC_FROM_TAIL : ALLOC_FROM_HEAD);
+
+		nova_dbgv("%s: allocate log @ 0x%lx\n", __func__,
+							blocknr);
+		if (allocated != 1 || blocknr == 0)
+			return -ENOSPC;
+
+		block = nova_get_block_off(sb, blocknr, NOVA_BLOCK_TYPE_2M);
+		nova_memunlock_range(sb, inode_table, CACHELINE_SIZE);
+		inode_table->log_head = block;
+		nova_memlock_range(sb, inode_table, CACHELINE_SIZE);
+		nova_flush_buffer(inode_table, CACHELINE_SIZE, 0);
+	}
+
+	return 0;
+}
+
+int nova_init_inode_table(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODETABLE_INO);
+	struct nova_inode_info_header sih;
+	int num_tables;
+	int ret = 0;
+	int i;
+
+	nova_memunlock_inode(sb, pi);
+	pi->i_mode = 0;
+	pi->i_uid = 0;
+	pi->i_gid = 0;
+	pi->i_links_count = cpu_to_le16(1);
+	pi->i_flags = 0;
+	pi->nova_ino = NOVA_INODETABLE_INO;
+
+	pi->i_blk_type = NOVA_BLOCK_TYPE_2M;
+	nova_memlock_inode(sb, pi);
+
+	sih.ino = NOVA_INODETABLE_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_2M;
+
+	num_tables = 1;
+	if (metadata_csum)
+		num_tables = 2;
+
+	for (i = 0; i < num_tables; i++) {
+		ret = nova_alloc_inode_table(sb, &sih, i);
+		if (ret)
+			return ret;
+	}
+
+	PERSISTENT_BARRIER();
+	return ret;
+}
+
+inline int nova_insert_inodetree(struct nova_sb_info *sbi,
+	struct nova_range_node *new_node, int cpu)
+{
+	struct rb_root *tree;
+	int ret;
+
+	tree = &sbi->inode_maps[cpu].inode_inuse_tree;
+	ret = nova_insert_range_node(tree, new_node);
+	if (ret)
+		nova_dbg("ERROR: %s failed %d\n", __func__, ret);
+
+	return ret;
+}
+
+inline int nova_search_inodetree(struct nova_sb_info *sbi,
+	unsigned long ino, struct nova_range_node **ret_node)
+{
+	struct rb_root *tree;
+	unsigned long internal_ino;
+	int cpu;
+
+	cpu = ino % sbi->cpus;
+	tree = &sbi->inode_maps[cpu].inode_inuse_tree;
+	internal_ino = ino / sbi->cpus;
+	return nova_find_range_node(sbi, tree, internal_ino, ret_node);
+}
+
+/* Get the address in PMEM of an inode by inode number.  Allocate additional
+ * block to store additional inodes if necessary.
+ */
+int nova_get_inode_address(struct super_block *sb, u64 ino, int version,
+	u64 *pi_addr, int extendable, int extend_alternate)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct inode_table *inode_table;
+	unsigned int data_bits;
+	unsigned int num_inodes_bits;
+	u64 curr;
+	unsigned int superpage_count;
+	u64 alternate_pi_addr = 0;
+	u64 internal_ino;
+	int cpuid;
+	int extended = 0;
+	unsigned int index;
+	unsigned int i = 0;
+	unsigned long blocknr;
+	unsigned long curr_addr;
+	int allocated;
+
+	if (ino < NOVA_NORMAL_INODE_START) {
+		*pi_addr = nova_get_reserved_inode_addr(sb, ino);
+		return 0;
+	}
+
+	sih.ino = NOVA_INODETABLE_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_2M;
+	data_bits = blk_type_to_shift[sih.i_blk_type];
+	num_inodes_bits = data_bits - NOVA_INODE_BITS;
+
+	cpuid = ino % sbi->cpus;
+	internal_ino = ino / sbi->cpus;
+
+	inode_table = nova_get_inode_table(sb, version, cpuid);
+	superpage_count = internal_ino >> num_inodes_bits;
+	index = internal_ino & ((1 << num_inodes_bits) - 1);
+
+	curr = inode_table->log_head;
+	if (curr == 0)
+		return -EINVAL;
+
+	for (i = 0; i < superpage_count; i++) {
+		if (curr == 0)
+			return -EINVAL;
+
+		curr_addr = (unsigned long)nova_get_block(sb, curr);
+		/* Next page pointer in the last 8 bytes of the superpage */
+		curr_addr += nova_inode_blk_size(&sih) - 8;
+		curr = *(u64 *)(curr_addr);
+
+		if (curr == 0) {
+			if (extendable == 0)
+				return -EINVAL;
+
+			extended = 1;
+
+			allocated = nova_new_log_blocks(sb, &sih, &blocknr,
+				1, ALLOC_INIT_ZERO, cpuid,
+				version ? ALLOC_FROM_TAIL : ALLOC_FROM_HEAD);
+
+			if (allocated != 1)
+				return allocated;
+
+			curr = nova_get_block_off(sb, blocknr,
+						NOVA_BLOCK_TYPE_2M);
+			nova_memunlock_range(sb, (void *)curr_addr,
+						CACHELINE_SIZE);
+			*(u64 *)(curr_addr) = curr;
+			nova_memlock_range(sb, (void *)curr_addr,
+						CACHELINE_SIZE);
+			nova_flush_buffer((void *)curr_addr,
+						NOVA_INODE_SIZE, 1);
+		}
+	}
+
+	/* Extend alternate inode table */
+	if (extended && extend_alternate && metadata_csum)
+		nova_get_inode_address(sb, ino, version + 1,
+					&alternate_pi_addr, extendable, 0);
+
+	*pi_addr = curr + index * NOVA_INODE_SIZE;
+
+	return 0;
+}
+
+int nova_get_alter_inode_address(struct super_block *sb, u64 ino,
+	u64 *alter_pi_addr)
+{
+	int ret;
+
+	if (metadata_csum == 0) {
+		nova_err(sb, "Access alter inode when replica inode disabled\n");
+		return 0;
+	}
+
+	if (ino < NOVA_NORMAL_INODE_START) {
+		*alter_pi_addr = nova_get_alter_reserved_inode_addr(sb, ino);
+	} else {
+		ret = nova_get_inode_address(sb, ino, 1, alter_pi_addr, 0, 0);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int nova_delete_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long start_blocknr,
+	unsigned long last_blocknr, bool delete_nvmm, bool delete_dead,
+	u64 epoch_id)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry *old_entry = NULL;
+	unsigned long pgoff = start_blocknr;
+	unsigned long old_pgoff = 0;
+	unsigned int num_free = 0;
+	int freed = 0;
+	void *ret;
+	timing_t delete_time;
+
+	NOVA_START_TIMING(delete_file_tree_t, delete_time);
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	/* Handle EOF blocks */
+	do {
+		entry = radix_tree_lookup(&sih->tree, pgoff);
+		if (entry) {
+			ret = radix_tree_delete(&sih->tree, pgoff);
+			BUG_ON(!ret || ret != entry);
+			if (entry != old_entry) {
+				if (old_entry && delete_nvmm) {
+					nova_free_old_entry(sb, sih,
+							old_entry, old_pgoff,
+							num_free, delete_dead,
+							epoch_id);
+					freed += num_free;
+				}
+
+				old_entry = entry;
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+			pgoff++;
+		} else {
+			/* We are finding a hole. Jump to the next entry. */
+			entry = nova_find_next_entry(sb, sih, pgoff);
+			if (!entry)
+				break;
+
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				break;
+
+			pgoff++;
+			pgoff = pgoff > entryc->pgoff ? pgoff : entryc->pgoff;
+		}
+	} while (1);
+
+	if (old_entry && delete_nvmm) {
+		nova_free_old_entry(sb, sih, old_entry, old_pgoff,
+					num_free, delete_dead, epoch_id);
+		freed += num_free;
+	}
+
+	nova_dbgv("Inode %lu: delete file tree from pgoff %lu to %lu, %d blocks freed\n",
+			sih->ino, start_blocknr, last_blocknr, freed);
+
+	NOVA_END_TIMING(delete_file_tree_t, delete_time);
+	return freed;
+}
+
+static int nova_free_dram_resource(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	unsigned long last_blocknr;
+	int freed = 0;
+
+	if (!(S_ISREG(sih->i_mode)) && !(S_ISDIR(sih->i_mode)))
+		return 0;
+
+	if (S_ISREG(sih->i_mode)) {
+		last_blocknr = nova_get_last_blocknr(sb, sih);
+		freed = nova_delete_file_tree(sb, sih, 0,
+					last_blocknr, false, false, 0);
+	} else {
+		nova_delete_dir_tree(sb, sih);
+		freed = 1;
+	}
+
+	return freed;
+}
+
+static inline void check_eof_blocks(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_info_header *sih)
+{
+	if ((pi->i_flags & cpu_to_le32(NOVA_EOFBLOCKS_FL)) &&
+		(inode->i_size + sb->s_blocksize) > (sih->i_blocks
+			<< sb->s_blocksize_bits)) {
+		nova_memunlock_inode(sb, pi);
+		pi->i_flags &= cpu_to_le32(~NOVA_EOFBLOCKS_FL);
+		nova_update_inode_checksum(pi);
+		nova_update_alter_inode(sb, inode, pi);
+		nova_memlock_inode(sb, pi);
+	}
+}
+
+/*
+ * Free data blocks from inode in the range start <=> end
+ */
+static void nova_truncate_file_blocks(struct inode *inode, loff_t start,
+				    loff_t end, u64 epoch_id)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	int freed = 0;
+
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+
+	nova_dbg_verbose("truncate: pi %p iblocks %lx %llx %llx %llx\n", pi,
+			 sih->i_blocks, start, end, pi->i_size);
+
+	first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+	if (end == 0)
+		return;
+	last_blocknr = (end - 1) >> data_bits;
+
+	if (first_blocknr > last_blocknr)
+		return;
+
+	freed = nova_delete_file_tree(sb, sih, first_blocknr,
+				last_blocknr, true, false, epoch_id);
+
+	inode->i_blocks -= (freed * (1 << (data_bits -
+				sb->s_blocksize_bits)));
+
+	sih->i_blocks = inode->i_blocks;
+	/* Check for the flag EOFBLOCKS is still valid after the set size */
+	check_eof_blocks(sb, pi, inode, sih);
+
+}
+
+/* search the radix tree to find hole or data
+ * in the specified range
+ * Input:
+ * first_blocknr: first block in the specified range
+ * last_blocknr: last_blocknr in the specified range
+ * @data_found: indicates whether data blocks were found
+ * @hole_found: indicates whether a hole was found
+ * hole: whether we are looking for a hole or data
+ */
+static int nova_lookup_hole_in_range(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	unsigned long first_blocknr, unsigned long last_blocknr,
+	int *data_found, int *hole_found, int hole)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	unsigned long blocks = 0;
+	unsigned long pgoff, old_pgoff;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	pgoff = first_blocknr;
+	while (pgoff <= last_blocknr) {
+		old_pgoff = pgoff;
+		entry = radix_tree_lookup(&sih->tree, pgoff);
+		if (entry) {
+			*data_found = 1;
+			if (!hole)
+				goto done;
+			pgoff++;
+		} else {
+			*hole_found = 1;
+			entry = nova_find_next_entry(sb, sih, pgoff);
+			pgoff++;
+			if (entry) {
+				if (metadata_csum == 0)
+					entryc = entry;
+				else if (!nova_verify_entry_csum(sb, entry,
+								entryc))
+					goto done;
+
+				pgoff = pgoff > entryc->pgoff ?
+					pgoff : entryc->pgoff;
+				if (pgoff > last_blocknr)
+					pgoff = last_blocknr + 1;
+			}
+		}
+
+		if (!*hole_found || !hole)
+			blocks += pgoff - old_pgoff;
+	}
+done:
+	return blocks;
+}
+
+/* copy persistent state to struct inode */
+static int nova_read_inode(struct super_block *sb, struct inode *inode,
+	u64 pi_addr)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode *pi, fake_pi;
+	struct nova_inode_info_header *sih = &si->header;
+	int ret = -EIO;
+	unsigned long ino;
+
+	ret = nova_get_reference(sb, pi_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("%s: read pi @ 0x%llx failed\n",
+				__func__, pi_addr);
+		goto bad_inode;
+	}
+
+	inode->i_mode = sih->i_mode;
+	i_uid_write(inode, le32_to_cpu(pi->i_uid));
+	i_gid_write(inode, le32_to_cpu(pi->i_gid));
+//	set_nlink(inode, le16_to_cpu(pi->i_links_count));
+	inode->i_generation = le32_to_cpu(pi->i_generation);
+	nova_set_inode_flags(inode, pi, le32_to_cpu(pi->i_flags));
+	ino = inode->i_ino;
+
+	/* check if the inode is active. */
+	if (inode->i_mode == 0 || pi->deleted == 1) {
+		/* this inode is deleted */
+		ret = -ESTALE;
+		goto bad_inode;
+	}
+
+	inode->i_blocks = sih->i_blocks;
+	inode->i_mapping->a_ops = &nova_aops_dax;
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		inode->i_op = &nova_file_inode_operations;
+		if (inplace_data_updates && wprotect == 0)
+			inode->i_fop = &nova_dax_file_operations;
+		else
+			inode->i_fop = &nova_wrap_file_operations;
+		break;
+	case S_IFDIR:
+		inode->i_op = &nova_dir_inode_operations;
+		inode->i_fop = &nova_dir_operations;
+		break;
+	case S_IFLNK:
+		inode->i_op = &nova_symlink_inode_operations;
+		break;
+	default:
+		inode->i_op = &nova_special_inode_operations;
+		init_special_inode(inode, inode->i_mode,
+				   le32_to_cpu(pi->dev.rdev));
+		break;
+	}
+
+	/* Update size and time after rebuild the tree */
+	inode->i_size = le64_to_cpu(sih->i_size);
+	inode->i_atime.tv_sec = (__s32)le32_to_cpu(pi->i_atime);
+	inode->i_ctime.tv_sec = (__s32)le32_to_cpu(pi->i_ctime);
+	inode->i_mtime.tv_sec = (__s32)le32_to_cpu(pi->i_mtime);
+	inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec =
+					 inode->i_ctime.tv_nsec = 0;
+	set_nlink(inode, le16_to_cpu(pi->i_links_count));
+	return 0;
+
+bad_inode:
+	make_bad_inode(inode);
+	return ret;
+}
+
+static void nova_get_inode_flags(struct inode *inode, struct nova_inode *pi)
+{
+	unsigned int flags = inode->i_flags;
+	unsigned int nova_flags = le32_to_cpu(pi->i_flags);
+
+	nova_flags &= ~(FS_SYNC_FL | FS_APPEND_FL | FS_IMMUTABLE_FL |
+			 FS_NOATIME_FL | FS_DIRSYNC_FL);
+	if (flags & S_SYNC)
+		nova_flags |= FS_SYNC_FL;
+	if (flags & S_APPEND)
+		nova_flags |= FS_APPEND_FL;
+	if (flags & S_IMMUTABLE)
+		nova_flags |= FS_IMMUTABLE_FL;
+	if (flags & S_NOATIME)
+		nova_flags |= FS_NOATIME_FL;
+	if (flags & S_DIRSYNC)
+		nova_flags |= FS_DIRSYNC_FL;
+
+	pi->i_flags = cpu_to_le32(nova_flags);
+}
+
+static void nova_init_inode(struct inode *inode, struct nova_inode *pi)
+{
+	pi->i_mode = cpu_to_le16(inode->i_mode);
+	pi->i_uid = cpu_to_le32(i_uid_read(inode));
+	pi->i_gid = cpu_to_le32(i_gid_read(inode));
+	pi->i_links_count = cpu_to_le16(inode->i_nlink);
+	pi->i_size = cpu_to_le64(inode->i_size);
+	pi->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
+	pi->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
+	pi->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
+	pi->i_generation = cpu_to_le32(inode->i_generation);
+	pi->log_head = 0;
+	pi->log_tail = 0;
+	pi->alter_log_head = 0;
+	pi->alter_log_tail = 0;
+	pi->deleted = 0;
+	pi->delete_epoch_id = 0;
+	nova_get_inode_flags(inode, pi);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
+		pi->dev.rdev = cpu_to_le32(inode->i_rdev);
+}
+
+static int nova_alloc_unused_inode(struct super_block *sb, int cpuid,
+	unsigned long *ino)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *i, *next_i;
+	struct rb_node *temp, *next;
+	unsigned long next_range_low;
+	unsigned long new_ino;
+	unsigned long MAX_INODE = 1UL << 31;
+
+	inode_map = &sbi->inode_maps[cpuid];
+	i = inode_map->first_inode_range;
+	NOVA_ASSERT(i);
+	if (!nova_range_node_checksum_ok(i)) {
+		nova_dbg("%s: first node failed\n", __func__);
+		return -EIO;
+	}
+
+	temp = &i->node;
+	next = rb_next(temp);
+
+	if (!next) {
+		next_i = NULL;
+		next_range_low = MAX_INODE;
+	} else {
+		next_i = container_of(next, struct nova_range_node, node);
+		if (!nova_range_node_checksum_ok(next_i)) {
+			nova_dbg("%s: second node failed\n", __func__);
+			return -EIO;
+		}
+		next_range_low = next_i->range_low;
+	}
+
+	new_ino = i->range_high + 1;
+
+	if (next_i && new_ino == (next_range_low - 1)) {
+		/* Fill the gap completely */
+		i->range_high = next_i->range_high;
+		nova_update_range_node_checksum(i);
+		rb_erase(&next_i->node, &inode_map->inode_inuse_tree);
+		nova_free_inode_node(sb, next_i);
+		inode_map->num_range_node_inode--;
+	} else if (new_ino < (next_range_low - 1)) {
+		/* Aligns to left */
+		i->range_high = new_ino;
+		nova_update_range_node_checksum(i);
+	} else {
+		nova_dbg("%s: ERROR: new ino %lu, next low %lu\n", __func__,
+			new_ino, next_range_low);
+		return -ENOSPC;
+	}
+
+	*ino = new_ino * sbi->cpus + cpuid;
+	sbi->s_inodes_used_count++;
+	inode_map->allocated++;
+
+	nova_dbg_verbose("Alloc ino %lu\n", *ino);
+	return 0;
+}
+
+static int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *i = NULL;
+	struct nova_range_node *curr_node;
+	int found = 0;
+	int cpuid = ino % sbi->cpus;
+	unsigned long internal_ino = ino / sbi->cpus;
+	int ret = 0;
+
+	nova_dbg_verbose("Free inuse ino: %lu\n", ino);
+	inode_map = &sbi->inode_maps[cpuid];
+
+	mutex_lock(&inode_map->inode_table_mutex);
+	found = nova_search_inodetree(sbi, ino, &i);
+	if (!found) {
+		nova_dbg("%s ERROR: ino %lu not found\n", __func__, ino);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return -EINVAL;
+	}
+
+	if ((internal_ino == i->range_low) && (internal_ino == i->range_high)) {
+		/* fits entire node */
+		rb_erase(&i->node, &inode_map->inode_inuse_tree);
+		nova_free_inode_node(sb, i);
+		inode_map->num_range_node_inode--;
+		goto block_found;
+	}
+	if ((internal_ino == i->range_low) && (internal_ino < i->range_high)) {
+		/* Aligns left */
+		i->range_low = internal_ino + 1;
+		nova_update_range_node_checksum(i);
+		goto block_found;
+	}
+	if ((internal_ino > i->range_low) && (internal_ino == i->range_high)) {
+		/* Aligns right */
+		i->range_high = internal_ino - 1;
+		nova_update_range_node_checksum(i);
+		goto block_found;
+	}
+	if ((internal_ino > i->range_low) && (internal_ino < i->range_high)) {
+		/* Aligns somewhere in the middle */
+		curr_node = nova_alloc_inode_node(sb);
+		NOVA_ASSERT(curr_node);
+		if (curr_node == NULL) {
+			/* returning without freeing the block */
+			goto block_found;
+		}
+		curr_node->range_low = internal_ino + 1;
+		curr_node->range_high = i->range_high;
+		nova_update_range_node_checksum(curr_node);
+
+		i->range_high = internal_ino - 1;
+		nova_update_range_node_checksum(i);
+
+		ret = nova_insert_inodetree(sbi, curr_node, cpuid);
+		if (ret) {
+			nova_free_inode_node(sb, curr_node);
+			goto err;
+		}
+		inode_map->num_range_node_inode++;
+		goto block_found;
+	}
+
+err:
+	nova_error_mng(sb, "Unable to free inode %lu\n", ino);
+	nova_error_mng(sb, "Found inuse block %lu - %lu\n",
+				 i->range_low, i->range_high);
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+
+block_found:
+	sbi->s_inodes_used_count--;
+	inode_map->freed++;
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+}
+
+static int nova_free_inode(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih)
+{
+	int err = 0;
+	timing_t free_time;
+
+	NOVA_START_TIMING(free_inode_t, free_time);
+
+	nova_free_inode_log(sb, pi, sih);
+
+	sih->log_pages = 0;
+	sih->i_mode = 0;
+	sih->pi_addr = 0;
+	sih->alter_pi_addr = 0;
+	sih->i_size = 0;
+	sih->i_blocks = 0;
+
+	err = nova_free_inuse_inode(sb, pi->nova_ino);
+
+	NOVA_END_TIMING(free_inode_t, free_time);
+	return err;
+}
+
+struct inode *nova_iget(struct super_block *sb, unsigned long ino)
+{
+	struct nova_inode_info *si;
+	struct inode *inode;
+	u64 pi_addr;
+	int err;
+
+	inode = iget_locked(sb, ino);
+	if (unlikely(!inode))
+		return ERR_PTR(-ENOMEM);
+	if (!(inode->i_state & I_NEW))
+		return inode;
+
+	si = NOVA_I(inode);
+
+	nova_dbgv("%s: inode %lu\n", __func__, ino);
+
+	err = nova_get_inode_address(sb, ino, 0, &pi_addr, 0, 0);
+	if (err) {
+		nova_dbg("%s: get inode %lu address failed %d\n",
+			 __func__, ino, err);
+		goto fail;
+	}
+
+	if (pi_addr == 0) {
+		nova_dbg("%s: failed to get pi_addr for inode %lu\n",
+			 __func__, ino);
+		err = -EACCES;
+		goto fail;
+	}
+
+	err = nova_rebuild_inode(sb, si, ino, pi_addr, 1);
+	if (err) {
+		nova_dbg("%s: failed to rebuild inode %lu\n", __func__, ino);
+		goto fail;
+	}
+
+	err = nova_read_inode(sb, inode, pi_addr);
+	if (unlikely(err)) {
+		nova_dbg("%s: failed to read inode %lu\n", __func__, ino);
+		goto fail;
+
+	}
+
+	inode->i_ino = ino;
+
+	unlock_new_inode(inode);
+	return inode;
+fail:
+	iget_failed(inode);
+	return ERR_PTR(err);
+}
+
+unsigned long nova_get_last_blocknr(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_inode *pi, fake_pi;
+	unsigned long last_blocknr;
+	unsigned int btype;
+	unsigned int data_bits;
+	int ret;
+
+	ret = nova_get_reference(sb, sih->pi_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("%s: read pi @ 0x%lx failed\n",
+				__func__, sih->pi_addr);
+		btype = 0;
+	} else {
+		btype = sih->i_blk_type;
+	}
+
+	data_bits = blk_type_to_shift[btype];
+
+	if (sih->i_size == 0)
+		last_blocknr = 0;
+	else
+		last_blocknr = (sih->i_size - 1) >> data_bits;
+
+	return last_blocknr;
+}
+
+static int nova_free_inode_resource(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih)
+{
+	unsigned long last_blocknr;
+	int ret = 0;
+	int freed = 0;
+	struct nova_inode *alter_pi;
+
+	nova_memunlock_inode(sb, pi);
+	pi->deleted = 1;
+
+	if (pi->valid) {
+		nova_dbg("%s: inode %lu still valid\n",
+				__func__, sih->ino);
+		pi->valid = 0;
+	}
+	nova_update_inode_checksum(pi);
+	if (metadata_csum && sih->alter_pi_addr) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+						sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+
+	/* We need the log to free the blocks from the b-tree */
+	switch (sih->i_mode & S_IFMT) {
+	case S_IFREG:
+		last_blocknr = nova_get_last_blocknr(sb, sih);
+		nova_dbgv("%s: file ino %lu\n", __func__, sih->ino);
+		freed = nova_delete_file_tree(sb, sih, 0,
+					last_blocknr, true, true, 0);
+		break;
+	case S_IFDIR:
+		nova_dbgv("%s: dir ino %lu\n", __func__, sih->ino);
+		nova_delete_dir_tree(sb, sih);
+		break;
+	case S_IFLNK:
+		/* Log will be freed later */
+		nova_dbgv("%s: symlink ino %lu\n",
+				__func__, sih->ino);
+		freed = nova_delete_file_tree(sb, sih, 0, 0,
+						true, true, 0);
+		break;
+	default:
+		nova_dbgv("%s: special ino %lu\n",
+				__func__, sih->ino);
+		break;
+	}
+
+	nova_dbg_verbose("%s: Freed %d\n", __func__, freed);
+	/* Then we can free the inode */
+	ret = nova_free_inode(sb, pi, sih);
+	if (ret)
+		nova_err(sb, "%s: free inode %lu failed\n",
+				__func__, sih->ino);
+
+	return ret;
+}
+
+void nova_evict_inode(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	timing_t evict_time;
+	int destroy = 0;
+	int ret;
+
+	NOVA_START_TIMING(evict_inode_t, evict_time);
+	if (!sih) {
+		nova_err(sb, "%s: ino %lu sih is NULL!\n",
+				__func__, inode->i_ino);
+		NOVA_ASSERT(0);
+		goto out;
+	}
+
+	// pi can be NULL if the file has already been deleted, but a handle
+	// remains.
+	if (pi && pi->nova_ino != inode->i_ino) {
+		nova_err(sb, "%s: inode %lu ino does not match: %llu\n",
+				__func__, inode->i_ino, pi->nova_ino);
+		nova_dbg("inode size %llu, pi addr 0x%lx, pi head 0x%llx, tail 0x%llx, mode %u\n",
+				inode->i_size, sih->pi_addr, sih->log_head,
+				sih->log_tail, pi->i_mode);
+		nova_dbg("sih: ino %lu, inode size %lu, mode %u, inode mode %u\n",
+				sih->ino, sih->i_size,
+				sih->i_mode, inode->i_mode);
+		nova_print_inode_log(sb, inode);
+	}
+
+	/* Check if this inode exists in at least one snapshot. */
+	if (pi && pi->valid == 0) {
+		ret = nova_append_inode_to_snapshot(sb, pi);
+		if (ret == 0)
+			goto out;
+	}
+
+	nova_dbg_verbose("%s: %lu\n", __func__, inode->i_ino);
+	if (!inode->i_nlink && !is_bad_inode(inode)) {
+		if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+			goto out;
+
+		if (pi) {
+			ret = nova_free_inode_resource(sb, pi, sih);
+			if (ret)
+				goto out;
+		}
+
+		destroy = 1;
+		pi = NULL; /* we no longer own the nova_inode */
+
+		inode->i_mtime = inode->i_ctime = current_time(inode);
+		inode->i_size = 0;
+	}
+out:
+	if (destroy == 0) {
+		nova_dbgv("%s: destroying %lu\n", __func__, inode->i_ino);
+		nova_free_dram_resource(sb, sih);
+	}
+	/* TODO: Since we don't use page-cache, do we really need the following
+	 * call?
+	 */
+	truncate_inode_pages(&inode->i_data, 0);
+
+	clear_inode(inode);
+	NOVA_END_TIMING(evict_inode_t, evict_time);
+}
+
+/* First rebuild the inode tree, then free the blocks */
+int nova_delete_dead_inode(struct super_block *sb, u64 ino)
+{
+	struct nova_inode_info si;
+	struct nova_inode_info_header *sih;
+	struct nova_inode *pi;
+	u64 pi_addr = 0;
+	int err;
+
+	if (ino < NOVA_NORMAL_INODE_START) {
+		nova_dbg("%s: invalid inode %llu\n", __func__, ino);
+		return -EINVAL;
+	}
+
+	err = nova_get_inode_address(sb, ino, 0, &pi_addr, 0, 0);
+	if (err) {
+		nova_dbg("%s: get inode %llu address failed %d\n",
+					__func__, ino, err);
+		return -EINVAL;
+	}
+
+	if (pi_addr == 0)
+		return -EACCES;
+
+	memset(&si, 0, sizeof(struct nova_inode_info));
+	err = nova_rebuild_inode(sb, &si, ino, pi_addr, 0);
+	if (err)
+		return err;
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	sih = &si.header;
+
+	nova_dbgv("Delete dead inode %lu, log head 0x%llx, tail 0x%llx\n",
+			sih->ino, sih->log_head, sih->log_tail);
+
+	return nova_free_inode_resource(sb, pi, sih);
+}
+
+/* Returns 0 on failure */
+u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	unsigned long free_ino = 0;
+	int map_id;
+	u64 ino = 0;
+	int ret;
+	timing_t new_inode_time;
+
+	NOVA_START_TIMING(new_nova_inode_t, new_inode_time);
+	map_id = sbi->map_id;
+	sbi->map_id = (sbi->map_id + 1) % sbi->cpus;
+
+	inode_map = &sbi->inode_maps[map_id];
+
+	mutex_lock(&inode_map->inode_table_mutex);
+	ret = nova_alloc_unused_inode(sb, map_id, &free_ino);
+	if (ret) {
+		nova_dbg("%s: alloc inode number failed %d\n", __func__, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return 0;
+	}
+
+	ret = nova_get_inode_address(sb, free_ino, 0, pi_addr, 1, 1);
+	if (ret) {
+		nova_dbg("%s: get inode address failed %d\n", __func__, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return 0;
+	}
+
+	mutex_unlock(&inode_map->inode_table_mutex);
+
+	ino = free_ino;
+
+	NOVA_END_TIMING(new_nova_inode_t, new_inode_time);
+	return ino;
+}
+
+struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
+	struct inode *dir, u64 pi_addr, u64 ino, umode_t mode,
+	size_t size, dev_t rdev, const struct qstr *qstr, u64 epoch_id)
+{
+	struct super_block *sb;
+	struct nova_sb_info *sbi;
+	struct inode *inode;
+	struct nova_inode *diri = NULL;
+	struct nova_inode_info *si;
+	struct nova_inode_info_header *sih = NULL;
+	struct nova_inode *pi;
+	struct nova_inode *alter_pi;
+	int errval;
+	u64 alter_pi_addr = 0;
+	timing_t new_inode_time;
+
+	NOVA_START_TIMING(new_vfs_inode_t, new_inode_time);
+	sb = dir->i_sb;
+	sbi = (struct nova_sb_info *)sb->s_fs_info;
+	inode = new_inode(sb);
+	if (!inode) {
+		errval = -ENOMEM;
+		goto fail2;
+	}
+
+	inode_init_owner(inode, dir, mode);
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+
+	inode->i_generation = atomic_add_return(1, &sbi->next_generation);
+	inode->i_size = size;
+
+	diri = nova_get_inode(sb, dir);
+	if (!diri) {
+		errval = -EACCES;
+		goto fail1;
+	}
+
+	if (metadata_csum) {
+		/* Get alternate inode address */
+		errval = nova_get_alter_inode_address(sb, ino, &alter_pi_addr);
+		if (errval)
+			goto fail1;
+	}
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	nova_dbg_verbose("%s: allocating inode %llu @ 0x%llx\n",
+					__func__, ino, pi_addr);
+
+	/* chosen inode is in ino */
+	inode->i_ino = ino;
+
+	switch (type) {
+	case TYPE_CREATE:
+		inode->i_op = &nova_file_inode_operations;
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		if (inplace_data_updates && wprotect == 0)
+			inode->i_fop = &nova_dax_file_operations;
+		else
+			inode->i_fop = &nova_wrap_file_operations;
+		break;
+	case TYPE_MKNOD:
+		init_special_inode(inode, mode, rdev);
+		inode->i_op = &nova_special_inode_operations;
+		break;
+	case TYPE_SYMLINK:
+		inode->i_op = &nova_symlink_inode_operations;
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		break;
+	case TYPE_MKDIR:
+		inode->i_op = &nova_dir_inode_operations;
+		inode->i_fop = &nova_dir_operations;
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		set_nlink(inode, 2);
+		break;
+	default:
+		nova_dbg("Unknown new inode type %d\n", type);
+		break;
+	}
+
+	/*
+	 * Pi is part of the dir log so no transaction is needed,
+	 * but we need to flush to NVMM.
+	 */
+	nova_memunlock_inode(sb, pi);
+	pi->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	pi->i_flags = nova_mask_flags(mode, diri->i_flags);
+	pi->nova_ino = ino;
+	pi->i_create_time = current_time(inode).tv_sec;
+	pi->create_epoch_id = epoch_id;
+	nova_init_inode(inode, pi);
+
+	if (metadata_csum) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+								alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+
+	nova_memlock_inode(sb, pi);
+
+	si = NOVA_I(inode);
+	sih = &si->header;
+	nova_init_header(sb, sih, inode->i_mode);
+	sih->pi_addr = pi_addr;
+	sih->alter_pi_addr = alter_pi_addr;
+	sih->ino = ino;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	nova_set_inode_flags(inode, pi, le32_to_cpu(pi->i_flags));
+
+	if (insert_inode_locked(inode) < 0) {
+		nova_err(sb, "nova_new_inode failed ino %lx\n", inode->i_ino);
+		errval = -EINVAL;
+		goto fail1;
+	}
+
+	nova_flush_buffer(pi, NOVA_INODE_SIZE, 0);
+	NOVA_END_TIMING(new_vfs_inode_t, new_inode_time);
+	return inode;
+fail1:
+	make_bad_inode(inode);
+	iput(inode);
+fail2:
+	NOVA_END_TIMING(new_vfs_inode_t, new_inode_time);
+	return ERR_PTR(errval);
+}
+
+int nova_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	/* write_inode should never be called because we always keep our inodes
+	 * clean. So let us know if write_inode ever gets called.
+	 */
+//	BUG();
+	return 0;
+}
+
+/*
+ * dirty_inode() is called from mark_inode_dirty_sync()
+ * usually dirty_inode should not be called because NOVA always keeps its inodes
+ * clean. Only exception is touch_atime which calls dirty_inode to update the
+ * i_atime field.
+ */
+void nova_dirty_inode(struct inode *inode, int flags)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi, inode_copy;
+
+	if (sbi->mount_snapshot)
+		return;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	/* check the inode before updating to make sure all fields are good */
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+					sih->alter_pi_addr, &inode_copy, 0) < 0)
+		return;
+
+	/* only i_atime should have changed if at all.
+	 * we can do in-place atomic update
+	 */
+	nova_memunlock_inode(sb, pi);
+	pi->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
+	nova_update_inode_checksum(pi);
+	nova_update_alter_inode(sb, inode, pi);
+	nova_memlock_inode(sb, pi);
+	/* Relax atime persistency */
+	nova_flush_buffer(&pi->i_atime, sizeof(pi->i_atime), 0);
+}
+
+static void nova_setsize(struct inode *inode, loff_t oldsize, loff_t newsize,
+	u64 epoch_id)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	timing_t setsize_time;
+
+	/* We only support truncate regular file */
+	if (!(S_ISREG(inode->i_mode))) {
+		nova_err(inode->i_sb, "%s:wrong file mode %x\n", inode->i_mode);
+		return;
+	}
+
+	NOVA_START_TIMING(setsize_t, setsize_time);
+
+	inode_dio_wait(inode);
+
+	nova_dbgv("%s: inode %lu, old size %llu, new size %llu\n",
+		__func__, inode->i_ino, oldsize, newsize);
+
+	if (newsize != oldsize) {
+		nova_clear_last_page_tail(sb, inode, newsize);
+		i_size_write(inode, newsize);
+		sih->i_size = newsize;
+	}
+
+	/* FIXME: we should make sure that there is nobody reading the inode
+	 * before truncating it. Also we need to munmap the truncated range
+	 * from application address space, if mmapped.
+	 */
+	/* synchronize_rcu(); */
+
+	/* FIXME: Do we need to clear truncated DAX pages? */
+//	dax_truncate_page(inode, newsize, nova_dax_get_block);
+
+	truncate_pagecache(inode, newsize);
+	nova_truncate_file_blocks(inode, newsize, oldsize, epoch_id);
+	NOVA_END_TIMING(setsize_t, setsize_time);
+}
+
+int nova_getattr(const struct path *path, struct kstat *stat,
+		 u32 request_mask, unsigned int flags)
+{
+	struct inode *inode;
+
+	inode = path->dentry->d_inode;
+	generic_fillattr(inode, stat);
+	/* stat->blocks should be the number of 512B blocks */
+	stat->blocks = (inode->i_blocks << inode->i_sb->s_blocksize_bits) >> 9;
+	return 0;
+}
+
+int nova_notify_change(struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = dentry->d_inode;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	int ret;
+	unsigned int ia_valid = attr->ia_valid, attr_mask;
+	loff_t oldsize = inode->i_size;
+	u64 epoch_id;
+	timing_t setattr_time;
+
+	NOVA_START_TIMING(setattr_t, setattr_time);
+	if (!pi) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	ret = setattr_prepare(dentry, attr);
+	if (ret)
+		goto out;
+
+	/* Update inode with attr except for size */
+	setattr_copy(inode, attr);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	attr_mask = ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_SIZE | ATTR_ATIME
+			| ATTR_MTIME | ATTR_CTIME;
+
+	ia_valid = ia_valid & attr_mask;
+
+	if (ia_valid == 0)
+		goto out;
+
+	ret = nova_handle_setattr_operation(sb, inode, pi, ia_valid,
+					attr, epoch_id);
+	if (ret)
+		goto out;
+
+	/* Only after log entry is committed, we can truncate size */
+	if ((ia_valid & ATTR_SIZE) && (attr->ia_size != oldsize ||
+			pi->i_flags & cpu_to_le32(NOVA_EOFBLOCKS_FL))) {
+//		nova_set_blocksize_hint(sb, inode, pi, attr->ia_size);
+
+		/* now we can freely truncate the inode */
+		nova_setsize(inode, oldsize, attr->ia_size, epoch_id);
+	}
+
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(setattr_t, setattr_time);
+	return ret;
+}
+
+void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
+	unsigned int flags)
+{
+	inode->i_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+	if (flags & FS_SYNC_FL)
+		inode->i_flags |= S_SYNC;
+	if (flags & FS_APPEND_FL)
+		inode->i_flags |= S_APPEND;
+	if (flags & FS_IMMUTABLE_FL)
+		inode->i_flags |= S_IMMUTABLE;
+	if (flags & FS_NOATIME_FL)
+		inode->i_flags |= S_NOATIME;
+	if (flags & FS_DIRSYNC_FL)
+		inode->i_flags |= S_DIRSYNC;
+	if (!pi->i_xattr)
+		inode_has_no_xattr(inode);
+	inode->i_flags |= S_DAX;
+}
+
+static int nova_legacy_get_blocks(struct inode *inode, sector_t iblock,
+	struct buffer_head *bh, int create)
+{
+	unsigned long max_blocks = bh->b_size >> inode->i_blkbits;
+	bool new = false, boundary = false;
+	u32 bno;
+	int ret;
+
+	ret = nova_dax_get_blocks(inode, iblock, max_blocks, &bno, &new,
+				&boundary, create, false);
+	if (ret <= 0)
+		return ret;
+
+	map_bh(bh, inode->i_sb, bno);
+	bh->b_size = ret << inode->i_blkbits;
+	return 0;
+}
+
+static ssize_t nova_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *filp = iocb->ki_filp;
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	ssize_t ret;
+	timing_t dio_time;
+
+	if (WARN_ON_ONCE(IS_DAX(inode)))
+		return -EIO;
+
+	NOVA_START_TIMING(direct_IO_t, dio_time);
+
+	ret = blockdev_direct_IO(iocb, inode, iter, nova_legacy_get_blocks);
+
+	NOVA_END_TIMING(direct_IO_t, dio_time);
+	return ret;
+}
+
+/*
+ * find the file offset for SEEK_DATA/SEEK_HOLE
+ */
+unsigned long nova_find_region(struct inode *inode, loff_t *offset, int hole)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	unsigned long blocks = 0, offset_in_block;
+	int data_found = 0, hole_found = 0;
+
+	if (*offset >= inode->i_size)
+		return -ENXIO;
+
+	if (!inode->i_blocks || !sih->i_size) {
+		if (hole)
+			return inode->i_size;
+		else
+			return -ENXIO;
+	}
+
+	offset_in_block = *offset & ((1UL << data_bits) - 1);
+
+	first_blocknr = *offset >> data_bits;
+	last_blocknr = inode->i_size >> data_bits;
+
+	nova_dbg_verbose("find_region offset %llx, first_blocknr %lx, last_blocknr %lx hole %d\n",
+		  *offset, first_blocknr, last_blocknr, hole);
+
+	blocks = nova_lookup_hole_in_range(inode->i_sb, sih,
+		first_blocknr, last_blocknr, &data_found, &hole_found, hole);
+
+	/* Searching data but only hole found till the end */
+	if (!hole && !data_found && hole_found)
+		return -ENXIO;
+
+	if (data_found && !hole_found) {
+		/* Searching data but we are already into them */
+		if (hole)
+			/* Searching hole but only data found, go to the end */
+			*offset = inode->i_size;
+		return 0;
+	}
+
+	/* Searching for hole, hole found and starting inside an hole */
+	if (hole && hole_found && !blocks) {
+		/* we found data after it */
+		if (!data_found)
+			/* last hole */
+			*offset = inode->i_size;
+		return 0;
+	}
+
+	if (offset_in_block) {
+		blocks--;
+		*offset += (blocks << data_bits) +
+			   ((1 << data_bits) - offset_in_block);
+	} else {
+		*offset += blocks << data_bits;
+	}
+
+	return 0;
+}
+
+static int nova_writepages(struct address_space *mapping,
+	struct writeback_control *wbc)
+{
+	int ret;
+	timing_t wp_time;
+
+	NOVA_START_TIMING(write_pages_t, wp_time);
+	ret = dax_writeback_mapping_range(mapping,
+			mapping->host->i_sb->s_bdev, wbc);
+	NOVA_END_TIMING(write_pages_t, wp_time);
+	return ret;
+}
+
+const struct address_space_operations nova_aops_dax = {
+	.writepages		= nova_writepages,
+	.direct_IO		= nova_direct_IO,
+	/*.dax_mem_protect	= nova_dax_mem_protect,*/
+};
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
new file mode 100644
index 000000000000..5ad69335799c
--- /dev/null
+++ b/fs/nova/inode.h
@@ -0,0 +1,389 @@
+#ifndef __INODE_H
+#define __INODE_H
+
+struct nova_inode_info_header;
+struct nova_inode;
+
+#include "super.h"
+#include "log.h"
+
+enum nova_new_inode_type {
+	TYPE_CREATE = 0,
+	TYPE_MKNOD,
+	TYPE_SYMLINK,
+	TYPE_MKDIR
+};
+
+
+/*
+ * Structure of an inode in PMEM
+ * Keep the inode size to within 120 bytes: We use the last eight bytes
+ * as inode table tail pointer.
+ */
+struct nova_inode {
+
+	/* first 40 bytes */
+	u8	i_rsvd;		 /* reserved. used to be checksum */
+	u8	valid;		 /* Is this inode valid? */
+	u8	deleted;	 /* Is this inode deleted? */
+	u8	i_blk_type;	 /* data block size this inode uses */
+	__le32	i_flags;	 /* Inode flags */
+	__le64	i_size;		 /* Size of data in bytes */
+	__le32	i_ctime;	 /* Inode modification time */
+	__le32	i_mtime;	 /* Inode b-tree Modification time */
+	__le32	i_atime;	 /* Access time */
+	__le16	i_mode;		 /* File mode */
+	__le16	i_links_count;	 /* Links count */
+
+	__le64	i_xattr;	 /* Extended attribute block */
+
+	/* second 40 bytes */
+	__le32	i_uid;		 /* Owner Uid */
+	__le32	i_gid;		 /* Group Id */
+	__le32	i_generation;	 /* File version (for NFS) */
+	__le32	i_create_time;	 /* Create time */
+	__le64	nova_ino;	 /* nova inode number */
+
+	__le64	log_head;	 /* Log head pointer */
+	__le64	log_tail;	 /* Log tail pointer */
+
+	/* last 40 bytes */
+	__le64	alter_log_head;	 /* Alternate log head pointer */
+	__le64	alter_log_tail;	 /* Alternate log tail pointer */
+
+	__le64	create_epoch_id; /* Transaction ID when create */
+	__le64	delete_epoch_id; /* Transaction ID when deleted */
+
+	struct {
+		__le32 rdev;	 /* major/minor # */
+	} dev;			 /* device inode */
+
+	__le32	csum;            /* CRC32 checksum */
+
+	/* Leave 8 bytes for inode table tail pointer */
+} __attribute((__packed__));
+
+/*
+ * Inode table.  It's a linked list of pages.
+ */
+struct inode_table {
+	__le64 log_head;
+};
+
+/*
+ * NOVA-specific inode state kept in DRAM
+ */
+struct nova_inode_info_header {
+	/* For files, tree holds a map from file offsets to
+	 * write log entries.
+	 *
+	 * For directories, tree holds a map from a hash of the file name to
+	 * dentry log entry.
+	 */
+	struct radix_tree_root tree;
+	struct rb_root vma_tree;	/* Write vmas */
+	struct list_head list;		/* SB list of mmap sih */
+	int num_vmas;
+	unsigned short i_mode;		/* Dir or file? */
+	unsigned long log_pages;	/* Num of log pages */
+	unsigned long i_size;
+	unsigned long i_blocks;
+	unsigned long ino;
+	unsigned long pi_addr;
+	unsigned long alter_pi_addr;
+	unsigned long valid_entries;	/* For thorough GC */
+	unsigned long num_entries;	/* For thorough GC */
+	u64 last_setattr;		/* Last setattr entry */
+	u64 last_link_change;		/* Last link change entry */
+	u64 last_dentry;		/* Last updated dentry */
+	u64 trans_id;			/* Transaction ID */
+	u64 log_head;			/* Log head pointer */
+	u64 log_tail;			/* Log tail pointer */
+	u64 alter_log_head;		/* Alternate log head pointer */
+	u64 alter_log_tail;		/* Alternate log tail pointer */
+	u8  i_blk_type;
+};
+
+/* For rebuild purpose, temporarily store pi infomation */
+struct nova_inode_rebuild {
+	u64	i_size;
+	u32	i_flags;	/* Inode flags */
+	u32	i_ctime;	/* Inode modification time */
+	u32	i_mtime;	/* Inode b-tree Modification time */
+	u32	i_atime;	/* Access time */
+	u32	i_uid;		/* Owner Uid */
+	u32	i_gid;		/* Group Id */
+	u32	i_generation;	/* File version (for NFS) */
+	u16	i_links_count;	/* Links count */
+	u16	i_mode;		/* File mode */
+	u64	trans_id;
+};
+
+/*
+ * DRAM state for inodes
+ */
+struct nova_inode_info {
+	struct nova_inode_info_header header;
+	struct inode vfs_inode;
+};
+
+
+static inline struct nova_inode_info *NOVA_I(struct inode *inode)
+{
+	return container_of(inode, struct nova_inode_info, vfs_inode);
+}
+
+static inline struct nova_inode *nova_get_alter_inode(struct super_block *sb,
+	struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode fake_pi;
+	void *addr;
+	int rc;
+
+	if (metadata_csum == 0)
+		return NULL;
+
+	addr = nova_get_block(sb, sih->alter_pi_addr);
+	rc = memcpy_mcsafe(&fake_pi, addr, sizeof(struct nova_inode));
+	if (rc)
+		return NULL;
+
+	return (struct nova_inode *)addr;
+}
+
+static inline int nova_update_alter_inode(struct super_block *sb,
+	struct inode *inode, struct nova_inode *pi)
+{
+	struct nova_inode *alter_pi;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	alter_pi = nova_get_alter_inode(sb, inode);
+	if (!alter_pi)
+		return -EINVAL;
+
+	memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	return 0;
+}
+
+
+static inline int nova_update_inode_checksum(struct nova_inode *pi)
+{
+	u32 crc = 0;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	crc = nova_crc32c(~0, (__u8 *)pi,
+			(sizeof(struct nova_inode) - sizeof(__le32)));
+
+	pi->csum = crc;
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
+	return 0;
+}
+
+static inline int nova_check_inode_checksum(struct nova_inode *pi)
+{
+	u32 crc = 0;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	crc = nova_crc32c(~0, (__u8 *)pi,
+			(sizeof(struct nova_inode) - sizeof(__le32)));
+
+	if (pi->csum == cpu_to_le32(crc))
+		return 0;
+	else
+		return 1;
+}
+
+
+
+static inline void nova_update_tail(struct nova_inode *pi, u64 new_tail)
+{
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_tail_t, update_time);
+
+	PERSISTENT_BARRIER();
+	pi->log_tail = new_tail;
+	nova_flush_buffer(&pi->log_tail, CACHELINE_SIZE, 1);
+
+	NOVA_END_TIMING(update_tail_t, update_time);
+}
+
+static inline void nova_update_alter_tail(struct nova_inode *pi, u64 new_tail)
+{
+	timing_t update_time;
+
+	if (metadata_csum == 0)
+		return;
+
+	NOVA_START_TIMING(update_tail_t, update_time);
+
+	PERSISTENT_BARRIER();
+	pi->alter_log_tail = new_tail;
+	nova_flush_buffer(&pi->alter_log_tail, CACHELINE_SIZE, 1);
+
+	NOVA_END_TIMING(update_tail_t, update_time);
+}
+
+
+
+/* Update inode tails and checksums */
+static inline void nova_update_inode(struct super_block *sb,
+	struct inode *inode, struct nova_inode *pi,
+	struct nova_inode_update *update, int update_alter)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	sih->log_tail = update->tail;
+	sih->alter_log_tail = update->alter_tail;
+	nova_update_tail(pi, update->tail);
+	if (metadata_csum)
+		nova_update_alter_tail(pi, update->alter_tail);
+
+	nova_update_inode_checksum(pi);
+	if (inode && update_alter)
+		nova_update_alter_inode(sb, inode, pi);
+}
+
+
+static inline
+struct inode_table *nova_get_inode_table(struct super_block *sb,
+	int version, int cpu)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int table_start;
+
+	if (cpu >= sbi->cpus)
+		return NULL;
+
+	if ((version & 0x1) == 0)
+		table_start = INODE_TABLE0_START;
+	else
+		table_start = INODE_TABLE1_START;
+
+	return (struct inode_table *)((char *)nova_get_block(sb,
+		NOVA_DEF_BLOCK_SIZE_4K * table_start) +
+		cpu * CACHELINE_SIZE);
+}
+
+static inline unsigned int
+nova_inode_blk_shift(struct nova_inode_info_header *sih)
+{
+	return blk_type_to_shift[sih->i_blk_type];
+}
+
+static inline uint32_t nova_inode_blk_size(struct nova_inode_info_header *sih)
+{
+	return blk_type_to_size[sih->i_blk_type];
+}
+
+static inline u64 nova_get_reserved_inode_addr(struct super_block *sb,
+	u64 inode_number)
+{
+	return (NOVA_DEF_BLOCK_SIZE_4K * RESERVE_INODE_START) +
+			inode_number * NOVA_INODE_SIZE;
+}
+
+static inline u64 nova_get_alter_reserved_inode_addr(struct super_block *sb,
+	u64 inode_number)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return nova_get_addr_off(sbi, sbi->replica_reserved_inodes_addr) +
+			inode_number * NOVA_INODE_SIZE;
+}
+
+static inline struct nova_inode *nova_get_reserved_inode(struct super_block *sb,
+	u64 inode_number)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 addr;
+
+	addr = nova_get_reserved_inode_addr(sb, inode_number);
+
+	return (struct nova_inode *)(sbi->virt_addr + addr);
+}
+
+static inline struct nova_inode *
+nova_get_alter_reserved_inode(struct super_block *sb,
+	u64 inode_number)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 addr;
+
+	addr = nova_get_alter_reserved_inode_addr(sb, inode_number);
+
+	return (struct nova_inode *)(sbi->virt_addr + addr);
+}
+
+/* If this is part of a read-modify-write of the inode metadata,
+ * nova_memunlock_inode() before calling!
+ */
+static inline struct nova_inode *nova_get_inode_by_ino(struct super_block *sb,
+						  u64 ino)
+{
+	if (ino == 0 || ino >= NOVA_NORMAL_INODE_START)
+		return NULL;
+
+	return nova_get_reserved_inode(sb, ino);
+}
+
+static inline struct nova_inode *nova_get_inode(struct super_block *sb,
+	struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode fake_pi;
+	void *addr;
+	int rc;
+
+	addr = nova_get_block(sb, sih->pi_addr);
+	rc = memcpy_mcsafe(&fake_pi, addr, sizeof(struct nova_inode));
+	if (rc)
+		return NULL;
+
+	return (struct nova_inode *)addr;
+}
+
+
+
+extern const struct address_space_operations nova_aops_dax;
+int nova_init_inode_inuse_list(struct super_block *sb);
+extern int nova_init_inode_table(struct super_block *sb);
+int nova_get_alter_inode_address(struct super_block *sb, u64 ino,
+	u64 *alter_pi_addr);
+unsigned long nova_get_last_blocknr(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+int nova_get_inode_address(struct super_block *sb, u64 ino, int version,
+	u64 *pi_addr, int extendable, int extend_alternate);
+int nova_set_blocksize_hint(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, loff_t new_size);
+extern struct inode *nova_iget(struct super_block *sb, unsigned long ino);
+extern void nova_evict_inode(struct inode *inode);
+extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
+extern void nova_dirty_inode(struct inode *inode, int flags);
+extern int nova_notify_change(struct dentry *dentry, struct iattr *attr);
+extern int nova_getattr(const struct path *path, struct kstat *stat,
+			u32 request_mask, unsigned int flags);
+extern void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
+	unsigned int flags);
+extern unsigned long nova_find_region(struct inode *inode, loff_t *offset,
+		int hole);
+int nova_delete_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long start_blocknr,
+	unsigned long last_blocknr, bool delete_nvmm,
+	bool delete_dead, u64 trasn_id);
+u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr);
+extern struct inode *nova_new_vfs_inode(enum nova_new_inode_type,
+	struct inode *dir, u64 pi_addr, u64 ino, umode_t mode,
+	size_t size, dev_t rdev, const struct qstr *qstr, u64 epoch_id);
+
+#endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 04/16] NOVA: Inode operations and structures
@ 2017-08-03  7:48   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova maintains per-CPU inode tables, and inode numbers are striped across the
tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).

The inodes themselves live in a set of linked lists (one per CPU) of 2MB
blocks.  The last 8 bytes of each block points to the next block.  Pointers to
heads of these list live in PMEM block INODE_TABLE0_START and are replicated in
PMEM block INODE_TABLE1_START.  Additional space for inodes is allocated on
demand.

To allocate inodes, Nova maintains a per-cpu inuse_list in DRAM holds a RB
tree that holds ranges of unallocated inode numbers.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/inode.c | 1467 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.h |  389 +++++++++++++++
 2 files changed, 1856 insertions(+)
 create mode 100644 fs/nova/inode.c
 create mode 100644 fs/nova/inode.h

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
new file mode 100644
index 000000000000..db001b7b5d4f
--- /dev/null
+++ b/fs/nova/inode.c
@@ -0,0 +1,1467 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode methods (allocate/free/read/write).
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/aio.h>
+#include <linux/highuid.h>
+#include <linux/module.h>
+#include <linux/mpage.h>
+#include <linux/backing-dev.h>
+#include <linux/types.h>
+#include <linux/ratelimit.h>
+#include "nova.h"
+#include "inode.h"
+
+unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX] = {12, 21, 30};
+uint32_t blk_type_to_size[NOVA_BLOCK_TYPE_MAX] = {0x1000, 0x200000, 0x40000000};
+
+int nova_init_inode_inuse_list(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_range_node *range_node;
+	struct inode_map *inode_map;
+	unsigned long range_high;
+	int i;
+	int ret;
+
+	sbi->s_inodes_used_count = NOVA_NORMAL_INODE_START;
+
+	range_high = NOVA_NORMAL_INODE_START / sbi->cpus;
+	if (NOVA_NORMAL_INODE_START % sbi->cpus)
+		range_high++;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		range_node = nova_alloc_inode_node(sb);
+		if (range_node == NULL)
+			/* FIXME: free allocated memories */
+			return -ENOMEM;
+
+		range_node->range_low = 0;
+		range_node->range_high = range_high;
+		nova_update_range_node_checksum(range_node);
+		ret = nova_insert_inodetree(sbi, range_node, i);
+		if (ret) {
+			nova_err(sb, "%s failed\n", __func__);
+			nova_free_inode_node(sb, range_node);
+			return ret;
+		}
+		inode_map->num_range_node_inode = 1;
+		inode_map->first_inode_range = range_node;
+	}
+
+	return 0;
+}
+
+static int nova_alloc_inode_table(struct super_block *sb,
+	struct nova_inode_info_header *sih, int version)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_table *inode_table;
+	unsigned long blocknr;
+	u64 block;
+	int allocated;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_table = nova_get_inode_table(sb, version, i);
+		if (!inode_table)
+			return -EINVAL;
+
+		/* Allocate replicate inodes from tail */
+		allocated = nova_new_log_blocks(sb, sih, &blocknr, 1,
+				ALLOC_INIT_ZERO, i,
+				version ? ALLOC_FROM_TAIL : ALLOC_FROM_HEAD);
+
+		nova_dbgv("%s: allocate log @ 0x%lx\n", __func__,
+							blocknr);
+		if (allocated != 1 || blocknr == 0)
+			return -ENOSPC;
+
+		block = nova_get_block_off(sb, blocknr, NOVA_BLOCK_TYPE_2M);
+		nova_memunlock_range(sb, inode_table, CACHELINE_SIZE);
+		inode_table->log_head = block;
+		nova_memlock_range(sb, inode_table, CACHELINE_SIZE);
+		nova_flush_buffer(inode_table, CACHELINE_SIZE, 0);
+	}
+
+	return 0;
+}
+
+int nova_init_inode_table(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODETABLE_INO);
+	struct nova_inode_info_header sih;
+	int num_tables;
+	int ret = 0;
+	int i;
+
+	nova_memunlock_inode(sb, pi);
+	pi->i_mode = 0;
+	pi->i_uid = 0;
+	pi->i_gid = 0;
+	pi->i_links_count = cpu_to_le16(1);
+	pi->i_flags = 0;
+	pi->nova_ino = NOVA_INODETABLE_INO;
+
+	pi->i_blk_type = NOVA_BLOCK_TYPE_2M;
+	nova_memlock_inode(sb, pi);
+
+	sih.ino = NOVA_INODETABLE_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_2M;
+
+	num_tables = 1;
+	if (metadata_csum)
+		num_tables = 2;
+
+	for (i = 0; i < num_tables; i++) {
+		ret = nova_alloc_inode_table(sb, &sih, i);
+		if (ret)
+			return ret;
+	}
+
+	PERSISTENT_BARRIER();
+	return ret;
+}
+
+inline int nova_insert_inodetree(struct nova_sb_info *sbi,
+	struct nova_range_node *new_node, int cpu)
+{
+	struct rb_root *tree;
+	int ret;
+
+	tree = &sbi->inode_maps[cpu].inode_inuse_tree;
+	ret = nova_insert_range_node(tree, new_node);
+	if (ret)
+		nova_dbg("ERROR: %s failed %d\n", __func__, ret);
+
+	return ret;
+}
+
+inline int nova_search_inodetree(struct nova_sb_info *sbi,
+	unsigned long ino, struct nova_range_node **ret_node)
+{
+	struct rb_root *tree;
+	unsigned long internal_ino;
+	int cpu;
+
+	cpu = ino % sbi->cpus;
+	tree = &sbi->inode_maps[cpu].inode_inuse_tree;
+	internal_ino = ino / sbi->cpus;
+	return nova_find_range_node(sbi, tree, internal_ino, ret_node);
+}
+
+/* Get the address in PMEM of an inode by inode number.  Allocate additional
+ * block to store additional inodes if necessary.
+ */
+int nova_get_inode_address(struct super_block *sb, u64 ino, int version,
+	u64 *pi_addr, int extendable, int extend_alternate)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct inode_table *inode_table;
+	unsigned int data_bits;
+	unsigned int num_inodes_bits;
+	u64 curr;
+	unsigned int superpage_count;
+	u64 alternate_pi_addr = 0;
+	u64 internal_ino;
+	int cpuid;
+	int extended = 0;
+	unsigned int index;
+	unsigned int i = 0;
+	unsigned long blocknr;
+	unsigned long curr_addr;
+	int allocated;
+
+	if (ino < NOVA_NORMAL_INODE_START) {
+		*pi_addr = nova_get_reserved_inode_addr(sb, ino);
+		return 0;
+	}
+
+	sih.ino = NOVA_INODETABLE_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_2M;
+	data_bits = blk_type_to_shift[sih.i_blk_type];
+	num_inodes_bits = data_bits - NOVA_INODE_BITS;
+
+	cpuid = ino % sbi->cpus;
+	internal_ino = ino / sbi->cpus;
+
+	inode_table = nova_get_inode_table(sb, version, cpuid);
+	superpage_count = internal_ino >> num_inodes_bits;
+	index = internal_ino & ((1 << num_inodes_bits) - 1);
+
+	curr = inode_table->log_head;
+	if (curr == 0)
+		return -EINVAL;
+
+	for (i = 0; i < superpage_count; i++) {
+		if (curr == 0)
+			return -EINVAL;
+
+		curr_addr = (unsigned long)nova_get_block(sb, curr);
+		/* Next page pointer in the last 8 bytes of the superpage */
+		curr_addr += nova_inode_blk_size(&sih) - 8;
+		curr = *(u64 *)(curr_addr);
+
+		if (curr == 0) {
+			if (extendable == 0)
+				return -EINVAL;
+
+			extended = 1;
+
+			allocated = nova_new_log_blocks(sb, &sih, &blocknr,
+				1, ALLOC_INIT_ZERO, cpuid,
+				version ? ALLOC_FROM_TAIL : ALLOC_FROM_HEAD);
+
+			if (allocated != 1)
+				return allocated;
+
+			curr = nova_get_block_off(sb, blocknr,
+						NOVA_BLOCK_TYPE_2M);
+			nova_memunlock_range(sb, (void *)curr_addr,
+						CACHELINE_SIZE);
+			*(u64 *)(curr_addr) = curr;
+			nova_memlock_range(sb, (void *)curr_addr,
+						CACHELINE_SIZE);
+			nova_flush_buffer((void *)curr_addr,
+						NOVA_INODE_SIZE, 1);
+		}
+	}
+
+	/* Extend alternate inode table */
+	if (extended && extend_alternate && metadata_csum)
+		nova_get_inode_address(sb, ino, version + 1,
+					&alternate_pi_addr, extendable, 0);
+
+	*pi_addr = curr + index * NOVA_INODE_SIZE;
+
+	return 0;
+}
+
+int nova_get_alter_inode_address(struct super_block *sb, u64 ino,
+	u64 *alter_pi_addr)
+{
+	int ret;
+
+	if (metadata_csum == 0) {
+		nova_err(sb, "Access alter inode when replica inode disabled\n");
+		return 0;
+	}
+
+	if (ino < NOVA_NORMAL_INODE_START) {
+		*alter_pi_addr = nova_get_alter_reserved_inode_addr(sb, ino);
+	} else {
+		ret = nova_get_inode_address(sb, ino, 1, alter_pi_addr, 0, 0);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+int nova_delete_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long start_blocknr,
+	unsigned long last_blocknr, bool delete_nvmm, bool delete_dead,
+	u64 epoch_id)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry *old_entry = NULL;
+	unsigned long pgoff = start_blocknr;
+	unsigned long old_pgoff = 0;
+	unsigned int num_free = 0;
+	int freed = 0;
+	void *ret;
+	timing_t delete_time;
+
+	NOVA_START_TIMING(delete_file_tree_t, delete_time);
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	/* Handle EOF blocks */
+	do {
+		entry = radix_tree_lookup(&sih->tree, pgoff);
+		if (entry) {
+			ret = radix_tree_delete(&sih->tree, pgoff);
+			BUG_ON(!ret || ret != entry);
+			if (entry != old_entry) {
+				if (old_entry && delete_nvmm) {
+					nova_free_old_entry(sb, sih,
+							old_entry, old_pgoff,
+							num_free, delete_dead,
+							epoch_id);
+					freed += num_free;
+				}
+
+				old_entry = entry;
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+			pgoff++;
+		} else {
+			/* We are finding a hole. Jump to the next entry. */
+			entry = nova_find_next_entry(sb, sih, pgoff);
+			if (!entry)
+				break;
+
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				break;
+
+			pgoff++;
+			pgoff = pgoff > entryc->pgoff ? pgoff : entryc->pgoff;
+		}
+	} while (1);
+
+	if (old_entry && delete_nvmm) {
+		nova_free_old_entry(sb, sih, old_entry, old_pgoff,
+					num_free, delete_dead, epoch_id);
+		freed += num_free;
+	}
+
+	nova_dbgv("Inode %lu: delete file tree from pgoff %lu to %lu, %d blocks freed\n",
+			sih->ino, start_blocknr, last_blocknr, freed);
+
+	NOVA_END_TIMING(delete_file_tree_t, delete_time);
+	return freed;
+}
+
+static int nova_free_dram_resource(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	unsigned long last_blocknr;
+	int freed = 0;
+
+	if (!(S_ISREG(sih->i_mode)) && !(S_ISDIR(sih->i_mode)))
+		return 0;
+
+	if (S_ISREG(sih->i_mode)) {
+		last_blocknr = nova_get_last_blocknr(sb, sih);
+		freed = nova_delete_file_tree(sb, sih, 0,
+					last_blocknr, false, false, 0);
+	} else {
+		nova_delete_dir_tree(sb, sih);
+		freed = 1;
+	}
+
+	return freed;
+}
+
+static inline void check_eof_blocks(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_info_header *sih)
+{
+	if ((pi->i_flags & cpu_to_le32(NOVA_EOFBLOCKS_FL)) &&
+		(inode->i_size + sb->s_blocksize) > (sih->i_blocks
+			<< sb->s_blocksize_bits)) {
+		nova_memunlock_inode(sb, pi);
+		pi->i_flags &= cpu_to_le32(~NOVA_EOFBLOCKS_FL);
+		nova_update_inode_checksum(pi);
+		nova_update_alter_inode(sb, inode, pi);
+		nova_memlock_inode(sb, pi);
+	}
+}
+
+/*
+ * Free data blocks from inode in the range start <=> end
+ */
+static void nova_truncate_file_blocks(struct inode *inode, loff_t start,
+				    loff_t end, u64 epoch_id)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	int freed = 0;
+
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+
+	nova_dbg_verbose("truncate: pi %p iblocks %lx %llx %llx %llx\n", pi,
+			 sih->i_blocks, start, end, pi->i_size);
+
+	first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+	if (end == 0)
+		return;
+	last_blocknr = (end - 1) >> data_bits;
+
+	if (first_blocknr > last_blocknr)
+		return;
+
+	freed = nova_delete_file_tree(sb, sih, first_blocknr,
+				last_blocknr, true, false, epoch_id);
+
+	inode->i_blocks -= (freed * (1 << (data_bits -
+				sb->s_blocksize_bits)));
+
+	sih->i_blocks = inode->i_blocks;
+	/* Check for the flag EOFBLOCKS is still valid after the set size */
+	check_eof_blocks(sb, pi, inode, sih);
+
+}
+
+/* search the radix tree to find hole or data
+ * in the specified range
+ * Input:
+ * first_blocknr: first block in the specified range
+ * last_blocknr: last_blocknr in the specified range
+ * @data_found: indicates whether data blocks were found
+ * @hole_found: indicates whether a hole was found
+ * hole: whether we are looking for a hole or data
+ */
+static int nova_lookup_hole_in_range(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	unsigned long first_blocknr, unsigned long last_blocknr,
+	int *data_found, int *hole_found, int hole)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	unsigned long blocks = 0;
+	unsigned long pgoff, old_pgoff;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	pgoff = first_blocknr;
+	while (pgoff <= last_blocknr) {
+		old_pgoff = pgoff;
+		entry = radix_tree_lookup(&sih->tree, pgoff);
+		if (entry) {
+			*data_found = 1;
+			if (!hole)
+				goto done;
+			pgoff++;
+		} else {
+			*hole_found = 1;
+			entry = nova_find_next_entry(sb, sih, pgoff);
+			pgoff++;
+			if (entry) {
+				if (metadata_csum == 0)
+					entryc = entry;
+				else if (!nova_verify_entry_csum(sb, entry,
+								entryc))
+					goto done;
+
+				pgoff = pgoff > entryc->pgoff ?
+					pgoff : entryc->pgoff;
+				if (pgoff > last_blocknr)
+					pgoff = last_blocknr + 1;
+			}
+		}
+
+		if (!*hole_found || !hole)
+			blocks += pgoff - old_pgoff;
+	}
+done:
+	return blocks;
+}
+
+/* copy persistent state to struct inode */
+static int nova_read_inode(struct super_block *sb, struct inode *inode,
+	u64 pi_addr)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode *pi, fake_pi;
+	struct nova_inode_info_header *sih = &si->header;
+	int ret = -EIO;
+	unsigned long ino;
+
+	ret = nova_get_reference(sb, pi_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("%s: read pi @ 0x%llx failed\n",
+				__func__, pi_addr);
+		goto bad_inode;
+	}
+
+	inode->i_mode = sih->i_mode;
+	i_uid_write(inode, le32_to_cpu(pi->i_uid));
+	i_gid_write(inode, le32_to_cpu(pi->i_gid));
+//	set_nlink(inode, le16_to_cpu(pi->i_links_count));
+	inode->i_generation = le32_to_cpu(pi->i_generation);
+	nova_set_inode_flags(inode, pi, le32_to_cpu(pi->i_flags));
+	ino = inode->i_ino;
+
+	/* check if the inode is active. */
+	if (inode->i_mode == 0 || pi->deleted == 1) {
+		/* this inode is deleted */
+		ret = -ESTALE;
+		goto bad_inode;
+	}
+
+	inode->i_blocks = sih->i_blocks;
+	inode->i_mapping->a_ops = &nova_aops_dax;
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		inode->i_op = &nova_file_inode_operations;
+		if (inplace_data_updates && wprotect == 0)
+			inode->i_fop = &nova_dax_file_operations;
+		else
+			inode->i_fop = &nova_wrap_file_operations;
+		break;
+	case S_IFDIR:
+		inode->i_op = &nova_dir_inode_operations;
+		inode->i_fop = &nova_dir_operations;
+		break;
+	case S_IFLNK:
+		inode->i_op = &nova_symlink_inode_operations;
+		break;
+	default:
+		inode->i_op = &nova_special_inode_operations;
+		init_special_inode(inode, inode->i_mode,
+				   le32_to_cpu(pi->dev.rdev));
+		break;
+	}
+
+	/* Update size and time after rebuild the tree */
+	inode->i_size = le64_to_cpu(sih->i_size);
+	inode->i_atime.tv_sec = (__s32)le32_to_cpu(pi->i_atime);
+	inode->i_ctime.tv_sec = (__s32)le32_to_cpu(pi->i_ctime);
+	inode->i_mtime.tv_sec = (__s32)le32_to_cpu(pi->i_mtime);
+	inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec =
+					 inode->i_ctime.tv_nsec = 0;
+	set_nlink(inode, le16_to_cpu(pi->i_links_count));
+	return 0;
+
+bad_inode:
+	make_bad_inode(inode);
+	return ret;
+}
+
+static void nova_get_inode_flags(struct inode *inode, struct nova_inode *pi)
+{
+	unsigned int flags = inode->i_flags;
+	unsigned int nova_flags = le32_to_cpu(pi->i_flags);
+
+	nova_flags &= ~(FS_SYNC_FL | FS_APPEND_FL | FS_IMMUTABLE_FL |
+			 FS_NOATIME_FL | FS_DIRSYNC_FL);
+	if (flags & S_SYNC)
+		nova_flags |= FS_SYNC_FL;
+	if (flags & S_APPEND)
+		nova_flags |= FS_APPEND_FL;
+	if (flags & S_IMMUTABLE)
+		nova_flags |= FS_IMMUTABLE_FL;
+	if (flags & S_NOATIME)
+		nova_flags |= FS_NOATIME_FL;
+	if (flags & S_DIRSYNC)
+		nova_flags |= FS_DIRSYNC_FL;
+
+	pi->i_flags = cpu_to_le32(nova_flags);
+}
+
+static void nova_init_inode(struct inode *inode, struct nova_inode *pi)
+{
+	pi->i_mode = cpu_to_le16(inode->i_mode);
+	pi->i_uid = cpu_to_le32(i_uid_read(inode));
+	pi->i_gid = cpu_to_le32(i_gid_read(inode));
+	pi->i_links_count = cpu_to_le16(inode->i_nlink);
+	pi->i_size = cpu_to_le64(inode->i_size);
+	pi->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
+	pi->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
+	pi->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
+	pi->i_generation = cpu_to_le32(inode->i_generation);
+	pi->log_head = 0;
+	pi->log_tail = 0;
+	pi->alter_log_head = 0;
+	pi->alter_log_tail = 0;
+	pi->deleted = 0;
+	pi->delete_epoch_id = 0;
+	nova_get_inode_flags(inode, pi);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
+		pi->dev.rdev = cpu_to_le32(inode->i_rdev);
+}
+
+static int nova_alloc_unused_inode(struct super_block *sb, int cpuid,
+	unsigned long *ino)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *i, *next_i;
+	struct rb_node *temp, *next;
+	unsigned long next_range_low;
+	unsigned long new_ino;
+	unsigned long MAX_INODE = 1UL << 31;
+
+	inode_map = &sbi->inode_maps[cpuid];
+	i = inode_map->first_inode_range;
+	NOVA_ASSERT(i);
+	if (!nova_range_node_checksum_ok(i)) {
+		nova_dbg("%s: first node failed\n", __func__);
+		return -EIO;
+	}
+
+	temp = &i->node;
+	next = rb_next(temp);
+
+	if (!next) {
+		next_i = NULL;
+		next_range_low = MAX_INODE;
+	} else {
+		next_i = container_of(next, struct nova_range_node, node);
+		if (!nova_range_node_checksum_ok(next_i)) {
+			nova_dbg("%s: second node failed\n", __func__);
+			return -EIO;
+		}
+		next_range_low = next_i->range_low;
+	}
+
+	new_ino = i->range_high + 1;
+
+	if (next_i && new_ino == (next_range_low - 1)) {
+		/* Fill the gap completely */
+		i->range_high = next_i->range_high;
+		nova_update_range_node_checksum(i);
+		rb_erase(&next_i->node, &inode_map->inode_inuse_tree);
+		nova_free_inode_node(sb, next_i);
+		inode_map->num_range_node_inode--;
+	} else if (new_ino < (next_range_low - 1)) {
+		/* Aligns to left */
+		i->range_high = new_ino;
+		nova_update_range_node_checksum(i);
+	} else {
+		nova_dbg("%s: ERROR: new ino %lu, next low %lu\n", __func__,
+			new_ino, next_range_low);
+		return -ENOSPC;
+	}
+
+	*ino = new_ino * sbi->cpus + cpuid;
+	sbi->s_inodes_used_count++;
+	inode_map->allocated++;
+
+	nova_dbg_verbose("Alloc ino %lu\n", *ino);
+	return 0;
+}
+
+static int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *i = NULL;
+	struct nova_range_node *curr_node;
+	int found = 0;
+	int cpuid = ino % sbi->cpus;
+	unsigned long internal_ino = ino / sbi->cpus;
+	int ret = 0;
+
+	nova_dbg_verbose("Free inuse ino: %lu\n", ino);
+	inode_map = &sbi->inode_maps[cpuid];
+
+	mutex_lock(&inode_map->inode_table_mutex);
+	found = nova_search_inodetree(sbi, ino, &i);
+	if (!found) {
+		nova_dbg("%s ERROR: ino %lu not found\n", __func__, ino);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return -EINVAL;
+	}
+
+	if ((internal_ino == i->range_low) && (internal_ino == i->range_high)) {
+		/* fits entire node */
+		rb_erase(&i->node, &inode_map->inode_inuse_tree);
+		nova_free_inode_node(sb, i);
+		inode_map->num_range_node_inode--;
+		goto block_found;
+	}
+	if ((internal_ino == i->range_low) && (internal_ino < i->range_high)) {
+		/* Aligns left */
+		i->range_low = internal_ino + 1;
+		nova_update_range_node_checksum(i);
+		goto block_found;
+	}
+	if ((internal_ino > i->range_low) && (internal_ino == i->range_high)) {
+		/* Aligns right */
+		i->range_high = internal_ino - 1;
+		nova_update_range_node_checksum(i);
+		goto block_found;
+	}
+	if ((internal_ino > i->range_low) && (internal_ino < i->range_high)) {
+		/* Aligns somewhere in the middle */
+		curr_node = nova_alloc_inode_node(sb);
+		NOVA_ASSERT(curr_node);
+		if (curr_node == NULL) {
+			/* returning without freeing the block */
+			goto block_found;
+		}
+		curr_node->range_low = internal_ino + 1;
+		curr_node->range_high = i->range_high;
+		nova_update_range_node_checksum(curr_node);
+
+		i->range_high = internal_ino - 1;
+		nova_update_range_node_checksum(i);
+
+		ret = nova_insert_inodetree(sbi, curr_node, cpuid);
+		if (ret) {
+			nova_free_inode_node(sb, curr_node);
+			goto err;
+		}
+		inode_map->num_range_node_inode++;
+		goto block_found;
+	}
+
+err:
+	nova_error_mng(sb, "Unable to free inode %lu\n", ino);
+	nova_error_mng(sb, "Found inuse block %lu - %lu\n",
+				 i->range_low, i->range_high);
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+
+block_found:
+	sbi->s_inodes_used_count--;
+	inode_map->freed++;
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+}
+
+static int nova_free_inode(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih)
+{
+	int err = 0;
+	timing_t free_time;
+
+	NOVA_START_TIMING(free_inode_t, free_time);
+
+	nova_free_inode_log(sb, pi, sih);
+
+	sih->log_pages = 0;
+	sih->i_mode = 0;
+	sih->pi_addr = 0;
+	sih->alter_pi_addr = 0;
+	sih->i_size = 0;
+	sih->i_blocks = 0;
+
+	err = nova_free_inuse_inode(sb, pi->nova_ino);
+
+	NOVA_END_TIMING(free_inode_t, free_time);
+	return err;
+}
+
+struct inode *nova_iget(struct super_block *sb, unsigned long ino)
+{
+	struct nova_inode_info *si;
+	struct inode *inode;
+	u64 pi_addr;
+	int err;
+
+	inode = iget_locked(sb, ino);
+	if (unlikely(!inode))
+		return ERR_PTR(-ENOMEM);
+	if (!(inode->i_state & I_NEW))
+		return inode;
+
+	si = NOVA_I(inode);
+
+	nova_dbgv("%s: inode %lu\n", __func__, ino);
+
+	err = nova_get_inode_address(sb, ino, 0, &pi_addr, 0, 0);
+	if (err) {
+		nova_dbg("%s: get inode %lu address failed %d\n",
+			 __func__, ino, err);
+		goto fail;
+	}
+
+	if (pi_addr == 0) {
+		nova_dbg("%s: failed to get pi_addr for inode %lu\n",
+			 __func__, ino);
+		err = -EACCES;
+		goto fail;
+	}
+
+	err = nova_rebuild_inode(sb, si, ino, pi_addr, 1);
+	if (err) {
+		nova_dbg("%s: failed to rebuild inode %lu\n", __func__, ino);
+		goto fail;
+	}
+
+	err = nova_read_inode(sb, inode, pi_addr);
+	if (unlikely(err)) {
+		nova_dbg("%s: failed to read inode %lu\n", __func__, ino);
+		goto fail;
+
+	}
+
+	inode->i_ino = ino;
+
+	unlock_new_inode(inode);
+	return inode;
+fail:
+	iget_failed(inode);
+	return ERR_PTR(err);
+}
+
+unsigned long nova_get_last_blocknr(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_inode *pi, fake_pi;
+	unsigned long last_blocknr;
+	unsigned int btype;
+	unsigned int data_bits;
+	int ret;
+
+	ret = nova_get_reference(sb, sih->pi_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("%s: read pi @ 0x%lx failed\n",
+				__func__, sih->pi_addr);
+		btype = 0;
+	} else {
+		btype = sih->i_blk_type;
+	}
+
+	data_bits = blk_type_to_shift[btype];
+
+	if (sih->i_size == 0)
+		last_blocknr = 0;
+	else
+		last_blocknr = (sih->i_size - 1) >> data_bits;
+
+	return last_blocknr;
+}
+
+static int nova_free_inode_resource(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih)
+{
+	unsigned long last_blocknr;
+	int ret = 0;
+	int freed = 0;
+	struct nova_inode *alter_pi;
+
+	nova_memunlock_inode(sb, pi);
+	pi->deleted = 1;
+
+	if (pi->valid) {
+		nova_dbg("%s: inode %lu still valid\n",
+				__func__, sih->ino);
+		pi->valid = 0;
+	}
+	nova_update_inode_checksum(pi);
+	if (metadata_csum && sih->alter_pi_addr) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+						sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+
+	/* We need the log to free the blocks from the b-tree */
+	switch (sih->i_mode & S_IFMT) {
+	case S_IFREG:
+		last_blocknr = nova_get_last_blocknr(sb, sih);
+		nova_dbgv("%s: file ino %lu\n", __func__, sih->ino);
+		freed = nova_delete_file_tree(sb, sih, 0,
+					last_blocknr, true, true, 0);
+		break;
+	case S_IFDIR:
+		nova_dbgv("%s: dir ino %lu\n", __func__, sih->ino);
+		nova_delete_dir_tree(sb, sih);
+		break;
+	case S_IFLNK:
+		/* Log will be freed later */
+		nova_dbgv("%s: symlink ino %lu\n",
+				__func__, sih->ino);
+		freed = nova_delete_file_tree(sb, sih, 0, 0,
+						true, true, 0);
+		break;
+	default:
+		nova_dbgv("%s: special ino %lu\n",
+				__func__, sih->ino);
+		break;
+	}
+
+	nova_dbg_verbose("%s: Freed %d\n", __func__, freed);
+	/* Then we can free the inode */
+	ret = nova_free_inode(sb, pi, sih);
+	if (ret)
+		nova_err(sb, "%s: free inode %lu failed\n",
+				__func__, sih->ino);
+
+	return ret;
+}
+
+void nova_evict_inode(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	timing_t evict_time;
+	int destroy = 0;
+	int ret;
+
+	NOVA_START_TIMING(evict_inode_t, evict_time);
+	if (!sih) {
+		nova_err(sb, "%s: ino %lu sih is NULL!\n",
+				__func__, inode->i_ino);
+		NOVA_ASSERT(0);
+		goto out;
+	}
+
+	// pi can be NULL if the file has already been deleted, but a handle
+	// remains.
+	if (pi && pi->nova_ino != inode->i_ino) {
+		nova_err(sb, "%s: inode %lu ino does not match: %llu\n",
+				__func__, inode->i_ino, pi->nova_ino);
+		nova_dbg("inode size %llu, pi addr 0x%lx, pi head 0x%llx, tail 0x%llx, mode %u\n",
+				inode->i_size, sih->pi_addr, sih->log_head,
+				sih->log_tail, pi->i_mode);
+		nova_dbg("sih: ino %lu, inode size %lu, mode %u, inode mode %u\n",
+				sih->ino, sih->i_size,
+				sih->i_mode, inode->i_mode);
+		nova_print_inode_log(sb, inode);
+	}
+
+	/* Check if this inode exists in at least one snapshot. */
+	if (pi && pi->valid == 0) {
+		ret = nova_append_inode_to_snapshot(sb, pi);
+		if (ret == 0)
+			goto out;
+	}
+
+	nova_dbg_verbose("%s: %lu\n", __func__, inode->i_ino);
+	if (!inode->i_nlink && !is_bad_inode(inode)) {
+		if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+			goto out;
+
+		if (pi) {
+			ret = nova_free_inode_resource(sb, pi, sih);
+			if (ret)
+				goto out;
+		}
+
+		destroy = 1;
+		pi = NULL; /* we no longer own the nova_inode */
+
+		inode->i_mtime = inode->i_ctime = current_time(inode);
+		inode->i_size = 0;
+	}
+out:
+	if (destroy == 0) {
+		nova_dbgv("%s: destroying %lu\n", __func__, inode->i_ino);
+		nova_free_dram_resource(sb, sih);
+	}
+	/* TODO: Since we don't use page-cache, do we really need the following
+	 * call?
+	 */
+	truncate_inode_pages(&inode->i_data, 0);
+
+	clear_inode(inode);
+	NOVA_END_TIMING(evict_inode_t, evict_time);
+}
+
+/* First rebuild the inode tree, then free the blocks */
+int nova_delete_dead_inode(struct super_block *sb, u64 ino)
+{
+	struct nova_inode_info si;
+	struct nova_inode_info_header *sih;
+	struct nova_inode *pi;
+	u64 pi_addr = 0;
+	int err;
+
+	if (ino < NOVA_NORMAL_INODE_START) {
+		nova_dbg("%s: invalid inode %llu\n", __func__, ino);
+		return -EINVAL;
+	}
+
+	err = nova_get_inode_address(sb, ino, 0, &pi_addr, 0, 0);
+	if (err) {
+		nova_dbg("%s: get inode %llu address failed %d\n",
+					__func__, ino, err);
+		return -EINVAL;
+	}
+
+	if (pi_addr == 0)
+		return -EACCES;
+
+	memset(&si, 0, sizeof(struct nova_inode_info));
+	err = nova_rebuild_inode(sb, &si, ino, pi_addr, 0);
+	if (err)
+		return err;
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	sih = &si.header;
+
+	nova_dbgv("Delete dead inode %lu, log head 0x%llx, tail 0x%llx\n",
+			sih->ino, sih->log_head, sih->log_tail);
+
+	return nova_free_inode_resource(sb, pi, sih);
+}
+
+/* Returns 0 on failure */
+u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	unsigned long free_ino = 0;
+	int map_id;
+	u64 ino = 0;
+	int ret;
+	timing_t new_inode_time;
+
+	NOVA_START_TIMING(new_nova_inode_t, new_inode_time);
+	map_id = sbi->map_id;
+	sbi->map_id = (sbi->map_id + 1) % sbi->cpus;
+
+	inode_map = &sbi->inode_maps[map_id];
+
+	mutex_lock(&inode_map->inode_table_mutex);
+	ret = nova_alloc_unused_inode(sb, map_id, &free_ino);
+	if (ret) {
+		nova_dbg("%s: alloc inode number failed %d\n", __func__, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return 0;
+	}
+
+	ret = nova_get_inode_address(sb, free_ino, 0, pi_addr, 1, 1);
+	if (ret) {
+		nova_dbg("%s: get inode address failed %d\n", __func__, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return 0;
+	}
+
+	mutex_unlock(&inode_map->inode_table_mutex);
+
+	ino = free_ino;
+
+	NOVA_END_TIMING(new_nova_inode_t, new_inode_time);
+	return ino;
+}
+
+struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
+	struct inode *dir, u64 pi_addr, u64 ino, umode_t mode,
+	size_t size, dev_t rdev, const struct qstr *qstr, u64 epoch_id)
+{
+	struct super_block *sb;
+	struct nova_sb_info *sbi;
+	struct inode *inode;
+	struct nova_inode *diri = NULL;
+	struct nova_inode_info *si;
+	struct nova_inode_info_header *sih = NULL;
+	struct nova_inode *pi;
+	struct nova_inode *alter_pi;
+	int errval;
+	u64 alter_pi_addr = 0;
+	timing_t new_inode_time;
+
+	NOVA_START_TIMING(new_vfs_inode_t, new_inode_time);
+	sb = dir->i_sb;
+	sbi = (struct nova_sb_info *)sb->s_fs_info;
+	inode = new_inode(sb);
+	if (!inode) {
+		errval = -ENOMEM;
+		goto fail2;
+	}
+
+	inode_init_owner(inode, dir, mode);
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+
+	inode->i_generation = atomic_add_return(1, &sbi->next_generation);
+	inode->i_size = size;
+
+	diri = nova_get_inode(sb, dir);
+	if (!diri) {
+		errval = -EACCES;
+		goto fail1;
+	}
+
+	if (metadata_csum) {
+		/* Get alternate inode address */
+		errval = nova_get_alter_inode_address(sb, ino, &alter_pi_addr);
+		if (errval)
+			goto fail1;
+	}
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	nova_dbg_verbose("%s: allocating inode %llu @ 0x%llx\n",
+					__func__, ino, pi_addr);
+
+	/* chosen inode is in ino */
+	inode->i_ino = ino;
+
+	switch (type) {
+	case TYPE_CREATE:
+		inode->i_op = &nova_file_inode_operations;
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		if (inplace_data_updates && wprotect == 0)
+			inode->i_fop = &nova_dax_file_operations;
+		else
+			inode->i_fop = &nova_wrap_file_operations;
+		break;
+	case TYPE_MKNOD:
+		init_special_inode(inode, mode, rdev);
+		inode->i_op = &nova_special_inode_operations;
+		break;
+	case TYPE_SYMLINK:
+		inode->i_op = &nova_symlink_inode_operations;
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		break;
+	case TYPE_MKDIR:
+		inode->i_op = &nova_dir_inode_operations;
+		inode->i_fop = &nova_dir_operations;
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		set_nlink(inode, 2);
+		break;
+	default:
+		nova_dbg("Unknown new inode type %d\n", type);
+		break;
+	}
+
+	/*
+	 * Pi is part of the dir log so no transaction is needed,
+	 * but we need to flush to NVMM.
+	 */
+	nova_memunlock_inode(sb, pi);
+	pi->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	pi->i_flags = nova_mask_flags(mode, diri->i_flags);
+	pi->nova_ino = ino;
+	pi->i_create_time = current_time(inode).tv_sec;
+	pi->create_epoch_id = epoch_id;
+	nova_init_inode(inode, pi);
+
+	if (metadata_csum) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+								alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+
+	nova_memlock_inode(sb, pi);
+
+	si = NOVA_I(inode);
+	sih = &si->header;
+	nova_init_header(sb, sih, inode->i_mode);
+	sih->pi_addr = pi_addr;
+	sih->alter_pi_addr = alter_pi_addr;
+	sih->ino = ino;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	nova_set_inode_flags(inode, pi, le32_to_cpu(pi->i_flags));
+
+	if (insert_inode_locked(inode) < 0) {
+		nova_err(sb, "nova_new_inode failed ino %lx\n", inode->i_ino);
+		errval = -EINVAL;
+		goto fail1;
+	}
+
+	nova_flush_buffer(pi, NOVA_INODE_SIZE, 0);
+	NOVA_END_TIMING(new_vfs_inode_t, new_inode_time);
+	return inode;
+fail1:
+	make_bad_inode(inode);
+	iput(inode);
+fail2:
+	NOVA_END_TIMING(new_vfs_inode_t, new_inode_time);
+	return ERR_PTR(errval);
+}
+
+int nova_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	/* write_inode should never be called because we always keep our inodes
+	 * clean. So let us know if write_inode ever gets called.
+	 */
+//	BUG();
+	return 0;
+}
+
+/*
+ * dirty_inode() is called from mark_inode_dirty_sync()
+ * usually dirty_inode should not be called because NOVA always keeps its inodes
+ * clean. Only exception is touch_atime which calls dirty_inode to update the
+ * i_atime field.
+ */
+void nova_dirty_inode(struct inode *inode, int flags)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi, inode_copy;
+
+	if (sbi->mount_snapshot)
+		return;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	/* check the inode before updating to make sure all fields are good */
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+					sih->alter_pi_addr, &inode_copy, 0) < 0)
+		return;
+
+	/* only i_atime should have changed if at all.
+	 * we can do in-place atomic update
+	 */
+	nova_memunlock_inode(sb, pi);
+	pi->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
+	nova_update_inode_checksum(pi);
+	nova_update_alter_inode(sb, inode, pi);
+	nova_memlock_inode(sb, pi);
+	/* Relax atime persistency */
+	nova_flush_buffer(&pi->i_atime, sizeof(pi->i_atime), 0);
+}
+
+static void nova_setsize(struct inode *inode, loff_t oldsize, loff_t newsize,
+	u64 epoch_id)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	timing_t setsize_time;
+
+	/* We only support truncate regular file */
+	if (!(S_ISREG(inode->i_mode))) {
+		nova_err(inode->i_sb, "%s:wrong file mode %x\n", inode->i_mode);
+		return;
+	}
+
+	NOVA_START_TIMING(setsize_t, setsize_time);
+
+	inode_dio_wait(inode);
+
+	nova_dbgv("%s: inode %lu, old size %llu, new size %llu\n",
+		__func__, inode->i_ino, oldsize, newsize);
+
+	if (newsize != oldsize) {
+		nova_clear_last_page_tail(sb, inode, newsize);
+		i_size_write(inode, newsize);
+		sih->i_size = newsize;
+	}
+
+	/* FIXME: we should make sure that there is nobody reading the inode
+	 * before truncating it. Also we need to munmap the truncated range
+	 * from application address space, if mmapped.
+	 */
+	/* synchronize_rcu(); */
+
+	/* FIXME: Do we need to clear truncated DAX pages? */
+//	dax_truncate_page(inode, newsize, nova_dax_get_block);
+
+	truncate_pagecache(inode, newsize);
+	nova_truncate_file_blocks(inode, newsize, oldsize, epoch_id);
+	NOVA_END_TIMING(setsize_t, setsize_time);
+}
+
+int nova_getattr(const struct path *path, struct kstat *stat,
+		 u32 request_mask, unsigned int flags)
+{
+	struct inode *inode;
+
+	inode = path->dentry->d_inode;
+	generic_fillattr(inode, stat);
+	/* stat->blocks should be the number of 512B blocks */
+	stat->blocks = (inode->i_blocks << inode->i_sb->s_blocksize_bits) >> 9;
+	return 0;
+}
+
+int nova_notify_change(struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = dentry->d_inode;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	int ret;
+	unsigned int ia_valid = attr->ia_valid, attr_mask;
+	loff_t oldsize = inode->i_size;
+	u64 epoch_id;
+	timing_t setattr_time;
+
+	NOVA_START_TIMING(setattr_t, setattr_time);
+	if (!pi) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	ret = setattr_prepare(dentry, attr);
+	if (ret)
+		goto out;
+
+	/* Update inode with attr except for size */
+	setattr_copy(inode, attr);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	attr_mask = ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_SIZE | ATTR_ATIME
+			| ATTR_MTIME | ATTR_CTIME;
+
+	ia_valid = ia_valid & attr_mask;
+
+	if (ia_valid == 0)
+		goto out;
+
+	ret = nova_handle_setattr_operation(sb, inode, pi, ia_valid,
+					attr, epoch_id);
+	if (ret)
+		goto out;
+
+	/* Only after log entry is committed, we can truncate size */
+	if ((ia_valid & ATTR_SIZE) && (attr->ia_size != oldsize ||
+			pi->i_flags & cpu_to_le32(NOVA_EOFBLOCKS_FL))) {
+//		nova_set_blocksize_hint(sb, inode, pi, attr->ia_size);
+
+		/* now we can freely truncate the inode */
+		nova_setsize(inode, oldsize, attr->ia_size, epoch_id);
+	}
+
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(setattr_t, setattr_time);
+	return ret;
+}
+
+void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
+	unsigned int flags)
+{
+	inode->i_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+	if (flags & FS_SYNC_FL)
+		inode->i_flags |= S_SYNC;
+	if (flags & FS_APPEND_FL)
+		inode->i_flags |= S_APPEND;
+	if (flags & FS_IMMUTABLE_FL)
+		inode->i_flags |= S_IMMUTABLE;
+	if (flags & FS_NOATIME_FL)
+		inode->i_flags |= S_NOATIME;
+	if (flags & FS_DIRSYNC_FL)
+		inode->i_flags |= S_DIRSYNC;
+	if (!pi->i_xattr)
+		inode_has_no_xattr(inode);
+	inode->i_flags |= S_DAX;
+}
+
+static int nova_legacy_get_blocks(struct inode *inode, sector_t iblock,
+	struct buffer_head *bh, int create)
+{
+	unsigned long max_blocks = bh->b_size >> inode->i_blkbits;
+	bool new = false, boundary = false;
+	u32 bno;
+	int ret;
+
+	ret = nova_dax_get_blocks(inode, iblock, max_blocks, &bno, &new,
+				&boundary, create, false);
+	if (ret <= 0)
+		return ret;
+
+	map_bh(bh, inode->i_sb, bno);
+	bh->b_size = ret << inode->i_blkbits;
+	return 0;
+}
+
+static ssize_t nova_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *filp = iocb->ki_filp;
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	ssize_t ret;
+	timing_t dio_time;
+
+	if (WARN_ON_ONCE(IS_DAX(inode)))
+		return -EIO;
+
+	NOVA_START_TIMING(direct_IO_t, dio_time);
+
+	ret = blockdev_direct_IO(iocb, inode, iter, nova_legacy_get_blocks);
+
+	NOVA_END_TIMING(direct_IO_t, dio_time);
+	return ret;
+}
+
+/*
+ * find the file offset for SEEK_DATA/SEEK_HOLE
+ */
+unsigned long nova_find_region(struct inode *inode, loff_t *offset, int hole)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	unsigned long blocks = 0, offset_in_block;
+	int data_found = 0, hole_found = 0;
+
+	if (*offset >= inode->i_size)
+		return -ENXIO;
+
+	if (!inode->i_blocks || !sih->i_size) {
+		if (hole)
+			return inode->i_size;
+		else
+			return -ENXIO;
+	}
+
+	offset_in_block = *offset & ((1UL << data_bits) - 1);
+
+	first_blocknr = *offset >> data_bits;
+	last_blocknr = inode->i_size >> data_bits;
+
+	nova_dbg_verbose("find_region offset %llx, first_blocknr %lx, last_blocknr %lx hole %d\n",
+		  *offset, first_blocknr, last_blocknr, hole);
+
+	blocks = nova_lookup_hole_in_range(inode->i_sb, sih,
+		first_blocknr, last_blocknr, &data_found, &hole_found, hole);
+
+	/* Searching data but only hole found till the end */
+	if (!hole && !data_found && hole_found)
+		return -ENXIO;
+
+	if (data_found && !hole_found) {
+		/* Searching data but we are already into them */
+		if (hole)
+			/* Searching hole but only data found, go to the end */
+			*offset = inode->i_size;
+		return 0;
+	}
+
+	/* Searching for hole, hole found and starting inside an hole */
+	if (hole && hole_found && !blocks) {
+		/* we found data after it */
+		if (!data_found)
+			/* last hole */
+			*offset = inode->i_size;
+		return 0;
+	}
+
+	if (offset_in_block) {
+		blocks--;
+		*offset += (blocks << data_bits) +
+			   ((1 << data_bits) - offset_in_block);
+	} else {
+		*offset += blocks << data_bits;
+	}
+
+	return 0;
+}
+
+static int nova_writepages(struct address_space *mapping,
+	struct writeback_control *wbc)
+{
+	int ret;
+	timing_t wp_time;
+
+	NOVA_START_TIMING(write_pages_t, wp_time);
+	ret = dax_writeback_mapping_range(mapping,
+			mapping->host->i_sb->s_bdev, wbc);
+	NOVA_END_TIMING(write_pages_t, wp_time);
+	return ret;
+}
+
+const struct address_space_operations nova_aops_dax = {
+	.writepages		= nova_writepages,
+	.direct_IO		= nova_direct_IO,
+	/*.dax_mem_protect	= nova_dax_mem_protect,*/
+};
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
new file mode 100644
index 000000000000..5ad69335799c
--- /dev/null
+++ b/fs/nova/inode.h
@@ -0,0 +1,389 @@
+#ifndef __INODE_H
+#define __INODE_H
+
+struct nova_inode_info_header;
+struct nova_inode;
+
+#include "super.h"
+#include "log.h"
+
+enum nova_new_inode_type {
+	TYPE_CREATE = 0,
+	TYPE_MKNOD,
+	TYPE_SYMLINK,
+	TYPE_MKDIR
+};
+
+
+/*
+ * Structure of an inode in PMEM
+ * Keep the inode size to within 120 bytes: We use the last eight bytes
+ * as inode table tail pointer.
+ */
+struct nova_inode {
+
+	/* first 40 bytes */
+	u8	i_rsvd;		 /* reserved. used to be checksum */
+	u8	valid;		 /* Is this inode valid? */
+	u8	deleted;	 /* Is this inode deleted? */
+	u8	i_blk_type;	 /* data block size this inode uses */
+	__le32	i_flags;	 /* Inode flags */
+	__le64	i_size;		 /* Size of data in bytes */
+	__le32	i_ctime;	 /* Inode modification time */
+	__le32	i_mtime;	 /* Inode b-tree Modification time */
+	__le32	i_atime;	 /* Access time */
+	__le16	i_mode;		 /* File mode */
+	__le16	i_links_count;	 /* Links count */
+
+	__le64	i_xattr;	 /* Extended attribute block */
+
+	/* second 40 bytes */
+	__le32	i_uid;		 /* Owner Uid */
+	__le32	i_gid;		 /* Group Id */
+	__le32	i_generation;	 /* File version (for NFS) */
+	__le32	i_create_time;	 /* Create time */
+	__le64	nova_ino;	 /* nova inode number */
+
+	__le64	log_head;	 /* Log head pointer */
+	__le64	log_tail;	 /* Log tail pointer */
+
+	/* last 40 bytes */
+	__le64	alter_log_head;	 /* Alternate log head pointer */
+	__le64	alter_log_tail;	 /* Alternate log tail pointer */
+
+	__le64	create_epoch_id; /* Transaction ID when create */
+	__le64	delete_epoch_id; /* Transaction ID when deleted */
+
+	struct {
+		__le32 rdev;	 /* major/minor # */
+	} dev;			 /* device inode */
+
+	__le32	csum;            /* CRC32 checksum */
+
+	/* Leave 8 bytes for inode table tail pointer */
+} __attribute((__packed__));
+
+/*
+ * Inode table.  It's a linked list of pages.
+ */
+struct inode_table {
+	__le64 log_head;
+};
+
+/*
+ * NOVA-specific inode state kept in DRAM
+ */
+struct nova_inode_info_header {
+	/* For files, tree holds a map from file offsets to
+	 * write log entries.
+	 *
+	 * For directories, tree holds a map from a hash of the file name to
+	 * dentry log entry.
+	 */
+	struct radix_tree_root tree;
+	struct rb_root vma_tree;	/* Write vmas */
+	struct list_head list;		/* SB list of mmap sih */
+	int num_vmas;
+	unsigned short i_mode;		/* Dir or file? */
+	unsigned long log_pages;	/* Num of log pages */
+	unsigned long i_size;
+	unsigned long i_blocks;
+	unsigned long ino;
+	unsigned long pi_addr;
+	unsigned long alter_pi_addr;
+	unsigned long valid_entries;	/* For thorough GC */
+	unsigned long num_entries;	/* For thorough GC */
+	u64 last_setattr;		/* Last setattr entry */
+	u64 last_link_change;		/* Last link change entry */
+	u64 last_dentry;		/* Last updated dentry */
+	u64 trans_id;			/* Transaction ID */
+	u64 log_head;			/* Log head pointer */
+	u64 log_tail;			/* Log tail pointer */
+	u64 alter_log_head;		/* Alternate log head pointer */
+	u64 alter_log_tail;		/* Alternate log tail pointer */
+	u8  i_blk_type;
+};
+
+/* For rebuild purpose, temporarily store pi infomation */
+struct nova_inode_rebuild {
+	u64	i_size;
+	u32	i_flags;	/* Inode flags */
+	u32	i_ctime;	/* Inode modification time */
+	u32	i_mtime;	/* Inode b-tree Modification time */
+	u32	i_atime;	/* Access time */
+	u32	i_uid;		/* Owner Uid */
+	u32	i_gid;		/* Group Id */
+	u32	i_generation;	/* File version (for NFS) */
+	u16	i_links_count;	/* Links count */
+	u16	i_mode;		/* File mode */
+	u64	trans_id;
+};
+
+/*
+ * DRAM state for inodes
+ */
+struct nova_inode_info {
+	struct nova_inode_info_header header;
+	struct inode vfs_inode;
+};
+
+
+static inline struct nova_inode_info *NOVA_I(struct inode *inode)
+{
+	return container_of(inode, struct nova_inode_info, vfs_inode);
+}
+
+static inline struct nova_inode *nova_get_alter_inode(struct super_block *sb,
+	struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode fake_pi;
+	void *addr;
+	int rc;
+
+	if (metadata_csum == 0)
+		return NULL;
+
+	addr = nova_get_block(sb, sih->alter_pi_addr);
+	rc = memcpy_mcsafe(&fake_pi, addr, sizeof(struct nova_inode));
+	if (rc)
+		return NULL;
+
+	return (struct nova_inode *)addr;
+}
+
+static inline int nova_update_alter_inode(struct super_block *sb,
+	struct inode *inode, struct nova_inode *pi)
+{
+	struct nova_inode *alter_pi;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	alter_pi = nova_get_alter_inode(sb, inode);
+	if (!alter_pi)
+		return -EINVAL;
+
+	memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	return 0;
+}
+
+
+static inline int nova_update_inode_checksum(struct nova_inode *pi)
+{
+	u32 crc = 0;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	crc = nova_crc32c(~0, (__u8 *)pi,
+			(sizeof(struct nova_inode) - sizeof(__le32)));
+
+	pi->csum = crc;
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
+	return 0;
+}
+
+static inline int nova_check_inode_checksum(struct nova_inode *pi)
+{
+	u32 crc = 0;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	crc = nova_crc32c(~0, (__u8 *)pi,
+			(sizeof(struct nova_inode) - sizeof(__le32)));
+
+	if (pi->csum == cpu_to_le32(crc))
+		return 0;
+	else
+		return 1;
+}
+
+
+
+static inline void nova_update_tail(struct nova_inode *pi, u64 new_tail)
+{
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_tail_t, update_time);
+
+	PERSISTENT_BARRIER();
+	pi->log_tail = new_tail;
+	nova_flush_buffer(&pi->log_tail, CACHELINE_SIZE, 1);
+
+	NOVA_END_TIMING(update_tail_t, update_time);
+}
+
+static inline void nova_update_alter_tail(struct nova_inode *pi, u64 new_tail)
+{
+	timing_t update_time;
+
+	if (metadata_csum == 0)
+		return;
+
+	NOVA_START_TIMING(update_tail_t, update_time);
+
+	PERSISTENT_BARRIER();
+	pi->alter_log_tail = new_tail;
+	nova_flush_buffer(&pi->alter_log_tail, CACHELINE_SIZE, 1);
+
+	NOVA_END_TIMING(update_tail_t, update_time);
+}
+
+
+
+/* Update inode tails and checksums */
+static inline void nova_update_inode(struct super_block *sb,
+	struct inode *inode, struct nova_inode *pi,
+	struct nova_inode_update *update, int update_alter)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	sih->log_tail = update->tail;
+	sih->alter_log_tail = update->alter_tail;
+	nova_update_tail(pi, update->tail);
+	if (metadata_csum)
+		nova_update_alter_tail(pi, update->alter_tail);
+
+	nova_update_inode_checksum(pi);
+	if (inode && update_alter)
+		nova_update_alter_inode(sb, inode, pi);
+}
+
+
+static inline
+struct inode_table *nova_get_inode_table(struct super_block *sb,
+	int version, int cpu)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int table_start;
+
+	if (cpu >= sbi->cpus)
+		return NULL;
+
+	if ((version & 0x1) == 0)
+		table_start = INODE_TABLE0_START;
+	else
+		table_start = INODE_TABLE1_START;
+
+	return (struct inode_table *)((char *)nova_get_block(sb,
+		NOVA_DEF_BLOCK_SIZE_4K * table_start) +
+		cpu * CACHELINE_SIZE);
+}
+
+static inline unsigned int
+nova_inode_blk_shift(struct nova_inode_info_header *sih)
+{
+	return blk_type_to_shift[sih->i_blk_type];
+}
+
+static inline uint32_t nova_inode_blk_size(struct nova_inode_info_header *sih)
+{
+	return blk_type_to_size[sih->i_blk_type];
+}
+
+static inline u64 nova_get_reserved_inode_addr(struct super_block *sb,
+	u64 inode_number)
+{
+	return (NOVA_DEF_BLOCK_SIZE_4K * RESERVE_INODE_START) +
+			inode_number * NOVA_INODE_SIZE;
+}
+
+static inline u64 nova_get_alter_reserved_inode_addr(struct super_block *sb,
+	u64 inode_number)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return nova_get_addr_off(sbi, sbi->replica_reserved_inodes_addr) +
+			inode_number * NOVA_INODE_SIZE;
+}
+
+static inline struct nova_inode *nova_get_reserved_inode(struct super_block *sb,
+	u64 inode_number)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 addr;
+
+	addr = nova_get_reserved_inode_addr(sb, inode_number);
+
+	return (struct nova_inode *)(sbi->virt_addr + addr);
+}
+
+static inline struct nova_inode *
+nova_get_alter_reserved_inode(struct super_block *sb,
+	u64 inode_number)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 addr;
+
+	addr = nova_get_alter_reserved_inode_addr(sb, inode_number);
+
+	return (struct nova_inode *)(sbi->virt_addr + addr);
+}
+
+/* If this is part of a read-modify-write of the inode metadata,
+ * nova_memunlock_inode() before calling!
+ */
+static inline struct nova_inode *nova_get_inode_by_ino(struct super_block *sb,
+						  u64 ino)
+{
+	if (ino == 0 || ino >= NOVA_NORMAL_INODE_START)
+		return NULL;
+
+	return nova_get_reserved_inode(sb, ino);
+}
+
+static inline struct nova_inode *nova_get_inode(struct super_block *sb,
+	struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode fake_pi;
+	void *addr;
+	int rc;
+
+	addr = nova_get_block(sb, sih->pi_addr);
+	rc = memcpy_mcsafe(&fake_pi, addr, sizeof(struct nova_inode));
+	if (rc)
+		return NULL;
+
+	return (struct nova_inode *)addr;
+}
+
+
+
+extern const struct address_space_operations nova_aops_dax;
+int nova_init_inode_inuse_list(struct super_block *sb);
+extern int nova_init_inode_table(struct super_block *sb);
+int nova_get_alter_inode_address(struct super_block *sb, u64 ino,
+	u64 *alter_pi_addr);
+unsigned long nova_get_last_blocknr(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+int nova_get_inode_address(struct super_block *sb, u64 ino, int version,
+	u64 *pi_addr, int extendable, int extend_alternate);
+int nova_set_blocksize_hint(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, loff_t new_size);
+extern struct inode *nova_iget(struct super_block *sb, unsigned long ino);
+extern void nova_evict_inode(struct inode *inode);
+extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
+extern void nova_dirty_inode(struct inode *inode, int flags);
+extern int nova_notify_change(struct dentry *dentry, struct iattr *attr);
+extern int nova_getattr(const struct path *path, struct kstat *stat,
+			u32 request_mask, unsigned int flags);
+extern void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
+	unsigned int flags);
+extern unsigned long nova_find_region(struct inode *inode, loff_t *offset,
+		int hole);
+int nova_delete_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long start_blocknr,
+	unsigned long last_blocknr, bool delete_nvmm,
+	bool delete_dead, u64 trasn_id);
+u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr);
+extern struct inode *nova_new_vfs_inode(enum nova_new_inode_type,
+	struct inode *dir, u64 pi_addr, u64 ino, umode_t mode,
+	size_t size, dev_t rdev, const struct qstr *qstr, u64 epoch_id);
+
+#endif

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 05/16] NOVA: Log data structures and operations
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:48   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Nova maintains a log for each inode that records updates to the inode's
metadata and holds pointers to the file data.  Nova makes updates to file data
and metadata atomic by atomically appending log entries to the log.

Each inode contains pointers to head and tail of the inode's log.  When the log
grows past the end of the last page, nova allocates additional space.  For
short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
fixed amount of additional space (1MB).

Log space is reclaimed during garbage collection.

Log Entries
-----------

There are eight kinds of log entry, documented in log.h.  The log entries have
several entries in common:

   1.  'epoch_id' gives the epoch during which the log entry was created.
   Creating a snapshot increiments the epoch_id for the file systems.

   2.  'trans_id' is filesystem-wide, monotone increasing, number assigned each
   log entry.  It provides an ordering over all FS operations.

   3.  'invalid' is true if the effects of this entry are dead and the log
   entry can be garbage collected.

   4.  'csum' is a CRC32 checksum for the entry.

Log structure
-------------

The logs comprise a linked list of PMEM blocks.  The tail of each block

contains some metadata about the block and pointers to the next block and
block's replica (struct nova_inode_page_tail).

+----------------+
| log entry      |
+----------------+
| log entry      |
+----------------+
| ...            |
+----------------+
| tail           |
|  metadata      |
|  -> next block |
+----------------+

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/log.c | 1411 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h |  333 +++++++++++++
 2 files changed, 1744 insertions(+)
 create mode 100644 fs/nova/log.c
 create mode 100644 fs/nova/log.h

diff --git a/fs/nova/log.c b/fs/nova/log.c
new file mode 100644
index 000000000000..2c3c9aa18043
--- /dev/null
+++ b/fs/nova/log.c
@@ -0,0 +1,1411 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Log methods
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "journal.h"
+#include "inode.h"
+#include "log.h"
+
+static int nova_execute_invalidate_reassign_logentry(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int reassign,
+	unsigned int num_free)
+{
+	struct nova_file_write_entry *fw_entry;
+	int invalid = 0;
+
+	switch (type) {
+	case FILE_WRITE:
+		fw_entry = (struct nova_file_write_entry *)entry;
+		if (reassign)
+			fw_entry->reassigned = 1;
+		if (num_free)
+			fw_entry->invalid_pages += num_free;
+		if (fw_entry->invalid_pages == fw_entry->num_pages)
+			invalid = 1;
+		break;
+	case DIR_LOG:
+		if (reassign) {
+			((struct nova_dentry *)entry)->reassigned = 1;
+		} else {
+			((struct nova_dentry *)entry)->invalid = 1;
+			invalid = 1;
+		}
+		break;
+	case SET_ATTR:
+		((struct nova_setattr_logentry *)entry)->invalid = 1;
+		invalid = 1;
+		break;
+	case LINK_CHANGE:
+		((struct nova_link_change_entry *)entry)->invalid = 1;
+		invalid = 1;
+		break;
+	case MMAP_WRITE:
+		((struct nova_mmap_entry *)entry)->invalid = 1;
+		invalid = 1;
+		break;
+	case SNAPSHOT_INFO:
+		((struct nova_snapshot_info_entry *)entry)->deleted = 1;
+		invalid = 1;
+		break;
+	default:
+		break;
+	}
+
+	if (invalid) {
+		u64 addr = nova_get_addr_off(NOVA_SB(sb), entry);
+
+		nova_inc_page_invalid_entries(sb, addr);
+	}
+
+	nova_update_entry_csum(entry);
+	return 0;
+}
+
+static int nova_invalidate_reassign_logentry(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int reassign,
+	unsigned int num_free)
+{
+	nova_memunlock_range(sb, entry, CACHELINE_SIZE);
+
+	nova_execute_invalidate_reassign_logentry(sb, entry, type,
+						reassign, num_free);
+	nova_update_alter_entry(sb, entry);
+	nova_memlock_range(sb, entry, CACHELINE_SIZE);
+
+	return 0;
+}
+
+int nova_invalidate_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type, unsigned int num_free)
+{
+	return nova_invalidate_reassign_logentry(sb, entry, type, 0, num_free);
+}
+
+int nova_reassign_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type)
+{
+	return nova_invalidate_reassign_logentry(sb, entry, type, 1, 0);
+}
+
+static inline int nova_invalidate_write_entry(struct super_block *sb,
+	struct nova_file_write_entry *entry, int reassign,
+	unsigned int num_free)
+{
+	struct nova_file_write_entry *entryc, entry_copy;
+
+	if (!entry)
+		return 0;
+
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+	}
+
+	if (num_free == 0 && entryc->reassigned == 1)
+		return 0;
+
+	return nova_invalidate_reassign_logentry(sb, entry, FILE_WRITE,
+							reassign, num_free);
+}
+
+unsigned int nova_free_old_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	unsigned long pgoff, unsigned int num_free,
+	bool delete_dead, u64 epoch_id)
+{
+	struct nova_file_write_entry *entryc, entry_copy;
+	unsigned long old_nvmm;
+	int ret;
+	timing_t free_time;
+
+	if (!entry)
+		return 0;
+
+	NOVA_START_TIMING(free_old_t, free_time);
+
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+	}
+
+	old_nvmm = get_nvmm(sb, sih, entryc, pgoff);
+
+	if (!delete_dead) {
+		ret = nova_append_data_to_snapshot(sb, entryc, old_nvmm,
+				num_free, epoch_id);
+		if (ret == 0) {
+			nova_invalidate_write_entry(sb, entry, 1, 0);
+			goto out;
+		}
+
+		nova_invalidate_write_entry(sb, entry, 1, num_free);
+	}
+
+	nova_dbgv("%s: pgoff %lu, free %u blocks\n",
+				__func__, pgoff, num_free);
+	nova_free_data_blocks(sb, sih, old_nvmm, num_free);
+
+out:
+	sih->i_blocks -= num_free;
+
+	NOVA_END_TIMING(free_old_t, free_time);
+	return num_free;
+}
+
+struct nova_file_write_entry *nova_find_next_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, pgoff_t pgoff)
+{
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_entry *entries[1];
+	int nr_entries;
+
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pgoff, 1);
+	if (nr_entries == 1)
+		entry = entries[0];
+
+	return entry;
+}
+
+/*
+ * Zero the tail page. Used in resize request
+ * to avoid to keep data in case the file grows again.
+ */
+void nova_clear_last_page_tail(struct super_block *sb,
+	struct inode *inode, loff_t newsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long offset = newsize & (sb->s_blocksize - 1);
+	unsigned long pgoff, length;
+	u64 nvmm;
+	char *nvmm_addr;
+
+	if (offset == 0 || newsize > inode->i_size)
+		return;
+
+	length = sb->s_blocksize - offset;
+	pgoff = newsize >> sb->s_blocksize_bits;
+
+	nvmm = nova_find_nvmm_block(sb, sih, NULL, pgoff);
+	if (nvmm == 0)
+		return;
+
+	nvmm_addr = (char *)nova_get_block(sb, nvmm);
+	nova_memunlock_range(sb, nvmm_addr + offset, length);
+	memcpy_to_pmem_nocache(nvmm_addr + offset, sbi->zeroed_page, length);
+	nova_memlock_range(sb, nvmm_addr + offset, length);
+
+	if (data_csum > 0)
+		nova_update_truncated_block_csum(sb, inode, newsize);
+	if (data_parity > 0)
+		nova_update_truncated_block_parity(sb, inode, newsize);
+}
+
+static void nova_update_setattr_entry(struct inode *inode,
+	struct nova_setattr_logentry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct iattr *attr = entry_info->attr;
+	unsigned int ia_valid = attr->ia_valid, attr_mask;
+
+	/* These files are in the lowest byte */
+	attr_mask = ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_SIZE |
+			ATTR_ATIME | ATTR_MTIME | ATTR_CTIME;
+
+	entry->entry_type	= SET_ATTR;
+	entry->attr	= ia_valid & attr_mask;
+	entry->mode	= cpu_to_le16(inode->i_mode);
+	entry->uid	= cpu_to_le32(i_uid_read(inode));
+	entry->gid	= cpu_to_le32(i_gid_read(inode));
+	entry->atime	= cpu_to_le32(inode->i_atime.tv_sec);
+	entry->ctime	= cpu_to_le32(inode->i_ctime.tv_sec);
+	entry->mtime	= cpu_to_le32(inode->i_mtime.tv_sec);
+	entry->epoch_id = cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id	= cpu_to_le64(entry_info->trans_id);
+	entry->invalid	= 0;
+
+	if (ia_valid & ATTR_SIZE)
+		entry->size = cpu_to_le64(attr->ia_size);
+	else
+		entry->size = cpu_to_le64(inode->i_size);
+
+	nova_update_entry_csum(entry);
+}
+
+static void nova_update_link_change_entry(struct inode *inode,
+	struct nova_link_change_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	entry->entry_type	= LINK_CHANGE;
+	entry->epoch_id		= cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id		= cpu_to_le64(entry_info->trans_id);
+	entry->invalid		= 0;
+	entry->links		= cpu_to_le16(inode->i_nlink);
+	entry->ctime		= cpu_to_le32(inode->i_ctime.tv_sec);
+	entry->flags		= cpu_to_le32(inode->i_flags);
+	entry->generation	= cpu_to_le32(inode->i_generation);
+
+	nova_update_entry_csum(entry);
+}
+
+static int nova_update_write_entry(struct super_block *sb,
+	struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	entry->epoch_id = cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id = cpu_to_le64(entry_info->trans_id);
+	entry->mtime = cpu_to_le32(entry_info->time);
+	entry->size = cpu_to_le64(entry_info->file_size);
+	entry->updating = 0;
+	nova_update_entry_csum(entry);
+	return 0;
+}
+
+static int nova_update_old_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *dentry,
+	struct nova_log_entry_info *entry_info)
+{
+	unsigned short links_count;
+	int link_change = entry_info->link_change;
+	u64 addr;
+
+	dentry->epoch_id = entry_info->epoch_id;
+	dentry->trans_id = entry_info->trans_id;
+	/* Remove_dentry */
+	dentry->ino = cpu_to_le64(0);
+	dentry->invalid = 1;
+	dentry->mtime = cpu_to_le32(dir->i_mtime.tv_sec);
+
+	links_count = cpu_to_le16(dir->i_nlink);
+	if (links_count == 0 && link_change == -1)
+		links_count = 0;
+	else
+		links_count += link_change;
+	dentry->links_count = cpu_to_le16(links_count);
+
+	addr = nova_get_addr_off(NOVA_SB(sb), dentry);
+	nova_inc_page_invalid_entries(sb, addr);
+
+	/* Update checksum */
+	nova_update_entry_csum(dentry);
+
+	return 0;
+}
+
+static int nova_update_new_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct dentry *dentry = entry_info->data;
+	unsigned short links_count;
+	int link_change = entry_info->link_change;
+
+	entry->entry_type = DIR_LOG;
+	entry->epoch_id = entry_info->epoch_id;
+	entry->trans_id = entry_info->trans_id;
+	entry->ino = entry_info->ino;
+	entry->name_len = dentry->d_name.len;
+	memcpy_to_pmem_nocache(entry->name, dentry->d_name.name,
+				dentry->d_name.len);
+	entry->name[dentry->d_name.len] = '\0';
+	entry->mtime = cpu_to_le32(dir->i_mtime.tv_sec);
+	//entry->size = cpu_to_le64(dir->i_size);
+
+	links_count = cpu_to_le16(dir->i_nlink);
+	if (links_count == 0 && link_change == -1)
+		links_count = 0;
+	else
+		links_count += link_change;
+	entry->links_count = cpu_to_le16(links_count);
+
+	/* Update actual de_len */
+	entry->de_len = cpu_to_le16(entry_info->file_size);
+
+	/* Update checksum */
+	nova_update_entry_csum(entry);
+
+	return 0;
+}
+
+static int nova_update_log_entry(struct super_block *sb, struct inode *inode,
+	void *entry, struct nova_log_entry_info *entry_info)
+{
+	enum nova_entry_type type = entry_info->type;
+
+	switch (type) {
+	case FILE_WRITE:
+		if (entry_info->inplace)
+			nova_update_write_entry(sb, entry, entry_info);
+		else
+			memcpy_to_pmem_nocache(entry, entry_info->data,
+				sizeof(struct nova_file_write_entry));
+		break;
+	case DIR_LOG:
+		if (entry_info->inplace)
+			nova_update_old_dentry(sb, inode, entry, entry_info);
+		else
+			nova_update_new_dentry(sb, inode, entry, entry_info);
+		break;
+	case SET_ATTR:
+		nova_update_setattr_entry(inode, entry, entry_info);
+		break;
+	case LINK_CHANGE:
+		nova_update_link_change_entry(inode, entry, entry_info);
+		break;
+	case MMAP_WRITE:
+		memcpy_to_pmem_nocache(entry, entry_info->data,
+				sizeof(struct nova_mmap_entry));
+		break;
+	case SNAPSHOT_INFO:
+		memcpy_to_pmem_nocache(entry, entry_info->data,
+				sizeof(struct nova_snapshot_info_entry));
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static int nova_append_log_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_info_header *sih,
+	struct nova_log_entry_info *entry_info)
+{
+	void *entry, *alter_entry;
+	enum nova_entry_type type = entry_info->type;
+	struct nova_inode_update *update = entry_info->update;
+	u64 tail, alter_tail;
+	u64 curr_p, alter_curr_p;
+	size_t size;
+	int extended = 0;
+
+	if (type == DIR_LOG)
+		size = entry_info->file_size;
+	else
+		size = nova_get_log_entry_size(sb, type);
+
+	tail = update->tail;
+	alter_tail = update->alter_tail;
+
+	curr_p = nova_get_append_head(sb, pi, sih, tail, size,
+						MAIN_LOG, 0, &extended);
+	if (curr_p == 0)
+		return -ENOSPC;
+
+	nova_dbg_verbose("%s: inode %lu attr change entry @ 0x%llx\n",
+				__func__, sih->ino, curr_p);
+
+	entry = nova_get_block(sb, curr_p);
+	/* inode is already updated with attr */
+	nova_memunlock_range(sb, entry, size);
+	memset(entry, 0, size);
+	nova_update_log_entry(sb, inode, entry, entry_info);
+	nova_inc_page_num_entries(sb, curr_p);
+	nova_memlock_range(sb, entry, size);
+	update->curr_entry = curr_p;
+	update->tail = curr_p + size;
+
+	if (metadata_csum) {
+		alter_curr_p = nova_get_append_head(sb, pi, sih, alter_tail,
+						size, ALTER_LOG, 0, &extended);
+		if (alter_curr_p == 0)
+			return -ENOSPC;
+
+		alter_entry = nova_get_block(sb, alter_curr_p);
+		nova_memunlock_range(sb, alter_entry, size);
+		memset(alter_entry, 0, size);
+		nova_update_log_entry(sb, inode, alter_entry, entry_info);
+		nova_memlock_range(sb, alter_entry, size);
+
+		update->alter_entry = alter_curr_p;
+		update->alter_tail = alter_curr_p + size;
+	}
+
+	entry_info->curr_p = curr_p;
+	return 0;
+}
+
+int nova_inplace_update_log_entry(struct super_block *sb,
+	struct inode *inode, void *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	enum nova_entry_type type = entry_info->type;
+	u64 journal_tail;
+	size_t size;
+	int cpu;
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_entry_t, update_time);
+	size = nova_get_log_entry_size(sb, type);
+
+	if (metadata_csum) {
+		nova_memunlock_range(sb, entry, size);
+		nova_update_log_entry(sb, inode, entry, entry_info);
+		// Also update the alter inode log entry.
+		nova_update_alter_entry(sb, entry);
+		nova_memlock_range(sb, entry, size);
+		goto out;
+	}
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	nova_memunlock_journal(sb);
+	journal_tail = nova_create_logentry_transaction(sb, entry, type, cpu);
+	nova_update_log_entry(sb, inode, entry, entry_info);
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	nova_memlock_journal(sb);
+	spin_unlock(&sbi->journal_locks[cpu]);
+out:
+	NOVA_END_TIMING(update_entry_t, update_time);
+	return 0;
+}
+
+/* Returns new tail after append */
+static int nova_append_setattr_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, struct iattr *attr,
+	struct nova_inode_update *update, u64 *last_setattr, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_setattr_t, append_time);
+	entry_info.type = SET_ATTR;
+	entry_info.attr = attr;
+	entry_info.update = update;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		goto out;
+	}
+
+	*last_setattr = sih->last_setattr;
+	sih->last_setattr = entry_info.curr_p;
+
+out:
+	NOVA_END_TIMING(append_setattr_t, append_time);
+	return ret;
+}
+
+/* Invalidate old link change entry */
+static int nova_invalidate_setattr_entry(struct super_block *sb,
+	u64 last_setattr)
+{
+	struct nova_setattr_logentry *old_entry;
+	struct nova_setattr_logentry *old_entryc, old_entry_copy;
+	void *addr;
+	int ret;
+
+	addr = (void *)nova_get_block(sb, last_setattr);
+	old_entry = (struct nova_setattr_logentry *)addr;
+
+	if (metadata_csum == 0)
+		old_entryc = old_entry;
+	else {
+		old_entryc = &old_entry_copy;
+		if (!nova_verify_entry_csum(sb, old_entry, old_entryc))
+			return -EIO;
+	}
+
+	/* Do not invalidate setsize entries */
+	if (!old_entry_freeable(sb, old_entryc->epoch_id) ||
+			(old_entryc->attr & ATTR_SIZE))
+		return 0;
+
+	ret = nova_invalidate_logentry(sb, old_entry, SET_ATTR, 0);
+
+	return ret;
+}
+
+#if 0
+static void setattr_copy_to_nova_inode(struct super_block *sb,
+	struct inode *inode, struct nova_inode *pi, u64 epoch_id)
+{
+	pi->i_mode  = cpu_to_le16(inode->i_mode);
+	pi->i_uid	= cpu_to_le32(i_uid_read(inode));
+	pi->i_gid	= cpu_to_le32(i_gid_read(inode));
+	pi->i_atime	= cpu_to_le32(inode->i_atime.tv_sec);
+	pi->i_ctime	= cpu_to_le32(inode->i_ctime.tv_sec);
+	pi->i_mtime	= cpu_to_le32(inode->i_mtime.tv_sec);
+	pi->create_epoch_id = epoch_id;
+
+	nova_update_alter_inode(sb, inode, pi);
+}
+#endif
+
+static int nova_can_inplace_update_setattr(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	u64 last_log = 0;
+	struct nova_setattr_logentry *entry = NULL;
+
+	last_log = sih->last_setattr;
+	if (last_log) {
+		entry = (struct nova_setattr_logentry *)nova_get_block(sb,
+								last_log);
+		/* Do not overwrite setsize entry */
+		if (entry->attr & ATTR_SIZE)
+			return 0;
+		if (entry->epoch_id == epoch_id)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int nova_inplace_update_setattr_entry(struct super_block *sb,
+	struct inode *inode, struct nova_inode_info_header *sih,
+	struct iattr *attr, u64 epoch_id)
+{
+	struct nova_setattr_logentry *entry = NULL;
+	struct nova_log_entry_info entry_info;
+	u64 last_log = 0;
+
+	nova_dbgv("%s : Modifying last log entry for inode %lu\n",
+				__func__, inode->i_ino);
+	last_log = sih->last_setattr;
+	entry = (struct nova_setattr_logentry *)nova_get_block(sb,
+							last_log);
+
+	entry_info.type = SET_ATTR;
+	entry_info.attr = attr;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					&entry_info);
+}
+
+int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, unsigned int ia_valid, struct iattr *attr,
+	u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode_update update;
+	u64 last_setattr = 0;
+	int ret;
+
+	if (ia_valid & ATTR_MODE)
+		sih->i_mode = inode->i_mode;
+
+	/*
+	 * Let's try to do inplace update.
+	 * If there are currently no snapshots holding this inode,
+	 * we can update the inode in place. If a snapshot creation
+	 * is in progress, we will use the create_snapshot_epoch_id
+	 * as the latest snapshot id.
+	 */
+	if (!(ia_valid & ATTR_SIZE) &&
+			nova_can_inplace_update_setattr(sb, sih, epoch_id)) {
+		nova_inplace_update_setattr_entry(sb, inode, sih,
+						attr, epoch_id);
+	} else {
+		/* We are holding inode lock so OK to append the log */
+		nova_dbgv("%s : Appending last log entry for inode ino = %lu\n",
+				__func__, inode->i_ino);
+		update.tail = update.alter_tail = 0;
+		ret = nova_append_setattr_entry(sb, pi, inode, attr, &update,
+						&last_setattr, epoch_id);
+		if (ret) {
+			nova_dbg("%s: append setattr entry failure\n",
+								__func__);
+			return ret;
+		}
+
+		nova_memunlock_inode(sb, pi);
+		nova_update_inode(sb, inode, pi, &update, 1);
+		nova_memlock_inode(sb, pi);
+	}
+
+	/* Invalidate old setattr entry */
+	if (last_setattr)
+		nova_invalidate_setattr_entry(sb, last_setattr);
+
+	return 0;
+}
+
+/* Invalidate old link change entry */
+int nova_invalidate_link_change_entry(struct super_block *sb,
+	u64 old_link_change)
+{
+	struct nova_link_change_entry *old_entry;
+	struct nova_link_change_entry *old_entryc, old_entry_copy;
+	void *addr;
+	int ret;
+
+	if (old_link_change == 0)
+		return 0;
+
+	addr = (void *)nova_get_block(sb, old_link_change);
+	old_entry = (struct nova_link_change_entry *)addr;
+
+	if (metadata_csum == 0)
+		old_entryc = old_entry;
+	else {
+		old_entryc = &old_entry_copy;
+		if (!nova_verify_entry_csum(sb, old_entry, old_entryc))
+			return -EIO;
+	}
+
+	if (!old_entry_freeable(sb, old_entryc->epoch_id))
+		return 0;
+
+	ret = nova_invalidate_logentry(sb, old_entry, LINK_CHANGE, 0);
+
+	return ret;
+}
+
+static int nova_can_inplace_update_lcentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	u64 last_log = 0;
+	struct nova_link_change_entry *entry = NULL;
+
+	last_log = sih->last_link_change;
+	if (last_log) {
+		entry = (struct nova_link_change_entry *)nova_get_block(sb,
+								last_log);
+		if (entry->epoch_id == epoch_id)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int nova_inplace_update_lcentry(struct super_block *sb,
+	struct inode *inode, struct nova_inode_info_header *sih,
+	u64 epoch_id)
+{
+	struct nova_link_change_entry *entry = NULL;
+	struct nova_log_entry_info entry_info;
+	u64 last_log = 0;
+
+	last_log = sih->last_link_change;
+	entry = (struct nova_link_change_entry *)nova_get_block(sb,
+							last_log);
+
+	entry_info.type = LINK_CHANGE;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					&entry_info);
+}
+
+/* Returns new tail after append */
+int nova_append_link_change_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_update *update, u64 *old_linkc, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	int ret = 0;
+	timing_t append_time;
+
+	NOVA_START_TIMING(append_link_change_t, append_time);
+
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (nova_can_inplace_update_lcentry(sb, sih, epoch_id)) {
+		nova_inplace_update_lcentry(sb, inode, sih, epoch_id);
+		update->tail = sih->log_tail;
+		update->alter_tail = sih->alter_log_tail;
+
+		*old_linkc = 0;
+		sih->trans_id++;
+		goto out;
+	}
+
+	entry_info.type = LINK_CHANGE;
+	entry_info.update = update;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		goto out;
+	}
+
+	*old_linkc = sih->last_link_change;
+	sih->last_link_change = entry_info.curr_p;
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(append_link_change_t, append_time);
+	return ret;
+}
+
+int nova_assign_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc,
+	bool free)
+{
+	struct nova_file_write_entry *old_entry;
+	struct nova_file_write_entry *start_old_entry = NULL;
+	void **pentry;
+	unsigned long start_pgoff = entryc->pgoff;
+	unsigned long start_old_pgoff = 0;
+	unsigned int num = entryc->num_pages;
+	unsigned int num_free = 0;
+	unsigned long curr_pgoff;
+	int i;
+	int ret = 0;
+	timing_t assign_time;
+
+	NOVA_START_TIMING(assign_t, assign_time);
+	for (i = 0; i < num; i++) {
+		curr_pgoff = start_pgoff + i;
+
+		pentry = radix_tree_lookup_slot(&sih->tree, curr_pgoff);
+		if (pentry) {
+			old_entry = radix_tree_deref_slot(pentry);
+			if (old_entry != start_old_entry) {
+				if (start_old_entry && free)
+					nova_free_old_entry(sb, sih,
+							start_old_entry,
+							start_old_pgoff,
+							num_free, false,
+							entryc->epoch_id);
+				nova_invalidate_write_entry(sb,
+						start_old_entry, 1, 0);
+
+				start_old_entry = old_entry;
+				start_old_pgoff = curr_pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+
+			radix_tree_replace_slot(&sih->tree, pentry, entry);
+		} else {
+			ret = radix_tree_insert(&sih->tree, curr_pgoff, entry);
+			if (ret) {
+				nova_dbg("%s: ERROR %d\n", __func__, ret);
+				goto out;
+			}
+		}
+	}
+
+	if (start_old_entry && free)
+		nova_free_old_entry(sb, sih, start_old_entry,
+					start_old_pgoff, num_free, false,
+					entryc->epoch_id);
+
+	nova_invalidate_write_entry(sb, start_old_entry, 1, 0);
+
+out:
+	NOVA_END_TIMING(assign_t, assign_time);
+
+	return ret;
+}
+
+int nova_inplace_update_write_entry(struct super_block *sb,
+	struct inode *inode, struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					entry_info);
+}
+
+int nova_set_write_entry_updating(struct super_block *sb,
+	struct nova_file_write_entry *entry, int set)
+{
+	nova_memunlock_range(sb, entry, sizeof(*entry));
+	entry->updating = set ? 1 : 0;
+	nova_update_entry_csum(entry);
+	nova_update_alter_entry(sb, entry);
+	nova_memlock_range(sb, entry, sizeof(*entry));
+
+	return 0;
+}
+
+/*
+ * Append a nova_file_write_entry to the current nova_inode_log_page.
+ * blocknr and start_blk are pgoff.
+ * We cannot update pi->log_tail here because a transaction may contain
+ * multiple entries.
+ */
+int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_file_write_entry *data,
+	struct nova_inode_update *update)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_file_entry_t, append_time);
+
+	nova_update_entry_csum(data);
+
+	entry_info.type = FILE_WRITE;
+	entry_info.update = update;
+	entry_info.data = data;
+	entry_info.epoch_id = data->epoch_id;
+	entry_info.trans_id = data->trans_id;
+	entry_info.inplace = 0;
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	NOVA_END_TIMING(append_file_entry_t, append_time);
+	return ret;
+}
+
+int nova_append_mmap_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_mmap_entry *data,
+	struct nova_inode_update *update, struct vma_item *item)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_mmap_entry_t, append_time);
+
+	nova_update_entry_csum(data);
+
+	entry_info.type = MMAP_WRITE;
+	entry_info.update = update;
+	entry_info.data = data;
+	entry_info.epoch_id = data->epoch_id;
+
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	item->mmap_entry = entry_info.curr_p;
+out:
+	NOVA_END_TIMING(append_mmap_entry_t, append_time);
+	return ret;
+}
+
+int nova_append_snapshot_info_entry(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info *si,
+	struct snapshot_info *info, struct nova_snapshot_info_entry *data,
+	struct nova_inode_update *update)
+{
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_snapshot_info_t, append_time);
+
+	nova_update_entry_csum(data);
+
+	entry_info.type = SNAPSHOT_INFO;
+	entry_info.update = update;
+	entry_info.data = data;
+	entry_info.epoch_id = data->epoch_id;
+	entry_info.inplace = 0;
+
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_append_log_entry(sb, pi, NULL, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	info->snapshot_entry = entry_info.curr_p;
+out:
+	NOVA_END_TIMING(append_snapshot_info_t, append_time);
+	return ret;
+}
+
+int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *dir, struct dentry *dentry, u64 ino,
+	unsigned short de_len, struct nova_inode_update *update,
+	int link_change, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_dir_entry_t, append_time);
+
+	entry_info.type = DIR_LOG;
+	entry_info.update = update;
+	entry_info.data = dentry;
+	entry_info.ino = ino;
+	entry_info.link_change = link_change;
+	entry_info.file_size = de_len;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+	entry_info.inplace = 0;
+
+	/* nova_inode tail pointer will be updated and we make sure all other
+	 * inode fields are good before checksumming the whole structure
+	 */
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_append_log_entry(sb, pi, dir, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	dir->i_blocks = sih->i_blocks;
+out:
+	NOVA_END_TIMING(append_dir_entry_t, append_time);
+	return ret;
+}
+
+int nova_update_alter_pages(struct super_block *sb, struct nova_inode *pi,
+	u64 curr, u64 alter_curr)
+{
+	if (curr == 0 || alter_curr == 0 || metadata_csum == 0)
+		return 0;
+
+	while (curr && alter_curr) {
+		nova_set_alter_page_address(sb, curr, alter_curr);
+		curr = next_log_page(sb, curr);
+		alter_curr = next_log_page(sb, alter_curr);
+	}
+
+	if (curr || alter_curr)
+		nova_dbg("%s: curr 0x%llx, alter_curr 0x%llx\n",
+					__func__, curr, alter_curr);
+
+	return 0;
+}
+
+static int nova_coalesce_log_pages(struct super_block *sb,
+	unsigned long prev_blocknr, unsigned long first_blocknr,
+	unsigned long num_pages)
+{
+	unsigned long next_blocknr;
+	u64 curr_block, next_page;
+	struct nova_inode_log_page *curr_page;
+	int i;
+
+	if (prev_blocknr) {
+		/* Link prev block and newly allocated head block */
+		curr_block = nova_get_block_off(sb, prev_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+		curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+		next_page = nova_get_block_off(sb, first_blocknr,
+				NOVA_BLOCK_TYPE_4K);
+		nova_memunlock_block(sb, curr_page);
+		nova_set_next_page_address(sb, curr_page, next_page, 0);
+		nova_memlock_block(sb, curr_page);
+	}
+
+	next_blocknr = first_blocknr + 1;
+	curr_block = nova_get_block_off(sb, first_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+	curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+	for (i = 0; i < num_pages - 1; i++) {
+		next_page = nova_get_block_off(sb, next_blocknr,
+				NOVA_BLOCK_TYPE_4K);
+		nova_memunlock_block(sb, curr_page);
+		nova_set_page_num_entries(sb, curr_page, 0, 0);
+		nova_set_page_invalid_entries(sb, curr_page, 0, 0);
+		nova_set_next_page_address(sb, curr_page, next_page, 0);
+		nova_memlock_block(sb, curr_page);
+		curr_page++;
+		next_blocknr++;
+	}
+
+	/* Last page */
+	nova_memunlock_block(sb, curr_page);
+	nova_set_page_num_entries(sb, curr_page, 0, 0);
+	nova_set_page_invalid_entries(sb, curr_page, 0, 0);
+	nova_set_next_page_address(sb, curr_page, 0, 1);
+	nova_memlock_block(sb, curr_page);
+	return 0;
+}
+
+/* Log block resides in NVMM */
+int nova_allocate_inode_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long num_pages,
+	u64 *new_block, int cpuid, enum nova_alloc_direction from_tail)
+{
+	unsigned long new_inode_blocknr;
+	unsigned long first_blocknr;
+	unsigned long prev_blocknr;
+	int allocated;
+	int ret_pages = 0;
+
+	allocated = nova_new_log_blocks(sb, sih, &new_inode_blocknr,
+			num_pages, ALLOC_NO_INIT, cpuid, from_tail);
+
+	if (allocated <= 0) {
+		nova_err(sb, "ERROR: no inode log page available: %d %d\n",
+			num_pages, allocated);
+		return allocated;
+	}
+	ret_pages += allocated;
+	num_pages -= allocated;
+	nova_dbg_verbose("Pi %lu: Alloc %d log blocks @ 0x%lx\n",
+			sih->ino, allocated, new_inode_blocknr);
+
+	/* Coalesce the pages */
+	nova_coalesce_log_pages(sb, 0, new_inode_blocknr, allocated);
+	first_blocknr = new_inode_blocknr;
+	prev_blocknr = new_inode_blocknr + allocated - 1;
+
+	/* Allocate remaining pages */
+	while (num_pages) {
+		allocated = nova_new_log_blocks(sb, sih,
+					&new_inode_blocknr, num_pages,
+					ALLOC_NO_INIT, cpuid, from_tail);
+
+		nova_dbg_verbose("Alloc %d log blocks @ 0x%lx\n",
+					allocated, new_inode_blocknr);
+		if (allocated <= 0) {
+			nova_dbg("%s: no inode log page available: %lu %d\n",
+				__func__, num_pages, allocated);
+			/* Return whatever we have */
+			break;
+		}
+		ret_pages += allocated;
+		num_pages -= allocated;
+		nova_coalesce_log_pages(sb, prev_blocknr, new_inode_blocknr,
+						allocated);
+		prev_blocknr = new_inode_blocknr + allocated - 1;
+	}
+
+	*new_block = nova_get_block_off(sb, first_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+
+	return ret_pages;
+}
+
+static int nova_initialize_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	int log_id)
+{
+	u64 new_block;
+	int allocated;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih,
+					1, &new_block, ANY_CPU,
+					log_id == MAIN_LOG ? 0 : 1);
+	if (allocated != 1) {
+		nova_err(sb, "%s ERROR: no inode log page available\n",
+					__func__);
+		return -ENOSPC;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	if (log_id == MAIN_LOG) {
+		pi->log_tail = new_block;
+		nova_flush_buffer(&pi->log_tail, CACHELINE_SIZE, 0);
+		pi->log_head = new_block;
+		sih->log_head = sih->log_tail = new_block;
+		sih->log_pages = 1;
+		nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 1);
+	} else {
+		pi->alter_log_tail = new_block;
+		nova_flush_buffer(&pi->alter_log_tail, CACHELINE_SIZE, 0);
+		pi->alter_log_head = new_block;
+		sih->alter_log_head = sih->alter_log_tail = new_block;
+		sih->log_pages++;
+		nova_flush_buffer(&pi->alter_log_head, CACHELINE_SIZE, 1);
+	}
+	nova_update_inode_checksum(pi);
+	nova_memlock_inode(sb, pi);
+
+	return 0;
+}
+
+/*
+ * Extend the log.  If the log is less than EXTEND_THRESHOLD pages, double its
+ * allocated size.  Otherwise, increase by EXTEND_THRESHOLD. Then, do GC.
+ */
+static u64 nova_extend_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	u64 new_block, alter_new_block = 0;
+	int allocated;
+	unsigned long num_pages;
+	int ret;
+
+	nova_dbgv("%s: inode %lu, curr 0x%llx\n", __func__, sih->ino, curr_p);
+
+	if (curr_p == 0) {
+		ret = nova_initialize_inode_log(sb, pi, sih, MAIN_LOG);
+		if (ret)
+			return 0;
+
+		if (metadata_csum) {
+			ret = nova_initialize_inode_log(sb, pi, sih, ALTER_LOG);
+			if (ret)
+				return 0;
+
+			nova_memunlock_inode(sb, pi);
+			nova_update_alter_pages(sb, pi, sih->log_head,
+							sih->alter_log_head);
+			nova_memlock_inode(sb, pi);
+		}
+
+		return sih->log_head;
+	}
+
+	num_pages = sih->log_pages >= EXTEND_THRESHOLD ?
+				EXTEND_THRESHOLD : sih->log_pages;
+//	nova_dbg("Before append log pages:\n");
+//	nova_print_inode_log_page(sb, inode);
+	allocated = nova_allocate_inode_log_pages(sb, sih,
+					num_pages, &new_block, ANY_CPU, 0);
+	nova_dbg_verbose("Link block %llu to block %llu\n",
+					curr_p >> PAGE_SHIFT,
+					new_block >> PAGE_SHIFT);
+	if (allocated <= 0) {
+		nova_err(sb, "%s ERROR: no inode log page available\n",
+					__func__);
+		nova_dbg("curr_p 0x%llx, %lu pages\n", curr_p,
+					sih->log_pages);
+		return 0;
+	}
+
+	if (metadata_csum) {
+		allocated = nova_allocate_inode_log_pages(sb, sih,
+				num_pages, &alter_new_block, ANY_CPU, 1);
+		if (allocated <= 0) {
+			nova_err(sb, "%s ERROR: no inode log page available\n",
+					__func__);
+			nova_dbg("curr_p 0x%llx, %lu pages\n", curr_p,
+					sih->log_pages);
+			return 0;
+		}
+
+		nova_memunlock_inode(sb, pi);
+		nova_update_alter_pages(sb, pi, new_block, alter_new_block);
+		nova_memlock_inode(sb, pi);
+	}
+
+
+	nova_inode_log_fast_gc(sb, pi, sih, curr_p,
+			       new_block, alter_new_block, allocated, 0);
+
+//	nova_dbg("After append log pages:\n");
+//	nova_print_inode_log_page(sb, inode);
+	/* Atomic switch to new log */
+//	nova_switch_to_new_log(sb, pi, new_block, num_pages);
+
+	return new_block;
+}
+
+/* For thorough GC, simply append one more page */
+static u64 nova_append_one_log_page(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 new_block;
+	u64 curr_block;
+	int allocated;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih, 1, &new_block,
+							ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_err(sb, "%s: ERROR: no inode log page available\n",
+				__func__);
+		return 0;
+	}
+
+	if (curr_p == 0) {
+		curr_p = new_block;
+	} else {
+		/* Link prev block and newly allocated head block */
+		curr_block = BLOCK_OFF(curr_p);
+		curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+		nova_memunlock_block(sb, curr_page);
+		nova_set_next_page_address(sb, curr_page, new_block, 1);
+		nova_memlock_block(sb, curr_page);
+	}
+
+	return curr_p;
+}
+
+u64 nova_get_append_head(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 tail, size_t size, int log_id,
+	int thorough_gc, int *extended)
+{
+	u64 curr_p;
+
+	if (tail)
+		curr_p = tail;
+	else if (log_id == MAIN_LOG)
+		curr_p = sih->log_tail;
+	else
+		curr_p = sih->alter_log_tail;
+
+	if (curr_p == 0 || (is_last_entry(curr_p, size) &&
+				next_log_page(sb, curr_p) == 0)) {
+		if (is_last_entry(curr_p, size)) {
+			nova_memunlock_block(sb, nova_get_block(sb, curr_p));
+			nova_set_next_page_flag(sb, curr_p);
+			nova_memlock_block(sb, nova_get_block(sb, curr_p));
+		}
+
+		/* Alternate log should not go here */
+		if (log_id != MAIN_LOG)
+			return 0;
+
+		if (thorough_gc == 0) {
+			curr_p = nova_extend_inode_log(sb, pi, sih, curr_p);
+		} else {
+			curr_p = nova_append_one_log_page(sb, sih, curr_p);
+			/* For thorough GC */
+			*extended = 1;
+		}
+
+		if (curr_p == 0)
+			return 0;
+	}
+
+	if (is_last_entry(curr_p, size)) {
+		nova_memunlock_block(sb, nova_get_block(sb, curr_p));
+		nova_set_next_page_flag(sb, curr_p);
+		nova_memlock_block(sb, nova_get_block(sb, curr_p));
+		curr_p = next_log_page(sb, curr_p);
+	}
+
+	return curr_p;
+}
+
+int nova_free_contiguous_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 head)
+{
+	unsigned long blocknr, start_blocknr = 0;
+	u64 curr_block = head;
+	u8 btype = sih->i_blk_type;
+	int num_free = 0;
+	int freed = 0;
+
+	while (curr_block > 0) {
+		if (ENTRY_LOC(curr_block)) {
+			nova_dbg("%s: ERROR: invalid block %llu\n",
+					__func__, curr_block);
+			break;
+		}
+
+		blocknr = nova_get_blocknr(sb, le64_to_cpu(curr_block),
+				    btype);
+		nova_dbg_verbose("%s: free page %llu\n", __func__, curr_block);
+		curr_block = next_log_page(sb, curr_block);
+
+		if (start_blocknr == 0) {
+			start_blocknr = blocknr;
+			num_free = 1;
+		} else {
+			if (blocknr == start_blocknr + num_free) {
+				num_free++;
+			} else {
+				/* A new start */
+				nova_free_log_blocks(sb, sih, start_blocknr,
+							num_free);
+				freed += num_free;
+				start_blocknr = blocknr;
+				num_free = 1;
+			}
+		}
+	}
+	if (start_blocknr) {
+		nova_free_log_blocks(sb, sih, start_blocknr, num_free);
+		freed += num_free;
+	}
+
+	return freed;
+}
+
+int nova_free_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih)
+{
+	int freed = 0;
+	timing_t free_time;
+
+	if (sih->log_head == 0 || sih->log_tail == 0)
+		return 0;
+
+	NOVA_START_TIMING(free_inode_log_t, free_time);
+
+	/* The inode is invalid now, no need to fence */
+	if (pi) {
+		nova_memunlock_inode(sb, pi);
+		pi->log_head = pi->log_tail = 0;
+		pi->alter_log_head = pi->alter_log_tail = 0;
+		nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+		nova_memlock_inode(sb, pi);
+	}
+
+	freed = nova_free_contiguous_log_blocks(sb, sih, sih->log_head);
+	if (metadata_csum)
+		freed += nova_free_contiguous_log_blocks(sb, sih,
+					sih->alter_log_head);
+
+	NOVA_END_TIMING(free_inode_log_t, free_time);
+	return 0;
+}
diff --git a/fs/nova/log.h b/fs/nova/log.h
new file mode 100644
index 000000000000..ee6c8927dd33
--- /dev/null
+++ b/fs/nova/log.h
@@ -0,0 +1,333 @@
+#ifndef __LOG_H
+#define __LOG_H
+
+#include "balloc.h"
+#include "inode.h"
+
+/* ======================= Log entry ========================= */
+/* Inode entry in the log */
+
+#define	MAIN_LOG	0
+#define	ALTER_LOG	1
+
+#define	PAGE_OFFSET_MASK	4095
+#define	BLOCK_OFF(p)	((p) & ~PAGE_OFFSET_MASK)
+
+#define	ENTRY_LOC(p)	((p) & PAGE_OFFSET_MASK)
+
+#define	LOG_BLOCK_TAIL	4064
+#define	PAGE_TAIL(p)	(BLOCK_OFF(p) + LOG_BLOCK_TAIL)
+
+/*
+ * Log page state and pointers to next page and the replica page
+ */
+struct nova_inode_page_tail {
+	__le32	invalid_entries;
+	__le32	num_entries;
+	__le64	epoch_id;	/* For snapshot list page */
+	__le64	alter_page;	/* Corresponding page in the other log */
+	__le64	next_page;
+} __attribute((__packed__));
+
+/* Fit in PAGE_SIZE */
+struct	nova_inode_log_page {
+	char padding[LOG_BLOCK_TAIL];
+	struct nova_inode_page_tail page_tail;
+} __attribute((__packed__));
+
+#define	EXTEND_THRESHOLD	256
+
+enum nova_entry_type {
+	FILE_WRITE = 1,
+	DIR_LOG,
+	SET_ATTR,
+	LINK_CHANGE,
+	MMAP_WRITE,
+	SNAPSHOT_INFO,
+	NEXT_PAGE,
+};
+
+static inline u8 nova_get_entry_type(void *p)
+{
+	u8 type;
+	int rc;
+
+	rc = memcpy_mcsafe(&type, p, sizeof(u8));
+	if (rc)
+		return rc;
+
+	return type;
+}
+
+static inline void nova_set_entry_type(void *p, enum nova_entry_type type)
+{
+	*(u8 *)p = type;
+}
+
+/*
+ * Write log entry.  Records a write to a contiguous range of PMEM pages.
+ *
+ * Documentation/filesystems/nova.txt contains descriptions of some fields.
+ */
+struct nova_file_write_entry {
+	u8	entry_type;
+	u8	reassigned;	/* Data is not latest */
+	u8	updating;	/* Data is being written */
+	u8	padding;
+	__le32	num_pages;
+	__le64	block;          /* offset of first block in this write */
+	__le64	pgoff;          /* file offset at the beginning of this write */
+	__le32	invalid_pages;	/* For GC */
+	/* For both ctime and mtime */
+	__le32	mtime;
+	__le64	size;           /* Write size for non-aligned writes */
+	__le64	epoch_id;
+	__le64	trans_id;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define WENTRY(entry)	((struct nova_file_write_entry *) entry)
+
+/*
+ * Log entry for adding a file/directory to a directory.
+ *
+ * Update DIR_LOG_REC_LEN if modify this struct!
+ */
+struct nova_dentry {
+	u8	entry_type;
+	u8	name_len;		/* length of the dentry name */
+	u8	reassigned;		/* Currently deleted */
+	u8	invalid;		/* Invalid now? */
+	__le16	de_len;			/* length of this dentry */
+	__le16	links_count;
+	__le32	mtime;			/* For both mtime and ctime */
+	__le32	csum;			/* entry checksum */
+	__le64	ino;			/* inode no pointed to by this entry */
+	__le64	padding;
+	__le64	epoch_id;
+	__le64	trans_id;
+	char	name[NOVA_NAME_LEN + 1];	/* File name */
+} __attribute((__packed__));
+
+#define DENTRY(entry)	((struct nova_dentry *) entry)
+
+#define NOVA_DIR_PAD			8	/* Align to 8 bytes boundary */
+#define NOVA_DIR_ROUND			(NOVA_DIR_PAD - 1)
+#define NOVA_DENTRY_HEADER_LEN		48
+#define NOVA_DIR_LOG_REC_LEN(name_len) \
+	(((name_len + 1) + NOVA_DENTRY_HEADER_LEN \
+	 + NOVA_DIR_ROUND) & ~NOVA_DIR_ROUND)
+
+#define NOVA_MAX_ENTRY_LEN		NOVA_DIR_LOG_REC_LEN(NOVA_NAME_LEN)
+
+/*
+ * Log entry for updating file attributes.
+ */
+struct nova_setattr_logentry {
+	u8	entry_type;
+	u8	attr;       /* bitmap of which attributes to update */
+	__le16	mode;
+	__le32	uid;
+	__le32	gid;
+	__le32	atime;
+	__le32	mtime;
+	__le32	ctime;
+	__le64	size;        /* File size after truncation */
+	__le64	epoch_id;
+	__le64	trans_id;
+	u8	invalid;
+	u8	paddings[3];
+	__le32	csum;
+} __attribute((__packed__));
+
+#define SENTRY(entry)	((struct nova_setattr_logentry *) entry)
+
+/* Link change log entry.
+ *
+ * TODO: Do we need this to be 32 bytes?
+ */
+struct nova_link_change_entry {
+	u8	entry_type;
+	u8	invalid;
+	__le16	links;
+	__le32	ctime;
+	__le32	flags;
+	__le32	generation;    /* for NFS handles */
+	__le64	epoch_id;
+	__le64	trans_id;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define LCENTRY(entry)	((struct nova_link_change_entry *) entry)
+
+/*
+ * MMap entry.  Records the fact that a region of the file is mmapped, so
+ * parity and checksums are inoperative.
+ */
+struct nova_mmap_entry {
+	u8	entry_type;
+	u8	invalid;
+	u8	paddings[6];
+	__le64	epoch_id;
+	__le64	pgoff;
+	__le64	num_pages;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define MMENTRY(entry)	((struct nova_mmap_entry *) entry)
+
+/*
+ * Log entry for the creation of a snapshot.  Only occurs in the log of the
+ * dedicated snapshot inode.
+ */
+struct nova_snapshot_info_entry {
+	u8	type;
+	u8	deleted;
+	u8	paddings[6];
+	__le64	epoch_id;
+	__le64	timestamp;
+	__le64	nvmm_page_addr;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define SNENTRY(entry)	((struct nova_snapshot_info_entry *) entry)
+
+
+/*
+ * Transient DRAM structure that describes changes needed to append a log entry
+ * to an inode
+ */
+struct nova_inode_update {
+	u64 head;
+	u64 alter_head;
+	u64 tail;
+	u64 alter_tail;
+	u64 curr_entry;
+	u64 alter_entry;
+	struct nova_dentry *create_dentry;
+	struct nova_dentry *delete_dentry;
+};
+
+
+/*
+ * Transient DRAM structure to parameterize the creation of a log entry.
+ */
+struct nova_log_entry_info {
+	enum nova_entry_type type;
+	struct iattr *attr;
+	struct nova_inode_update *update;
+	void *data;	/* struct dentry */
+	u64 epoch_id;
+	u64 trans_id;
+	u64 curr_p;	/* output */
+	u64 file_size;	/* de_len for dentry */
+	u64 ino;
+	u32 time;
+	int link_change;
+	int inplace;	/* For file write entry */
+};
+
+
+
+static inline size_t nova_get_log_entry_size(struct super_block *sb,
+	enum nova_entry_type type)
+{
+	size_t size = 0;
+
+	switch (type) {
+	case FILE_WRITE:
+		size = sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		size = NOVA_DENTRY_HEADER_LEN;
+		break;
+	case SET_ATTR:
+		size = sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		size = sizeof(struct nova_link_change_entry);
+		break;
+	case MMAP_WRITE:
+		size = sizeof(struct nova_mmap_entry);
+		break;
+	case SNAPSHOT_INFO:
+		size = sizeof(struct nova_snapshot_info_entry);
+		break;
+	default:
+		break;
+	}
+
+	return size;
+}
+
+
+int nova_invalidate_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type, unsigned int num_free);
+int nova_reassign_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type);
+int nova_inplace_update_log_entry(struct super_block *sb,
+	struct inode *inode, void *entry,
+	struct nova_log_entry_info *entry_info);
+void nova_clear_last_page_tail(struct super_block *sb,
+	struct inode *inode, loff_t newsize);
+unsigned int nova_free_old_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	unsigned long pgoff, unsigned int num_free,
+	bool delete_dead, u64 epoch_id);
+int nova_free_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih);
+int nova_update_alter_pages(struct super_block *sb, struct nova_inode *pi,
+	u64 curr, u64 alter_curr);
+struct nova_file_write_entry *nova_find_next_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, pgoff_t pgoff);
+int nova_allocate_inode_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long num_pages,
+	u64 *new_block, int cpuid, enum nova_alloc_direction from_tail);
+int nova_free_contiguous_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 head);
+u64 nova_get_append_head(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 tail, size_t size, int log_id,
+	int thorough_gc, int *extended);
+int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, unsigned int ia_valid, struct iattr *attr,
+	u64 epoch_id);
+int nova_invalidate_link_change_entry(struct super_block *sb,
+	u64 old_link_change);
+int nova_append_link_change_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_update *update, u64 *old_linkc, u64 epoch_id);
+int nova_set_write_entry_updating(struct super_block *sb,
+	struct nova_file_write_entry *entry, int set);
+int nova_inplace_update_write_entry(struct super_block *sb,
+	struct inode *inode, struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info);
+int nova_append_mmap_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_mmap_entry *data,
+	struct nova_inode_update *update, struct vma_item *item);
+int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_file_write_entry *data,
+	struct nova_inode_update *update);
+int nova_append_snapshot_info_entry(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info *si,
+	struct snapshot_info *info, struct nova_snapshot_info_entry *data,
+	struct nova_inode_update *update);
+int nova_assign_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc, bool free);
+
+
+void nova_print_curr_log_page(struct super_block *sb, u64 curr);
+void nova_print_nova_log(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+int nova_get_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode *pi);
+void nova_print_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+
+#endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 05/16] NOVA: Log data structures and operations
@ 2017-08-03  7:48   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova maintains a log for each inode that records updates to the inode's
metadata and holds pointers to the file data.  Nova makes updates to file data
and metadata atomic by atomically appending log entries to the log.

Each inode contains pointers to head and tail of the inode's log.  When the log
grows past the end of the last page, nova allocates additional space.  For
short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
fixed amount of additional space (1MB).

Log space is reclaimed during garbage collection.

Log Entries
-----------

There are eight kinds of log entry, documented in log.h.  The log entries have
several entries in common:

   1.  'epoch_id' gives the epoch during which the log entry was created.
   Creating a snapshot increiments the epoch_id for the file systems.

   2.  'trans_id' is filesystem-wide, monotone increasing, number assigned each
   log entry.  It provides an ordering over all FS operations.

   3.  'invalid' is true if the effects of this entry are dead and the log
   entry can be garbage collected.

   4.  'csum' is a CRC32 checksum for the entry.

Log structure
-------------

The logs comprise a linked list of PMEM blocks.  The tail of each block

contains some metadata about the block and pointers to the next block and
block's replica (struct nova_inode_page_tail).

+----------------+
| log entry      |
+----------------+
| log entry      |
+----------------+
| ...            |
+----------------+
| tail           |
|  metadata      |
|  -> next block |
+----------------+

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/log.c | 1411 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h |  333 +++++++++++++
 2 files changed, 1744 insertions(+)
 create mode 100644 fs/nova/log.c
 create mode 100644 fs/nova/log.h

diff --git a/fs/nova/log.c b/fs/nova/log.c
new file mode 100644
index 000000000000..2c3c9aa18043
--- /dev/null
+++ b/fs/nova/log.c
@@ -0,0 +1,1411 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Log methods
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "journal.h"
+#include "inode.h"
+#include "log.h"
+
+static int nova_execute_invalidate_reassign_logentry(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int reassign,
+	unsigned int num_free)
+{
+	struct nova_file_write_entry *fw_entry;
+	int invalid = 0;
+
+	switch (type) {
+	case FILE_WRITE:
+		fw_entry = (struct nova_file_write_entry *)entry;
+		if (reassign)
+			fw_entry->reassigned = 1;
+		if (num_free)
+			fw_entry->invalid_pages += num_free;
+		if (fw_entry->invalid_pages == fw_entry->num_pages)
+			invalid = 1;
+		break;
+	case DIR_LOG:
+		if (reassign) {
+			((struct nova_dentry *)entry)->reassigned = 1;
+		} else {
+			((struct nova_dentry *)entry)->invalid = 1;
+			invalid = 1;
+		}
+		break;
+	case SET_ATTR:
+		((struct nova_setattr_logentry *)entry)->invalid = 1;
+		invalid = 1;
+		break;
+	case LINK_CHANGE:
+		((struct nova_link_change_entry *)entry)->invalid = 1;
+		invalid = 1;
+		break;
+	case MMAP_WRITE:
+		((struct nova_mmap_entry *)entry)->invalid = 1;
+		invalid = 1;
+		break;
+	case SNAPSHOT_INFO:
+		((struct nova_snapshot_info_entry *)entry)->deleted = 1;
+		invalid = 1;
+		break;
+	default:
+		break;
+	}
+
+	if (invalid) {
+		u64 addr = nova_get_addr_off(NOVA_SB(sb), entry);
+
+		nova_inc_page_invalid_entries(sb, addr);
+	}
+
+	nova_update_entry_csum(entry);
+	return 0;
+}
+
+static int nova_invalidate_reassign_logentry(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int reassign,
+	unsigned int num_free)
+{
+	nova_memunlock_range(sb, entry, CACHELINE_SIZE);
+
+	nova_execute_invalidate_reassign_logentry(sb, entry, type,
+						reassign, num_free);
+	nova_update_alter_entry(sb, entry);
+	nova_memlock_range(sb, entry, CACHELINE_SIZE);
+
+	return 0;
+}
+
+int nova_invalidate_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type, unsigned int num_free)
+{
+	return nova_invalidate_reassign_logentry(sb, entry, type, 0, num_free);
+}
+
+int nova_reassign_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type)
+{
+	return nova_invalidate_reassign_logentry(sb, entry, type, 1, 0);
+}
+
+static inline int nova_invalidate_write_entry(struct super_block *sb,
+	struct nova_file_write_entry *entry, int reassign,
+	unsigned int num_free)
+{
+	struct nova_file_write_entry *entryc, entry_copy;
+
+	if (!entry)
+		return 0;
+
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+	}
+
+	if (num_free == 0 && entryc->reassigned == 1)
+		return 0;
+
+	return nova_invalidate_reassign_logentry(sb, entry, FILE_WRITE,
+							reassign, num_free);
+}
+
+unsigned int nova_free_old_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	unsigned long pgoff, unsigned int num_free,
+	bool delete_dead, u64 epoch_id)
+{
+	struct nova_file_write_entry *entryc, entry_copy;
+	unsigned long old_nvmm;
+	int ret;
+	timing_t free_time;
+
+	if (!entry)
+		return 0;
+
+	NOVA_START_TIMING(free_old_t, free_time);
+
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+	}
+
+	old_nvmm = get_nvmm(sb, sih, entryc, pgoff);
+
+	if (!delete_dead) {
+		ret = nova_append_data_to_snapshot(sb, entryc, old_nvmm,
+				num_free, epoch_id);
+		if (ret == 0) {
+			nova_invalidate_write_entry(sb, entry, 1, 0);
+			goto out;
+		}
+
+		nova_invalidate_write_entry(sb, entry, 1, num_free);
+	}
+
+	nova_dbgv("%s: pgoff %lu, free %u blocks\n",
+				__func__, pgoff, num_free);
+	nova_free_data_blocks(sb, sih, old_nvmm, num_free);
+
+out:
+	sih->i_blocks -= num_free;
+
+	NOVA_END_TIMING(free_old_t, free_time);
+	return num_free;
+}
+
+struct nova_file_write_entry *nova_find_next_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, pgoff_t pgoff)
+{
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_entry *entries[1];
+	int nr_entries;
+
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pgoff, 1);
+	if (nr_entries == 1)
+		entry = entries[0];
+
+	return entry;
+}
+
+/*
+ * Zero the tail page. Used in resize request
+ * to avoid to keep data in case the file grows again.
+ */
+void nova_clear_last_page_tail(struct super_block *sb,
+	struct inode *inode, loff_t newsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long offset = newsize & (sb->s_blocksize - 1);
+	unsigned long pgoff, length;
+	u64 nvmm;
+	char *nvmm_addr;
+
+	if (offset == 0 || newsize > inode->i_size)
+		return;
+
+	length = sb->s_blocksize - offset;
+	pgoff = newsize >> sb->s_blocksize_bits;
+
+	nvmm = nova_find_nvmm_block(sb, sih, NULL, pgoff);
+	if (nvmm == 0)
+		return;
+
+	nvmm_addr = (char *)nova_get_block(sb, nvmm);
+	nova_memunlock_range(sb, nvmm_addr + offset, length);
+	memcpy_to_pmem_nocache(nvmm_addr + offset, sbi->zeroed_page, length);
+	nova_memlock_range(sb, nvmm_addr + offset, length);
+
+	if (data_csum > 0)
+		nova_update_truncated_block_csum(sb, inode, newsize);
+	if (data_parity > 0)
+		nova_update_truncated_block_parity(sb, inode, newsize);
+}
+
+static void nova_update_setattr_entry(struct inode *inode,
+	struct nova_setattr_logentry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct iattr *attr = entry_info->attr;
+	unsigned int ia_valid = attr->ia_valid, attr_mask;
+
+	/* These files are in the lowest byte */
+	attr_mask = ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_SIZE |
+			ATTR_ATIME | ATTR_MTIME | ATTR_CTIME;
+
+	entry->entry_type	= SET_ATTR;
+	entry->attr	= ia_valid & attr_mask;
+	entry->mode	= cpu_to_le16(inode->i_mode);
+	entry->uid	= cpu_to_le32(i_uid_read(inode));
+	entry->gid	= cpu_to_le32(i_gid_read(inode));
+	entry->atime	= cpu_to_le32(inode->i_atime.tv_sec);
+	entry->ctime	= cpu_to_le32(inode->i_ctime.tv_sec);
+	entry->mtime	= cpu_to_le32(inode->i_mtime.tv_sec);
+	entry->epoch_id = cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id	= cpu_to_le64(entry_info->trans_id);
+	entry->invalid	= 0;
+
+	if (ia_valid & ATTR_SIZE)
+		entry->size = cpu_to_le64(attr->ia_size);
+	else
+		entry->size = cpu_to_le64(inode->i_size);
+
+	nova_update_entry_csum(entry);
+}
+
+static void nova_update_link_change_entry(struct inode *inode,
+	struct nova_link_change_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	entry->entry_type	= LINK_CHANGE;
+	entry->epoch_id		= cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id		= cpu_to_le64(entry_info->trans_id);
+	entry->invalid		= 0;
+	entry->links		= cpu_to_le16(inode->i_nlink);
+	entry->ctime		= cpu_to_le32(inode->i_ctime.tv_sec);
+	entry->flags		= cpu_to_le32(inode->i_flags);
+	entry->generation	= cpu_to_le32(inode->i_generation);
+
+	nova_update_entry_csum(entry);
+}
+
+static int nova_update_write_entry(struct super_block *sb,
+	struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	entry->epoch_id = cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id = cpu_to_le64(entry_info->trans_id);
+	entry->mtime = cpu_to_le32(entry_info->time);
+	entry->size = cpu_to_le64(entry_info->file_size);
+	entry->updating = 0;
+	nova_update_entry_csum(entry);
+	return 0;
+}
+
+static int nova_update_old_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *dentry,
+	struct nova_log_entry_info *entry_info)
+{
+	unsigned short links_count;
+	int link_change = entry_info->link_change;
+	u64 addr;
+
+	dentry->epoch_id = entry_info->epoch_id;
+	dentry->trans_id = entry_info->trans_id;
+	/* Remove_dentry */
+	dentry->ino = cpu_to_le64(0);
+	dentry->invalid = 1;
+	dentry->mtime = cpu_to_le32(dir->i_mtime.tv_sec);
+
+	links_count = cpu_to_le16(dir->i_nlink);
+	if (links_count == 0 && link_change == -1)
+		links_count = 0;
+	else
+		links_count += link_change;
+	dentry->links_count = cpu_to_le16(links_count);
+
+	addr = nova_get_addr_off(NOVA_SB(sb), dentry);
+	nova_inc_page_invalid_entries(sb, addr);
+
+	/* Update checksum */
+	nova_update_entry_csum(dentry);
+
+	return 0;
+}
+
+static int nova_update_new_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct dentry *dentry = entry_info->data;
+	unsigned short links_count;
+	int link_change = entry_info->link_change;
+
+	entry->entry_type = DIR_LOG;
+	entry->epoch_id = entry_info->epoch_id;
+	entry->trans_id = entry_info->trans_id;
+	entry->ino = entry_info->ino;
+	entry->name_len = dentry->d_name.len;
+	memcpy_to_pmem_nocache(entry->name, dentry->d_name.name,
+				dentry->d_name.len);
+	entry->name[dentry->d_name.len] = '\0';
+	entry->mtime = cpu_to_le32(dir->i_mtime.tv_sec);
+	//entry->size = cpu_to_le64(dir->i_size);
+
+	links_count = cpu_to_le16(dir->i_nlink);
+	if (links_count == 0 && link_change == -1)
+		links_count = 0;
+	else
+		links_count += link_change;
+	entry->links_count = cpu_to_le16(links_count);
+
+	/* Update actual de_len */
+	entry->de_len = cpu_to_le16(entry_info->file_size);
+
+	/* Update checksum */
+	nova_update_entry_csum(entry);
+
+	return 0;
+}
+
+static int nova_update_log_entry(struct super_block *sb, struct inode *inode,
+	void *entry, struct nova_log_entry_info *entry_info)
+{
+	enum nova_entry_type type = entry_info->type;
+
+	switch (type) {
+	case FILE_WRITE:
+		if (entry_info->inplace)
+			nova_update_write_entry(sb, entry, entry_info);
+		else
+			memcpy_to_pmem_nocache(entry, entry_info->data,
+				sizeof(struct nova_file_write_entry));
+		break;
+	case DIR_LOG:
+		if (entry_info->inplace)
+			nova_update_old_dentry(sb, inode, entry, entry_info);
+		else
+			nova_update_new_dentry(sb, inode, entry, entry_info);
+		break;
+	case SET_ATTR:
+		nova_update_setattr_entry(inode, entry, entry_info);
+		break;
+	case LINK_CHANGE:
+		nova_update_link_change_entry(inode, entry, entry_info);
+		break;
+	case MMAP_WRITE:
+		memcpy_to_pmem_nocache(entry, entry_info->data,
+				sizeof(struct nova_mmap_entry));
+		break;
+	case SNAPSHOT_INFO:
+		memcpy_to_pmem_nocache(entry, entry_info->data,
+				sizeof(struct nova_snapshot_info_entry));
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static int nova_append_log_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_info_header *sih,
+	struct nova_log_entry_info *entry_info)
+{
+	void *entry, *alter_entry;
+	enum nova_entry_type type = entry_info->type;
+	struct nova_inode_update *update = entry_info->update;
+	u64 tail, alter_tail;
+	u64 curr_p, alter_curr_p;
+	size_t size;
+	int extended = 0;
+
+	if (type == DIR_LOG)
+		size = entry_info->file_size;
+	else
+		size = nova_get_log_entry_size(sb, type);
+
+	tail = update->tail;
+	alter_tail = update->alter_tail;
+
+	curr_p = nova_get_append_head(sb, pi, sih, tail, size,
+						MAIN_LOG, 0, &extended);
+	if (curr_p == 0)
+		return -ENOSPC;
+
+	nova_dbg_verbose("%s: inode %lu attr change entry @ 0x%llx\n",
+				__func__, sih->ino, curr_p);
+
+	entry = nova_get_block(sb, curr_p);
+	/* inode is already updated with attr */
+	nova_memunlock_range(sb, entry, size);
+	memset(entry, 0, size);
+	nova_update_log_entry(sb, inode, entry, entry_info);
+	nova_inc_page_num_entries(sb, curr_p);
+	nova_memlock_range(sb, entry, size);
+	update->curr_entry = curr_p;
+	update->tail = curr_p + size;
+
+	if (metadata_csum) {
+		alter_curr_p = nova_get_append_head(sb, pi, sih, alter_tail,
+						size, ALTER_LOG, 0, &extended);
+		if (alter_curr_p == 0)
+			return -ENOSPC;
+
+		alter_entry = nova_get_block(sb, alter_curr_p);
+		nova_memunlock_range(sb, alter_entry, size);
+		memset(alter_entry, 0, size);
+		nova_update_log_entry(sb, inode, alter_entry, entry_info);
+		nova_memlock_range(sb, alter_entry, size);
+
+		update->alter_entry = alter_curr_p;
+		update->alter_tail = alter_curr_p + size;
+	}
+
+	entry_info->curr_p = curr_p;
+	return 0;
+}
+
+int nova_inplace_update_log_entry(struct super_block *sb,
+	struct inode *inode, void *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	enum nova_entry_type type = entry_info->type;
+	u64 journal_tail;
+	size_t size;
+	int cpu;
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_entry_t, update_time);
+	size = nova_get_log_entry_size(sb, type);
+
+	if (metadata_csum) {
+		nova_memunlock_range(sb, entry, size);
+		nova_update_log_entry(sb, inode, entry, entry_info);
+		// Also update the alter inode log entry.
+		nova_update_alter_entry(sb, entry);
+		nova_memlock_range(sb, entry, size);
+		goto out;
+	}
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	nova_memunlock_journal(sb);
+	journal_tail = nova_create_logentry_transaction(sb, entry, type, cpu);
+	nova_update_log_entry(sb, inode, entry, entry_info);
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	nova_memlock_journal(sb);
+	spin_unlock(&sbi->journal_locks[cpu]);
+out:
+	NOVA_END_TIMING(update_entry_t, update_time);
+	return 0;
+}
+
+/* Returns new tail after append */
+static int nova_append_setattr_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, struct iattr *attr,
+	struct nova_inode_update *update, u64 *last_setattr, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_setattr_t, append_time);
+	entry_info.type = SET_ATTR;
+	entry_info.attr = attr;
+	entry_info.update = update;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		goto out;
+	}
+
+	*last_setattr = sih->last_setattr;
+	sih->last_setattr = entry_info.curr_p;
+
+out:
+	NOVA_END_TIMING(append_setattr_t, append_time);
+	return ret;
+}
+
+/* Invalidate old link change entry */
+static int nova_invalidate_setattr_entry(struct super_block *sb,
+	u64 last_setattr)
+{
+	struct nova_setattr_logentry *old_entry;
+	struct nova_setattr_logentry *old_entryc, old_entry_copy;
+	void *addr;
+	int ret;
+
+	addr = (void *)nova_get_block(sb, last_setattr);
+	old_entry = (struct nova_setattr_logentry *)addr;
+
+	if (metadata_csum == 0)
+		old_entryc = old_entry;
+	else {
+		old_entryc = &old_entry_copy;
+		if (!nova_verify_entry_csum(sb, old_entry, old_entryc))
+			return -EIO;
+	}
+
+	/* Do not invalidate setsize entries */
+	if (!old_entry_freeable(sb, old_entryc->epoch_id) ||
+			(old_entryc->attr & ATTR_SIZE))
+		return 0;
+
+	ret = nova_invalidate_logentry(sb, old_entry, SET_ATTR, 0);
+
+	return ret;
+}
+
+#if 0
+static void setattr_copy_to_nova_inode(struct super_block *sb,
+	struct inode *inode, struct nova_inode *pi, u64 epoch_id)
+{
+	pi->i_mode  = cpu_to_le16(inode->i_mode);
+	pi->i_uid	= cpu_to_le32(i_uid_read(inode));
+	pi->i_gid	= cpu_to_le32(i_gid_read(inode));
+	pi->i_atime	= cpu_to_le32(inode->i_atime.tv_sec);
+	pi->i_ctime	= cpu_to_le32(inode->i_ctime.tv_sec);
+	pi->i_mtime	= cpu_to_le32(inode->i_mtime.tv_sec);
+	pi->create_epoch_id = epoch_id;
+
+	nova_update_alter_inode(sb, inode, pi);
+}
+#endif
+
+static int nova_can_inplace_update_setattr(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	u64 last_log = 0;
+	struct nova_setattr_logentry *entry = NULL;
+
+	last_log = sih->last_setattr;
+	if (last_log) {
+		entry = (struct nova_setattr_logentry *)nova_get_block(sb,
+								last_log);
+		/* Do not overwrite setsize entry */
+		if (entry->attr & ATTR_SIZE)
+			return 0;
+		if (entry->epoch_id == epoch_id)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int nova_inplace_update_setattr_entry(struct super_block *sb,
+	struct inode *inode, struct nova_inode_info_header *sih,
+	struct iattr *attr, u64 epoch_id)
+{
+	struct nova_setattr_logentry *entry = NULL;
+	struct nova_log_entry_info entry_info;
+	u64 last_log = 0;
+
+	nova_dbgv("%s : Modifying last log entry for inode %lu\n",
+				__func__, inode->i_ino);
+	last_log = sih->last_setattr;
+	entry = (struct nova_setattr_logentry *)nova_get_block(sb,
+							last_log);
+
+	entry_info.type = SET_ATTR;
+	entry_info.attr = attr;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					&entry_info);
+}
+
+int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, unsigned int ia_valid, struct iattr *attr,
+	u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode_update update;
+	u64 last_setattr = 0;
+	int ret;
+
+	if (ia_valid & ATTR_MODE)
+		sih->i_mode = inode->i_mode;
+
+	/*
+	 * Let's try to do inplace update.
+	 * If there are currently no snapshots holding this inode,
+	 * we can update the inode in place. If a snapshot creation
+	 * is in progress, we will use the create_snapshot_epoch_id
+	 * as the latest snapshot id.
+	 */
+	if (!(ia_valid & ATTR_SIZE) &&
+			nova_can_inplace_update_setattr(sb, sih, epoch_id)) {
+		nova_inplace_update_setattr_entry(sb, inode, sih,
+						attr, epoch_id);
+	} else {
+		/* We are holding inode lock so OK to append the log */
+		nova_dbgv("%s : Appending last log entry for inode ino = %lu\n",
+				__func__, inode->i_ino);
+		update.tail = update.alter_tail = 0;
+		ret = nova_append_setattr_entry(sb, pi, inode, attr, &update,
+						&last_setattr, epoch_id);
+		if (ret) {
+			nova_dbg("%s: append setattr entry failure\n",
+								__func__);
+			return ret;
+		}
+
+		nova_memunlock_inode(sb, pi);
+		nova_update_inode(sb, inode, pi, &update, 1);
+		nova_memlock_inode(sb, pi);
+	}
+
+	/* Invalidate old setattr entry */
+	if (last_setattr)
+		nova_invalidate_setattr_entry(sb, last_setattr);
+
+	return 0;
+}
+
+/* Invalidate old link change entry */
+int nova_invalidate_link_change_entry(struct super_block *sb,
+	u64 old_link_change)
+{
+	struct nova_link_change_entry *old_entry;
+	struct nova_link_change_entry *old_entryc, old_entry_copy;
+	void *addr;
+	int ret;
+
+	if (old_link_change == 0)
+		return 0;
+
+	addr = (void *)nova_get_block(sb, old_link_change);
+	old_entry = (struct nova_link_change_entry *)addr;
+
+	if (metadata_csum == 0)
+		old_entryc = old_entry;
+	else {
+		old_entryc = &old_entry_copy;
+		if (!nova_verify_entry_csum(sb, old_entry, old_entryc))
+			return -EIO;
+	}
+
+	if (!old_entry_freeable(sb, old_entryc->epoch_id))
+		return 0;
+
+	ret = nova_invalidate_logentry(sb, old_entry, LINK_CHANGE, 0);
+
+	return ret;
+}
+
+static int nova_can_inplace_update_lcentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	u64 last_log = 0;
+	struct nova_link_change_entry *entry = NULL;
+
+	last_log = sih->last_link_change;
+	if (last_log) {
+		entry = (struct nova_link_change_entry *)nova_get_block(sb,
+								last_log);
+		if (entry->epoch_id == epoch_id)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int nova_inplace_update_lcentry(struct super_block *sb,
+	struct inode *inode, struct nova_inode_info_header *sih,
+	u64 epoch_id)
+{
+	struct nova_link_change_entry *entry = NULL;
+	struct nova_log_entry_info entry_info;
+	u64 last_log = 0;
+
+	last_log = sih->last_link_change;
+	entry = (struct nova_link_change_entry *)nova_get_block(sb,
+							last_log);
+
+	entry_info.type = LINK_CHANGE;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					&entry_info);
+}
+
+/* Returns new tail after append */
+int nova_append_link_change_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_update *update, u64 *old_linkc, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	int ret = 0;
+	timing_t append_time;
+
+	NOVA_START_TIMING(append_link_change_t, append_time);
+
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (nova_can_inplace_update_lcentry(sb, sih, epoch_id)) {
+		nova_inplace_update_lcentry(sb, inode, sih, epoch_id);
+		update->tail = sih->log_tail;
+		update->alter_tail = sih->alter_log_tail;
+
+		*old_linkc = 0;
+		sih->trans_id++;
+		goto out;
+	}
+
+	entry_info.type = LINK_CHANGE;
+	entry_info.update = update;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		goto out;
+	}
+
+	*old_linkc = sih->last_link_change;
+	sih->last_link_change = entry_info.curr_p;
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(append_link_change_t, append_time);
+	return ret;
+}
+
+int nova_assign_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc,
+	bool free)
+{
+	struct nova_file_write_entry *old_entry;
+	struct nova_file_write_entry *start_old_entry = NULL;
+	void **pentry;
+	unsigned long start_pgoff = entryc->pgoff;
+	unsigned long start_old_pgoff = 0;
+	unsigned int num = entryc->num_pages;
+	unsigned int num_free = 0;
+	unsigned long curr_pgoff;
+	int i;
+	int ret = 0;
+	timing_t assign_time;
+
+	NOVA_START_TIMING(assign_t, assign_time);
+	for (i = 0; i < num; i++) {
+		curr_pgoff = start_pgoff + i;
+
+		pentry = radix_tree_lookup_slot(&sih->tree, curr_pgoff);
+		if (pentry) {
+			old_entry = radix_tree_deref_slot(pentry);
+			if (old_entry != start_old_entry) {
+				if (start_old_entry && free)
+					nova_free_old_entry(sb, sih,
+							start_old_entry,
+							start_old_pgoff,
+							num_free, false,
+							entryc->epoch_id);
+				nova_invalidate_write_entry(sb,
+						start_old_entry, 1, 0);
+
+				start_old_entry = old_entry;
+				start_old_pgoff = curr_pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+
+			radix_tree_replace_slot(&sih->tree, pentry, entry);
+		} else {
+			ret = radix_tree_insert(&sih->tree, curr_pgoff, entry);
+			if (ret) {
+				nova_dbg("%s: ERROR %d\n", __func__, ret);
+				goto out;
+			}
+		}
+	}
+
+	if (start_old_entry && free)
+		nova_free_old_entry(sb, sih, start_old_entry,
+					start_old_pgoff, num_free, false,
+					entryc->epoch_id);
+
+	nova_invalidate_write_entry(sb, start_old_entry, 1, 0);
+
+out:
+	NOVA_END_TIMING(assign_t, assign_time);
+
+	return ret;
+}
+
+int nova_inplace_update_write_entry(struct super_block *sb,
+	struct inode *inode, struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					entry_info);
+}
+
+int nova_set_write_entry_updating(struct super_block *sb,
+	struct nova_file_write_entry *entry, int set)
+{
+	nova_memunlock_range(sb, entry, sizeof(*entry));
+	entry->updating = set ? 1 : 0;
+	nova_update_entry_csum(entry);
+	nova_update_alter_entry(sb, entry);
+	nova_memlock_range(sb, entry, sizeof(*entry));
+
+	return 0;
+}
+
+/*
+ * Append a nova_file_write_entry to the current nova_inode_log_page.
+ * blocknr and start_blk are pgoff.
+ * We cannot update pi->log_tail here because a transaction may contain
+ * multiple entries.
+ */
+int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_file_write_entry *data,
+	struct nova_inode_update *update)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_file_entry_t, append_time);
+
+	nova_update_entry_csum(data);
+
+	entry_info.type = FILE_WRITE;
+	entry_info.update = update;
+	entry_info.data = data;
+	entry_info.epoch_id = data->epoch_id;
+	entry_info.trans_id = data->trans_id;
+	entry_info.inplace = 0;
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	NOVA_END_TIMING(append_file_entry_t, append_time);
+	return ret;
+}
+
+int nova_append_mmap_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_mmap_entry *data,
+	struct nova_inode_update *update, struct vma_item *item)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_mmap_entry_t, append_time);
+
+	nova_update_entry_csum(data);
+
+	entry_info.type = MMAP_WRITE;
+	entry_info.update = update;
+	entry_info.data = data;
+	entry_info.epoch_id = data->epoch_id;
+
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	item->mmap_entry = entry_info.curr_p;
+out:
+	NOVA_END_TIMING(append_mmap_entry_t, append_time);
+	return ret;
+}
+
+int nova_append_snapshot_info_entry(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info *si,
+	struct snapshot_info *info, struct nova_snapshot_info_entry *data,
+	struct nova_inode_update *update)
+{
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_snapshot_info_t, append_time);
+
+	nova_update_entry_csum(data);
+
+	entry_info.type = SNAPSHOT_INFO;
+	entry_info.update = update;
+	entry_info.data = data;
+	entry_info.epoch_id = data->epoch_id;
+	entry_info.inplace = 0;
+
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_append_log_entry(sb, pi, NULL, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	info->snapshot_entry = entry_info.curr_p;
+out:
+	NOVA_END_TIMING(append_snapshot_info_t, append_time);
+	return ret;
+}
+
+int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *dir, struct dentry *dentry, u64 ino,
+	unsigned short de_len, struct nova_inode_update *update,
+	int link_change, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode inode_copy;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_dir_entry_t, append_time);
+
+	entry_info.type = DIR_LOG;
+	entry_info.update = update;
+	entry_info.data = dentry;
+	entry_info.ino = ino;
+	entry_info.link_change = link_change;
+	entry_info.file_size = de_len;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+	entry_info.inplace = 0;
+
+	/* nova_inode tail pointer will be updated and we make sure all other
+	 * inode fields are good before checksumming the whole structure
+	 */
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_append_log_entry(sb, pi, dir, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	dir->i_blocks = sih->i_blocks;
+out:
+	NOVA_END_TIMING(append_dir_entry_t, append_time);
+	return ret;
+}
+
+int nova_update_alter_pages(struct super_block *sb, struct nova_inode *pi,
+	u64 curr, u64 alter_curr)
+{
+	if (curr == 0 || alter_curr == 0 || metadata_csum == 0)
+		return 0;
+
+	while (curr && alter_curr) {
+		nova_set_alter_page_address(sb, curr, alter_curr);
+		curr = next_log_page(sb, curr);
+		alter_curr = next_log_page(sb, alter_curr);
+	}
+
+	if (curr || alter_curr)
+		nova_dbg("%s: curr 0x%llx, alter_curr 0x%llx\n",
+					__func__, curr, alter_curr);
+
+	return 0;
+}
+
+static int nova_coalesce_log_pages(struct super_block *sb,
+	unsigned long prev_blocknr, unsigned long first_blocknr,
+	unsigned long num_pages)
+{
+	unsigned long next_blocknr;
+	u64 curr_block, next_page;
+	struct nova_inode_log_page *curr_page;
+	int i;
+
+	if (prev_blocknr) {
+		/* Link prev block and newly allocated head block */
+		curr_block = nova_get_block_off(sb, prev_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+		curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+		next_page = nova_get_block_off(sb, first_blocknr,
+				NOVA_BLOCK_TYPE_4K);
+		nova_memunlock_block(sb, curr_page);
+		nova_set_next_page_address(sb, curr_page, next_page, 0);
+		nova_memlock_block(sb, curr_page);
+	}
+
+	next_blocknr = first_blocknr + 1;
+	curr_block = nova_get_block_off(sb, first_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+	curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+	for (i = 0; i < num_pages - 1; i++) {
+		next_page = nova_get_block_off(sb, next_blocknr,
+				NOVA_BLOCK_TYPE_4K);
+		nova_memunlock_block(sb, curr_page);
+		nova_set_page_num_entries(sb, curr_page, 0, 0);
+		nova_set_page_invalid_entries(sb, curr_page, 0, 0);
+		nova_set_next_page_address(sb, curr_page, next_page, 0);
+		nova_memlock_block(sb, curr_page);
+		curr_page++;
+		next_blocknr++;
+	}
+
+	/* Last page */
+	nova_memunlock_block(sb, curr_page);
+	nova_set_page_num_entries(sb, curr_page, 0, 0);
+	nova_set_page_invalid_entries(sb, curr_page, 0, 0);
+	nova_set_next_page_address(sb, curr_page, 0, 1);
+	nova_memlock_block(sb, curr_page);
+	return 0;
+}
+
+/* Log block resides in NVMM */
+int nova_allocate_inode_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long num_pages,
+	u64 *new_block, int cpuid, enum nova_alloc_direction from_tail)
+{
+	unsigned long new_inode_blocknr;
+	unsigned long first_blocknr;
+	unsigned long prev_blocknr;
+	int allocated;
+	int ret_pages = 0;
+
+	allocated = nova_new_log_blocks(sb, sih, &new_inode_blocknr,
+			num_pages, ALLOC_NO_INIT, cpuid, from_tail);
+
+	if (allocated <= 0) {
+		nova_err(sb, "ERROR: no inode log page available: %d %d\n",
+			num_pages, allocated);
+		return allocated;
+	}
+	ret_pages += allocated;
+	num_pages -= allocated;
+	nova_dbg_verbose("Pi %lu: Alloc %d log blocks @ 0x%lx\n",
+			sih->ino, allocated, new_inode_blocknr);
+
+	/* Coalesce the pages */
+	nova_coalesce_log_pages(sb, 0, new_inode_blocknr, allocated);
+	first_blocknr = new_inode_blocknr;
+	prev_blocknr = new_inode_blocknr + allocated - 1;
+
+	/* Allocate remaining pages */
+	while (num_pages) {
+		allocated = nova_new_log_blocks(sb, sih,
+					&new_inode_blocknr, num_pages,
+					ALLOC_NO_INIT, cpuid, from_tail);
+
+		nova_dbg_verbose("Alloc %d log blocks @ 0x%lx\n",
+					allocated, new_inode_blocknr);
+		if (allocated <= 0) {
+			nova_dbg("%s: no inode log page available: %lu %d\n",
+				__func__, num_pages, allocated);
+			/* Return whatever we have */
+			break;
+		}
+		ret_pages += allocated;
+		num_pages -= allocated;
+		nova_coalesce_log_pages(sb, prev_blocknr, new_inode_blocknr,
+						allocated);
+		prev_blocknr = new_inode_blocknr + allocated - 1;
+	}
+
+	*new_block = nova_get_block_off(sb, first_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+
+	return ret_pages;
+}
+
+static int nova_initialize_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	int log_id)
+{
+	u64 new_block;
+	int allocated;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih,
+					1, &new_block, ANY_CPU,
+					log_id == MAIN_LOG ? 0 : 1);
+	if (allocated != 1) {
+		nova_err(sb, "%s ERROR: no inode log page available\n",
+					__func__);
+		return -ENOSPC;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	if (log_id == MAIN_LOG) {
+		pi->log_tail = new_block;
+		nova_flush_buffer(&pi->log_tail, CACHELINE_SIZE, 0);
+		pi->log_head = new_block;
+		sih->log_head = sih->log_tail = new_block;
+		sih->log_pages = 1;
+		nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 1);
+	} else {
+		pi->alter_log_tail = new_block;
+		nova_flush_buffer(&pi->alter_log_tail, CACHELINE_SIZE, 0);
+		pi->alter_log_head = new_block;
+		sih->alter_log_head = sih->alter_log_tail = new_block;
+		sih->log_pages++;
+		nova_flush_buffer(&pi->alter_log_head, CACHELINE_SIZE, 1);
+	}
+	nova_update_inode_checksum(pi);
+	nova_memlock_inode(sb, pi);
+
+	return 0;
+}
+
+/*
+ * Extend the log.  If the log is less than EXTEND_THRESHOLD pages, double its
+ * allocated size.  Otherwise, increase by EXTEND_THRESHOLD. Then, do GC.
+ */
+static u64 nova_extend_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	u64 new_block, alter_new_block = 0;
+	int allocated;
+	unsigned long num_pages;
+	int ret;
+
+	nova_dbgv("%s: inode %lu, curr 0x%llx\n", __func__, sih->ino, curr_p);
+
+	if (curr_p == 0) {
+		ret = nova_initialize_inode_log(sb, pi, sih, MAIN_LOG);
+		if (ret)
+			return 0;
+
+		if (metadata_csum) {
+			ret = nova_initialize_inode_log(sb, pi, sih, ALTER_LOG);
+			if (ret)
+				return 0;
+
+			nova_memunlock_inode(sb, pi);
+			nova_update_alter_pages(sb, pi, sih->log_head,
+							sih->alter_log_head);
+			nova_memlock_inode(sb, pi);
+		}
+
+		return sih->log_head;
+	}
+
+	num_pages = sih->log_pages >= EXTEND_THRESHOLD ?
+				EXTEND_THRESHOLD : sih->log_pages;
+//	nova_dbg("Before append log pages:\n");
+//	nova_print_inode_log_page(sb, inode);
+	allocated = nova_allocate_inode_log_pages(sb, sih,
+					num_pages, &new_block, ANY_CPU, 0);
+	nova_dbg_verbose("Link block %llu to block %llu\n",
+					curr_p >> PAGE_SHIFT,
+					new_block >> PAGE_SHIFT);
+	if (allocated <= 0) {
+		nova_err(sb, "%s ERROR: no inode log page available\n",
+					__func__);
+		nova_dbg("curr_p 0x%llx, %lu pages\n", curr_p,
+					sih->log_pages);
+		return 0;
+	}
+
+	if (metadata_csum) {
+		allocated = nova_allocate_inode_log_pages(sb, sih,
+				num_pages, &alter_new_block, ANY_CPU, 1);
+		if (allocated <= 0) {
+			nova_err(sb, "%s ERROR: no inode log page available\n",
+					__func__);
+			nova_dbg("curr_p 0x%llx, %lu pages\n", curr_p,
+					sih->log_pages);
+			return 0;
+		}
+
+		nova_memunlock_inode(sb, pi);
+		nova_update_alter_pages(sb, pi, new_block, alter_new_block);
+		nova_memlock_inode(sb, pi);
+	}
+
+
+	nova_inode_log_fast_gc(sb, pi, sih, curr_p,
+			       new_block, alter_new_block, allocated, 0);
+
+//	nova_dbg("After append log pages:\n");
+//	nova_print_inode_log_page(sb, inode);
+	/* Atomic switch to new log */
+//	nova_switch_to_new_log(sb, pi, new_block, num_pages);
+
+	return new_block;
+}
+
+/* For thorough GC, simply append one more page */
+static u64 nova_append_one_log_page(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 new_block;
+	u64 curr_block;
+	int allocated;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih, 1, &new_block,
+							ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_err(sb, "%s: ERROR: no inode log page available\n",
+				__func__);
+		return 0;
+	}
+
+	if (curr_p == 0) {
+		curr_p = new_block;
+	} else {
+		/* Link prev block and newly allocated head block */
+		curr_block = BLOCK_OFF(curr_p);
+		curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+		nova_memunlock_block(sb, curr_page);
+		nova_set_next_page_address(sb, curr_page, new_block, 1);
+		nova_memlock_block(sb, curr_page);
+	}
+
+	return curr_p;
+}
+
+u64 nova_get_append_head(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 tail, size_t size, int log_id,
+	int thorough_gc, int *extended)
+{
+	u64 curr_p;
+
+	if (tail)
+		curr_p = tail;
+	else if (log_id == MAIN_LOG)
+		curr_p = sih->log_tail;
+	else
+		curr_p = sih->alter_log_tail;
+
+	if (curr_p == 0 || (is_last_entry(curr_p, size) &&
+				next_log_page(sb, curr_p) == 0)) {
+		if (is_last_entry(curr_p, size)) {
+			nova_memunlock_block(sb, nova_get_block(sb, curr_p));
+			nova_set_next_page_flag(sb, curr_p);
+			nova_memlock_block(sb, nova_get_block(sb, curr_p));
+		}
+
+		/* Alternate log should not go here */
+		if (log_id != MAIN_LOG)
+			return 0;
+
+		if (thorough_gc == 0) {
+			curr_p = nova_extend_inode_log(sb, pi, sih, curr_p);
+		} else {
+			curr_p = nova_append_one_log_page(sb, sih, curr_p);
+			/* For thorough GC */
+			*extended = 1;
+		}
+
+		if (curr_p == 0)
+			return 0;
+	}
+
+	if (is_last_entry(curr_p, size)) {
+		nova_memunlock_block(sb, nova_get_block(sb, curr_p));
+		nova_set_next_page_flag(sb, curr_p);
+		nova_memlock_block(sb, nova_get_block(sb, curr_p));
+		curr_p = next_log_page(sb, curr_p);
+	}
+
+	return curr_p;
+}
+
+int nova_free_contiguous_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 head)
+{
+	unsigned long blocknr, start_blocknr = 0;
+	u64 curr_block = head;
+	u8 btype = sih->i_blk_type;
+	int num_free = 0;
+	int freed = 0;
+
+	while (curr_block > 0) {
+		if (ENTRY_LOC(curr_block)) {
+			nova_dbg("%s: ERROR: invalid block %llu\n",
+					__func__, curr_block);
+			break;
+		}
+
+		blocknr = nova_get_blocknr(sb, le64_to_cpu(curr_block),
+				    btype);
+		nova_dbg_verbose("%s: free page %llu\n", __func__, curr_block);
+		curr_block = next_log_page(sb, curr_block);
+
+		if (start_blocknr == 0) {
+			start_blocknr = blocknr;
+			num_free = 1;
+		} else {
+			if (blocknr == start_blocknr + num_free) {
+				num_free++;
+			} else {
+				/* A new start */
+				nova_free_log_blocks(sb, sih, start_blocknr,
+							num_free);
+				freed += num_free;
+				start_blocknr = blocknr;
+				num_free = 1;
+			}
+		}
+	}
+	if (start_blocknr) {
+		nova_free_log_blocks(sb, sih, start_blocknr, num_free);
+		freed += num_free;
+	}
+
+	return freed;
+}
+
+int nova_free_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih)
+{
+	int freed = 0;
+	timing_t free_time;
+
+	if (sih->log_head == 0 || sih->log_tail == 0)
+		return 0;
+
+	NOVA_START_TIMING(free_inode_log_t, free_time);
+
+	/* The inode is invalid now, no need to fence */
+	if (pi) {
+		nova_memunlock_inode(sb, pi);
+		pi->log_head = pi->log_tail = 0;
+		pi->alter_log_head = pi->alter_log_tail = 0;
+		nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+		nova_memlock_inode(sb, pi);
+	}
+
+	freed = nova_free_contiguous_log_blocks(sb, sih, sih->log_head);
+	if (metadata_csum)
+		freed += nova_free_contiguous_log_blocks(sb, sih,
+					sih->alter_log_head);
+
+	NOVA_END_TIMING(free_inode_log_t, free_time);
+	return 0;
+}
diff --git a/fs/nova/log.h b/fs/nova/log.h
new file mode 100644
index 000000000000..ee6c8927dd33
--- /dev/null
+++ b/fs/nova/log.h
@@ -0,0 +1,333 @@
+#ifndef __LOG_H
+#define __LOG_H
+
+#include "balloc.h"
+#include "inode.h"
+
+/* ======================= Log entry ========================= */
+/* Inode entry in the log */
+
+#define	MAIN_LOG	0
+#define	ALTER_LOG	1
+
+#define	PAGE_OFFSET_MASK	4095
+#define	BLOCK_OFF(p)	((p) & ~PAGE_OFFSET_MASK)
+
+#define	ENTRY_LOC(p)	((p) & PAGE_OFFSET_MASK)
+
+#define	LOG_BLOCK_TAIL	4064
+#define	PAGE_TAIL(p)	(BLOCK_OFF(p) + LOG_BLOCK_TAIL)
+
+/*
+ * Log page state and pointers to next page and the replica page
+ */
+struct nova_inode_page_tail {
+	__le32	invalid_entries;
+	__le32	num_entries;
+	__le64	epoch_id;	/* For snapshot list page */
+	__le64	alter_page;	/* Corresponding page in the other log */
+	__le64	next_page;
+} __attribute((__packed__));
+
+/* Fit in PAGE_SIZE */
+struct	nova_inode_log_page {
+	char padding[LOG_BLOCK_TAIL];
+	struct nova_inode_page_tail page_tail;
+} __attribute((__packed__));
+
+#define	EXTEND_THRESHOLD	256
+
+enum nova_entry_type {
+	FILE_WRITE = 1,
+	DIR_LOG,
+	SET_ATTR,
+	LINK_CHANGE,
+	MMAP_WRITE,
+	SNAPSHOT_INFO,
+	NEXT_PAGE,
+};
+
+static inline u8 nova_get_entry_type(void *p)
+{
+	u8 type;
+	int rc;
+
+	rc = memcpy_mcsafe(&type, p, sizeof(u8));
+	if (rc)
+		return rc;
+
+	return type;
+}
+
+static inline void nova_set_entry_type(void *p, enum nova_entry_type type)
+{
+	*(u8 *)p = type;
+}
+
+/*
+ * Write log entry.  Records a write to a contiguous range of PMEM pages.
+ *
+ * Documentation/filesystems/nova.txt contains descriptions of some fields.
+ */
+struct nova_file_write_entry {
+	u8	entry_type;
+	u8	reassigned;	/* Data is not latest */
+	u8	updating;	/* Data is being written */
+	u8	padding;
+	__le32	num_pages;
+	__le64	block;          /* offset of first block in this write */
+	__le64	pgoff;          /* file offset at the beginning of this write */
+	__le32	invalid_pages;	/* For GC */
+	/* For both ctime and mtime */
+	__le32	mtime;
+	__le64	size;           /* Write size for non-aligned writes */
+	__le64	epoch_id;
+	__le64	trans_id;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define WENTRY(entry)	((struct nova_file_write_entry *) entry)
+
+/*
+ * Log entry for adding a file/directory to a directory.
+ *
+ * Update DIR_LOG_REC_LEN if modify this struct!
+ */
+struct nova_dentry {
+	u8	entry_type;
+	u8	name_len;		/* length of the dentry name */
+	u8	reassigned;		/* Currently deleted */
+	u8	invalid;		/* Invalid now? */
+	__le16	de_len;			/* length of this dentry */
+	__le16	links_count;
+	__le32	mtime;			/* For both mtime and ctime */
+	__le32	csum;			/* entry checksum */
+	__le64	ino;			/* inode no pointed to by this entry */
+	__le64	padding;
+	__le64	epoch_id;
+	__le64	trans_id;
+	char	name[NOVA_NAME_LEN + 1];	/* File name */
+} __attribute((__packed__));
+
+#define DENTRY(entry)	((struct nova_dentry *) entry)
+
+#define NOVA_DIR_PAD			8	/* Align to 8 bytes boundary */
+#define NOVA_DIR_ROUND			(NOVA_DIR_PAD - 1)
+#define NOVA_DENTRY_HEADER_LEN		48
+#define NOVA_DIR_LOG_REC_LEN(name_len) \
+	(((name_len + 1) + NOVA_DENTRY_HEADER_LEN \
+	 + NOVA_DIR_ROUND) & ~NOVA_DIR_ROUND)
+
+#define NOVA_MAX_ENTRY_LEN		NOVA_DIR_LOG_REC_LEN(NOVA_NAME_LEN)
+
+/*
+ * Log entry for updating file attributes.
+ */
+struct nova_setattr_logentry {
+	u8	entry_type;
+	u8	attr;       /* bitmap of which attributes to update */
+	__le16	mode;
+	__le32	uid;
+	__le32	gid;
+	__le32	atime;
+	__le32	mtime;
+	__le32	ctime;
+	__le64	size;        /* File size after truncation */
+	__le64	epoch_id;
+	__le64	trans_id;
+	u8	invalid;
+	u8	paddings[3];
+	__le32	csum;
+} __attribute((__packed__));
+
+#define SENTRY(entry)	((struct nova_setattr_logentry *) entry)
+
+/* Link change log entry.
+ *
+ * TODO: Do we need this to be 32 bytes?
+ */
+struct nova_link_change_entry {
+	u8	entry_type;
+	u8	invalid;
+	__le16	links;
+	__le32	ctime;
+	__le32	flags;
+	__le32	generation;    /* for NFS handles */
+	__le64	epoch_id;
+	__le64	trans_id;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define LCENTRY(entry)	((struct nova_link_change_entry *) entry)
+
+/*
+ * MMap entry.  Records the fact that a region of the file is mmapped, so
+ * parity and checksums are inoperative.
+ */
+struct nova_mmap_entry {
+	u8	entry_type;
+	u8	invalid;
+	u8	paddings[6];
+	__le64	epoch_id;
+	__le64	pgoff;
+	__le64	num_pages;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define MMENTRY(entry)	((struct nova_mmap_entry *) entry)
+
+/*
+ * Log entry for the creation of a snapshot.  Only occurs in the log of the
+ * dedicated snapshot inode.
+ */
+struct nova_snapshot_info_entry {
+	u8	type;
+	u8	deleted;
+	u8	paddings[6];
+	__le64	epoch_id;
+	__le64	timestamp;
+	__le64	nvmm_page_addr;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define SNENTRY(entry)	((struct nova_snapshot_info_entry *) entry)
+
+
+/*
+ * Transient DRAM structure that describes changes needed to append a log entry
+ * to an inode
+ */
+struct nova_inode_update {
+	u64 head;
+	u64 alter_head;
+	u64 tail;
+	u64 alter_tail;
+	u64 curr_entry;
+	u64 alter_entry;
+	struct nova_dentry *create_dentry;
+	struct nova_dentry *delete_dentry;
+};
+
+
+/*
+ * Transient DRAM structure to parameterize the creation of a log entry.
+ */
+struct nova_log_entry_info {
+	enum nova_entry_type type;
+	struct iattr *attr;
+	struct nova_inode_update *update;
+	void *data;	/* struct dentry */
+	u64 epoch_id;
+	u64 trans_id;
+	u64 curr_p;	/* output */
+	u64 file_size;	/* de_len for dentry */
+	u64 ino;
+	u32 time;
+	int link_change;
+	int inplace;	/* For file write entry */
+};
+
+
+
+static inline size_t nova_get_log_entry_size(struct super_block *sb,
+	enum nova_entry_type type)
+{
+	size_t size = 0;
+
+	switch (type) {
+	case FILE_WRITE:
+		size = sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		size = NOVA_DENTRY_HEADER_LEN;
+		break;
+	case SET_ATTR:
+		size = sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		size = sizeof(struct nova_link_change_entry);
+		break;
+	case MMAP_WRITE:
+		size = sizeof(struct nova_mmap_entry);
+		break;
+	case SNAPSHOT_INFO:
+		size = sizeof(struct nova_snapshot_info_entry);
+		break;
+	default:
+		break;
+	}
+
+	return size;
+}
+
+
+int nova_invalidate_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type, unsigned int num_free);
+int nova_reassign_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type);
+int nova_inplace_update_log_entry(struct super_block *sb,
+	struct inode *inode, void *entry,
+	struct nova_log_entry_info *entry_info);
+void nova_clear_last_page_tail(struct super_block *sb,
+	struct inode *inode, loff_t newsize);
+unsigned int nova_free_old_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	unsigned long pgoff, unsigned int num_free,
+	bool delete_dead, u64 epoch_id);
+int nova_free_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih);
+int nova_update_alter_pages(struct super_block *sb, struct nova_inode *pi,
+	u64 curr, u64 alter_curr);
+struct nova_file_write_entry *nova_find_next_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, pgoff_t pgoff);
+int nova_allocate_inode_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long num_pages,
+	u64 *new_block, int cpuid, enum nova_alloc_direction from_tail);
+int nova_free_contiguous_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 head);
+u64 nova_get_append_head(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 tail, size_t size, int log_id,
+	int thorough_gc, int *extended);
+int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, unsigned int ia_valid, struct iattr *attr,
+	u64 epoch_id);
+int nova_invalidate_link_change_entry(struct super_block *sb,
+	u64 old_link_change);
+int nova_append_link_change_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_update *update, u64 *old_linkc, u64 epoch_id);
+int nova_set_write_entry_updating(struct super_block *sb,
+	struct nova_file_write_entry *entry, int set);
+int nova_inplace_update_write_entry(struct super_block *sb,
+	struct inode *inode, struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info);
+int nova_append_mmap_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_mmap_entry *data,
+	struct nova_inode_update *update, struct vma_item *item);
+int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_file_write_entry *data,
+	struct nova_inode_update *update);
+int nova_append_snapshot_info_entry(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info *si,
+	struct snapshot_info *info, struct nova_snapshot_info_entry *data,
+	struct nova_inode_update *update);
+int nova_assign_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc, bool free);
+
+
+void nova_print_curr_log_page(struct super_block *sb, u64 curr);
+void nova_print_nova_log(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+int nova_get_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode *pi);
+void nova_print_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+
+#endif

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 06/16] NOVA: Lite-weight journaling for complex ops
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:48   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Nova uses a lightweight journaling mechanisms to provide atomicity for
operations that modify more than one on inode.  The journals providing logging
for two operations:

1.  Single word updates (JOURNAL_ENTRY)
2.  Copying inodes (JOURNAL_INODE)

The journals are undo logs: Nova creates the journal entries for an operation,
and if the operation does not complete due to a system failure, the recovery
process rolls back the changes using the journal entries.

To commit, Nova drops the log.

Nova maintains one journal per CPU.  The head and tail pointers for each
journal live in a reserved page near the beginning of the file system.

During recovery, Nova scans the journals and undoes the operations described by
each entry.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/journal.c |  474 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/journal.h |   61 +++++++
 2 files changed, 535 insertions(+)
 create mode 100644 fs/nova/journal.c
 create mode 100644 fs/nova/journal.h

diff --git a/fs/nova/journal.c b/fs/nova/journal.c
new file mode 100644
index 000000000000..b05c7212929f
--- /dev/null
+++ b/fs/nova/journal.c
@@ -0,0 +1,474 @@
+/*
+ * NOVA journaling facility.
+ *
+ * This file contains journaling code to guarantee the atomicity of directory
+ * operations that span multiple inodes (unlink, rename, etc).
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/vfs.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include "nova.h"
+#include "journal.h"
+
+/**************************** Lite journal ******************************/
+
+static inline void
+nova_print_lite_transaction(struct nova_lite_journal_entry *entry)
+{
+	nova_dbg("Entry %p: Type %llu, data1 0x%llx, data2 0x%llx\n, checksum %u\n",
+			entry, entry->type,
+			entry->data1, entry->data2, entry->csum);
+}
+
+static inline int nova_update_journal_entry_csum(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u32 crc = 0;
+
+	crc = nova_crc32c(~0, (__u8 *)entry,
+			(sizeof(struct nova_lite_journal_entry)
+			 - sizeof(__le32)));
+
+	entry->csum = cpu_to_le32(crc);
+	nova_flush_buffer(entry, sizeof(struct nova_lite_journal_entry), 0);
+	return 0;
+}
+
+static inline int nova_check_entry_integrity(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u32 crc = 0;
+
+	crc = nova_crc32c(~0, (__u8 *)entry,
+			(sizeof(struct nova_lite_journal_entry)
+			 - sizeof(__le32)));
+
+	if (entry->csum == cpu_to_le32(crc))
+		return 0;
+	else
+		return 1;
+}
+
+// Get the next journal entry.  Journal entries are stored in a circular
+// buffer.  They live a 1-page circular buffer.
+//
+// TODO: Add check to ensure that the journal doesn't grow too large.
+static inline u64 next_lite_journal(u64 curr_p)
+{
+	size_t size = sizeof(struct nova_lite_journal_entry);
+
+	if ((curr_p & (PAGE_SIZE - 1)) + size >= PAGE_SIZE)
+		return (curr_p & PAGE_MASK);
+
+	return curr_p + size;
+}
+
+// Walk the journal for one CPU, and verify the checksum on each entry.
+static int nova_check_journal_entries(struct super_block *sb,
+	struct journal_ptr_pair *pair)
+{
+	struct nova_lite_journal_entry *entry;
+	u64 temp;
+	int ret;
+
+	temp = pair->journal_head;
+	while (temp != pair->journal_tail) {
+		entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+									temp);
+		ret = nova_check_entry_integrity(sb, entry);
+		if (ret) {
+			nova_dbg("Entry %p checksum failure\n", entry);
+			nova_print_lite_transaction(entry);
+			return ret;
+		}
+		temp = next_lite_journal(temp);
+	}
+
+	return 0;
+}
+
+/**************************** Journal Recovery ******************************/
+
+static void nova_undo_journal_inode(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	struct nova_inode *pi, *alter_pi;
+	u64 pi_addr, alter_pi_addr;
+
+	if (metadata_csum == 0)
+		return;
+
+	pi_addr = le64_to_cpu(entry->data1);
+	alter_pi_addr = le64_to_cpu(entry->data2);
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	alter_pi = (struct nova_inode *)nova_get_block(sb, alter_pi_addr);
+
+	memcpy_to_pmem_nocache(pi, alter_pi, sizeof(struct nova_inode));
+}
+
+static void nova_undo_journal_entry(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u64 addr, value;
+
+	addr = le64_to_cpu(entry->data1);
+	value = le64_to_cpu(entry->data2);
+
+	*(u64 *)nova_get_block(sb, addr) = (u64)value;
+	nova_flush_buffer((void *)nova_get_block(sb, addr), CACHELINE_SIZE, 0);
+}
+
+static void nova_undo_lite_journal_entry(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u64 type;
+
+	type = le64_to_cpu(entry->type);
+
+	switch (type) {
+	case JOURNAL_INODE:
+		nova_undo_journal_inode(sb, entry);
+		break;
+	case JOURNAL_ENTRY:
+		nova_undo_journal_entry(sb, entry);
+		break;
+	default:
+		nova_dbg("%s: unknown data type %llu\n", __func__, type);
+		break;
+	}
+}
+
+/* Roll back all journal enries */
+static int nova_recover_lite_journal(struct super_block *sb,
+	struct journal_ptr_pair *pair)
+{
+	struct nova_lite_journal_entry *entry;
+	u64 temp;
+
+	nova_memunlock_journal(sb);
+	temp = pair->journal_head;
+	while (temp != pair->journal_tail) {
+		entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+									temp);
+		nova_undo_lite_journal_entry(sb, entry);
+		temp = next_lite_journal(temp);
+	}
+
+	pair->journal_tail = pair->journal_head;
+	nova_memlock_journal(sb);
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	return 0;
+}
+
+/**************************** Create/commit ******************************/
+
+static u64 nova_append_replica_inode_journal(struct super_block *sb,
+	u64 curr_p, struct inode *inode)
+{
+	struct nova_lite_journal_entry *entry;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+							curr_p);
+	entry->type = cpu_to_le64(JOURNAL_INODE);
+	entry->padding = 0;
+	entry->data1 = cpu_to_le64(sih->pi_addr);
+	entry->data2 = cpu_to_le64(sih->alter_pi_addr);
+	nova_update_journal_entry_csum(sb, entry);
+
+	curr_p = next_lite_journal(curr_p);
+	return curr_p;
+}
+
+/* Create and append an undo entry for a small update to PMEM. */
+static u64 nova_append_entry_journal(struct super_block *sb,
+	u64 curr_p, void *field)
+{
+	struct nova_lite_journal_entry *entry;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 *aligned_field;
+	u64 addr;
+
+	entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+							curr_p);
+	entry->type = cpu_to_le64(JOURNAL_ENTRY);
+	entry->padding = 0;
+	/* Align to 8 bytes */
+	aligned_field = (u64 *)((unsigned long)field & ~7UL);
+	/* Store the offset from the start of Nova instead of the pointer */
+	addr = (u64)nova_get_addr_off(sbi, aligned_field);
+	entry->data1 = cpu_to_le64(addr);
+	entry->data2 = cpu_to_le64(*aligned_field);
+	nova_update_journal_entry_csum(sb, entry);
+
+	curr_p = next_lite_journal(curr_p);
+	return curr_p;
+}
+
+static u64 nova_journal_inode_tail(struct super_block *sb,
+	u64 curr_p, struct nova_inode *pi)
+{
+	curr_p = nova_append_entry_journal(sb, curr_p, &pi->log_tail);
+	if (metadata_csum)
+		curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->alter_log_tail);
+	return curr_p;
+}
+
+/* Create and append undo log entries for creating a new file or directory. */
+static u64 nova_append_inode_journal(struct super_block *sb,
+	u64 curr_p, struct inode *inode, int new_inode,
+	int invalidate, int is_dir)
+{
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+
+	if (metadata_csum)
+		return nova_append_replica_inode_journal(sb, curr_p, inode);
+
+	if (!pi) {
+		nova_err(sb, "%s: get inode failed\n", __func__);
+		return curr_p;
+	}
+
+	if (is_dir)
+		return nova_journal_inode_tail(sb, curr_p, pi);
+
+	if (new_inode) {
+		curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->valid);
+	} else {
+		curr_p = nova_journal_inode_tail(sb, curr_p, pi);
+		if (invalidate) {
+			curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->valid);
+			curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->delete_epoch_id);
+		}
+	}
+
+	return curr_p;
+}
+
+static u64 nova_append_dentry_journal(struct super_block *sb,
+	u64 curr_p, struct nova_dentry *dentry)
+{
+	curr_p = nova_append_entry_journal(sb, curr_p, &dentry->ino);
+	curr_p = nova_append_entry_journal(sb, curr_p, &dentry->csum);
+	return curr_p;
+}
+
+/* Journaled transactions for inode creation */
+u64 nova_create_inode_transaction(struct super_block *sb,
+	struct inode *inode, struct inode *dir, int cpu,
+	int new_inode, int invalidate)
+{
+	struct journal_ptr_pair *pair;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+	if (pair->journal_head == 0 ||
+			pair->journal_head != pair->journal_tail)
+		BUG();
+
+	temp = pair->journal_head;
+
+	temp = nova_append_inode_journal(sb, temp, inode,
+					new_inode, invalidate, 0);
+
+	temp = nova_append_inode_journal(sb, temp, dir,
+					new_inode, invalidate, 1);
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* Journaled transactions for rename operations */
+u64 nova_create_rename_transaction(struct super_block *sb,
+	struct inode *old_inode, struct inode *old_dir, struct inode *new_inode,
+	struct inode *new_dir, struct nova_dentry *father_entry,
+	int invalidate_new_inode, int cpu)
+{
+	struct journal_ptr_pair *pair;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+	if (pair->journal_head == 0 ||
+			pair->journal_head != pair->journal_tail)
+		BUG();
+
+	temp = pair->journal_head;
+
+	/* Journal tails for old inode */
+	temp = nova_append_inode_journal(sb, temp, old_inode, 0, 0, 0);
+
+	/* Journal tails for old dir */
+	temp = nova_append_inode_journal(sb, temp, old_dir, 0, 0, 1);
+
+	if (new_inode) {
+		/* New inode may be unlinked */
+		temp = nova_append_inode_journal(sb, temp, new_inode, 0,
+					invalidate_new_inode, 0);
+	}
+
+	if (new_dir)
+		temp = nova_append_inode_journal(sb, temp, new_dir, 0, 0, 1);
+
+	if (father_entry)
+		temp = nova_append_dentry_journal(sb, temp, father_entry);
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* For log entry inplace update */
+u64 nova_create_logentry_transaction(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int cpu)
+{
+	struct journal_ptr_pair *pair;
+	size_t size = 0;
+	int i, count;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+	if (pair->journal_head == 0 ||
+			pair->journal_head != pair->journal_tail)
+		BUG();
+
+	size = nova_get_log_entry_size(sb, type);
+
+	temp = pair->journal_head;
+
+	count = size / 8;
+	for (i = 0; i < count; i++) {
+		temp = nova_append_entry_journal(sb, temp,
+						(char *)entry + i * 8);
+	}
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* Commit the transactions by dropping the journal entries */
+void nova_commit_lite_transaction(struct super_block *sb, u64 tail, int cpu)
+{
+	struct journal_ptr_pair *pair;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+	if (pair->journal_tail != tail)
+		BUG();
+
+	pair->journal_head = tail;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+}
+
+/**************************** Initialization ******************************/
+
+// Initialized DRAM journal state, validate, and recover
+int nova_lite_journal_soft_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct journal_ptr_pair *pair;
+	int i;
+	int ret = 0;
+
+	sbi->journal_locks = kcalloc(sbi->cpus, sizeof(spinlock_t),
+				     GFP_KERNEL);
+	if (!sbi->journal_locks)
+		return -ENOMEM;
+
+	for (i = 0; i < sbi->cpus; i++)
+		spin_lock_init(&sbi->journal_locks[i]);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+		if (pair->journal_head == pair->journal_tail)
+			continue;
+
+		/* Ensure all entries are genuine */
+		ret = nova_check_journal_entries(sb, pair);
+		if (ret) {
+			nova_err(sb, "Journal %d checksum failure\n", i);
+			ret = -EINVAL;
+			break;
+		}
+
+		ret = nova_recover_lite_journal(sb, pair);
+	}
+
+	return ret;
+}
+
+/* Initialized persistent journal state */
+int nova_lite_journal_hard_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct journal_ptr_pair *pair;
+	unsigned long blocknr = 0;
+	int allocated;
+	int i;
+	u64 block;
+
+	sih.ino = NOVA_LITEJOURNAL_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_4K;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+
+		allocated = nova_new_log_blocks(sb, &sih, &blocknr, 1,
+			ALLOC_INIT_ZERO, ANY_CPU, ALLOC_FROM_HEAD);
+		nova_dbg_verbose("%s: allocate log @ 0x%lx\n", __func__,
+							blocknr);
+		if (allocated != 1 || blocknr == 0)
+			return -ENOSPC;
+
+		block = nova_get_block_off(sb, blocknr, NOVA_BLOCK_TYPE_4K);
+		nova_memunlock_range(sb, pair, CACHELINE_SIZE);
+		pair->journal_head = pair->journal_tail = block;
+		nova_flush_buffer(pair, CACHELINE_SIZE, 0);
+		nova_memlock_range(sb, pair, CACHELINE_SIZE);
+	}
+
+	PERSISTENT_BARRIER();
+	return nova_lite_journal_soft_init(sb);
+}
+
diff --git a/fs/nova/journal.h b/fs/nova/journal.h
new file mode 100644
index 000000000000..621138bb6eac
--- /dev/null
+++ b/fs/nova/journal.h
@@ -0,0 +1,61 @@
+#ifndef __JOURNAL_H
+#define __JOURNAL_H
+
+#include <linux/types.h>
+#include <linux/fs.h>
+#include "nova.h"
+#include "super.h"
+
+
+/* ======================= Lite journal ========================= */
+
+#define NOVA_MAX_JOURNAL_LENGTH 128
+
+#define	JOURNAL_INODE	1
+#define	JOURNAL_ENTRY	2
+
+/* Lightweight journal entry */
+struct nova_lite_journal_entry {
+	__le64 type;       // JOURNAL_INODE or JOURNAL_ENTRY
+	__le64 data1;
+	__le64 data2;
+	__le32 padding;
+	__le32 csum;
+} __attribute((__packed__));
+
+/* Head and tail pointers into a circular queue of journal entries.  There's
+ * one of these per CPU.
+ */
+struct journal_ptr_pair {
+	__le64 journal_head;
+	__le64 journal_tail;
+};
+
+static inline
+struct journal_ptr_pair *nova_get_journal_pointers(struct super_block *sb,
+	int cpu)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (cpu >= sbi->cpus)
+		BUG();
+
+	return (struct journal_ptr_pair *)((char *)nova_get_block(sb,
+		NOVA_DEF_BLOCK_SIZE_4K * JOURNAL_START) + cpu * CACHELINE_SIZE);
+}
+
+
+u64 nova_create_inode_transaction(struct super_block *sb,
+	struct inode *inode, struct inode *dir, int cpu,
+	int new_inode, int invalidate);
+u64 nova_create_rename_transaction(struct super_block *sb,
+	struct inode *old_inode, struct inode *old_dir, struct inode *new_inode,
+	struct inode *new_dir, struct nova_dentry *father_entry,
+	int invalidate_new_inode, int cpu);
+u64 nova_create_logentry_transaction(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int cpu);
+void nova_commit_lite_transaction(struct super_block *sb, u64 tail, int cpu);
+int nova_lite_journal_soft_init(struct super_block *sb);
+int nova_lite_journal_hard_init(struct super_block *sb);
+
+#endif

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 06/16] NOVA: Lite-weight journaling for complex ops
@ 2017-08-03  7:48   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova uses a lightweight journaling mechanisms to provide atomicity for
operations that modify more than one on inode.  The journals providing logging
for two operations:

1.  Single word updates (JOURNAL_ENTRY)
2.  Copying inodes (JOURNAL_INODE)

The journals are undo logs: Nova creates the journal entries for an operation,
and if the operation does not complete due to a system failure, the recovery
process rolls back the changes using the journal entries.

To commit, Nova drops the log.

Nova maintains one journal per CPU.  The head and tail pointers for each
journal live in a reserved page near the beginning of the file system.

During recovery, Nova scans the journals and undoes the operations described by
each entry.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/journal.c |  474 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/journal.h |   61 +++++++
 2 files changed, 535 insertions(+)
 create mode 100644 fs/nova/journal.c
 create mode 100644 fs/nova/journal.h

diff --git a/fs/nova/journal.c b/fs/nova/journal.c
new file mode 100644
index 000000000000..b05c7212929f
--- /dev/null
+++ b/fs/nova/journal.c
@@ -0,0 +1,474 @@
+/*
+ * NOVA journaling facility.
+ *
+ * This file contains journaling code to guarantee the atomicity of directory
+ * operations that span multiple inodes (unlink, rename, etc).
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/vfs.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include "nova.h"
+#include "journal.h"
+
+/**************************** Lite journal ******************************/
+
+static inline void
+nova_print_lite_transaction(struct nova_lite_journal_entry *entry)
+{
+	nova_dbg("Entry %p: Type %llu, data1 0x%llx, data2 0x%llx\n, checksum %u\n",
+			entry, entry->type,
+			entry->data1, entry->data2, entry->csum);
+}
+
+static inline int nova_update_journal_entry_csum(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u32 crc = 0;
+
+	crc = nova_crc32c(~0, (__u8 *)entry,
+			(sizeof(struct nova_lite_journal_entry)
+			 - sizeof(__le32)));
+
+	entry->csum = cpu_to_le32(crc);
+	nova_flush_buffer(entry, sizeof(struct nova_lite_journal_entry), 0);
+	return 0;
+}
+
+static inline int nova_check_entry_integrity(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u32 crc = 0;
+
+	crc = nova_crc32c(~0, (__u8 *)entry,
+			(sizeof(struct nova_lite_journal_entry)
+			 - sizeof(__le32)));
+
+	if (entry->csum == cpu_to_le32(crc))
+		return 0;
+	else
+		return 1;
+}
+
+// Get the next journal entry.  Journal entries are stored in a circular
+// buffer.  They live a 1-page circular buffer.
+//
+// TODO: Add check to ensure that the journal doesn't grow too large.
+static inline u64 next_lite_journal(u64 curr_p)
+{
+	size_t size = sizeof(struct nova_lite_journal_entry);
+
+	if ((curr_p & (PAGE_SIZE - 1)) + size >= PAGE_SIZE)
+		return (curr_p & PAGE_MASK);
+
+	return curr_p + size;
+}
+
+// Walk the journal for one CPU, and verify the checksum on each entry.
+static int nova_check_journal_entries(struct super_block *sb,
+	struct journal_ptr_pair *pair)
+{
+	struct nova_lite_journal_entry *entry;
+	u64 temp;
+	int ret;
+
+	temp = pair->journal_head;
+	while (temp != pair->journal_tail) {
+		entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+									temp);
+		ret = nova_check_entry_integrity(sb, entry);
+		if (ret) {
+			nova_dbg("Entry %p checksum failure\n", entry);
+			nova_print_lite_transaction(entry);
+			return ret;
+		}
+		temp = next_lite_journal(temp);
+	}
+
+	return 0;
+}
+
+/**************************** Journal Recovery ******************************/
+
+static void nova_undo_journal_inode(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	struct nova_inode *pi, *alter_pi;
+	u64 pi_addr, alter_pi_addr;
+
+	if (metadata_csum == 0)
+		return;
+
+	pi_addr = le64_to_cpu(entry->data1);
+	alter_pi_addr = le64_to_cpu(entry->data2);
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	alter_pi = (struct nova_inode *)nova_get_block(sb, alter_pi_addr);
+
+	memcpy_to_pmem_nocache(pi, alter_pi, sizeof(struct nova_inode));
+}
+
+static void nova_undo_journal_entry(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u64 addr, value;
+
+	addr = le64_to_cpu(entry->data1);
+	value = le64_to_cpu(entry->data2);
+
+	*(u64 *)nova_get_block(sb, addr) = (u64)value;
+	nova_flush_buffer((void *)nova_get_block(sb, addr), CACHELINE_SIZE, 0);
+}
+
+static void nova_undo_lite_journal_entry(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u64 type;
+
+	type = le64_to_cpu(entry->type);
+
+	switch (type) {
+	case JOURNAL_INODE:
+		nova_undo_journal_inode(sb, entry);
+		break;
+	case JOURNAL_ENTRY:
+		nova_undo_journal_entry(sb, entry);
+		break;
+	default:
+		nova_dbg("%s: unknown data type %llu\n", __func__, type);
+		break;
+	}
+}
+
+/* Roll back all journal enries */
+static int nova_recover_lite_journal(struct super_block *sb,
+	struct journal_ptr_pair *pair)
+{
+	struct nova_lite_journal_entry *entry;
+	u64 temp;
+
+	nova_memunlock_journal(sb);
+	temp = pair->journal_head;
+	while (temp != pair->journal_tail) {
+		entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+									temp);
+		nova_undo_lite_journal_entry(sb, entry);
+		temp = next_lite_journal(temp);
+	}
+
+	pair->journal_tail = pair->journal_head;
+	nova_memlock_journal(sb);
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	return 0;
+}
+
+/**************************** Create/commit ******************************/
+
+static u64 nova_append_replica_inode_journal(struct super_block *sb,
+	u64 curr_p, struct inode *inode)
+{
+	struct nova_lite_journal_entry *entry;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+							curr_p);
+	entry->type = cpu_to_le64(JOURNAL_INODE);
+	entry->padding = 0;
+	entry->data1 = cpu_to_le64(sih->pi_addr);
+	entry->data2 = cpu_to_le64(sih->alter_pi_addr);
+	nova_update_journal_entry_csum(sb, entry);
+
+	curr_p = next_lite_journal(curr_p);
+	return curr_p;
+}
+
+/* Create and append an undo entry for a small update to PMEM. */
+static u64 nova_append_entry_journal(struct super_block *sb,
+	u64 curr_p, void *field)
+{
+	struct nova_lite_journal_entry *entry;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 *aligned_field;
+	u64 addr;
+
+	entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+							curr_p);
+	entry->type = cpu_to_le64(JOURNAL_ENTRY);
+	entry->padding = 0;
+	/* Align to 8 bytes */
+	aligned_field = (u64 *)((unsigned long)field & ~7UL);
+	/* Store the offset from the start of Nova instead of the pointer */
+	addr = (u64)nova_get_addr_off(sbi, aligned_field);
+	entry->data1 = cpu_to_le64(addr);
+	entry->data2 = cpu_to_le64(*aligned_field);
+	nova_update_journal_entry_csum(sb, entry);
+
+	curr_p = next_lite_journal(curr_p);
+	return curr_p;
+}
+
+static u64 nova_journal_inode_tail(struct super_block *sb,
+	u64 curr_p, struct nova_inode *pi)
+{
+	curr_p = nova_append_entry_journal(sb, curr_p, &pi->log_tail);
+	if (metadata_csum)
+		curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->alter_log_tail);
+	return curr_p;
+}
+
+/* Create and append undo log entries for creating a new file or directory. */
+static u64 nova_append_inode_journal(struct super_block *sb,
+	u64 curr_p, struct inode *inode, int new_inode,
+	int invalidate, int is_dir)
+{
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+
+	if (metadata_csum)
+		return nova_append_replica_inode_journal(sb, curr_p, inode);
+
+	if (!pi) {
+		nova_err(sb, "%s: get inode failed\n", __func__);
+		return curr_p;
+	}
+
+	if (is_dir)
+		return nova_journal_inode_tail(sb, curr_p, pi);
+
+	if (new_inode) {
+		curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->valid);
+	} else {
+		curr_p = nova_journal_inode_tail(sb, curr_p, pi);
+		if (invalidate) {
+			curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->valid);
+			curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->delete_epoch_id);
+		}
+	}
+
+	return curr_p;
+}
+
+static u64 nova_append_dentry_journal(struct super_block *sb,
+	u64 curr_p, struct nova_dentry *dentry)
+{
+	curr_p = nova_append_entry_journal(sb, curr_p, &dentry->ino);
+	curr_p = nova_append_entry_journal(sb, curr_p, &dentry->csum);
+	return curr_p;
+}
+
+/* Journaled transactions for inode creation */
+u64 nova_create_inode_transaction(struct super_block *sb,
+	struct inode *inode, struct inode *dir, int cpu,
+	int new_inode, int invalidate)
+{
+	struct journal_ptr_pair *pair;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+	if (pair->journal_head == 0 ||
+			pair->journal_head != pair->journal_tail)
+		BUG();
+
+	temp = pair->journal_head;
+
+	temp = nova_append_inode_journal(sb, temp, inode,
+					new_inode, invalidate, 0);
+
+	temp = nova_append_inode_journal(sb, temp, dir,
+					new_inode, invalidate, 1);
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* Journaled transactions for rename operations */
+u64 nova_create_rename_transaction(struct super_block *sb,
+	struct inode *old_inode, struct inode *old_dir, struct inode *new_inode,
+	struct inode *new_dir, struct nova_dentry *father_entry,
+	int invalidate_new_inode, int cpu)
+{
+	struct journal_ptr_pair *pair;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+	if (pair->journal_head == 0 ||
+			pair->journal_head != pair->journal_tail)
+		BUG();
+
+	temp = pair->journal_head;
+
+	/* Journal tails for old inode */
+	temp = nova_append_inode_journal(sb, temp, old_inode, 0, 0, 0);
+
+	/* Journal tails for old dir */
+	temp = nova_append_inode_journal(sb, temp, old_dir, 0, 0, 1);
+
+	if (new_inode) {
+		/* New inode may be unlinked */
+		temp = nova_append_inode_journal(sb, temp, new_inode, 0,
+					invalidate_new_inode, 0);
+	}
+
+	if (new_dir)
+		temp = nova_append_inode_journal(sb, temp, new_dir, 0, 0, 1);
+
+	if (father_entry)
+		temp = nova_append_dentry_journal(sb, temp, father_entry);
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* For log entry inplace update */
+u64 nova_create_logentry_transaction(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int cpu)
+{
+	struct journal_ptr_pair *pair;
+	size_t size = 0;
+	int i, count;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+	if (pair->journal_head == 0 ||
+			pair->journal_head != pair->journal_tail)
+		BUG();
+
+	size = nova_get_log_entry_size(sb, type);
+
+	temp = pair->journal_head;
+
+	count = size / 8;
+	for (i = 0; i < count; i++) {
+		temp = nova_append_entry_journal(sb, temp,
+						(char *)entry + i * 8);
+	}
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* Commit the transactions by dropping the journal entries */
+void nova_commit_lite_transaction(struct super_block *sb, u64 tail, int cpu)
+{
+	struct journal_ptr_pair *pair;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+	if (pair->journal_tail != tail)
+		BUG();
+
+	pair->journal_head = tail;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+}
+
+/**************************** Initialization ******************************/
+
+// Initialized DRAM journal state, validate, and recover
+int nova_lite_journal_soft_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct journal_ptr_pair *pair;
+	int i;
+	int ret = 0;
+
+	sbi->journal_locks = kcalloc(sbi->cpus, sizeof(spinlock_t),
+				     GFP_KERNEL);
+	if (!sbi->journal_locks)
+		return -ENOMEM;
+
+	for (i = 0; i < sbi->cpus; i++)
+		spin_lock_init(&sbi->journal_locks[i]);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+		if (pair->journal_head == pair->journal_tail)
+			continue;
+
+		/* Ensure all entries are genuine */
+		ret = nova_check_journal_entries(sb, pair);
+		if (ret) {
+			nova_err(sb, "Journal %d checksum failure\n", i);
+			ret = -EINVAL;
+			break;
+		}
+
+		ret = nova_recover_lite_journal(sb, pair);
+	}
+
+	return ret;
+}
+
+/* Initialized persistent journal state */
+int nova_lite_journal_hard_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct journal_ptr_pair *pair;
+	unsigned long blocknr = 0;
+	int allocated;
+	int i;
+	u64 block;
+
+	sih.ino = NOVA_LITEJOURNAL_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_4K;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+
+		allocated = nova_new_log_blocks(sb, &sih, &blocknr, 1,
+			ALLOC_INIT_ZERO, ANY_CPU, ALLOC_FROM_HEAD);
+		nova_dbg_verbose("%s: allocate log @ 0x%lx\n", __func__,
+							blocknr);
+		if (allocated != 1 || blocknr == 0)
+			return -ENOSPC;
+
+		block = nova_get_block_off(sb, blocknr, NOVA_BLOCK_TYPE_4K);
+		nova_memunlock_range(sb, pair, CACHELINE_SIZE);
+		pair->journal_head = pair->journal_tail = block;
+		nova_flush_buffer(pair, CACHELINE_SIZE, 0);
+		nova_memlock_range(sb, pair, CACHELINE_SIZE);
+	}
+
+	PERSISTENT_BARRIER();
+	return nova_lite_journal_soft_init(sb);
+}
+
diff --git a/fs/nova/journal.h b/fs/nova/journal.h
new file mode 100644
index 000000000000..621138bb6eac
--- /dev/null
+++ b/fs/nova/journal.h
@@ -0,0 +1,61 @@
+#ifndef __JOURNAL_H
+#define __JOURNAL_H
+
+#include <linux/types.h>
+#include <linux/fs.h>
+#include "nova.h"
+#include "super.h"
+
+
+/* ======================= Lite journal ========================= */
+
+#define NOVA_MAX_JOURNAL_LENGTH 128
+
+#define	JOURNAL_INODE	1
+#define	JOURNAL_ENTRY	2
+
+/* Lightweight journal entry */
+struct nova_lite_journal_entry {
+	__le64 type;       // JOURNAL_INODE or JOURNAL_ENTRY
+	__le64 data1;
+	__le64 data2;
+	__le32 padding;
+	__le32 csum;
+} __attribute((__packed__));
+
+/* Head and tail pointers into a circular queue of journal entries.  There's
+ * one of these per CPU.
+ */
+struct journal_ptr_pair {
+	__le64 journal_head;
+	__le64 journal_tail;
+};
+
+static inline
+struct journal_ptr_pair *nova_get_journal_pointers(struct super_block *sb,
+	int cpu)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (cpu >= sbi->cpus)
+		BUG();
+
+	return (struct journal_ptr_pair *)((char *)nova_get_block(sb,
+		NOVA_DEF_BLOCK_SIZE_4K * JOURNAL_START) + cpu * CACHELINE_SIZE);
+}
+
+
+u64 nova_create_inode_transaction(struct super_block *sb,
+	struct inode *inode, struct inode *dir, int cpu,
+	int new_inode, int invalidate);
+u64 nova_create_rename_transaction(struct super_block *sb,
+	struct inode *old_inode, struct inode *old_dir, struct inode *new_inode,
+	struct inode *new_dir, struct nova_dentry *father_entry,
+	int invalidate_new_inode, int cpu);
+u64 nova_create_logentry_transaction(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int cpu);
+void nova_commit_lite_transaction(struct super_block *sb, u64 tail, int cpu);
+int nova_lite_journal_soft_init(struct super_block *sb);
+int nova_lite_journal_hard_init(struct super_block *sb);
+
+#endif

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 07/16] NOVA: File and directory operations
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:48   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

To access file data via read(), Nova maintains a radix tree in DRAM for each
inode (nova_inode_info_header.tree) that maps file offsets to write log
entries.  For directories, the same tree maps a hash of filenames to their
corresponding dentry.

In both cases, the nova populates the tree when the file or directory is opened
by scanning its log.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/dir.c     |  760 +++++++++++++++++++++++++++++++++++++++++++
 fs/nova/file.c    |  943 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/namei.c   |  919 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/symlink.c |  153 +++++++++
 4 files changed, 2775 insertions(+)
 create mode 100644 fs/nova/dir.c
 create mode 100644 fs/nova/file.c
 create mode 100644 fs/nova/namei.c
 create mode 100644 fs/nova/symlink.c

diff --git a/fs/nova/dir.c b/fs/nova/dir.c
new file mode 100644
index 000000000000..47e89088a69b
--- /dev/null
+++ b/fs/nova/dir.c
@@ -0,0 +1,760 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for directories.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include "nova.h"
+#include "inode.h"
+
+#define DT2IF(dt) (((dt) << 12) & S_IFMT)
+#define IF2DT(sif) (((sif) & S_IFMT) >> 12)
+
+struct nova_dentry *nova_find_dentry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, const char *name,
+	unsigned long name_len)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_dentry *direntry;
+	unsigned long hash;
+
+	hash = BKDRHash(name, name_len);
+	direntry = radix_tree_lookup(&sih->tree, hash);
+
+	return direntry;
+}
+
+int nova_insert_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name,
+	int namelen, struct nova_dentry *direntry)
+{
+	unsigned long hash;
+	int ret;
+
+	hash = BKDRHash(name, namelen);
+	nova_dbgv("%s: insert %s hash %lu\n", __func__, name, hash);
+
+	/* FIXME: hash collision ignored here */
+	ret = radix_tree_insert(&sih->tree, hash, direntry);
+	if (ret)
+		nova_dbg("%s ERROR %d: %s\n", __func__, ret, name);
+
+	return ret;
+}
+
+static int nova_check_dentry_match(struct super_block *sb,
+	struct nova_dentry *dentry, const char *name, int namelen)
+{
+	if (dentry->name_len != namelen)
+		return -EINVAL;
+
+	return strncmp(dentry->name, name, namelen);
+}
+
+int nova_remove_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name, int namelen,
+	int replay, struct nova_dentry **create_dentry)
+{
+	struct nova_dentry *entry;
+	struct nova_dentry *entryc, entry_copy;
+	unsigned long hash;
+
+	hash = BKDRHash(name, namelen);
+	entry = radix_tree_delete(&sih->tree, hash);
+
+	if (replay == 0) {
+		if (!entry) {
+			nova_dbg("%s ERROR: %s, length %d, hash %lu\n",
+					__func__, name, namelen, hash);
+			return -EINVAL;
+		}
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else {
+			entryc = &entry_copy;
+			if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EINVAL;
+		}
+
+		if (entryc->ino == 0 || entryc->invalid ||
+		    nova_check_dentry_match(sb, entryc, name, namelen)) {
+			nova_dbg("%s dentry not match: %s, length %d, hash %lu\n",
+				 __func__, name, namelen, hash);
+			/* for debug information, still allow access to nvmm */
+			nova_dbg("dentry: type %d, inode %llu, name %s, namelen %u, rec len %u\n",
+				 entry->entry_type, le64_to_cpu(entry->ino),
+				 entry->name, entry->name_len,
+				 le16_to_cpu(entry->de_len));
+			return -EINVAL;
+		}
+
+		if (create_dentry)
+			*create_dentry = entry;
+	}
+
+	return 0;
+}
+
+void nova_delete_dir_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_dentry *direntry;
+	struct nova_dentry *direntryc, entry_copy;
+	unsigned long pos = 0;
+	struct nova_dentry *entries[FREE_BATCH];
+	timing_t delete_time;
+	int nr_entries;
+	int i;
+	void *ret;
+
+	NOVA_START_TIMING(delete_dir_tree_t, delete_time);
+
+	direntryc = (metadata_csum == 0) ? direntry : &entry_copy;
+	do {
+		nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, FREE_BATCH);
+		for (i = 0; i < nr_entries; i++) {
+			direntry = entries[i];
+			BUG_ON(!direntry);
+
+			if (metadata_csum == 0)
+				direntryc = direntry;
+			else if (!nova_verify_entry_csum(sb, direntry,
+								direntryc))
+				return;
+
+			pos = BKDRHash(direntryc->name, direntryc->name_len);
+			ret = radix_tree_delete(&sih->tree, pos);
+			if (!ret || ret != direntry) {
+				nova_err(sb, "dentry: type %d, inode %llu, name %s, namelen %u, rec len %u\n",
+					direntry->entry_type,
+					le64_to_cpu(direntry->ino),
+					direntry->name, direntry->name_len,
+					le16_to_cpu(direntry->de_len));
+				if (!ret)
+					nova_dbg("ret is NULL\n");
+			}
+		}
+		pos++;
+	} while (nr_entries == FREE_BATCH);
+
+	NOVA_END_TIMING(delete_dir_tree_t, delete_time);
+}
+
+/* ========================= Entry operations ============================= */
+
+static unsigned int nova_init_dentry(struct super_block *sb,
+	struct nova_dentry *de_entry, u64 self_ino, u64 parent_ino,
+	u64 epoch_id)
+{
+	void *start = de_entry;
+	struct nova_inode_log_page *curr_page = start;
+	unsigned int length;
+	unsigned short de_len;
+
+	de_len = NOVA_DIR_LOG_REC_LEN(1);
+	memset(de_entry, 0, de_len);
+	de_entry->entry_type = DIR_LOG;
+	de_entry->epoch_id = epoch_id;
+	de_entry->trans_id = 0;
+	de_entry->ino = cpu_to_le64(self_ino);
+	de_entry->name_len = 1;
+	de_entry->de_len = cpu_to_le16(de_len);
+	de_entry->mtime = timespec_trunc(current_kernel_time(),
+					 sb->s_time_gran).tv_sec;
+
+	de_entry->links_count = 1;
+	strncpy(de_entry->name, ".\0", 2);
+	nova_update_entry_csum(de_entry);
+
+	length = de_len;
+
+	de_entry = (struct nova_dentry *)((char *)de_entry + length);
+	de_len = NOVA_DIR_LOG_REC_LEN(2);
+	memset(de_entry, 0, de_len);
+	de_entry->entry_type = DIR_LOG;
+	de_entry->epoch_id = epoch_id;
+	de_entry->trans_id = 0;
+	de_entry->ino = cpu_to_le64(parent_ino);
+	de_entry->name_len = 2;
+	de_entry->de_len = cpu_to_le16(de_len);
+	de_entry->mtime = timespec_trunc(current_kernel_time(),
+					 sb->s_time_gran).tv_sec;
+
+	de_entry->links_count = 2;
+	strncpy(de_entry->name, "..\0", 3);
+	nova_update_entry_csum(de_entry);
+	length += de_len;
+
+	nova_set_page_num_entries(sb, curr_page, 2, 1);
+
+	nova_flush_buffer(start, length, 0);
+	return length;
+}
+
+/* Append . and .. entries
+ *
+ * TODO: why is epoch_id a parameter when we pass in the sb?
+ */
+int nova_append_dir_init_entries(struct super_block *sb,
+	struct nova_inode *pi, u64 self_ino, u64 parent_ino, u64 epoch_id)
+{
+	struct nova_inode_info_header sih;
+	struct nova_inode *alter_pi;
+	u64 alter_pi_addr = 0;
+	int allocated;
+	int ret;
+	u64 new_block;
+	unsigned int length;
+	struct nova_dentry *de_entry;
+
+	sih.ino = self_ino;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, 1, &new_block,
+							ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_err(sb, "ERROR: no inode log page available\n");
+		return -ENOMEM;
+	}
+
+	nova_memunlock_inode(sb, pi);
+
+	pi->log_tail = pi->log_head = new_block;
+
+	de_entry = (struct nova_dentry *)nova_get_block(sb, new_block);
+
+	length = nova_init_dentry(sb, de_entry, self_ino, parent_ino, epoch_id);
+
+	nova_update_tail(pi, new_block + length);
+
+	nova_memlock_inode(sb, pi);
+
+	if (metadata_csum == 0)
+		return 0;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, 1, &new_block,
+							ANY_CPU, 1);
+	if (allocated != 1) {
+		nova_err(sb, "ERROR: no inode log page available\n");
+		return -ENOMEM;
+	}
+	nova_memunlock_inode(sb, pi);
+	pi->alter_log_tail = pi->alter_log_head = new_block;
+
+	de_entry = (struct nova_dentry *)nova_get_block(sb, new_block);
+
+	length = nova_init_dentry(sb, de_entry, self_ino, parent_ino, epoch_id);
+
+	nova_update_alter_tail(pi, new_block + length);
+	nova_update_alter_pages(sb, pi, pi->log_head,
+						pi->alter_log_head);
+	nova_update_inode_checksum(pi);
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 0);
+	nova_memlock_inode(sb, pi);
+
+	/* Get alternate inode address */
+	ret = nova_get_alter_inode_address(sb, self_ino, &alter_pi_addr);
+	if (ret)
+		return ret;
+
+	alter_pi = (struct nova_inode *)nova_get_block(sb, alter_pi_addr);
+	if (!alter_pi)
+		return -EINVAL;
+
+	nova_memunlock_inode(sb, alter_pi);
+	memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	nova_memlock_inode(sb, alter_pi);
+
+	return 0;
+}
+
+/* adds a directory entry pointing to the inode. assumes the inode has
+ * already been logged for consistency
+ */
+int nova_add_dentry(struct dentry *dentry, u64 ino, int inc_link,
+	struct nova_inode_update *update, u64 epoch_id)
+{
+	struct inode *dir = dentry->d_parent->d_inode;
+	struct super_block *sb = dir->i_sb;
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pidir;
+	const char *name = dentry->d_name.name;
+	int namelen = dentry->d_name.len;
+	struct nova_dentry *direntry;
+	unsigned short loglen;
+	int ret;
+	u64 curr_entry;
+	timing_t add_dentry_time;
+
+	nova_dbg_verbose("%s: dir %lu new inode %llu\n",
+				__func__, dir->i_ino, ino);
+	nova_dbg_verbose("%s: %s %d\n", __func__, name, namelen);
+	NOVA_START_TIMING(add_dentry_t, add_dentry_time);
+	if (namelen == 0)
+		return -EINVAL;
+
+	pidir = nova_get_inode(sb, dir);
+
+	/*
+	 * XXX shouldn't update any times until successful
+	 * completion of syscall, but too many callers depend
+	 * on this.
+	 */
+	dir->i_mtime = dir->i_ctime = current_time(dir);
+
+	loglen = NOVA_DIR_LOG_REC_LEN(namelen);
+	ret = nova_append_dentry(sb, pidir, dir, dentry,
+				ino, loglen, update,
+				inc_link, epoch_id);
+
+	if (ret) {
+		nova_dbg("%s: append dir entry failure\n", __func__);
+		return ret;
+	}
+
+	curr_entry = update->curr_entry;
+	direntry = (struct nova_dentry *)nova_get_block(sb, curr_entry);
+	sih->last_dentry = curr_entry;
+	ret = nova_insert_dir_radix_tree(sb, sih, name, namelen, direntry);
+
+	sih->trans_id++;
+	NOVA_END_TIMING(add_dentry_t, add_dentry_time);
+	return ret;
+}
+
+static int nova_can_inplace_update_dentry(struct super_block *sb,
+	struct nova_dentry *dentry, u64 epoch_id)
+{
+	struct nova_dentry *dentryc, entry_copy;
+
+	if (metadata_csum == 0)
+		dentryc = dentry;
+	else {
+		dentryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, dentry, dentryc))
+			return 0;
+	}
+
+	if (dentry && dentryc->epoch_id == epoch_id)
+		return 1;
+
+	return 0;
+}
+
+static int nova_inplace_update_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *dentry, int link_change,
+	u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+
+	entry_info.type = DIR_LOG;
+	entry_info.link_change = link_change;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+	entry_info.inplace = 1;
+
+	return nova_inplace_update_log_entry(sb, dir, dentry,
+					&entry_info);
+}
+
+/* removes a directory entry pointing to the inode. assumes the inode has
+ * already been logged for consistency
+ */
+int nova_remove_dentry(struct dentry *dentry, int dec_link,
+	struct nova_inode_update *update, u64 epoch_id)
+{
+	struct inode *dir = dentry->d_parent->d_inode;
+	struct super_block *sb = dir->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pidir;
+	struct qstr *entry = &dentry->d_name;
+	struct nova_dentry *old_dentry = NULL;
+	unsigned short loglen;
+	int ret;
+	u64 curr_entry;
+	timing_t remove_dentry_time;
+
+	NOVA_START_TIMING(remove_dentry_t, remove_dentry_time);
+
+	update->create_dentry = NULL;
+	update->delete_dentry = NULL;
+
+	if (!dentry->d_name.len) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = nova_remove_dir_radix_tree(sb, sih, entry->name, entry->len, 0,
+					&old_dentry);
+
+	if (ret)
+		goto out;
+
+	pidir = nova_get_inode(sb, dir);
+
+	dir->i_mtime = dir->i_ctime = current_time(dir);
+
+	if (nova_can_inplace_update_dentry(sb, old_dentry, epoch_id)) {
+		nova_inplace_update_dentry(sb, dir, old_dentry,
+						dec_link, epoch_id);
+		curr_entry = nova_get_addr_off(sbi, old_dentry);
+
+		sih->last_dentry = curr_entry;
+		/* Leave create/delete_dentry to NULL
+		 * Do not change tail/alter_tail if used as input
+		 */
+		if (update->tail == 0) {
+			update->tail = sih->log_tail;
+			update->alter_tail = sih->alter_log_tail;
+		}
+		sih->trans_id++;
+		goto out;
+	}
+
+	loglen = NOVA_DIR_LOG_REC_LEN(entry->len);
+	ret = nova_append_dentry(sb, pidir, dir, dentry,
+				0, loglen, update,
+				dec_link, epoch_id);
+
+	if (ret) {
+		nova_dbg("%s: append dir entry failure\n", __func__);
+		goto out;
+	}
+
+	update->create_dentry = old_dentry;
+	curr_entry = update->curr_entry;
+	update->delete_dentry = (struct nova_dentry *)nova_get_block(sb,
+						curr_entry);
+	sih->last_dentry = curr_entry;
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(remove_dentry_t, remove_dentry_time);
+	return ret;
+}
+
+/* Create dentry and delete dentry must be invalidated together */
+int nova_invalidate_dentries(struct super_block *sb,
+	struct nova_inode_update *update)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_dentry *create_dentry;
+	struct nova_dentry *create_dentryc, entry_copy;
+	struct nova_dentry *delete_dentry;
+	u64 create_curr, delete_curr;
+	int ret;
+
+	create_dentry = update->create_dentry;
+	delete_dentry = update->delete_dentry;
+
+	if (!create_dentry)
+		return 0;
+
+	nova_reassign_logentry(sb, create_dentry, DIR_LOG);
+
+	if (metadata_csum == 0)
+		create_dentryc = create_dentry;
+	else {
+		create_dentryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, create_dentry, create_dentryc))
+			return 0;
+	}
+
+	if (!old_entry_freeable(sb, create_dentryc->epoch_id))
+		return 0;
+
+	create_curr = nova_get_addr_off(sbi, create_dentry);
+	delete_curr = nova_get_addr_off(sbi, delete_dentry);
+
+	nova_invalidate_logentry(sb, create_dentry, DIR_LOG, 0);
+
+	ret = nova_invalidate_logentry(sb, delete_dentry, DIR_LOG, 0);
+
+	return ret;
+}
+
+static int nova_readdir_slow(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pidir;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *child_pi;
+	struct nova_dentry *entry;
+	struct nova_dentry *entryc, entry_copy;
+	struct nova_dentry *entries[FREE_BATCH];
+	int nr_entries;
+	u64 pi_addr;
+	unsigned long pos = 0;
+	ino_t ino;
+	int i;
+	int ret;
+	timing_t readdir_time;
+
+	NOVA_START_TIMING(readdir_t, readdir_time);
+	pidir = nova_get_inode(sb, inode);
+	nova_dbgv("%s: ino %llu, size %llu, pos %llu\n",
+			__func__, (u64)inode->i_ino,
+			pidir->i_size, ctx->pos);
+
+	if (!sih) {
+		nova_dbg("%s: inode %lu sih does not exist!\n",
+				__func__, inode->i_ino);
+		ctx->pos = READDIR_END;
+		return 0;
+	}
+
+	pos = ctx->pos;
+	if (pos == READDIR_END)
+		goto out;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	do {
+		nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, FREE_BATCH);
+		for (i = 0; i < nr_entries; i++) {
+			entry = entries[i];
+
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EIO;
+
+			pos = BKDRHash(entryc->name, entryc->name_len);
+			ino = __le64_to_cpu(entryc->ino);
+			if (ino == 0)
+				continue;
+
+			ret = nova_get_inode_address(sb, ino, 0, &pi_addr,
+						     0, 0);
+
+			if (ret) {
+				nova_dbg("%s: get child inode %lu address failed %d\n",
+					 __func__, ino, ret);
+				ctx->pos = READDIR_END;
+				return ret;
+			}
+
+			child_pi = nova_get_block(sb, pi_addr);
+			nova_dbgv("ctx: ino %llu, name %s, name_len %u, de_len %u, csum 0x%x\n",
+				(u64)ino, entry->name, entry->name_len,
+				entry->de_len, entry->csum);
+			if (!dir_emit(ctx, entryc->name, entryc->name_len,
+				ino, IF2DT(le16_to_cpu(child_pi->i_mode)))) {
+				nova_dbgv("Here: pos %llu\n", ctx->pos);
+				return 0;
+			}
+			ctx->pos = pos + 1;
+		}
+		pos++;
+	} while (nr_entries == FREE_BATCH);
+
+out:
+	NOVA_END_TIMING(readdir_t, readdir_time);
+	return 0;
+}
+
+static u64 nova_find_next_dentry_addr(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 pos)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_entry *entries[1];
+	int nr_entries;
+	u64 addr = 0;
+
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, 1);
+	if (nr_entries == 1) {
+		entry = entries[0];
+		addr = nova_get_addr_off(sbi, entry);
+	}
+
+	return addr;
+}
+
+static int nova_readdir_fast(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pidir;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *child_pi;
+	struct nova_inode *prev_child_pi = NULL;
+	struct nova_dentry *entry = NULL;
+	struct nova_dentry *entryc, entry_copy;
+	struct nova_dentry *prev_entry = NULL;
+	struct nova_dentry *prev_entryc, prev_entry_copy;
+	unsigned short de_len;
+	u64 pi_addr;
+	unsigned long pos = 0;
+	ino_t ino;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+	int ret;
+	timing_t readdir_time;
+
+	NOVA_START_TIMING(readdir_t, readdir_time);
+	pidir = nova_get_inode(sb, inode);
+	nova_dbgv("%s: ino %llu, size %llu, pos 0x%llx\n",
+			__func__, (u64)inode->i_ino,
+			pidir->i_size, ctx->pos);
+
+	if (sih->log_head == 0) {
+		nova_err(sb, "Dir %lu log is NULL!\n", inode->i_ino);
+		BUG();
+		return -EINVAL;
+	}
+
+	pos = ctx->pos;
+
+	if (pos == 0)
+		curr_p = sih->log_head;
+	else if (pos == READDIR_END)
+		goto out;
+	else {
+		curr_p = nova_find_next_dentry_addr(sb, sih, pos);
+		if (curr_p == 0)
+			goto out;
+	}
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+	prev_entryc = (metadata_csum == 0) ? prev_entry : &prev_entry_copy;
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p))
+			curr_p = next_log_page(sb, curr_p);
+
+
+		if (curr_p == 0) {
+			nova_err(sb, "Dir %lu log is NULL!\n", inode->i_ino);
+			BUG();
+			return -EINVAL;
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+		type = nova_get_entry_type(addr);
+		switch (type) {
+		case SET_ATTR:
+			curr_p += sizeof(struct nova_setattr_logentry);
+			continue;
+		case LINK_CHANGE:
+			curr_p += sizeof(struct nova_link_change_entry);
+			continue;
+		case DIR_LOG:
+			break;
+		default:
+			nova_dbg("%s: unknown type %d, 0x%llx\n",
+				 __func__, type, curr_p);
+			BUG();
+			return -EINVAL;
+		}
+
+		entry = (struct nova_dentry *)nova_get_block(sb, curr_p);
+		nova_dbgv("curr_p: 0x%llx, type %d, ino %llu, name %s, namelen %u, rec len %u\n",
+			  curr_p, entry->entry_type, le64_to_cpu(entry->ino),
+			  entry->name, entry->name_len,
+			  le16_to_cpu(entry->de_len));
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+
+		de_len = le16_to_cpu(entryc->de_len);
+		if (entryc->ino > 0 && entryc->invalid == 0
+					&& entryc->reassigned == 0) {
+			ino = __le64_to_cpu(entryc->ino);
+			pos = BKDRHash(entryc->name, entryc->name_len);
+
+			ret = nova_get_inode_address(sb, ino, 0,
+						     &pi_addr, 0, 0);
+			if (ret) {
+				nova_dbg("%s: get child inode %lu address failed %d\n",
+					 __func__, ino, ret);
+				ctx->pos = READDIR_END;
+				return ret;
+			}
+
+			child_pi = nova_get_block(sb, pi_addr);
+			nova_dbgv("ctx: ino %llu, name %s, name_len %u, de_len %u\n",
+				(u64)ino, entry->name, entry->name_len,
+				entry->de_len);
+			if (prev_entry && !dir_emit(ctx, prev_entryc->name,
+				prev_entryc->name_len, ino,
+				IF2DT(le16_to_cpu(prev_child_pi->i_mode)))) {
+				nova_dbgv("Here: pos %llu\n", ctx->pos);
+				return 0;
+			}
+			prev_entry = entry;
+
+			if (metadata_csum == 0)
+				prev_entryc = prev_entry;
+			else
+				memcpy(prev_entryc, entryc,
+						sizeof(struct nova_dentry));
+
+			prev_child_pi = child_pi;
+		}
+		ctx->pos = pos;
+		curr_p += de_len;
+	}
+
+	if (prev_entry && !dir_emit(ctx, prev_entryc->name,
+			prev_entryc->name_len, ino,
+			IF2DT(le16_to_cpu(prev_child_pi->i_mode))))
+		return 0;
+
+	ctx->pos = READDIR_END;
+out:
+	NOVA_END_TIMING(readdir_t, readdir_time);
+	nova_dbgv("%s return\n", __func__);
+	return 0;
+}
+
+static int nova_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->mount_snapshot == 0)
+		return nova_readdir_fast(file, ctx);
+	else
+		return nova_readdir_slow(file, ctx);
+}
+
+const struct file_operations nova_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.iterate	= nova_readdir,
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = nova_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= nova_compat_ioctl,
+#endif
+};
diff --git a/fs/nova/file.c b/fs/nova/file.c
new file mode 100644
index 000000000000..51b2114796df
--- /dev/null
+++ b/fs/nova/file.c
@@ -0,0 +1,943 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for files.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/uaccess.h>
+#include <linux/falloc.h>
+#include <asm/mman.h>
+#include "nova.h"
+#include "inode.h"
+
+
+static inline int nova_can_set_blocksize_hint(struct inode *inode,
+	struct nova_inode *pi, loff_t new_size)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	/* Currently, we don't deallocate data blocks till the file is deleted.
+	 * So no changing blocksize hints once allocation is done.
+	 */
+	if (sih->i_size > 0)
+		return 0;
+	return 1;
+}
+
+int nova_set_blocksize_hint(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, loff_t new_size)
+{
+	unsigned short block_type;
+
+	if (!nova_can_set_blocksize_hint(inode, pi, new_size))
+		return 0;
+
+	if (new_size >= 0x40000000) {   /* 1G */
+		block_type = NOVA_BLOCK_TYPE_1G;
+		goto hint_set;
+	}
+
+	if (new_size >= 0x200000) {     /* 2M */
+		block_type = NOVA_BLOCK_TYPE_2M;
+		goto hint_set;
+	}
+
+	/* defaulting to 4K */
+	block_type = NOVA_BLOCK_TYPE_4K;
+
+hint_set:
+	nova_dbg_verbose(
+		"Hint: new_size 0x%llx, i_size 0x%llx\n",
+		new_size, pi->i_size);
+	nova_dbg_verbose("Setting the hint to 0x%x\n", block_type);
+	nova_memunlock_inode(sb, pi);
+	pi->i_blk_type = block_type;
+	nova_memlock_inode(sb, pi);
+	return 0;
+}
+
+static loff_t nova_llseek(struct file *file, loff_t offset, int origin)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	int retval;
+
+	if (origin != SEEK_DATA && origin != SEEK_HOLE)
+		return generic_file_llseek(file, offset, origin);
+
+	inode_lock(inode);
+	switch (origin) {
+	case SEEK_DATA:
+		retval = nova_find_region(inode, &offset, 0);
+		if (retval) {
+			inode_unlock(inode);
+			return retval;
+		}
+		break;
+	case SEEK_HOLE:
+		retval = nova_find_region(inode, &offset, 1);
+		if (retval) {
+			inode_unlock(inode);
+			return retval;
+		}
+		break;
+	}
+
+	if ((offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET)) ||
+	    offset > inode->i_sb->s_maxbytes) {
+		inode_unlock(inode);
+		return -ENXIO;
+	}
+
+	if (offset != file->f_pos) {
+		file->f_pos = offset;
+		file->f_version = 0;
+	}
+
+	inode_unlock(inode);
+	return offset;
+}
+
+/* This function is called by both msync() and fsync().
+ * TODO: Check if we can avoid calling nova_flush_buffer() for fsync. We use
+ * movnti to write data to files, so we may want to avoid doing unnecessary
+ * nova_flush_buffer() on fsync()
+ */
+static int nova_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	unsigned long start_pgoff, end_pgoff;
+	int ret = 0;
+	timing_t fsync_time;
+
+	NOVA_START_TIMING(fsync_t, fsync_time);
+
+	if (datasync)
+		NOVA_STATS_ADD(fdatasync, 1);
+
+	/* No need to flush if the file is not mmaped */
+	if (!mapping_mapped(mapping))
+		goto persist;
+
+	start_pgoff = start >> PAGE_SHIFT;
+	end_pgoff = (end + 1) >> PAGE_SHIFT;
+	nova_dbgv("%s: msync pgoff range %lu to %lu\n",
+			__func__, start_pgoff, end_pgoff);
+
+	/*
+	 * Set csum and parity.
+	 * We do not protect data integrity during mmap, but we have to
+	 * update csum here since msync clears dirty bit.
+	 */
+	nova_reset_mapping_csum_parity(sb, inode, mapping,
+					start_pgoff, end_pgoff);
+
+	ret = generic_file_fsync(file, start, end, datasync);
+
+persist:
+	PERSISTENT_BARRIER();
+	NOVA_END_TIMING(fsync_t, fsync_time);
+
+	return ret;
+}
+
+/* This callback is called when a file is closed */
+static int nova_flush(struct file *file, fl_owner_t id)
+{
+	PERSISTENT_BARRIER();
+	return 0;
+}
+
+static int nova_open(struct inode *inode, struct file *filp)
+{
+	return generic_file_open(inode, filp);
+}
+
+static long nova_fallocate(struct file *file, int mode, loff_t offset,
+	loff_t len)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	unsigned long start_blk, num_blocks, ent_blks = 0;
+	unsigned long total_blocks = 0;
+	unsigned long blocknr = 0;
+	unsigned long blockoff;
+	unsigned int data_bits;
+	loff_t new_size;
+	long ret = 0;
+	int inplace = 0;
+	int blocksize_mask;
+	int allocated = 0;
+	bool update_log = false;
+	timing_t fallocate_time;
+	u64 begin_tail = 0;
+	u64 epoch_id;
+	u32 time;
+
+	/*
+	 * Fallocate does not make much sence for CoW,
+	 * but we still support it for DAX-mmap purpose.
+	 */
+
+	/* We only support the FALLOC_FL_KEEP_SIZE mode */
+	if (mode & ~FALLOC_FL_KEEP_SIZE)
+		return -EOPNOTSUPP;
+
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
+	new_size = len + offset;
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && new_size > inode->i_size) {
+		ret = inode_newsize_ok(inode, new_size);
+		if (ret)
+			return ret;
+	} else {
+		new_size = inode->i_size;
+	}
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lld, mode 0x%x\n",
+			__func__, inode->i_ino,	offset, len, mode);
+
+	NOVA_START_TIMING(fallocate_t, fallocate_time);
+	inode_lock(inode);
+
+	pi = nova_get_inode(sb, inode);
+	if (!pi) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	blocksize_mask = sb->s_blocksize - 1;
+	start_blk = offset >> sb->s_blocksize_bits;
+	blockoff = offset & blocksize_mask;
+	num_blocks = (blockoff + len + blocksize_mask) >> sb->s_blocksize_bits;
+
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+	while (num_blocks > 0) {
+		ent_blks = nova_check_existing_entry(sb, inode, num_blocks,
+						start_blk, &entry, &entry_copy,
+						1, epoch_id, &inplace, 1);
+
+		entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+		if (entry && inplace) {
+			if (entryc->size < new_size) {
+				/* Update existing entry */
+				nova_memunlock_range(sb, entry, CACHELINE_SIZE);
+				entry->size = new_size;
+				nova_update_entry_csum(entry);
+				nova_update_alter_entry(sb, entry);
+				nova_memlock_range(sb, entry, CACHELINE_SIZE);
+			}
+			allocated = ent_blks;
+			goto next;
+		}
+
+		/* Allocate zeroed blocks to fill hole */
+		allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk,
+				 ent_blks, ALLOC_INIT_ZERO, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+		nova_dbgv("%s: alloc %d blocks @ %lu\n", __func__,
+						allocated, blocknr);
+
+		if (allocated <= 0) {
+			nova_dbg("%s alloc %lu blocks failed!, %d\n",
+						__func__, ent_blks, allocated);
+			ret = allocated;
+			goto out;
+		}
+
+		/* Handle hole fill write */
+		nova_init_file_write_entry(sb, sih, &entry_data, epoch_id,
+					start_blk, allocated, blocknr,
+					time, new_size);
+
+		ret = nova_append_file_write_entry(sb, pi, inode,
+					&entry_data, &update);
+		if (ret) {
+			nova_dbg("%s: append inode entry failed\n", __func__);
+			ret = -ENOSPC;
+			goto out;
+		}
+
+		entry = nova_get_block(sb, update.curr_entry);
+		nova_reset_csum_parity_range(sb, sih, entry, start_blk,
+					start_blk + allocated, 1, 0);
+
+		update_log = true;
+		if (begin_tail == 0)
+			begin_tail = update.curr_entry;
+
+		total_blocks += allocated;
+next:
+		num_blocks -= allocated;
+		start_blk += allocated;
+	}
+
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (total_blocks << (data_bits - sb->s_blocksize_bits));
+
+	inode->i_blocks = sih->i_blocks;
+
+	if (update_log) {
+		sih->log_tail = update.tail;
+		sih->alter_log_tail = update.alter_tail;
+
+		nova_memunlock_inode(sb, pi);
+		nova_update_tail(pi, update.tail);
+		if (metadata_csum)
+			nova_update_alter_tail(pi, update.alter_tail);
+		nova_memlock_inode(sb, pi);
+
+		/* Update file tree */
+		ret = nova_reassign_file_tree(sb, sih, begin_tail);
+		if (ret)
+			goto out;
+
+	}
+
+	nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks);
+
+	if (ret || (mode & FALLOC_FL_KEEP_SIZE)) {
+		nova_memunlock_inode(sb, pi);
+		pi->i_flags |= cpu_to_le32(NOVA_EOFBLOCKS_FL);
+		nova_memlock_inode(sb, pi);
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && new_size > inode->i_size) {
+		inode->i_size = new_size;
+		sih->i_size = new_size;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode_checksum(pi);
+	nova_update_alter_inode(sb, inode, pi);
+	nova_memlock_inode(sb, pi);
+
+	sih->trans_id++;
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						begin_tail, update.tail);
+
+	inode_unlock(inode);
+	NOVA_END_TIMING(fallocate_t, fallocate_time);
+	return ret;
+}
+
+static int nova_iomap_begin_nolock(struct inode *inode, loff_t offset,
+	loff_t length, unsigned int flags, struct iomap *iomap)
+{
+	return nova_iomap_begin(inode, offset, length, flags, iomap, false);
+}
+
+static struct iomap_ops nova_iomap_ops_nolock = {
+	.iomap_begin	= nova_iomap_begin_nolock,
+	.iomap_end	= nova_iomap_end,
+};
+
+static ssize_t nova_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	ssize_t ret;
+	timing_t read_iter_time;
+
+	if (!iov_iter_count(to))
+		return 0;
+
+	NOVA_START_TIMING(read_iter_t, read_iter_time);
+	inode_lock_shared(inode);
+	ret = dax_iomap_rw(iocb, to, &nova_iomap_ops_nolock);
+	inode_unlock_shared(inode);
+
+	file_accessed(iocb->ki_filp);
+	NOVA_END_TIMING(read_iter_t, read_iter_time);
+	return ret;
+}
+
+static int nova_update_iter_csum_parity(struct super_block *sb,
+	struct inode *inode, loff_t offset, size_t count)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long start_pgoff, end_pgoff;
+	loff_t end;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	end = offset + count;
+
+	start_pgoff = offset >> sb->s_blocksize_bits;
+	end_pgoff = end >> sb->s_blocksize_bits;
+	if (end & (nova_inode_blk_size(sih) - 1))
+		end_pgoff++;
+
+	nova_reset_csum_parity_range(sb, sih, NULL, start_pgoff,
+			end_pgoff, 0, 0);
+
+	return 0;
+}
+
+static ssize_t nova_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file->f_mapping->host;
+	struct super_block *sb = inode->i_sb;
+	loff_t offset;
+	size_t count;
+	ssize_t ret;
+	timing_t write_iter_time;
+
+	NOVA_START_TIMING(write_iter_t, write_iter_time);
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out_unlock;
+
+	ret = file_remove_privs(file);
+	if (ret)
+		goto out_unlock;
+
+	ret = file_update_time(file);
+	if (ret)
+		goto out_unlock;
+
+	count = iov_iter_count(from);
+	offset = iocb->ki_pos;
+
+	ret = dax_iomap_rw(iocb, from, &nova_iomap_ops_nolock);
+	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
+		i_size_write(inode, iocb->ki_pos);
+		mark_inode_dirty(inode);
+	}
+
+	nova_update_iter_csum_parity(sb, inode, offset, count);
+
+out_unlock:
+	inode_unlock(inode);
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	NOVA_END_TIMING(write_iter_t, write_iter_time);
+	return ret;
+}
+
+static ssize_t
+do_dax_mapping_read(struct file *filp, char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct inode *inode = filp->f_mapping->host;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	pgoff_t index, end_index;
+	unsigned long offset;
+	loff_t isize, pos;
+	size_t copied = 0, error = 0;
+	timing_t memcpy_time;
+
+	pos = *ppos;
+	index = pos >> PAGE_SHIFT;
+	offset = pos & ~PAGE_MASK;
+
+	if (!access_ok(VERIFY_WRITE, buf, len)) {
+		error = -EFAULT;
+		goto out;
+	}
+
+	isize = i_size_read(inode);
+	if (!isize)
+		goto out;
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lu, size %lld\n",
+		__func__, inode->i_ino,	pos, len, isize);
+
+	if (len > isize - pos)
+		len = isize - pos;
+
+	if (len <= 0)
+		goto out;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	end_index = (isize - 1) >> PAGE_SHIFT;
+	do {
+		unsigned long nr, left;
+		unsigned long nvmm;
+		void *dax_mem = NULL;
+		int zero = 0;
+
+		/* nr is the maximum number of bytes to copy from this page */
+		if (index >= end_index) {
+			if (index > end_index)
+				goto out;
+			nr = ((isize - 1) & ~PAGE_MASK) + 1;
+			if (nr <= offset)
+				goto out;
+		}
+
+		entry = nova_get_write_entry(sb, sih, index);
+		if (unlikely(entry == NULL)) {
+			nova_dbgv("Required extent not found: pgoff %lu, inode size %lld\n",
+				index, isize);
+			nr = PAGE_SIZE;
+			zero = 1;
+			goto memcpy;
+		}
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+
+		/* Find contiguous blocks */
+		if (index < entryc->pgoff ||
+			index - entryc->pgoff >= entryc->num_pages) {
+			nova_err(sb, "%s ERROR: %lu, entry pgoff %llu, num %u, blocknr %llu\n",
+				__func__, index, entry->pgoff,
+				entry->num_pages, entry->block >> PAGE_SHIFT);
+			return -EINVAL;
+		}
+		if (entryc->reassigned == 0) {
+			nr = (entryc->num_pages - (index - entryc->pgoff))
+				* PAGE_SIZE;
+		} else {
+			nr = PAGE_SIZE;
+		}
+
+		nvmm = get_nvmm(sb, sih, entryc, index);
+		dax_mem = nova_get_block(sb, (nvmm << PAGE_SHIFT));
+
+memcpy:
+		nr = nr - offset;
+		if (nr > len - copied)
+			nr = len - copied;
+
+		if ((!zero) && (data_csum > 0)) {
+			if (nova_find_pgoff_in_vma(inode, index))
+				goto skip_verify;
+
+			if (!nova_verify_data_csum(sb, sih, nvmm, offset, nr)) {
+				nova_err(sb, "%s: nova data checksum and recovery fail! inode %lu, offset %lu, entry pgoff %lu, %u pages, pgoff %lu\n",
+					 __func__, inode->i_ino, offset,
+					 entry->pgoff, entry->num_pages, index);
+				error = -EIO;
+				goto out;
+			}
+		}
+skip_verify:
+		NOVA_START_TIMING(memcpy_r_nvmm_t, memcpy_time);
+
+		if (!zero)
+			left = __copy_to_user(buf + copied,
+						dax_mem + offset, nr);
+		else
+			left = __clear_user(buf + copied, nr);
+
+		NOVA_END_TIMING(memcpy_r_nvmm_t, memcpy_time);
+
+		if (left) {
+			nova_dbg("%s ERROR!: bytes %lu, left %lu\n",
+				__func__, nr, left);
+			error = -EFAULT;
+			goto out;
+		}
+
+		copied += (nr - left);
+		offset += (nr - left);
+		index += offset >> PAGE_SHIFT;
+		offset &= ~PAGE_MASK;
+	} while (copied < len);
+
+out:
+	*ppos = pos + copied;
+	if (filp)
+		file_accessed(filp);
+
+	NOVA_STATS_ADD(read_bytes, copied);
+
+	nova_dbgv("%s returned %zu\n", __func__, copied);
+	return copied ? copied : error;
+}
+
+/*
+ * Wrappers. We need to use the rcu read lock to avoid
+ * concurrent truncate operation. No problem for write because we held
+ * lock.
+ */
+static ssize_t nova_dax_file_read(struct file *filp, char __user *buf,
+			    size_t len, loff_t *ppos)
+{
+	struct inode *inode = filp->f_mapping->host;
+	ssize_t res;
+	timing_t dax_read_time;
+
+	NOVA_START_TIMING(dax_read_t, dax_read_time);
+	inode_lock_shared(inode);
+	res = do_dax_mapping_read(filp, buf, len, ppos);
+	inode_unlock_shared(inode);
+	NOVA_END_TIMING(dax_read_t, dax_read_time);
+	return res;
+}
+
+static ssize_t nova_cow_file_write(struct file *filp,
+	const char __user *buf,	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode	*inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi, inode_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	ssize_t	    written = 0;
+	loff_t pos;
+	size_t count, offset, copied;
+	unsigned long start_blk, num_blocks;
+	unsigned long total_blocks;
+	unsigned long blocknr = 0;
+	unsigned int data_bits;
+	int allocated = 0;
+	void *kmem;
+	u64 file_size;
+	size_t bytes;
+	long status = 0;
+	timing_t cow_write_time, memcpy_time;
+	unsigned long step = 0;
+	ssize_t ret;
+	u64 begin_tail = 0;
+	int try_inplace = 0;
+	u64 epoch_id;
+	u32 time;
+
+
+	if (len == 0)
+		return 0;
+
+	NOVA_START_TIMING(cow_write_t, cow_write_time);
+
+	sb_start_write(inode->i_sb);
+	inode_lock(inode);
+
+	if (!access_ok(VERIFY_READ, buf, len)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	pos = *ppos;
+
+	if (filp->f_flags & O_APPEND)
+		pos = i_size_read(inode);
+
+	count = len;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	/* nova_inode tail pointer will be updated and we make sure all other
+	 * inode fields are good before checksumming the whole structure
+	 */
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	total_blocks = num_blocks;
+	start_blk = pos >> sb->s_blocksize_bits;
+
+	if (nova_check_overlap_vmas(sb, sih, start_blk, num_blocks)) {
+		nova_dbgv("COW write overlaps with vma: inode %lu, pgoff %lu, %lu blocks\n",
+				inode->i_ino, start_blk, num_blocks);
+		NOVA_STATS_ADD(cow_overlap_mmap, 1);
+		try_inplace = 1;
+		ret = -EACCES;
+		goto out;
+	}
+
+	/* offset in the actual block size block */
+
+	ret = file_remove_privs(filp);
+	if (ret)
+		goto out;
+
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lu\n",
+			__func__, inode->i_ino,	pos, count);
+
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+	while (num_blocks > 0) {
+		offset = pos & (nova_inode_blk_size(sih) - 1);
+		start_blk = pos >> sb->s_blocksize_bits;
+
+		/* don't zero-out the allocated blocks */
+		allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk,
+				 num_blocks, ALLOC_NO_INIT, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+
+		nova_dbg_verbose("%s: alloc %d blocks @ %lu\n", __func__,
+						allocated, blocknr);
+
+		if (allocated <= 0) {
+			nova_dbg("%s alloc blocks failed %d\n", __func__,
+								allocated);
+			ret = allocated;
+			goto out;
+		}
+
+		step++;
+		bytes = sb->s_blocksize * allocated - offset;
+		if (bytes > count)
+			bytes = count;
+
+		kmem = nova_get_block(inode->i_sb,
+			     nova_get_block_off(sb, blocknr, sih->i_blk_type));
+
+		if (offset || ((offset + bytes) & (PAGE_SIZE - 1)) != 0)  {
+			ret = nova_handle_head_tail_blocks(sb, inode, pos,
+							   bytes, kmem);
+			if (ret)
+				goto out;
+		}
+		/* Now copy from user buf */
+		//		nova_dbg("Write: %p\n", kmem);
+		NOVA_START_TIMING(memcpy_w_nvmm_t, memcpy_time);
+		nova_memunlock_range(sb, kmem + offset, bytes);
+		copied = bytes - memcpy_to_pmem_nocache(kmem + offset,
+						buf, bytes);
+		nova_memlock_range(sb, kmem + offset, bytes);
+		NOVA_END_TIMING(memcpy_w_nvmm_t, memcpy_time);
+
+		if (data_csum > 0 || data_parity > 0) {
+			ret = nova_protect_file_data(sb, inode, pos, bytes,
+							buf, blocknr, false);
+			if (ret)
+				goto out;
+		}
+
+		if (pos + copied > inode->i_size)
+			file_size = cpu_to_le64(pos + copied);
+		else
+			file_size = cpu_to_le64(inode->i_size);
+
+		nova_init_file_write_entry(sb, sih, &entry_data, epoch_id,
+					start_blk, allocated, blocknr, time,
+					file_size);
+
+		ret = nova_append_file_write_entry(sb, pi, inode,
+					&entry_data, &update);
+		if (ret) {
+			nova_dbg("%s: append inode entry failed\n", __func__);
+			ret = -ENOSPC;
+			goto out;
+		}
+
+		nova_dbgv("Write: %p, %lu\n", kmem, copied);
+		if (copied > 0) {
+			status = copied;
+			written += copied;
+			pos += copied;
+			buf += copied;
+			count -= copied;
+			num_blocks -= allocated;
+		}
+		if (unlikely(copied != bytes)) {
+			nova_dbg("%s ERROR!: %p, bytes %lu, copied %lu\n",
+				__func__, kmem, bytes, copied);
+			if (status >= 0)
+				status = -EFAULT;
+		}
+		if (status < 0)
+			break;
+
+		if (begin_tail == 0)
+			begin_tail = update.curr_entry;
+	}
+
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (total_blocks << (data_bits - sb->s_blocksize_bits));
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	/* Free the overlap blocks after the write is committed */
+	ret = nova_reassign_file_tree(sb, sih, begin_tail);
+	if (ret)
+		goto out;
+
+	inode->i_blocks = sih->i_blocks;
+
+	ret = written;
+	NOVA_STATS_ADD(cow_write_breaks, step);
+	nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks);
+
+	*ppos = pos;
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		sih->i_size = pos;
+	}
+
+	sih->trans_id++;
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						begin_tail, update.tail);
+
+	inode_unlock(inode);
+	sb_end_write(inode->i_sb);
+	NOVA_END_TIMING(cow_write_t, cow_write_time);
+	NOVA_STATS_ADD(cow_write_bytes, written);
+
+	if (try_inplace)
+		return nova_inplace_file_write(filp, buf, len, ppos);
+
+	return ret;
+}
+
+static ssize_t nova_dax_file_write(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	if (inplace_data_updates)
+		return nova_inplace_file_write(filp, buf, len, ppos);
+	else
+		return nova_cow_file_write(filp, buf, len, ppos);
+}
+
+static int nova_dax_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file->f_mapping->host;
+
+	file_accessed(file);
+
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	vma->vm_ops = &nova_dax_vm_ops;
+
+	nova_insert_write_vma(vma);
+
+	nova_dbg_mmap4k("[%s:%d] inode %lu, MMAP 4KPAGE vm_start(0x%lx), vm_end(0x%lx), vm pgoff %lu, %lu blocks, vm_flags(0x%lx), vm_page_prot(0x%lx)\n",
+			__func__, __LINE__,
+			inode->i_ino, vma->vm_start, vma->vm_end,
+			vma->vm_pgoff,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			vma->vm_flags,
+			pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
+
+const struct file_operations nova_dax_file_operations = {
+	.llseek			= nova_llseek,
+	.read			= nova_dax_file_read,
+	.write			= nova_dax_file_write,
+	.read_iter		= nova_dax_read_iter,
+	.write_iter		= nova_dax_write_iter,
+	.mmap			= nova_dax_file_mmap,
+	.open			= nova_open,
+	.fsync			= nova_fsync,
+	.flush			= nova_flush,
+	.unlocked_ioctl		= nova_ioctl,
+	.fallocate		= nova_fallocate,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl		= nova_compat_ioctl,
+#endif
+};
+
+
+static ssize_t nova_wrap_rw_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *filp = iocb->ki_filp;
+	ssize_t ret = -EIO;
+	ssize_t written = 0;
+	unsigned long seg;
+	unsigned long nr_segs = iter->nr_segs;
+	const struct iovec *iv = iter->iov;
+
+	nova_dbgv("%s %s: %lu segs\n", __func__,
+			iov_iter_rw(iter) == READ ? "read" : "write",
+			nr_segs);
+	iv = iter->iov;
+	for (seg = 0; seg < nr_segs; seg++) {
+		if (iov_iter_rw(iter) == READ) {
+			ret = nova_dax_file_read(filp, iv->iov_base,
+					iv->iov_len, &iocb->ki_pos);
+		} else if (iov_iter_rw(iter) == WRITE) {
+			ret = nova_dax_file_write(filp, iv->iov_base,
+					iv->iov_len, &iocb->ki_pos);
+		}
+		if (ret < 0)
+			goto err;
+
+		if (iter->count > iv->iov_len)
+			iter->count -= iv->iov_len;
+		else
+			iter->count = 0;
+
+		written += ret;
+		iter->nr_segs--;
+		iv++;
+	}
+	ret = written;
+err:
+	return ret;
+}
+
+
+/* Wrap read/write_iter for DP, CoW and WP */
+const struct file_operations nova_wrap_file_operations = {
+	.llseek			= nova_llseek,
+	.read			= nova_dax_file_read,
+	.write			= nova_dax_file_write,
+	.read_iter		= nova_wrap_rw_iter,
+	.write_iter		= nova_wrap_rw_iter,
+	.mmap			= nova_dax_file_mmap,
+	.open			= nova_open,
+	.fsync			= nova_fsync,
+	.flush			= nova_flush,
+	.unlocked_ioctl		= nova_ioctl,
+	.fallocate		= nova_fallocate,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl		= nova_compat_ioctl,
+#endif
+};
+
+const struct inode_operations nova_file_inode_operations = {
+	.setattr	= nova_notify_change,
+	.getattr	= nova_getattr,
+	.get_acl	= NULL,
+};
diff --git a/fs/nova/namei.c b/fs/nova/namei.c
new file mode 100644
index 000000000000..59776338008d
--- /dev/null
+++ b/fs/nova/namei.c
@@ -0,0 +1,919 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode operations for directories.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include "nova.h"
+#include "journal.h"
+#include "inode.h"
+
+static ino_t nova_inode_by_name(struct inode *dir, struct qstr *entry,
+				 struct nova_dentry **res_entry)
+{
+	struct super_block *sb = dir->i_sb;
+	struct nova_dentry *direntry;
+	struct nova_dentry *direntryc, entry_copy;
+
+	direntry = nova_find_dentry(sb, NULL, dir,
+					entry->name, entry->len);
+	if (direntry == NULL)
+		return 0;
+
+	if (metadata_csum == 0)
+		direntryc = direntry;
+	else {
+		direntryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, direntry, direntryc))
+			return 0;
+	}
+
+	*res_entry = direntry;
+	return direntryc->ino;
+}
+
+static struct dentry *nova_lookup(struct inode *dir, struct dentry *dentry,
+				   unsigned int flags)
+{
+	struct inode *inode = NULL;
+	struct nova_dentry *de;
+	ino_t ino;
+	timing_t lookup_time;
+
+	NOVA_START_TIMING(lookup_t, lookup_time);
+	if (dentry->d_name.len > NOVA_NAME_LEN) {
+		nova_dbg("%s: namelen %u exceeds limit\n",
+			__func__, dentry->d_name.len);
+		return ERR_PTR(-ENAMETOOLONG);
+	}
+
+	nova_dbg_verbose("%s: %s\n", __func__, dentry->d_name.name);
+	ino = nova_inode_by_name(dir, &dentry->d_name, &de);
+	nova_dbg_verbose("%s: ino %lu\n", __func__, ino);
+	if (ino) {
+		inode = nova_iget(dir->i_sb, ino);
+		if (inode == ERR_PTR(-ESTALE) || inode == ERR_PTR(-ENOMEM)
+				|| inode == ERR_PTR(-EACCES)) {
+			nova_err(dir->i_sb,
+				  "%s: get inode failed: %lu\n",
+				  __func__, (unsigned long)ino);
+			return ERR_PTR(-EIO);
+		}
+	}
+
+	NOVA_END_TIMING(lookup_t, lookup_time);
+	return d_splice_alias(inode, dentry);
+}
+
+static void nova_lite_transaction_for_new_inode(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode *pidir, struct inode *inode,
+	struct inode *dir, struct nova_inode_update *update)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int cpu;
+	u64 journal_tail;
+	timing_t trans_time;
+
+	NOVA_START_TIMING(create_trans_t, trans_time);
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	nova_memunlock_journal(sb);
+
+	// If you change what's required to create a new inode, you need to
+	// update this functions so the changes will be roll back on failure.
+	journal_tail = nova_create_inode_transaction(sb, inode, dir, cpu, 1, 0);
+
+	nova_update_inode(sb, dir, pidir, update, 0);
+
+	pi->valid = 1;
+	nova_update_inode_checksum(pi);
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	nova_memlock_journal(sb);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	if (metadata_csum) {
+		nova_memunlock_inode(sb, pi);
+		nova_update_alter_inode(sb, inode, pi);
+		nova_update_alter_inode(sb, dir, pidir);
+		nova_memlock_inode(sb, pi);
+	}
+	NOVA_END_TIMING(create_trans_t, trans_time);
+}
+
+/* Returns new tail after append */
+/*
+ * By the time this is called, we already have created
+ * the directory cache entry for the new file, but it
+ * is so far negative - it has no inode.
+ *
+ * If the create succeeds, we fill in the inode information
+ * with d_instantiate().
+ */
+static int nova_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+			bool excl)
+{
+	struct inode *inode = NULL;
+	int err = PTR_ERR(inode);
+	struct super_block *sb = dir->i_sb;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 pi_addr = 0;
+	u64 ino, epoch_id;
+	timing_t create_time;
+
+	NOVA_START_TIMING(create_t, create_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_err;
+
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+	inode = nova_new_vfs_inode(TYPE_CREATE, dir, pi_addr, ino, mode,
+					0, 0, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode))
+		goto out_err;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	pi = nova_get_block(sb, pi_addr);
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+						&update);
+	NOVA_END_TIMING(create_t, create_time);
+	return err;
+out_err:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(create_t, create_time);
+	return err;
+}
+
+static int nova_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		       dev_t rdev)
+{
+	struct inode *inode = NULL;
+	int err = PTR_ERR(inode);
+	struct super_block *sb = dir->i_sb;
+	u64 pi_addr = 0;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 ino;
+	u64 epoch_id;
+	timing_t mknod_time;
+
+	NOVA_START_TIMING(mknod_t, mknod_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_err;
+
+	inode = nova_new_vfs_inode(TYPE_MKNOD, dir, pi_addr, ino, mode,
+					0, rdev, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode))
+		goto out_err;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	pi = nova_get_block(sb, pi_addr);
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+						&update);
+	NOVA_END_TIMING(mknod_t, mknod_time);
+	return err;
+out_err:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(mknod_t, mknod_time);
+	return err;
+}
+
+static int nova_symlink(struct inode *dir, struct dentry *dentry,
+			 const char *symname)
+{
+	struct super_block *sb = dir->i_sb;
+	int err = -ENAMETOOLONG;
+	unsigned int len = strlen(symname);
+	struct inode *inode;
+	struct nova_inode_info *si;
+	struct nova_inode_info_header *sih;
+	u64 pi_addr = 0;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 ino;
+	u64 epoch_id;
+	timing_t symlink_time;
+
+	NOVA_START_TIMING(symlink_t, symlink_time);
+	if (len + 1 > sb->s_blocksize)
+		goto out;
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_fail;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_fail;
+
+	nova_dbgv("%s: name %s, symname %s\n", __func__,
+				dentry->d_name.name, symname);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_fail;
+
+	inode = nova_new_vfs_inode(TYPE_SYMLINK, dir, pi_addr, ino,
+					S_IFLNK|0777, len, 0,
+					&dentry->d_name, epoch_id);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto out_fail;
+	}
+
+	pi = nova_get_inode(sb, inode);
+
+	si = NOVA_I(inode);
+	sih = &si->header;
+
+	err = nova_block_symlink(sb, pi, inode, symname, len, epoch_id);
+	if (err)
+		goto out_fail;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+					&update);
+out:
+	NOVA_END_TIMING(symlink_t, symlink_time);
+	return err;
+
+out_fail:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	goto out;
+}
+
+static void nova_lite_transaction_for_time_and_link(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode *pidir, struct inode *inode,
+	struct inode *dir, struct nova_inode_update *update,
+	struct nova_inode_update *update_dir, int invalidate, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 journal_tail;
+	int cpu;
+	timing_t trans_time;
+
+	NOVA_START_TIMING(link_trans_t, trans_time);
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	nova_memunlock_journal(sb);
+
+	// If you change what's required to create a new inode, you need to
+	// update this functions so the changes will be roll back on failure.
+	journal_tail = nova_create_inode_transaction(sb, inode, dir, cpu,
+						0, invalidate);
+
+	if (invalidate) {
+		pi->valid = 0;
+		pi->delete_epoch_id = epoch_id;
+	}
+	nova_update_inode(sb, inode, pi, update, 0);
+
+	nova_update_inode(sb, dir, pidir, update_dir, 0);
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	nova_memlock_journal(sb);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	if (metadata_csum) {
+		nova_memunlock_inode(sb, pi);
+		nova_update_alter_inode(sb, inode, pi);
+		nova_update_alter_inode(sb, dir, pidir);
+		nova_memlock_inode(sb, pi);
+	}
+
+	NOVA_END_TIMING(link_trans_t, trans_time);
+}
+
+static int nova_link(struct dentry *dest_dentry, struct inode *dir,
+		      struct dentry *dentry)
+{
+	struct super_block *sb = dir->i_sb;
+	struct inode *inode = dest_dentry->d_inode;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	u64 epoch_id;
+	int err = -ENOMEM;
+	timing_t link_time;
+
+	NOVA_START_TIMING(link_t, link_time);
+	if (inode->i_nlink >= NOVA_LINK_MAX) {
+		err = -EMLINK;
+		goto out;
+	}
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	ihold(inode);
+	epoch_id = nova_get_epoch_id(sb);
+
+	nova_dbgv("%s: name %s, dest %s\n", __func__,
+			dentry->d_name.name, dest_dentry->d_name.name);
+	nova_dbgv("%s: inode %lu, dir %lu\n", __func__,
+			inode->i_ino, dir->i_ino);
+
+	update_dir.tail = 0;
+	update_dir.alter_tail = 0;
+	err = nova_add_dentry(dentry, inode->i_ino, 0, &update_dir, epoch_id);
+	if (err) {
+		iput(inode);
+		goto out;
+	}
+
+	inode->i_ctime = current_time(inode);
+	inc_nlink(inode);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (err) {
+		iput(inode);
+		goto out;
+	}
+
+	d_instantiate(dentry, inode);
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+					&update, &update_dir, 0, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+
+out:
+	NOVA_END_TIMING(link_t, link_time);
+	return err;
+}
+
+static int nova_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	struct super_block *sb = dir->i_sb;
+	int retval = -ENOMEM;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	u64 epoch_id;
+	int invalidate = 0;
+	timing_t unlink_time;
+
+	NOVA_START_TIMING(unlink_t, unlink_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out;
+
+	epoch_id = nova_get_epoch_id(sb);
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %lu, dir %lu\n", __func__,
+				inode->i_ino, dir->i_ino);
+
+	update_dir.tail = 0;
+	update_dir.alter_tail = 0;
+	retval = nova_remove_dentry(dentry, 0, &update_dir, epoch_id);
+	if (retval)
+		goto out;
+
+	inode->i_ctime = dir->i_ctime;
+
+	if (inode->i_nlink == 1)
+		invalidate = 1;
+
+	if (inode->i_nlink)
+		drop_nlink(inode);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	retval = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (retval)
+		goto out;
+
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+				&update, &update_dir, invalidate, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+	nova_invalidate_dentries(sb, &update_dir);
+
+	NOVA_END_TIMING(unlink_t, unlink_time);
+	return 0;
+out:
+	nova_err(sb, "%s return %d\n", __func__, retval);
+	NOVA_END_TIMING(unlink_t, unlink_time);
+	return retval;
+}
+
+static int nova_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct super_block *sb = dir->i_sb;
+	struct inode *inode;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_info *si, *sidir;
+	struct nova_inode_info_header *sih = NULL;
+	struct nova_inode_update update;
+	u64 pi_addr = 0;
+	u64 ino;
+	u64 epoch_id;
+	int err = -EMLINK;
+	timing_t mkdir_time;
+
+	NOVA_START_TIMING(mkdir_t, mkdir_time);
+	if (dir->i_nlink >= NOVA_LINK_MAX)
+		goto out;
+
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	nova_dbgv("%s: name %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu, link %d\n", __func__,
+				ino, dir->i_ino, dir->i_nlink);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_add_dentry(dentry, ino, 1, &update, epoch_id);
+	if (err) {
+		nova_dbg("failed to add dir entry\n");
+		goto out_err;
+	}
+
+	inode = nova_new_vfs_inode(TYPE_MKDIR, dir, pi_addr, ino,
+					S_IFDIR | mode, sb->s_blocksize,
+					0, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto out_err;
+	}
+
+	pi = nova_get_inode(sb, inode);
+	nova_append_dir_init_entries(sb, pi, inode->i_ino, dir->i_ino,
+					epoch_id);
+
+	/* Build the dir tree */
+	si = NOVA_I(inode);
+	sih = &si->header;
+	nova_rebuild_dir_inode_tree(sb, pi, pi_addr, sih);
+
+	pidir = nova_get_inode(sb, dir);
+	sidir = NOVA_I(dir);
+	sih = &si->header;
+	dir->i_blocks = sih->i_blocks;
+	inc_nlink(dir);
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+					&update);
+out:
+	NOVA_END_TIMING(mkdir_t, mkdir_time);
+	return err;
+
+out_err:
+//	clear_nlink(inode);
+	nova_err(sb, "%s return %d\n", __func__, err);
+	goto out;
+}
+
+/*
+ * routine to check that the specified directory is empty (for rmdir)
+ */
+static int nova_empty_dir(struct inode *inode)
+{
+	struct super_block *sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_dentry *entry;
+	struct nova_dentry *entryc, entry_copy;
+	unsigned long pos = 0;
+	struct nova_dentry *entries[4];
+	int nr_entries;
+	int i;
+
+	sb = inode->i_sb;
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, 4);
+	if (nr_entries > 2)
+		return 0;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	for (i = 0; i < nr_entries; i++) {
+		entry = entries[i];
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return 0;
+
+		if (!is_dir_init_entry(sb, entryc))
+			return 0;
+	}
+
+	return 1;
+}
+
+static int nova_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	struct nova_dentry *de;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode), *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	int err = -ENOTEMPTY;
+	u64 epoch_id;
+	timing_t rmdir_time;
+
+	NOVA_START_TIMING(rmdir_t, rmdir_time);
+	if (!inode)
+		return -ENOENT;
+
+	nova_dbgv("%s: name %s\n", __func__, dentry->d_name.name);
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		return -EINVAL;
+
+	if (nova_inode_by_name(dir, &dentry->d_name, &de) == 0)
+		return -ENOENT;
+
+	if (!nova_empty_dir(inode))
+		return err;
+
+	nova_dbgv("%s: inode %lu, dir %lu, link %d\n", __func__,
+				inode->i_ino, dir->i_ino, dir->i_nlink);
+
+	if (inode->i_nlink != 2)
+		nova_dbg("empty directory %lu has nlink!=2 (%d), dir %lu",
+				inode->i_ino, inode->i_nlink, dir->i_ino);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	update_dir.tail = 0;
+	update_dir.alter_tail = 0;
+	err = nova_remove_dentry(dentry, -1, &update_dir, epoch_id);
+	if (err)
+		goto end_rmdir;
+
+	/*inode->i_version++; */
+	clear_nlink(inode);
+	inode->i_ctime = dir->i_ctime;
+
+	if (dir->i_nlink)
+		drop_nlink(dir);
+
+	nova_delete_dir_tree(sb, sih);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (err)
+		goto end_rmdir;
+
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+					&update, &update_dir, 1, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+	nova_invalidate_dentries(sb, &update_dir);
+
+	NOVA_END_TIMING(rmdir_t, rmdir_time);
+	return err;
+
+end_rmdir:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(rmdir_t, rmdir_time);
+	return err;
+}
+
+static int nova_rename(struct inode *old_dir,
+			struct dentry *old_dentry,
+			struct inode *new_dir, struct dentry *new_dentry,
+			unsigned int flags)
+{
+	struct inode *old_inode = old_dentry->d_inode;
+	struct inode *new_inode = new_dentry->d_inode;
+	struct super_block *sb = old_inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *old_pi = NULL, *new_pi = NULL;
+	struct nova_inode *new_pidir = NULL, *old_pidir = NULL;
+	struct nova_dentry *father_entry = NULL;
+	struct nova_dentry *father_entryc, entry_copy;
+	char *head_addr = NULL;
+	int invalidate_new_inode = 0;
+	struct nova_inode_update update_dir_new;
+	struct nova_inode_update update_dir_old;
+	struct nova_inode_update update_new;
+	struct nova_inode_update update_old;
+	u64 old_linkc1 = 0, old_linkc2 = 0;
+	int err = -ENOENT;
+	int inc_link = 0, dec_link = 0;
+	int cpu;
+	int change_parent = 0;
+	u64 journal_tail;
+	u64 epoch_id;
+	timing_t rename_time;
+
+	nova_dbgv("%s: rename %s to %s,\n", __func__,
+			old_dentry->d_name.name, new_dentry->d_name.name);
+	nova_dbgv("%s: %s inode %lu, old dir %lu, new dir %lu, new inode %lu\n",
+			__func__, S_ISDIR(old_inode->i_mode) ? "dir" : "normal",
+			old_inode->i_ino, old_dir->i_ino, new_dir->i_ino,
+			new_inode ? new_inode->i_ino : 0);
+
+	if (flags & ~RENAME_NOREPLACE)
+		return -EINVAL;
+
+	NOVA_START_TIMING(rename_t, rename_time);
+
+	if (new_inode) {
+		err = -ENOTEMPTY;
+		if (S_ISDIR(old_inode->i_mode) && !nova_empty_dir(new_inode))
+			goto out;
+	} else {
+		if (S_ISDIR(old_inode->i_mode)) {
+			err = -EMLINK;
+			if (new_dir->i_nlink >= NOVA_LINK_MAX)
+				goto out;
+		}
+	}
+
+	if (S_ISDIR(old_inode->i_mode)) {
+		dec_link = -1;
+		if (!new_inode)
+			inc_link = 1;
+		/*
+		 * Tricky for in-place update:
+		 * New dentry is always after renamed dentry, so we have to
+		 * make sure new dentry has the correct links count
+		 * to workaround the rebuild nlink issue.
+		 */
+		if (old_dir == new_dir) {
+			inc_link--;
+			if (inc_link == 0)
+				dec_link = 0;
+		}
+	}
+
+	epoch_id = nova_get_epoch_id(sb);
+	new_pidir = nova_get_inode(sb, new_dir);
+	old_pidir = nova_get_inode(sb, old_dir);
+
+	old_pi = nova_get_inode(sb, old_inode);
+	old_inode->i_ctime = current_time(old_inode);
+	update_old.tail = 0;
+	update_old.alter_tail = 0;
+	err = nova_append_link_change_entry(sb, old_pi, old_inode,
+					&update_old, &old_linkc1, epoch_id);
+	if (err)
+		goto out;
+
+	if (S_ISDIR(old_inode->i_mode) && old_dir != new_dir) {
+		/* My father is changed. Update .. entry */
+		/* For simplicity, we use in-place update and journal it */
+		change_parent = 1;
+		head_addr = (char *)nova_get_block(sb, old_pi->log_head);
+		father_entry = (struct nova_dentry *)(head_addr +
+					NOVA_DIR_LOG_REC_LEN(1));
+
+		if (metadata_csum == 0)
+			father_entryc = father_entry;
+		else {
+			father_entryc = &entry_copy;
+			if (!nova_verify_entry_csum(sb, father_entry,
+							father_entryc)) {
+				err = -EIO;
+				goto out;
+			}
+		}
+
+		if (le64_to_cpu(father_entryc->ino) != old_dir->i_ino)
+			nova_err(sb, "%s: dir %lu parent should be %lu, but actually %lu\n",
+				__func__,
+				old_inode->i_ino, old_dir->i_ino,
+				le64_to_cpu(father_entry->ino));
+	}
+
+	update_dir_new.tail = 0;
+	update_dir_new.alter_tail = 0;
+	if (new_inode) {
+		/* First remove the old entry in the new directory */
+		err = nova_remove_dentry(new_dentry, 0, &update_dir_new,
+					epoch_id);
+		if (err)
+			goto out;
+	}
+
+	/* link into the new directory. */
+	err = nova_add_dentry(new_dentry, old_inode->i_ino,
+				inc_link, &update_dir_new, epoch_id);
+	if (err)
+		goto out;
+
+	if (inc_link > 0)
+		inc_nlink(new_dir);
+
+	update_dir_old.tail = 0;
+	update_dir_old.alter_tail = 0;
+	if (old_dir == new_dir) {
+		update_dir_old.tail = update_dir_new.tail;
+		update_dir_old.alter_tail = update_dir_new.alter_tail;
+	}
+
+	err = nova_remove_dentry(old_dentry, dec_link, &update_dir_old,
+					epoch_id);
+	if (err)
+		goto out;
+
+	if (dec_link < 0)
+		drop_nlink(old_dir);
+
+	if (new_inode) {
+		new_pi = nova_get_inode(sb, new_inode);
+		new_inode->i_ctime = current_time(new_inode);
+
+		if (S_ISDIR(old_inode->i_mode)) {
+			if (new_inode->i_nlink)
+				drop_nlink(new_inode);
+		}
+		if (new_inode->i_nlink)
+			drop_nlink(new_inode);
+
+		update_new.tail = 0;
+		update_new.alter_tail = 0;
+		err = nova_append_link_change_entry(sb, new_pi, new_inode,
+						&update_new, &old_linkc2,
+						epoch_id);
+		if (err)
+			goto out;
+	}
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	nova_memunlock_journal(sb);
+	if (new_inode && new_inode->i_nlink == 0)
+		invalidate_new_inode = 1;
+	journal_tail = nova_create_rename_transaction(sb, old_inode, old_dir,
+				new_inode,
+				old_dir != new_dir ? new_dir : NULL,
+				father_entry,
+				invalidate_new_inode,
+				cpu);
+
+	nova_update_inode(sb, old_inode, old_pi, &update_old, 0);
+	nova_update_inode(sb, old_dir, old_pidir, &update_dir_old, 0);
+
+	if (old_pidir != new_pidir)
+		nova_update_inode(sb, new_dir, new_pidir, &update_dir_new, 0);
+
+	if (change_parent && father_entry) {
+		father_entry->ino = cpu_to_le64(new_dir->i_ino);
+		nova_update_entry_csum(father_entry);
+		nova_update_alter_entry(sb, father_entry);
+	}
+
+	if (new_inode) {
+		if (invalidate_new_inode) {
+			new_pi->valid = 0;
+			new_pi->delete_epoch_id = epoch_id;
+		}
+		nova_update_inode(sb, new_inode, new_pi, &update_new, 0);
+	}
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	nova_memlock_journal(sb);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	nova_memunlock_inode(sb, old_pi);
+	nova_update_alter_inode(sb, old_inode, old_pi);
+	nova_update_alter_inode(sb, old_dir, old_pidir);
+	if (old_dir != new_dir)
+		nova_update_alter_inode(sb, new_dir, new_pidir);
+	if (new_inode)
+		nova_update_alter_inode(sb, new_inode, new_pi);
+	nova_memlock_inode(sb, old_pi);
+
+	nova_invalidate_link_change_entry(sb, old_linkc1);
+	nova_invalidate_link_change_entry(sb, old_linkc2);
+	if (new_inode)
+		nova_invalidate_dentries(sb, &update_dir_new);
+	nova_invalidate_dentries(sb, &update_dir_old);
+
+	NOVA_END_TIMING(rename_t, rename_time);
+	return 0;
+out:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(rename_t, rename_time);
+	return err;
+}
+
+struct dentry *nova_get_parent(struct dentry *child)
+{
+	struct inode *inode;
+	struct qstr dotdot = QSTR_INIT("..", 2);
+	struct nova_dentry *de = NULL;
+	ino_t ino;
+
+	nova_inode_by_name(child->d_inode, &dotdot, &de);
+	if (!de)
+		return ERR_PTR(-ENOENT);
+
+	/* FIXME: can de->ino be avoided by using the return value of
+	 * nova_inode_by_name()?
+	 */
+	ino = le64_to_cpu(de->ino);
+
+	if (ino)
+		inode = nova_iget(child->d_inode->i_sb, ino);
+	else
+		return ERR_PTR(-ENOENT);
+
+	return d_obtain_alias(inode);
+}
+
+const struct inode_operations nova_dir_inode_operations = {
+	.create		= nova_create,
+	.lookup		= nova_lookup,
+	.link		= nova_link,
+	.unlink		= nova_unlink,
+	.symlink	= nova_symlink,
+	.mkdir		= nova_mkdir,
+	.rmdir		= nova_rmdir,
+	.mknod		= nova_mknod,
+	.rename		= nova_rename,
+	.setattr	= nova_notify_change,
+	.get_acl	= NULL,
+};
+
+const struct inode_operations nova_special_inode_operations = {
+	.setattr	= nova_notify_change,
+	.get_acl	= NULL,
+};
diff --git a/fs/nova/symlink.c b/fs/nova/symlink.c
new file mode 100644
index 000000000000..b0e5e898a41b
--- /dev/null
+++ b/fs/nova/symlink.c
@@ -0,0 +1,153 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Symlink operations
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/version.h>
+#include "nova.h"
+#include "inode.h"
+
+int nova_block_symlink(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, const char *symname, int len, u64 epoch_id)
+{
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode_update update;
+	unsigned long name_blocknr = 0;
+	int allocated;
+	u64 block;
+	char *blockp;
+	u32 time;
+	int ret;
+
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+
+	allocated = nova_new_data_blocks(sb, sih, &name_blocknr, 0, 1,
+				 ALLOC_INIT_ZERO, ANY_CPU, ALLOC_FROM_TAIL);
+	if (allocated != 1 || name_blocknr == 0) {
+		ret = allocated;
+		return ret;
+	}
+
+	/* First copy name to name block */
+	block = nova_get_block_off(sb, name_blocknr, NOVA_BLOCK_TYPE_4K);
+	blockp = (char *)nova_get_block(sb, block);
+
+	nova_memunlock_block(sb, blockp);
+	memcpy_to_pmem_nocache(blockp, symname, len);
+	blockp[len] = '\0';
+	nova_memlock_block(sb, blockp);
+
+	/* Apply a write entry to the log page */
+	time = current_time(inode).tv_sec;
+	nova_init_file_write_entry(sb, sih, &entry_data, epoch_id, 0, 1,
+					name_blocknr, time, len + 1);
+
+	ret = nova_append_file_write_entry(sb, pi, inode, &entry_data, &update);
+	if (ret) {
+		nova_dbg("%s: append file write entry failed %d\n",
+					__func__, ret);
+		nova_free_data_blocks(sb, sih, name_blocknr, 1);
+		return ret;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+	sih->trans_id++;
+
+	return 0;
+}
+
+/* FIXME: Temporary workaround */
+static int nova_readlink_copy(char __user *buffer, int buflen, const char *link)
+{
+	int len = PTR_ERR(link);
+
+	if (IS_ERR(link))
+		goto out;
+
+	len = strlen(link);
+	if (len > (unsigned int) buflen)
+		len = buflen;
+	if (copy_to_user(buffer, link, len))
+		len = -EFAULT;
+out:
+	return len;
+}
+
+static int nova_readlink(struct dentry *dentry, char __user *buffer, int buflen)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct inode *inode = dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	char *blockp;
+
+	entry = (struct nova_file_write_entry *)nova_get_block(sb,
+							sih->log_head);
+
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+	}
+
+	blockp = (char *)nova_get_block(sb, BLOCK_OFF(entryc->block));
+
+	return nova_readlink_copy(buffer, buflen, blockp);
+}
+
+static const char *nova_get_link(struct dentry *dentry, struct inode *inode,
+	struct delayed_call *done)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	char *blockp;
+
+	entry = (struct nova_file_write_entry *)nova_get_block(sb,
+							sih->log_head);
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return NULL;
+	}
+
+	blockp = (char *)nova_get_block(sb, BLOCK_OFF(entryc->block));
+
+	return blockp;
+}
+
+const struct inode_operations nova_symlink_inode_operations = {
+	.readlink	= nova_readlink,
+	.get_link	= nova_get_link,
+	.setattr	= nova_notify_change,
+};

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 07/16] NOVA: File and directory operations
@ 2017-08-03  7:48   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

To access file data via read(), Nova maintains a radix tree in DRAM for each
inode (nova_inode_info_header.tree) that maps file offsets to write log
entries.  For directories, the same tree maps a hash of filenames to their
corresponding dentry.

In both cases, the nova populates the tree when the file or directory is opened
by scanning its log.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/dir.c     |  760 +++++++++++++++++++++++++++++++++++++++++++
 fs/nova/file.c    |  943 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/namei.c   |  919 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/symlink.c |  153 +++++++++
 4 files changed, 2775 insertions(+)
 create mode 100644 fs/nova/dir.c
 create mode 100644 fs/nova/file.c
 create mode 100644 fs/nova/namei.c
 create mode 100644 fs/nova/symlink.c

diff --git a/fs/nova/dir.c b/fs/nova/dir.c
new file mode 100644
index 000000000000..47e89088a69b
--- /dev/null
+++ b/fs/nova/dir.c
@@ -0,0 +1,760 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for directories.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include "nova.h"
+#include "inode.h"
+
+#define DT2IF(dt) (((dt) << 12) & S_IFMT)
+#define IF2DT(sif) (((sif) & S_IFMT) >> 12)
+
+struct nova_dentry *nova_find_dentry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, const char *name,
+	unsigned long name_len)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_dentry *direntry;
+	unsigned long hash;
+
+	hash = BKDRHash(name, name_len);
+	direntry = radix_tree_lookup(&sih->tree, hash);
+
+	return direntry;
+}
+
+int nova_insert_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name,
+	int namelen, struct nova_dentry *direntry)
+{
+	unsigned long hash;
+	int ret;
+
+	hash = BKDRHash(name, namelen);
+	nova_dbgv("%s: insert %s hash %lu\n", __func__, name, hash);
+
+	/* FIXME: hash collision ignored here */
+	ret = radix_tree_insert(&sih->tree, hash, direntry);
+	if (ret)
+		nova_dbg("%s ERROR %d: %s\n", __func__, ret, name);
+
+	return ret;
+}
+
+static int nova_check_dentry_match(struct super_block *sb,
+	struct nova_dentry *dentry, const char *name, int namelen)
+{
+	if (dentry->name_len != namelen)
+		return -EINVAL;
+
+	return strncmp(dentry->name, name, namelen);
+}
+
+int nova_remove_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name, int namelen,
+	int replay, struct nova_dentry **create_dentry)
+{
+	struct nova_dentry *entry;
+	struct nova_dentry *entryc, entry_copy;
+	unsigned long hash;
+
+	hash = BKDRHash(name, namelen);
+	entry = radix_tree_delete(&sih->tree, hash);
+
+	if (replay == 0) {
+		if (!entry) {
+			nova_dbg("%s ERROR: %s, length %d, hash %lu\n",
+					__func__, name, namelen, hash);
+			return -EINVAL;
+		}
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else {
+			entryc = &entry_copy;
+			if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EINVAL;
+		}
+
+		if (entryc->ino == 0 || entryc->invalid ||
+		    nova_check_dentry_match(sb, entryc, name, namelen)) {
+			nova_dbg("%s dentry not match: %s, length %d, hash %lu\n",
+				 __func__, name, namelen, hash);
+			/* for debug information, still allow access to nvmm */
+			nova_dbg("dentry: type %d, inode %llu, name %s, namelen %u, rec len %u\n",
+				 entry->entry_type, le64_to_cpu(entry->ino),
+				 entry->name, entry->name_len,
+				 le16_to_cpu(entry->de_len));
+			return -EINVAL;
+		}
+
+		if (create_dentry)
+			*create_dentry = entry;
+	}
+
+	return 0;
+}
+
+void nova_delete_dir_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_dentry *direntry;
+	struct nova_dentry *direntryc, entry_copy;
+	unsigned long pos = 0;
+	struct nova_dentry *entries[FREE_BATCH];
+	timing_t delete_time;
+	int nr_entries;
+	int i;
+	void *ret;
+
+	NOVA_START_TIMING(delete_dir_tree_t, delete_time);
+
+	direntryc = (metadata_csum == 0) ? direntry : &entry_copy;
+	do {
+		nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, FREE_BATCH);
+		for (i = 0; i < nr_entries; i++) {
+			direntry = entries[i];
+			BUG_ON(!direntry);
+
+			if (metadata_csum == 0)
+				direntryc = direntry;
+			else if (!nova_verify_entry_csum(sb, direntry,
+								direntryc))
+				return;
+
+			pos = BKDRHash(direntryc->name, direntryc->name_len);
+			ret = radix_tree_delete(&sih->tree, pos);
+			if (!ret || ret != direntry) {
+				nova_err(sb, "dentry: type %d, inode %llu, name %s, namelen %u, rec len %u\n",
+					direntry->entry_type,
+					le64_to_cpu(direntry->ino),
+					direntry->name, direntry->name_len,
+					le16_to_cpu(direntry->de_len));
+				if (!ret)
+					nova_dbg("ret is NULL\n");
+			}
+		}
+		pos++;
+	} while (nr_entries == FREE_BATCH);
+
+	NOVA_END_TIMING(delete_dir_tree_t, delete_time);
+}
+
+/* ========================= Entry operations ============================= */
+
+static unsigned int nova_init_dentry(struct super_block *sb,
+	struct nova_dentry *de_entry, u64 self_ino, u64 parent_ino,
+	u64 epoch_id)
+{
+	void *start = de_entry;
+	struct nova_inode_log_page *curr_page = start;
+	unsigned int length;
+	unsigned short de_len;
+
+	de_len = NOVA_DIR_LOG_REC_LEN(1);
+	memset(de_entry, 0, de_len);
+	de_entry->entry_type = DIR_LOG;
+	de_entry->epoch_id = epoch_id;
+	de_entry->trans_id = 0;
+	de_entry->ino = cpu_to_le64(self_ino);
+	de_entry->name_len = 1;
+	de_entry->de_len = cpu_to_le16(de_len);
+	de_entry->mtime = timespec_trunc(current_kernel_time(),
+					 sb->s_time_gran).tv_sec;
+
+	de_entry->links_count = 1;
+	strncpy(de_entry->name, ".\0", 2);
+	nova_update_entry_csum(de_entry);
+
+	length = de_len;
+
+	de_entry = (struct nova_dentry *)((char *)de_entry + length);
+	de_len = NOVA_DIR_LOG_REC_LEN(2);
+	memset(de_entry, 0, de_len);
+	de_entry->entry_type = DIR_LOG;
+	de_entry->epoch_id = epoch_id;
+	de_entry->trans_id = 0;
+	de_entry->ino = cpu_to_le64(parent_ino);
+	de_entry->name_len = 2;
+	de_entry->de_len = cpu_to_le16(de_len);
+	de_entry->mtime = timespec_trunc(current_kernel_time(),
+					 sb->s_time_gran).tv_sec;
+
+	de_entry->links_count = 2;
+	strncpy(de_entry->name, "..\0", 3);
+	nova_update_entry_csum(de_entry);
+	length += de_len;
+
+	nova_set_page_num_entries(sb, curr_page, 2, 1);
+
+	nova_flush_buffer(start, length, 0);
+	return length;
+}
+
+/* Append . and .. entries
+ *
+ * TODO: why is epoch_id a parameter when we pass in the sb?
+ */
+int nova_append_dir_init_entries(struct super_block *sb,
+	struct nova_inode *pi, u64 self_ino, u64 parent_ino, u64 epoch_id)
+{
+	struct nova_inode_info_header sih;
+	struct nova_inode *alter_pi;
+	u64 alter_pi_addr = 0;
+	int allocated;
+	int ret;
+	u64 new_block;
+	unsigned int length;
+	struct nova_dentry *de_entry;
+
+	sih.ino = self_ino;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, 1, &new_block,
+							ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_err(sb, "ERROR: no inode log page available\n");
+		return -ENOMEM;
+	}
+
+	nova_memunlock_inode(sb, pi);
+
+	pi->log_tail = pi->log_head = new_block;
+
+	de_entry = (struct nova_dentry *)nova_get_block(sb, new_block);
+
+	length = nova_init_dentry(sb, de_entry, self_ino, parent_ino, epoch_id);
+
+	nova_update_tail(pi, new_block + length);
+
+	nova_memlock_inode(sb, pi);
+
+	if (metadata_csum == 0)
+		return 0;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, 1, &new_block,
+							ANY_CPU, 1);
+	if (allocated != 1) {
+		nova_err(sb, "ERROR: no inode log page available\n");
+		return -ENOMEM;
+	}
+	nova_memunlock_inode(sb, pi);
+	pi->alter_log_tail = pi->alter_log_head = new_block;
+
+	de_entry = (struct nova_dentry *)nova_get_block(sb, new_block);
+
+	length = nova_init_dentry(sb, de_entry, self_ino, parent_ino, epoch_id);
+
+	nova_update_alter_tail(pi, new_block + length);
+	nova_update_alter_pages(sb, pi, pi->log_head,
+						pi->alter_log_head);
+	nova_update_inode_checksum(pi);
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 0);
+	nova_memlock_inode(sb, pi);
+
+	/* Get alternate inode address */
+	ret = nova_get_alter_inode_address(sb, self_ino, &alter_pi_addr);
+	if (ret)
+		return ret;
+
+	alter_pi = (struct nova_inode *)nova_get_block(sb, alter_pi_addr);
+	if (!alter_pi)
+		return -EINVAL;
+
+	nova_memunlock_inode(sb, alter_pi);
+	memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	nova_memlock_inode(sb, alter_pi);
+
+	return 0;
+}
+
+/* adds a directory entry pointing to the inode. assumes the inode has
+ * already been logged for consistency
+ */
+int nova_add_dentry(struct dentry *dentry, u64 ino, int inc_link,
+	struct nova_inode_update *update, u64 epoch_id)
+{
+	struct inode *dir = dentry->d_parent->d_inode;
+	struct super_block *sb = dir->i_sb;
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pidir;
+	const char *name = dentry->d_name.name;
+	int namelen = dentry->d_name.len;
+	struct nova_dentry *direntry;
+	unsigned short loglen;
+	int ret;
+	u64 curr_entry;
+	timing_t add_dentry_time;
+
+	nova_dbg_verbose("%s: dir %lu new inode %llu\n",
+				__func__, dir->i_ino, ino);
+	nova_dbg_verbose("%s: %s %d\n", __func__, name, namelen);
+	NOVA_START_TIMING(add_dentry_t, add_dentry_time);
+	if (namelen == 0)
+		return -EINVAL;
+
+	pidir = nova_get_inode(sb, dir);
+
+	/*
+	 * XXX shouldn't update any times until successful
+	 * completion of syscall, but too many callers depend
+	 * on this.
+	 */
+	dir->i_mtime = dir->i_ctime = current_time(dir);
+
+	loglen = NOVA_DIR_LOG_REC_LEN(namelen);
+	ret = nova_append_dentry(sb, pidir, dir, dentry,
+				ino, loglen, update,
+				inc_link, epoch_id);
+
+	if (ret) {
+		nova_dbg("%s: append dir entry failure\n", __func__);
+		return ret;
+	}
+
+	curr_entry = update->curr_entry;
+	direntry = (struct nova_dentry *)nova_get_block(sb, curr_entry);
+	sih->last_dentry = curr_entry;
+	ret = nova_insert_dir_radix_tree(sb, sih, name, namelen, direntry);
+
+	sih->trans_id++;
+	NOVA_END_TIMING(add_dentry_t, add_dentry_time);
+	return ret;
+}
+
+static int nova_can_inplace_update_dentry(struct super_block *sb,
+	struct nova_dentry *dentry, u64 epoch_id)
+{
+	struct nova_dentry *dentryc, entry_copy;
+
+	if (metadata_csum == 0)
+		dentryc = dentry;
+	else {
+		dentryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, dentry, dentryc))
+			return 0;
+	}
+
+	if (dentry && dentryc->epoch_id == epoch_id)
+		return 1;
+
+	return 0;
+}
+
+static int nova_inplace_update_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *dentry, int link_change,
+	u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+
+	entry_info.type = DIR_LOG;
+	entry_info.link_change = link_change;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+	entry_info.inplace = 1;
+
+	return nova_inplace_update_log_entry(sb, dir, dentry,
+					&entry_info);
+}
+
+/* removes a directory entry pointing to the inode. assumes the inode has
+ * already been logged for consistency
+ */
+int nova_remove_dentry(struct dentry *dentry, int dec_link,
+	struct nova_inode_update *update, u64 epoch_id)
+{
+	struct inode *dir = dentry->d_parent->d_inode;
+	struct super_block *sb = dir->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pidir;
+	struct qstr *entry = &dentry->d_name;
+	struct nova_dentry *old_dentry = NULL;
+	unsigned short loglen;
+	int ret;
+	u64 curr_entry;
+	timing_t remove_dentry_time;
+
+	NOVA_START_TIMING(remove_dentry_t, remove_dentry_time);
+
+	update->create_dentry = NULL;
+	update->delete_dentry = NULL;
+
+	if (!dentry->d_name.len) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = nova_remove_dir_radix_tree(sb, sih, entry->name, entry->len, 0,
+					&old_dentry);
+
+	if (ret)
+		goto out;
+
+	pidir = nova_get_inode(sb, dir);
+
+	dir->i_mtime = dir->i_ctime = current_time(dir);
+
+	if (nova_can_inplace_update_dentry(sb, old_dentry, epoch_id)) {
+		nova_inplace_update_dentry(sb, dir, old_dentry,
+						dec_link, epoch_id);
+		curr_entry = nova_get_addr_off(sbi, old_dentry);
+
+		sih->last_dentry = curr_entry;
+		/* Leave create/delete_dentry to NULL
+		 * Do not change tail/alter_tail if used as input
+		 */
+		if (update->tail == 0) {
+			update->tail = sih->log_tail;
+			update->alter_tail = sih->alter_log_tail;
+		}
+		sih->trans_id++;
+		goto out;
+	}
+
+	loglen = NOVA_DIR_LOG_REC_LEN(entry->len);
+	ret = nova_append_dentry(sb, pidir, dir, dentry,
+				0, loglen, update,
+				dec_link, epoch_id);
+
+	if (ret) {
+		nova_dbg("%s: append dir entry failure\n", __func__);
+		goto out;
+	}
+
+	update->create_dentry = old_dentry;
+	curr_entry = update->curr_entry;
+	update->delete_dentry = (struct nova_dentry *)nova_get_block(sb,
+						curr_entry);
+	sih->last_dentry = curr_entry;
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(remove_dentry_t, remove_dentry_time);
+	return ret;
+}
+
+/* Create dentry and delete dentry must be invalidated together */
+int nova_invalidate_dentries(struct super_block *sb,
+	struct nova_inode_update *update)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_dentry *create_dentry;
+	struct nova_dentry *create_dentryc, entry_copy;
+	struct nova_dentry *delete_dentry;
+	u64 create_curr, delete_curr;
+	int ret;
+
+	create_dentry = update->create_dentry;
+	delete_dentry = update->delete_dentry;
+
+	if (!create_dentry)
+		return 0;
+
+	nova_reassign_logentry(sb, create_dentry, DIR_LOG);
+
+	if (metadata_csum == 0)
+		create_dentryc = create_dentry;
+	else {
+		create_dentryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, create_dentry, create_dentryc))
+			return 0;
+	}
+
+	if (!old_entry_freeable(sb, create_dentryc->epoch_id))
+		return 0;
+
+	create_curr = nova_get_addr_off(sbi, create_dentry);
+	delete_curr = nova_get_addr_off(sbi, delete_dentry);
+
+	nova_invalidate_logentry(sb, create_dentry, DIR_LOG, 0);
+
+	ret = nova_invalidate_logentry(sb, delete_dentry, DIR_LOG, 0);
+
+	return ret;
+}
+
+static int nova_readdir_slow(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pidir;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *child_pi;
+	struct nova_dentry *entry;
+	struct nova_dentry *entryc, entry_copy;
+	struct nova_dentry *entries[FREE_BATCH];
+	int nr_entries;
+	u64 pi_addr;
+	unsigned long pos = 0;
+	ino_t ino;
+	int i;
+	int ret;
+	timing_t readdir_time;
+
+	NOVA_START_TIMING(readdir_t, readdir_time);
+	pidir = nova_get_inode(sb, inode);
+	nova_dbgv("%s: ino %llu, size %llu, pos %llu\n",
+			__func__, (u64)inode->i_ino,
+			pidir->i_size, ctx->pos);
+
+	if (!sih) {
+		nova_dbg("%s: inode %lu sih does not exist!\n",
+				__func__, inode->i_ino);
+		ctx->pos = READDIR_END;
+		return 0;
+	}
+
+	pos = ctx->pos;
+	if (pos == READDIR_END)
+		goto out;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	do {
+		nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, FREE_BATCH);
+		for (i = 0; i < nr_entries; i++) {
+			entry = entries[i];
+
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EIO;
+
+			pos = BKDRHash(entryc->name, entryc->name_len);
+			ino = __le64_to_cpu(entryc->ino);
+			if (ino == 0)
+				continue;
+
+			ret = nova_get_inode_address(sb, ino, 0, &pi_addr,
+						     0, 0);
+
+			if (ret) {
+				nova_dbg("%s: get child inode %lu address failed %d\n",
+					 __func__, ino, ret);
+				ctx->pos = READDIR_END;
+				return ret;
+			}
+
+			child_pi = nova_get_block(sb, pi_addr);
+			nova_dbgv("ctx: ino %llu, name %s, name_len %u, de_len %u, csum 0x%x\n",
+				(u64)ino, entry->name, entry->name_len,
+				entry->de_len, entry->csum);
+			if (!dir_emit(ctx, entryc->name, entryc->name_len,
+				ino, IF2DT(le16_to_cpu(child_pi->i_mode)))) {
+				nova_dbgv("Here: pos %llu\n", ctx->pos);
+				return 0;
+			}
+			ctx->pos = pos + 1;
+		}
+		pos++;
+	} while (nr_entries == FREE_BATCH);
+
+out:
+	NOVA_END_TIMING(readdir_t, readdir_time);
+	return 0;
+}
+
+static u64 nova_find_next_dentry_addr(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 pos)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_entry *entries[1];
+	int nr_entries;
+	u64 addr = 0;
+
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, 1);
+	if (nr_entries == 1) {
+		entry = entries[0];
+		addr = nova_get_addr_off(sbi, entry);
+	}
+
+	return addr;
+}
+
+static int nova_readdir_fast(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pidir;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *child_pi;
+	struct nova_inode *prev_child_pi = NULL;
+	struct nova_dentry *entry = NULL;
+	struct nova_dentry *entryc, entry_copy;
+	struct nova_dentry *prev_entry = NULL;
+	struct nova_dentry *prev_entryc, prev_entry_copy;
+	unsigned short de_len;
+	u64 pi_addr;
+	unsigned long pos = 0;
+	ino_t ino;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+	int ret;
+	timing_t readdir_time;
+
+	NOVA_START_TIMING(readdir_t, readdir_time);
+	pidir = nova_get_inode(sb, inode);
+	nova_dbgv("%s: ino %llu, size %llu, pos 0x%llx\n",
+			__func__, (u64)inode->i_ino,
+			pidir->i_size, ctx->pos);
+
+	if (sih->log_head == 0) {
+		nova_err(sb, "Dir %lu log is NULL!\n", inode->i_ino);
+		BUG();
+		return -EINVAL;
+	}
+
+	pos = ctx->pos;
+
+	if (pos == 0)
+		curr_p = sih->log_head;
+	else if (pos == READDIR_END)
+		goto out;
+	else {
+		curr_p = nova_find_next_dentry_addr(sb, sih, pos);
+		if (curr_p == 0)
+			goto out;
+	}
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+	prev_entryc = (metadata_csum == 0) ? prev_entry : &prev_entry_copy;
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p))
+			curr_p = next_log_page(sb, curr_p);
+
+
+		if (curr_p == 0) {
+			nova_err(sb, "Dir %lu log is NULL!\n", inode->i_ino);
+			BUG();
+			return -EINVAL;
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+		type = nova_get_entry_type(addr);
+		switch (type) {
+		case SET_ATTR:
+			curr_p += sizeof(struct nova_setattr_logentry);
+			continue;
+		case LINK_CHANGE:
+			curr_p += sizeof(struct nova_link_change_entry);
+			continue;
+		case DIR_LOG:
+			break;
+		default:
+			nova_dbg("%s: unknown type %d, 0x%llx\n",
+				 __func__, type, curr_p);
+			BUG();
+			return -EINVAL;
+		}
+
+		entry = (struct nova_dentry *)nova_get_block(sb, curr_p);
+		nova_dbgv("curr_p: 0x%llx, type %d, ino %llu, name %s, namelen %u, rec len %u\n",
+			  curr_p, entry->entry_type, le64_to_cpu(entry->ino),
+			  entry->name, entry->name_len,
+			  le16_to_cpu(entry->de_len));
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+
+		de_len = le16_to_cpu(entryc->de_len);
+		if (entryc->ino > 0 && entryc->invalid == 0
+					&& entryc->reassigned == 0) {
+			ino = __le64_to_cpu(entryc->ino);
+			pos = BKDRHash(entryc->name, entryc->name_len);
+
+			ret = nova_get_inode_address(sb, ino, 0,
+						     &pi_addr, 0, 0);
+			if (ret) {
+				nova_dbg("%s: get child inode %lu address failed %d\n",
+					 __func__, ino, ret);
+				ctx->pos = READDIR_END;
+				return ret;
+			}
+
+			child_pi = nova_get_block(sb, pi_addr);
+			nova_dbgv("ctx: ino %llu, name %s, name_len %u, de_len %u\n",
+				(u64)ino, entry->name, entry->name_len,
+				entry->de_len);
+			if (prev_entry && !dir_emit(ctx, prev_entryc->name,
+				prev_entryc->name_len, ino,
+				IF2DT(le16_to_cpu(prev_child_pi->i_mode)))) {
+				nova_dbgv("Here: pos %llu\n", ctx->pos);
+				return 0;
+			}
+			prev_entry = entry;
+
+			if (metadata_csum == 0)
+				prev_entryc = prev_entry;
+			else
+				memcpy(prev_entryc, entryc,
+						sizeof(struct nova_dentry));
+
+			prev_child_pi = child_pi;
+		}
+		ctx->pos = pos;
+		curr_p += de_len;
+	}
+
+	if (prev_entry && !dir_emit(ctx, prev_entryc->name,
+			prev_entryc->name_len, ino,
+			IF2DT(le16_to_cpu(prev_child_pi->i_mode))))
+		return 0;
+
+	ctx->pos = READDIR_END;
+out:
+	NOVA_END_TIMING(readdir_t, readdir_time);
+	nova_dbgv("%s return\n", __func__);
+	return 0;
+}
+
+static int nova_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->mount_snapshot == 0)
+		return nova_readdir_fast(file, ctx);
+	else
+		return nova_readdir_slow(file, ctx);
+}
+
+const struct file_operations nova_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.iterate	= nova_readdir,
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = nova_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= nova_compat_ioctl,
+#endif
+};
diff --git a/fs/nova/file.c b/fs/nova/file.c
new file mode 100644
index 000000000000..51b2114796df
--- /dev/null
+++ b/fs/nova/file.c
@@ -0,0 +1,943 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for files.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/uaccess.h>
+#include <linux/falloc.h>
+#include <asm/mman.h>
+#include "nova.h"
+#include "inode.h"
+
+
+static inline int nova_can_set_blocksize_hint(struct inode *inode,
+	struct nova_inode *pi, loff_t new_size)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	/* Currently, we don't deallocate data blocks till the file is deleted.
+	 * So no changing blocksize hints once allocation is done.
+	 */
+	if (sih->i_size > 0)
+		return 0;
+	return 1;
+}
+
+int nova_set_blocksize_hint(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, loff_t new_size)
+{
+	unsigned short block_type;
+
+	if (!nova_can_set_blocksize_hint(inode, pi, new_size))
+		return 0;
+
+	if (new_size >= 0x40000000) {   /* 1G */
+		block_type = NOVA_BLOCK_TYPE_1G;
+		goto hint_set;
+	}
+
+	if (new_size >= 0x200000) {     /* 2M */
+		block_type = NOVA_BLOCK_TYPE_2M;
+		goto hint_set;
+	}
+
+	/* defaulting to 4K */
+	block_type = NOVA_BLOCK_TYPE_4K;
+
+hint_set:
+	nova_dbg_verbose(
+		"Hint: new_size 0x%llx, i_size 0x%llx\n",
+		new_size, pi->i_size);
+	nova_dbg_verbose("Setting the hint to 0x%x\n", block_type);
+	nova_memunlock_inode(sb, pi);
+	pi->i_blk_type = block_type;
+	nova_memlock_inode(sb, pi);
+	return 0;
+}
+
+static loff_t nova_llseek(struct file *file, loff_t offset, int origin)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	int retval;
+
+	if (origin != SEEK_DATA && origin != SEEK_HOLE)
+		return generic_file_llseek(file, offset, origin);
+
+	inode_lock(inode);
+	switch (origin) {
+	case SEEK_DATA:
+		retval = nova_find_region(inode, &offset, 0);
+		if (retval) {
+			inode_unlock(inode);
+			return retval;
+		}
+		break;
+	case SEEK_HOLE:
+		retval = nova_find_region(inode, &offset, 1);
+		if (retval) {
+			inode_unlock(inode);
+			return retval;
+		}
+		break;
+	}
+
+	if ((offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET)) ||
+	    offset > inode->i_sb->s_maxbytes) {
+		inode_unlock(inode);
+		return -ENXIO;
+	}
+
+	if (offset != file->f_pos) {
+		file->f_pos = offset;
+		file->f_version = 0;
+	}
+
+	inode_unlock(inode);
+	return offset;
+}
+
+/* This function is called by both msync() and fsync().
+ * TODO: Check if we can avoid calling nova_flush_buffer() for fsync. We use
+ * movnti to write data to files, so we may want to avoid doing unnecessary
+ * nova_flush_buffer() on fsync()
+ */
+static int nova_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	unsigned long start_pgoff, end_pgoff;
+	int ret = 0;
+	timing_t fsync_time;
+
+	NOVA_START_TIMING(fsync_t, fsync_time);
+
+	if (datasync)
+		NOVA_STATS_ADD(fdatasync, 1);
+
+	/* No need to flush if the file is not mmaped */
+	if (!mapping_mapped(mapping))
+		goto persist;
+
+	start_pgoff = start >> PAGE_SHIFT;
+	end_pgoff = (end + 1) >> PAGE_SHIFT;
+	nova_dbgv("%s: msync pgoff range %lu to %lu\n",
+			__func__, start_pgoff, end_pgoff);
+
+	/*
+	 * Set csum and parity.
+	 * We do not protect data integrity during mmap, but we have to
+	 * update csum here since msync clears dirty bit.
+	 */
+	nova_reset_mapping_csum_parity(sb, inode, mapping,
+					start_pgoff, end_pgoff);
+
+	ret = generic_file_fsync(file, start, end, datasync);
+
+persist:
+	PERSISTENT_BARRIER();
+	NOVA_END_TIMING(fsync_t, fsync_time);
+
+	return ret;
+}
+
+/* This callback is called when a file is closed */
+static int nova_flush(struct file *file, fl_owner_t id)
+{
+	PERSISTENT_BARRIER();
+	return 0;
+}
+
+static int nova_open(struct inode *inode, struct file *filp)
+{
+	return generic_file_open(inode, filp);
+}
+
+static long nova_fallocate(struct file *file, int mode, loff_t offset,
+	loff_t len)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	unsigned long start_blk, num_blocks, ent_blks = 0;
+	unsigned long total_blocks = 0;
+	unsigned long blocknr = 0;
+	unsigned long blockoff;
+	unsigned int data_bits;
+	loff_t new_size;
+	long ret = 0;
+	int inplace = 0;
+	int blocksize_mask;
+	int allocated = 0;
+	bool update_log = false;
+	timing_t fallocate_time;
+	u64 begin_tail = 0;
+	u64 epoch_id;
+	u32 time;
+
+	/*
+	 * Fallocate does not make much sence for CoW,
+	 * but we still support it for DAX-mmap purpose.
+	 */
+
+	/* We only support the FALLOC_FL_KEEP_SIZE mode */
+	if (mode & ~FALLOC_FL_KEEP_SIZE)
+		return -EOPNOTSUPP;
+
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
+	new_size = len + offset;
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && new_size > inode->i_size) {
+		ret = inode_newsize_ok(inode, new_size);
+		if (ret)
+			return ret;
+	} else {
+		new_size = inode->i_size;
+	}
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lld, mode 0x%x\n",
+			__func__, inode->i_ino,	offset, len, mode);
+
+	NOVA_START_TIMING(fallocate_t, fallocate_time);
+	inode_lock(inode);
+
+	pi = nova_get_inode(sb, inode);
+	if (!pi) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	blocksize_mask = sb->s_blocksize - 1;
+	start_blk = offset >> sb->s_blocksize_bits;
+	blockoff = offset & blocksize_mask;
+	num_blocks = (blockoff + len + blocksize_mask) >> sb->s_blocksize_bits;
+
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+	while (num_blocks > 0) {
+		ent_blks = nova_check_existing_entry(sb, inode, num_blocks,
+						start_blk, &entry, &entry_copy,
+						1, epoch_id, &inplace, 1);
+
+		entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+		if (entry && inplace) {
+			if (entryc->size < new_size) {
+				/* Update existing entry */
+				nova_memunlock_range(sb, entry, CACHELINE_SIZE);
+				entry->size = new_size;
+				nova_update_entry_csum(entry);
+				nova_update_alter_entry(sb, entry);
+				nova_memlock_range(sb, entry, CACHELINE_SIZE);
+			}
+			allocated = ent_blks;
+			goto next;
+		}
+
+		/* Allocate zeroed blocks to fill hole */
+		allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk,
+				 ent_blks, ALLOC_INIT_ZERO, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+		nova_dbgv("%s: alloc %d blocks @ %lu\n", __func__,
+						allocated, blocknr);
+
+		if (allocated <= 0) {
+			nova_dbg("%s alloc %lu blocks failed!, %d\n",
+						__func__, ent_blks, allocated);
+			ret = allocated;
+			goto out;
+		}
+
+		/* Handle hole fill write */
+		nova_init_file_write_entry(sb, sih, &entry_data, epoch_id,
+					start_blk, allocated, blocknr,
+					time, new_size);
+
+		ret = nova_append_file_write_entry(sb, pi, inode,
+					&entry_data, &update);
+		if (ret) {
+			nova_dbg("%s: append inode entry failed\n", __func__);
+			ret = -ENOSPC;
+			goto out;
+		}
+
+		entry = nova_get_block(sb, update.curr_entry);
+		nova_reset_csum_parity_range(sb, sih, entry, start_blk,
+					start_blk + allocated, 1, 0);
+
+		update_log = true;
+		if (begin_tail == 0)
+			begin_tail = update.curr_entry;
+
+		total_blocks += allocated;
+next:
+		num_blocks -= allocated;
+		start_blk += allocated;
+	}
+
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (total_blocks << (data_bits - sb->s_blocksize_bits));
+
+	inode->i_blocks = sih->i_blocks;
+
+	if (update_log) {
+		sih->log_tail = update.tail;
+		sih->alter_log_tail = update.alter_tail;
+
+		nova_memunlock_inode(sb, pi);
+		nova_update_tail(pi, update.tail);
+		if (metadata_csum)
+			nova_update_alter_tail(pi, update.alter_tail);
+		nova_memlock_inode(sb, pi);
+
+		/* Update file tree */
+		ret = nova_reassign_file_tree(sb, sih, begin_tail);
+		if (ret)
+			goto out;
+
+	}
+
+	nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks);
+
+	if (ret || (mode & FALLOC_FL_KEEP_SIZE)) {
+		nova_memunlock_inode(sb, pi);
+		pi->i_flags |= cpu_to_le32(NOVA_EOFBLOCKS_FL);
+		nova_memlock_inode(sb, pi);
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && new_size > inode->i_size) {
+		inode->i_size = new_size;
+		sih->i_size = new_size;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode_checksum(pi);
+	nova_update_alter_inode(sb, inode, pi);
+	nova_memlock_inode(sb, pi);
+
+	sih->trans_id++;
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						begin_tail, update.tail);
+
+	inode_unlock(inode);
+	NOVA_END_TIMING(fallocate_t, fallocate_time);
+	return ret;
+}
+
+static int nova_iomap_begin_nolock(struct inode *inode, loff_t offset,
+	loff_t length, unsigned int flags, struct iomap *iomap)
+{
+	return nova_iomap_begin(inode, offset, length, flags, iomap, false);
+}
+
+static struct iomap_ops nova_iomap_ops_nolock = {
+	.iomap_begin	= nova_iomap_begin_nolock,
+	.iomap_end	= nova_iomap_end,
+};
+
+static ssize_t nova_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	ssize_t ret;
+	timing_t read_iter_time;
+
+	if (!iov_iter_count(to))
+		return 0;
+
+	NOVA_START_TIMING(read_iter_t, read_iter_time);
+	inode_lock_shared(inode);
+	ret = dax_iomap_rw(iocb, to, &nova_iomap_ops_nolock);
+	inode_unlock_shared(inode);
+
+	file_accessed(iocb->ki_filp);
+	NOVA_END_TIMING(read_iter_t, read_iter_time);
+	return ret;
+}
+
+static int nova_update_iter_csum_parity(struct super_block *sb,
+	struct inode *inode, loff_t offset, size_t count)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long start_pgoff, end_pgoff;
+	loff_t end;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	end = offset + count;
+
+	start_pgoff = offset >> sb->s_blocksize_bits;
+	end_pgoff = end >> sb->s_blocksize_bits;
+	if (end & (nova_inode_blk_size(sih) - 1))
+		end_pgoff++;
+
+	nova_reset_csum_parity_range(sb, sih, NULL, start_pgoff,
+			end_pgoff, 0, 0);
+
+	return 0;
+}
+
+static ssize_t nova_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file->f_mapping->host;
+	struct super_block *sb = inode->i_sb;
+	loff_t offset;
+	size_t count;
+	ssize_t ret;
+	timing_t write_iter_time;
+
+	NOVA_START_TIMING(write_iter_t, write_iter_time);
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out_unlock;
+
+	ret = file_remove_privs(file);
+	if (ret)
+		goto out_unlock;
+
+	ret = file_update_time(file);
+	if (ret)
+		goto out_unlock;
+
+	count = iov_iter_count(from);
+	offset = iocb->ki_pos;
+
+	ret = dax_iomap_rw(iocb, from, &nova_iomap_ops_nolock);
+	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
+		i_size_write(inode, iocb->ki_pos);
+		mark_inode_dirty(inode);
+	}
+
+	nova_update_iter_csum_parity(sb, inode, offset, count);
+
+out_unlock:
+	inode_unlock(inode);
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	NOVA_END_TIMING(write_iter_t, write_iter_time);
+	return ret;
+}
+
+static ssize_t
+do_dax_mapping_read(struct file *filp, char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct inode *inode = filp->f_mapping->host;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	pgoff_t index, end_index;
+	unsigned long offset;
+	loff_t isize, pos;
+	size_t copied = 0, error = 0;
+	timing_t memcpy_time;
+
+	pos = *ppos;
+	index = pos >> PAGE_SHIFT;
+	offset = pos & ~PAGE_MASK;
+
+	if (!access_ok(VERIFY_WRITE, buf, len)) {
+		error = -EFAULT;
+		goto out;
+	}
+
+	isize = i_size_read(inode);
+	if (!isize)
+		goto out;
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lu, size %lld\n",
+		__func__, inode->i_ino,	pos, len, isize);
+
+	if (len > isize - pos)
+		len = isize - pos;
+
+	if (len <= 0)
+		goto out;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	end_index = (isize - 1) >> PAGE_SHIFT;
+	do {
+		unsigned long nr, left;
+		unsigned long nvmm;
+		void *dax_mem = NULL;
+		int zero = 0;
+
+		/* nr is the maximum number of bytes to copy from this page */
+		if (index >= end_index) {
+			if (index > end_index)
+				goto out;
+			nr = ((isize - 1) & ~PAGE_MASK) + 1;
+			if (nr <= offset)
+				goto out;
+		}
+
+		entry = nova_get_write_entry(sb, sih, index);
+		if (unlikely(entry == NULL)) {
+			nova_dbgv("Required extent not found: pgoff %lu, inode size %lld\n",
+				index, isize);
+			nr = PAGE_SIZE;
+			zero = 1;
+			goto memcpy;
+		}
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+
+		/* Find contiguous blocks */
+		if (index < entryc->pgoff ||
+			index - entryc->pgoff >= entryc->num_pages) {
+			nova_err(sb, "%s ERROR: %lu, entry pgoff %llu, num %u, blocknr %llu\n",
+				__func__, index, entry->pgoff,
+				entry->num_pages, entry->block >> PAGE_SHIFT);
+			return -EINVAL;
+		}
+		if (entryc->reassigned == 0) {
+			nr = (entryc->num_pages - (index - entryc->pgoff))
+				* PAGE_SIZE;
+		} else {
+			nr = PAGE_SIZE;
+		}
+
+		nvmm = get_nvmm(sb, sih, entryc, index);
+		dax_mem = nova_get_block(sb, (nvmm << PAGE_SHIFT));
+
+memcpy:
+		nr = nr - offset;
+		if (nr > len - copied)
+			nr = len - copied;
+
+		if ((!zero) && (data_csum > 0)) {
+			if (nova_find_pgoff_in_vma(inode, index))
+				goto skip_verify;
+
+			if (!nova_verify_data_csum(sb, sih, nvmm, offset, nr)) {
+				nova_err(sb, "%s: nova data checksum and recovery fail! inode %lu, offset %lu, entry pgoff %lu, %u pages, pgoff %lu\n",
+					 __func__, inode->i_ino, offset,
+					 entry->pgoff, entry->num_pages, index);
+				error = -EIO;
+				goto out;
+			}
+		}
+skip_verify:
+		NOVA_START_TIMING(memcpy_r_nvmm_t, memcpy_time);
+
+		if (!zero)
+			left = __copy_to_user(buf + copied,
+						dax_mem + offset, nr);
+		else
+			left = __clear_user(buf + copied, nr);
+
+		NOVA_END_TIMING(memcpy_r_nvmm_t, memcpy_time);
+
+		if (left) {
+			nova_dbg("%s ERROR!: bytes %lu, left %lu\n",
+				__func__, nr, left);
+			error = -EFAULT;
+			goto out;
+		}
+
+		copied += (nr - left);
+		offset += (nr - left);
+		index += offset >> PAGE_SHIFT;
+		offset &= ~PAGE_MASK;
+	} while (copied < len);
+
+out:
+	*ppos = pos + copied;
+	if (filp)
+		file_accessed(filp);
+
+	NOVA_STATS_ADD(read_bytes, copied);
+
+	nova_dbgv("%s returned %zu\n", __func__, copied);
+	return copied ? copied : error;
+}
+
+/*
+ * Wrappers. We need to use the rcu read lock to avoid
+ * concurrent truncate operation. No problem for write because we held
+ * lock.
+ */
+static ssize_t nova_dax_file_read(struct file *filp, char __user *buf,
+			    size_t len, loff_t *ppos)
+{
+	struct inode *inode = filp->f_mapping->host;
+	ssize_t res;
+	timing_t dax_read_time;
+
+	NOVA_START_TIMING(dax_read_t, dax_read_time);
+	inode_lock_shared(inode);
+	res = do_dax_mapping_read(filp, buf, len, ppos);
+	inode_unlock_shared(inode);
+	NOVA_END_TIMING(dax_read_t, dax_read_time);
+	return res;
+}
+
+static ssize_t nova_cow_file_write(struct file *filp,
+	const char __user *buf,	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode	*inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi, inode_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	ssize_t	    written = 0;
+	loff_t pos;
+	size_t count, offset, copied;
+	unsigned long start_blk, num_blocks;
+	unsigned long total_blocks;
+	unsigned long blocknr = 0;
+	unsigned int data_bits;
+	int allocated = 0;
+	void *kmem;
+	u64 file_size;
+	size_t bytes;
+	long status = 0;
+	timing_t cow_write_time, memcpy_time;
+	unsigned long step = 0;
+	ssize_t ret;
+	u64 begin_tail = 0;
+	int try_inplace = 0;
+	u64 epoch_id;
+	u32 time;
+
+
+	if (len == 0)
+		return 0;
+
+	NOVA_START_TIMING(cow_write_t, cow_write_time);
+
+	sb_start_write(inode->i_sb);
+	inode_lock(inode);
+
+	if (!access_ok(VERIFY_READ, buf, len)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	pos = *ppos;
+
+	if (filp->f_flags & O_APPEND)
+		pos = i_size_read(inode);
+
+	count = len;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	/* nova_inode tail pointer will be updated and we make sure all other
+	 * inode fields are good before checksumming the whole structure
+	 */
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	total_blocks = num_blocks;
+	start_blk = pos >> sb->s_blocksize_bits;
+
+	if (nova_check_overlap_vmas(sb, sih, start_blk, num_blocks)) {
+		nova_dbgv("COW write overlaps with vma: inode %lu, pgoff %lu, %lu blocks\n",
+				inode->i_ino, start_blk, num_blocks);
+		NOVA_STATS_ADD(cow_overlap_mmap, 1);
+		try_inplace = 1;
+		ret = -EACCES;
+		goto out;
+	}
+
+	/* offset in the actual block size block */
+
+	ret = file_remove_privs(filp);
+	if (ret)
+		goto out;
+
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lu\n",
+			__func__, inode->i_ino,	pos, count);
+
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+	while (num_blocks > 0) {
+		offset = pos & (nova_inode_blk_size(sih) - 1);
+		start_blk = pos >> sb->s_blocksize_bits;
+
+		/* don't zero-out the allocated blocks */
+		allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk,
+				 num_blocks, ALLOC_NO_INIT, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+
+		nova_dbg_verbose("%s: alloc %d blocks @ %lu\n", __func__,
+						allocated, blocknr);
+
+		if (allocated <= 0) {
+			nova_dbg("%s alloc blocks failed %d\n", __func__,
+								allocated);
+			ret = allocated;
+			goto out;
+		}
+
+		step++;
+		bytes = sb->s_blocksize * allocated - offset;
+		if (bytes > count)
+			bytes = count;
+
+		kmem = nova_get_block(inode->i_sb,
+			     nova_get_block_off(sb, blocknr, sih->i_blk_type));
+
+		if (offset || ((offset + bytes) & (PAGE_SIZE - 1)) != 0)  {
+			ret = nova_handle_head_tail_blocks(sb, inode, pos,
+							   bytes, kmem);
+			if (ret)
+				goto out;
+		}
+		/* Now copy from user buf */
+		//		nova_dbg("Write: %p\n", kmem);
+		NOVA_START_TIMING(memcpy_w_nvmm_t, memcpy_time);
+		nova_memunlock_range(sb, kmem + offset, bytes);
+		copied = bytes - memcpy_to_pmem_nocache(kmem + offset,
+						buf, bytes);
+		nova_memlock_range(sb, kmem + offset, bytes);
+		NOVA_END_TIMING(memcpy_w_nvmm_t, memcpy_time);
+
+		if (data_csum > 0 || data_parity > 0) {
+			ret = nova_protect_file_data(sb, inode, pos, bytes,
+							buf, blocknr, false);
+			if (ret)
+				goto out;
+		}
+
+		if (pos + copied > inode->i_size)
+			file_size = cpu_to_le64(pos + copied);
+		else
+			file_size = cpu_to_le64(inode->i_size);
+
+		nova_init_file_write_entry(sb, sih, &entry_data, epoch_id,
+					start_blk, allocated, blocknr, time,
+					file_size);
+
+		ret = nova_append_file_write_entry(sb, pi, inode,
+					&entry_data, &update);
+		if (ret) {
+			nova_dbg("%s: append inode entry failed\n", __func__);
+			ret = -ENOSPC;
+			goto out;
+		}
+
+		nova_dbgv("Write: %p, %lu\n", kmem, copied);
+		if (copied > 0) {
+			status = copied;
+			written += copied;
+			pos += copied;
+			buf += copied;
+			count -= copied;
+			num_blocks -= allocated;
+		}
+		if (unlikely(copied != bytes)) {
+			nova_dbg("%s ERROR!: %p, bytes %lu, copied %lu\n",
+				__func__, kmem, bytes, copied);
+			if (status >= 0)
+				status = -EFAULT;
+		}
+		if (status < 0)
+			break;
+
+		if (begin_tail == 0)
+			begin_tail = update.curr_entry;
+	}
+
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (total_blocks << (data_bits - sb->s_blocksize_bits));
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	/* Free the overlap blocks after the write is committed */
+	ret = nova_reassign_file_tree(sb, sih, begin_tail);
+	if (ret)
+		goto out;
+
+	inode->i_blocks = sih->i_blocks;
+
+	ret = written;
+	NOVA_STATS_ADD(cow_write_breaks, step);
+	nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks);
+
+	*ppos = pos;
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		sih->i_size = pos;
+	}
+
+	sih->trans_id++;
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						begin_tail, update.tail);
+
+	inode_unlock(inode);
+	sb_end_write(inode->i_sb);
+	NOVA_END_TIMING(cow_write_t, cow_write_time);
+	NOVA_STATS_ADD(cow_write_bytes, written);
+
+	if (try_inplace)
+		return nova_inplace_file_write(filp, buf, len, ppos);
+
+	return ret;
+}
+
+static ssize_t nova_dax_file_write(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	if (inplace_data_updates)
+		return nova_inplace_file_write(filp, buf, len, ppos);
+	else
+		return nova_cow_file_write(filp, buf, len, ppos);
+}
+
+static int nova_dax_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file->f_mapping->host;
+
+	file_accessed(file);
+
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+
+	vma->vm_ops = &nova_dax_vm_ops;
+
+	nova_insert_write_vma(vma);
+
+	nova_dbg_mmap4k("[%s:%d] inode %lu, MMAP 4KPAGE vm_start(0x%lx), vm_end(0x%lx), vm pgoff %lu, %lu blocks, vm_flags(0x%lx), vm_page_prot(0x%lx)\n",
+			__func__, __LINE__,
+			inode->i_ino, vma->vm_start, vma->vm_end,
+			vma->vm_pgoff,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			vma->vm_flags,
+			pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
+
+const struct file_operations nova_dax_file_operations = {
+	.llseek			= nova_llseek,
+	.read			= nova_dax_file_read,
+	.write			= nova_dax_file_write,
+	.read_iter		= nova_dax_read_iter,
+	.write_iter		= nova_dax_write_iter,
+	.mmap			= nova_dax_file_mmap,
+	.open			= nova_open,
+	.fsync			= nova_fsync,
+	.flush			= nova_flush,
+	.unlocked_ioctl		= nova_ioctl,
+	.fallocate		= nova_fallocate,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl		= nova_compat_ioctl,
+#endif
+};
+
+
+static ssize_t nova_wrap_rw_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct file *filp = iocb->ki_filp;
+	ssize_t ret = -EIO;
+	ssize_t written = 0;
+	unsigned long seg;
+	unsigned long nr_segs = iter->nr_segs;
+	const struct iovec *iv = iter->iov;
+
+	nova_dbgv("%s %s: %lu segs\n", __func__,
+			iov_iter_rw(iter) == READ ? "read" : "write",
+			nr_segs);
+	iv = iter->iov;
+	for (seg = 0; seg < nr_segs; seg++) {
+		if (iov_iter_rw(iter) == READ) {
+			ret = nova_dax_file_read(filp, iv->iov_base,
+					iv->iov_len, &iocb->ki_pos);
+		} else if (iov_iter_rw(iter) == WRITE) {
+			ret = nova_dax_file_write(filp, iv->iov_base,
+					iv->iov_len, &iocb->ki_pos);
+		}
+		if (ret < 0)
+			goto err;
+
+		if (iter->count > iv->iov_len)
+			iter->count -= iv->iov_len;
+		else
+			iter->count = 0;
+
+		written += ret;
+		iter->nr_segs--;
+		iv++;
+	}
+	ret = written;
+err:
+	return ret;
+}
+
+
+/* Wrap read/write_iter for DP, CoW and WP */
+const struct file_operations nova_wrap_file_operations = {
+	.llseek			= nova_llseek,
+	.read			= nova_dax_file_read,
+	.write			= nova_dax_file_write,
+	.read_iter		= nova_wrap_rw_iter,
+	.write_iter		= nova_wrap_rw_iter,
+	.mmap			= nova_dax_file_mmap,
+	.open			= nova_open,
+	.fsync			= nova_fsync,
+	.flush			= nova_flush,
+	.unlocked_ioctl		= nova_ioctl,
+	.fallocate		= nova_fallocate,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl		= nova_compat_ioctl,
+#endif
+};
+
+const struct inode_operations nova_file_inode_operations = {
+	.setattr	= nova_notify_change,
+	.getattr	= nova_getattr,
+	.get_acl	= NULL,
+};
diff --git a/fs/nova/namei.c b/fs/nova/namei.c
new file mode 100644
index 000000000000..59776338008d
--- /dev/null
+++ b/fs/nova/namei.c
@@ -0,0 +1,919 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode operations for directories.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include "nova.h"
+#include "journal.h"
+#include "inode.h"
+
+static ino_t nova_inode_by_name(struct inode *dir, struct qstr *entry,
+				 struct nova_dentry **res_entry)
+{
+	struct super_block *sb = dir->i_sb;
+	struct nova_dentry *direntry;
+	struct nova_dentry *direntryc, entry_copy;
+
+	direntry = nova_find_dentry(sb, NULL, dir,
+					entry->name, entry->len);
+	if (direntry == NULL)
+		return 0;
+
+	if (metadata_csum == 0)
+		direntryc = direntry;
+	else {
+		direntryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, direntry, direntryc))
+			return 0;
+	}
+
+	*res_entry = direntry;
+	return direntryc->ino;
+}
+
+static struct dentry *nova_lookup(struct inode *dir, struct dentry *dentry,
+				   unsigned int flags)
+{
+	struct inode *inode = NULL;
+	struct nova_dentry *de;
+	ino_t ino;
+	timing_t lookup_time;
+
+	NOVA_START_TIMING(lookup_t, lookup_time);
+	if (dentry->d_name.len > NOVA_NAME_LEN) {
+		nova_dbg("%s: namelen %u exceeds limit\n",
+			__func__, dentry->d_name.len);
+		return ERR_PTR(-ENAMETOOLONG);
+	}
+
+	nova_dbg_verbose("%s: %s\n", __func__, dentry->d_name.name);
+	ino = nova_inode_by_name(dir, &dentry->d_name, &de);
+	nova_dbg_verbose("%s: ino %lu\n", __func__, ino);
+	if (ino) {
+		inode = nova_iget(dir->i_sb, ino);
+		if (inode == ERR_PTR(-ESTALE) || inode == ERR_PTR(-ENOMEM)
+				|| inode == ERR_PTR(-EACCES)) {
+			nova_err(dir->i_sb,
+				  "%s: get inode failed: %lu\n",
+				  __func__, (unsigned long)ino);
+			return ERR_PTR(-EIO);
+		}
+	}
+
+	NOVA_END_TIMING(lookup_t, lookup_time);
+	return d_splice_alias(inode, dentry);
+}
+
+static void nova_lite_transaction_for_new_inode(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode *pidir, struct inode *inode,
+	struct inode *dir, struct nova_inode_update *update)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int cpu;
+	u64 journal_tail;
+	timing_t trans_time;
+
+	NOVA_START_TIMING(create_trans_t, trans_time);
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	nova_memunlock_journal(sb);
+
+	// If you change what's required to create a new inode, you need to
+	// update this functions so the changes will be roll back on failure.
+	journal_tail = nova_create_inode_transaction(sb, inode, dir, cpu, 1, 0);
+
+	nova_update_inode(sb, dir, pidir, update, 0);
+
+	pi->valid = 1;
+	nova_update_inode_checksum(pi);
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	nova_memlock_journal(sb);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	if (metadata_csum) {
+		nova_memunlock_inode(sb, pi);
+		nova_update_alter_inode(sb, inode, pi);
+		nova_update_alter_inode(sb, dir, pidir);
+		nova_memlock_inode(sb, pi);
+	}
+	NOVA_END_TIMING(create_trans_t, trans_time);
+}
+
+/* Returns new tail after append */
+/*
+ * By the time this is called, we already have created
+ * the directory cache entry for the new file, but it
+ * is so far negative - it has no inode.
+ *
+ * If the create succeeds, we fill in the inode information
+ * with d_instantiate().
+ */
+static int nova_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+			bool excl)
+{
+	struct inode *inode = NULL;
+	int err = PTR_ERR(inode);
+	struct super_block *sb = dir->i_sb;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 pi_addr = 0;
+	u64 ino, epoch_id;
+	timing_t create_time;
+
+	NOVA_START_TIMING(create_t, create_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_err;
+
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+	inode = nova_new_vfs_inode(TYPE_CREATE, dir, pi_addr, ino, mode,
+					0, 0, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode))
+		goto out_err;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	pi = nova_get_block(sb, pi_addr);
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+						&update);
+	NOVA_END_TIMING(create_t, create_time);
+	return err;
+out_err:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(create_t, create_time);
+	return err;
+}
+
+static int nova_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		       dev_t rdev)
+{
+	struct inode *inode = NULL;
+	int err = PTR_ERR(inode);
+	struct super_block *sb = dir->i_sb;
+	u64 pi_addr = 0;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 ino;
+	u64 epoch_id;
+	timing_t mknod_time;
+
+	NOVA_START_TIMING(mknod_t, mknod_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_err;
+
+	inode = nova_new_vfs_inode(TYPE_MKNOD, dir, pi_addr, ino, mode,
+					0, rdev, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode))
+		goto out_err;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	pi = nova_get_block(sb, pi_addr);
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+						&update);
+	NOVA_END_TIMING(mknod_t, mknod_time);
+	return err;
+out_err:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(mknod_t, mknod_time);
+	return err;
+}
+
+static int nova_symlink(struct inode *dir, struct dentry *dentry,
+			 const char *symname)
+{
+	struct super_block *sb = dir->i_sb;
+	int err = -ENAMETOOLONG;
+	unsigned int len = strlen(symname);
+	struct inode *inode;
+	struct nova_inode_info *si;
+	struct nova_inode_info_header *sih;
+	u64 pi_addr = 0;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 ino;
+	u64 epoch_id;
+	timing_t symlink_time;
+
+	NOVA_START_TIMING(symlink_t, symlink_time);
+	if (len + 1 > sb->s_blocksize)
+		goto out;
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_fail;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_fail;
+
+	nova_dbgv("%s: name %s, symname %s\n", __func__,
+				dentry->d_name.name, symname);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_fail;
+
+	inode = nova_new_vfs_inode(TYPE_SYMLINK, dir, pi_addr, ino,
+					S_IFLNK|0777, len, 0,
+					&dentry->d_name, epoch_id);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto out_fail;
+	}
+
+	pi = nova_get_inode(sb, inode);
+
+	si = NOVA_I(inode);
+	sih = &si->header;
+
+	err = nova_block_symlink(sb, pi, inode, symname, len, epoch_id);
+	if (err)
+		goto out_fail;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+					&update);
+out:
+	NOVA_END_TIMING(symlink_t, symlink_time);
+	return err;
+
+out_fail:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	goto out;
+}
+
+static void nova_lite_transaction_for_time_and_link(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode *pidir, struct inode *inode,
+	struct inode *dir, struct nova_inode_update *update,
+	struct nova_inode_update *update_dir, int invalidate, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 journal_tail;
+	int cpu;
+	timing_t trans_time;
+
+	NOVA_START_TIMING(link_trans_t, trans_time);
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	nova_memunlock_journal(sb);
+
+	// If you change what's required to create a new inode, you need to
+	// update this functions so the changes will be roll back on failure.
+	journal_tail = nova_create_inode_transaction(sb, inode, dir, cpu,
+						0, invalidate);
+
+	if (invalidate) {
+		pi->valid = 0;
+		pi->delete_epoch_id = epoch_id;
+	}
+	nova_update_inode(sb, inode, pi, update, 0);
+
+	nova_update_inode(sb, dir, pidir, update_dir, 0);
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	nova_memlock_journal(sb);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	if (metadata_csum) {
+		nova_memunlock_inode(sb, pi);
+		nova_update_alter_inode(sb, inode, pi);
+		nova_update_alter_inode(sb, dir, pidir);
+		nova_memlock_inode(sb, pi);
+	}
+
+	NOVA_END_TIMING(link_trans_t, trans_time);
+}
+
+static int nova_link(struct dentry *dest_dentry, struct inode *dir,
+		      struct dentry *dentry)
+{
+	struct super_block *sb = dir->i_sb;
+	struct inode *inode = dest_dentry->d_inode;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	u64 epoch_id;
+	int err = -ENOMEM;
+	timing_t link_time;
+
+	NOVA_START_TIMING(link_t, link_time);
+	if (inode->i_nlink >= NOVA_LINK_MAX) {
+		err = -EMLINK;
+		goto out;
+	}
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	ihold(inode);
+	epoch_id = nova_get_epoch_id(sb);
+
+	nova_dbgv("%s: name %s, dest %s\n", __func__,
+			dentry->d_name.name, dest_dentry->d_name.name);
+	nova_dbgv("%s: inode %lu, dir %lu\n", __func__,
+			inode->i_ino, dir->i_ino);
+
+	update_dir.tail = 0;
+	update_dir.alter_tail = 0;
+	err = nova_add_dentry(dentry, inode->i_ino, 0, &update_dir, epoch_id);
+	if (err) {
+		iput(inode);
+		goto out;
+	}
+
+	inode->i_ctime = current_time(inode);
+	inc_nlink(inode);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (err) {
+		iput(inode);
+		goto out;
+	}
+
+	d_instantiate(dentry, inode);
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+					&update, &update_dir, 0, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+
+out:
+	NOVA_END_TIMING(link_t, link_time);
+	return err;
+}
+
+static int nova_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	struct super_block *sb = dir->i_sb;
+	int retval = -ENOMEM;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	u64 epoch_id;
+	int invalidate = 0;
+	timing_t unlink_time;
+
+	NOVA_START_TIMING(unlink_t, unlink_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out;
+
+	epoch_id = nova_get_epoch_id(sb);
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %lu, dir %lu\n", __func__,
+				inode->i_ino, dir->i_ino);
+
+	update_dir.tail = 0;
+	update_dir.alter_tail = 0;
+	retval = nova_remove_dentry(dentry, 0, &update_dir, epoch_id);
+	if (retval)
+		goto out;
+
+	inode->i_ctime = dir->i_ctime;
+
+	if (inode->i_nlink == 1)
+		invalidate = 1;
+
+	if (inode->i_nlink)
+		drop_nlink(inode);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	retval = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (retval)
+		goto out;
+
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+				&update, &update_dir, invalidate, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+	nova_invalidate_dentries(sb, &update_dir);
+
+	NOVA_END_TIMING(unlink_t, unlink_time);
+	return 0;
+out:
+	nova_err(sb, "%s return %d\n", __func__, retval);
+	NOVA_END_TIMING(unlink_t, unlink_time);
+	return retval;
+}
+
+static int nova_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct super_block *sb = dir->i_sb;
+	struct inode *inode;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_info *si, *sidir;
+	struct nova_inode_info_header *sih = NULL;
+	struct nova_inode_update update;
+	u64 pi_addr = 0;
+	u64 ino;
+	u64 epoch_id;
+	int err = -EMLINK;
+	timing_t mkdir_time;
+
+	NOVA_START_TIMING(mkdir_t, mkdir_time);
+	if (dir->i_nlink >= NOVA_LINK_MAX)
+		goto out;
+
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	nova_dbgv("%s: name %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu, link %d\n", __func__,
+				ino, dir->i_ino, dir->i_nlink);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_add_dentry(dentry, ino, 1, &update, epoch_id);
+	if (err) {
+		nova_dbg("failed to add dir entry\n");
+		goto out_err;
+	}
+
+	inode = nova_new_vfs_inode(TYPE_MKDIR, dir, pi_addr, ino,
+					S_IFDIR | mode, sb->s_blocksize,
+					0, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto out_err;
+	}
+
+	pi = nova_get_inode(sb, inode);
+	nova_append_dir_init_entries(sb, pi, inode->i_ino, dir->i_ino,
+					epoch_id);
+
+	/* Build the dir tree */
+	si = NOVA_I(inode);
+	sih = &si->header;
+	nova_rebuild_dir_inode_tree(sb, pi, pi_addr, sih);
+
+	pidir = nova_get_inode(sb, dir);
+	sidir = NOVA_I(dir);
+	sih = &si->header;
+	dir->i_blocks = sih->i_blocks;
+	inc_nlink(dir);
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+					&update);
+out:
+	NOVA_END_TIMING(mkdir_t, mkdir_time);
+	return err;
+
+out_err:
+//	clear_nlink(inode);
+	nova_err(sb, "%s return %d\n", __func__, err);
+	goto out;
+}
+
+/*
+ * routine to check that the specified directory is empty (for rmdir)
+ */
+static int nova_empty_dir(struct inode *inode)
+{
+	struct super_block *sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_dentry *entry;
+	struct nova_dentry *entryc, entry_copy;
+	unsigned long pos = 0;
+	struct nova_dentry *entries[4];
+	int nr_entries;
+	int i;
+
+	sb = inode->i_sb;
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, 4);
+	if (nr_entries > 2)
+		return 0;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	for (i = 0; i < nr_entries; i++) {
+		entry = entries[i];
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return 0;
+
+		if (!is_dir_init_entry(sb, entryc))
+			return 0;
+	}
+
+	return 1;
+}
+
+static int nova_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	struct nova_dentry *de;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode), *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	int err = -ENOTEMPTY;
+	u64 epoch_id;
+	timing_t rmdir_time;
+
+	NOVA_START_TIMING(rmdir_t, rmdir_time);
+	if (!inode)
+		return -ENOENT;
+
+	nova_dbgv("%s: name %s\n", __func__, dentry->d_name.name);
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		return -EINVAL;
+
+	if (nova_inode_by_name(dir, &dentry->d_name, &de) == 0)
+		return -ENOENT;
+
+	if (!nova_empty_dir(inode))
+		return err;
+
+	nova_dbgv("%s: inode %lu, dir %lu, link %d\n", __func__,
+				inode->i_ino, dir->i_ino, dir->i_nlink);
+
+	if (inode->i_nlink != 2)
+		nova_dbg("empty directory %lu has nlink!=2 (%d), dir %lu",
+				inode->i_ino, inode->i_nlink, dir->i_ino);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	update_dir.tail = 0;
+	update_dir.alter_tail = 0;
+	err = nova_remove_dentry(dentry, -1, &update_dir, epoch_id);
+	if (err)
+		goto end_rmdir;
+
+	/*inode->i_version++; */
+	clear_nlink(inode);
+	inode->i_ctime = dir->i_ctime;
+
+	if (dir->i_nlink)
+		drop_nlink(dir);
+
+	nova_delete_dir_tree(sb, sih);
+
+	update.tail = 0;
+	update.alter_tail = 0;
+	err = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (err)
+		goto end_rmdir;
+
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+					&update, &update_dir, 1, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+	nova_invalidate_dentries(sb, &update_dir);
+
+	NOVA_END_TIMING(rmdir_t, rmdir_time);
+	return err;
+
+end_rmdir:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(rmdir_t, rmdir_time);
+	return err;
+}
+
+static int nova_rename(struct inode *old_dir,
+			struct dentry *old_dentry,
+			struct inode *new_dir, struct dentry *new_dentry,
+			unsigned int flags)
+{
+	struct inode *old_inode = old_dentry->d_inode;
+	struct inode *new_inode = new_dentry->d_inode;
+	struct super_block *sb = old_inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *old_pi = NULL, *new_pi = NULL;
+	struct nova_inode *new_pidir = NULL, *old_pidir = NULL;
+	struct nova_dentry *father_entry = NULL;
+	struct nova_dentry *father_entryc, entry_copy;
+	char *head_addr = NULL;
+	int invalidate_new_inode = 0;
+	struct nova_inode_update update_dir_new;
+	struct nova_inode_update update_dir_old;
+	struct nova_inode_update update_new;
+	struct nova_inode_update update_old;
+	u64 old_linkc1 = 0, old_linkc2 = 0;
+	int err = -ENOENT;
+	int inc_link = 0, dec_link = 0;
+	int cpu;
+	int change_parent = 0;
+	u64 journal_tail;
+	u64 epoch_id;
+	timing_t rename_time;
+
+	nova_dbgv("%s: rename %s to %s,\n", __func__,
+			old_dentry->d_name.name, new_dentry->d_name.name);
+	nova_dbgv("%s: %s inode %lu, old dir %lu, new dir %lu, new inode %lu\n",
+			__func__, S_ISDIR(old_inode->i_mode) ? "dir" : "normal",
+			old_inode->i_ino, old_dir->i_ino, new_dir->i_ino,
+			new_inode ? new_inode->i_ino : 0);
+
+	if (flags & ~RENAME_NOREPLACE)
+		return -EINVAL;
+
+	NOVA_START_TIMING(rename_t, rename_time);
+
+	if (new_inode) {
+		err = -ENOTEMPTY;
+		if (S_ISDIR(old_inode->i_mode) && !nova_empty_dir(new_inode))
+			goto out;
+	} else {
+		if (S_ISDIR(old_inode->i_mode)) {
+			err = -EMLINK;
+			if (new_dir->i_nlink >= NOVA_LINK_MAX)
+				goto out;
+		}
+	}
+
+	if (S_ISDIR(old_inode->i_mode)) {
+		dec_link = -1;
+		if (!new_inode)
+			inc_link = 1;
+		/*
+		 * Tricky for in-place update:
+		 * New dentry is always after renamed dentry, so we have to
+		 * make sure new dentry has the correct links count
+		 * to workaround the rebuild nlink issue.
+		 */
+		if (old_dir == new_dir) {
+			inc_link--;
+			if (inc_link == 0)
+				dec_link = 0;
+		}
+	}
+
+	epoch_id = nova_get_epoch_id(sb);
+	new_pidir = nova_get_inode(sb, new_dir);
+	old_pidir = nova_get_inode(sb, old_dir);
+
+	old_pi = nova_get_inode(sb, old_inode);
+	old_inode->i_ctime = current_time(old_inode);
+	update_old.tail = 0;
+	update_old.alter_tail = 0;
+	err = nova_append_link_change_entry(sb, old_pi, old_inode,
+					&update_old, &old_linkc1, epoch_id);
+	if (err)
+		goto out;
+
+	if (S_ISDIR(old_inode->i_mode) && old_dir != new_dir) {
+		/* My father is changed. Update .. entry */
+		/* For simplicity, we use in-place update and journal it */
+		change_parent = 1;
+		head_addr = (char *)nova_get_block(sb, old_pi->log_head);
+		father_entry = (struct nova_dentry *)(head_addr +
+					NOVA_DIR_LOG_REC_LEN(1));
+
+		if (metadata_csum == 0)
+			father_entryc = father_entry;
+		else {
+			father_entryc = &entry_copy;
+			if (!nova_verify_entry_csum(sb, father_entry,
+							father_entryc)) {
+				err = -EIO;
+				goto out;
+			}
+		}
+
+		if (le64_to_cpu(father_entryc->ino) != old_dir->i_ino)
+			nova_err(sb, "%s: dir %lu parent should be %lu, but actually %lu\n",
+				__func__,
+				old_inode->i_ino, old_dir->i_ino,
+				le64_to_cpu(father_entry->ino));
+	}
+
+	update_dir_new.tail = 0;
+	update_dir_new.alter_tail = 0;
+	if (new_inode) {
+		/* First remove the old entry in the new directory */
+		err = nova_remove_dentry(new_dentry, 0, &update_dir_new,
+					epoch_id);
+		if (err)
+			goto out;
+	}
+
+	/* link into the new directory. */
+	err = nova_add_dentry(new_dentry, old_inode->i_ino,
+				inc_link, &update_dir_new, epoch_id);
+	if (err)
+		goto out;
+
+	if (inc_link > 0)
+		inc_nlink(new_dir);
+
+	update_dir_old.tail = 0;
+	update_dir_old.alter_tail = 0;
+	if (old_dir == new_dir) {
+		update_dir_old.tail = update_dir_new.tail;
+		update_dir_old.alter_tail = update_dir_new.alter_tail;
+	}
+
+	err = nova_remove_dentry(old_dentry, dec_link, &update_dir_old,
+					epoch_id);
+	if (err)
+		goto out;
+
+	if (dec_link < 0)
+		drop_nlink(old_dir);
+
+	if (new_inode) {
+		new_pi = nova_get_inode(sb, new_inode);
+		new_inode->i_ctime = current_time(new_inode);
+
+		if (S_ISDIR(old_inode->i_mode)) {
+			if (new_inode->i_nlink)
+				drop_nlink(new_inode);
+		}
+		if (new_inode->i_nlink)
+			drop_nlink(new_inode);
+
+		update_new.tail = 0;
+		update_new.alter_tail = 0;
+		err = nova_append_link_change_entry(sb, new_pi, new_inode,
+						&update_new, &old_linkc2,
+						epoch_id);
+		if (err)
+			goto out;
+	}
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	nova_memunlock_journal(sb);
+	if (new_inode && new_inode->i_nlink == 0)
+		invalidate_new_inode = 1;
+	journal_tail = nova_create_rename_transaction(sb, old_inode, old_dir,
+				new_inode,
+				old_dir != new_dir ? new_dir : NULL,
+				father_entry,
+				invalidate_new_inode,
+				cpu);
+
+	nova_update_inode(sb, old_inode, old_pi, &update_old, 0);
+	nova_update_inode(sb, old_dir, old_pidir, &update_dir_old, 0);
+
+	if (old_pidir != new_pidir)
+		nova_update_inode(sb, new_dir, new_pidir, &update_dir_new, 0);
+
+	if (change_parent && father_entry) {
+		father_entry->ino = cpu_to_le64(new_dir->i_ino);
+		nova_update_entry_csum(father_entry);
+		nova_update_alter_entry(sb, father_entry);
+	}
+
+	if (new_inode) {
+		if (invalidate_new_inode) {
+			new_pi->valid = 0;
+			new_pi->delete_epoch_id = epoch_id;
+		}
+		nova_update_inode(sb, new_inode, new_pi, &update_new, 0);
+	}
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	nova_memlock_journal(sb);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	nova_memunlock_inode(sb, old_pi);
+	nova_update_alter_inode(sb, old_inode, old_pi);
+	nova_update_alter_inode(sb, old_dir, old_pidir);
+	if (old_dir != new_dir)
+		nova_update_alter_inode(sb, new_dir, new_pidir);
+	if (new_inode)
+		nova_update_alter_inode(sb, new_inode, new_pi);
+	nova_memlock_inode(sb, old_pi);
+
+	nova_invalidate_link_change_entry(sb, old_linkc1);
+	nova_invalidate_link_change_entry(sb, old_linkc2);
+	if (new_inode)
+		nova_invalidate_dentries(sb, &update_dir_new);
+	nova_invalidate_dentries(sb, &update_dir_old);
+
+	NOVA_END_TIMING(rename_t, rename_time);
+	return 0;
+out:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(rename_t, rename_time);
+	return err;
+}
+
+struct dentry *nova_get_parent(struct dentry *child)
+{
+	struct inode *inode;
+	struct qstr dotdot = QSTR_INIT("..", 2);
+	struct nova_dentry *de = NULL;
+	ino_t ino;
+
+	nova_inode_by_name(child->d_inode, &dotdot, &de);
+	if (!de)
+		return ERR_PTR(-ENOENT);
+
+	/* FIXME: can de->ino be avoided by using the return value of
+	 * nova_inode_by_name()?
+	 */
+	ino = le64_to_cpu(de->ino);
+
+	if (ino)
+		inode = nova_iget(child->d_inode->i_sb, ino);
+	else
+		return ERR_PTR(-ENOENT);
+
+	return d_obtain_alias(inode);
+}
+
+const struct inode_operations nova_dir_inode_operations = {
+	.create		= nova_create,
+	.lookup		= nova_lookup,
+	.link		= nova_link,
+	.unlink		= nova_unlink,
+	.symlink	= nova_symlink,
+	.mkdir		= nova_mkdir,
+	.rmdir		= nova_rmdir,
+	.mknod		= nova_mknod,
+	.rename		= nova_rename,
+	.setattr	= nova_notify_change,
+	.get_acl	= NULL,
+};
+
+const struct inode_operations nova_special_inode_operations = {
+	.setattr	= nova_notify_change,
+	.get_acl	= NULL,
+};
diff --git a/fs/nova/symlink.c b/fs/nova/symlink.c
new file mode 100644
index 000000000000..b0e5e898a41b
--- /dev/null
+++ b/fs/nova/symlink.c
@@ -0,0 +1,153 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Symlink operations
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/version.h>
+#include "nova.h"
+#include "inode.h"
+
+int nova_block_symlink(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, const char *symname, int len, u64 epoch_id)
+{
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode_update update;
+	unsigned long name_blocknr = 0;
+	int allocated;
+	u64 block;
+	char *blockp;
+	u32 time;
+	int ret;
+
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+
+	allocated = nova_new_data_blocks(sb, sih, &name_blocknr, 0, 1,
+				 ALLOC_INIT_ZERO, ANY_CPU, ALLOC_FROM_TAIL);
+	if (allocated != 1 || name_blocknr == 0) {
+		ret = allocated;
+		return ret;
+	}
+
+	/* First copy name to name block */
+	block = nova_get_block_off(sb, name_blocknr, NOVA_BLOCK_TYPE_4K);
+	blockp = (char *)nova_get_block(sb, block);
+
+	nova_memunlock_block(sb, blockp);
+	memcpy_to_pmem_nocache(blockp, symname, len);
+	blockp[len] = '\0';
+	nova_memlock_block(sb, blockp);
+
+	/* Apply a write entry to the log page */
+	time = current_time(inode).tv_sec;
+	nova_init_file_write_entry(sb, sih, &entry_data, epoch_id, 0, 1,
+					name_blocknr, time, len + 1);
+
+	ret = nova_append_file_write_entry(sb, pi, inode, &entry_data, &update);
+	if (ret) {
+		nova_dbg("%s: append file write entry failed %d\n",
+					__func__, ret);
+		nova_free_data_blocks(sb, sih, name_blocknr, 1);
+		return ret;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+	sih->trans_id++;
+
+	return 0;
+}
+
+/* FIXME: Temporary workaround */
+static int nova_readlink_copy(char __user *buffer, int buflen, const char *link)
+{
+	int len = PTR_ERR(link);
+
+	if (IS_ERR(link))
+		goto out;
+
+	len = strlen(link);
+	if (len > (unsigned int) buflen)
+		len = buflen;
+	if (copy_to_user(buffer, link, len))
+		len = -EFAULT;
+out:
+	return len;
+}
+
+static int nova_readlink(struct dentry *dentry, char __user *buffer, int buflen)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct inode *inode = dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	char *blockp;
+
+	entry = (struct nova_file_write_entry *)nova_get_block(sb,
+							sih->log_head);
+
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+	}
+
+	blockp = (char *)nova_get_block(sb, BLOCK_OFF(entryc->block));
+
+	return nova_readlink_copy(buffer, buflen, blockp);
+}
+
+static const char *nova_get_link(struct dentry *dentry, struct inode *inode,
+	struct delayed_call *done)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	char *blockp;
+
+	entry = (struct nova_file_write_entry *)nova_get_block(sb,
+							sih->log_head);
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return NULL;
+	}
+
+	blockp = (char *)nova_get_block(sb, BLOCK_OFF(entryc->block));
+
+	return blockp;
+}
+
+const struct inode_operations nova_symlink_inode_operations = {
+	.readlink	= nova_readlink,
+	.get_link	= nova_get_link,
+	.setattr	= nova_notify_change,
+};

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 08/16] NOVA: Garbage collection
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:49   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Nova recovers log space with a two-phase garbage collection system.  When a log
reaches the end of its allocated pages, Nova allocates more space.  Then, the
fast GC algorithm scans the log to remove pages that have no valid entries.
Then, it estimates how many pages the logs valid entries would fill.  If this
is less than half the number of pages in the log, the second GC phase copies
the valid entries to new pages.

For example (V=valid; I=invalid):

+---+          +---+	        +---+
| I |	       | I |  	      	| V |
+---+	       +---+  Thorough	+---+
| V |	       | V |  	 GC   	| V |
+---+	       +---+   =====> 	+---+
| I |	       | I |  	      	| V |
+---+	       +---+	        +---+
| V |	       | V |  	        | V |
+---+	       +---+            +---+
  |	         |
  V	         V
+---+	       +---+
| I |	       | V |
+---+	       +---+
| I | fast GC  | I |
+---+  ====>   +---+
| I |	       | I |
+---+	       +---+
| I |	       | V |
+---+	       +---+
  |
  V
+---+
| V |
+---+
| I |
+---+
| I |
+---+
| V |
+---+

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/gc.c |  739 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 739 insertions(+)
 create mode 100644 fs/nova/gc.c

diff --git a/fs/nova/gc.c b/fs/nova/gc.c
new file mode 100644
index 000000000000..cfb39ceabe56
--- /dev/null
+++ b/fs/nova/gc.c
@@ -0,0 +1,739 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Garbage collection methods
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+static bool curr_log_entry_invalid(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_p, size_t *length)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_dentry *dentry;
+	struct nova_setattr_logentry *setattr_entry;
+	struct nova_link_change_entry *linkc_entry;
+	struct nova_mmap_entry *mmap_entry;
+	struct nova_snapshot_info_entry *sn_entry;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	void *addr, *entryc;
+	u8 type;
+	bool ret = true;
+
+	addr = (void *)nova_get_block(sb, curr_p);
+
+	/* FIXME: this check might hurt performance for workloads that
+	 * frequently invokes gc
+	 */
+	if (metadata_csum == 0)
+		entryc = addr;
+	else {
+		entryc = entry_copy;
+		if (!nova_verify_entry_csum(sb, addr, entryc))
+			return true;
+	}
+
+	type = nova_get_entry_type(entryc);
+	switch (type) {
+	case SET_ATTR:
+		setattr_entry = (struct nova_setattr_logentry *) entryc;
+		if (setattr_entry->invalid == 0)
+			ret = false;
+		*length = sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		linkc_entry = (struct nova_link_change_entry *) entryc;
+		if (linkc_entry->invalid == 0)
+			ret = false;
+		*length = sizeof(struct nova_link_change_entry);
+		break;
+	case FILE_WRITE:
+		entry = (struct nova_file_write_entry *) entryc;
+		if (entry->num_pages != entry->invalid_pages)
+			ret = false;
+		*length = sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		dentry = (struct nova_dentry *) entryc;
+		if (dentry->invalid == 0)
+			ret = false;
+		if (sih->last_dentry == curr_p)
+			ret = false;
+		*length = le16_to_cpu(dentry->de_len);
+		break;
+	case MMAP_WRITE:
+		mmap_entry = (struct nova_mmap_entry *) entryc;
+		if (mmap_entry->invalid == 0)
+			ret = false;
+		*length = sizeof(struct nova_mmap_entry);
+		break;
+	case SNAPSHOT_INFO:
+		sn_entry = (struct nova_snapshot_info_entry *) entryc;
+		if (sn_entry->deleted == 0)
+			ret = false;
+		*length = sizeof(struct nova_snapshot_info_entry);
+		break;
+	case NEXT_PAGE:
+		/* No more entries in this page */
+		*length = PAGE_SIZE - ENTRY_LOC(curr_p);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+		NOVA_ASSERT(0);
+		*length = PAGE_SIZE - ENTRY_LOC(curr_p);
+		break;
+	}
+
+	return ret;
+}
+
+static bool curr_page_invalid(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 page_head)
+{
+	struct nova_inode_log_page *curr_page;
+	struct nova_inode_page_tail page_tail;
+	unsigned int num_entries;
+	unsigned int invalid_entries;
+	bool ret;
+	timing_t check_time;
+	int rc;
+
+	NOVA_START_TIMING(check_invalid_t, check_time);
+
+	curr_page = (struct nova_inode_log_page *)
+					nova_get_block(sb, page_head);
+	rc = memcpy_mcsafe(&page_tail, &curr_page->page_tail,
+					sizeof(struct nova_inode_page_tail));
+	if (rc) {
+		/* FIXME: Recover use replica log */
+		nova_err(sb, "check page failed\n");
+		return false;
+	}
+
+	num_entries = le32_to_cpu(page_tail.num_entries);
+	invalid_entries = le32_to_cpu(page_tail.invalid_entries);
+
+	ret = (invalid_entries == num_entries);
+	if (!ret) {
+		sih->num_entries += num_entries;
+		sih->valid_entries += num_entries - invalid_entries;
+	}
+
+	NOVA_END_TIMING(check_invalid_t, check_time);
+	return ret;
+}
+
+static void free_curr_page(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_inode_log_page *curr_page,
+	struct nova_inode_log_page *last_page, u64 curr_head)
+{
+	u8 btype = sih->i_blk_type;
+
+	nova_memunlock_block(sb, last_page);
+	nova_set_next_page_address(sb, last_page,
+			curr_page->page_tail.next_page, 1);
+	nova_memlock_block(sb, last_page);
+	nova_free_log_blocks(sb, sih,
+			nova_get_blocknr(sb, curr_head, btype), 1);
+}
+
+static int nova_gc_assign_file_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *old_entry,
+	struct nova_file_write_entry *new_entry)
+{
+	struct nova_file_write_entry *temp;
+	void **pentry;
+	unsigned long start_pgoff = old_entry->pgoff;
+	unsigned int num = old_entry->num_pages;
+	unsigned long curr_pgoff;
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < num; i++) {
+		curr_pgoff = start_pgoff + i;
+
+		pentry = radix_tree_lookup_slot(&sih->tree, curr_pgoff);
+		if (pentry) {
+			temp = radix_tree_deref_slot(pentry);
+			if (temp == old_entry)
+				radix_tree_replace_slot(&sih->tree, pentry,
+							new_entry);
+		}
+	}
+
+	return ret;
+}
+
+static int nova_gc_assign_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *old_dentry,
+	struct nova_dentry *new_dentry)
+{
+	struct nova_dentry *temp;
+	void **pentry;
+	unsigned long hash;
+	int ret = 0;
+
+	hash = BKDRHash(old_dentry->name, old_dentry->name_len);
+	nova_dbgv("%s: assign %s hash %lu\n", __func__,
+			old_dentry->name, hash);
+
+	/* FIXME: hash collision ignored here */
+	pentry = radix_tree_lookup_slot(&sih->tree, hash);
+	if (pentry) {
+		temp = radix_tree_deref_slot(pentry);
+		if (temp == old_dentry)
+			radix_tree_replace_slot(&sih->tree, pentry, new_dentry);
+	}
+
+	return ret;
+}
+
+static int nova_gc_assign_mmap_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 curr_p, u64 new_curr)
+{
+	struct vma_item *item;
+	struct rb_node *temp;
+	int ret = 0;
+
+	if (sih->num_vmas == 0)
+		return ret;
+
+	temp = rb_first(&sih->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		if (item->mmap_entry == curr_p) {
+			item->mmap_entry = new_curr;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int nova_gc_assign_snapshot_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_snapshot_info_entry *old_entry, u64 curr_p, u64 new_curr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	int ret = 0;
+
+	info = radix_tree_lookup(&sbi->snapshot_info_tree,
+				old_entry->epoch_id);
+
+	if (info && info->snapshot_entry == curr_p)
+		info->snapshot_entry = new_curr;
+
+	return ret;
+}
+
+static int nova_gc_assign_new_entry(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_p, u64 new_curr)
+{
+	struct nova_file_write_entry *old_entry, *new_entry;
+	struct nova_dentry *old_dentry, *new_dentry;
+	void *addr, *new_addr;
+	u8 type;
+	int ret = 0;
+
+	addr = (void *)nova_get_block(sb, curr_p);
+	type = nova_get_entry_type(addr);
+	switch (type) {
+	case SET_ATTR:
+		sih->last_setattr = new_curr;
+		break;
+	case LINK_CHANGE:
+		sih->last_link_change = new_curr;
+		break;
+	case MMAP_WRITE:
+		ret = nova_gc_assign_mmap_entry(sb, sih, curr_p, new_curr);
+		break;
+	case SNAPSHOT_INFO:
+		ret = nova_gc_assign_snapshot_entry(sb, sih, addr,
+						curr_p, new_curr);
+		break;
+	case FILE_WRITE:
+		new_addr = (void *)nova_get_block(sb, new_curr);
+		old_entry = (struct nova_file_write_entry *)addr;
+		new_entry = (struct nova_file_write_entry *)new_addr;
+		ret = nova_gc_assign_file_entry(sb, sih, old_entry, new_entry);
+		break;
+	case DIR_LOG:
+		new_addr = (void *)nova_get_block(sb, new_curr);
+		old_dentry = (struct nova_dentry *)addr;
+		new_dentry = (struct nova_dentry *)new_addr;
+		if (sih->last_dentry == curr_p)
+			sih->last_dentry = new_curr;
+		ret = nova_gc_assign_dentry(sb, sih, old_dentry, new_dentry);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+		NOVA_ASSERT(0);
+		break;
+	}
+
+	return ret;
+}
+
+/* Copy live log entries to the new log and atomically replace the old log */
+static unsigned long nova_inode_log_thorough_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	unsigned long blocks, unsigned long checked_pages)
+{
+	struct nova_inode_log_page *curr_page = NULL;
+	size_t length;
+	struct nova_inode *alter_pi;
+	u64 ino = pi->nova_ino;
+	u64 curr_p, new_curr;
+	u64 old_curr_p;
+	u64 tail_block;
+	u64 old_head;
+	u64 new_head = 0;
+	u64 next;
+	int allocated;
+	int extended = 0;
+	int ret;
+	timing_t gc_time;
+
+	NOVA_START_TIMING(thorough_gc_t, gc_time);
+
+	curr_p = sih->log_head;
+	old_curr_p = curr_p;
+	old_head = sih->log_head;
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				curr_p, sih->log_tail);
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+	if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT)
+		goto out;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih, blocks,
+					&new_head, ANY_CPU, 0);
+	if (allocated != blocks) {
+		nova_err(sb, "%s: ERROR: no inode log page available\n",
+					__func__);
+		goto out;
+	}
+
+	new_curr = new_head;
+	while (curr_p != sih->log_tail) {
+		old_curr_p = curr_p;
+		if (goto_next_page(sb, curr_p))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT) {
+			/* Don't recycle tail page */
+			break;
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		length = 0;
+		ret = curr_log_entry_invalid(sb, pi, sih, curr_p, &length);
+		if (!ret) {
+			extended = 0;
+			new_curr = nova_get_append_head(sb, pi, sih,
+						new_curr, length, MAIN_LOG,
+						1, &extended);
+			if (extended)
+				blocks++;
+			/* Copy entry to the new log */
+			nova_memunlock_block(sb, nova_get_block(sb, new_curr));
+			memcpy_to_pmem_nocache(nova_get_block(sb, new_curr),
+				nova_get_block(sb, curr_p), length);
+			nova_inc_page_num_entries(sb, new_curr);
+			nova_memlock_block(sb, nova_get_block(sb, new_curr));
+			nova_gc_assign_new_entry(sb, pi, sih, curr_p, new_curr);
+			new_curr += length;
+		}
+
+		curr_p += length;
+	}
+
+	/* Step 1: Link new log to the tail block */
+	tail_block = BLOCK_OFF(sih->log_tail);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+							BLOCK_OFF(new_curr));
+	next = next_log_page(sb, new_curr);
+	if (next > 0)
+		nova_free_contiguous_log_blocks(sb, sih, next);
+
+	nova_memunlock_block(sb, curr_page);
+	nova_set_next_page_flag(sb, new_curr);
+	nova_set_next_page_address(sb, curr_page, tail_block, 0);
+	nova_memlock_block(sb, curr_page);
+
+	/* Step 2: Atomically switch to the new log */
+	nova_memunlock_inode(sb, pi);
+	pi->log_head = new_head;
+	nova_update_inode_checksum(pi);
+	if (metadata_csum && sih->alter_pi_addr) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+						sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
+	sih->log_head = new_head;
+
+	/* Step 3: Unlink the old log */
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+							BLOCK_OFF(old_curr_p));
+	next = next_log_page(sb, old_curr_p);
+	if (next != tail_block) {
+		nova_err(sb, "Old log error: old curr_p 0x%lx, next 0x%lx ",
+			"curr_p 0x%lx, tail block 0x%lx\n", old_curr_p,
+			next, curr_p, tail_block);
+		BUG();
+	}
+	nova_memunlock_block(sb, curr_page);
+	nova_set_next_page_address(sb, curr_page, 0, 1);
+	nova_memlock_block(sb, curr_page);
+
+	/* Step 4: Free the old log */
+	nova_free_contiguous_log_blocks(sb, sih, old_head);
+
+	sih->log_pages = sih->log_pages + blocks - checked_pages;
+	NOVA_STATS_ADD(thorough_gc_pages, checked_pages - blocks);
+	NOVA_STATS_ADD(thorough_checked_pages, checked_pages);
+out:
+	NOVA_END_TIMING(thorough_gc_t, gc_time);
+	return blocks;
+}
+
+/* Copy original log to alternate log */
+static unsigned long nova_inode_alter_log_thorough_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	unsigned long blocks, unsigned long checked_pages)
+{
+	struct nova_inode_log_page *alter_curr_page = NULL;
+	struct nova_inode *alter_pi;
+	u64 ino = pi->nova_ino;
+	u64 curr_p, new_curr;
+	u64 alter_curr_p;
+	u64 old_alter_curr_p;
+	u64 alter_tail_block;
+	u64 alter_old_head;
+	u64 new_head = 0;
+	u64 alter_next;
+	int allocated;
+	timing_t gc_time;
+
+	NOVA_START_TIMING(thorough_gc_t, gc_time);
+
+	curr_p = sih->log_head;
+	alter_old_head = sih->alter_log_head;
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				curr_p, sih->log_tail);
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+	if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT)
+		goto out;
+
+	if (alter_old_head >> PAGE_SHIFT == sih->alter_log_tail >> PAGE_SHIFT)
+		goto out;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih, blocks,
+					&new_head, ANY_CPU, 1);
+	if (allocated != blocks) {
+		nova_err(sb, "%s: ERROR: no inode log page available\n",
+					__func__);
+		goto out;
+	}
+
+	new_curr = new_head;
+	while (1) {
+		nova_memunlock_block(sb, nova_get_block(sb, new_curr));
+		memcpy_to_pmem_nocache(nova_get_block(sb, new_curr),
+				nova_get_block(sb, curr_p), LOG_BLOCK_TAIL);
+
+		nova_set_alter_page_address(sb, curr_p, new_curr);
+		nova_memlock_block(sb, nova_get_block(sb, new_curr));
+
+		curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT) {
+			/* Don't recycle tail page */
+			break;
+		}
+
+		new_curr = next_log_page(sb, new_curr);
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+	}
+
+	/* Step 1: Link new log to the tail block */
+	alter_tail_block = BLOCK_OFF(sih->alter_log_tail);
+	alter_curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+							BLOCK_OFF(new_curr));
+	alter_next = next_log_page(sb, new_curr);
+	if (alter_next > 0)
+		nova_free_contiguous_log_blocks(sb, sih, alter_next);
+	nova_memunlock_block(sb, alter_curr_page);
+	nova_set_next_page_address(sb, alter_curr_page, alter_tail_block, 0);
+	nova_memlock_block(sb, alter_curr_page);
+
+	/* Step 2: Find the old log block before the tail block */
+	alter_curr_p = sih->alter_log_head;
+	while (1) {
+		old_alter_curr_p = alter_curr_p;
+		alter_curr_p = next_log_page(sb, alter_curr_p);
+
+		if (alter_curr_p >> PAGE_SHIFT ==
+				sih->alter_log_tail >> PAGE_SHIFT)
+			break;
+
+		if (alter_curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+	}
+
+	/* Step 3: Atomically switch to the new log */
+	nova_memunlock_inode(sb, pi);
+	pi->alter_log_head = new_head;
+	nova_update_inode_checksum(pi);
+	if (metadata_csum && sih->alter_pi_addr) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+						sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
+	sih->alter_log_head = new_head;
+
+	/* Step 4: Unlink the old log */
+	alter_curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+						BLOCK_OFF(old_alter_curr_p));
+	alter_next = next_log_page(sb, old_alter_curr_p);
+	if (alter_next != alter_tail_block) {
+		nova_err(sb, "Old log error: old curr_p 0x%lx, next 0x%lx ",
+			"curr_p 0x%lx, tail block 0x%lx\n", old_alter_curr_p,
+			alter_next, alter_curr_p, alter_tail_block);
+		BUG();
+	}
+	nova_memunlock_block(sb, alter_curr_page);
+	nova_set_next_page_address(sb, alter_curr_page, 0, 1);
+	nova_memlock_block(sb, alter_curr_page);
+
+	/* Step 5: Free the old log */
+	nova_free_contiguous_log_blocks(sb, sih, alter_old_head);
+
+	sih->log_pages = sih->log_pages + blocks - checked_pages;
+	NOVA_STATS_ADD(thorough_gc_pages, checked_pages - blocks);
+	NOVA_STATS_ADD(thorough_checked_pages, checked_pages);
+out:
+	NOVA_END_TIMING(thorough_gc_t, gc_time);
+	return blocks;
+}
+
+/*
+ * Scan pages in the log and remove those with no valid log entries.
+ */
+int nova_inode_log_fast_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_tail, u64 new_block, u64 alter_new_block,
+	int num_pages, int force_thorough)
+{
+	struct nova_inode *alter_pi;
+	u64 curr, next, possible_head = 0;
+	u64 alter_curr, alter_next = 0, alter_possible_head = 0;
+	int found_head = 0;
+	struct nova_inode_log_page *last_page = NULL;
+	struct nova_inode_log_page *curr_page = NULL;
+	struct nova_inode_log_page *alter_last_page = NULL;
+	struct nova_inode_log_page *alter_curr_page = NULL;
+	int first_need_free = 0;
+	int num_logs;
+	u8 btype = sih->i_blk_type;
+	unsigned long blocks;
+	unsigned long checked_pages = 0;
+	int freed_pages = 0;
+	timing_t gc_time;
+
+	NOVA_START_TIMING(fast_gc_t, gc_time);
+	curr = sih->log_head;
+	alter_curr = sih->alter_log_head;
+	sih->valid_entries = 0;
+	sih->num_entries = 0;
+
+	num_logs = 1;
+	if (metadata_csum)
+		num_logs = 2;
+
+	nova_dbgv("%s: log head 0x%llx, tail 0x%llx\n",
+				__func__, curr, curr_tail);
+	while (1) {
+		if (curr >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT) {
+			/* Don't recycle tail page */
+			if (found_head == 0) {
+				possible_head = cpu_to_le64(curr);
+				alter_possible_head = cpu_to_le64(alter_curr);
+			}
+			break;
+		}
+
+		curr_page = (struct nova_inode_log_page *)
+					nova_get_block(sb, curr);
+		next = next_log_page(sb, curr);
+		if (next < 0)
+			break;
+
+		if (metadata_csum) {
+			alter_curr_page = (struct nova_inode_log_page *)
+						nova_get_block(sb, alter_curr);
+			alter_next = next_log_page(sb, alter_curr);
+			if (alter_next < 0)
+				break;
+		}
+		nova_dbg_verbose("curr 0x%llx, next 0x%llx\n", curr, next);
+		if (curr_page_invalid(sb, pi, sih, curr)) {
+			nova_dbg_verbose("curr page %p invalid\n", curr_page);
+			if (curr == sih->log_head) {
+				/* Free first page later */
+				first_need_free = 1;
+				last_page = curr_page;
+				alter_last_page = alter_curr_page;
+			} else {
+				nova_dbg_verbose("Free log block 0x%llx\n",
+						curr >> PAGE_SHIFT);
+				free_curr_page(sb, sih, curr_page, last_page,
+						curr);
+				if (metadata_csum)
+					free_curr_page(sb, sih, alter_curr_page,
+						alter_last_page, alter_curr);
+			}
+			NOVA_STATS_ADD(fast_gc_pages, 1);
+			freed_pages++;
+		} else {
+			if (found_head == 0) {
+				possible_head = cpu_to_le64(curr);
+				alter_possible_head = cpu_to_le64(alter_curr);
+				found_head = 1;
+			}
+			last_page = curr_page;
+			alter_last_page = alter_curr_page;
+		}
+
+		curr = next;
+		alter_curr = alter_next;
+		checked_pages++;
+		if (curr == 0 || (metadata_csum && alter_curr == 0))
+			break;
+	}
+
+	NOVA_STATS_ADD(fast_checked_pages, checked_pages);
+	nova_dbgv("checked pages %lu, freed %d\n", checked_pages, freed_pages);
+	checked_pages -= freed_pages;
+
+	// TODO:  I think this belongs in nova_extend_inode_log.
+	if (num_pages > 0) {
+		curr = BLOCK_OFF(curr_tail);
+		curr_page = (struct nova_inode_log_page *)
+						  nova_get_block(sb, curr);
+
+		nova_memunlock_block(sb, curr_page);
+		nova_set_next_page_address(sb, curr_page, new_block, 1);
+		nova_memlock_block(sb, curr_page);
+
+		if (metadata_csum) {
+			alter_curr = BLOCK_OFF(sih->alter_log_tail);
+
+			while (next_log_page(sb, alter_curr) > 0)
+				alter_curr = next_log_page(sb, alter_curr);
+
+			alter_curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, alter_curr);
+			nova_memunlock_block(sb, curr_page);
+			nova_set_next_page_address(sb, alter_curr_page,
+						   alter_new_block, 1);
+			nova_memlock_block(sb, curr_page);
+		}
+	}
+
+	curr = sih->log_head;
+	alter_curr = sih->alter_log_head;
+
+	nova_memunlock_inode(sb, pi);
+	pi->log_head = possible_head;
+	pi->alter_log_head = alter_possible_head;
+	nova_update_inode_checksum(pi);
+	if (metadata_csum && sih->alter_pi_addr) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+						sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+	sih->log_head = possible_head;
+	sih->alter_log_head = alter_possible_head;
+	nova_dbgv("%s: %d new head 0x%llx\n", __func__,
+					found_head, possible_head);
+	sih->log_pages += (num_pages - freed_pages) * num_logs;
+	/* Don't update log tail pointer here */
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 1);
+
+	if (first_need_free) {
+		nova_dbg_verbose("Free log head block 0x%llx\n",
+					curr >> PAGE_SHIFT);
+		nova_free_log_blocks(sb, sih,
+				nova_get_blocknr(sb, curr, btype), 1);
+		if (metadata_csum)
+			nova_free_log_blocks(sb, sih,
+				nova_get_blocknr(sb, alter_curr, btype), 1);
+	}
+
+	NOVA_END_TIMING(fast_gc_t, gc_time);
+
+	if (sih->num_entries == 0)
+		return 0;
+
+	/* Estimate how many pages worth of valid entries the log contains.
+	 *
+	 * If it is less than half the number pages that remain in the log,
+	 * compress them with thorough gc.
+	 */
+	blocks = (sih->valid_entries * checked_pages) / sih->num_entries;
+	if ((sih->valid_entries * checked_pages) % sih->num_entries)
+		blocks++;
+
+	if (force_thorough || (blocks && blocks * 2 < checked_pages)) {
+		nova_dbgv("Thorough GC for inode %lu: checked pages %lu, valid pages %lu\n",
+				sih->ino,
+				checked_pages, blocks);
+		blocks = nova_inode_log_thorough_gc(sb, pi, sih,
+							blocks, checked_pages);
+		if (metadata_csum)
+			nova_inode_alter_log_thorough_gc(sb, pi, sih,
+							blocks, checked_pages);
+	}
+
+	return 0;
+}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 08/16] NOVA: Garbage collection
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova recovers log space with a two-phase garbage collection system.  When a log
reaches the end of its allocated pages, Nova allocates more space.  Then, the
fast GC algorithm scans the log to remove pages that have no valid entries.
Then, it estimates how many pages the logs valid entries would fill.  If this
is less than half the number of pages in the log, the second GC phase copies
the valid entries to new pages.

For example (V=valid; I=invalid):

+---+          +---+	        +---+
| I |	       | I |  	      	| V |
+---+	       +---+  Thorough	+---+
| V |	       | V |  	 GC   	| V |
+---+	       +---+   =====> 	+---+
| I |	       | I |  	      	| V |
+---+	       +---+	        +---+
| V |	       | V |  	        | V |
+---+	       +---+            +---+
  |	         |
  V	         V
+---+	       +---+
| I |	       | V |
+---+	       +---+
| I | fast GC  | I |
+---+  ====>   +---+
| I |	       | I |
+---+	       +---+
| I |	       | V |
+---+	       +---+
  |
  V
+---+
| V |
+---+
| I |
+---+
| I |
+---+
| V |
+---+

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/gc.c |  739 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 739 insertions(+)
 create mode 100644 fs/nova/gc.c

diff --git a/fs/nova/gc.c b/fs/nova/gc.c
new file mode 100644
index 000000000000..cfb39ceabe56
--- /dev/null
+++ b/fs/nova/gc.c
@@ -0,0 +1,739 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Garbage collection methods
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+static bool curr_log_entry_invalid(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_p, size_t *length)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_dentry *dentry;
+	struct nova_setattr_logentry *setattr_entry;
+	struct nova_link_change_entry *linkc_entry;
+	struct nova_mmap_entry *mmap_entry;
+	struct nova_snapshot_info_entry *sn_entry;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	void *addr, *entryc;
+	u8 type;
+	bool ret = true;
+
+	addr = (void *)nova_get_block(sb, curr_p);
+
+	/* FIXME: this check might hurt performance for workloads that
+	 * frequently invokes gc
+	 */
+	if (metadata_csum == 0)
+		entryc = addr;
+	else {
+		entryc = entry_copy;
+		if (!nova_verify_entry_csum(sb, addr, entryc))
+			return true;
+	}
+
+	type = nova_get_entry_type(entryc);
+	switch (type) {
+	case SET_ATTR:
+		setattr_entry = (struct nova_setattr_logentry *) entryc;
+		if (setattr_entry->invalid == 0)
+			ret = false;
+		*length = sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		linkc_entry = (struct nova_link_change_entry *) entryc;
+		if (linkc_entry->invalid == 0)
+			ret = false;
+		*length = sizeof(struct nova_link_change_entry);
+		break;
+	case FILE_WRITE:
+		entry = (struct nova_file_write_entry *) entryc;
+		if (entry->num_pages != entry->invalid_pages)
+			ret = false;
+		*length = sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		dentry = (struct nova_dentry *) entryc;
+		if (dentry->invalid == 0)
+			ret = false;
+		if (sih->last_dentry == curr_p)
+			ret = false;
+		*length = le16_to_cpu(dentry->de_len);
+		break;
+	case MMAP_WRITE:
+		mmap_entry = (struct nova_mmap_entry *) entryc;
+		if (mmap_entry->invalid == 0)
+			ret = false;
+		*length = sizeof(struct nova_mmap_entry);
+		break;
+	case SNAPSHOT_INFO:
+		sn_entry = (struct nova_snapshot_info_entry *) entryc;
+		if (sn_entry->deleted == 0)
+			ret = false;
+		*length = sizeof(struct nova_snapshot_info_entry);
+		break;
+	case NEXT_PAGE:
+		/* No more entries in this page */
+		*length = PAGE_SIZE - ENTRY_LOC(curr_p);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+		NOVA_ASSERT(0);
+		*length = PAGE_SIZE - ENTRY_LOC(curr_p);
+		break;
+	}
+
+	return ret;
+}
+
+static bool curr_page_invalid(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 page_head)
+{
+	struct nova_inode_log_page *curr_page;
+	struct nova_inode_page_tail page_tail;
+	unsigned int num_entries;
+	unsigned int invalid_entries;
+	bool ret;
+	timing_t check_time;
+	int rc;
+
+	NOVA_START_TIMING(check_invalid_t, check_time);
+
+	curr_page = (struct nova_inode_log_page *)
+					nova_get_block(sb, page_head);
+	rc = memcpy_mcsafe(&page_tail, &curr_page->page_tail,
+					sizeof(struct nova_inode_page_tail));
+	if (rc) {
+		/* FIXME: Recover use replica log */
+		nova_err(sb, "check page failed\n");
+		return false;
+	}
+
+	num_entries = le32_to_cpu(page_tail.num_entries);
+	invalid_entries = le32_to_cpu(page_tail.invalid_entries);
+
+	ret = (invalid_entries == num_entries);
+	if (!ret) {
+		sih->num_entries += num_entries;
+		sih->valid_entries += num_entries - invalid_entries;
+	}
+
+	NOVA_END_TIMING(check_invalid_t, check_time);
+	return ret;
+}
+
+static void free_curr_page(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_inode_log_page *curr_page,
+	struct nova_inode_log_page *last_page, u64 curr_head)
+{
+	u8 btype = sih->i_blk_type;
+
+	nova_memunlock_block(sb, last_page);
+	nova_set_next_page_address(sb, last_page,
+			curr_page->page_tail.next_page, 1);
+	nova_memlock_block(sb, last_page);
+	nova_free_log_blocks(sb, sih,
+			nova_get_blocknr(sb, curr_head, btype), 1);
+}
+
+static int nova_gc_assign_file_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *old_entry,
+	struct nova_file_write_entry *new_entry)
+{
+	struct nova_file_write_entry *temp;
+	void **pentry;
+	unsigned long start_pgoff = old_entry->pgoff;
+	unsigned int num = old_entry->num_pages;
+	unsigned long curr_pgoff;
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < num; i++) {
+		curr_pgoff = start_pgoff + i;
+
+		pentry = radix_tree_lookup_slot(&sih->tree, curr_pgoff);
+		if (pentry) {
+			temp = radix_tree_deref_slot(pentry);
+			if (temp == old_entry)
+				radix_tree_replace_slot(&sih->tree, pentry,
+							new_entry);
+		}
+	}
+
+	return ret;
+}
+
+static int nova_gc_assign_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *old_dentry,
+	struct nova_dentry *new_dentry)
+{
+	struct nova_dentry *temp;
+	void **pentry;
+	unsigned long hash;
+	int ret = 0;
+
+	hash = BKDRHash(old_dentry->name, old_dentry->name_len);
+	nova_dbgv("%s: assign %s hash %lu\n", __func__,
+			old_dentry->name, hash);
+
+	/* FIXME: hash collision ignored here */
+	pentry = radix_tree_lookup_slot(&sih->tree, hash);
+	if (pentry) {
+		temp = radix_tree_deref_slot(pentry);
+		if (temp == old_dentry)
+			radix_tree_replace_slot(&sih->tree, pentry, new_dentry);
+	}
+
+	return ret;
+}
+
+static int nova_gc_assign_mmap_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 curr_p, u64 new_curr)
+{
+	struct vma_item *item;
+	struct rb_node *temp;
+	int ret = 0;
+
+	if (sih->num_vmas == 0)
+		return ret;
+
+	temp = rb_first(&sih->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		if (item->mmap_entry == curr_p) {
+			item->mmap_entry = new_curr;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int nova_gc_assign_snapshot_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_snapshot_info_entry *old_entry, u64 curr_p, u64 new_curr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	int ret = 0;
+
+	info = radix_tree_lookup(&sbi->snapshot_info_tree,
+				old_entry->epoch_id);
+
+	if (info && info->snapshot_entry == curr_p)
+		info->snapshot_entry = new_curr;
+
+	return ret;
+}
+
+static int nova_gc_assign_new_entry(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_p, u64 new_curr)
+{
+	struct nova_file_write_entry *old_entry, *new_entry;
+	struct nova_dentry *old_dentry, *new_dentry;
+	void *addr, *new_addr;
+	u8 type;
+	int ret = 0;
+
+	addr = (void *)nova_get_block(sb, curr_p);
+	type = nova_get_entry_type(addr);
+	switch (type) {
+	case SET_ATTR:
+		sih->last_setattr = new_curr;
+		break;
+	case LINK_CHANGE:
+		sih->last_link_change = new_curr;
+		break;
+	case MMAP_WRITE:
+		ret = nova_gc_assign_mmap_entry(sb, sih, curr_p, new_curr);
+		break;
+	case SNAPSHOT_INFO:
+		ret = nova_gc_assign_snapshot_entry(sb, sih, addr,
+						curr_p, new_curr);
+		break;
+	case FILE_WRITE:
+		new_addr = (void *)nova_get_block(sb, new_curr);
+		old_entry = (struct nova_file_write_entry *)addr;
+		new_entry = (struct nova_file_write_entry *)new_addr;
+		ret = nova_gc_assign_file_entry(sb, sih, old_entry, new_entry);
+		break;
+	case DIR_LOG:
+		new_addr = (void *)nova_get_block(sb, new_curr);
+		old_dentry = (struct nova_dentry *)addr;
+		new_dentry = (struct nova_dentry *)new_addr;
+		if (sih->last_dentry == curr_p)
+			sih->last_dentry = new_curr;
+		ret = nova_gc_assign_dentry(sb, sih, old_dentry, new_dentry);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+		NOVA_ASSERT(0);
+		break;
+	}
+
+	return ret;
+}
+
+/* Copy live log entries to the new log and atomically replace the old log */
+static unsigned long nova_inode_log_thorough_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	unsigned long blocks, unsigned long checked_pages)
+{
+	struct nova_inode_log_page *curr_page = NULL;
+	size_t length;
+	struct nova_inode *alter_pi;
+	u64 ino = pi->nova_ino;
+	u64 curr_p, new_curr;
+	u64 old_curr_p;
+	u64 tail_block;
+	u64 old_head;
+	u64 new_head = 0;
+	u64 next;
+	int allocated;
+	int extended = 0;
+	int ret;
+	timing_t gc_time;
+
+	NOVA_START_TIMING(thorough_gc_t, gc_time);
+
+	curr_p = sih->log_head;
+	old_curr_p = curr_p;
+	old_head = sih->log_head;
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				curr_p, sih->log_tail);
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+	if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT)
+		goto out;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih, blocks,
+					&new_head, ANY_CPU, 0);
+	if (allocated != blocks) {
+		nova_err(sb, "%s: ERROR: no inode log page available\n",
+					__func__);
+		goto out;
+	}
+
+	new_curr = new_head;
+	while (curr_p != sih->log_tail) {
+		old_curr_p = curr_p;
+		if (goto_next_page(sb, curr_p))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT) {
+			/* Don't recycle tail page */
+			break;
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		length = 0;
+		ret = curr_log_entry_invalid(sb, pi, sih, curr_p, &length);
+		if (!ret) {
+			extended = 0;
+			new_curr = nova_get_append_head(sb, pi, sih,
+						new_curr, length, MAIN_LOG,
+						1, &extended);
+			if (extended)
+				blocks++;
+			/* Copy entry to the new log */
+			nova_memunlock_block(sb, nova_get_block(sb, new_curr));
+			memcpy_to_pmem_nocache(nova_get_block(sb, new_curr),
+				nova_get_block(sb, curr_p), length);
+			nova_inc_page_num_entries(sb, new_curr);
+			nova_memlock_block(sb, nova_get_block(sb, new_curr));
+			nova_gc_assign_new_entry(sb, pi, sih, curr_p, new_curr);
+			new_curr += length;
+		}
+
+		curr_p += length;
+	}
+
+	/* Step 1: Link new log to the tail block */
+	tail_block = BLOCK_OFF(sih->log_tail);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+							BLOCK_OFF(new_curr));
+	next = next_log_page(sb, new_curr);
+	if (next > 0)
+		nova_free_contiguous_log_blocks(sb, sih, next);
+
+	nova_memunlock_block(sb, curr_page);
+	nova_set_next_page_flag(sb, new_curr);
+	nova_set_next_page_address(sb, curr_page, tail_block, 0);
+	nova_memlock_block(sb, curr_page);
+
+	/* Step 2: Atomically switch to the new log */
+	nova_memunlock_inode(sb, pi);
+	pi->log_head = new_head;
+	nova_update_inode_checksum(pi);
+	if (metadata_csum && sih->alter_pi_addr) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+						sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
+	sih->log_head = new_head;
+
+	/* Step 3: Unlink the old log */
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+							BLOCK_OFF(old_curr_p));
+	next = next_log_page(sb, old_curr_p);
+	if (next != tail_block) {
+		nova_err(sb, "Old log error: old curr_p 0x%lx, next 0x%lx ",
+			"curr_p 0x%lx, tail block 0x%lx\n", old_curr_p,
+			next, curr_p, tail_block);
+		BUG();
+	}
+	nova_memunlock_block(sb, curr_page);
+	nova_set_next_page_address(sb, curr_page, 0, 1);
+	nova_memlock_block(sb, curr_page);
+
+	/* Step 4: Free the old log */
+	nova_free_contiguous_log_blocks(sb, sih, old_head);
+
+	sih->log_pages = sih->log_pages + blocks - checked_pages;
+	NOVA_STATS_ADD(thorough_gc_pages, checked_pages - blocks);
+	NOVA_STATS_ADD(thorough_checked_pages, checked_pages);
+out:
+	NOVA_END_TIMING(thorough_gc_t, gc_time);
+	return blocks;
+}
+
+/* Copy original log to alternate log */
+static unsigned long nova_inode_alter_log_thorough_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	unsigned long blocks, unsigned long checked_pages)
+{
+	struct nova_inode_log_page *alter_curr_page = NULL;
+	struct nova_inode *alter_pi;
+	u64 ino = pi->nova_ino;
+	u64 curr_p, new_curr;
+	u64 alter_curr_p;
+	u64 old_alter_curr_p;
+	u64 alter_tail_block;
+	u64 alter_old_head;
+	u64 new_head = 0;
+	u64 alter_next;
+	int allocated;
+	timing_t gc_time;
+
+	NOVA_START_TIMING(thorough_gc_t, gc_time);
+
+	curr_p = sih->log_head;
+	alter_old_head = sih->alter_log_head;
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				curr_p, sih->log_tail);
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+	if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT)
+		goto out;
+
+	if (alter_old_head >> PAGE_SHIFT == sih->alter_log_tail >> PAGE_SHIFT)
+		goto out;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih, blocks,
+					&new_head, ANY_CPU, 1);
+	if (allocated != blocks) {
+		nova_err(sb, "%s: ERROR: no inode log page available\n",
+					__func__);
+		goto out;
+	}
+
+	new_curr = new_head;
+	while (1) {
+		nova_memunlock_block(sb, nova_get_block(sb, new_curr));
+		memcpy_to_pmem_nocache(nova_get_block(sb, new_curr),
+				nova_get_block(sb, curr_p), LOG_BLOCK_TAIL);
+
+		nova_set_alter_page_address(sb, curr_p, new_curr);
+		nova_memlock_block(sb, nova_get_block(sb, new_curr));
+
+		curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT) {
+			/* Don't recycle tail page */
+			break;
+		}
+
+		new_curr = next_log_page(sb, new_curr);
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+	}
+
+	/* Step 1: Link new log to the tail block */
+	alter_tail_block = BLOCK_OFF(sih->alter_log_tail);
+	alter_curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+							BLOCK_OFF(new_curr));
+	alter_next = next_log_page(sb, new_curr);
+	if (alter_next > 0)
+		nova_free_contiguous_log_blocks(sb, sih, alter_next);
+	nova_memunlock_block(sb, alter_curr_page);
+	nova_set_next_page_address(sb, alter_curr_page, alter_tail_block, 0);
+	nova_memlock_block(sb, alter_curr_page);
+
+	/* Step 2: Find the old log block before the tail block */
+	alter_curr_p = sih->alter_log_head;
+	while (1) {
+		old_alter_curr_p = alter_curr_p;
+		alter_curr_p = next_log_page(sb, alter_curr_p);
+
+		if (alter_curr_p >> PAGE_SHIFT ==
+				sih->alter_log_tail >> PAGE_SHIFT)
+			break;
+
+		if (alter_curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+	}
+
+	/* Step 3: Atomically switch to the new log */
+	nova_memunlock_inode(sb, pi);
+	pi->alter_log_head = new_head;
+	nova_update_inode_checksum(pi);
+	if (metadata_csum && sih->alter_pi_addr) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+						sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
+	sih->alter_log_head = new_head;
+
+	/* Step 4: Unlink the old log */
+	alter_curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+						BLOCK_OFF(old_alter_curr_p));
+	alter_next = next_log_page(sb, old_alter_curr_p);
+	if (alter_next != alter_tail_block) {
+		nova_err(sb, "Old log error: old curr_p 0x%lx, next 0x%lx ",
+			"curr_p 0x%lx, tail block 0x%lx\n", old_alter_curr_p,
+			alter_next, alter_curr_p, alter_tail_block);
+		BUG();
+	}
+	nova_memunlock_block(sb, alter_curr_page);
+	nova_set_next_page_address(sb, alter_curr_page, 0, 1);
+	nova_memlock_block(sb, alter_curr_page);
+
+	/* Step 5: Free the old log */
+	nova_free_contiguous_log_blocks(sb, sih, alter_old_head);
+
+	sih->log_pages = sih->log_pages + blocks - checked_pages;
+	NOVA_STATS_ADD(thorough_gc_pages, checked_pages - blocks);
+	NOVA_STATS_ADD(thorough_checked_pages, checked_pages);
+out:
+	NOVA_END_TIMING(thorough_gc_t, gc_time);
+	return blocks;
+}
+
+/*
+ * Scan pages in the log and remove those with no valid log entries.
+ */
+int nova_inode_log_fast_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_tail, u64 new_block, u64 alter_new_block,
+	int num_pages, int force_thorough)
+{
+	struct nova_inode *alter_pi;
+	u64 curr, next, possible_head = 0;
+	u64 alter_curr, alter_next = 0, alter_possible_head = 0;
+	int found_head = 0;
+	struct nova_inode_log_page *last_page = NULL;
+	struct nova_inode_log_page *curr_page = NULL;
+	struct nova_inode_log_page *alter_last_page = NULL;
+	struct nova_inode_log_page *alter_curr_page = NULL;
+	int first_need_free = 0;
+	int num_logs;
+	u8 btype = sih->i_blk_type;
+	unsigned long blocks;
+	unsigned long checked_pages = 0;
+	int freed_pages = 0;
+	timing_t gc_time;
+
+	NOVA_START_TIMING(fast_gc_t, gc_time);
+	curr = sih->log_head;
+	alter_curr = sih->alter_log_head;
+	sih->valid_entries = 0;
+	sih->num_entries = 0;
+
+	num_logs = 1;
+	if (metadata_csum)
+		num_logs = 2;
+
+	nova_dbgv("%s: log head 0x%llx, tail 0x%llx\n",
+				__func__, curr, curr_tail);
+	while (1) {
+		if (curr >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT) {
+			/* Don't recycle tail page */
+			if (found_head == 0) {
+				possible_head = cpu_to_le64(curr);
+				alter_possible_head = cpu_to_le64(alter_curr);
+			}
+			break;
+		}
+
+		curr_page = (struct nova_inode_log_page *)
+					nova_get_block(sb, curr);
+		next = next_log_page(sb, curr);
+		if (next < 0)
+			break;
+
+		if (metadata_csum) {
+			alter_curr_page = (struct nova_inode_log_page *)
+						nova_get_block(sb, alter_curr);
+			alter_next = next_log_page(sb, alter_curr);
+			if (alter_next < 0)
+				break;
+		}
+		nova_dbg_verbose("curr 0x%llx, next 0x%llx\n", curr, next);
+		if (curr_page_invalid(sb, pi, sih, curr)) {
+			nova_dbg_verbose("curr page %p invalid\n", curr_page);
+			if (curr == sih->log_head) {
+				/* Free first page later */
+				first_need_free = 1;
+				last_page = curr_page;
+				alter_last_page = alter_curr_page;
+			} else {
+				nova_dbg_verbose("Free log block 0x%llx\n",
+						curr >> PAGE_SHIFT);
+				free_curr_page(sb, sih, curr_page, last_page,
+						curr);
+				if (metadata_csum)
+					free_curr_page(sb, sih, alter_curr_page,
+						alter_last_page, alter_curr);
+			}
+			NOVA_STATS_ADD(fast_gc_pages, 1);
+			freed_pages++;
+		} else {
+			if (found_head == 0) {
+				possible_head = cpu_to_le64(curr);
+				alter_possible_head = cpu_to_le64(alter_curr);
+				found_head = 1;
+			}
+			last_page = curr_page;
+			alter_last_page = alter_curr_page;
+		}
+
+		curr = next;
+		alter_curr = alter_next;
+		checked_pages++;
+		if (curr == 0 || (metadata_csum && alter_curr == 0))
+			break;
+	}
+
+	NOVA_STATS_ADD(fast_checked_pages, checked_pages);
+	nova_dbgv("checked pages %lu, freed %d\n", checked_pages, freed_pages);
+	checked_pages -= freed_pages;
+
+	// TODO:  I think this belongs in nova_extend_inode_log.
+	if (num_pages > 0) {
+		curr = BLOCK_OFF(curr_tail);
+		curr_page = (struct nova_inode_log_page *)
+						  nova_get_block(sb, curr);
+
+		nova_memunlock_block(sb, curr_page);
+		nova_set_next_page_address(sb, curr_page, new_block, 1);
+		nova_memlock_block(sb, curr_page);
+
+		if (metadata_csum) {
+			alter_curr = BLOCK_OFF(sih->alter_log_tail);
+
+			while (next_log_page(sb, alter_curr) > 0)
+				alter_curr = next_log_page(sb, alter_curr);
+
+			alter_curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, alter_curr);
+			nova_memunlock_block(sb, curr_page);
+			nova_set_next_page_address(sb, alter_curr_page,
+						   alter_new_block, 1);
+			nova_memlock_block(sb, curr_page);
+		}
+	}
+
+	curr = sih->log_head;
+	alter_curr = sih->alter_log_head;
+
+	nova_memunlock_inode(sb, pi);
+	pi->log_head = possible_head;
+	pi->alter_log_head = alter_possible_head;
+	nova_update_inode_checksum(pi);
+	if (metadata_csum && sih->alter_pi_addr) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+						sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+	sih->log_head = possible_head;
+	sih->alter_log_head = alter_possible_head;
+	nova_dbgv("%s: %d new head 0x%llx\n", __func__,
+					found_head, possible_head);
+	sih->log_pages += (num_pages - freed_pages) * num_logs;
+	/* Don't update log tail pointer here */
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 1);
+
+	if (first_need_free) {
+		nova_dbg_verbose("Free log head block 0x%llx\n",
+					curr >> PAGE_SHIFT);
+		nova_free_log_blocks(sb, sih,
+				nova_get_blocknr(sb, curr, btype), 1);
+		if (metadata_csum)
+			nova_free_log_blocks(sb, sih,
+				nova_get_blocknr(sb, alter_curr, btype), 1);
+	}
+
+	NOVA_END_TIMING(fast_gc_t, gc_time);
+
+	if (sih->num_entries == 0)
+		return 0;
+
+	/* Estimate how many pages worth of valid entries the log contains.
+	 *
+	 * If it is less than half the number pages that remain in the log,
+	 * compress them with thorough gc.
+	 */
+	blocks = (sih->valid_entries * checked_pages) / sih->num_entries;
+	if ((sih->valid_entries * checked_pages) % sih->num_entries)
+		blocks++;
+
+	if (force_thorough || (blocks && blocks * 2 < checked_pages)) {
+		nova_dbgv("Thorough GC for inode %lu: checked pages %lu, valid pages %lu\n",
+				sih->ino,
+				checked_pages, blocks);
+		blocks = nova_inode_log_thorough_gc(sb, pi, sih,
+							blocks, checked_pages);
+		if (metadata_csum)
+			nova_inode_alter_log_thorough_gc(sb, pi, sih,
+							blocks, checked_pages);
+	}
+
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 09/16] NOVA: DAX code
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:49   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

NOVA leverages the kernel's DAX mechanisms for mmap and file data access.  Nova
maintains a red-black tree in DRAM (nova_inode_info_header.vma_tree) to track
which portions of a file have been mapped.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/dax.c | 1346 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1346 insertions(+)
 create mode 100644 fs/nova/dax.c

diff --git a/fs/nova/dax.c b/fs/nova/dax.c
new file mode 100644
index 000000000000..871b10f1889c
--- /dev/null
+++ b/fs/nova/dax.c
@@ -0,0 +1,1346 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * DAX file operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/module.h>
+#include <linux/buffer_head.h>
+#include <linux/cpufeature.h>
+#include <asm/pgtable.h>
+#include <linux/version.h>
+#include "nova.h"
+#include "inode.h"
+
+
+
+static inline int nova_copy_partial_block(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long index,
+	size_t offset, size_t length, void *kmem)
+{
+	void *ptr;
+	int rc = 0;
+	unsigned long nvmm;
+
+	nvmm = get_nvmm(sb, sih, entry, index);
+	ptr = nova_get_block(sb, (nvmm << PAGE_SHIFT));
+
+	if (ptr != NULL) {
+		if (support_clwb)
+			rc = memcpy_mcsafe(kmem + offset, ptr + offset,
+						length);
+		else
+			memcpy_to_pmem_nocache(kmem + offset, ptr + offset,
+						length);
+	}
+
+	/* TODO: If rc < 0, go to MCE data recovery. */
+	return rc;
+}
+
+static inline int nova_handle_partial_block(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long index,
+	size_t offset, size_t length, void *kmem)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_file_write_entry *entryc, entry_copy;
+
+	nova_memunlock_block(sb, kmem);
+	if (entry == NULL) {
+		/* Fill zero */
+		if (support_clwb)
+			memset(kmem + offset, 0, length);
+		else
+			memcpy_to_pmem_nocache(kmem + offset,
+					sbi->zeroed_page, length);
+	} else {
+		/* Copy from original block */
+		if (metadata_csum == 0)
+			entryc = entry;
+		else {
+			entryc = &entry_copy;
+			if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EIO;
+		}
+
+		nova_copy_partial_block(sb, sih, entryc, index,
+					offset, length, kmem);
+
+	}
+	nova_memlock_block(sb, kmem);
+	if (support_clwb)
+		nova_flush_buffer(kmem + offset, length, 0);
+	return 0;
+}
+
+/*
+ * Fill the new start/end block from original blocks.
+ * Do nothing if fully covered; copy if original blocks present;
+ * Fill zero otherwise.
+ */
+int nova_handle_head_tail_blocks(struct super_block *sb,
+	struct inode *inode, loff_t pos, size_t count, void *kmem)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	size_t offset, eblk_offset;
+	unsigned long start_blk, end_blk, num_blocks;
+	struct nova_file_write_entry *entry;
+	timing_t partial_time;
+	int ret = 0;
+
+	NOVA_START_TIMING(partial_block_t, partial_time);
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	/* offset in the actual block size block */
+	offset = pos & (nova_inode_blk_size(sih) - 1);
+	start_blk = pos >> sb->s_blocksize_bits;
+	end_blk = start_blk + num_blocks - 1;
+
+	nova_dbg_verbose("%s: %lu blocks\n", __func__, num_blocks);
+	/* We avoid zeroing the alloc'd range, which is going to be overwritten
+	 * by this system call anyway
+	 */
+	nova_dbg_verbose("%s: start offset %lu start blk %lu %p\n", __func__,
+				offset, start_blk, kmem);
+	if (offset != 0) {
+		entry = nova_get_write_entry(sb, sih, start_blk);
+		ret = nova_handle_partial_block(sb, sih, entry,
+						start_blk, 0, offset, kmem);
+		if (ret < 0)
+			return ret;
+	}
+
+	kmem = (void *)((char *)kmem +
+			((num_blocks - 1) << sb->s_blocksize_bits));
+	eblk_offset = (pos + count) & (nova_inode_blk_size(sih) - 1);
+	nova_dbg_verbose("%s: end offset %lu, end blk %lu %p\n", __func__,
+				eblk_offset, end_blk, kmem);
+	if (eblk_offset != 0) {
+		entry = nova_get_write_entry(sb, sih, end_blk);
+
+		ret = nova_handle_partial_block(sb, sih, entry, end_blk,
+						eblk_offset,
+						sb->s_blocksize - eblk_offset,
+						kmem);
+		if (ret < 0)
+			return ret;
+	}
+	NOVA_END_TIMING(partial_block_t, partial_time);
+
+	return ret;
+}
+
+int nova_reassign_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 begin_tail)
+{
+	void *addr;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	u64 curr_p = begin_tail;
+	size_t entry_size = sizeof(struct nova_file_write_entry);
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	while (curr_p && curr_p != sih->log_tail) {
+		if (is_last_entry(curr_p, entry_size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_err(sb, "%s: File inode %lu log is NULL!\n",
+				__func__, sih->ino);
+			return -EINVAL;
+		}
+
+		addr = (void *) nova_get_block(sb, curr_p);
+		entry = (struct nova_file_write_entry *) addr;
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+
+		if (nova_get_entry_type(entryc) != FILE_WRITE) {
+			nova_dbg("%s: entry type is not write? %d\n",
+				__func__, nova_get_entry_type(entry));
+			curr_p += entry_size;
+			continue;
+		}
+
+		nova_assign_write_entry(sb, sih, entry, entryc, true);
+		curr_p += entry_size;
+	}
+
+	return 0;
+}
+
+int nova_cleanup_incomplete_write(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr,
+	int allocated, u64 begin_tail, u64 end_tail)
+{
+	void *addr;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	u64 curr_p = begin_tail;
+	size_t entry_size = sizeof(struct nova_file_write_entry);
+
+	if (blocknr > 0 && allocated > 0)
+		nova_free_data_blocks(sb, sih, blocknr, allocated);
+
+	if (begin_tail == 0 || end_tail == 0)
+		return 0;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	while (curr_p != end_tail) {
+		if (is_last_entry(curr_p, entry_size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_err(sb, "%s: File inode %lu log is NULL!\n",
+				__func__, sih->ino);
+			return -EINVAL;
+		}
+
+		addr = (void *) nova_get_block(sb, curr_p);
+		entry = (struct nova_file_write_entry *) addr;
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else {
+			/* skip entry check here as the entry checksum may not
+			 * be updated when this is called
+			 */
+			if (memcpy_mcsafe(entryc, entry,
+					sizeof(struct nova_file_write_entry)))
+				return -EIO;
+		}
+
+		if (nova_get_entry_type(entryc) != FILE_WRITE) {
+			nova_dbg("%s: entry type is not write? %d\n",
+				__func__, nova_get_entry_type(entry));
+			curr_p += entry_size;
+			continue;
+		}
+
+		blocknr = entryc->block >> PAGE_SHIFT;
+		nova_free_data_blocks(sb, sih, blocknr, entryc->num_pages);
+		curr_p += entry_size;
+	}
+
+	return 0;
+}
+
+void nova_init_file_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time,
+	u64 file_size)
+{
+	memset(entry, 0, sizeof(struct nova_file_write_entry));
+	entry->entry_type = FILE_WRITE;
+	entry->reassigned = 0;
+	entry->updating = 0;
+	entry->epoch_id = epoch_id;
+	entry->trans_id = sih->trans_id;
+	entry->pgoff = cpu_to_le64(pgoff);
+	entry->num_pages = cpu_to_le32(num_pages);
+	entry->invalid_pages = 0;
+	entry->block = cpu_to_le64(nova_get_block_off(sb, blocknr,
+							sih->i_blk_type));
+	entry->mtime = cpu_to_le32(time);
+
+	entry->size = file_size;
+}
+
+int nova_protect_file_data(struct super_block *sb, struct inode *inode,
+	loff_t pos, size_t count, const char __user *buf, unsigned long blocknr,
+	bool inplace)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	size_t offset, eblk_offset, bytes, left;
+	unsigned long start_blk, end_blk, num_blocks, nvmm, nvmmoff;
+	unsigned long blocksize = sb->s_blocksize;
+	unsigned int blocksize_bits = sb->s_blocksize_bits;
+	u8 *blockbuf, *blockptr;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	bool mapped, nvmm_ok;
+	int ret = 0;
+	timing_t protect_file_data_time, memcpy_time;
+
+	NOVA_START_TIMING(protect_file_data_t, protect_file_data_time);
+
+	offset = pos & (blocksize - 1);
+	num_blocks = ((offset + count - 1) >> blocksize_bits) + 1;
+	start_blk = pos >> blocksize_bits;
+	end_blk = start_blk + num_blocks - 1;
+
+	NOVA_START_TIMING(protect_memcpy_t, memcpy_time);
+	blockbuf = kmalloc(blocksize, GFP_KERNEL);
+	if (blockbuf == NULL) {
+		nova_err(sb, "%s: block buffer allocation error\n", __func__);
+		return -ENOMEM;
+	}
+
+	bytes = blocksize - offset;
+	if (bytes > count)
+		bytes = count;
+
+	left = copy_from_user(blockbuf + offset, buf, bytes);
+	NOVA_END_TIMING(protect_memcpy_t, memcpy_time);
+	if (unlikely(left != 0)) {
+		nova_err(sb, "%s: not all data is copied from user! expect to copy %zu bytes, actually copied %zu bytes\n",
+			 __func__, bytes, bytes - left);
+		ret = -EFAULT;
+		goto out;
+	}
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	if (offset != 0) {
+		NOVA_STATS_ADD(protect_head, 1);
+		entry = nova_get_write_entry(sb, sih, start_blk);
+		if (entry != NULL) {
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EIO;
+
+			/* make sure data in the partial block head is good */
+			nvmm = get_nvmm(sb, sih, entryc, start_blk);
+			nvmmoff = nova_get_block_off(sb, nvmm, sih->i_blk_type);
+			blockptr = (u8 *) nova_get_block(sb, nvmmoff);
+
+			mapped = nova_find_pgoff_in_vma(inode, start_blk);
+			if (data_csum > 0 && !mapped && !inplace) {
+				nvmm_ok = nova_verify_data_csum(sb, sih, nvmm,
+								0, offset);
+				if (!nvmm_ok) {
+					ret = -EIO;
+					goto out;
+				}
+			}
+
+			ret = memcpy_mcsafe(blockbuf, blockptr, offset);
+			if (ret < 0)
+				goto out;
+		} else {
+			memset(blockbuf, 0, offset);
+		}
+
+		/* copying existing checksums from nvmm can be even slower than
+		 * re-computing checksums of a whole block.
+		if (data_csum > 0)
+			nova_copy_partial_block_csum(sb, sih, entry, start_blk,
+							offset, blocknr, false);
+		*/
+	}
+
+	if (num_blocks == 1)
+		goto eblk;
+
+	do {
+		if (inplace)
+			nova_update_block_csum_parity(sb, sih, blockbuf,
+							blocknr, offset, bytes);
+		else
+			nova_update_block_csum_parity(sb, sih, blockbuf,
+							blocknr, 0, blocksize);
+
+		blocknr++;
+		pos += bytes;
+		buf += bytes;
+		count -= bytes;
+		offset = pos & (blocksize - 1);
+
+		bytes = count < blocksize ? count : blocksize;
+		left = copy_from_user(blockbuf, buf, bytes);
+		if (unlikely(left != 0)) {
+			nova_err(sb, "%s: not all data is copied from user!  expect to copy %zu bytes, actually copied %zu bytes\n",
+				 __func__, bytes, bytes - left);
+			ret = -EFAULT;
+			goto out;
+		}
+	} while (count > blocksize);
+
+eblk:
+	eblk_offset = (pos + count) & (blocksize - 1);
+
+	if (eblk_offset != 0) {
+		NOVA_STATS_ADD(protect_tail, 1);
+		entry = nova_get_write_entry(sb, sih, end_blk);
+		if (entry != NULL) {
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EIO;
+
+			/* make sure data in the partial block tail is good */
+			nvmm = get_nvmm(sb, sih, entryc, end_blk);
+			nvmmoff = nova_get_block_off(sb, nvmm, sih->i_blk_type);
+			blockptr = (u8 *) nova_get_block(sb, nvmmoff);
+
+			mapped = nova_find_pgoff_in_vma(inode, end_blk);
+			if (data_csum > 0 && !mapped && !inplace) {
+				nvmm_ok = nova_verify_data_csum(sb, sih, nvmm,
+					eblk_offset, blocksize - eblk_offset);
+				if (!nvmm_ok) {
+					ret = -EIO;
+					goto out;
+				}
+			}
+
+			ret = memcpy_mcsafe(blockbuf + eblk_offset,
+						blockptr + eblk_offset,
+						blocksize - eblk_offset);
+			if (ret < 0)
+				goto out;
+		} else {
+			memset(blockbuf + eblk_offset, 0,
+				blocksize - eblk_offset);
+		}
+
+		/* copying existing checksums from nvmm can be even slower than
+		 * re-computing checksums of a whole block.
+		if (data_csum > 0)
+			nova_copy_partial_block_csum(sb, sih, entry, end_blk,
+						eblk_offset, blocknr, true);
+		*/
+	}
+
+	if (inplace)
+		nova_update_block_csum_parity(sb, sih, blockbuf, blocknr,
+							offset, bytes);
+	else
+		nova_update_block_csum_parity(sb, sih, blockbuf, blocknr,
+							0, blocksize);
+
+out:
+	if (blockbuf != NULL)
+		kfree(blockbuf);
+
+	NOVA_END_TIMING(protect_file_data_t, protect_file_data_time);
+
+	return ret;
+}
+
+static bool nova_get_verify_entry(struct super_block *sb,
+	struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc,
+	int locked)
+{
+	int ret = 0;
+
+	if (metadata_csum == 0)
+		return true;
+
+	if (locked == 0) {
+		/* Someone else may be updating the entry. Skip check */
+		ret = memcpy_mcsafe(entryc, entry,
+				sizeof(struct nova_file_write_entry));
+		if (ret < 0)
+			return false;
+
+		return true;
+	}
+
+	return nova_verify_entry_csum(sb, entry, entryc);
+}
+
+/*
+ * Check if there is an existing entry for target page offset.
+ * Used for inplace write, direct IO, DAX-mmap and fallocate.
+ */
+unsigned long nova_check_existing_entry(struct super_block *sb,
+	struct inode *inode, unsigned long num_blocks, unsigned long start_blk,
+	struct nova_file_write_entry **ret_entry,
+	struct nova_file_write_entry *ret_entryc, int check_next, u64 epoch_id,
+	int *inplace, int locked)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc;
+	unsigned long next_pgoff;
+	unsigned long ent_blks = 0;
+	timing_t check_time;
+
+	NOVA_START_TIMING(check_entry_t, check_time);
+
+	*ret_entry = NULL;
+	*inplace = 0;
+	entry = nova_get_write_entry(sb, sih, start_blk);
+
+	entryc = (metadata_csum == 0) ? entry : ret_entryc;
+
+	if (entry) {
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_get_verify_entry(sb, entry, entryc, locked))
+			goto out;
+
+		*ret_entry = entry;
+
+		/* We can do inplace write. Find contiguous blocks */
+		if (entryc->reassigned == 0)
+			ent_blks = entryc->num_pages -
+					(start_blk - entryc->pgoff);
+		else
+			ent_blks = 1;
+
+		if (ent_blks > num_blocks)
+			ent_blks = num_blocks;
+
+		if (entryc->epoch_id == epoch_id)
+			*inplace = 1;
+
+	} else if (check_next) {
+		/* Possible Hole */
+		entry = nova_find_next_entry(sb, sih, start_blk);
+		if (entry) {
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_get_verify_entry(sb, entry, entryc,
+							locked))
+				goto out;
+
+			next_pgoff = entryc->pgoff;
+			if (next_pgoff <= start_blk) {
+				nova_err(sb, "iblock %lu, entry pgoff %lu, num pages %lu\n",
+				       start_blk, next_pgoff, entry->num_pages);
+				nova_print_inode_log(sb, inode);
+				BUG();
+				ent_blks = num_blocks;
+				goto out;
+			}
+			ent_blks = next_pgoff - start_blk;
+			if (ent_blks > num_blocks)
+				ent_blks = num_blocks;
+		} else {
+			/* File grow */
+			ent_blks = num_blocks;
+		}
+	}
+
+	if (entry && ent_blks == 0) {
+		nova_dbg("%s: %d\n", __func__, check_next);
+		dump_stack();
+	}
+
+out:
+	NOVA_END_TIMING(check_entry_t, check_time);
+	return ent_blks;
+}
+
+ssize_t nova_inplace_file_write(struct file *filp,
+	const char __user *buf,	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode	*inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi, inode_copy;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	ssize_t	    written = 0;
+	loff_t pos;
+	size_t count, offset, copied;
+	unsigned long start_blk, num_blocks, ent_blks = 0;
+	unsigned long total_blocks;
+	unsigned long blocknr = 0;
+	unsigned int data_bits;
+	int allocated = 0;
+	int inplace = 0;
+	bool hole_fill = false;
+	bool update_log = false;
+	void *kmem;
+	u64 blk_off;
+	size_t bytes;
+	long status = 0;
+	timing_t inplace_write_time, memcpy_time;
+	unsigned long step = 0;
+	u64 begin_tail = 0;
+	u64 epoch_id;
+	u64 file_size;
+	u32 time;
+	ssize_t ret;
+
+
+	if (len == 0)
+		return 0;
+
+
+	NOVA_START_TIMING(inplace_write_t, inplace_write_time);
+
+	sb_start_write(inode->i_sb);
+	inode_lock(inode);
+
+	if (!access_ok(VERIFY_READ, buf, len)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	pos = *ppos;
+
+	if (filp->f_flags & O_APPEND)
+		pos = i_size_read(inode);
+
+	count = len;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	/* nova_inode tail pointer will be updated and we make sure all other
+	 * inode fields are good before checksumming the whole structure
+	 */
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	total_blocks = num_blocks;
+
+	/* offset in the actual block size block */
+
+	ret = file_remove_privs(filp);
+	if (ret)
+		goto out;
+
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	nova_dbgv("%s: epoch_id %llu, inode %lu, offset %lld, count %lu\n",
+			__func__, epoch_id, inode->i_ino, pos, count);
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+	while (num_blocks > 0) {
+		hole_fill = false;
+		offset = pos & (nova_inode_blk_size(sih) - 1);
+		start_blk = pos >> sb->s_blocksize_bits;
+
+		ent_blks = nova_check_existing_entry(sb, inode, num_blocks,
+						start_blk, &entry, &entry_copy,
+						1, epoch_id, &inplace, 1);
+
+		entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+		if (entry && inplace) {
+			/* We can do inplace write. Find contiguous blocks */
+			blocknr = get_nvmm(sb, sih, entryc, start_blk);
+			blk_off = blocknr << PAGE_SHIFT;
+			allocated = ent_blks;
+			if (data_csum || data_parity)
+				nova_set_write_entry_updating(sb, entry, 1);
+		} else {
+			/* Allocate blocks to fill hole */
+			allocated = nova_new_data_blocks(sb, sih, &blocknr,
+					 start_blk, ent_blks, ALLOC_NO_INIT,
+					 ANY_CPU, ALLOC_FROM_HEAD);
+
+			nova_dbg_verbose("%s: alloc %d blocks @ %lu\n",
+						__func__, allocated, blocknr);
+
+			if (allocated <= 0) {
+				nova_dbg("%s alloc blocks failed!, %d\n",
+							__func__, allocated);
+				ret = allocated;
+				goto out;
+			}
+
+			hole_fill = true;
+			blk_off = nova_get_block_off(sb, blocknr,
+							sih->i_blk_type);
+		}
+
+		step++;
+		bytes = sb->s_blocksize * allocated - offset;
+		if (bytes > count)
+			bytes = count;
+
+		kmem = nova_get_block(inode->i_sb, blk_off);
+
+		if (hole_fill &&
+		    (offset || ((offset + bytes) & (PAGE_SIZE - 1)) != 0)) {
+			ret =  nova_handle_head_tail_blocks(sb, inode,
+							    pos, bytes, kmem);
+			if (ret)
+				goto out;
+
+		}
+
+		/* Now copy from user buf */
+//		nova_dbg("Write: %p\n", kmem);
+		NOVA_START_TIMING(memcpy_w_nvmm_t, memcpy_time);
+		nova_memunlock_range(sb, kmem + offset, bytes);
+		copied = bytes - memcpy_to_pmem_nocache(kmem + offset,
+						buf, bytes);
+		nova_memlock_range(sb, kmem + offset, bytes);
+		NOVA_END_TIMING(memcpy_w_nvmm_t, memcpy_time);
+
+		if (data_csum > 0 || data_parity > 0) {
+			ret = nova_protect_file_data(sb, inode, pos, bytes,
+						buf, blocknr, !hole_fill);
+			if (ret)
+				goto out;
+		}
+
+		if (pos + copied > inode->i_size)
+			file_size = cpu_to_le64(pos + copied);
+		else
+			file_size = cpu_to_le64(inode->i_size);
+
+		/* Handle hole fill write */
+		if (hole_fill) {
+			nova_init_file_write_entry(sb, sih, &entry_data,
+						epoch_id, start_blk, allocated,
+						blocknr, time, file_size);
+
+			ret = nova_append_file_write_entry(sb, pi, inode,
+						&entry_data, &update);
+			if (ret) {
+				nova_dbg("%s: append inode entry failed\n",
+								__func__);
+				ret = -ENOSPC;
+				goto out;
+			}
+		} else {
+			/* Update existing entry */
+			struct nova_log_entry_info entry_info;
+
+			entry_info.type = FILE_WRITE;
+			entry_info.epoch_id = epoch_id;
+			entry_info.trans_id = sih->trans_id;
+			entry_info.time = time;
+			entry_info.file_size = file_size;
+			entry_info.inplace = 1;
+
+			nova_inplace_update_write_entry(sb, inode, entry,
+							&entry_info);
+		}
+
+		nova_dbgv("Write: %p, %lu\n", kmem, copied);
+		if (copied > 0) {
+			status = copied;
+			written += copied;
+			pos += copied;
+			buf += copied;
+			count -= copied;
+			num_blocks -= allocated;
+		}
+		if (unlikely(copied != bytes)) {
+			nova_dbg("%s ERROR!: %p, bytes %lu, copied %lu\n",
+				__func__, kmem, bytes, copied);
+			if (status >= 0)
+				status = -EFAULT;
+		}
+		if (status < 0)
+			break;
+
+		if (hole_fill) {
+			update_log = true;
+			if (begin_tail == 0)
+				begin_tail = update.curr_entry;
+		}
+	}
+
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (total_blocks << (data_bits - sb->s_blocksize_bits));
+
+	inode->i_blocks = sih->i_blocks;
+
+	if (update_log) {
+		nova_memunlock_inode(sb, pi);
+		nova_update_inode(sb, inode, pi, &update, 1);
+		nova_memlock_inode(sb, pi);
+		NOVA_STATS_ADD(inplace_new_blocks, 1);
+
+		/* Update file tree */
+		ret = nova_reassign_file_tree(sb, sih, begin_tail);
+		if (ret)
+			goto out;
+	}
+
+	ret = written;
+	NOVA_STATS_ADD(inplace_write_breaks, step);
+	nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks);
+
+	*ppos = pos;
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		sih->i_size = pos;
+	}
+
+	sih->trans_id++;
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						begin_tail, update.tail);
+
+	inode_unlock(inode);
+	sb_end_write(inode->i_sb);
+	NOVA_END_TIMING(inplace_write_t, inplace_write_time);
+	NOVA_STATS_ADD(inplace_write_bytes, written);
+	return ret;
+}
+
+/* Check if existing entry overlap with vma regions */
+int nova_check_overlap_vmas(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	unsigned long pgoff, unsigned long num_pages)
+{
+	unsigned long start_pgoff = 0;
+	unsigned long num = 0;
+	unsigned long i;
+	struct vma_item *item;
+	struct rb_node *temp;
+	int ret = 0;
+
+	if (sih->num_vmas == 0)
+		return 0;
+
+	temp = rb_first(&sih->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		ret = nova_get_vma_overlap_range(sb, sih, item->vma, pgoff,
+					num_pages, &start_pgoff, &num);
+		if (ret) {
+			for (i = 0; i < num; i++) {
+				if (nova_get_write_entry(sb, sih,
+							start_pgoff + i))
+					return 1;
+			}
+		}
+	}
+
+	return 0;
+}
+
+
+/*
+ * return > 0, # of blocks mapped or allocated.
+ * return = 0, if plain lookup failed.
+ * return < 0, error case.
+ */
+int nova_dax_get_blocks(struct inode *inode, sector_t iblock,
+	unsigned long max_blocks, u32 *bno, bool *new, bool *boundary,
+	int create, bool taking_lock)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	u32 time;
+	unsigned int data_bits;
+	unsigned long nvmm = 0;
+	unsigned long blocknr = 0;
+	u64 epoch_id;
+	int num_blocks = 0;
+	int inplace = 0;
+	int allocated = 0;
+	int locked = 0;
+	int check_next = 1;
+	int ret = 0;
+	timing_t get_block_time;
+
+
+	if (max_blocks == 0)
+		return 0;
+
+	NOVA_START_TIMING(dax_get_block_t, get_block_time);
+
+	nova_dbgv("%s: pgoff %lu, num %lu, create %d\n",
+				__func__, iblock, max_blocks, create);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	if (taking_lock)
+		check_next = 0;
+
+again:
+	num_blocks = nova_check_existing_entry(sb, inode, max_blocks,
+					iblock, &entry, &entry_copy, check_next,
+					epoch_id, &inplace, locked);
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	if (entry) {
+		if (create == 0 || inplace) {
+			nvmm = get_nvmm(sb, sih, entryc, iblock);
+			nova_dbgv("%s: found pgoff %lu, block %lu\n",
+					__func__, iblock, nvmm);
+			goto out;
+		}
+	}
+
+	if (create == 0) {
+		num_blocks = 0;
+		goto out1;
+	}
+
+	if (taking_lock && locked == 0) {
+		inode_lock(inode);
+		locked = 1;
+		/* Check again incase someone has done it for us */
+		check_next = 1;
+		goto again;
+	}
+
+	pi = nova_get_inode(sb, inode);
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+
+	/* Return initialized blocks to the user */
+	allocated = nova_new_data_blocks(sb, sih, &blocknr, iblock,
+				 num_blocks, ALLOC_INIT_ZERO, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+	if (allocated <= 0) {
+		nova_dbgv("%s alloc blocks failed %d\n", __func__,
+							allocated);
+		ret = allocated;
+		goto out;
+	}
+
+	num_blocks = allocated;
+	/* Do not extend file size */
+	nova_init_file_write_entry(sb, sih, &entry_data,
+					epoch_id, iblock, num_blocks,
+					blocknr, time, inode->i_size);
+
+	ret = nova_append_file_write_entry(sb, pi, inode,
+				&entry_data, &update);
+	if (ret) {
+		nova_dbgv("%s: append inode entry failed\n", __func__);
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	nvmm = blocknr;
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (num_blocks << (data_bits - sb->s_blocksize_bits));
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	ret = nova_reassign_file_tree(sb, sih, update.curr_entry);
+	if (ret) {
+		nova_dbgv("%s: nova_reassign_file_tree failed: %d\n",
+			  __func__,  ret);
+		goto out;
+	}
+	inode->i_blocks = sih->i_blocks;
+	sih->trans_id++;
+	NOVA_STATS_ADD(dax_new_blocks, 1);
+
+//	set_buffer_new(bh);
+out:
+	if (ret < 0) {
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						0, update.tail);
+		num_blocks = ret;
+		goto out1;
+	}
+
+	*bno = nvmm;
+//	if (num_blocks > 1)
+//		bh->b_size = sb->s_blocksize * num_blocks;
+
+out1:
+	if (taking_lock && locked)
+		inode_unlock(inode);
+
+	NOVA_END_TIMING(dax_get_block_t, get_block_time);
+	return num_blocks;
+}
+
+int nova_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+	unsigned int flags, struct iomap *iomap, bool taking_lock)
+{
+	struct nova_sb_info *sbi = NOVA_SB(inode->i_sb);
+	unsigned int blkbits = inode->i_blkbits;
+	unsigned long first_block = offset >> blkbits;
+	unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits;
+	bool new = false, boundary = false;
+	u32 bno;
+	int ret;
+
+	ret = nova_dax_get_blocks(inode, first_block, max_blocks, &bno, &new,
+				  &boundary, flags & IOMAP_WRITE, taking_lock);
+	if (ret < 0) {
+		nova_dbgv("%s: nova_dax_get_blocks failed %d", __func__, ret);
+		return ret;
+	}
+
+	iomap->flags = 0;
+	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->dax_dev = sbi->s_dax_dev;
+	iomap->offset = (u64)first_block << blkbits;
+
+	if (ret == 0) {
+		iomap->type = IOMAP_HOLE;
+		iomap->blkno = IOMAP_NULL_BLOCK;
+		iomap->length = 1 << blkbits;
+	} else {
+		iomap->type = IOMAP_MAPPED;
+		iomap->blkno = (sector_t)bno << (blkbits - 9);
+		iomap->length = (u64)ret << blkbits;
+		iomap->flags |= IOMAP_F_MERGED;
+	}
+
+	if (new)
+		iomap->flags |= IOMAP_F_NEW;
+	return 0;
+}
+
+int nova_iomap_end(struct inode *inode, loff_t offset, loff_t length,
+	ssize_t written, unsigned int flags, struct iomap *iomap)
+{
+	if (iomap->type == IOMAP_MAPPED &&
+			written < length &&
+			(flags & IOMAP_WRITE))
+		truncate_pagecache(inode, inode->i_size);
+	return 0;
+}
+
+
+static int nova_iomap_begin_lock(struct inode *inode, loff_t offset,
+	loff_t length, unsigned int flags, struct iomap *iomap)
+{
+	return nova_iomap_begin(inode, offset, length, flags, iomap, true);
+}
+
+static struct iomap_ops nova_iomap_ops_lock = {
+	.iomap_begin	= nova_iomap_begin_lock,
+	.iomap_end	= nova_iomap_end,
+};
+
+
+static int nova_dax_huge_fault(struct vm_fault *vmf,
+			      enum page_entry_size pe_size)
+{
+	int ret = 0;
+	timing_t fault_time;
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	NOVA_START_TIMING(pmd_fault_t, fault_time);
+
+	nova_dbgv("%s: inode %lu, pgoff %lu\n",
+		  __func__, inode->i_ino, vmf->pgoff);
+
+	ret = dax_iomap_fault(vmf, pe_size, &nova_iomap_ops_lock);
+
+	NOVA_END_TIMING(pmd_fault_t, fault_time);
+	return ret;
+}
+
+static int nova_dax_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	nova_dbgv("%s: inode %lu, pgoff %lu\n",
+		  __func__, inode->i_ino, vmf->pgoff);
+
+	return nova_dax_huge_fault(vmf, PE_SIZE_PTE);
+}
+
+static int nova_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	loff_t size;
+	int ret = 0;
+	timing_t fault_time;
+
+	NOVA_START_TIMING(pfn_mkwrite_t, fault_time);
+
+	inode_lock(inode);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vmf);
+	inode_unlock(inode);
+
+	NOVA_END_TIMING(pfn_mkwrite_t, fault_time);
+	return ret;
+}
+
+static inline int nova_rbtree_compare_vma(struct vma_item *curr,
+	struct vm_area_struct *vma)
+{
+	if (vma < curr->vma)
+		return -1;
+	if (vma > curr->vma)
+		return 1;
+
+	return 0;
+}
+
+static int nova_append_write_mmap_to_log(struct super_block *sb,
+	struct inode *inode, struct vma_item *item)
+{
+	struct vm_area_struct *vma = item->vma;
+	struct nova_inode *pi;
+	struct nova_mmap_entry data;
+	struct nova_inode_update update;
+	unsigned long num_pages;
+	u64 epoch_id;
+	int ret;
+
+	/* Only for csum and parity update */
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	pi = nova_get_inode(sb, inode);
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = update.alter_tail = 0;
+
+	memset(&data, 0, sizeof(struct nova_mmap_entry));
+	data.entry_type = MMAP_WRITE;
+	data.epoch_id = epoch_id;
+	data.pgoff = cpu_to_le64(vma->vm_pgoff);
+	num_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+	data.num_pages = cpu_to_le64(num_pages);
+	data.invalid = 0;
+
+	nova_dbgv("%s : Appending mmap log entry for inode %lu, pgoff %llu, %llu pages\n",
+			__func__, inode->i_ino,
+			data.pgoff, data.num_pages);
+
+	ret = nova_append_mmap_entry(sb, pi, inode, &data, &update, item);
+	if (ret) {
+		nova_dbg("%s: append write mmap entry failure\n", __func__);
+		goto out;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+out:
+	return ret;
+}
+
+int nova_insert_write_vma(struct vm_area_struct *vma)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long flags = VM_SHARED | VM_WRITE;
+	struct vma_item *item, *curr;
+	struct rb_node **temp, *parent;
+	int compVal;
+	int insert = 0;
+	int ret;
+	timing_t insert_vma_time;
+
+
+	if ((vma->vm_flags & flags) != flags)
+		return 0;
+
+	NOVA_START_TIMING(insert_vma_t, insert_vma_time);
+
+	item = nova_alloc_vma_item(sb);
+	if (!item) {
+		NOVA_END_TIMING(insert_vma_t, insert_vma_time);
+		return -ENOMEM;
+	}
+
+	item->vma = vma;
+
+	nova_dbgv("Inode %lu insert vma %p, start 0x%lx, end 0x%lx, pgoff %lu\n",
+			inode->i_ino, vma, vma->vm_start, vma->vm_end,
+			vma->vm_pgoff);
+
+	inode_lock(inode);
+
+	/* Append to log */
+	ret = nova_append_write_mmap_to_log(sb, inode, item);
+	if (ret)
+		goto out;
+
+	temp = &(sih->vma_tree.rb_node);
+	parent = NULL;
+
+	while (*temp) {
+		curr = container_of(*temp, struct vma_item, node);
+		compVal = nova_rbtree_compare_vma(curr, vma);
+		parent = *temp;
+
+		if (compVal == -1) {
+			temp = &((*temp)->rb_left);
+		} else if (compVal == 1) {
+			temp = &((*temp)->rb_right);
+		} else {
+			nova_dbg("%s: vma %p already exists\n",
+				__func__, vma);
+			kfree(item);
+			goto out;
+		}
+	}
+
+	rb_link_node(&item->node, parent, temp);
+	rb_insert_color(&item->node, &sih->vma_tree);
+
+	sih->num_vmas++;
+	if (sih->num_vmas == 1)
+		insert = 1;
+
+	sih->trans_id++;
+out:
+	inode_unlock(inode);
+
+	if (insert) {
+		mutex_lock(&sbi->vma_mutex);
+		list_add_tail(&sih->list, &sbi->mmap_sih_list);
+		mutex_unlock(&sbi->vma_mutex);
+	}
+
+	NOVA_END_TIMING(insert_vma_t, insert_vma_time);
+	return ret;
+}
+
+static int nova_remove_write_vma(struct vm_area_struct *vma)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct vma_item *curr = NULL;
+	struct rb_node *temp;
+	int compVal;
+	int found = 0;
+	int remove = 0;
+	timing_t remove_vma_time;
+
+
+	NOVA_START_TIMING(remove_vma_t, remove_vma_time);
+	inode_lock(inode);
+
+	temp = sih->vma_tree.rb_node;
+	while (temp) {
+		curr = container_of(temp, struct vma_item, node);
+		compVal = nova_rbtree_compare_vma(curr, vma);
+
+		if (compVal == -1) {
+			temp = temp->rb_left;
+		} else if (compVal == 1) {
+			temp = temp->rb_right;
+		} else {
+			nova_reset_vma_csum_parity(sb, curr);
+			rb_erase(&curr->node, &sih->vma_tree);
+			found = 1;
+			break;
+		}
+	}
+
+	if (found) {
+		sih->num_vmas--;
+		if (sih->num_vmas == 0)
+			remove = 1;
+	}
+
+	inode_unlock(inode);
+
+	if (found) {
+		nova_dbgv("Inode %lu remove vma %p, start 0x%lx, end 0x%lx, pgoff %lu\n",
+			  inode->i_ino,	curr->vma, curr->vma->vm_start,
+			  curr->vma->vm_end, curr->vma->vm_pgoff);
+		nova_free_vma_item(sb, curr);
+	}
+
+	if (remove) {
+		mutex_lock(&sbi->vma_mutex);
+		list_del(&sih->list);
+		mutex_unlock(&sbi->vma_mutex);
+	}
+
+	NOVA_END_TIMING(remove_vma_t, remove_vma_time);
+	return 0;
+}
+
+static int nova_restore_page_write(struct vm_area_struct *vma,
+	unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+
+	down_write(&mm->mmap_sem);
+
+	nova_dbgv("Restore vma %p write, start 0x%lx, end 0x%lx, address 0x%lx\n",
+		  vma, vma->vm_start, vma->vm_end, address);
+
+	/* Restore single page write */
+	nova_mmap_to_new_blocks(vma, address);
+
+	up_write(&mm->mmap_sem);
+
+	return 0;
+}
+
+static void nova_vma_open(struct vm_area_struct *vma)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	nova_dbg_mmap4k("[%s:%d] inode %lu, MMAP 4KPAGE vm_start(0x%lx), vm_end(0x%lx), vm pgoff %lu, %lu blocks, vm_flags(0x%lx), vm_page_prot(0x%lx)\n",
+			__func__, __LINE__,
+			inode->i_ino, vma->vm_start, vma->vm_end,
+			vma->vm_pgoff,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			vma->vm_flags,
+			pgprot_val(vma->vm_page_prot));
+
+	nova_insert_write_vma(vma);
+}
+
+static void nova_vma_close(struct vm_area_struct *vma)
+{
+	nova_dbgv("[%s:%d] MMAP 4KPAGE vm_start(0x%lx), vm_end(0x%lx), vm_flags(0x%lx), vm_page_prot(0x%lx)\n",
+		  __func__, __LINE__, vma->vm_start, vma->vm_end,
+		  vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	vma->original_write = 0;
+	nova_remove_write_vma(vma);
+}
+
+const struct vm_operations_struct nova_dax_vm_ops = {
+	.fault	= nova_dax_fault,
+	.huge_fault = nova_dax_huge_fault,
+	.page_mkwrite = nova_dax_fault,
+	.pfn_mkwrite = nova_dax_pfn_mkwrite,
+	.open = nova_vma_open,
+	.close = nova_vma_close,
+	.dax_cow = nova_restore_page_write,
+};
+

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 09/16] NOVA: DAX code
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

NOVA leverages the kernel's DAX mechanisms for mmap and file data access.  Nova
maintains a red-black tree in DRAM (nova_inode_info_header.vma_tree) to track
which portions of a file have been mapped.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/dax.c | 1346 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1346 insertions(+)
 create mode 100644 fs/nova/dax.c

diff --git a/fs/nova/dax.c b/fs/nova/dax.c
new file mode 100644
index 000000000000..871b10f1889c
--- /dev/null
+++ b/fs/nova/dax.c
@@ -0,0 +1,1346 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * DAX file operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/module.h>
+#include <linux/buffer_head.h>
+#include <linux/cpufeature.h>
+#include <asm/pgtable.h>
+#include <linux/version.h>
+#include "nova.h"
+#include "inode.h"
+
+
+
+static inline int nova_copy_partial_block(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long index,
+	size_t offset, size_t length, void *kmem)
+{
+	void *ptr;
+	int rc = 0;
+	unsigned long nvmm;
+
+	nvmm = get_nvmm(sb, sih, entry, index);
+	ptr = nova_get_block(sb, (nvmm << PAGE_SHIFT));
+
+	if (ptr != NULL) {
+		if (support_clwb)
+			rc = memcpy_mcsafe(kmem + offset, ptr + offset,
+						length);
+		else
+			memcpy_to_pmem_nocache(kmem + offset, ptr + offset,
+						length);
+	}
+
+	/* TODO: If rc < 0, go to MCE data recovery. */
+	return rc;
+}
+
+static inline int nova_handle_partial_block(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long index,
+	size_t offset, size_t length, void *kmem)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_file_write_entry *entryc, entry_copy;
+
+	nova_memunlock_block(sb, kmem);
+	if (entry == NULL) {
+		/* Fill zero */
+		if (support_clwb)
+			memset(kmem + offset, 0, length);
+		else
+			memcpy_to_pmem_nocache(kmem + offset,
+					sbi->zeroed_page, length);
+	} else {
+		/* Copy from original block */
+		if (metadata_csum == 0)
+			entryc = entry;
+		else {
+			entryc = &entry_copy;
+			if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EIO;
+		}
+
+		nova_copy_partial_block(sb, sih, entryc, index,
+					offset, length, kmem);
+
+	}
+	nova_memlock_block(sb, kmem);
+	if (support_clwb)
+		nova_flush_buffer(kmem + offset, length, 0);
+	return 0;
+}
+
+/*
+ * Fill the new start/end block from original blocks.
+ * Do nothing if fully covered; copy if original blocks present;
+ * Fill zero otherwise.
+ */
+int nova_handle_head_tail_blocks(struct super_block *sb,
+	struct inode *inode, loff_t pos, size_t count, void *kmem)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	size_t offset, eblk_offset;
+	unsigned long start_blk, end_blk, num_blocks;
+	struct nova_file_write_entry *entry;
+	timing_t partial_time;
+	int ret = 0;
+
+	NOVA_START_TIMING(partial_block_t, partial_time);
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	/* offset in the actual block size block */
+	offset = pos & (nova_inode_blk_size(sih) - 1);
+	start_blk = pos >> sb->s_blocksize_bits;
+	end_blk = start_blk + num_blocks - 1;
+
+	nova_dbg_verbose("%s: %lu blocks\n", __func__, num_blocks);
+	/* We avoid zeroing the alloc'd range, which is going to be overwritten
+	 * by this system call anyway
+	 */
+	nova_dbg_verbose("%s: start offset %lu start blk %lu %p\n", __func__,
+				offset, start_blk, kmem);
+	if (offset != 0) {
+		entry = nova_get_write_entry(sb, sih, start_blk);
+		ret = nova_handle_partial_block(sb, sih, entry,
+						start_blk, 0, offset, kmem);
+		if (ret < 0)
+			return ret;
+	}
+
+	kmem = (void *)((char *)kmem +
+			((num_blocks - 1) << sb->s_blocksize_bits));
+	eblk_offset = (pos + count) & (nova_inode_blk_size(sih) - 1);
+	nova_dbg_verbose("%s: end offset %lu, end blk %lu %p\n", __func__,
+				eblk_offset, end_blk, kmem);
+	if (eblk_offset != 0) {
+		entry = nova_get_write_entry(sb, sih, end_blk);
+
+		ret = nova_handle_partial_block(sb, sih, entry, end_blk,
+						eblk_offset,
+						sb->s_blocksize - eblk_offset,
+						kmem);
+		if (ret < 0)
+			return ret;
+	}
+	NOVA_END_TIMING(partial_block_t, partial_time);
+
+	return ret;
+}
+
+int nova_reassign_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 begin_tail)
+{
+	void *addr;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	u64 curr_p = begin_tail;
+	size_t entry_size = sizeof(struct nova_file_write_entry);
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	while (curr_p && curr_p != sih->log_tail) {
+		if (is_last_entry(curr_p, entry_size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_err(sb, "%s: File inode %lu log is NULL!\n",
+				__func__, sih->ino);
+			return -EINVAL;
+		}
+
+		addr = (void *) nova_get_block(sb, curr_p);
+		entry = (struct nova_file_write_entry *) addr;
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return -EIO;
+
+		if (nova_get_entry_type(entryc) != FILE_WRITE) {
+			nova_dbg("%s: entry type is not write? %d\n",
+				__func__, nova_get_entry_type(entry));
+			curr_p += entry_size;
+			continue;
+		}
+
+		nova_assign_write_entry(sb, sih, entry, entryc, true);
+		curr_p += entry_size;
+	}
+
+	return 0;
+}
+
+int nova_cleanup_incomplete_write(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr,
+	int allocated, u64 begin_tail, u64 end_tail)
+{
+	void *addr;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	u64 curr_p = begin_tail;
+	size_t entry_size = sizeof(struct nova_file_write_entry);
+
+	if (blocknr > 0 && allocated > 0)
+		nova_free_data_blocks(sb, sih, blocknr, allocated);
+
+	if (begin_tail == 0 || end_tail == 0)
+		return 0;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	while (curr_p != end_tail) {
+		if (is_last_entry(curr_p, entry_size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_err(sb, "%s: File inode %lu log is NULL!\n",
+				__func__, sih->ino);
+			return -EINVAL;
+		}
+
+		addr = (void *) nova_get_block(sb, curr_p);
+		entry = (struct nova_file_write_entry *) addr;
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else {
+			/* skip entry check here as the entry checksum may not
+			 * be updated when this is called
+			 */
+			if (memcpy_mcsafe(entryc, entry,
+					sizeof(struct nova_file_write_entry)))
+				return -EIO;
+		}
+
+		if (nova_get_entry_type(entryc) != FILE_WRITE) {
+			nova_dbg("%s: entry type is not write? %d\n",
+				__func__, nova_get_entry_type(entry));
+			curr_p += entry_size;
+			continue;
+		}
+
+		blocknr = entryc->block >> PAGE_SHIFT;
+		nova_free_data_blocks(sb, sih, blocknr, entryc->num_pages);
+		curr_p += entry_size;
+	}
+
+	return 0;
+}
+
+void nova_init_file_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time,
+	u64 file_size)
+{
+	memset(entry, 0, sizeof(struct nova_file_write_entry));
+	entry->entry_type = FILE_WRITE;
+	entry->reassigned = 0;
+	entry->updating = 0;
+	entry->epoch_id = epoch_id;
+	entry->trans_id = sih->trans_id;
+	entry->pgoff = cpu_to_le64(pgoff);
+	entry->num_pages = cpu_to_le32(num_pages);
+	entry->invalid_pages = 0;
+	entry->block = cpu_to_le64(nova_get_block_off(sb, blocknr,
+							sih->i_blk_type));
+	entry->mtime = cpu_to_le32(time);
+
+	entry->size = file_size;
+}
+
+int nova_protect_file_data(struct super_block *sb, struct inode *inode,
+	loff_t pos, size_t count, const char __user *buf, unsigned long blocknr,
+	bool inplace)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	size_t offset, eblk_offset, bytes, left;
+	unsigned long start_blk, end_blk, num_blocks, nvmm, nvmmoff;
+	unsigned long blocksize = sb->s_blocksize;
+	unsigned int blocksize_bits = sb->s_blocksize_bits;
+	u8 *blockbuf, *blockptr;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	bool mapped, nvmm_ok;
+	int ret = 0;
+	timing_t protect_file_data_time, memcpy_time;
+
+	NOVA_START_TIMING(protect_file_data_t, protect_file_data_time);
+
+	offset = pos & (blocksize - 1);
+	num_blocks = ((offset + count - 1) >> blocksize_bits) + 1;
+	start_blk = pos >> blocksize_bits;
+	end_blk = start_blk + num_blocks - 1;
+
+	NOVA_START_TIMING(protect_memcpy_t, memcpy_time);
+	blockbuf = kmalloc(blocksize, GFP_KERNEL);
+	if (blockbuf == NULL) {
+		nova_err(sb, "%s: block buffer allocation error\n", __func__);
+		return -ENOMEM;
+	}
+
+	bytes = blocksize - offset;
+	if (bytes > count)
+		bytes = count;
+
+	left = copy_from_user(blockbuf + offset, buf, bytes);
+	NOVA_END_TIMING(protect_memcpy_t, memcpy_time);
+	if (unlikely(left != 0)) {
+		nova_err(sb, "%s: not all data is copied from user! expect to copy %zu bytes, actually copied %zu bytes\n",
+			 __func__, bytes, bytes - left);
+		ret = -EFAULT;
+		goto out;
+	}
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	if (offset != 0) {
+		NOVA_STATS_ADD(protect_head, 1);
+		entry = nova_get_write_entry(sb, sih, start_blk);
+		if (entry != NULL) {
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EIO;
+
+			/* make sure data in the partial block head is good */
+			nvmm = get_nvmm(sb, sih, entryc, start_blk);
+			nvmmoff = nova_get_block_off(sb, nvmm, sih->i_blk_type);
+			blockptr = (u8 *) nova_get_block(sb, nvmmoff);
+
+			mapped = nova_find_pgoff_in_vma(inode, start_blk);
+			if (data_csum > 0 && !mapped && !inplace) {
+				nvmm_ok = nova_verify_data_csum(sb, sih, nvmm,
+								0, offset);
+				if (!nvmm_ok) {
+					ret = -EIO;
+					goto out;
+				}
+			}
+
+			ret = memcpy_mcsafe(blockbuf, blockptr, offset);
+			if (ret < 0)
+				goto out;
+		} else {
+			memset(blockbuf, 0, offset);
+		}
+
+		/* copying existing checksums from nvmm can be even slower than
+		 * re-computing checksums of a whole block.
+		if (data_csum > 0)
+			nova_copy_partial_block_csum(sb, sih, entry, start_blk,
+							offset, blocknr, false);
+		*/
+	}
+
+	if (num_blocks == 1)
+		goto eblk;
+
+	do {
+		if (inplace)
+			nova_update_block_csum_parity(sb, sih, blockbuf,
+							blocknr, offset, bytes);
+		else
+			nova_update_block_csum_parity(sb, sih, blockbuf,
+							blocknr, 0, blocksize);
+
+		blocknr++;
+		pos += bytes;
+		buf += bytes;
+		count -= bytes;
+		offset = pos & (blocksize - 1);
+
+		bytes = count < blocksize ? count : blocksize;
+		left = copy_from_user(blockbuf, buf, bytes);
+		if (unlikely(left != 0)) {
+			nova_err(sb, "%s: not all data is copied from user!  expect to copy %zu bytes, actually copied %zu bytes\n",
+				 __func__, bytes, bytes - left);
+			ret = -EFAULT;
+			goto out;
+		}
+	} while (count > blocksize);
+
+eblk:
+	eblk_offset = (pos + count) & (blocksize - 1);
+
+	if (eblk_offset != 0) {
+		NOVA_STATS_ADD(protect_tail, 1);
+		entry = nova_get_write_entry(sb, sih, end_blk);
+		if (entry != NULL) {
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				return -EIO;
+
+			/* make sure data in the partial block tail is good */
+			nvmm = get_nvmm(sb, sih, entryc, end_blk);
+			nvmmoff = nova_get_block_off(sb, nvmm, sih->i_blk_type);
+			blockptr = (u8 *) nova_get_block(sb, nvmmoff);
+
+			mapped = nova_find_pgoff_in_vma(inode, end_blk);
+			if (data_csum > 0 && !mapped && !inplace) {
+				nvmm_ok = nova_verify_data_csum(sb, sih, nvmm,
+					eblk_offset, blocksize - eblk_offset);
+				if (!nvmm_ok) {
+					ret = -EIO;
+					goto out;
+				}
+			}
+
+			ret = memcpy_mcsafe(blockbuf + eblk_offset,
+						blockptr + eblk_offset,
+						blocksize - eblk_offset);
+			if (ret < 0)
+				goto out;
+		} else {
+			memset(blockbuf + eblk_offset, 0,
+				blocksize - eblk_offset);
+		}
+
+		/* copying existing checksums from nvmm can be even slower than
+		 * re-computing checksums of a whole block.
+		if (data_csum > 0)
+			nova_copy_partial_block_csum(sb, sih, entry, end_blk,
+						eblk_offset, blocknr, true);
+		*/
+	}
+
+	if (inplace)
+		nova_update_block_csum_parity(sb, sih, blockbuf, blocknr,
+							offset, bytes);
+	else
+		nova_update_block_csum_parity(sb, sih, blockbuf, blocknr,
+							0, blocksize);
+
+out:
+	if (blockbuf != NULL)
+		kfree(blockbuf);
+
+	NOVA_END_TIMING(protect_file_data_t, protect_file_data_time);
+
+	return ret;
+}
+
+static bool nova_get_verify_entry(struct super_block *sb,
+	struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc,
+	int locked)
+{
+	int ret = 0;
+
+	if (metadata_csum == 0)
+		return true;
+
+	if (locked == 0) {
+		/* Someone else may be updating the entry. Skip check */
+		ret = memcpy_mcsafe(entryc, entry,
+				sizeof(struct nova_file_write_entry));
+		if (ret < 0)
+			return false;
+
+		return true;
+	}
+
+	return nova_verify_entry_csum(sb, entry, entryc);
+}
+
+/*
+ * Check if there is an existing entry for target page offset.
+ * Used for inplace write, direct IO, DAX-mmap and fallocate.
+ */
+unsigned long nova_check_existing_entry(struct super_block *sb,
+	struct inode *inode, unsigned long num_blocks, unsigned long start_blk,
+	struct nova_file_write_entry **ret_entry,
+	struct nova_file_write_entry *ret_entryc, int check_next, u64 epoch_id,
+	int *inplace, int locked)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc;
+	unsigned long next_pgoff;
+	unsigned long ent_blks = 0;
+	timing_t check_time;
+
+	NOVA_START_TIMING(check_entry_t, check_time);
+
+	*ret_entry = NULL;
+	*inplace = 0;
+	entry = nova_get_write_entry(sb, sih, start_blk);
+
+	entryc = (metadata_csum == 0) ? entry : ret_entryc;
+
+	if (entry) {
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_get_verify_entry(sb, entry, entryc, locked))
+			goto out;
+
+		*ret_entry = entry;
+
+		/* We can do inplace write. Find contiguous blocks */
+		if (entryc->reassigned == 0)
+			ent_blks = entryc->num_pages -
+					(start_blk - entryc->pgoff);
+		else
+			ent_blks = 1;
+
+		if (ent_blks > num_blocks)
+			ent_blks = num_blocks;
+
+		if (entryc->epoch_id == epoch_id)
+			*inplace = 1;
+
+	} else if (check_next) {
+		/* Possible Hole */
+		entry = nova_find_next_entry(sb, sih, start_blk);
+		if (entry) {
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_get_verify_entry(sb, entry, entryc,
+							locked))
+				goto out;
+
+			next_pgoff = entryc->pgoff;
+			if (next_pgoff <= start_blk) {
+				nova_err(sb, "iblock %lu, entry pgoff %lu, num pages %lu\n",
+				       start_blk, next_pgoff, entry->num_pages);
+				nova_print_inode_log(sb, inode);
+				BUG();
+				ent_blks = num_blocks;
+				goto out;
+			}
+			ent_blks = next_pgoff - start_blk;
+			if (ent_blks > num_blocks)
+				ent_blks = num_blocks;
+		} else {
+			/* File grow */
+			ent_blks = num_blocks;
+		}
+	}
+
+	if (entry && ent_blks == 0) {
+		nova_dbg("%s: %d\n", __func__, check_next);
+		dump_stack();
+	}
+
+out:
+	NOVA_END_TIMING(check_entry_t, check_time);
+	return ent_blks;
+}
+
+ssize_t nova_inplace_file_write(struct file *filp,
+	const char __user *buf,	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode	*inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi, inode_copy;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	ssize_t	    written = 0;
+	loff_t pos;
+	size_t count, offset, copied;
+	unsigned long start_blk, num_blocks, ent_blks = 0;
+	unsigned long total_blocks;
+	unsigned long blocknr = 0;
+	unsigned int data_bits;
+	int allocated = 0;
+	int inplace = 0;
+	bool hole_fill = false;
+	bool update_log = false;
+	void *kmem;
+	u64 blk_off;
+	size_t bytes;
+	long status = 0;
+	timing_t inplace_write_time, memcpy_time;
+	unsigned long step = 0;
+	u64 begin_tail = 0;
+	u64 epoch_id;
+	u64 file_size;
+	u32 time;
+	ssize_t ret;
+
+
+	if (len == 0)
+		return 0;
+
+
+	NOVA_START_TIMING(inplace_write_t, inplace_write_time);
+
+	sb_start_write(inode->i_sb);
+	inode_lock(inode);
+
+	if (!access_ok(VERIFY_READ, buf, len)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	pos = *ppos;
+
+	if (filp->f_flags & O_APPEND)
+		pos = i_size_read(inode);
+
+	count = len;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	/* nova_inode tail pointer will be updated and we make sure all other
+	 * inode fields are good before checksumming the whole structure
+	 */
+	if (nova_check_inode_integrity(sb, sih->ino, sih->pi_addr,
+			sih->alter_pi_addr, &inode_copy, 0) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	total_blocks = num_blocks;
+
+	/* offset in the actual block size block */
+
+	ret = file_remove_privs(filp);
+	if (ret)
+		goto out;
+
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	nova_dbgv("%s: epoch_id %llu, inode %lu, offset %lld, count %lu\n",
+			__func__, epoch_id, inode->i_ino, pos, count);
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+	while (num_blocks > 0) {
+		hole_fill = false;
+		offset = pos & (nova_inode_blk_size(sih) - 1);
+		start_blk = pos >> sb->s_blocksize_bits;
+
+		ent_blks = nova_check_existing_entry(sb, inode, num_blocks,
+						start_blk, &entry, &entry_copy,
+						1, epoch_id, &inplace, 1);
+
+		entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+		if (entry && inplace) {
+			/* We can do inplace write. Find contiguous blocks */
+			blocknr = get_nvmm(sb, sih, entryc, start_blk);
+			blk_off = blocknr << PAGE_SHIFT;
+			allocated = ent_blks;
+			if (data_csum || data_parity)
+				nova_set_write_entry_updating(sb, entry, 1);
+		} else {
+			/* Allocate blocks to fill hole */
+			allocated = nova_new_data_blocks(sb, sih, &blocknr,
+					 start_blk, ent_blks, ALLOC_NO_INIT,
+					 ANY_CPU, ALLOC_FROM_HEAD);
+
+			nova_dbg_verbose("%s: alloc %d blocks @ %lu\n",
+						__func__, allocated, blocknr);
+
+			if (allocated <= 0) {
+				nova_dbg("%s alloc blocks failed!, %d\n",
+							__func__, allocated);
+				ret = allocated;
+				goto out;
+			}
+
+			hole_fill = true;
+			blk_off = nova_get_block_off(sb, blocknr,
+							sih->i_blk_type);
+		}
+
+		step++;
+		bytes = sb->s_blocksize * allocated - offset;
+		if (bytes > count)
+			bytes = count;
+
+		kmem = nova_get_block(inode->i_sb, blk_off);
+
+		if (hole_fill &&
+		    (offset || ((offset + bytes) & (PAGE_SIZE - 1)) != 0)) {
+			ret =  nova_handle_head_tail_blocks(sb, inode,
+							    pos, bytes, kmem);
+			if (ret)
+				goto out;
+
+		}
+
+		/* Now copy from user buf */
+//		nova_dbg("Write: %p\n", kmem);
+		NOVA_START_TIMING(memcpy_w_nvmm_t, memcpy_time);
+		nova_memunlock_range(sb, kmem + offset, bytes);
+		copied = bytes - memcpy_to_pmem_nocache(kmem + offset,
+						buf, bytes);
+		nova_memlock_range(sb, kmem + offset, bytes);
+		NOVA_END_TIMING(memcpy_w_nvmm_t, memcpy_time);
+
+		if (data_csum > 0 || data_parity > 0) {
+			ret = nova_protect_file_data(sb, inode, pos, bytes,
+						buf, blocknr, !hole_fill);
+			if (ret)
+				goto out;
+		}
+
+		if (pos + copied > inode->i_size)
+			file_size = cpu_to_le64(pos + copied);
+		else
+			file_size = cpu_to_le64(inode->i_size);
+
+		/* Handle hole fill write */
+		if (hole_fill) {
+			nova_init_file_write_entry(sb, sih, &entry_data,
+						epoch_id, start_blk, allocated,
+						blocknr, time, file_size);
+
+			ret = nova_append_file_write_entry(sb, pi, inode,
+						&entry_data, &update);
+			if (ret) {
+				nova_dbg("%s: append inode entry failed\n",
+								__func__);
+				ret = -ENOSPC;
+				goto out;
+			}
+		} else {
+			/* Update existing entry */
+			struct nova_log_entry_info entry_info;
+
+			entry_info.type = FILE_WRITE;
+			entry_info.epoch_id = epoch_id;
+			entry_info.trans_id = sih->trans_id;
+			entry_info.time = time;
+			entry_info.file_size = file_size;
+			entry_info.inplace = 1;
+
+			nova_inplace_update_write_entry(sb, inode, entry,
+							&entry_info);
+		}
+
+		nova_dbgv("Write: %p, %lu\n", kmem, copied);
+		if (copied > 0) {
+			status = copied;
+			written += copied;
+			pos += copied;
+			buf += copied;
+			count -= copied;
+			num_blocks -= allocated;
+		}
+		if (unlikely(copied != bytes)) {
+			nova_dbg("%s ERROR!: %p, bytes %lu, copied %lu\n",
+				__func__, kmem, bytes, copied);
+			if (status >= 0)
+				status = -EFAULT;
+		}
+		if (status < 0)
+			break;
+
+		if (hole_fill) {
+			update_log = true;
+			if (begin_tail == 0)
+				begin_tail = update.curr_entry;
+		}
+	}
+
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (total_blocks << (data_bits - sb->s_blocksize_bits));
+
+	inode->i_blocks = sih->i_blocks;
+
+	if (update_log) {
+		nova_memunlock_inode(sb, pi);
+		nova_update_inode(sb, inode, pi, &update, 1);
+		nova_memlock_inode(sb, pi);
+		NOVA_STATS_ADD(inplace_new_blocks, 1);
+
+		/* Update file tree */
+		ret = nova_reassign_file_tree(sb, sih, begin_tail);
+		if (ret)
+			goto out;
+	}
+
+	ret = written;
+	NOVA_STATS_ADD(inplace_write_breaks, step);
+	nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks);
+
+	*ppos = pos;
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		sih->i_size = pos;
+	}
+
+	sih->trans_id++;
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						begin_tail, update.tail);
+
+	inode_unlock(inode);
+	sb_end_write(inode->i_sb);
+	NOVA_END_TIMING(inplace_write_t, inplace_write_time);
+	NOVA_STATS_ADD(inplace_write_bytes, written);
+	return ret;
+}
+
+/* Check if existing entry overlap with vma regions */
+int nova_check_overlap_vmas(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	unsigned long pgoff, unsigned long num_pages)
+{
+	unsigned long start_pgoff = 0;
+	unsigned long num = 0;
+	unsigned long i;
+	struct vma_item *item;
+	struct rb_node *temp;
+	int ret = 0;
+
+	if (sih->num_vmas == 0)
+		return 0;
+
+	temp = rb_first(&sih->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		ret = nova_get_vma_overlap_range(sb, sih, item->vma, pgoff,
+					num_pages, &start_pgoff, &num);
+		if (ret) {
+			for (i = 0; i < num; i++) {
+				if (nova_get_write_entry(sb, sih,
+							start_pgoff + i))
+					return 1;
+			}
+		}
+	}
+
+	return 0;
+}
+
+
+/*
+ * return > 0, # of blocks mapped or allocated.
+ * return = 0, if plain lookup failed.
+ * return < 0, error case.
+ */
+int nova_dax_get_blocks(struct inode *inode, sector_t iblock,
+	unsigned long max_blocks, u32 *bno, bool *new, bool *boundary,
+	int create, bool taking_lock)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	u32 time;
+	unsigned int data_bits;
+	unsigned long nvmm = 0;
+	unsigned long blocknr = 0;
+	u64 epoch_id;
+	int num_blocks = 0;
+	int inplace = 0;
+	int allocated = 0;
+	int locked = 0;
+	int check_next = 1;
+	int ret = 0;
+	timing_t get_block_time;
+
+
+	if (max_blocks == 0)
+		return 0;
+
+	NOVA_START_TIMING(dax_get_block_t, get_block_time);
+
+	nova_dbgv("%s: pgoff %lu, num %lu, create %d\n",
+				__func__, iblock, max_blocks, create);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	if (taking_lock)
+		check_next = 0;
+
+again:
+	num_blocks = nova_check_existing_entry(sb, inode, max_blocks,
+					iblock, &entry, &entry_copy, check_next,
+					epoch_id, &inplace, locked);
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	if (entry) {
+		if (create == 0 || inplace) {
+			nvmm = get_nvmm(sb, sih, entryc, iblock);
+			nova_dbgv("%s: found pgoff %lu, block %lu\n",
+					__func__, iblock, nvmm);
+			goto out;
+		}
+	}
+
+	if (create == 0) {
+		num_blocks = 0;
+		goto out1;
+	}
+
+	if (taking_lock && locked == 0) {
+		inode_lock(inode);
+		locked = 1;
+		/* Check again incase someone has done it for us */
+		check_next = 1;
+		goto again;
+	}
+
+	pi = nova_get_inode(sb, inode);
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+	update.tail = sih->log_tail;
+	update.alter_tail = sih->alter_log_tail;
+
+	/* Return initialized blocks to the user */
+	allocated = nova_new_data_blocks(sb, sih, &blocknr, iblock,
+				 num_blocks, ALLOC_INIT_ZERO, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+	if (allocated <= 0) {
+		nova_dbgv("%s alloc blocks failed %d\n", __func__,
+							allocated);
+		ret = allocated;
+		goto out;
+	}
+
+	num_blocks = allocated;
+	/* Do not extend file size */
+	nova_init_file_write_entry(sb, sih, &entry_data,
+					epoch_id, iblock, num_blocks,
+					blocknr, time, inode->i_size);
+
+	ret = nova_append_file_write_entry(sb, pi, inode,
+				&entry_data, &update);
+	if (ret) {
+		nova_dbgv("%s: append inode entry failed\n", __func__);
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	nvmm = blocknr;
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (num_blocks << (data_bits - sb->s_blocksize_bits));
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	ret = nova_reassign_file_tree(sb, sih, update.curr_entry);
+	if (ret) {
+		nova_dbgv("%s: nova_reassign_file_tree failed: %d\n",
+			  __func__,  ret);
+		goto out;
+	}
+	inode->i_blocks = sih->i_blocks;
+	sih->trans_id++;
+	NOVA_STATS_ADD(dax_new_blocks, 1);
+
+//	set_buffer_new(bh);
+out:
+	if (ret < 0) {
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						0, update.tail);
+		num_blocks = ret;
+		goto out1;
+	}
+
+	*bno = nvmm;
+//	if (num_blocks > 1)
+//		bh->b_size = sb->s_blocksize * num_blocks;
+
+out1:
+	if (taking_lock && locked)
+		inode_unlock(inode);
+
+	NOVA_END_TIMING(dax_get_block_t, get_block_time);
+	return num_blocks;
+}
+
+int nova_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+	unsigned int flags, struct iomap *iomap, bool taking_lock)
+{
+	struct nova_sb_info *sbi = NOVA_SB(inode->i_sb);
+	unsigned int blkbits = inode->i_blkbits;
+	unsigned long first_block = offset >> blkbits;
+	unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits;
+	bool new = false, boundary = false;
+	u32 bno;
+	int ret;
+
+	ret = nova_dax_get_blocks(inode, first_block, max_blocks, &bno, &new,
+				  &boundary, flags & IOMAP_WRITE, taking_lock);
+	if (ret < 0) {
+		nova_dbgv("%s: nova_dax_get_blocks failed %d", __func__, ret);
+		return ret;
+	}
+
+	iomap->flags = 0;
+	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->dax_dev = sbi->s_dax_dev;
+	iomap->offset = (u64)first_block << blkbits;
+
+	if (ret == 0) {
+		iomap->type = IOMAP_HOLE;
+		iomap->blkno = IOMAP_NULL_BLOCK;
+		iomap->length = 1 << blkbits;
+	} else {
+		iomap->type = IOMAP_MAPPED;
+		iomap->blkno = (sector_t)bno << (blkbits - 9);
+		iomap->length = (u64)ret << blkbits;
+		iomap->flags |= IOMAP_F_MERGED;
+	}
+
+	if (new)
+		iomap->flags |= IOMAP_F_NEW;
+	return 0;
+}
+
+int nova_iomap_end(struct inode *inode, loff_t offset, loff_t length,
+	ssize_t written, unsigned int flags, struct iomap *iomap)
+{
+	if (iomap->type == IOMAP_MAPPED &&
+			written < length &&
+			(flags & IOMAP_WRITE))
+		truncate_pagecache(inode, inode->i_size);
+	return 0;
+}
+
+
+static int nova_iomap_begin_lock(struct inode *inode, loff_t offset,
+	loff_t length, unsigned int flags, struct iomap *iomap)
+{
+	return nova_iomap_begin(inode, offset, length, flags, iomap, true);
+}
+
+static struct iomap_ops nova_iomap_ops_lock = {
+	.iomap_begin	= nova_iomap_begin_lock,
+	.iomap_end	= nova_iomap_end,
+};
+
+
+static int nova_dax_huge_fault(struct vm_fault *vmf,
+			      enum page_entry_size pe_size)
+{
+	int ret = 0;
+	timing_t fault_time;
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	NOVA_START_TIMING(pmd_fault_t, fault_time);
+
+	nova_dbgv("%s: inode %lu, pgoff %lu\n",
+		  __func__, inode->i_ino, vmf->pgoff);
+
+	ret = dax_iomap_fault(vmf, pe_size, &nova_iomap_ops_lock);
+
+	NOVA_END_TIMING(pmd_fault_t, fault_time);
+	return ret;
+}
+
+static int nova_dax_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	nova_dbgv("%s: inode %lu, pgoff %lu\n",
+		  __func__, inode->i_ino, vmf->pgoff);
+
+	return nova_dax_huge_fault(vmf, PE_SIZE_PTE);
+}
+
+static int nova_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	loff_t size;
+	int ret = 0;
+	timing_t fault_time;
+
+	NOVA_START_TIMING(pfn_mkwrite_t, fault_time);
+
+	inode_lock(inode);
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vmf);
+	inode_unlock(inode);
+
+	NOVA_END_TIMING(pfn_mkwrite_t, fault_time);
+	return ret;
+}
+
+static inline int nova_rbtree_compare_vma(struct vma_item *curr,
+	struct vm_area_struct *vma)
+{
+	if (vma < curr->vma)
+		return -1;
+	if (vma > curr->vma)
+		return 1;
+
+	return 0;
+}
+
+static int nova_append_write_mmap_to_log(struct super_block *sb,
+	struct inode *inode, struct vma_item *item)
+{
+	struct vm_area_struct *vma = item->vma;
+	struct nova_inode *pi;
+	struct nova_mmap_entry data;
+	struct nova_inode_update update;
+	unsigned long num_pages;
+	u64 epoch_id;
+	int ret;
+
+	/* Only for csum and parity update */
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	pi = nova_get_inode(sb, inode);
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = update.alter_tail = 0;
+
+	memset(&data, 0, sizeof(struct nova_mmap_entry));
+	data.entry_type = MMAP_WRITE;
+	data.epoch_id = epoch_id;
+	data.pgoff = cpu_to_le64(vma->vm_pgoff);
+	num_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+	data.num_pages = cpu_to_le64(num_pages);
+	data.invalid = 0;
+
+	nova_dbgv("%s : Appending mmap log entry for inode %lu, pgoff %llu, %llu pages\n",
+			__func__, inode->i_ino,
+			data.pgoff, data.num_pages);
+
+	ret = nova_append_mmap_entry(sb, pi, inode, &data, &update, item);
+	if (ret) {
+		nova_dbg("%s: append write mmap entry failure\n", __func__);
+		goto out;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+out:
+	return ret;
+}
+
+int nova_insert_write_vma(struct vm_area_struct *vma)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long flags = VM_SHARED | VM_WRITE;
+	struct vma_item *item, *curr;
+	struct rb_node **temp, *parent;
+	int compVal;
+	int insert = 0;
+	int ret;
+	timing_t insert_vma_time;
+
+
+	if ((vma->vm_flags & flags) != flags)
+		return 0;
+
+	NOVA_START_TIMING(insert_vma_t, insert_vma_time);
+
+	item = nova_alloc_vma_item(sb);
+	if (!item) {
+		NOVA_END_TIMING(insert_vma_t, insert_vma_time);
+		return -ENOMEM;
+	}
+
+	item->vma = vma;
+
+	nova_dbgv("Inode %lu insert vma %p, start 0x%lx, end 0x%lx, pgoff %lu\n",
+			inode->i_ino, vma, vma->vm_start, vma->vm_end,
+			vma->vm_pgoff);
+
+	inode_lock(inode);
+
+	/* Append to log */
+	ret = nova_append_write_mmap_to_log(sb, inode, item);
+	if (ret)
+		goto out;
+
+	temp = &(sih->vma_tree.rb_node);
+	parent = NULL;
+
+	while (*temp) {
+		curr = container_of(*temp, struct vma_item, node);
+		compVal = nova_rbtree_compare_vma(curr, vma);
+		parent = *temp;
+
+		if (compVal == -1) {
+			temp = &((*temp)->rb_left);
+		} else if (compVal == 1) {
+			temp = &((*temp)->rb_right);
+		} else {
+			nova_dbg("%s: vma %p already exists\n",
+				__func__, vma);
+			kfree(item);
+			goto out;
+		}
+	}
+
+	rb_link_node(&item->node, parent, temp);
+	rb_insert_color(&item->node, &sih->vma_tree);
+
+	sih->num_vmas++;
+	if (sih->num_vmas == 1)
+		insert = 1;
+
+	sih->trans_id++;
+out:
+	inode_unlock(inode);
+
+	if (insert) {
+		mutex_lock(&sbi->vma_mutex);
+		list_add_tail(&sih->list, &sbi->mmap_sih_list);
+		mutex_unlock(&sbi->vma_mutex);
+	}
+
+	NOVA_END_TIMING(insert_vma_t, insert_vma_time);
+	return ret;
+}
+
+static int nova_remove_write_vma(struct vm_area_struct *vma)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct vma_item *curr = NULL;
+	struct rb_node *temp;
+	int compVal;
+	int found = 0;
+	int remove = 0;
+	timing_t remove_vma_time;
+
+
+	NOVA_START_TIMING(remove_vma_t, remove_vma_time);
+	inode_lock(inode);
+
+	temp = sih->vma_tree.rb_node;
+	while (temp) {
+		curr = container_of(temp, struct vma_item, node);
+		compVal = nova_rbtree_compare_vma(curr, vma);
+
+		if (compVal == -1) {
+			temp = temp->rb_left;
+		} else if (compVal == 1) {
+			temp = temp->rb_right;
+		} else {
+			nova_reset_vma_csum_parity(sb, curr);
+			rb_erase(&curr->node, &sih->vma_tree);
+			found = 1;
+			break;
+		}
+	}
+
+	if (found) {
+		sih->num_vmas--;
+		if (sih->num_vmas == 0)
+			remove = 1;
+	}
+
+	inode_unlock(inode);
+
+	if (found) {
+		nova_dbgv("Inode %lu remove vma %p, start 0x%lx, end 0x%lx, pgoff %lu\n",
+			  inode->i_ino,	curr->vma, curr->vma->vm_start,
+			  curr->vma->vm_end, curr->vma->vm_pgoff);
+		nova_free_vma_item(sb, curr);
+	}
+
+	if (remove) {
+		mutex_lock(&sbi->vma_mutex);
+		list_del(&sih->list);
+		mutex_unlock(&sbi->vma_mutex);
+	}
+
+	NOVA_END_TIMING(remove_vma_t, remove_vma_time);
+	return 0;
+}
+
+static int nova_restore_page_write(struct vm_area_struct *vma,
+	unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+
+	down_write(&mm->mmap_sem);
+
+	nova_dbgv("Restore vma %p write, start 0x%lx, end 0x%lx, address 0x%lx\n",
+		  vma, vma->vm_start, vma->vm_end, address);
+
+	/* Restore single page write */
+	nova_mmap_to_new_blocks(vma, address);
+
+	up_write(&mm->mmap_sem);
+
+	return 0;
+}
+
+static void nova_vma_open(struct vm_area_struct *vma)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	nova_dbg_mmap4k("[%s:%d] inode %lu, MMAP 4KPAGE vm_start(0x%lx), vm_end(0x%lx), vm pgoff %lu, %lu blocks, vm_flags(0x%lx), vm_page_prot(0x%lx)\n",
+			__func__, __LINE__,
+			inode->i_ino, vma->vm_start, vma->vm_end,
+			vma->vm_pgoff,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			vma->vm_flags,
+			pgprot_val(vma->vm_page_prot));
+
+	nova_insert_write_vma(vma);
+}
+
+static void nova_vma_close(struct vm_area_struct *vma)
+{
+	nova_dbgv("[%s:%d] MMAP 4KPAGE vm_start(0x%lx), vm_end(0x%lx), vm_flags(0x%lx), vm_page_prot(0x%lx)\n",
+		  __func__, __LINE__, vma->vm_start, vma->vm_end,
+		  vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	vma->original_write = 0;
+	nova_remove_write_vma(vma);
+}
+
+const struct vm_operations_struct nova_dax_vm_ops = {
+	.fault	= nova_dax_fault,
+	.huge_fault = nova_dax_huge_fault,
+	.page_mkwrite = nova_dax_fault,
+	.pfn_mkwrite = nova_dax_pfn_mkwrite,
+	.open = nova_vma_open,
+	.close = nova_vma_close,
+	.dax_cow = nova_restore_page_write,
+};
+

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 10/16] NOVA: File data protection
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:49   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Nova protects data and metadat from corruption due to media errors and
scribbles -- software errors in the kernels that may overwrite Nova data.

Replication
-----------

Nova replicates all PMEM metadata structures (there are a few exceptions.  They
are WIP).  For structure, there is a primary and an alternate (denoted as
alter in the code).  To ensure that Nova can recover a consistent copy of the
data in case of a failure, Nova first updates the primary, and issues a persist
barrier to ensure that data is written to NVMM.  Then it does the same for the
alternate.

Detection
---------

Nova uses two techniques to detect data corruption.  For media errors, Nova
should always uses memcpy_from_pmem() to read data from PMEM, usually by
copying the PMEM data structure into DRAM.

To detect software-caused corruption, Nova uses CRC32 checksums.  All the PMEM
data structures in Nova include csum field for this purpose.  Nova also
computes CRC32 checksums each 512-byte slice of each data page.

The checksums are stored in dedicated pages in each CPU's allocation region.

                                                          replica
                                                 parity   parity
					         page	  page
            +---+---+---+---+---+---+---+---+    +---+    +---+
data page 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |    | 0 |    | 0 |
            +---+---+---+---+---+---+---+---+    +---+    +---+
data page 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |    | 1 |    | 1 |
            +---+---+---+---+---+---+---+---+    +---+    +---+
data page 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |    | 0 |    | 0 |
            +---+---+---+---+---+---+---+---+    +---+    +---+
data page 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |    | 0 |    | 0 |
            +---+---+---+---+---+---+---+---+    +---+    +---+
    ...                    ...                    ...      ...

Recovery
--------

Nova uses replication to support recovery of metadata structures and
RAID4-style parity to recover corrupted data.

If Nova detects corruption of a metadata structure, it restores the structure
using the replica.

If it detects a corrupt slice of data page, it uses RAID4-style recovery to
restore it.  The CRC32 checksums for the page slices are replicated.

Cautious allocation
-------------------

To maximize its resilience to software scribbles, Nova allocate metadata
structures and their replicas far from one another.  It tries to allocate the
primary copy at a low address and the replica at a high address within the PMEM
region.

Write Protection
----------------

Finally, Nova supports can prevent unintended writes PMEM by mapping the entire
PMEM device as read-only and then disabling _all_ write protection by clearing
the WP bit the CR0 control register when Nova needs to perform a write.  The
wprotect mount-time option controls this behavior.

To map the PMEM device as read-only, we have added a readonly module command
line option to nd_pmem.  There is probably a better approach to achieving this
goal.

The changes to nd_pmem are included in a later patch in this series.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/checksum.c |  912 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/mprotect.c |  604 ++++++++++++++++++++++++++++++++++
 fs/nova/mprotect.h |  190 +++++++++++
 fs/nova/parity.c   |  411 +++++++++++++++++++++++
 4 files changed, 2117 insertions(+)
 create mode 100644 fs/nova/checksum.c
 create mode 100644 fs/nova/mprotect.c
 create mode 100644 fs/nova/mprotect.h
 create mode 100644 fs/nova/parity.c

diff --git a/fs/nova/checksum.c b/fs/nova/checksum.c
new file mode 100644
index 000000000000..092164a80d40
--- /dev/null
+++ b/fs/nova/checksum.c
@@ -0,0 +1,912 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Checksum related methods.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+static int nova_get_entry_copy(struct super_block *sb, void *entry,
+	u32 *entry_csum, size_t *entry_size, void *entry_copy)
+{
+	u8 type;
+	struct nova_dentry *dentry;
+	int ret = 0;
+
+	ret = memcpy_mcsafe(&type, entry, sizeof(u8));
+	if (ret < 0)
+		return ret;
+
+	switch (type) {
+	case DIR_LOG:
+		dentry = DENTRY(entry_copy);
+		ret = memcpy_mcsafe(dentry, entry, NOVA_DENTRY_HEADER_LEN);
+		if (ret < 0 || dentry->de_len > NOVA_MAX_ENTRY_LEN)
+			break;
+		*entry_size = dentry->de_len;
+		ret = memcpy_mcsafe((u8 *) dentry + NOVA_DENTRY_HEADER_LEN,
+					(u8 *) entry + NOVA_DENTRY_HEADER_LEN,
+					*entry_size - NOVA_DENTRY_HEADER_LEN);
+		if (ret < 0)
+			break;
+		*entry_csum = dentry->csum;
+		break;
+	case FILE_WRITE:
+		*entry_size = sizeof(struct nova_file_write_entry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = WENTRY(entry_copy)->csum;
+		break;
+	case SET_ATTR:
+		*entry_size = sizeof(struct nova_setattr_logentry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = SENTRY(entry_copy)->csum;
+		break;
+	case LINK_CHANGE:
+		*entry_size = sizeof(struct nova_link_change_entry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = LCENTRY(entry_copy)->csum;
+		break;
+	case MMAP_WRITE:
+		*entry_size = sizeof(struct nova_mmap_entry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = MMENTRY(entry_copy)->csum;
+		break;
+	case SNAPSHOT_INFO:
+		*entry_size = sizeof(struct nova_snapshot_info_entry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = SNENTRY(entry_copy)->csum;
+		break;
+	default:
+		*entry_csum = 0;
+		*entry_size = 0;
+		nova_dbg("%s: unknown or unsupported entry type (%d) for checksum, 0x%llx\n",
+			 __func__, type, (u64)entry);
+		ret = -EINVAL;
+		dump_stack();
+		break;
+	}
+
+	return ret;
+}
+
+/* Calculate the entry checksum. */
+static u32 nova_calc_entry_csum(void *entry)
+{
+	u8 type;
+	u32 csum = 0;
+	size_t entry_len, check_len;
+	void *csum_addr, *remain;
+	timing_t calc_time;
+
+	NOVA_START_TIMING(calc_entry_csum_t, calc_time);
+
+	/* Entry is checksummed excluding its csum field. */
+	type = nova_get_entry_type(entry);
+	switch (type) {
+	/* nova_dentry has variable length due to its name. */
+	case DIR_LOG:
+		entry_len =  DENTRY(entry)->de_len;
+		csum_addr = &DENTRY(entry)->csum;
+		break;
+	case FILE_WRITE:
+		entry_len = sizeof(struct nova_file_write_entry);
+		csum_addr = &WENTRY(entry)->csum;
+		break;
+	case SET_ATTR:
+		entry_len = sizeof(struct nova_setattr_logentry);
+		csum_addr = &SENTRY(entry)->csum;
+		break;
+	case LINK_CHANGE:
+		entry_len = sizeof(struct nova_link_change_entry);
+		csum_addr = &LCENTRY(entry)->csum;
+		break;
+	case MMAP_WRITE:
+		entry_len = sizeof(struct nova_mmap_entry);
+		csum_addr = &MMENTRY(entry)->csum;
+		break;
+	case SNAPSHOT_INFO:
+		entry_len = sizeof(struct nova_snapshot_info_entry);
+		csum_addr = &SNENTRY(entry)->csum;
+		break;
+	default:
+		entry_len = 0;
+		csum_addr = NULL;
+		nova_dbg("%s: unknown or unsupported entry type (%d) for checksum, 0x%llx\n",
+			 __func__, type, (u64) entry);
+		break;
+	}
+
+	if (entry_len > 0) {
+		check_len = ((u8 *) csum_addr) - ((u8 *) entry);
+		csum = nova_crc32c(NOVA_INIT_CSUM, entry, check_len);
+		check_len = entry_len - (check_len + NOVA_META_CSUM_LEN);
+		if (check_len > 0) {
+			remain = ((u8 *) csum_addr) + NOVA_META_CSUM_LEN;
+			csum = nova_crc32c(csum, remain, check_len);
+		}
+
+		if (check_len < 0) {
+			nova_dbg("%s: checksum run-length error %ld < 0",
+				__func__, check_len);
+		}
+	}
+
+	NOVA_END_TIMING(calc_entry_csum_t, calc_time);
+	return csum;
+}
+
+/* Update the log entry checksum. */
+void nova_update_entry_csum(void *entry)
+{
+	u8  type;
+	u32 csum;
+	size_t entry_len = CACHELINE_SIZE;
+
+	if (metadata_csum == 0)
+		goto flush;
+
+	type = nova_get_entry_type(entry);
+	csum = nova_calc_entry_csum(entry);
+
+	switch (type) {
+	case DIR_LOG:
+		DENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = DENTRY(entry)->de_len;
+		break;
+	case FILE_WRITE:
+		WENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_file_write_entry);
+		break;
+	case SET_ATTR:
+		SENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		LCENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_link_change_entry);
+		break;
+	case MMAP_WRITE:
+		MMENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_mmap_entry);
+		break;
+	case SNAPSHOT_INFO:
+		SNENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_snapshot_info_entry);
+		break;
+	default:
+		entry_len = 0;
+		nova_dbg("%s: unknown or unsupported entry type (%d), 0x%llx\n",
+			__func__, type, (u64) entry);
+		break;
+	}
+
+flush:
+	if (entry_len > 0)
+		nova_flush_buffer(entry, entry_len, 0);
+
+}
+
+int nova_update_alter_entry(struct super_block *sb, void *entry)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	void *alter_entry;
+	u64 curr, alter_curr;
+	u32 entry_csum;
+	size_t size;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	int ret;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	curr = nova_get_addr_off(sbi, entry);
+	alter_curr = alter_log_entry(sb, curr);
+	if (alter_curr == 0) {
+		nova_err(sb, "%s: log page tail error detected\n", __func__);
+		return -EIO;
+	}
+	alter_entry = (void *)nova_get_block(sb, alter_curr);
+
+	ret = nova_get_entry_copy(sb, entry, &entry_csum, &size, entry_copy);
+	if (ret)
+		return ret;
+
+	ret = memcpy_to_pmem_nocache(alter_entry, entry_copy, size);
+	return ret;
+}
+
+/* media error: repair the poison radius that the entry belongs to */
+static int nova_repair_entry_pr(struct super_block *sb, void *entry)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret;
+	u64 entry_off, alter_off;
+	void *entry_pr, *alter_pr;
+
+	entry_off = nova_get_addr_off(sbi, entry);
+	alter_off = alter_log_entry(sb, entry_off);
+	if (alter_off == 0) {
+		nova_err(sb, "%s: log page tail error detected\n", __func__);
+		goto fail;
+	}
+
+	entry_pr = (void *) nova_get_block(sb, entry_off & POISON_MASK);
+	alter_pr = (void *) nova_get_block(sb, alter_off & POISON_MASK);
+
+	if (entry_pr == NULL || alter_pr == NULL)
+		BUG();
+
+	nova_memunlock_range(sb, entry_pr, POISON_RADIUS);
+	ret = memcpy_mcsafe(entry_pr, alter_pr, POISON_RADIUS);
+	nova_memlock_range(sb, entry_pr, POISON_RADIUS);
+	nova_flush_buffer(entry_pr, POISON_RADIUS, 0);
+
+	/* alter_entry shows media error during memcpy */
+	if (ret < 0)
+		goto fail;
+
+	nova_dbg("%s: entry media error repaired\n", __func__);
+	return 0;
+
+fail:
+	nova_err(sb, "%s: unrecoverable media error detected\n", __func__);
+	return -1;
+}
+
+static int nova_repair_entry(struct super_block *sb, void *bad, void *good,
+	size_t entry_size)
+{
+	int ret;
+
+	nova_memunlock_range(sb, bad, entry_size);
+	ret = memcpy_to_pmem_nocache(bad, good, entry_size);
+	nova_memlock_range(sb, bad, entry_size);
+
+	if (ret == 0)
+		nova_dbg("%s: entry error repaired\n", __func__);
+
+	return ret;
+}
+
+/* Verify the log entry checksum and get a copy in DRAM. */
+bool nova_verify_entry_csum(struct super_block *sb, void *entry, void *entryc)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret = 0;
+	u64 entry_off, alter_off;
+	void *alter;
+	size_t entry_size, alter_size;
+	u32 entry_csum, alter_csum;
+	u32 entry_csum_calc, alter_csum_calc;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	char alter_copy[NOVA_MAX_ENTRY_LEN];
+	timing_t verify_time;
+
+	if (metadata_csum == 0)
+		return true;
+
+	NOVA_START_TIMING(verify_entry_csum_t, verify_time);
+
+	ret = nova_get_entry_copy(sb, entry, &entry_csum, &entry_size,
+				  entry_copy);
+	if (ret < 0) { /* media error */
+		ret = nova_repair_entry_pr(sb, entry);
+		if (ret < 0)
+			goto fail;
+		/* try again */
+		ret = nova_get_entry_copy(sb, entry, &entry_csum, &entry_size,
+						entry_copy);
+		if (ret < 0)
+			goto fail;
+	}
+
+	entry_off = nova_get_addr_off(sbi, entry);
+	alter_off = alter_log_entry(sb, entry_off);
+	if (alter_off == 0) {
+		nova_err(sb, "%s: log page tail error detected\n", __func__);
+		goto fail;
+	}
+
+	alter = (void *) nova_get_block(sb, alter_off);
+	ret = nova_get_entry_copy(sb, alter, &alter_csum, &alter_size,
+					alter_copy);
+	if (ret < 0) { /* media error */
+		ret = nova_repair_entry_pr(sb, alter);
+		if (ret < 0)
+			goto fail;
+		/* try again */
+		ret = nova_get_entry_copy(sb, alter, &alter_csum, &alter_size,
+						alter_copy);
+		if (ret < 0)
+			goto fail;
+	}
+
+	/* no media errors, now verify the checksums */
+	entry_csum = le32_to_cpu(entry_csum);
+	alter_csum = le32_to_cpu(alter_csum);
+	entry_csum_calc = nova_calc_entry_csum(entry_copy);
+	alter_csum_calc = nova_calc_entry_csum(alter_copy);
+
+	if (entry_csum != entry_csum_calc && alter_csum != alter_csum_calc) {
+		nova_err(sb, "%s: both entry and its replica fail checksum verification\n",
+			 __func__);
+		goto fail;
+	} else if (entry_csum != entry_csum_calc) {
+		nova_dbg("%s: entry %p checksum error, trying to repair using the replica\n",
+			 __func__, entry);
+		ret = nova_repair_entry(sb, entry, alter_copy, alter_size);
+		if (ret != 0)
+			goto fail;
+
+		memcpy(entryc, alter_copy, alter_size);
+	} else if (alter_csum != alter_csum_calc) {
+		nova_dbg("%s: entry replica %p checksum error, trying to repair using the primary\n",
+			 __func__, alter);
+		ret = nova_repair_entry(sb, alter, entry_copy, entry_size);
+		if (ret != 0)
+			goto fail;
+
+		memcpy(entryc, entry_copy, entry_size);
+	} else {
+		/* now both entries pass checksum verification and the primary
+		 * is trusted if their buffers don't match
+		 */
+		if (memcmp(entry_copy, alter_copy, entry_size)) {
+			nova_dbg("%s: entry replica %p error, trying to repair using the primary\n",
+				 __func__, alter);
+			ret = nova_repair_entry(sb, alter, entry_copy,
+						entry_size);
+			if (ret != 0)
+				goto fail;
+		}
+
+		memcpy(entryc, entry_copy, entry_size);
+	}
+
+	NOVA_END_TIMING(verify_entry_csum_t, verify_time);
+	return true;
+
+fail:
+	nova_err(sb, "%s: unable to repair entry errors\n", __func__);
+
+	NOVA_END_TIMING(verify_entry_csum_t, verify_time);
+	return false;
+}
+
+/* media error: repair the poison radius that the inode belongs to */
+static int nova_repair_inode_pr(struct super_block *sb,
+	struct nova_inode *bad_pi, struct nova_inode *good_pi)
+{
+	int ret;
+	void *bad_pr, *good_pr;
+
+	bad_pr = (void *)((u64) bad_pi & POISON_MASK);
+	good_pr = (void *)((u64) good_pi & POISON_MASK);
+
+	if (bad_pr == NULL || good_pr == NULL)
+		BUG();
+
+	nova_memunlock_range(sb, bad_pr, POISON_RADIUS);
+	ret = memcpy_mcsafe(bad_pr, good_pr, POISON_RADIUS);
+	nova_memlock_range(sb, bad_pr, POISON_RADIUS);
+	nova_flush_buffer(bad_pr, POISON_RADIUS, 0);
+
+	/* good_pi shows media error during memcpy */
+	if (ret < 0)
+		goto fail;
+
+	nova_dbg("%s: inode media error repaired\n", __func__);
+	return 0;
+
+fail:
+	nova_err(sb, "%s: unrecoverable media error detected\n", __func__);
+	return -1;
+}
+
+static int nova_repair_inode(struct super_block *sb, struct nova_inode *bad_pi,
+	struct nova_inode *good_copy)
+{
+	int ret;
+
+	nova_memunlock_inode(sb, bad_pi);
+	ret = memcpy_to_pmem_nocache(bad_pi, good_copy,
+					sizeof(struct nova_inode));
+	nova_memlock_inode(sb, bad_pi);
+
+	if (ret == 0)
+		nova_dbg("%s: inode %llu error repaired\n", __func__,
+					good_copy->nova_ino);
+
+	return ret;
+}
+
+/*
+ * Check nova_inode and get a copy in DRAM.
+ * If we are going to update (write) the inode, we don't need to check the
+ * alter inode if the major inode checks ok. If we are going to read or rebuild
+ * the inode, also check the alter even if the major inode checks ok.
+ */
+int nova_check_inode_integrity(struct super_block *sb, u64 ino, u64 pi_addr,
+	u64 alter_pi_addr, struct nova_inode *pic, int check_replica)
+{
+	struct nova_inode *pi, *alter_pi, alter_copy, *alter_pic;
+	int inode_bad, alter_bad;
+	int ret;
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+
+	ret = memcpy_mcsafe(pic, pi, sizeof(struct nova_inode));
+
+	if (metadata_csum == 0)
+		return ret;
+
+	alter_pi = (struct nova_inode *)nova_get_block(sb, alter_pi_addr);
+
+	if (ret < 0) { /* media error */
+		ret = nova_repair_inode_pr(sb, pi, alter_pi);
+		if (ret < 0)
+			goto fail;
+		/* try again */
+		ret = memcpy_mcsafe(pic, pi, sizeof(struct nova_inode));
+		if (ret < 0)
+			goto fail;
+	}
+
+	inode_bad = nova_check_inode_checksum(pic);
+
+	if (!inode_bad && !check_replica)
+		return 0;
+
+	alter_pic = &alter_copy;
+	ret = memcpy_mcsafe(alter_pic, alter_pi, sizeof(struct nova_inode));
+	if (ret < 0) { /* media error */
+		if (inode_bad)
+			goto fail;
+		ret = nova_repair_inode_pr(sb, alter_pi, pi);
+		if (ret < 0)
+			goto fail;
+		/* try again */
+		ret = memcpy_mcsafe(alter_pic, alter_pi,
+					sizeof(struct nova_inode));
+		if (ret < 0)
+			goto fail;
+	}
+
+	alter_bad = nova_check_inode_checksum(alter_pic);
+
+	if (inode_bad && alter_bad) {
+		nova_err(sb, "%s: both inode and its replica fail checksum verification\n",
+			 __func__);
+		goto fail;
+	} else if (inode_bad) {
+		nova_dbg("%s: inode %llu checksum error, trying to repair using the replica\n",
+			 __func__, ino);
+		ret = nova_repair_inode(sb, pi, alter_pic);
+		if (ret != 0)
+			goto fail;
+
+		memcpy(pic, alter_pic, sizeof(struct nova_inode));
+	} else if (alter_bad) {
+		nova_dbg("%s: inode replica %llu checksum error, trying to repair using the primary\n",
+			 __func__, ino);
+		ret = nova_repair_inode(sb, alter_pi, pic);
+		if (ret != 0)
+			goto fail;
+	} else if (memcmp(pic, alter_pic, sizeof(struct nova_inode))) {
+		nova_dbg("%s: inode replica %llu is stale, trying to repair using the primary\n",
+			 __func__, ino);
+		ret = nova_repair_inode(sb, alter_pi, pic);
+		if (ret != 0)
+			goto fail;
+	}
+
+	return 0;
+
+fail:
+	nova_err(sb, "%s: unable to repair inode errors\n", __func__);
+
+	return -EIO;
+}
+
+static int nova_update_stripe_csum(struct super_block *sb, unsigned long strps,
+	unsigned long strp_nr, u8 *strp_ptr, int zero)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned long strp;
+	u32 csum;
+	u32 crc[8];
+	void *csum_addr, *csum_addr1;
+	void *src_addr;
+
+	while (strps >= 8) {
+		if (zero) {
+			src_addr = sbi->zero_csum;
+			goto copy;
+		}
+
+		crc[0] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr, strp_size));
+		crc[1] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size, strp_size));
+		crc[2] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 2, strp_size));
+		crc[3] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 3, strp_size));
+		crc[4] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 4, strp_size));
+		crc[5] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 5, strp_size));
+		crc[6] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 6, strp_size));
+		crc[7] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 7, strp_size));
+
+		src_addr = crc;
+copy:
+		csum_addr = nova_get_data_csum_addr(sb, strp_nr, 0);
+		csum_addr1 = nova_get_data_csum_addr(sb, strp_nr, 1);
+
+		nova_memunlock_range(sb, csum_addr, NOVA_DATA_CSUM_LEN * 8);
+		if (support_clwb) {
+			memcpy(csum_addr, src_addr, NOVA_DATA_CSUM_LEN * 8);
+			memcpy(csum_addr1, src_addr, NOVA_DATA_CSUM_LEN * 8);
+		} else {
+			memcpy_to_pmem_nocache(csum_addr, src_addr,
+						NOVA_DATA_CSUM_LEN * 8);
+			memcpy_to_pmem_nocache(csum_addr1, src_addr,
+						NOVA_DATA_CSUM_LEN * 8);
+		}
+		nova_memlock_range(sb, csum_addr, NOVA_DATA_CSUM_LEN * 8);
+		if (support_clwb) {
+			nova_flush_buffer(csum_addr,
+					  NOVA_DATA_CSUM_LEN * 8, 0);
+			nova_flush_buffer(csum_addr1,
+					  NOVA_DATA_CSUM_LEN * 8, 0);
+		}
+
+		strp_nr += 8;
+		strps -= 8;
+		if (!zero)
+			strp_ptr += strp_size * 8;
+	}
+
+	for (strp = 0; strp < strps; strp++) {
+		if (zero)
+			csum = sbi->zero_csum[0];
+		else
+			csum = nova_crc32c(NOVA_INIT_CSUM, strp_ptr, strp_size);
+
+		csum = cpu_to_le32(csum);
+		csum_addr = nova_get_data_csum_addr(sb, strp_nr, 0);
+		csum_addr1 = nova_get_data_csum_addr(sb, strp_nr, 1);
+
+		nova_memunlock_range(sb, csum_addr, NOVA_DATA_CSUM_LEN);
+		memcpy_to_pmem_nocache(csum_addr, &csum, NOVA_DATA_CSUM_LEN);
+		memcpy_to_pmem_nocache(csum_addr1, &csum, NOVA_DATA_CSUM_LEN);
+		nova_memlock_range(sb, csum_addr, NOVA_DATA_CSUM_LEN);
+
+		strp_nr += 1;
+		if (!zero)
+			strp_ptr += strp_size;
+	}
+
+	return 0;
+}
+
+/* Checksums a sequence of contiguous file write data stripes within one block
+ * and writes the checksum values to nvmm.
+ *
+ * The block buffer to compute checksums should reside in dram (more trusted),
+ * not in nvmm (less trusted).
+ *
+ * Checksum is calculated over a whole stripe.
+ *
+ * block:   block buffer with user data and possibly partial head-tail block
+ *          - should be in kernel memory (dram) to avoid page faults
+ * blocknr: destination nvmm block number where the block is written to
+ *          - used to derive checksum value addresses
+ * offset:  byte offset of user data in the block buffer
+ * bytes:   number of user data bytes in the block buffer
+ * zero:    if the user data is all zero
+ */
+int nova_update_block_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, u8 *block, unsigned long blocknr,
+	size_t offset, size_t bytes, int zero)
+{
+	u8 *strp_ptr;
+	size_t blockoff;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned int strp_index, strp_offset;
+	unsigned long strps, strp_nr;
+	timing_t block_csum_time;
+
+	NOVA_START_TIMING(block_csum_t, block_csum_time);
+	blockoff = nova_get_block_off(sb, blocknr, sih->i_blk_type);
+
+	/* strp_index: stripe index within the block buffer
+	 * strp_offset: stripe offset within the block buffer
+	 *
+	 * strps: number of stripes touched by user data (need new checksums)
+	 * strp_nr: global stripe number converted from blocknr and offset
+	 * strp_ptr: pointer to stripes in the block buffer
+	 */
+	strp_index = offset >> strp_shift;
+	strp_offset = offset - (strp_index << strp_shift);
+
+	strps = ((strp_offset + bytes - 1) >> strp_shift) + 1;
+	strp_nr = (blockoff + offset) >> strp_shift;
+	strp_ptr = block + (strp_index << strp_shift);
+
+	nova_update_stripe_csum(sb, strps, strp_nr, strp_ptr, zero);
+
+	NOVA_END_TIMING(block_csum_t, block_csum_time);
+
+	return 0;
+}
+
+int nova_update_pgoff_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero)
+{
+	void *dax_mem = NULL;
+	u64 blockoff;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned long strp_nr;
+	int count;
+
+	count = blk_type_to_size[sih->i_blk_type] / strp_size;
+
+	blockoff = nova_find_nvmm_block(sb, sih, entry, pgoff);
+
+	/* Truncated? */
+	if (blockoff == 0)
+		return 0;
+
+	dax_mem = nova_get_block(sb, blockoff);
+
+	strp_nr = blockoff >> strp_shift;
+
+	nova_update_stripe_csum(sb, count, strp_nr, dax_mem, zero);
+
+	return 0;
+}
+
+/* Verify checksums of requested data bytes starting from offset of blocknr.
+ *
+ * Only a whole stripe can be checksum verified.
+ *
+ * blocknr: container blocknr for the first stripe to be verified
+ * offset:  byte offset within the block associated with blocknr
+ * bytes:   number of contiguous bytes to be verified starting from offset
+ *
+ * return: true or false
+ */
+bool nova_verify_data_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr,
+	size_t offset, size_t bytes)
+{
+	void *blockptr, *strp_ptr;
+	size_t blockoff, blocksize = nova_inode_blk_size(sih);
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned int strp_index;
+	unsigned long strp, strps, strp_nr;
+	void *strip = NULL;
+	u32 csum_calc, csum_nvmm0, csum_nvmm1;
+	u32 *csum_addr0, *csum_addr1;
+	int error;
+	bool match;
+	timing_t verify_time;
+
+	NOVA_START_TIMING(verify_data_csum_t, verify_time);
+
+	/* Only a whole stripe can be checksum verified.
+	 * strps: # of stripes to be checked since offset.
+	 */
+	strps = ((offset + bytes - 1) >> strp_shift)
+		- (offset >> strp_shift) + 1;
+
+	blockoff = nova_get_block_off(sb, blocknr, sih->i_blk_type);
+	blockptr = nova_get_block(sb, blockoff);
+
+	/* strp_nr: global stripe number converted from blocknr and offset
+	 * strp_ptr: virtual address of the 1st stripe
+	 * strp_index: stripe index within a block
+	 */
+	strp_nr = (blockoff + offset) >> strp_shift;
+	strp_index = offset >> strp_shift;
+	strp_ptr = blockptr + (strp_index << strp_shift);
+
+	strip = kmalloc(strp_size, GFP_KERNEL);
+	if (strip == NULL)
+		return false;
+
+	match = true;
+	for (strp = 0; strp < strps; strp++) {
+		csum_addr0 = nova_get_data_csum_addr(sb, strp_nr, 0);
+		csum_nvmm0 = le32_to_cpu(*csum_addr0);
+
+		csum_addr1 = nova_get_data_csum_addr(sb, strp_nr, 1);
+		csum_nvmm1 = le32_to_cpu(*csum_addr1);
+
+		error = memcpy_mcsafe(strip, strp_ptr, strp_size);
+		if (error < 0) {
+			nova_dbg("%s: media error in data strip detected!\n",
+				__func__);
+			match = false;
+		} else {
+			csum_calc = nova_crc32c(NOVA_INIT_CSUM, strip,
+						strp_size);
+			match = (csum_calc == csum_nvmm0) ||
+				(csum_calc == csum_nvmm1);
+		}
+
+		if (!match) {
+			/* Getting here, data is considered corrupted.
+			 *
+			 * if: csum_nvmm0 == csum_nvmm1
+			 *     both csums good, run data recovery
+			 * if: csum_nvmm0 != csum_nvmm1
+			 *     at least one csum is corrupted, also need to run
+			 *     data recovery to see if one csum is still good
+			 */
+			nova_dbg("%s: nova data corruption detected! inode %lu, strp %lu of %lu, block offset %lu, stripe nr %lu, csum calc 0x%08x, csum nvmm 0x%08x, csum nvmm replica 0x%08x\n",
+				__func__, sih->ino, strp, strps, blockoff,
+				strp_nr, csum_calc, csum_nvmm0, csum_nvmm1);
+
+			if (data_parity == 0) {
+				nova_dbg("%s: no data redundancy available, can not repair data corruption!\n",
+					 __func__);
+				break;
+			}
+
+			nova_dbg("%s: nova data recovery begins\n", __func__);
+
+			error = nova_restore_data(sb, blocknr, strp_index,
+					strip, error, csum_nvmm0, csum_nvmm1,
+					&csum_calc);
+			if (error) {
+				nova_dbg("%s: nova data recovery fails!\n",
+						__func__);
+				dump_stack();
+				break;
+			}
+
+			/* Getting here, data corruption is repaired and the
+			 * good checksum is stored in csum_calc.
+			 */
+			nova_dbg("%s: nova data recovery success!\n", __func__);
+			match = true;
+		}
+
+		/* Getting here, match must be true, otherwise already breaking
+		 * out the for loop. Data is known good, either it's good in
+		 * nvmm, or good after recovery.
+		 */
+		if (csum_nvmm0 != csum_nvmm1) {
+			/* Getting here, data is known good but one checksum is
+			 * considered corrupted.
+			 */
+			nova_dbg("%s: nova checksum corruption detected! inode %lu, strp %lu of %lu, block offset %lu, stripe nr %lu, csum calc 0x%08x, csum nvmm 0x%08x, csum nvmm replica 0x%08x\n",
+				__func__, sih->ino, strp, strps, blockoff,
+				strp_nr, csum_calc, csum_nvmm0, csum_nvmm1);
+
+			nova_memunlock_range(sb, csum_addr0,
+							NOVA_DATA_CSUM_LEN);
+			if (csum_nvmm0 != csum_calc) {
+				csum_nvmm0 = cpu_to_le32(csum_calc);
+				memcpy_to_pmem_nocache(csum_addr0, &csum_nvmm0,
+							NOVA_DATA_CSUM_LEN);
+			}
+
+			if (csum_nvmm1 != csum_calc) {
+				csum_nvmm1 = cpu_to_le32(csum_calc);
+				memcpy_to_pmem_nocache(csum_addr1, &csum_nvmm1,
+							NOVA_DATA_CSUM_LEN);
+			}
+			nova_memlock_range(sb, csum_addr0, NOVA_DATA_CSUM_LEN);
+
+			nova_dbg("%s: nova checksum corruption repaired!\n",
+								__func__);
+		}
+
+		/* Getting here, the data stripe and both checksum copies are
+		 * known good. Continue to the next stripe.
+		 */
+		strp_nr    += 1;
+		strp_index += 1;
+		strp_ptr   += strp_size;
+		if (strp_index == (blocksize >> strp_shift)) {
+			blocknr += 1;
+			blockoff += blocksize;
+			strp_index = 0;
+		}
+
+	}
+
+	if (strip != NULL)
+		kfree(strip);
+
+	NOVA_END_TIMING(verify_data_csum_t, verify_time);
+
+	return match;
+}
+
+int nova_update_truncated_block_csum(struct super_block *sb,
+	struct inode *inode, loff_t newsize) {
+
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long offset = newsize & (sb->s_blocksize - 1);
+	unsigned long pgoff, length;
+	u64 nvmm;
+	char *nvmm_addr, *strp_addr, *tail_strp = NULL;
+	unsigned int strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned int strp_index, strp_offset;
+	unsigned long strps, strp_nr;
+
+	length = sb->s_blocksize - offset;
+	pgoff = newsize >> sb->s_blocksize_bits;
+
+	nvmm = nova_find_nvmm_block(sb, sih, NULL, pgoff);
+	if (nvmm == 0)
+		return -EFAULT;
+
+	nvmm_addr = (char *)nova_get_block(sb, nvmm);
+
+	strp_index = offset >> strp_shift;
+	strp_offset = offset - (strp_index << strp_shift);
+
+	strps = ((strp_offset + length - 1) >> strp_shift) + 1;
+	strp_nr = (nvmm + offset) >> strp_shift;
+	strp_addr = nvmm_addr + (strp_index << strp_shift);
+
+	if (strp_offset > 0) {
+		/* Copy to DRAM to catch MCE. */
+		tail_strp = kzalloc(strp_size, GFP_KERNEL);
+		if (tail_strp == NULL)
+			return -ENOMEM;
+
+		if (memcpy_mcsafe(tail_strp, strp_addr, strp_offset) < 0)
+			return -EIO;
+
+		nova_update_stripe_csum(sb, 1, strp_nr, tail_strp, 0);
+
+		strps--;
+		strp_nr++;
+	}
+
+	if (strps > 0)
+		nova_update_stripe_csum(sb, strps, strp_nr, NULL, 1);
+
+	if (tail_strp != NULL)
+		kfree(tail_strp);
+
+	return 0;
+}
+
diff --git a/fs/nova/mprotect.c b/fs/nova/mprotect.c
new file mode 100644
index 000000000000..4b58786f401e
--- /dev/null
+++ b/fs/nova/mprotect.c
@@ -0,0 +1,604 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Memory protection for the filesystem pages.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include "nova.h"
+#include "inode.h"
+
+static inline void wprotect_disable(void)
+{
+	unsigned long cr0_val;
+
+	cr0_val = read_cr0();
+	cr0_val &= (~X86_CR0_WP);
+	write_cr0(cr0_val);
+}
+
+static inline void wprotect_enable(void)
+{
+	unsigned long cr0_val;
+
+	cr0_val = read_cr0();
+	cr0_val |= X86_CR0_WP;
+	write_cr0(cr0_val);
+}
+
+/* FIXME: Assumes that we are always called in the right order.
+ * nova_writeable(vaddr, size, 1);
+ * nova_writeable(vaddr, size, 0);
+ */
+int nova_writeable(void *vaddr, unsigned long size, int rw)
+{
+	static unsigned long flags;
+	timing_t wprotect_time;
+
+	NOVA_START_TIMING(wprotect_t, wprotect_time);
+	if (rw) {
+		local_irq_save(flags);
+		wprotect_disable();
+	} else {
+		wprotect_enable();
+		local_irq_restore(flags);
+	}
+	NOVA_END_TIMING(wprotect_t, wprotect_time);
+	return 0;
+}
+
+int nova_dax_mem_protect(struct super_block *sb, void *vaddr,
+			  unsigned long size, int rw)
+{
+	if (!nova_is_wprotected(sb))
+		return 0;
+	return nova_writeable(vaddr, size, rw);
+}
+
+int nova_get_vma_overlap_range(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	unsigned long entry_pgoff, unsigned long entry_pages,
+	unsigned long *start_pgoff, unsigned long *num_pages)
+{
+	unsigned long vma_pgoff;
+	unsigned long vma_pages;
+	unsigned long end_pgoff;
+
+	vma_pgoff = vma->vm_pgoff;
+	vma_pages = (vma->vm_end - vma->vm_start) >> sb->s_blocksize_bits;
+
+	if (vma_pgoff + vma_pages <= entry_pgoff ||
+				entry_pgoff + entry_pages <= vma_pgoff)
+		return 0;
+
+	*start_pgoff = vma_pgoff > entry_pgoff ? vma_pgoff : entry_pgoff;
+	end_pgoff = (vma_pgoff + vma_pages) > (entry_pgoff + entry_pages) ?
+			entry_pgoff + entry_pages : vma_pgoff + vma_pages;
+	*num_pages = end_pgoff - *start_pgoff;
+	return 1;
+}
+
+static int nova_update_dax_mapping(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	struct nova_file_write_entry *entry, unsigned long start_pgoff,
+	unsigned long num_pages)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	void **pentry;
+	unsigned long curr_pgoff;
+	unsigned long blocknr, start_blocknr;
+	unsigned long value, new_value;
+	int i;
+	int ret = 0;
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_mapping_t, update_time);
+
+	start_blocknr = nova_get_blocknr(sb, entry->block, sih->i_blk_type);
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < num_pages; i++) {
+		curr_pgoff = start_pgoff + i;
+		blocknr = start_blocknr + i;
+
+		pentry = radix_tree_lookup_slot(&mapping->page_tree,
+						curr_pgoff);
+		if (pentry) {
+			value = (unsigned long)radix_tree_deref_slot(pentry);
+			/* 9 = sector shift (3) + RADIX_DAX_SHIFT (6) */
+			new_value = (blocknr << 9) | (value & 0xff);
+			nova_dbgv("%s: pgoff %lu, entry 0x%lx, new 0x%lx\n",
+						__func__, curr_pgoff,
+						value, new_value);
+			radix_tree_replace_slot(&sih->tree, pentry,
+						(void *)new_value);
+			radix_tree_tag_set(&mapping->page_tree, curr_pgoff,
+						PAGECACHE_TAG_DIRTY);
+		}
+	}
+
+	spin_unlock_irq(&mapping->tree_lock);
+
+	NOVA_END_TIMING(update_mapping_t, update_time);
+	return ret;
+}
+
+static int nova_update_entry_pfn(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	struct nova_file_write_entry *entry, unsigned long start_pgoff,
+	unsigned long num_pages)
+{
+	unsigned long newflags;
+	unsigned long addr;
+	unsigned long size;
+	unsigned long pfn;
+	pgprot_t new_prot;
+	int ret;
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_pfn_t, update_time);
+
+	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	pfn = nova_get_pfn(sb, entry->block) + start_pgoff - entry->pgoff;
+	size = num_pages << PAGE_SHIFT;
+
+	nova_dbgv("%s: addr 0x%lx, size 0x%lx\n", __func__,
+			addr, size);
+
+	newflags = vma->vm_flags | VM_WRITE;
+	new_prot = vm_get_page_prot(newflags);
+
+	ret = remap_pfn_range(vma, addr, pfn, size, new_prot);
+
+	NOVA_END_TIMING(update_pfn_t, update_time);
+	return ret;
+}
+
+static int nova_dax_mmap_update_mapping(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	struct nova_file_write_entry *entry_data)
+{
+	unsigned long start_pgoff, num_pages = 0;
+	int ret;
+
+	ret = nova_get_vma_overlap_range(sb, sih, vma, entry_data->pgoff,
+						entry_data->num_pages,
+						&start_pgoff, &num_pages);
+	if (ret == 0)
+		return ret;
+
+
+	NOVA_STATS_ADD(mapping_updated_pages, num_pages);
+
+	ret = nova_update_dax_mapping(sb, sih, vma, entry_data,
+						start_pgoff, num_pages);
+	if (ret) {
+		nova_err(sb, "update DAX mapping return %d\n", ret);
+		return ret;
+	}
+
+	ret = nova_update_entry_pfn(sb, sih, vma, entry_data,
+						start_pgoff, num_pages);
+	if (ret)
+		nova_err(sb, "update_pfn return %d\n", ret);
+
+
+	return ret;
+}
+
+static int nova_dax_cow_mmap_handler(struct super_block *sb,
+	struct vm_area_struct *vma, struct nova_inode_info_header *sih,
+	u64 begin_tail)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	u64 curr_p = begin_tail;
+	size_t entry_size = sizeof(struct nova_file_write_entry);
+	int ret = 0;
+	timing_t update_time;
+
+	NOVA_START_TIMING(mmap_handler_t, update_time);
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+	while (curr_p && curr_p != sih->log_tail) {
+		if (is_last_entry(curr_p, entry_size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_err(sb, "%s: File inode %lu log is NULL!\n",
+				__func__, sih->ino);
+			ret = -EINVAL;
+			break;
+		}
+
+		entry = (struct nova_file_write_entry *)
+					nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc)) {
+			ret = -EIO;
+			curr_p += entry_size;
+			continue;
+		}
+
+		if (nova_get_entry_type(entryc) != FILE_WRITE) {
+			/* for debug information, still use nvmm entry */
+			nova_dbg("%s: entry type is not write? %d\n",
+				__func__, nova_get_entry_type(entry));
+			curr_p += entry_size;
+			continue;
+		}
+
+		ret = nova_dax_mmap_update_mapping(sb, sih, vma, entryc);
+		if (ret)
+			break;
+
+		curr_p += entry_size;
+	}
+
+	NOVA_END_TIMING(mmap_handler_t, update_time);
+	return ret;
+}
+
+static int nova_get_dax_cow_range(struct super_block *sb,
+	struct vm_area_struct *vma, unsigned long address,
+	unsigned long *start_blk, int *num_blocks)
+{
+	int base = 1;
+	unsigned long vma_blocks;
+	unsigned long pgoff;
+	unsigned long start_pgoff;
+
+	vma_blocks = (vma->vm_end - vma->vm_start) >> sb->s_blocksize_bits;
+
+	/* Read ahead, avoid sequential page faults */
+	if (vma_blocks >= 4096)
+		base = 4096;
+
+	pgoff = (address - vma->vm_start) >> sb->s_blocksize_bits;
+	start_pgoff = pgoff & ~(base - 1);
+	*start_blk = vma->vm_pgoff + start_pgoff;
+	*num_blocks = (base > vma_blocks - start_pgoff) ?
+			vma_blocks - start_pgoff : base;
+	nova_dbgv("%s: start block %lu, %d blocks\n",
+			__func__, *start_blk, *num_blocks);
+	return 0;
+}
+
+int nova_mmap_to_new_blocks(struct vm_area_struct *vma,
+	unsigned long address)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	unsigned long start_blk, end_blk;
+	unsigned long entry_pgoff;
+	unsigned long from_blocknr = 0;
+	unsigned long blocknr = 0;
+	unsigned long avail_blocks;
+	unsigned long copy_blocks;
+	int num_blocks = 0;
+	u64 from_blockoff, to_blockoff;
+	size_t copied;
+	int allocated = 0;
+	void *from_kmem;
+	void *to_kmem;
+	size_t bytes;
+	timing_t memcpy_time;
+	u64 begin_tail = 0;
+	u64 epoch_id;
+	u64 entry_size;
+	u32 time;
+	timing_t mmap_cow_time;
+	int ret = 0;
+
+	NOVA_START_TIMING(mmap_cow_t, mmap_cow_time);
+
+	nova_get_dax_cow_range(sb, vma, address, &start_blk, &num_blocks);
+
+	end_blk = start_blk + num_blocks;
+	if (start_blk >= end_blk) {
+		NOVA_END_TIMING(mmap_cow_t, mmap_cow_time);
+		return 0;
+	}
+
+	if (sbi->snapshot_taking) {
+		/* Block CoW mmap until snapshot taken completes */
+		NOVA_STATS_ADD(dax_cow_during_snapshot, 1);
+		wait_event_interruptible(sbi->snapshot_mmap_wait,
+					sbi->snapshot_taking == 0);
+	}
+
+	inode_lock(inode);
+
+	pi = nova_get_inode(sb, inode);
+
+	nova_dbgv("%s: inode %lu, start pgoff %lu, end pgoff %lu\n",
+			__func__, inode->i_ino, start_blk, end_blk);
+
+	time = current_time(inode).tv_sec;
+
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = pi->log_tail;
+	update.alter_tail = pi->alter_log_tail;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	while (start_blk < end_blk) {
+		entry = nova_get_write_entry(sb, sih, start_blk);
+		if (!entry) {
+			nova_dbgv("%s: Found hole: pgoff %lu\n",
+					__func__, start_blk);
+
+			/* Jump the hole */
+			entry = nova_find_next_entry(sb, sih, start_blk);
+			if (!entry)
+				break;
+
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				break;
+
+			start_blk = entryc->pgoff;
+			if (start_blk >= end_blk)
+				break;
+		} else {
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				break;
+		}
+
+		if (entryc->epoch_id == epoch_id) {
+			/* Someone has done it for us. */
+			break;
+		}
+
+		from_blocknr = get_nvmm(sb, sih, entryc, start_blk);
+		from_blockoff = nova_get_block_off(sb, from_blocknr,
+						pi->i_blk_type);
+		from_kmem = nova_get_block(sb, from_blockoff);
+
+		if (entryc->reassigned == 0)
+			avail_blocks = entryc->num_pages -
+					(start_blk - entryc->pgoff);
+		else
+			avail_blocks = 1;
+
+		if (avail_blocks > end_blk - start_blk)
+			avail_blocks = end_blk - start_blk;
+
+		allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk,
+					 avail_blocks, ALLOC_NO_INIT, ANY_CPU,
+					 ALLOC_FROM_HEAD);
+
+		nova_dbgv("%s: alloc %d blocks @ %lu\n", __func__,
+						allocated, blocknr);
+
+		if (allocated <= 0) {
+			nova_dbg("%s alloc blocks failed!, %d\n",
+						__func__, allocated);
+			ret = allocated;
+			goto out;
+		}
+
+		to_blockoff = nova_get_block_off(sb, blocknr,
+						pi->i_blk_type);
+		to_kmem = nova_get_block(sb, to_blockoff);
+		entry_pgoff = start_blk;
+
+		copy_blocks = allocated;
+
+		bytes = sb->s_blocksize * copy_blocks;
+
+		/* Now copy from user buf */
+		NOVA_START_TIMING(memcpy_w_wb_t, memcpy_time);
+		nova_memunlock_range(sb, to_kmem, bytes);
+		copied = bytes - memcpy_to_pmem_nocache(to_kmem, from_kmem,
+							bytes);
+		nova_memlock_range(sb, to_kmem, bytes);
+		NOVA_END_TIMING(memcpy_w_wb_t, memcpy_time);
+
+		if (copied == bytes) {
+			start_blk += copy_blocks;
+		} else {
+			nova_dbg("%s ERROR!: bytes %lu, copied %lu\n",
+				__func__, bytes, copied);
+			ret = -EFAULT;
+			goto out;
+		}
+
+		entry_size = cpu_to_le64(inode->i_size);
+
+		nova_init_file_write_entry(sb, sih, &entry_data,
+					epoch_id, entry_pgoff, copy_blocks,
+					blocknr, time, entry_size);
+
+		ret = nova_append_file_write_entry(sb, pi, inode,
+					&entry_data, &update);
+		if (ret) {
+			nova_dbg("%s: append inode entry failed\n",
+					__func__);
+			ret = -ENOSPC;
+			goto out;
+		}
+
+		if (begin_tail == 0)
+			begin_tail = update.curr_entry;
+	}
+
+	if (begin_tail == 0)
+		goto out;
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	/* Update file tree */
+	ret = nova_reassign_file_tree(sb, sih, begin_tail);
+	if (ret)
+		goto out;
+
+
+	/* Update pfn and prot */
+	ret = nova_dax_cow_mmap_handler(sb, vma, sih, begin_tail);
+	if (ret)
+		goto out;
+
+
+	sih->trans_id++;
+
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						begin_tail, update.tail);
+
+	inode_unlock(inode);
+	NOVA_END_TIMING(mmap_cow_t, mmap_cow_time);
+	return ret;
+}
+
+static int nova_set_vma_read(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long oldflags = vma->vm_flags;
+	unsigned long newflags;
+	pgprot_t new_page_prot;
+
+	down_write(&mm->mmap_sem);
+
+	newflags = oldflags & (~VM_WRITE);
+	if (oldflags == newflags)
+		goto out;
+
+	nova_dbgv("Set vma %p read, start 0x%lx, end 0x%lx\n",
+				vma, vma->vm_start,
+				vma->vm_end);
+
+	new_page_prot = vm_get_page_prot(newflags);
+	change_protection(vma, vma->vm_start, vma->vm_end,
+				new_page_prot, 0, 0);
+	vma->original_write = 1;
+
+out:
+	up_write(&mm->mmap_sem);
+
+	return 0;
+}
+
+static inline bool pgoff_in_vma(struct vm_area_struct *vma,
+	unsigned long pgoff)
+{
+	unsigned long num_pages;
+
+	num_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+	if (pgoff >= vma->vm_pgoff && pgoff < vma->vm_pgoff + num_pages)
+		return true;
+
+	return false;
+}
+
+bool nova_find_pgoff_in_vma(struct inode *inode, unsigned long pgoff)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct vma_item *item;
+	struct rb_node *temp;
+	bool ret = false;
+
+	if (sih->num_vmas == 0)
+		return ret;
+
+	temp = rb_first(&sih->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		if (pgoff_in_vma(item->vma, pgoff)) {
+			ret = true;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int nova_set_sih_vmas_readonly(struct nova_inode_info_header *sih)
+{
+	struct vma_item *item;
+	struct rb_node *temp;
+	timing_t set_read_time;
+
+	NOVA_START_TIMING(set_vma_read_t, set_read_time);
+
+	temp = rb_first(&sih->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		nova_set_vma_read(item->vma);
+	}
+
+	NOVA_END_TIMING(set_vma_read_t, set_read_time);
+	return 0;
+}
+
+int nova_set_vmas_readonly(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header *sih;
+
+	nova_dbgv("%s\n", __func__);
+	mutex_lock(&sbi->vma_mutex);
+	list_for_each_entry(sih, &sbi->mmap_sih_list, list)
+		nova_set_sih_vmas_readonly(sih);
+	mutex_unlock(&sbi->vma_mutex);
+
+	return 0;
+}
+
+#if 0
+int nova_destroy_vma_tree(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct vma_item *item;
+	struct rb_node *temp;
+
+	nova_dbgv("%s\n", __func__);
+	mutex_lock(&sbi->vma_mutex);
+	temp = rb_first(&sbi->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		rb_erase(&item->node, &sbi->vma_tree);
+		kfree(item);
+	}
+	mutex_unlock(&sbi->vma_mutex);
+
+	return 0;
+}
+#endif
diff --git a/fs/nova/mprotect.h b/fs/nova/mprotect.h
new file mode 100644
index 000000000000..e28243caae52
--- /dev/null
+++ b/fs/nova/mprotect.h
@@ -0,0 +1,190 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Memory protection definitions for the NOVA filesystem.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef __WPROTECT_H
+#define __WPROTECT_H
+
+#include <linux/fs.h>
+#include "nova_def.h"
+#include "super.h"
+
+extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
+
+static inline int nova_range_check(struct super_block *sb, void *p,
+					 unsigned long len)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (p < sbi->virt_addr ||
+			p + len > sbi->virt_addr + sbi->initsize) {
+		nova_err(sb, "access pmem out of range: pmem range %p - %p, access range %p - %p\n",
+				sbi->virt_addr,
+				sbi->virt_addr + sbi->initsize,
+				p, p + len);
+		dump_stack();
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+extern int nova_writeable(void *vaddr, unsigned long size, int rw);
+
+static inline int nova_is_protected(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = (struct nova_sb_info *)sb->s_fs_info;
+
+	if (wprotect)
+		return wprotect;
+
+	return sbi->s_mount_opt & NOVA_MOUNT_PROTECT;
+}
+
+static inline int nova_is_wprotected(struct super_block *sb)
+{
+	return nova_is_protected(sb);
+}
+
+static inline void
+__nova_memunlock_range(void *p, unsigned long len)
+{
+	/*
+	 * NOTE: Ideally we should lock all the kernel to be memory safe
+	 * and avoid to write in the protected memory,
+	 * obviously it's not possible, so we only serialize
+	 * the operations at fs level. We can't disable the interrupts
+	 * because we could have a deadlock in this path.
+	 */
+	nova_writeable(p, len, 1);
+}
+
+static inline void
+__nova_memlock_range(void *p, unsigned long len)
+{
+	nova_writeable(p, len, 0);
+}
+
+static inline void nova_memunlock_range(struct super_block *sb, void *p,
+					 unsigned long len)
+{
+	if (nova_range_check(sb, p, len))
+		return;
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(p, len);
+}
+
+static inline void nova_memlock_range(struct super_block *sb, void *p,
+				       unsigned long len)
+{
+	if (nova_is_protected(sb))
+		__nova_memlock_range(p, len);
+}
+
+static inline void nova_memunlock_super(struct super_block *sb)
+{
+	struct nova_super_block *ps = nova_get_super(sb);
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(ps, NOVA_SB_SIZE);
+}
+
+static inline void nova_memlock_super(struct super_block *sb)
+{
+	struct nova_super_block *ps = nova_get_super(sb);
+
+	if (nova_is_protected(sb))
+		__nova_memlock_range(ps, NOVA_SB_SIZE);
+}
+
+static inline void nova_memunlock_reserved(struct super_block *sb,
+					 struct nova_super_block *ps)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(ps,
+			sbi->head_reserved_blocks * NOVA_DEF_BLOCK_SIZE_4K);
+}
+
+static inline void nova_memlock_reserved(struct super_block *sb,
+				       struct nova_super_block *ps)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (nova_is_protected(sb))
+		__nova_memlock_range(ps,
+			sbi->head_reserved_blocks * NOVA_DEF_BLOCK_SIZE_4K);
+}
+
+static inline void nova_memunlock_journal(struct super_block *sb)
+{
+	void *addr = nova_get_block(sb, NOVA_DEF_BLOCK_SIZE_4K * JOURNAL_START);
+
+	if (nova_range_check(sb, addr, NOVA_DEF_BLOCK_SIZE_4K))
+		return;
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(addr, NOVA_DEF_BLOCK_SIZE_4K);
+}
+
+static inline void nova_memlock_journal(struct super_block *sb)
+{
+	void *addr = nova_get_block(sb, NOVA_DEF_BLOCK_SIZE_4K * JOURNAL_START);
+
+	if (nova_is_protected(sb))
+		__nova_memlock_range(addr, NOVA_DEF_BLOCK_SIZE_4K);
+}
+
+static inline void nova_memunlock_inode(struct super_block *sb,
+					 struct nova_inode *pi)
+{
+	if (nova_range_check(sb, pi, NOVA_INODE_SIZE))
+		return;
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(pi, NOVA_INODE_SIZE);
+}
+
+static inline void nova_memlock_inode(struct super_block *sb,
+				       struct nova_inode *pi)
+{
+	/* nova_sync_inode(pi); */
+	if (nova_is_protected(sb))
+		__nova_memlock_range(pi, NOVA_INODE_SIZE);
+}
+
+static inline void nova_memunlock_block(struct super_block *sb, void *bp)
+{
+	if (nova_range_check(sb, bp, sb->s_blocksize))
+		return;
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(bp, sb->s_blocksize);
+}
+
+static inline void nova_memlock_block(struct super_block *sb, void *bp)
+{
+	if (nova_is_protected(sb))
+		__nova_memlock_range(bp, sb->s_blocksize);
+}
+
+
+#endif
diff --git a/fs/nova/parity.c b/fs/nova/parity.c
new file mode 100644
index 000000000000..1f2f8b4d6c0e
--- /dev/null
+++ b/fs/nova/parity.c
@@ -0,0 +1,411 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Parity related methods.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+
+static int nova_calculate_block_parity(struct super_block *sb, u8 *parity,
+	u8 *block)
+{
+	unsigned int strp, num_strps, i, j;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	u64 xor;
+
+	num_strps = sb->s_blocksize >> strp_shift;
+	if (static_cpu_has(X86_FEATURE_XMM2)) { // sse2 128b
+		for (i = 0; i < strp_size; i += 16) {
+			asm volatile("movdqa %0, %%xmm0" : : "m" (block[i]));
+			for (strp = 1; strp < num_strps; strp++) {
+				j = (strp << strp_shift) + i;
+				asm volatile(
+					"movdqa     %0, %%xmm1\n"
+					"pxor   %%xmm1, %%xmm0\n"
+					: : "m" (block[j])
+				);
+			}
+			asm volatile("movntdq %%xmm0, %0" : "=m" (parity[i]));
+		}
+	} else { // common 64b
+		for (i = 0; i < strp_size; i += 8) {
+			xor = *((u64 *) &block[i]);
+			for (strp = 1; strp < num_strps; strp++) {
+				j = (strp << strp_shift) + i;
+				xor ^= *((u64 *) &block[j]);
+			}
+			*((u64 *) &parity[i]) = xor;
+		}
+	}
+
+	return 0;
+}
+
+/* Compute parity for a whole data block and write the parity stripe to nvmm
+ *
+ * The block buffer to compute checksums should reside in dram (more trusted),
+ * not in nvmm (less trusted).
+ *
+ * block:   block buffer with user data and possibly partial head-tail block
+ *          - should be in kernel memory (dram) to avoid page faults
+ * blocknr: destination nvmm block number where the block is written to
+ *          - used to derive the parity stripe address
+
+ * If the modified content is less than a stripe size (small writes), it's
+ * possible to re-compute the parity only using the difference of the modified
+ * stripe, without re-computing for the whole block.
+
+static int nova_update_block_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, void *block, unsigned long blocknr,
+	size_t offset, size_t bytes, int zero)
+
+ */
+static int nova_update_block_parity(struct super_block *sb, u8 *block,
+	unsigned long blocknr, int zero)
+{
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	void *parity, *nvmmptr;
+	int ret = 0;
+	timing_t block_parity_time;
+
+	NOVA_START_TIMING(block_parity_t, block_parity_time);
+
+	parity = kmalloc(strp_size, GFP_KERNEL);
+	if (parity == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (block == NULL) {
+		nova_dbg("%s: block pointer error\n", __func__);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (unlikely(zero))
+		memset(parity, 0, strp_size);
+	else
+		nova_calculate_block_parity(sb, parity, block);
+
+	nvmmptr = nova_get_parity_addr(sb, blocknr);
+
+	nova_memunlock_range(sb, nvmmptr, strp_size);
+	memcpy_to_pmem_nocache(nvmmptr, parity, strp_size);
+	nova_memlock_range(sb, nvmmptr, strp_size);
+
+	// TODO: The parity stripe is better checksummed for higher reliability.
+out:
+	if (parity != NULL)
+		kfree(parity);
+
+	NOVA_END_TIMING(block_parity_t, block_parity_time);
+
+	return 0;
+}
+
+int nova_update_pgoff_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero)
+{
+	unsigned long blocknr;
+	void *dax_mem = NULL;
+	u64 blockoff;
+
+	blockoff = nova_find_nvmm_block(sb, sih, entry, pgoff);
+	/* Truncated? */
+	if (blockoff == 0)
+		return 0;
+
+	dax_mem = nova_get_block(sb, blockoff);
+
+	blocknr = nova_get_blocknr(sb, blockoff, sih->i_blk_type);
+	nova_update_block_parity(sb, dax_mem, blocknr, zero);
+
+	return 0;
+}
+
+/* Update block checksums and/or parity.
+ *
+ * Since this part of computing is along the critical path, unroll by 8 to gain
+ * performance if possible. This unrolling applies to stripe width of 8 and
+ * whole block writes.
+ */
+#define CSUM0 NOVA_INIT_CSUM
+int nova_update_block_csum_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, u8 *block, unsigned long blocknr,
+	size_t offset, size_t bytes)
+{
+	unsigned int i, strp_offset, num_strps;
+	size_t csum_size = NOVA_DATA_CSUM_LEN;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned long strp_nr, blockoff, blocksize = sb->s_blocksize;
+	void *nvmmptr, *nvmmptr1;
+	u32 crc[8];
+	u64 qwd[8], *parity = NULL;
+	u64 acc[8] = {CSUM0, CSUM0, CSUM0, CSUM0, CSUM0, CSUM0, CSUM0, CSUM0};
+	bool unroll_csum = false, unroll_parity = false;
+	int ret = 0;
+	timing_t block_csum_parity_time;
+
+	NOVA_STATS_ADD(block_csum_parity, 1);
+
+	blockoff = nova_get_block_off(sb, blocknr, sih->i_blk_type);
+	strp_nr = blockoff >> strp_shift;
+
+	strp_offset = offset & (strp_size - 1);
+	num_strps = ((strp_offset + bytes - 1) >> strp_shift) + 1;
+
+	unroll_parity = (blocksize / strp_size == 8) && (num_strps == 8);
+	unroll_csum = unroll_parity && static_cpu_has(X86_FEATURE_XMM4_2);
+
+	/* unrolled-by-8 implementation */
+	if (unroll_csum || unroll_parity) {
+		NOVA_START_TIMING(block_csum_parity_t, block_csum_parity_time);
+		if (data_parity > 0) {
+			parity = kmalloc(strp_size, GFP_KERNEL);
+			if (parity == NULL) {
+				nova_err(sb, "%s: buffer allocation error\n",
+								__func__);
+				ret = -ENOMEM;
+				NOVA_END_TIMING(block_csum_parity_t,
+						block_csum_parity_time);
+				goto out;
+			}
+		}
+		for (i = 0; i < strp_size / 8; i++) {
+			qwd[0] = *((u64 *) (block));
+			qwd[1] = *((u64 *) (block + 1 * strp_size));
+			qwd[2] = *((u64 *) (block + 2 * strp_size));
+			qwd[3] = *((u64 *) (block + 3 * strp_size));
+			qwd[4] = *((u64 *) (block + 4 * strp_size));
+			qwd[5] = *((u64 *) (block + 5 * strp_size));
+			qwd[6] = *((u64 *) (block + 6 * strp_size));
+			qwd[7] = *((u64 *) (block + 7 * strp_size));
+
+			if (data_csum > 0 && unroll_csum) {
+				nova_crc32c_qword(qwd[0], acc[0]);
+				nova_crc32c_qword(qwd[1], acc[1]);
+				nova_crc32c_qword(qwd[2], acc[2]);
+				nova_crc32c_qword(qwd[3], acc[3]);
+				nova_crc32c_qword(qwd[4], acc[4]);
+				nova_crc32c_qword(qwd[5], acc[5]);
+				nova_crc32c_qword(qwd[6], acc[6]);
+				nova_crc32c_qword(qwd[7], acc[7]);
+			}
+
+			if (data_parity > 0) {
+				parity[i] = qwd[0] ^ qwd[1] ^ qwd[2] ^ qwd[3] ^
+					    qwd[4] ^ qwd[5] ^ qwd[6] ^ qwd[7];
+			}
+
+			block += 8;
+		}
+		if (data_csum > 0 && unroll_csum) {
+			crc[0] = cpu_to_le32((u32) acc[0]);
+			crc[1] = cpu_to_le32((u32) acc[1]);
+			crc[2] = cpu_to_le32((u32) acc[2]);
+			crc[3] = cpu_to_le32((u32) acc[3]);
+			crc[4] = cpu_to_le32((u32) acc[4]);
+			crc[5] = cpu_to_le32((u32) acc[5]);
+			crc[6] = cpu_to_le32((u32) acc[6]);
+			crc[7] = cpu_to_le32((u32) acc[7]);
+
+			nvmmptr = nova_get_data_csum_addr(sb, strp_nr, 0);
+			nvmmptr1 = nova_get_data_csum_addr(sb, strp_nr, 1);
+			nova_memunlock_range(sb, nvmmptr, csum_size * 8);
+			memcpy_to_pmem_nocache(nvmmptr, crc, csum_size * 8);
+			memcpy_to_pmem_nocache(nvmmptr1, crc, csum_size * 8);
+			nova_memlock_range(sb, nvmmptr, csum_size * 8);
+		}
+
+		if (data_parity > 0) {
+			nvmmptr = nova_get_parity_addr(sb, blocknr);
+			nova_memunlock_range(sb, nvmmptr, strp_size);
+			memcpy_to_pmem_nocache(nvmmptr, parity, strp_size);
+			nova_memlock_range(sb, nvmmptr, strp_size);
+		}
+
+		if (parity != NULL)
+			kfree(parity);
+		NOVA_END_TIMING(block_csum_parity_t, block_csum_parity_time);
+	}
+
+	if (data_csum > 0 && !unroll_csum)
+		nova_update_block_csum(sb, sih, block, blocknr,
+					offset, bytes, 0);
+	if (data_parity > 0 && !unroll_parity)
+		nova_update_block_parity(sb, block, blocknr, 0);
+
+out:
+	return 0;
+}
+
+/* Restore a stripe of data.
+ *
+ * When this function is called, the two corresponding checksum copies are also
+ * given. After recovery the restored data stripe is checksum-verified using the
+ * given checksums. If any one matches, data recovery is considered successful
+ * and the restored stripe is written to nvmm to repair the corrupted data.
+ *
+ * If recovery succeeded, the known good checksum is returned by csum_good, and
+ * the caller will also check if any checksum restoration is necessary.
+ */
+int nova_restore_data(struct super_block *sb, unsigned long blocknr,
+	unsigned int badstrip_id, void *badstrip, int nvmmerr, u32 csum0,
+	u32 csum1, u32 *csum_good)
+{
+	unsigned int i, num_strps;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	size_t blockoff, offset;
+	u8 *blockptr, *stripptr, *block, *parity, *strip;
+	u32 csum_calc;
+	bool success = false;
+	timing_t restore_time;
+	int ret = 0;
+
+	NOVA_START_TIMING(restore_data_t, restore_time);
+	blockoff = nova_get_block_off(sb, blocknr, NOVA_BLOCK_TYPE_4K);
+	blockptr = nova_get_block(sb, blockoff);
+	stripptr = blockptr + (badstrip_id << strp_shift);
+
+	block = kmalloc(sb->s_blocksize, GFP_KERNEL);
+	strip = kmalloc(strp_size, GFP_KERNEL);
+	if (block == NULL || strip == NULL) {
+		nova_err(sb, "%s: buffer allocation error\n", __func__);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	parity = nova_get_parity_addr(sb, blocknr);
+	if (parity == NULL) {
+		nova_err(sb, "%s: parity address error\n", __func__);
+		ret = -EIO;
+		goto out;
+	}
+
+	num_strps = sb->s_blocksize >> strp_shift;
+	for (i = 0; i < num_strps; i++) {
+		offset = i << strp_shift;
+		if (i == badstrip_id)
+			/* parity strip has media errors */
+			ret = memcpy_mcsafe(block + offset,
+						parity, strp_size);
+		else
+			/* another data strip has media errors */
+			ret = memcpy_mcsafe(block + offset,
+						blockptr + offset, strp_size);
+		if (ret < 0) {
+			/* media error happens during recovery */
+			nova_err(sb, "%s: unrecoverable media error detected\n",
+					__func__);
+			goto out;
+		}
+	}
+
+	nova_calculate_block_parity(sb, strip, block);
+	for (i = 0; i < strp_size; i++) {
+		/* i indicates the amount of good bytes in badstrip.
+		 * if corruption is contained within one strip, the i = 0 pass
+		 * can restore the strip; otherwise we need to test every i to
+		 * check if there is a unaligned but recoverable corruption,
+		 * i.e. a scribble corrupting two adjacent strips but the
+		 * scribble size is no larger than the strip size.
+		 */
+		memcpy(strip, badstrip, i);
+
+		csum_calc = nova_crc32c(NOVA_INIT_CSUM, strip, strp_size);
+		if (csum_calc == csum0 || csum_calc == csum1) {
+			success = true;
+			break;
+		}
+
+		/* media error, no good bytes in badstrip */
+		if (nvmmerr)
+			break;
+
+		/* corruption happens to the last strip must be contained within
+		 * the strip; if the corruption goes beyond the block boundary,
+		 * that's not the concern of this recovery call.
+		 */
+		if (badstrip_id == num_strps - 1)
+			break;
+	}
+
+	if (success) {
+		/* recovery success, repair the bad nvmm data */
+		nova_memunlock_range(sb, stripptr, strp_size);
+		memcpy_to_pmem_nocache(stripptr, strip, strp_size);
+		nova_memlock_range(sb, stripptr, strp_size);
+
+		/* return the good checksum */
+		*csum_good = csum_calc;
+	} else {
+		/* unrecoverable data corruption */
+		ret = -EIO;
+	}
+
+out:
+	if (block != NULL)
+		kfree(block);
+	if (strip != NULL)
+		kfree(strip);
+
+	NOVA_END_TIMING(restore_data_t, restore_time);
+	return ret;
+}
+
+int nova_update_truncated_block_parity(struct super_block *sb,
+	struct inode *inode, loff_t newsize)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long pgoff, blocknr;
+	unsigned long blocksize = sb->s_blocksize;
+	u64 nvmm;
+	char *nvmm_addr, *block;
+	u8 btype = sih->i_blk_type;
+	int ret = 0;
+
+	pgoff = newsize >> sb->s_blocksize_bits;
+
+	nvmm = nova_find_nvmm_block(sb, sih, NULL, pgoff);
+	if (nvmm == 0)
+		return -EFAULT;
+
+	nvmm_addr = (char *)nova_get_block(sb, nvmm);
+
+	blocknr = nova_get_blocknr(sb, nvmm, btype);
+
+	/* Copy to DRAM to catch MCE. */
+	block = kmalloc(blocksize, GFP_KERNEL);
+	if (block == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (memcpy_mcsafe(block, nvmm_addr, blocksize) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	nova_update_block_parity(sb, block, blocknr, 0);
+out:
+	if (block != NULL)
+		kfree(block);
+	return ret;
+}
+

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 10/16] NOVA: File data protection
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova protects data and metadat from corruption due to media errors and
scribbles -- software errors in the kernels that may overwrite Nova data.

Replication
-----------

Nova replicates all PMEM metadata structures (there are a few exceptions.  They
are WIP).  For structure, there is a primary and an alternate (denoted as
alter in the code).  To ensure that Nova can recover a consistent copy of the
data in case of a failure, Nova first updates the primary, and issues a persist
barrier to ensure that data is written to NVMM.  Then it does the same for the
alternate.

Detection
---------

Nova uses two techniques to detect data corruption.  For media errors, Nova
should always uses memcpy_from_pmem() to read data from PMEM, usually by
copying the PMEM data structure into DRAM.

To detect software-caused corruption, Nova uses CRC32 checksums.  All the PMEM
data structures in Nova include csum field for this purpose.  Nova also
computes CRC32 checksums each 512-byte slice of each data page.

The checksums are stored in dedicated pages in each CPU's allocation region.

                                                          replica
                                                 parity   parity
					         page	  page
            +---+---+---+---+---+---+---+---+    +---+    +---+
data page 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |    | 0 |    | 0 |
            +---+---+---+---+---+---+---+---+    +---+    +---+
data page 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |    | 1 |    | 1 |
            +---+---+---+---+---+---+---+---+    +---+    +---+
data page 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |    | 0 |    | 0 |
            +---+---+---+---+---+---+---+---+    +---+    +---+
data page 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |    | 0 |    | 0 |
            +---+---+---+---+---+---+---+---+    +---+    +---+
    ...                    ...                    ...      ...

Recovery
--------

Nova uses replication to support recovery of metadata structures and
RAID4-style parity to recover corrupted data.

If Nova detects corruption of a metadata structure, it restores the structure
using the replica.

If it detects a corrupt slice of data page, it uses RAID4-style recovery to
restore it.  The CRC32 checksums for the page slices are replicated.

Cautious allocation
-------------------

To maximize its resilience to software scribbles, Nova allocate metadata
structures and their replicas far from one another.  It tries to allocate the
primary copy at a low address and the replica at a high address within the PMEM
region.

Write Protection
----------------

Finally, Nova supports can prevent unintended writes PMEM by mapping the entire
PMEM device as read-only and then disabling _all_ write protection by clearing
the WP bit the CR0 control register when Nova needs to perform a write.  The
wprotect mount-time option controls this behavior.

To map the PMEM device as read-only, we have added a readonly module command
line option to nd_pmem.  There is probably a better approach to achieving this
goal.

The changes to nd_pmem are included in a later patch in this series.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/checksum.c |  912 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/mprotect.c |  604 ++++++++++++++++++++++++++++++++++
 fs/nova/mprotect.h |  190 +++++++++++
 fs/nova/parity.c   |  411 +++++++++++++++++++++++
 4 files changed, 2117 insertions(+)
 create mode 100644 fs/nova/checksum.c
 create mode 100644 fs/nova/mprotect.c
 create mode 100644 fs/nova/mprotect.h
 create mode 100644 fs/nova/parity.c

diff --git a/fs/nova/checksum.c b/fs/nova/checksum.c
new file mode 100644
index 000000000000..092164a80d40
--- /dev/null
+++ b/fs/nova/checksum.c
@@ -0,0 +1,912 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Checksum related methods.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+static int nova_get_entry_copy(struct super_block *sb, void *entry,
+	u32 *entry_csum, size_t *entry_size, void *entry_copy)
+{
+	u8 type;
+	struct nova_dentry *dentry;
+	int ret = 0;
+
+	ret = memcpy_mcsafe(&type, entry, sizeof(u8));
+	if (ret < 0)
+		return ret;
+
+	switch (type) {
+	case DIR_LOG:
+		dentry = DENTRY(entry_copy);
+		ret = memcpy_mcsafe(dentry, entry, NOVA_DENTRY_HEADER_LEN);
+		if (ret < 0 || dentry->de_len > NOVA_MAX_ENTRY_LEN)
+			break;
+		*entry_size = dentry->de_len;
+		ret = memcpy_mcsafe((u8 *) dentry + NOVA_DENTRY_HEADER_LEN,
+					(u8 *) entry + NOVA_DENTRY_HEADER_LEN,
+					*entry_size - NOVA_DENTRY_HEADER_LEN);
+		if (ret < 0)
+			break;
+		*entry_csum = dentry->csum;
+		break;
+	case FILE_WRITE:
+		*entry_size = sizeof(struct nova_file_write_entry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = WENTRY(entry_copy)->csum;
+		break;
+	case SET_ATTR:
+		*entry_size = sizeof(struct nova_setattr_logentry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = SENTRY(entry_copy)->csum;
+		break;
+	case LINK_CHANGE:
+		*entry_size = sizeof(struct nova_link_change_entry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = LCENTRY(entry_copy)->csum;
+		break;
+	case MMAP_WRITE:
+		*entry_size = sizeof(struct nova_mmap_entry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = MMENTRY(entry_copy)->csum;
+		break;
+	case SNAPSHOT_INFO:
+		*entry_size = sizeof(struct nova_snapshot_info_entry);
+		ret = memcpy_mcsafe(entry_copy, entry, *entry_size);
+		if (ret < 0)
+			break;
+		*entry_csum = SNENTRY(entry_copy)->csum;
+		break;
+	default:
+		*entry_csum = 0;
+		*entry_size = 0;
+		nova_dbg("%s: unknown or unsupported entry type (%d) for checksum, 0x%llx\n",
+			 __func__, type, (u64)entry);
+		ret = -EINVAL;
+		dump_stack();
+		break;
+	}
+
+	return ret;
+}
+
+/* Calculate the entry checksum. */
+static u32 nova_calc_entry_csum(void *entry)
+{
+	u8 type;
+	u32 csum = 0;
+	size_t entry_len, check_len;
+	void *csum_addr, *remain;
+	timing_t calc_time;
+
+	NOVA_START_TIMING(calc_entry_csum_t, calc_time);
+
+	/* Entry is checksummed excluding its csum field. */
+	type = nova_get_entry_type(entry);
+	switch (type) {
+	/* nova_dentry has variable length due to its name. */
+	case DIR_LOG:
+		entry_len =  DENTRY(entry)->de_len;
+		csum_addr = &DENTRY(entry)->csum;
+		break;
+	case FILE_WRITE:
+		entry_len = sizeof(struct nova_file_write_entry);
+		csum_addr = &WENTRY(entry)->csum;
+		break;
+	case SET_ATTR:
+		entry_len = sizeof(struct nova_setattr_logentry);
+		csum_addr = &SENTRY(entry)->csum;
+		break;
+	case LINK_CHANGE:
+		entry_len = sizeof(struct nova_link_change_entry);
+		csum_addr = &LCENTRY(entry)->csum;
+		break;
+	case MMAP_WRITE:
+		entry_len = sizeof(struct nova_mmap_entry);
+		csum_addr = &MMENTRY(entry)->csum;
+		break;
+	case SNAPSHOT_INFO:
+		entry_len = sizeof(struct nova_snapshot_info_entry);
+		csum_addr = &SNENTRY(entry)->csum;
+		break;
+	default:
+		entry_len = 0;
+		csum_addr = NULL;
+		nova_dbg("%s: unknown or unsupported entry type (%d) for checksum, 0x%llx\n",
+			 __func__, type, (u64) entry);
+		break;
+	}
+
+	if (entry_len > 0) {
+		check_len = ((u8 *) csum_addr) - ((u8 *) entry);
+		csum = nova_crc32c(NOVA_INIT_CSUM, entry, check_len);
+		check_len = entry_len - (check_len + NOVA_META_CSUM_LEN);
+		if (check_len > 0) {
+			remain = ((u8 *) csum_addr) + NOVA_META_CSUM_LEN;
+			csum = nova_crc32c(csum, remain, check_len);
+		}
+
+		if (check_len < 0) {
+			nova_dbg("%s: checksum run-length error %ld < 0",
+				__func__, check_len);
+		}
+	}
+
+	NOVA_END_TIMING(calc_entry_csum_t, calc_time);
+	return csum;
+}
+
+/* Update the log entry checksum. */
+void nova_update_entry_csum(void *entry)
+{
+	u8  type;
+	u32 csum;
+	size_t entry_len = CACHELINE_SIZE;
+
+	if (metadata_csum == 0)
+		goto flush;
+
+	type = nova_get_entry_type(entry);
+	csum = nova_calc_entry_csum(entry);
+
+	switch (type) {
+	case DIR_LOG:
+		DENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = DENTRY(entry)->de_len;
+		break;
+	case FILE_WRITE:
+		WENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_file_write_entry);
+		break;
+	case SET_ATTR:
+		SENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		LCENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_link_change_entry);
+		break;
+	case MMAP_WRITE:
+		MMENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_mmap_entry);
+		break;
+	case SNAPSHOT_INFO:
+		SNENTRY(entry)->csum = cpu_to_le32(csum);
+		entry_len = sizeof(struct nova_snapshot_info_entry);
+		break;
+	default:
+		entry_len = 0;
+		nova_dbg("%s: unknown or unsupported entry type (%d), 0x%llx\n",
+			__func__, type, (u64) entry);
+		break;
+	}
+
+flush:
+	if (entry_len > 0)
+		nova_flush_buffer(entry, entry_len, 0);
+
+}
+
+int nova_update_alter_entry(struct super_block *sb, void *entry)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	void *alter_entry;
+	u64 curr, alter_curr;
+	u32 entry_csum;
+	size_t size;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	int ret;
+
+	if (metadata_csum == 0)
+		return 0;
+
+	curr = nova_get_addr_off(sbi, entry);
+	alter_curr = alter_log_entry(sb, curr);
+	if (alter_curr == 0) {
+		nova_err(sb, "%s: log page tail error detected\n", __func__);
+		return -EIO;
+	}
+	alter_entry = (void *)nova_get_block(sb, alter_curr);
+
+	ret = nova_get_entry_copy(sb, entry, &entry_csum, &size, entry_copy);
+	if (ret)
+		return ret;
+
+	ret = memcpy_to_pmem_nocache(alter_entry, entry_copy, size);
+	return ret;
+}
+
+/* media error: repair the poison radius that the entry belongs to */
+static int nova_repair_entry_pr(struct super_block *sb, void *entry)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret;
+	u64 entry_off, alter_off;
+	void *entry_pr, *alter_pr;
+
+	entry_off = nova_get_addr_off(sbi, entry);
+	alter_off = alter_log_entry(sb, entry_off);
+	if (alter_off == 0) {
+		nova_err(sb, "%s: log page tail error detected\n", __func__);
+		goto fail;
+	}
+
+	entry_pr = (void *) nova_get_block(sb, entry_off & POISON_MASK);
+	alter_pr = (void *) nova_get_block(sb, alter_off & POISON_MASK);
+
+	if (entry_pr == NULL || alter_pr == NULL)
+		BUG();
+
+	nova_memunlock_range(sb, entry_pr, POISON_RADIUS);
+	ret = memcpy_mcsafe(entry_pr, alter_pr, POISON_RADIUS);
+	nova_memlock_range(sb, entry_pr, POISON_RADIUS);
+	nova_flush_buffer(entry_pr, POISON_RADIUS, 0);
+
+	/* alter_entry shows media error during memcpy */
+	if (ret < 0)
+		goto fail;
+
+	nova_dbg("%s: entry media error repaired\n", __func__);
+	return 0;
+
+fail:
+	nova_err(sb, "%s: unrecoverable media error detected\n", __func__);
+	return -1;
+}
+
+static int nova_repair_entry(struct super_block *sb, void *bad, void *good,
+	size_t entry_size)
+{
+	int ret;
+
+	nova_memunlock_range(sb, bad, entry_size);
+	ret = memcpy_to_pmem_nocache(bad, good, entry_size);
+	nova_memlock_range(sb, bad, entry_size);
+
+	if (ret == 0)
+		nova_dbg("%s: entry error repaired\n", __func__);
+
+	return ret;
+}
+
+/* Verify the log entry checksum and get a copy in DRAM. */
+bool nova_verify_entry_csum(struct super_block *sb, void *entry, void *entryc)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret = 0;
+	u64 entry_off, alter_off;
+	void *alter;
+	size_t entry_size, alter_size;
+	u32 entry_csum, alter_csum;
+	u32 entry_csum_calc, alter_csum_calc;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	char alter_copy[NOVA_MAX_ENTRY_LEN];
+	timing_t verify_time;
+
+	if (metadata_csum == 0)
+		return true;
+
+	NOVA_START_TIMING(verify_entry_csum_t, verify_time);
+
+	ret = nova_get_entry_copy(sb, entry, &entry_csum, &entry_size,
+				  entry_copy);
+	if (ret < 0) { /* media error */
+		ret = nova_repair_entry_pr(sb, entry);
+		if (ret < 0)
+			goto fail;
+		/* try again */
+		ret = nova_get_entry_copy(sb, entry, &entry_csum, &entry_size,
+						entry_copy);
+		if (ret < 0)
+			goto fail;
+	}
+
+	entry_off = nova_get_addr_off(sbi, entry);
+	alter_off = alter_log_entry(sb, entry_off);
+	if (alter_off == 0) {
+		nova_err(sb, "%s: log page tail error detected\n", __func__);
+		goto fail;
+	}
+
+	alter = (void *) nova_get_block(sb, alter_off);
+	ret = nova_get_entry_copy(sb, alter, &alter_csum, &alter_size,
+					alter_copy);
+	if (ret < 0) { /* media error */
+		ret = nova_repair_entry_pr(sb, alter);
+		if (ret < 0)
+			goto fail;
+		/* try again */
+		ret = nova_get_entry_copy(sb, alter, &alter_csum, &alter_size,
+						alter_copy);
+		if (ret < 0)
+			goto fail;
+	}
+
+	/* no media errors, now verify the checksums */
+	entry_csum = le32_to_cpu(entry_csum);
+	alter_csum = le32_to_cpu(alter_csum);
+	entry_csum_calc = nova_calc_entry_csum(entry_copy);
+	alter_csum_calc = nova_calc_entry_csum(alter_copy);
+
+	if (entry_csum != entry_csum_calc && alter_csum != alter_csum_calc) {
+		nova_err(sb, "%s: both entry and its replica fail checksum verification\n",
+			 __func__);
+		goto fail;
+	} else if (entry_csum != entry_csum_calc) {
+		nova_dbg("%s: entry %p checksum error, trying to repair using the replica\n",
+			 __func__, entry);
+		ret = nova_repair_entry(sb, entry, alter_copy, alter_size);
+		if (ret != 0)
+			goto fail;
+
+		memcpy(entryc, alter_copy, alter_size);
+	} else if (alter_csum != alter_csum_calc) {
+		nova_dbg("%s: entry replica %p checksum error, trying to repair using the primary\n",
+			 __func__, alter);
+		ret = nova_repair_entry(sb, alter, entry_copy, entry_size);
+		if (ret != 0)
+			goto fail;
+
+		memcpy(entryc, entry_copy, entry_size);
+	} else {
+		/* now both entries pass checksum verification and the primary
+		 * is trusted if their buffers don't match
+		 */
+		if (memcmp(entry_copy, alter_copy, entry_size)) {
+			nova_dbg("%s: entry replica %p error, trying to repair using the primary\n",
+				 __func__, alter);
+			ret = nova_repair_entry(sb, alter, entry_copy,
+						entry_size);
+			if (ret != 0)
+				goto fail;
+		}
+
+		memcpy(entryc, entry_copy, entry_size);
+	}
+
+	NOVA_END_TIMING(verify_entry_csum_t, verify_time);
+	return true;
+
+fail:
+	nova_err(sb, "%s: unable to repair entry errors\n", __func__);
+
+	NOVA_END_TIMING(verify_entry_csum_t, verify_time);
+	return false;
+}
+
+/* media error: repair the poison radius that the inode belongs to */
+static int nova_repair_inode_pr(struct super_block *sb,
+	struct nova_inode *bad_pi, struct nova_inode *good_pi)
+{
+	int ret;
+	void *bad_pr, *good_pr;
+
+	bad_pr = (void *)((u64) bad_pi & POISON_MASK);
+	good_pr = (void *)((u64) good_pi & POISON_MASK);
+
+	if (bad_pr == NULL || good_pr == NULL)
+		BUG();
+
+	nova_memunlock_range(sb, bad_pr, POISON_RADIUS);
+	ret = memcpy_mcsafe(bad_pr, good_pr, POISON_RADIUS);
+	nova_memlock_range(sb, bad_pr, POISON_RADIUS);
+	nova_flush_buffer(bad_pr, POISON_RADIUS, 0);
+
+	/* good_pi shows media error during memcpy */
+	if (ret < 0)
+		goto fail;
+
+	nova_dbg("%s: inode media error repaired\n", __func__);
+	return 0;
+
+fail:
+	nova_err(sb, "%s: unrecoverable media error detected\n", __func__);
+	return -1;
+}
+
+static int nova_repair_inode(struct super_block *sb, struct nova_inode *bad_pi,
+	struct nova_inode *good_copy)
+{
+	int ret;
+
+	nova_memunlock_inode(sb, bad_pi);
+	ret = memcpy_to_pmem_nocache(bad_pi, good_copy,
+					sizeof(struct nova_inode));
+	nova_memlock_inode(sb, bad_pi);
+
+	if (ret == 0)
+		nova_dbg("%s: inode %llu error repaired\n", __func__,
+					good_copy->nova_ino);
+
+	return ret;
+}
+
+/*
+ * Check nova_inode and get a copy in DRAM.
+ * If we are going to update (write) the inode, we don't need to check the
+ * alter inode if the major inode checks ok. If we are going to read or rebuild
+ * the inode, also check the alter even if the major inode checks ok.
+ */
+int nova_check_inode_integrity(struct super_block *sb, u64 ino, u64 pi_addr,
+	u64 alter_pi_addr, struct nova_inode *pic, int check_replica)
+{
+	struct nova_inode *pi, *alter_pi, alter_copy, *alter_pic;
+	int inode_bad, alter_bad;
+	int ret;
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+
+	ret = memcpy_mcsafe(pic, pi, sizeof(struct nova_inode));
+
+	if (metadata_csum == 0)
+		return ret;
+
+	alter_pi = (struct nova_inode *)nova_get_block(sb, alter_pi_addr);
+
+	if (ret < 0) { /* media error */
+		ret = nova_repair_inode_pr(sb, pi, alter_pi);
+		if (ret < 0)
+			goto fail;
+		/* try again */
+		ret = memcpy_mcsafe(pic, pi, sizeof(struct nova_inode));
+		if (ret < 0)
+			goto fail;
+	}
+
+	inode_bad = nova_check_inode_checksum(pic);
+
+	if (!inode_bad && !check_replica)
+		return 0;
+
+	alter_pic = &alter_copy;
+	ret = memcpy_mcsafe(alter_pic, alter_pi, sizeof(struct nova_inode));
+	if (ret < 0) { /* media error */
+		if (inode_bad)
+			goto fail;
+		ret = nova_repair_inode_pr(sb, alter_pi, pi);
+		if (ret < 0)
+			goto fail;
+		/* try again */
+		ret = memcpy_mcsafe(alter_pic, alter_pi,
+					sizeof(struct nova_inode));
+		if (ret < 0)
+			goto fail;
+	}
+
+	alter_bad = nova_check_inode_checksum(alter_pic);
+
+	if (inode_bad && alter_bad) {
+		nova_err(sb, "%s: both inode and its replica fail checksum verification\n",
+			 __func__);
+		goto fail;
+	} else if (inode_bad) {
+		nova_dbg("%s: inode %llu checksum error, trying to repair using the replica\n",
+			 __func__, ino);
+		ret = nova_repair_inode(sb, pi, alter_pic);
+		if (ret != 0)
+			goto fail;
+
+		memcpy(pic, alter_pic, sizeof(struct nova_inode));
+	} else if (alter_bad) {
+		nova_dbg("%s: inode replica %llu checksum error, trying to repair using the primary\n",
+			 __func__, ino);
+		ret = nova_repair_inode(sb, alter_pi, pic);
+		if (ret != 0)
+			goto fail;
+	} else if (memcmp(pic, alter_pic, sizeof(struct nova_inode))) {
+		nova_dbg("%s: inode replica %llu is stale, trying to repair using the primary\n",
+			 __func__, ino);
+		ret = nova_repair_inode(sb, alter_pi, pic);
+		if (ret != 0)
+			goto fail;
+	}
+
+	return 0;
+
+fail:
+	nova_err(sb, "%s: unable to repair inode errors\n", __func__);
+
+	return -EIO;
+}
+
+static int nova_update_stripe_csum(struct super_block *sb, unsigned long strps,
+	unsigned long strp_nr, u8 *strp_ptr, int zero)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned long strp;
+	u32 csum;
+	u32 crc[8];
+	void *csum_addr, *csum_addr1;
+	void *src_addr;
+
+	while (strps >= 8) {
+		if (zero) {
+			src_addr = sbi->zero_csum;
+			goto copy;
+		}
+
+		crc[0] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr, strp_size));
+		crc[1] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size, strp_size));
+		crc[2] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 2, strp_size));
+		crc[3] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 3, strp_size));
+		crc[4] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 4, strp_size));
+		crc[5] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 5, strp_size));
+		crc[6] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 6, strp_size));
+		crc[7] = cpu_to_le32(nova_crc32c(NOVA_INIT_CSUM,
+				strp_ptr + strp_size * 7, strp_size));
+
+		src_addr = crc;
+copy:
+		csum_addr = nova_get_data_csum_addr(sb, strp_nr, 0);
+		csum_addr1 = nova_get_data_csum_addr(sb, strp_nr, 1);
+
+		nova_memunlock_range(sb, csum_addr, NOVA_DATA_CSUM_LEN * 8);
+		if (support_clwb) {
+			memcpy(csum_addr, src_addr, NOVA_DATA_CSUM_LEN * 8);
+			memcpy(csum_addr1, src_addr, NOVA_DATA_CSUM_LEN * 8);
+		} else {
+			memcpy_to_pmem_nocache(csum_addr, src_addr,
+						NOVA_DATA_CSUM_LEN * 8);
+			memcpy_to_pmem_nocache(csum_addr1, src_addr,
+						NOVA_DATA_CSUM_LEN * 8);
+		}
+		nova_memlock_range(sb, csum_addr, NOVA_DATA_CSUM_LEN * 8);
+		if (support_clwb) {
+			nova_flush_buffer(csum_addr,
+					  NOVA_DATA_CSUM_LEN * 8, 0);
+			nova_flush_buffer(csum_addr1,
+					  NOVA_DATA_CSUM_LEN * 8, 0);
+		}
+
+		strp_nr += 8;
+		strps -= 8;
+		if (!zero)
+			strp_ptr += strp_size * 8;
+	}
+
+	for (strp = 0; strp < strps; strp++) {
+		if (zero)
+			csum = sbi->zero_csum[0];
+		else
+			csum = nova_crc32c(NOVA_INIT_CSUM, strp_ptr, strp_size);
+
+		csum = cpu_to_le32(csum);
+		csum_addr = nova_get_data_csum_addr(sb, strp_nr, 0);
+		csum_addr1 = nova_get_data_csum_addr(sb, strp_nr, 1);
+
+		nova_memunlock_range(sb, csum_addr, NOVA_DATA_CSUM_LEN);
+		memcpy_to_pmem_nocache(csum_addr, &csum, NOVA_DATA_CSUM_LEN);
+		memcpy_to_pmem_nocache(csum_addr1, &csum, NOVA_DATA_CSUM_LEN);
+		nova_memlock_range(sb, csum_addr, NOVA_DATA_CSUM_LEN);
+
+		strp_nr += 1;
+		if (!zero)
+			strp_ptr += strp_size;
+	}
+
+	return 0;
+}
+
+/* Checksums a sequence of contiguous file write data stripes within one block
+ * and writes the checksum values to nvmm.
+ *
+ * The block buffer to compute checksums should reside in dram (more trusted),
+ * not in nvmm (less trusted).
+ *
+ * Checksum is calculated over a whole stripe.
+ *
+ * block:   block buffer with user data and possibly partial head-tail block
+ *          - should be in kernel memory (dram) to avoid page faults
+ * blocknr: destination nvmm block number where the block is written to
+ *          - used to derive checksum value addresses
+ * offset:  byte offset of user data in the block buffer
+ * bytes:   number of user data bytes in the block buffer
+ * zero:    if the user data is all zero
+ */
+int nova_update_block_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, u8 *block, unsigned long blocknr,
+	size_t offset, size_t bytes, int zero)
+{
+	u8 *strp_ptr;
+	size_t blockoff;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned int strp_index, strp_offset;
+	unsigned long strps, strp_nr;
+	timing_t block_csum_time;
+
+	NOVA_START_TIMING(block_csum_t, block_csum_time);
+	blockoff = nova_get_block_off(sb, blocknr, sih->i_blk_type);
+
+	/* strp_index: stripe index within the block buffer
+	 * strp_offset: stripe offset within the block buffer
+	 *
+	 * strps: number of stripes touched by user data (need new checksums)
+	 * strp_nr: global stripe number converted from blocknr and offset
+	 * strp_ptr: pointer to stripes in the block buffer
+	 */
+	strp_index = offset >> strp_shift;
+	strp_offset = offset - (strp_index << strp_shift);
+
+	strps = ((strp_offset + bytes - 1) >> strp_shift) + 1;
+	strp_nr = (blockoff + offset) >> strp_shift;
+	strp_ptr = block + (strp_index << strp_shift);
+
+	nova_update_stripe_csum(sb, strps, strp_nr, strp_ptr, zero);
+
+	NOVA_END_TIMING(block_csum_t, block_csum_time);
+
+	return 0;
+}
+
+int nova_update_pgoff_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero)
+{
+	void *dax_mem = NULL;
+	u64 blockoff;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned long strp_nr;
+	int count;
+
+	count = blk_type_to_size[sih->i_blk_type] / strp_size;
+
+	blockoff = nova_find_nvmm_block(sb, sih, entry, pgoff);
+
+	/* Truncated? */
+	if (blockoff == 0)
+		return 0;
+
+	dax_mem = nova_get_block(sb, blockoff);
+
+	strp_nr = blockoff >> strp_shift;
+
+	nova_update_stripe_csum(sb, count, strp_nr, dax_mem, zero);
+
+	return 0;
+}
+
+/* Verify checksums of requested data bytes starting from offset of blocknr.
+ *
+ * Only a whole stripe can be checksum verified.
+ *
+ * blocknr: container blocknr for the first stripe to be verified
+ * offset:  byte offset within the block associated with blocknr
+ * bytes:   number of contiguous bytes to be verified starting from offset
+ *
+ * return: true or false
+ */
+bool nova_verify_data_csum(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr,
+	size_t offset, size_t bytes)
+{
+	void *blockptr, *strp_ptr;
+	size_t blockoff, blocksize = nova_inode_blk_size(sih);
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned int strp_index;
+	unsigned long strp, strps, strp_nr;
+	void *strip = NULL;
+	u32 csum_calc, csum_nvmm0, csum_nvmm1;
+	u32 *csum_addr0, *csum_addr1;
+	int error;
+	bool match;
+	timing_t verify_time;
+
+	NOVA_START_TIMING(verify_data_csum_t, verify_time);
+
+	/* Only a whole stripe can be checksum verified.
+	 * strps: # of stripes to be checked since offset.
+	 */
+	strps = ((offset + bytes - 1) >> strp_shift)
+		- (offset >> strp_shift) + 1;
+
+	blockoff = nova_get_block_off(sb, blocknr, sih->i_blk_type);
+	blockptr = nova_get_block(sb, blockoff);
+
+	/* strp_nr: global stripe number converted from blocknr and offset
+	 * strp_ptr: virtual address of the 1st stripe
+	 * strp_index: stripe index within a block
+	 */
+	strp_nr = (blockoff + offset) >> strp_shift;
+	strp_index = offset >> strp_shift;
+	strp_ptr = blockptr + (strp_index << strp_shift);
+
+	strip = kmalloc(strp_size, GFP_KERNEL);
+	if (strip == NULL)
+		return false;
+
+	match = true;
+	for (strp = 0; strp < strps; strp++) {
+		csum_addr0 = nova_get_data_csum_addr(sb, strp_nr, 0);
+		csum_nvmm0 = le32_to_cpu(*csum_addr0);
+
+		csum_addr1 = nova_get_data_csum_addr(sb, strp_nr, 1);
+		csum_nvmm1 = le32_to_cpu(*csum_addr1);
+
+		error = memcpy_mcsafe(strip, strp_ptr, strp_size);
+		if (error < 0) {
+			nova_dbg("%s: media error in data strip detected!\n",
+				__func__);
+			match = false;
+		} else {
+			csum_calc = nova_crc32c(NOVA_INIT_CSUM, strip,
+						strp_size);
+			match = (csum_calc == csum_nvmm0) ||
+				(csum_calc == csum_nvmm1);
+		}
+
+		if (!match) {
+			/* Getting here, data is considered corrupted.
+			 *
+			 * if: csum_nvmm0 == csum_nvmm1
+			 *     both csums good, run data recovery
+			 * if: csum_nvmm0 != csum_nvmm1
+			 *     at least one csum is corrupted, also need to run
+			 *     data recovery to see if one csum is still good
+			 */
+			nova_dbg("%s: nova data corruption detected! inode %lu, strp %lu of %lu, block offset %lu, stripe nr %lu, csum calc 0x%08x, csum nvmm 0x%08x, csum nvmm replica 0x%08x\n",
+				__func__, sih->ino, strp, strps, blockoff,
+				strp_nr, csum_calc, csum_nvmm0, csum_nvmm1);
+
+			if (data_parity == 0) {
+				nova_dbg("%s: no data redundancy available, can not repair data corruption!\n",
+					 __func__);
+				break;
+			}
+
+			nova_dbg("%s: nova data recovery begins\n", __func__);
+
+			error = nova_restore_data(sb, blocknr, strp_index,
+					strip, error, csum_nvmm0, csum_nvmm1,
+					&csum_calc);
+			if (error) {
+				nova_dbg("%s: nova data recovery fails!\n",
+						__func__);
+				dump_stack();
+				break;
+			}
+
+			/* Getting here, data corruption is repaired and the
+			 * good checksum is stored in csum_calc.
+			 */
+			nova_dbg("%s: nova data recovery success!\n", __func__);
+			match = true;
+		}
+
+		/* Getting here, match must be true, otherwise already breaking
+		 * out the for loop. Data is known good, either it's good in
+		 * nvmm, or good after recovery.
+		 */
+		if (csum_nvmm0 != csum_nvmm1) {
+			/* Getting here, data is known good but one checksum is
+			 * considered corrupted.
+			 */
+			nova_dbg("%s: nova checksum corruption detected! inode %lu, strp %lu of %lu, block offset %lu, stripe nr %lu, csum calc 0x%08x, csum nvmm 0x%08x, csum nvmm replica 0x%08x\n",
+				__func__, sih->ino, strp, strps, blockoff,
+				strp_nr, csum_calc, csum_nvmm0, csum_nvmm1);
+
+			nova_memunlock_range(sb, csum_addr0,
+							NOVA_DATA_CSUM_LEN);
+			if (csum_nvmm0 != csum_calc) {
+				csum_nvmm0 = cpu_to_le32(csum_calc);
+				memcpy_to_pmem_nocache(csum_addr0, &csum_nvmm0,
+							NOVA_DATA_CSUM_LEN);
+			}
+
+			if (csum_nvmm1 != csum_calc) {
+				csum_nvmm1 = cpu_to_le32(csum_calc);
+				memcpy_to_pmem_nocache(csum_addr1, &csum_nvmm1,
+							NOVA_DATA_CSUM_LEN);
+			}
+			nova_memlock_range(sb, csum_addr0, NOVA_DATA_CSUM_LEN);
+
+			nova_dbg("%s: nova checksum corruption repaired!\n",
+								__func__);
+		}
+
+		/* Getting here, the data stripe and both checksum copies are
+		 * known good. Continue to the next stripe.
+		 */
+		strp_nr    += 1;
+		strp_index += 1;
+		strp_ptr   += strp_size;
+		if (strp_index == (blocksize >> strp_shift)) {
+			blocknr += 1;
+			blockoff += blocksize;
+			strp_index = 0;
+		}
+
+	}
+
+	if (strip != NULL)
+		kfree(strip);
+
+	NOVA_END_TIMING(verify_data_csum_t, verify_time);
+
+	return match;
+}
+
+int nova_update_truncated_block_csum(struct super_block *sb,
+	struct inode *inode, loff_t newsize) {
+
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long offset = newsize & (sb->s_blocksize - 1);
+	unsigned long pgoff, length;
+	u64 nvmm;
+	char *nvmm_addr, *strp_addr, *tail_strp = NULL;
+	unsigned int strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned int strp_index, strp_offset;
+	unsigned long strps, strp_nr;
+
+	length = sb->s_blocksize - offset;
+	pgoff = newsize >> sb->s_blocksize_bits;
+
+	nvmm = nova_find_nvmm_block(sb, sih, NULL, pgoff);
+	if (nvmm == 0)
+		return -EFAULT;
+
+	nvmm_addr = (char *)nova_get_block(sb, nvmm);
+
+	strp_index = offset >> strp_shift;
+	strp_offset = offset - (strp_index << strp_shift);
+
+	strps = ((strp_offset + length - 1) >> strp_shift) + 1;
+	strp_nr = (nvmm + offset) >> strp_shift;
+	strp_addr = nvmm_addr + (strp_index << strp_shift);
+
+	if (strp_offset > 0) {
+		/* Copy to DRAM to catch MCE. */
+		tail_strp = kzalloc(strp_size, GFP_KERNEL);
+		if (tail_strp == NULL)
+			return -ENOMEM;
+
+		if (memcpy_mcsafe(tail_strp, strp_addr, strp_offset) < 0)
+			return -EIO;
+
+		nova_update_stripe_csum(sb, 1, strp_nr, tail_strp, 0);
+
+		strps--;
+		strp_nr++;
+	}
+
+	if (strps > 0)
+		nova_update_stripe_csum(sb, strps, strp_nr, NULL, 1);
+
+	if (tail_strp != NULL)
+		kfree(tail_strp);
+
+	return 0;
+}
+
diff --git a/fs/nova/mprotect.c b/fs/nova/mprotect.c
new file mode 100644
index 000000000000..4b58786f401e
--- /dev/null
+++ b/fs/nova/mprotect.c
@@ -0,0 +1,604 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Memory protection for the filesystem pages.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include "nova.h"
+#include "inode.h"
+
+static inline void wprotect_disable(void)
+{
+	unsigned long cr0_val;
+
+	cr0_val = read_cr0();
+	cr0_val &= (~X86_CR0_WP);
+	write_cr0(cr0_val);
+}
+
+static inline void wprotect_enable(void)
+{
+	unsigned long cr0_val;
+
+	cr0_val = read_cr0();
+	cr0_val |= X86_CR0_WP;
+	write_cr0(cr0_val);
+}
+
+/* FIXME: Assumes that we are always called in the right order.
+ * nova_writeable(vaddr, size, 1);
+ * nova_writeable(vaddr, size, 0);
+ */
+int nova_writeable(void *vaddr, unsigned long size, int rw)
+{
+	static unsigned long flags;
+	timing_t wprotect_time;
+
+	NOVA_START_TIMING(wprotect_t, wprotect_time);
+	if (rw) {
+		local_irq_save(flags);
+		wprotect_disable();
+	} else {
+		wprotect_enable();
+		local_irq_restore(flags);
+	}
+	NOVA_END_TIMING(wprotect_t, wprotect_time);
+	return 0;
+}
+
+int nova_dax_mem_protect(struct super_block *sb, void *vaddr,
+			  unsigned long size, int rw)
+{
+	if (!nova_is_wprotected(sb))
+		return 0;
+	return nova_writeable(vaddr, size, rw);
+}
+
+int nova_get_vma_overlap_range(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	unsigned long entry_pgoff, unsigned long entry_pages,
+	unsigned long *start_pgoff, unsigned long *num_pages)
+{
+	unsigned long vma_pgoff;
+	unsigned long vma_pages;
+	unsigned long end_pgoff;
+
+	vma_pgoff = vma->vm_pgoff;
+	vma_pages = (vma->vm_end - vma->vm_start) >> sb->s_blocksize_bits;
+
+	if (vma_pgoff + vma_pages <= entry_pgoff ||
+				entry_pgoff + entry_pages <= vma_pgoff)
+		return 0;
+
+	*start_pgoff = vma_pgoff > entry_pgoff ? vma_pgoff : entry_pgoff;
+	end_pgoff = (vma_pgoff + vma_pages) > (entry_pgoff + entry_pages) ?
+			entry_pgoff + entry_pages : vma_pgoff + vma_pages;
+	*num_pages = end_pgoff - *start_pgoff;
+	return 1;
+}
+
+static int nova_update_dax_mapping(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	struct nova_file_write_entry *entry, unsigned long start_pgoff,
+	unsigned long num_pages)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	void **pentry;
+	unsigned long curr_pgoff;
+	unsigned long blocknr, start_blocknr;
+	unsigned long value, new_value;
+	int i;
+	int ret = 0;
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_mapping_t, update_time);
+
+	start_blocknr = nova_get_blocknr(sb, entry->block, sih->i_blk_type);
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < num_pages; i++) {
+		curr_pgoff = start_pgoff + i;
+		blocknr = start_blocknr + i;
+
+		pentry = radix_tree_lookup_slot(&mapping->page_tree,
+						curr_pgoff);
+		if (pentry) {
+			value = (unsigned long)radix_tree_deref_slot(pentry);
+			/* 9 = sector shift (3) + RADIX_DAX_SHIFT (6) */
+			new_value = (blocknr << 9) | (value & 0xff);
+			nova_dbgv("%s: pgoff %lu, entry 0x%lx, new 0x%lx\n",
+						__func__, curr_pgoff,
+						value, new_value);
+			radix_tree_replace_slot(&sih->tree, pentry,
+						(void *)new_value);
+			radix_tree_tag_set(&mapping->page_tree, curr_pgoff,
+						PAGECACHE_TAG_DIRTY);
+		}
+	}
+
+	spin_unlock_irq(&mapping->tree_lock);
+
+	NOVA_END_TIMING(update_mapping_t, update_time);
+	return ret;
+}
+
+static int nova_update_entry_pfn(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	struct nova_file_write_entry *entry, unsigned long start_pgoff,
+	unsigned long num_pages)
+{
+	unsigned long newflags;
+	unsigned long addr;
+	unsigned long size;
+	unsigned long pfn;
+	pgprot_t new_prot;
+	int ret;
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_pfn_t, update_time);
+
+	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	pfn = nova_get_pfn(sb, entry->block) + start_pgoff - entry->pgoff;
+	size = num_pages << PAGE_SHIFT;
+
+	nova_dbgv("%s: addr 0x%lx, size 0x%lx\n", __func__,
+			addr, size);
+
+	newflags = vma->vm_flags | VM_WRITE;
+	new_prot = vm_get_page_prot(newflags);
+
+	ret = remap_pfn_range(vma, addr, pfn, size, new_prot);
+
+	NOVA_END_TIMING(update_pfn_t, update_time);
+	return ret;
+}
+
+static int nova_dax_mmap_update_mapping(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct vm_area_struct *vma,
+	struct nova_file_write_entry *entry_data)
+{
+	unsigned long start_pgoff, num_pages = 0;
+	int ret;
+
+	ret = nova_get_vma_overlap_range(sb, sih, vma, entry_data->pgoff,
+						entry_data->num_pages,
+						&start_pgoff, &num_pages);
+	if (ret == 0)
+		return ret;
+
+
+	NOVA_STATS_ADD(mapping_updated_pages, num_pages);
+
+	ret = nova_update_dax_mapping(sb, sih, vma, entry_data,
+						start_pgoff, num_pages);
+	if (ret) {
+		nova_err(sb, "update DAX mapping return %d\n", ret);
+		return ret;
+	}
+
+	ret = nova_update_entry_pfn(sb, sih, vma, entry_data,
+						start_pgoff, num_pages);
+	if (ret)
+		nova_err(sb, "update_pfn return %d\n", ret);
+
+
+	return ret;
+}
+
+static int nova_dax_cow_mmap_handler(struct super_block *sb,
+	struct vm_area_struct *vma, struct nova_inode_info_header *sih,
+	u64 begin_tail)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	u64 curr_p = begin_tail;
+	size_t entry_size = sizeof(struct nova_file_write_entry);
+	int ret = 0;
+	timing_t update_time;
+
+	NOVA_START_TIMING(mmap_handler_t, update_time);
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+	while (curr_p && curr_p != sih->log_tail) {
+		if (is_last_entry(curr_p, entry_size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_err(sb, "%s: File inode %lu log is NULL!\n",
+				__func__, sih->ino);
+			ret = -EINVAL;
+			break;
+		}
+
+		entry = (struct nova_file_write_entry *)
+					nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc)) {
+			ret = -EIO;
+			curr_p += entry_size;
+			continue;
+		}
+
+		if (nova_get_entry_type(entryc) != FILE_WRITE) {
+			/* for debug information, still use nvmm entry */
+			nova_dbg("%s: entry type is not write? %d\n",
+				__func__, nova_get_entry_type(entry));
+			curr_p += entry_size;
+			continue;
+		}
+
+		ret = nova_dax_mmap_update_mapping(sb, sih, vma, entryc);
+		if (ret)
+			break;
+
+		curr_p += entry_size;
+	}
+
+	NOVA_END_TIMING(mmap_handler_t, update_time);
+	return ret;
+}
+
+static int nova_get_dax_cow_range(struct super_block *sb,
+	struct vm_area_struct *vma, unsigned long address,
+	unsigned long *start_blk, int *num_blocks)
+{
+	int base = 1;
+	unsigned long vma_blocks;
+	unsigned long pgoff;
+	unsigned long start_pgoff;
+
+	vma_blocks = (vma->vm_end - vma->vm_start) >> sb->s_blocksize_bits;
+
+	/* Read ahead, avoid sequential page faults */
+	if (vma_blocks >= 4096)
+		base = 4096;
+
+	pgoff = (address - vma->vm_start) >> sb->s_blocksize_bits;
+	start_pgoff = pgoff & ~(base - 1);
+	*start_blk = vma->vm_pgoff + start_pgoff;
+	*num_blocks = (base > vma_blocks - start_pgoff) ?
+			vma_blocks - start_pgoff : base;
+	nova_dbgv("%s: start block %lu, %d blocks\n",
+			__func__, *start_blk, *num_blocks);
+	return 0;
+}
+
+int nova_mmap_to_new_blocks(struct vm_area_struct *vma,
+	unsigned long address)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	struct nova_file_write_entry entry_data;
+	struct nova_inode_update update;
+	unsigned long start_blk, end_blk;
+	unsigned long entry_pgoff;
+	unsigned long from_blocknr = 0;
+	unsigned long blocknr = 0;
+	unsigned long avail_blocks;
+	unsigned long copy_blocks;
+	int num_blocks = 0;
+	u64 from_blockoff, to_blockoff;
+	size_t copied;
+	int allocated = 0;
+	void *from_kmem;
+	void *to_kmem;
+	size_t bytes;
+	timing_t memcpy_time;
+	u64 begin_tail = 0;
+	u64 epoch_id;
+	u64 entry_size;
+	u32 time;
+	timing_t mmap_cow_time;
+	int ret = 0;
+
+	NOVA_START_TIMING(mmap_cow_t, mmap_cow_time);
+
+	nova_get_dax_cow_range(sb, vma, address, &start_blk, &num_blocks);
+
+	end_blk = start_blk + num_blocks;
+	if (start_blk >= end_blk) {
+		NOVA_END_TIMING(mmap_cow_t, mmap_cow_time);
+		return 0;
+	}
+
+	if (sbi->snapshot_taking) {
+		/* Block CoW mmap until snapshot taken completes */
+		NOVA_STATS_ADD(dax_cow_during_snapshot, 1);
+		wait_event_interruptible(sbi->snapshot_mmap_wait,
+					sbi->snapshot_taking == 0);
+	}
+
+	inode_lock(inode);
+
+	pi = nova_get_inode(sb, inode);
+
+	nova_dbgv("%s: inode %lu, start pgoff %lu, end pgoff %lu\n",
+			__func__, inode->i_ino, start_blk, end_blk);
+
+	time = current_time(inode).tv_sec;
+
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = pi->log_tail;
+	update.alter_tail = pi->alter_log_tail;
+
+	entryc = (metadata_csum == 0) ? entry : &entry_copy;
+
+	while (start_blk < end_blk) {
+		entry = nova_get_write_entry(sb, sih, start_blk);
+		if (!entry) {
+			nova_dbgv("%s: Found hole: pgoff %lu\n",
+					__func__, start_blk);
+
+			/* Jump the hole */
+			entry = nova_find_next_entry(sb, sih, start_blk);
+			if (!entry)
+				break;
+
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				break;
+
+			start_blk = entryc->pgoff;
+			if (start_blk >= end_blk)
+				break;
+		} else {
+			if (metadata_csum == 0)
+				entryc = entry;
+			else if (!nova_verify_entry_csum(sb, entry, entryc))
+				break;
+		}
+
+		if (entryc->epoch_id == epoch_id) {
+			/* Someone has done it for us. */
+			break;
+		}
+
+		from_blocknr = get_nvmm(sb, sih, entryc, start_blk);
+		from_blockoff = nova_get_block_off(sb, from_blocknr,
+						pi->i_blk_type);
+		from_kmem = nova_get_block(sb, from_blockoff);
+
+		if (entryc->reassigned == 0)
+			avail_blocks = entryc->num_pages -
+					(start_blk - entryc->pgoff);
+		else
+			avail_blocks = 1;
+
+		if (avail_blocks > end_blk - start_blk)
+			avail_blocks = end_blk - start_blk;
+
+		allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk,
+					 avail_blocks, ALLOC_NO_INIT, ANY_CPU,
+					 ALLOC_FROM_HEAD);
+
+		nova_dbgv("%s: alloc %d blocks @ %lu\n", __func__,
+						allocated, blocknr);
+
+		if (allocated <= 0) {
+			nova_dbg("%s alloc blocks failed!, %d\n",
+						__func__, allocated);
+			ret = allocated;
+			goto out;
+		}
+
+		to_blockoff = nova_get_block_off(sb, blocknr,
+						pi->i_blk_type);
+		to_kmem = nova_get_block(sb, to_blockoff);
+		entry_pgoff = start_blk;
+
+		copy_blocks = allocated;
+
+		bytes = sb->s_blocksize * copy_blocks;
+
+		/* Now copy from user buf */
+		NOVA_START_TIMING(memcpy_w_wb_t, memcpy_time);
+		nova_memunlock_range(sb, to_kmem, bytes);
+		copied = bytes - memcpy_to_pmem_nocache(to_kmem, from_kmem,
+							bytes);
+		nova_memlock_range(sb, to_kmem, bytes);
+		NOVA_END_TIMING(memcpy_w_wb_t, memcpy_time);
+
+		if (copied == bytes) {
+			start_blk += copy_blocks;
+		} else {
+			nova_dbg("%s ERROR!: bytes %lu, copied %lu\n",
+				__func__, bytes, copied);
+			ret = -EFAULT;
+			goto out;
+		}
+
+		entry_size = cpu_to_le64(inode->i_size);
+
+		nova_init_file_write_entry(sb, sih, &entry_data,
+					epoch_id, entry_pgoff, copy_blocks,
+					blocknr, time, entry_size);
+
+		ret = nova_append_file_write_entry(sb, pi, inode,
+					&entry_data, &update);
+		if (ret) {
+			nova_dbg("%s: append inode entry failed\n",
+					__func__);
+			ret = -ENOSPC;
+			goto out;
+		}
+
+		if (begin_tail == 0)
+			begin_tail = update.curr_entry;
+	}
+
+	if (begin_tail == 0)
+		goto out;
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	/* Update file tree */
+	ret = nova_reassign_file_tree(sb, sih, begin_tail);
+	if (ret)
+		goto out;
+
+
+	/* Update pfn and prot */
+	ret = nova_dax_cow_mmap_handler(sb, vma, sih, begin_tail);
+	if (ret)
+		goto out;
+
+
+	sih->trans_id++;
+
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, blocknr, allocated,
+						begin_tail, update.tail);
+
+	inode_unlock(inode);
+	NOVA_END_TIMING(mmap_cow_t, mmap_cow_time);
+	return ret;
+}
+
+static int nova_set_vma_read(struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long oldflags = vma->vm_flags;
+	unsigned long newflags;
+	pgprot_t new_page_prot;
+
+	down_write(&mm->mmap_sem);
+
+	newflags = oldflags & (~VM_WRITE);
+	if (oldflags == newflags)
+		goto out;
+
+	nova_dbgv("Set vma %p read, start 0x%lx, end 0x%lx\n",
+				vma, vma->vm_start,
+				vma->vm_end);
+
+	new_page_prot = vm_get_page_prot(newflags);
+	change_protection(vma, vma->vm_start, vma->vm_end,
+				new_page_prot, 0, 0);
+	vma->original_write = 1;
+
+out:
+	up_write(&mm->mmap_sem);
+
+	return 0;
+}
+
+static inline bool pgoff_in_vma(struct vm_area_struct *vma,
+	unsigned long pgoff)
+{
+	unsigned long num_pages;
+
+	num_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+	if (pgoff >= vma->vm_pgoff && pgoff < vma->vm_pgoff + num_pages)
+		return true;
+
+	return false;
+}
+
+bool nova_find_pgoff_in_vma(struct inode *inode, unsigned long pgoff)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct vma_item *item;
+	struct rb_node *temp;
+	bool ret = false;
+
+	if (sih->num_vmas == 0)
+		return ret;
+
+	temp = rb_first(&sih->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		if (pgoff_in_vma(item->vma, pgoff)) {
+			ret = true;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int nova_set_sih_vmas_readonly(struct nova_inode_info_header *sih)
+{
+	struct vma_item *item;
+	struct rb_node *temp;
+	timing_t set_read_time;
+
+	NOVA_START_TIMING(set_vma_read_t, set_read_time);
+
+	temp = rb_first(&sih->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		nova_set_vma_read(item->vma);
+	}
+
+	NOVA_END_TIMING(set_vma_read_t, set_read_time);
+	return 0;
+}
+
+int nova_set_vmas_readonly(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header *sih;
+
+	nova_dbgv("%s\n", __func__);
+	mutex_lock(&sbi->vma_mutex);
+	list_for_each_entry(sih, &sbi->mmap_sih_list, list)
+		nova_set_sih_vmas_readonly(sih);
+	mutex_unlock(&sbi->vma_mutex);
+
+	return 0;
+}
+
+#if 0
+int nova_destroy_vma_tree(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct vma_item *item;
+	struct rb_node *temp;
+
+	nova_dbgv("%s\n", __func__);
+	mutex_lock(&sbi->vma_mutex);
+	temp = rb_first(&sbi->vma_tree);
+	while (temp) {
+		item = container_of(temp, struct vma_item, node);
+		temp = rb_next(temp);
+		rb_erase(&item->node, &sbi->vma_tree);
+		kfree(item);
+	}
+	mutex_unlock(&sbi->vma_mutex);
+
+	return 0;
+}
+#endif
diff --git a/fs/nova/mprotect.h b/fs/nova/mprotect.h
new file mode 100644
index 000000000000..e28243caae52
--- /dev/null
+++ b/fs/nova/mprotect.h
@@ -0,0 +1,190 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Memory protection definitions for the NOVA filesystem.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#ifndef __WPROTECT_H
+#define __WPROTECT_H
+
+#include <linux/fs.h>
+#include "nova_def.h"
+#include "super.h"
+
+extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
+
+static inline int nova_range_check(struct super_block *sb, void *p,
+					 unsigned long len)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (p < sbi->virt_addr ||
+			p + len > sbi->virt_addr + sbi->initsize) {
+		nova_err(sb, "access pmem out of range: pmem range %p - %p, access range %p - %p\n",
+				sbi->virt_addr,
+				sbi->virt_addr + sbi->initsize,
+				p, p + len);
+		dump_stack();
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+extern int nova_writeable(void *vaddr, unsigned long size, int rw);
+
+static inline int nova_is_protected(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = (struct nova_sb_info *)sb->s_fs_info;
+
+	if (wprotect)
+		return wprotect;
+
+	return sbi->s_mount_opt & NOVA_MOUNT_PROTECT;
+}
+
+static inline int nova_is_wprotected(struct super_block *sb)
+{
+	return nova_is_protected(sb);
+}
+
+static inline void
+__nova_memunlock_range(void *p, unsigned long len)
+{
+	/*
+	 * NOTE: Ideally we should lock all the kernel to be memory safe
+	 * and avoid to write in the protected memory,
+	 * obviously it's not possible, so we only serialize
+	 * the operations at fs level. We can't disable the interrupts
+	 * because we could have a deadlock in this path.
+	 */
+	nova_writeable(p, len, 1);
+}
+
+static inline void
+__nova_memlock_range(void *p, unsigned long len)
+{
+	nova_writeable(p, len, 0);
+}
+
+static inline void nova_memunlock_range(struct super_block *sb, void *p,
+					 unsigned long len)
+{
+	if (nova_range_check(sb, p, len))
+		return;
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(p, len);
+}
+
+static inline void nova_memlock_range(struct super_block *sb, void *p,
+				       unsigned long len)
+{
+	if (nova_is_protected(sb))
+		__nova_memlock_range(p, len);
+}
+
+static inline void nova_memunlock_super(struct super_block *sb)
+{
+	struct nova_super_block *ps = nova_get_super(sb);
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(ps, NOVA_SB_SIZE);
+}
+
+static inline void nova_memlock_super(struct super_block *sb)
+{
+	struct nova_super_block *ps = nova_get_super(sb);
+
+	if (nova_is_protected(sb))
+		__nova_memlock_range(ps, NOVA_SB_SIZE);
+}
+
+static inline void nova_memunlock_reserved(struct super_block *sb,
+					 struct nova_super_block *ps)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(ps,
+			sbi->head_reserved_blocks * NOVA_DEF_BLOCK_SIZE_4K);
+}
+
+static inline void nova_memlock_reserved(struct super_block *sb,
+				       struct nova_super_block *ps)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (nova_is_protected(sb))
+		__nova_memlock_range(ps,
+			sbi->head_reserved_blocks * NOVA_DEF_BLOCK_SIZE_4K);
+}
+
+static inline void nova_memunlock_journal(struct super_block *sb)
+{
+	void *addr = nova_get_block(sb, NOVA_DEF_BLOCK_SIZE_4K * JOURNAL_START);
+
+	if (nova_range_check(sb, addr, NOVA_DEF_BLOCK_SIZE_4K))
+		return;
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(addr, NOVA_DEF_BLOCK_SIZE_4K);
+}
+
+static inline void nova_memlock_journal(struct super_block *sb)
+{
+	void *addr = nova_get_block(sb, NOVA_DEF_BLOCK_SIZE_4K * JOURNAL_START);
+
+	if (nova_is_protected(sb))
+		__nova_memlock_range(addr, NOVA_DEF_BLOCK_SIZE_4K);
+}
+
+static inline void nova_memunlock_inode(struct super_block *sb,
+					 struct nova_inode *pi)
+{
+	if (nova_range_check(sb, pi, NOVA_INODE_SIZE))
+		return;
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(pi, NOVA_INODE_SIZE);
+}
+
+static inline void nova_memlock_inode(struct super_block *sb,
+				       struct nova_inode *pi)
+{
+	/* nova_sync_inode(pi); */
+	if (nova_is_protected(sb))
+		__nova_memlock_range(pi, NOVA_INODE_SIZE);
+}
+
+static inline void nova_memunlock_block(struct super_block *sb, void *bp)
+{
+	if (nova_range_check(sb, bp, sb->s_blocksize))
+		return;
+
+	if (nova_is_protected(sb))
+		__nova_memunlock_range(bp, sb->s_blocksize);
+}
+
+static inline void nova_memlock_block(struct super_block *sb, void *bp)
+{
+	if (nova_is_protected(sb))
+		__nova_memlock_range(bp, sb->s_blocksize);
+}
+
+
+#endif
diff --git a/fs/nova/parity.c b/fs/nova/parity.c
new file mode 100644
index 000000000000..1f2f8b4d6c0e
--- /dev/null
+++ b/fs/nova/parity.c
@@ -0,0 +1,411 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Parity related methods.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+
+static int nova_calculate_block_parity(struct super_block *sb, u8 *parity,
+	u8 *block)
+{
+	unsigned int strp, num_strps, i, j;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	u64 xor;
+
+	num_strps = sb->s_blocksize >> strp_shift;
+	if (static_cpu_has(X86_FEATURE_XMM2)) { // sse2 128b
+		for (i = 0; i < strp_size; i += 16) {
+			asm volatile("movdqa %0, %%xmm0" : : "m" (block[i]));
+			for (strp = 1; strp < num_strps; strp++) {
+				j = (strp << strp_shift) + i;
+				asm volatile(
+					"movdqa     %0, %%xmm1\n"
+					"pxor   %%xmm1, %%xmm0\n"
+					: : "m" (block[j])
+				);
+			}
+			asm volatile("movntdq %%xmm0, %0" : "=m" (parity[i]));
+		}
+	} else { // common 64b
+		for (i = 0; i < strp_size; i += 8) {
+			xor = *((u64 *) &block[i]);
+			for (strp = 1; strp < num_strps; strp++) {
+				j = (strp << strp_shift) + i;
+				xor ^= *((u64 *) &block[j]);
+			}
+			*((u64 *) &parity[i]) = xor;
+		}
+	}
+
+	return 0;
+}
+
+/* Compute parity for a whole data block and write the parity stripe to nvmm
+ *
+ * The block buffer to compute checksums should reside in dram (more trusted),
+ * not in nvmm (less trusted).
+ *
+ * block:   block buffer with user data and possibly partial head-tail block
+ *          - should be in kernel memory (dram) to avoid page faults
+ * blocknr: destination nvmm block number where the block is written to
+ *          - used to derive the parity stripe address
+
+ * If the modified content is less than a stripe size (small writes), it's
+ * possible to re-compute the parity only using the difference of the modified
+ * stripe, without re-computing for the whole block.
+
+static int nova_update_block_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, void *block, unsigned long blocknr,
+	size_t offset, size_t bytes, int zero)
+
+ */
+static int nova_update_block_parity(struct super_block *sb, u8 *block,
+	unsigned long blocknr, int zero)
+{
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	void *parity, *nvmmptr;
+	int ret = 0;
+	timing_t block_parity_time;
+
+	NOVA_START_TIMING(block_parity_t, block_parity_time);
+
+	parity = kmalloc(strp_size, GFP_KERNEL);
+	if (parity == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (block == NULL) {
+		nova_dbg("%s: block pointer error\n", __func__);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (unlikely(zero))
+		memset(parity, 0, strp_size);
+	else
+		nova_calculate_block_parity(sb, parity, block);
+
+	nvmmptr = nova_get_parity_addr(sb, blocknr);
+
+	nova_memunlock_range(sb, nvmmptr, strp_size);
+	memcpy_to_pmem_nocache(nvmmptr, parity, strp_size);
+	nova_memlock_range(sb, nvmmptr, strp_size);
+
+	// TODO: The parity stripe is better checksummed for higher reliability.
+out:
+	if (parity != NULL)
+		kfree(parity);
+
+	NOVA_END_TIMING(block_parity_t, block_parity_time);
+
+	return 0;
+}
+
+int nova_update_pgoff_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero)
+{
+	unsigned long blocknr;
+	void *dax_mem = NULL;
+	u64 blockoff;
+
+	blockoff = nova_find_nvmm_block(sb, sih, entry, pgoff);
+	/* Truncated? */
+	if (blockoff == 0)
+		return 0;
+
+	dax_mem = nova_get_block(sb, blockoff);
+
+	blocknr = nova_get_blocknr(sb, blockoff, sih->i_blk_type);
+	nova_update_block_parity(sb, dax_mem, blocknr, zero);
+
+	return 0;
+}
+
+/* Update block checksums and/or parity.
+ *
+ * Since this part of computing is along the critical path, unroll by 8 to gain
+ * performance if possible. This unrolling applies to stripe width of 8 and
+ * whole block writes.
+ */
+#define CSUM0 NOVA_INIT_CSUM
+int nova_update_block_csum_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, u8 *block, unsigned long blocknr,
+	size_t offset, size_t bytes)
+{
+	unsigned int i, strp_offset, num_strps;
+	size_t csum_size = NOVA_DATA_CSUM_LEN;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	unsigned long strp_nr, blockoff, blocksize = sb->s_blocksize;
+	void *nvmmptr, *nvmmptr1;
+	u32 crc[8];
+	u64 qwd[8], *parity = NULL;
+	u64 acc[8] = {CSUM0, CSUM0, CSUM0, CSUM0, CSUM0, CSUM0, CSUM0, CSUM0};
+	bool unroll_csum = false, unroll_parity = false;
+	int ret = 0;
+	timing_t block_csum_parity_time;
+
+	NOVA_STATS_ADD(block_csum_parity, 1);
+
+	blockoff = nova_get_block_off(sb, blocknr, sih->i_blk_type);
+	strp_nr = blockoff >> strp_shift;
+
+	strp_offset = offset & (strp_size - 1);
+	num_strps = ((strp_offset + bytes - 1) >> strp_shift) + 1;
+
+	unroll_parity = (blocksize / strp_size == 8) && (num_strps == 8);
+	unroll_csum = unroll_parity && static_cpu_has(X86_FEATURE_XMM4_2);
+
+	/* unrolled-by-8 implementation */
+	if (unroll_csum || unroll_parity) {
+		NOVA_START_TIMING(block_csum_parity_t, block_csum_parity_time);
+		if (data_parity > 0) {
+			parity = kmalloc(strp_size, GFP_KERNEL);
+			if (parity == NULL) {
+				nova_err(sb, "%s: buffer allocation error\n",
+								__func__);
+				ret = -ENOMEM;
+				NOVA_END_TIMING(block_csum_parity_t,
+						block_csum_parity_time);
+				goto out;
+			}
+		}
+		for (i = 0; i < strp_size / 8; i++) {
+			qwd[0] = *((u64 *) (block));
+			qwd[1] = *((u64 *) (block + 1 * strp_size));
+			qwd[2] = *((u64 *) (block + 2 * strp_size));
+			qwd[3] = *((u64 *) (block + 3 * strp_size));
+			qwd[4] = *((u64 *) (block + 4 * strp_size));
+			qwd[5] = *((u64 *) (block + 5 * strp_size));
+			qwd[6] = *((u64 *) (block + 6 * strp_size));
+			qwd[7] = *((u64 *) (block + 7 * strp_size));
+
+			if (data_csum > 0 && unroll_csum) {
+				nova_crc32c_qword(qwd[0], acc[0]);
+				nova_crc32c_qword(qwd[1], acc[1]);
+				nova_crc32c_qword(qwd[2], acc[2]);
+				nova_crc32c_qword(qwd[3], acc[3]);
+				nova_crc32c_qword(qwd[4], acc[4]);
+				nova_crc32c_qword(qwd[5], acc[5]);
+				nova_crc32c_qword(qwd[6], acc[6]);
+				nova_crc32c_qword(qwd[7], acc[7]);
+			}
+
+			if (data_parity > 0) {
+				parity[i] = qwd[0] ^ qwd[1] ^ qwd[2] ^ qwd[3] ^
+					    qwd[4] ^ qwd[5] ^ qwd[6] ^ qwd[7];
+			}
+
+			block += 8;
+		}
+		if (data_csum > 0 && unroll_csum) {
+			crc[0] = cpu_to_le32((u32) acc[0]);
+			crc[1] = cpu_to_le32((u32) acc[1]);
+			crc[2] = cpu_to_le32((u32) acc[2]);
+			crc[3] = cpu_to_le32((u32) acc[3]);
+			crc[4] = cpu_to_le32((u32) acc[4]);
+			crc[5] = cpu_to_le32((u32) acc[5]);
+			crc[6] = cpu_to_le32((u32) acc[6]);
+			crc[7] = cpu_to_le32((u32) acc[7]);
+
+			nvmmptr = nova_get_data_csum_addr(sb, strp_nr, 0);
+			nvmmptr1 = nova_get_data_csum_addr(sb, strp_nr, 1);
+			nova_memunlock_range(sb, nvmmptr, csum_size * 8);
+			memcpy_to_pmem_nocache(nvmmptr, crc, csum_size * 8);
+			memcpy_to_pmem_nocache(nvmmptr1, crc, csum_size * 8);
+			nova_memlock_range(sb, nvmmptr, csum_size * 8);
+		}
+
+		if (data_parity > 0) {
+			nvmmptr = nova_get_parity_addr(sb, blocknr);
+			nova_memunlock_range(sb, nvmmptr, strp_size);
+			memcpy_to_pmem_nocache(nvmmptr, parity, strp_size);
+			nova_memlock_range(sb, nvmmptr, strp_size);
+		}
+
+		if (parity != NULL)
+			kfree(parity);
+		NOVA_END_TIMING(block_csum_parity_t, block_csum_parity_time);
+	}
+
+	if (data_csum > 0 && !unroll_csum)
+		nova_update_block_csum(sb, sih, block, blocknr,
+					offset, bytes, 0);
+	if (data_parity > 0 && !unroll_parity)
+		nova_update_block_parity(sb, block, blocknr, 0);
+
+out:
+	return 0;
+}
+
+/* Restore a stripe of data.
+ *
+ * When this function is called, the two corresponding checksum copies are also
+ * given. After recovery the restored data stripe is checksum-verified using the
+ * given checksums. If any one matches, data recovery is considered successful
+ * and the restored stripe is written to nvmm to repair the corrupted data.
+ *
+ * If recovery succeeded, the known good checksum is returned by csum_good, and
+ * the caller will also check if any checksum restoration is necessary.
+ */
+int nova_restore_data(struct super_block *sb, unsigned long blocknr,
+	unsigned int badstrip_id, void *badstrip, int nvmmerr, u32 csum0,
+	u32 csum1, u32 *csum_good)
+{
+	unsigned int i, num_strps;
+	size_t strp_size = NOVA_STRIPE_SIZE;
+	unsigned int strp_shift = NOVA_STRIPE_SHIFT;
+	size_t blockoff, offset;
+	u8 *blockptr, *stripptr, *block, *parity, *strip;
+	u32 csum_calc;
+	bool success = false;
+	timing_t restore_time;
+	int ret = 0;
+
+	NOVA_START_TIMING(restore_data_t, restore_time);
+	blockoff = nova_get_block_off(sb, blocknr, NOVA_BLOCK_TYPE_4K);
+	blockptr = nova_get_block(sb, blockoff);
+	stripptr = blockptr + (badstrip_id << strp_shift);
+
+	block = kmalloc(sb->s_blocksize, GFP_KERNEL);
+	strip = kmalloc(strp_size, GFP_KERNEL);
+	if (block == NULL || strip == NULL) {
+		nova_err(sb, "%s: buffer allocation error\n", __func__);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	parity = nova_get_parity_addr(sb, blocknr);
+	if (parity == NULL) {
+		nova_err(sb, "%s: parity address error\n", __func__);
+		ret = -EIO;
+		goto out;
+	}
+
+	num_strps = sb->s_blocksize >> strp_shift;
+	for (i = 0; i < num_strps; i++) {
+		offset = i << strp_shift;
+		if (i == badstrip_id)
+			/* parity strip has media errors */
+			ret = memcpy_mcsafe(block + offset,
+						parity, strp_size);
+		else
+			/* another data strip has media errors */
+			ret = memcpy_mcsafe(block + offset,
+						blockptr + offset, strp_size);
+		if (ret < 0) {
+			/* media error happens during recovery */
+			nova_err(sb, "%s: unrecoverable media error detected\n",
+					__func__);
+			goto out;
+		}
+	}
+
+	nova_calculate_block_parity(sb, strip, block);
+	for (i = 0; i < strp_size; i++) {
+		/* i indicates the amount of good bytes in badstrip.
+		 * if corruption is contained within one strip, the i = 0 pass
+		 * can restore the strip; otherwise we need to test every i to
+		 * check if there is a unaligned but recoverable corruption,
+		 * i.e. a scribble corrupting two adjacent strips but the
+		 * scribble size is no larger than the strip size.
+		 */
+		memcpy(strip, badstrip, i);
+
+		csum_calc = nova_crc32c(NOVA_INIT_CSUM, strip, strp_size);
+		if (csum_calc == csum0 || csum_calc == csum1) {
+			success = true;
+			break;
+		}
+
+		/* media error, no good bytes in badstrip */
+		if (nvmmerr)
+			break;
+
+		/* corruption happens to the last strip must be contained within
+		 * the strip; if the corruption goes beyond the block boundary,
+		 * that's not the concern of this recovery call.
+		 */
+		if (badstrip_id == num_strps - 1)
+			break;
+	}
+
+	if (success) {
+		/* recovery success, repair the bad nvmm data */
+		nova_memunlock_range(sb, stripptr, strp_size);
+		memcpy_to_pmem_nocache(stripptr, strip, strp_size);
+		nova_memlock_range(sb, stripptr, strp_size);
+
+		/* return the good checksum */
+		*csum_good = csum_calc;
+	} else {
+		/* unrecoverable data corruption */
+		ret = -EIO;
+	}
+
+out:
+	if (block != NULL)
+		kfree(block);
+	if (strip != NULL)
+		kfree(strip);
+
+	NOVA_END_TIMING(restore_data_t, restore_time);
+	return ret;
+}
+
+int nova_update_truncated_block_parity(struct super_block *sb,
+	struct inode *inode, loff_t newsize)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long pgoff, blocknr;
+	unsigned long blocksize = sb->s_blocksize;
+	u64 nvmm;
+	char *nvmm_addr, *block;
+	u8 btype = sih->i_blk_type;
+	int ret = 0;
+
+	pgoff = newsize >> sb->s_blocksize_bits;
+
+	nvmm = nova_find_nvmm_block(sb, sih, NULL, pgoff);
+	if (nvmm == 0)
+		return -EFAULT;
+
+	nvmm_addr = (char *)nova_get_block(sb, nvmm);
+
+	blocknr = nova_get_blocknr(sb, nvmm, btype);
+
+	/* Copy to DRAM to catch MCE. */
+	block = kmalloc(blocksize, GFP_KERNEL);
+	if (block == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (memcpy_mcsafe(block, nvmm_addr, blocksize) < 0) {
+		ret = -EIO;
+		goto out;
+	}
+
+	nova_update_block_parity(sb, block, blocknr, 0);
+out:
+	if (block != NULL)
+		kfree(block);
+	return ret;
+}
+

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 11/16] NOVA: Snapshot support
  2017-08-03  7:48 ` Steven Swanson
  (?)
@ 2017-08-03  7:49   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Nova supports snapshots to facilitate backups.

Taking a snapshot
-----------------

Each Nova file systems has a current epoch_id in the super block and each log
entry has the epoch_id attached to it at creation.  When the user creates a
snaphot, Nova increments the epoch_id for the file system and the old epoch_id
identifies the moment the snapshot was taken.

Nova records the epoch_id and a timestamp in a new log entry (struct
snapshot_info_log_entry) and appends it to the log of the reserved snapshot
inode (NOVA_SNAPSHOT_INODE) in the superblock.

Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct
snapshot_info in DRAM indexed by epoch_id.

Nova also marks all mmap'd pages as read-only and uses COW to preserve file
contents after the snapshot.

Tracking Live Data
------------------

Supporting snapshots requires Nova to preserve file contents from previous
snapshots while also being able to recover the space a snapshot occupied after
its deletion.

Preserving file contents requires a small change to how Nova implements write
operations.  To perform a write, Nova appends a write log entry to the file's
log.  The log entry includes pointers to newly-allocated and populated NVMM
pages that hold the written data.  If the write overwrites existing data, Nova
locates the previous write log entry for that portion of the file, and performs
an "epoch check" that compares the old log entry's epoch_id to the file
system's current epoch_id.  If the comparison matches, the old write log entry
and the file data blocks it points to no longer belong to any snapshot, and
Nova reclaims the data blocks.

If the epoch_id's do not match, then the data in the old log entry belongs to
an earlier snapshot and Nova leaves the log entry in place.

Determining when to reclaim data belonging to deleted snapshots requires
additional bookkeeping.  For each snapshot, Nova maintains a "snapshot log"
that records the inodes and blocks that belong to that snapshot, but are not
part of the current file system image.

Nova populates the snapshot log during the epoch check: If the epoch_ids for
the new and old log entries do not match, it appends a log entry (either struct
snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log
that the old log entry belongs to.  The log entry contains a pointer to the old
log entry, and the filesystem's current epoch_id as the delete_epoch_id.

To delete a snapshot, Nova removes the snapshot from the list of live snapshots
and appends its log to the following snapshot's log.  Then, a background thread
traverses the combined log and reclaims dead inode/data based on the delete
epoch_id: If the delete epoch_id for an entry in the log is less than or equal
to the snapshot's epoch_id, it means the log entry and/or the associated data
blocks are now dead.



Snapshots and DAX
-----------------

Taking consistent snapshots while applications are modifying files using
DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM
become persistent (i.e., reach physical NVMM so they will survive a system
failure).  These applications rely on the processor's memory persistence
model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees
about when and in what order stores become persistent.  These guarantees allow
the application to restore their data to a consistent state during recovery
from a system failure.

>>From the application's perspective, reading a snapshot is equivalent to
recovering from a system failure.  In both cases, the contents of the
memory-mapped file reflect its state at a moment when application operations
might be in-flight and when the application had no chance to shut down cleanly.

A naive approach to checkpointing mmap()'d files in NOVA would simply mark each
of the read/write mapped pages as read-only and then do copy-on-write when a
store occurs to preserve the old pages as part of the snapshot.

However, this approach can leave the snapshot in an inconsistent state:
Setting the page to read-only captures its contents for the
snapshot, and the kernel requires NOVA to set the pages as read-only
one at a time.  So, if the order in which NOVA marks pages as read-only
is incompatible with ordering that the application requires, the snapshot will
contain an inconsistent version of the file.

To resolve this problem, when NOVA starts marking pages as read-only, it blocks
page faults to the read-only mmap()'d pages until it has marked all the pages
read-only and finished taking the snapshot.

More detail is available in the technical report referenced at the top of this
document.

We have implemented this functionality in NOVA by adding the 'original_write'
flag to struct vm_area_struct that tracks whether the vm_area_struct is created
with write permission, but has been marked read-only in the course of taking a
snapshot.  We have also added a 'dax_cow' operation to struct
vm_operations_struct that the page fault handler runs when applications write
to a page with original_write = 1.  NOVA's dax_cow operation
(nova_restore_page_write()) performs the COW, maps the page to a new physical
page and allows writing.


Saving Snapshot State
---------------------

During a clean shutdown, Nova stores the snapshot information to PMEM.

Nova reserves an inode for storing snapshot information.  The log for the inode
contains an entry for each snapshot (struct snapshot_info_log_entry).  On
shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array
of struct snapshot_nvmm_list.

Each of these lists (one per CPU) contains head and tail pointers to a linked
list of blocks (just like an inode log).  The lists contain a struct
snapshot_file_write_entry or struct snapshot_inode_entry for each operation
that modified file data or an inode.

Superblock
+--------------------+
|   ...              |
+--------------------+
| Reserved Inodes    |
+---+----------------+
|   |     ..         |
+---+----------------+
| 7 | Snapshot Inode |
|   | head           |
+---+----------------+
        /
       /
      /
+---------+---------+---------+
|  Snap   |  Snap   |  Snap   |
| epoch=1 | epoch=4 | epoch=11|
|         |         |         |
|nvmm_page|nvmm_page|nvmm_page|
+---------+---------+---------+
     |
     |
+----------+   +--------+--------+
|  cpu 0   |   | snap 	| snap   |
|   head   |-->| inode	| write	 |
|          |   | entry  | entry  |
|          |   +--------+--------+
+----------+   +--------+--------+
|  cpu 1   |   | snap 	| snap   |
|   head   |-->| write	| write	 |
|          |   | entry  | entry  |
|          |   +--------+--------+
+----------+
|    ...   |
+----------+   +--------+
|  cpu 128 |   | snap 	|
|   head   |-->| inode	|
|          |   | entry  |
|          |   +--------+
+----------+

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 arch/x86/mm/fault.c      |   11 
 fs/nova/snapshot.c       | 1407 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/snapshot.h       |   98 +++
 include/linux/mm.h       |    2 
 include/linux/mm_types.h |    3 
 mm/mprotect.c            |   13 
 6 files changed, 1533 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/snapshot.c
 create mode 100644 fs/nova/snapshot.h

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a01cbc8..34430601c7c0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1431,6 +1431,17 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	 * we can handle it..
 	 */
 good_area:
+
+	if (error_code & PF_WRITE) {
+		/* write, present and write, not present: */
+		if (vma->original_write && vma->vm_ops &&
+					vma->vm_ops->dax_cow) {
+			up_read(&mm->mmap_sem);
+			vma->vm_ops->dax_cow(vma, address);
+			down_read(&mm->mmap_sem);
+		}
+	}
+
 	if (unlikely(access_error(error_code, vma))) {
 		bad_area_access_error(regs, error_code, address, vma);
 		return;
diff --git a/fs/nova/snapshot.c b/fs/nova/snapshot.c
new file mode 100644
index 000000000000..088b56c0d38c
--- /dev/null
+++ b/fs/nova/snapshot.c
@@ -0,0 +1,1407 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Snapshot support
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+#include "super.h"
+
+static inline u64 next_list_page(u64 curr_p)
+{
+	void *curr_addr = (void *)curr_p;
+	unsigned long page_tail = ((unsigned long)curr_addr & ~PAGE_OFFSET_MASK)
+					+ LOG_BLOCK_TAIL;
+	return ((struct nova_inode_page_tail *)page_tail)->next_page;
+}
+
+static inline bool goto_next_list_page(struct super_block *sb, u64 curr_p)
+{
+	void *addr;
+	u8 type;
+
+	/* Each kind of entry takes at least 32 bytes */
+	if (ENTRY_LOC(curr_p) + 32 > LOG_BLOCK_TAIL)
+		return true;
+
+	addr = (void *)curr_p;
+	type = nova_get_entry_type(addr);
+	if (type == NEXT_PAGE)
+		return true;
+
+	return false;
+}
+
+static int nova_find_target_snapshot_info(struct super_block *sb,
+	u64 epoch_id, struct snapshot_info **ret_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *infos[1];
+	int nr_infos;
+	int ret = 0;
+
+	nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, 1);
+	if (nr_infos == 1) {
+		*ret_info = infos[0];
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static struct snapshot_info *
+nova_find_next_snapshot_info(struct super_block *sb, struct snapshot_info *info)
+{
+	struct snapshot_info *ret_info = NULL;
+	int ret;
+
+	ret = nova_find_target_snapshot_info(sb, info->epoch_id + 1, &ret_info);
+
+	if (ret == 1 && ret_info->epoch_id <= info->epoch_id) {
+		nova_err(sb, "info epoch id %llu, next epoch id %llu\n",
+				info->epoch_id, ret_info->epoch_id);
+		ret_info = NULL;
+	}
+
+	return ret_info;
+}
+
+static int nova_insert_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret;
+
+	ret = radix_tree_insert(&sbi->snapshot_info_tree, info->epoch_id, info);
+	if (ret)
+		nova_dbg("%s ERROR %d\n", __func__, ret);
+
+	return ret;
+}
+
+/* Reuse the inode log page structure */
+static inline void nova_set_link_page_epoch_id(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 epoch_id)
+{
+	curr_page->page_tail.epoch_id = epoch_id;
+}
+
+/* Reuse the inode log page structure */
+static inline void nova_set_next_link_page_address(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 next_page)
+{
+	curr_page->page_tail.next_page = next_page;
+}
+
+static int nova_delete_snapshot_list_entries(struct super_block *sb,
+	struct snapshot_list *list)
+{
+	struct snapshot_file_write_entry *w_entry = NULL;
+	struct snapshot_inode_entry *i_entry = NULL;
+	struct nova_inode_info_header sih;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.log_head = sih.log_tail = 0;
+
+	curr_p = list->head;
+	nova_dbg_verbose("Snapshot list head 0x%llx, tail 0x%lx\n",
+				curr_p, list->tail);
+	if (curr_p == 0 && list->tail == 0)
+		return 0;
+
+	while (curr_p != list->tail) {
+		if (goto_next_list_page(sb, curr_p)) {
+			curr_p = next_list_page(curr_p);
+			if (curr_p == list->tail)
+				break;
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Snapshot list is NULL!\n");
+			BUG();
+		}
+
+		addr = (void *)curr_p;
+		type = nova_get_entry_type(addr);
+
+		switch (type) {
+		case SS_INODE:
+			i_entry = (struct snapshot_inode_entry *)addr;
+			if (i_entry->deleted == 0)
+				nova_delete_dead_inode(sb, i_entry->nova_ino);
+			curr_p += sizeof(struct snapshot_inode_entry);
+			continue;
+		case SS_FILE_WRITE:
+			w_entry = (struct snapshot_file_write_entry *)addr;
+			if (w_entry->deleted == 0)
+				nova_free_data_blocks(sb, &sih, w_entry->nvmm,
+							w_entry->num_pages);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx, tail 0x%llx\n",
+					type, curr_p, list->tail);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		}
+	}
+
+	return 0;
+}
+
+static inline int nova_background_clean_inode_entry(struct super_block *sb,
+	struct snapshot_inode_entry *i_entry, u64 epoch_id)
+{
+	if (i_entry->deleted == 0 && i_entry->delete_epoch_id <= epoch_id) {
+		nova_delete_dead_inode(sb, i_entry->nova_ino);
+		i_entry->deleted = 1;
+	}
+
+	return 0;
+}
+
+static inline int nova_background_clean_write_entry(struct super_block *sb,
+	struct snapshot_file_write_entry *w_entry,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	if (w_entry->deleted == 0 && w_entry->delete_epoch_id <= epoch_id) {
+		nova_free_data_blocks(sb, sih, w_entry->nvmm,
+					w_entry->num_pages);
+		w_entry->deleted = 1;
+	}
+
+	return 0;
+}
+
+static int nova_background_clean_snapshot_list(struct super_block *sb,
+	struct snapshot_list *list, u64 epoch_id)
+{
+	struct nova_inode_log_page *curr_page;
+	struct nova_inode_info_header sih;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.log_head = sih.log_tail = 0;
+
+	curr_p = list->head;
+	nova_dbg_verbose("Snapshot list head 0x%llx, tail 0x%lx\n",
+				curr_p, list->tail);
+	if (curr_p == 0 && list->tail == 0)
+		return 0;
+
+	curr_page = (struct nova_inode_log_page *)curr_p;
+	while (curr_page->page_tail.epoch_id < epoch_id &&
+					curr_p != list->tail) {
+		if (goto_next_list_page(sb, curr_p)) {
+			curr_p = next_list_page(curr_p);
+			if (curr_p == list->tail)
+				break;
+			curr_page = (struct nova_inode_log_page *)curr_p;
+			if (curr_page->page_tail.epoch_id == epoch_id)
+				break;
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Snapshot list is NULL!\n");
+			BUG();
+		}
+
+		addr = (void *)curr_p;
+		type = nova_get_entry_type(addr);
+
+		switch (type) {
+		case SS_INODE:
+			nova_background_clean_inode_entry(sb, addr, epoch_id);
+			curr_p += sizeof(struct snapshot_inode_entry);
+			continue;
+		case SS_FILE_WRITE:
+			nova_background_clean_write_entry(sb, addr, &sih,
+								epoch_id);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx, tail 0x%llx\n",
+					type, curr_p, list->tail);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		}
+	}
+
+	return 0;
+}
+
+static int nova_delete_snapshot_list_pages(struct super_block *sb,
+	struct snapshot_list *list)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block = list->head;
+	int freed = 0;
+
+	while (curr_block) {
+		if (ENTRY_LOC(curr_block)) {
+			nova_dbg("%s: ERROR: invalid block %llu\n",
+					__func__, curr_block);
+			break;
+		}
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		curr_block = curr_page->page_tail.next_page;
+		kfree(curr_page);
+		freed++;
+	}
+
+	return freed;
+}
+
+static int nova_delete_snapshot_list(struct super_block *sb,
+	struct snapshot_list *list, int delete_entries)
+{
+	if (delete_entries)
+		nova_delete_snapshot_list_entries(sb, list);
+	nova_delete_snapshot_list_pages(sb, list);
+	return 0;
+}
+
+static int nova_delete_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info, int delete_entries)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *list;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		mutex_lock(&list->list_mutex);
+		nova_delete_snapshot_list(sb, list, delete_entries);
+		mutex_unlock(&list->list_mutex);
+	}
+
+	kfree(info->lists);
+	return 0;
+}
+
+static int nova_initialize_snapshot_info_pages(struct super_block *sb,
+	struct snapshot_info *info, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *list;
+	unsigned long new_page = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		new_page = (unsigned long)kmalloc(PAGE_SIZE,
+							GFP_KERNEL);
+		/* Aligned to PAGE_SIZE */
+		if (!new_page || ENTRY_LOC(new_page)) {
+			nova_dbg("%s: failed\n", __func__);
+			kfree((void *)new_page);
+			return -ENOMEM;
+		}
+
+		nova_set_link_page_epoch_id(sb, (void *)new_page, epoch_id);
+		nova_set_next_link_page_address(sb, (void *)new_page, 0);
+		list->tail = list->head = new_page;
+		list->num_pages = 1;
+	}
+
+	return 0;
+}
+
+static int nova_initialize_snapshot_info(struct super_block *sb,
+	struct snapshot_info **ret_info, int init_pages, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	int i;
+	int ret;
+	timing_t init_snapshot_time;
+
+	NOVA_START_TIMING(init_snapshot_info_t, init_snapshot_time);
+
+	info = nova_alloc_snapshot_info(sb);
+	if (!info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	info->lists = kzalloc(sbi->cpus * sizeof(struct snapshot_list),
+							GFP_KERNEL);
+
+	if (!info->lists) {
+		nova_free_snapshot_info(info);
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		mutex_init(&list->list_mutex);
+	}
+
+	if (init_pages) {
+		ret = nova_initialize_snapshot_info_pages(sb, info, epoch_id);
+		if (ret)
+			goto fail;
+	}
+
+	*ret_info = info;
+out:
+	NOVA_END_TIMING(init_snapshot_info_t, init_snapshot_time);
+	return ret;
+
+fail:
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		if (list->head)
+			kfree((void *)list->head);
+	}
+
+	kfree(info->lists);
+	nova_free_snapshot_info(info);
+
+	*ret_info = NULL;
+	goto out;
+}
+
+static void nova_write_snapshot_list_entry(struct super_block *sb,
+	struct snapshot_list *list, u64 curr_p, void *entry, size_t size)
+{
+	if (is_last_entry(curr_p, size)) {
+		nova_err(sb, "%s: write to page end? curr 0x%llx, size %lu\n",
+				__func__, curr_p, size);
+		return;
+	}
+
+	memcpy((void *)curr_p, entry, size);
+	list->tail = curr_p + size;
+}
+
+static int nova_append_snapshot_list_entry(struct super_block *sb,
+	struct snapshot_info *info, void *entry, size_t size)
+{
+	struct snapshot_list *list;
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block;
+	int cpuid;
+	u64 curr_p;
+	u64 new_page = 0;
+
+	cpuid = smp_processor_id();
+	list = &info->lists[cpuid];
+
+retry:
+	mutex_lock(&list->list_mutex);
+	curr_p = list->tail;
+
+	if (new_page) {
+		/* Link prev block and newly allocated page */
+		curr_block = BLOCK_OFF(curr_p);
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		nova_set_next_link_page_address(sb, curr_page, new_page);
+		list->num_pages++;
+	}
+
+	if ((is_last_entry(curr_p, size) && next_list_page(curr_p) == 0)) {
+		nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+		if (new_page == 0) {
+			mutex_unlock(&list->list_mutex);
+			new_page = (unsigned long)kmalloc(PAGE_SIZE,
+						GFP_KERNEL);
+			if (!new_page || ENTRY_LOC(new_page)) {
+				kfree((void *)new_page);
+				nova_err(sb, "%s: allocation failed\n",
+						__func__);
+				return -ENOMEM;
+			}
+			nova_set_link_page_epoch_id(sb, (void *)new_page,
+						info->epoch_id);
+			nova_set_next_link_page_address(sb,
+						(void *)new_page, 0);
+			goto retry;
+		}
+	}
+
+	if (is_last_entry(curr_p, size)) {
+		nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+		curr_p = next_list_page(curr_p);
+	}
+
+	nova_write_snapshot_list_entry(sb, list, curr_p, entry, size);
+	mutex_unlock(&list->list_mutex);
+
+	return 0;
+}
+
+/*
+ * An entry is deleteable if
+ * 1) It is created after the last snapshot, or
+ * 2) It is created and deleted during the same snapshot period.
+ */
+static int nova_old_entry_deleteable(struct super_block *sb,
+	u64 create_epoch_id, u64 delete_epoch_id,
+	struct snapshot_info **ret_info)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	if (create_epoch_id == delete_epoch_id) {
+		/* Create and delete in the same epoch */
+		return 1;
+	}
+
+	ret = nova_find_target_snapshot_info(sb, create_epoch_id, &info);
+	if (ret == 0) {
+		/* Old entry does not belong to any snapshot */
+		return 1;
+	}
+
+	if (info->epoch_id >= delete_epoch_id) {
+		/* Create and delete in different epoch but same snapshot */
+		return 1;
+	}
+
+	*ret_info = info;
+	return 0;
+}
+
+static int nova_append_snapshot_file_write_entry(struct super_block *sb,
+	struct snapshot_info *info, u64 nvmm, u64 num_pages,
+	u64 delete_epoch_id)
+{
+	struct snapshot_file_write_entry entry;
+	int ret;
+	timing_t append_time;
+
+	if (!info) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(append_snapshot_file_t, append_time);
+	nova_dbgv("Append file write entry: block %llu, %llu pages, delete epoch ID %llu to Snapshot epoch ID %llu\n",
+			nvmm, num_pages, delete_epoch_id,
+			info->epoch_id);
+
+	memset(&entry, 0, sizeof(struct snapshot_file_write_entry));
+	entry.type = SS_FILE_WRITE;
+	entry.deleted = 0;
+	entry.nvmm = nvmm;
+	entry.num_pages = num_pages;
+	entry.delete_epoch_id = delete_epoch_id;
+
+	ret = nova_append_snapshot_list_entry(sb, info, &entry,
+			sizeof(struct snapshot_file_write_entry));
+
+	NOVA_END_TIMING(append_snapshot_file_t, append_time);
+	return ret;
+}
+
+/* entry given to this function is a copy in dram */
+int nova_append_data_to_snapshot(struct super_block *sb,
+	struct nova_file_write_entry *entry, u64 nvmm, u64 num_pages,
+	u64 delete_epoch_id)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	ret = nova_old_entry_deleteable(sb, entry->epoch_id,
+					delete_epoch_id, &info);
+	if (ret == 0)
+		nova_append_snapshot_file_write_entry(sb, info, nvmm,
+					num_pages, delete_epoch_id);
+
+	return ret;
+}
+
+static int nova_append_snapshot_inode_entry(struct super_block *sb,
+	struct nova_inode *pi, struct snapshot_info *info)
+{
+	struct snapshot_inode_entry entry;
+	int ret;
+	timing_t append_time;
+
+	if (!info) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(append_snapshot_inode_t, append_time);
+	nova_dbgv("Append inode entry: inode %llu, delete epoch ID %llu to Snapshot epoch ID %llu\n",
+			pi->nova_ino, pi->delete_epoch_id,
+			info->epoch_id);
+
+	memset(&entry, 0, sizeof(struct snapshot_inode_entry));
+	entry.type = SS_INODE;
+	entry.deleted = 0;
+	entry.nova_ino = pi->nova_ino;
+	entry.delete_epoch_id = pi->delete_epoch_id;
+
+	ret = nova_append_snapshot_list_entry(sb, info, &entry,
+			sizeof(struct snapshot_inode_entry));
+
+	NOVA_END_TIMING(append_snapshot_inode_t, append_time);
+	return ret;
+}
+
+int nova_append_inode_to_snapshot(struct super_block *sb,
+	struct nova_inode *pi)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	ret = nova_old_entry_deleteable(sb, pi->create_epoch_id,
+					pi->delete_epoch_id, &info);
+	if (ret == 0)
+		nova_append_snapshot_inode_entry(sb, pi, info);
+
+	return ret;
+}
+
+int nova_encounter_mount_snapshot(struct super_block *sb, void *addr,
+	u8 type)
+{
+	struct nova_dentry *dentry;
+	struct nova_setattr_logentry *attr_entry;
+	struct nova_link_change_entry *linkc_entry;
+	struct nova_file_write_entry *fw_entry;
+	struct nova_mmap_entry *mmap_entry;
+	int ret = 0;
+
+	switch (type) {
+	case SET_ATTR:
+		attr_entry = (struct nova_setattr_logentry *)addr;
+		if (pass_mount_snapshot(sb, attr_entry->epoch_id))
+			ret = 1;
+		break;
+	case LINK_CHANGE:
+		linkc_entry = (struct nova_link_change_entry *)addr;
+		if (pass_mount_snapshot(sb, linkc_entry->epoch_id))
+			ret = 1;
+		break;
+	case DIR_LOG:
+		dentry = (struct nova_dentry *)addr;
+		if (pass_mount_snapshot(sb, dentry->epoch_id))
+			ret = 1;
+		break;
+	case FILE_WRITE:
+		fw_entry = (struct nova_file_write_entry *)addr;
+		if (pass_mount_snapshot(sb, fw_entry->epoch_id))
+			ret = 1;
+		break;
+	case MMAP_WRITE:
+		mmap_entry = (struct nova_mmap_entry *)addr;
+		if (pass_mount_snapshot(sb, mmap_entry->epoch_id))
+			ret = 1;
+		break;
+	default:
+		break;
+	}
+
+	return ret;
+}
+
+static int nova_copy_snapshot_list_to_dram(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list)
+{
+	struct nova_inode_log_page *dram_page;
+	void *curr_nvmm_addr;
+	u64 curr_nvmm_block;
+	u64 prev_dram_addr;
+	u64 curr_dram_addr;
+	unsigned long i;
+	int ret;
+
+	curr_dram_addr = list->head;
+	prev_dram_addr = list->head;
+	curr_nvmm_block = nvmm_list->head;
+	curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+
+	for (i = 0; i < nvmm_list->num_pages; i++) {
+		/* Leave next_page field alone */
+		ret = memcpy_mcsafe((void *)curr_dram_addr, curr_nvmm_addr,
+						LOG_BLOCK_TAIL);
+
+		if (ret < 0) {
+			nova_dbg("%s: Copy nvmm page %lu failed\n",
+					__func__, i);
+			continue;
+		}
+
+		dram_page = (struct nova_inode_log_page *)curr_dram_addr;
+		prev_dram_addr = curr_dram_addr;
+		curr_nvmm_block = next_log_page(sb, curr_nvmm_block);
+		if (curr_nvmm_block < 0)
+			break;
+		curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+		curr_dram_addr = dram_page->page_tail.next_page;
+	}
+
+	list->num_pages = nvmm_list->num_pages;
+	list->tail = prev_dram_addr + ENTRY_LOC(nvmm_list->tail);
+
+	return 0;
+}
+
+static int nova_allocate_snapshot_list_pages(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list,
+	u64 epoch_id)
+{
+	unsigned long prev_page = 0;
+	unsigned long new_page = 0;
+	unsigned long i;
+
+	for (i = 0; i < nvmm_list->num_pages; i++) {
+		new_page = (unsigned long)kmalloc(PAGE_SIZE,
+							GFP_KERNEL);
+
+		if (!new_page) {
+			nova_dbg("%s ERROR: fail to allocate list pages\n",
+					__func__);
+			goto fail;
+		}
+
+		nova_set_link_page_epoch_id(sb, (void *)new_page, epoch_id);
+		nova_set_next_link_page_address(sb, (void *)new_page, 0);
+
+		if (i == 0)
+			list->head = new_page;
+
+		if (prev_page)
+			nova_set_next_link_page_address(sb, (void *)prev_page,
+							new_page);
+		prev_page = new_page;
+	}
+
+	return 0;
+
+fail:
+	nova_delete_snapshot_list_pages(sb, list);
+	return -ENOMEM;
+}
+
+static int nova_restore_snapshot_info_lists(struct super_block *sb,
+	struct snapshot_info *info, struct nova_snapshot_info_entry *entry,
+	u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_list *list;
+	struct snapshot_nvmm_list *nvmm_list;
+	int i;
+	int ret;
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+						entry->nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		nvmm_list = &nvmm_page->lists[i];
+		if (!list || !nvmm_list) {
+			nova_dbg("%s: list NULL? list %p, nvmm list %p\n",
+					__func__, list, nvmm_list);
+			continue;
+		}
+
+		ret = nova_allocate_snapshot_list_pages(sb, list,
+						nvmm_list, info->epoch_id);
+		if (ret) {
+			nova_dbg("%s failure\n", __func__);
+			return ret;
+		}
+		nova_copy_snapshot_list_to_dram(sb, list, nvmm_list);
+	}
+
+	return 0;
+}
+
+static int nova_restore_snapshot_info(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 epoch_id,
+	u64 timestamp, u64 curr_p, int just_init)
+{
+	struct snapshot_info *info = NULL;
+	int ret = 0;
+
+	nova_dbg("Restore snapshot epoch ID %llu\n", epoch_id);
+
+	/* Allocate list pages on demand later */
+	ret = nova_initialize_snapshot_info(sb, &info, just_init, epoch_id);
+	if (ret) {
+		nova_dbg("%s: initialize snapshot info failed %d\n",
+				__func__, ret);
+		goto fail;
+	}
+
+	info->epoch_id = epoch_id;
+	info->timestamp = timestamp;
+	info->snapshot_entry = curr_p;
+
+	if (just_init == 0) {
+		ret = nova_restore_snapshot_info_lists(sb, info,
+							entry, epoch_id);
+		if (ret)
+			goto fail;
+	}
+
+	ret = nova_insert_snapshot_info(sb, info);
+	return ret;
+
+fail:
+	nova_delete_snapshot_info(sb, info, 0);
+	return ret;
+}
+
+int nova_mount_snapshot(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 epoch_id;
+
+	epoch_id = sbi->mount_snapshot_epoch_id;
+	nova_dbg("Mount snapshot %llu\n", epoch_id);
+	return 0;
+}
+
+static int nova_free_nvmm_page(struct super_block *sb,
+	u64 nvmm_page_addr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_nvmm_list *nvmm_list;
+	struct nova_inode_info_header sih;
+	unsigned long nvmm_blocknr;
+	int i;
+
+	if (nvmm_page_addr == 0)
+		return 0;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+						nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		nvmm_list = &nvmm_page->lists[i];
+		sih.log_head = nvmm_list->head;
+		sih.log_tail = nvmm_list->tail;
+		sih.alter_log_head = sih.alter_log_tail = 0;
+		nova_free_inode_log(sb, NULL, &sih);
+	}
+
+	nvmm_blocknr = nova_get_blocknr(sb, nvmm_page_addr, 0);
+	nova_free_log_blocks(sb, &sih, nvmm_blocknr, 1);
+	return 0;
+}
+
+static int nova_set_nvmm_page_addr(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 nvmm_page_addr)
+{
+	nova_memunlock_range(sb, entry, CACHELINE_SIZE);
+	entry->nvmm_page_addr = nvmm_page_addr;
+	nova_update_entry_csum(entry);
+	nova_update_alter_entry(sb, entry);
+	nova_memlock_range(sb, entry, CACHELINE_SIZE);
+
+	return 0;
+}
+
+static int nova_clear_nvmm_page(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, int just_init)
+{
+	if (just_init)
+		/* No need to free because we do not set the bitmap. */
+		goto out;
+
+	nova_free_nvmm_page(sb, entry->nvmm_page_addr);
+
+out:
+	nova_set_nvmm_page_addr(sb, entry, 0);
+	return 0;
+}
+
+int nova_restore_snapshot_entry(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 curr_p, int just_init)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 epoch_id, timestamp;
+	int ret = 0;
+
+	if (entry->deleted == 1)
+		goto out;
+
+	epoch_id = entry->epoch_id;
+	timestamp = entry->timestamp;
+
+	ret = nova_restore_snapshot_info(sb, entry, epoch_id,
+					timestamp, curr_p, just_init);
+	if (ret) {
+		nova_dbg("%s: Restore snapshot epoch ID %llu failed\n",
+				__func__, epoch_id);
+		goto out;
+	}
+
+	if (epoch_id > sbi->s_epoch_id)
+		sbi->s_epoch_id = epoch_id;
+
+out:
+	nova_clear_nvmm_page(sb, entry, just_init);
+
+	return ret;
+}
+
+static int nova_append_snapshot_info_log(struct super_block *sb,
+	struct snapshot_info *info, u64 epoch_id, u64 timestamp)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = sbi->snapshot_si;
+	struct nova_inode *pi = nova_get_reserved_inode(sb, NOVA_SNAPSHOT_INO);
+	struct nova_inode_update update;
+	struct nova_snapshot_info_entry entry_info;
+	int ret;
+
+	entry_info.type = SNAPSHOT_INFO;
+	entry_info.deleted = 0;
+	entry_info.nvmm_page_addr = 0;
+	entry_info.epoch_id = epoch_id;
+	entry_info.timestamp = timestamp;
+
+	update.tail = update.alter_tail = 0;
+	ret = nova_append_snapshot_info_entry(sb, pi, si, info,
+					&entry_info, &update);
+	if (ret) {
+		nova_dbg("%s: append snapshot info entry failure\n", __func__);
+		return ret;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, &si->vfs_inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	return 0;
+}
+
+int nova_create_snapshot(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info = NULL;
+	u64 timestamp = 0;
+	u64 epoch_id;
+	int ret;
+	timing_t create_snapshot_time;
+
+	NOVA_START_TIMING(create_snapshot_t, create_snapshot_time);
+
+	mutex_lock(&sbi->s_lock);
+	sbi->snapshot_taking = 1;
+
+	/* Increase the epoch id, but use the old value as snapshot id */
+	epoch_id = sbi->s_epoch_id++;
+
+	/*
+	 * Mark the create_snapshot_epoch_id before starting the snapshot
+	 * creation. We will check this during in-place updates for metadata
+	 * and data, to prevent overwriting logs that might belong to a
+	 * snapshot that is being created.
+	 */
+	nova_info("%s: epoch id %llu\n", __func__, epoch_id);
+
+
+	timestamp = timespec_trunc(current_kernel_time(),
+				   sb->s_time_gran).tv_sec;
+
+	ret = nova_initialize_snapshot_info(sb, &info, 1, epoch_id);
+	if (ret) {
+		nova_dbg("%s: initialize snapshot info failed %d\n",
+				__func__, ret);
+		NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+		goto out;
+	}
+
+	info->epoch_id = epoch_id;
+	info->timestamp = timestamp;
+
+	ret = nova_append_snapshot_info_log(sb, info, epoch_id, timestamp);
+	if (ret) {
+		nova_free_snapshot_info(info);
+		NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+		goto out;
+	}
+
+	sbi->num_snapshots++;
+
+	ret = nova_insert_snapshot_info(sb, info);
+
+	nova_set_vmas_readonly(sb);
+
+	sbi->nova_sb->s_wtime = cpu_to_le32(get_seconds());
+	sbi->nova_sb->s_epoch_id = cpu_to_le64(epoch_id);
+	nova_update_super_crc(sb);
+
+	nova_sync_super(sb);
+
+out:
+	sbi->snapshot_taking = 0;
+	mutex_unlock(&sbi->s_lock);
+	wake_up_interruptible(&sbi->snapshot_mmap_wait);
+
+	NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+	return ret;
+}
+
+static void wakeup_snapshot_cleaner(struct nova_sb_info *sbi)
+{
+	if (!waitqueue_active(&sbi->snapshot_cleaner_wait))
+		return;
+
+	nova_dbg("Wakeup snapshot cleaner thread\n");
+	wake_up_interruptible(&sbi->snapshot_cleaner_wait);
+}
+
+static int nova_link_to_next_snapshot(struct super_block *sb,
+	struct snapshot_info *prev_info, struct snapshot_info *next_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *prev_list, *next_list;
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block, curr_p;
+	int i;
+
+	nova_dbg("Link deleted snapshot %llu to next snapshot %llu\n",
+			prev_info->epoch_id, next_info->epoch_id);
+
+	if (prev_info->epoch_id >= next_info->epoch_id)
+		nova_dbg("Error: prev epoch ID %llu higher than next epoch ID %llu\n",
+			prev_info->epoch_id, next_info->epoch_id);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		prev_list = &prev_info->lists[i];
+		next_list = &next_info->lists[i];
+
+		mutex_lock(&prev_list->list_mutex);
+		mutex_lock(&next_list->list_mutex);
+
+		/* Set NEXT_PAGE flag for prev lists */
+		curr_p = prev_list->tail;
+		if (!goto_next_list_page(sb, curr_p))
+			nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+
+		/* Link the prev lists to the head of next lists */
+		curr_block = BLOCK_OFF(prev_list->tail);
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		nova_set_next_link_page_address(sb, curr_page, next_list->head);
+
+		next_list->head = prev_list->head;
+		next_list->num_pages += prev_list->num_pages;
+
+		mutex_unlock(&next_list->list_mutex);
+		mutex_unlock(&prev_list->list_mutex);
+	}
+
+	sbi->curr_clean_snapshot_info = next_info;
+	wakeup_snapshot_cleaner(sbi);
+
+	return 0;
+}
+
+static int nova_invalidate_snapshot_entry(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_snapshot_info_entry *entry;
+	int ret;
+
+	entry = nova_get_block(sb, info->snapshot_entry);
+	ret = nova_invalidate_logentry(sb, entry, SNAPSHOT_INFO, 0);
+	return ret;
+}
+
+int nova_delete_snapshot(struct super_block *sb, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info = NULL;
+	struct snapshot_info *next = NULL;
+	int delete = 0;
+	int ret;
+	timing_t delete_snapshot_time;
+
+	NOVA_START_TIMING(delete_snapshot_t, delete_snapshot_time);
+	mutex_lock(&sbi->s_lock);
+	nova_info("Delete snapshot epoch ID %llu\n", epoch_id);
+
+	ret = nova_find_target_snapshot_info(sb, epoch_id, &info);
+	if (ret != 1 || info->epoch_id != epoch_id) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		goto out;
+	}
+
+	next = nova_find_next_snapshot_info(sb, info);
+
+	if (next) {
+		nova_link_to_next_snapshot(sb, info, next);
+	} else {
+		/* Delete the last snapshot. Find the previous one. */
+		delete = 1;
+	}
+
+	radix_tree_delete(&sbi->snapshot_info_tree, epoch_id);
+
+	nova_invalidate_snapshot_entry(sb, info);
+
+out:
+	sbi->num_snapshots--;
+	mutex_unlock(&sbi->s_lock);
+
+	if (delete)
+		nova_delete_snapshot_info(sb, info, 1);
+
+	nova_free_snapshot_info(info);
+
+	NOVA_END_TIMING(delete_snapshot_t, delete_snapshot_time);
+	return 0;
+}
+
+static int nova_copy_snapshot_list_to_nvmm(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list,
+	u64 new_block)
+{
+	struct nova_inode_log_page *dram_page;
+	void *curr_nvmm_addr;
+	u64 curr_nvmm_block;
+	u64 prev_nvmm_block;
+	u64 curr_dram_addr;
+	unsigned long i;
+	size_t size = sizeof(struct snapshot_nvmm_list);
+
+	curr_dram_addr = list->head;
+	prev_nvmm_block = new_block;
+	curr_nvmm_block = new_block;
+	curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+
+	for (i = 0; i < list->num_pages; i++) {
+		/* Leave next_page field alone */
+		nova_memunlock_block(sb, curr_nvmm_addr);
+		memcpy_to_pmem_nocache(curr_nvmm_addr, (void *)curr_dram_addr,
+						LOG_BLOCK_TAIL);
+		nova_memlock_block(sb, curr_nvmm_addr);
+
+		dram_page = (struct nova_inode_log_page *)curr_dram_addr;
+		prev_nvmm_block = curr_nvmm_block;
+		curr_nvmm_block = next_log_page(sb, curr_nvmm_block);
+		if (curr_nvmm_block < 0)
+			break;
+		curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+		curr_dram_addr = dram_page->page_tail.next_page;
+	}
+
+	nova_memunlock_range(sb, nvmm_list, size);
+	nvmm_list->num_pages = list->num_pages;
+	nvmm_list->tail = prev_nvmm_block + ENTRY_LOC(list->tail);
+	nvmm_list->head = new_block;
+	nova_memlock_range(sb, nvmm_list, size);
+
+	nova_flush_buffer(nvmm_list, sizeof(struct snapshot_nvmm_list), 1);
+
+	return 0;
+}
+
+static int nova_save_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_snapshot_info_entry *entry;
+	struct nova_inode_info_header sih;
+	struct snapshot_list *list;
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_nvmm_list *nvmm_list;
+	unsigned long num_pages;
+	int i;
+	u64 nvmm_page_addr;
+	u64 new_block;
+	int allocated;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = 0;
+
+	/* Support up to 128 CPUs */
+	allocated = nova_allocate_inode_log_pages(sb, &sih, 1,
+						&nvmm_page_addr, ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_dbg("Error allocating NVMM info page\n");
+		return -ENOSPC;
+	}
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+							nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		num_pages = list->num_pages;
+		allocated = nova_allocate_inode_log_pages(sb, &sih,
+					num_pages, &new_block, i, 0);
+		if (allocated != num_pages) {
+			nova_dbg("Error saving snapshot list: %d\n", allocated);
+			return -ENOSPC;
+		}
+		nvmm_list = &nvmm_page->lists[i];
+		nova_copy_snapshot_list_to_nvmm(sb, list, nvmm_list, new_block);
+	}
+
+	entry = nova_get_block(sb, info->snapshot_entry);
+	nova_set_nvmm_page_addr(sb, entry, nvmm_page_addr);
+
+	return 0;
+}
+
+static int nova_print_snapshot_info(struct snapshot_info *info,
+	struct seq_file *seq)
+{
+	struct tm tm;
+	u64 epoch_id;
+	u64 timestamp;
+	unsigned long local_time;
+
+	epoch_id = info->epoch_id;
+	timestamp = info->timestamp;
+
+	local_time = timestamp - sys_tz.tz_minuteswest * 60;
+	time_to_tm(local_time, 0, &tm);
+	seq_printf(seq, "%8llu\t%4lu-%02d-%02d\t%02d:%02d:%02d\n",
+					info->epoch_id,
+					tm.tm_year + 1900, tm.tm_mon + 1,
+					tm.tm_mday,
+					tm.tm_hour, tm.tm_min, tm.tm_sec);
+	return 0;
+}
+
+int nova_print_snapshots(struct super_block *sb, struct seq_file *seq)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int count = 0;
+	int i;
+
+	seq_puts(seq, "========== NOVA snapshot table ==========\n");
+	seq_puts(seq, "Epoch ID\t      Date\t    Time\n");
+
+	/* Print in epoch ID increasing order */
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			nova_print_snapshot_info(info, seq);
+			count++;
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	seq_printf(seq, "=========== Total %d snapshots ===========\n", count);
+	return 0;
+}
+
+int nova_print_snapshot_lists(struct super_block *sb, struct seq_file *seq)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int count = 0;
+	int sum;
+	int i, j;
+
+	seq_puts(seq, "========== NOVA snapshot statistics ==========\n");
+
+	/* Print in epoch ID increasing order */
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			sum = 0;
+			for (j = 0; j < sbi->cpus; j++) {
+				list = &info->lists[j];
+				sum += list->num_pages;
+			}
+			seq_printf(seq, "Snapshot epoch ID %llu, %d list pages\n",
+					epoch_id, sum);
+			count++;
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	seq_printf(seq, "============= Total %d snapshots =============\n",
+			count);
+	return 0;
+}
+
+static int nova_traverse_and_delete_snapshot_infos(struct super_block *sb,
+	int save)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int i;
+
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			if (save)
+				nova_save_snapshot_info(sb, info);
+			nova_delete_snapshot_info(sb, info, 0);
+			radix_tree_delete(&sbi->snapshot_info_tree, epoch_id);
+			nova_free_snapshot_info(info);
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	return 0;
+}
+
+int nova_save_snapshots(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->snapshot_cleaner_thread)
+		kthread_stop(sbi->snapshot_cleaner_thread);
+
+	if (sbi->mount_snapshot)
+		return 0;
+
+	return nova_traverse_and_delete_snapshot_infos(sb, 1);
+}
+
+int nova_destroy_snapshot_infos(struct super_block *sb)
+{
+	return nova_traverse_and_delete_snapshot_infos(sb, 0);
+}
+
+static void snapshot_cleaner_try_sleeping(struct nova_sb_info *sbi)
+{
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&sbi->snapshot_cleaner_wait, &wait, TASK_INTERRUPTIBLE);
+	schedule();
+	finish_wait(&sbi->snapshot_cleaner_wait, &wait);
+}
+
+static int nova_clean_snapshot(struct nova_sb_info *sbi)
+{
+	struct super_block *sb = sbi->sb;
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	int i;
+
+	if (!sbi->curr_clean_snapshot_info)
+		return 0;
+
+	info = sbi->curr_clean_snapshot_info;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+
+		mutex_lock(&list->list_mutex);
+		nova_background_clean_snapshot_list(sb, list,
+							info->epoch_id);
+		mutex_unlock(&list->list_mutex);
+	}
+
+	sbi->curr_clean_snapshot_info = NULL;
+	return 0;
+}
+
+static int nova_snapshot_cleaner(void *arg)
+{
+	struct nova_sb_info *sbi = arg;
+
+	nova_dbg("Running snapshot cleaner thread\n");
+	for (;;) {
+		snapshot_cleaner_try_sleeping(sbi);
+
+		if (kthread_should_stop())
+			break;
+
+		nova_clean_snapshot(sbi);
+	}
+
+	if (sbi->curr_clean_snapshot_info)
+		nova_clean_snapshot(sbi);
+
+	return 0;
+}
+
+static int nova_snapshot_cleaner_init(struct nova_sb_info *sbi)
+{
+	int ret = 0;
+
+	init_waitqueue_head(&sbi->snapshot_cleaner_wait);
+
+	sbi->snapshot_cleaner_thread = kthread_run(nova_snapshot_cleaner,
+		sbi, "nova_snapshot_cleaner");
+	if (IS_ERR(sbi->snapshot_cleaner_thread)) {
+		nova_info("Failed to start NOVA snapshot cleaner thread\n");
+		ret = -1;
+	}
+	nova_info("Start NOVA snapshot cleaner thread.\n");
+	return ret;
+}
+
+int nova_snapshot_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header *sih;
+	u64 ino = NOVA_SNAPSHOT_INO;
+	int ret;
+
+	sih = &sbi->snapshot_si->header;
+	nova_init_header(sb, sih, 0);
+	sih->pi_addr = nova_get_reserved_inode_addr(sb, ino);
+	sih->alter_pi_addr = nova_get_alter_reserved_inode_addr(sb, ino);
+	sih->ino = ino;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	INIT_RADIX_TREE(&sbi->snapshot_info_tree, GFP_ATOMIC);
+	init_waitqueue_head(&sbi->snapshot_mmap_wait);
+	ret = nova_snapshot_cleaner_init(sbi);
+
+	return ret;
+}
+
diff --git a/fs/nova/snapshot.h b/fs/nova/snapshot.h
new file mode 100644
index 000000000000..948dfd557de4
--- /dev/null
+++ b/fs/nova/snapshot.h
@@ -0,0 +1,98 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Snapshot header
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+
+/*
+ * DRAM log of updates to a snapshot.
+ */
+struct snapshot_list {
+	struct mutex list_mutex;
+	unsigned long num_pages;
+	unsigned long head;
+	unsigned long tail;
+};
+
+
+/*
+ * DRAM info about a snapshop.
+ */
+struct snapshot_info {
+	u64	epoch_id;
+	u64	timestamp;
+	unsigned long snapshot_entry; /* PMEM pointer to the struct
+				       * snapshot_info_entry for this
+				       * snapshot
+				       */
+
+	struct snapshot_list *lists;	/* Per-CPU snapshot list */
+};
+
+
+enum nova_snapshot_entry_type {
+	SS_INODE = 1,
+	SS_FILE_WRITE,
+};
+
+/*
+ * Snapshot log entry for recording an inode operation in a snapshot log.
+ *
+ * Todo: add checksum
+ */
+struct snapshot_inode_entry {
+	u8	type;
+	u8	deleted;
+	u8	padding[6];
+	u64	padding64;
+	u64	nova_ino;          // inode number that was deleted.
+	u64	delete_epoch_id;   // Deleted when?
+} __attribute((__packed__));
+
+/*
+ * Snapshot log entry for recording a write operation in a snapshot log
+ *
+ * Todo: add checksum.
+ */
+struct snapshot_file_write_entry {
+	u8	type;
+	u8	deleted;
+	u8	padding[6];
+	u64	nvmm;
+	u64	num_pages;
+	u64	delete_epoch_id;
+} __attribute((__packed__));
+
+/*
+ * PMEM structure pointing to a log comprised of snapshot_inode_entry and
+ * snapshot_file_write_entry objects.
+ *
+ * TODO: add checksum
+ */
+struct snapshot_nvmm_list {
+	__le64 padding;
+	__le64 num_pages;
+	__le64 head;
+	__le64 tail;
+} __attribute((__packed__));
+
+/* Support up to 128 CPUs */
+struct snapshot_nvmm_page {
+	struct snapshot_nvmm_list lists[128];
+};
+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a47fc92..349e319b10f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -415,6 +415,8 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	/* For NOVA DAX-mmap protection */
+	int (*dax_cow)(struct vm_area_struct * area, unsigned long address);
 };
 
 struct mmu_gather;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..0b7667fe3dfb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,6 +342,9 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+
+	/* Flag for NOVA DAX cow */
+	int original_write;
 };
 
 struct core_thread {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..aa27a5517a75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -275,6 +275,7 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 
 	return pages;
 }
+EXPORT_SYMBOL(change_protection);
 
 int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
@@ -288,7 +289,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	int error;
 	int dirty_accountable = 0;
 
-	if (newflags == oldflags) {
+	if (newflags == oldflags && vma->original_write == 0) {
 		*pprev = vma;
 		return 0;
 	}
@@ -352,6 +353,16 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	change_protection(vma, start, end, vma->vm_page_prot,
 			  dirty_accountable, 0);
 
+	/* Update NOVA vma list */
+	if (vma->vm_ops && vma->vm_ops->dax_cow) {
+		if (!(oldflags & VM_WRITE) && (newflags & VM_WRITE)) {
+			vma->vm_ops->open(vma);
+		} else if (!(newflags & VM_WRITE)) {
+			if (vma->original_write || (oldflags & VM_WRITE))
+				vma->vm_ops->close(vma);
+		}
+	}
+
 	/*
 	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
 	 * fault on access.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 11/16] NOVA: Snapshot support
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova supports snapshots to facilitate backups.

Taking a snapshot
-----------------

Each Nova file systems has a current epoch_id in the super block and each log
entry has the epoch_id attached to it at creation.  When the user creates a
snaphot, Nova increments the epoch_id for the file system and the old epoch_id
identifies the moment the snapshot was taken.

Nova records the epoch_id and a timestamp in a new log entry (struct
snapshot_info_log_entry) and appends it to the log of the reserved snapshot
inode (NOVA_SNAPSHOT_INODE) in the superblock.

Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct
snapshot_info in DRAM indexed by epoch_id.

Nova also marks all mmap'd pages as read-only and uses COW to preserve file
contents after the snapshot.

Tracking Live Data
------------------

Supporting snapshots requires Nova to preserve file contents from previous
snapshots while also being able to recover the space a snapshot occupied after
its deletion.

Preserving file contents requires a small change to how Nova implements write
operations.  To perform a write, Nova appends a write log entry to the file's
log.  The log entry includes pointers to newly-allocated and populated NVMM
pages that hold the written data.  If the write overwrites existing data, Nova
locates the previous write log entry for that portion of the file, and performs
an "epoch check" that compares the old log entry's epoch_id to the file
system's current epoch_id.  If the comparison matches, the old write log entry
and the file data blocks it points to no longer belong to any snapshot, and
Nova reclaims the data blocks.

If the epoch_id's do not match, then the data in the old log entry belongs to
an earlier snapshot and Nova leaves the log entry in place.

Determining when to reclaim data belonging to deleted snapshots requires
additional bookkeeping.  For each snapshot, Nova maintains a "snapshot log"
that records the inodes and blocks that belong to that snapshot, but are not
part of the current file system image.

Nova populates the snapshot log during the epoch check: If the epoch_ids for
the new and old log entries do not match, it appends a log entry (either struct
snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log
that the old log entry belongs to.  The log entry contains a pointer to the old
log entry, and the filesystem's current epoch_id as the delete_epoch_id.

To delete a snapshot, Nova removes the snapshot from the list of live snapshots
and appends its log to the following snapshot's log.  Then, a background thread
traverses the combined log and reclaims dead inode/data based on the delete
epoch_id: If the delete epoch_id for an entry in the log is less than or equal
to the snapshot's epoch_id, it means the log entry and/or the associated data
blocks are now dead.



Snapshots and DAX
-----------------

Taking consistent snapshots while applications are modifying files using
DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM
become persistent (i.e., reach physical NVMM so they will survive a system
failure).  These applications rely on the processor's memory persistence
model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees
about when and in what order stores become persistent.  These guarantees allow
the application to restore their data to a consistent state during recovery
from a system failure.

>From the application's perspective, reading a snapshot is equivalent to
recovering from a system failure.  In both cases, the contents of the
memory-mapped file reflect its state at a moment when application operations
might be in-flight and when the application had no chance to shut down cleanly.

A naive approach to checkpointing mmap()'d files in NOVA would simply mark each
of the read/write mapped pages as read-only and then do copy-on-write when a
store occurs to preserve the old pages as part of the snapshot.

However, this approach can leave the snapshot in an inconsistent state:
Setting the page to read-only captures its contents for the
snapshot, and the kernel requires NOVA to set the pages as read-only
one at a time.  So, if the order in which NOVA marks pages as read-only
is incompatible with ordering that the application requires, the snapshot will
contain an inconsistent version of the file.

To resolve this problem, when NOVA starts marking pages as read-only, it blocks
page faults to the read-only mmap()'d pages until it has marked all the pages
read-only and finished taking the snapshot.

More detail is available in the technical report referenced at the top of this
document.

We have implemented this functionality in NOVA by adding the 'original_write'
flag to struct vm_area_struct that tracks whether the vm_area_struct is created
with write permission, but has been marked read-only in the course of taking a
snapshot.  We have also added a 'dax_cow' operation to struct
vm_operations_struct that the page fault handler runs when applications write
to a page with original_write = 1.  NOVA's dax_cow operation
(nova_restore_page_write()) performs the COW, maps the page to a new physical
page and allows writing.


Saving Snapshot State
---------------------

During a clean shutdown, Nova stores the snapshot information to PMEM.

Nova reserves an inode for storing snapshot information.  The log for the inode
contains an entry for each snapshot (struct snapshot_info_log_entry).  On
shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array
of struct snapshot_nvmm_list.

Each of these lists (one per CPU) contains head and tail pointers to a linked
list of blocks (just like an inode log).  The lists contain a struct
snapshot_file_write_entry or struct snapshot_inode_entry for each operation
that modified file data or an inode.

Superblock
+--------------------+
|   ...              |
+--------------------+
| Reserved Inodes    |
+---+----------------+
|   |     ..         |
+---+----------------+
| 7 | Snapshot Inode |
|   | head           |
+---+----------------+
        /
       /
      /
+---------+---------+---------+
|  Snap   |  Snap   |  Snap   |
| epoch=1 | epoch=4 | epoch=11|
|         |         |         |
|nvmm_page|nvmm_page|nvmm_page|
+---------+---------+---------+
     |
     |
+----------+   +--------+--------+
|  cpu 0   |   | snap 	| snap   |
|   head   |-->| inode	| write	 |
|          |   | entry  | entry  |
|          |   +--------+--------+
+----------+   +--------+--------+
|  cpu 1   |   | snap 	| snap   |
|   head   |-->| write	| write	 |
|          |   | entry  | entry  |
|          |   +--------+--------+
+----------+
|    ...   |
+----------+   +--------+
|  cpu 128 |   | snap 	|
|   head   |-->| inode	|
|          |   | entry  |
|          |   +--------+
+----------+

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 arch/x86/mm/fault.c      |   11 
 fs/nova/snapshot.c       | 1407 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/snapshot.h       |   98 +++
 include/linux/mm.h       |    2 
 include/linux/mm_types.h |    3 
 mm/mprotect.c            |   13 
 6 files changed, 1533 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/snapshot.c
 create mode 100644 fs/nova/snapshot.h

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a01cbc8..34430601c7c0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1431,6 +1431,17 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	 * we can handle it..
 	 */
 good_area:
+
+	if (error_code & PF_WRITE) {
+		/* write, present and write, not present: */
+		if (vma->original_write && vma->vm_ops &&
+					vma->vm_ops->dax_cow) {
+			up_read(&mm->mmap_sem);
+			vma->vm_ops->dax_cow(vma, address);
+			down_read(&mm->mmap_sem);
+		}
+	}
+
 	if (unlikely(access_error(error_code, vma))) {
 		bad_area_access_error(regs, error_code, address, vma);
 		return;
diff --git a/fs/nova/snapshot.c b/fs/nova/snapshot.c
new file mode 100644
index 000000000000..088b56c0d38c
--- /dev/null
+++ b/fs/nova/snapshot.c
@@ -0,0 +1,1407 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Snapshot support
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+#include "super.h"
+
+static inline u64 next_list_page(u64 curr_p)
+{
+	void *curr_addr = (void *)curr_p;
+	unsigned long page_tail = ((unsigned long)curr_addr & ~PAGE_OFFSET_MASK)
+					+ LOG_BLOCK_TAIL;
+	return ((struct nova_inode_page_tail *)page_tail)->next_page;
+}
+
+static inline bool goto_next_list_page(struct super_block *sb, u64 curr_p)
+{
+	void *addr;
+	u8 type;
+
+	/* Each kind of entry takes at least 32 bytes */
+	if (ENTRY_LOC(curr_p) + 32 > LOG_BLOCK_TAIL)
+		return true;
+
+	addr = (void *)curr_p;
+	type = nova_get_entry_type(addr);
+	if (type == NEXT_PAGE)
+		return true;
+
+	return false;
+}
+
+static int nova_find_target_snapshot_info(struct super_block *sb,
+	u64 epoch_id, struct snapshot_info **ret_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *infos[1];
+	int nr_infos;
+	int ret = 0;
+
+	nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, 1);
+	if (nr_infos == 1) {
+		*ret_info = infos[0];
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static struct snapshot_info *
+nova_find_next_snapshot_info(struct super_block *sb, struct snapshot_info *info)
+{
+	struct snapshot_info *ret_info = NULL;
+	int ret;
+
+	ret = nova_find_target_snapshot_info(sb, info->epoch_id + 1, &ret_info);
+
+	if (ret == 1 && ret_info->epoch_id <= info->epoch_id) {
+		nova_err(sb, "info epoch id %llu, next epoch id %llu\n",
+				info->epoch_id, ret_info->epoch_id);
+		ret_info = NULL;
+	}
+
+	return ret_info;
+}
+
+static int nova_insert_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret;
+
+	ret = radix_tree_insert(&sbi->snapshot_info_tree, info->epoch_id, info);
+	if (ret)
+		nova_dbg("%s ERROR %d\n", __func__, ret);
+
+	return ret;
+}
+
+/* Reuse the inode log page structure */
+static inline void nova_set_link_page_epoch_id(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 epoch_id)
+{
+	curr_page->page_tail.epoch_id = epoch_id;
+}
+
+/* Reuse the inode log page structure */
+static inline void nova_set_next_link_page_address(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 next_page)
+{
+	curr_page->page_tail.next_page = next_page;
+}
+
+static int nova_delete_snapshot_list_entries(struct super_block *sb,
+	struct snapshot_list *list)
+{
+	struct snapshot_file_write_entry *w_entry = NULL;
+	struct snapshot_inode_entry *i_entry = NULL;
+	struct nova_inode_info_header sih;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.log_head = sih.log_tail = 0;
+
+	curr_p = list->head;
+	nova_dbg_verbose("Snapshot list head 0x%llx, tail 0x%lx\n",
+				curr_p, list->tail);
+	if (curr_p == 0 && list->tail == 0)
+		return 0;
+
+	while (curr_p != list->tail) {
+		if (goto_next_list_page(sb, curr_p)) {
+			curr_p = next_list_page(curr_p);
+			if (curr_p == list->tail)
+				break;
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Snapshot list is NULL!\n");
+			BUG();
+		}
+
+		addr = (void *)curr_p;
+		type = nova_get_entry_type(addr);
+
+		switch (type) {
+		case SS_INODE:
+			i_entry = (struct snapshot_inode_entry *)addr;
+			if (i_entry->deleted == 0)
+				nova_delete_dead_inode(sb, i_entry->nova_ino);
+			curr_p += sizeof(struct snapshot_inode_entry);
+			continue;
+		case SS_FILE_WRITE:
+			w_entry = (struct snapshot_file_write_entry *)addr;
+			if (w_entry->deleted == 0)
+				nova_free_data_blocks(sb, &sih, w_entry->nvmm,
+							w_entry->num_pages);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx, tail 0x%llx\n",
+					type, curr_p, list->tail);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		}
+	}
+
+	return 0;
+}
+
+static inline int nova_background_clean_inode_entry(struct super_block *sb,
+	struct snapshot_inode_entry *i_entry, u64 epoch_id)
+{
+	if (i_entry->deleted == 0 && i_entry->delete_epoch_id <= epoch_id) {
+		nova_delete_dead_inode(sb, i_entry->nova_ino);
+		i_entry->deleted = 1;
+	}
+
+	return 0;
+}
+
+static inline int nova_background_clean_write_entry(struct super_block *sb,
+	struct snapshot_file_write_entry *w_entry,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	if (w_entry->deleted == 0 && w_entry->delete_epoch_id <= epoch_id) {
+		nova_free_data_blocks(sb, sih, w_entry->nvmm,
+					w_entry->num_pages);
+		w_entry->deleted = 1;
+	}
+
+	return 0;
+}
+
+static int nova_background_clean_snapshot_list(struct super_block *sb,
+	struct snapshot_list *list, u64 epoch_id)
+{
+	struct nova_inode_log_page *curr_page;
+	struct nova_inode_info_header sih;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.log_head = sih.log_tail = 0;
+
+	curr_p = list->head;
+	nova_dbg_verbose("Snapshot list head 0x%llx, tail 0x%lx\n",
+				curr_p, list->tail);
+	if (curr_p == 0 && list->tail == 0)
+		return 0;
+
+	curr_page = (struct nova_inode_log_page *)curr_p;
+	while (curr_page->page_tail.epoch_id < epoch_id &&
+					curr_p != list->tail) {
+		if (goto_next_list_page(sb, curr_p)) {
+			curr_p = next_list_page(curr_p);
+			if (curr_p == list->tail)
+				break;
+			curr_page = (struct nova_inode_log_page *)curr_p;
+			if (curr_page->page_tail.epoch_id == epoch_id)
+				break;
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Snapshot list is NULL!\n");
+			BUG();
+		}
+
+		addr = (void *)curr_p;
+		type = nova_get_entry_type(addr);
+
+		switch (type) {
+		case SS_INODE:
+			nova_background_clean_inode_entry(sb, addr, epoch_id);
+			curr_p += sizeof(struct snapshot_inode_entry);
+			continue;
+		case SS_FILE_WRITE:
+			nova_background_clean_write_entry(sb, addr, &sih,
+								epoch_id);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx, tail 0x%llx\n",
+					type, curr_p, list->tail);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		}
+	}
+
+	return 0;
+}
+
+static int nova_delete_snapshot_list_pages(struct super_block *sb,
+	struct snapshot_list *list)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block = list->head;
+	int freed = 0;
+
+	while (curr_block) {
+		if (ENTRY_LOC(curr_block)) {
+			nova_dbg("%s: ERROR: invalid block %llu\n",
+					__func__, curr_block);
+			break;
+		}
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		curr_block = curr_page->page_tail.next_page;
+		kfree(curr_page);
+		freed++;
+	}
+
+	return freed;
+}
+
+static int nova_delete_snapshot_list(struct super_block *sb,
+	struct snapshot_list *list, int delete_entries)
+{
+	if (delete_entries)
+		nova_delete_snapshot_list_entries(sb, list);
+	nova_delete_snapshot_list_pages(sb, list);
+	return 0;
+}
+
+static int nova_delete_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info, int delete_entries)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *list;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		mutex_lock(&list->list_mutex);
+		nova_delete_snapshot_list(sb, list, delete_entries);
+		mutex_unlock(&list->list_mutex);
+	}
+
+	kfree(info->lists);
+	return 0;
+}
+
+static int nova_initialize_snapshot_info_pages(struct super_block *sb,
+	struct snapshot_info *info, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *list;
+	unsigned long new_page = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		new_page = (unsigned long)kmalloc(PAGE_SIZE,
+							GFP_KERNEL);
+		/* Aligned to PAGE_SIZE */
+		if (!new_page || ENTRY_LOC(new_page)) {
+			nova_dbg("%s: failed\n", __func__);
+			kfree((void *)new_page);
+			return -ENOMEM;
+		}
+
+		nova_set_link_page_epoch_id(sb, (void *)new_page, epoch_id);
+		nova_set_next_link_page_address(sb, (void *)new_page, 0);
+		list->tail = list->head = new_page;
+		list->num_pages = 1;
+	}
+
+	return 0;
+}
+
+static int nova_initialize_snapshot_info(struct super_block *sb,
+	struct snapshot_info **ret_info, int init_pages, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	int i;
+	int ret;
+	timing_t init_snapshot_time;
+
+	NOVA_START_TIMING(init_snapshot_info_t, init_snapshot_time);
+
+	info = nova_alloc_snapshot_info(sb);
+	if (!info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	info->lists = kzalloc(sbi->cpus * sizeof(struct snapshot_list),
+							GFP_KERNEL);
+
+	if (!info->lists) {
+		nova_free_snapshot_info(info);
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		mutex_init(&list->list_mutex);
+	}
+
+	if (init_pages) {
+		ret = nova_initialize_snapshot_info_pages(sb, info, epoch_id);
+		if (ret)
+			goto fail;
+	}
+
+	*ret_info = info;
+out:
+	NOVA_END_TIMING(init_snapshot_info_t, init_snapshot_time);
+	return ret;
+
+fail:
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		if (list->head)
+			kfree((void *)list->head);
+	}
+
+	kfree(info->lists);
+	nova_free_snapshot_info(info);
+
+	*ret_info = NULL;
+	goto out;
+}
+
+static void nova_write_snapshot_list_entry(struct super_block *sb,
+	struct snapshot_list *list, u64 curr_p, void *entry, size_t size)
+{
+	if (is_last_entry(curr_p, size)) {
+		nova_err(sb, "%s: write to page end? curr 0x%llx, size %lu\n",
+				__func__, curr_p, size);
+		return;
+	}
+
+	memcpy((void *)curr_p, entry, size);
+	list->tail = curr_p + size;
+}
+
+static int nova_append_snapshot_list_entry(struct super_block *sb,
+	struct snapshot_info *info, void *entry, size_t size)
+{
+	struct snapshot_list *list;
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block;
+	int cpuid;
+	u64 curr_p;
+	u64 new_page = 0;
+
+	cpuid = smp_processor_id();
+	list = &info->lists[cpuid];
+
+retry:
+	mutex_lock(&list->list_mutex);
+	curr_p = list->tail;
+
+	if (new_page) {
+		/* Link prev block and newly allocated page */
+		curr_block = BLOCK_OFF(curr_p);
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		nova_set_next_link_page_address(sb, curr_page, new_page);
+		list->num_pages++;
+	}
+
+	if ((is_last_entry(curr_p, size) && next_list_page(curr_p) == 0)) {
+		nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+		if (new_page == 0) {
+			mutex_unlock(&list->list_mutex);
+			new_page = (unsigned long)kmalloc(PAGE_SIZE,
+						GFP_KERNEL);
+			if (!new_page || ENTRY_LOC(new_page)) {
+				kfree((void *)new_page);
+				nova_err(sb, "%s: allocation failed\n",
+						__func__);
+				return -ENOMEM;
+			}
+			nova_set_link_page_epoch_id(sb, (void *)new_page,
+						info->epoch_id);
+			nova_set_next_link_page_address(sb,
+						(void *)new_page, 0);
+			goto retry;
+		}
+	}
+
+	if (is_last_entry(curr_p, size)) {
+		nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+		curr_p = next_list_page(curr_p);
+	}
+
+	nova_write_snapshot_list_entry(sb, list, curr_p, entry, size);
+	mutex_unlock(&list->list_mutex);
+
+	return 0;
+}
+
+/*
+ * An entry is deleteable if
+ * 1) It is created after the last snapshot, or
+ * 2) It is created and deleted during the same snapshot period.
+ */
+static int nova_old_entry_deleteable(struct super_block *sb,
+	u64 create_epoch_id, u64 delete_epoch_id,
+	struct snapshot_info **ret_info)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	if (create_epoch_id == delete_epoch_id) {
+		/* Create and delete in the same epoch */
+		return 1;
+	}
+
+	ret = nova_find_target_snapshot_info(sb, create_epoch_id, &info);
+	if (ret == 0) {
+		/* Old entry does not belong to any snapshot */
+		return 1;
+	}
+
+	if (info->epoch_id >= delete_epoch_id) {
+		/* Create and delete in different epoch but same snapshot */
+		return 1;
+	}
+
+	*ret_info = info;
+	return 0;
+}
+
+static int nova_append_snapshot_file_write_entry(struct super_block *sb,
+	struct snapshot_info *info, u64 nvmm, u64 num_pages,
+	u64 delete_epoch_id)
+{
+	struct snapshot_file_write_entry entry;
+	int ret;
+	timing_t append_time;
+
+	if (!info) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(append_snapshot_file_t, append_time);
+	nova_dbgv("Append file write entry: block %llu, %llu pages, delete epoch ID %llu to Snapshot epoch ID %llu\n",
+			nvmm, num_pages, delete_epoch_id,
+			info->epoch_id);
+
+	memset(&entry, 0, sizeof(struct snapshot_file_write_entry));
+	entry.type = SS_FILE_WRITE;
+	entry.deleted = 0;
+	entry.nvmm = nvmm;
+	entry.num_pages = num_pages;
+	entry.delete_epoch_id = delete_epoch_id;
+
+	ret = nova_append_snapshot_list_entry(sb, info, &entry,
+			sizeof(struct snapshot_file_write_entry));
+
+	NOVA_END_TIMING(append_snapshot_file_t, append_time);
+	return ret;
+}
+
+/* entry given to this function is a copy in dram */
+int nova_append_data_to_snapshot(struct super_block *sb,
+	struct nova_file_write_entry *entry, u64 nvmm, u64 num_pages,
+	u64 delete_epoch_id)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	ret = nova_old_entry_deleteable(sb, entry->epoch_id,
+					delete_epoch_id, &info);
+	if (ret == 0)
+		nova_append_snapshot_file_write_entry(sb, info, nvmm,
+					num_pages, delete_epoch_id);
+
+	return ret;
+}
+
+static int nova_append_snapshot_inode_entry(struct super_block *sb,
+	struct nova_inode *pi, struct snapshot_info *info)
+{
+	struct snapshot_inode_entry entry;
+	int ret;
+	timing_t append_time;
+
+	if (!info) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(append_snapshot_inode_t, append_time);
+	nova_dbgv("Append inode entry: inode %llu, delete epoch ID %llu to Snapshot epoch ID %llu\n",
+			pi->nova_ino, pi->delete_epoch_id,
+			info->epoch_id);
+
+	memset(&entry, 0, sizeof(struct snapshot_inode_entry));
+	entry.type = SS_INODE;
+	entry.deleted = 0;
+	entry.nova_ino = pi->nova_ino;
+	entry.delete_epoch_id = pi->delete_epoch_id;
+
+	ret = nova_append_snapshot_list_entry(sb, info, &entry,
+			sizeof(struct snapshot_inode_entry));
+
+	NOVA_END_TIMING(append_snapshot_inode_t, append_time);
+	return ret;
+}
+
+int nova_append_inode_to_snapshot(struct super_block *sb,
+	struct nova_inode *pi)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	ret = nova_old_entry_deleteable(sb, pi->create_epoch_id,
+					pi->delete_epoch_id, &info);
+	if (ret == 0)
+		nova_append_snapshot_inode_entry(sb, pi, info);
+
+	return ret;
+}
+
+int nova_encounter_mount_snapshot(struct super_block *sb, void *addr,
+	u8 type)
+{
+	struct nova_dentry *dentry;
+	struct nova_setattr_logentry *attr_entry;
+	struct nova_link_change_entry *linkc_entry;
+	struct nova_file_write_entry *fw_entry;
+	struct nova_mmap_entry *mmap_entry;
+	int ret = 0;
+
+	switch (type) {
+	case SET_ATTR:
+		attr_entry = (struct nova_setattr_logentry *)addr;
+		if (pass_mount_snapshot(sb, attr_entry->epoch_id))
+			ret = 1;
+		break;
+	case LINK_CHANGE:
+		linkc_entry = (struct nova_link_change_entry *)addr;
+		if (pass_mount_snapshot(sb, linkc_entry->epoch_id))
+			ret = 1;
+		break;
+	case DIR_LOG:
+		dentry = (struct nova_dentry *)addr;
+		if (pass_mount_snapshot(sb, dentry->epoch_id))
+			ret = 1;
+		break;
+	case FILE_WRITE:
+		fw_entry = (struct nova_file_write_entry *)addr;
+		if (pass_mount_snapshot(sb, fw_entry->epoch_id))
+			ret = 1;
+		break;
+	case MMAP_WRITE:
+		mmap_entry = (struct nova_mmap_entry *)addr;
+		if (pass_mount_snapshot(sb, mmap_entry->epoch_id))
+			ret = 1;
+		break;
+	default:
+		break;
+	}
+
+	return ret;
+}
+
+static int nova_copy_snapshot_list_to_dram(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list)
+{
+	struct nova_inode_log_page *dram_page;
+	void *curr_nvmm_addr;
+	u64 curr_nvmm_block;
+	u64 prev_dram_addr;
+	u64 curr_dram_addr;
+	unsigned long i;
+	int ret;
+
+	curr_dram_addr = list->head;
+	prev_dram_addr = list->head;
+	curr_nvmm_block = nvmm_list->head;
+	curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+
+	for (i = 0; i < nvmm_list->num_pages; i++) {
+		/* Leave next_page field alone */
+		ret = memcpy_mcsafe((void *)curr_dram_addr, curr_nvmm_addr,
+						LOG_BLOCK_TAIL);
+
+		if (ret < 0) {
+			nova_dbg("%s: Copy nvmm page %lu failed\n",
+					__func__, i);
+			continue;
+		}
+
+		dram_page = (struct nova_inode_log_page *)curr_dram_addr;
+		prev_dram_addr = curr_dram_addr;
+		curr_nvmm_block = next_log_page(sb, curr_nvmm_block);
+		if (curr_nvmm_block < 0)
+			break;
+		curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+		curr_dram_addr = dram_page->page_tail.next_page;
+	}
+
+	list->num_pages = nvmm_list->num_pages;
+	list->tail = prev_dram_addr + ENTRY_LOC(nvmm_list->tail);
+
+	return 0;
+}
+
+static int nova_allocate_snapshot_list_pages(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list,
+	u64 epoch_id)
+{
+	unsigned long prev_page = 0;
+	unsigned long new_page = 0;
+	unsigned long i;
+
+	for (i = 0; i < nvmm_list->num_pages; i++) {
+		new_page = (unsigned long)kmalloc(PAGE_SIZE,
+							GFP_KERNEL);
+
+		if (!new_page) {
+			nova_dbg("%s ERROR: fail to allocate list pages\n",
+					__func__);
+			goto fail;
+		}
+
+		nova_set_link_page_epoch_id(sb, (void *)new_page, epoch_id);
+		nova_set_next_link_page_address(sb, (void *)new_page, 0);
+
+		if (i == 0)
+			list->head = new_page;
+
+		if (prev_page)
+			nova_set_next_link_page_address(sb, (void *)prev_page,
+							new_page);
+		prev_page = new_page;
+	}
+
+	return 0;
+
+fail:
+	nova_delete_snapshot_list_pages(sb, list);
+	return -ENOMEM;
+}
+
+static int nova_restore_snapshot_info_lists(struct super_block *sb,
+	struct snapshot_info *info, struct nova_snapshot_info_entry *entry,
+	u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_list *list;
+	struct snapshot_nvmm_list *nvmm_list;
+	int i;
+	int ret;
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+						entry->nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		nvmm_list = &nvmm_page->lists[i];
+		if (!list || !nvmm_list) {
+			nova_dbg("%s: list NULL? list %p, nvmm list %p\n",
+					__func__, list, nvmm_list);
+			continue;
+		}
+
+		ret = nova_allocate_snapshot_list_pages(sb, list,
+						nvmm_list, info->epoch_id);
+		if (ret) {
+			nova_dbg("%s failure\n", __func__);
+			return ret;
+		}
+		nova_copy_snapshot_list_to_dram(sb, list, nvmm_list);
+	}
+
+	return 0;
+}
+
+static int nova_restore_snapshot_info(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 epoch_id,
+	u64 timestamp, u64 curr_p, int just_init)
+{
+	struct snapshot_info *info = NULL;
+	int ret = 0;
+
+	nova_dbg("Restore snapshot epoch ID %llu\n", epoch_id);
+
+	/* Allocate list pages on demand later */
+	ret = nova_initialize_snapshot_info(sb, &info, just_init, epoch_id);
+	if (ret) {
+		nova_dbg("%s: initialize snapshot info failed %d\n",
+				__func__, ret);
+		goto fail;
+	}
+
+	info->epoch_id = epoch_id;
+	info->timestamp = timestamp;
+	info->snapshot_entry = curr_p;
+
+	if (just_init == 0) {
+		ret = nova_restore_snapshot_info_lists(sb, info,
+							entry, epoch_id);
+		if (ret)
+			goto fail;
+	}
+
+	ret = nova_insert_snapshot_info(sb, info);
+	return ret;
+
+fail:
+	nova_delete_snapshot_info(sb, info, 0);
+	return ret;
+}
+
+int nova_mount_snapshot(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 epoch_id;
+
+	epoch_id = sbi->mount_snapshot_epoch_id;
+	nova_dbg("Mount snapshot %llu\n", epoch_id);
+	return 0;
+}
+
+static int nova_free_nvmm_page(struct super_block *sb,
+	u64 nvmm_page_addr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_nvmm_list *nvmm_list;
+	struct nova_inode_info_header sih;
+	unsigned long nvmm_blocknr;
+	int i;
+
+	if (nvmm_page_addr == 0)
+		return 0;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+						nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		nvmm_list = &nvmm_page->lists[i];
+		sih.log_head = nvmm_list->head;
+		sih.log_tail = nvmm_list->tail;
+		sih.alter_log_head = sih.alter_log_tail = 0;
+		nova_free_inode_log(sb, NULL, &sih);
+	}
+
+	nvmm_blocknr = nova_get_blocknr(sb, nvmm_page_addr, 0);
+	nova_free_log_blocks(sb, &sih, nvmm_blocknr, 1);
+	return 0;
+}
+
+static int nova_set_nvmm_page_addr(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 nvmm_page_addr)
+{
+	nova_memunlock_range(sb, entry, CACHELINE_SIZE);
+	entry->nvmm_page_addr = nvmm_page_addr;
+	nova_update_entry_csum(entry);
+	nova_update_alter_entry(sb, entry);
+	nova_memlock_range(sb, entry, CACHELINE_SIZE);
+
+	return 0;
+}
+
+static int nova_clear_nvmm_page(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, int just_init)
+{
+	if (just_init)
+		/* No need to free because we do not set the bitmap. */
+		goto out;
+
+	nova_free_nvmm_page(sb, entry->nvmm_page_addr);
+
+out:
+	nova_set_nvmm_page_addr(sb, entry, 0);
+	return 0;
+}
+
+int nova_restore_snapshot_entry(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 curr_p, int just_init)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 epoch_id, timestamp;
+	int ret = 0;
+
+	if (entry->deleted == 1)
+		goto out;
+
+	epoch_id = entry->epoch_id;
+	timestamp = entry->timestamp;
+
+	ret = nova_restore_snapshot_info(sb, entry, epoch_id,
+					timestamp, curr_p, just_init);
+	if (ret) {
+		nova_dbg("%s: Restore snapshot epoch ID %llu failed\n",
+				__func__, epoch_id);
+		goto out;
+	}
+
+	if (epoch_id > sbi->s_epoch_id)
+		sbi->s_epoch_id = epoch_id;
+
+out:
+	nova_clear_nvmm_page(sb, entry, just_init);
+
+	return ret;
+}
+
+static int nova_append_snapshot_info_log(struct super_block *sb,
+	struct snapshot_info *info, u64 epoch_id, u64 timestamp)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = sbi->snapshot_si;
+	struct nova_inode *pi = nova_get_reserved_inode(sb, NOVA_SNAPSHOT_INO);
+	struct nova_inode_update update;
+	struct nova_snapshot_info_entry entry_info;
+	int ret;
+
+	entry_info.type = SNAPSHOT_INFO;
+	entry_info.deleted = 0;
+	entry_info.nvmm_page_addr = 0;
+	entry_info.epoch_id = epoch_id;
+	entry_info.timestamp = timestamp;
+
+	update.tail = update.alter_tail = 0;
+	ret = nova_append_snapshot_info_entry(sb, pi, si, info,
+					&entry_info, &update);
+	if (ret) {
+		nova_dbg("%s: append snapshot info entry failure\n", __func__);
+		return ret;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, &si->vfs_inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	return 0;
+}
+
+int nova_create_snapshot(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info = NULL;
+	u64 timestamp = 0;
+	u64 epoch_id;
+	int ret;
+	timing_t create_snapshot_time;
+
+	NOVA_START_TIMING(create_snapshot_t, create_snapshot_time);
+
+	mutex_lock(&sbi->s_lock);
+	sbi->snapshot_taking = 1;
+
+	/* Increase the epoch id, but use the old value as snapshot id */
+	epoch_id = sbi->s_epoch_id++;
+
+	/*
+	 * Mark the create_snapshot_epoch_id before starting the snapshot
+	 * creation. We will check this during in-place updates for metadata
+	 * and data, to prevent overwriting logs that might belong to a
+	 * snapshot that is being created.
+	 */
+	nova_info("%s: epoch id %llu\n", __func__, epoch_id);
+
+
+	timestamp = timespec_trunc(current_kernel_time(),
+				   sb->s_time_gran).tv_sec;
+
+	ret = nova_initialize_snapshot_info(sb, &info, 1, epoch_id);
+	if (ret) {
+		nova_dbg("%s: initialize snapshot info failed %d\n",
+				__func__, ret);
+		NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+		goto out;
+	}
+
+	info->epoch_id = epoch_id;
+	info->timestamp = timestamp;
+
+	ret = nova_append_snapshot_info_log(sb, info, epoch_id, timestamp);
+	if (ret) {
+		nova_free_snapshot_info(info);
+		NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+		goto out;
+	}
+
+	sbi->num_snapshots++;
+
+	ret = nova_insert_snapshot_info(sb, info);
+
+	nova_set_vmas_readonly(sb);
+
+	sbi->nova_sb->s_wtime = cpu_to_le32(get_seconds());
+	sbi->nova_sb->s_epoch_id = cpu_to_le64(epoch_id);
+	nova_update_super_crc(sb);
+
+	nova_sync_super(sb);
+
+out:
+	sbi->snapshot_taking = 0;
+	mutex_unlock(&sbi->s_lock);
+	wake_up_interruptible(&sbi->snapshot_mmap_wait);
+
+	NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+	return ret;
+}
+
+static void wakeup_snapshot_cleaner(struct nova_sb_info *sbi)
+{
+	if (!waitqueue_active(&sbi->snapshot_cleaner_wait))
+		return;
+
+	nova_dbg("Wakeup snapshot cleaner thread\n");
+	wake_up_interruptible(&sbi->snapshot_cleaner_wait);
+}
+
+static int nova_link_to_next_snapshot(struct super_block *sb,
+	struct snapshot_info *prev_info, struct snapshot_info *next_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *prev_list, *next_list;
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block, curr_p;
+	int i;
+
+	nova_dbg("Link deleted snapshot %llu to next snapshot %llu\n",
+			prev_info->epoch_id, next_info->epoch_id);
+
+	if (prev_info->epoch_id >= next_info->epoch_id)
+		nova_dbg("Error: prev epoch ID %llu higher than next epoch ID %llu\n",
+			prev_info->epoch_id, next_info->epoch_id);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		prev_list = &prev_info->lists[i];
+		next_list = &next_info->lists[i];
+
+		mutex_lock(&prev_list->list_mutex);
+		mutex_lock(&next_list->list_mutex);
+
+		/* Set NEXT_PAGE flag for prev lists */
+		curr_p = prev_list->tail;
+		if (!goto_next_list_page(sb, curr_p))
+			nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+
+		/* Link the prev lists to the head of next lists */
+		curr_block = BLOCK_OFF(prev_list->tail);
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		nova_set_next_link_page_address(sb, curr_page, next_list->head);
+
+		next_list->head = prev_list->head;
+		next_list->num_pages += prev_list->num_pages;
+
+		mutex_unlock(&next_list->list_mutex);
+		mutex_unlock(&prev_list->list_mutex);
+	}
+
+	sbi->curr_clean_snapshot_info = next_info;
+	wakeup_snapshot_cleaner(sbi);
+
+	return 0;
+}
+
+static int nova_invalidate_snapshot_entry(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_snapshot_info_entry *entry;
+	int ret;
+
+	entry = nova_get_block(sb, info->snapshot_entry);
+	ret = nova_invalidate_logentry(sb, entry, SNAPSHOT_INFO, 0);
+	return ret;
+}
+
+int nova_delete_snapshot(struct super_block *sb, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info = NULL;
+	struct snapshot_info *next = NULL;
+	int delete = 0;
+	int ret;
+	timing_t delete_snapshot_time;
+
+	NOVA_START_TIMING(delete_snapshot_t, delete_snapshot_time);
+	mutex_lock(&sbi->s_lock);
+	nova_info("Delete snapshot epoch ID %llu\n", epoch_id);
+
+	ret = nova_find_target_snapshot_info(sb, epoch_id, &info);
+	if (ret != 1 || info->epoch_id != epoch_id) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		goto out;
+	}
+
+	next = nova_find_next_snapshot_info(sb, info);
+
+	if (next) {
+		nova_link_to_next_snapshot(sb, info, next);
+	} else {
+		/* Delete the last snapshot. Find the previous one. */
+		delete = 1;
+	}
+
+	radix_tree_delete(&sbi->snapshot_info_tree, epoch_id);
+
+	nova_invalidate_snapshot_entry(sb, info);
+
+out:
+	sbi->num_snapshots--;
+	mutex_unlock(&sbi->s_lock);
+
+	if (delete)
+		nova_delete_snapshot_info(sb, info, 1);
+
+	nova_free_snapshot_info(info);
+
+	NOVA_END_TIMING(delete_snapshot_t, delete_snapshot_time);
+	return 0;
+}
+
+static int nova_copy_snapshot_list_to_nvmm(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list,
+	u64 new_block)
+{
+	struct nova_inode_log_page *dram_page;
+	void *curr_nvmm_addr;
+	u64 curr_nvmm_block;
+	u64 prev_nvmm_block;
+	u64 curr_dram_addr;
+	unsigned long i;
+	size_t size = sizeof(struct snapshot_nvmm_list);
+
+	curr_dram_addr = list->head;
+	prev_nvmm_block = new_block;
+	curr_nvmm_block = new_block;
+	curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+
+	for (i = 0; i < list->num_pages; i++) {
+		/* Leave next_page field alone */
+		nova_memunlock_block(sb, curr_nvmm_addr);
+		memcpy_to_pmem_nocache(curr_nvmm_addr, (void *)curr_dram_addr,
+						LOG_BLOCK_TAIL);
+		nova_memlock_block(sb, curr_nvmm_addr);
+
+		dram_page = (struct nova_inode_log_page *)curr_dram_addr;
+		prev_nvmm_block = curr_nvmm_block;
+		curr_nvmm_block = next_log_page(sb, curr_nvmm_block);
+		if (curr_nvmm_block < 0)
+			break;
+		curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+		curr_dram_addr = dram_page->page_tail.next_page;
+	}
+
+	nova_memunlock_range(sb, nvmm_list, size);
+	nvmm_list->num_pages = list->num_pages;
+	nvmm_list->tail = prev_nvmm_block + ENTRY_LOC(list->tail);
+	nvmm_list->head = new_block;
+	nova_memlock_range(sb, nvmm_list, size);
+
+	nova_flush_buffer(nvmm_list, sizeof(struct snapshot_nvmm_list), 1);
+
+	return 0;
+}
+
+static int nova_save_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_snapshot_info_entry *entry;
+	struct nova_inode_info_header sih;
+	struct snapshot_list *list;
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_nvmm_list *nvmm_list;
+	unsigned long num_pages;
+	int i;
+	u64 nvmm_page_addr;
+	u64 new_block;
+	int allocated;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = 0;
+
+	/* Support up to 128 CPUs */
+	allocated = nova_allocate_inode_log_pages(sb, &sih, 1,
+						&nvmm_page_addr, ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_dbg("Error allocating NVMM info page\n");
+		return -ENOSPC;
+	}
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+							nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		num_pages = list->num_pages;
+		allocated = nova_allocate_inode_log_pages(sb, &sih,
+					num_pages, &new_block, i, 0);
+		if (allocated != num_pages) {
+			nova_dbg("Error saving snapshot list: %d\n", allocated);
+			return -ENOSPC;
+		}
+		nvmm_list = &nvmm_page->lists[i];
+		nova_copy_snapshot_list_to_nvmm(sb, list, nvmm_list, new_block);
+	}
+
+	entry = nova_get_block(sb, info->snapshot_entry);
+	nova_set_nvmm_page_addr(sb, entry, nvmm_page_addr);
+
+	return 0;
+}
+
+static int nova_print_snapshot_info(struct snapshot_info *info,
+	struct seq_file *seq)
+{
+	struct tm tm;
+	u64 epoch_id;
+	u64 timestamp;
+	unsigned long local_time;
+
+	epoch_id = info->epoch_id;
+	timestamp = info->timestamp;
+
+	local_time = timestamp - sys_tz.tz_minuteswest * 60;
+	time_to_tm(local_time, 0, &tm);
+	seq_printf(seq, "%8llu\t%4lu-%02d-%02d\t%02d:%02d:%02d\n",
+					info->epoch_id,
+					tm.tm_year + 1900, tm.tm_mon + 1,
+					tm.tm_mday,
+					tm.tm_hour, tm.tm_min, tm.tm_sec);
+	return 0;
+}
+
+int nova_print_snapshots(struct super_block *sb, struct seq_file *seq)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int count = 0;
+	int i;
+
+	seq_puts(seq, "========== NOVA snapshot table ==========\n");
+	seq_puts(seq, "Epoch ID\t      Date\t    Time\n");
+
+	/* Print in epoch ID increasing order */
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			nova_print_snapshot_info(info, seq);
+			count++;
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	seq_printf(seq, "=========== Total %d snapshots ===========\n", count);
+	return 0;
+}
+
+int nova_print_snapshot_lists(struct super_block *sb, struct seq_file *seq)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int count = 0;
+	int sum;
+	int i, j;
+
+	seq_puts(seq, "========== NOVA snapshot statistics ==========\n");
+
+	/* Print in epoch ID increasing order */
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			sum = 0;
+			for (j = 0; j < sbi->cpus; j++) {
+				list = &info->lists[j];
+				sum += list->num_pages;
+			}
+			seq_printf(seq, "Snapshot epoch ID %llu, %d list pages\n",
+					epoch_id, sum);
+			count++;
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	seq_printf(seq, "============= Total %d snapshots =============\n",
+			count);
+	return 0;
+}
+
+static int nova_traverse_and_delete_snapshot_infos(struct super_block *sb,
+	int save)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int i;
+
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			if (save)
+				nova_save_snapshot_info(sb, info);
+			nova_delete_snapshot_info(sb, info, 0);
+			radix_tree_delete(&sbi->snapshot_info_tree, epoch_id);
+			nova_free_snapshot_info(info);
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	return 0;
+}
+
+int nova_save_snapshots(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->snapshot_cleaner_thread)
+		kthread_stop(sbi->snapshot_cleaner_thread);
+
+	if (sbi->mount_snapshot)
+		return 0;
+
+	return nova_traverse_and_delete_snapshot_infos(sb, 1);
+}
+
+int nova_destroy_snapshot_infos(struct super_block *sb)
+{
+	return nova_traverse_and_delete_snapshot_infos(sb, 0);
+}
+
+static void snapshot_cleaner_try_sleeping(struct nova_sb_info *sbi)
+{
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&sbi->snapshot_cleaner_wait, &wait, TASK_INTERRUPTIBLE);
+	schedule();
+	finish_wait(&sbi->snapshot_cleaner_wait, &wait);
+}
+
+static int nova_clean_snapshot(struct nova_sb_info *sbi)
+{
+	struct super_block *sb = sbi->sb;
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	int i;
+
+	if (!sbi->curr_clean_snapshot_info)
+		return 0;
+
+	info = sbi->curr_clean_snapshot_info;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+
+		mutex_lock(&list->list_mutex);
+		nova_background_clean_snapshot_list(sb, list,
+							info->epoch_id);
+		mutex_unlock(&list->list_mutex);
+	}
+
+	sbi->curr_clean_snapshot_info = NULL;
+	return 0;
+}
+
+static int nova_snapshot_cleaner(void *arg)
+{
+	struct nova_sb_info *sbi = arg;
+
+	nova_dbg("Running snapshot cleaner thread\n");
+	for (;;) {
+		snapshot_cleaner_try_sleeping(sbi);
+
+		if (kthread_should_stop())
+			break;
+
+		nova_clean_snapshot(sbi);
+	}
+
+	if (sbi->curr_clean_snapshot_info)
+		nova_clean_snapshot(sbi);
+
+	return 0;
+}
+
+static int nova_snapshot_cleaner_init(struct nova_sb_info *sbi)
+{
+	int ret = 0;
+
+	init_waitqueue_head(&sbi->snapshot_cleaner_wait);
+
+	sbi->snapshot_cleaner_thread = kthread_run(nova_snapshot_cleaner,
+		sbi, "nova_snapshot_cleaner");
+	if (IS_ERR(sbi->snapshot_cleaner_thread)) {
+		nova_info("Failed to start NOVA snapshot cleaner thread\n");
+		ret = -1;
+	}
+	nova_info("Start NOVA snapshot cleaner thread.\n");
+	return ret;
+}
+
+int nova_snapshot_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header *sih;
+	u64 ino = NOVA_SNAPSHOT_INO;
+	int ret;
+
+	sih = &sbi->snapshot_si->header;
+	nova_init_header(sb, sih, 0);
+	sih->pi_addr = nova_get_reserved_inode_addr(sb, ino);
+	sih->alter_pi_addr = nova_get_alter_reserved_inode_addr(sb, ino);
+	sih->ino = ino;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	INIT_RADIX_TREE(&sbi->snapshot_info_tree, GFP_ATOMIC);
+	init_waitqueue_head(&sbi->snapshot_mmap_wait);
+	ret = nova_snapshot_cleaner_init(sbi);
+
+	return ret;
+}
+
diff --git a/fs/nova/snapshot.h b/fs/nova/snapshot.h
new file mode 100644
index 000000000000..948dfd557de4
--- /dev/null
+++ b/fs/nova/snapshot.h
@@ -0,0 +1,98 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Snapshot header
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+
+/*
+ * DRAM log of updates to a snapshot.
+ */
+struct snapshot_list {
+	struct mutex list_mutex;
+	unsigned long num_pages;
+	unsigned long head;
+	unsigned long tail;
+};
+
+
+/*
+ * DRAM info about a snapshop.
+ */
+struct snapshot_info {
+	u64	epoch_id;
+	u64	timestamp;
+	unsigned long snapshot_entry; /* PMEM pointer to the struct
+				       * snapshot_info_entry for this
+				       * snapshot
+				       */
+
+	struct snapshot_list *lists;	/* Per-CPU snapshot list */
+};
+
+
+enum nova_snapshot_entry_type {
+	SS_INODE = 1,
+	SS_FILE_WRITE,
+};
+
+/*
+ * Snapshot log entry for recording an inode operation in a snapshot log.
+ *
+ * Todo: add checksum
+ */
+struct snapshot_inode_entry {
+	u8	type;
+	u8	deleted;
+	u8	padding[6];
+	u64	padding64;
+	u64	nova_ino;          // inode number that was deleted.
+	u64	delete_epoch_id;   // Deleted when?
+} __attribute((__packed__));
+
+/*
+ * Snapshot log entry for recording a write operation in a snapshot log
+ *
+ * Todo: add checksum.
+ */
+struct snapshot_file_write_entry {
+	u8	type;
+	u8	deleted;
+	u8	padding[6];
+	u64	nvmm;
+	u64	num_pages;
+	u64	delete_epoch_id;
+} __attribute((__packed__));
+
+/*
+ * PMEM structure pointing to a log comprised of snapshot_inode_entry and
+ * snapshot_file_write_entry objects.
+ *
+ * TODO: add checksum
+ */
+struct snapshot_nvmm_list {
+	__le64 padding;
+	__le64 num_pages;
+	__le64 head;
+	__le64 tail;
+} __attribute((__packed__));
+
+/* Support up to 128 CPUs */
+struct snapshot_nvmm_page {
+	struct snapshot_nvmm_list lists[128];
+};
+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a47fc92..349e319b10f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -415,6 +415,8 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	/* For NOVA DAX-mmap protection */
+	int (*dax_cow)(struct vm_area_struct * area, unsigned long address);
 };
 
 struct mmu_gather;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..0b7667fe3dfb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,6 +342,9 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+
+	/* Flag for NOVA DAX cow */
+	int original_write;
 };
 
 struct core_thread {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..aa27a5517a75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -275,6 +275,7 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 
 	return pages;
 }
+EXPORT_SYMBOL(change_protection);
 
 int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
@@ -288,7 +289,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	int error;
 	int dirty_accountable = 0;
 
-	if (newflags == oldflags) {
+	if (newflags == oldflags && vma->original_write == 0) {
 		*pprev = vma;
 		return 0;
 	}
@@ -352,6 +353,16 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	change_protection(vma, start, end, vma->vm_page_prot,
 			  dirty_accountable, 0);
 
+	/* Update NOVA vma list */
+	if (vma->vm_ops && vma->vm_ops->dax_cow) {
+		if (!(oldflags & VM_WRITE) && (newflags & VM_WRITE)) {
+			vma->vm_ops->open(vma);
+		} else if (!(newflags & VM_WRITE)) {
+			if (vma->original_write || (oldflags & VM_WRITE))
+				vma->vm_ops->close(vma);
+		}
+	}
+
 	/*
 	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
 	 * fault on access.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 11/16] NOVA: Snapshot support
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova supports snapshots to facilitate backups.

Taking a snapshot
-----------------

Each Nova file systems has a current epoch_id in the super block and each log
entry has the epoch_id attached to it at creation.  When the user creates a
snaphot, Nova increments the epoch_id for the file system and the old epoch_id
identifies the moment the snapshot was taken.

Nova records the epoch_id and a timestamp in a new log entry (struct
snapshot_info_log_entry) and appends it to the log of the reserved snapshot
inode (NOVA_SNAPSHOT_INODE) in the superblock.

Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct
snapshot_info in DRAM indexed by epoch_id.

Nova also marks all mmap'd pages as read-only and uses COW to preserve file
contents after the snapshot.

Tracking Live Data
------------------

Supporting snapshots requires Nova to preserve file contents from previous
snapshots while also being able to recover the space a snapshot occupied after
its deletion.

Preserving file contents requires a small change to how Nova implements write
operations.  To perform a write, Nova appends a write log entry to the file's
log.  The log entry includes pointers to newly-allocated and populated NVMM
pages that hold the written data.  If the write overwrites existing data, Nova
locates the previous write log entry for that portion of the file, and performs
an "epoch check" that compares the old log entry's epoch_id to the file
system's current epoch_id.  If the comparison matches, the old write log entry
and the file data blocks it points to no longer belong to any snapshot, and
Nova reclaims the data blocks.

If the epoch_id's do not match, then the data in the old log entry belongs to
an earlier snapshot and Nova leaves the log entry in place.

Determining when to reclaim data belonging to deleted snapshots requires
additional bookkeeping.  For each snapshot, Nova maintains a "snapshot log"
that records the inodes and blocks that belong to that snapshot, but are not
part of the current file system image.

Nova populates the snapshot log during the epoch check: If the epoch_ids for
the new and old log entries do not match, it appends a log entry (either struct
snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log
that the old log entry belongs to.  The log entry contains a pointer to the old
log entry, and the filesystem's current epoch_id as the delete_epoch_id.

To delete a snapshot, Nova removes the snapshot from the list of live snapshots
and appends its log to the following snapshot's log.  Then, a background thread
traverses the combined log and reclaims dead inode/data based on the delete
epoch_id: If the delete epoch_id for an entry in the log is less than or equal
to the snapshot's epoch_id, it means the log entry and/or the associated data
blocks are now dead.



Snapshots and DAX
-----------------

Taking consistent snapshots while applications are modifying files using
DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM
become persistent (i.e., reach physical NVMM so they will survive a system
failure).  These applications rely on the processor's memory persistence
model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees
about when and in what order stores become persistent.  These guarantees allow
the application to restore their data to a consistent state during recovery
from a system failure.

>>From the application's perspective, reading a snapshot is equivalent to
recovering from a system failure.  In both cases, the contents of the
memory-mapped file reflect its state at a moment when application operations
might be in-flight and when the application had no chance to shut down cleanly.

A naive approach to checkpointing mmap()'d files in NOVA would simply mark each
of the read/write mapped pages as read-only and then do copy-on-write when a
store occurs to preserve the old pages as part of the snapshot.

However, this approach can leave the snapshot in an inconsistent state:
Setting the page to read-only captures its contents for the
snapshot, and the kernel requires NOVA to set the pages as read-only
one at a time.  So, if the order in which NOVA marks pages as read-only
is incompatible with ordering that the application requires, the snapshot will
contain an inconsistent version of the file.

To resolve this problem, when NOVA starts marking pages as read-only, it blocks
page faults to the read-only mmap()'d pages until it has marked all the pages
read-only and finished taking the snapshot.

More detail is available in the technical report referenced at the top of this
document.

We have implemented this functionality in NOVA by adding the 'original_write'
flag to struct vm_area_struct that tracks whether the vm_area_struct is created
with write permission, but has been marked read-only in the course of taking a
snapshot.  We have also added a 'dax_cow' operation to struct
vm_operations_struct that the page fault handler runs when applications write
to a page with original_write = 1.  NOVA's dax_cow operation
(nova_restore_page_write()) performs the COW, maps the page to a new physical
page and allows writing.


Saving Snapshot State
---------------------

During a clean shutdown, Nova stores the snapshot information to PMEM.

Nova reserves an inode for storing snapshot information.  The log for the inode
contains an entry for each snapshot (struct snapshot_info_log_entry).  On
shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array
of struct snapshot_nvmm_list.

Each of these lists (one per CPU) contains head and tail pointers to a linked
list of blocks (just like an inode log).  The lists contain a struct
snapshot_file_write_entry or struct snapshot_inode_entry for each operation
that modified file data or an inode.

Superblock
+--------------------+
|   ...              |
+--------------------+
| Reserved Inodes    |
+---+----------------+
|   |     ..         |
+---+----------------+
| 7 | Snapshot Inode |
|   | head           |
+---+----------------+
        /
       /
      /
+---------+---------+---------+
|  Snap   |  Snap   |  Snap   |
| epoch=1 | epoch=4 | epoch=11|
|         |         |         |
|nvmm_page|nvmm_page|nvmm_page|
+---------+---------+---------+
     |
     |
+----------+   +--------+--------+
|  cpu 0   |   | snap 	| snap   |
|   head   |-->| inode	| write	 |
|          |   | entry  | entry  |
|          |   +--------+--------+
+----------+   +--------+--------+
|  cpu 1   |   | snap 	| snap   |
|   head   |-->| write	| write	 |
|          |   | entry  | entry  |
|          |   +--------+--------+
+----------+
|    ...   |
+----------+   +--------+
|  cpu 128 |   | snap 	|
|   head   |-->| inode	|
|          |   | entry  |
|          |   +--------+
+----------+

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 arch/x86/mm/fault.c      |   11 
 fs/nova/snapshot.c       | 1407 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/snapshot.h       |   98 +++
 include/linux/mm.h       |    2 
 include/linux/mm_types.h |    3 
 mm/mprotect.c            |   13 
 6 files changed, 1533 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/snapshot.c
 create mode 100644 fs/nova/snapshot.h

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a01cbc8..34430601c7c0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1431,6 +1431,17 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
 	 * we can handle it..
 	 */
 good_area:
+
+	if (error_code & PF_WRITE) {
+		/* write, present and write, not present: */
+		if (vma->original_write && vma->vm_ops &&
+					vma->vm_ops->dax_cow) {
+			up_read(&mm->mmap_sem);
+			vma->vm_ops->dax_cow(vma, address);
+			down_read(&mm->mmap_sem);
+		}
+	}
+
 	if (unlikely(access_error(error_code, vma))) {
 		bad_area_access_error(regs, error_code, address, vma);
 		return;
diff --git a/fs/nova/snapshot.c b/fs/nova/snapshot.c
new file mode 100644
index 000000000000..088b56c0d38c
--- /dev/null
+++ b/fs/nova/snapshot.c
@@ -0,0 +1,1407 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Snapshot support
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+#include "super.h"
+
+static inline u64 next_list_page(u64 curr_p)
+{
+	void *curr_addr = (void *)curr_p;
+	unsigned long page_tail = ((unsigned long)curr_addr & ~PAGE_OFFSET_MASK)
+					+ LOG_BLOCK_TAIL;
+	return ((struct nova_inode_page_tail *)page_tail)->next_page;
+}
+
+static inline bool goto_next_list_page(struct super_block *sb, u64 curr_p)
+{
+	void *addr;
+	u8 type;
+
+	/* Each kind of entry takes at least 32 bytes */
+	if (ENTRY_LOC(curr_p) + 32 > LOG_BLOCK_TAIL)
+		return true;
+
+	addr = (void *)curr_p;
+	type = nova_get_entry_type(addr);
+	if (type == NEXT_PAGE)
+		return true;
+
+	return false;
+}
+
+static int nova_find_target_snapshot_info(struct super_block *sb,
+	u64 epoch_id, struct snapshot_info **ret_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *infos[1];
+	int nr_infos;
+	int ret = 0;
+
+	nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, 1);
+	if (nr_infos == 1) {
+		*ret_info = infos[0];
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static struct snapshot_info *
+nova_find_next_snapshot_info(struct super_block *sb, struct snapshot_info *info)
+{
+	struct snapshot_info *ret_info = NULL;
+	int ret;
+
+	ret = nova_find_target_snapshot_info(sb, info->epoch_id + 1, &ret_info);
+
+	if (ret == 1 && ret_info->epoch_id <= info->epoch_id) {
+		nova_err(sb, "info epoch id %llu, next epoch id %llu\n",
+				info->epoch_id, ret_info->epoch_id);
+		ret_info = NULL;
+	}
+
+	return ret_info;
+}
+
+static int nova_insert_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret;
+
+	ret = radix_tree_insert(&sbi->snapshot_info_tree, info->epoch_id, info);
+	if (ret)
+		nova_dbg("%s ERROR %d\n", __func__, ret);
+
+	return ret;
+}
+
+/* Reuse the inode log page structure */
+static inline void nova_set_link_page_epoch_id(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 epoch_id)
+{
+	curr_page->page_tail.epoch_id = epoch_id;
+}
+
+/* Reuse the inode log page structure */
+static inline void nova_set_next_link_page_address(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 next_page)
+{
+	curr_page->page_tail.next_page = next_page;
+}
+
+static int nova_delete_snapshot_list_entries(struct super_block *sb,
+	struct snapshot_list *list)
+{
+	struct snapshot_file_write_entry *w_entry = NULL;
+	struct snapshot_inode_entry *i_entry = NULL;
+	struct nova_inode_info_header sih;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.log_head = sih.log_tail = 0;
+
+	curr_p = list->head;
+	nova_dbg_verbose("Snapshot list head 0x%llx, tail 0x%lx\n",
+				curr_p, list->tail);
+	if (curr_p == 0 && list->tail == 0)
+		return 0;
+
+	while (curr_p != list->tail) {
+		if (goto_next_list_page(sb, curr_p)) {
+			curr_p = next_list_page(curr_p);
+			if (curr_p == list->tail)
+				break;
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Snapshot list is NULL!\n");
+			BUG();
+		}
+
+		addr = (void *)curr_p;
+		type = nova_get_entry_type(addr);
+
+		switch (type) {
+		case SS_INODE:
+			i_entry = (struct snapshot_inode_entry *)addr;
+			if (i_entry->deleted == 0)
+				nova_delete_dead_inode(sb, i_entry->nova_ino);
+			curr_p += sizeof(struct snapshot_inode_entry);
+			continue;
+		case SS_FILE_WRITE:
+			w_entry = (struct snapshot_file_write_entry *)addr;
+			if (w_entry->deleted == 0)
+				nova_free_data_blocks(sb, &sih, w_entry->nvmm,
+							w_entry->num_pages);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx, tail 0x%llx\n",
+					type, curr_p, list->tail);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		}
+	}
+
+	return 0;
+}
+
+static inline int nova_background_clean_inode_entry(struct super_block *sb,
+	struct snapshot_inode_entry *i_entry, u64 epoch_id)
+{
+	if (i_entry->deleted == 0 && i_entry->delete_epoch_id <= epoch_id) {
+		nova_delete_dead_inode(sb, i_entry->nova_ino);
+		i_entry->deleted = 1;
+	}
+
+	return 0;
+}
+
+static inline int nova_background_clean_write_entry(struct super_block *sb,
+	struct snapshot_file_write_entry *w_entry,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	if (w_entry->deleted == 0 && w_entry->delete_epoch_id <= epoch_id) {
+		nova_free_data_blocks(sb, sih, w_entry->nvmm,
+					w_entry->num_pages);
+		w_entry->deleted = 1;
+	}
+
+	return 0;
+}
+
+static int nova_background_clean_snapshot_list(struct super_block *sb,
+	struct snapshot_list *list, u64 epoch_id)
+{
+	struct nova_inode_log_page *curr_page;
+	struct nova_inode_info_header sih;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.log_head = sih.log_tail = 0;
+
+	curr_p = list->head;
+	nova_dbg_verbose("Snapshot list head 0x%llx, tail 0x%lx\n",
+				curr_p, list->tail);
+	if (curr_p == 0 && list->tail == 0)
+		return 0;
+
+	curr_page = (struct nova_inode_log_page *)curr_p;
+	while (curr_page->page_tail.epoch_id < epoch_id &&
+					curr_p != list->tail) {
+		if (goto_next_list_page(sb, curr_p)) {
+			curr_p = next_list_page(curr_p);
+			if (curr_p == list->tail)
+				break;
+			curr_page = (struct nova_inode_log_page *)curr_p;
+			if (curr_page->page_tail.epoch_id == epoch_id)
+				break;
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Snapshot list is NULL!\n");
+			BUG();
+		}
+
+		addr = (void *)curr_p;
+		type = nova_get_entry_type(addr);
+
+		switch (type) {
+		case SS_INODE:
+			nova_background_clean_inode_entry(sb, addr, epoch_id);
+			curr_p += sizeof(struct snapshot_inode_entry);
+			continue;
+		case SS_FILE_WRITE:
+			nova_background_clean_write_entry(sb, addr, &sih,
+								epoch_id);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx, tail 0x%llx\n",
+					type, curr_p, list->tail);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct snapshot_file_write_entry);
+			continue;
+		}
+	}
+
+	return 0;
+}
+
+static int nova_delete_snapshot_list_pages(struct super_block *sb,
+	struct snapshot_list *list)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block = list->head;
+	int freed = 0;
+
+	while (curr_block) {
+		if (ENTRY_LOC(curr_block)) {
+			nova_dbg("%s: ERROR: invalid block %llu\n",
+					__func__, curr_block);
+			break;
+		}
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		curr_block = curr_page->page_tail.next_page;
+		kfree(curr_page);
+		freed++;
+	}
+
+	return freed;
+}
+
+static int nova_delete_snapshot_list(struct super_block *sb,
+	struct snapshot_list *list, int delete_entries)
+{
+	if (delete_entries)
+		nova_delete_snapshot_list_entries(sb, list);
+	nova_delete_snapshot_list_pages(sb, list);
+	return 0;
+}
+
+static int nova_delete_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info, int delete_entries)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *list;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		mutex_lock(&list->list_mutex);
+		nova_delete_snapshot_list(sb, list, delete_entries);
+		mutex_unlock(&list->list_mutex);
+	}
+
+	kfree(info->lists);
+	return 0;
+}
+
+static int nova_initialize_snapshot_info_pages(struct super_block *sb,
+	struct snapshot_info *info, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *list;
+	unsigned long new_page = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		new_page = (unsigned long)kmalloc(PAGE_SIZE,
+							GFP_KERNEL);
+		/* Aligned to PAGE_SIZE */
+		if (!new_page || ENTRY_LOC(new_page)) {
+			nova_dbg("%s: failed\n", __func__);
+			kfree((void *)new_page);
+			return -ENOMEM;
+		}
+
+		nova_set_link_page_epoch_id(sb, (void *)new_page, epoch_id);
+		nova_set_next_link_page_address(sb, (void *)new_page, 0);
+		list->tail = list->head = new_page;
+		list->num_pages = 1;
+	}
+
+	return 0;
+}
+
+static int nova_initialize_snapshot_info(struct super_block *sb,
+	struct snapshot_info **ret_info, int init_pages, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	int i;
+	int ret;
+	timing_t init_snapshot_time;
+
+	NOVA_START_TIMING(init_snapshot_info_t, init_snapshot_time);
+
+	info = nova_alloc_snapshot_info(sb);
+	if (!info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	info->lists = kzalloc(sbi->cpus * sizeof(struct snapshot_list),
+							GFP_KERNEL);
+
+	if (!info->lists) {
+		nova_free_snapshot_info(info);
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		mutex_init(&list->list_mutex);
+	}
+
+	if (init_pages) {
+		ret = nova_initialize_snapshot_info_pages(sb, info, epoch_id);
+		if (ret)
+			goto fail;
+	}
+
+	*ret_info = info;
+out:
+	NOVA_END_TIMING(init_snapshot_info_t, init_snapshot_time);
+	return ret;
+
+fail:
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		if (list->head)
+			kfree((void *)list->head);
+	}
+
+	kfree(info->lists);
+	nova_free_snapshot_info(info);
+
+	*ret_info = NULL;
+	goto out;
+}
+
+static void nova_write_snapshot_list_entry(struct super_block *sb,
+	struct snapshot_list *list, u64 curr_p, void *entry, size_t size)
+{
+	if (is_last_entry(curr_p, size)) {
+		nova_err(sb, "%s: write to page end? curr 0x%llx, size %lu\n",
+				__func__, curr_p, size);
+		return;
+	}
+
+	memcpy((void *)curr_p, entry, size);
+	list->tail = curr_p + size;
+}
+
+static int nova_append_snapshot_list_entry(struct super_block *sb,
+	struct snapshot_info *info, void *entry, size_t size)
+{
+	struct snapshot_list *list;
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block;
+	int cpuid;
+	u64 curr_p;
+	u64 new_page = 0;
+
+	cpuid = smp_processor_id();
+	list = &info->lists[cpuid];
+
+retry:
+	mutex_lock(&list->list_mutex);
+	curr_p = list->tail;
+
+	if (new_page) {
+		/* Link prev block and newly allocated page */
+		curr_block = BLOCK_OFF(curr_p);
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		nova_set_next_link_page_address(sb, curr_page, new_page);
+		list->num_pages++;
+	}
+
+	if ((is_last_entry(curr_p, size) && next_list_page(curr_p) == 0)) {
+		nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+		if (new_page == 0) {
+			mutex_unlock(&list->list_mutex);
+			new_page = (unsigned long)kmalloc(PAGE_SIZE,
+						GFP_KERNEL);
+			if (!new_page || ENTRY_LOC(new_page)) {
+				kfree((void *)new_page);
+				nova_err(sb, "%s: allocation failed\n",
+						__func__);
+				return -ENOMEM;
+			}
+			nova_set_link_page_epoch_id(sb, (void *)new_page,
+						info->epoch_id);
+			nova_set_next_link_page_address(sb,
+						(void *)new_page, 0);
+			goto retry;
+		}
+	}
+
+	if (is_last_entry(curr_p, size)) {
+		nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+		curr_p = next_list_page(curr_p);
+	}
+
+	nova_write_snapshot_list_entry(sb, list, curr_p, entry, size);
+	mutex_unlock(&list->list_mutex);
+
+	return 0;
+}
+
+/*
+ * An entry is deleteable if
+ * 1) It is created after the last snapshot, or
+ * 2) It is created and deleted during the same snapshot period.
+ */
+static int nova_old_entry_deleteable(struct super_block *sb,
+	u64 create_epoch_id, u64 delete_epoch_id,
+	struct snapshot_info **ret_info)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	if (create_epoch_id == delete_epoch_id) {
+		/* Create and delete in the same epoch */
+		return 1;
+	}
+
+	ret = nova_find_target_snapshot_info(sb, create_epoch_id, &info);
+	if (ret == 0) {
+		/* Old entry does not belong to any snapshot */
+		return 1;
+	}
+
+	if (info->epoch_id >= delete_epoch_id) {
+		/* Create and delete in different epoch but same snapshot */
+		return 1;
+	}
+
+	*ret_info = info;
+	return 0;
+}
+
+static int nova_append_snapshot_file_write_entry(struct super_block *sb,
+	struct snapshot_info *info, u64 nvmm, u64 num_pages,
+	u64 delete_epoch_id)
+{
+	struct snapshot_file_write_entry entry;
+	int ret;
+	timing_t append_time;
+
+	if (!info) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(append_snapshot_file_t, append_time);
+	nova_dbgv("Append file write entry: block %llu, %llu pages, delete epoch ID %llu to Snapshot epoch ID %llu\n",
+			nvmm, num_pages, delete_epoch_id,
+			info->epoch_id);
+
+	memset(&entry, 0, sizeof(struct snapshot_file_write_entry));
+	entry.type = SS_FILE_WRITE;
+	entry.deleted = 0;
+	entry.nvmm = nvmm;
+	entry.num_pages = num_pages;
+	entry.delete_epoch_id = delete_epoch_id;
+
+	ret = nova_append_snapshot_list_entry(sb, info, &entry,
+			sizeof(struct snapshot_file_write_entry));
+
+	NOVA_END_TIMING(append_snapshot_file_t, append_time);
+	return ret;
+}
+
+/* entry given to this function is a copy in dram */
+int nova_append_data_to_snapshot(struct super_block *sb,
+	struct nova_file_write_entry *entry, u64 nvmm, u64 num_pages,
+	u64 delete_epoch_id)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	ret = nova_old_entry_deleteable(sb, entry->epoch_id,
+					delete_epoch_id, &info);
+	if (ret == 0)
+		nova_append_snapshot_file_write_entry(sb, info, nvmm,
+					num_pages, delete_epoch_id);
+
+	return ret;
+}
+
+static int nova_append_snapshot_inode_entry(struct super_block *sb,
+	struct nova_inode *pi, struct snapshot_info *info)
+{
+	struct snapshot_inode_entry entry;
+	int ret;
+	timing_t append_time;
+
+	if (!info) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(append_snapshot_inode_t, append_time);
+	nova_dbgv("Append inode entry: inode %llu, delete epoch ID %llu to Snapshot epoch ID %llu\n",
+			pi->nova_ino, pi->delete_epoch_id,
+			info->epoch_id);
+
+	memset(&entry, 0, sizeof(struct snapshot_inode_entry));
+	entry.type = SS_INODE;
+	entry.deleted = 0;
+	entry.nova_ino = pi->nova_ino;
+	entry.delete_epoch_id = pi->delete_epoch_id;
+
+	ret = nova_append_snapshot_list_entry(sb, info, &entry,
+			sizeof(struct snapshot_inode_entry));
+
+	NOVA_END_TIMING(append_snapshot_inode_t, append_time);
+	return ret;
+}
+
+int nova_append_inode_to_snapshot(struct super_block *sb,
+	struct nova_inode *pi)
+{
+	struct snapshot_info *info = NULL;
+	int ret;
+
+	ret = nova_old_entry_deleteable(sb, pi->create_epoch_id,
+					pi->delete_epoch_id, &info);
+	if (ret == 0)
+		nova_append_snapshot_inode_entry(sb, pi, info);
+
+	return ret;
+}
+
+int nova_encounter_mount_snapshot(struct super_block *sb, void *addr,
+	u8 type)
+{
+	struct nova_dentry *dentry;
+	struct nova_setattr_logentry *attr_entry;
+	struct nova_link_change_entry *linkc_entry;
+	struct nova_file_write_entry *fw_entry;
+	struct nova_mmap_entry *mmap_entry;
+	int ret = 0;
+
+	switch (type) {
+	case SET_ATTR:
+		attr_entry = (struct nova_setattr_logentry *)addr;
+		if (pass_mount_snapshot(sb, attr_entry->epoch_id))
+			ret = 1;
+		break;
+	case LINK_CHANGE:
+		linkc_entry = (struct nova_link_change_entry *)addr;
+		if (pass_mount_snapshot(sb, linkc_entry->epoch_id))
+			ret = 1;
+		break;
+	case DIR_LOG:
+		dentry = (struct nova_dentry *)addr;
+		if (pass_mount_snapshot(sb, dentry->epoch_id))
+			ret = 1;
+		break;
+	case FILE_WRITE:
+		fw_entry = (struct nova_file_write_entry *)addr;
+		if (pass_mount_snapshot(sb, fw_entry->epoch_id))
+			ret = 1;
+		break;
+	case MMAP_WRITE:
+		mmap_entry = (struct nova_mmap_entry *)addr;
+		if (pass_mount_snapshot(sb, mmap_entry->epoch_id))
+			ret = 1;
+		break;
+	default:
+		break;
+	}
+
+	return ret;
+}
+
+static int nova_copy_snapshot_list_to_dram(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list)
+{
+	struct nova_inode_log_page *dram_page;
+	void *curr_nvmm_addr;
+	u64 curr_nvmm_block;
+	u64 prev_dram_addr;
+	u64 curr_dram_addr;
+	unsigned long i;
+	int ret;
+
+	curr_dram_addr = list->head;
+	prev_dram_addr = list->head;
+	curr_nvmm_block = nvmm_list->head;
+	curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+
+	for (i = 0; i < nvmm_list->num_pages; i++) {
+		/* Leave next_page field alone */
+		ret = memcpy_mcsafe((void *)curr_dram_addr, curr_nvmm_addr,
+						LOG_BLOCK_TAIL);
+
+		if (ret < 0) {
+			nova_dbg("%s: Copy nvmm page %lu failed\n",
+					__func__, i);
+			continue;
+		}
+
+		dram_page = (struct nova_inode_log_page *)curr_dram_addr;
+		prev_dram_addr = curr_dram_addr;
+		curr_nvmm_block = next_log_page(sb, curr_nvmm_block);
+		if (curr_nvmm_block < 0)
+			break;
+		curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+		curr_dram_addr = dram_page->page_tail.next_page;
+	}
+
+	list->num_pages = nvmm_list->num_pages;
+	list->tail = prev_dram_addr + ENTRY_LOC(nvmm_list->tail);
+
+	return 0;
+}
+
+static int nova_allocate_snapshot_list_pages(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list,
+	u64 epoch_id)
+{
+	unsigned long prev_page = 0;
+	unsigned long new_page = 0;
+	unsigned long i;
+
+	for (i = 0; i < nvmm_list->num_pages; i++) {
+		new_page = (unsigned long)kmalloc(PAGE_SIZE,
+							GFP_KERNEL);
+
+		if (!new_page) {
+			nova_dbg("%s ERROR: fail to allocate list pages\n",
+					__func__);
+			goto fail;
+		}
+
+		nova_set_link_page_epoch_id(sb, (void *)new_page, epoch_id);
+		nova_set_next_link_page_address(sb, (void *)new_page, 0);
+
+		if (i == 0)
+			list->head = new_page;
+
+		if (prev_page)
+			nova_set_next_link_page_address(sb, (void *)prev_page,
+							new_page);
+		prev_page = new_page;
+	}
+
+	return 0;
+
+fail:
+	nova_delete_snapshot_list_pages(sb, list);
+	return -ENOMEM;
+}
+
+static int nova_restore_snapshot_info_lists(struct super_block *sb,
+	struct snapshot_info *info, struct nova_snapshot_info_entry *entry,
+	u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_list *list;
+	struct snapshot_nvmm_list *nvmm_list;
+	int i;
+	int ret;
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+						entry->nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		nvmm_list = &nvmm_page->lists[i];
+		if (!list || !nvmm_list) {
+			nova_dbg("%s: list NULL? list %p, nvmm list %p\n",
+					__func__, list, nvmm_list);
+			continue;
+		}
+
+		ret = nova_allocate_snapshot_list_pages(sb, list,
+						nvmm_list, info->epoch_id);
+		if (ret) {
+			nova_dbg("%s failure\n", __func__);
+			return ret;
+		}
+		nova_copy_snapshot_list_to_dram(sb, list, nvmm_list);
+	}
+
+	return 0;
+}
+
+static int nova_restore_snapshot_info(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 epoch_id,
+	u64 timestamp, u64 curr_p, int just_init)
+{
+	struct snapshot_info *info = NULL;
+	int ret = 0;
+
+	nova_dbg("Restore snapshot epoch ID %llu\n", epoch_id);
+
+	/* Allocate list pages on demand later */
+	ret = nova_initialize_snapshot_info(sb, &info, just_init, epoch_id);
+	if (ret) {
+		nova_dbg("%s: initialize snapshot info failed %d\n",
+				__func__, ret);
+		goto fail;
+	}
+
+	info->epoch_id = epoch_id;
+	info->timestamp = timestamp;
+	info->snapshot_entry = curr_p;
+
+	if (just_init == 0) {
+		ret = nova_restore_snapshot_info_lists(sb, info,
+							entry, epoch_id);
+		if (ret)
+			goto fail;
+	}
+
+	ret = nova_insert_snapshot_info(sb, info);
+	return ret;
+
+fail:
+	nova_delete_snapshot_info(sb, info, 0);
+	return ret;
+}
+
+int nova_mount_snapshot(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 epoch_id;
+
+	epoch_id = sbi->mount_snapshot_epoch_id;
+	nova_dbg("Mount snapshot %llu\n", epoch_id);
+	return 0;
+}
+
+static int nova_free_nvmm_page(struct super_block *sb,
+	u64 nvmm_page_addr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_nvmm_list *nvmm_list;
+	struct nova_inode_info_header sih;
+	unsigned long nvmm_blocknr;
+	int i;
+
+	if (nvmm_page_addr == 0)
+		return 0;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+						nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		nvmm_list = &nvmm_page->lists[i];
+		sih.log_head = nvmm_list->head;
+		sih.log_tail = nvmm_list->tail;
+		sih.alter_log_head = sih.alter_log_tail = 0;
+		nova_free_inode_log(sb, NULL, &sih);
+	}
+
+	nvmm_blocknr = nova_get_blocknr(sb, nvmm_page_addr, 0);
+	nova_free_log_blocks(sb, &sih, nvmm_blocknr, 1);
+	return 0;
+}
+
+static int nova_set_nvmm_page_addr(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 nvmm_page_addr)
+{
+	nova_memunlock_range(sb, entry, CACHELINE_SIZE);
+	entry->nvmm_page_addr = nvmm_page_addr;
+	nova_update_entry_csum(entry);
+	nova_update_alter_entry(sb, entry);
+	nova_memlock_range(sb, entry, CACHELINE_SIZE);
+
+	return 0;
+}
+
+static int nova_clear_nvmm_page(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, int just_init)
+{
+	if (just_init)
+		/* No need to free because we do not set the bitmap. */
+		goto out;
+
+	nova_free_nvmm_page(sb, entry->nvmm_page_addr);
+
+out:
+	nova_set_nvmm_page_addr(sb, entry, 0);
+	return 0;
+}
+
+int nova_restore_snapshot_entry(struct super_block *sb,
+	struct nova_snapshot_info_entry *entry, u64 curr_p, int just_init)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 epoch_id, timestamp;
+	int ret = 0;
+
+	if (entry->deleted == 1)
+		goto out;
+
+	epoch_id = entry->epoch_id;
+	timestamp = entry->timestamp;
+
+	ret = nova_restore_snapshot_info(sb, entry, epoch_id,
+					timestamp, curr_p, just_init);
+	if (ret) {
+		nova_dbg("%s: Restore snapshot epoch ID %llu failed\n",
+				__func__, epoch_id);
+		goto out;
+	}
+
+	if (epoch_id > sbi->s_epoch_id)
+		sbi->s_epoch_id = epoch_id;
+
+out:
+	nova_clear_nvmm_page(sb, entry, just_init);
+
+	return ret;
+}
+
+static int nova_append_snapshot_info_log(struct super_block *sb,
+	struct snapshot_info *info, u64 epoch_id, u64 timestamp)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = sbi->snapshot_si;
+	struct nova_inode *pi = nova_get_reserved_inode(sb, NOVA_SNAPSHOT_INO);
+	struct nova_inode_update update;
+	struct nova_snapshot_info_entry entry_info;
+	int ret;
+
+	entry_info.type = SNAPSHOT_INFO;
+	entry_info.deleted = 0;
+	entry_info.nvmm_page_addr = 0;
+	entry_info.epoch_id = epoch_id;
+	entry_info.timestamp = timestamp;
+
+	update.tail = update.alter_tail = 0;
+	ret = nova_append_snapshot_info_entry(sb, pi, si, info,
+					&entry_info, &update);
+	if (ret) {
+		nova_dbg("%s: append snapshot info entry failure\n", __func__);
+		return ret;
+	}
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode(sb, &si->vfs_inode, pi, &update, 1);
+	nova_memlock_inode(sb, pi);
+
+	return 0;
+}
+
+int nova_create_snapshot(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info = NULL;
+	u64 timestamp = 0;
+	u64 epoch_id;
+	int ret;
+	timing_t create_snapshot_time;
+
+	NOVA_START_TIMING(create_snapshot_t, create_snapshot_time);
+
+	mutex_lock(&sbi->s_lock);
+	sbi->snapshot_taking = 1;
+
+	/* Increase the epoch id, but use the old value as snapshot id */
+	epoch_id = sbi->s_epoch_id++;
+
+	/*
+	 * Mark the create_snapshot_epoch_id before starting the snapshot
+	 * creation. We will check this during in-place updates for metadata
+	 * and data, to prevent overwriting logs that might belong to a
+	 * snapshot that is being created.
+	 */
+	nova_info("%s: epoch id %llu\n", __func__, epoch_id);
+
+
+	timestamp = timespec_trunc(current_kernel_time(),
+				   sb->s_time_gran).tv_sec;
+
+	ret = nova_initialize_snapshot_info(sb, &info, 1, epoch_id);
+	if (ret) {
+		nova_dbg("%s: initialize snapshot info failed %d\n",
+				__func__, ret);
+		NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+		goto out;
+	}
+
+	info->epoch_id = epoch_id;
+	info->timestamp = timestamp;
+
+	ret = nova_append_snapshot_info_log(sb, info, epoch_id, timestamp);
+	if (ret) {
+		nova_free_snapshot_info(info);
+		NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+		goto out;
+	}
+
+	sbi->num_snapshots++;
+
+	ret = nova_insert_snapshot_info(sb, info);
+
+	nova_set_vmas_readonly(sb);
+
+	sbi->nova_sb->s_wtime = cpu_to_le32(get_seconds());
+	sbi->nova_sb->s_epoch_id = cpu_to_le64(epoch_id);
+	nova_update_super_crc(sb);
+
+	nova_sync_super(sb);
+
+out:
+	sbi->snapshot_taking = 0;
+	mutex_unlock(&sbi->s_lock);
+	wake_up_interruptible(&sbi->snapshot_mmap_wait);
+
+	NOVA_END_TIMING(create_snapshot_t, create_snapshot_time);
+	return ret;
+}
+
+static void wakeup_snapshot_cleaner(struct nova_sb_info *sbi)
+{
+	if (!waitqueue_active(&sbi->snapshot_cleaner_wait))
+		return;
+
+	nova_dbg("Wakeup snapshot cleaner thread\n");
+	wake_up_interruptible(&sbi->snapshot_cleaner_wait);
+}
+
+static int nova_link_to_next_snapshot(struct super_block *sb,
+	struct snapshot_info *prev_info, struct snapshot_info *next_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_list *prev_list, *next_list;
+	struct nova_inode_log_page *curr_page;
+	u64 curr_block, curr_p;
+	int i;
+
+	nova_dbg("Link deleted snapshot %llu to next snapshot %llu\n",
+			prev_info->epoch_id, next_info->epoch_id);
+
+	if (prev_info->epoch_id >= next_info->epoch_id)
+		nova_dbg("Error: prev epoch ID %llu higher than next epoch ID %llu\n",
+			prev_info->epoch_id, next_info->epoch_id);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		prev_list = &prev_info->lists[i];
+		next_list = &next_info->lists[i];
+
+		mutex_lock(&prev_list->list_mutex);
+		mutex_lock(&next_list->list_mutex);
+
+		/* Set NEXT_PAGE flag for prev lists */
+		curr_p = prev_list->tail;
+		if (!goto_next_list_page(sb, curr_p))
+			nova_set_entry_type((void *)curr_p, NEXT_PAGE);
+
+		/* Link the prev lists to the head of next lists */
+		curr_block = BLOCK_OFF(prev_list->tail);
+		curr_page = (struct nova_inode_log_page *)curr_block;
+		nova_set_next_link_page_address(sb, curr_page, next_list->head);
+
+		next_list->head = prev_list->head;
+		next_list->num_pages += prev_list->num_pages;
+
+		mutex_unlock(&next_list->list_mutex);
+		mutex_unlock(&prev_list->list_mutex);
+	}
+
+	sbi->curr_clean_snapshot_info = next_info;
+	wakeup_snapshot_cleaner(sbi);
+
+	return 0;
+}
+
+static int nova_invalidate_snapshot_entry(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_snapshot_info_entry *entry;
+	int ret;
+
+	entry = nova_get_block(sb, info->snapshot_entry);
+	ret = nova_invalidate_logentry(sb, entry, SNAPSHOT_INFO, 0);
+	return ret;
+}
+
+int nova_delete_snapshot(struct super_block *sb, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info = NULL;
+	struct snapshot_info *next = NULL;
+	int delete = 0;
+	int ret;
+	timing_t delete_snapshot_time;
+
+	NOVA_START_TIMING(delete_snapshot_t, delete_snapshot_time);
+	mutex_lock(&sbi->s_lock);
+	nova_info("Delete snapshot epoch ID %llu\n", epoch_id);
+
+	ret = nova_find_target_snapshot_info(sb, epoch_id, &info);
+	if (ret != 1 || info->epoch_id != epoch_id) {
+		nova_dbg("%s: Snapshot info not found\n", __func__);
+		goto out;
+	}
+
+	next = nova_find_next_snapshot_info(sb, info);
+
+	if (next) {
+		nova_link_to_next_snapshot(sb, info, next);
+	} else {
+		/* Delete the last snapshot. Find the previous one. */
+		delete = 1;
+	}
+
+	radix_tree_delete(&sbi->snapshot_info_tree, epoch_id);
+
+	nova_invalidate_snapshot_entry(sb, info);
+
+out:
+	sbi->num_snapshots--;
+	mutex_unlock(&sbi->s_lock);
+
+	if (delete)
+		nova_delete_snapshot_info(sb, info, 1);
+
+	nova_free_snapshot_info(info);
+
+	NOVA_END_TIMING(delete_snapshot_t, delete_snapshot_time);
+	return 0;
+}
+
+static int nova_copy_snapshot_list_to_nvmm(struct super_block *sb,
+	struct snapshot_list *list, struct snapshot_nvmm_list *nvmm_list,
+	u64 new_block)
+{
+	struct nova_inode_log_page *dram_page;
+	void *curr_nvmm_addr;
+	u64 curr_nvmm_block;
+	u64 prev_nvmm_block;
+	u64 curr_dram_addr;
+	unsigned long i;
+	size_t size = sizeof(struct snapshot_nvmm_list);
+
+	curr_dram_addr = list->head;
+	prev_nvmm_block = new_block;
+	curr_nvmm_block = new_block;
+	curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+
+	for (i = 0; i < list->num_pages; i++) {
+		/* Leave next_page field alone */
+		nova_memunlock_block(sb, curr_nvmm_addr);
+		memcpy_to_pmem_nocache(curr_nvmm_addr, (void *)curr_dram_addr,
+						LOG_BLOCK_TAIL);
+		nova_memlock_block(sb, curr_nvmm_addr);
+
+		dram_page = (struct nova_inode_log_page *)curr_dram_addr;
+		prev_nvmm_block = curr_nvmm_block;
+		curr_nvmm_block = next_log_page(sb, curr_nvmm_block);
+		if (curr_nvmm_block < 0)
+			break;
+		curr_nvmm_addr = nova_get_block(sb, curr_nvmm_block);
+		curr_dram_addr = dram_page->page_tail.next_page;
+	}
+
+	nova_memunlock_range(sb, nvmm_list, size);
+	nvmm_list->num_pages = list->num_pages;
+	nvmm_list->tail = prev_nvmm_block + ENTRY_LOC(list->tail);
+	nvmm_list->head = new_block;
+	nova_memlock_range(sb, nvmm_list, size);
+
+	nova_flush_buffer(nvmm_list, sizeof(struct snapshot_nvmm_list), 1);
+
+	return 0;
+}
+
+static int nova_save_snapshot_info(struct super_block *sb,
+	struct snapshot_info *info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_snapshot_info_entry *entry;
+	struct nova_inode_info_header sih;
+	struct snapshot_list *list;
+	struct snapshot_nvmm_page *nvmm_page;
+	struct snapshot_nvmm_list *nvmm_list;
+	unsigned long num_pages;
+	int i;
+	u64 nvmm_page_addr;
+	u64 new_block;
+	int allocated;
+
+	sih.ino = NOVA_SNAPSHOT_INO;
+	sih.i_blk_type = 0;
+
+	/* Support up to 128 CPUs */
+	allocated = nova_allocate_inode_log_pages(sb, &sih, 1,
+						&nvmm_page_addr, ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_dbg("Error allocating NVMM info page\n");
+		return -ENOSPC;
+	}
+
+	nvmm_page = (struct snapshot_nvmm_page *)nova_get_block(sb,
+							nvmm_page_addr);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+		num_pages = list->num_pages;
+		allocated = nova_allocate_inode_log_pages(sb, &sih,
+					num_pages, &new_block, i, 0);
+		if (allocated != num_pages) {
+			nova_dbg("Error saving snapshot list: %d\n", allocated);
+			return -ENOSPC;
+		}
+		nvmm_list = &nvmm_page->lists[i];
+		nova_copy_snapshot_list_to_nvmm(sb, list, nvmm_list, new_block);
+	}
+
+	entry = nova_get_block(sb, info->snapshot_entry);
+	nova_set_nvmm_page_addr(sb, entry, nvmm_page_addr);
+
+	return 0;
+}
+
+static int nova_print_snapshot_info(struct snapshot_info *info,
+	struct seq_file *seq)
+{
+	struct tm tm;
+	u64 epoch_id;
+	u64 timestamp;
+	unsigned long local_time;
+
+	epoch_id = info->epoch_id;
+	timestamp = info->timestamp;
+
+	local_time = timestamp - sys_tz.tz_minuteswest * 60;
+	time_to_tm(local_time, 0, &tm);
+	seq_printf(seq, "%8llu\t%4lu-%02d-%02d\t%02d:%02d:%02d\n",
+					info->epoch_id,
+					tm.tm_year + 1900, tm.tm_mon + 1,
+					tm.tm_mday,
+					tm.tm_hour, tm.tm_min, tm.tm_sec);
+	return 0;
+}
+
+int nova_print_snapshots(struct super_block *sb, struct seq_file *seq)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int count = 0;
+	int i;
+
+	seq_puts(seq, "========== NOVA snapshot table ==========\n");
+	seq_puts(seq, "Epoch ID\t      Date\t    Time\n");
+
+	/* Print in epoch ID increasing order */
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			nova_print_snapshot_info(info, seq);
+			count++;
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	seq_printf(seq, "=========== Total %d snapshots ===========\n", count);
+	return 0;
+}
+
+int nova_print_snapshot_lists(struct super_block *sb, struct seq_file *seq)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int count = 0;
+	int sum;
+	int i, j;
+
+	seq_puts(seq, "========== NOVA snapshot statistics ==========\n");
+
+	/* Print in epoch ID increasing order */
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			sum = 0;
+			for (j = 0; j < sbi->cpus; j++) {
+				list = &info->lists[j];
+				sum += list->num_pages;
+			}
+			seq_printf(seq, "Snapshot epoch ID %llu, %d list pages\n",
+					epoch_id, sum);
+			count++;
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	seq_printf(seq, "============= Total %d snapshots =============\n",
+			count);
+	return 0;
+}
+
+static int nova_traverse_and_delete_snapshot_infos(struct super_block *sb,
+	int save)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct snapshot_info *info;
+	struct snapshot_info *infos[FREE_BATCH];
+	int nr_infos;
+	u64 epoch_id = 0;
+	int i;
+
+	do {
+		nr_infos = radix_tree_gang_lookup(&sbi->snapshot_info_tree,
+					(void **)infos, epoch_id, FREE_BATCH);
+		for (i = 0; i < nr_infos; i++) {
+			info = infos[i];
+			BUG_ON(!info);
+			epoch_id = info->epoch_id;
+			if (save)
+				nova_save_snapshot_info(sb, info);
+			nova_delete_snapshot_info(sb, info, 0);
+			radix_tree_delete(&sbi->snapshot_info_tree, epoch_id);
+			nova_free_snapshot_info(info);
+		}
+		epoch_id++;
+	} while (nr_infos == FREE_BATCH);
+
+	return 0;
+}
+
+int nova_save_snapshots(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->snapshot_cleaner_thread)
+		kthread_stop(sbi->snapshot_cleaner_thread);
+
+	if (sbi->mount_snapshot)
+		return 0;
+
+	return nova_traverse_and_delete_snapshot_infos(sb, 1);
+}
+
+int nova_destroy_snapshot_infos(struct super_block *sb)
+{
+	return nova_traverse_and_delete_snapshot_infos(sb, 0);
+}
+
+static void snapshot_cleaner_try_sleeping(struct nova_sb_info *sbi)
+{
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&sbi->snapshot_cleaner_wait, &wait, TASK_INTERRUPTIBLE);
+	schedule();
+	finish_wait(&sbi->snapshot_cleaner_wait, &wait);
+}
+
+static int nova_clean_snapshot(struct nova_sb_info *sbi)
+{
+	struct super_block *sb = sbi->sb;
+	struct snapshot_info *info;
+	struct snapshot_list *list;
+	int i;
+
+	if (!sbi->curr_clean_snapshot_info)
+		return 0;
+
+	info = sbi->curr_clean_snapshot_info;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		list = &info->lists[i];
+
+		mutex_lock(&list->list_mutex);
+		nova_background_clean_snapshot_list(sb, list,
+							info->epoch_id);
+		mutex_unlock(&list->list_mutex);
+	}
+
+	sbi->curr_clean_snapshot_info = NULL;
+	return 0;
+}
+
+static int nova_snapshot_cleaner(void *arg)
+{
+	struct nova_sb_info *sbi = arg;
+
+	nova_dbg("Running snapshot cleaner thread\n");
+	for (;;) {
+		snapshot_cleaner_try_sleeping(sbi);
+
+		if (kthread_should_stop())
+			break;
+
+		nova_clean_snapshot(sbi);
+	}
+
+	if (sbi->curr_clean_snapshot_info)
+		nova_clean_snapshot(sbi);
+
+	return 0;
+}
+
+static int nova_snapshot_cleaner_init(struct nova_sb_info *sbi)
+{
+	int ret = 0;
+
+	init_waitqueue_head(&sbi->snapshot_cleaner_wait);
+
+	sbi->snapshot_cleaner_thread = kthread_run(nova_snapshot_cleaner,
+		sbi, "nova_snapshot_cleaner");
+	if (IS_ERR(sbi->snapshot_cleaner_thread)) {
+		nova_info("Failed to start NOVA snapshot cleaner thread\n");
+		ret = -1;
+	}
+	nova_info("Start NOVA snapshot cleaner thread.\n");
+	return ret;
+}
+
+int nova_snapshot_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header *sih;
+	u64 ino = NOVA_SNAPSHOT_INO;
+	int ret;
+
+	sih = &sbi->snapshot_si->header;
+	nova_init_header(sb, sih, 0);
+	sih->pi_addr = nova_get_reserved_inode_addr(sb, ino);
+	sih->alter_pi_addr = nova_get_alter_reserved_inode_addr(sb, ino);
+	sih->ino = ino;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	INIT_RADIX_TREE(&sbi->snapshot_info_tree, GFP_ATOMIC);
+	init_waitqueue_head(&sbi->snapshot_mmap_wait);
+	ret = nova_snapshot_cleaner_init(sbi);
+
+	return ret;
+}
+
diff --git a/fs/nova/snapshot.h b/fs/nova/snapshot.h
new file mode 100644
index 000000000000..948dfd557de4
--- /dev/null
+++ b/fs/nova/snapshot.h
@@ -0,0 +1,98 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Snapshot header
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+
+/*
+ * DRAM log of updates to a snapshot.
+ */
+struct snapshot_list {
+	struct mutex list_mutex;
+	unsigned long num_pages;
+	unsigned long head;
+	unsigned long tail;
+};
+
+
+/*
+ * DRAM info about a snapshop.
+ */
+struct snapshot_info {
+	u64	epoch_id;
+	u64	timestamp;
+	unsigned long snapshot_entry; /* PMEM pointer to the struct
+				       * snapshot_info_entry for this
+				       * snapshot
+				       */
+
+	struct snapshot_list *lists;	/* Per-CPU snapshot list */
+};
+
+
+enum nova_snapshot_entry_type {
+	SS_INODE = 1,
+	SS_FILE_WRITE,
+};
+
+/*
+ * Snapshot log entry for recording an inode operation in a snapshot log.
+ *
+ * Todo: add checksum
+ */
+struct snapshot_inode_entry {
+	u8	type;
+	u8	deleted;
+	u8	padding[6];
+	u64	padding64;
+	u64	nova_ino;          // inode number that was deleted.
+	u64	delete_epoch_id;   // Deleted when?
+} __attribute((__packed__));
+
+/*
+ * Snapshot log entry for recording a write operation in a snapshot log
+ *
+ * Todo: add checksum.
+ */
+struct snapshot_file_write_entry {
+	u8	type;
+	u8	deleted;
+	u8	padding[6];
+	u64	nvmm;
+	u64	num_pages;
+	u64	delete_epoch_id;
+} __attribute((__packed__));
+
+/*
+ * PMEM structure pointing to a log comprised of snapshot_inode_entry and
+ * snapshot_file_write_entry objects.
+ *
+ * TODO: add checksum
+ */
+struct snapshot_nvmm_list {
+	__le64 padding;
+	__le64 num_pages;
+	__le64 head;
+	__le64 tail;
+} __attribute((__packed__));
+
+/* Support up to 128 CPUs */
+struct snapshot_nvmm_page {
+	struct snapshot_nvmm_list lists[128];
+};
+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a47fc92..349e319b10f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -415,6 +415,8 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	/* For NOVA DAX-mmap protection */
+	int (*dax_cow)(struct vm_area_struct * area, unsigned long address);
 };
 
 struct mmu_gather;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45cdb27791a3..0b7667fe3dfb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,6 +342,9 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+
+	/* Flag for NOVA DAX cow */
+	int original_write;
 };
 
 struct core_thread {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..aa27a5517a75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -275,6 +275,7 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 
 	return pages;
 }
+EXPORT_SYMBOL(change_protection);
 
 int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
@@ -288,7 +289,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	int error;
 	int dirty_accountable = 0;
 
-	if (newflags == oldflags) {
+	if (newflags == oldflags && vma->original_write == 0) {
 		*pprev = vma;
 		return 0;
 	}
@@ -352,6 +353,16 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	change_protection(vma, start, end, vma->vm_page_prot,
 			  dirty_accountable, 0);
 
+	/* Update NOVA vma list */
+	if (vma->vm_ops && vma->vm_ops->dax_cow) {
+		if (!(oldflags & VM_WRITE) && (newflags & VM_WRITE)) {
+			vma->vm_ops->open(vma);
+		} else if (!(newflags & VM_WRITE)) {
+			if (vma->original_write || (oldflags & VM_WRITE))
+				vma->vm_ops->close(vma);
+		}
+	}
+
 	/*
 	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
 	 * fault on access.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 12/16] NOVA: Recovery code
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:49   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Clean umount/mount
------------------

On a clean unmount, Nova saves the contents of many of its DRAM data structures
to PMEM to accelerate the next mount:

1. Nova stores the allocator state for each of the per-cpu allocators to the
   log of a reserved inode (NOVA_BLOCK_NODE_INO).

2. Nova stores the per-CPU lists of available inodes (the inuse_list) to the
   NOVA_BLOCK_INODELIST1_INO reserved inode.

3. Nova stores the snapshot state to PMEM as described above.

After a clean unmount, the following mount restores these data and then
invalidates them.

Recovery after failures
------------------------

In case of a unclean dismount (e.g., system crash), Nova must rebuild these
DRAM structures by scanning the inode logs.  Nova log scanning is fast because
per-CPU inode tables and per-inode logs allow for parallel recovery.

The number of live log entries in an inode log is roughly the number of extents
in the file.  As a result, Nova only needs to scan a small fraction of the NVMM
during recovery.

The Nova failure recovery consists of two steps:

First, Nova checks its lite weight journals and rolls back any uncommitted
transactions to restore the file system to a consistent state.

Second, Nova starts a recovery thread on each CPU and scans the inode tables in
parallel, performing log scanning for every valid inode in the inode table.
Nova use different recovery mechanisms for directory inodes and file inodes:
For a directory inode, Nova scans the log's linked list to enumerate the pages
it occupies, but it does not inspect the log's contents.  For a file inode,
Nova reads the write entries in the log to enumerate the data pages.

During the recovery scan Nova builds a bitmap of occupied pages, and rebuilds
the allocator based on the result. After this process completes, the file
system is ready to accept new requests.

During the same scan, it rebuilds the snapshot information and the list
available inodes.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/bbuild.c  | 1602 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/rebuild.c |  847 ++++++++++++++++++++++++++++
 2 files changed, 2449 insertions(+)
 create mode 100644 fs/nova/bbuild.c
 create mode 100644 fs/nova/rebuild.c

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
new file mode 100644
index 000000000000..bdfcc3e3d70f
--- /dev/null
+++ b/fs/nova/bbuild.c
@@ -0,0 +1,1602 @@
+/*
+ * NOVA Recovery routines.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/random.h>
+#include <linux/delay.h>
+#include "nova.h"
+#include "journal.h"
+#include "super.h"
+#include "inode.h"
+#include "log.h"
+
+void nova_init_header(struct super_block *sb,
+	struct nova_inode_info_header *sih, u16 i_mode)
+{
+	sih->log_pages = 0;
+	sih->i_size = 0;
+	sih->i_blocks = 0;
+	sih->pi_addr = 0;
+	sih->alter_pi_addr = 0;
+	INIT_RADIX_TREE(&sih->tree, GFP_ATOMIC);
+	sih->vma_tree = RB_ROOT;
+	sih->num_vmas = 0;
+	INIT_LIST_HEAD(&sih->list);
+	sih->i_mode = i_mode;
+	sih->valid_entries = 0;
+	sih->num_entries = 0;
+	sih->last_setattr = 0;
+	sih->last_link_change = 0;
+	sih->last_dentry = 0;
+	sih->trans_id = 0;
+	sih->log_head = 0;
+	sih->log_tail = 0;
+	sih->alter_log_head = 0;
+	sih->alter_log_tail = 0;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+}
+
+static inline void set_scan_bm(unsigned long bit,
+	struct single_scan_bm *scan_bm)
+{
+	set_bit(bit, scan_bm->bitmap);
+}
+
+inline void set_bm(unsigned long bit, struct scan_bitmap *bm,
+	enum bm_type type)
+{
+	switch (type) {
+	case BM_4K:
+		set_scan_bm(bit, &bm->scan_bm_4K);
+		break;
+	case BM_2M:
+		set_scan_bm(bit, &bm->scan_bm_2M);
+		break;
+	case BM_1G:
+		set_scan_bm(bit, &bm->scan_bm_1G);
+		break;
+	default:
+		break;
+	}
+}
+
+static inline int get_cpuid(struct nova_sb_info *sbi, unsigned long blocknr)
+{
+	return blocknr / sbi->per_list_blocks;
+}
+
+static int nova_failure_insert_inodetree(struct super_block *sb,
+	unsigned long ino_low, unsigned long ino_high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *prev = NULL, *next = NULL;
+	struct nova_range_node *new_node;
+	unsigned long internal_low, internal_high;
+	int cpu;
+	struct rb_root *tree;
+	int ret;
+
+	if (ino_low > ino_high) {
+		nova_err(sb, "%s: ino low %lu, ino high %lu\n",
+				__func__, ino_low, ino_high);
+		BUG();
+	}
+
+	cpu = ino_low % sbi->cpus;
+	if (ino_high % sbi->cpus != cpu) {
+		nova_err(sb, "%s: ino low %lu, ino high %lu\n",
+				__func__, ino_low, ino_high);
+		BUG();
+	}
+
+	internal_low = ino_low / sbi->cpus;
+	internal_high = ino_high / sbi->cpus;
+	inode_map = &sbi->inode_maps[cpu];
+	tree = &inode_map->inode_inuse_tree;
+	mutex_lock(&inode_map->inode_table_mutex);
+
+	ret = nova_find_free_slot(sbi, tree, internal_low, internal_high,
+					&prev, &next);
+	if (ret) {
+		nova_dbg("%s: ino %lu - %lu already exists!: %d\n",
+					__func__, ino_low, ino_high, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return ret;
+	}
+
+	if (prev && next && (internal_low == prev->range_high + 1) &&
+			(internal_high + 1 == next->range_low)) {
+		/* fits the hole */
+		rb_erase(&next->node, tree);
+		inode_map->num_range_node_inode--;
+		prev->range_high = next->range_high;
+		nova_update_range_node_checksum(prev);
+		nova_free_inode_node(sb, next);
+		goto finish;
+	}
+	if (prev && (internal_low == prev->range_high + 1)) {
+		/* Aligns left */
+		prev->range_high += internal_high - internal_low + 1;
+		nova_update_range_node_checksum(prev);
+		goto finish;
+	}
+	if (next && (internal_high + 1 == next->range_low)) {
+		/* Aligns right */
+		next->range_low -= internal_high - internal_low + 1;
+		nova_update_range_node_checksum(next);
+		goto finish;
+	}
+
+	/* Aligns somewhere in the middle */
+	new_node = nova_alloc_inode_node(sb);
+	NOVA_ASSERT(new_node);
+	new_node->range_low = internal_low;
+	new_node->range_high = internal_high;
+	nova_update_range_node_checksum(new_node);
+	ret = nova_insert_inodetree(sbi, new_node, cpu);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		nova_free_inode_node(sb, new_node);
+		goto finish;
+	}
+	inode_map->num_range_node_inode++;
+
+finish:
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+}
+
+static void nova_destroy_range_node_tree(struct super_block *sb,
+	struct rb_root *tree)
+{
+	struct nova_range_node *curr;
+	struct rb_node *temp;
+
+	temp = rb_first(tree);
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		temp = rb_next(temp);
+		rb_erase(&curr->node, tree);
+		nova_free_range_node(curr);
+	}
+}
+
+static void nova_destroy_blocknode_tree(struct super_block *sb, int cpu)
+{
+	struct free_list *free_list;
+
+	free_list = nova_get_free_list(sb, cpu);
+	nova_destroy_range_node_tree(sb, &free_list->block_free_tree);
+}
+
+static void nova_destroy_blocknode_trees(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++)
+		nova_destroy_blocknode_tree(sb, i);
+
+}
+
+static int nova_init_blockmap_from_inode(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	struct nova_inode_info_header sih;
+	struct free_list *free_list;
+	struct nova_range_node_lowhigh *entry;
+	struct nova_range_node *blknode;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	u64 curr_p;
+	u64 cpuid;
+	int ret = 0;
+
+	/* FIXME: Backup inode for BLOCKNODE */
+	ret = nova_get_head_tail(sb, pi, &sih);
+	if (ret)
+		goto out;
+
+	curr_p = sih.log_head;
+	if (curr_p == 0) {
+		nova_dbg("%s: pi head is 0!\n", __func__);
+		return -EINVAL;
+	}
+
+	while (curr_p != sih.log_tail) {
+		if (is_last_entry(curr_p, size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_dbg("%s: curr_p is NULL!\n", __func__);
+			NOVA_ASSERT(0);
+			ret = -EINVAL;
+			break;
+		}
+
+		entry = (struct nova_range_node_lowhigh *)nova_get_block(sb,
+							curr_p);
+		blknode = nova_alloc_blocknode(sb);
+		if (blknode == NULL)
+			NOVA_ASSERT(0);
+		blknode->range_low = le64_to_cpu(entry->range_low);
+		blknode->range_high = le64_to_cpu(entry->range_high);
+		nova_update_range_node_checksum(blknode);
+		cpuid = get_cpuid(sbi, blknode->range_low);
+
+		/* FIXME: Assume NR_CPUS not change */
+		free_list = nova_get_free_list(sb, cpuid);
+		ret = nova_insert_blocktree(sbi,
+				&free_list->block_free_tree, blknode);
+		if (ret) {
+			nova_err(sb, "%s failed\n", __func__);
+			nova_free_blocknode(sb, blknode);
+			NOVA_ASSERT(0);
+			nova_destroy_blocknode_trees(sb);
+			goto out;
+		}
+		free_list->num_blocknode++;
+		if (free_list->num_blocknode == 1)
+			free_list->first_node = blknode;
+		free_list->last_node = blknode;
+		free_list->num_free_blocks +=
+			blknode->range_high - blknode->range_low + 1;
+		curr_p += sizeof(struct nova_range_node_lowhigh);
+	}
+out:
+	nova_free_inode_log(sb, pi, &sih);
+	return ret;
+}
+
+static void nova_destroy_inode_trees(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		nova_destroy_range_node_tree(sb,
+					&inode_map->inode_inuse_tree);
+	}
+}
+
+#define CPUID_MASK 0xff00000000000000
+
+static int nova_init_inode_list_from_inode(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODELIST1_INO);
+	struct nova_inode_info_header sih;
+	struct nova_range_node_lowhigh *entry;
+	struct nova_range_node *range_node;
+	struct inode_map *inode_map;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	unsigned long num_inode_node = 0;
+	u64 curr_p;
+	unsigned long cpuid;
+	int ret;
+
+	/* FIXME: Backup inode for INODELIST */
+	ret = nova_get_head_tail(sb, pi, &sih);
+	if (ret)
+		goto out;
+
+	sbi->s_inodes_used_count = 0;
+	curr_p = sih.log_head;
+	if (curr_p == 0) {
+		nova_dbg("%s: pi head is 0!\n", __func__);
+		return -EINVAL;
+	}
+
+	while (curr_p != sih.log_tail) {
+		if (is_last_entry(curr_p, size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_dbg("%s: curr_p is NULL!\n", __func__);
+			NOVA_ASSERT(0);
+		}
+
+		entry = (struct nova_range_node_lowhigh *)nova_get_block(sb,
+							curr_p);
+		range_node = nova_alloc_inode_node(sb);
+		if (range_node == NULL)
+			NOVA_ASSERT(0);
+
+		cpuid = (entry->range_low & CPUID_MASK) >> 56;
+		if (cpuid >= sbi->cpus) {
+			nova_err(sb, "Invalid cpuid %lu\n", cpuid);
+			nova_free_inode_node(sb, range_node);
+			NOVA_ASSERT(0);
+			nova_destroy_inode_trees(sb);
+			goto out;
+		}
+
+		range_node->range_low = entry->range_low & ~CPUID_MASK;
+		range_node->range_high = entry->range_high;
+		nova_update_range_node_checksum(range_node);
+		ret = nova_insert_inodetree(sbi, range_node, cpuid);
+		if (ret) {
+			nova_err(sb, "%s failed, %d\n", __func__, cpuid);
+			nova_free_inode_node(sb, range_node);
+			NOVA_ASSERT(0);
+			nova_destroy_inode_trees(sb);
+			goto out;
+		}
+
+		sbi->s_inodes_used_count +=
+			range_node->range_high - range_node->range_low + 1;
+		num_inode_node++;
+
+		inode_map = &sbi->inode_maps[cpuid];
+		inode_map->num_range_node_inode++;
+		if (!inode_map->first_inode_range)
+			inode_map->first_inode_range = range_node;
+
+		curr_p += sizeof(struct nova_range_node_lowhigh);
+	}
+
+	nova_dbg("%s: %lu inode nodes\n", __func__, num_inode_node);
+out:
+	nova_free_inode_log(sb, pi, &sih);
+	return ret;
+}
+
+static u64 nova_append_range_node_entry(struct super_block *sb,
+	struct nova_range_node *curr, u64 tail, unsigned long cpuid)
+{
+	u64 curr_p;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	struct nova_range_node_lowhigh *entry;
+
+	curr_p = tail;
+
+	if (!nova_range_node_checksum_ok(curr)) {
+		nova_dbg("%s: range node checksum failure\n", __func__);
+		goto out;
+	}
+
+	if (curr_p == 0 || (is_last_entry(curr_p, size) &&
+				next_log_page(sb, curr_p) == 0)) {
+		nova_dbg("%s: inode log reaches end?\n", __func__);
+		goto out;
+	}
+
+	if (is_last_entry(curr_p, size))
+		curr_p = next_log_page(sb, curr_p);
+
+	entry = (struct nova_range_node_lowhigh *)nova_get_block(sb, curr_p);
+	nova_memunlock_range(sb, entry, size);
+	entry->range_low = cpu_to_le64(curr->range_low);
+	if (cpuid)
+		entry->range_low |= cpu_to_le64(cpuid << 56);
+	entry->range_high = cpu_to_le64(curr->range_high);
+	nova_memlock_range(sb, entry, size);
+	nova_dbgv("append entry block low 0x%lx, high 0x%lx\n",
+			curr->range_low, curr->range_high);
+
+	nova_flush_buffer(entry, sizeof(struct nova_range_node_lowhigh), 0);
+out:
+	return curr_p;
+}
+
+static u64 nova_save_range_nodes_to_log(struct super_block *sb,
+	struct rb_root *tree, u64 temp_tail, unsigned long cpuid)
+{
+	struct nova_range_node *curr;
+	struct rb_node *temp;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	u64 curr_entry = 0;
+
+	/* Save in increasing order */
+	temp = rb_first(tree);
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		curr_entry = nova_append_range_node_entry(sb, curr,
+						temp_tail, cpuid);
+		temp_tail = curr_entry + size;
+		temp = rb_next(temp);
+		rb_erase(&curr->node, tree);
+		nova_free_range_node(curr);
+	}
+
+	return temp_tail;
+}
+
+static u64 nova_save_free_list_blocknodes(struct super_block *sb, int cpu,
+	u64 temp_tail)
+{
+	struct free_list *free_list;
+
+	free_list = nova_get_free_list(sb, cpu);
+	temp_tail = nova_save_range_nodes_to_log(sb,
+				&free_list->block_free_tree, temp_tail, 0);
+	return temp_tail;
+}
+
+void nova_save_inode_list_to_log(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODELIST1_INO);
+	struct nova_inode_info_header sih;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long num_blocks;
+	unsigned long num_nodes = 0;
+	struct inode_map *inode_map;
+	unsigned long i;
+	u64 temp_tail;
+	u64 new_block;
+	int allocated;
+
+	sih.ino = NOVA_INODELIST1_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.i_blocks = 0;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		num_nodes += inode_map->num_range_node_inode;
+	}
+
+	num_blocks = num_nodes / RANGENODE_PER_PAGE;
+	if (num_nodes % RANGENODE_PER_PAGE)
+		num_blocks++;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, num_blocks,
+						&new_block, ANY_CPU, 0);
+	if (allocated != num_blocks) {
+		nova_dbg("Error saving inode list: %d\n", allocated);
+		return;
+	}
+
+	temp_tail = new_block;
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		temp_tail = nova_save_range_nodes_to_log(sb,
+				&inode_map->inode_inuse_tree, temp_tail, i);
+	}
+
+	nova_memunlock_inode(sb, pi);
+	pi->alter_log_head = pi->alter_log_tail = 0;
+	pi->log_head = new_block;
+	nova_update_tail(pi, temp_tail);
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+	nova_memlock_inode(sb, pi);
+
+	nova_dbg("%s: %lu inode nodes, pi head 0x%llx, tail 0x%llx\n",
+		__func__, num_nodes, pi->log_head, pi->log_tail);
+}
+
+void nova_save_blocknode_mappings_to_log(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	struct nova_inode_info_header sih;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long num_blocknode = 0;
+	unsigned long num_pages;
+	int allocated;
+	u64 new_block = 0;
+	u64 temp_tail;
+	int i;
+
+	sih.ino = NOVA_BLOCKNODE_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	/* Allocate log pages before save blocknode mappings */
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		num_blocknode += free_list->num_blocknode;
+		nova_dbgv("%s: free list %d: %lu nodes\n", __func__,
+				i, free_list->num_blocknode);
+	}
+
+	num_pages = num_blocknode / RANGENODE_PER_PAGE;
+	if (num_blocknode % RANGENODE_PER_PAGE)
+		num_pages++;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, num_pages,
+						&new_block, ANY_CPU, 0);
+	if (allocated != num_pages) {
+		nova_dbg("Error saving blocknode mappings: %d\n", allocated);
+		return;
+	}
+
+	temp_tail = new_block;
+	for (i = 0; i < sbi->cpus; i++)
+		temp_tail = nova_save_free_list_blocknodes(sb, i, temp_tail);
+
+	/* Finally update log head and tail */
+	nova_memunlock_inode(sb, pi);
+	pi->alter_log_head = pi->alter_log_tail = 0;
+	pi->log_head = new_block;
+	nova_update_tail(pi, temp_tail);
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+	nova_memlock_inode(sb, pi);
+
+	nova_dbg("%s: %lu blocknodes, %lu log pages, pi head 0x%llx, tail 0x%llx\n",
+		  __func__, num_blocknode, num_pages,
+		  pi->log_head, pi->log_tail);
+}
+
+static int nova_insert_blocknode_map(struct super_block *sb,
+	int cpuid, unsigned long low, unsigned long high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	struct rb_root *tree;
+	struct nova_range_node *blknode = NULL;
+	unsigned long num_blocks = 0;
+	int ret;
+
+	num_blocks = high - low + 1;
+	nova_dbgv("%s: cpu %d, low %lu, high %lu, num %lu\n",
+		__func__, cpuid, low, high, num_blocks);
+	free_list = nova_get_free_list(sb, cpuid);
+	tree = &(free_list->block_free_tree);
+
+	blknode = nova_alloc_blocknode(sb);
+	if (blknode == NULL)
+		return -ENOMEM;
+	blknode->range_low = low;
+	blknode->range_high = high;
+	nova_update_range_node_checksum(blknode);
+	ret = nova_insert_blocktree(sbi, tree, blknode);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		nova_free_blocknode(sb, blknode);
+		goto out;
+	}
+	if (!free_list->first_node)
+		free_list->first_node = blknode;
+	free_list->last_node = blknode;
+	free_list->num_blocknode++;
+	free_list->num_free_blocks += num_blocks;
+out:
+	return ret;
+}
+
+static int __nova_build_blocknode_map(struct super_block *sb,
+	unsigned long *bitmap, unsigned long bsize, unsigned long scale)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long next = 0;
+	unsigned long low = 0;
+	unsigned long start, end;
+	int cpuid = 0;
+
+	free_list = nova_get_free_list(sb, cpuid);
+	start = free_list->block_start;
+	end = free_list->block_end + 1;
+	while (1) {
+		next = find_next_zero_bit(bitmap, end, start);
+		if (next == bsize)
+			break;
+		if (next == end) {
+			if (cpuid == sbi->cpus - 1)
+				break;
+
+			cpuid++;
+			free_list = nova_get_free_list(sb, cpuid);
+			start = free_list->block_start;
+			end = free_list->block_end + 1;
+			continue;
+		}
+
+		low = next;
+		next = find_next_bit(bitmap, end, next);
+		if (nova_insert_blocknode_map(sb, cpuid,
+				low << scale, (next << scale) - 1)) {
+			nova_dbg("Error: could not insert %lu - %lu\n",
+				low << scale, ((next << scale) - 1));
+		}
+		start = next;
+		if (next == bsize)
+			break;
+		if (next == end) {
+			if (cpuid == sbi->cpus - 1)
+				break;
+
+			cpuid++;
+			free_list = nova_get_free_list(sb, cpuid);
+			start = free_list->block_start;
+			end = free_list->block_end + 1;
+		}
+	}
+	return 0;
+}
+
+static void nova_update_4K_map(struct super_block *sb,
+	struct scan_bitmap *bm,	unsigned long *bitmap,
+	unsigned long bsize, unsigned long scale)
+{
+	unsigned long next = 0;
+	unsigned long low = 0;
+	int i;
+
+	while (1) {
+		next = find_next_bit(bitmap, bsize, next);
+		if (next == bsize)
+			break;
+		low = next;
+		next = find_next_zero_bit(bitmap, bsize, next);
+		for (i = (low << scale); i < (next << scale); i++)
+			set_bm(i, bm, BM_4K);
+		if (next == bsize)
+			break;
+	}
+}
+
+struct scan_bitmap *global_bm[64];
+
+static int nova_build_blocknode_map(struct super_block *sb,
+	unsigned long initsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	struct scan_bitmap *final_bm;
+	unsigned long *src, *dst;
+	int i, j;
+	int num;
+	int ret;
+
+	final_bm = kzalloc(sizeof(struct scan_bitmap), GFP_KERNEL);
+	if (!final_bm)
+		return -ENOMEM;
+
+	final_bm->scan_bm_4K.bitmap_size =
+				(initsize >> (PAGE_SHIFT + 0x3));
+
+	/* Alloc memory to hold the block alloc bitmap */
+	final_bm->scan_bm_4K.bitmap = kzalloc(final_bm->scan_bm_4K.bitmap_size,
+							GFP_KERNEL);
+
+	if (!final_bm->scan_bm_4K.bitmap) {
+		kfree(final_bm);
+		return -ENOMEM;
+	}
+
+	/*
+	 * We are using free lists. Set 2M and 1G blocks in 4K map,
+	 * and use 4K map to rebuild block map.
+	 */
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		nova_update_4K_map(sb, bm, bm->scan_bm_2M.bitmap,
+			bm->scan_bm_2M.bitmap_size * 8, PAGE_SHIFT_2M - 12);
+		nova_update_4K_map(sb, bm, bm->scan_bm_1G.bitmap,
+			bm->scan_bm_1G.bitmap_size * 8, PAGE_SHIFT_1G - 12);
+	}
+
+	/* Merge per-CPU bms to the final single bm */
+	num = final_bm->scan_bm_4K.bitmap_size / sizeof(unsigned long);
+	if (final_bm->scan_bm_4K.bitmap_size % sizeof(unsigned long))
+		num++;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		src = (unsigned long *)bm->scan_bm_4K.bitmap;
+		dst = (unsigned long *)final_bm->scan_bm_4K.bitmap;
+
+		for (j = 0; j < num; j++)
+			dst[j] |= src[j];
+	}
+
+	ret = __nova_build_blocknode_map(sb, final_bm->scan_bm_4K.bitmap,
+			final_bm->scan_bm_4K.bitmap_size * 8, PAGE_SHIFT - 12);
+
+	kfree(final_bm->scan_bm_4K.bitmap);
+	kfree(final_bm);
+
+	return ret;
+}
+
+static void free_bm(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		if (bm) {
+			kfree(bm->scan_bm_4K.bitmap);
+			kfree(bm->scan_bm_2M.bitmap);
+			kfree(bm->scan_bm_1G.bitmap);
+			kfree(bm);
+		}
+	}
+}
+
+static int alloc_bm(struct super_block *sb, unsigned long initsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = kzalloc(sizeof(struct scan_bitmap), GFP_KERNEL);
+		if (!bm)
+			return -ENOMEM;
+
+		global_bm[i] = bm;
+
+		bm->scan_bm_4K.bitmap_size =
+				(initsize >> (PAGE_SHIFT + 0x3));
+		bm->scan_bm_2M.bitmap_size =
+				(initsize >> (PAGE_SHIFT_2M + 0x3));
+		bm->scan_bm_1G.bitmap_size =
+				(initsize >> (PAGE_SHIFT_1G + 0x3));
+
+		/* Alloc memory to hold the block alloc bitmap */
+		bm->scan_bm_4K.bitmap = kzalloc(bm->scan_bm_4K.bitmap_size,
+							GFP_KERNEL);
+		bm->scan_bm_2M.bitmap = kzalloc(bm->scan_bm_2M.bitmap_size,
+							GFP_KERNEL);
+		bm->scan_bm_1G.bitmap = kzalloc(bm->scan_bm_1G.bitmap_size,
+							GFP_KERNEL);
+
+		if (!bm->scan_bm_4K.bitmap || !bm->scan_bm_2M.bitmap ||
+				!bm->scan_bm_1G.bitmap)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+/************************** NOVA recovery ****************************/
+
+#define MAX_PGOFF	262144
+
+struct task_ring {
+	u64 addr0[512];
+	u64 addr1[512];		/* Second inode address */
+	int num;
+	int inodes_used_count;
+	u64 *entry_array;
+	u64 *nvmm_array;
+};
+
+static struct task_ring *task_rings;
+static struct task_struct **threads;
+wait_queue_head_t finish_wq;
+int *finished;
+
+static int nova_traverse_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct scan_bitmap *bm, u64 head)
+{
+	u64 curr_p;
+	u64 next;
+
+	curr_p = head;
+
+	if (curr_p == 0)
+		return 0;
+
+	BUG_ON(curr_p & (PAGE_SIZE - 1));
+	set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+
+	next = next_log_page(sb, curr_p);
+	while (next > 0) {
+		curr_p = next;
+		BUG_ON(curr_p & (PAGE_SIZE - 1));
+		set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+		next = next_log_page(sb, curr_p);
+	}
+
+	return 0;
+}
+
+static void nova_traverse_dir_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct scan_bitmap *bm)
+{
+	nova_traverse_inode_log(sb, pi, bm, pi->log_head);
+	if (metadata_csum)
+		nova_traverse_inode_log(sb, pi, bm, pi->alter_log_head);
+}
+
+static unsigned int nova_check_old_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 entry_addr,
+	unsigned long pgoff, unsigned int num_free,
+	u64 epoch_id, struct task_ring *ring, unsigned long base,
+	struct scan_bitmap *bm)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	unsigned long old_nvmm, nvmm;
+	unsigned long index;
+	int i;
+	int ret;
+
+	entry = (struct nova_file_write_entry *)entry_addr;
+
+	if (!entry)
+		return 0;
+
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return 0;
+	}
+
+	old_nvmm = get_nvmm(sb, sih, entryc, pgoff);
+
+	ret = nova_append_data_to_snapshot(sb, entryc, old_nvmm,
+				num_free, epoch_id);
+
+	if (ret != 0)
+		return ret;
+
+	index = pgoff - base;
+	for (i = 0; i < num_free; i++) {
+		nvmm = ring->nvmm_array[index];
+		if (nvmm)
+			set_bm(nvmm, bm, BM_4K);
+		index++;
+	}
+
+	return ret;
+}
+
+static int nova_set_ring_array(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc, struct task_ring *ring,
+	unsigned long base, struct scan_bitmap *bm)
+{
+	unsigned long start, end;
+	unsigned long pgoff, old_pgoff = 0;
+	unsigned long index;
+	unsigned int num_free = 0;
+	u64 old_entry = 0;
+	u64 epoch_id = entryc->epoch_id;
+
+	start = entryc->pgoff;
+	if (start < base)
+		start = base;
+
+	end = entryc->pgoff + entryc->num_pages;
+	if (end > base + MAX_PGOFF)
+		end = base + MAX_PGOFF;
+
+	for (pgoff = start; pgoff < end; pgoff++) {
+		index = pgoff - base;
+		if (ring->nvmm_array[index]) {
+			if (ring->entry_array[index] != old_entry) {
+				if (old_entry)
+					nova_check_old_entry(sb, sih, old_entry,
+							old_pgoff, num_free,
+							epoch_id, ring, base,
+							bm);
+
+				old_entry = ring->entry_array[index];
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+		}
+	}
+
+	if (old_entry)
+		nova_check_old_entry(sb, sih, old_entry, old_pgoff,
+					num_free, epoch_id, ring, base, bm);
+
+	for (pgoff = start; pgoff < end; pgoff++) {
+		index = pgoff - base;
+		ring->entry_array[index] = (u64)entry;
+		ring->nvmm_array[index] = (u64)(entryc->block >> PAGE_SHIFT)
+						+ pgoff - entryc->pgoff;
+	}
+
+	return 0;
+}
+
+static int nova_set_file_bm(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct task_ring *ring,
+	struct scan_bitmap *bm, unsigned long base, unsigned long last_blocknr)
+{
+	unsigned long nvmm, pgoff;
+
+	if (last_blocknr >= base + MAX_PGOFF)
+		last_blocknr = MAX_PGOFF - 1;
+	else
+		last_blocknr -= base;
+
+	for (pgoff = 0; pgoff <= last_blocknr; pgoff++) {
+		nvmm = ring->nvmm_array[pgoff];
+		if (nvmm) {
+			set_bm(nvmm, bm, BM_4K);
+			ring->nvmm_array[pgoff] = 0;
+			ring->entry_array[pgoff] = 0;
+		}
+	}
+
+	return 0;
+}
+
+/* entry given to this function is a copy in dram */
+static void nova_ring_setattr_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_setattr_logentry *entry, struct task_ring *ring,
+	unsigned long base, unsigned int data_bits, struct scan_bitmap *bm)
+{
+	unsigned long first_blocknr, last_blocknr;
+	unsigned long pgoff, old_pgoff = 0;
+	unsigned long index;
+	unsigned int num_free = 0;
+	u64 old_entry = 0;
+	loff_t start, end;
+	u64 epoch_id = entry->epoch_id;
+
+	if (sih->i_size <= entry->size)
+		goto out;
+
+	start = entry->size;
+	end = sih->i_size;
+
+	first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+	if (end > 0)
+		last_blocknr = (end - 1) >> data_bits;
+	else
+		last_blocknr = 0;
+
+	if (first_blocknr > last_blocknr)
+		goto out;
+
+	if (first_blocknr < base)
+		first_blocknr = base;
+
+	if (last_blocknr > base + MAX_PGOFF - 1)
+		last_blocknr = base + MAX_PGOFF - 1;
+
+	for (pgoff = first_blocknr; pgoff <= last_blocknr; pgoff++) {
+		index = pgoff - base;
+		if (ring->nvmm_array[index]) {
+			if (ring->entry_array[index] != old_entry) {
+				if (old_entry)
+					nova_check_old_entry(sb, sih, old_entry,
+							old_pgoff, num_free,
+							epoch_id, ring, base,
+							bm);
+
+				old_entry = ring->entry_array[index];
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+		}
+	}
+
+	if (old_entry)
+		nova_check_old_entry(sb, sih, old_entry, old_pgoff,
+					num_free, epoch_id, ring, base, bm);
+
+	for (pgoff = first_blocknr; pgoff <= last_blocknr; pgoff++) {
+		index = pgoff - base;
+		ring->nvmm_array[index] = 0;
+		ring->entry_array[index] = 0;
+	}
+
+out:
+	sih->i_size = entry->size;
+}
+
+static void nova_traverse_file_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc, struct task_ring *ring,
+	unsigned long base, struct scan_bitmap *bm)
+{
+	sih->i_size = entryc->size;
+
+	if (entryc->num_pages != entryc->invalid_pages) {
+		if (entryc->pgoff < base + MAX_PGOFF &&
+				entryc->pgoff + entryc->num_pages > base)
+			nova_set_ring_array(sb, sih, entry, entryc,
+						ring, base, bm);
+	}
+}
+
+static int nova_traverse_file_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct task_ring *ring, struct scan_bitmap *bm)
+{
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	unsigned long base = 0;
+	unsigned long last_blocknr;
+	u64 ino = pi->nova_ino;
+	void *entry, *entryc;
+	unsigned int btype;
+	unsigned int data_bits;
+	u64 curr_p;
+	u64 next;
+	u8 type;
+
+	btype = pi->i_blk_type;
+	data_bits = blk_type_to_shift[btype];
+
+	if (metadata_csum)
+		nova_traverse_inode_log(sb, pi, bm, pi->alter_log_head);
+
+	entryc = (metadata_csum == 0) ? NULL : entry_copy;
+
+again:
+	sih->i_size = 0;
+	curr_p = pi->log_head;
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				curr_p, pi->log_tail);
+	if (curr_p == 0 && pi->log_tail == 0)
+		return 0;
+
+	if (base == 0) {
+		BUG_ON(curr_p & (PAGE_SIZE - 1));
+		set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+	}
+
+	while (curr_p != pi->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			curr_p = next_log_page(sb, curr_p);
+			if (base == 0) {
+				BUG_ON(curr_p & (PAGE_SIZE - 1));
+				set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+			}
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		entry = (void *)nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return 0;
+
+		type = nova_get_entry_type(entryc);
+		switch (type) {
+		case SET_ATTR:
+			nova_ring_setattr_entry(sb, sih, SENTRY(entryc),
+						ring, base, data_bits,
+						bm);
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case FILE_WRITE:
+			nova_traverse_file_write_entry(sb, sih, WENTRY(entry),
+						WENTRY(entryc), ring, base, bm);
+			curr_p += sizeof(struct nova_file_write_entry);
+			break;
+		case MMAP_WRITE:
+			curr_p += sizeof(struct nova_mmap_entry);
+			break;
+		default:
+			nova_dbg("%s: unknown type %d, 0x%llx\n",
+						__func__, type, curr_p);
+			NOVA_ASSERT(0);
+			BUG();
+		}
+
+	}
+
+	if (base == 0) {
+		/* Keep traversing until log ends */
+		curr_p &= PAGE_MASK;
+		next = next_log_page(sb, curr_p);
+		while (next > 0) {
+			curr_p = next;
+			BUG_ON(curr_p & (PAGE_SIZE - 1));
+			set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+			next = next_log_page(sb, curr_p);
+		}
+	}
+
+	if (sih->i_size == 0)
+		return 0;
+
+	last_blocknr = (sih->i_size - 1) >> data_bits;
+	nova_set_file_bm(sb, sih, ring, bm, base, last_blocknr);
+	if (last_blocknr >= base + MAX_PGOFF) {
+		base += MAX_PGOFF;
+		goto again;
+	}
+
+	return 0;
+}
+
+/* Pi is DRAM fake version */
+static int nova_recover_inode_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct task_ring *ring,
+	struct nova_inode *pi, struct scan_bitmap *bm)
+{
+	unsigned long nova_ino;
+
+	if (pi->deleted == 1)
+		return 0;
+
+	nova_ino = pi->nova_ino;
+	ring->inodes_used_count++;
+
+	sih->i_mode = __le16_to_cpu(pi->i_mode);
+	sih->ino = nova_ino;
+
+	nova_dbgv("%s: inode %lu, head 0x%llx, tail 0x%llx\n",
+			__func__, nova_ino, pi->log_head, pi->log_tail);
+
+	switch (__le16_to_cpu(pi->i_mode) & S_IFMT) {
+	case S_IFDIR:
+		nova_traverse_dir_inode_log(sb, pi, bm);
+		break;
+	case S_IFLNK:
+		/* Treat symlink files as normal files */
+		/* Fall through */
+	case S_IFREG:
+		/* Fall through */
+	default:
+		/* In case of special inode, walk the log */
+		nova_traverse_file_inode_log(sb, pi, sih, ring, bm);
+		break;
+	}
+
+	return 0;
+}
+
+static void free_resources(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct task_ring *ring;
+	int i;
+
+	if (task_rings) {
+		for (i = 0; i < sbi->cpus; i++) {
+			ring = &task_rings[i];
+			vfree(ring->entry_array);
+			vfree(ring->nvmm_array);
+			ring->entry_array = NULL;
+			ring->nvmm_array = NULL;
+		}
+	}
+
+	kfree(task_rings);
+	kfree(threads);
+	kfree(finished);
+}
+
+static int failure_thread_func(void *data);
+
+static int allocate_resources(struct super_block *sb, int cpus)
+{
+	struct task_ring *ring;
+	int i;
+
+	task_rings = kcalloc(cpus, sizeof(struct task_ring), GFP_KERNEL);
+	if (!task_rings)
+		goto fail;
+
+	for (i = 0; i < cpus; i++) {
+		ring = &task_rings[i];
+
+		ring->nvmm_array = vzalloc(sizeof(u64) * MAX_PGOFF);
+		if (!ring->nvmm_array)
+			goto fail;
+
+		ring->entry_array = vmalloc(sizeof(u64) * MAX_PGOFF);
+		if (!ring->entry_array)
+			goto fail;
+	}
+
+	threads = kcalloc(cpus, sizeof(struct task_struct *), GFP_KERNEL);
+	if (!threads)
+		goto fail;
+
+	finished = kcalloc(cpus, sizeof(int), GFP_KERNEL);
+	if (!finished)
+		goto fail;
+
+	init_waitqueue_head(&finish_wq);
+
+	for (i = 0; i < cpus; i++) {
+		threads[i] = kthread_create(failure_thread_func,
+						sb, "recovery thread");
+		kthread_bind(threads[i], i);
+	}
+
+	return 0;
+
+fail:
+	free_resources(sb);
+	return -ENOMEM;
+}
+
+static void wait_to_finish(int cpus)
+{
+	int i;
+
+	for (i = 0; i < cpus; i++) {
+		while (finished[i] == 0) {
+			wait_event_interruptible_timeout(finish_wq, false,
+							msecs_to_jiffies(1));
+		}
+	}
+}
+
+/*********************** Failure recovery *************************/
+
+static inline int nova_failure_update_inodetree(struct super_block *sb,
+	struct nova_inode *pi, unsigned long *ino_low, unsigned long *ino_high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (*ino_low == 0) {
+		*ino_low = *ino_high = pi->nova_ino;
+	} else {
+		if (pi->nova_ino == *ino_high + sbi->cpus) {
+			*ino_high = pi->nova_ino;
+		} else {
+			/* A new start */
+			nova_failure_insert_inodetree(sb, *ino_low, *ino_high);
+			*ino_low = *ino_high = pi->nova_ino;
+		}
+	}
+
+	return 0;
+}
+
+static int failure_thread_func(void *data)
+{
+	struct super_block *sb = data;
+	struct nova_inode_info_header sih;
+	struct task_ring *ring;
+	struct nova_inode *pi, fake_pi;
+	unsigned long num_inodes_per_page;
+	unsigned long ino_low, ino_high;
+	unsigned long last_blocknr;
+	unsigned int data_bits;
+	u64 curr, curr1;
+	int cpuid = smp_processor_id();
+	unsigned long i;
+	unsigned long max_size = 0;
+	u64 pi_addr = 0;
+	int ret = 0;
+	int count;
+
+	pi = nova_get_inode_by_ino(sb, NOVA_INODETABLE_INO);
+	data_bits = blk_type_to_shift[pi->i_blk_type];
+	num_inodes_per_page = 1 << (data_bits - NOVA_INODE_BITS);
+
+	ring = &task_rings[cpuid];
+	nova_init_header(sb, &sih, 0);
+
+	for (count = 0; count < ring->num; count++) {
+		curr = ring->addr0[count];
+		curr1 = ring->addr1[count];
+		ino_low = ino_high = 0;
+
+		/*
+		 * Note: The inode log page is allocated in 2MB
+		 * granularity, but not aligned on 2MB boundary.
+		 */
+		for (i = 0; i < 512; i++)
+			set_bm((curr >> PAGE_SHIFT) + i,
+					global_bm[cpuid], BM_4K);
+
+		if (metadata_csum) {
+			for (i = 0; i < 512; i++)
+				set_bm((curr1 >> PAGE_SHIFT) + i,
+					global_bm[cpuid], BM_4K);
+		}
+
+		for (i = 0; i < num_inodes_per_page; i++) {
+			pi_addr = curr + i * NOVA_INODE_SIZE;
+			ret = nova_get_reference(sb, pi_addr, &fake_pi,
+				(void **)&pi, sizeof(struct nova_inode));
+			if (ret) {
+				nova_dbg("Recover pi @ 0x%llx failed\n",
+						pi_addr);
+				continue;
+			}
+			/* FIXME: Check inode checksum */
+			if (fake_pi.i_mode && fake_pi.deleted == 0) {
+				if (fake_pi.valid == 0) {
+					ret = nova_append_inode_to_snapshot(sb,
+									pi);
+					if (ret != 0) {
+						/* Deleteable */
+						pi->deleted = 1;
+						fake_pi.deleted = 1;
+						continue;
+					}
+				}
+
+				nova_recover_inode_pages(sb, &sih, ring,
+						&fake_pi, global_bm[cpuid]);
+				nova_failure_update_inodetree(sb, pi,
+						&ino_low, &ino_high);
+				if (sih.i_size > max_size)
+					max_size = sih.i_size;
+			}
+		}
+
+		if (ino_low && ino_high)
+			nova_failure_insert_inodetree(sb, ino_low, ino_high);
+	}
+
+	/* Free radix tree */
+	if (max_size) {
+		last_blocknr = (max_size - 1) >> PAGE_SHIFT;
+		nova_delete_file_tree(sb, &sih, 0, last_blocknr,
+						false, false, 0);
+	}
+
+	finished[cpuid] = 1;
+	wake_up_interruptible(&finish_wq);
+	do_exit(ret);
+	return ret;
+}
+
+static int nova_failure_recovery_crawl(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct inode_table *inode_table;
+	struct task_ring *ring;
+	struct nova_inode *pi, fake_pi;
+	unsigned long curr_addr;
+	u64 root_addr;
+	u64 curr;
+	int num_tables;
+	int version;
+	int ret = 0;
+	int count;
+	int cpuid;
+
+	root_addr = nova_get_reserved_inode_addr(sb, NOVA_ROOT_INO);
+
+	num_tables = 1;
+	if (metadata_csum)
+		num_tables = 2;
+
+	for (cpuid = 0; cpuid < sbi->cpus; cpuid++) {
+		ring = &task_rings[cpuid];
+		for (version = 0; version < num_tables; version++) {
+			inode_table = nova_get_inode_table(sb, version,
+								cpuid);
+			if (!inode_table)
+				return -EINVAL;
+
+			count = 0;
+			curr = inode_table->log_head;
+			while (curr) {
+				if (ring->num >= 512) {
+					nova_err(sb, "%s: ring size too small\n",
+						 __func__);
+					return -EINVAL;
+				}
+
+				if (version == 0)
+					ring->addr0[count] = curr;
+				else
+					ring->addr1[count] = curr;
+
+				count++;
+
+				curr_addr = (unsigned long)nova_get_block(sb,
+								curr);
+				/* Next page resides at the last 8 bytes */
+				curr_addr += 2097152 - 8;
+				curr = *(u64 *)(curr_addr);
+			}
+
+			if (count > ring->num)
+				ring->num = count;
+		}
+	}
+
+	for (cpuid = 0; cpuid < sbi->cpus; cpuid++)
+		wake_up_process(threads[cpuid]);
+
+	nova_init_header(sb, &sih, 0);
+	/* Recover the root iode */
+	ret = nova_get_reference(sb, root_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("Recover root pi failed\n");
+		return ret;
+	}
+
+	nova_recover_inode_pages(sb, &sih, &task_rings[0],
+					&fake_pi, global_bm[1]);
+
+	return ret;
+}
+
+int nova_failure_recovery(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct task_ring *ring;
+	struct nova_inode *pi;
+	struct journal_ptr_pair *pair;
+	int ret;
+	int i;
+
+	sbi->s_inodes_used_count = 0;
+
+	/* Initialize inuse inode list */
+	if (nova_init_inode_inuse_list(sb) < 0)
+		return -EINVAL;
+
+	/* Handle special inodes */
+	pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	pi->log_head = pi->log_tail = 0;
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+
+		set_bm(pair->journal_head >> PAGE_SHIFT, global_bm[i], BM_4K);
+	}
+
+	i = NOVA_SNAPSHOT_INO % sbi->cpus;
+	pi = nova_get_inode_by_ino(sb, NOVA_SNAPSHOT_INO);
+	/* Set snapshot info log pages */
+	nova_traverse_dir_inode_log(sb, pi, global_bm[i]);
+
+	PERSISTENT_BARRIER();
+
+	ret = allocate_resources(sb, sbi->cpus);
+	if (ret)
+		return ret;
+
+	ret = nova_failure_recovery_crawl(sb);
+
+	wait_to_finish(sbi->cpus);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		ring = &task_rings[i];
+		sbi->s_inodes_used_count += ring->inodes_used_count;
+	}
+
+	free_resources(sb);
+
+	nova_dbg("Failure recovery total recovered %lu\n",
+				sbi->s_inodes_used_count);
+	return ret;
+}
+
+/*********************** Recovery entrance *************************/
+
+/* Return TRUE if we can do a normal unmount recovery */
+static bool nova_try_normal_recovery(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi =  nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	int ret;
+
+	if (pi->log_head == 0 || pi->log_tail == 0)
+		return false;
+
+	ret = nova_init_blockmap_from_inode(sb);
+	if (ret) {
+		nova_err(sb, "init blockmap failed, fall back to failure recovery\n");
+		return false;
+	}
+
+	ret = nova_init_inode_list_from_inode(sb);
+	if (ret) {
+		nova_err(sb, "init inode list failed, fall back to failure recovery\n");
+		nova_destroy_blocknode_trees(sb);
+		return false;
+	}
+
+	if (sbi->mount_snapshot == 0) {
+		ret = nova_restore_snapshot_table(sb, 0);
+		if (ret) {
+			nova_err(sb, "Restore snapshot table failed, fall back to failure recovery\n");
+			nova_destroy_snapshot_infos(sb);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Recovery routine has three tasks:
+ * 1. Restore snapshot table;
+ * 2. Restore inuse inode list;
+ * 3. Restore the NVMM allocator.
+ */
+int nova_recovery(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_super_block *super = sbi->nova_sb;
+	unsigned long initsize = le64_to_cpu(super->s_size);
+	bool value = false;
+	int ret = 0;
+	timing_t start, end;
+
+	nova_dbgv("%s\n", __func__);
+
+	/* Always check recovery time */
+	if (measure_timing == 0)
+		getrawmonotonic(&start);
+
+	NOVA_START_TIMING(recovery_t, start);
+	sbi->num_blocks = ((unsigned long)(initsize) >> PAGE_SHIFT);
+
+	/* initialize free list info */
+	nova_init_blockmap(sb, 1);
+
+	value = nova_try_normal_recovery(sb);
+	if (value) {
+		nova_dbg("NOVA: Normal shutdown\n");
+	} else {
+		nova_dbg("NOVA: Failure recovery\n");
+		ret = alloc_bm(sb, initsize);
+		if (ret)
+			goto out;
+
+		if (sbi->mount_snapshot == 0) {
+			/* Initialize the snapshot infos */
+			ret = nova_restore_snapshot_table(sb, 1);
+			if (ret) {
+				nova_dbg("Initialize snapshot infos failed\n");
+				nova_destroy_snapshot_infos(sb);
+				goto out;
+			}
+		}
+
+		sbi->s_inodes_used_count = 0;
+		ret = nova_failure_recovery(sb);
+		if (ret)
+			goto out;
+
+		ret = nova_build_blocknode_map(sb, initsize);
+	}
+
+out:
+	NOVA_END_TIMING(recovery_t, start);
+	if (measure_timing == 0) {
+		getrawmonotonic(&end);
+		Timingstats[recovery_t] +=
+			(end.tv_sec - start.tv_sec) * 1000000000 +
+			(end.tv_nsec - start.tv_nsec);
+	}
+
+	if (!value)
+		free_bm(sb);
+
+	sbi->s_epoch_id = le64_to_cpu(super->s_epoch_id);
+	return ret;
+}
diff --git a/fs/nova/rebuild.c b/fs/nova/rebuild.c
new file mode 100644
index 000000000000..893f180d507e
--- /dev/null
+++ b/fs/nova/rebuild.c
@@ -0,0 +1,847 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode rebuild methods.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+/* entry given to this function is a copy in dram */
+static void nova_apply_setattr_entry(struct super_block *sb,
+	struct nova_inode_rebuild *reb,	struct nova_inode_info_header *sih,
+	struct nova_setattr_logentry *entry)
+{
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	loff_t start, end;
+	int freed = 0;
+
+	if (entry->entry_type != SET_ATTR)
+		BUG();
+
+	reb->i_mode	= entry->mode;
+	reb->i_uid	= entry->uid;
+	reb->i_gid	= entry->gid;
+	reb->i_atime	= entry->atime;
+
+	if (S_ISREG(reb->i_mode)) {
+		start = entry->size;
+		end = reb->i_size;
+
+		first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+		if (end > 0)
+			last_blocknr = (end - 1) >> data_bits;
+		else
+			last_blocknr = 0;
+
+		freed = nova_delete_file_tree(sb, sih, first_blocknr,
+					last_blocknr, false, false, 0);
+	}
+}
+
+/* entry given to this function is a copy in dram */
+static void nova_apply_link_change_entry(struct super_block *sb,
+	struct nova_inode_rebuild *reb,	struct nova_link_change_entry *entry)
+{
+	if (entry->entry_type != LINK_CHANGE)
+		BUG();
+
+	reb->i_links_count	= entry->links;
+	reb->i_ctime		= entry->ctime;
+	reb->i_flags		= entry->flags;
+	reb->i_generation	= entry->generation;
+
+	/* Do not flush now */
+}
+
+static void nova_update_inode_with_rebuild(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_inode *pi)
+{
+	pi->i_size = cpu_to_le64(reb->i_size);
+	pi->i_flags = cpu_to_le32(reb->i_flags);
+	pi->i_uid = cpu_to_le32(reb->i_uid);
+	pi->i_gid = cpu_to_le32(reb->i_gid);
+	pi->i_atime = cpu_to_le32(reb->i_atime);
+	pi->i_ctime = cpu_to_le32(reb->i_ctime);
+	pi->i_mtime = cpu_to_le32(reb->i_mtime);
+	pi->i_generation = cpu_to_le32(reb->i_generation);
+	pi->i_links_count = cpu_to_le16(reb->i_links_count);
+	pi->i_mode = cpu_to_le16(reb->i_mode);
+}
+
+static int nova_init_inode_rebuild(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_inode *pi)
+{
+	struct nova_inode fake_pi;
+	int rc;
+
+	rc = memcpy_mcsafe(&fake_pi, pi, sizeof(struct nova_inode));
+	if (rc)
+		return rc;
+
+	reb->i_size = le64_to_cpu(fake_pi.i_size);
+	reb->i_flags = le32_to_cpu(fake_pi.i_flags);
+	reb->i_uid = le32_to_cpu(fake_pi.i_uid);
+	reb->i_gid = le32_to_cpu(fake_pi.i_gid);
+	reb->i_atime = le32_to_cpu(fake_pi.i_atime);
+	reb->i_ctime = le32_to_cpu(fake_pi.i_ctime);
+	reb->i_mtime = le32_to_cpu(fake_pi.i_mtime);
+	reb->i_generation = le32_to_cpu(fake_pi.i_generation);
+	reb->i_links_count = le16_to_cpu(fake_pi.i_links_count);
+	reb->i_mode = le16_to_cpu(fake_pi.i_mode);
+	reb->trans_id = 0;
+
+	return rc;
+}
+
+static inline void nova_rebuild_file_time_and_size(struct super_block *sb,
+	struct nova_inode_rebuild *reb, u32 mtime, u32 ctime, u64 size)
+{
+	reb->i_mtime = cpu_to_le32(mtime);
+	reb->i_ctime = cpu_to_le32(ctime);
+	reb->i_size = cpu_to_le64(size);
+}
+
+static int nova_rebuild_inode_start(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct nova_inode_rebuild *reb, u64 pi_addr)
+{
+	int ret;
+
+	ret = nova_get_head_tail(sb, pi, sih);
+	if (ret)
+		return ret;
+
+	ret = nova_init_inode_rebuild(sb, reb, pi);
+	if (ret)
+		return ret;
+
+	sih->pi_addr = pi_addr;
+
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				sih->log_head, sih->log_tail);
+	sih->log_pages = 1;
+
+	return ret;
+}
+
+static int nova_rebuild_inode_finish(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct nova_inode_rebuild *reb, u64 curr_p)
+{
+	struct nova_inode *alter_pi;
+	u64 next;
+
+	sih->i_size = le64_to_cpu(reb->i_size);
+	sih->i_mode = le64_to_cpu(reb->i_mode);
+	sih->trans_id = reb->trans_id + 1;
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode_with_rebuild(sb, reb, pi);
+	nova_update_inode_checksum(pi);
+	if (metadata_csum) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+							sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+
+	/* Keep traversing until log ends */
+	curr_p &= PAGE_MASK;
+	while ((next = next_log_page(sb, curr_p)) > 0) {
+		sih->log_pages++;
+		curr_p = next;
+	}
+
+	if (metadata_csum)
+		sih->log_pages *= 2;
+
+	return 0;
+}
+
+static int nova_reset_csum_parity_page(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero)
+{
+	nova_dbgv("%s: update page off %lu\n", __func__, pgoff);
+
+	if (data_csum)
+		nova_update_pgoff_csum(sb, sih, entry, pgoff, zero);
+
+	if (data_parity)
+		nova_update_pgoff_parity(sb, sih, entry, pgoff, zero);
+
+	return 0;
+}
+
+int nova_reset_csum_parity_range(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long start_pgoff, unsigned long end_pgoff, int zero,
+	int check_entry)
+{
+	struct nova_file_write_entry *curr;
+	unsigned long pgoff;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	for (pgoff = start_pgoff; pgoff < end_pgoff; pgoff++) {
+		if (entry && check_entry && zero == 0) {
+			curr = nova_get_write_entry(sb, sih, pgoff);
+			if (curr != entry)
+				continue;
+		}
+
+		/* FIXME: For mmap, check dirty? */
+		nova_reset_csum_parity_page(sb, sih, entry, pgoff, zero);
+	}
+
+	return 0;
+}
+
+/* Reset data csum for updating entries */
+static int nova_reset_data_csum_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc)
+{
+	unsigned long end_pgoff;
+
+	if (data_csum == 0 && data_parity == 0)
+		goto out;
+
+	if (entryc->invalid_pages == entryc->num_pages)
+		/* Dead entry */
+		goto out;
+
+	end_pgoff = entryc->pgoff + entryc->num_pages;
+	nova_reset_csum_parity_range(sb, sih, entry, entryc->pgoff,
+			end_pgoff, 0, 1);
+
+out:
+	nova_set_write_entry_updating(sb, entry, 0);
+
+	return 0;
+}
+
+/* Reset data csum for mmap entries */
+static int nova_reset_mmap_csum_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_mmap_entry *entry,
+	struct nova_mmap_entry *entryc)
+{
+	unsigned long end_pgoff;
+	int ret = 0;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	if (entryc->invalid == 1)
+		/* Dead entry */
+		return 0;
+
+	end_pgoff = entryc->pgoff + entryc->num_pages;
+	nova_reset_csum_parity_range(sb, sih, NULL, entryc->pgoff,
+			end_pgoff, 0, 0);
+
+	ret = nova_invalidate_logentry(sb, entry, MMAP_WRITE, 0);
+
+	return ret;
+}
+
+int nova_reset_mapping_csum_parity(struct super_block *sb,
+	struct inode *inode, struct address_space *mapping,
+	unsigned long start_pgoff, unsigned long end_pgoff)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	pgoff_t indices[PAGEVEC_SIZE];
+	struct pagevec pvec;
+	bool done = false;
+	int count = 0;
+	unsigned long start = 0;
+	timing_t reset_time;
+	int i;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	NOVA_START_TIMING(reset_mapping_t, reset_time);
+	nova_dbgv("%s: pgoff %lu to %lu\n",
+			__func__, start_pgoff, end_pgoff);
+
+	while (!done) {
+		pvec.nr = find_get_entries_tag(mapping, start_pgoff,
+				PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		if (count == 0)
+			start = indices[0];
+
+		for (i = 0; i < pvec.nr; i++) {
+			if (indices[i] >= end_pgoff) {
+				done = true;
+				break;
+			}
+
+			NOVA_STATS_ADD(dirty_pages, 1);
+			nova_reset_csum_parity_page(sb, sih, NULL,
+						indices[i], 0);
+		}
+
+		count += pvec.nr;
+		if (pvec.nr < PAGEVEC_SIZE)
+			break;
+
+		start_pgoff = indices[pvec.nr - 1] + 1;
+	}
+
+	if (count)
+		nova_dbgv("%s: inode %lu, reset %d pages, start pgoff %lu\n",
+				__func__, sih->ino, count, start);
+
+	NOVA_END_TIMING(reset_mapping_t, reset_time);
+	return 0;
+}
+
+int nova_reset_vma_csum_parity(struct super_block *sb,
+	struct vma_item *item)
+{
+	struct vm_area_struct *vma = item->vma;
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nova_mmap_entry *entry;
+	unsigned long num_pages;
+	unsigned long start_index, end_index;
+	timing_t reset_time;
+	int ret = 0;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	NOVA_START_TIMING(reset_vma_t, reset_time);
+	num_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+	start_index = vma->vm_pgoff;
+	end_index = vma->vm_pgoff + num_pages;
+
+	nova_dbgv("%s: inode %lu, pgoff %lu - %lu\n",
+			__func__, inode->i_ino, start_index, end_index);
+
+	ret = nova_reset_mapping_csum_parity(sb, inode, mapping,
+					start_index, end_index);
+
+	if (item->mmap_entry) {
+		entry = nova_get_block(sb, item->mmap_entry);
+		ret = nova_invalidate_logentry(sb, entry, MMAP_WRITE, 0);
+	}
+
+	NOVA_END_TIMING(reset_vma_t, reset_time);
+	return ret;
+}
+
+static void nova_rebuild_handle_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode_rebuild *reb,
+	struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc)
+{
+	if (entryc->num_pages != entryc->invalid_pages) {
+		/*
+		 * The overlaped blocks are already freed.
+		 * Don't double free them, just re-assign the pointers.
+		 */
+		nova_assign_write_entry(sb, sih, entry, entryc, false);
+	}
+
+	if (entryc->trans_id >= sih->trans_id) {
+		nova_rebuild_file_time_and_size(sb, reb,
+					entryc->mtime, entryc->mtime,
+					entryc->size);
+		reb->trans_id = entryc->trans_id;
+	}
+
+	if (entryc->updating)
+		nova_reset_data_csum_parity(sb, sih, entry, entryc);
+
+	/* Update sih->i_size for setattr apply operations */
+	sih->i_size = le64_to_cpu(reb->i_size);
+}
+
+static int nova_rebuild_file_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_setattr_logentry *attr_entry = NULL;
+	struct nova_link_change_entry *link_change_entry = NULL;
+	struct nova_mmap_entry *mmap_entry = NULL;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	struct nova_inode_rebuild rebuild, *reb;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	u64 ino = pi->nova_ino;
+	timing_t rebuild_time;
+	void *addr, *entryc;
+	u64 curr_p;
+	u8 type;
+	int ret;
+
+	NOVA_START_TIMING(rebuild_file_t, rebuild_time);
+	nova_dbg_verbose("Rebuild file inode %llu tree\n", ino);
+
+	reb = &rebuild;
+	ret = nova_rebuild_inode_start(sb, pi, sih, reb, pi_addr);
+	if (ret)
+		goto out;
+
+	curr_p = sih->log_head;
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+	entryc = (metadata_csum == 0) ? NULL : entry_copy;
+
+//	nova_print_nova_log(sb, sih);
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			sih->log_pages++;
+			curr_p = next_log_page(sb, curr_p);
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = addr;
+		else if (!nova_verify_entry_csum(sb, addr, entryc))
+			return 0;
+
+		type = nova_get_entry_type(entryc);
+
+		if (sbi->mount_snapshot) {
+			if (nova_encounter_mount_snapshot(sb, addr, type))
+				break;
+		}
+
+		switch (type) {
+		case SET_ATTR:
+			attr_entry = (struct nova_setattr_logentry *)entryc;
+			nova_apply_setattr_entry(sb, reb, sih, attr_entry);
+			sih->last_setattr = curr_p;
+			if (attr_entry->trans_id >= reb->trans_id) {
+				nova_rebuild_file_time_and_size(sb, reb,
+							attr_entry->mtime,
+							attr_entry->ctime,
+							attr_entry->size);
+				reb->trans_id = attr_entry->trans_id;
+			}
+
+			/* Update sih->i_size for setattr operation */
+			sih->i_size = le64_to_cpu(reb->i_size);
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			link_change_entry =
+				(struct nova_link_change_entry *)entryc;
+			nova_apply_link_change_entry(sb, reb,
+						link_change_entry);
+			sih->last_link_change = curr_p;
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case FILE_WRITE:
+			entry = (struct nova_file_write_entry *)addr;
+			nova_rebuild_handle_write_entry(sb, sih, reb,
+					entry, WENTRY(entryc));
+			curr_p += sizeof(struct nova_file_write_entry);
+			break;
+		case MMAP_WRITE:
+			mmap_entry = (struct nova_mmap_entry *)addr;
+			nova_reset_mmap_csum_parity(sb, sih,
+					mmap_entry, MMENTRY(entryc));
+			curr_p += sizeof(struct nova_mmap_entry);
+			break;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx\n", type, curr_p);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct nova_file_write_entry);
+			break;
+		}
+
+	}
+
+	ret = nova_rebuild_inode_finish(sb, pi, sih, reb, curr_p);
+	sih->i_blocks = sih->log_pages + (sih->i_size >> data_bits);
+
+out:
+//	nova_print_inode_log_page(sb, inode);
+	NOVA_END_TIMING(rebuild_file_t, rebuild_time);
+	return ret;
+}
+
+/******************* Directory rebuild *********************/
+
+static inline void nova_rebuild_dir_time_and_size(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_dentry *entry,
+	struct nova_dentry *entryc)
+{
+	if (!entry || !reb)
+		return;
+
+	reb->i_ctime = entryc->mtime;
+	reb->i_mtime = entryc->mtime;
+	reb->i_links_count = entryc->links_count;
+	//reb->i_size = entryc->size;
+}
+
+static void nova_reassign_last_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	struct nova_dentry *dentry, *old_dentry;
+
+	if (sih->last_dentry == 0) {
+		sih->last_dentry = curr_p;
+	} else {
+		old_dentry = (struct nova_dentry *)nova_get_block(sb,
+							sih->last_dentry);
+		dentry = (struct nova_dentry *)nova_get_block(sb, curr_p);
+		if (dentry->trans_id >= old_dentry->trans_id)
+			sih->last_dentry = curr_p;
+	}
+}
+
+static inline int nova_replay_add_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *entry,
+	struct nova_dentry *entryc)
+{
+	if (!entryc->name_len)
+		return -EINVAL;
+
+	nova_dbg_verbose("%s: add %s\n", __func__, entry->name);
+	return nova_insert_dir_radix_tree(sb, sih,
+			entryc->name, entryc->name_len, entry);
+}
+
+/* entry given to this function is a copy in dram */
+static inline int nova_replay_remove_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *entry)
+{
+	nova_dbg_verbose("%s: remove %s\n", __func__, entry->name);
+	nova_remove_dir_radix_tree(sb, sih, entry->name,
+					entry->name_len, 1, NULL);
+	return 0;
+}
+
+static int nova_rebuild_handle_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode_rebuild *reb,
+	struct nova_dentry *entry, struct nova_dentry *entryc, u64 curr_p)
+{
+	int ret = 0;
+
+	nova_dbgv("curr_p: 0x%llx, type %d, ino %llu, name %s, namelen %u, csum 0x%x, rec len %u\n",
+			curr_p,
+			entry->entry_type, le64_to_cpu(entry->ino),
+			entry->name, entry->name_len, entry->csum,
+			le16_to_cpu(entry->de_len));
+
+	nova_reassign_last_dentry(sb, sih, curr_p);
+
+	if (entryc->invalid == 0) {
+		if (entryc->ino > 0)
+			ret = nova_replay_add_dentry(sb, sih, entry, entryc);
+		else
+			ret = nova_replay_remove_dentry(sb, sih, entryc);
+	}
+
+	if (ret) {
+		nova_err(sb, "%s ERROR %d\n", __func__, ret);
+		return ret;
+	}
+
+	if (entryc->trans_id >= reb->trans_id) {
+		nova_rebuild_dir_time_and_size(sb, reb, entry, entryc);
+		reb->trans_id = entryc->trans_id;
+	}
+
+	return ret;
+}
+
+int nova_rebuild_dir_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_dentry *entry = NULL;
+	struct nova_setattr_logentry *attr_entry = NULL;
+	struct nova_link_change_entry *lc_entry = NULL;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	struct nova_inode_rebuild rebuild, *reb;
+	u64 ino = pi->nova_ino;
+	unsigned short de_len;
+	timing_t rebuild_time;
+	void *addr, *entryc;
+	u64 curr_p;
+	u8 type;
+	int ret;
+
+	NOVA_START_TIMING(rebuild_dir_t, rebuild_time);
+	nova_dbgv("Rebuild dir %llu tree\n", ino);
+
+	reb = &rebuild;
+	ret = nova_rebuild_inode_start(sb, pi, sih, reb, pi_addr);
+	if (ret)
+		goto out;
+
+	curr_p = sih->log_head;
+	if (curr_p == 0) {
+		nova_err(sb, "Dir %llu log is NULL!\n", ino);
+		BUG();
+		goto out;
+	}
+
+	entryc = (metadata_csum == 0) ? NULL : entry_copy;
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			sih->log_pages++;
+			curr_p = next_log_page(sb, curr_p);
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Dir %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = addr;
+		else if (!nova_verify_entry_csum(sb, addr, entryc))
+			return 0;
+
+		type = nova_get_entry_type(entryc);
+
+		if (sbi->mount_snapshot) {
+			if (nova_encounter_mount_snapshot(sb, addr, type))
+				break;
+		}
+
+		switch (type) {
+		case SET_ATTR:
+			attr_entry = (struct nova_setattr_logentry *)entryc;
+			nova_apply_setattr_entry(sb, reb, sih, attr_entry);
+			sih->last_setattr = curr_p;
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			lc_entry = (struct nova_link_change_entry *)entryc;
+			if (lc_entry->trans_id >= reb->trans_id) {
+				nova_apply_link_change_entry(sb, reb, lc_entry);
+				reb->trans_id = lc_entry->trans_id;
+			}
+			sih->last_link_change = curr_p;
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case DIR_LOG:
+			entry = (struct nova_dentry *)addr;
+			ret = nova_rebuild_handle_dentry(sb, sih, reb,
+					entry, DENTRY(entryc), curr_p);
+			if (ret)
+				goto out;
+			de_len = le16_to_cpu(DENTRY(entryc)->de_len);
+			curr_p += de_len;
+			break;
+		default:
+			nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+			NOVA_ASSERT(0);
+			break;
+		}
+	}
+
+	ret = nova_rebuild_inode_finish(sb, pi, sih, reb, curr_p);
+	sih->i_blocks = sih->log_pages;
+
+out:
+//	nova_print_dir_tree(sb, sih, ino);
+	NOVA_END_TIMING(rebuild_dir_t, rebuild_time);
+	return ret;
+}
+
+/* initialize nova inode header and other DRAM data structures */
+int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
+	u64 ino, u64 pi_addr, int rebuild_dir)
+{
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+	struct nova_inode inode_copy;
+	u64 alter_pi_addr = 0;
+	int ret;
+
+	if (metadata_csum) {
+		/* Get alternate inode address */
+		ret = nova_get_alter_inode_address(sb, ino, &alter_pi_addr);
+		if (ret)  {
+			nova_dbg("%s: failed alt ino addr for inode %llu\n",
+				 __func__, ino);
+			return ret;
+		}
+	}
+
+	ret = nova_check_inode_integrity(sb, ino, pi_addr, alter_pi_addr,
+					 &inode_copy, 1);
+
+	if (ret)
+		return ret;
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	// We need this te valid in case we need to evect the inode.
+	sih->pi_addr = pi_addr;
+
+	if (pi->deleted == 1) {
+		nova_dbg("%s: inode %llu has been deleted.\n", __func__, ino);
+		return -EINVAL;
+	}
+
+	nova_dbgv("%s: inode %llu, addr 0x%llx, valid %d, head 0x%llx, tail 0x%llx\n",
+			__func__, ino, pi_addr, pi->valid,
+			pi->log_head, pi->log_tail);
+
+	nova_init_header(sb, sih, __le16_to_cpu(pi->i_mode));
+	sih->ino = ino;
+	sih->alter_pi_addr = alter_pi_addr;
+
+	switch (__le16_to_cpu(pi->i_mode) & S_IFMT) {
+	case S_IFLNK:
+		/* Treat symlink files as normal files */
+		/* Fall through */
+	case S_IFREG:
+		nova_rebuild_file_inode_tree(sb, pi, pi_addr, sih);
+		break;
+	case S_IFDIR:
+		if (rebuild_dir)
+			nova_rebuild_dir_inode_tree(sb, pi, pi_addr, sih);
+		break;
+	default:
+		/* In case of special inode, walk the log */
+		if (pi->log_head)
+			nova_rebuild_file_inode_tree(sb, pi, pi_addr, sih);
+		sih->pi_addr = pi_addr;
+		break;
+	}
+
+	return 0;
+}
+
+
+/******************* Snapshot log rebuild *********************/
+
+/* For power failure recovery, just initialize the infos */
+int nova_restore_snapshot_table(struct super_block *sb, int just_init)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_snapshot_info_entry *entry = NULL;
+	struct nova_inode *pi;
+	struct nova_inode_info_header *sih;
+	struct nova_inode_rebuild rebuild, *reb;
+	unsigned int data_bits;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	size_t size = sizeof(struct nova_snapshot_info_entry);
+	u64 ino = NOVA_SNAPSHOT_INO;
+	timing_t rebuild_time;
+	int count = 0;
+	void *addr, *entryc;
+	u64 curr_p;
+	u8 type;
+	int ret;
+
+	NOVA_START_TIMING(rebuild_snapshot_t, rebuild_time);
+	nova_dbg_verbose("Rebuild snapshot table\n");
+
+	entryc = (metadata_csum == 0) ? NULL : entry_copy;
+
+	pi = nova_get_reserved_inode(sb, ino);
+	sih = &sbi->snapshot_si->header;
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	reb = &rebuild;
+	ret = nova_rebuild_inode_start(sb, pi, sih, reb, sih->pi_addr);
+	if (ret)
+		goto out;
+
+	curr_p = sih->log_head;
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+//	nova_print_nova_log(sb, sih);
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			sih->log_pages++;
+			curr_p = next_log_page(sb, curr_p);
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = addr;
+		else if (!nova_verify_entry_csum(sb, addr, entryc))
+			return 0;
+
+		type = nova_get_entry_type(entryc);
+
+		switch (type) {
+		case SNAPSHOT_INFO:
+			entry = (struct nova_snapshot_info_entry *)addr;
+			ret = nova_restore_snapshot_entry(sb, entry,
+						curr_p, just_init);
+			if (ret) {
+				nova_err(sb, "Restore entry %llu failed\n",
+					entry->epoch_id);
+				goto out;
+			}
+			if (SNENTRY(entryc)->deleted == 0)
+				count++;
+			curr_p += size;
+			break;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx\n", type, curr_p);
+			NOVA_ASSERT(0);
+			curr_p += size;
+			break;
+		}
+
+	}
+
+	ret = nova_rebuild_inode_finish(sb, pi, sih, reb, curr_p);
+	sih->i_blocks = sih->log_pages + (sih->i_size >> data_bits);
+
+out:
+//	nova_print_inode_log_page(sb, inode);
+	NOVA_END_TIMING(rebuild_snapshot_t, rebuild_time);
+
+	nova_dbg("Recovered %d snapshots, latest epoch ID %llu\n",
+			count, sbi->s_epoch_id);
+
+	return ret;
+}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 12/16] NOVA: Recovery code
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Clean umount/mount
------------------

On a clean unmount, Nova saves the contents of many of its DRAM data structures
to PMEM to accelerate the next mount:

1. Nova stores the allocator state for each of the per-cpu allocators to the
   log of a reserved inode (NOVA_BLOCK_NODE_INO).

2. Nova stores the per-CPU lists of available inodes (the inuse_list) to the
   NOVA_BLOCK_INODELIST1_INO reserved inode.

3. Nova stores the snapshot state to PMEM as described above.

After a clean unmount, the following mount restores these data and then
invalidates them.

Recovery after failures
------------------------

In case of a unclean dismount (e.g., system crash), Nova must rebuild these
DRAM structures by scanning the inode logs.  Nova log scanning is fast because
per-CPU inode tables and per-inode logs allow for parallel recovery.

The number of live log entries in an inode log is roughly the number of extents
in the file.  As a result, Nova only needs to scan a small fraction of the NVMM
during recovery.

The Nova failure recovery consists of two steps:

First, Nova checks its lite weight journals and rolls back any uncommitted
transactions to restore the file system to a consistent state.

Second, Nova starts a recovery thread on each CPU and scans the inode tables in
parallel, performing log scanning for every valid inode in the inode table.
Nova use different recovery mechanisms for directory inodes and file inodes:
For a directory inode, Nova scans the log's linked list to enumerate the pages
it occupies, but it does not inspect the log's contents.  For a file inode,
Nova reads the write entries in the log to enumerate the data pages.

During the recovery scan Nova builds a bitmap of occupied pages, and rebuilds
the allocator based on the result. After this process completes, the file
system is ready to accept new requests.

During the same scan, it rebuilds the snapshot information and the list
available inodes.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/bbuild.c  | 1602 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/rebuild.c |  847 ++++++++++++++++++++++++++++
 2 files changed, 2449 insertions(+)
 create mode 100644 fs/nova/bbuild.c
 create mode 100644 fs/nova/rebuild.c

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
new file mode 100644
index 000000000000..bdfcc3e3d70f
--- /dev/null
+++ b/fs/nova/bbuild.c
@@ -0,0 +1,1602 @@
+/*
+ * NOVA Recovery routines.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/random.h>
+#include <linux/delay.h>
+#include "nova.h"
+#include "journal.h"
+#include "super.h"
+#include "inode.h"
+#include "log.h"
+
+void nova_init_header(struct super_block *sb,
+	struct nova_inode_info_header *sih, u16 i_mode)
+{
+	sih->log_pages = 0;
+	sih->i_size = 0;
+	sih->i_blocks = 0;
+	sih->pi_addr = 0;
+	sih->alter_pi_addr = 0;
+	INIT_RADIX_TREE(&sih->tree, GFP_ATOMIC);
+	sih->vma_tree = RB_ROOT;
+	sih->num_vmas = 0;
+	INIT_LIST_HEAD(&sih->list);
+	sih->i_mode = i_mode;
+	sih->valid_entries = 0;
+	sih->num_entries = 0;
+	sih->last_setattr = 0;
+	sih->last_link_change = 0;
+	sih->last_dentry = 0;
+	sih->trans_id = 0;
+	sih->log_head = 0;
+	sih->log_tail = 0;
+	sih->alter_log_head = 0;
+	sih->alter_log_tail = 0;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+}
+
+static inline void set_scan_bm(unsigned long bit,
+	struct single_scan_bm *scan_bm)
+{
+	set_bit(bit, scan_bm->bitmap);
+}
+
+inline void set_bm(unsigned long bit, struct scan_bitmap *bm,
+	enum bm_type type)
+{
+	switch (type) {
+	case BM_4K:
+		set_scan_bm(bit, &bm->scan_bm_4K);
+		break;
+	case BM_2M:
+		set_scan_bm(bit, &bm->scan_bm_2M);
+		break;
+	case BM_1G:
+		set_scan_bm(bit, &bm->scan_bm_1G);
+		break;
+	default:
+		break;
+	}
+}
+
+static inline int get_cpuid(struct nova_sb_info *sbi, unsigned long blocknr)
+{
+	return blocknr / sbi->per_list_blocks;
+}
+
+static int nova_failure_insert_inodetree(struct super_block *sb,
+	unsigned long ino_low, unsigned long ino_high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *prev = NULL, *next = NULL;
+	struct nova_range_node *new_node;
+	unsigned long internal_low, internal_high;
+	int cpu;
+	struct rb_root *tree;
+	int ret;
+
+	if (ino_low > ino_high) {
+		nova_err(sb, "%s: ino low %lu, ino high %lu\n",
+				__func__, ino_low, ino_high);
+		BUG();
+	}
+
+	cpu = ino_low % sbi->cpus;
+	if (ino_high % sbi->cpus != cpu) {
+		nova_err(sb, "%s: ino low %lu, ino high %lu\n",
+				__func__, ino_low, ino_high);
+		BUG();
+	}
+
+	internal_low = ino_low / sbi->cpus;
+	internal_high = ino_high / sbi->cpus;
+	inode_map = &sbi->inode_maps[cpu];
+	tree = &inode_map->inode_inuse_tree;
+	mutex_lock(&inode_map->inode_table_mutex);
+
+	ret = nova_find_free_slot(sbi, tree, internal_low, internal_high,
+					&prev, &next);
+	if (ret) {
+		nova_dbg("%s: ino %lu - %lu already exists!: %d\n",
+					__func__, ino_low, ino_high, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return ret;
+	}
+
+	if (prev && next && (internal_low == prev->range_high + 1) &&
+			(internal_high + 1 == next->range_low)) {
+		/* fits the hole */
+		rb_erase(&next->node, tree);
+		inode_map->num_range_node_inode--;
+		prev->range_high = next->range_high;
+		nova_update_range_node_checksum(prev);
+		nova_free_inode_node(sb, next);
+		goto finish;
+	}
+	if (prev && (internal_low == prev->range_high + 1)) {
+		/* Aligns left */
+		prev->range_high += internal_high - internal_low + 1;
+		nova_update_range_node_checksum(prev);
+		goto finish;
+	}
+	if (next && (internal_high + 1 == next->range_low)) {
+		/* Aligns right */
+		next->range_low -= internal_high - internal_low + 1;
+		nova_update_range_node_checksum(next);
+		goto finish;
+	}
+
+	/* Aligns somewhere in the middle */
+	new_node = nova_alloc_inode_node(sb);
+	NOVA_ASSERT(new_node);
+	new_node->range_low = internal_low;
+	new_node->range_high = internal_high;
+	nova_update_range_node_checksum(new_node);
+	ret = nova_insert_inodetree(sbi, new_node, cpu);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		nova_free_inode_node(sb, new_node);
+		goto finish;
+	}
+	inode_map->num_range_node_inode++;
+
+finish:
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+}
+
+static void nova_destroy_range_node_tree(struct super_block *sb,
+	struct rb_root *tree)
+{
+	struct nova_range_node *curr;
+	struct rb_node *temp;
+
+	temp = rb_first(tree);
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		temp = rb_next(temp);
+		rb_erase(&curr->node, tree);
+		nova_free_range_node(curr);
+	}
+}
+
+static void nova_destroy_blocknode_tree(struct super_block *sb, int cpu)
+{
+	struct free_list *free_list;
+
+	free_list = nova_get_free_list(sb, cpu);
+	nova_destroy_range_node_tree(sb, &free_list->block_free_tree);
+}
+
+static void nova_destroy_blocknode_trees(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++)
+		nova_destroy_blocknode_tree(sb, i);
+
+}
+
+static int nova_init_blockmap_from_inode(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	struct nova_inode_info_header sih;
+	struct free_list *free_list;
+	struct nova_range_node_lowhigh *entry;
+	struct nova_range_node *blknode;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	u64 curr_p;
+	u64 cpuid;
+	int ret = 0;
+
+	/* FIXME: Backup inode for BLOCKNODE */
+	ret = nova_get_head_tail(sb, pi, &sih);
+	if (ret)
+		goto out;
+
+	curr_p = sih.log_head;
+	if (curr_p == 0) {
+		nova_dbg("%s: pi head is 0!\n", __func__);
+		return -EINVAL;
+	}
+
+	while (curr_p != sih.log_tail) {
+		if (is_last_entry(curr_p, size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_dbg("%s: curr_p is NULL!\n", __func__);
+			NOVA_ASSERT(0);
+			ret = -EINVAL;
+			break;
+		}
+
+		entry = (struct nova_range_node_lowhigh *)nova_get_block(sb,
+							curr_p);
+		blknode = nova_alloc_blocknode(sb);
+		if (blknode == NULL)
+			NOVA_ASSERT(0);
+		blknode->range_low = le64_to_cpu(entry->range_low);
+		blknode->range_high = le64_to_cpu(entry->range_high);
+		nova_update_range_node_checksum(blknode);
+		cpuid = get_cpuid(sbi, blknode->range_low);
+
+		/* FIXME: Assume NR_CPUS not change */
+		free_list = nova_get_free_list(sb, cpuid);
+		ret = nova_insert_blocktree(sbi,
+				&free_list->block_free_tree, blknode);
+		if (ret) {
+			nova_err(sb, "%s failed\n", __func__);
+			nova_free_blocknode(sb, blknode);
+			NOVA_ASSERT(0);
+			nova_destroy_blocknode_trees(sb);
+			goto out;
+		}
+		free_list->num_blocknode++;
+		if (free_list->num_blocknode == 1)
+			free_list->first_node = blknode;
+		free_list->last_node = blknode;
+		free_list->num_free_blocks +=
+			blknode->range_high - blknode->range_low + 1;
+		curr_p += sizeof(struct nova_range_node_lowhigh);
+	}
+out:
+	nova_free_inode_log(sb, pi, &sih);
+	return ret;
+}
+
+static void nova_destroy_inode_trees(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		nova_destroy_range_node_tree(sb,
+					&inode_map->inode_inuse_tree);
+	}
+}
+
+#define CPUID_MASK 0xff00000000000000
+
+static int nova_init_inode_list_from_inode(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODELIST1_INO);
+	struct nova_inode_info_header sih;
+	struct nova_range_node_lowhigh *entry;
+	struct nova_range_node *range_node;
+	struct inode_map *inode_map;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	unsigned long num_inode_node = 0;
+	u64 curr_p;
+	unsigned long cpuid;
+	int ret;
+
+	/* FIXME: Backup inode for INODELIST */
+	ret = nova_get_head_tail(sb, pi, &sih);
+	if (ret)
+		goto out;
+
+	sbi->s_inodes_used_count = 0;
+	curr_p = sih.log_head;
+	if (curr_p == 0) {
+		nova_dbg("%s: pi head is 0!\n", __func__);
+		return -EINVAL;
+	}
+
+	while (curr_p != sih.log_tail) {
+		if (is_last_entry(curr_p, size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_dbg("%s: curr_p is NULL!\n", __func__);
+			NOVA_ASSERT(0);
+		}
+
+		entry = (struct nova_range_node_lowhigh *)nova_get_block(sb,
+							curr_p);
+		range_node = nova_alloc_inode_node(sb);
+		if (range_node == NULL)
+			NOVA_ASSERT(0);
+
+		cpuid = (entry->range_low & CPUID_MASK) >> 56;
+		if (cpuid >= sbi->cpus) {
+			nova_err(sb, "Invalid cpuid %lu\n", cpuid);
+			nova_free_inode_node(sb, range_node);
+			NOVA_ASSERT(0);
+			nova_destroy_inode_trees(sb);
+			goto out;
+		}
+
+		range_node->range_low = entry->range_low & ~CPUID_MASK;
+		range_node->range_high = entry->range_high;
+		nova_update_range_node_checksum(range_node);
+		ret = nova_insert_inodetree(sbi, range_node, cpuid);
+		if (ret) {
+			nova_err(sb, "%s failed, %d\n", __func__, cpuid);
+			nova_free_inode_node(sb, range_node);
+			NOVA_ASSERT(0);
+			nova_destroy_inode_trees(sb);
+			goto out;
+		}
+
+		sbi->s_inodes_used_count +=
+			range_node->range_high - range_node->range_low + 1;
+		num_inode_node++;
+
+		inode_map = &sbi->inode_maps[cpuid];
+		inode_map->num_range_node_inode++;
+		if (!inode_map->first_inode_range)
+			inode_map->first_inode_range = range_node;
+
+		curr_p += sizeof(struct nova_range_node_lowhigh);
+	}
+
+	nova_dbg("%s: %lu inode nodes\n", __func__, num_inode_node);
+out:
+	nova_free_inode_log(sb, pi, &sih);
+	return ret;
+}
+
+static u64 nova_append_range_node_entry(struct super_block *sb,
+	struct nova_range_node *curr, u64 tail, unsigned long cpuid)
+{
+	u64 curr_p;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	struct nova_range_node_lowhigh *entry;
+
+	curr_p = tail;
+
+	if (!nova_range_node_checksum_ok(curr)) {
+		nova_dbg("%s: range node checksum failure\n", __func__);
+		goto out;
+	}
+
+	if (curr_p == 0 || (is_last_entry(curr_p, size) &&
+				next_log_page(sb, curr_p) == 0)) {
+		nova_dbg("%s: inode log reaches end?\n", __func__);
+		goto out;
+	}
+
+	if (is_last_entry(curr_p, size))
+		curr_p = next_log_page(sb, curr_p);
+
+	entry = (struct nova_range_node_lowhigh *)nova_get_block(sb, curr_p);
+	nova_memunlock_range(sb, entry, size);
+	entry->range_low = cpu_to_le64(curr->range_low);
+	if (cpuid)
+		entry->range_low |= cpu_to_le64(cpuid << 56);
+	entry->range_high = cpu_to_le64(curr->range_high);
+	nova_memlock_range(sb, entry, size);
+	nova_dbgv("append entry block low 0x%lx, high 0x%lx\n",
+			curr->range_low, curr->range_high);
+
+	nova_flush_buffer(entry, sizeof(struct nova_range_node_lowhigh), 0);
+out:
+	return curr_p;
+}
+
+static u64 nova_save_range_nodes_to_log(struct super_block *sb,
+	struct rb_root *tree, u64 temp_tail, unsigned long cpuid)
+{
+	struct nova_range_node *curr;
+	struct rb_node *temp;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	u64 curr_entry = 0;
+
+	/* Save in increasing order */
+	temp = rb_first(tree);
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		curr_entry = nova_append_range_node_entry(sb, curr,
+						temp_tail, cpuid);
+		temp_tail = curr_entry + size;
+		temp = rb_next(temp);
+		rb_erase(&curr->node, tree);
+		nova_free_range_node(curr);
+	}
+
+	return temp_tail;
+}
+
+static u64 nova_save_free_list_blocknodes(struct super_block *sb, int cpu,
+	u64 temp_tail)
+{
+	struct free_list *free_list;
+
+	free_list = nova_get_free_list(sb, cpu);
+	temp_tail = nova_save_range_nodes_to_log(sb,
+				&free_list->block_free_tree, temp_tail, 0);
+	return temp_tail;
+}
+
+void nova_save_inode_list_to_log(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODELIST1_INO);
+	struct nova_inode_info_header sih;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long num_blocks;
+	unsigned long num_nodes = 0;
+	struct inode_map *inode_map;
+	unsigned long i;
+	u64 temp_tail;
+	u64 new_block;
+	int allocated;
+
+	sih.ino = NOVA_INODELIST1_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.i_blocks = 0;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		num_nodes += inode_map->num_range_node_inode;
+	}
+
+	num_blocks = num_nodes / RANGENODE_PER_PAGE;
+	if (num_nodes % RANGENODE_PER_PAGE)
+		num_blocks++;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, num_blocks,
+						&new_block, ANY_CPU, 0);
+	if (allocated != num_blocks) {
+		nova_dbg("Error saving inode list: %d\n", allocated);
+		return;
+	}
+
+	temp_tail = new_block;
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		temp_tail = nova_save_range_nodes_to_log(sb,
+				&inode_map->inode_inuse_tree, temp_tail, i);
+	}
+
+	nova_memunlock_inode(sb, pi);
+	pi->alter_log_head = pi->alter_log_tail = 0;
+	pi->log_head = new_block;
+	nova_update_tail(pi, temp_tail);
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+	nova_memlock_inode(sb, pi);
+
+	nova_dbg("%s: %lu inode nodes, pi head 0x%llx, tail 0x%llx\n",
+		__func__, num_nodes, pi->log_head, pi->log_tail);
+}
+
+void nova_save_blocknode_mappings_to_log(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	struct nova_inode_info_header sih;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long num_blocknode = 0;
+	unsigned long num_pages;
+	int allocated;
+	u64 new_block = 0;
+	u64 temp_tail;
+	int i;
+
+	sih.ino = NOVA_BLOCKNODE_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	/* Allocate log pages before save blocknode mappings */
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		num_blocknode += free_list->num_blocknode;
+		nova_dbgv("%s: free list %d: %lu nodes\n", __func__,
+				i, free_list->num_blocknode);
+	}
+
+	num_pages = num_blocknode / RANGENODE_PER_PAGE;
+	if (num_blocknode % RANGENODE_PER_PAGE)
+		num_pages++;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, num_pages,
+						&new_block, ANY_CPU, 0);
+	if (allocated != num_pages) {
+		nova_dbg("Error saving blocknode mappings: %d\n", allocated);
+		return;
+	}
+
+	temp_tail = new_block;
+	for (i = 0; i < sbi->cpus; i++)
+		temp_tail = nova_save_free_list_blocknodes(sb, i, temp_tail);
+
+	/* Finally update log head and tail */
+	nova_memunlock_inode(sb, pi);
+	pi->alter_log_head = pi->alter_log_tail = 0;
+	pi->log_head = new_block;
+	nova_update_tail(pi, temp_tail);
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+	nova_memlock_inode(sb, pi);
+
+	nova_dbg("%s: %lu blocknodes, %lu log pages, pi head 0x%llx, tail 0x%llx\n",
+		  __func__, num_blocknode, num_pages,
+		  pi->log_head, pi->log_tail);
+}
+
+static int nova_insert_blocknode_map(struct super_block *sb,
+	int cpuid, unsigned long low, unsigned long high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	struct rb_root *tree;
+	struct nova_range_node *blknode = NULL;
+	unsigned long num_blocks = 0;
+	int ret;
+
+	num_blocks = high - low + 1;
+	nova_dbgv("%s: cpu %d, low %lu, high %lu, num %lu\n",
+		__func__, cpuid, low, high, num_blocks);
+	free_list = nova_get_free_list(sb, cpuid);
+	tree = &(free_list->block_free_tree);
+
+	blknode = nova_alloc_blocknode(sb);
+	if (blknode == NULL)
+		return -ENOMEM;
+	blknode->range_low = low;
+	blknode->range_high = high;
+	nova_update_range_node_checksum(blknode);
+	ret = nova_insert_blocktree(sbi, tree, blknode);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		nova_free_blocknode(sb, blknode);
+		goto out;
+	}
+	if (!free_list->first_node)
+		free_list->first_node = blknode;
+	free_list->last_node = blknode;
+	free_list->num_blocknode++;
+	free_list->num_free_blocks += num_blocks;
+out:
+	return ret;
+}
+
+static int __nova_build_blocknode_map(struct super_block *sb,
+	unsigned long *bitmap, unsigned long bsize, unsigned long scale)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long next = 0;
+	unsigned long low = 0;
+	unsigned long start, end;
+	int cpuid = 0;
+
+	free_list = nova_get_free_list(sb, cpuid);
+	start = free_list->block_start;
+	end = free_list->block_end + 1;
+	while (1) {
+		next = find_next_zero_bit(bitmap, end, start);
+		if (next == bsize)
+			break;
+		if (next == end) {
+			if (cpuid == sbi->cpus - 1)
+				break;
+
+			cpuid++;
+			free_list = nova_get_free_list(sb, cpuid);
+			start = free_list->block_start;
+			end = free_list->block_end + 1;
+			continue;
+		}
+
+		low = next;
+		next = find_next_bit(bitmap, end, next);
+		if (nova_insert_blocknode_map(sb, cpuid,
+				low << scale, (next << scale) - 1)) {
+			nova_dbg("Error: could not insert %lu - %lu\n",
+				low << scale, ((next << scale) - 1));
+		}
+		start = next;
+		if (next == bsize)
+			break;
+		if (next == end) {
+			if (cpuid == sbi->cpus - 1)
+				break;
+
+			cpuid++;
+			free_list = nova_get_free_list(sb, cpuid);
+			start = free_list->block_start;
+			end = free_list->block_end + 1;
+		}
+	}
+	return 0;
+}
+
+static void nova_update_4K_map(struct super_block *sb,
+	struct scan_bitmap *bm,	unsigned long *bitmap,
+	unsigned long bsize, unsigned long scale)
+{
+	unsigned long next = 0;
+	unsigned long low = 0;
+	int i;
+
+	while (1) {
+		next = find_next_bit(bitmap, bsize, next);
+		if (next == bsize)
+			break;
+		low = next;
+		next = find_next_zero_bit(bitmap, bsize, next);
+		for (i = (low << scale); i < (next << scale); i++)
+			set_bm(i, bm, BM_4K);
+		if (next == bsize)
+			break;
+	}
+}
+
+struct scan_bitmap *global_bm[64];
+
+static int nova_build_blocknode_map(struct super_block *sb,
+	unsigned long initsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	struct scan_bitmap *final_bm;
+	unsigned long *src, *dst;
+	int i, j;
+	int num;
+	int ret;
+
+	final_bm = kzalloc(sizeof(struct scan_bitmap), GFP_KERNEL);
+	if (!final_bm)
+		return -ENOMEM;
+
+	final_bm->scan_bm_4K.bitmap_size =
+				(initsize >> (PAGE_SHIFT + 0x3));
+
+	/* Alloc memory to hold the block alloc bitmap */
+	final_bm->scan_bm_4K.bitmap = kzalloc(final_bm->scan_bm_4K.bitmap_size,
+							GFP_KERNEL);
+
+	if (!final_bm->scan_bm_4K.bitmap) {
+		kfree(final_bm);
+		return -ENOMEM;
+	}
+
+	/*
+	 * We are using free lists. Set 2M and 1G blocks in 4K map,
+	 * and use 4K map to rebuild block map.
+	 */
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		nova_update_4K_map(sb, bm, bm->scan_bm_2M.bitmap,
+			bm->scan_bm_2M.bitmap_size * 8, PAGE_SHIFT_2M - 12);
+		nova_update_4K_map(sb, bm, bm->scan_bm_1G.bitmap,
+			bm->scan_bm_1G.bitmap_size * 8, PAGE_SHIFT_1G - 12);
+	}
+
+	/* Merge per-CPU bms to the final single bm */
+	num = final_bm->scan_bm_4K.bitmap_size / sizeof(unsigned long);
+	if (final_bm->scan_bm_4K.bitmap_size % sizeof(unsigned long))
+		num++;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		src = (unsigned long *)bm->scan_bm_4K.bitmap;
+		dst = (unsigned long *)final_bm->scan_bm_4K.bitmap;
+
+		for (j = 0; j < num; j++)
+			dst[j] |= src[j];
+	}
+
+	ret = __nova_build_blocknode_map(sb, final_bm->scan_bm_4K.bitmap,
+			final_bm->scan_bm_4K.bitmap_size * 8, PAGE_SHIFT - 12);
+
+	kfree(final_bm->scan_bm_4K.bitmap);
+	kfree(final_bm);
+
+	return ret;
+}
+
+static void free_bm(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		if (bm) {
+			kfree(bm->scan_bm_4K.bitmap);
+			kfree(bm->scan_bm_2M.bitmap);
+			kfree(bm->scan_bm_1G.bitmap);
+			kfree(bm);
+		}
+	}
+}
+
+static int alloc_bm(struct super_block *sb, unsigned long initsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = kzalloc(sizeof(struct scan_bitmap), GFP_KERNEL);
+		if (!bm)
+			return -ENOMEM;
+
+		global_bm[i] = bm;
+
+		bm->scan_bm_4K.bitmap_size =
+				(initsize >> (PAGE_SHIFT + 0x3));
+		bm->scan_bm_2M.bitmap_size =
+				(initsize >> (PAGE_SHIFT_2M + 0x3));
+		bm->scan_bm_1G.bitmap_size =
+				(initsize >> (PAGE_SHIFT_1G + 0x3));
+
+		/* Alloc memory to hold the block alloc bitmap */
+		bm->scan_bm_4K.bitmap = kzalloc(bm->scan_bm_4K.bitmap_size,
+							GFP_KERNEL);
+		bm->scan_bm_2M.bitmap = kzalloc(bm->scan_bm_2M.bitmap_size,
+							GFP_KERNEL);
+		bm->scan_bm_1G.bitmap = kzalloc(bm->scan_bm_1G.bitmap_size,
+							GFP_KERNEL);
+
+		if (!bm->scan_bm_4K.bitmap || !bm->scan_bm_2M.bitmap ||
+				!bm->scan_bm_1G.bitmap)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+/************************** NOVA recovery ****************************/
+
+#define MAX_PGOFF	262144
+
+struct task_ring {
+	u64 addr0[512];
+	u64 addr1[512];		/* Second inode address */
+	int num;
+	int inodes_used_count;
+	u64 *entry_array;
+	u64 *nvmm_array;
+};
+
+static struct task_ring *task_rings;
+static struct task_struct **threads;
+wait_queue_head_t finish_wq;
+int *finished;
+
+static int nova_traverse_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct scan_bitmap *bm, u64 head)
+{
+	u64 curr_p;
+	u64 next;
+
+	curr_p = head;
+
+	if (curr_p == 0)
+		return 0;
+
+	BUG_ON(curr_p & (PAGE_SIZE - 1));
+	set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+
+	next = next_log_page(sb, curr_p);
+	while (next > 0) {
+		curr_p = next;
+		BUG_ON(curr_p & (PAGE_SIZE - 1));
+		set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+		next = next_log_page(sb, curr_p);
+	}
+
+	return 0;
+}
+
+static void nova_traverse_dir_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct scan_bitmap *bm)
+{
+	nova_traverse_inode_log(sb, pi, bm, pi->log_head);
+	if (metadata_csum)
+		nova_traverse_inode_log(sb, pi, bm, pi->alter_log_head);
+}
+
+static unsigned int nova_check_old_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 entry_addr,
+	unsigned long pgoff, unsigned int num_free,
+	u64 epoch_id, struct task_ring *ring, unsigned long base,
+	struct scan_bitmap *bm)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *entryc, entry_copy;
+	unsigned long old_nvmm, nvmm;
+	unsigned long index;
+	int i;
+	int ret;
+
+	entry = (struct nova_file_write_entry *)entry_addr;
+
+	if (!entry)
+		return 0;
+
+	if (metadata_csum == 0)
+		entryc = entry;
+	else {
+		entryc = &entry_copy;
+		if (!nova_verify_entry_csum(sb, entry, entryc))
+			return 0;
+	}
+
+	old_nvmm = get_nvmm(sb, sih, entryc, pgoff);
+
+	ret = nova_append_data_to_snapshot(sb, entryc, old_nvmm,
+				num_free, epoch_id);
+
+	if (ret != 0)
+		return ret;
+
+	index = pgoff - base;
+	for (i = 0; i < num_free; i++) {
+		nvmm = ring->nvmm_array[index];
+		if (nvmm)
+			set_bm(nvmm, bm, BM_4K);
+		index++;
+	}
+
+	return ret;
+}
+
+static int nova_set_ring_array(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc, struct task_ring *ring,
+	unsigned long base, struct scan_bitmap *bm)
+{
+	unsigned long start, end;
+	unsigned long pgoff, old_pgoff = 0;
+	unsigned long index;
+	unsigned int num_free = 0;
+	u64 old_entry = 0;
+	u64 epoch_id = entryc->epoch_id;
+
+	start = entryc->pgoff;
+	if (start < base)
+		start = base;
+
+	end = entryc->pgoff + entryc->num_pages;
+	if (end > base + MAX_PGOFF)
+		end = base + MAX_PGOFF;
+
+	for (pgoff = start; pgoff < end; pgoff++) {
+		index = pgoff - base;
+		if (ring->nvmm_array[index]) {
+			if (ring->entry_array[index] != old_entry) {
+				if (old_entry)
+					nova_check_old_entry(sb, sih, old_entry,
+							old_pgoff, num_free,
+							epoch_id, ring, base,
+							bm);
+
+				old_entry = ring->entry_array[index];
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+		}
+	}
+
+	if (old_entry)
+		nova_check_old_entry(sb, sih, old_entry, old_pgoff,
+					num_free, epoch_id, ring, base, bm);
+
+	for (pgoff = start; pgoff < end; pgoff++) {
+		index = pgoff - base;
+		ring->entry_array[index] = (u64)entry;
+		ring->nvmm_array[index] = (u64)(entryc->block >> PAGE_SHIFT)
+						+ pgoff - entryc->pgoff;
+	}
+
+	return 0;
+}
+
+static int nova_set_file_bm(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct task_ring *ring,
+	struct scan_bitmap *bm, unsigned long base, unsigned long last_blocknr)
+{
+	unsigned long nvmm, pgoff;
+
+	if (last_blocknr >= base + MAX_PGOFF)
+		last_blocknr = MAX_PGOFF - 1;
+	else
+		last_blocknr -= base;
+
+	for (pgoff = 0; pgoff <= last_blocknr; pgoff++) {
+		nvmm = ring->nvmm_array[pgoff];
+		if (nvmm) {
+			set_bm(nvmm, bm, BM_4K);
+			ring->nvmm_array[pgoff] = 0;
+			ring->entry_array[pgoff] = 0;
+		}
+	}
+
+	return 0;
+}
+
+/* entry given to this function is a copy in dram */
+static void nova_ring_setattr_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_setattr_logentry *entry, struct task_ring *ring,
+	unsigned long base, unsigned int data_bits, struct scan_bitmap *bm)
+{
+	unsigned long first_blocknr, last_blocknr;
+	unsigned long pgoff, old_pgoff = 0;
+	unsigned long index;
+	unsigned int num_free = 0;
+	u64 old_entry = 0;
+	loff_t start, end;
+	u64 epoch_id = entry->epoch_id;
+
+	if (sih->i_size <= entry->size)
+		goto out;
+
+	start = entry->size;
+	end = sih->i_size;
+
+	first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+	if (end > 0)
+		last_blocknr = (end - 1) >> data_bits;
+	else
+		last_blocknr = 0;
+
+	if (first_blocknr > last_blocknr)
+		goto out;
+
+	if (first_blocknr < base)
+		first_blocknr = base;
+
+	if (last_blocknr > base + MAX_PGOFF - 1)
+		last_blocknr = base + MAX_PGOFF - 1;
+
+	for (pgoff = first_blocknr; pgoff <= last_blocknr; pgoff++) {
+		index = pgoff - base;
+		if (ring->nvmm_array[index]) {
+			if (ring->entry_array[index] != old_entry) {
+				if (old_entry)
+					nova_check_old_entry(sb, sih, old_entry,
+							old_pgoff, num_free,
+							epoch_id, ring, base,
+							bm);
+
+				old_entry = ring->entry_array[index];
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+		}
+	}
+
+	if (old_entry)
+		nova_check_old_entry(sb, sih, old_entry, old_pgoff,
+					num_free, epoch_id, ring, base, bm);
+
+	for (pgoff = first_blocknr; pgoff <= last_blocknr; pgoff++) {
+		index = pgoff - base;
+		ring->nvmm_array[index] = 0;
+		ring->entry_array[index] = 0;
+	}
+
+out:
+	sih->i_size = entry->size;
+}
+
+static void nova_traverse_file_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc, struct task_ring *ring,
+	unsigned long base, struct scan_bitmap *bm)
+{
+	sih->i_size = entryc->size;
+
+	if (entryc->num_pages != entryc->invalid_pages) {
+		if (entryc->pgoff < base + MAX_PGOFF &&
+				entryc->pgoff + entryc->num_pages > base)
+			nova_set_ring_array(sb, sih, entry, entryc,
+						ring, base, bm);
+	}
+}
+
+static int nova_traverse_file_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct task_ring *ring, struct scan_bitmap *bm)
+{
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	unsigned long base = 0;
+	unsigned long last_blocknr;
+	u64 ino = pi->nova_ino;
+	void *entry, *entryc;
+	unsigned int btype;
+	unsigned int data_bits;
+	u64 curr_p;
+	u64 next;
+	u8 type;
+
+	btype = pi->i_blk_type;
+	data_bits = blk_type_to_shift[btype];
+
+	if (metadata_csum)
+		nova_traverse_inode_log(sb, pi, bm, pi->alter_log_head);
+
+	entryc = (metadata_csum == 0) ? NULL : entry_copy;
+
+again:
+	sih->i_size = 0;
+	curr_p = pi->log_head;
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				curr_p, pi->log_tail);
+	if (curr_p == 0 && pi->log_tail == 0)
+		return 0;
+
+	if (base == 0) {
+		BUG_ON(curr_p & (PAGE_SIZE - 1));
+		set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+	}
+
+	while (curr_p != pi->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			curr_p = next_log_page(sb, curr_p);
+			if (base == 0) {
+				BUG_ON(curr_p & (PAGE_SIZE - 1));
+				set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+			}
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		entry = (void *)nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = entry;
+		else if (!nova_verify_entry_csum(sb, entry, entryc))
+			return 0;
+
+		type = nova_get_entry_type(entryc);
+		switch (type) {
+		case SET_ATTR:
+			nova_ring_setattr_entry(sb, sih, SENTRY(entryc),
+						ring, base, data_bits,
+						bm);
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case FILE_WRITE:
+			nova_traverse_file_write_entry(sb, sih, WENTRY(entry),
+						WENTRY(entryc), ring, base, bm);
+			curr_p += sizeof(struct nova_file_write_entry);
+			break;
+		case MMAP_WRITE:
+			curr_p += sizeof(struct nova_mmap_entry);
+			break;
+		default:
+			nova_dbg("%s: unknown type %d, 0x%llx\n",
+						__func__, type, curr_p);
+			NOVA_ASSERT(0);
+			BUG();
+		}
+
+	}
+
+	if (base == 0) {
+		/* Keep traversing until log ends */
+		curr_p &= PAGE_MASK;
+		next = next_log_page(sb, curr_p);
+		while (next > 0) {
+			curr_p = next;
+			BUG_ON(curr_p & (PAGE_SIZE - 1));
+			set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+			next = next_log_page(sb, curr_p);
+		}
+	}
+
+	if (sih->i_size == 0)
+		return 0;
+
+	last_blocknr = (sih->i_size - 1) >> data_bits;
+	nova_set_file_bm(sb, sih, ring, bm, base, last_blocknr);
+	if (last_blocknr >= base + MAX_PGOFF) {
+		base += MAX_PGOFF;
+		goto again;
+	}
+
+	return 0;
+}
+
+/* Pi is DRAM fake version */
+static int nova_recover_inode_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct task_ring *ring,
+	struct nova_inode *pi, struct scan_bitmap *bm)
+{
+	unsigned long nova_ino;
+
+	if (pi->deleted == 1)
+		return 0;
+
+	nova_ino = pi->nova_ino;
+	ring->inodes_used_count++;
+
+	sih->i_mode = __le16_to_cpu(pi->i_mode);
+	sih->ino = nova_ino;
+
+	nova_dbgv("%s: inode %lu, head 0x%llx, tail 0x%llx\n",
+			__func__, nova_ino, pi->log_head, pi->log_tail);
+
+	switch (__le16_to_cpu(pi->i_mode) & S_IFMT) {
+	case S_IFDIR:
+		nova_traverse_dir_inode_log(sb, pi, bm);
+		break;
+	case S_IFLNK:
+		/* Treat symlink files as normal files */
+		/* Fall through */
+	case S_IFREG:
+		/* Fall through */
+	default:
+		/* In case of special inode, walk the log */
+		nova_traverse_file_inode_log(sb, pi, sih, ring, bm);
+		break;
+	}
+
+	return 0;
+}
+
+static void free_resources(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct task_ring *ring;
+	int i;
+
+	if (task_rings) {
+		for (i = 0; i < sbi->cpus; i++) {
+			ring = &task_rings[i];
+			vfree(ring->entry_array);
+			vfree(ring->nvmm_array);
+			ring->entry_array = NULL;
+			ring->nvmm_array = NULL;
+		}
+	}
+
+	kfree(task_rings);
+	kfree(threads);
+	kfree(finished);
+}
+
+static int failure_thread_func(void *data);
+
+static int allocate_resources(struct super_block *sb, int cpus)
+{
+	struct task_ring *ring;
+	int i;
+
+	task_rings = kcalloc(cpus, sizeof(struct task_ring), GFP_KERNEL);
+	if (!task_rings)
+		goto fail;
+
+	for (i = 0; i < cpus; i++) {
+		ring = &task_rings[i];
+
+		ring->nvmm_array = vzalloc(sizeof(u64) * MAX_PGOFF);
+		if (!ring->nvmm_array)
+			goto fail;
+
+		ring->entry_array = vmalloc(sizeof(u64) * MAX_PGOFF);
+		if (!ring->entry_array)
+			goto fail;
+	}
+
+	threads = kcalloc(cpus, sizeof(struct task_struct *), GFP_KERNEL);
+	if (!threads)
+		goto fail;
+
+	finished = kcalloc(cpus, sizeof(int), GFP_KERNEL);
+	if (!finished)
+		goto fail;
+
+	init_waitqueue_head(&finish_wq);
+
+	for (i = 0; i < cpus; i++) {
+		threads[i] = kthread_create(failure_thread_func,
+						sb, "recovery thread");
+		kthread_bind(threads[i], i);
+	}
+
+	return 0;
+
+fail:
+	free_resources(sb);
+	return -ENOMEM;
+}
+
+static void wait_to_finish(int cpus)
+{
+	int i;
+
+	for (i = 0; i < cpus; i++) {
+		while (finished[i] == 0) {
+			wait_event_interruptible_timeout(finish_wq, false,
+							msecs_to_jiffies(1));
+		}
+	}
+}
+
+/*********************** Failure recovery *************************/
+
+static inline int nova_failure_update_inodetree(struct super_block *sb,
+	struct nova_inode *pi, unsigned long *ino_low, unsigned long *ino_high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (*ino_low == 0) {
+		*ino_low = *ino_high = pi->nova_ino;
+	} else {
+		if (pi->nova_ino == *ino_high + sbi->cpus) {
+			*ino_high = pi->nova_ino;
+		} else {
+			/* A new start */
+			nova_failure_insert_inodetree(sb, *ino_low, *ino_high);
+			*ino_low = *ino_high = pi->nova_ino;
+		}
+	}
+
+	return 0;
+}
+
+static int failure_thread_func(void *data)
+{
+	struct super_block *sb = data;
+	struct nova_inode_info_header sih;
+	struct task_ring *ring;
+	struct nova_inode *pi, fake_pi;
+	unsigned long num_inodes_per_page;
+	unsigned long ino_low, ino_high;
+	unsigned long last_blocknr;
+	unsigned int data_bits;
+	u64 curr, curr1;
+	int cpuid = smp_processor_id();
+	unsigned long i;
+	unsigned long max_size = 0;
+	u64 pi_addr = 0;
+	int ret = 0;
+	int count;
+
+	pi = nova_get_inode_by_ino(sb, NOVA_INODETABLE_INO);
+	data_bits = blk_type_to_shift[pi->i_blk_type];
+	num_inodes_per_page = 1 << (data_bits - NOVA_INODE_BITS);
+
+	ring = &task_rings[cpuid];
+	nova_init_header(sb, &sih, 0);
+
+	for (count = 0; count < ring->num; count++) {
+		curr = ring->addr0[count];
+		curr1 = ring->addr1[count];
+		ino_low = ino_high = 0;
+
+		/*
+		 * Note: The inode log page is allocated in 2MB
+		 * granularity, but not aligned on 2MB boundary.
+		 */
+		for (i = 0; i < 512; i++)
+			set_bm((curr >> PAGE_SHIFT) + i,
+					global_bm[cpuid], BM_4K);
+
+		if (metadata_csum) {
+			for (i = 0; i < 512; i++)
+				set_bm((curr1 >> PAGE_SHIFT) + i,
+					global_bm[cpuid], BM_4K);
+		}
+
+		for (i = 0; i < num_inodes_per_page; i++) {
+			pi_addr = curr + i * NOVA_INODE_SIZE;
+			ret = nova_get_reference(sb, pi_addr, &fake_pi,
+				(void **)&pi, sizeof(struct nova_inode));
+			if (ret) {
+				nova_dbg("Recover pi @ 0x%llx failed\n",
+						pi_addr);
+				continue;
+			}
+			/* FIXME: Check inode checksum */
+			if (fake_pi.i_mode && fake_pi.deleted == 0) {
+				if (fake_pi.valid == 0) {
+					ret = nova_append_inode_to_snapshot(sb,
+									pi);
+					if (ret != 0) {
+						/* Deleteable */
+						pi->deleted = 1;
+						fake_pi.deleted = 1;
+						continue;
+					}
+				}
+
+				nova_recover_inode_pages(sb, &sih, ring,
+						&fake_pi, global_bm[cpuid]);
+				nova_failure_update_inodetree(sb, pi,
+						&ino_low, &ino_high);
+				if (sih.i_size > max_size)
+					max_size = sih.i_size;
+			}
+		}
+
+		if (ino_low && ino_high)
+			nova_failure_insert_inodetree(sb, ino_low, ino_high);
+	}
+
+	/* Free radix tree */
+	if (max_size) {
+		last_blocknr = (max_size - 1) >> PAGE_SHIFT;
+		nova_delete_file_tree(sb, &sih, 0, last_blocknr,
+						false, false, 0);
+	}
+
+	finished[cpuid] = 1;
+	wake_up_interruptible(&finish_wq);
+	do_exit(ret);
+	return ret;
+}
+
+static int nova_failure_recovery_crawl(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct inode_table *inode_table;
+	struct task_ring *ring;
+	struct nova_inode *pi, fake_pi;
+	unsigned long curr_addr;
+	u64 root_addr;
+	u64 curr;
+	int num_tables;
+	int version;
+	int ret = 0;
+	int count;
+	int cpuid;
+
+	root_addr = nova_get_reserved_inode_addr(sb, NOVA_ROOT_INO);
+
+	num_tables = 1;
+	if (metadata_csum)
+		num_tables = 2;
+
+	for (cpuid = 0; cpuid < sbi->cpus; cpuid++) {
+		ring = &task_rings[cpuid];
+		for (version = 0; version < num_tables; version++) {
+			inode_table = nova_get_inode_table(sb, version,
+								cpuid);
+			if (!inode_table)
+				return -EINVAL;
+
+			count = 0;
+			curr = inode_table->log_head;
+			while (curr) {
+				if (ring->num >= 512) {
+					nova_err(sb, "%s: ring size too small\n",
+						 __func__);
+					return -EINVAL;
+				}
+
+				if (version == 0)
+					ring->addr0[count] = curr;
+				else
+					ring->addr1[count] = curr;
+
+				count++;
+
+				curr_addr = (unsigned long)nova_get_block(sb,
+								curr);
+				/* Next page resides at the last 8 bytes */
+				curr_addr += 2097152 - 8;
+				curr = *(u64 *)(curr_addr);
+			}
+
+			if (count > ring->num)
+				ring->num = count;
+		}
+	}
+
+	for (cpuid = 0; cpuid < sbi->cpus; cpuid++)
+		wake_up_process(threads[cpuid]);
+
+	nova_init_header(sb, &sih, 0);
+	/* Recover the root iode */
+	ret = nova_get_reference(sb, root_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("Recover root pi failed\n");
+		return ret;
+	}
+
+	nova_recover_inode_pages(sb, &sih, &task_rings[0],
+					&fake_pi, global_bm[1]);
+
+	return ret;
+}
+
+int nova_failure_recovery(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct task_ring *ring;
+	struct nova_inode *pi;
+	struct journal_ptr_pair *pair;
+	int ret;
+	int i;
+
+	sbi->s_inodes_used_count = 0;
+
+	/* Initialize inuse inode list */
+	if (nova_init_inode_inuse_list(sb) < 0)
+		return -EINVAL;
+
+	/* Handle special inodes */
+	pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	pi->log_head = pi->log_tail = 0;
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+
+		set_bm(pair->journal_head >> PAGE_SHIFT, global_bm[i], BM_4K);
+	}
+
+	i = NOVA_SNAPSHOT_INO % sbi->cpus;
+	pi = nova_get_inode_by_ino(sb, NOVA_SNAPSHOT_INO);
+	/* Set snapshot info log pages */
+	nova_traverse_dir_inode_log(sb, pi, global_bm[i]);
+
+	PERSISTENT_BARRIER();
+
+	ret = allocate_resources(sb, sbi->cpus);
+	if (ret)
+		return ret;
+
+	ret = nova_failure_recovery_crawl(sb);
+
+	wait_to_finish(sbi->cpus);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		ring = &task_rings[i];
+		sbi->s_inodes_used_count += ring->inodes_used_count;
+	}
+
+	free_resources(sb);
+
+	nova_dbg("Failure recovery total recovered %lu\n",
+				sbi->s_inodes_used_count);
+	return ret;
+}
+
+/*********************** Recovery entrance *************************/
+
+/* Return TRUE if we can do a normal unmount recovery */
+static bool nova_try_normal_recovery(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi =  nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	int ret;
+
+	if (pi->log_head == 0 || pi->log_tail == 0)
+		return false;
+
+	ret = nova_init_blockmap_from_inode(sb);
+	if (ret) {
+		nova_err(sb, "init blockmap failed, fall back to failure recovery\n");
+		return false;
+	}
+
+	ret = nova_init_inode_list_from_inode(sb);
+	if (ret) {
+		nova_err(sb, "init inode list failed, fall back to failure recovery\n");
+		nova_destroy_blocknode_trees(sb);
+		return false;
+	}
+
+	if (sbi->mount_snapshot == 0) {
+		ret = nova_restore_snapshot_table(sb, 0);
+		if (ret) {
+			nova_err(sb, "Restore snapshot table failed, fall back to failure recovery\n");
+			nova_destroy_snapshot_infos(sb);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Recovery routine has three tasks:
+ * 1. Restore snapshot table;
+ * 2. Restore inuse inode list;
+ * 3. Restore the NVMM allocator.
+ */
+int nova_recovery(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_super_block *super = sbi->nova_sb;
+	unsigned long initsize = le64_to_cpu(super->s_size);
+	bool value = false;
+	int ret = 0;
+	timing_t start, end;
+
+	nova_dbgv("%s\n", __func__);
+
+	/* Always check recovery time */
+	if (measure_timing == 0)
+		getrawmonotonic(&start);
+
+	NOVA_START_TIMING(recovery_t, start);
+	sbi->num_blocks = ((unsigned long)(initsize) >> PAGE_SHIFT);
+
+	/* initialize free list info */
+	nova_init_blockmap(sb, 1);
+
+	value = nova_try_normal_recovery(sb);
+	if (value) {
+		nova_dbg("NOVA: Normal shutdown\n");
+	} else {
+		nova_dbg("NOVA: Failure recovery\n");
+		ret = alloc_bm(sb, initsize);
+		if (ret)
+			goto out;
+
+		if (sbi->mount_snapshot == 0) {
+			/* Initialize the snapshot infos */
+			ret = nova_restore_snapshot_table(sb, 1);
+			if (ret) {
+				nova_dbg("Initialize snapshot infos failed\n");
+				nova_destroy_snapshot_infos(sb);
+				goto out;
+			}
+		}
+
+		sbi->s_inodes_used_count = 0;
+		ret = nova_failure_recovery(sb);
+		if (ret)
+			goto out;
+
+		ret = nova_build_blocknode_map(sb, initsize);
+	}
+
+out:
+	NOVA_END_TIMING(recovery_t, start);
+	if (measure_timing == 0) {
+		getrawmonotonic(&end);
+		Timingstats[recovery_t] +=
+			(end.tv_sec - start.tv_sec) * 1000000000 +
+			(end.tv_nsec - start.tv_nsec);
+	}
+
+	if (!value)
+		free_bm(sb);
+
+	sbi->s_epoch_id = le64_to_cpu(super->s_epoch_id);
+	return ret;
+}
diff --git a/fs/nova/rebuild.c b/fs/nova/rebuild.c
new file mode 100644
index 000000000000..893f180d507e
--- /dev/null
+++ b/fs/nova/rebuild.c
@@ -0,0 +1,847 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode rebuild methods.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+/* entry given to this function is a copy in dram */
+static void nova_apply_setattr_entry(struct super_block *sb,
+	struct nova_inode_rebuild *reb,	struct nova_inode_info_header *sih,
+	struct nova_setattr_logentry *entry)
+{
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	loff_t start, end;
+	int freed = 0;
+
+	if (entry->entry_type != SET_ATTR)
+		BUG();
+
+	reb->i_mode	= entry->mode;
+	reb->i_uid	= entry->uid;
+	reb->i_gid	= entry->gid;
+	reb->i_atime	= entry->atime;
+
+	if (S_ISREG(reb->i_mode)) {
+		start = entry->size;
+		end = reb->i_size;
+
+		first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+		if (end > 0)
+			last_blocknr = (end - 1) >> data_bits;
+		else
+			last_blocknr = 0;
+
+		freed = nova_delete_file_tree(sb, sih, first_blocknr,
+					last_blocknr, false, false, 0);
+	}
+}
+
+/* entry given to this function is a copy in dram */
+static void nova_apply_link_change_entry(struct super_block *sb,
+	struct nova_inode_rebuild *reb,	struct nova_link_change_entry *entry)
+{
+	if (entry->entry_type != LINK_CHANGE)
+		BUG();
+
+	reb->i_links_count	= entry->links;
+	reb->i_ctime		= entry->ctime;
+	reb->i_flags		= entry->flags;
+	reb->i_generation	= entry->generation;
+
+	/* Do not flush now */
+}
+
+static void nova_update_inode_with_rebuild(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_inode *pi)
+{
+	pi->i_size = cpu_to_le64(reb->i_size);
+	pi->i_flags = cpu_to_le32(reb->i_flags);
+	pi->i_uid = cpu_to_le32(reb->i_uid);
+	pi->i_gid = cpu_to_le32(reb->i_gid);
+	pi->i_atime = cpu_to_le32(reb->i_atime);
+	pi->i_ctime = cpu_to_le32(reb->i_ctime);
+	pi->i_mtime = cpu_to_le32(reb->i_mtime);
+	pi->i_generation = cpu_to_le32(reb->i_generation);
+	pi->i_links_count = cpu_to_le16(reb->i_links_count);
+	pi->i_mode = cpu_to_le16(reb->i_mode);
+}
+
+static int nova_init_inode_rebuild(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_inode *pi)
+{
+	struct nova_inode fake_pi;
+	int rc;
+
+	rc = memcpy_mcsafe(&fake_pi, pi, sizeof(struct nova_inode));
+	if (rc)
+		return rc;
+
+	reb->i_size = le64_to_cpu(fake_pi.i_size);
+	reb->i_flags = le32_to_cpu(fake_pi.i_flags);
+	reb->i_uid = le32_to_cpu(fake_pi.i_uid);
+	reb->i_gid = le32_to_cpu(fake_pi.i_gid);
+	reb->i_atime = le32_to_cpu(fake_pi.i_atime);
+	reb->i_ctime = le32_to_cpu(fake_pi.i_ctime);
+	reb->i_mtime = le32_to_cpu(fake_pi.i_mtime);
+	reb->i_generation = le32_to_cpu(fake_pi.i_generation);
+	reb->i_links_count = le16_to_cpu(fake_pi.i_links_count);
+	reb->i_mode = le16_to_cpu(fake_pi.i_mode);
+	reb->trans_id = 0;
+
+	return rc;
+}
+
+static inline void nova_rebuild_file_time_and_size(struct super_block *sb,
+	struct nova_inode_rebuild *reb, u32 mtime, u32 ctime, u64 size)
+{
+	reb->i_mtime = cpu_to_le32(mtime);
+	reb->i_ctime = cpu_to_le32(ctime);
+	reb->i_size = cpu_to_le64(size);
+}
+
+static int nova_rebuild_inode_start(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct nova_inode_rebuild *reb, u64 pi_addr)
+{
+	int ret;
+
+	ret = nova_get_head_tail(sb, pi, sih);
+	if (ret)
+		return ret;
+
+	ret = nova_init_inode_rebuild(sb, reb, pi);
+	if (ret)
+		return ret;
+
+	sih->pi_addr = pi_addr;
+
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				sih->log_head, sih->log_tail);
+	sih->log_pages = 1;
+
+	return ret;
+}
+
+static int nova_rebuild_inode_finish(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct nova_inode_rebuild *reb, u64 curr_p)
+{
+	struct nova_inode *alter_pi;
+	u64 next;
+
+	sih->i_size = le64_to_cpu(reb->i_size);
+	sih->i_mode = le64_to_cpu(reb->i_mode);
+	sih->trans_id = reb->trans_id + 1;
+
+	nova_memunlock_inode(sb, pi);
+	nova_update_inode_with_rebuild(sb, reb, pi);
+	nova_update_inode_checksum(pi);
+	if (metadata_csum) {
+		alter_pi = (struct nova_inode *)nova_get_block(sb,
+							sih->alter_pi_addr);
+		memcpy_to_pmem_nocache(alter_pi, pi, sizeof(struct nova_inode));
+	}
+	nova_memlock_inode(sb, pi);
+
+	/* Keep traversing until log ends */
+	curr_p &= PAGE_MASK;
+	while ((next = next_log_page(sb, curr_p)) > 0) {
+		sih->log_pages++;
+		curr_p = next;
+	}
+
+	if (metadata_csum)
+		sih->log_pages *= 2;
+
+	return 0;
+}
+
+static int nova_reset_csum_parity_page(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long pgoff, int zero)
+{
+	nova_dbgv("%s: update page off %lu\n", __func__, pgoff);
+
+	if (data_csum)
+		nova_update_pgoff_csum(sb, sih, entry, pgoff, zero);
+
+	if (data_parity)
+		nova_update_pgoff_parity(sb, sih, entry, pgoff, zero);
+
+	return 0;
+}
+
+int nova_reset_csum_parity_range(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long start_pgoff, unsigned long end_pgoff, int zero,
+	int check_entry)
+{
+	struct nova_file_write_entry *curr;
+	unsigned long pgoff;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	for (pgoff = start_pgoff; pgoff < end_pgoff; pgoff++) {
+		if (entry && check_entry && zero == 0) {
+			curr = nova_get_write_entry(sb, sih, pgoff);
+			if (curr != entry)
+				continue;
+		}
+
+		/* FIXME: For mmap, check dirty? */
+		nova_reset_csum_parity_page(sb, sih, entry, pgoff, zero);
+	}
+
+	return 0;
+}
+
+/* Reset data csum for updating entries */
+static int nova_reset_data_csum_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc)
+{
+	unsigned long end_pgoff;
+
+	if (data_csum == 0 && data_parity == 0)
+		goto out;
+
+	if (entryc->invalid_pages == entryc->num_pages)
+		/* Dead entry */
+		goto out;
+
+	end_pgoff = entryc->pgoff + entryc->num_pages;
+	nova_reset_csum_parity_range(sb, sih, entry, entryc->pgoff,
+			end_pgoff, 0, 1);
+
+out:
+	nova_set_write_entry_updating(sb, entry, 0);
+
+	return 0;
+}
+
+/* Reset data csum for mmap entries */
+static int nova_reset_mmap_csum_parity(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_mmap_entry *entry,
+	struct nova_mmap_entry *entryc)
+{
+	unsigned long end_pgoff;
+	int ret = 0;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	if (entryc->invalid == 1)
+		/* Dead entry */
+		return 0;
+
+	end_pgoff = entryc->pgoff + entryc->num_pages;
+	nova_reset_csum_parity_range(sb, sih, NULL, entryc->pgoff,
+			end_pgoff, 0, 0);
+
+	ret = nova_invalidate_logentry(sb, entry, MMAP_WRITE, 0);
+
+	return ret;
+}
+
+int nova_reset_mapping_csum_parity(struct super_block *sb,
+	struct inode *inode, struct address_space *mapping,
+	unsigned long start_pgoff, unsigned long end_pgoff)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	pgoff_t indices[PAGEVEC_SIZE];
+	struct pagevec pvec;
+	bool done = false;
+	int count = 0;
+	unsigned long start = 0;
+	timing_t reset_time;
+	int i;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	NOVA_START_TIMING(reset_mapping_t, reset_time);
+	nova_dbgv("%s: pgoff %lu to %lu\n",
+			__func__, start_pgoff, end_pgoff);
+
+	while (!done) {
+		pvec.nr = find_get_entries_tag(mapping, start_pgoff,
+				PAGECACHE_TAG_DIRTY, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		if (count == 0)
+			start = indices[0];
+
+		for (i = 0; i < pvec.nr; i++) {
+			if (indices[i] >= end_pgoff) {
+				done = true;
+				break;
+			}
+
+			NOVA_STATS_ADD(dirty_pages, 1);
+			nova_reset_csum_parity_page(sb, sih, NULL,
+						indices[i], 0);
+		}
+
+		count += pvec.nr;
+		if (pvec.nr < PAGEVEC_SIZE)
+			break;
+
+		start_pgoff = indices[pvec.nr - 1] + 1;
+	}
+
+	if (count)
+		nova_dbgv("%s: inode %lu, reset %d pages, start pgoff %lu\n",
+				__func__, sih->ino, count, start);
+
+	NOVA_END_TIMING(reset_mapping_t, reset_time);
+	return 0;
+}
+
+int nova_reset_vma_csum_parity(struct super_block *sb,
+	struct vma_item *item)
+{
+	struct vm_area_struct *vma = item->vma;
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct nova_mmap_entry *entry;
+	unsigned long num_pages;
+	unsigned long start_index, end_index;
+	timing_t reset_time;
+	int ret = 0;
+
+	if (data_csum == 0 && data_parity == 0)
+		return 0;
+
+	NOVA_START_TIMING(reset_vma_t, reset_time);
+	num_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+	start_index = vma->vm_pgoff;
+	end_index = vma->vm_pgoff + num_pages;
+
+	nova_dbgv("%s: inode %lu, pgoff %lu - %lu\n",
+			__func__, inode->i_ino, start_index, end_index);
+
+	ret = nova_reset_mapping_csum_parity(sb, inode, mapping,
+					start_index, end_index);
+
+	if (item->mmap_entry) {
+		entry = nova_get_block(sb, item->mmap_entry);
+		ret = nova_invalidate_logentry(sb, entry, MMAP_WRITE, 0);
+	}
+
+	NOVA_END_TIMING(reset_vma_t, reset_time);
+	return ret;
+}
+
+static void nova_rebuild_handle_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode_rebuild *reb,
+	struct nova_file_write_entry *entry,
+	struct nova_file_write_entry *entryc)
+{
+	if (entryc->num_pages != entryc->invalid_pages) {
+		/*
+		 * The overlaped blocks are already freed.
+		 * Don't double free them, just re-assign the pointers.
+		 */
+		nova_assign_write_entry(sb, sih, entry, entryc, false);
+	}
+
+	if (entryc->trans_id >= sih->trans_id) {
+		nova_rebuild_file_time_and_size(sb, reb,
+					entryc->mtime, entryc->mtime,
+					entryc->size);
+		reb->trans_id = entryc->trans_id;
+	}
+
+	if (entryc->updating)
+		nova_reset_data_csum_parity(sb, sih, entry, entryc);
+
+	/* Update sih->i_size for setattr apply operations */
+	sih->i_size = le64_to_cpu(reb->i_size);
+}
+
+static int nova_rebuild_file_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_setattr_logentry *attr_entry = NULL;
+	struct nova_link_change_entry *link_change_entry = NULL;
+	struct nova_mmap_entry *mmap_entry = NULL;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	struct nova_inode_rebuild rebuild, *reb;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	u64 ino = pi->nova_ino;
+	timing_t rebuild_time;
+	void *addr, *entryc;
+	u64 curr_p;
+	u8 type;
+	int ret;
+
+	NOVA_START_TIMING(rebuild_file_t, rebuild_time);
+	nova_dbg_verbose("Rebuild file inode %llu tree\n", ino);
+
+	reb = &rebuild;
+	ret = nova_rebuild_inode_start(sb, pi, sih, reb, pi_addr);
+	if (ret)
+		goto out;
+
+	curr_p = sih->log_head;
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+	entryc = (metadata_csum == 0) ? NULL : entry_copy;
+
+//	nova_print_nova_log(sb, sih);
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			sih->log_pages++;
+			curr_p = next_log_page(sb, curr_p);
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = addr;
+		else if (!nova_verify_entry_csum(sb, addr, entryc))
+			return 0;
+
+		type = nova_get_entry_type(entryc);
+
+		if (sbi->mount_snapshot) {
+			if (nova_encounter_mount_snapshot(sb, addr, type))
+				break;
+		}
+
+		switch (type) {
+		case SET_ATTR:
+			attr_entry = (struct nova_setattr_logentry *)entryc;
+			nova_apply_setattr_entry(sb, reb, sih, attr_entry);
+			sih->last_setattr = curr_p;
+			if (attr_entry->trans_id >= reb->trans_id) {
+				nova_rebuild_file_time_and_size(sb, reb,
+							attr_entry->mtime,
+							attr_entry->ctime,
+							attr_entry->size);
+				reb->trans_id = attr_entry->trans_id;
+			}
+
+			/* Update sih->i_size for setattr operation */
+			sih->i_size = le64_to_cpu(reb->i_size);
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			link_change_entry =
+				(struct nova_link_change_entry *)entryc;
+			nova_apply_link_change_entry(sb, reb,
+						link_change_entry);
+			sih->last_link_change = curr_p;
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case FILE_WRITE:
+			entry = (struct nova_file_write_entry *)addr;
+			nova_rebuild_handle_write_entry(sb, sih, reb,
+					entry, WENTRY(entryc));
+			curr_p += sizeof(struct nova_file_write_entry);
+			break;
+		case MMAP_WRITE:
+			mmap_entry = (struct nova_mmap_entry *)addr;
+			nova_reset_mmap_csum_parity(sb, sih,
+					mmap_entry, MMENTRY(entryc));
+			curr_p += sizeof(struct nova_mmap_entry);
+			break;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx\n", type, curr_p);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct nova_file_write_entry);
+			break;
+		}
+
+	}
+
+	ret = nova_rebuild_inode_finish(sb, pi, sih, reb, curr_p);
+	sih->i_blocks = sih->log_pages + (sih->i_size >> data_bits);
+
+out:
+//	nova_print_inode_log_page(sb, inode);
+	NOVA_END_TIMING(rebuild_file_t, rebuild_time);
+	return ret;
+}
+
+/******************* Directory rebuild *********************/
+
+static inline void nova_rebuild_dir_time_and_size(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_dentry *entry,
+	struct nova_dentry *entryc)
+{
+	if (!entry || !reb)
+		return;
+
+	reb->i_ctime = entryc->mtime;
+	reb->i_mtime = entryc->mtime;
+	reb->i_links_count = entryc->links_count;
+	//reb->i_size = entryc->size;
+}
+
+static void nova_reassign_last_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	struct nova_dentry *dentry, *old_dentry;
+
+	if (sih->last_dentry == 0) {
+		sih->last_dentry = curr_p;
+	} else {
+		old_dentry = (struct nova_dentry *)nova_get_block(sb,
+							sih->last_dentry);
+		dentry = (struct nova_dentry *)nova_get_block(sb, curr_p);
+		if (dentry->trans_id >= old_dentry->trans_id)
+			sih->last_dentry = curr_p;
+	}
+}
+
+static inline int nova_replay_add_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *entry,
+	struct nova_dentry *entryc)
+{
+	if (!entryc->name_len)
+		return -EINVAL;
+
+	nova_dbg_verbose("%s: add %s\n", __func__, entry->name);
+	return nova_insert_dir_radix_tree(sb, sih,
+			entryc->name, entryc->name_len, entry);
+}
+
+/* entry given to this function is a copy in dram */
+static inline int nova_replay_remove_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *entry)
+{
+	nova_dbg_verbose("%s: remove %s\n", __func__, entry->name);
+	nova_remove_dir_radix_tree(sb, sih, entry->name,
+					entry->name_len, 1, NULL);
+	return 0;
+}
+
+static int nova_rebuild_handle_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode_rebuild *reb,
+	struct nova_dentry *entry, struct nova_dentry *entryc, u64 curr_p)
+{
+	int ret = 0;
+
+	nova_dbgv("curr_p: 0x%llx, type %d, ino %llu, name %s, namelen %u, csum 0x%x, rec len %u\n",
+			curr_p,
+			entry->entry_type, le64_to_cpu(entry->ino),
+			entry->name, entry->name_len, entry->csum,
+			le16_to_cpu(entry->de_len));
+
+	nova_reassign_last_dentry(sb, sih, curr_p);
+
+	if (entryc->invalid == 0) {
+		if (entryc->ino > 0)
+			ret = nova_replay_add_dentry(sb, sih, entry, entryc);
+		else
+			ret = nova_replay_remove_dentry(sb, sih, entryc);
+	}
+
+	if (ret) {
+		nova_err(sb, "%s ERROR %d\n", __func__, ret);
+		return ret;
+	}
+
+	if (entryc->trans_id >= reb->trans_id) {
+		nova_rebuild_dir_time_and_size(sb, reb, entry, entryc);
+		reb->trans_id = entryc->trans_id;
+	}
+
+	return ret;
+}
+
+int nova_rebuild_dir_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_dentry *entry = NULL;
+	struct nova_setattr_logentry *attr_entry = NULL;
+	struct nova_link_change_entry *lc_entry = NULL;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	struct nova_inode_rebuild rebuild, *reb;
+	u64 ino = pi->nova_ino;
+	unsigned short de_len;
+	timing_t rebuild_time;
+	void *addr, *entryc;
+	u64 curr_p;
+	u8 type;
+	int ret;
+
+	NOVA_START_TIMING(rebuild_dir_t, rebuild_time);
+	nova_dbgv("Rebuild dir %llu tree\n", ino);
+
+	reb = &rebuild;
+	ret = nova_rebuild_inode_start(sb, pi, sih, reb, pi_addr);
+	if (ret)
+		goto out;
+
+	curr_p = sih->log_head;
+	if (curr_p == 0) {
+		nova_err(sb, "Dir %llu log is NULL!\n", ino);
+		BUG();
+		goto out;
+	}
+
+	entryc = (metadata_csum == 0) ? NULL : entry_copy;
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			sih->log_pages++;
+			curr_p = next_log_page(sb, curr_p);
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Dir %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = addr;
+		else if (!nova_verify_entry_csum(sb, addr, entryc))
+			return 0;
+
+		type = nova_get_entry_type(entryc);
+
+		if (sbi->mount_snapshot) {
+			if (nova_encounter_mount_snapshot(sb, addr, type))
+				break;
+		}
+
+		switch (type) {
+		case SET_ATTR:
+			attr_entry = (struct nova_setattr_logentry *)entryc;
+			nova_apply_setattr_entry(sb, reb, sih, attr_entry);
+			sih->last_setattr = curr_p;
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			lc_entry = (struct nova_link_change_entry *)entryc;
+			if (lc_entry->trans_id >= reb->trans_id) {
+				nova_apply_link_change_entry(sb, reb, lc_entry);
+				reb->trans_id = lc_entry->trans_id;
+			}
+			sih->last_link_change = curr_p;
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case DIR_LOG:
+			entry = (struct nova_dentry *)addr;
+			ret = nova_rebuild_handle_dentry(sb, sih, reb,
+					entry, DENTRY(entryc), curr_p);
+			if (ret)
+				goto out;
+			de_len = le16_to_cpu(DENTRY(entryc)->de_len);
+			curr_p += de_len;
+			break;
+		default:
+			nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+			NOVA_ASSERT(0);
+			break;
+		}
+	}
+
+	ret = nova_rebuild_inode_finish(sb, pi, sih, reb, curr_p);
+	sih->i_blocks = sih->log_pages;
+
+out:
+//	nova_print_dir_tree(sb, sih, ino);
+	NOVA_END_TIMING(rebuild_dir_t, rebuild_time);
+	return ret;
+}
+
+/* initialize nova inode header and other DRAM data structures */
+int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
+	u64 ino, u64 pi_addr, int rebuild_dir)
+{
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+	struct nova_inode inode_copy;
+	u64 alter_pi_addr = 0;
+	int ret;
+
+	if (metadata_csum) {
+		/* Get alternate inode address */
+		ret = nova_get_alter_inode_address(sb, ino, &alter_pi_addr);
+		if (ret)  {
+			nova_dbg("%s: failed alt ino addr for inode %llu\n",
+				 __func__, ino);
+			return ret;
+		}
+	}
+
+	ret = nova_check_inode_integrity(sb, ino, pi_addr, alter_pi_addr,
+					 &inode_copy, 1);
+
+	if (ret)
+		return ret;
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	// We need this te valid in case we need to evect the inode.
+	sih->pi_addr = pi_addr;
+
+	if (pi->deleted == 1) {
+		nova_dbg("%s: inode %llu has been deleted.\n", __func__, ino);
+		return -EINVAL;
+	}
+
+	nova_dbgv("%s: inode %llu, addr 0x%llx, valid %d, head 0x%llx, tail 0x%llx\n",
+			__func__, ino, pi_addr, pi->valid,
+			pi->log_head, pi->log_tail);
+
+	nova_init_header(sb, sih, __le16_to_cpu(pi->i_mode));
+	sih->ino = ino;
+	sih->alter_pi_addr = alter_pi_addr;
+
+	switch (__le16_to_cpu(pi->i_mode) & S_IFMT) {
+	case S_IFLNK:
+		/* Treat symlink files as normal files */
+		/* Fall through */
+	case S_IFREG:
+		nova_rebuild_file_inode_tree(sb, pi, pi_addr, sih);
+		break;
+	case S_IFDIR:
+		if (rebuild_dir)
+			nova_rebuild_dir_inode_tree(sb, pi, pi_addr, sih);
+		break;
+	default:
+		/* In case of special inode, walk the log */
+		if (pi->log_head)
+			nova_rebuild_file_inode_tree(sb, pi, pi_addr, sih);
+		sih->pi_addr = pi_addr;
+		break;
+	}
+
+	return 0;
+}
+
+
+/******************* Snapshot log rebuild *********************/
+
+/* For power failure recovery, just initialize the infos */
+int nova_restore_snapshot_table(struct super_block *sb, int just_init)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_snapshot_info_entry *entry = NULL;
+	struct nova_inode *pi;
+	struct nova_inode_info_header *sih;
+	struct nova_inode_rebuild rebuild, *reb;
+	unsigned int data_bits;
+	char entry_copy[NOVA_MAX_ENTRY_LEN];
+	size_t size = sizeof(struct nova_snapshot_info_entry);
+	u64 ino = NOVA_SNAPSHOT_INO;
+	timing_t rebuild_time;
+	int count = 0;
+	void *addr, *entryc;
+	u64 curr_p;
+	u8 type;
+	int ret;
+
+	NOVA_START_TIMING(rebuild_snapshot_t, rebuild_time);
+	nova_dbg_verbose("Rebuild snapshot table\n");
+
+	entryc = (metadata_csum == 0) ? NULL : entry_copy;
+
+	pi = nova_get_reserved_inode(sb, ino);
+	sih = &sbi->snapshot_si->header;
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	reb = &rebuild;
+	ret = nova_rebuild_inode_start(sb, pi, sih, reb, sih->pi_addr);
+	if (ret)
+		goto out;
+
+	curr_p = sih->log_head;
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+//	nova_print_nova_log(sb, sih);
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			sih->log_pages++;
+			curr_p = next_log_page(sb, curr_p);
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			BUG();
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+
+		if (metadata_csum == 0)
+			entryc = addr;
+		else if (!nova_verify_entry_csum(sb, addr, entryc))
+			return 0;
+
+		type = nova_get_entry_type(entryc);
+
+		switch (type) {
+		case SNAPSHOT_INFO:
+			entry = (struct nova_snapshot_info_entry *)addr;
+			ret = nova_restore_snapshot_entry(sb, entry,
+						curr_p, just_init);
+			if (ret) {
+				nova_err(sb, "Restore entry %llu failed\n",
+					entry->epoch_id);
+				goto out;
+			}
+			if (SNENTRY(entryc)->deleted == 0)
+				count++;
+			curr_p += size;
+			break;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx\n", type, curr_p);
+			NOVA_ASSERT(0);
+			curr_p += size;
+			break;
+		}
+
+	}
+
+	ret = nova_rebuild_inode_finish(sb, pi, sih, reb, curr_p);
+	sih->i_blocks = sih->log_pages + (sih->i_size >> data_bits);
+
+out:
+//	nova_print_inode_log_page(sb, inode);
+	NOVA_END_TIMING(rebuild_snapshot_t, rebuild_time);
+
+	nova_dbg("Recovered %d snapshots, latest epoch ID %llu\n",
+			count, sbi->s_epoch_id);
+
+	return ret;
+}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 13/16] NOVA: Sysfs and ioctl
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:49   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Nova provides the normal ioctls for setting file attributes and provides a /proc-based interface for taking snapshots.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/ioctl.c |  185 +++++++++++++++++++
 fs/nova/sysfs.c |  543 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 728 insertions(+)
 create mode 100644 fs/nova/ioctl.c
 create mode 100644 fs/nova/sysfs.c

diff --git a/fs/nova/ioctl.c b/fs/nova/ioctl.c
new file mode 100644
index 000000000000..163e24ccd2ca
--- /dev/null
+++ b/fs/nova/ioctl.c
@@ -0,0 +1,185 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2010-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/capability.h>
+#include <linux/time.h>
+#include <linux/sched.h>
+#include <linux/compat.h>
+#include <linux/mount.h>
+#include "nova.h"
+#include "inode.h"
+
+long nova_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode    *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_update update;
+	unsigned int flags;
+	int ret;
+
+	pi = nova_get_inode(sb, inode);
+	if (!pi)
+		return -EACCES;
+
+	switch (cmd) {
+	case FS_IOC_GETFLAGS:
+		flags = le32_to_cpu(pi->i_flags) & NOVA_FL_USER_VISIBLE;
+		return put_user(flags, (int __user *)arg);
+	case FS_IOC_SETFLAGS: {
+		unsigned int oldflags;
+		u64 old_linkc = 0;
+		u64 epoch_id;
+
+		ret = mnt_want_write_file(filp);
+		if (ret)
+			return ret;
+
+		if (!inode_owner_or_capable(inode)) {
+			ret = -EPERM;
+			goto flags_out;
+		}
+
+		if (get_user(flags, (int __user *)arg)) {
+			ret = -EFAULT;
+			goto flags_out;
+		}
+
+		inode_lock(inode);
+		oldflags = le32_to_cpu(pi->i_flags);
+
+		if ((flags ^ oldflags) &
+		    (FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+			if (!capable(CAP_LINUX_IMMUTABLE)) {
+				inode_unlock(inode);
+				ret = -EPERM;
+				goto flags_out_unlock;
+			}
+		}
+
+		if (!S_ISDIR(inode->i_mode))
+			flags &= ~FS_DIRSYNC_FL;
+
+		epoch_id = nova_get_epoch_id(sb);
+		flags = flags & FS_FL_USER_MODIFIABLE;
+		flags |= oldflags & ~FS_FL_USER_MODIFIABLE;
+		inode->i_ctime = current_time(inode);
+		nova_set_inode_flags(inode, pi, flags);
+
+		update.tail = 0;
+		update.alter_tail = 0;
+		ret = nova_append_link_change_entry(sb, pi, inode,
+					&update, &old_linkc, epoch_id);
+		if (!ret) {
+			nova_memunlock_inode(sb, pi);
+			nova_update_inode(sb, inode, pi, &update, 1);
+			nova_memlock_inode(sb, pi);
+			nova_invalidate_link_change_entry(sb, old_linkc);
+		}
+		sih->trans_id++;
+flags_out_unlock:
+		inode_unlock(inode);
+flags_out:
+		mnt_drop_write_file(filp);
+		return ret;
+	}
+	case FS_IOC_GETVERSION:
+		return put_user(inode->i_generation, (int __user *)arg);
+	case FS_IOC_SETVERSION: {
+		u64 old_linkc = 0;
+		u64 epoch_id;
+		__u32 generation;
+
+		if (!inode_owner_or_capable(inode))
+			return -EPERM;
+		ret = mnt_want_write_file(filp);
+		if (ret)
+			return ret;
+		if (get_user(generation, (int __user *)arg)) {
+			ret = -EFAULT;
+			goto setversion_out;
+		}
+
+		epoch_id = nova_get_epoch_id(sb);
+		inode_lock(inode);
+		inode->i_ctime = current_time(inode);
+		inode->i_generation = generation;
+
+		update.tail = 0;
+		update.alter_tail = 0;
+		ret = nova_append_link_change_entry(sb, pi, inode,
+					&update, &old_linkc, epoch_id);
+		if (!ret) {
+			nova_memunlock_inode(sb, pi);
+			nova_update_inode(sb, inode, pi, &update, 1);
+			nova_memlock_inode(sb, pi);
+			nova_invalidate_link_change_entry(sb, old_linkc);
+		}
+		sih->trans_id++;
+		inode_unlock(inode);
+setversion_out:
+		mnt_drop_write_file(filp);
+		return ret;
+	}
+	case NOVA_PRINT_TIMING: {
+		nova_print_timing_stats(sb);
+		return 0;
+	}
+	case NOVA_CLEAR_STATS: {
+		nova_clear_stats(sb);
+		return 0;
+	}
+	case NOVA_PRINT_LOG: {
+		nova_print_inode_log(sb, inode);
+		return 0;
+	}
+	case NOVA_PRINT_LOG_PAGES: {
+		nova_print_inode_log_pages(sb, inode);
+		return 0;
+	}
+	case NOVA_PRINT_FREE_LISTS: {
+		nova_print_free_lists(sb);
+		return 0;
+	}
+	default:
+		return -ENOTTY;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+long nova_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case FS_IOC32_GETFLAGS:
+		cmd = FS_IOC_GETFLAGS;
+		break;
+	case FS_IOC32_SETFLAGS:
+		cmd = FS_IOC_SETFLAGS;
+		break;
+	case FS_IOC32_GETVERSION:
+		cmd = FS_IOC_GETVERSION;
+		break;
+	case FS_IOC32_SETVERSION:
+		cmd = FS_IOC_SETVERSION;
+		break;
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return nova_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
diff --git a/fs/nova/sysfs.c b/fs/nova/sysfs.c
new file mode 100644
index 000000000000..38749bb8b14b
--- /dev/null
+++ b/fs/nova/sysfs.c
@@ -0,0 +1,543 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Proc fs operations
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+const char *proc_dirname = "fs/NOVA";
+struct proc_dir_entry *nova_proc_root;
+
+/* ====================== Statistics ======================== */
+static int nova_seq_timing_show(struct seq_file *seq, void *v)
+{
+	int i;
+
+	nova_get_timing_stats();
+
+	seq_puts(seq, "=========== NOVA kernel timing stats ===========\n");
+	for (i = 0; i < TIMING_NUM; i++) {
+		/* Title */
+		if (Timingstring[i][0] == '=') {
+			seq_printf(seq, "\n%s\n\n", Timingstring[i]);
+			continue;
+		}
+
+		if (measure_timing || Timingstats[i]) {
+			seq_printf(seq, "%s: count %llu, timing %llu, average %llu\n",
+				Timingstring[i],
+				Countstats[i],
+				Timingstats[i],
+				Countstats[i] ?
+				Timingstats[i] / Countstats[i] : 0);
+		} else {
+			seq_printf(seq, "%s: count %llu\n",
+				Timingstring[i],
+				Countstats[i]);
+		}
+	}
+
+	seq_puts(seq, "\n");
+	return 0;
+}
+
+static int nova_seq_timing_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_timing_show, PDE_DATA(inode));
+}
+
+ssize_t nova_seq_clear_stats(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+
+	nova_clear_stats(sb);
+	return len;
+}
+
+static const struct file_operations nova_seq_timing_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_timing_open,
+	.read		= seq_read,
+	.write		= nova_seq_clear_stats,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_IO_show(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = seq->private;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long alloc_log_count = 0;
+	unsigned long alloc_log_pages = 0;
+	unsigned long alloc_data_count = 0;
+	unsigned long alloc_data_pages = 0;
+	unsigned long free_log_count = 0;
+	unsigned long freed_log_pages = 0;
+	unsigned long free_data_count = 0;
+	unsigned long freed_data_pages = 0;
+	int i;
+
+	nova_get_timing_stats();
+	nova_get_IO_stats();
+
+	seq_puts(seq, "============ NOVA allocation stats ============\n\n");
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		alloc_log_count += free_list->alloc_log_count;
+		alloc_log_pages += free_list->alloc_log_pages;
+		alloc_data_count += free_list->alloc_data_count;
+		alloc_data_pages += free_list->alloc_data_pages;
+		free_log_count += free_list->free_log_count;
+		freed_log_pages += free_list->freed_log_pages;
+		free_data_count += free_list->free_data_count;
+		freed_data_pages += free_list->freed_data_pages;
+	}
+
+	seq_printf(seq, "alloc log count %lu, allocated log pages %lu\n"
+		"alloc data count %lu, allocated data pages %lu\n"
+		"free log count %lu, freed log pages %lu\n"
+		"free data count %lu, freed data pages %lu\n",
+		alloc_log_count, alloc_log_pages,
+		alloc_data_count, alloc_data_pages,
+		free_log_count, freed_log_pages,
+		free_data_count, freed_data_pages);
+
+	seq_printf(seq, "Fast GC %llu, check pages %llu, free pages %llu, average %llu\n",
+		Countstats[fast_gc_t], IOstats[fast_checked_pages],
+		IOstats[fast_gc_pages], Countstats[fast_gc_t] ?
+			IOstats[fast_gc_pages] / Countstats[fast_gc_t] : 0);
+	seq_printf(seq, "Thorough GC %llu, checked pages %llu, free pages %llu, average %llu\n",
+		Countstats[thorough_gc_t],
+		IOstats[thorough_checked_pages], IOstats[thorough_gc_pages],
+		Countstats[thorough_gc_t] ?
+			IOstats[thorough_gc_pages] / Countstats[thorough_gc_t]
+			: 0);
+
+	seq_puts(seq, "\n");
+
+	seq_puts(seq, "================ NOVA I/O stats ================\n\n");
+	seq_printf(seq, "Read %llu, bytes %llu, average %llu\n",
+		Countstats[dax_read_t], IOstats[read_bytes],
+		Countstats[dax_read_t] ?
+			IOstats[read_bytes] / Countstats[dax_read_t] : 0);
+	seq_printf(seq, "COW write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[cow_write_t], IOstats[cow_write_bytes],
+		Countstats[cow_write_t] ?
+			IOstats[cow_write_bytes] / Countstats[cow_write_t] : 0,
+		IOstats[cow_write_breaks], Countstats[cow_write_t] ?
+			IOstats[cow_write_breaks] / Countstats[cow_write_t]
+			: 0);
+	seq_printf(seq, "Inplace write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[inplace_write_t], IOstats[inplace_write_bytes],
+		Countstats[inplace_write_t] ?
+			IOstats[inplace_write_bytes] /
+			Countstats[inplace_write_t] : 0,
+		IOstats[inplace_write_breaks], Countstats[inplace_write_t] ?
+			IOstats[inplace_write_breaks] /
+			Countstats[inplace_write_t] : 0);
+	seq_printf(seq, "Inplace write %llu, allocate new blocks %llu\n",
+			Countstats[inplace_write_t],
+			IOstats[inplace_new_blocks]);
+	seq_printf(seq, "DAX get blocks %llu, allocate new blocks %llu\n",
+			Countstats[dax_get_block_t], IOstats[dax_new_blocks]);
+	seq_printf(seq, "Dirty pages %llu\n", IOstats[dirty_pages]);
+	seq_printf(seq, "Protect head %llu, tail %llu\n",
+			IOstats[protect_head], IOstats[protect_tail]);
+	seq_printf(seq, "Block csum parity %llu\n", IOstats[block_csum_parity]);
+	seq_printf(seq, "Page fault %llu, dax cow fault %llu, dax cow fault during snapshot creation %llu\n"
+			"CoW write overlap mmap range %llu, mapping/pfn updated pages %llu\n",
+			Countstats[mmap_fault_t], Countstats[mmap_cow_t],
+			IOstats[dax_cow_during_snapshot],
+			IOstats[cow_overlap_mmap],
+			IOstats[mapping_updated_pages]);
+	seq_printf(seq, "fsync %llu, fdatasync %llu\n",
+			Countstats[fsync_t], IOstats[fdatasync]);
+
+	seq_puts(seq, "\n");
+
+	nova_print_snapshot_lists(sb, seq);
+	seq_puts(seq, "\n");
+
+	return 0;
+}
+
+static int nova_seq_IO_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_IO_show, PDE_DATA(inode));
+}
+
+static const struct file_operations nova_seq_IO_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_IO_open,
+	.read		= seq_read,
+	.write		= nova_seq_clear_stats,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_show_allocator(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = seq->private;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+	unsigned long log_pages = 0;
+	unsigned long data_pages = 0;
+
+	seq_puts(seq, "======== NOVA per-CPU allocator stats ========\n");
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		seq_printf(seq, "Free list %d: block start %lu, block end %lu, num_blocks %lu, num_free_blocks %lu, blocknode %lu\n",
+			i, free_list->block_start, free_list->block_end,
+			free_list->block_end - free_list->block_start + 1,
+			free_list->num_free_blocks, free_list->num_blocknode);
+
+		if (free_list->first_node) {
+			seq_printf(seq, "First node %lu - %lu\n",
+					free_list->first_node->range_low,
+					free_list->first_node->range_high);
+		}
+
+		if (free_list->last_node) {
+			seq_printf(seq, "Last node %lu - %lu\n",
+					free_list->last_node->range_low,
+					free_list->last_node->range_high);
+		}
+
+		seq_printf(seq, "Free list %d: csum start %lu, replica csum start %lu, csum blocks %lu, parity start %lu, parity blocks %lu\n",
+			i, free_list->csum_start, free_list->replica_csum_start,
+			free_list->num_csum_blocks,
+			free_list->parity_start, free_list->num_parity_blocks);
+
+		seq_printf(seq, "Free list %d: alloc log count %lu, allocated log pages %lu, alloc data count %lu, allocated data pages %lu, free log count %lu, freed log pages %lu, free data count %lu, freed data pages %lu\n",
+			   i,
+			   free_list->alloc_log_count,
+			   free_list->alloc_log_pages,
+			   free_list->alloc_data_count,
+			   free_list->alloc_data_pages,
+			   free_list->free_log_count,
+			   free_list->freed_log_pages,
+			   free_list->free_data_count,
+			   free_list->freed_data_pages);
+
+		log_pages += free_list->alloc_log_pages;
+		log_pages -= free_list->freed_log_pages;
+
+		data_pages += free_list->alloc_data_pages;
+		data_pages -= free_list->freed_data_pages;
+	}
+
+	seq_printf(seq, "\nCurrently used pmem pages: log %lu, data %lu\n",
+			log_pages, data_pages);
+
+	return 0;
+}
+
+static int nova_seq_allocator_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_show_allocator,
+				PDE_DATA(inode));
+}
+
+static const struct file_operations nova_seq_allocator_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_allocator_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* ====================== Snapshot ======================== */
+static int nova_seq_create_snapshot_show(struct seq_file *seq, void *v)
+{
+	seq_puts(seq, "Write to create a snapshot\n");
+	return 0;
+}
+
+static int nova_seq_create_snapshot_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_create_snapshot_show,
+				PDE_DATA(inode));
+}
+
+ssize_t nova_seq_create_snapshot(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+
+	nova_create_snapshot(sb);
+	return len;
+}
+
+static const struct file_operations nova_seq_create_snapshot_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_create_snapshot_open,
+	.read		= seq_read,
+	.write		= nova_seq_create_snapshot,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_delete_snapshot_show(struct seq_file *seq, void *v)
+{
+	seq_puts(seq, "Echo index to delete a snapshot\n");
+	return 0;
+}
+
+static int nova_seq_delete_snapshot_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_delete_snapshot_show,
+				PDE_DATA(inode));
+}
+
+ssize_t nova_seq_delete_snapshot(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+	u64 epoch_id;
+	int ret;
+
+	ret = kstrtoull(buf, 10, &epoch_id);
+	if (ret < 0)
+		nova_warn("Couldn't parse snapshot id %s", buf);
+	else
+		nova_delete_snapshot(sb, epoch_id);
+
+	return len;
+}
+
+static const struct file_operations nova_seq_delete_snapshot_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_delete_snapshot_open,
+	.read		= seq_read,
+	.write		= nova_seq_delete_snapshot,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_show_snapshots(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = seq->private;
+
+	nova_print_snapshots(sb, seq);
+	return 0;
+}
+
+static int nova_seq_show_snapshots_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_show_snapshots,
+				PDE_DATA(inode));
+}
+
+static const struct file_operations nova_seq_show_snapshots_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_show_snapshots_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* ====================== Performance ======================== */
+static int nova_seq_test_perf_show(struct seq_file *seq, void *v)
+{
+	seq_printf(seq, "Echo function:poolmb:size:disks to test function performance working on size of data.\n"
+			"    example: echo 1:128:4096:8 > /proc/fs/NOVA/pmem0/test_perf\n"
+			"The disks value only matters for raid functions.\n"
+			"Set function to 0 to test all functions.\n");
+	return 0;
+}
+
+static int nova_seq_test_perf_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_test_perf_show, PDE_DATA(inode));
+}
+
+ssize_t nova_seq_test_perf(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+	size_t size;
+	unsigned int func_id, poolmb, disks;
+
+	if (sscanf(buf, "%u:%u:%zu:%u", &func_id, &poolmb, &size, &disks) == 4)
+		nova_test_perf(sb, func_id, poolmb, size, disks);
+	else
+		nova_warn("Couldn't parse test_perf request: %s", buf);
+
+	return len;
+}
+
+static const struct file_operations nova_seq_test_perf_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_test_perf_open,
+	.read		= seq_read,
+	.write		= nova_seq_test_perf,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+
+/* ====================== GC ======================== */
+
+
+static int nova_seq_gc_show(struct seq_file *seq, void *v)
+{
+	seq_printf(seq, "Echo inode number to trigger garbage collection\n"
+		   "    example: echo 34 > /proc/fs/NOVA/pmem0/gc\n");
+	return 0;
+}
+
+static int nova_seq_gc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_gc_show, PDE_DATA(inode));
+}
+
+ssize_t nova_seq_gc(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	u64 target_inode_number;
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+	struct inode *target_inode;
+	struct nova_inode *target_pi;
+	struct nova_inode_info *target_sih;
+
+	int ret;
+	char *_buf;
+	int retval = len;
+
+	_buf = kmalloc(len, GFP_KERNEL);
+	if (_buf == NULL)  {
+		retval = -ENOMEM;
+		nova_dbg("%s: kmalloc failed\n", __func__);
+		goto out;
+	}
+
+	if (copy_from_user(_buf, buf, len)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	_buf[len] = 0;
+	ret = kstrtoull(_buf, 0, &target_inode_number);
+	if (ret) {
+		nova_info("%s: Could not parse ino '%s'\n", __func__, _buf);
+		return ret;
+	}
+	nova_info("%s: target_inode_number=%llu.", __func__,
+		  target_inode_number);
+
+	target_inode = nova_iget(sb, target_inode_number);
+	if (target_inode == NULL) {
+		nova_info("%s: inode %llu does not exist.", __func__,
+			  target_inode_number);
+		retval = -ENOENT;
+		goto out;
+	}
+
+	target_pi = nova_get_inode(sb, target_inode);
+	if (target_pi == NULL) {
+		nova_info("%s: couldn't get nova inode %llu.", __func__,
+			  target_inode_number);
+		retval = -ENOENT;
+		goto out;
+	}
+
+	target_sih = NOVA_I(target_inode);
+
+	nova_info("%s: got inode %llu @ 0x%p; pi=0x%p\n", __func__,
+		  target_inode_number, target_inode, target_pi);
+
+	nova_inode_log_fast_gc(sb, target_pi, &target_sih->header,
+			       0, 0, 0, 0, 1);
+	iput(target_inode);
+
+out:
+	kfree(_buf);
+	return retval;
+}
+
+static const struct file_operations nova_seq_gc_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_gc_open,
+	.read		= seq_read,
+	.write		= nova_seq_gc,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* ====================== Setup/teardown======================== */
+void nova_sysfs_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (nova_proc_root)
+		sbi->s_proc = proc_mkdir(sbi->s_bdev->bd_disk->disk_name,
+					 nova_proc_root);
+
+	if (sbi->s_proc) {
+		proc_create_data("timing_stats", 0444, sbi->s_proc,
+				 &nova_seq_timing_fops, sb);
+		proc_create_data("IO_stats", 0444, sbi->s_proc,
+				 &nova_seq_IO_fops, sb);
+		proc_create_data("allocator", 0444, sbi->s_proc,
+				 &nova_seq_allocator_fops, sb);
+		proc_create_data("create_snapshot", 0444, sbi->s_proc,
+				 &nova_seq_create_snapshot_fops, sb);
+		proc_create_data("delete_snapshot", 0444, sbi->s_proc,
+				 &nova_seq_delete_snapshot_fops, sb);
+		proc_create_data("snapshots", 0444, sbi->s_proc,
+				 &nova_seq_show_snapshots_fops, sb);
+		proc_create_data("test_perf", 0444, sbi->s_proc,
+				 &nova_seq_test_perf_fops, sb);
+		proc_create_data("gc", 0444, sbi->s_proc,
+				 &nova_seq_gc_fops, sb);
+	}
+}
+
+void nova_sysfs_exit(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->s_proc) {
+		remove_proc_entry("timing_stats", sbi->s_proc);
+		remove_proc_entry("IO_stats", sbi->s_proc);
+		remove_proc_entry("allocator", sbi->s_proc);
+		remove_proc_entry("create_snapshot", sbi->s_proc);
+		remove_proc_entry("delete_snapshot", sbi->s_proc);
+		remove_proc_entry("snapshots", sbi->s_proc);
+		remove_proc_entry("test_perf", sbi->s_proc);
+		remove_proc_entry("gc", sbi->s_proc);
+		remove_proc_entry(sbi->s_bdev->bd_disk->disk_name,
+					nova_proc_root);
+	}
+}

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 13/16] NOVA: Sysfs and ioctl
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Nova provides the normal ioctls for setting file attributes and provides a /proc-based interface for taking snapshots.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/ioctl.c |  185 +++++++++++++++++++
 fs/nova/sysfs.c |  543 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 728 insertions(+)
 create mode 100644 fs/nova/ioctl.c
 create mode 100644 fs/nova/sysfs.c

diff --git a/fs/nova/ioctl.c b/fs/nova/ioctl.c
new file mode 100644
index 000000000000..163e24ccd2ca
--- /dev/null
+++ b/fs/nova/ioctl.c
@@ -0,0 +1,185 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2010-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/capability.h>
+#include <linux/time.h>
+#include <linux/sched.h>
+#include <linux/compat.h>
+#include <linux/mount.h>
+#include "nova.h"
+#include "inode.h"
+
+long nova_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode    *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_update update;
+	unsigned int flags;
+	int ret;
+
+	pi = nova_get_inode(sb, inode);
+	if (!pi)
+		return -EACCES;
+
+	switch (cmd) {
+	case FS_IOC_GETFLAGS:
+		flags = le32_to_cpu(pi->i_flags) & NOVA_FL_USER_VISIBLE;
+		return put_user(flags, (int __user *)arg);
+	case FS_IOC_SETFLAGS: {
+		unsigned int oldflags;
+		u64 old_linkc = 0;
+		u64 epoch_id;
+
+		ret = mnt_want_write_file(filp);
+		if (ret)
+			return ret;
+
+		if (!inode_owner_or_capable(inode)) {
+			ret = -EPERM;
+			goto flags_out;
+		}
+
+		if (get_user(flags, (int __user *)arg)) {
+			ret = -EFAULT;
+			goto flags_out;
+		}
+
+		inode_lock(inode);
+		oldflags = le32_to_cpu(pi->i_flags);
+
+		if ((flags ^ oldflags) &
+		    (FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+			if (!capable(CAP_LINUX_IMMUTABLE)) {
+				inode_unlock(inode);
+				ret = -EPERM;
+				goto flags_out_unlock;
+			}
+		}
+
+		if (!S_ISDIR(inode->i_mode))
+			flags &= ~FS_DIRSYNC_FL;
+
+		epoch_id = nova_get_epoch_id(sb);
+		flags = flags & FS_FL_USER_MODIFIABLE;
+		flags |= oldflags & ~FS_FL_USER_MODIFIABLE;
+		inode->i_ctime = current_time(inode);
+		nova_set_inode_flags(inode, pi, flags);
+
+		update.tail = 0;
+		update.alter_tail = 0;
+		ret = nova_append_link_change_entry(sb, pi, inode,
+					&update, &old_linkc, epoch_id);
+		if (!ret) {
+			nova_memunlock_inode(sb, pi);
+			nova_update_inode(sb, inode, pi, &update, 1);
+			nova_memlock_inode(sb, pi);
+			nova_invalidate_link_change_entry(sb, old_linkc);
+		}
+		sih->trans_id++;
+flags_out_unlock:
+		inode_unlock(inode);
+flags_out:
+		mnt_drop_write_file(filp);
+		return ret;
+	}
+	case FS_IOC_GETVERSION:
+		return put_user(inode->i_generation, (int __user *)arg);
+	case FS_IOC_SETVERSION: {
+		u64 old_linkc = 0;
+		u64 epoch_id;
+		__u32 generation;
+
+		if (!inode_owner_or_capable(inode))
+			return -EPERM;
+		ret = mnt_want_write_file(filp);
+		if (ret)
+			return ret;
+		if (get_user(generation, (int __user *)arg)) {
+			ret = -EFAULT;
+			goto setversion_out;
+		}
+
+		epoch_id = nova_get_epoch_id(sb);
+		inode_lock(inode);
+		inode->i_ctime = current_time(inode);
+		inode->i_generation = generation;
+
+		update.tail = 0;
+		update.alter_tail = 0;
+		ret = nova_append_link_change_entry(sb, pi, inode,
+					&update, &old_linkc, epoch_id);
+		if (!ret) {
+			nova_memunlock_inode(sb, pi);
+			nova_update_inode(sb, inode, pi, &update, 1);
+			nova_memlock_inode(sb, pi);
+			nova_invalidate_link_change_entry(sb, old_linkc);
+		}
+		sih->trans_id++;
+		inode_unlock(inode);
+setversion_out:
+		mnt_drop_write_file(filp);
+		return ret;
+	}
+	case NOVA_PRINT_TIMING: {
+		nova_print_timing_stats(sb);
+		return 0;
+	}
+	case NOVA_CLEAR_STATS: {
+		nova_clear_stats(sb);
+		return 0;
+	}
+	case NOVA_PRINT_LOG: {
+		nova_print_inode_log(sb, inode);
+		return 0;
+	}
+	case NOVA_PRINT_LOG_PAGES: {
+		nova_print_inode_log_pages(sb, inode);
+		return 0;
+	}
+	case NOVA_PRINT_FREE_LISTS: {
+		nova_print_free_lists(sb);
+		return 0;
+	}
+	default:
+		return -ENOTTY;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+long nova_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case FS_IOC32_GETFLAGS:
+		cmd = FS_IOC_GETFLAGS;
+		break;
+	case FS_IOC32_SETFLAGS:
+		cmd = FS_IOC_SETFLAGS;
+		break;
+	case FS_IOC32_GETVERSION:
+		cmd = FS_IOC_GETVERSION;
+		break;
+	case FS_IOC32_SETVERSION:
+		cmd = FS_IOC_SETVERSION;
+		break;
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return nova_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
diff --git a/fs/nova/sysfs.c b/fs/nova/sysfs.c
new file mode 100644
index 000000000000..38749bb8b14b
--- /dev/null
+++ b/fs/nova/sysfs.c
@@ -0,0 +1,543 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Proc fs operations
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+const char *proc_dirname = "fs/NOVA";
+struct proc_dir_entry *nova_proc_root;
+
+/* ====================== Statistics ======================== */
+static int nova_seq_timing_show(struct seq_file *seq, void *v)
+{
+	int i;
+
+	nova_get_timing_stats();
+
+	seq_puts(seq, "=========== NOVA kernel timing stats ===========\n");
+	for (i = 0; i < TIMING_NUM; i++) {
+		/* Title */
+		if (Timingstring[i][0] == '=') {
+			seq_printf(seq, "\n%s\n\n", Timingstring[i]);
+			continue;
+		}
+
+		if (measure_timing || Timingstats[i]) {
+			seq_printf(seq, "%s: count %llu, timing %llu, average %llu\n",
+				Timingstring[i],
+				Countstats[i],
+				Timingstats[i],
+				Countstats[i] ?
+				Timingstats[i] / Countstats[i] : 0);
+		} else {
+			seq_printf(seq, "%s: count %llu\n",
+				Timingstring[i],
+				Countstats[i]);
+		}
+	}
+
+	seq_puts(seq, "\n");
+	return 0;
+}
+
+static int nova_seq_timing_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_timing_show, PDE_DATA(inode));
+}
+
+ssize_t nova_seq_clear_stats(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+
+	nova_clear_stats(sb);
+	return len;
+}
+
+static const struct file_operations nova_seq_timing_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_timing_open,
+	.read		= seq_read,
+	.write		= nova_seq_clear_stats,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_IO_show(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = seq->private;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long alloc_log_count = 0;
+	unsigned long alloc_log_pages = 0;
+	unsigned long alloc_data_count = 0;
+	unsigned long alloc_data_pages = 0;
+	unsigned long free_log_count = 0;
+	unsigned long freed_log_pages = 0;
+	unsigned long free_data_count = 0;
+	unsigned long freed_data_pages = 0;
+	int i;
+
+	nova_get_timing_stats();
+	nova_get_IO_stats();
+
+	seq_puts(seq, "============ NOVA allocation stats ============\n\n");
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		alloc_log_count += free_list->alloc_log_count;
+		alloc_log_pages += free_list->alloc_log_pages;
+		alloc_data_count += free_list->alloc_data_count;
+		alloc_data_pages += free_list->alloc_data_pages;
+		free_log_count += free_list->free_log_count;
+		freed_log_pages += free_list->freed_log_pages;
+		free_data_count += free_list->free_data_count;
+		freed_data_pages += free_list->freed_data_pages;
+	}
+
+	seq_printf(seq, "alloc log count %lu, allocated log pages %lu\n"
+		"alloc data count %lu, allocated data pages %lu\n"
+		"free log count %lu, freed log pages %lu\n"
+		"free data count %lu, freed data pages %lu\n",
+		alloc_log_count, alloc_log_pages,
+		alloc_data_count, alloc_data_pages,
+		free_log_count, freed_log_pages,
+		free_data_count, freed_data_pages);
+
+	seq_printf(seq, "Fast GC %llu, check pages %llu, free pages %llu, average %llu\n",
+		Countstats[fast_gc_t], IOstats[fast_checked_pages],
+		IOstats[fast_gc_pages], Countstats[fast_gc_t] ?
+			IOstats[fast_gc_pages] / Countstats[fast_gc_t] : 0);
+	seq_printf(seq, "Thorough GC %llu, checked pages %llu, free pages %llu, average %llu\n",
+		Countstats[thorough_gc_t],
+		IOstats[thorough_checked_pages], IOstats[thorough_gc_pages],
+		Countstats[thorough_gc_t] ?
+			IOstats[thorough_gc_pages] / Countstats[thorough_gc_t]
+			: 0);
+
+	seq_puts(seq, "\n");
+
+	seq_puts(seq, "================ NOVA I/O stats ================\n\n");
+	seq_printf(seq, "Read %llu, bytes %llu, average %llu\n",
+		Countstats[dax_read_t], IOstats[read_bytes],
+		Countstats[dax_read_t] ?
+			IOstats[read_bytes] / Countstats[dax_read_t] : 0);
+	seq_printf(seq, "COW write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[cow_write_t], IOstats[cow_write_bytes],
+		Countstats[cow_write_t] ?
+			IOstats[cow_write_bytes] / Countstats[cow_write_t] : 0,
+		IOstats[cow_write_breaks], Countstats[cow_write_t] ?
+			IOstats[cow_write_breaks] / Countstats[cow_write_t]
+			: 0);
+	seq_printf(seq, "Inplace write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[inplace_write_t], IOstats[inplace_write_bytes],
+		Countstats[inplace_write_t] ?
+			IOstats[inplace_write_bytes] /
+			Countstats[inplace_write_t] : 0,
+		IOstats[inplace_write_breaks], Countstats[inplace_write_t] ?
+			IOstats[inplace_write_breaks] /
+			Countstats[inplace_write_t] : 0);
+	seq_printf(seq, "Inplace write %llu, allocate new blocks %llu\n",
+			Countstats[inplace_write_t],
+			IOstats[inplace_new_blocks]);
+	seq_printf(seq, "DAX get blocks %llu, allocate new blocks %llu\n",
+			Countstats[dax_get_block_t], IOstats[dax_new_blocks]);
+	seq_printf(seq, "Dirty pages %llu\n", IOstats[dirty_pages]);
+	seq_printf(seq, "Protect head %llu, tail %llu\n",
+			IOstats[protect_head], IOstats[protect_tail]);
+	seq_printf(seq, "Block csum parity %llu\n", IOstats[block_csum_parity]);
+	seq_printf(seq, "Page fault %llu, dax cow fault %llu, dax cow fault during snapshot creation %llu\n"
+			"CoW write overlap mmap range %llu, mapping/pfn updated pages %llu\n",
+			Countstats[mmap_fault_t], Countstats[mmap_cow_t],
+			IOstats[dax_cow_during_snapshot],
+			IOstats[cow_overlap_mmap],
+			IOstats[mapping_updated_pages]);
+	seq_printf(seq, "fsync %llu, fdatasync %llu\n",
+			Countstats[fsync_t], IOstats[fdatasync]);
+
+	seq_puts(seq, "\n");
+
+	nova_print_snapshot_lists(sb, seq);
+	seq_puts(seq, "\n");
+
+	return 0;
+}
+
+static int nova_seq_IO_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_IO_show, PDE_DATA(inode));
+}
+
+static const struct file_operations nova_seq_IO_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_IO_open,
+	.read		= seq_read,
+	.write		= nova_seq_clear_stats,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_show_allocator(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = seq->private;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+	unsigned long log_pages = 0;
+	unsigned long data_pages = 0;
+
+	seq_puts(seq, "======== NOVA per-CPU allocator stats ========\n");
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		seq_printf(seq, "Free list %d: block start %lu, block end %lu, num_blocks %lu, num_free_blocks %lu, blocknode %lu\n",
+			i, free_list->block_start, free_list->block_end,
+			free_list->block_end - free_list->block_start + 1,
+			free_list->num_free_blocks, free_list->num_blocknode);
+
+		if (free_list->first_node) {
+			seq_printf(seq, "First node %lu - %lu\n",
+					free_list->first_node->range_low,
+					free_list->first_node->range_high);
+		}
+
+		if (free_list->last_node) {
+			seq_printf(seq, "Last node %lu - %lu\n",
+					free_list->last_node->range_low,
+					free_list->last_node->range_high);
+		}
+
+		seq_printf(seq, "Free list %d: csum start %lu, replica csum start %lu, csum blocks %lu, parity start %lu, parity blocks %lu\n",
+			i, free_list->csum_start, free_list->replica_csum_start,
+			free_list->num_csum_blocks,
+			free_list->parity_start, free_list->num_parity_blocks);
+
+		seq_printf(seq, "Free list %d: alloc log count %lu, allocated log pages %lu, alloc data count %lu, allocated data pages %lu, free log count %lu, freed log pages %lu, free data count %lu, freed data pages %lu\n",
+			   i,
+			   free_list->alloc_log_count,
+			   free_list->alloc_log_pages,
+			   free_list->alloc_data_count,
+			   free_list->alloc_data_pages,
+			   free_list->free_log_count,
+			   free_list->freed_log_pages,
+			   free_list->free_data_count,
+			   free_list->freed_data_pages);
+
+		log_pages += free_list->alloc_log_pages;
+		log_pages -= free_list->freed_log_pages;
+
+		data_pages += free_list->alloc_data_pages;
+		data_pages -= free_list->freed_data_pages;
+	}
+
+	seq_printf(seq, "\nCurrently used pmem pages: log %lu, data %lu\n",
+			log_pages, data_pages);
+
+	return 0;
+}
+
+static int nova_seq_allocator_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_show_allocator,
+				PDE_DATA(inode));
+}
+
+static const struct file_operations nova_seq_allocator_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_allocator_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* ====================== Snapshot ======================== */
+static int nova_seq_create_snapshot_show(struct seq_file *seq, void *v)
+{
+	seq_puts(seq, "Write to create a snapshot\n");
+	return 0;
+}
+
+static int nova_seq_create_snapshot_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_create_snapshot_show,
+				PDE_DATA(inode));
+}
+
+ssize_t nova_seq_create_snapshot(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+
+	nova_create_snapshot(sb);
+	return len;
+}
+
+static const struct file_operations nova_seq_create_snapshot_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_create_snapshot_open,
+	.read		= seq_read,
+	.write		= nova_seq_create_snapshot,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_delete_snapshot_show(struct seq_file *seq, void *v)
+{
+	seq_puts(seq, "Echo index to delete a snapshot\n");
+	return 0;
+}
+
+static int nova_seq_delete_snapshot_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_delete_snapshot_show,
+				PDE_DATA(inode));
+}
+
+ssize_t nova_seq_delete_snapshot(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+	u64 epoch_id;
+	int ret;
+
+	ret = kstrtoull(buf, 10, &epoch_id);
+	if (ret < 0)
+		nova_warn("Couldn't parse snapshot id %s", buf);
+	else
+		nova_delete_snapshot(sb, epoch_id);
+
+	return len;
+}
+
+static const struct file_operations nova_seq_delete_snapshot_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_delete_snapshot_open,
+	.read		= seq_read,
+	.write		= nova_seq_delete_snapshot,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_show_snapshots(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = seq->private;
+
+	nova_print_snapshots(sb, seq);
+	return 0;
+}
+
+static int nova_seq_show_snapshots_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_show_snapshots,
+				PDE_DATA(inode));
+}
+
+static const struct file_operations nova_seq_show_snapshots_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_show_snapshots_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* ====================== Performance ======================== */
+static int nova_seq_test_perf_show(struct seq_file *seq, void *v)
+{
+	seq_printf(seq, "Echo function:poolmb:size:disks to test function performance working on size of data.\n"
+			"    example: echo 1:128:4096:8 > /proc/fs/NOVA/pmem0/test_perf\n"
+			"The disks value only matters for raid functions.\n"
+			"Set function to 0 to test all functions.\n");
+	return 0;
+}
+
+static int nova_seq_test_perf_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_test_perf_show, PDE_DATA(inode));
+}
+
+ssize_t nova_seq_test_perf(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+	size_t size;
+	unsigned int func_id, poolmb, disks;
+
+	if (sscanf(buf, "%u:%u:%zu:%u", &func_id, &poolmb, &size, &disks) == 4)
+		nova_test_perf(sb, func_id, poolmb, size, disks);
+	else
+		nova_warn("Couldn't parse test_perf request: %s", buf);
+
+	return len;
+}
+
+static const struct file_operations nova_seq_test_perf_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_test_perf_open,
+	.read		= seq_read,
+	.write		= nova_seq_test_perf,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+
+/* ====================== GC ======================== */
+
+
+static int nova_seq_gc_show(struct seq_file *seq, void *v)
+{
+	seq_printf(seq, "Echo inode number to trigger garbage collection\n"
+		   "    example: echo 34 > /proc/fs/NOVA/pmem0/gc\n");
+	return 0;
+}
+
+static int nova_seq_gc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_gc_show, PDE_DATA(inode));
+}
+
+ssize_t nova_seq_gc(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	u64 target_inode_number;
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+	struct inode *target_inode;
+	struct nova_inode *target_pi;
+	struct nova_inode_info *target_sih;
+
+	int ret;
+	char *_buf;
+	int retval = len;
+
+	_buf = kmalloc(len, GFP_KERNEL);
+	if (_buf == NULL)  {
+		retval = -ENOMEM;
+		nova_dbg("%s: kmalloc failed\n", __func__);
+		goto out;
+	}
+
+	if (copy_from_user(_buf, buf, len)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	_buf[len] = 0;
+	ret = kstrtoull(_buf, 0, &target_inode_number);
+	if (ret) {
+		nova_info("%s: Could not parse ino '%s'\n", __func__, _buf);
+		return ret;
+	}
+	nova_info("%s: target_inode_number=%llu.", __func__,
+		  target_inode_number);
+
+	target_inode = nova_iget(sb, target_inode_number);
+	if (target_inode == NULL) {
+		nova_info("%s: inode %llu does not exist.", __func__,
+			  target_inode_number);
+		retval = -ENOENT;
+		goto out;
+	}
+
+	target_pi = nova_get_inode(sb, target_inode);
+	if (target_pi == NULL) {
+		nova_info("%s: couldn't get nova inode %llu.", __func__,
+			  target_inode_number);
+		retval = -ENOENT;
+		goto out;
+	}
+
+	target_sih = NOVA_I(target_inode);
+
+	nova_info("%s: got inode %llu @ 0x%p; pi=0x%p\n", __func__,
+		  target_inode_number, target_inode, target_pi);
+
+	nova_inode_log_fast_gc(sb, target_pi, &target_sih->header,
+			       0, 0, 0, 0, 1);
+	iput(target_inode);
+
+out:
+	kfree(_buf);
+	return retval;
+}
+
+static const struct file_operations nova_seq_gc_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_gc_open,
+	.read		= seq_read,
+	.write		= nova_seq_gc,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* ====================== Setup/teardown======================== */
+void nova_sysfs_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (nova_proc_root)
+		sbi->s_proc = proc_mkdir(sbi->s_bdev->bd_disk->disk_name,
+					 nova_proc_root);
+
+	if (sbi->s_proc) {
+		proc_create_data("timing_stats", 0444, sbi->s_proc,
+				 &nova_seq_timing_fops, sb);
+		proc_create_data("IO_stats", 0444, sbi->s_proc,
+				 &nova_seq_IO_fops, sb);
+		proc_create_data("allocator", 0444, sbi->s_proc,
+				 &nova_seq_allocator_fops, sb);
+		proc_create_data("create_snapshot", 0444, sbi->s_proc,
+				 &nova_seq_create_snapshot_fops, sb);
+		proc_create_data("delete_snapshot", 0444, sbi->s_proc,
+				 &nova_seq_delete_snapshot_fops, sb);
+		proc_create_data("snapshots", 0444, sbi->s_proc,
+				 &nova_seq_show_snapshots_fops, sb);
+		proc_create_data("test_perf", 0444, sbi->s_proc,
+				 &nova_seq_test_perf_fops, sb);
+		proc_create_data("gc", 0444, sbi->s_proc,
+				 &nova_seq_gc_fops, sb);
+	}
+}
+
+void nova_sysfs_exit(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->s_proc) {
+		remove_proc_entry("timing_stats", sbi->s_proc);
+		remove_proc_entry("IO_stats", sbi->s_proc);
+		remove_proc_entry("allocator", sbi->s_proc);
+		remove_proc_entry("create_snapshot", sbi->s_proc);
+		remove_proc_entry("delete_snapshot", sbi->s_proc);
+		remove_proc_entry("snapshots", sbi->s_proc);
+		remove_proc_entry("test_perf", sbi->s_proc);
+		remove_proc_entry("gc", sbi->s_proc);
+		remove_proc_entry(sbi->s_bdev->bd_disk->disk_name,
+					nova_proc_root);
+	}
+}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 14/16] NOVA: Read-only pmem devices
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:49   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Add (and implement) a module command line option to nd_pmem to support read-only pmem devices.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 arch/x86/include/asm/io.h |    1 +
 arch/x86/mm/ioremap.c     |   25 ++++++++++++++++++-------
 drivers/nvdimm/pmem.c     |   14 ++++++++++++--
 include/linux/io.h        |    2 ++
 kernel/memremap.c         |   24 ++++++++++++++++++++++++
 mm/memory.c               |    2 +-
 mm/mmap.c                 |    1 +
 7 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 7afb0e2f07f4..7aae48f2e4f1 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -173,6 +173,7 @@ extern void __iomem *ioremap_uc(resource_size_t offset, unsigned long size);
 #define ioremap_uc ioremap_uc
 
 extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size);
+extern void __iomem *ioremap_cache_ro(resource_size_t phys_addr, unsigned long size);
 extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, unsigned long prot_val);
 
 /**
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index bbc558b88a88..bcd473801817 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -81,7 +81,8 @@ static int __ioremap_check_ram(unsigned long start_pfn, unsigned long nr_pages,
  * caller shouldn't need to know that small detail.
  */
 static void __iomem *__ioremap_caller(resource_size_t phys_addr,
-		unsigned long size, enum page_cache_mode pcm, void *caller)
+		unsigned long size, enum page_cache_mode pcm, void *caller,
+		int readonly)
 {
 	unsigned long offset, vaddr;
 	resource_size_t pfn, last_pfn, last_addr;
@@ -172,6 +173,9 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
 		break;
 	}
 
+	if (readonly)
+		prot = __pgprot((unsigned long)prot.pgprot & ~_PAGE_RW);
+
 	/*
 	 * Ok, go for it..
 	 */
@@ -239,7 +243,7 @@ void __iomem *ioremap_nocache(resource_size_t phys_addr, unsigned long size)
 	enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC_MINUS;
 
 	return __ioremap_caller(phys_addr, size, pcm,
-				__builtin_return_address(0));
+				__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
@@ -272,7 +276,7 @@ void __iomem *ioremap_uc(resource_size_t phys_addr, unsigned long size)
 	enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC;
 
 	return __ioremap_caller(phys_addr, size, pcm,
-				__builtin_return_address(0));
+				__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL_GPL(ioremap_uc);
 
@@ -289,7 +293,7 @@ EXPORT_SYMBOL_GPL(ioremap_uc);
 void __iomem *ioremap_wc(resource_size_t phys_addr, unsigned long size)
 {
 	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WC,
-					__builtin_return_address(0));
+					__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_wc);
 
@@ -306,23 +310,30 @@ EXPORT_SYMBOL(ioremap_wc);
 void __iomem *ioremap_wt(resource_size_t phys_addr, unsigned long size)
 {
 	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WT,
-					__builtin_return_address(0));
+					__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_wt);
 
 void __iomem *ioremap_cache(resource_size_t phys_addr, unsigned long size)
 {
 	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
-				__builtin_return_address(0));
+				__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_cache);
 
+void __iomem *ioremap_cache_ro(resource_size_t phys_addr, unsigned long size)
+{
+	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
+				__builtin_return_address(0), 1);
+}
+EXPORT_SYMBOL(ioremap_cache_ro);
+
 void __iomem *ioremap_prot(resource_size_t phys_addr, unsigned long size,
 				unsigned long prot_val)
 {
 	return __ioremap_caller(phys_addr, size,
 				pgprot2cachemode(__pgprot(prot_val)),
-				__builtin_return_address(0));
+				__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_prot);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c544d466ea51..a6b29c731c53 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -35,6 +35,11 @@
 #include "pfn.h"
 #include "nd.h"
 
+int readonly;
+
+module_param(readonly, int, S_IRUGO);
+MODULE_PARM_DESC(readonly, "Mount readonly");
+
 static struct device *to_dev(struct pmem_device *pmem)
 {
 	/*
@@ -324,9 +329,14 @@ static int pmem_attach_disk(struct device *dev,
 		addr = devm_memremap_pages(dev, &nsio->res,
 				&q->q_usage_counter, NULL);
 		pmem->pfn_flags |= PFN_MAP;
-	} else
-		addr = devm_memremap(dev, pmem->phys_addr,
+	} else {
+		if (readonly == 0)
+			addr = devm_memremap(dev, pmem->phys_addr,
 				pmem->size, ARCH_MEMREMAP_PMEM);
+		else
+			addr = devm_memremap_ro(dev, pmem->phys_addr,
+				pmem->size, ARCH_MEMREMAP_PMEM);
+	}
 
 	/*
 	 * At release time the queue must be frozen before
diff --git a/include/linux/io.h b/include/linux/io.h
index 2195d9ea4aaa..00641aef9ab3 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -86,6 +86,8 @@ void devm_ioremap_release(struct device *dev, void *res);
 
 void *devm_memremap(struct device *dev, resource_size_t offset,
 		size_t size, unsigned long flags);
+void *devm_memremap_ro(struct device *dev, resource_size_t offset,
+		size_t size, unsigned long flags);
 void devm_memunmap(struct device *dev, void *addr);
 
 void *__devm_memremap_pages(struct device *dev, struct resource *res);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 23a6483c3666..68371a9a40e5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -162,6 +162,30 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
 }
 EXPORT_SYMBOL(devm_memremap);
 
+void *devm_memremap_ro(struct device *dev, resource_size_t offset,
+		size_t size, unsigned long flags)
+{
+	void **ptr, *addr;
+
+	printk("%s\n", __func__);
+	ptr = devres_alloc_node(devm_memremap_release, sizeof(*ptr), GFP_KERNEL,
+			dev_to_node(dev));
+	if (!ptr)
+		return ERR_PTR(-ENOMEM);
+
+	addr = ioremap_cache_ro(offset, size);
+	if (addr) {
+		*ptr = addr;
+		devres_add(dev, ptr);
+	} else {
+		devres_free(ptr);
+		return ERR_PTR(-ENXIO);
+	}
+
+	return addr;
+}
+EXPORT_SYMBOL(devm_memremap_ro);
+
 void devm_memunmap(struct device *dev, void *addr)
 {
 	WARN_ON(devres_release(dev, devm_memremap_release,
diff --git a/mm/memory.c b/mm/memory.c
index bb11c474857e..625623a90f08 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1793,7 +1793,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		return -ENOMEM;
 	arch_enter_lazy_mmu_mode();
 	do {
-		BUG_ON(!pte_none(*pte));
+//		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
diff --git a/mm/mmap.c b/mm/mmap.c
index a5e3dcd75e79..5423e3340e59 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -126,6 +126,7 @@ void vma_set_page_prot(struct vm_area_struct *vma)
 	/* remove_protection_ptes reads vma->vm_page_prot without mmap_sem */
 	WRITE_ONCE(vma->vm_page_prot, vm_page_prot);
 }
+EXPORT_SYMBOL(vma_set_page_prot);
 
 /*
  * Requires inode->i_mapping->i_mmap_rwsem

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 14/16] NOVA: Read-only pmem devices
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Add (and implement) a module command line option to nd_pmem to support read-only pmem devices.

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 arch/x86/include/asm/io.h |    1 +
 arch/x86/mm/ioremap.c     |   25 ++++++++++++++++++-------
 drivers/nvdimm/pmem.c     |   14 ++++++++++++--
 include/linux/io.h        |    2 ++
 kernel/memremap.c         |   24 ++++++++++++++++++++++++
 mm/memory.c               |    2 +-
 mm/mmap.c                 |    1 +
 7 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 7afb0e2f07f4..7aae48f2e4f1 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -173,6 +173,7 @@ extern void __iomem *ioremap_uc(resource_size_t offset, unsigned long size);
 #define ioremap_uc ioremap_uc
 
 extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size);
+extern void __iomem *ioremap_cache_ro(resource_size_t phys_addr, unsigned long size);
 extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, unsigned long prot_val);
 
 /**
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index bbc558b88a88..bcd473801817 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -81,7 +81,8 @@ static int __ioremap_check_ram(unsigned long start_pfn, unsigned long nr_pages,
  * caller shouldn't need to know that small detail.
  */
 static void __iomem *__ioremap_caller(resource_size_t phys_addr,
-		unsigned long size, enum page_cache_mode pcm, void *caller)
+		unsigned long size, enum page_cache_mode pcm, void *caller,
+		int readonly)
 {
 	unsigned long offset, vaddr;
 	resource_size_t pfn, last_pfn, last_addr;
@@ -172,6 +173,9 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
 		break;
 	}
 
+	if (readonly)
+		prot = __pgprot((unsigned long)prot.pgprot & ~_PAGE_RW);
+
 	/*
 	 * Ok, go for it..
 	 */
@@ -239,7 +243,7 @@ void __iomem *ioremap_nocache(resource_size_t phys_addr, unsigned long size)
 	enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC_MINUS;
 
 	return __ioremap_caller(phys_addr, size, pcm,
-				__builtin_return_address(0));
+				__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
@@ -272,7 +276,7 @@ void __iomem *ioremap_uc(resource_size_t phys_addr, unsigned long size)
 	enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC;
 
 	return __ioremap_caller(phys_addr, size, pcm,
-				__builtin_return_address(0));
+				__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL_GPL(ioremap_uc);
 
@@ -289,7 +293,7 @@ EXPORT_SYMBOL_GPL(ioremap_uc);
 void __iomem *ioremap_wc(resource_size_t phys_addr, unsigned long size)
 {
 	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WC,
-					__builtin_return_address(0));
+					__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_wc);
 
@@ -306,23 +310,30 @@ EXPORT_SYMBOL(ioremap_wc);
 void __iomem *ioremap_wt(resource_size_t phys_addr, unsigned long size)
 {
 	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WT,
-					__builtin_return_address(0));
+					__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_wt);
 
 void __iomem *ioremap_cache(resource_size_t phys_addr, unsigned long size)
 {
 	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
-				__builtin_return_address(0));
+				__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_cache);
 
+void __iomem *ioremap_cache_ro(resource_size_t phys_addr, unsigned long size)
+{
+	return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
+				__builtin_return_address(0), 1);
+}
+EXPORT_SYMBOL(ioremap_cache_ro);
+
 void __iomem *ioremap_prot(resource_size_t phys_addr, unsigned long size,
 				unsigned long prot_val)
 {
 	return __ioremap_caller(phys_addr, size,
 				pgprot2cachemode(__pgprot(prot_val)),
-				__builtin_return_address(0));
+				__builtin_return_address(0), 0);
 }
 EXPORT_SYMBOL(ioremap_prot);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c544d466ea51..a6b29c731c53 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -35,6 +35,11 @@
 #include "pfn.h"
 #include "nd.h"
 
+int readonly;
+
+module_param(readonly, int, S_IRUGO);
+MODULE_PARM_DESC(readonly, "Mount readonly");
+
 static struct device *to_dev(struct pmem_device *pmem)
 {
 	/*
@@ -324,9 +329,14 @@ static int pmem_attach_disk(struct device *dev,
 		addr = devm_memremap_pages(dev, &nsio->res,
 				&q->q_usage_counter, NULL);
 		pmem->pfn_flags |= PFN_MAP;
-	} else
-		addr = devm_memremap(dev, pmem->phys_addr,
+	} else {
+		if (readonly == 0)
+			addr = devm_memremap(dev, pmem->phys_addr,
 				pmem->size, ARCH_MEMREMAP_PMEM);
+		else
+			addr = devm_memremap_ro(dev, pmem->phys_addr,
+				pmem->size, ARCH_MEMREMAP_PMEM);
+	}
 
 	/*
 	 * At release time the queue must be frozen before
diff --git a/include/linux/io.h b/include/linux/io.h
index 2195d9ea4aaa..00641aef9ab3 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -86,6 +86,8 @@ void devm_ioremap_release(struct device *dev, void *res);
 
 void *devm_memremap(struct device *dev, resource_size_t offset,
 		size_t size, unsigned long flags);
+void *devm_memremap_ro(struct device *dev, resource_size_t offset,
+		size_t size, unsigned long flags);
 void devm_memunmap(struct device *dev, void *addr);
 
 void *__devm_memremap_pages(struct device *dev, struct resource *res);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 23a6483c3666..68371a9a40e5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -162,6 +162,30 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
 }
 EXPORT_SYMBOL(devm_memremap);
 
+void *devm_memremap_ro(struct device *dev, resource_size_t offset,
+		size_t size, unsigned long flags)
+{
+	void **ptr, *addr;
+
+	printk("%s\n", __func__);
+	ptr = devres_alloc_node(devm_memremap_release, sizeof(*ptr), GFP_KERNEL,
+			dev_to_node(dev));
+	if (!ptr)
+		return ERR_PTR(-ENOMEM);
+
+	addr = ioremap_cache_ro(offset, size);
+	if (addr) {
+		*ptr = addr;
+		devres_add(dev, ptr);
+	} else {
+		devres_free(ptr);
+		return ERR_PTR(-ENXIO);
+	}
+
+	return addr;
+}
+EXPORT_SYMBOL(devm_memremap_ro);
+
 void devm_memunmap(struct device *dev, void *addr)
 {
 	WARN_ON(devres_release(dev, devm_memremap_release,
diff --git a/mm/memory.c b/mm/memory.c
index bb11c474857e..625623a90f08 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1793,7 +1793,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		return -ENOMEM;
 	arch_enter_lazy_mmu_mode();
 	do {
-		BUG_ON(!pte_none(*pte));
+//		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
diff --git a/mm/mmap.c b/mm/mmap.c
index a5e3dcd75e79..5423e3340e59 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -126,6 +126,7 @@ void vma_set_page_prot(struct vm_area_struct *vma)
 	/* remove_protection_ptes reads vma->vm_page_prot without mmap_sem */
 	WRITE_ONCE(vma->vm_page_prot, vm_page_prot);
 }
+EXPORT_SYMBOL(vma_set_page_prot);
 
 /*
  * Requires inode->i_mapping->i_mmap_rwsem

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 15/16] NOVA: Performance measurement
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:49   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/perf.c  |  594 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/perf.h  |   96 ++++++++
 fs/nova/stats.c |  685 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/stats.h |  218 ++++++++++++++++++
 4 files changed, 1593 insertions(+)
 create mode 100644 fs/nova/perf.c
 create mode 100644 fs/nova/perf.h
 create mode 100644 fs/nova/stats.c
 create mode 100644 fs/nova/stats.h

diff --git a/fs/nova/perf.c b/fs/nova/perf.c
new file mode 100644
index 000000000000..35a4c6a490c3
--- /dev/null
+++ b/fs/nova/perf.c
@@ -0,0 +1,594 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Performance test routines
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "perf.h"
+
+/* normal memcpy functions */
+static int memcpy_read_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin dst address to cache most writes, if size fits */
+	memcpy(dst, src + off, size);
+	return 0;
+}
+
+static int memcpy_write_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin src address to cache most reads, if size fits */
+	memcpy(dst + off, src, size);
+	return 0;
+}
+
+static int memcpy_bidir_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* minimize caching by forwarding both src and dst */
+	memcpy(dst + off, src + off, size);
+	return 0;
+}
+
+static const memcpy_call_t memcpy_calls[] = {
+	/* order should match enum memcpy_call_id */
+	{ "memcpy (mostly read)",  memcpy_read_call },
+	{ "memcpy (mostly write)", memcpy_write_call },
+	{ "memcpy (read write)",   memcpy_bidir_call }
+};
+
+/* copy from pmem functions */
+static int from_pmem_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin dst address to cache most writes, if size fits */
+	/* src address should point to pmem */
+	memcpy_mcsafe(dst, src + off, size);
+	return 0;
+}
+
+static const memcpy_call_t from_pmem_calls[] = {
+	/* order should match enum from_pmem_call_id */
+	{ "memcpy_mcsafe", from_pmem_call }
+};
+
+/* copy to pmem functions */
+static int to_pmem_nocache_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin src address to cache most reads, if size fits */
+	/* dst address should point to pmem */
+	memcpy_to_pmem_nocache(dst + off, src, size);
+	return 0;
+}
+
+static int to_flush_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin src address to cache most reads, if size fits */
+	/* dst address should point to pmem */
+	nova_flush_buffer(dst + off, size, 0);
+	return 0;
+}
+
+static int to_pmem_flush_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin src address to cache most reads, if size fits */
+	/* dst address should point to pmem */
+	memcpy(dst + off, src, size);
+	nova_flush_buffer(dst + off, size, 0);
+	return 0;
+}
+
+static const memcpy_call_t to_pmem_calls[] = {
+	/* order should match enum to_pmem_call_id */
+	{ "memcpy_to_pmem_nocache", to_pmem_nocache_call },
+	{ "flush buffer",	    to_flush_call },
+	{ "memcpy + flush buffer",  to_pmem_flush_call }
+};
+
+/* checksum functions */
+static u64 zlib_adler32_call(u64 init, char *data, size_t size)
+{
+	u64 csum;
+
+	/* include/linux/zutil.h */
+	csum = zlib_adler32(init, data, size);
+	return csum;
+}
+
+static u64 nd_fletcher64_call(u64 init, char *data, size_t size)
+{
+	u64 csum;
+
+	/* drivers/nvdimm/core.c */
+	csum = nd_fletcher64(data, size, 1);
+	return csum;
+}
+
+static u64 libcrc32c_call(u64 init, char *data, size_t size)
+{
+	u32 crc = (u32) init;
+
+	crc = crc32c(crc, data, size);
+	return (u64) crc;
+}
+
+static u64 nova_crc32c_call(u64 init, char *data, size_t size)
+{
+	u32 crc = (u32) init;
+
+	crc = nova_crc32c(crc, data, size);
+	return (u64) crc;
+}
+
+static u64 plain_xor64_call(u64 init, char *data, size_t size)
+{
+	u64 csum = init;
+	u64 *word = (u64 *) data;
+
+	while (size > 8) {
+		csum ^= *word;
+		word += 1;
+		size -= 8;
+	}
+
+	/* for perf testing ignore trailing bytes, if any */
+
+	return csum;
+}
+
+static const checksum_call_t checksum_calls[] = {
+	/* order should match enum checksum_call_id */
+	{ "zlib_adler32",  zlib_adler32_call },
+	{ "nd_fletcher64", nd_fletcher64_call },
+	{ "libcrc32c",     libcrc32c_call },
+	{ "nova_crc32c",   nova_crc32c_call },
+	{ "plain_xor64",   plain_xor64_call }
+};
+
+/* raid5 functions */
+static u64 nova_block_parity_call(char **data, char *parity,
+	size_t size, int disks)
+{
+	int i, j, strp, num_strps = disks;
+	size_t strp_size = size;
+	char *block = *data;
+	u64 xor;
+
+	/* FIXME: using same code as in parity.c; need a way to reuse that */
+
+	if (static_cpu_has(X86_FEATURE_XMM2)) { // sse2 128b
+		for (i = 0; i < strp_size; i += 16) {
+			asm volatile("movdqa %0, %%xmm0" : : "m" (block[i]));
+			for (strp = 1; strp < num_strps; strp++) {
+				j = strp * strp_size + i;
+				asm volatile(
+					"movdqa     %0, %%xmm1\n"
+					"pxor   %%xmm1, %%xmm0\n"
+					: : "m" (block[j])
+				);
+			}
+			asm volatile("movntdq %%xmm0, %0" : "=m" (parity[i]));
+		}
+	} else { // common 64b
+		for (i = 0; i < strp_size; i += 8) {
+			xor = *((u64 *) &block[i]);
+			for (strp = 1; strp < num_strps; strp++) {
+				j = strp * strp_size + i;
+				xor ^= *((u64 *) &block[j]);
+			}
+			*((u64 *) &parity[i]) = xor;
+		}
+	}
+
+	return *((u64 *) parity);
+}
+
+static u64 nova_block_csum_parity_call(char **data, char *parity,
+	size_t size, int disks)
+{
+	int i;
+	size_t strp_size = size;
+	char *block = *data;
+	u32 volatile crc[8]; // avoid results being optimized out
+	u64 qwd[8];
+	u64 acc[8] = {0, 0, 0, 0, 0, 0, 0, 0};
+
+	/* FIXME: using same code as in parity.c; need a way to reuse that */
+
+	for (i = 0; i < strp_size / 8; i++) {
+		qwd[0] = *((u64 *) (block));
+		qwd[1] = *((u64 *) (block + 1 * strp_size));
+		qwd[2] = *((u64 *) (block + 2 * strp_size));
+		qwd[3] = *((u64 *) (block + 3 * strp_size));
+		qwd[4] = *((u64 *) (block + 4 * strp_size));
+		qwd[5] = *((u64 *) (block + 5 * strp_size));
+		qwd[6] = *((u64 *) (block + 6 * strp_size));
+		qwd[7] = *((u64 *) (block + 7 * strp_size));
+
+		// if (data_csum > 0 && unroll_csum) {
+			nova_crc32c_qword(qwd[0], acc[0]);
+			nova_crc32c_qword(qwd[1], acc[1]);
+			nova_crc32c_qword(qwd[2], acc[2]);
+			nova_crc32c_qword(qwd[3], acc[3]);
+			nova_crc32c_qword(qwd[4], acc[4]);
+			nova_crc32c_qword(qwd[5], acc[5]);
+			nova_crc32c_qword(qwd[6], acc[6]);
+			nova_crc32c_qword(qwd[7], acc[7]);
+		// }
+
+		// if (data_parity > 0) {
+			parity[i] = qwd[0] ^ qwd[1] ^ qwd[2] ^ qwd[3] ^
+					qwd[4] ^ qwd[5] ^ qwd[6] ^ qwd[7];
+		// }
+
+		block += 8;
+	}
+	// if (data_csum > 0 && unroll_csum) {
+		crc[0] = cpu_to_le32((u32) acc[0]);
+		crc[1] = cpu_to_le32((u32) acc[1]);
+		crc[2] = cpu_to_le32((u32) acc[2]);
+		crc[3] = cpu_to_le32((u32) acc[3]);
+		crc[4] = cpu_to_le32((u32) acc[4]);
+		crc[5] = cpu_to_le32((u32) acc[5]);
+		crc[6] = cpu_to_le32((u32) acc[6]);
+		crc[7] = cpu_to_le32((u32) acc[7]);
+	// }
+
+	return *((u64 *) parity);
+}
+
+#if 0 // some test machines do not have this function (need CONFIG_MD_RAID456)
+static u64 xor_blocks_call(char **data, char *parity,
+	size_t size, int disks)
+{
+	int xor_cnt, disk_id;
+
+	memcpy(parity, data[0], size); /* init parity with the first disk */
+	disks--;
+	disk_id = 1;
+	while (disks > 0) {
+		/* each xor_blocks call can do at most MAX_XOR_BLOCKS (4) */
+		xor_cnt = min(disks, MAX_XOR_BLOCKS);
+		/* crypto/xor.c, used in lib/raid6 and fs/btrfs */
+		xor_blocks(xor_cnt, size, parity, (void **)(data + disk_id));
+
+		disks -= xor_cnt;
+		disk_id += xor_cnt;
+	}
+
+	return *((u64 *) parity);
+}
+#endif
+
+static const raid5_call_t raid5_calls[] = {
+	/* order should match enum raid5_call_id */
+	{ "nova_block_parity", nova_block_parity_call },
+	{ "nova_block_csum_parity", nova_block_csum_parity_call },
+//	{ "xor_blocks", xor_blocks_call },
+};
+
+/* memory pools for perf testing */
+static void *nova_alloc_vmem_pool(size_t poolsize)
+{
+	void *pool = vmalloc(poolsize);
+
+	if (pool == NULL)
+		return NULL;
+
+	/* init pool to verify some checksum results */
+	// memset(pool, 0xAC, poolsize);
+
+	/* to have a clean start, flush the data cache for the given virtual
+	 * address range in the vmap area
+	 */
+	flush_kernel_vmap_range(pool, poolsize);
+
+	return pool;
+}
+
+static void nova_free_vmem_pool(void *pool)
+{
+	if (pool != NULL)
+		vfree(pool);
+}
+
+static void *nova_alloc_pmem_pool(struct super_block *sb,
+	struct nova_inode_info_header *sih, int cpu, size_t poolsize,
+	unsigned long *blocknr, int *allocated)
+{
+	int num;
+	void *pool;
+	size_t blocksize, blockoff;
+	u8 blocktype = NOVA_BLOCK_TYPE_4K;
+
+	blocksize = blk_type_to_size[blocktype];
+	num = poolsize / blocksize;
+	if (poolsize % blocksize)
+		num++;
+
+	sih->ino = NOVA_TEST_PERF_INO;
+	sih->i_blk_type = blocktype;
+	sih->log_head = 0;
+	sih->log_tail = 0;
+
+	*allocated = nova_new_data_blocks(sb, sih, blocknr, 0, num,
+					  ALLOC_NO_INIT, cpu, ALLOC_FROM_HEAD);
+	if (*allocated < num) {
+		nova_dbg("%s: allocated pmem blocks %d < requested blocks %d\n",
+						__func__, *allocated, num);
+		if (*allocated > 0)
+			nova_free_data_blocks(sb, sih, *blocknr, *allocated);
+
+		return NULL;
+	}
+
+	blockoff = nova_get_block_off(sb, *blocknr, blocktype);
+	pool = nova_get_block(sb, blockoff);
+
+	return pool;
+}
+
+static void nova_free_pmem_pool(struct super_block *sb,
+	struct nova_inode_info_header *sih, char **pmem,
+	unsigned long blocknr, int num)
+{
+	if (num > 0)
+		nova_free_data_blocks(sb, sih, blocknr, num);
+	*pmem = NULL;
+}
+
+static int nova_test_func_perf(struct super_block *sb, unsigned int func_id,
+	size_t poolsize, size_t size, unsigned int disks)
+{
+	u64 csum = 12345, xor = 0;
+
+	u64 volatile result; // avoid results being optimized out
+	const char *fname = NULL;
+	char *src = NULL, *dst = NULL, *pmem = NULL;
+	char **data = NULL, *parity;
+	size_t off = 0;
+	int cpu, i, j, reps, err = 0, allocated = 0;
+	unsigned int call_id = 0, call_gid = 0;
+	unsigned long blocknr = 0, nsec, lat, thru;
+	struct nova_inode_info_header perf_sih;
+	const memcpy_call_t *fmemcpy = NULL;
+	const checksum_call_t *fchecksum = NULL;
+	const raid5_call_t *fraid5 = NULL;
+	timing_t perf_time;
+
+	cpu = get_cpu(); /* get cpu id and disable preemption */
+	reps = poolsize / size; /* raid calls will adjust this number */
+	call_id = func_id - 1; /* individual function id starting from 1 */
+
+	/* normal memcpy */
+	if (call_id < NUM_MEMCPY_CALLS) {
+		src = nova_alloc_vmem_pool(poolsize);
+		dst = nova_alloc_vmem_pool(poolsize);
+		if (src == NULL || dst == NULL) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		fmemcpy = &memcpy_calls[call_id];
+		fname = fmemcpy->name;
+		call_gid = memcpy_gid;
+
+		goto test;
+	}
+	call_id -= NUM_MEMCPY_CALLS;
+
+	/* memcpy from pmem */
+	if (call_id < NUM_FROM_PMEM_CALLS) {
+		pmem = nova_alloc_pmem_pool(sb, &perf_sih, cpu, poolsize,
+							&blocknr, &allocated);
+		dst = nova_alloc_vmem_pool(poolsize);
+		if (pmem == NULL || dst == NULL) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		fmemcpy = &from_pmem_calls[call_id];
+		fname = fmemcpy->name;
+		call_gid = from_pmem_gid;
+
+		goto test;
+	}
+	call_id -= NUM_FROM_PMEM_CALLS;
+
+	/* memcpy to pmem */
+	if (call_id < NUM_TO_PMEM_CALLS) {
+		src = nova_alloc_vmem_pool(poolsize);
+		pmem = nova_alloc_pmem_pool(sb, &perf_sih, cpu, poolsize,
+							&blocknr, &allocated);
+		if (src == NULL || pmem == NULL) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		fmemcpy = &to_pmem_calls[call_id];
+		fname = fmemcpy->name;
+		call_gid = to_pmem_gid;
+
+		goto test;
+	}
+	call_id -= NUM_TO_PMEM_CALLS;
+
+	/* checksum */
+	if (call_id < NUM_CHECKSUM_CALLS) {
+		src = nova_alloc_vmem_pool(poolsize);
+
+		fchecksum = &checksum_calls[call_id];
+		fname = fchecksum->name;
+		call_gid = checksum_gid;
+
+		goto test;
+	}
+	call_id -= NUM_CHECKSUM_CALLS;
+
+	/* raid5 */
+	if (call_id < NUM_RAID5_CALLS) {
+		src = nova_alloc_vmem_pool(poolsize);
+		data = kcalloc(disks, sizeof(char *), GFP_NOFS);
+		if (data == NULL) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		reps = poolsize / ((disks + 1) * size); /* +1 for parity */
+
+		fraid5 = &raid5_calls[call_id];
+		fname = fraid5->name;
+		call_gid = raid5_gid;
+
+		if (call_id == nova_block_csum_parity_id && disks != 8) {
+			nova_dbg("%s only for 8 disks, skip testing\n", fname);
+			goto out;
+		}
+
+		goto test;
+	}
+	call_id -= NUM_RAID5_CALLS;
+
+	/* continue with the next call group */
+
+test:
+	if (fmemcpy == NULL && fchecksum == NULL && fraid5 == NULL) {
+		nova_dbg("%s: function struct error\n", __func__);
+		err = -EFAULT;
+		goto out;
+	}
+
+	reset_perf_timer();
+	NOVA_START_TIMING(perf_t, perf_time);
+
+	switch (call_gid) {
+	case memcpy_gid:
+		for (i = 0; i < reps; i++, off += size)
+			err = fmemcpy->call(dst, src, off, size);
+		break;
+	case from_pmem_gid:
+		for (i = 0; i < reps; i++, off += size)
+			err = fmemcpy->call(dst, pmem, off, size);
+		break;
+	case to_pmem_gid:
+		nova_memunlock_range(sb, pmem, poolsize);
+		for (i = 0; i < reps; i++, off += size)
+			err = fmemcpy->call(pmem, src, off, size);
+		nova_memlock_range(sb, pmem, poolsize);
+		break;
+	case checksum_gid:
+		for (i = 0; i < reps; i++, off += size)
+			/* checksum calls are memory-read intensive */
+			csum = fchecksum->call(csum, src + off, size);
+		result = csum;
+		break;
+	case raid5_gid:
+		for (i = 0; i < reps; i++, off += (disks + 1) * size) {
+			for (j = 0; j < disks; j++)
+				data[j] = &src[off + j * size];
+			parity = src + off + disks * size;
+			xor = fraid5->call(data, parity, size, disks);
+		}
+		result = xor;
+		break;
+	default:
+		nova_dbg("%s: invalid function group %d\n", __func__, call_gid);
+		break;
+	}
+
+	NOVA_END_TIMING(perf_t, perf_time);
+	nsec = read_perf_timer();
+
+	// nova_info("checksum value: 0x%016llx\n", csum);
+
+	lat  = (err) ? 0 : nsec / reps;
+	if (call_gid == raid5_gid)
+		thru = (err) ? 0 : mb_per_sec(reps * disks * size, nsec);
+	else
+		thru = (err) ? 0 : mb_per_sec(reps * size, nsec);
+
+	if (cpu != smp_processor_id()) /* scheduling shouldn't happen */
+		nova_dbg("cpu was %d, now %d\n", cpu, smp_processor_id());
+
+	nova_info("%4u %25s %4u %8lu %8lu\n", func_id, fname, cpu, lat, thru);
+
+out:
+	nova_free_vmem_pool(src);
+	nova_free_vmem_pool(dst);
+	nova_free_pmem_pool(sb, &perf_sih, &pmem, blocknr, allocated);
+
+	if (data != NULL)
+		kfree(data);
+
+	put_cpu(); /* enable preemption */
+
+	if (err)
+		nova_dbg("%s: performance test aborted\n", __func__);
+	return err;
+}
+
+int nova_test_perf(struct super_block *sb, unsigned int func_id,
+	unsigned int poolmb, size_t size, unsigned int disks)
+{
+	int id, ret = 0;
+	size_t poolsize = poolmb * 1024 * 1024;
+
+	if (!measure_timing) {
+		nova_dbg("%s: measure_timing not set!\n", __func__);
+		ret = -EFAULT;
+		goto out;
+	}
+	if (func_id > NUM_PERF_CALLS) {
+		nova_dbg("%s: invalid function id %d!\n", __func__, func_id);
+		ret = -EFAULT;
+		goto out;
+	}
+	if (poolmb < 1 || 1024 < poolmb) { /* limit pool size to 1GB */
+		nova_dbg("%s: invalid pool size %u MB!\n", __func__, poolmb);
+		ret = -EFAULT;
+		goto out;
+	}
+	if (size < 64 || poolsize < size || (size % 64)) {
+		nova_dbg("%s: invalid data size %zu!\n", __func__, size);
+		ret = -EFAULT;
+		goto out;
+	}
+	if (disks < 1 || 32 < disks) { /* limit number of disks */
+		nova_dbg("%s: invalid disk count %u!\n", __func__, disks);
+		ret = -EFAULT;
+		goto out;
+	}
+
+	nova_info("test function performance\n");
+	nova_info("pool size %u MB, work size %zu, disks %u\n",
+					poolmb, size, disks);
+
+	nova_info("%4s %25s %4s %8s %8s\n", "id", "name", "cpu", "ns", "MB/s");
+	nova_info("-------------------------------------------------------\n");
+	if (func_id == 0) {
+		/* individual function id starting from 1 */
+		for (id = 1; id <= NUM_PERF_CALLS; id++) {
+			ret = nova_test_func_perf(sb, id, poolsize,
+							size, disks);
+			if (ret < 0)
+				goto out;
+		}
+	} else {
+		ret = nova_test_func_perf(sb, func_id, poolsize, size, disks);
+	}
+	nova_info("-------------------------------------------------------\n");
+
+out:
+	return ret;
+}
diff --git a/fs/nova/perf.h b/fs/nova/perf.h
new file mode 100644
index 000000000000..94bee4674f2e
--- /dev/null
+++ b/fs/nova/perf.h
@@ -0,0 +1,96 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Performance test
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/zutil.h>
+#include <linux/libnvdimm.h>
+#include <linux/raid/xor.h>
+#include "nova.h"
+
+#define	reset_perf_timer()	__this_cpu_write(Timingstats_percpu[perf_t], 0)
+#define	read_perf_timer()	__this_cpu_read(Timingstats_percpu[perf_t])
+
+#define	mb_per_sec(size, nsec)	(nsec == 0 ? 0 : \
+				(size * (1000000000 / 1024 / 1024) / nsec))
+
+enum memcpy_call_id {
+	memcpy_read_id = 0,
+	memcpy_write_id,
+	memcpy_bidir_id,
+	NUM_MEMCPY_CALLS
+};
+
+enum from_pmem_call_id {
+	memcpy_mcsafe_id = 0,
+	NUM_FROM_PMEM_CALLS
+};
+
+enum to_pmem_call_id {
+	memcpy_to_pmem_nocache_id = 0,
+	flush_buffer_id,
+	memcpy_to_pmem_flush_id,
+	NUM_TO_PMEM_CALLS
+};
+
+enum checksum_call_id {
+	zlib_adler32_id = 0,
+	nd_fletcher64_id,
+	libcrc32c_id,
+	nova_crc32c_id,
+	plain_xor64_id,
+	NUM_CHECKSUM_CALLS
+};
+
+enum raid5_call_id {
+	nova_block_parity_id = 0,
+	nova_block_csum_parity_id,
+//	xor_blocks_id,
+	NUM_RAID5_CALLS
+};
+
+#define	NUM_PERF_CALLS	\
+	 (NUM_MEMCPY_CALLS + NUM_FROM_PMEM_CALLS + NUM_TO_PMEM_CALLS + \
+	  NUM_CHECKSUM_CALLS + NUM_RAID5_CALLS)
+
+enum call_group_id {
+	memcpy_gid = 0,
+	from_pmem_gid,
+	to_pmem_gid,
+	checksum_gid,
+	raid5_gid
+};
+
+typedef struct {
+	const char *name;                              /* name of this call */
+//	int (*valid)(void);            /* might need for availability check */
+	int (*call)(char *, char *, size_t, size_t); /* dst, src, off, size */
+} memcpy_call_t;
+
+typedef struct {
+	const char *name;                              /* name of this call */
+//	int (*valid)(void);            /* might need for availability check */
+	u64 (*call)(u64, char *, size_t);               /* init, data, size */
+} checksum_call_t;
+
+typedef struct {
+	const char *name;                              /* name of this call */
+//	int (*valid)(void);            /* might need for availability check */
+	u64 (*call)(char **, char *,                        /* data, parity */
+			size_t, int);          /* per-disk-size, data disks */
+} raid5_call_t;
diff --git a/fs/nova/stats.c b/fs/nova/stats.c
new file mode 100644
index 000000000000..cacf76f0d16d
--- /dev/null
+++ b/fs/nova/stats.c
@@ -0,0 +1,685 @@
+/*
+ * NOVA File System statistics
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include "nova.h"
+
+const char *Timingstring[TIMING_NUM] = {
+	/* Init */
+	"================ Initialization ================",
+	"init",
+	"mount",
+	"ioremap",
+	"new_init",
+	"recovery",
+
+	/* Namei operations */
+	"============= Directory operations =============",
+	"create",
+	"lookup",
+	"link",
+	"unlink",
+	"symlink",
+	"mkdir",
+	"rmdir",
+	"mknod",
+	"rename",
+	"readdir",
+	"add_dentry",
+	"remove_dentry",
+	"setattr",
+	"setsize",
+
+	/* I/O operations */
+	"================ I/O operations ================",
+	"dax_read",
+	"cow_write",
+	"inplace_write",
+	"copy_to_nvmm",
+	"dax_get_block",
+	"read_iter",
+	"write_iter",
+
+	/* Memory operations */
+	"============== Memory operations ===============",
+	"memcpy_read_nvmm",
+	"memcpy_write_nvmm",
+	"memcpy_write_back_to_nvmm",
+	"handle_partial_block",
+
+	/* Memory management */
+	"============== Memory management ===============",
+	"alloc_blocks",
+	"new_data_blocks",
+	"new_log_blocks",
+	"free_blocks",
+	"free_data_blocks",
+	"free_log_blocks",
+
+	/* Transaction */
+	"================= Transaction ==================",
+	"transaction_new_inode",
+	"transaction_link_change",
+	"update_tail",
+
+	/* Logging */
+	"============= Logging operations ===============",
+	"append_dir_entry",
+	"append_file_entry",
+	"append_mmap_entry",
+	"append_link_change",
+	"append_setattr",
+	"append_snapshot_info",
+	"inplace_update_entry",
+
+	/* Tree */
+	"=============== Tree operations ================",
+	"checking_entry",
+	"assign_blocks",
+
+	/* GC */
+	"============= Garbage collection ===============",
+	"log_fast_gc",
+	"log_thorough_gc",
+	"check_invalid_log",
+
+	/* Integrity */
+	"============ Integrity operations ==============",
+	"block_csum",
+	"block_parity",
+	"block_csum_parity",
+	"protect_memcpy",
+	"protect_file_data",
+	"verify_entry_csum",
+	"verify_data_csum",
+	"calc_entry_csum",
+	"restore_file_data",
+	"reset_mapping",
+	"reset_vma",
+
+	/* Others */
+	"================ Miscellaneous =================",
+	"find_cache_page",
+	"fsync",
+	"write_pages",
+	"fallocate",
+	"direct_IO",
+	"free_old_entry",
+	"delete_file_tree",
+	"delete_dir_tree",
+	"new_vfs_inode",
+	"new_nova_inode",
+	"free_inode",
+	"free_inode_log",
+	"evict_inode",
+	"test_perf",
+	"wprotect",
+
+	/* Mmap */
+	"=============== MMap operations ================",
+	"mmap_page_fault",
+	"mmap_pmd_fault",
+	"mmap_pfn_mkwrite",
+	"insert_vma",
+	"remove_vma",
+	"set_vma_readonly",
+	"mmap_cow",
+	"udpate_mapping",
+	"udpate_pfn",
+	"mmap_handler",
+
+	/* Rebuild */
+	"=================== Rebuild ====================",
+	"rebuild_dir",
+	"rebuild_file",
+	"rebuild_snapshot_table",
+
+	/* Snapshot */
+	"=================== Snapshot ===================",
+	"create_snapshot",
+	"init_snapshot_info",
+	"delete_snapshot",
+	"append_snapshot_filedata",
+	"append_snapshot_inode",
+};
+
+u64 Timingstats[TIMING_NUM];
+DEFINE_PER_CPU(u64[TIMING_NUM], Timingstats_percpu);
+u64 Countstats[TIMING_NUM];
+DEFINE_PER_CPU(u64[TIMING_NUM], Countstats_percpu);
+u64 IOstats[STATS_NUM];
+DEFINE_PER_CPU(u64[STATS_NUM], IOstats_percpu);
+
+static void nova_print_alloc_stats(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long alloc_log_count = 0;
+	unsigned long alloc_log_pages = 0;
+	unsigned long alloc_data_count = 0;
+	unsigned long alloc_data_pages = 0;
+	unsigned long free_log_count = 0;
+	unsigned long freed_log_pages = 0;
+	unsigned long free_data_count = 0;
+	unsigned long freed_data_pages = 0;
+	int i;
+
+	nova_info("=========== NOVA allocation stats ===========\n");
+	nova_info("Alloc %llu, alloc steps %llu, average %llu\n",
+		Countstats[new_data_blocks_t], IOstats[alloc_steps],
+		Countstats[new_data_blocks_t] ?
+			IOstats[alloc_steps] / Countstats[new_data_blocks_t]
+			: 0);
+	nova_info("Free %llu\n", Countstats[free_data_t]);
+	nova_info("Fast GC %llu, check pages %llu, free pages %llu, average %llu\n",
+		Countstats[fast_gc_t], IOstats[fast_checked_pages],
+		IOstats[fast_gc_pages], Countstats[fast_gc_t] ?
+			IOstats[fast_gc_pages] / Countstats[fast_gc_t] : 0);
+	nova_info("Thorough GC %llu, checked pages %llu, free pages %llu, average %llu\n",
+		Countstats[thorough_gc_t],
+		IOstats[thorough_checked_pages], IOstats[thorough_gc_pages],
+		Countstats[thorough_gc_t] ?
+			IOstats[thorough_gc_pages] / Countstats[thorough_gc_t]
+			: 0);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		alloc_log_count += free_list->alloc_log_count;
+		alloc_log_pages += free_list->alloc_log_pages;
+		alloc_data_count += free_list->alloc_data_count;
+		alloc_data_pages += free_list->alloc_data_pages;
+		free_log_count += free_list->free_log_count;
+		freed_log_pages += free_list->freed_log_pages;
+		free_data_count += free_list->free_data_count;
+		freed_data_pages += free_list->freed_data_pages;
+	}
+
+	nova_info("alloc log count %lu, allocated log pages %lu, alloc data count %lu, allocated data pages %lu, free log count %lu, freed log pages %lu, free data count %lu, freed data pages %lu\n",
+		alloc_log_count, alloc_log_pages,
+		alloc_data_count, alloc_data_pages,
+		free_log_count, freed_log_pages,
+		free_data_count, freed_data_pages);
+}
+
+static void nova_print_IO_stats(struct super_block *sb)
+{
+	nova_info("=========== NOVA I/O stats ===========\n");
+	nova_info("Read %llu, bytes %llu, average %llu\n",
+		Countstats[dax_read_t], IOstats[read_bytes],
+		Countstats[dax_read_t] ?
+			IOstats[read_bytes] / Countstats[dax_read_t] : 0);
+	nova_info("COW write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[cow_write_t], IOstats[cow_write_bytes],
+		Countstats[cow_write_t] ?
+			IOstats[cow_write_bytes] / Countstats[cow_write_t] : 0,
+		IOstats[cow_write_breaks], Countstats[cow_write_t] ?
+			IOstats[cow_write_breaks] / Countstats[cow_write_t]
+			: 0);
+	nova_info("Inplace write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[inplace_write_t], IOstats[inplace_write_bytes],
+		Countstats[inplace_write_t] ?
+			IOstats[inplace_write_bytes] /
+			Countstats[inplace_write_t] : 0,
+		IOstats[inplace_write_breaks], Countstats[inplace_write_t] ?
+			IOstats[inplace_write_breaks] /
+			Countstats[inplace_write_t] : 0);
+}
+
+void nova_get_timing_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < TIMING_NUM; i++) {
+		Timingstats[i] = 0;
+		Countstats[i] = 0;
+		for_each_possible_cpu(cpu) {
+			Timingstats[i] += per_cpu(Timingstats_percpu[i], cpu);
+			Countstats[i] += per_cpu(Countstats_percpu[i], cpu);
+		}
+	}
+}
+
+void nova_get_IO_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < STATS_NUM; i++) {
+		IOstats[i] = 0;
+		for_each_possible_cpu(cpu)
+			IOstats[i] += per_cpu(IOstats_percpu[i], cpu);
+	}
+}
+
+void nova_print_timing_stats(struct super_block *sb)
+{
+	int i;
+
+	nova_get_timing_stats();
+	nova_get_IO_stats();
+
+	nova_info("=========== NOVA kernel timing stats ============\n");
+	for (i = 0; i < TIMING_NUM; i++) {
+		/* Title */
+		if (Timingstring[i][0] == '=') {
+			nova_info("\n%s\n\n", Timingstring[i]);
+			continue;
+		}
+
+		if (measure_timing || Timingstats[i]) {
+			nova_info("%s: count %llu, timing %llu, average %llu\n",
+				Timingstring[i],
+				Countstats[i],
+				Timingstats[i],
+				Countstats[i] ?
+				Timingstats[i] / Countstats[i] : 0);
+		} else {
+			nova_info("%s: count %llu\n",
+				Timingstring[i],
+				Countstats[i]);
+		}
+	}
+
+	nova_info("\n");
+	nova_print_alloc_stats(sb);
+	nova_print_IO_stats(sb);
+}
+
+static void nova_clear_timing_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < TIMING_NUM; i++) {
+		Countstats[i] = 0;
+		Timingstats[i] = 0;
+		for_each_possible_cpu(cpu) {
+			per_cpu(Timingstats_percpu[i], cpu) = 0;
+			per_cpu(Countstats_percpu[i], cpu) = 0;
+		}
+	}
+}
+
+static void nova_clear_IO_stats(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+	int cpu;
+
+	for (i = 0; i < STATS_NUM; i++) {
+		IOstats[i] = 0;
+		for_each_possible_cpu(cpu)
+			per_cpu(IOstats_percpu[i], cpu) = 0;
+	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		free_list->alloc_log_count = 0;
+		free_list->alloc_log_pages = 0;
+		free_list->alloc_data_count = 0;
+		free_list->alloc_data_pages = 0;
+		free_list->free_log_count = 0;
+		free_list->freed_log_pages = 0;
+		free_list->free_data_count = 0;
+		free_list->freed_data_pages = 0;
+	}
+}
+
+void nova_clear_stats(struct super_block *sb)
+{
+	nova_clear_timing_stats();
+	nova_clear_IO_stats(sb);
+}
+
+void nova_print_inode(struct nova_inode *pi)
+{
+	nova_dbg("%s: NOVA inode %llu\n", __func__, pi->nova_ino);
+	nova_dbg("valid %u, deleted %u, blk type %u, flags %u\n",
+		pi->valid, pi->deleted, pi->i_blk_type, pi->i_flags);
+	nova_dbg("size %llu, ctime %u, mtime %u, atime %u\n",
+		pi->i_size, pi->i_ctime, pi->i_mtime, pi->i_atime);
+	nova_dbg("mode %u, links %u, xattr 0x%llx, csum %u\n",
+		pi->i_mode, pi->i_links_count, pi->i_xattr, pi->csum);
+	nova_dbg("uid %u, gid %u, gen %u, create time %u\n",
+		pi->i_uid, pi->i_gid, pi->i_generation, pi->i_create_time);
+	nova_dbg("head 0x%llx, tail 0x%llx, alter head 0x%llx, tail 0x%llx\n",
+		pi->log_head, pi->log_tail, pi->alter_log_head,
+		pi->alter_log_tail);
+	nova_dbg("create epoch id %llu, delete epoch id %llu\n",
+		pi->create_epoch_id, pi->delete_epoch_id);
+}
+
+static inline void nova_print_file_write_entry(struct super_block *sb,
+	u64 curr, struct nova_file_write_entry *entry)
+{
+	nova_dbg("file write entry @ 0x%llx: epoch %llu, trans %llu, pgoff %llu, pages %u, blocknr %llu, reassigned %u, updating %u, invalid count %u, size %llu, mtime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->pgoff, entry->num_pages,
+			entry->block >> PAGE_SHIFT,
+			entry->reassigned, entry->updating,
+			entry->invalid_pages, entry->size, entry->mtime);
+}
+
+static inline void nova_print_set_attr_entry(struct super_block *sb,
+	u64 curr, struct nova_setattr_logentry *entry)
+{
+	nova_dbg("set attr entry @ 0x%llx: epoch %llu, trans %llu, invalid %u, mode %u, size %llu, atime %u, mtime %u, ctime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->invalid, entry->mode,
+			entry->size, entry->atime, entry->mtime, entry->ctime);
+}
+
+static inline void nova_print_link_change_entry(struct super_block *sb,
+	u64 curr, struct nova_link_change_entry *entry)
+{
+	nova_dbg("link change entry @ 0x%llx: epoch %llu, trans %llu, invalid %u, links %u, flags %u, ctime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->invalid, entry->links,
+			entry->flags, entry->ctime);
+}
+
+static inline void nova_print_mmap_entry(struct super_block *sb,
+	u64 curr, struct nova_mmap_entry *entry)
+{
+	nova_dbg("mmap write entry @ 0x%llx: epoch %llu, invalid %u, pgoff %llu, pages %llu\n",
+			curr, entry->epoch_id, entry->invalid,
+			entry->pgoff, entry->num_pages);
+}
+
+static inline void nova_print_snapshot_info_entry(struct super_block *sb,
+	u64 curr, struct nova_snapshot_info_entry *entry)
+{
+	nova_dbg("snapshot info entry @ 0x%llx: epoch %llu, deleted %u, timestamp %llu\n",
+			curr, entry->epoch_id, entry->deleted,
+			entry->timestamp);
+}
+
+static inline size_t nova_print_dentry(struct super_block *sb,
+	u64 curr, struct nova_dentry *entry)
+{
+	nova_dbg("dir logentry @ 0x%llx: epoch %llu, trans %llu, reassigned %u, invalid %u, inode %llu, links %u, namelen %u, rec len %u, name %s, mtime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->reassigned, entry->invalid,
+			le64_to_cpu(entry->ino),
+			entry->links_count, entry->name_len,
+			le16_to_cpu(entry->de_len), entry->name,
+			entry->mtime);
+
+	return le16_to_cpu(entry->de_len);
+}
+
+u64 nova_print_log_entry(struct super_block *sb, u64 curr)
+{
+	void *addr;
+	size_t size;
+	u8 type;
+
+	addr = (void *)nova_get_block(sb, curr);
+	type = nova_get_entry_type(addr);
+	switch (type) {
+	case SET_ATTR:
+		nova_print_set_attr_entry(sb, curr, addr);
+		curr += sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		nova_print_link_change_entry(sb, curr, addr);
+		curr += sizeof(struct nova_link_change_entry);
+		break;
+	case MMAP_WRITE:
+		nova_print_mmap_entry(sb, curr, addr);
+		curr += sizeof(struct nova_mmap_entry);
+		break;
+	case SNAPSHOT_INFO:
+		nova_print_snapshot_info_entry(sb, curr, addr);
+		curr += sizeof(struct nova_snapshot_info_entry);
+		break;
+	case FILE_WRITE:
+		nova_print_file_write_entry(sb, curr, addr);
+		curr += sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		size = nova_print_dentry(sb, curr, addr);
+		curr += size;
+		if (size == 0) {
+			nova_dbg("%s: dentry with size 0 @ 0x%llx\n",
+					__func__, curr);
+			curr += sizeof(struct nova_file_write_entry);
+			NOVA_ASSERT(0);
+		}
+		break;
+	case NEXT_PAGE:
+		nova_dbg("%s: next page sign @ 0x%llx\n", __func__, curr);
+		curr = PAGE_TAIL(curr);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n", __func__, type, curr);
+		curr += sizeof(struct nova_file_write_entry);
+		NOVA_ASSERT(0);
+		break;
+	}
+
+	return curr;
+}
+
+void nova_print_curr_log_page(struct super_block *sb, u64 curr)
+{
+	struct nova_inode_page_tail *tail;
+	u64 start, end;
+
+	start = BLOCK_OFF(curr);
+	end = PAGE_TAIL(curr);
+
+	while (start < end)
+		start = nova_print_log_entry(sb, start);
+
+	tail = nova_get_block(sb, end);
+	nova_dbg("Page tail. curr 0x%llx, next page 0x%llx, %u entries, %u invalid\n",
+			start, tail->next_page,
+			tail->num_entries, tail->invalid_entries);
+}
+
+void nova_print_nova_log(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	u64 curr;
+
+	if (sih->log_tail == 0 || sih->log_head == 0)
+		return;
+
+	curr = sih->log_head;
+	nova_dbg("Pi %lu: log head 0x%llx, tail 0x%llx\n",
+			sih->ino, curr, sih->log_tail);
+	while (curr != sih->log_tail) {
+		if ((curr & (PAGE_SIZE - 1)) == LOG_BLOCK_TAIL) {
+			struct nova_inode_page_tail *tail =
+					nova_get_block(sb, curr);
+			nova_dbg("Log tail, curr 0x%llx, next page 0x%llx, %u entries, %u invalid\n",
+					curr, tail->next_page,
+					tail->num_entries,
+					tail->invalid_entries);
+			curr = tail->next_page;
+		} else {
+			curr = nova_print_log_entry(sb, curr);
+		}
+	}
+}
+
+void nova_print_inode_log(struct super_block *sb, struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	nova_print_nova_log(sb, sih);
+}
+
+int nova_get_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode *pi)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr, next;
+	int count = 1;
+
+	if (pi->log_head == 0 || pi->log_tail == 0) {
+		nova_dbg("Pi %lu has no log\n", sih->ino);
+		return 0;
+	}
+
+	curr = pi->log_head;
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	while ((next = curr_page->page_tail.next_page) != 0) {
+		curr = next;
+		curr_page = (struct nova_inode_log_page *)
+			nova_get_block(sb, curr);
+		count++;
+	}
+
+	return count;
+}
+
+void nova_print_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr, next;
+	int count = 1;
+	int used = count;
+
+	if (sih->log_head == 0 || sih->log_tail == 0) {
+		nova_dbg("Pi %lu has no log\n", sih->ino);
+		return;
+	}
+
+	curr = sih->log_head;
+	nova_dbg("Pi %lu: log head @ 0x%llx, tail @ 0x%llx\n",
+			sih->ino, curr, sih->log_tail);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	while ((next = curr_page->page_tail.next_page) != 0) {
+		nova_dbg("Current page 0x%llx, next page 0x%llx, %u entries, %u invalid\n",
+			curr >> PAGE_SHIFT, next >> PAGE_SHIFT,
+			curr_page->page_tail.num_entries,
+			curr_page->page_tail.invalid_entries);
+		if (sih->log_tail >> PAGE_SHIFT == curr >> PAGE_SHIFT)
+			used = count;
+		curr = next;
+		curr_page = (struct nova_inode_log_page *)
+			nova_get_block(sb, curr);
+		count++;
+	}
+	if (sih->log_tail >> PAGE_SHIFT == curr >> PAGE_SHIFT)
+		used = count;
+	nova_dbg("Pi %lu: log used %d pages, has %d pages, si reports %lu pages\n",
+		sih->ino, used, count,
+		sih->log_pages);
+}
+
+void nova_print_inode_log_pages(struct super_block *sb, struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	nova_print_nova_log_pages(sb, sih);
+}
+
+int nova_check_inode_logs(struct super_block *sb, struct nova_inode *pi)
+{
+	int count1 = 0;
+	int count2 = 0;
+	int tail1_at = 0;
+	int tail2_at = 0;
+	u64 curr, alter_curr;
+
+	curr = pi->log_head;
+	alter_curr = pi->alter_log_head;
+
+	while (curr && alter_curr) {
+		if (alter_log_page(sb, curr) != alter_curr ||
+				alter_log_page(sb, alter_curr) != curr)
+			nova_dbg("Inode %llu page %d: curr 0x%llx, alter 0x%llx, alter_curr 0x%llx, alter 0x%llx\n",
+					pi->nova_ino, count1,
+					curr, alter_log_page(sb, curr),
+					alter_curr,
+					alter_log_page(sb, alter_curr));
+
+		count1++;
+		count2++;
+		if ((curr >> PAGE_SHIFT) == (pi->log_tail >> PAGE_SHIFT))
+			tail1_at = count1;
+		if ((alter_curr >> PAGE_SHIFT) ==
+				(pi->alter_log_tail >> PAGE_SHIFT))
+			tail2_at = count2;
+		curr = next_log_page(sb, curr);
+		alter_curr = next_log_page(sb, alter_curr);
+	}
+
+	while (curr) {
+		count1++;
+		if ((curr >> PAGE_SHIFT) == (pi->log_tail >> PAGE_SHIFT))
+			tail1_at = count1;
+		curr = next_log_page(sb, curr);
+	}
+
+	while (alter_curr) {
+		count2++;
+		if ((alter_curr >> PAGE_SHIFT) ==
+				(pi->alter_log_tail >> PAGE_SHIFT))
+			tail2_at = count2;
+		alter_curr = next_log_page(sb, alter_curr);
+	}
+
+	nova_dbg("Log1 %d pages, tail @ page %d\n", count1, tail1_at);
+	nova_dbg("Log2 %d pages, tail @ page %d\n", count2, tail2_at);
+
+	return 0;
+}
+
+void nova_print_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+
+	nova_dbg("======== NOVA per-CPU free list allocation stats ========\n");
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		nova_dbg("Free list %d: block start %lu, block end %lu, num_blocks %lu, num_free_blocks %lu, blocknode %lu\n",
+			i, free_list->block_start, free_list->block_end,
+			free_list->block_end - free_list->block_start + 1,
+			free_list->num_free_blocks, free_list->num_blocknode);
+
+		nova_dbg("Free list %d: csum start %lu, replica csum start %lu, csum blocks %lu, parity start %lu, parity blocks %lu\n",
+			i, free_list->csum_start, free_list->replica_csum_start,
+			free_list->num_csum_blocks,
+			free_list->parity_start, free_list->num_parity_blocks);
+
+		nova_dbg("Free list %d: alloc log count %lu, allocated log pages %lu, alloc data count %lu, allocated data pages %lu, free log count %lu, freed log pages %lu, free data count %lu, freed data pages %lu\n",
+			 i,
+			 free_list->alloc_log_count,
+			 free_list->alloc_log_pages,
+			 free_list->alloc_data_count,
+			 free_list->alloc_data_pages,
+			 free_list->free_log_count,
+			 free_list->freed_log_pages,
+			 free_list->free_data_count,
+			 free_list->freed_data_pages);
+	}
+}
diff --git a/fs/nova/stats.h b/fs/nova/stats.h
new file mode 100644
index 000000000000..766ba0a77872
--- /dev/null
+++ b/fs/nova/stats.h
@@ -0,0 +1,218 @@
+/*
+ * NOVA File System statistics
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+
+/* ======================= Timing ========================= */
+enum timing_category {
+	/* Init */
+	init_title_t,
+	init_t,
+	mount_t,
+	ioremap_t,
+	new_init_t,
+	recovery_t,
+
+	/* Namei operations */
+	namei_title_t,
+	create_t,
+	lookup_t,
+	link_t,
+	unlink_t,
+	symlink_t,
+	mkdir_t,
+	rmdir_t,
+	mknod_t,
+	rename_t,
+	readdir_t,
+	add_dentry_t,
+	remove_dentry_t,
+	setattr_t,
+	setsize_t,
+
+	/* I/O operations */
+	io_title_t,
+	dax_read_t,
+	cow_write_t,
+	inplace_write_t,
+	copy_to_nvmm_t,
+	dax_get_block_t,
+	read_iter_t,
+	write_iter_t,
+
+	/* Memory operations */
+	memory_title_t,
+	memcpy_r_nvmm_t,
+	memcpy_w_nvmm_t,
+	memcpy_w_wb_t,
+	partial_block_t,
+
+	/* Memory management */
+	mm_title_t,
+	new_blocks_t,
+	new_data_blocks_t,
+	new_log_blocks_t,
+	free_blocks_t,
+	free_data_t,
+	free_log_t,
+
+	/* Transaction */
+	trans_title_t,
+	create_trans_t,
+	link_trans_t,
+	update_tail_t,
+
+	/* Logging */
+	logging_title_t,
+	append_dir_entry_t,
+	append_file_entry_t,
+	append_mmap_entry_t,
+	append_link_change_t,
+	append_setattr_t,
+	append_snapshot_info_t,
+	update_entry_t,
+
+	/* Tree */
+	tree_title_t,
+	check_entry_t,
+	assign_t,
+
+	/* GC */
+	gc_title_t,
+	fast_gc_t,
+	thorough_gc_t,
+	check_invalid_t,
+
+	/* Integrity */
+	integrity_title_t,
+	block_csum_t,
+	block_parity_t,
+	block_csum_parity_t,
+	protect_memcpy_t,
+	protect_file_data_t,
+	verify_entry_csum_t,
+	verify_data_csum_t,
+	calc_entry_csum_t,
+	restore_data_t,
+	reset_mapping_t,
+	reset_vma_t,
+
+	/* Others */
+	others_title_t,
+	find_cache_t,
+	fsync_t,
+	write_pages_t,
+	fallocate_t,
+	direct_IO_t,
+	free_old_t,
+	delete_file_tree_t,
+	delete_dir_tree_t,
+	new_vfs_inode_t,
+	new_nova_inode_t,
+	free_inode_t,
+	free_inode_log_t,
+	evict_inode_t,
+	perf_t,
+	wprotect_t,
+
+	/* Mmap */
+	mmap_title_t,
+	mmap_fault_t,
+	pmd_fault_t,
+	pfn_mkwrite_t,
+	insert_vma_t,
+	remove_vma_t,
+	set_vma_read_t,
+	mmap_cow_t,
+	update_mapping_t,
+	update_pfn_t,
+	mmap_handler_t,
+
+	/* Rebuild */
+	rebuild_title_t,
+	rebuild_dir_t,
+	rebuild_file_t,
+	rebuild_snapshot_t,
+
+	/* Snapshot */
+	snapshot_title_t,
+	create_snapshot_t,
+	init_snapshot_info_t,
+	delete_snapshot_t,
+	append_snapshot_file_t,
+	append_snapshot_inode_t,
+
+	/* Sentinel */
+	TIMING_NUM,
+};
+
+enum stats_category {
+	alloc_steps,
+	cow_write_breaks,
+	inplace_write_breaks,
+	read_bytes,
+	cow_write_bytes,
+	inplace_write_bytes,
+	fast_checked_pages,
+	thorough_checked_pages,
+	fast_gc_pages,
+	thorough_gc_pages,
+	dirty_pages,
+	protect_head,
+	protect_tail,
+	block_csum_parity,
+	dax_cow_during_snapshot,
+	mapping_updated_pages,
+	cow_overlap_mmap,
+	dax_new_blocks,
+	inplace_new_blocks,
+	fdatasync,
+
+	/* Sentinel */
+	STATS_NUM,
+};
+
+extern const char *Timingstring[TIMING_NUM];
+extern u64 Timingstats[TIMING_NUM];
+DECLARE_PER_CPU(u64[TIMING_NUM], Timingstats_percpu);
+extern u64 Countstats[TIMING_NUM];
+DECLARE_PER_CPU(u64[TIMING_NUM], Countstats_percpu);
+extern u64 IOstats[STATS_NUM];
+DECLARE_PER_CPU(u64[STATS_NUM], IOstats_percpu);
+
+typedef struct timespec timing_t;
+
+#define NOVA_START_TIMING(name, start) \
+	{if (measure_timing) getrawmonotonic(&start); }
+
+#define NOVA_END_TIMING(name, start) \
+	{if (measure_timing) { \
+		timing_t end; \
+		getrawmonotonic(&end); \
+		__this_cpu_add(Timingstats_percpu[name], \
+			(end.tv_sec - start.tv_sec) * 1000000000 + \
+			(end.tv_nsec - start.tv_nsec)); \
+	} \
+	__this_cpu_add(Countstats_percpu[name], 1); \
+	}
+
+#define NOVA_STATS_ADD(name, value) \
+	{__this_cpu_add(IOstats_percpu[name], value); }
+
+

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 15/16] NOVA: Performance measurement
@ 2017-08-03  7:49   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/nova/perf.c  |  594 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/perf.h  |   96 ++++++++
 fs/nova/stats.c |  685 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/stats.h |  218 ++++++++++++++++++
 4 files changed, 1593 insertions(+)
 create mode 100644 fs/nova/perf.c
 create mode 100644 fs/nova/perf.h
 create mode 100644 fs/nova/stats.c
 create mode 100644 fs/nova/stats.h

diff --git a/fs/nova/perf.c b/fs/nova/perf.c
new file mode 100644
index 000000000000..35a4c6a490c3
--- /dev/null
+++ b/fs/nova/perf.c
@@ -0,0 +1,594 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Performance test routines
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "perf.h"
+
+/* normal memcpy functions */
+static int memcpy_read_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin dst address to cache most writes, if size fits */
+	memcpy(dst, src + off, size);
+	return 0;
+}
+
+static int memcpy_write_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin src address to cache most reads, if size fits */
+	memcpy(dst + off, src, size);
+	return 0;
+}
+
+static int memcpy_bidir_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* minimize caching by forwarding both src and dst */
+	memcpy(dst + off, src + off, size);
+	return 0;
+}
+
+static const memcpy_call_t memcpy_calls[] = {
+	/* order should match enum memcpy_call_id */
+	{ "memcpy (mostly read)",  memcpy_read_call },
+	{ "memcpy (mostly write)", memcpy_write_call },
+	{ "memcpy (read write)",   memcpy_bidir_call }
+};
+
+/* copy from pmem functions */
+static int from_pmem_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin dst address to cache most writes, if size fits */
+	/* src address should point to pmem */
+	memcpy_mcsafe(dst, src + off, size);
+	return 0;
+}
+
+static const memcpy_call_t from_pmem_calls[] = {
+	/* order should match enum from_pmem_call_id */
+	{ "memcpy_mcsafe", from_pmem_call }
+};
+
+/* copy to pmem functions */
+static int to_pmem_nocache_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin src address to cache most reads, if size fits */
+	/* dst address should point to pmem */
+	memcpy_to_pmem_nocache(dst + off, src, size);
+	return 0;
+}
+
+static int to_flush_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin src address to cache most reads, if size fits */
+	/* dst address should point to pmem */
+	nova_flush_buffer(dst + off, size, 0);
+	return 0;
+}
+
+static int to_pmem_flush_call(char *dst, char *src, size_t off, size_t size)
+{
+	/* pin src address to cache most reads, if size fits */
+	/* dst address should point to pmem */
+	memcpy(dst + off, src, size);
+	nova_flush_buffer(dst + off, size, 0);
+	return 0;
+}
+
+static const memcpy_call_t to_pmem_calls[] = {
+	/* order should match enum to_pmem_call_id */
+	{ "memcpy_to_pmem_nocache", to_pmem_nocache_call },
+	{ "flush buffer",	    to_flush_call },
+	{ "memcpy + flush buffer",  to_pmem_flush_call }
+};
+
+/* checksum functions */
+static u64 zlib_adler32_call(u64 init, char *data, size_t size)
+{
+	u64 csum;
+
+	/* include/linux/zutil.h */
+	csum = zlib_adler32(init, data, size);
+	return csum;
+}
+
+static u64 nd_fletcher64_call(u64 init, char *data, size_t size)
+{
+	u64 csum;
+
+	/* drivers/nvdimm/core.c */
+	csum = nd_fletcher64(data, size, 1);
+	return csum;
+}
+
+static u64 libcrc32c_call(u64 init, char *data, size_t size)
+{
+	u32 crc = (u32) init;
+
+	crc = crc32c(crc, data, size);
+	return (u64) crc;
+}
+
+static u64 nova_crc32c_call(u64 init, char *data, size_t size)
+{
+	u32 crc = (u32) init;
+
+	crc = nova_crc32c(crc, data, size);
+	return (u64) crc;
+}
+
+static u64 plain_xor64_call(u64 init, char *data, size_t size)
+{
+	u64 csum = init;
+	u64 *word = (u64 *) data;
+
+	while (size > 8) {
+		csum ^= *word;
+		word += 1;
+		size -= 8;
+	}
+
+	/* for perf testing ignore trailing bytes, if any */
+
+	return csum;
+}
+
+static const checksum_call_t checksum_calls[] = {
+	/* order should match enum checksum_call_id */
+	{ "zlib_adler32",  zlib_adler32_call },
+	{ "nd_fletcher64", nd_fletcher64_call },
+	{ "libcrc32c",     libcrc32c_call },
+	{ "nova_crc32c",   nova_crc32c_call },
+	{ "plain_xor64",   plain_xor64_call }
+};
+
+/* raid5 functions */
+static u64 nova_block_parity_call(char **data, char *parity,
+	size_t size, int disks)
+{
+	int i, j, strp, num_strps = disks;
+	size_t strp_size = size;
+	char *block = *data;
+	u64 xor;
+
+	/* FIXME: using same code as in parity.c; need a way to reuse that */
+
+	if (static_cpu_has(X86_FEATURE_XMM2)) { // sse2 128b
+		for (i = 0; i < strp_size; i += 16) {
+			asm volatile("movdqa %0, %%xmm0" : : "m" (block[i]));
+			for (strp = 1; strp < num_strps; strp++) {
+				j = strp * strp_size + i;
+				asm volatile(
+					"movdqa     %0, %%xmm1\n"
+					"pxor   %%xmm1, %%xmm0\n"
+					: : "m" (block[j])
+				);
+			}
+			asm volatile("movntdq %%xmm0, %0" : "=m" (parity[i]));
+		}
+	} else { // common 64b
+		for (i = 0; i < strp_size; i += 8) {
+			xor = *((u64 *) &block[i]);
+			for (strp = 1; strp < num_strps; strp++) {
+				j = strp * strp_size + i;
+				xor ^= *((u64 *) &block[j]);
+			}
+			*((u64 *) &parity[i]) = xor;
+		}
+	}
+
+	return *((u64 *) parity);
+}
+
+static u64 nova_block_csum_parity_call(char **data, char *parity,
+	size_t size, int disks)
+{
+	int i;
+	size_t strp_size = size;
+	char *block = *data;
+	u32 volatile crc[8]; // avoid results being optimized out
+	u64 qwd[8];
+	u64 acc[8] = {0, 0, 0, 0, 0, 0, 0, 0};
+
+	/* FIXME: using same code as in parity.c; need a way to reuse that */
+
+	for (i = 0; i < strp_size / 8; i++) {
+		qwd[0] = *((u64 *) (block));
+		qwd[1] = *((u64 *) (block + 1 * strp_size));
+		qwd[2] = *((u64 *) (block + 2 * strp_size));
+		qwd[3] = *((u64 *) (block + 3 * strp_size));
+		qwd[4] = *((u64 *) (block + 4 * strp_size));
+		qwd[5] = *((u64 *) (block + 5 * strp_size));
+		qwd[6] = *((u64 *) (block + 6 * strp_size));
+		qwd[7] = *((u64 *) (block + 7 * strp_size));
+
+		// if (data_csum > 0 && unroll_csum) {
+			nova_crc32c_qword(qwd[0], acc[0]);
+			nova_crc32c_qword(qwd[1], acc[1]);
+			nova_crc32c_qword(qwd[2], acc[2]);
+			nova_crc32c_qword(qwd[3], acc[3]);
+			nova_crc32c_qword(qwd[4], acc[4]);
+			nova_crc32c_qword(qwd[5], acc[5]);
+			nova_crc32c_qword(qwd[6], acc[6]);
+			nova_crc32c_qword(qwd[7], acc[7]);
+		// }
+
+		// if (data_parity > 0) {
+			parity[i] = qwd[0] ^ qwd[1] ^ qwd[2] ^ qwd[3] ^
+					qwd[4] ^ qwd[5] ^ qwd[6] ^ qwd[7];
+		// }
+
+		block += 8;
+	}
+	// if (data_csum > 0 && unroll_csum) {
+		crc[0] = cpu_to_le32((u32) acc[0]);
+		crc[1] = cpu_to_le32((u32) acc[1]);
+		crc[2] = cpu_to_le32((u32) acc[2]);
+		crc[3] = cpu_to_le32((u32) acc[3]);
+		crc[4] = cpu_to_le32((u32) acc[4]);
+		crc[5] = cpu_to_le32((u32) acc[5]);
+		crc[6] = cpu_to_le32((u32) acc[6]);
+		crc[7] = cpu_to_le32((u32) acc[7]);
+	// }
+
+	return *((u64 *) parity);
+}
+
+#if 0 // some test machines do not have this function (need CONFIG_MD_RAID456)
+static u64 xor_blocks_call(char **data, char *parity,
+	size_t size, int disks)
+{
+	int xor_cnt, disk_id;
+
+	memcpy(parity, data[0], size); /* init parity with the first disk */
+	disks--;
+	disk_id = 1;
+	while (disks > 0) {
+		/* each xor_blocks call can do at most MAX_XOR_BLOCKS (4) */
+		xor_cnt = min(disks, MAX_XOR_BLOCKS);
+		/* crypto/xor.c, used in lib/raid6 and fs/btrfs */
+		xor_blocks(xor_cnt, size, parity, (void **)(data + disk_id));
+
+		disks -= xor_cnt;
+		disk_id += xor_cnt;
+	}
+
+	return *((u64 *) parity);
+}
+#endif
+
+static const raid5_call_t raid5_calls[] = {
+	/* order should match enum raid5_call_id */
+	{ "nova_block_parity", nova_block_parity_call },
+	{ "nova_block_csum_parity", nova_block_csum_parity_call },
+//	{ "xor_blocks", xor_blocks_call },
+};
+
+/* memory pools for perf testing */
+static void *nova_alloc_vmem_pool(size_t poolsize)
+{
+	void *pool = vmalloc(poolsize);
+
+	if (pool == NULL)
+		return NULL;
+
+	/* init pool to verify some checksum results */
+	// memset(pool, 0xAC, poolsize);
+
+	/* to have a clean start, flush the data cache for the given virtual
+	 * address range in the vmap area
+	 */
+	flush_kernel_vmap_range(pool, poolsize);
+
+	return pool;
+}
+
+static void nova_free_vmem_pool(void *pool)
+{
+	if (pool != NULL)
+		vfree(pool);
+}
+
+static void *nova_alloc_pmem_pool(struct super_block *sb,
+	struct nova_inode_info_header *sih, int cpu, size_t poolsize,
+	unsigned long *blocknr, int *allocated)
+{
+	int num;
+	void *pool;
+	size_t blocksize, blockoff;
+	u8 blocktype = NOVA_BLOCK_TYPE_4K;
+
+	blocksize = blk_type_to_size[blocktype];
+	num = poolsize / blocksize;
+	if (poolsize % blocksize)
+		num++;
+
+	sih->ino = NOVA_TEST_PERF_INO;
+	sih->i_blk_type = blocktype;
+	sih->log_head = 0;
+	sih->log_tail = 0;
+
+	*allocated = nova_new_data_blocks(sb, sih, blocknr, 0, num,
+					  ALLOC_NO_INIT, cpu, ALLOC_FROM_HEAD);
+	if (*allocated < num) {
+		nova_dbg("%s: allocated pmem blocks %d < requested blocks %d\n",
+						__func__, *allocated, num);
+		if (*allocated > 0)
+			nova_free_data_blocks(sb, sih, *blocknr, *allocated);
+
+		return NULL;
+	}
+
+	blockoff = nova_get_block_off(sb, *blocknr, blocktype);
+	pool = nova_get_block(sb, blockoff);
+
+	return pool;
+}
+
+static void nova_free_pmem_pool(struct super_block *sb,
+	struct nova_inode_info_header *sih, char **pmem,
+	unsigned long blocknr, int num)
+{
+	if (num > 0)
+		nova_free_data_blocks(sb, sih, blocknr, num);
+	*pmem = NULL;
+}
+
+static int nova_test_func_perf(struct super_block *sb, unsigned int func_id,
+	size_t poolsize, size_t size, unsigned int disks)
+{
+	u64 csum = 12345, xor = 0;
+
+	u64 volatile result; // avoid results being optimized out
+	const char *fname = NULL;
+	char *src = NULL, *dst = NULL, *pmem = NULL;
+	char **data = NULL, *parity;
+	size_t off = 0;
+	int cpu, i, j, reps, err = 0, allocated = 0;
+	unsigned int call_id = 0, call_gid = 0;
+	unsigned long blocknr = 0, nsec, lat, thru;
+	struct nova_inode_info_header perf_sih;
+	const memcpy_call_t *fmemcpy = NULL;
+	const checksum_call_t *fchecksum = NULL;
+	const raid5_call_t *fraid5 = NULL;
+	timing_t perf_time;
+
+	cpu = get_cpu(); /* get cpu id and disable preemption */
+	reps = poolsize / size; /* raid calls will adjust this number */
+	call_id = func_id - 1; /* individual function id starting from 1 */
+
+	/* normal memcpy */
+	if (call_id < NUM_MEMCPY_CALLS) {
+		src = nova_alloc_vmem_pool(poolsize);
+		dst = nova_alloc_vmem_pool(poolsize);
+		if (src == NULL || dst == NULL) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		fmemcpy = &memcpy_calls[call_id];
+		fname = fmemcpy->name;
+		call_gid = memcpy_gid;
+
+		goto test;
+	}
+	call_id -= NUM_MEMCPY_CALLS;
+
+	/* memcpy from pmem */
+	if (call_id < NUM_FROM_PMEM_CALLS) {
+		pmem = nova_alloc_pmem_pool(sb, &perf_sih, cpu, poolsize,
+							&blocknr, &allocated);
+		dst = nova_alloc_vmem_pool(poolsize);
+		if (pmem == NULL || dst == NULL) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		fmemcpy = &from_pmem_calls[call_id];
+		fname = fmemcpy->name;
+		call_gid = from_pmem_gid;
+
+		goto test;
+	}
+	call_id -= NUM_FROM_PMEM_CALLS;
+
+	/* memcpy to pmem */
+	if (call_id < NUM_TO_PMEM_CALLS) {
+		src = nova_alloc_vmem_pool(poolsize);
+		pmem = nova_alloc_pmem_pool(sb, &perf_sih, cpu, poolsize,
+							&blocknr, &allocated);
+		if (src == NULL || pmem == NULL) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		fmemcpy = &to_pmem_calls[call_id];
+		fname = fmemcpy->name;
+		call_gid = to_pmem_gid;
+
+		goto test;
+	}
+	call_id -= NUM_TO_PMEM_CALLS;
+
+	/* checksum */
+	if (call_id < NUM_CHECKSUM_CALLS) {
+		src = nova_alloc_vmem_pool(poolsize);
+
+		fchecksum = &checksum_calls[call_id];
+		fname = fchecksum->name;
+		call_gid = checksum_gid;
+
+		goto test;
+	}
+	call_id -= NUM_CHECKSUM_CALLS;
+
+	/* raid5 */
+	if (call_id < NUM_RAID5_CALLS) {
+		src = nova_alloc_vmem_pool(poolsize);
+		data = kcalloc(disks, sizeof(char *), GFP_NOFS);
+		if (data == NULL) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		reps = poolsize / ((disks + 1) * size); /* +1 for parity */
+
+		fraid5 = &raid5_calls[call_id];
+		fname = fraid5->name;
+		call_gid = raid5_gid;
+
+		if (call_id == nova_block_csum_parity_id && disks != 8) {
+			nova_dbg("%s only for 8 disks, skip testing\n", fname);
+			goto out;
+		}
+
+		goto test;
+	}
+	call_id -= NUM_RAID5_CALLS;
+
+	/* continue with the next call group */
+
+test:
+	if (fmemcpy == NULL && fchecksum == NULL && fraid5 == NULL) {
+		nova_dbg("%s: function struct error\n", __func__);
+		err = -EFAULT;
+		goto out;
+	}
+
+	reset_perf_timer();
+	NOVA_START_TIMING(perf_t, perf_time);
+
+	switch (call_gid) {
+	case memcpy_gid:
+		for (i = 0; i < reps; i++, off += size)
+			err = fmemcpy->call(dst, src, off, size);
+		break;
+	case from_pmem_gid:
+		for (i = 0; i < reps; i++, off += size)
+			err = fmemcpy->call(dst, pmem, off, size);
+		break;
+	case to_pmem_gid:
+		nova_memunlock_range(sb, pmem, poolsize);
+		for (i = 0; i < reps; i++, off += size)
+			err = fmemcpy->call(pmem, src, off, size);
+		nova_memlock_range(sb, pmem, poolsize);
+		break;
+	case checksum_gid:
+		for (i = 0; i < reps; i++, off += size)
+			/* checksum calls are memory-read intensive */
+			csum = fchecksum->call(csum, src + off, size);
+		result = csum;
+		break;
+	case raid5_gid:
+		for (i = 0; i < reps; i++, off += (disks + 1) * size) {
+			for (j = 0; j < disks; j++)
+				data[j] = &src[off + j * size];
+			parity = src + off + disks * size;
+			xor = fraid5->call(data, parity, size, disks);
+		}
+		result = xor;
+		break;
+	default:
+		nova_dbg("%s: invalid function group %d\n", __func__, call_gid);
+		break;
+	}
+
+	NOVA_END_TIMING(perf_t, perf_time);
+	nsec = read_perf_timer();
+
+	// nova_info("checksum value: 0x%016llx\n", csum);
+
+	lat  = (err) ? 0 : nsec / reps;
+	if (call_gid == raid5_gid)
+		thru = (err) ? 0 : mb_per_sec(reps * disks * size, nsec);
+	else
+		thru = (err) ? 0 : mb_per_sec(reps * size, nsec);
+
+	if (cpu != smp_processor_id()) /* scheduling shouldn't happen */
+		nova_dbg("cpu was %d, now %d\n", cpu, smp_processor_id());
+
+	nova_info("%4u %25s %4u %8lu %8lu\n", func_id, fname, cpu, lat, thru);
+
+out:
+	nova_free_vmem_pool(src);
+	nova_free_vmem_pool(dst);
+	nova_free_pmem_pool(sb, &perf_sih, &pmem, blocknr, allocated);
+
+	if (data != NULL)
+		kfree(data);
+
+	put_cpu(); /* enable preemption */
+
+	if (err)
+		nova_dbg("%s: performance test aborted\n", __func__);
+	return err;
+}
+
+int nova_test_perf(struct super_block *sb, unsigned int func_id,
+	unsigned int poolmb, size_t size, unsigned int disks)
+{
+	int id, ret = 0;
+	size_t poolsize = poolmb * 1024 * 1024;
+
+	if (!measure_timing) {
+		nova_dbg("%s: measure_timing not set!\n", __func__);
+		ret = -EFAULT;
+		goto out;
+	}
+	if (func_id > NUM_PERF_CALLS) {
+		nova_dbg("%s: invalid function id %d!\n", __func__, func_id);
+		ret = -EFAULT;
+		goto out;
+	}
+	if (poolmb < 1 || 1024 < poolmb) { /* limit pool size to 1GB */
+		nova_dbg("%s: invalid pool size %u MB!\n", __func__, poolmb);
+		ret = -EFAULT;
+		goto out;
+	}
+	if (size < 64 || poolsize < size || (size % 64)) {
+		nova_dbg("%s: invalid data size %zu!\n", __func__, size);
+		ret = -EFAULT;
+		goto out;
+	}
+	if (disks < 1 || 32 < disks) { /* limit number of disks */
+		nova_dbg("%s: invalid disk count %u!\n", __func__, disks);
+		ret = -EFAULT;
+		goto out;
+	}
+
+	nova_info("test function performance\n");
+	nova_info("pool size %u MB, work size %zu, disks %u\n",
+					poolmb, size, disks);
+
+	nova_info("%4s %25s %4s %8s %8s\n", "id", "name", "cpu", "ns", "MB/s");
+	nova_info("-------------------------------------------------------\n");
+	if (func_id == 0) {
+		/* individual function id starting from 1 */
+		for (id = 1; id <= NUM_PERF_CALLS; id++) {
+			ret = nova_test_func_perf(sb, id, poolsize,
+							size, disks);
+			if (ret < 0)
+				goto out;
+		}
+	} else {
+		ret = nova_test_func_perf(sb, func_id, poolsize, size, disks);
+	}
+	nova_info("-------------------------------------------------------\n");
+
+out:
+	return ret;
+}
diff --git a/fs/nova/perf.h b/fs/nova/perf.h
new file mode 100644
index 000000000000..94bee4674f2e
--- /dev/null
+++ b/fs/nova/perf.h
@@ -0,0 +1,96 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Performance test
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/zutil.h>
+#include <linux/libnvdimm.h>
+#include <linux/raid/xor.h>
+#include "nova.h"
+
+#define	reset_perf_timer()	__this_cpu_write(Timingstats_percpu[perf_t], 0)
+#define	read_perf_timer()	__this_cpu_read(Timingstats_percpu[perf_t])
+
+#define	mb_per_sec(size, nsec)	(nsec == 0 ? 0 : \
+				(size * (1000000000 / 1024 / 1024) / nsec))
+
+enum memcpy_call_id {
+	memcpy_read_id = 0,
+	memcpy_write_id,
+	memcpy_bidir_id,
+	NUM_MEMCPY_CALLS
+};
+
+enum from_pmem_call_id {
+	memcpy_mcsafe_id = 0,
+	NUM_FROM_PMEM_CALLS
+};
+
+enum to_pmem_call_id {
+	memcpy_to_pmem_nocache_id = 0,
+	flush_buffer_id,
+	memcpy_to_pmem_flush_id,
+	NUM_TO_PMEM_CALLS
+};
+
+enum checksum_call_id {
+	zlib_adler32_id = 0,
+	nd_fletcher64_id,
+	libcrc32c_id,
+	nova_crc32c_id,
+	plain_xor64_id,
+	NUM_CHECKSUM_CALLS
+};
+
+enum raid5_call_id {
+	nova_block_parity_id = 0,
+	nova_block_csum_parity_id,
+//	xor_blocks_id,
+	NUM_RAID5_CALLS
+};
+
+#define	NUM_PERF_CALLS	\
+	 (NUM_MEMCPY_CALLS + NUM_FROM_PMEM_CALLS + NUM_TO_PMEM_CALLS + \
+	  NUM_CHECKSUM_CALLS + NUM_RAID5_CALLS)
+
+enum call_group_id {
+	memcpy_gid = 0,
+	from_pmem_gid,
+	to_pmem_gid,
+	checksum_gid,
+	raid5_gid
+};
+
+typedef struct {
+	const char *name;                              /* name of this call */
+//	int (*valid)(void);            /* might need for availability check */
+	int (*call)(char *, char *, size_t, size_t); /* dst, src, off, size */
+} memcpy_call_t;
+
+typedef struct {
+	const char *name;                              /* name of this call */
+//	int (*valid)(void);            /* might need for availability check */
+	u64 (*call)(u64, char *, size_t);               /* init, data, size */
+} checksum_call_t;
+
+typedef struct {
+	const char *name;                              /* name of this call */
+//	int (*valid)(void);            /* might need for availability check */
+	u64 (*call)(char **, char *,                        /* data, parity */
+			size_t, int);          /* per-disk-size, data disks */
+} raid5_call_t;
diff --git a/fs/nova/stats.c b/fs/nova/stats.c
new file mode 100644
index 000000000000..cacf76f0d16d
--- /dev/null
+++ b/fs/nova/stats.c
@@ -0,0 +1,685 @@
+/*
+ * NOVA File System statistics
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include "nova.h"
+
+const char *Timingstring[TIMING_NUM] = {
+	/* Init */
+	"================ Initialization ================",
+	"init",
+	"mount",
+	"ioremap",
+	"new_init",
+	"recovery",
+
+	/* Namei operations */
+	"============= Directory operations =============",
+	"create",
+	"lookup",
+	"link",
+	"unlink",
+	"symlink",
+	"mkdir",
+	"rmdir",
+	"mknod",
+	"rename",
+	"readdir",
+	"add_dentry",
+	"remove_dentry",
+	"setattr",
+	"setsize",
+
+	/* I/O operations */
+	"================ I/O operations ================",
+	"dax_read",
+	"cow_write",
+	"inplace_write",
+	"copy_to_nvmm",
+	"dax_get_block",
+	"read_iter",
+	"write_iter",
+
+	/* Memory operations */
+	"============== Memory operations ===============",
+	"memcpy_read_nvmm",
+	"memcpy_write_nvmm",
+	"memcpy_write_back_to_nvmm",
+	"handle_partial_block",
+
+	/* Memory management */
+	"============== Memory management ===============",
+	"alloc_blocks",
+	"new_data_blocks",
+	"new_log_blocks",
+	"free_blocks",
+	"free_data_blocks",
+	"free_log_blocks",
+
+	/* Transaction */
+	"================= Transaction ==================",
+	"transaction_new_inode",
+	"transaction_link_change",
+	"update_tail",
+
+	/* Logging */
+	"============= Logging operations ===============",
+	"append_dir_entry",
+	"append_file_entry",
+	"append_mmap_entry",
+	"append_link_change",
+	"append_setattr",
+	"append_snapshot_info",
+	"inplace_update_entry",
+
+	/* Tree */
+	"=============== Tree operations ================",
+	"checking_entry",
+	"assign_blocks",
+
+	/* GC */
+	"============= Garbage collection ===============",
+	"log_fast_gc",
+	"log_thorough_gc",
+	"check_invalid_log",
+
+	/* Integrity */
+	"============ Integrity operations ==============",
+	"block_csum",
+	"block_parity",
+	"block_csum_parity",
+	"protect_memcpy",
+	"protect_file_data",
+	"verify_entry_csum",
+	"verify_data_csum",
+	"calc_entry_csum",
+	"restore_file_data",
+	"reset_mapping",
+	"reset_vma",
+
+	/* Others */
+	"================ Miscellaneous =================",
+	"find_cache_page",
+	"fsync",
+	"write_pages",
+	"fallocate",
+	"direct_IO",
+	"free_old_entry",
+	"delete_file_tree",
+	"delete_dir_tree",
+	"new_vfs_inode",
+	"new_nova_inode",
+	"free_inode",
+	"free_inode_log",
+	"evict_inode",
+	"test_perf",
+	"wprotect",
+
+	/* Mmap */
+	"=============== MMap operations ================",
+	"mmap_page_fault",
+	"mmap_pmd_fault",
+	"mmap_pfn_mkwrite",
+	"insert_vma",
+	"remove_vma",
+	"set_vma_readonly",
+	"mmap_cow",
+	"udpate_mapping",
+	"udpate_pfn",
+	"mmap_handler",
+
+	/* Rebuild */
+	"=================== Rebuild ====================",
+	"rebuild_dir",
+	"rebuild_file",
+	"rebuild_snapshot_table",
+
+	/* Snapshot */
+	"=================== Snapshot ===================",
+	"create_snapshot",
+	"init_snapshot_info",
+	"delete_snapshot",
+	"append_snapshot_filedata",
+	"append_snapshot_inode",
+};
+
+u64 Timingstats[TIMING_NUM];
+DEFINE_PER_CPU(u64[TIMING_NUM], Timingstats_percpu);
+u64 Countstats[TIMING_NUM];
+DEFINE_PER_CPU(u64[TIMING_NUM], Countstats_percpu);
+u64 IOstats[STATS_NUM];
+DEFINE_PER_CPU(u64[STATS_NUM], IOstats_percpu);
+
+static void nova_print_alloc_stats(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long alloc_log_count = 0;
+	unsigned long alloc_log_pages = 0;
+	unsigned long alloc_data_count = 0;
+	unsigned long alloc_data_pages = 0;
+	unsigned long free_log_count = 0;
+	unsigned long freed_log_pages = 0;
+	unsigned long free_data_count = 0;
+	unsigned long freed_data_pages = 0;
+	int i;
+
+	nova_info("=========== NOVA allocation stats ===========\n");
+	nova_info("Alloc %llu, alloc steps %llu, average %llu\n",
+		Countstats[new_data_blocks_t], IOstats[alloc_steps],
+		Countstats[new_data_blocks_t] ?
+			IOstats[alloc_steps] / Countstats[new_data_blocks_t]
+			: 0);
+	nova_info("Free %llu\n", Countstats[free_data_t]);
+	nova_info("Fast GC %llu, check pages %llu, free pages %llu, average %llu\n",
+		Countstats[fast_gc_t], IOstats[fast_checked_pages],
+		IOstats[fast_gc_pages], Countstats[fast_gc_t] ?
+			IOstats[fast_gc_pages] / Countstats[fast_gc_t] : 0);
+	nova_info("Thorough GC %llu, checked pages %llu, free pages %llu, average %llu\n",
+		Countstats[thorough_gc_t],
+		IOstats[thorough_checked_pages], IOstats[thorough_gc_pages],
+		Countstats[thorough_gc_t] ?
+			IOstats[thorough_gc_pages] / Countstats[thorough_gc_t]
+			: 0);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		alloc_log_count += free_list->alloc_log_count;
+		alloc_log_pages += free_list->alloc_log_pages;
+		alloc_data_count += free_list->alloc_data_count;
+		alloc_data_pages += free_list->alloc_data_pages;
+		free_log_count += free_list->free_log_count;
+		freed_log_pages += free_list->freed_log_pages;
+		free_data_count += free_list->free_data_count;
+		freed_data_pages += free_list->freed_data_pages;
+	}
+
+	nova_info("alloc log count %lu, allocated log pages %lu, alloc data count %lu, allocated data pages %lu, free log count %lu, freed log pages %lu, free data count %lu, freed data pages %lu\n",
+		alloc_log_count, alloc_log_pages,
+		alloc_data_count, alloc_data_pages,
+		free_log_count, freed_log_pages,
+		free_data_count, freed_data_pages);
+}
+
+static void nova_print_IO_stats(struct super_block *sb)
+{
+	nova_info("=========== NOVA I/O stats ===========\n");
+	nova_info("Read %llu, bytes %llu, average %llu\n",
+		Countstats[dax_read_t], IOstats[read_bytes],
+		Countstats[dax_read_t] ?
+			IOstats[read_bytes] / Countstats[dax_read_t] : 0);
+	nova_info("COW write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[cow_write_t], IOstats[cow_write_bytes],
+		Countstats[cow_write_t] ?
+			IOstats[cow_write_bytes] / Countstats[cow_write_t] : 0,
+		IOstats[cow_write_breaks], Countstats[cow_write_t] ?
+			IOstats[cow_write_breaks] / Countstats[cow_write_t]
+			: 0);
+	nova_info("Inplace write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[inplace_write_t], IOstats[inplace_write_bytes],
+		Countstats[inplace_write_t] ?
+			IOstats[inplace_write_bytes] /
+			Countstats[inplace_write_t] : 0,
+		IOstats[inplace_write_breaks], Countstats[inplace_write_t] ?
+			IOstats[inplace_write_breaks] /
+			Countstats[inplace_write_t] : 0);
+}
+
+void nova_get_timing_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < TIMING_NUM; i++) {
+		Timingstats[i] = 0;
+		Countstats[i] = 0;
+		for_each_possible_cpu(cpu) {
+			Timingstats[i] += per_cpu(Timingstats_percpu[i], cpu);
+			Countstats[i] += per_cpu(Countstats_percpu[i], cpu);
+		}
+	}
+}
+
+void nova_get_IO_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < STATS_NUM; i++) {
+		IOstats[i] = 0;
+		for_each_possible_cpu(cpu)
+			IOstats[i] += per_cpu(IOstats_percpu[i], cpu);
+	}
+}
+
+void nova_print_timing_stats(struct super_block *sb)
+{
+	int i;
+
+	nova_get_timing_stats();
+	nova_get_IO_stats();
+
+	nova_info("=========== NOVA kernel timing stats ============\n");
+	for (i = 0; i < TIMING_NUM; i++) {
+		/* Title */
+		if (Timingstring[i][0] == '=') {
+			nova_info("\n%s\n\n", Timingstring[i]);
+			continue;
+		}
+
+		if (measure_timing || Timingstats[i]) {
+			nova_info("%s: count %llu, timing %llu, average %llu\n",
+				Timingstring[i],
+				Countstats[i],
+				Timingstats[i],
+				Countstats[i] ?
+				Timingstats[i] / Countstats[i] : 0);
+		} else {
+			nova_info("%s: count %llu\n",
+				Timingstring[i],
+				Countstats[i]);
+		}
+	}
+
+	nova_info("\n");
+	nova_print_alloc_stats(sb);
+	nova_print_IO_stats(sb);
+}
+
+static void nova_clear_timing_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < TIMING_NUM; i++) {
+		Countstats[i] = 0;
+		Timingstats[i] = 0;
+		for_each_possible_cpu(cpu) {
+			per_cpu(Timingstats_percpu[i], cpu) = 0;
+			per_cpu(Countstats_percpu[i], cpu) = 0;
+		}
+	}
+}
+
+static void nova_clear_IO_stats(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+	int cpu;
+
+	for (i = 0; i < STATS_NUM; i++) {
+		IOstats[i] = 0;
+		for_each_possible_cpu(cpu)
+			per_cpu(IOstats_percpu[i], cpu) = 0;
+	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		free_list->alloc_log_count = 0;
+		free_list->alloc_log_pages = 0;
+		free_list->alloc_data_count = 0;
+		free_list->alloc_data_pages = 0;
+		free_list->free_log_count = 0;
+		free_list->freed_log_pages = 0;
+		free_list->free_data_count = 0;
+		free_list->freed_data_pages = 0;
+	}
+}
+
+void nova_clear_stats(struct super_block *sb)
+{
+	nova_clear_timing_stats();
+	nova_clear_IO_stats(sb);
+}
+
+void nova_print_inode(struct nova_inode *pi)
+{
+	nova_dbg("%s: NOVA inode %llu\n", __func__, pi->nova_ino);
+	nova_dbg("valid %u, deleted %u, blk type %u, flags %u\n",
+		pi->valid, pi->deleted, pi->i_blk_type, pi->i_flags);
+	nova_dbg("size %llu, ctime %u, mtime %u, atime %u\n",
+		pi->i_size, pi->i_ctime, pi->i_mtime, pi->i_atime);
+	nova_dbg("mode %u, links %u, xattr 0x%llx, csum %u\n",
+		pi->i_mode, pi->i_links_count, pi->i_xattr, pi->csum);
+	nova_dbg("uid %u, gid %u, gen %u, create time %u\n",
+		pi->i_uid, pi->i_gid, pi->i_generation, pi->i_create_time);
+	nova_dbg("head 0x%llx, tail 0x%llx, alter head 0x%llx, tail 0x%llx\n",
+		pi->log_head, pi->log_tail, pi->alter_log_head,
+		pi->alter_log_tail);
+	nova_dbg("create epoch id %llu, delete epoch id %llu\n",
+		pi->create_epoch_id, pi->delete_epoch_id);
+}
+
+static inline void nova_print_file_write_entry(struct super_block *sb,
+	u64 curr, struct nova_file_write_entry *entry)
+{
+	nova_dbg("file write entry @ 0x%llx: epoch %llu, trans %llu, pgoff %llu, pages %u, blocknr %llu, reassigned %u, updating %u, invalid count %u, size %llu, mtime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->pgoff, entry->num_pages,
+			entry->block >> PAGE_SHIFT,
+			entry->reassigned, entry->updating,
+			entry->invalid_pages, entry->size, entry->mtime);
+}
+
+static inline void nova_print_set_attr_entry(struct super_block *sb,
+	u64 curr, struct nova_setattr_logentry *entry)
+{
+	nova_dbg("set attr entry @ 0x%llx: epoch %llu, trans %llu, invalid %u, mode %u, size %llu, atime %u, mtime %u, ctime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->invalid, entry->mode,
+			entry->size, entry->atime, entry->mtime, entry->ctime);
+}
+
+static inline void nova_print_link_change_entry(struct super_block *sb,
+	u64 curr, struct nova_link_change_entry *entry)
+{
+	nova_dbg("link change entry @ 0x%llx: epoch %llu, trans %llu, invalid %u, links %u, flags %u, ctime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->invalid, entry->links,
+			entry->flags, entry->ctime);
+}
+
+static inline void nova_print_mmap_entry(struct super_block *sb,
+	u64 curr, struct nova_mmap_entry *entry)
+{
+	nova_dbg("mmap write entry @ 0x%llx: epoch %llu, invalid %u, pgoff %llu, pages %llu\n",
+			curr, entry->epoch_id, entry->invalid,
+			entry->pgoff, entry->num_pages);
+}
+
+static inline void nova_print_snapshot_info_entry(struct super_block *sb,
+	u64 curr, struct nova_snapshot_info_entry *entry)
+{
+	nova_dbg("snapshot info entry @ 0x%llx: epoch %llu, deleted %u, timestamp %llu\n",
+			curr, entry->epoch_id, entry->deleted,
+			entry->timestamp);
+}
+
+static inline size_t nova_print_dentry(struct super_block *sb,
+	u64 curr, struct nova_dentry *entry)
+{
+	nova_dbg("dir logentry @ 0x%llx: epoch %llu, trans %llu, reassigned %u, invalid %u, inode %llu, links %u, namelen %u, rec len %u, name %s, mtime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->reassigned, entry->invalid,
+			le64_to_cpu(entry->ino),
+			entry->links_count, entry->name_len,
+			le16_to_cpu(entry->de_len), entry->name,
+			entry->mtime);
+
+	return le16_to_cpu(entry->de_len);
+}
+
+u64 nova_print_log_entry(struct super_block *sb, u64 curr)
+{
+	void *addr;
+	size_t size;
+	u8 type;
+
+	addr = (void *)nova_get_block(sb, curr);
+	type = nova_get_entry_type(addr);
+	switch (type) {
+	case SET_ATTR:
+		nova_print_set_attr_entry(sb, curr, addr);
+		curr += sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		nova_print_link_change_entry(sb, curr, addr);
+		curr += sizeof(struct nova_link_change_entry);
+		break;
+	case MMAP_WRITE:
+		nova_print_mmap_entry(sb, curr, addr);
+		curr += sizeof(struct nova_mmap_entry);
+		break;
+	case SNAPSHOT_INFO:
+		nova_print_snapshot_info_entry(sb, curr, addr);
+		curr += sizeof(struct nova_snapshot_info_entry);
+		break;
+	case FILE_WRITE:
+		nova_print_file_write_entry(sb, curr, addr);
+		curr += sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		size = nova_print_dentry(sb, curr, addr);
+		curr += size;
+		if (size == 0) {
+			nova_dbg("%s: dentry with size 0 @ 0x%llx\n",
+					__func__, curr);
+			curr += sizeof(struct nova_file_write_entry);
+			NOVA_ASSERT(0);
+		}
+		break;
+	case NEXT_PAGE:
+		nova_dbg("%s: next page sign @ 0x%llx\n", __func__, curr);
+		curr = PAGE_TAIL(curr);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n", __func__, type, curr);
+		curr += sizeof(struct nova_file_write_entry);
+		NOVA_ASSERT(0);
+		break;
+	}
+
+	return curr;
+}
+
+void nova_print_curr_log_page(struct super_block *sb, u64 curr)
+{
+	struct nova_inode_page_tail *tail;
+	u64 start, end;
+
+	start = BLOCK_OFF(curr);
+	end = PAGE_TAIL(curr);
+
+	while (start < end)
+		start = nova_print_log_entry(sb, start);
+
+	tail = nova_get_block(sb, end);
+	nova_dbg("Page tail. curr 0x%llx, next page 0x%llx, %u entries, %u invalid\n",
+			start, tail->next_page,
+			tail->num_entries, tail->invalid_entries);
+}
+
+void nova_print_nova_log(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	u64 curr;
+
+	if (sih->log_tail == 0 || sih->log_head == 0)
+		return;
+
+	curr = sih->log_head;
+	nova_dbg("Pi %lu: log head 0x%llx, tail 0x%llx\n",
+			sih->ino, curr, sih->log_tail);
+	while (curr != sih->log_tail) {
+		if ((curr & (PAGE_SIZE - 1)) == LOG_BLOCK_TAIL) {
+			struct nova_inode_page_tail *tail =
+					nova_get_block(sb, curr);
+			nova_dbg("Log tail, curr 0x%llx, next page 0x%llx, %u entries, %u invalid\n",
+					curr, tail->next_page,
+					tail->num_entries,
+					tail->invalid_entries);
+			curr = tail->next_page;
+		} else {
+			curr = nova_print_log_entry(sb, curr);
+		}
+	}
+}
+
+void nova_print_inode_log(struct super_block *sb, struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	nova_print_nova_log(sb, sih);
+}
+
+int nova_get_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode *pi)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr, next;
+	int count = 1;
+
+	if (pi->log_head == 0 || pi->log_tail == 0) {
+		nova_dbg("Pi %lu has no log\n", sih->ino);
+		return 0;
+	}
+
+	curr = pi->log_head;
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	while ((next = curr_page->page_tail.next_page) != 0) {
+		curr = next;
+		curr_page = (struct nova_inode_log_page *)
+			nova_get_block(sb, curr);
+		count++;
+	}
+
+	return count;
+}
+
+void nova_print_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr, next;
+	int count = 1;
+	int used = count;
+
+	if (sih->log_head == 0 || sih->log_tail == 0) {
+		nova_dbg("Pi %lu has no log\n", sih->ino);
+		return;
+	}
+
+	curr = sih->log_head;
+	nova_dbg("Pi %lu: log head @ 0x%llx, tail @ 0x%llx\n",
+			sih->ino, curr, sih->log_tail);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	while ((next = curr_page->page_tail.next_page) != 0) {
+		nova_dbg("Current page 0x%llx, next page 0x%llx, %u entries, %u invalid\n",
+			curr >> PAGE_SHIFT, next >> PAGE_SHIFT,
+			curr_page->page_tail.num_entries,
+			curr_page->page_tail.invalid_entries);
+		if (sih->log_tail >> PAGE_SHIFT == curr >> PAGE_SHIFT)
+			used = count;
+		curr = next;
+		curr_page = (struct nova_inode_log_page *)
+			nova_get_block(sb, curr);
+		count++;
+	}
+	if (sih->log_tail >> PAGE_SHIFT == curr >> PAGE_SHIFT)
+		used = count;
+	nova_dbg("Pi %lu: log used %d pages, has %d pages, si reports %lu pages\n",
+		sih->ino, used, count,
+		sih->log_pages);
+}
+
+void nova_print_inode_log_pages(struct super_block *sb, struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	nova_print_nova_log_pages(sb, sih);
+}
+
+int nova_check_inode_logs(struct super_block *sb, struct nova_inode *pi)
+{
+	int count1 = 0;
+	int count2 = 0;
+	int tail1_at = 0;
+	int tail2_at = 0;
+	u64 curr, alter_curr;
+
+	curr = pi->log_head;
+	alter_curr = pi->alter_log_head;
+
+	while (curr && alter_curr) {
+		if (alter_log_page(sb, curr) != alter_curr ||
+				alter_log_page(sb, alter_curr) != curr)
+			nova_dbg("Inode %llu page %d: curr 0x%llx, alter 0x%llx, alter_curr 0x%llx, alter 0x%llx\n",
+					pi->nova_ino, count1,
+					curr, alter_log_page(sb, curr),
+					alter_curr,
+					alter_log_page(sb, alter_curr));
+
+		count1++;
+		count2++;
+		if ((curr >> PAGE_SHIFT) == (pi->log_tail >> PAGE_SHIFT))
+			tail1_at = count1;
+		if ((alter_curr >> PAGE_SHIFT) ==
+				(pi->alter_log_tail >> PAGE_SHIFT))
+			tail2_at = count2;
+		curr = next_log_page(sb, curr);
+		alter_curr = next_log_page(sb, alter_curr);
+	}
+
+	while (curr) {
+		count1++;
+		if ((curr >> PAGE_SHIFT) == (pi->log_tail >> PAGE_SHIFT))
+			tail1_at = count1;
+		curr = next_log_page(sb, curr);
+	}
+
+	while (alter_curr) {
+		count2++;
+		if ((alter_curr >> PAGE_SHIFT) ==
+				(pi->alter_log_tail >> PAGE_SHIFT))
+			tail2_at = count2;
+		alter_curr = next_log_page(sb, alter_curr);
+	}
+
+	nova_dbg("Log1 %d pages, tail @ page %d\n", count1, tail1_at);
+	nova_dbg("Log2 %d pages, tail @ page %d\n", count2, tail2_at);
+
+	return 0;
+}
+
+void nova_print_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+
+	nova_dbg("======== NOVA per-CPU free list allocation stats ========\n");
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		nova_dbg("Free list %d: block start %lu, block end %lu, num_blocks %lu, num_free_blocks %lu, blocknode %lu\n",
+			i, free_list->block_start, free_list->block_end,
+			free_list->block_end - free_list->block_start + 1,
+			free_list->num_free_blocks, free_list->num_blocknode);
+
+		nova_dbg("Free list %d: csum start %lu, replica csum start %lu, csum blocks %lu, parity start %lu, parity blocks %lu\n",
+			i, free_list->csum_start, free_list->replica_csum_start,
+			free_list->num_csum_blocks,
+			free_list->parity_start, free_list->num_parity_blocks);
+
+		nova_dbg("Free list %d: alloc log count %lu, allocated log pages %lu, alloc data count %lu, allocated data pages %lu, free log count %lu, freed log pages %lu, free data count %lu, freed data pages %lu\n",
+			 i,
+			 free_list->alloc_log_count,
+			 free_list->alloc_log_pages,
+			 free_list->alloc_data_count,
+			 free_list->alloc_data_pages,
+			 free_list->free_log_count,
+			 free_list->freed_log_pages,
+			 free_list->free_data_count,
+			 free_list->freed_data_pages);
+	}
+}
diff --git a/fs/nova/stats.h b/fs/nova/stats.h
new file mode 100644
index 000000000000..766ba0a77872
--- /dev/null
+++ b/fs/nova/stats.h
@@ -0,0 +1,218 @@
+/*
+ * NOVA File System statistics
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+
+/* ======================= Timing ========================= */
+enum timing_category {
+	/* Init */
+	init_title_t,
+	init_t,
+	mount_t,
+	ioremap_t,
+	new_init_t,
+	recovery_t,
+
+	/* Namei operations */
+	namei_title_t,
+	create_t,
+	lookup_t,
+	link_t,
+	unlink_t,
+	symlink_t,
+	mkdir_t,
+	rmdir_t,
+	mknod_t,
+	rename_t,
+	readdir_t,
+	add_dentry_t,
+	remove_dentry_t,
+	setattr_t,
+	setsize_t,
+
+	/* I/O operations */
+	io_title_t,
+	dax_read_t,
+	cow_write_t,
+	inplace_write_t,
+	copy_to_nvmm_t,
+	dax_get_block_t,
+	read_iter_t,
+	write_iter_t,
+
+	/* Memory operations */
+	memory_title_t,
+	memcpy_r_nvmm_t,
+	memcpy_w_nvmm_t,
+	memcpy_w_wb_t,
+	partial_block_t,
+
+	/* Memory management */
+	mm_title_t,
+	new_blocks_t,
+	new_data_blocks_t,
+	new_log_blocks_t,
+	free_blocks_t,
+	free_data_t,
+	free_log_t,
+
+	/* Transaction */
+	trans_title_t,
+	create_trans_t,
+	link_trans_t,
+	update_tail_t,
+
+	/* Logging */
+	logging_title_t,
+	append_dir_entry_t,
+	append_file_entry_t,
+	append_mmap_entry_t,
+	append_link_change_t,
+	append_setattr_t,
+	append_snapshot_info_t,
+	update_entry_t,
+
+	/* Tree */
+	tree_title_t,
+	check_entry_t,
+	assign_t,
+
+	/* GC */
+	gc_title_t,
+	fast_gc_t,
+	thorough_gc_t,
+	check_invalid_t,
+
+	/* Integrity */
+	integrity_title_t,
+	block_csum_t,
+	block_parity_t,
+	block_csum_parity_t,
+	protect_memcpy_t,
+	protect_file_data_t,
+	verify_entry_csum_t,
+	verify_data_csum_t,
+	calc_entry_csum_t,
+	restore_data_t,
+	reset_mapping_t,
+	reset_vma_t,
+
+	/* Others */
+	others_title_t,
+	find_cache_t,
+	fsync_t,
+	write_pages_t,
+	fallocate_t,
+	direct_IO_t,
+	free_old_t,
+	delete_file_tree_t,
+	delete_dir_tree_t,
+	new_vfs_inode_t,
+	new_nova_inode_t,
+	free_inode_t,
+	free_inode_log_t,
+	evict_inode_t,
+	perf_t,
+	wprotect_t,
+
+	/* Mmap */
+	mmap_title_t,
+	mmap_fault_t,
+	pmd_fault_t,
+	pfn_mkwrite_t,
+	insert_vma_t,
+	remove_vma_t,
+	set_vma_read_t,
+	mmap_cow_t,
+	update_mapping_t,
+	update_pfn_t,
+	mmap_handler_t,
+
+	/* Rebuild */
+	rebuild_title_t,
+	rebuild_dir_t,
+	rebuild_file_t,
+	rebuild_snapshot_t,
+
+	/* Snapshot */
+	snapshot_title_t,
+	create_snapshot_t,
+	init_snapshot_info_t,
+	delete_snapshot_t,
+	append_snapshot_file_t,
+	append_snapshot_inode_t,
+
+	/* Sentinel */
+	TIMING_NUM,
+};
+
+enum stats_category {
+	alloc_steps,
+	cow_write_breaks,
+	inplace_write_breaks,
+	read_bytes,
+	cow_write_bytes,
+	inplace_write_bytes,
+	fast_checked_pages,
+	thorough_checked_pages,
+	fast_gc_pages,
+	thorough_gc_pages,
+	dirty_pages,
+	protect_head,
+	protect_tail,
+	block_csum_parity,
+	dax_cow_during_snapshot,
+	mapping_updated_pages,
+	cow_overlap_mmap,
+	dax_new_blocks,
+	inplace_new_blocks,
+	fdatasync,
+
+	/* Sentinel */
+	STATS_NUM,
+};
+
+extern const char *Timingstring[TIMING_NUM];
+extern u64 Timingstats[TIMING_NUM];
+DECLARE_PER_CPU(u64[TIMING_NUM], Timingstats_percpu);
+extern u64 Countstats[TIMING_NUM];
+DECLARE_PER_CPU(u64[TIMING_NUM], Countstats_percpu);
+extern u64 IOstats[STATS_NUM];
+DECLARE_PER_CPU(u64[STATS_NUM], IOstats_percpu);
+
+typedef struct timespec timing_t;
+
+#define NOVA_START_TIMING(name, start) \
+	{if (measure_timing) getrawmonotonic(&start); }
+
+#define NOVA_END_TIMING(name, start) \
+	{if (measure_timing) { \
+		timing_t end; \
+		getrawmonotonic(&end); \
+		__this_cpu_add(Timingstats_percpu[name], \
+			(end.tv_sec - start.tv_sec) * 1000000000 + \
+			(end.tv_nsec - start.tv_nsec)); \
+	} \
+	__this_cpu_add(Countstats_percpu[name], 1); \
+	}
+
+#define NOVA_STATS_ADD(name, value) \
+	{__this_cpu_add(IOstats_percpu[name], value); }
+
+

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 16/16] NOVA: Build infrastructure
  2017-08-03  7:48 ` Steven Swanson
@ 2017-08-03  7:50   ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:50 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/Kconfig       |    2 ++
 fs/Makefile      |    1 +
 fs/nova/Kconfig  |   15 +++++++++++++++
 fs/nova/Makefile |    9 +++++++++
 4 files changed, 27 insertions(+)
 create mode 100644 fs/nova/Kconfig
 create mode 100644 fs/nova/Makefile

diff --git a/fs/Kconfig b/fs/Kconfig
index b0e42b6a96b9..571714353a5f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,8 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+source "fs/nova/Kconfig"
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..53f6465e0f4c 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -113,6 +113,7 @@ obj-$(CONFIG_OMFS_FS)		+= omfs/
 obj-$(CONFIG_JFS_FS)		+= jfs/
 obj-$(CONFIG_XFS_FS)		+= xfs/
 obj-$(CONFIG_9P_FS)		+= 9p/
+obj-$(CONFIG_NOVA_FS)		+= nova/
 obj-$(CONFIG_AFS_FS)		+= afs/
 obj-$(CONFIG_NILFS2_FS)		+= nilfs2/
 obj-$(CONFIG_BEFS_FS)		+= befs/
diff --git a/fs/nova/Kconfig b/fs/nova/Kconfig
new file mode 100644
index 000000000000..c1c692edef92
--- /dev/null
+++ b/fs/nova/Kconfig
@@ -0,0 +1,15 @@
+config NOVA_FS
+	tristate "NOVA: log-structured file system for non-volatile memories"
+	depends on FS_DAX
+	select CRC32
+	select LIBCRC32C
+	help
+	  If your system has a block of fast (comparable in access speed to
+	  system memory) and non-volatile byte-addressable memory and you wish
+	  to mount a light-weight filesystem with strong consistency support
+	  over it, say Y here.
+
+	  To compile this as a module, choose M here: the module will be
+	  called nova.
+
+	  If unsure, say N.
diff --git a/fs/nova/Makefile b/fs/nova/Makefile
new file mode 100644
index 000000000000..c45e418652ca
--- /dev/null
+++ b/fs/nova/Makefile
@@ -0,0 +1,9 @@
+#
+# Makefile for the linux NOVA filesystem routines.
+#
+
+obj-$(CONFIG_NOVA_FS) += nova.o
+
+nova-y := balloc.o bbuild.o checksum.o dax.o dir.o file.o gc.o inode.o ioctl.o \
+	journal.o log.o mprotect.o namei.o parity.o rebuild.o snapshot.o stats.o \
+	super.o symlink.o sysfs.o perf.o

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 16/16] NOVA: Build infrastructure
@ 2017-08-03  7:50   ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-03  7:50 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson, dan.j.williams

Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
---
 fs/Kconfig       |    2 ++
 fs/Makefile      |    1 +
 fs/nova/Kconfig  |   15 +++++++++++++++
 fs/nova/Makefile |    9 +++++++++
 4 files changed, 27 insertions(+)
 create mode 100644 fs/nova/Kconfig
 create mode 100644 fs/nova/Makefile

diff --git a/fs/Kconfig b/fs/Kconfig
index b0e42b6a96b9..571714353a5f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,8 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+source "fs/nova/Kconfig"
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..53f6465e0f4c 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -113,6 +113,7 @@ obj-$(CONFIG_OMFS_FS)		+= omfs/
 obj-$(CONFIG_JFS_FS)		+= jfs/
 obj-$(CONFIG_XFS_FS)		+= xfs/
 obj-$(CONFIG_9P_FS)		+= 9p/
+obj-$(CONFIG_NOVA_FS)		+= nova/
 obj-$(CONFIG_AFS_FS)		+= afs/
 obj-$(CONFIG_NILFS2_FS)		+= nilfs2/
 obj-$(CONFIG_BEFS_FS)		+= befs/
diff --git a/fs/nova/Kconfig b/fs/nova/Kconfig
new file mode 100644
index 000000000000..c1c692edef92
--- /dev/null
+++ b/fs/nova/Kconfig
@@ -0,0 +1,15 @@
+config NOVA_FS
+	tristate "NOVA: log-structured file system for non-volatile memories"
+	depends on FS_DAX
+	select CRC32
+	select LIBCRC32C
+	help
+	  If your system has a block of fast (comparable in access speed to
+	  system memory) and non-volatile byte-addressable memory and you wish
+	  to mount a light-weight filesystem with strong consistency support
+	  over it, say Y here.
+
+	  To compile this as a module, choose M here: the module will be
+	  called nova.
+
+	  If unsure, say N.
diff --git a/fs/nova/Makefile b/fs/nova/Makefile
new file mode 100644
index 000000000000..c45e418652ca
--- /dev/null
+++ b/fs/nova/Makefile
@@ -0,0 +1,9 @@
+#
+# Makefile for the linux NOVA filesystem routines.
+#
+
+obj-$(CONFIG_NOVA_FS) += nova.o
+
+nova-y := balloc.o bbuild.o checksum.o dax.o dir.o file.o gc.o inode.o ioctl.o \
+	journal.o log.o mprotect.o namei.o parity.o rebuild.o snapshot.o stats.o \
+	super.o symlink.o sysfs.o perf.o

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC 01/16] NOVA: Documentation
  2017-08-03  7:48   ` Steven Swanson
@ 2017-08-03 22:38     ` Randy Dunlap
  -1 siblings, 0 replies; 43+ messages in thread
From: Randy Dunlap @ 2017-08-03 22:38 UTC (permalink / raw)
  To: Steven Swanson, linux-fsdevel, linux-kernel, linux-nvdimm; +Cc: Steven Swanson

On 08/03/2017 12:48 AM, Steven Swanson wrote:
> A brief overview is in README.md.
> 

See below.

> Implementation and usage details are in Documentation/filesystems/nova.txt.
> 

Reviewed in a separate email.

> These two papers provide a detailed, high-level description of NOVA's design goals and approach:
> 
>    NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
> 
>    Hardening the NOVA File System (http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf)
> 
> Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
> ---
>  Documentation/filesystems/00-INDEX |    2 
>  Documentation/filesystems/nova.txt |  771 ++++++++++++++++++++++++++++++++++++
>  MAINTAINERS                        |    8 
>  README.md                          |  173 ++++++++
>  4 files changed, 954 insertions(+)
>  create mode 100644 Documentation/filesystems/nova.txt
>  create mode 100644 README.md
> 

This file should not be in the top-level directory.
It would be OK in Documentation/filesystems, probably with a different
filename.

> diff --git a/README.md b/README.md
> new file mode 100644
> index 000000000000..4f778e99a79e
> --- /dev/null
> +++ b/README.md
> @@ -0,0 +1,173 @@
> +# NOVA: NOn-Volatile memory Accelerated log-structured file system
> +
> +NOVA's goal is to provide a high-performance, full-featured, production-ready
> +file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
> +and Intel's soon-to-be-released 3DXpoint DIMMs).  It combines design elements
> +from many other file systems to provide a combination of high-performance,

                                                            high performance,

> +strong consistency guarantees, and comprehensive data protection.  NOVA support

                                                                           supports

> +DAX-style mmap and making DAX performs well is a first-order priority in NOVA's

                                 perform

> +design.  NOVA was developed by the [Non-Volatile Systems Laboratory][NVSL] in
> +the [Computer Science and Engineering Department][CSE] at the [University of
> +California, San Diego][UCSD].
> +
> +
> +NOVA is primarily a log-structured file system, but rather than maintain a
> +single global log for the entire file system, it maintains separate logs for
> +each file (inode).  NOVA breaks the logs into 4KB pages, they need not be

                                                     pages;

> +contiguous in memory.  The logs only contain metadata.
> +
> +File data pages reside outside the log, and log entries for write operations
> +point to data pages they modify.  File modification uses copy-on-write (COW) to
> +provide atomic file updates.
> +
> +For file operations that involve multiple inodes, NOVA use small, fixed-sized

                                                          uses

> +redo logs to atomically append log entries to the logs of the inodes involned.

                                                                        involved.

> +
> +This structure keeps logs small and make garbage collection very fast.  It also

                                       makes

> +enables enormous parallelism during recovery from an unclean unmount, since
> +threads can scan logs in parallel.
> +
> +NOVA replicates and checksums all metadata structures and protects file data
> +with RAID-4-style parity.  It supports checkpoints to facilitate backups.
> +
> +A more thorough discussion of NOVA's design is avaialable in these two papers:
> +
> +**NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories** 
> +[PDF](http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)<br>
> +*Jian Xu and Steven Swanson*<br>
> +Published in [FAST 2016][FAST2016]
> +
> +**Hardening the NOVA File System**
> +[PDF](http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf) <br>
> +UCSD-CSE Techreport CS2017-1018
> +*Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson*<br>
> +
> +Read on for further details about NOVA's overall design and its current status 
> +
> +### Compatibilty with Other File Systems
> +
> +NOVA aims to be compatible with other Linux file systems.  To help verify that it achieves this we run several test suites against NOVA each night.

Compatible in what ways?

> +
> +* The latest version of XFSTests. ([Current failures](https://github.com/NVSL/linux-nova/issues?q=is%3Aopen+is%3Aissue+label%3AXFSTests))
> +* The (Linux testing project)(https://linux-test-project.github.io/) file system tests.
> +* The (fstest POSIX test suite)[POSIXtest].
> +
> +Currently, nearly all of these tests pass for the `master` branch, and we have
> +run complex programs on NOVA.  There are, of course, many bugs left to fix.
> +
> +NOVA uses the standard PMEM kernel interfaces for accessing and managing
> +persistent memory.
> +
> +### Atomicity
> +
> +By default, NOVA makes all metadata and file data operations atomic.
> +
> +Strong atomicity guarantees make it easier to build reliable applications on
> +NOVA, and NOVA can provide these guarantees with sacrificing much performance

                                               without

> +because NVDIMMs support very fast random access.
> +
> +NOVA also supports "unsafe data" and "unsafe metadata" modes that
> +improve performance in some cases and allows for non-atomic updates of file

                                         allow

> +data and metadata, respectively.
> +
> +### Data Protection
> +
> +NOVA aims to protect data against both misdirected writes in the kernel (which
> +can easily "scribble" over the contents of an NVDIMM) as well as media errors.
> +
> +NOVA protects all of its metadata data structures with a combination of
> +replication and checksums.  It protects file data using RAID-5 style parity.

Above here it says RAID-4-style parity...???

> +
> +NOVA can detects data corruption by verifying checksums on each access and by

            detect

> +catching and handling machine check exceptions (MCEs) that arise when the
> +system's memory controller detects at uncorrectable media error.
> +
> +We use a fault injection tool that allows testing of these recovery mechanisms.
> +
> +To facilitate backups, NOVA can take snapshots of the current filesystem state
> +that can be mounted read-only while the current file system is mounted
> +read-write.
> +
> +The tech report list above describes the design of NOVA's data protection system in detail.
> +
> +### DAX Support
> +
> +Supporting DAX efficiently is a core feature of NOVA and one of the challenges
> +in designing NOVA is reconciling DAX support which aims to avoid file system
> +intervention when file data changes, and other features that require such
> +intervention.
> +
> +NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
> +to modify a file, the program must take full responsibility for that data and
> +NOVA must ensure that the memory will behave as expected.  At other times, the
> +file system provides protection.  This approach has several implications:
> +
> +1. Implementing `msync()` in user space works fine.
> +
> +2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
> +mechanism, because protecting it would be too expensive.  When the file is
> +unmapped and/or during file system recovery, protection is restored.
> +
> +3. The snapshot mechanism must be careful about the order in which in adds

                                                                      it

> +pages to the file's snapshot image.
> +
> +### Performance
> +
> +The research paper and technical report referenced above compare NOVA's
> +performance to other file systems.  In almost all cases, NOVA outperforms other
> +DAX-enabled file systems.  A notable exception is sub-page updates which incur
> +COW overheads for the entire page.
> +
> +The technical report also illustrates the trade-offs between our protection
> +mechanisms and performance.
> +
> +## Gaps, Missing Features, and Development Status
> +
> +Although NOVA is a fully-functional file system, there is still much work left
> +to be done.  In particular, (at least) the following items are currently missing:
> +
> +1.  There is no mkfs or fsk utility (`mount` takes `-o init` to create a NOVA file system)

                           fsck

> +2.  NOVA doesn't scrub data to prevent corruption from accumulating in infrequently accessed data.
> +3.  NOVA doesn't read bad block information on mount and attempt recovery of the effected data.
> +4.  NOVA only works on x86-64 kernels.
> +5.  NOVA does not currently support extended attributes or ACL.
> +6.  NOVA does not currently prevent writes to mounted snapshots.
> +7.  Using `write()` to modify pages that are mmap'd is not supported.
> +8.  NOVA deoesn't provide quota support.
> +9.  Moving NOVA file systems between machines with different numbers of CPUs does not work.
> +10. Remounting a NOVA file system with different mount options may fail.
> +
> +None of these are fundamental limitations of NOVA's design.  Additional bugs
> +and issues are here [here][https://github.com/NVSL/linux-nova/issues].
> +
> +NOVA is complete and robust enough to run a range of complex applications, but
> +it is not yet ready for production use.  Our current focus is on adding a few
> +missing features list above and finding/fixing bugs.

                    from the list above

> +
> +## Building and Using NOVA
> +
> +This repo contains a version of the Linux with NOVA included.  You should be

what repo?                      of Linux

> +able to build and install it just as you would the mainline Linux source.
> +
> +### Building NOVA
> +
> +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
> +
> +## Hacking and Contributing
> +
> +The NOVA source code is almost completely contains in the `fs/nova` directory.

                                             contained

> +The execptions are some small changes in the kernel's memory management system

       exceptions

> +to support checkpointing.
> +
> +`Documentation/filesystems/nova.txt` describes the internals of Nova in more detail.
> +
> +If you find bugs, please [report them](https://github.com/NVSL/linux-nova/issues).
> +
> +If you have other questions or suggestions you can contact the NOVA developers at [cse-nova-hackers@eng.ucsd.edu](mailto:cse-nova-hackers@eng.ucsd.edu).
> +
> +
> +[NVSL]: http://nvsl.ucsd.edu/ "http://nvsl.ucsd.edu"
> +[POSIXtest]: http://www.tuxera.com/community/posix-test-suite/ 
> +[FAST2016]: https://www.usenix.org/conference/fast16/technical-sessions
> +[CSE]: http://cs.ucsd.edu
> +[UCSD]: http://www.ucsd.edu
> \ No newline at end of file
> 


-- 
~Randy
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 01/16] NOVA: Documentation
@ 2017-08-03 22:38     ` Randy Dunlap
  0 siblings, 0 replies; 43+ messages in thread
From: Randy Dunlap @ 2017-08-03 22:38 UTC (permalink / raw)
  To: Steven Swanson, linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: Steven Swanson, dan.j.williams

On 08/03/2017 12:48 AM, Steven Swanson wrote:
> A brief overview is in README.md.
> 

See below.

> Implementation and usage details are in Documentation/filesystems/nova.txt.
> 

Reviewed in a separate email.

> These two papers provide a detailed, high-level description of NOVA's design goals and approach:
> 
>    NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
> 
>    Hardening the NOVA File System (http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf)
> 
> Signed-off-by: Steven Swanson <swanson@cs.ucsd.edu>
> ---
>  Documentation/filesystems/00-INDEX |    2 
>  Documentation/filesystems/nova.txt |  771 ++++++++++++++++++++++++++++++++++++
>  MAINTAINERS                        |    8 
>  README.md                          |  173 ++++++++
>  4 files changed, 954 insertions(+)
>  create mode 100644 Documentation/filesystems/nova.txt
>  create mode 100644 README.md
> 

This file should not be in the top-level directory.
It would be OK in Documentation/filesystems, probably with a different
filename.

> diff --git a/README.md b/README.md
> new file mode 100644
> index 000000000000..4f778e99a79e
> --- /dev/null
> +++ b/README.md
> @@ -0,0 +1,173 @@
> +# NOVA: NOn-Volatile memory Accelerated log-structured file system
> +
> +NOVA's goal is to provide a high-performance, full-featured, production-ready
> +file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
> +and Intel's soon-to-be-released 3DXpoint DIMMs).  It combines design elements
> +from many other file systems to provide a combination of high-performance,

                                                            high performance,

> +strong consistency guarantees, and comprehensive data protection.  NOVA support

                                                                           supports

> +DAX-style mmap and making DAX performs well is a first-order priority in NOVA's

                                 perform

> +design.  NOVA was developed by the [Non-Volatile Systems Laboratory][NVSL] in
> +the [Computer Science and Engineering Department][CSE] at the [University of
> +California, San Diego][UCSD].
> +
> +
> +NOVA is primarily a log-structured file system, but rather than maintain a
> +single global log for the entire file system, it maintains separate logs for
> +each file (inode).  NOVA breaks the logs into 4KB pages, they need not be

                                                     pages;

> +contiguous in memory.  The logs only contain metadata.
> +
> +File data pages reside outside the log, and log entries for write operations
> +point to data pages they modify.  File modification uses copy-on-write (COW) to
> +provide atomic file updates.
> +
> +For file operations that involve multiple inodes, NOVA use small, fixed-sized

                                                          uses

> +redo logs to atomically append log entries to the logs of the inodes involned.

                                                                        involved.

> +
> +This structure keeps logs small and make garbage collection very fast.  It also

                                       makes

> +enables enormous parallelism during recovery from an unclean unmount, since
> +threads can scan logs in parallel.
> +
> +NOVA replicates and checksums all metadata structures and protects file data
> +with RAID-4-style parity.  It supports checkpoints to facilitate backups.
> +
> +A more thorough discussion of NOVA's design is avaialable in these two papers:
> +
> +**NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories** 
> +[PDF](http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)<br>
> +*Jian Xu and Steven Swanson*<br>
> +Published in [FAST 2016][FAST2016]
> +
> +**Hardening the NOVA File System**
> +[PDF](http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf) <br>
> +UCSD-CSE Techreport CS2017-1018
> +*Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson*<br>
> +
> +Read on for further details about NOVA's overall design and its current status 
> +
> +### Compatibilty with Other File Systems
> +
> +NOVA aims to be compatible with other Linux file systems.  To help verify that it achieves this we run several test suites against NOVA each night.

Compatible in what ways?

> +
> +* The latest version of XFSTests. ([Current failures](https://github.com/NVSL/linux-nova/issues?q=is%3Aopen+is%3Aissue+label%3AXFSTests))
> +* The (Linux testing project)(https://linux-test-project.github.io/) file system tests.
> +* The (fstest POSIX test suite)[POSIXtest].
> +
> +Currently, nearly all of these tests pass for the `master` branch, and we have
> +run complex programs on NOVA.  There are, of course, many bugs left to fix.
> +
> +NOVA uses the standard PMEM kernel interfaces for accessing and managing
> +persistent memory.
> +
> +### Atomicity
> +
> +By default, NOVA makes all metadata and file data operations atomic.
> +
> +Strong atomicity guarantees make it easier to build reliable applications on
> +NOVA, and NOVA can provide these guarantees with sacrificing much performance

                                               without

> +because NVDIMMs support very fast random access.
> +
> +NOVA also supports "unsafe data" and "unsafe metadata" modes that
> +improve performance in some cases and allows for non-atomic updates of file

                                         allow

> +data and metadata, respectively.
> +
> +### Data Protection
> +
> +NOVA aims to protect data against both misdirected writes in the kernel (which
> +can easily "scribble" over the contents of an NVDIMM) as well as media errors.
> +
> +NOVA protects all of its metadata data structures with a combination of
> +replication and checksums.  It protects file data using RAID-5 style parity.

Above here it says RAID-4-style parity...???

> +
> +NOVA can detects data corruption by verifying checksums on each access and by

            detect

> +catching and handling machine check exceptions (MCEs) that arise when the
> +system's memory controller detects at uncorrectable media error.
> +
> +We use a fault injection tool that allows testing of these recovery mechanisms.
> +
> +To facilitate backups, NOVA can take snapshots of the current filesystem state
> +that can be mounted read-only while the current file system is mounted
> +read-write.
> +
> +The tech report list above describes the design of NOVA's data protection system in detail.
> +
> +### DAX Support
> +
> +Supporting DAX efficiently is a core feature of NOVA and one of the challenges
> +in designing NOVA is reconciling DAX support which aims to avoid file system
> +intervention when file data changes, and other features that require such
> +intervention.
> +
> +NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
> +to modify a file, the program must take full responsibility for that data and
> +NOVA must ensure that the memory will behave as expected.  At other times, the
> +file system provides protection.  This approach has several implications:
> +
> +1. Implementing `msync()` in user space works fine.
> +
> +2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
> +mechanism, because protecting it would be too expensive.  When the file is
> +unmapped and/or during file system recovery, protection is restored.
> +
> +3. The snapshot mechanism must be careful about the order in which in adds

                                                                      it

> +pages to the file's snapshot image.
> +
> +### Performance
> +
> +The research paper and technical report referenced above compare NOVA's
> +performance to other file systems.  In almost all cases, NOVA outperforms other
> +DAX-enabled file systems.  A notable exception is sub-page updates which incur
> +COW overheads for the entire page.
> +
> +The technical report also illustrates the trade-offs between our protection
> +mechanisms and performance.
> +
> +## Gaps, Missing Features, and Development Status
> +
> +Although NOVA is a fully-functional file system, there is still much work left
> +to be done.  In particular, (at least) the following items are currently missing:
> +
> +1.  There is no mkfs or fsk utility (`mount` takes `-o init` to create a NOVA file system)

                           fsck

> +2.  NOVA doesn't scrub data to prevent corruption from accumulating in infrequently accessed data.
> +3.  NOVA doesn't read bad block information on mount and attempt recovery of the effected data.
> +4.  NOVA only works on x86-64 kernels.
> +5.  NOVA does not currently support extended attributes or ACL.
> +6.  NOVA does not currently prevent writes to mounted snapshots.
> +7.  Using `write()` to modify pages that are mmap'd is not supported.
> +8.  NOVA deoesn't provide quota support.
> +9.  Moving NOVA file systems between machines with different numbers of CPUs does not work.
> +10. Remounting a NOVA file system with different mount options may fail.
> +
> +None of these are fundamental limitations of NOVA's design.  Additional bugs
> +and issues are here [here][https://github.com/NVSL/linux-nova/issues].
> +
> +NOVA is complete and robust enough to run a range of complex applications, but
> +it is not yet ready for production use.  Our current focus is on adding a few
> +missing features list above and finding/fixing bugs.

                    from the list above

> +
> +## Building and Using NOVA
> +
> +This repo contains a version of the Linux with NOVA included.  You should be

what repo?                      of Linux

> +able to build and install it just as you would the mainline Linux source.
> +
> +### Building NOVA
> +
> +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
> +
> +## Hacking and Contributing
> +
> +The NOVA source code is almost completely contains in the `fs/nova` directory.

                                             contained

> +The execptions are some small changes in the kernel's memory management system

       exceptions

> +to support checkpointing.
> +
> +`Documentation/filesystems/nova.txt` describes the internals of Nova in more detail.
> +
> +If you find bugs, please [report them](https://github.com/NVSL/linux-nova/issues).
> +
> +If you have other questions or suggestions you can contact the NOVA developers at [cse-nova-hackers@eng.ucsd.edu](mailto:cse-nova-hackers@eng.ucsd.edu).
> +
> +
> +[NVSL]: http://nvsl.ucsd.edu/ "http://nvsl.ucsd.edu"
> +[POSIXtest]: http://www.tuxera.com/community/posix-test-suite/ 
> +[FAST2016]: https://www.usenix.org/conference/fast16/technical-sessions
> +[CSE]: http://cs.ucsd.edu
> +[UCSD]: http://www.ucsd.edu
> \ No newline at end of file
> 


-- 
~Randy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 01/16] NOVA: Documentation
  2017-08-03  7:48   ` Steven Swanson
@ 2017-08-04 15:09     ` Bart Van Assche
  -1 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-08-04 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, swanson, linux-fsdevel; +Cc: steven.swanson

On Thu, 2017-08-03 at 00:48 -0700, Steven Swanson wrote:
> +### DAX Support
> +
> +Supporting DAX efficiently is a core feature of NOVA and one of the challenges
> +in designing NOVA is reconciling DAX support which aims to avoid file system
> +intervention when file data changes, and other features that require such
> +intervention.
> +
> +NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
> +to modify a file, the program must take full responsibility for that data and
> +NOVA must ensure that the memory will behave as expected.  At other times, the
> +file system provides protection.  This approach has several implications:
> +
> +1. Implementing `msync()` in user space works fine.
> +
> +2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
> +mechanism, because protecting it would be too expensive.  When the file is
> +unmapped and/or during file system recovery, protection is restored.
> +
> +3. The snapshot mechanism must be careful about the order in which in adds
> +pages to the file's snapshot image.

Hello Steven,

Thank you for having shared this very interesting work. After having read the
NOVA paper and patch 01/16 I have a question for you. Does the above mean that
COW is disabled for writable mmap-ed files? If so, what is the reason behind
this? Is there a fundamental issue that does not allow to implement COW for
writable mmap-ed files? Or have you perhaps tried to implement this and was the
performance not sufficient? Please note that I'm neither a filesystem nor a
persistent memory expert.

Thanks,

Bart.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 01/16] NOVA: Documentation
@ 2017-08-04 15:09     ` Bart Van Assche
  0 siblings, 0 replies; 43+ messages in thread
From: Bart Van Assche @ 2017-08-04 15:09 UTC (permalink / raw)
  To: linux-kernel, linux-nvdimm, swanson, linux-fsdevel
  Cc: dan.j.williams, steven.swanson

On Thu, 2017-08-03 at 00:48 -0700, Steven Swanson wrote:
> +### DAX Support
> +
> +Supporting DAX efficiently is a core feature of NOVA and one of the challenges
> +in designing NOVA is reconciling DAX support which aims to avoid file system
> +intervention when file data changes, and other features that require such
> +intervention.
> +
> +NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
> +to modify a file, the program must take full responsibility for that data and
> +NOVA must ensure that the memory will behave as expected.  At other times, the
> +file system provides protection.  This approach has several implications:
> +
> +1. Implementing `msync()` in user space works fine.
> +
> +2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
> +mechanism, because protecting it would be too expensive.  When the file is
> +unmapped and/or during file system recovery, protection is restored.
> +
> +3. The snapshot mechanism must be careful about the order in which in adds
> +pages to the file's snapshot image.

Hello Steven,

Thank you for having shared this very interesting work. After having read the
NOVA paper and patch 01/16 I have a question for you. Does the above mean that
COW is disabled for writable mmap-ed files? If so, what is the reason behind
this? Is there a fundamental issue that does not allow to implement COW for
writable mmap-ed files? Or have you perhaps tried to implement this and was the
performance not sufficient? Please note that I'm neither a filesystem nor a
persistent memory expert.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 01/16] NOVA: Documentation
  2017-08-04 15:09     ` Bart Van Assche
@ 2017-08-06  3:28       ` Steven Swanson
  -1 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-06  3:28 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-fsdevel, linux-kernel, linux-nvdimm

There is nothing impossible COW for mapped files, but it is not a good match for the expected usage model for DAX.  

The idea is that programs can mmap files and the build interesting data structures in them, just like they do in DRAM.  This means lots of small updatEs, and that would be very slow if each of them required a COW.

The reason we use COW for normal accesses is to provide atomicity.  Programs will probably want doe form of atomicity for their mmaped data structures, but part of the DAX bargain is that programmer are responsible for this rather than the FS.

-steve

--
Composed on (and maybe dictated to) my phone.

> On Aug 4, 2017, at 08:09, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> 
>> On Thu, 2017-08-03 at 00:48 -0700, Steven Swanson wrote:
>> +### DAX Support
>> +
>> +Supporting DAX efficiently is a core feature of NOVA and one of the challenges
>> +in designing NOVA is reconciling DAX support which aims to avoid file system
>> +intervention when file data changes, and other features that require such
>> +intervention.
>> +
>> +NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
>> +to modify a file, the program must take full responsibility for that data and
>> +NOVA must ensure that the memory will behave as expected.  At other times, the
>> +file system provides protection.  This approach has several implications:
>> +
>> +1. Implementing `msync()` in user space works fine.
>> +
>> +2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
>> +mechanism, because protecting it would be too expensive.  When the file is
>> +unmapped and/or during file system recovery, protection is restored.
>> +
>> +3. The snapshot mechanism must be careful about the order in which in adds
>> +pages to the file's snapshot image.
> 
> Hello Steven,
> 
> Thank you for having shared this very interesting work. After having read the
> NOVA paper and patch 01/16 I have a question for you. Does the above mean that
> COW is disabled for writable mmap-ed files? If so, what is the reason behind
> this? Is there a fundamental issue that does not allow to implement COW for
> writable mmap-ed files? Or have you perhaps tried to implement this and was the
> performance not sufficient? Please note that I'm neither a filesystem nor a
> persistent memory expert.
> 
> Thanks,
> 
> Bart.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 01/16] NOVA: Documentation
@ 2017-08-06  3:28       ` Steven Swanson
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Swanson @ 2017-08-06  3:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-kernel, linux-nvdimm, swanson, linux-fsdevel, dan.j.williams

There is nothing impossible COW for mapped files, but it is not a good match for the expected usage model for DAX.  

The idea is that programs can mmap files and the build interesting data structures in them, just like they do in DRAM.  This means lots of small updatEs, and that would be very slow if each of them required a COW.

The reason we use COW for normal accesses is to provide atomicity.  Programs will probably want doe form of atomicity for their mmaped data structures, but part of the DAX bargain is that programmer are responsible for this rather than the FS.

-steve

--
Composed on (and maybe dictated to) my phone.

> On Aug 4, 2017, at 08:09, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> 
>> On Thu, 2017-08-03 at 00:48 -0700, Steven Swanson wrote:
>> +### DAX Support
>> +
>> +Supporting DAX efficiently is a core feature of NOVA and one of the challenges
>> +in designing NOVA is reconciling DAX support which aims to avoid file system
>> +intervention when file data changes, and other features that require such
>> +intervention.
>> +
>> +NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to
>> +to modify a file, the program must take full responsibility for that data and
>> +NOVA must ensure that the memory will behave as expected.  At other times, the
>> +file system provides protection.  This approach has several implications:
>> +
>> +1. Implementing `msync()` in user space works fine.
>> +
>> +2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity
>> +mechanism, because protecting it would be too expensive.  When the file is
>> +unmapped and/or during file system recovery, protection is restored.
>> +
>> +3. The snapshot mechanism must be careful about the order in which in adds
>> +pages to the file's snapshot image.
> 
> Hello Steven,
> 
> Thank you for having shared this very interesting work. After having read the
> NOVA paper and patch 01/16 I have a question for you. Does the above mean that
> COW is disabled for writable mmap-ed files? If so, what is the reason behind
> this? Is there a fundamental issue that does not allow to implement COW for
> writable mmap-ed files? Or have you perhaps tried to implement this and was the
> performance not sufficient? Please note that I'm neither a filesystem nor a
> persistent memory expert.
> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 00/16] NOVA: a new file system for persistent memory
  2017-08-03  7:48 ` Steven Swanson
@ 2017-10-09 15:32   ` Miklos Szeredi
  -1 siblings, 0 replies; 43+ messages in thread
From: Miklos Szeredi @ 2017-10-09 15:32 UTC (permalink / raw)
  To: Steven Swanson
  Cc: Steven Swanson, linux-nvdimm, LKML, linux-fsdevel, Steven Whitehouse

On Thu, Aug 3, 2017 at 9:48 AM, Steven Swanson <swanson@eng.ucsd.edu> wrote:
> This is an RFC patch series that impements NOVA (NOn-Volatile memory
> Accelerated file system), a new file system built for PMEM.

Hi,

Thanks for posting.

I read the paper and the design looks nice.  Then I  looked at the
patches, but could not find a place to start, nor something I could
actually try out.  So let me suggest some ways to make this more
reviewer/tester friendly:

1) try starting with something very simple yet working and supporting
the final layout
   - no optimizations (one big lock, no per-cpu data, rcu, numa, etc support)
   - no support for optional features (checksumming, NFS export, etc)
   - missing mandatory features (e.g. just readdir and getattr support)
   - try and get it down to <5k lines, preferably 2-3k

2) pointer to sources and instructions for trying it out without
special hardware

3) build on this minimal working version by
   - adding mandatory features
   - then adding optimizations

4) each patch should leave the tree in a compiling and working state
but should be small and easily reviewed

5) leave optional features and unimportant optimizations for a later
submission; try to make the patchset as small as you meaningfully can
(i.e. it should be fully working and demonstrate the capabilities and
performance, but nothing more).

Thanks,
Miklos
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 00/16] NOVA: a new file system for persistent memory
@ 2017-10-09 15:32   ` Miklos Szeredi
  0 siblings, 0 replies; 43+ messages in thread
From: Miklos Szeredi @ 2017-10-09 15:32 UTC (permalink / raw)
  To: Steven Swanson
  Cc: linux-fsdevel, LKML, linux-nvdimm, Steven Swanson,
	dan.j.williams, Steven Whitehouse

On Thu, Aug 3, 2017 at 9:48 AM, Steven Swanson <swanson@eng.ucsd.edu> wrote:
> This is an RFC patch series that impements NOVA (NOn-Volatile memory
> Accelerated file system), a new file system built for PMEM.

Hi,

Thanks for posting.

I read the paper and the design looks nice.  Then I  looked at the
patches, but could not find a place to start, nor something I could
actually try out.  So let me suggest some ways to make this more
reviewer/tester friendly:

1) try starting with something very simple yet working and supporting
the final layout
   - no optimizations (one big lock, no per-cpu data, rcu, numa, etc support)
   - no support for optional features (checksumming, NFS export, etc)
   - missing mandatory features (e.g. just readdir and getattr support)
   - try and get it down to <5k lines, preferably 2-3k

2) pointer to sources and instructions for trying it out without
special hardware

3) build on this minimal working version by
   - adding mandatory features
   - then adding optimizations

4) each patch should leave the tree in a compiling and working state
but should be small and easily reviewed

5) leave optional features and unimportant optimizations for a later
submission; try to make the patchset as small as you meaningfully can
(i.e. it should be fully working and demonstrate the capabilities and
performance, but nothing more).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-10-09 15:32 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-03  7:48 [RFC 00/16] NOVA: a new file system for persistent memory Steven Swanson
2017-08-03  7:48 ` Steven Swanson
2017-08-03  7:48 ` [RFC 01/16] NOVA: Documentation Steven Swanson
2017-08-03  7:48   ` Steven Swanson
2017-08-03 22:38   ` Randy Dunlap
2017-08-03 22:38     ` Randy Dunlap
2017-08-04 15:09   ` Bart Van Assche
2017-08-04 15:09     ` Bart Van Assche
2017-08-06  3:28     ` Steven Swanson
2017-08-06  3:28       ` Steven Swanson
2017-08-03  7:48 ` [RFC 02/16] NOVA: Superblock and fs layout Steven Swanson
2017-08-03  7:48   ` Steven Swanson
2017-08-03  7:48 ` [RFC 03/16] NOVA: PMEM allocation system Steven Swanson
2017-08-03  7:48   ` Steven Swanson
2017-08-03  7:48 ` [RFC 04/16] NOVA: Inode operations and structures Steven Swanson
2017-08-03  7:48   ` Steven Swanson
2017-08-03  7:48 ` [RFC 05/16] NOVA: Log data structures and operations Steven Swanson
2017-08-03  7:48   ` Steven Swanson
2017-08-03  7:48 ` [RFC 06/16] NOVA: Lite-weight journaling for complex ops Steven Swanson
2017-08-03  7:48   ` Steven Swanson
2017-08-03  7:48 ` [RFC 07/16] NOVA: File and directory operations Steven Swanson
2017-08-03  7:48   ` Steven Swanson
2017-08-03  7:49 ` [RFC 08/16] NOVA: Garbage collection Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:49 ` [RFC 09/16] NOVA: DAX code Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:49 ` [RFC 10/16] NOVA: File data protection Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:49 ` [RFC 11/16] NOVA: Snapshot support Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:49 ` [RFC 12/16] NOVA: Recovery code Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:49 ` [RFC 13/16] NOVA: Sysfs and ioctl Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:49 ` [RFC 14/16] NOVA: Read-only pmem devices Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:49 ` [RFC 15/16] NOVA: Performance measurement Steven Swanson
2017-08-03  7:49   ` Steven Swanson
2017-08-03  7:50 ` [RFC 16/16] NOVA: Build infrastructure Steven Swanson
2017-08-03  7:50   ` Steven Swanson
2017-10-09 15:32 ` [RFC 00/16] NOVA: a new file system for persistent memory Miklos Szeredi
2017-10-09 15:32   ` Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.