Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [RFC v2 00/83] NOVA: a new file system for persistent memory
@ 2018-03-10 18:17 Andiry Xu
  2018-03-10 18:17 ` [RFC v2 01/83] Introduction and documentation of NOVA filesystem Andiry Xu
                   ` (83 more replies)
  0 siblings, 84 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

This is the second version of RFC patch series that impements
NOVA (NOn-Volatile memory Accelerated file system), a new file system built for PMEM.

NOVA's goal is to provide a high performance, production-ready
file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
and Intel's soon-to-be-released 3DXpoint DIMMs).
     
NOVA was developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <jix024@cs.ucsd.edu>, Lu Zhang
<luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.
     
NOVA is stable enough to run complex applications, but there is substantial
work left to do.  This RFC is intended to gather feedback to guide its
development toward eventual inclusion upstream.
     
The patches are based on Linux 4.16-rc4.


Changes from v1:

* Remove snapshot, metadata replication and data parity for future submission.
  This significantly reduces complexity and LOC: 22129 -> 13834.

* Breakdown the code in a more reviewer-friendly way:
  The patchset starts with a simple skeleton and adds more features gradually.
  Each patch leaves the tree in a compilable and working state,
  and is self-contained and small, so easier to review.

* Fix bugs so that NOVA passes xfstests: https://github.com/NVSL/xfstests


Overview
========

NOVA is primarily a log-structured file system, but rather than maintain a
single global log for the entire file system, it maintains separate logs for
each inode.  NOVA breaks the logs into 4KB pages, they need not be
contiguous in memory.  The logs only contain metadata.
	
File data pages reside outside the log, and log entries for write operations
point to data pages they modify.  File modification can be done in
either inplace update or copy-on-write (COW) way to provide atomic file updates.
	
For file operations that involve multiple inodes, NOVA use small, fixed-sized
redo logs to atomically append log entries to the logs of the inodes involved.
	
This structure keeps logs small and makes garbage collection very fast.  It also
enables enormous parallelism during recovery from an unclean unmount, since
threads can scan logs in parallel.
	
Documentation/filesystems/NOVA.txt contains some lower-level implementation and
usage information.  A more thorough discussion of NOVA's goals and design is
avaialable in two papers:
	
NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf
Jian Xu and Steven Swanson
Published in FAST 2016

NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson
Published in SOSP 2017

This version contains features from the FAST paper. We leave NOVA-Fortis
features for future.


Build and Run
=============

To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.

NOVA runs on a pmem non-volatile memory region created by memmap kernel option.
For instance, adding 'memmap=16G!8G' to the kernel boot parameters will reserve
16GB memory starting from address 8GB, and the kernel will create a pmem0 
block device under the /dev directory.

After the OS has booted, initialize a NOVA instance with the following commands:

# modprobe nova
# mount -t NOVA -o init /dev/pmem0 /mnt/nova

The above commands create a NOVA instance on /dev/pmem0 and mounts it on
/mnt/nova. Currently NOVA does not have mkfs or fsck support.


Performance
===========

Comparing to other DAX file systems such as ext4-DAX and xfs-DAX,
NOVA provides fine-grained, byte granularity metadata operation,
and it performs better in metadata-intensive and write-intensive applications.
NOVA also excel in append-fsync access pattern, i.e. write-ahead logging,
which is very common in DBMS and key-value stores.

The following test is performed on Intel i7-3770K with 16GB DRAM
and 8GB PMEM emulated with DRAM. The kernel is 4.16-rc4 64bit on Ubuntu 16.04.
Performance may vary on different platforms.


Filebench throughout (ops/s):
		xfs-DAX	ext4-DAX	NOVA
Fileserver	86971	177826		334166
Varmail		148032	288033		999794
Webserver	370245	370144		374130
Webproxy	315084	737544		927216

Webserver is read-intensive and all the file systems have similar performance.


SQLite test:
SQLite has four journaling modes:
Delete: delete the undo log file after transaction commit
Truncate: truncate the undo log file to zero after transaction commit
Persist: write a flag at the beginning of the log file after transaction commit
WAL: write-ahead logging

SQLite insert (transactions/s):
		xfs-DAX	ext4-DAX	NOVA
Delete		18525	23615		45289
Truncate	21930	26391		52046	
Persist		58053	56106		50554
WAL		38622	62703		85395

NOVA performs bad in Persist mode because it does copy-on-write for writes,
and writes 4KB for sub-page writes.


Redis: fsync the WAL file after every set.
Redis set throughout (trans/s):
xfs-DAX	ext4-DAX	NOVA
49771	88308		102560


RocksDB fillunique test (ops/s):
		xfs-DAX	ext4-DAX	NOVA
WAL sync	33563	62066		295655
WAL nosync	254533	288106		393713

Both ext4-DAX and xfs-DAX suffer from high fsync overhead.

More test results are available in the two NOVA papers.

NOVA uses per-inode logging, per-CPU inode table and journal to avoid lock contention.
We use the FxMark test suite (https://github.com/sslab-gatech/fxmark)
to test the filesystem scalability. The result is at
http://cseweb.ucsd.edu/~jix024/sc.pdf


Thanks,
Andiry

---

Andiry Xu (83):
  Introduction and documentation of NOVA filesystem.
  Add nova_def.h.
  Add super.h.
  NOVA inode definition.
  Add NOVA filesystem definitions and useful helper routines.
  Add inode get/read methods.
  Initialize inode_info and rebuild inode information in nova_iget().
  NOVA superblock operations.
  Add Kconfig and Makefile
  Add superblock integrity check.
  Add timing and I/O statistics for performance analysis and profiling.
  Add timing for mount and init.
  Add remount_fs and show_options methods.
  Add range node kmem cache.
  Add free list data structure.
  Initialize block map and free lists in nova_init().
  Add statfs support.
  Add freelist statistics printing.
  Add pmem block free routines.
  Pmem block allocation routines.
  Add log structure.
  Inode log pages allocation and reclaimation.
  Save allocator to pmem in put_super.
  Initialize and allocate inode table.
  Support get normal inode address and inode table extentsion.
  Add inode_map to track inuse inodes.
  Save the inode inuse list to pmem upon umount
  Add NOVA address space operations
  Add write_inode and dirty_inode routines.
  New NOVA inode allocation.
  Add new vfs inode allocation.
  Add log entry definitions.
  Inode log and entry printing for debug purpose.
  Journal: NOVA light weight journal definitions.
  Journal: Lite journal helper routines.
  Journal: Lite journal recovery.
  Journal: Lite journal create and commit.
  Journal: NOVA lite journal initialization.
  Log operation: dentry append.
  Log operation: file write entry append.
  Log operation: setattr entry append
  Log operation: link change append.
  Log operation: in-place update log entry
  Log operation: invalidate log entries
  Log operation: file inode log lookup and assign
  Dir: Add Directory radix tree insert/remove methods.
  Dir: Add initial dentries when initializing a directory inode log.
  Dir: Readdir operation.
  Dir: Append create/remove dentry.
  Inode: Add nova_evict_inode.
  Rebuild: directory inode.
  Rebuild: file inode.
  Namei: lookup.
  Namei: create and mknod.
  Namei: mkdir
  Namei: link and unlink.
  Namei: rmdir
  Namei: rename
  Namei: setattr
  Add special inode operations.
  Super: Add nova_export_ops.
  File: getattr and file inode operations
  File operation: llseek.
  File operation: open, fsync, flush.
  File operation: read.
  Super: Add file write item cache.
  Dax: commit list of file write items to log.
  File operation: copy-on-write write.
  Super: Add module param inplace_data_updates.
  File operation: Inplace write.
  Symlink support.
  File operation: fallocate.
  Dax: Add iomap operations.
  File operation: Mmap.
  File operation: read/write iter.
  Ioctl support.
  GC: Fast garbage collection.
  GC: Thorough garbage collection.
  Normal recovery.
  Failure recovery: bitmap operations.
  Failure recovery: Inode pages recovery routines.
  Failure recovery: Per-CPU recovery.
  Sysfs support.

 Documentation/filesystems/00-INDEX |    2 +
 Documentation/filesystems/nova.txt |  498 +++++++++++++
 MAINTAINERS                        |    8 +
 fs/Kconfig                         |    2 +
 fs/Makefile                        |    1 +
 fs/nova/Kconfig                    |   15 +
 fs/nova/Makefile                   |    8 +
 fs/nova/balloc.c                   |  730 ++++++++++++++++++
 fs/nova/balloc.h                   |   96 +++
 fs/nova/bbuild.c                   | 1437 ++++++++++++++++++++++++++++++++++++
 fs/nova/bbuild.h                   |   28 +
 fs/nova/dax.c                      |  970 ++++++++++++++++++++++++
 fs/nova/dir.c                      |  520 +++++++++++++
 fs/nova/file.c                     |  728 ++++++++++++++++++
 fs/nova/gc.c                       |  459 ++++++++++++
 fs/nova/inode.c                    | 1310 ++++++++++++++++++++++++++++++++
 fs/nova/inode.h                    |  277 +++++++
 fs/nova/ioctl.c                    |  184 +++++
 fs/nova/journal.c                  |  412 +++++++++++
 fs/nova/journal.h                  |   56 ++
 fs/nova/log.c                      | 1111 ++++++++++++++++++++++++++++
 fs/nova/log.h                      |  417 +++++++++++
 fs/nova/namei.c                    |  848 +++++++++++++++++++++
 fs/nova/nova.h                     |  566 ++++++++++++++
 fs/nova/nova_def.h                 |  128 ++++
 fs/nova/rebuild.c                  |  499 +++++++++++++
 fs/nova/stats.c                    |  600 +++++++++++++++
 fs/nova/stats.h                    |  178 +++++
 fs/nova/super.c                    | 1063 ++++++++++++++++++++++++++
 fs/nova/super.h                    |  171 +++++
 fs/nova/symlink.c                  |  133 ++++
 fs/nova/sysfs.c                    |  379 ++++++++++
 32 files changed, 13834 insertions(+)
 create mode 100644 Documentation/filesystems/nova.txt
 create mode 100644 fs/nova/Kconfig
 create mode 100644 fs/nova/Makefile
 create mode 100644 fs/nova/balloc.c
 create mode 100644 fs/nova/balloc.h
 create mode 100644 fs/nova/bbuild.c
 create mode 100644 fs/nova/bbuild.h
 create mode 100644 fs/nova/dax.c
 create mode 100644 fs/nova/dir.c
 create mode 100644 fs/nova/file.c
 create mode 100644 fs/nova/gc.c
 create mode 100644 fs/nova/inode.c
 create mode 100644 fs/nova/inode.h
 create mode 100644 fs/nova/ioctl.c
 create mode 100644 fs/nova/journal.c
 create mode 100644 fs/nova/journal.h
 create mode 100644 fs/nova/log.c
 create mode 100644 fs/nova/log.h
 create mode 100644 fs/nova/namei.c
 create mode 100644 fs/nova/nova.h
 create mode 100644 fs/nova/nova_def.h
 create mode 100644 fs/nova/rebuild.c
 create mode 100644 fs/nova/stats.c
 create mode 100644 fs/nova/stats.h
 create mode 100644 fs/nova/super.c
 create mode 100644 fs/nova/super.h
 create mode 100644 fs/nova/symlink.c
 create mode 100644 fs/nova/sysfs.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 01/83] Introduction and documentation of NOVA filesystem.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-19 20:43   ` Randy Dunlap
  2018-04-22  8:05   ` Pavel Machek
  2018-03-10 18:17 ` [RFC v2 02/83] Add nova_def.h Andiry Xu
                   ` (82 subsequent siblings)
  83 siblings, 2 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA is a log-structured file system tailored for byte-addressable non-volatile memories.
It was designed and developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <jix024@eng.ucsd.edu>, Lu Zhang
<luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.

These two papers provide a detailed, high-level description of NOVA's design goals and approach:

   NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
   In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
   (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)

   NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
   In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
   (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf)

This patchset contains features from the FAST paper. We leave NOVA-Fortis features,
such as snapshot, metadata and data replication and RAID parity for
future submission.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 Documentation/filesystems/00-INDEX |   2 +
 Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++
 MAINTAINERS                        |   8 +
 3 files changed, 508 insertions(+)
 create mode 100644 Documentation/filesystems/nova.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index b7bd6c9..dc5c722 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -95,6 +95,8 @@ nfs/
 	- nfs-related documentation.
 nilfs2.txt
 	- info and mount options for the NILFS2 filesystem.
+nova.txt
+	- info on the NOVA filesystem.
 ntfs.txt
 	- info and mount options for the NTFS filesystem (Windows NT).
 ocfs2.txt
diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
new file mode 100644
index 0000000..4728f50
--- /dev/null
+++ b/Documentation/filesystems/nova.txt
@@ -0,0 +1,498 @@
+The NOVA Filesystem
+===================
+
+NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system
+designed to provide a high performance and production-ready file system
+tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
+and Intel's soon-to-be-released 3DXPoint DIMMs).
+NOVA combines design elements from many other file systems
+and adapts conventional log-structured file system techniques to
+exploit the fast random access that NVMs provide. In particular, NOVA maintains
+separate logs for each inode to improve concurrency, and stores file data
+outside the log to minimize log size and reduce garbage collection costs. NOVA's
+logs provide metadata and data atomicity and focus on simplicity and
+reliability, keeping complex metadata structures in DRAM to accelerate lookup
+operations.
+
+NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in
+the Computer Science and Engineering Department at the University of
+California, San Diego.
+
+A more thorough discussion of NOVA's design is avaialable in these two papers:
+
+NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories
+Jian Xu and Steven Swanson
+In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
+
+NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
+Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase,
+Tamires Brito Da Silva, Andy Rudoff and Steven Swanson
+In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
+
+This version of NOVA contains features from the FAST paper.
+NOVA-Fortis features, such as snapshot, metadata and data protection and replication
+are left for future submission.
+
+The main NOVA features include:
+
+  * POSIX semantics
+  * Directly access (DAX) byte-addressable NVMM without page caching
+  * Per-CPU NVMM pool to maximize concurrency
+  * Strong consistency guarantees with 8-byte atomic stores
+
+
+Filesystem Design
+=================
+
+NOVA divides NVMM into several regions. NOVA's 512B superblock contains global
+file system information and the recovery inode. The recovery inode represents a
+special file that stores recovery information (e.g., the list of unallocated
+NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
+provides per-CPU journals for complex file operations that involve multiple
+inodes. The rest of the available NVMM stores logs and file data.
+
+NOVA is log-structured and stores a separate log for each inode to maximize
+concurrency and provide atomicity for operations that affect a single file. The
+logs only store metadata and comprise a linked list of 4 KB pages. Log entries
+are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
+pages may reside anywhere in NVMM.
+
+NOVA keeps copies of most file metadata in DRAM during normal
+operations, eliminating the need to access metadata in NVMM during reads.
+
+NOVA supports both copy-on-write and in-place file data updates and appends
+metadata about the write to the log. For operations that affect multiple inodes
+NOVA uses lightweight, fixed-length journals –one per core.
+
+NOVA divides the allocatable NVMM into multiple regions, one region per CPU
+core. A per-core allocator manages each of the regions, minimizing contention
+during memory allocation.
+
+After a system crash, NOVA must scan all the logs to rebuild the memory
+allocator state. Since, there are many logs, NOVA aggressively parallelizes the
+scan.
+
+
+Building and using NOVA
+=======================
+
+To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
+DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
+
+NOVA runs on a pmem non-volatile memory region.  You can create one of these
+regions with the `memmap` kernel command line option.  For instance, adding
+`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
+from address 8GB, and the kernel will create a `pmem0` block device under the
+`/dev` directory.
+
+After the OS has booted, you can initialize a NOVA instance with the following commands:
+
+
+# modprobe nova
+# mount -t NOVA -o init /dev/pmem0 /mnt/nova
+
+
+The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
+`/mnt/nova`.
+
+NOVA support several module command line options:
+
+ * measure_timing: Measure the timing of file system operations for profiling (default: 0)
+
+ * inplace_data_updates:  Update data in place rather than with COW (default: 0)
+
+To recover an existing NOVA instance, mount NOVA without the init option, for example:
+
+# mount -t NOVA /dev/pmem0 /mnt/nova
+
+
+Sysfs support
+-------------
+
+NOVA provides sysfs support to enable user to get/set information of 
+a running NOVA instance.
+After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/:
+
+timing_stats	IO_stats	allocator	gc
+
+Show NOVA file operation timing statistics:
+# cat /proc/fs/NOVA/pmem#/timing_stats
+
+Clear timing statistics:
+# echo 1 > /proc/fs/NOVA/pmem#/timing_stats
+
+Show NOVA I/O statistics:
+# cat /proc/fs/NOVA/pmem#/IO_stats
+
+Clear I/O statistics:
+# echo 1 > /proc/fs/NOVA/pmem#/IO_stats
+
+Show NOVA allocator information:
+# cat /proc/fs/NOVA/pmem#/allocator
+
+Manual garbage collection:
+# echo #inode_number > /proc/fs/NOVA/pmem#/gc
+
+
+Source File Structure
+=====================
+
+  * nova_def.h/nova.h
+   Defines NOVA macros and key inline functions.
+
+  * balloc.{h,c}
+    NOVA's pmem allocator implementation.
+
+  * bbuild.c
+    Implements recovery routines to restore the in-use inode list and the NVMM
+    allocator information.
+
+  * dax.c
+    Implements DAX read/write and mmap functions to access file data. NOVA uses
+    copy-on-write to modify file pages by default, unless inplace data update is
+    enabled at mount-time.
+
+  * dir.c
+    Contains functions to create, update, and remove NOVA dentries.
+
+  * file.c
+    Implements file-related operations such as open, fallocate, llseek, fsync,
+    and flush.
+
+  * gc.c
+    NOVA's garbage collection functions.
+
+  * inode.{h,c}
+    Creates, reads, and frees NOVA inode tables and inodes.
+
+  * ioctl.c
+    Implements some ioctl commands to call NOVA's internal functions.
+
+  * journal.{h,c}
+    For operations that affect multiple inodes NOVA uses lightweight,
+    fixed-length journals – one per core. This file contains functions to
+    create and manage the lite journals.
+
+  * log.{h,c}
+    Functions to manipulate NOVA inode logs, including log page allocation, log
+    entry creation, commit, modification, and deletion.
+
+  * namei.c
+    Functions to create/remove files, directories, and links. It also looks for
+    the NOVA inode number for a given path name.
+
+  * rebuild.c
+    When mounting NOVA, rebuild NOVA inodes from its logs.
+
+  * stats.{h,c}
+    Provide routines to gather and print NOVA usage statistics.
+
+  * super.{h,c}
+    Super block structures and NOVA FS layout and entry points for NOVA
+    mounting and unmounting, initializing or recovering the NOVA super block
+    and other global file system information.
+
+  * symlink.c
+    Implements functions to create and read symbolic links in the filesystem.
+
+  * sysfs.c
+    Implements sysfs entries to take user inputs for printing NOVA statistics.
+
+
+Filesystem Layout
+=================
+
+A NOVA file systems resides in single PMEM device.
+NOVA divides the device into 4KB blocks.
+
+ block
++---------------------------------------------------------+
+|    0    | primary super block (struct nova_super_block) |
++---------------------------------------------------------+
+|    1    | Reserved inodes                               |
++---------------------------------------------------------+
+|  2 - 15 | reserved                                      |
++---------------------------------------------------------+
+| 16 - 31 | Inode table pointers                          |
++---------------------------------------------------------+
+| 32 - 47 | Journal pointers                              |
++---------------------------------------------------------+
+| 48 - 63 | reserved                                      |
++---------------------------------------------------------+
+|   ...   | log and data pages                            |
++---------------------------------------------------------+
+|   n-2   | replica reserved Inodes                       |
++---------------------------------------------------------+
+|   n-1   | replica super block                           |
++---------------------------------------------------------+
+
+
+
+Superblock and Associated Structures
+====================================
+
+The beginning of the PMEM device hold the super block and its associated
+tables.  These include reserved inodes, a table of pointers to the journals
+NOVA uses for complex operations, and pointers to inodes tables.  NOVA
+maintains replicas of the super block and reserved inodes in the last two
+blocks of the PMEM area.
+
+
+Block Allocator/Free Lists
+==========================
+
+NOVA uses per-CPU allocators to manage free PMEM blocks.  On initialization,
+NOVA divides the range of blocks in the PMEM device among the CPUs, and those
+blocks are managed solely by that CPU.  We call these ranges of "allocation regions".
+Each allocator maintains a red-black tree of unallocated ranges (struct
+nova_range_node).
+
+Allocation Functions
+--------------------
+
+NOVA allocate PMEM blocks using two mechanisms:
+
+1.  Static allocation as defined in super.h
+
+2.  Allocation for log and data pages via nova_new_log_blocks() and
+nova_new_data_blocks().
+
+
+PMEM Address Translation
+------------------------
+
+In NOVA's persistent data structures, memory locations are given as offsets
+from the beginning of the PMEM region.  nova_get_block() translates offsets to
+PMEM addresses.  nova_get_addr_off() performs the reverse translation.
+
+
+Inodes
+======
+
+NOVA maintains per-CPU inode tables, and inode numbers are striped across the
+tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
+
+The inodes themselves live in a set of linked lists (one per CPU) of 2MB
+blocks.  The last 8 bytes of each block points to the next block.  Pointers to
+heads of these list live in PMEM block INODE_TABLE_START.
+Additional space for inodes is allocated on demand.
+
+To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB
+tree that holds ranges of allocated inode numbers.
+
+
+Logs
+====
+
+NOVA maintains a log for each inode that records updates to the inode's
+metadata and holds pointers to the file data.  NOVA makes updates to file data
+and metadata atomic by atomically appending log entries to the log.
+
+Each inode contains pointers to head and tail of the inode's log.  When the log
+grows past the end of the last page, nova allocates additional space.  For
+short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
+fixed amount of additional space (1MB).
+
+Log space is reclaimed during garbage collection.
+
+Log Entries
+-----------
+
+There are four kinds of log entry, documented in log.h.  The log entries have
+several entries in common:
+
+   1.  'epoch_id' gives the epoch during which the log entry was created.
+   Creating a snapshot increments the epoch_id for the file systems.
+   Currently disabled (always zero).
+
+   2.  'trans_id' is per-inode, monotone increasing, number assigned each
+   log entry.  It provides an ordering over FS operations on a single inode.
+
+   3.  'invalid' is true if the effects of this entry are dead and the log
+   entry can be garbage collected.
+
+   4.  'csum' is a CRC32 checksum for the entry. Currently it is disabled.
+
+Log structure
+-------------
+
+The logs comprise a linked list of PMEM blocks.  The tail of each block
+contains some metadata about the block and pointers to the next block and
+block's replica (struct nova_inode_page_tail).
+
++----------------+
+| log entry      |
++----------------+
+| log entry      |
++----------------+
+| ...            |
++----------------+
+| tail           |
+|  metadata      |
+|  -> next block |
++----------------+
+
+
+Journals
+========
+
+NOVA uses a lightweight journaling mechanisms to provide atomicity for
+operations that modify more than one on inode.  The journals providing logging
+for two operations:
+
+1.  Single word updates (JOURNAL_ENTRY)
+2.  Copying inodes (JOURNAL_INODE)
+
+The journals are undo logs: NOVA creates the journal entries for an operation,
+and if the operation does not complete due to a system failure, the recovery
+process rolls back the changes using the journal entries.
+
+To commit, NOVA drops the log.
+
+NOVA maintains one journal per CPU.  The head and tail pointers for each
+journal live in a reserved page near the beginning of the file system.
+
+During recovery, NOVA scans the journals and undoes the operations described by
+each entry.
+
+
+File and Directory Access
+=========================
+
+To access file data via read(), NOVA maintains a radix tree in DRAM for each
+inode (nova_inode_info_header.tree) that maps file offsets to write log
+entries.  For directories, the same tree maps a hash of filenames to their
+corresponding dentry.
+
+In both cases, the nova populates the tree when the file or directory is opened
+by scanning its log.
+
+
+MMap and DAX
+============
+
+NOVA leverages the kernel's DAX mechanisms for mmap and file data access.
+NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the
+application's address space.
+
+
+Garbage Collection
+==================
+
+NOVA recovers log space with a two-phase garbage collection system.  When a log
+reaches the end of its allocated pages, NOVA allocates more space.  Then, the
+fast GC algorithm scans the log to remove pages that have no valid entries.
+Then, it estimates how many pages the logs valid entries would fill.  If this
+is less than half the number of pages in the log, the second GC phase copies
+the valid entries to new pages.
+
+For example (V=valid; I=invalid):
+
++---+         +---+	        +---+
+| I |	       | I |  	      	| V |
++---+	       +---+  Thorough	+---+
+| V |	       | V |  	 GC   	| V |
++---+	       +---+   =====> 	+---+
+| I |	       | I |  	      	| V |
++---+	       +---+	        +---+
+| V |	       | V |  	        | V |
++---+	       +---+            +---+
+  |	         |
+  V	         V
++---+	       +---+
+| I |	       | V |
++---+	       +---+
+| I | fast GC  | I |
++---+  ====>   +---+
+| I |	       | I |
++---+	       +---+
+| I |	       | V |
++---+	       +---+
+  |
+  V
++---+
+| V |
++---+
+| I |
++---+
+| I |
++---+
+| V |
++---+
+
+
+Umount and Recovery
+===================
+
+Clean umount/mount
+------------------
+
+On a clean unmount, NOVA saves the contents of many of its DRAM data structures
+to PMEM to accelerate the next mount:
+
+1. NOVA stores the allocator state for each of the per-cpu allocators to the
+   log of a reserved inode (NOVA_BLOCK_NODE_INO).
+
+2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the
+   NOVA_BLOCK_INODELIST_INO reserved inode.
+
+After a clean unmount, the following mount restores these data and then
+invalidates them.
+
+Recovery after failures
+-----------------------
+
+In case of a unclean dismount (e.g., system crash), NOVA must rebuild these
+DRAM structures by scanning the inode logs.  NOVA log scanning is fast because
+per-CPU inode tables and per-inode logs allow for parallel recovery.
+
+The number of live log entries in an inode log is roughly the number of extents
+in the file.  As a result, NOVA only needs to scan a small fraction of the NVMM
+during recovery.
+
+The NOVA failure recovery consists of two steps:
+
+First, NOVA checks its lite weight journals and rolls back any uncommitted
+transactions to restore the file system to a consistent state.
+
+Second, NOVA starts a recovery thread on each CPU and scans the inode tables in
+parallel, performing log scanning for every valid inode in the inode table.
+NOVA use different recovery mechanisms for directory inodes and file inodes:
+For a directory inode, NOVA scans the log's linked list to enumerate the pages
+it occupies, but it does not inspect the log's contents.  For a file inode,
+NOVA reads the write entries in the log to enumerate the data pages.
+
+During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds
+the allocator based on the result. After this process completes, the file
+system is ready to accept new requests.
+
+During the same scan, it rebuilds the list of available inodes.
+
+
+Gaps, Missing Features, and Development Status
+==============================================
+
+Although NOVA is a fully-functional file system, there is still much work left
+to be done.  In particular, (at least) the following items are currently missing:
+
+1.  Snapshot, metadata and data replication and protection are left for future submission.
+2.  There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system).
+3.  NOVA only works on x86-64 kernels.
+4.  NOVA does not currently support extended attributes or ACL.
+5.  NOVA doesn't provide quota support.
+6.  Moving NOVA file systems between machines with different numbers of CPUs does not work.
+
+None of these are fundamental limitations of NOVA's design.
+
+NOVA is complete and robust enough to run a range of complex applications, but
+it is not yet ready for production use.  Our current focus is on adding a few
+missing features from the list above and finding/fixing bugs.
+
+
+Hacking and Contributing
+========================
+
+If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues.
+
+If you have other questions or suggestions you can contact the NOVA developers
+at cse-nova-hackers@eng.ucsd.edu.
diff --git a/MAINTAINERS b/MAINTAINERS
index 4623caf..89ac59b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9825,6 +9825,14 @@ F:	drivers/power/supply/bq27xxx_battery_i2c.c
 F:	drivers/power/supply/isp1704_charger.c
 F:	drivers/power/supply/rx51_battery.c
 
+NOVA FILESYSTEM
+M:	Andiry Xu <jix024@cs.ucsd.edu>
+M:	Steven Swanson <swanson@cs.ucsd.edu>
+L:	linux-fsdevel@vger.kernel.org
+L:	linux-nvdimm@lists.01.org
+F:	Documentation/filesystems/nova.txt
+F:	fs/nova/
+
 NTB AMD DRIVER
 M:	Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
 L:	linux-ntb@googlegroups.com
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 02/83] Add nova_def.h.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
  2018-03-10 18:17 ` [RFC v2 01/83] Introduction and documentation of NOVA filesystem Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 03/83] Add super.h Andiry Xu
                   ` (81 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

This file defines NOVA filesystem macros and routines to persist updates
by using Intel persistent memory instruction CLWB or clflush.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/nova_def.h | 128 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)
 create mode 100644 fs/nova/nova_def.h

diff --git a/fs/nova/nova_def.h b/fs/nova/nova_def.h
new file mode 100644
index 0000000..1cbed6f
--- /dev/null
+++ b/fs/nova/nova_def.h
@@ -0,0 +1,128 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the NOVA filesystem.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#ifndef _LINUX_NOVA_DEF_H
+#define _LINUX_NOVA_DEF_H
+
+#include <linux/types.h>
+#include <linux/magic.h>
+
+#define	NOVA_SUPER_MAGIC	0x4E4F5641	/* NOVA */
+
+/*
+ * The NOVA filesystem constants/structures
+ */
+
+/*
+ * Mount flags
+ */
+#define NOVA_MOUNT_XATTR_USER   0x000002    /* Extended user attributes */
+#define NOVA_MOUNT_POSIX_ACL    0x000004    /* POSIX Access Control Lists */
+#define NOVA_MOUNT_DAX          0x000008    /* Direct Access */
+#define NOVA_MOUNT_ERRORS_CONT  0x000010    /* Continue on errors */
+#define NOVA_MOUNT_ERRORS_RO    0x000020    /* Remount fs ro on errors */
+#define NOVA_MOUNT_ERRORS_PANIC 0x000040    /* Panic on errors */
+#define NOVA_MOUNT_HUGEMMAP     0x000080    /* Huge mappings with mmap */
+#define NOVA_MOUNT_HUGEIOREMAP  0x000100    /* Huge mappings with ioremap */
+#define NOVA_MOUNT_FORMAT       0x000200    /* was FS formatted on mount? */
+
+/*
+ * Maximal count of links to a file
+ */
+#define NOVA_LINK_MAX          32000
+
+#define NOVA_DEF_BLOCK_SIZE_4K 4096
+
+#define NOVA_INODE_BITS   7
+#define NOVA_INODE_SIZE   128    /* must be power of two */
+
+#define NOVA_NAME_LEN 255
+
+#define MAX_CPUS 1024
+
+/* NOVA supported data blocks */
+#define NOVA_BLOCK_TYPE_4K     0
+#define NOVA_BLOCK_TYPE_2M     1
+#define NOVA_BLOCK_TYPE_1G     2
+#define NOVA_BLOCK_TYPE_MAX    3
+
+#define META_BLK_SHIFT 9
+
+/*
+ * Play with this knob to change the default block type.
+ * By changing the NOVA_DEFAULT_BLOCK_TYPE to 2M or 1G,
+ * we should get pretty good coverage in testing.
+ */
+#define NOVA_DEFAULT_BLOCK_TYPE NOVA_BLOCK_TYPE_4K
+
+
+/* ======================= Write ordering ========================= */
+
+#define CACHELINE_SIZE  (64)
+#define CACHELINE_MASK  (~(CACHELINE_SIZE - 1))
+#define CACHELINE_ALIGN(addr) (((addr)+CACHELINE_SIZE-1) & CACHELINE_MASK)
+
+
+static inline bool arch_has_clwb(void)
+{
+	return static_cpu_has(X86_FEATURE_CLWB);
+}
+
+extern int support_clwb;
+
+#define _mm_clflush(addr)\
+	asm volatile("clflush %0" : "+m" (*(volatile char *)(addr)))
+#define _mm_clflushopt(addr)\
+	asm volatile(".byte 0x66; clflush %0" : "+m" \
+		     (*(volatile char *)(addr)))
+#define _mm_clwb(addr)\
+	asm volatile(".byte 0x66; xsaveopt %0" : "+m" \
+		     (*(volatile char *)(addr)))
+
+/* Provides ordering from all previous clflush too */
+static inline void PERSISTENT_MARK(void)
+{
+	/* TODO: Fix me. */
+}
+
+static inline void PERSISTENT_BARRIER(void)
+{
+	asm volatile ("sfence\n" : : );
+}
+
+static inline void nova_flush_buffer(void *buf, uint32_t len, bool fence)
+{
+	uint32_t i;
+
+	len = len + ((unsigned long)(buf) & (CACHELINE_SIZE - 1));
+	if (support_clwb) {
+		for (i = 0; i < len; i += CACHELINE_SIZE)
+			_mm_clwb(buf + i);
+	} else {
+		for (i = 0; i < len; i += CACHELINE_SIZE)
+			_mm_clflush(buf + i);
+	}
+	/* Do a fence only if asked. We often don't need to do a fence
+	 * immediately after clflush because even if we get context switched
+	 * between clflush and subsequent fence, the context switch operation
+	 * provides implicit fence.
+	 */
+	if (fence)
+		PERSISTENT_BARRIER();
+}
+
+#endif /* _LINUX_NOVA_DEF_H */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 03/83] Add super.h.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
  2018-03-10 18:17 ` [RFC v2 01/83] Introduction and documentation of NOVA filesystem Andiry Xu
  2018-03-10 18:17 ` [RFC v2 02/83] Add nova_def.h Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-15  4:54   ` Darrick J. Wong
  2018-03-10 18:17 ` [RFC v2 04/83] NOVA inode definition Andiry Xu
                   ` (80 subsequent siblings)
  83 siblings, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

This header file defines NOVA persistent and volatile superblock
data structures.

It also defines NOVA block layout:

Page 0: Superblock
Page 1: Reserved inodes
Page 2 - 15: Reserved
Page 16 - 31: Inode table pointers
Page 32 - 47: Journal address pointers
Page 48 - 63: Reserved
Pages n-2: Replicate reserved inodes
Pages n-1: Replicate superblock

Other pages are for normal inodes, logs and data.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/super.h | 149 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 149 insertions(+)
 create mode 100644 fs/nova/super.h

diff --git a/fs/nova/super.h b/fs/nova/super.h
new file mode 100644
index 0000000..cb53908
--- /dev/null
+++ b/fs/nova/super.h
@@ -0,0 +1,149 @@
+#ifndef __SUPER_H
+#define __SUPER_H
+/*
+ * Structure of the NOVA super block in PMEM
+ *
+ * The fields are partitioned into static and dynamic fields. The static fields
+ * never change after file system creation. This was primarily done because
+ * nova_get_block() returns NULL if the block offset is 0 (helps in catching
+ * bugs). So if we modify any field using journaling (for consistency), we
+ * will have to modify s_sum which is at offset 0. So journaling code fails.
+ * This (static+dynamic fields) is a temporary solution and can be avoided
+ * once the file system becomes stable and nova_get_block() returns correct
+ * pointers even for offset 0.
+ */
+struct nova_super_block {
+	/* static fields. they never change after file system creation.
+	 * checksum only validates up to s_start_dynamic field below
+	 */
+	__le32		s_sum;			/* checksum of this sb */
+	__le32		s_magic;		/* magic signature */
+	__le32		s_padding32;
+	__le32		s_blocksize;		/* blocksize in bytes */
+	__le64		s_size;			/* total size of fs in bytes */
+	char		s_volume_name[16];	/* volume name */
+
+	/* all the dynamic fields should go here */
+	__le64		s_epoch_id;		/* Epoch ID */
+
+	/* s_mtime and s_wtime should be together and their order should not be
+	 * changed. we use an 8 byte write to update both of them atomically
+	 */
+	__le32		s_mtime;		/* mount time */
+	__le32		s_wtime;		/* write time */
+} __attribute((__packed__));
+
+#define NOVA_SB_SIZE 512       /* must be power of two */
+
+/* ======================= Reserved blocks ========================= */
+
+/*
+ * Page 0 contains super blocks;
+ * Page 1 contains reserved inodes;
+ * Page 2 - 15 are reserved.
+ * Page 16 - 31 contain pointers to inode tables.
+ * Page 32 - 47 contain pointers to journal pages.
+ */
+#define	HEAD_RESERVED_BLOCKS	64
+#define	NUM_JOURNAL_PAGES	16
+
+#define	SUPER_BLOCK_START       0 // Superblock
+#define	RESERVE_INODE_START	1 // Reserved inodes
+#define	INODE_TABLE_START	16 // inode table pointers
+#define	JOURNAL_START		32 // journal pointer table
+
+/* For replica super block and replica reserved inodes */
+#define	TAIL_RESERVED_BLOCKS	2
+
+/* ======================= Reserved inodes ========================= */
+
+/* We have space for 31 reserved inodes */
+#define NOVA_ROOT_INO		(1)
+#define NOVA_INODETABLE_INO	(2)	/* Fake inode associated with inode
+					 * stroage.  We need this because our
+					 * allocator requires inode to be
+					 * associated with each allocation.
+					 * The data actually lives in linked
+					 * lists in INODE_TABLE_START. */
+#define NOVA_BLOCKNODE_INO	(3)     /* Storage for allocator state */
+#define NOVA_LITEJOURNAL_INO	(4)     /* Storage for lightweight journals */
+#define NOVA_INODELIST_INO	(5)     /* Storage for Inode free list */
+
+
+/* Normal inode starts at 32 */
+#define NOVA_NORMAL_INODE_START      (32)
+
+
+
+/*
+ * NOVA super-block data in DRAM
+ */
+struct nova_sb_info {
+	struct super_block *sb;			/* VFS super block */
+	struct nova_super_block *nova_sb;	/* DRAM copy of SB */
+	struct block_device *s_bdev;
+	struct dax_device *s_dax_dev;
+
+	/*
+	 * base physical and virtual address of NOVA (which is also
+	 * the pointer to the super block)
+	 */
+	phys_addr_t	phys_addr;
+	void		*virt_addr;
+	void		*replica_reserved_inodes_addr;
+	void		*replica_sb_addr;
+
+	unsigned long	num_blocks;
+
+	/* Mount options */
+	unsigned long	bpi;
+	unsigned long	blocksize;
+	unsigned long	initsize;
+	unsigned long	s_mount_opt;
+	kuid_t		uid;    /* Mount uid for root directory */
+	kgid_t		gid;    /* Mount gid for root directory */
+	umode_t		mode;   /* Mount mode for root directory */
+	atomic_t	next_generation;
+	/* inode tracking */
+	unsigned long	s_inodes_used_count;
+	unsigned long	head_reserved_blocks;
+	unsigned long	tail_reserved_blocks;
+
+	struct mutex	s_lock;	/* protects the SB's buffer-head */
+
+	int cpus;
+
+	/* Current epoch. volatile guarantees visibility */
+	volatile u64 s_epoch_id;
+
+	/* ZEROED page for cache page initialized */
+	void *zeroed_page;
+};
+
+static inline struct nova_sb_info *NOVA_SB(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+static inline struct nova_super_block
+*nova_get_redund_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return (struct nova_super_block *)(sbi->replica_sb_addr);
+}
+
+
+/* If this is part of a read-modify-write of the super block,
+ * nova_memunlock_super() before calling!
+ */
+static inline struct nova_super_block *nova_get_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return (struct nova_super_block *)sbi->virt_addr;
+}
+
+extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
+
+#endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 04/83] NOVA inode definition.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (2 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 03/83] Add super.h Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-15  5:06   ` Darrick J. Wong
  2018-03-10 18:17 ` [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines Andiry Xu
                   ` (79 subsequent siblings)
  83 siblings, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

inode.h defines the non-volatile and volatile NOVA inode data structures.

The non-volatile NOVA inode (nova_inode) is aligned to 128 bytes and contains
file/directory metadata information. The most important fields
are log_head and log_tail. log_head points to the start of
the log, and log_tail points to the end of the latest committed
log entry. NOVA make updates to the inode by appending
to the log tail and update the log_tail pointer atomically.

The volatile NOVA inode (nova_inode_info) contains necessary
information to limit access to the non-volatile NOVA inode during runtime.
It has a radix tree to map file offset or filenames to the corresponding
log entries.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.h | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 187 insertions(+)
 create mode 100644 fs/nova/inode.h

diff --git a/fs/nova/inode.h b/fs/nova/inode.h
new file mode 100644
index 0000000..f9187e3
--- /dev/null
+++ b/fs/nova/inode.h
@@ -0,0 +1,187 @@
+#ifndef __INODE_H
+#define __INODE_H
+
+struct nova_inode_info_header;
+struct nova_inode;
+
+#include "super.h"
+
+enum nova_new_inode_type {
+	TYPE_CREATE = 0,
+	TYPE_MKNOD,
+	TYPE_SYMLINK,
+	TYPE_MKDIR
+};
+
+
+/*
+ * Structure of an inode in PMEM
+ * Keep the inode size to within 120 bytes: We use the last eight bytes
+ * as inode table tail pointer.
+ */
+struct nova_inode {
+
+	/* first 40 bytes */
+	u8	i_rsvd;		 /* reserved. used to be checksum */
+	u8	valid;		 /* Is this inode valid? */
+	u8	deleted;	 /* Is this inode deleted? */
+	u8	i_blk_type;	 /* data block size this inode uses */
+	__le32	i_flags;	 /* Inode flags */
+	__le64	i_size;		 /* Size of data in bytes */
+	__le32	i_ctime;	 /* Inode modification time */
+	__le32	i_mtime;	 /* Inode b-tree Modification time */
+	__le32	i_atime;	 /* Access time */
+	__le16	i_mode;		 /* File mode */
+	__le16	i_links_count;	 /* Links count */
+
+	__le64	i_xattr;	 /* Extended attribute block */
+
+	/* second 40 bytes */
+	__le32	i_uid;		 /* Owner Uid */
+	__le32	i_gid;		 /* Group Id */
+	__le32	i_generation;	 /* File version (for NFS) */
+	__le32	i_create_time;	 /* Create time */
+	__le64	nova_ino;	 /* nova inode number */
+
+	__le64	log_head;	 /* Log head pointer */
+	__le64	log_tail;	 /* Log tail pointer */
+
+	/* last 40 bytes */
+	__le64	create_epoch_id; /* Transaction ID when create */
+	__le64	delete_epoch_id; /* Transaction ID when deleted */
+
+	struct {
+		__le32 rdev;	 /* major/minor # */
+	} dev;			 /* device inode */
+
+	__le32	csum;            /* CRC32 checksum */
+
+	/* Leave 8 bytes for inode table tail pointer */
+} __attribute((__packed__));
+
+/*
+ * NOVA-specific inode state kept in DRAM
+ */
+struct nova_inode_info_header {
+	/* For files, tree holds a map from file offsets to
+	 * write log entries.
+	 *
+	 * For directories, tree holds a map from a hash of the file name to
+	 * dentry log entry.
+	 */
+	struct radix_tree_root tree;
+	struct rw_semaphore i_sem;	/* Protect log and tree */
+	unsigned short i_mode;		/* Dir or file? */
+	unsigned int i_flags;
+	unsigned long log_pages;	/* Num of log pages */
+	unsigned long i_size;
+	unsigned long i_blocks;
+	unsigned long ino;
+	unsigned long pi_addr;
+	unsigned long valid_entries;	/* For thorough GC */
+	unsigned long num_entries;	/* For thorough GC */
+	u64 last_setattr;		/* Last setattr entry */
+	u64 last_link_change;		/* Last link change entry */
+	u64 last_dentry;		/* Last updated dentry */
+	u64 trans_id;			/* Transaction ID */
+	u64 log_head;			/* Log head pointer */
+	u64 log_tail;			/* Log tail pointer */
+	u8  i_blk_type;
+};
+
+/*
+ * DRAM state for inodes
+ */
+struct nova_inode_info {
+	struct nova_inode_info_header header;
+	struct inode vfs_inode;
+};
+
+
+static inline struct nova_inode_info *NOVA_I(struct inode *inode)
+{
+	return container_of(inode, struct nova_inode_info, vfs_inode);
+}
+
+static inline void sih_lock(struct nova_inode_info_header *header)
+{
+	down_write(&header->i_sem);
+}
+
+static inline void sih_unlock(struct nova_inode_info_header *header)
+{
+	up_write(&header->i_sem);
+}
+
+static inline void sih_lock_shared(struct nova_inode_info_header *header)
+{
+	down_read(&header->i_sem);
+}
+
+static inline void sih_unlock_shared(struct nova_inode_info_header *header)
+{
+	up_read(&header->i_sem);
+}
+
+static inline unsigned int
+nova_inode_blk_shift(struct nova_inode_info_header *sih)
+{
+	return blk_type_to_shift[sih->i_blk_type];
+}
+
+static inline uint32_t nova_inode_blk_size(struct nova_inode_info_header *sih)
+{
+	return blk_type_to_size[sih->i_blk_type];
+}
+
+static inline u64 nova_get_reserved_inode_addr(struct super_block *sb,
+	u64 inode_number)
+{
+	return (NOVA_DEF_BLOCK_SIZE_4K * RESERVE_INODE_START) +
+			inode_number * NOVA_INODE_SIZE;
+}
+
+static inline struct nova_inode *nova_get_reserved_inode(struct super_block *sb,
+	u64 inode_number)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 addr;
+
+	addr = nova_get_reserved_inode_addr(sb, inode_number);
+
+	return (struct nova_inode *)(sbi->virt_addr + addr);
+}
+
+static inline struct nova_inode *nova_get_inode_by_ino(struct super_block *sb,
+						  u64 ino)
+{
+	if (ino == 0 || ino >= NOVA_NORMAL_INODE_START)
+		return NULL;
+
+	return nova_get_reserved_inode(sb, ino);
+}
+
+static inline struct nova_inode *nova_get_inode(struct super_block *sb,
+	struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode fake_pi;
+	void *addr;
+	int rc;
+
+	addr = nova_get_block(sb, sih->pi_addr);
+	rc = memcpy_mcsafe(&fake_pi, addr, sizeof(struct nova_inode));
+	if (rc)
+		return NULL;
+
+	return (struct nova_inode *)addr;
+}
+
+static inline int nova_persist_inode(struct nova_inode *pi)
+{
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
+	return 0;
+}
+
+#endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (3 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 04/83] NOVA inode definition Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-11 12:00   ` Nikolay Borisov
  2018-03-10 18:17 ` [RFC v2 06/83] Add inode get/read methods Andiry Xu
                   ` (78 subsequent siblings)
  83 siblings, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA stores offset rather than absolute addresses in pmem.
nova_get_block() and nova_get_addr_off() provide transitions
between these two kinds of addresses.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/nova.h | 299 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 299 insertions(+)
 create mode 100644 fs/nova/nova.h

diff --git a/fs/nova/nova.h b/fs/nova/nova.h
new file mode 100644
index 0000000..5eb696c
--- /dev/null
+++ b/fs/nova/nova.h
@@ -0,0 +1,299 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the NOVA filesystem.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#ifndef __NOVA_H
+#define __NOVA_H
+
+#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/init.h>
+#include <linux/time.h>
+#include <linux/rtc.h>
+#include <linux/mm.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/backing-dev.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/rcupdate.h>
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/radix-tree.h>
+#include <linux/version.h>
+#include <linux/kthread.h>
+#include <linux/buffer_head.h>
+#include <linux/uio.h>
+#include <linux/iomap.h>
+#include <linux/crc32c.h>
+#include <asm/tlbflush.h>
+#include <linux/version.h>
+#include <linux/pfn_t.h>
+#include <linux/pagevec.h>
+
+#include "nova_def.h"
+
+#define PAGE_SHIFT_2M 21
+#define PAGE_SHIFT_1G 30
+
+
+/*
+ * Debug code
+ */
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/* #define nova_dbg(s, args...)		pr_debug(s, ## args) */
+#define nova_dbg(s, args ...)		pr_info(s, ## args)
+#define nova_err(sb, s, args ...)	nova_error_mng(sb, s, ## args)
+#define nova_warn(s, args ...)		pr_warn(s, ## args)
+#define nova_info(s, args ...)		pr_info(s, ## args)
+
+extern unsigned int nova_dbgmask;
+#define NOVA_DBGMASK_MMAPHUGE	       (0x00000001)
+#define NOVA_DBGMASK_MMAP4K	       (0x00000002)
+#define NOVA_DBGMASK_MMAPVERBOSE       (0x00000004)
+#define NOVA_DBGMASK_MMAPVVERBOSE      (0x00000008)
+#define NOVA_DBGMASK_VERBOSE	       (0x00000010)
+#define NOVA_DBGMASK_TRANSACTION       (0x00000020)
+
+#define nova_dbg_mmap4k(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAP4K) ? nova_dbg(s, args) : 0)
+#define nova_dbg_mmapv(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAPVERBOSE) ? nova_dbg(s, args) : 0)
+#define nova_dbg_mmapvv(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_MMAPVVERBOSE) ? nova_dbg(s, args) : 0)
+
+#define nova_dbg_verbose(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_VERBOSE) ? nova_dbg(s, ##args) : 0)
+#define nova_dbgv(s, args ...)	nova_dbg_verbose(s, ##args)
+#define nova_dbg_trans(s, args ...)		 \
+	((nova_dbgmask & NOVA_DBGMASK_TRANSACTION) ? nova_dbg(s, ##args) : 0)
+
+#define NOVA_ASSERT(x) do {\
+			       if (!(x))\
+				       nova_warn("assertion failed %s:%d: %s\n", \
+			       __FILE__, __LINE__, #x);\
+		       } while (0)
+
+#define nova_set_bit		       __test_and_set_bit_le
+#define nova_clear_bit		       __test_and_clear_bit_le
+#define nova_find_next_zero_bit	       find_next_zero_bit_le
+
+#define clear_opt(o, opt)	(o &= ~NOVA_MOUNT_ ## opt)
+#define set_opt(o, opt)		(o |= NOVA_MOUNT_ ## opt)
+#define test_opt(sb, opt)	(NOVA_SB(sb)->s_mount_opt & NOVA_MOUNT_ ## opt)
+
+#define NOVA_LARGE_INODE_TABLE_SIZE    (0x200000)
+/* NOVA size threshold for using 2M blocks for inode table */
+#define NOVA_LARGE_INODE_TABLE_THREASHOLD    (0x20000000)
+/*
+ * nova inode flags
+ *
+ * NOVA_EOFBLOCKS_FL	There are blocks allocated beyond eof
+ */
+#define NOVA_EOFBLOCKS_FL      0x20000000
+/* Flags that should be inherited by new inodes from their parent. */
+#define NOVA_FL_INHERITED (FS_SECRM_FL | FS_UNRM_FL | FS_COMPR_FL | \
+			    FS_SYNC_FL | FS_NODUMP_FL | FS_NOATIME_FL |	\
+			    FS_COMPRBLK_FL | FS_NOCOMP_FL | \
+			    FS_JOURNAL_DATA_FL | FS_NOTAIL_FL | FS_DIRSYNC_FL)
+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define NOVA_REG_FLMASK (~(FS_DIRSYNC_FL | FS_TOPDIR_FL))
+/* Flags that are appropriate for non-directories/regular files. */
+#define NOVA_OTHER_FLMASK (FS_NODUMP_FL | FS_NOATIME_FL)
+#define NOVA_FL_USER_VISIBLE (FS_FL_USER_VISIBLE | NOVA_EOFBLOCKS_FL)
+
+/* IOCTLs */
+#define	NOVA_PRINT_TIMING		0xBCD00010
+#define	NOVA_CLEAR_STATS		0xBCD00011
+#define	NOVA_PRINT_LOG			0xBCD00013
+#define	NOVA_PRINT_LOG_BLOCKNODE	0xBCD00014
+#define	NOVA_PRINT_LOG_PAGES		0xBCD00015
+#define	NOVA_PRINT_FREE_LISTS		0xBCD00018
+
+
+#define	READDIR_END			(ULONG_MAX)
+#define	ANY_CPU				(65536)
+#define	FREE_BATCH			(16)
+
+extern unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX];
+extern unsigned int blk_type_to_size[NOVA_BLOCK_TYPE_MAX];
+
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static inline __le32 nova_mask_flags(umode_t mode, __le32 flags)
+{
+	flags &= cpu_to_le32(NOVA_FL_INHERITED);
+	if (S_ISDIR(mode))
+		return flags;
+	else if (S_ISREG(mode))
+		return flags & cpu_to_le32(NOVA_REG_FLMASK);
+	else
+		return flags & cpu_to_le32(NOVA_OTHER_FLMASK);
+}
+
+static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
+{
+	u8 *ptr = (u8 *) data;
+	u64 acc = crc; /* accumulator, crc32c value in lower 32b */
+	u32 csum;
+
+	/* x86 instruction crc32 is part of SSE-4.2 */
+	if (static_cpu_has(X86_FEATURE_XMM4_2)) {
+		/* This inline assembly implementation should be equivalent
+		 * to the kernel's crc32c_intel_le_hw() function used by
+		 * crc32c(), but this performs better on test machines.
+		 */
+		while (len > 8) {
+			asm volatile(/* 64b quad words */
+				"crc32q (%1), %0"
+				: "=r" (acc)
+				: "r"  (ptr), "0" (acc)
+			);
+			ptr += 8;
+			len -= 8;
+		}
+
+		while (len > 0) {
+			asm volatile(/* trailing bytes */
+				"crc32b (%1), %0"
+				: "=r" (acc)
+				: "r"  (ptr), "0" (acc)
+			);
+			ptr++;
+			len--;
+		}
+
+		csum = (u32) acc;
+	} else {
+		/* The kernel's crc32c() function should also detect and use the
+		 * crc32 instruction of SSE-4.2. But calling in to this function
+		 * is about 3x to 5x slower than the inline assembly version on
+		 * some test machines.
+		 */
+		csum = crc32c(crc, data, len);
+	}
+
+	return csum;
+}
+
+static inline int memcpy_to_pmem_nocache(void *dst, const void *src,
+	unsigned int size)
+{
+	int ret;
+
+	ret = __copy_from_user_inatomic_nocache(dst, src, size);
+
+	return ret;
+}
+
+
+/* assumes the length to be 4-byte aligned */
+static inline void memset_nt(void *dest, uint32_t dword, size_t length)
+{
+	uint64_t dummy1, dummy2;
+	uint64_t qword = ((uint64_t)dword << 32) | dword;
+
+	asm volatile ("movl %%edx,%%ecx\n"
+		"andl $63,%%edx\n"
+		"shrl $6,%%ecx\n"
+		"jz 9f\n"
+		"1:	 movnti %%rax,(%%rdi)\n"
+		"2:	 movnti %%rax,1*8(%%rdi)\n"
+		"3:	 movnti %%rax,2*8(%%rdi)\n"
+		"4:	 movnti %%rax,3*8(%%rdi)\n"
+		"5:	 movnti %%rax,4*8(%%rdi)\n"
+		"8:	 movnti %%rax,5*8(%%rdi)\n"
+		"7:	 movnti %%rax,6*8(%%rdi)\n"
+		"8:	 movnti %%rax,7*8(%%rdi)\n"
+		"leaq 64(%%rdi),%%rdi\n"
+		"decl %%ecx\n"
+		"jnz 1b\n"
+		"9:	movl %%edx,%%ecx\n"
+		"andl $7,%%edx\n"
+		"shrl $3,%%ecx\n"
+		"jz 11f\n"
+		"10:	 movnti %%rax,(%%rdi)\n"
+		"leaq 8(%%rdi),%%rdi\n"
+		"decl %%ecx\n"
+		"jnz 10b\n"
+		"11:	 movl %%edx,%%ecx\n"
+		"shrl $2,%%ecx\n"
+		"jz 12f\n"
+		"movnti %%eax,(%%rdi)\n"
+		"12:\n"
+		: "=D"(dummy1), "=d" (dummy2)
+		: "D" (dest), "a" (qword), "d" (length)
+		: "memory", "rcx");
+}
+
+
+#include "super.h" // Remove when we factor out these and other functions.
+
+/* Translate an offset the beginning of the Nova instance to a PMEM address.
+ *
+ * If this is part of a read-modify-write of the block,
+ * nova_memunlock_block() before calling!
+ */
+static inline void *nova_get_block(struct super_block *sb, u64 block)
+{
+	struct nova_super_block *ps = nova_get_super(sb);
+
+	return block ? ((void *)ps + block) : NULL;
+}
+
+static inline int nova_get_reference(struct super_block *sb, u64 block,
+	void *dram, void **nvmm, size_t size)
+{
+	int rc;
+
+	*nvmm = nova_get_block(sb, block);
+	rc = memcpy_mcsafe(dram, *nvmm, size);
+	return rc;
+}
+
+
+static inline u64
+nova_get_addr_off(struct nova_sb_info *sbi, void *addr)
+{
+	NOVA_ASSERT((addr >= sbi->virt_addr) &&
+			(addr < (sbi->virt_addr + sbi->initsize)));
+	return (u64)(addr - sbi->virt_addr);
+}
+
+static inline u64
+nova_get_block_off(struct super_block *sb, unsigned long blocknr,
+		    unsigned short btype)
+{
+	return (u64)blocknr << PAGE_SHIFT;
+}
+
+
+static inline u64 nova_get_epoch_id(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return sbi->s_epoch_id;
+}
+
+#include "inode.h"
+#endif /* __NOVA_H */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 06/83] Add inode get/read methods.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (4 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-04-23  6:12   ` Darrick J. Wong
  2018-03-10 18:17 ` [RFC v2 07/83] Initialize inode_info and rebuild inode information in nova_iget() Andiry Xu
                   ` (77 subsequent siblings)
  83 siblings, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

These routines are incomplete and currently only support reserved inodes,
whose addresses are fixed. This is necessary for fill_super to work.
File/dir operations are left NULL.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.h |   3 +
 2 files changed, 179 insertions(+)
 create mode 100644 fs/nova/inode.c

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
new file mode 100644
index 0000000..bfdc5dc
--- /dev/null
+++ b/fs/nova/inode.c
@@ -0,0 +1,176 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode methods (allocate/free/read/write).
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/aio.h>
+#include <linux/highuid.h>
+#include <linux/module.h>
+#include <linux/mpage.h>
+#include <linux/backing-dev.h>
+#include <linux/types.h>
+#include <linux/ratelimit.h>
+#include "nova.h"
+#include "inode.h"
+
+unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX] = {12, 21, 30};
+uint32_t blk_type_to_size[NOVA_BLOCK_TYPE_MAX] = {0x1000, 0x200000, 0x40000000};
+
+void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
+	unsigned int flags)
+{
+	inode->i_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+	if (flags & FS_SYNC_FL)
+		inode->i_flags |= S_SYNC;
+	if (flags & FS_APPEND_FL)
+		inode->i_flags |= S_APPEND;
+	if (flags & FS_IMMUTABLE_FL)
+		inode->i_flags |= S_IMMUTABLE;
+	if (flags & FS_NOATIME_FL)
+		inode->i_flags |= S_NOATIME;
+	if (flags & FS_DIRSYNC_FL)
+		inode->i_flags |= S_DIRSYNC;
+	if (!pi->i_xattr)
+		inode_has_no_xattr(inode);
+	inode->i_flags |= S_DAX;
+}
+
+/* copy persistent state to struct inode */
+static int nova_read_inode(struct super_block *sb, struct inode *inode,
+	u64 pi_addr)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode *pi, fake_pi;
+	struct nova_inode_info_header *sih = &si->header;
+	int ret = -EIO;
+	unsigned long ino;
+
+	ret = nova_get_reference(sb, pi_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("%s: read pi @ 0x%llx failed\n",
+				__func__, pi_addr);
+		goto bad_inode;
+	}
+
+	inode->i_mode = sih->i_mode;
+	i_uid_write(inode, le32_to_cpu(pi->i_uid));
+	i_gid_write(inode, le32_to_cpu(pi->i_gid));
+//	set_nlink(inode, le16_to_cpu(pi->i_links_count));
+	inode->i_generation = le32_to_cpu(pi->i_generation);
+	nova_set_inode_flags(inode, pi, le32_to_cpu(pi->i_flags));
+	ino = inode->i_ino;
+
+	/* check if the inode is active. */
+	if (inode->i_mode == 0 || pi->deleted == 1) {
+		/* this inode is deleted */
+		ret = -ESTALE;
+		goto bad_inode;
+	}
+
+	inode->i_blocks = sih->i_blocks;
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		break;
+	case S_IFDIR:
+		break;
+	case S_IFLNK:
+		break;
+	default:
+		init_special_inode(inode, inode->i_mode,
+				   le32_to_cpu(pi->dev.rdev));
+		break;
+	}
+
+	/* Update size and time after rebuild the tree */
+	inode->i_size = le64_to_cpu(sih->i_size);
+	inode->i_atime.tv_sec = (__s32)le32_to_cpu(pi->i_atime);
+	inode->i_ctime.tv_sec = (__s32)le32_to_cpu(pi->i_ctime);
+	inode->i_mtime.tv_sec = (__s32)le32_to_cpu(pi->i_mtime);
+	inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec =
+					 inode->i_ctime.tv_nsec = 0;
+	set_nlink(inode, le16_to_cpu(pi->i_links_count));
+	return 0;
+
+bad_inode:
+	make_bad_inode(inode);
+	return ret;
+}
+
+/* Get the address in PMEM of an inode by inode number.  Allocate additional
+ * block to store additional inodes if necessary.
+ */
+int nova_get_inode_address(struct super_block *sb, u64 ino,
+	u64 *pi_addr, int extendable)
+{
+	if (ino < NOVA_NORMAL_INODE_START) {
+		*pi_addr = nova_get_reserved_inode_addr(sb, ino);
+		return 0;
+	}
+
+	*pi_addr = 0;
+	return 0;
+}
+
+struct inode *nova_iget(struct super_block *sb, unsigned long ino)
+{
+	struct nova_inode_info *si;
+	struct inode *inode;
+	u64 pi_addr;
+	int err;
+
+	inode = iget_locked(sb, ino);
+	if (unlikely(!inode))
+		return ERR_PTR(-ENOMEM);
+	if (!(inode->i_state & I_NEW))
+		return inode;
+
+	si = NOVA_I(inode);
+
+	nova_dbgv("%s: inode %lu\n", __func__, ino);
+
+	err = nova_get_inode_address(sb, ino, &pi_addr, 0);
+	if (err) {
+		nova_dbg("%s: get inode %lu address failed %d\n",
+			 __func__, ino, err);
+		goto fail;
+	}
+
+	if (pi_addr == 0) {
+		nova_dbg("%s: failed to get pi_addr for inode %lu\n",
+			 __func__, ino);
+		err = -EACCES;
+		goto fail;
+	}
+
+	err = nova_read_inode(sb, inode, pi_addr);
+	if (unlikely(err)) {
+		nova_dbg("%s: failed to read inode %lu\n", __func__, ino);
+		goto fail;
+
+	}
+
+	inode->i_ino = ino;
+
+	unlock_new_inode(inode);
+	return inode;
+fail:
+	iget_failed(inode);
+	return ERR_PTR(err);
+}
+
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index f9187e3..dbd5256 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -184,4 +184,7 @@ static inline int nova_persist_inode(struct nova_inode *pi)
 	return 0;
 }
 
+int nova_get_inode_address(struct super_block *sb, u64 ino,
+	u64 *pi_addr, int extendable);
+struct inode *nova_iget(struct super_block *sb, unsigned long ino);
 #endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 07/83] Initialize inode_info and rebuild inode information in nova_iget().
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (5 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 06/83] Add inode get/read methods Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 08/83] NOVA superblock operations Andiry Xu
                   ` (76 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Incomplete nova_rebuild_inode() implemenation.
nova_rebuild_inode() will go through the inode log and rebuild
radix tree and metadata. Leave for later patches.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/bbuild.c  | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/bbuild.h  |  7 +++++++
 fs/nova/inode.c   |  6 ++++++
 fs/nova/nova.h    | 10 ++++++++++
 fs/nova/rebuild.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 124 insertions(+)
 create mode 100644 fs/nova/bbuild.c
 create mode 100644 fs/nova/bbuild.h
 create mode 100644 fs/nova/rebuild.c

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
new file mode 100644
index 0000000..8bc0545
--- /dev/null
+++ b/fs/nova/bbuild.c
@@ -0,0 +1,53 @@
+/*
+ * NOVA Recovery routines.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/fs.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/random.h>
+#include <linux/delay.h>
+#include "nova.h"
+#include "super.h"
+#include "inode.h"
+
+void nova_init_header(struct super_block *sb,
+	struct nova_inode_info_header *sih, u16 i_mode)
+{
+	sih->log_pages = 0;
+	sih->i_size = 0;
+	sih->ino = 0;
+	sih->i_blocks = 0;
+	sih->pi_addr = 0;
+	INIT_RADIX_TREE(&sih->tree, GFP_ATOMIC);
+	sih->i_mode = i_mode;
+	sih->i_flags = 0;
+	sih->valid_entries = 0;
+	sih->num_entries = 0;
+	sih->last_setattr = 0;
+	sih->last_link_change = 0;
+	sih->last_dentry = 0;
+	sih->trans_id = 0;
+	sih->log_head = 0;
+	sih->log_tail = 0;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	init_rwsem(&sih->i_sem);
+}
+
diff --git a/fs/nova/bbuild.h b/fs/nova/bbuild.h
new file mode 100644
index 0000000..162a832
--- /dev/null
+++ b/fs/nova/bbuild.h
@@ -0,0 +1,7 @@
+#ifndef __BBUILD_H
+#define __BBUILD_H
+
+void nova_init_header(struct super_block *sb,
+	struct nova_inode_info_header *sih, u16 i_mode);
+
+#endif
diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index bfdc5dc..f7d6410 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -158,6 +158,12 @@ struct inode *nova_iget(struct super_block *sb, unsigned long ino)
 		goto fail;
 	}
 
+	err = nova_rebuild_inode(sb, si, ino, pi_addr, 1);
+	if (err) {
+		nova_dbg("%s: failed to rebuild inode %lu\n", __func__, ino);
+		goto fail;
+	}
+
 	err = nova_read_inode(sb, inode, pi_addr);
 	if (unlikely(err)) {
 		nova_dbg("%s: failed to read inode %lu\n", __func__, ino);
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 5eb696c..ded9fe8 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -296,4 +296,14 @@ static inline u64 nova_get_epoch_id(struct super_block *sb)
 }
 
 #include "inode.h"
+#include "bbuild.h"
+
+/* ====================================================== */
+/* ==============  Function prototypes  ================= */
+/* ====================================================== */
+
+/* rebuild.c */
+int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
+	u64 ino, u64 pi_addr, int rebuild_dir);
+
 #endif /* __NOVA_H */
diff --git a/fs/nova/rebuild.c b/fs/nova/rebuild.c
new file mode 100644
index 0000000..0595851
--- /dev/null
+++ b/fs/nova/rebuild.c
@@ -0,0 +1,48 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode rebuild methods.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+/* initialize nova inode header and other DRAM data structures */
+int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
+	u64 ino, u64 pi_addr, int rebuild_dir)
+{
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	// We need this valid in case we need to evict the inode.
+
+	nova_init_header(sb, sih, __le16_to_cpu(pi->i_mode));
+	sih->pi_addr = pi_addr;
+
+	if (pi->deleted == 1) {
+		nova_dbgv("%s: inode %llu has been deleted.\n", __func__, ino);
+		return -ESTALE;
+	}
+
+	nova_dbgv("%s: inode %llu, addr 0x%llx, valid %d, head 0x%llx, tail 0x%llx\n",
+			__func__, ino, pi_addr, pi->valid,
+			pi->log_head, pi->log_tail);
+
+	sih->ino = ino;
+
+	/* Traverse the log */
+	return 0;
+}
+
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 08/83] NOVA superblock operations.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (6 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 07/83] Initialize inode_info and rebuild inode information in nova_iget() Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 09/83] Add Kconfig and Makefile Andiry Xu
                   ` (75 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

This is the entry point for NOVA filesystem mount and umount.
NOVA works on DAX devices. During initialization it gets the
device information, such as physical/virtual addresses and device size.
It does not access the DAX device during runtime.

During initialization NOVA also initializes the root inode.
The root inode is a reserved inode and resides on the fixed location.

The way to mount and initialize a NOVA instance is:

mount -t NOVA -o init /dev/pmem0 /mnt/NOVA

This creates a NOVA instance on /dev/pmem0 and mount on /mnt/NOVA.
Currently it cannot do anything except mount and umount.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/super.c | 630 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 630 insertions(+)
 create mode 100644 fs/nova/super.c

diff --git a/fs/nova/super.c b/fs/nova/super.c
new file mode 100644
index 0000000..552fe5d
--- /dev/null
+++ b/fs/nova/super.c
@@ -0,0 +1,630 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Super block operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/parser.h>
+#include <linux/vfs.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+#include <linux/seq_file.h>
+#include <linux/mount.h>
+#include <linux/mm.h>
+#include <linux/ctype.h>
+#include <linux/bitops.h>
+#include <linux/magic.h>
+#include <linux/exportfs.h>
+#include <linux/random.h>
+#include <linux/cred.h>
+#include <linux/list.h>
+#include <linux/dax.h>
+#include "nova.h"
+#include "super.h"
+
+int support_clwb;
+
+module_param(nova_dbgmask, int, 0444);
+MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
+
+static struct super_operations nova_sops;
+
+static struct kmem_cache *nova_inode_cachep;
+
+
+/* FIXME: should the following variable be one per NOVA instance? */
+unsigned int nova_dbgmask;
+
+void nova_error_mng(struct super_block *sb, const char *fmt, ...)
+{
+	va_list args;
+
+	printk(KERN_CRIT "nova error: ");
+	va_start(args, fmt);
+	vprintk(fmt, args);
+	va_end(args);
+
+	if (test_opt(sb, ERRORS_PANIC))
+		panic("nova: panic from previous error\n");
+	if (test_opt(sb, ERRORS_RO)) {
+		printk(KERN_CRIT "nova err: remounting filesystem read-only");
+		sb->s_flags |= MS_RDONLY;
+	}
+}
+
+static void nova_set_blocksize(struct super_block *sb, unsigned long size)
+{
+	int bits;
+
+	/*
+	 * We've already validated the user input and the value here must be
+	 * between NOVA_MAX_BLOCK_SIZE and NOVA_MIN_BLOCK_SIZE
+	 * and it must be a power of 2.
+	 */
+	bits = fls(size) - 1;
+	sb->s_blocksize_bits = bits;
+	sb->s_blocksize = (1 << bits);
+}
+
+static int nova_get_nvmm_info(struct super_block *sb,
+	struct nova_sb_info *sbi)
+{
+	void *virt_addr = NULL;
+	pfn_t __pfn_t;
+	long size;
+	struct dax_device *dax_dev;
+	int ret;
+
+	ret = bdev_dax_supported(sb, PAGE_SIZE);
+	nova_dbg_verbose("%s: dax_supported = %d; bdev->super=0x%p",
+			 __func__, ret, sb->s_bdev->bd_super);
+	if (ret) {
+		nova_err(sb, "device does not support DAX\n");
+		return ret;
+	}
+
+	sbi->s_bdev = sb->s_bdev;
+
+	dax_dev = fs_dax_get_by_host(sb->s_bdev->bd_disk->disk_name);
+	if (!dax_dev) {
+		nova_err(sb, "Couldn't retrieve DAX device.\n");
+		return -EINVAL;
+	}
+	sbi->s_dax_dev = dax_dev;
+
+	size = dax_direct_access(sbi->s_dax_dev, 0, LONG_MAX/PAGE_SIZE,
+				 &virt_addr, &__pfn_t) * PAGE_SIZE;
+	if (size <= 0) {
+		nova_err(sb, "direct_access failed\n");
+		return -EINVAL;
+	}
+
+	sbi->virt_addr = virt_addr;
+
+	if (!sbi->virt_addr) {
+		nova_err(sb, "ioremap of the nova image failed(1)\n");
+		return -EINVAL;
+	}
+
+	sbi->phys_addr = pfn_t_to_pfn(__pfn_t) << PAGE_SHIFT;
+	sbi->initsize = size;
+	sbi->replica_reserved_inodes_addr = virt_addr + size -
+			(sbi->tail_reserved_blocks << PAGE_SHIFT);
+	sbi->replica_sb_addr = virt_addr + size - PAGE_SIZE;
+
+	nova_dbg("%s: dev %s, phys_addr 0x%llx, virt_addr %p, size %ld\n",
+		__func__, sbi->s_bdev->bd_disk->disk_name,
+		sbi->phys_addr, sbi->virt_addr, sbi->initsize);
+
+	return 0;
+}
+
+static loff_t nova_max_size(int bits)
+{
+	loff_t res;
+
+	res = (1ULL << 63) - 1;
+
+	if (res > MAX_LFS_FILESIZE)
+		res = MAX_LFS_FILESIZE;
+
+	nova_dbg_verbose("max file size %llu bytes\n", res);
+	return res;
+}
+
+enum {
+	Opt_bpi, Opt_init, Opt_mode, Opt_uid,
+	Opt_gid, Opt_dax,
+	Opt_err_cont, Opt_err_panic, Opt_err_ro,
+	Opt_dbgmask, Opt_err
+};
+
+static const match_table_t tokens = {
+	{ Opt_bpi,	     "bpi=%u"		  },
+	{ Opt_init,	     "init"		  },
+	{ Opt_mode,	     "mode=%o"		  },
+	{ Opt_uid,	     "uid=%u"		  },
+	{ Opt_gid,	     "gid=%u"		  },
+	{ Opt_dax,	     "dax"		  },
+	{ Opt_err_cont,	     "errors=continue"	  },
+	{ Opt_err_panic,     "errors=panic"	  },
+	{ Opt_err_ro,	     "errors=remount-ro"  },
+	{ Opt_dbgmask,	     "dbgmask=%u"	  },
+	{ Opt_err,	     NULL		  },
+};
+
+static int nova_parse_options(char *options, struct nova_sb_info *sbi,
+			       bool remount)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int option;
+	kuid_t uid;
+
+	if (!options)
+		return 0;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_bpi:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			if (remount && sbi->bpi)
+				goto bad_opt;
+			sbi->bpi = option;
+			break;
+		case Opt_uid:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			uid = make_kuid(current_user_ns(), option);
+			if (remount && !uid_eq(sbi->uid, uid))
+				goto bad_opt;
+			sbi->uid = uid;
+			break;
+		case Opt_gid:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			sbi->gid = make_kgid(current_user_ns(), option);
+			break;
+		case Opt_mode:
+			if (match_octal(&args[0], &option))
+				goto bad_val;
+			sbi->mode = option & 01777U;
+			break;
+		case Opt_init:
+			if (remount)
+				goto bad_opt;
+			set_opt(sbi->s_mount_opt, FORMAT);
+			break;
+		case Opt_err_panic:
+			clear_opt(sbi->s_mount_opt, ERRORS_CONT);
+			clear_opt(sbi->s_mount_opt, ERRORS_RO);
+			set_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			break;
+		case Opt_err_ro:
+			clear_opt(sbi->s_mount_opt, ERRORS_CONT);
+			clear_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			set_opt(sbi->s_mount_opt, ERRORS_RO);
+			break;
+		case Opt_err_cont:
+			clear_opt(sbi->s_mount_opt, ERRORS_RO);
+			clear_opt(sbi->s_mount_opt, ERRORS_PANIC);
+			set_opt(sbi->s_mount_opt, ERRORS_CONT);
+			break;
+		case Opt_dax:
+			set_opt(sbi->s_mount_opt, DAX);
+			break;
+		case Opt_dbgmask:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			nova_dbgmask = option;
+			break;
+		default: {
+			goto bad_opt;
+		}
+		}
+	}
+
+	return 0;
+
+bad_val:
+	nova_info("Bad value '%s' for mount option '%s'\n", args[0].from,
+	       p);
+	return -EINVAL;
+bad_opt:
+	nova_info("Bad mount option: \"%s\"\n", p);
+	return -EINVAL;
+}
+
+
+/* Make sure we have enough space */
+static bool nova_check_size(struct super_block *sb, unsigned long size)
+{
+	unsigned long minimum_size;
+
+	/* space required for super block and root directory.*/
+	minimum_size = (HEAD_RESERVED_BLOCKS + TAIL_RESERVED_BLOCKS + 1)
+			  << sb->s_blocksize_bits;
+
+	if (size < minimum_size)
+		return false;
+
+	return true;
+}
+
+static inline void nova_sync_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_super_block *super = nova_get_super(sb);
+	struct nova_super_block *super_redund;
+
+	super_redund = nova_get_redund_super(sb);
+
+	memcpy_to_pmem_nocache((void *)super, (void *)sbi->nova_sb,
+		sizeof(struct nova_super_block));
+	PERSISTENT_BARRIER();
+
+	memcpy_to_pmem_nocache((void *)super_redund, (void *)sbi->nova_sb,
+		sizeof(struct nova_super_block));
+	PERSISTENT_BARRIER();
+}
+
+static struct nova_inode *nova_init(struct super_block *sb,
+				      unsigned long size)
+{
+	unsigned long blocksize;
+	struct nova_inode *root_i, *pi;
+	struct nova_super_block *super;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	nova_info("creating an empty nova of size %lu\n", size);
+	sbi->num_blocks = ((unsigned long)(size) >> PAGE_SHIFT);
+
+	nova_dbgv("nova: Default block size set to 4K\n");
+	sbi->blocksize = blocksize = NOVA_DEF_BLOCK_SIZE_4K;
+	nova_set_blocksize(sb, sbi->blocksize);
+
+	if (!nova_check_size(sb, size)) {
+		nova_warn("Specified NOVA size too small 0x%lx.\n", size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	nova_dbgv("max file name len %d\n", (unsigned int)NOVA_NAME_LEN);
+
+	super = nova_get_super(sb);
+
+	/* clear out super-block and inode table */
+	memset_nt(super, 0, sbi->head_reserved_blocks * sbi->blocksize);
+
+	pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	pi->nova_ino = NOVA_BLOCKNODE_INO;
+	nova_flush_buffer(pi, CACHELINE_SIZE, 1);
+
+	sbi->nova_sb->s_size = cpu_to_le64(size);
+	sbi->nova_sb->s_blocksize = cpu_to_le32(blocksize);
+	sbi->nova_sb->s_magic = cpu_to_le32(NOVA_SUPER_MAGIC);
+	sbi->nova_sb->s_epoch_id = 0;
+
+	nova_sync_super(sb);
+
+	root_i = nova_get_inode_by_ino(sb, NOVA_ROOT_INO);
+	nova_dbgv("%s: Allocate root inode @ 0x%p\n", __func__, root_i);
+
+	root_i->i_mode = cpu_to_le16(sbi->mode | S_IFDIR);
+	root_i->i_uid = cpu_to_le32(from_kuid(&init_user_ns, sbi->uid));
+	root_i->i_gid = cpu_to_le32(from_kgid(&init_user_ns, sbi->gid));
+	root_i->i_links_count = cpu_to_le16(2);
+	root_i->i_blk_type = NOVA_BLOCK_TYPE_4K;
+	root_i->i_flags = 0;
+	root_i->i_size = cpu_to_le64(sb->s_blocksize);
+	root_i->i_atime = root_i->i_mtime = root_i->i_ctime =
+		cpu_to_le32(get_seconds());
+	root_i->nova_ino = cpu_to_le64(NOVA_ROOT_INO);
+	root_i->valid = 1;
+
+	nova_flush_buffer(root_i, sizeof(*root_i), false);
+
+	PERSISTENT_MARK();
+	PERSISTENT_BARRIER();
+	nova_info("NOVA initialization finish\n");
+	return root_i;
+}
+
+static inline void set_default_opts(struct nova_sb_info *sbi)
+{
+	set_opt(sbi->s_mount_opt, HUGEIOREMAP);
+	set_opt(sbi->s_mount_opt, ERRORS_CONT);
+	sbi->head_reserved_blocks = HEAD_RESERVED_BLOCKS;
+	sbi->tail_reserved_blocks = TAIL_RESERVED_BLOCKS;
+	sbi->cpus = num_online_cpus();
+}
+
+static void nova_root_check(struct super_block *sb, struct nova_inode *root_pi)
+{
+	if (!S_ISDIR(le16_to_cpu(root_pi->i_mode)))
+		nova_warn("root is not a directory!\n");
+}
+
+static int nova_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct nova_sb_info *sbi = NULL;
+	struct nova_inode *root_pi;
+	struct inode *root_i = NULL;
+	unsigned long blocksize;
+	u32 random = 0;
+	int retval = -EINVAL;
+
+	BUILD_BUG_ON(sizeof(struct nova_super_block) > NOVA_SB_SIZE);
+
+	sbi = kzalloc(sizeof(struct nova_sb_info), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
+	sbi->nova_sb = kzalloc(sizeof(struct nova_super_block), GFP_KERNEL);
+	if (!sbi->nova_sb) {
+		kfree(sbi);
+		return -ENOMEM;
+	}
+
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	set_default_opts(sbi);
+
+	/* Currently the log page supports 64 journal pointer pairs */
+	if (sbi->cpus > MAX_CPUS) {
+		nova_err(sb, "NOVA needs more log pointer pages to support more than "
+			  __stringify(MAX_CPUS) " cpus.\n");
+		goto out;
+	}
+
+	retval = nova_get_nvmm_info(sb, sbi);
+	if (retval) {
+		nova_err(sb, "%s: Failed to get nvmm info.",
+			 __func__);
+		goto out;
+	}
+
+	get_random_bytes(&random, sizeof(u32));
+	atomic_set(&sbi->next_generation, random);
+
+	/* Init with default values */
+	sbi->mode = (0755);
+	sbi->uid = current_fsuid();
+	sbi->gid = current_fsgid();
+	set_opt(sbi->s_mount_opt, HUGEIOREMAP);
+
+	mutex_init(&sbi->s_lock);
+
+	sbi->zeroed_page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!sbi->zeroed_page) {
+		retval = -ENOMEM;
+		nova_dbg("%s: sbi->zeroed_page failed.",
+			 __func__);
+		goto out;
+	}
+
+	retval = nova_parse_options(data, sbi, 0);
+	if (retval) {
+		nova_err(sb, "%s: Failed to parse nova command line options.",
+			 __func__);
+		goto out;
+	}
+
+	/* Init a new nova instance */
+	if (sbi->s_mount_opt & NOVA_MOUNT_FORMAT) {
+		root_pi = nova_init(sb, sbi->initsize);
+		if (IS_ERR(root_pi)) {
+			nova_err(sb, "%s: root_pi error.",
+				 __func__);
+
+			goto out;
+		}
+		goto setup_sb;
+	}
+
+	blocksize = le32_to_cpu(sbi->nova_sb->s_blocksize);
+	nova_set_blocksize(sb, blocksize);
+
+	nova_dbg_verbose("blocksize %lu\n", blocksize);
+
+	/* Read the root inode */
+	root_pi = nova_get_inode_by_ino(sb, NOVA_ROOT_INO);
+
+	/* Check that the root inode is in a sane state */
+	nova_root_check(sb, root_pi);
+
+	/* Set it all up.. */
+setup_sb:
+	sb->s_magic = le32_to_cpu(sbi->nova_sb->s_magic);
+	sb->s_op = &nova_sops;
+	sb->s_maxbytes = nova_max_size(sb->s_blocksize_bits);
+	sb->s_time_gran = 1000000000; // 1 second.
+	sb->s_xattr = NULL;
+	sb->s_flags |= MS_NOSEC;
+
+	root_i = nova_iget(sb, NOVA_ROOT_INO);
+	if (IS_ERR(root_i)) {
+		retval = PTR_ERR(root_i);
+		nova_err(sb, "%s: failed to get root inode",
+			 __func__);
+
+		goto out;
+	}
+
+	sb->s_root = d_make_root(root_i);
+	if (!sb->s_root) {
+		nova_err(sb, "get nova root inode failed\n");
+		retval = -ENOMEM;
+		goto out;
+	}
+
+	retval = 0;
+	return retval;
+
+out:
+	kfree(sbi->zeroed_page);
+	sbi->zeroed_page = NULL;
+
+	kfree(sbi->nova_sb);
+	kfree(sbi);
+	nova_dbg("%s failed: return %d\n", __func__, retval);
+	return retval;
+}
+
+static void nova_put_super(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->virt_addr) {
+		sbi->virt_addr = NULL;
+	}
+
+	kfree(sbi->zeroed_page);
+	nova_dbgmask = 0;
+
+	kfree(sbi->nova_sb);
+	kfree(sbi);
+	sb->s_fs_info = NULL;
+}
+
+static struct inode *nova_alloc_inode(struct super_block *sb)
+{
+	struct nova_inode_info *vi;
+
+	vi = kmem_cache_alloc(nova_inode_cachep, GFP_NOFS);
+	if (!vi)
+		return NULL;
+
+	return &vi->vfs_inode;
+}
+
+static void nova_i_callback(struct rcu_head *head)
+{
+	struct inode *inode = container_of(head, struct inode, i_rcu);
+	struct nova_inode_info *vi = NOVA_I(inode);
+
+	nova_dbg_verbose("%s: ino %lu\n", __func__, inode->i_ino);
+	kmem_cache_free(nova_inode_cachep, vi);
+}
+
+static void nova_destroy_inode(struct inode *inode)
+{
+	nova_dbgv("%s: %lu\n", __func__, inode->i_ino);
+	call_rcu(&inode->i_rcu, nova_i_callback);
+}
+
+static void init_once(void *foo)
+{
+	struct nova_inode_info *vi = foo;
+
+	inode_init_once(&vi->vfs_inode);
+}
+
+static int __init init_inodecache(void)
+{
+	nova_inode_cachep = kmem_cache_create("nova_inode_cache",
+					       sizeof(struct nova_inode_info),
+					       0, (SLAB_RECLAIM_ACCOUNT |
+						   SLAB_MEM_SPREAD), init_once);
+	if (nova_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static void destroy_inodecache(void)
+{
+	/*
+	 * Make sure all delayed rcu free inodes are flushed before
+	 * we destroy cache.
+	 */
+	rcu_barrier();
+	kmem_cache_destroy(nova_inode_cachep);
+}
+
+
+/*
+ * the super block writes are all done "on the fly", so the
+ * super block is never in a "dirty" state, so there's no need
+ * for write_super.
+ */
+static struct super_operations nova_sops = {
+	.alloc_inode	= nova_alloc_inode,
+	.destroy_inode	= nova_destroy_inode,
+	.put_super	= nova_put_super,
+};
+
+static struct dentry *nova_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name, void *data)
+{
+	return mount_bdev(fs_type, flags, dev_name, data, nova_fill_super);
+}
+
+static struct file_system_type nova_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "NOVA",
+	.mount		= nova_mount,
+	.kill_sb	= kill_block_super,
+};
+
+static int __init init_nova_fs(void)
+{
+	int rc = 0;
+
+	nova_dbg("%s: %d cpus online\n", __func__, num_online_cpus());
+	if (arch_has_clwb())
+		support_clwb = 1;
+
+	nova_info("Arch new instructions support: CLWB %s\n",
+			support_clwb ? "YES" : "NO");
+
+	rc = init_inodecache();
+	if (rc)
+		return rc;
+
+	rc = register_filesystem(&nova_fs_type);
+	if (rc)
+		goto out1;
+
+	return rc;
+
+out1:
+	destroy_inodecache();
+	return rc;
+}
+
+static void __exit exit_nova_fs(void)
+{
+	unregister_filesystem(&nova_fs_type);
+	destroy_inodecache();
+}
+
+MODULE_AUTHOR("Andiry Xu <jix024@cs.ucsd.edu>");
+MODULE_DESCRIPTION("NOVA: NOn-Volatile memory Accelerated File System");
+MODULE_LICENSE("GPL");
+
+module_init(init_nova_fs)
+module_exit(exit_nova_fs)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 09/83] Add Kconfig and Makefile
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (7 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 08/83] NOVA superblock operations Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-11 12:15   ` Nikolay Borisov
  2018-03-10 18:17 ` [RFC v2 10/83] Add superblock integrity check Andiry Xu
                   ` (74 subsequent siblings)
  83 siblings, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/Kconfig       |  2 ++
 fs/Makefile      |  1 +
 fs/nova/Kconfig  | 15 +++++++++++++++
 fs/nova/Makefile |  7 +++++++
 4 files changed, 25 insertions(+)
 create mode 100644 fs/nova/Kconfig
 create mode 100644 fs/nova/Makefile

diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a8..5e9ff3e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -58,6 +58,8 @@ config FS_DAX_PMD
 	depends on ZONE_DEVICE
 	depends on TRANSPARENT_HUGEPAGE
 
+source "fs/nova/Kconfig"
+
 # Selected by DAX drivers that do not expect filesystem DAX to support
 # get_user_pages() of DAX mappings. I.e. "limited" indicates no support
 # for fork() of processes with MAP_SHARED mappings or support for
diff --git a/fs/Makefile b/fs/Makefile
index add789e..65ea619 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -113,6 +113,7 @@ obj-$(CONFIG_OMFS_FS)		+= omfs/
 obj-$(CONFIG_JFS_FS)		+= jfs/
 obj-$(CONFIG_XFS_FS)		+= xfs/
 obj-$(CONFIG_9P_FS)		+= 9p/
+obj-$(CONFIG_NOVA_FS)		+= nova/
 obj-$(CONFIG_AFS_FS)		+= afs/
 obj-$(CONFIG_NILFS2_FS)		+= nilfs2/
 obj-$(CONFIG_BEFS_FS)		+= befs/
diff --git a/fs/nova/Kconfig b/fs/nova/Kconfig
new file mode 100644
index 0000000..c1c692e
--- /dev/null
+++ b/fs/nova/Kconfig
@@ -0,0 +1,15 @@
+config NOVA_FS
+	tristate "NOVA: log-structured file system for non-volatile memories"
+	depends on FS_DAX
+	select CRC32
+	select LIBCRC32C
+	help
+	  If your system has a block of fast (comparable in access speed to
+	  system memory) and non-volatile byte-addressable memory and you wish
+	  to mount a light-weight filesystem with strong consistency support
+	  over it, say Y here.
+
+	  To compile this as a module, choose M here: the module will be
+	  called nova.
+
+	  If unsure, say N.
diff --git a/fs/nova/Makefile b/fs/nova/Makefile
new file mode 100644
index 0000000..eb19646
--- /dev/null
+++ b/fs/nova/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for the linux NOVA filesystem routines.
+#
+
+obj-$(CONFIG_NOVA_FS) += nova.o
+
+nova-y := bbuild.o inode.o rebuild.o super.o
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 10/83] Add superblock integrity check.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (8 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 09/83] Add Kconfig and Makefile Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 11/83] Add timing and I/O statistics for performance analysis and profiling Andiry Xu
                   ` (73 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Repair broken primary superblock with redundant superblock.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/super.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 102 insertions(+)

diff --git a/fs/nova/super.c b/fs/nova/super.c
index 552fe5d..e0e38ab 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -276,6 +276,21 @@ static bool nova_check_size(struct super_block *sb, unsigned long size)
 	return true;
 }
 
+static inline int nova_check_super_checksum(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u32 crc = 0;
+
+	// Check CRC but skip c_sum, which is the 4 bytes at the beginning
+	crc = nova_crc32c(~0, (__u8 *)sbi->nova_sb + sizeof(__le32),
+			sizeof(struct nova_super_block) - sizeof(__le32));
+
+	if (sbi->nova_sb->s_sum == cpu_to_le32(crc))
+		return 0;
+	else
+		return 1;
+}
+
 static inline void nova_sync_super(struct super_block *sb)
 {
 	struct nova_sb_info *sbi = NOVA_SB(sb);
@@ -293,6 +308,34 @@ static inline void nova_sync_super(struct super_block *sb)
 	PERSISTENT_BARRIER();
 }
 
+/* Update checksum for the DRAM copy */
+static inline void nova_update_super_crc(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u32 crc = 0;
+
+	sbi->nova_sb->s_wtime = cpu_to_le32(get_seconds());
+	sbi->nova_sb->s_sum = 0;
+	crc = nova_crc32c(~0, (__u8 *)sbi->nova_sb + sizeof(__le32),
+			sizeof(struct nova_super_block) - sizeof(__le32));
+	sbi->nova_sb->s_sum = cpu_to_le32(crc);
+}
+
+
+static inline void nova_update_mount_time(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 mnt_write_time;
+
+	mnt_write_time = (get_seconds() & 0xFFFFFFFF);
+	mnt_write_time = mnt_write_time | (mnt_write_time << 32);
+
+	sbi->nova_sb->s_mtime = cpu_to_le64(mnt_write_time);
+	nova_update_super_crc(sb);
+
+	nova_sync_super(sb);
+}
+
 static struct nova_inode *nova_init(struct super_block *sb,
 				      unsigned long size)
 {
@@ -328,6 +371,7 @@ static struct nova_inode *nova_init(struct super_block *sb,
 	sbi->nova_sb->s_blocksize = cpu_to_le32(blocksize);
 	sbi->nova_sb->s_magic = cpu_to_le32(NOVA_SUPER_MAGIC);
 	sbi->nova_sb->s_epoch_id = 0;
+	nova_update_super_crc(sb);
 
 	nova_sync_super(sb);
 
@@ -369,6 +413,54 @@ static void nova_root_check(struct super_block *sb, struct nova_inode *root_pi)
 		nova_warn("root is not a directory!\n");
 }
 
+/* Check super block magic and checksum */
+static int nova_check_super(struct super_block *sb,
+	struct nova_super_block *ps)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int rc;
+
+	rc = memcpy_mcsafe(sbi->nova_sb, ps,
+				sizeof(struct nova_super_block));
+
+	if (rc < 0)
+		return rc;
+
+	if (le32_to_cpu(sbi->nova_sb->s_magic) != NOVA_SUPER_MAGIC)
+		return -EIO;
+
+	if (nova_check_super_checksum(sb))
+		return -EIO;
+
+	return 0;
+}
+
+static int nova_check_integrity(struct super_block *sb)
+{
+	struct nova_super_block *super = nova_get_super(sb);
+	struct nova_super_block *super_redund;
+	int rc;
+
+	super_redund = nova_get_redund_super(sb);
+
+	/* Do sanity checks on the superblock */
+	rc = nova_check_super(sb, super);
+	if (rc < 0) {
+		rc = nova_check_super(sb, super_redund);
+		if (rc < 0) {
+			nova_err(sb, "Can't find a valid nova partition\n");
+			return rc;
+		} else {
+			nova_warn("Error in super block: try to repair it with the other copy\n");
+			memcpy_to_pmem_nocache((void *)super, (void *)super_redund,
+					sizeof(struct nova_super_block));
+			PERSISTENT_BARRIER();
+		}
+	}
+
+	return 0;
+}
+
 static int nova_fill_super(struct super_block *sb, void *data, int silent)
 {
 	struct nova_sb_info *sbi = NULL;
@@ -446,6 +538,13 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 		goto setup_sb;
 	}
 
+	if (nova_check_integrity(sb) < 0) {
+		retval = -EINVAL;
+		nova_dbg("Memory contains invalid nova %x:%x\n",
+			le32_to_cpu(sbi->nova_sb->s_magic), NOVA_SUPER_MAGIC);
+		goto out;
+	}
+
 	blocksize = le32_to_cpu(sbi->nova_sb->s_blocksize);
 	nova_set_blocksize(sb, blocksize);
 
@@ -482,6 +581,9 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 		goto out;
 	}
 
+	if (!(sb->s_flags & MS_RDONLY))
+		nova_update_mount_time(sb);
+
 	retval = 0;
 	return retval;
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 11/83] Add timing and I/O statistics for performance analysis and profiling.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (9 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 10/83] Add superblock integrity check Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 12/83] Add timing for mount and init Andiry Xu
                   ` (72 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |   2 +-
 fs/nova/nova.h   |  12 +++
 fs/nova/stats.c  | 263 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/stats.h  | 178 +++++++++++++++++++++++++++++++++++++
 fs/nova/super.c  |   6 ++
 5 files changed, 460 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/stats.c
 create mode 100644 fs/nova/stats.h

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index eb19646..886356a 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := bbuild.o inode.o rebuild.o super.o
+nova-y := bbuild.o inode.o rebuild.o stats.o super.o
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index ded9fe8..ba7ffca 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -48,6 +48,7 @@
 #include <linux/pagevec.h>
 
 #include "nova_def.h"
+#include "stats.h"
 
 #define PAGE_SHIFT_2M 21
 #define PAGE_SHIFT_1G 30
@@ -135,6 +136,10 @@ extern unsigned int nova_dbgmask;
 #define	ANY_CPU				(65536)
 #define	FREE_BATCH			(16)
 
+
+extern int measure_timing;
+
+
 extern unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX];
 extern unsigned int blk_type_to_size[NOVA_BLOCK_TYPE_MAX];
 
@@ -306,4 +311,11 @@ static inline u64 nova_get_epoch_id(struct super_block *sb)
 int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
 	u64 ino, u64 pi_addr, int rebuild_dir);
 
+/* stats.c */
+void nova_get_timing_stats(void);
+void nova_get_IO_stats(void);
+void nova_print_timing_stats(struct super_block *sb);
+void nova_clear_stats(struct super_block *sb);
+void nova_print_inode(struct nova_inode *pi);
+
 #endif /* __NOVA_H */
diff --git a/fs/nova/stats.c b/fs/nova/stats.c
new file mode 100644
index 0000000..4b7c317
--- /dev/null
+++ b/fs/nova/stats.c
@@ -0,0 +1,263 @@
+/*
+ * NOVA File System statistics
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include "nova.h"
+
+const char *Timingstring[TIMING_NUM] = {
+	/* Init */
+	"================ Initialization ================",
+	"init",
+	"mount",
+	"ioremap",
+	"new_init",
+	"recovery",
+
+	/* Namei operations */
+	"============= Directory operations =============",
+	"create",
+	"lookup",
+	"link",
+	"unlink",
+	"symlink",
+	"mkdir",
+	"rmdir",
+	"mknod",
+	"rename",
+	"readdir",
+	"add_dentry",
+	"remove_dentry",
+	"setattr",
+	"setsize",
+
+	/* I/O operations */
+	"================ I/O operations ================",
+	"dax_read",
+	"cow_write",
+	"inplace_write",
+	"copy_to_nvmm",
+	"dax_get_block",
+	"read_iter",
+	"write_iter",
+
+	/* Memory operations */
+	"============== Memory operations ===============",
+	"memcpy_read_nvmm",
+	"memcpy_write_nvmm",
+	"memcpy_write_back_to_nvmm",
+	"handle_partial_block",
+
+	/* Memory management */
+	"============== Memory management ===============",
+	"alloc_blocks",
+	"new_data_blocks",
+	"new_log_blocks",
+	"free_blocks",
+	"free_data_blocks",
+	"free_log_blocks",
+
+	/* Transaction */
+	"================= Transaction ==================",
+	"transaction_new_inode",
+	"transaction_link_change",
+	"update_tail",
+
+	/* Logging */
+	"============= Logging operations ===============",
+	"append_dir_entry",
+	"append_file_entry",
+	"append_link_change",
+	"append_setattr",
+	"inplace_update_entry",
+
+	/* Tree */
+	"=============== Tree operations ================",
+	"checking_entry",
+	"assign_blocks",
+
+	/* GC */
+	"============= Garbage collection ===============",
+	"log_fast_gc",
+	"log_thorough_gc",
+	"check_invalid_log",
+
+	/* Others */
+	"================ Miscellaneous =================",
+	"find_cache_page",
+	"fsync",
+	"write_pages",
+	"fallocate",
+	"direct_IO",
+	"free_old_entry",
+	"delete_file_tree",
+	"delete_dir_tree",
+	"new_vfs_inode",
+	"new_nova_inode",
+	"free_inode",
+	"free_inode_log",
+	"evict_inode",
+
+	/* Mmap */
+	"=============== MMap operations ================",
+	"mmap_page_fault",
+	"mmap_pmd_fault",
+	"mmap_pfn_mkwrite",
+
+	/* Rebuild */
+	"=================== Rebuild ====================",
+	"rebuild_dir",
+	"rebuild_file",
+};
+
+u64 Timingstats[TIMING_NUM];
+DEFINE_PER_CPU(u64[TIMING_NUM], Timingstats_percpu);
+u64 Countstats[TIMING_NUM];
+DEFINE_PER_CPU(u64[TIMING_NUM], Countstats_percpu);
+u64 IOstats[STATS_NUM];
+DEFINE_PER_CPU(u64[STATS_NUM], IOstats_percpu);
+
+static void nova_print_IO_stats(struct super_block *sb)
+{
+	nova_info("=========== NOVA I/O stats ===========\n");
+	nova_info("Read %llu, bytes %llu, average %llu\n",
+		Countstats[dax_read_t], IOstats[read_bytes],
+		Countstats[dax_read_t] ?
+			IOstats[read_bytes] / Countstats[dax_read_t] : 0);
+	nova_info("COW write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[cow_write_t], IOstats[cow_write_bytes],
+		Countstats[cow_write_t] ?
+			IOstats[cow_write_bytes] / Countstats[cow_write_t] : 0,
+		IOstats[cow_write_breaks], Countstats[cow_write_t] ?
+			IOstats[cow_write_breaks] / Countstats[cow_write_t]
+			: 0);
+	nova_info("Inplace write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[inplace_write_t], IOstats[inplace_write_bytes],
+		Countstats[inplace_write_t] ?
+			IOstats[inplace_write_bytes] /
+			Countstats[inplace_write_t] : 0,
+		IOstats[inplace_write_breaks], Countstats[inplace_write_t] ?
+			IOstats[inplace_write_breaks] /
+			Countstats[inplace_write_t] : 0);
+}
+
+void nova_get_timing_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < TIMING_NUM; i++) {
+		Timingstats[i] = 0;
+		Countstats[i] = 0;
+		for_each_possible_cpu(cpu) {
+			Timingstats[i] += per_cpu(Timingstats_percpu[i], cpu);
+			Countstats[i] += per_cpu(Countstats_percpu[i], cpu);
+		}
+	}
+}
+
+void nova_get_IO_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < STATS_NUM; i++) {
+		IOstats[i] = 0;
+		for_each_possible_cpu(cpu)
+			IOstats[i] += per_cpu(IOstats_percpu[i], cpu);
+	}
+}
+
+void nova_print_timing_stats(struct super_block *sb)
+{
+	int i;
+
+	nova_get_timing_stats();
+	nova_get_IO_stats();
+
+	nova_info("=========== NOVA kernel timing stats ============\n");
+	for (i = 0; i < TIMING_NUM; i++) {
+		/* Title */
+		if (Timingstring[i][0] == '=') {
+			nova_info("\n%s\n\n", Timingstring[i]);
+			continue;
+		}
+
+		if (measure_timing || Timingstats[i]) {
+			nova_info("%s: count %llu, timing %llu, average %llu\n",
+				Timingstring[i],
+				Countstats[i],
+				Timingstats[i],
+				Countstats[i] ?
+				Timingstats[i] / Countstats[i] : 0);
+		} else {
+			nova_info("%s: count %llu\n",
+				Timingstring[i],
+				Countstats[i]);
+		}
+	}
+
+	nova_info("\n");
+	nova_print_IO_stats(sb);
+}
+
+static void nova_clear_timing_stats(void)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < TIMING_NUM; i++) {
+		Countstats[i] = 0;
+		Timingstats[i] = 0;
+		for_each_possible_cpu(cpu) {
+			per_cpu(Timingstats_percpu[i], cpu) = 0;
+			per_cpu(Countstats_percpu[i], cpu) = 0;
+		}
+	}
+}
+
+static void nova_clear_IO_stats(struct super_block *sb)
+{
+	int i;
+	int cpu;
+
+	for (i = 0; i < STATS_NUM; i++) {
+		IOstats[i] = 0;
+		for_each_possible_cpu(cpu)
+			per_cpu(IOstats_percpu[i], cpu) = 0;
+	}
+}
+
+void nova_clear_stats(struct super_block *sb)
+{
+	nova_clear_timing_stats();
+	nova_clear_IO_stats(sb);
+}
+
+void nova_print_inode(struct nova_inode *pi)
+{
+	nova_dbg("%s: NOVA inode %llu\n", __func__, pi->nova_ino);
+	nova_dbg("valid %u, deleted %u, blk type %u, flags %u\n",
+		pi->valid, pi->deleted, pi->i_blk_type, pi->i_flags);
+	nova_dbg("size %llu, ctime %u, mtime %u, atime %u\n",
+		pi->i_size, pi->i_ctime, pi->i_mtime, pi->i_atime);
+	nova_dbg("mode %u, links %u, xattr 0x%llx\n",
+		pi->i_mode, pi->i_links_count, pi->i_xattr);
+	nova_dbg("uid %u, gid %u, gen %u, create time %u\n",
+		pi->i_uid, pi->i_gid, pi->i_generation, pi->i_create_time);
+	nova_dbg("head 0x%llx, tail 0x%llx\n",
+		pi->log_head, pi->log_tail);
+	nova_dbg("create epoch id %llu, delete epoch id %llu\n",
+		pi->create_epoch_id, pi->delete_epoch_id);
+}
diff --git a/fs/nova/stats.h b/fs/nova/stats.h
new file mode 100644
index 0000000..8dbd02d
--- /dev/null
+++ b/fs/nova/stats.h
@@ -0,0 +1,178 @@
+/*
+ * NOVA File System statistics
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef __STATS_H
+#define __STATS_H
+
+
+/* ======================= Timing ========================= */
+enum timing_category {
+	/* Init */
+	init_title_t,
+	init_t,
+	mount_t,
+	ioremap_t,
+	new_init_t,
+	recovery_t,
+
+	/* Namei operations */
+	namei_title_t,
+	create_t,
+	lookup_t,
+	link_t,
+	unlink_t,
+	symlink_t,
+	mkdir_t,
+	rmdir_t,
+	mknod_t,
+	rename_t,
+	readdir_t,
+	add_dentry_t,
+	remove_dentry_t,
+	setattr_t,
+	setsize_t,
+
+	/* I/O operations */
+	io_title_t,
+	dax_read_t,
+	cow_write_t,
+	inplace_write_t,
+	copy_to_nvmm_t,
+	dax_get_block_t,
+	read_iter_t,
+	write_iter_t,
+
+	/* Memory operations */
+	memory_title_t,
+	memcpy_r_nvmm_t,
+	memcpy_w_nvmm_t,
+	memcpy_w_wb_t,
+	partial_block_t,
+
+	/* Memory management */
+	mm_title_t,
+	new_blocks_t,
+	new_data_blocks_t,
+	new_log_blocks_t,
+	free_blocks_t,
+	free_data_t,
+	free_log_t,
+
+	/* Transaction */
+	trans_title_t,
+	create_trans_t,
+	link_trans_t,
+	update_tail_t,
+
+	/* Logging */
+	logging_title_t,
+	append_dir_entry_t,
+	append_file_entry_t,
+	append_link_change_t,
+	append_setattr_t,
+	update_entry_t,
+
+	/* Tree */
+	tree_title_t,
+	check_entry_t,
+	assign_t,
+
+	/* GC */
+	gc_title_t,
+	fast_gc_t,
+	thorough_gc_t,
+	check_invalid_t,
+
+	/* Others */
+	others_title_t,
+	find_cache_t,
+	fsync_t,
+	write_pages_t,
+	fallocate_t,
+	direct_IO_t,
+	free_old_t,
+	delete_file_tree_t,
+	delete_dir_tree_t,
+	new_vfs_inode_t,
+	new_nova_inode_t,
+	free_inode_t,
+	free_inode_log_t,
+	evict_inode_t,
+
+	/* Mmap */
+	mmap_title_t,
+	mmap_fault_t,
+	pmd_fault_t,
+	pfn_mkwrite_t,
+
+	/* Rebuild */
+	rebuild_title_t,
+	rebuild_dir_t,
+	rebuild_file_t,
+
+	/* Sentinel */
+	TIMING_NUM,
+};
+
+enum stats_category {
+	alloc_steps,
+	cow_write_breaks,
+	inplace_write_breaks,
+	read_bytes,
+	cow_write_bytes,
+	inplace_write_bytes,
+	fast_checked_pages,
+	thorough_checked_pages,
+	fast_gc_pages,
+	thorough_gc_pages,
+	dax_new_blocks,
+	inplace_new_blocks,
+	fdatasync,
+
+	/* Sentinel */
+	STATS_NUM,
+};
+
+extern const char *Timingstring[TIMING_NUM];
+extern u64 Timingstats[TIMING_NUM];
+DECLARE_PER_CPU(u64[TIMING_NUM], Timingstats_percpu);
+extern u64 Countstats[TIMING_NUM];
+DECLARE_PER_CPU(u64[TIMING_NUM], Countstats_percpu);
+extern u64 IOstats[STATS_NUM];
+DECLARE_PER_CPU(u64[STATS_NUM], IOstats_percpu);
+
+typedef struct timespec timing_t;
+
+#define NOVA_START_TIMING(name, start) \
+	{if (measure_timing) getrawmonotonic(&start); }
+
+#define NOVA_END_TIMING(name, start) \
+	{if (measure_timing) { \
+		timing_t end; \
+		getrawmonotonic(&end); \
+		__this_cpu_add(Timingstats_percpu[name], \
+			(end.tv_sec - start.tv_sec) * 1000000000 + \
+			(end.tv_nsec - start.tv_nsec)); \
+	} \
+	__this_cpu_add(Countstats_percpu[name], 1); \
+	}
+
+#define NOVA_STATS_ADD(name, value) \
+	{__this_cpu_add(IOstats_percpu[name], value); }
+
+
+
+#endif
diff --git a/fs/nova/super.c b/fs/nova/super.c
index e0e38ab..9295d23 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -40,8 +40,12 @@
 #include "nova.h"
 #include "super.h"
 
+int measure_timing;
 int support_clwb;
 
+module_param(measure_timing, int, 0444);
+MODULE_PARM_DESC(measure_timing, "Timing measurement");
+
 module_param(nova_dbgmask, int, 0444);
 MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
 
@@ -500,6 +504,8 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 		goto out;
 	}
 
+	nova_dbg("measure timing %d\n", measure_timing);
+
 	get_random_bytes(&random, sizeof(u32));
 	atomic_set(&sbi->next_generation, random);
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 12/83] Add timing for mount and init.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (10 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 11/83] Add timing and I/O statistics for performance analysis and profiling Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 13/83] Add remount_fs and show_options methods Andiry Xu
                   ` (71 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/super.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/nova/super.c b/fs/nova/super.c
index 9295d23..3efb560 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -347,6 +347,9 @@ static struct nova_inode *nova_init(struct super_block *sb,
 	struct nova_inode *root_i, *pi;
 	struct nova_super_block *super;
 	struct nova_sb_info *sbi = NOVA_SB(sb);
+	timing_t init_time;
+
+	NOVA_START_TIMING(new_init_t, init_time);
 
 	nova_info("creating an empty nova of size %lu\n", size);
 	sbi->num_blocks = ((unsigned long)(size) >> PAGE_SHIFT);
@@ -357,6 +360,7 @@ static struct nova_inode *nova_init(struct super_block *sb,
 
 	if (!nova_check_size(sb, size)) {
 		nova_warn("Specified NOVA size too small 0x%lx.\n", size);
+		NOVA_END_TIMING(new_init_t, init_time);
 		return ERR_PTR(-EINVAL);
 	}
 
@@ -399,6 +403,7 @@ static struct nova_inode *nova_init(struct super_block *sb,
 	PERSISTENT_MARK();
 	PERSISTENT_BARRIER();
 	nova_info("NOVA initialization finish\n");
+	NOVA_END_TIMING(new_init_t, init_time);
 	return root_i;
 }
 
@@ -473,15 +478,22 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	unsigned long blocksize;
 	u32 random = 0;
 	int retval = -EINVAL;
+	timing_t mount_time;
+
+	NOVA_START_TIMING(mount_t, mount_time);
 
 	BUILD_BUG_ON(sizeof(struct nova_super_block) > NOVA_SB_SIZE);
 
 	sbi = kzalloc(sizeof(struct nova_sb_info), GFP_KERNEL);
-	if (!sbi)
+	if (!sbi) {
+		NOVA_END_TIMING(mount_t, mount_time);
 		return -ENOMEM;
+	}
+
 	sbi->nova_sb = kzalloc(sizeof(struct nova_super_block), GFP_KERNEL);
 	if (!sbi->nova_sb) {
 		kfree(sbi);
+		NOVA_END_TIMING(mount_t, mount_time);
 		return -ENOMEM;
 	}
 
@@ -591,6 +603,7 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 		nova_update_mount_time(sb);
 
 	retval = 0;
+	NOVA_END_TIMING(mount_t, mount_time);
 	return retval;
 
 out:
@@ -600,6 +613,7 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->nova_sb);
 	kfree(sbi);
 	nova_dbg("%s failed: return %d\n", __func__, retval);
+	NOVA_END_TIMING(mount_t, mount_time);
 	return retval;
 }
 
@@ -701,6 +715,9 @@ static struct file_system_type nova_fs_type = {
 static int __init init_nova_fs(void)
 {
 	int rc = 0;
+	timing_t init_time;
+
+	NOVA_START_TIMING(init_t, init_time);
 
 	nova_dbg("%s: %d cpus online\n", __func__, num_online_cpus());
 	if (arch_has_clwb())
@@ -711,17 +728,19 @@ static int __init init_nova_fs(void)
 
 	rc = init_inodecache();
 	if (rc)
-		return rc;
+		goto out;
 
 	rc = register_filesystem(&nova_fs_type);
 	if (rc)
 		goto out1;
 
+out:
+	NOVA_END_TIMING(init_t, init_time);
 	return rc;
 
 out1:
 	destroy_inodecache();
-	return rc;
+	goto out;
 }
 
 static void __exit exit_nova_fs(void)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 13/83] Add remount_fs and show_options methods.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (11 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 12/83] Add timing for mount and init Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 14/83] Add range node kmem cache Andiry Xu
                   ` (70 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/super.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/fs/nova/super.c b/fs/nova/super.c
index 3efb560..f41cc04 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -617,6 +617,59 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	return retval;
 }
 
+static int nova_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct nova_sb_info *sbi = NOVA_SB(root->d_sb);
+
+	if (sbi->mode != (0777 | S_ISVTX))
+		seq_printf(seq, ",mode=%03o", sbi->mode);
+	if (uid_valid(sbi->uid))
+		seq_printf(seq, ",uid=%u", from_kuid(&init_user_ns, sbi->uid));
+	if (gid_valid(sbi->gid))
+		seq_printf(seq, ",gid=%u", from_kgid(&init_user_ns, sbi->gid));
+	if (test_opt(root->d_sb, ERRORS_RO))
+		seq_puts(seq, ",errors=remount-ro");
+	if (test_opt(root->d_sb, ERRORS_PANIC))
+		seq_puts(seq, ",errors=panic");
+	if (test_opt(root->d_sb, DAX))
+		seq_puts(seq, ",dax");
+
+	return 0;
+}
+
+static int nova_remount(struct super_block *sb, int *mntflags, char *data)
+{
+	unsigned long old_sb_flags;
+	unsigned long old_mount_opt;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int ret = -EINVAL;
+
+	/* Store the old options */
+	mutex_lock(&sbi->s_lock);
+	old_sb_flags = sb->s_flags;
+	old_mount_opt = sbi->s_mount_opt;
+
+	if (nova_parse_options(data, sbi, 1))
+		goto restore_opt;
+
+	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
+		      ((sbi->s_mount_opt & NOVA_MOUNT_POSIX_ACL) ?
+		       MS_POSIXACL : 0);
+
+	if ((*mntflags & MS_RDONLY) != (sb->s_flags & MS_RDONLY))
+		nova_update_mount_time(sb);
+
+	mutex_unlock(&sbi->s_lock);
+	ret = 0;
+	return ret;
+
+restore_opt:
+	sb->s_flags = old_sb_flags;
+	sbi->s_mount_opt = old_mount_opt;
+	mutex_unlock(&sbi->s_lock);
+	return ret;
+}
+
 static void nova_put_super(struct super_block *sb)
 {
 	struct nova_sb_info *sbi = NOVA_SB(sb);
@@ -697,6 +750,8 @@ static struct super_operations nova_sops = {
 	.alloc_inode	= nova_alloc_inode,
 	.destroy_inode	= nova_destroy_inode,
 	.put_super	= nova_put_super,
+	.remount_fs	= nova_remount,
+	.show_options	= nova_show_options,
 };
 
 static struct dentry *nova_mount(struct file_system_type *fs_type,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 14/83] Add range node kmem cache.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (12 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 13/83] Add remount_fs and show_options methods Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-11 11:55   ` Nikolay Borisov
  2018-03-10 18:17 ` [RFC v2 15/83] Add free list data structure Andiry Xu
                   ` (69 subsequent siblings)
  83 siblings, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Range node specifies a range of [start, end]. and is managed by a red-black tree.
NOVA uses range node to manage NVM allocator and inodes being used.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/nova.h  |  8 ++++++++
 fs/nova/super.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 fs/nova/super.h |  2 ++
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index ba7ffca..e0e85fb 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -301,6 +301,14 @@ static inline u64 nova_get_epoch_id(struct super_block *sb)
 }
 
 #include "inode.h"
+
+/* A node in the RB tree representing a range of pages */
+struct nova_range_node {
+	struct rb_node node;
+	unsigned long range_low;
+	unsigned long range_high;
+};
+
 #include "bbuild.h"
 
 /* ====================================================== */
diff --git a/fs/nova/super.c b/fs/nova/super.c
index f41cc04..aec1cd3 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -52,6 +52,7 @@ MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
 static struct super_operations nova_sops;
 
 static struct kmem_cache *nova_inode_cachep;
+static struct kmem_cache *nova_range_node_cachep;
 
 
 /* FIXME: should the following variable be one per NOVA instance? */
@@ -686,6 +687,20 @@ static void nova_put_super(struct super_block *sb)
 	sb->s_fs_info = NULL;
 }
 
+inline void nova_free_range_node(struct nova_range_node *node)
+{
+	kmem_cache_free(nova_range_node_cachep, node);
+}
+
+inline struct nova_range_node *nova_alloc_range_node(struct super_block *sb)
+{
+	struct nova_range_node *p;
+
+	p = (struct nova_range_node *)
+		kmem_cache_zalloc(nova_range_node_cachep, GFP_NOFS);
+	return p;
+}
+
 static struct inode *nova_alloc_inode(struct super_block *sb)
 {
 	struct nova_inode_info *vi;
@@ -719,6 +734,17 @@ static void init_once(void *foo)
 	inode_init_once(&vi->vfs_inode);
 }
 
+static int __init init_rangenode_cache(void)
+{
+	nova_range_node_cachep = kmem_cache_create("nova_range_node_cache",
+					sizeof(struct nova_range_node),
+					0, (SLAB_RECLAIM_ACCOUNT |
+					SLAB_MEM_SPREAD), NULL);
+	if (nova_range_node_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
 static int __init init_inodecache(void)
 {
 	nova_inode_cachep = kmem_cache_create("nova_inode_cache",
@@ -740,6 +766,11 @@ static void destroy_inodecache(void)
 	kmem_cache_destroy(nova_inode_cachep);
 }
 
+static void destroy_rangenode_cache(void)
+{
+	kmem_cache_destroy(nova_range_node_cachep);
+}
+
 
 /*
  * the super block writes are all done "on the fly", so the
@@ -781,20 +812,27 @@ static int __init init_nova_fs(void)
 	nova_info("Arch new instructions support: CLWB %s\n",
 			support_clwb ? "YES" : "NO");
 
-	rc = init_inodecache();
+	rc = init_rangenode_cache();
 	if (rc)
 		goto out;
 
-	rc = register_filesystem(&nova_fs_type);
+	rc = init_inodecache();
 	if (rc)
 		goto out1;
 
+	rc = register_filesystem(&nova_fs_type);
+	if (rc)
+		goto out2;
+
 out:
 	NOVA_END_TIMING(init_t, init_time);
 	return rc;
 
-out1:
+out2:
 	destroy_inodecache();
+
+out1:
+	destroy_rangenode_cache();
 	goto out;
 }
 
@@ -802,6 +840,7 @@ static void __exit exit_nova_fs(void)
 {
 	unregister_filesystem(&nova_fs_type);
 	destroy_inodecache();
+	destroy_rangenode_cache();
 }
 
 MODULE_AUTHOR("Andiry Xu <jix024@cs.ucsd.edu>");
diff --git a/fs/nova/super.h b/fs/nova/super.h
index cb53908..b478080 100644
--- a/fs/nova/super.h
+++ b/fs/nova/super.h
@@ -145,5 +145,7 @@ static inline struct nova_super_block *nova_get_super(struct super_block *sb)
 }
 
 extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
+extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
+extern void nova_free_range_node(struct nova_range_node *node);
 
 #endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 15/83] Add free list data structure.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (13 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 14/83] Add range node kmem cache Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 16/83] Initialize block map and free lists in nova_init() Andiry Xu
                   ` (68 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Free list is the data structure that NOVA uses to manage free pmem blocks.
Each CPU has its own free list to avoid contention.
Free list manages free pmem blocks (represented in range node) with red-black tree.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |  2 +-
 fs/nova/balloc.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/balloc.h | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h   |  1 +
 fs/nova/super.c  | 11 ++++++++++
 fs/nova/super.h  |  4 ++++
 6 files changed, 141 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/balloc.c
 create mode 100644 fs/nova/balloc.h

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index 886356a..e2f7b07 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := bbuild.o inode.o rebuild.o stats.o super.o
+nova-y := balloc.o bbuild.o inode.o rebuild.o stats.o super.o
diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
new file mode 100644
index 0000000..450c942
--- /dev/null
+++ b/fs/nova/balloc.c
@@ -0,0 +1,58 @@
+/*
+ * NOVA persistent memory management
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/fs.h>
+#include <linux/bitops.h>
+#include "nova.h"
+#include "inode.h"
+
+int nova_alloc_block_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+
+	sbi->free_lists = kcalloc(sbi->cpus, sizeof(struct free_list),
+				  GFP_KERNEL);
+
+	if (!sbi->free_lists)
+		return -ENOMEM;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		free_list->block_free_tree = RB_ROOT;
+		spin_lock_init(&free_list->s_lock);
+		free_list->index = i;
+	}
+
+	return 0;
+}
+
+void nova_delete_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	/* Each tree is freed in save_blocknode_mappings */
+	kfree(sbi->free_lists);
+	sbi->free_lists = NULL;
+}
+
+
diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
new file mode 100644
index 0000000..e7c7a1d
--- /dev/null
+++ b/fs/nova/balloc.h
@@ -0,0 +1,66 @@
+#ifndef __BALLOC_H
+#define __BALLOC_H
+
+#include "inode.h"
+
+/* DRAM structure to hold a list of free PMEM blocks */
+struct free_list {
+	spinlock_t s_lock;
+	struct rb_root	block_free_tree;
+	struct nova_range_node *first_node; // lowest address free range
+	struct nova_range_node *last_node; // highest address free range
+
+	int		index; // Which CPU do I belong to?
+
+	/*
+	 * Start and end of allocatable range, inclusive.
+	 */
+	unsigned long	block_start;
+	unsigned long	block_end;
+
+	unsigned long	num_free_blocks;
+
+	/* How many nodes in the rb tree? */
+	unsigned long	num_blocknode;
+
+	u32		csum;		/* Protect integrity */
+
+	/* Statistics */
+	unsigned long	alloc_log_count;
+	unsigned long	alloc_data_count;
+	unsigned long	free_log_count;
+	unsigned long	free_data_count;
+	unsigned long	alloc_log_pages;
+	unsigned long	alloc_data_pages;
+	unsigned long	freed_log_pages;
+	unsigned long	freed_data_pages;
+
+	u64		padding[8];	/* Cache line break */
+};
+
+static inline
+struct free_list *nova_get_free_list(struct super_block *sb, int cpu)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	return &sbi->free_lists[cpu];
+}
+
+enum nova_alloc_direction {ALLOC_FROM_HEAD = 0,
+			   ALLOC_FROM_TAIL = 1};
+
+enum nova_alloc_init {ALLOC_NO_INIT = 0,
+		      ALLOC_INIT_ZERO = 1};
+
+enum alloc_type {
+	LOG = 1,
+	DATA,
+};
+
+
+
+
+int nova_alloc_block_free_lists(struct super_block *sb);
+void nova_delete_free_lists(struct super_block *sb);
+
+#endif
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index e0e85fb..c4abdd8 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -310,6 +310,7 @@ struct nova_range_node {
 };
 
 #include "bbuild.h"
+#include "balloc.h"
 
 /* ====================================================== */
 /* ==============  Function prototypes  ================= */
diff --git a/fs/nova/super.c b/fs/nova/super.c
index aec1cd3..43b24a7 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -545,6 +545,13 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 		goto out;
 	}
 
+	if (nova_alloc_block_free_lists(sb)) {
+		retval = -ENOMEM;
+		nova_err(sb, "%s: Failed to allocate block free lists.",
+			 __func__);
+		goto out;
+	}
+
 	/* Init a new nova instance */
 	if (sbi->s_mount_opt & NOVA_MOUNT_FORMAT) {
 		root_pi = nova_init(sb, sbi->initsize);
@@ -611,6 +618,8 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->zeroed_page);
 	sbi->zeroed_page = NULL;
 
+	nova_delete_free_lists(sb);
+
 	kfree(sbi->nova_sb);
 	kfree(sbi);
 	nova_dbg("%s failed: return %d\n", __func__, retval);
@@ -679,6 +688,8 @@ static void nova_put_super(struct super_block *sb)
 		sbi->virt_addr = NULL;
 	}
 
+	nova_delete_free_lists(sb);
+
 	kfree(sbi->zeroed_page);
 	nova_dbgmask = 0;
 
diff --git a/fs/nova/super.h b/fs/nova/super.h
index b478080..dcafbd8 100644
--- a/fs/nova/super.h
+++ b/fs/nova/super.h
@@ -118,6 +118,10 @@ struct nova_sb_info {
 
 	/* ZEROED page for cache page initialized */
 	void *zeroed_page;
+
+	/* Per-CPU free block list */
+	struct free_list *free_lists;
+	unsigned long per_list_blocks;
 };
 
 static inline struct nova_sb_info *NOVA_SB(struct super_block *sb)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 16/83] Initialize block map and free lists in nova_init().
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (14 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 15/83] Add free list data structure Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-11 12:12   ` Nikolay Borisov
  2018-03-10 18:17 ` [RFC v2 17/83] Add statfs support Andiry Xu
                   ` (67 subsequent siblings)
  83 siblings, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA divides the pmem range equally among per-CPU free lists,
and format the red-black trees by inserting the initial free range.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/balloc.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/balloc.h |  13 ++++-
 fs/nova/super.c  |   2 +
 3 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
index 450c942..cb627db 100644
--- a/fs/nova/balloc.c
+++ b/fs/nova/balloc.c
@@ -55,4 +55,165 @@ void nova_delete_free_lists(struct super_block *sb)
 	sbi->free_lists = NULL;
 }
 
+// Initialize a free list.  Each CPU gets an equal share of the block space to
+// manage.
+static void nova_init_free_list(struct super_block *sb,
+	struct free_list *free_list, int index)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long per_list_blocks;
+
+	per_list_blocks = sbi->num_blocks / sbi->cpus;
+
+	free_list->block_start = per_list_blocks * index;
+	free_list->block_end = free_list->block_start +
+					per_list_blocks - 1;
+	if (index == 0)
+		free_list->block_start += sbi->head_reserved_blocks;
+	if (index == sbi->cpus - 1)
+		free_list->block_end -= sbi->tail_reserved_blocks;
+}
+
+inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
+{
+	return nova_alloc_range_node(sb);
+}
+
+inline void nova_free_blocknode(struct super_block *sb,
+	struct nova_range_node *node)
+{
+	nova_free_range_node(node);
+}
+
+void nova_init_blockmap(struct super_block *sb, int recovery)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct rb_root *tree;
+	struct nova_range_node *blknode;
+	struct free_list *free_list;
+	int i;
+	int ret;
+
+	/* Divide the block range among per-CPU free lists */
+	sbi->per_list_blocks = sbi->num_blocks / sbi->cpus;
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		tree = &(free_list->block_free_tree);
+		nova_init_free_list(sb, free_list, i);
+
+		/* For recovery, update these fields later */
+		if (recovery == 0) {
+			free_list->num_free_blocks = free_list->block_end -
+						free_list->block_start + 1;
+
+			blknode = nova_alloc_blocknode(sb);
+			if (blknode == NULL)
+				return;
+			blknode->range_low = free_list->block_start;
+			blknode->range_high = free_list->block_end;
+			ret = nova_insert_blocktree(sbi, tree, blknode);
+			if (ret) {
+				nova_err(sb, "%s failed\n", __func__);
+				nova_free_blocknode(sb, blknode);
+				return;
+			}
+			free_list->first_node = blknode;
+			free_list->last_node = blknode;
+			free_list->num_blocknode = 1;
+		}
+
+		nova_dbgv("%s: free list %d: block start %lu, end %lu, %lu free blocks\n",
+			  __func__, i,
+			  free_list->block_start,
+			  free_list->block_end,
+			  free_list->num_free_blocks);
+	}
+}
+
+static inline int nova_rbtree_compare_rangenode(struct nova_range_node *curr,
+	unsigned long range_low)
+{
+	if (range_low < curr->range_low)
+		return -1;
+	if (range_low > curr->range_high)
+		return 1;
 
+	return 0;
+}
+
+int nova_find_range_node(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	struct nova_range_node **ret_node)
+{
+	struct nova_range_node *curr = NULL;
+	struct rb_node *temp;
+	int compVal;
+	int ret = 0;
+
+	temp = tree->rb_node;
+
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		compVal = nova_rbtree_compare_rangenode(curr, range_low);
+
+		if (compVal == -1) {
+			temp = temp->rb_left;
+		} else if (compVal == 1) {
+			temp = temp->rb_right;
+		} else {
+			ret = 1;
+			break;
+		}
+	}
+
+	*ret_node = curr;
+	return ret;
+}
+
+
+int nova_insert_range_node(struct rb_root *tree,
+	struct nova_range_node *new_node)
+{
+	struct nova_range_node *curr;
+	struct rb_node **temp, *parent;
+	int compVal;
+
+	temp = &(tree->rb_node);
+	parent = NULL;
+
+	while (*temp) {
+		curr = container_of(*temp, struct nova_range_node, node);
+		compVal = nova_rbtree_compare_rangenode(curr,
+					new_node->range_low);
+		parent = *temp;
+
+		if (compVal == -1) {
+			temp = &((*temp)->rb_left);
+		} else if (compVal == 1) {
+			temp = &((*temp)->rb_right);
+		} else {
+			nova_dbg("%s: entry %lu - %lu already exists: %lu - %lu\n",
+				 __func__, new_node->range_low,
+				new_node->range_high, curr->range_low,
+				curr->range_high);
+			return -EINVAL;
+		}
+	}
+
+	rb_link_node(&new_node->node, parent, temp);
+	rb_insert_color(&new_node->node, tree);
+
+	return 0;
+}
+
+inline int nova_insert_blocktree(struct nova_sb_info *sbi,
+	struct rb_root *tree, struct nova_range_node *new_node)
+{
+	int ret;
+
+	ret = nova_insert_range_node(tree, new_node);
+	if (ret)
+		nova_dbg("ERROR: %s failed %d\n", __func__, ret);
+
+	return ret;
+}
diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
index e7c7a1d..57a93e4 100644
--- a/fs/nova/balloc.h
+++ b/fs/nova/balloc.h
@@ -62,5 +62,16 @@ enum alloc_type {
 
 int nova_alloc_block_free_lists(struct super_block *sb);
 void nova_delete_free_lists(struct super_block *sb);
-
+inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
+inline void nova_free_blocknode(struct super_block *sb,
+	struct nova_range_node *bnode);
+extern void nova_init_blockmap(struct super_block *sb, int recovery);
+inline int nova_insert_blocktree(struct nova_sb_info *sbi,
+	struct rb_root *tree, struct nova_range_node *new_node);
+
+extern int nova_insert_range_node(struct rb_root *tree,
+				  struct nova_range_node *new_node);
+extern int nova_find_range_node(struct nova_sb_info *sbi,
+				struct rb_root *tree, unsigned long range_low,
+				struct nova_range_node **ret_node);
 #endif
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 43b24a7..9762f26 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -376,6 +376,8 @@ static struct nova_inode *nova_init(struct super_block *sb,
 	pi->nova_ino = NOVA_BLOCKNODE_INO;
 	nova_flush_buffer(pi, CACHELINE_SIZE, 1);
 
+	nova_init_blockmap(sb, 0);
+
 	sbi->nova_sb->s_size = cpu_to_le64(size);
 	sbi->nova_sb->s_blocksize = cpu_to_le32(blocksize);
 	sbi->nova_sb->s_magic = cpu_to_le32(NOVA_SUPER_MAGIC);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 17/83] Add statfs support.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (15 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 16/83] Initialize block map and free lists in nova_init() Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:17 ` [RFC v2 18/83] Add freelist statistics printing Andiry Xu
                   ` (66 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/balloc.c | 18 ++++++++++++++++++
 fs/nova/balloc.h |  1 +
 fs/nova/super.c  | 19 +++++++++++++++++++
 3 files changed, 38 insertions(+)

diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
index cb627db..0742fe0 100644
--- a/fs/nova/balloc.c
+++ b/fs/nova/balloc.c
@@ -217,3 +217,21 @@ inline int nova_insert_blocktree(struct nova_sb_info *sbi,
 
 	return ret;
 }
+
+/* We do not take locks so it's inaccurate */
+unsigned long nova_count_free_blocks(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long num_free_blocks = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		num_free_blocks += free_list->num_free_blocks;
+	}
+
+	return num_free_blocks;
+}
+
+
diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
index 57a93e4..537532e 100644
--- a/fs/nova/balloc.h
+++ b/fs/nova/balloc.h
@@ -66,6 +66,7 @@ inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
 inline void nova_free_blocknode(struct super_block *sb,
 	struct nova_range_node *bnode);
 extern void nova_init_blockmap(struct super_block *sb, int recovery);
+extern unsigned long nova_count_free_blocks(struct super_block *sb);
 inline int nova_insert_blocktree(struct nova_sb_info *sbi,
 	struct rb_root *tree, struct nova_range_node *new_node);
 
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 9762f26..3500d19 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -629,6 +629,24 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	return retval;
 }
 
+static int nova_statfs(struct dentry *d, struct kstatfs *buf)
+{
+	struct super_block *sb = d->d_sb;
+	struct nova_sb_info *sbi = (struct nova_sb_info *)sb->s_fs_info;
+
+	buf->f_type = NOVA_SUPER_MAGIC;
+	buf->f_bsize = sb->s_blocksize;
+
+	buf->f_blocks = sbi->num_blocks;
+	buf->f_bfree = buf->f_bavail = nova_count_free_blocks(sb);
+	buf->f_files = LONG_MAX;
+	buf->f_ffree = LONG_MAX - sbi->s_inodes_used_count;
+	buf->f_namelen = NOVA_NAME_LEN;
+	nova_dbg_verbose("nova_stats: total 4k free blocks 0x%llx\n",
+		buf->f_bfree);
+	return 0;
+}
+
 static int nova_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct nova_sb_info *sbi = NOVA_SB(root->d_sb);
@@ -794,6 +812,7 @@ static struct super_operations nova_sops = {
 	.alloc_inode	= nova_alloc_inode,
 	.destroy_inode	= nova_destroy_inode,
 	.put_super	= nova_put_super,
+	.statfs		= nova_statfs,
 	.remount_fs	= nova_remount,
 	.show_options	= nova_show_options,
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 18/83] Add freelist statistics printing.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (16 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 17/83] Add statfs support Andiry Xu
@ 2018-03-10 18:17 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 19/83] Add pmem block free routines Andiry Xu
                   ` (65 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:17 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/nova.h  |   1 +
 fs/nova/stats.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)

diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index c4abdd8..404e133 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -326,5 +326,6 @@ void nova_get_IO_stats(void);
 void nova_print_timing_stats(struct super_block *sb);
 void nova_clear_stats(struct super_block *sb);
 void nova_print_inode(struct nova_inode *pi);
+void nova_print_free_lists(struct super_block *sb);
 
 #endif /* __NOVA_H */
diff --git a/fs/nova/stats.c b/fs/nova/stats.c
index 4b7c317..9ddd267 100644
--- a/fs/nova/stats.c
+++ b/fs/nova/stats.c
@@ -128,6 +128,61 @@ DEFINE_PER_CPU(u64[TIMING_NUM], Countstats_percpu);
 u64 IOstats[STATS_NUM];
 DEFINE_PER_CPU(u64[STATS_NUM], IOstats_percpu);
 
+static void nova_print_alloc_stats(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long alloc_log_count = 0;
+	unsigned long alloc_log_pages = 0;
+	unsigned long alloc_data_count = 0;
+	unsigned long alloc_data_pages = 0;
+	unsigned long free_log_count = 0;
+	unsigned long freed_log_pages = 0;
+	unsigned long free_data_count = 0;
+	unsigned long freed_data_pages = 0;
+	int i;
+
+	nova_info("=========== NOVA allocation stats ===========\n");
+	nova_info("Alloc %llu, alloc steps %llu, average %llu\n",
+		Countstats[new_data_blocks_t], IOstats[alloc_steps],
+		Countstats[new_data_blocks_t] ?
+			IOstats[alloc_steps] / Countstats[new_data_blocks_t]
+			: 0);
+	nova_info("Free %llu\n", Countstats[free_data_t]);
+	nova_info("Fast GC %llu, check pages %llu, free pages %llu, average %llu\n",
+		Countstats[fast_gc_t], IOstats[fast_checked_pages],
+		IOstats[fast_gc_pages], Countstats[fast_gc_t] ?
+			IOstats[fast_gc_pages] / Countstats[fast_gc_t] : 0);
+	nova_info("Thorough GC %llu, checked pages %llu, free pages %llu, average %llu\n",
+		Countstats[thorough_gc_t],
+		IOstats[thorough_checked_pages], IOstats[thorough_gc_pages],
+		Countstats[thorough_gc_t] ?
+			IOstats[thorough_gc_pages] / Countstats[thorough_gc_t]
+			: 0);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		alloc_log_count += free_list->alloc_log_count;
+		alloc_log_pages += free_list->alloc_log_pages;
+		alloc_data_count += free_list->alloc_data_count;
+		alloc_data_pages += free_list->alloc_data_pages;
+		free_log_count += free_list->free_log_count;
+		freed_log_pages += free_list->freed_log_pages;
+		free_data_count += free_list->free_data_count;
+		freed_data_pages += free_list->freed_data_pages;
+	}
+
+	nova_info("alloc log count %lu, allocated log pages %lu, "
+		"alloc data count %lu, allocated data pages %lu, "
+		"free log count %lu, freed log pages %lu, "
+		"free data count %lu, freed data pages %lu\n",
+		alloc_log_count, alloc_log_pages,
+		alloc_data_count, alloc_data_pages,
+		free_log_count, freed_log_pages,
+		free_data_count, freed_data_pages);
+}
+
 static void nova_print_IO_stats(struct super_block *sb)
 {
 	nova_info("=========== NOVA I/O stats ===========\n");
@@ -209,6 +264,7 @@ void nova_print_timing_stats(struct super_block *sb)
 	}
 
 	nova_info("\n");
+	nova_print_alloc_stats(sb);
 	nova_print_IO_stats(sb);
 }
 
@@ -229,6 +285,8 @@ static void nova_clear_timing_stats(void)
 
 static void nova_clear_IO_stats(struct super_block *sb)
 {
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
 	int i;
 	int cpu;
 
@@ -237,6 +295,19 @@ static void nova_clear_IO_stats(struct super_block *sb)
 		for_each_possible_cpu(cpu)
 			per_cpu(IOstats_percpu[i], cpu) = 0;
 	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		free_list->alloc_log_count = 0;
+		free_list->alloc_log_pages = 0;
+		free_list->alloc_data_count = 0;
+		free_list->alloc_data_pages = 0;
+		free_list->free_log_count = 0;
+		free_list->freed_log_pages = 0;
+		free_list->free_data_count = 0;
+		free_list->freed_data_pages = 0;
+	}
 }
 
 void nova_clear_stats(struct super_block *sb)
@@ -261,3 +332,35 @@ void nova_print_inode(struct nova_inode *pi)
 	nova_dbg("create epoch id %llu, delete epoch id %llu\n",
 		pi->create_epoch_id, pi->delete_epoch_id);
 }
+
+void nova_print_free_lists(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+
+	nova_dbg("======== NOVA per-CPU free list allocation stats ========\n");
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		nova_dbg("Free list %d: block start %lu, block end %lu, "
+			"num_blocks %lu, num_free_blocks %lu, blocknode %lu\n",
+			i, free_list->block_start, free_list->block_end,
+			free_list->block_end - free_list->block_start + 1,
+			free_list->num_free_blocks, free_list->num_blocknode);
+
+		nova_dbg("Free list %d: alloc log count %lu, "
+			"allocated log pages %lu, alloc data count %lu, "
+			"allocated data pages %lu, free log count %lu, "
+			"freed log pages %lu, free data count %lu, "
+			"freed data pages %lu\n",
+			i,
+			free_list->alloc_log_count,
+			free_list->alloc_log_pages,
+			free_list->alloc_data_count,
+			free_list->alloc_data_pages,
+			free_list->free_log_count,
+			free_list->freed_log_pages,
+			free_list->free_data_count,
+			free_list->freed_data_pages);
+	}
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 19/83] Add pmem block free routines.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (17 preceding siblings ...)
  2018-03-10 18:17 ` [RFC v2 18/83] Add freelist statistics printing Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 20/83] Pmem block allocation routines Andiry Xu
                   ` (64 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA allocates/frees log pages and data pages in the same way.
For block free, NOVA first gets the corresponding free list by
checking the block number, and then inserts the freed range in
the red-black tree. NOVA always merge adjacent free ranges if possible.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/balloc.c | 223 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/balloc.h |   8 ++
 fs/nova/nova.h   |  23 ++++++
 3 files changed, 254 insertions(+)

diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
index 0742fe0..9108721 100644
--- a/fs/nova/balloc.c
+++ b/fs/nova/balloc.c
@@ -218,6 +218,229 @@ inline int nova_insert_blocktree(struct nova_sb_info *sbi,
 	return ret;
 }
 
+/* Used for both block free tree and inode inuse tree */
+int nova_find_free_slot(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	unsigned long range_high, struct nova_range_node **prev,
+	struct nova_range_node **next)
+{
+	struct nova_range_node *ret_node = NULL;
+	struct rb_node *tmp;
+	int check_prev = 0, check_next = 0;
+	int ret;
+
+	ret = nova_find_range_node(sbi, tree, range_low, &ret_node);
+	if (ret) {
+		nova_dbg("%s ERROR: %lu - %lu already in free list\n",
+			__func__, range_low, range_high);
+		return -EINVAL;
+	}
+
+	if (!ret_node) {
+		*prev = *next = NULL;
+	} else if (ret_node->range_high < range_low) {
+		*prev = ret_node;
+		tmp = rb_next(&ret_node->node);
+		if (tmp) {
+			*next = container_of(tmp, struct nova_range_node, node);
+			check_next = 1;
+		} else {
+			*next = NULL;
+		}
+	} else if (ret_node->range_low > range_high) {
+		*next = ret_node;
+		tmp = rb_prev(&ret_node->node);
+		if (tmp) {
+			*prev = container_of(tmp, struct nova_range_node, node);
+			check_prev = 1;
+		} else {
+			*prev = NULL;
+		}
+	} else {
+		nova_dbg("%s ERROR: %lu - %lu overlaps with existing node %lu - %lu\n",
+			 __func__, range_low, range_high, ret_node->range_low,
+			ret_node->range_high);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * blocknr: start block number
+ * num: number of freed pages
+ * btype: is large page?
+ * log_page: is log page?
+ */
+static int nova_free_blocks(struct super_block *sb, unsigned long blocknr,
+	int num, unsigned short btype, int log_page)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct rb_root *tree;
+	unsigned long block_low;
+	unsigned long block_high;
+	unsigned long num_blocks = 0;
+	struct nova_range_node *prev = NULL;
+	struct nova_range_node *next = NULL;
+	struct nova_range_node *curr_node;
+	struct free_list *free_list;
+	int cpuid;
+	int new_node_used = 0;
+	int ret;
+	timing_t free_time;
+
+	if (num <= 0) {
+		nova_dbg("%s ERROR: free %d\n", __func__, num);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(free_blocks_t, free_time);
+	cpuid = blocknr / sbi->per_list_blocks;
+
+	/* Pre-allocate blocknode */
+	curr_node = nova_alloc_blocknode(sb);
+	if (curr_node == NULL) {
+		/* returning without freeing the block*/
+		NOVA_END_TIMING(free_blocks_t, free_time);
+		return -ENOMEM;
+	}
+
+	free_list = nova_get_free_list(sb, cpuid);
+	spin_lock(&free_list->s_lock);
+
+	tree = &(free_list->block_free_tree);
+
+	num_blocks = nova_get_numblocks(btype) * num;
+	block_low = blocknr;
+	block_high = blocknr + num_blocks - 1;
+
+	nova_dbgv("Free: %lu - %lu\n", block_low, block_high);
+
+	if (blocknr < free_list->block_start ||
+			blocknr + num > free_list->block_end + 1) {
+		nova_err(sb, "free blocks %lu to %lu, free list %d, start %lu, end %lu\n",
+				blocknr, blocknr + num - 1,
+				free_list->index,
+				free_list->block_start,
+				free_list->block_end);
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = nova_find_free_slot(sbi, tree, block_low,
+					block_high, &prev, &next);
+
+	if (ret) {
+		nova_dbg("%s: find free slot fail: %d\n", __func__, ret);
+		goto out;
+	}
+
+	if (prev && next && (block_low == prev->range_high + 1) &&
+			(block_high + 1 == next->range_low)) {
+		/* fits the hole */
+		rb_erase(&next->node, tree);
+		free_list->num_blocknode--;
+		prev->range_high = next->range_high;
+		if (free_list->last_node == next)
+			free_list->last_node = prev;
+		nova_free_blocknode(sb, next);
+		goto block_found;
+	}
+	if (prev && (block_low == prev->range_high + 1)) {
+		/* Aligns left */
+		prev->range_high += num_blocks;
+		goto block_found;
+	}
+	if (next && (block_high + 1 == next->range_low)) {
+		/* Aligns right */
+		next->range_low -= num_blocks;
+		goto block_found;
+	}
+
+	/* Aligns somewhere in the middle */
+	curr_node->range_low = block_low;
+	curr_node->range_high = block_high;
+	new_node_used = 1;
+	ret = nova_insert_blocktree(sbi, tree, curr_node);
+	if (ret) {
+		new_node_used = 0;
+		goto out;
+	}
+	if (!prev)
+		free_list->first_node = curr_node;
+	if (!next)
+		free_list->last_node = curr_node;
+
+	free_list->num_blocknode++;
+
+block_found:
+	free_list->num_free_blocks += num_blocks;
+
+	if (log_page) {
+		free_list->free_log_count++;
+		free_list->freed_log_pages += num_blocks;
+	} else {
+		free_list->free_data_count++;
+		free_list->freed_data_pages += num_blocks;
+	}
+
+out:
+	spin_unlock(&free_list->s_lock);
+	if (new_node_used == 0)
+		nova_free_blocknode(sb, curr_node);
+
+	NOVA_END_TIMING(free_blocks_t, free_time);
+	return ret;
+}
+
+int nova_free_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num)
+{
+	int ret;
+	timing_t free_time;
+
+	nova_dbgv("Inode %lu: free %d data block from %lu to %lu\n",
+			sih->ino, num, blocknr, blocknr + num - 1);
+	if (blocknr == 0) {
+		nova_dbg("%s: ERROR: %lu, %d\n", __func__, blocknr, num);
+		return -EINVAL;
+	}
+	NOVA_START_TIMING(free_data_t, free_time);
+	ret = nova_free_blocks(sb, blocknr, num, sih->i_blk_type, 0);
+	if (ret) {
+		nova_err(sb, "Inode %lu: free %d data block from %lu to %lu failed!\n",
+			 sih->ino, num, blocknr, blocknr + num - 1);
+		dump_stack();
+	}
+	NOVA_END_TIMING(free_data_t, free_time);
+
+	return ret;
+}
+
+int nova_free_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num)
+{
+	int ret;
+	timing_t free_time;
+
+	nova_dbgv("Inode %lu: free %d log block from %lu to %lu\n",
+			sih->ino, num, blocknr, blocknr + num - 1);
+	if (blocknr == 0) {
+		nova_dbg("%s: ERROR: %lu, %d\n", __func__, blocknr, num);
+		return -EINVAL;
+	}
+	NOVA_START_TIMING(free_log_t, free_time);
+	ret = nova_free_blocks(sb, blocknr, num, sih->i_blk_type, 1);
+	if (ret) {
+		nova_err(sb, "Inode %lu: free %d log block from %lu to %lu failed!\n",
+			 sih->ino, num, blocknr, blocknr + num - 1);
+		dump_stack();
+	}
+	NOVA_END_TIMING(free_log_t, free_time);
+
+	return ret;
+}
+
 /* We do not take locks so it's inaccurate */
 unsigned long nova_count_free_blocks(struct super_block *sb)
 {
diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
index 537532e..249eb72 100644
--- a/fs/nova/balloc.h
+++ b/fs/nova/balloc.h
@@ -69,6 +69,14 @@ extern void nova_init_blockmap(struct super_block *sb, int recovery);
 extern unsigned long nova_count_free_blocks(struct super_block *sb);
 inline int nova_insert_blocktree(struct nova_sb_info *sbi,
 	struct rb_root *tree, struct nova_range_node *new_node);
+extern int nova_free_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
+extern int nova_free_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
+int nova_find_free_slot(struct nova_sb_info *sbi,
+	struct rb_root *tree, unsigned long range_low,
+	unsigned long range_high, struct nova_range_node **prev,
+	struct nova_range_node **next);
 
 extern int nova_insert_range_node(struct rb_root *tree,
 				  struct nova_range_node *new_node);
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 404e133..0992f50 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -312,6 +312,29 @@ struct nova_range_node {
 #include "bbuild.h"
 #include "balloc.h"
 
+static inline unsigned long
+nova_get_numblocks(unsigned short btype)
+{
+	unsigned long num_blocks;
+
+	if (btype == NOVA_BLOCK_TYPE_4K) {
+		num_blocks = 1;
+	} else if (btype == NOVA_BLOCK_TYPE_2M) {
+		num_blocks = 512;
+	} else {
+		//btype == NOVA_BLOCK_TYPE_1G
+		num_blocks = 0x40000;
+	}
+	return num_blocks;
+}
+
+static inline unsigned long
+nova_get_blocknr(struct super_block *sb, u64 block, unsigned short btype)
+{
+	return block >> PAGE_SHIFT;
+}
+
+
 /* ====================================================== */
 /* ==============  Function prototypes  ================= */
 /* ====================================================== */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 20/83] Pmem block allocation routines.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (18 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 19/83] Add pmem block free routines Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 21/83] Add log structure Andiry Xu
                   ` (63 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Upon a allocation request, NOVA first try the free list on current CPU.
If there are not enough blocks to allocate, NOVA will go to the
free list with the most free blocks.
Caller can specify allocation direction: from low address or from
high address.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/balloc.c | 270 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/balloc.h |  10 +++
 2 files changed, 280 insertions(+)

diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
index 9108721..8e99215 100644
--- a/fs/nova/balloc.c
+++ b/fs/nova/balloc.c
@@ -441,6 +441,276 @@ int nova_free_log_blocks(struct super_block *sb,
 	return ret;
 }
 
+static int not_enough_blocks(struct free_list *free_list,
+	unsigned long num_blocks, enum alloc_type atype)
+{
+	struct nova_range_node *first = free_list->first_node;
+	struct nova_range_node *last = free_list->last_node;
+
+	if (free_list->num_free_blocks < num_blocks || !first || !last) {
+		nova_dbgv("%s: num_free_blocks=%ld; num_blocks=%ld; first=0x%p; last=0x%p",
+			  __func__, free_list->num_free_blocks, num_blocks,
+			  first, last);
+		return 1;
+	}
+
+	return 0;
+}
+
+/* Return how many blocks allocated */
+static long nova_alloc_blocks_in_free_list(struct super_block *sb,
+	struct free_list *free_list, unsigned short btype,
+	enum alloc_type atype, unsigned long num_blocks,
+	unsigned long *new_blocknr, enum nova_alloc_direction from_tail)
+{
+	struct rb_root *tree;
+	struct nova_range_node *curr, *next = NULL, *prev = NULL;
+	struct rb_node *temp, *next_node, *prev_node;
+	unsigned long curr_blocks;
+	bool found = 0;
+	unsigned long step = 0;
+
+	if (!free_list->first_node || free_list->num_free_blocks == 0) {
+		nova_dbgv("%s: Can't alloc. free_list->first_node=0x%p free_list->num_free_blocks = %lu",
+			  __func__, free_list->first_node,
+			  free_list->num_free_blocks);
+		return -ENOSPC;
+	}
+
+	if (atype == LOG && not_enough_blocks(free_list, num_blocks, atype)) {
+		nova_dbgv("%s: Can't alloc.  not_enough_blocks() == true",
+			  __func__);
+		return -ENOSPC;
+	}
+
+	tree = &(free_list->block_free_tree);
+	if (from_tail == ALLOC_FROM_HEAD)
+		temp = &(free_list->first_node->node);
+	else
+		temp = &(free_list->last_node->node);
+
+	while (temp) {
+		step++;
+		curr = container_of(temp, struct nova_range_node, node);
+
+		curr_blocks = curr->range_high - curr->range_low + 1;
+
+		if (num_blocks >= curr_blocks) {
+			/* Superpage allocation must succeed */
+			if (btype > 0 && num_blocks > curr_blocks)
+				goto next;
+
+			/* Otherwise, allocate the whole blocknode */
+			if (curr == free_list->first_node) {
+				next_node = rb_next(temp);
+				if (next_node)
+					next = container_of(next_node,
+						struct nova_range_node, node);
+				free_list->first_node = next;
+			}
+
+			if (curr == free_list->last_node) {
+				prev_node = rb_prev(temp);
+				if (prev_node)
+					prev = container_of(prev_node,
+						struct nova_range_node, node);
+				free_list->last_node = prev;
+			}
+
+			rb_erase(&curr->node, tree);
+			free_list->num_blocknode--;
+			num_blocks = curr_blocks;
+			*new_blocknr = curr->range_low;
+			nova_free_blocknode(sb, curr);
+			found = 1;
+			break;
+		}
+
+		/* Allocate partial blocknode */
+		if (from_tail == ALLOC_FROM_HEAD) {
+			*new_blocknr = curr->range_low;
+			curr->range_low += num_blocks;
+		} else {
+			*new_blocknr = curr->range_high + 1 - num_blocks;
+			curr->range_high -= num_blocks;
+		}
+
+		found = 1;
+		break;
+next:
+		if (from_tail == ALLOC_FROM_HEAD)
+			temp = rb_next(temp);
+		else
+			temp = rb_prev(temp);
+	}
+
+	if (free_list->num_free_blocks < num_blocks) {
+		nova_dbg("%s: free list %d has %lu free blocks, but allocated %lu blocks?\n",
+				__func__, free_list->index,
+				free_list->num_free_blocks, num_blocks);
+		return -ENOSPC;
+	}
+
+	if (found == 1)
+		free_list->num_free_blocks -= num_blocks;
+	else {
+		nova_dbgv("%s: Can't alloc.  found = %d", __func__, found);
+		return -ENOSPC;
+	}
+
+	NOVA_STATS_ADD(alloc_steps, step);
+
+	return num_blocks;
+}
+
+/* Find out the free list with most free blocks */
+static int nova_get_candidate_free_list(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int cpuid = 0;
+	int num_free_blocks = 0;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		if (free_list->num_free_blocks > num_free_blocks) {
+			cpuid = i;
+			num_free_blocks = free_list->num_free_blocks;
+		}
+	}
+
+	return cpuid;
+}
+
+static int nova_new_blocks(struct super_block *sb, unsigned long *blocknr,
+	unsigned int num, unsigned short btype, int zero,
+	enum alloc_type atype, int cpuid, enum nova_alloc_direction from_tail)
+{
+	struct free_list *free_list;
+	void *bp;
+	unsigned long num_blocks = 0;
+	unsigned long new_blocknr = 0;
+	long ret_blocks = 0;
+	int retried = 0;
+	timing_t alloc_time;
+
+	num_blocks = num * nova_get_numblocks(btype);
+	if (num_blocks == 0) {
+		nova_dbg_verbose("%s: num_blocks == 0", __func__);
+		return -EINVAL;
+	}
+
+	NOVA_START_TIMING(new_blocks_t, alloc_time);
+	if (cpuid == ANY_CPU)
+		cpuid = smp_processor_id();
+
+retry:
+	free_list = nova_get_free_list(sb, cpuid);
+	spin_lock(&free_list->s_lock);
+
+	if (not_enough_blocks(free_list, num_blocks, atype)) {
+		nova_dbgv("%s: cpu %d, free_blocks %lu, required %lu, blocknode %lu\n",
+			  __func__, cpuid, free_list->num_free_blocks,
+			  num_blocks, free_list->num_blocknode);
+
+		if (retried >= 2)
+			/* Allocate anyway */
+			goto alloc;
+
+		spin_unlock(&free_list->s_lock);
+		cpuid = nova_get_candidate_free_list(sb);
+		retried++;
+		goto retry;
+	}
+alloc:
+	ret_blocks = nova_alloc_blocks_in_free_list(sb, free_list, btype, atype,
+					num_blocks, &new_blocknr, from_tail);
+
+	if (ret_blocks > 0) {
+		if (atype == LOG) {
+			free_list->alloc_log_count++;
+			free_list->alloc_log_pages += ret_blocks;
+		} else if (atype == DATA) {
+			free_list->alloc_data_count++;
+			free_list->alloc_data_pages += ret_blocks;
+		}
+	}
+
+	spin_unlock(&free_list->s_lock);
+	NOVA_END_TIMING(new_blocks_t, alloc_time);
+
+	if (ret_blocks <= 0 || new_blocknr == 0) {
+		nova_dbg_verbose("%s: not able to allocate %d blocks.  ret_blocks=%ld; new_blocknr=%lu",
+				 __func__, num, ret_blocks, new_blocknr);
+		return -ENOSPC;
+	}
+
+	if (zero) {
+		bp = nova_get_block(sb, nova_get_block_off(sb,
+						new_blocknr, btype));
+		memset_nt(bp, 0, PAGE_SIZE * ret_blocks);
+	}
+	*blocknr = new_blocknr;
+
+	nova_dbg_verbose("Alloc %lu NVMM blocks 0x%lx\n", ret_blocks, *blocknr);
+	return ret_blocks / nova_get_numblocks(btype);
+}
+
+// Allocate data blocks.  The offset for the allocated block comes back in
+// blocknr.  Return the number of blocks allocated.
+inline int nova_new_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long *blocknr,
+	unsigned long start_blk, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail)
+{
+	int allocated;
+	timing_t alloc_time;
+
+	NOVA_START_TIMING(new_data_blocks_t, alloc_time);
+	allocated = nova_new_blocks(sb, blocknr, num,
+			    sih->i_blk_type, zero, DATA, cpu, from_tail);
+	NOVA_END_TIMING(new_data_blocks_t, alloc_time);
+	if (allocated < 0) {
+		nova_dbgv("FAILED: Inode %lu, start blk %lu, alloc %d data blocks from %lu to %lu\n",
+			  sih->ino, start_blk, allocated, *blocknr,
+			  *blocknr + allocated - 1);
+	} else {
+		nova_dbgv("Inode %lu, start blk %lu, alloc %d data blocks from %lu to %lu\n",
+			  sih->ino, start_blk, allocated, *blocknr,
+			  *blocknr + allocated - 1);
+	}
+	return allocated;
+}
+
+
+// Allocate log blocks. The offset for the allocated block comes back in
+// blocknr.  Return the number of blocks allocated.
+inline int nova_new_log_blocks(struct super_block *sb,
+			struct nova_inode_info_header *sih,
+			unsigned long *blocknr, unsigned int num,
+			enum nova_alloc_init zero, int cpu,
+			enum nova_alloc_direction from_tail)
+{
+	int allocated;
+	timing_t alloc_time;
+
+	NOVA_START_TIMING(new_log_blocks_t, alloc_time);
+	allocated = nova_new_blocks(sb, blocknr, num,
+			    sih->i_blk_type, zero, LOG, cpu, from_tail);
+	NOVA_END_TIMING(new_log_blocks_t, alloc_time);
+	if (allocated < 0) {
+		nova_dbgv("%s: ino %lu, failed to alloc %d log blocks",
+			  __func__, sih->ino, num);
+	} else {
+		nova_dbgv("%s: ino %lu, alloc %d of %d log blocks %lu to %lu\n",
+			  __func__, sih->ino, allocated, num, *blocknr,
+			  *blocknr + allocated - 1);
+	}
+	return allocated;
+}
+
 /* We do not take locks so it's inaccurate */
 unsigned long nova_count_free_blocks(struct super_block *sb)
 {
diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
index 249eb72..463fbac 100644
--- a/fs/nova/balloc.h
+++ b/fs/nova/balloc.h
@@ -73,6 +73,16 @@ extern int nova_free_data_blocks(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
 extern int nova_free_log_blocks(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
+extern inline int nova_new_data_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long *blocknr,
+	unsigned long start_blk, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail);
+extern int nova_new_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	unsigned long *blocknr, unsigned int num,
+	enum nova_alloc_init zero, int cpu,
+	enum nova_alloc_direction from_tail);
 int nova_find_free_slot(struct nova_sb_info *sbi,
 	struct rb_root *tree, unsigned long range_low,
 	unsigned long range_high, struct nova_range_node **prev,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 21/83] Add log structure.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (19 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 20/83] Pmem block allocation routines Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 22/83] Inode log pages allocation and reclaimation Andiry Xu
                   ` (62 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA log is a singly linked list of 4KB pmem pages.
Each log page consists of two parts: 4064 bytes for log entries,
and 32 bytes for page tail structure. Page tail contains metadata
about the log page and the address of the next log page in the
linked list.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.h  | 187 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h |   1 +
 2 files changed, 188 insertions(+)
 create mode 100644 fs/nova/log.h

diff --git a/fs/nova/log.h b/fs/nova/log.h
new file mode 100644
index 0000000..61586a3
--- /dev/null
+++ b/fs/nova/log.h
@@ -0,0 +1,187 @@
+#ifndef __LOG_H
+#define __LOG_H
+
+#include "balloc.h"
+#include "inode.h"
+
+/* ======================= Log entry ========================= */
+/* Inode entry in the log */
+
+#define	MAIN_LOG	0
+#define	ALTER_LOG	1
+
+#define	PAGE_OFFSET_MASK	4095
+#define	BLOCK_OFF(p)	((p) & ~PAGE_OFFSET_MASK)
+
+#define	ENTRY_LOC(p)	((p) & PAGE_OFFSET_MASK)
+
+#define	LOG_BLOCK_TAIL	4064
+#define	PAGE_TAIL(p)	(BLOCK_OFF(p) + LOG_BLOCK_TAIL)
+
+/*
+ * Log page state and pointers to next page and the replica page
+ */
+struct nova_inode_page_tail {
+	__le32	invalid_entries;
+	__le32	num_entries;
+	__le64	epoch_id;	/* For snapshot list page */
+	__le64	padding;
+	__le64	next_page;
+} __attribute((__packed__));
+
+/* Fit in PAGE_SIZE */
+struct	nova_inode_log_page {
+	char padding[LOG_BLOCK_TAIL];
+	struct nova_inode_page_tail page_tail;
+} __attribute((__packed__));
+
+
+enum nova_entry_type {
+	FILE_WRITE = 1,
+	DIR_LOG,
+	SET_ATTR,
+	LINK_CHANGE,
+	NEXT_PAGE,
+};
+
+static inline u8 nova_get_entry_type(void *p)
+{
+	u8 type;
+	int rc;
+
+	rc = memcpy_mcsafe(&type, p, sizeof(u8));
+	if (rc)
+		return rc;
+
+	return type;
+}
+
+static inline void nova_set_entry_type(void *p, enum nova_entry_type type)
+{
+	*(u8 *)p = type;
+}
+
+static inline u64 next_log_page(struct super_block *sb, u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 next = 0;
+	int rc;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	rc = memcpy_mcsafe(&next, &curr_page->page_tail.next_page,
+				sizeof(u64));
+	if (rc)
+		return rc;
+
+	return next;
+}
+
+static inline void nova_set_next_page_flag(struct super_block *sb, u64 curr_p)
+{
+	void *p;
+
+	if (ENTRY_LOC(curr_p) >= LOG_BLOCK_TAIL)
+		return;
+
+	p = nova_get_block(sb, curr_p);
+	nova_set_entry_type(p, NEXT_PAGE);
+	nova_flush_buffer(p, CACHELINE_SIZE, 1);
+}
+
+static inline void nova_set_next_page_address(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, u64 next_page, int fence)
+{
+	curr_page->page_tail.next_page = next_page;
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+	if (fence)
+		PERSISTENT_BARRIER();
+}
+
+static inline void nova_set_page_num_entries(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, int num, int flush)
+{
+	curr_page->page_tail.num_entries = num;
+	if (flush)
+		nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_set_page_invalid_entries(struct super_block *sb,
+	struct nova_inode_log_page *curr_page, int num, int flush)
+{
+	curr_page->page_tail.invalid_entries = num;
+	if (flush)
+		nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_inc_page_num_entries(struct super_block *sb,
+	u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+
+	curr_page->page_tail.num_entries++;
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline void nova_inc_page_invalid_entries(struct super_block *sb,
+	u64 curr)
+{
+	struct nova_inode_log_page *curr_page;
+
+	curr = BLOCK_OFF(curr);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+
+	curr_page->page_tail.invalid_entries++;
+	if (curr_page->page_tail.invalid_entries >
+			curr_page->page_tail.num_entries) {
+		nova_dbg("Page 0x%llx has %u entries, %u invalid\n",
+				curr,
+				curr_page->page_tail.num_entries,
+				curr_page->page_tail.invalid_entries);
+	}
+
+	nova_flush_buffer(&curr_page->page_tail,
+				sizeof(struct nova_inode_page_tail), 0);
+}
+
+static inline bool is_last_entry(u64 curr_p, size_t size)
+{
+	unsigned int entry_end;
+
+	entry_end = ENTRY_LOC(curr_p) + size;
+
+	return entry_end > LOG_BLOCK_TAIL;
+}
+
+static inline bool goto_next_page(struct super_block *sb, u64 curr_p)
+{
+	void *addr;
+	u8 type;
+	int rc;
+
+	/* Each kind of entry takes at least 32 bytes */
+	if (ENTRY_LOC(curr_p) + 32 > LOG_BLOCK_TAIL)
+		return true;
+
+	addr = nova_get_block(sb, curr_p);
+	rc = memcpy_mcsafe(&type, addr, sizeof(u8));
+
+	if (rc < 0)
+		return true;
+
+	if (type == NEXT_PAGE)
+		return true;
+
+	return false;
+}
+
+
+
+#endif
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 0992f50..f5b4ec8 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -301,6 +301,7 @@ static inline u64 nova_get_epoch_id(struct super_block *sb)
 }
 
 #include "inode.h"
+#include "log.h"
 
 /* A node in the RB tree representing a range of pages */
 struct nova_range_node {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 22/83] Inode log pages allocation and reclaimation.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (20 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 21/83] Add log structure Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 23/83] Save allocator to pmem in put_super Andiry Xu
                   ` (61 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA allocates one log page for each new inode. When the log is full,
NOVA allocates new log pages, extends the log by either doubling the log size
or increasing by fixed length, depends on log size.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |   2 +-
 fs/nova/log.c    | 327 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h    |  11 ++
 3 files changed, 339 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/log.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index e2f7b07..b3638a4 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := balloc.o bbuild.o inode.o rebuild.o stats.o super.o
+nova-y := balloc.o bbuild.o inode.o log.o rebuild.o stats.o super.o
diff --git a/fs/nova/log.c b/fs/nova/log.c
new file mode 100644
index 0000000..bdd133e
--- /dev/null
+++ b/fs/nova/log.c
@@ -0,0 +1,327 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Log methods
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+#include "log.h"
+
+/* Coalesce log pages to a singly linked list */
+static int nova_coalesce_log_pages(struct super_block *sb,
+	unsigned long prev_blocknr, unsigned long first_blocknr,
+	unsigned long num_pages)
+{
+	unsigned long next_blocknr;
+	u64 curr_block, next_page;
+	struct nova_inode_log_page *curr_page;
+	int i;
+
+	if (prev_blocknr) {
+		/* Link prev block and newly allocated head block */
+		curr_block = nova_get_block_off(sb, prev_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+		curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+		next_page = nova_get_block_off(sb, first_blocknr,
+				NOVA_BLOCK_TYPE_4K);
+		nova_set_next_page_address(sb, curr_page, next_page, 0);
+	}
+
+	next_blocknr = first_blocknr + 1;
+	curr_block = nova_get_block_off(sb, first_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+	curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+	for (i = 0; i < num_pages - 1; i++) {
+		next_page = nova_get_block_off(sb, next_blocknr,
+				NOVA_BLOCK_TYPE_4K);
+		nova_set_page_num_entries(sb, curr_page, 0, 0);
+		nova_set_page_invalid_entries(sb, curr_page, 0, 0);
+		nova_set_next_page_address(sb, curr_page, next_page, 0);
+		curr_page++;
+		next_blocknr++;
+	}
+
+	/* Last page */
+	nova_set_page_num_entries(sb, curr_page, 0, 0);
+	nova_set_page_invalid_entries(sb, curr_page, 0, 0);
+	nova_set_next_page_address(sb, curr_page, 0, 1);
+	return 0;
+}
+
+/* Log block resides in NVMM */
+int nova_allocate_inode_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long num_pages,
+	u64 *new_block, int cpuid, enum nova_alloc_direction from_tail)
+{
+	unsigned long new_inode_blocknr;
+	unsigned long first_blocknr;
+	unsigned long prev_blocknr;
+	int allocated;
+	int ret_pages = 0;
+
+	allocated = nova_new_log_blocks(sb, sih, &new_inode_blocknr,
+			num_pages, ALLOC_NO_INIT, cpuid, from_tail);
+
+	if (allocated <= 0) {
+		nova_err(sb, "ERROR: no inode log page available: %d %d\n",
+			num_pages, allocated);
+		return allocated;
+	}
+	ret_pages += allocated;
+	num_pages -= allocated;
+	nova_dbg_verbose("Pi %lu: Alloc %d log blocks @ 0x%lx\n",
+			sih->ino, allocated, new_inode_blocknr);
+
+	/* Coalesce the pages */
+	nova_coalesce_log_pages(sb, 0, new_inode_blocknr, allocated);
+	first_blocknr = new_inode_blocknr;
+	prev_blocknr = new_inode_blocknr + allocated - 1;
+
+	/* Allocate remaining pages */
+	while (num_pages) {
+		allocated = nova_new_log_blocks(sb, sih,
+					&new_inode_blocknr, num_pages,
+					ALLOC_NO_INIT, cpuid, from_tail);
+
+		nova_dbg_verbose("Alloc %d log blocks @ 0x%lx\n",
+					allocated, new_inode_blocknr);
+		if (allocated <= 0) {
+			nova_dbg("%s: no inode log page available: %lu %d\n",
+				__func__, num_pages, allocated);
+			/* Return whatever we have */
+			break;
+		}
+		ret_pages += allocated;
+		num_pages -= allocated;
+		nova_coalesce_log_pages(sb, prev_blocknr, new_inode_blocknr,
+						allocated);
+		prev_blocknr = new_inode_blocknr + allocated - 1;
+	}
+
+	*new_block = nova_get_block_off(sb, first_blocknr,
+						NOVA_BLOCK_TYPE_4K);
+
+	return ret_pages;
+}
+
+static int nova_initialize_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	int log_id)
+{
+	u64 new_block;
+	int allocated;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih,
+					1, &new_block, ANY_CPU,
+					log_id == MAIN_LOG ? 0 : 1);
+	if (allocated != 1) {
+		nova_err(sb, "%s ERROR: no inode log page available\n",
+					__func__);
+		return -ENOSPC;
+	}
+
+	pi->log_tail = new_block;
+	nova_flush_buffer(&pi->log_tail, CACHELINE_SIZE, 0);
+	pi->log_head = new_block;
+	sih->log_head = sih->log_tail = new_block;
+	sih->log_pages = 1;
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 1);
+
+	return 0;
+}
+
+/*
+ * Extend the log.  If the log is less than EXTEND_THRESHOLD pages, double its
+ * allocated size.  Otherwise, increase by EXTEND_THRESHOLD. Then, do GC.
+ */
+static u64 nova_extend_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	u64 new_block = 0;
+	int allocated;
+	unsigned long num_pages;
+	int ret;
+
+	nova_dbgv("%s: inode %lu, curr 0x%llx\n", __func__, sih->ino, curr_p);
+
+	if (curr_p == 0) {
+		ret = nova_initialize_inode_log(sb, pi, sih, MAIN_LOG);
+		if (ret)
+			return 0;
+
+		return sih->log_head;
+	}
+
+	num_pages = sih->log_pages >= EXTEND_THRESHOLD ?
+				EXTEND_THRESHOLD : sih->log_pages;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih,
+					num_pages, &new_block, ANY_CPU, 0);
+	nova_dbg_verbose("Link block %llu to block %llu\n",
+					curr_p >> PAGE_SHIFT,
+					new_block >> PAGE_SHIFT);
+	if (allocated <= 0) {
+		nova_err(sb, "%s ERROR: no inode log page available\n",
+					__func__);
+		nova_dbg("curr_p 0x%llx, %lu pages\n", curr_p,
+					sih->log_pages);
+		return 0;
+	}
+
+	/* Perform GC */
+	return new_block;
+}
+
+/* For thorough GC, simply append one more page */
+static u64 nova_append_one_log_page(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 new_block;
+	u64 curr_block;
+	int allocated;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih, 1, &new_block,
+							ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_err(sb, "%s: ERROR: no inode log page available\n",
+				__func__);
+		return 0;
+	}
+
+	if (curr_p == 0) {
+		curr_p = new_block;
+	} else {
+		/* Link prev block and newly allocated head block */
+		curr_block = BLOCK_OFF(curr_p);
+		curr_page = (struct nova_inode_log_page *)
+				nova_get_block(sb, curr_block);
+		nova_set_next_page_address(sb, curr_page, new_block, 1);
+	}
+
+	return curr_p;
+}
+
+/* Get the append location. Extent the log if needed. */
+u64 nova_get_append_head(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 tail, size_t size, int log_id,
+	int thorough_gc, int *extended)
+{
+	u64 curr_p;
+
+	if (tail)
+		curr_p = tail;
+	else
+		curr_p = sih->log_tail;
+
+	if (curr_p == 0 || (is_last_entry(curr_p, size) &&
+				next_log_page(sb, curr_p) == 0)) {
+		if (is_last_entry(curr_p, size)) {
+			nova_set_next_page_flag(sb, curr_p);
+		}
+
+		/* Alternate log should not go here */
+		if (log_id != MAIN_LOG)
+			return 0;
+
+		if (thorough_gc == 0) {
+			curr_p = nova_extend_inode_log(sb, pi, sih, curr_p);
+		} else {
+			curr_p = nova_append_one_log_page(sb, sih, curr_p);
+			/* For thorough GC */
+			*extended = 1;
+		}
+
+		if (curr_p == 0)
+			return 0;
+	}
+
+	if (is_last_entry(curr_p, size)) {
+		nova_set_next_page_flag(sb, curr_p);
+		curr_p = next_log_page(sb, curr_p);
+	}
+
+	return curr_p;
+}
+
+int nova_free_contiguous_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 head)
+{
+	unsigned long blocknr, start_blocknr = 0;
+	u64 curr_block = head;
+	u8 btype = sih->i_blk_type;
+	int num_free = 0;
+	int freed = 0;
+
+	while (curr_block > 0) {
+		if (ENTRY_LOC(curr_block)) {
+			nova_dbg("%s: ERROR: invalid block %llu\n",
+					__func__, curr_block);
+			break;
+		}
+
+		blocknr = nova_get_blocknr(sb, le64_to_cpu(curr_block),
+				    btype);
+		nova_dbg_verbose("%s: free page %llu\n", __func__, curr_block);
+		curr_block = next_log_page(sb, curr_block);
+
+		if (start_blocknr == 0) {
+			start_blocknr = blocknr;
+			num_free = 1;
+		} else {
+			if (blocknr == start_blocknr + num_free) {
+				num_free++;
+			} else {
+				/* A new start */
+				nova_free_log_blocks(sb, sih, start_blocknr,
+							num_free);
+				freed += num_free;
+				start_blocknr = blocknr;
+				num_free = 1;
+			}
+		}
+	}
+	if (start_blocknr) {
+		nova_free_log_blocks(sb, sih, start_blocknr, num_free);
+		freed += num_free;
+	}
+
+	return freed;
+}
+
+int nova_free_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih)
+{
+	int freed = 0;
+	timing_t free_time;
+
+	if (sih->log_head == 0 || sih->log_tail == 0)
+		return 0;
+
+	NOVA_START_TIMING(free_inode_log_t, free_time);
+
+	/* The inode is invalid now, no need to fence */
+	if (pi) {
+		pi->log_head = pi->log_tail = 0;
+		nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+	}
+
+	freed = nova_free_contiguous_log_blocks(sb, sih, sih->log_head);
+
+	NOVA_END_TIMING(free_inode_log_t, free_time);
+	return 0;
+}
diff --git a/fs/nova/log.h b/fs/nova/log.h
index 61586a3..2bc131f 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -35,6 +35,7 @@ struct	nova_inode_log_page {
 	struct nova_inode_page_tail page_tail;
 } __attribute((__packed__));
 
+#define	EXTEND_THRESHOLD	256
 
 enum nova_entry_type {
 	FILE_WRITE = 1,
@@ -183,5 +184,15 @@ static inline bool goto_next_page(struct super_block *sb, u64 curr_p)
 }
 
 
+int nova_allocate_inode_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long num_pages,
+	u64 *new_block, int cpuid, enum nova_alloc_direction from_tail);
+u64 nova_get_append_head(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih, u64 tail, size_t size, int log_id,
+	int thorough_gc, int *extended);
+int nova_free_contiguous_log_blocks(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 head);
+int nova_free_inode_log(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih);
 
 #endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 23/83] Save allocator to pmem in put_super.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (21 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 22/83] Inode log pages allocation and reclaimation Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 24/83] Initialize and allocate inode table Andiry Xu
                   ` (60 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

We allocate log pages and append free range node to the log of the reserved blocknode inode.
We can recover the allocator status by reading the log upon normal recovery.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/bbuild.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/bbuild.h |   1 +
 fs/nova/inode.h  |  13 +++++++
 fs/nova/nova.h   |   7 ++++
 fs/nova/super.c  |   2 +
 5 files changed, 137 insertions(+)

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
index 8bc0545..12a2f11 100644
--- a/fs/nova/bbuild.c
+++ b/fs/nova/bbuild.c
@@ -51,3 +51,117 @@ void nova_init_header(struct super_block *sb,
 	init_rwsem(&sih->i_sem);
 }
 
+static u64 nova_append_range_node_entry(struct super_block *sb,
+	struct nova_range_node *curr, u64 tail, unsigned long cpuid)
+{
+	u64 curr_p;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	struct nova_range_node_lowhigh *entry;
+
+	curr_p = tail;
+
+	if (curr_p == 0 || (is_last_entry(curr_p, size) &&
+				next_log_page(sb, curr_p) == 0)) {
+		nova_dbg("%s: inode log reaches end?\n", __func__);
+		goto out;
+	}
+
+	if (is_last_entry(curr_p, size))
+		curr_p = next_log_page(sb, curr_p);
+
+	entry = (struct nova_range_node_lowhigh *)nova_get_block(sb, curr_p);
+	entry->range_low = cpu_to_le64(curr->range_low);
+	if (cpuid)
+		entry->range_low |= cpu_to_le64(cpuid << 56);
+	entry->range_high = cpu_to_le64(curr->range_high);
+	nova_dbgv("append entry block low 0x%lx, high 0x%lx\n",
+			curr->range_low, curr->range_high);
+
+	nova_flush_buffer(entry, sizeof(struct nova_range_node_lowhigh), 0);
+out:
+	return curr_p;
+}
+
+static u64 nova_save_range_nodes_to_log(struct super_block *sb,
+	struct rb_root *tree, u64 temp_tail, unsigned long cpuid)
+{
+	struct nova_range_node *curr;
+	struct rb_node *temp;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	u64 curr_entry = 0;
+
+	/* Save in increasing order */
+	temp = rb_first(tree);
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		curr_entry = nova_append_range_node_entry(sb, curr,
+						temp_tail, cpuid);
+		temp_tail = curr_entry + size;
+		temp = rb_next(temp);
+		rb_erase(&curr->node, tree);
+		nova_free_range_node(curr);
+	}
+
+	return temp_tail;
+}
+
+static u64 nova_save_free_list_blocknodes(struct super_block *sb, int cpu,
+	u64 temp_tail)
+{
+	struct free_list *free_list;
+
+	free_list = nova_get_free_list(sb, cpu);
+	temp_tail = nova_save_range_nodes_to_log(sb,
+				&free_list->block_free_tree, temp_tail, 0);
+	return temp_tail;
+}
+
+void nova_save_blocknode_mappings_to_log(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	struct nova_inode_info_header sih;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long num_blocknode = 0;
+	unsigned long num_pages;
+	int allocated;
+	u64 new_block = 0;
+	u64 temp_tail;
+	int i;
+
+	sih.ino = NOVA_BLOCKNODE_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	/* Allocate log pages before save blocknode mappings */
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		num_blocknode += free_list->num_blocknode;
+		nova_dbgv("%s: free list %d: %lu nodes\n", __func__,
+				i, free_list->num_blocknode);
+	}
+
+	num_pages = num_blocknode / RANGENODE_PER_PAGE;
+	if (num_blocknode % RANGENODE_PER_PAGE)
+		num_pages++;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, num_pages,
+						&new_block, ANY_CPU, 0);
+	if (allocated != num_pages) {
+		nova_dbg("Error saving blocknode mappings: %d\n", allocated);
+		return;
+	}
+
+	temp_tail = new_block;
+	for (i = 0; i < sbi->cpus; i++)
+		temp_tail = nova_save_free_list_blocknodes(sb, i, temp_tail);
+
+	/* Finally update log head and tail */
+	pi->log_head = new_block;
+	nova_update_tail(pi, temp_tail);
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+
+	nova_dbg("%s: %lu blocknodes, %lu log pages, pi head 0x%llx, tail 0x%llx\n",
+		  __func__, num_blocknode, num_pages,
+		  pi->log_head, pi->log_tail);
+}
+
diff --git a/fs/nova/bbuild.h b/fs/nova/bbuild.h
index 162a832..59cc379 100644
--- a/fs/nova/bbuild.h
+++ b/fs/nova/bbuild.h
@@ -3,5 +3,6 @@
 
 void nova_init_header(struct super_block *sb,
 	struct nova_inode_info_header *sih, u16 i_mode);
+void nova_save_blocknode_mappings_to_log(struct super_block *sb);
 
 #endif
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index dbd5256..0594ef3 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -123,6 +123,19 @@ static inline void sih_unlock_shared(struct nova_inode_info_header *header)
 	up_read(&header->i_sem);
 }
 
+static inline void nova_update_tail(struct nova_inode *pi, u64 new_tail)
+{
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_tail_t, update_time);
+
+	PERSISTENT_BARRIER();
+	pi->log_tail = new_tail;
+	nova_flush_buffer(&pi->log_tail, CACHELINE_SIZE, 1);
+
+	NOVA_END_TIMING(update_tail_t, update_time);
+}
+
 static inline unsigned int
 nova_inode_blk_shift(struct nova_inode_info_header *sih)
 {
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index f5b4ec8..aa88d9f 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -303,6 +303,13 @@ static inline u64 nova_get_epoch_id(struct super_block *sb)
 #include "inode.h"
 #include "log.h"
 
+struct nova_range_node_lowhigh {
+	__le64 range_low;
+	__le64 range_high;
+};
+
+#define	RANGENODE_PER_PAGE	254
+
 /* A node in the RB tree representing a range of pages */
 struct nova_range_node {
 	struct rb_node node;
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 3500d19..7ee3f66 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -705,6 +705,8 @@ static void nova_put_super(struct super_block *sb)
 	struct nova_sb_info *sbi = NOVA_SB(sb);
 
 	if (sbi->virt_addr) {
+		/* Save everything before blocknode mapping! */
+		nova_save_blocknode_mappings_to_log(sb);
 		sbi->virt_addr = NULL;
 	}
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 24/83] Initialize and allocate inode table.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (22 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 23/83] Save allocator to pmem in put_super Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 25/83] Support get normal inode address and inode table extentsion Andiry Xu
                   ` (59 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Inode table is a singly linked list of 2MB pages.
Each CPU has one inode table with initial size 2MB.
The inode table addresses are stored in the
INODE_TABLE_START of the pmem range.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.h | 26 ++++++++++++++++++++++++++
 fs/nova/super.c |  3 +++
 3 files changed, 84 insertions(+)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index f7d6410..42816ff 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -29,6 +29,61 @@
 unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX] = {12, 21, 30};
 uint32_t blk_type_to_size[NOVA_BLOCK_TYPE_MAX] = {0x1000, 0x200000, 0x40000000};
 
+static int nova_alloc_inode_table(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_table *inode_table;
+	unsigned long blocknr;
+	u64 block;
+	int allocated;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_table = nova_get_inode_table(sb, i);
+		if (!inode_table)
+			return -EINVAL;
+
+		allocated = nova_new_log_blocks(sb, sih, &blocknr, 1,
+				ALLOC_INIT_ZERO, i, ALLOC_FROM_HEAD);
+
+		nova_dbgv("%s: allocate log @ 0x%lx\n", __func__,
+							blocknr);
+		if (allocated != 1 || blocknr == 0)
+			return -ENOSPC;
+
+		block = nova_get_block_off(sb, blocknr, NOVA_BLOCK_TYPE_2M);
+		inode_table->log_head = block;
+		nova_flush_buffer(inode_table, CACHELINE_SIZE, 0);
+	}
+
+	return 0;
+}
+
+int nova_init_inode_table(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODETABLE_INO);
+	struct nova_inode_info_header sih;
+	int ret = 0;
+
+	pi->i_mode = 0;
+	pi->i_uid = 0;
+	pi->i_gid = 0;
+	pi->i_links_count = cpu_to_le16(1);
+	pi->i_flags = 0;
+	pi->nova_ino = NOVA_INODETABLE_INO;
+
+	pi->i_blk_type = NOVA_BLOCK_TYPE_2M;
+
+	sih.ino = NOVA_INODETABLE_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_2M;
+
+	ret = nova_alloc_inode_table(sb, &sih);
+
+	PERSISTENT_BARRIER();
+	return ret;
+}
+
 void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
 	unsigned int flags)
 {
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 0594ef3..a88f0a2 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -60,6 +60,13 @@ struct nova_inode {
 } __attribute((__packed__));
 
 /*
+ * Inode table.  It's a linked list of pages.
+ */
+struct inode_table {
+	__le64 log_head;
+};
+
+/*
  * NOVA-specific inode state kept in DRAM
  */
 struct nova_inode_info_header {
@@ -136,6 +143,22 @@ static inline void nova_update_tail(struct nova_inode *pi, u64 new_tail)
 	NOVA_END_TIMING(update_tail_t, update_time);
 }
 
+static inline
+struct inode_table *nova_get_inode_table(struct super_block *sb, int cpu)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int table_start;
+
+	if (cpu >= sbi->cpus)
+		return NULL;
+
+	table_start = INODE_TABLE_START;
+
+	return (struct inode_table *)((char *)nova_get_block(sb,
+		NOVA_DEF_BLOCK_SIZE_4K * table_start) +
+		cpu * CACHELINE_SIZE);
+}
+
 static inline unsigned int
 nova_inode_blk_shift(struct nova_inode_info_header *sih)
 {
@@ -197,7 +220,10 @@ static inline int nova_persist_inode(struct nova_inode *pi)
 	return 0;
 }
 
+
+int nova_init_inode_table(struct super_block *sb);
 int nova_get_inode_address(struct super_block *sb, u64 ino,
 	u64 *pi_addr, int extendable);
 struct inode *nova_iget(struct super_block *sb, unsigned long ino);
+
 #endif
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 7ee3f66..32fe29b 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -378,6 +378,9 @@ static struct nova_inode *nova_init(struct super_block *sb,
 
 	nova_init_blockmap(sb, 0);
 
+	if (nova_init_inode_table(sb) < 0)
+		return ERR_PTR(-EINVAL);
+
 	sbi->nova_sb->s_size = cpu_to_le64(size);
 	sbi->nova_sb->s_blocksize = cpu_to_le32(blocksize);
 	sbi->nova_sb->s_magic = cpu_to_le32(NOVA_SUPER_MAGIC);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 25/83] Support get normal inode address and inode table extentsion.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (23 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 24/83] Initialize and allocate inode table Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 26/83] Add inode_map to track inuse inodes Andiry Xu
                   ` (58 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Inodes are assigned to per-CPU inode tables in a round-robin way:

If there are four cores, then

CPU 0's inode table contains inode 0, inode 4, inode 8, ...
CPU 1's inode table contains inode 1, inode 5, inode 9, ...
CPU 2's inode table contains inode 2, inode 6, inode 10, ...
CPU 3's inode table contains inode 3, inode 7, inode 11, ...

So given an inode number, the inode table and inode position
can be easily calculated.

If NOVA runs out of 2MB inode table size, it will allocate a new
2MB log page and links it to the tail of the previous inode table.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 42816ff..4e2842d 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -167,18 +167,81 @@ static int nova_read_inode(struct super_block *sb, struct inode *inode,
 	return ret;
 }
 
-/* Get the address in PMEM of an inode by inode number.  Allocate additional
+/*
+ * Get the address in PMEM of an inode by inode number.  Allocate additional
  * block to store additional inodes if necessary.
  */
 int nova_get_inode_address(struct super_block *sb, u64 ino,
 	u64 *pi_addr, int extendable)
 {
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct inode_table *inode_table;
+	unsigned int data_bits;
+	unsigned int num_inodes_bits;
+	u64 curr;
+	unsigned int superpage_count;
+	u64 internal_ino;
+	int cpuid;
+	int extended = 0;
+	unsigned int index;
+	unsigned int i = 0;
+	unsigned long blocknr;
+	unsigned long curr_addr;
+	int allocated;
+
 	if (ino < NOVA_NORMAL_INODE_START) {
 		*pi_addr = nova_get_reserved_inode_addr(sb, ino);
 		return 0;
 	}
 
-	*pi_addr = 0;
+	sih.ino = NOVA_INODETABLE_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_2M;
+	data_bits = blk_type_to_shift[sih.i_blk_type];
+	num_inodes_bits = data_bits - NOVA_INODE_BITS;
+
+	cpuid = ino % sbi->cpus;
+	internal_ino = ino / sbi->cpus;
+
+	inode_table = nova_get_inode_table(sb, cpuid);
+	superpage_count = internal_ino >> num_inodes_bits;
+	index = internal_ino & ((1 << num_inodes_bits) - 1);
+
+	curr = inode_table->log_head;
+	if (curr == 0)
+		return -EINVAL;
+
+	for (i = 0; i < superpage_count; i++) {
+		if (curr == 0)
+			return -EINVAL;
+
+		curr_addr = (unsigned long)nova_get_block(sb, curr);
+		/* Next page pointer in the last 8 bytes of the superpage */
+		curr_addr += nova_inode_blk_size(&sih) - 8;
+		curr = *(u64 *)(curr_addr);
+
+		if (curr == 0) {
+			if (extendable == 0)
+				return -EINVAL;
+
+			extended = 1;
+
+			allocated = nova_new_log_blocks(sb, &sih, &blocknr,
+				1, ALLOC_INIT_ZERO, cpuid, ALLOC_FROM_HEAD);
+
+			if (allocated != 1)
+				return allocated;
+
+			curr = nova_get_block_off(sb, blocknr,
+						NOVA_BLOCK_TYPE_2M);
+			*(u64 *)(curr_addr) = curr;
+			nova_flush_buffer((void *)curr_addr,
+						NOVA_INODE_SIZE, 1);
+		}
+	}
+
+	*pi_addr = curr + index * NOVA_INODE_SIZE;
+
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 26/83] Add inode_map to track inuse inodes.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (24 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 25/83] Support get normal inode address and inode table extentsion Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 27/83] Save the inode inuse list to pmem upon umount Andiry Xu
                   ` (57 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA uses per-CPU inode map to track inuse inodes.
It works in the same way as the allocator, the only difference is that inode map
tracks in-use inodes, while free list contains free ranges. NOVA always try
to allocate the first available inode number.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.h |   3 +
 fs/nova/nova.h  |  10 +++
 fs/nova/super.c |  44 +++++++++++++
 fs/nova/super.h |   9 +++
 5 files changed, 256 insertions(+)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 4e2842d..7c10d0e 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -29,6 +29,43 @@
 unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX] = {12, 21, 30};
 uint32_t blk_type_to_size[NOVA_BLOCK_TYPE_MAX] = {0x1000, 0x200000, 0x40000000};
 
+int nova_init_inode_inuse_list(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_range_node *range_node;
+	struct inode_map *inode_map;
+	unsigned long range_high;
+	int i;
+	int ret;
+
+	sbi->s_inodes_used_count = NOVA_NORMAL_INODE_START;
+
+	range_high = NOVA_NORMAL_INODE_START / sbi->cpus;
+	if (NOVA_NORMAL_INODE_START % sbi->cpus)
+		range_high++;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		range_node = nova_alloc_inode_node(sb);
+		if (range_node == NULL)
+			/* FIXME: free allocated memories */
+			return -ENOMEM;
+
+		range_node->range_low = 0;
+		range_node->range_high = range_high;
+		ret = nova_insert_inodetree(sbi, range_node, i);
+		if (ret) {
+			nova_err(sb, "%s failed\n", __func__);
+			nova_free_inode_node(sb, range_node);
+			return ret;
+		}
+		inode_map->num_range_node_inode = 1;
+		inode_map->first_inode_range = range_node;
+	}
+
+	return 0;
+}
+
 static int nova_alloc_inode_table(struct super_block *sb,
 	struct nova_inode_info_header *sih)
 {
@@ -298,3 +335,156 @@ struct inode *nova_iget(struct super_block *sb, unsigned long ino)
 	return ERR_PTR(err);
 }
 
+inline int nova_insert_inodetree(struct nova_sb_info *sbi,
+	struct nova_range_node *new_node, int cpu)
+{
+	struct rb_root *tree;
+	int ret;
+
+	tree = &sbi->inode_maps[cpu].inode_inuse_tree;
+	ret = nova_insert_range_node(tree, new_node);
+	if (ret)
+		nova_dbg("ERROR: %s failed %d\n", __func__, ret);
+
+	return ret;
+}
+
+static inline int nova_search_inodetree(struct nova_sb_info *sbi,
+	unsigned long ino, struct nova_range_node **ret_node)
+{
+	struct rb_root *tree;
+	unsigned long internal_ino;
+	int cpu;
+
+	cpu = ino % sbi->cpus;
+	tree = &sbi->inode_maps[cpu].inode_inuse_tree;
+	internal_ino = ino / sbi->cpus;
+	return nova_find_range_node(sbi, tree, internal_ino, ret_node);
+}
+
+int nova_alloc_unused_inode(struct super_block *sb, int cpuid,
+	unsigned long *ino)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *i, *next_i;
+	struct rb_node *temp, *next;
+	unsigned long next_range_low;
+	unsigned long new_ino;
+	unsigned long MAX_INODE = 1UL << 31;
+
+	inode_map = &sbi->inode_maps[cpuid];
+	i = inode_map->first_inode_range;
+	NOVA_ASSERT(i);
+
+	temp = &i->node;
+	next = rb_next(temp);
+
+	if (!next) {
+		next_i = NULL;
+		next_range_low = MAX_INODE;
+	} else {
+		next_i = container_of(next, struct nova_range_node, node);
+		next_range_low = next_i->range_low;
+	}
+
+	new_ino = i->range_high + 1;
+
+	if (next_i && new_ino == (next_range_low - 1)) {
+		/* Fill the gap completely */
+		i->range_high = next_i->range_high;
+		rb_erase(&next_i->node, &inode_map->inode_inuse_tree);
+		nova_free_inode_node(sb, next_i);
+		inode_map->num_range_node_inode--;
+	} else if (new_ino < (next_range_low - 1)) {
+		/* Aligns to left */
+		i->range_high = new_ino;
+	} else {
+		nova_dbg("%s: ERROR: new ino %lu, next low %lu\n", __func__,
+			new_ino, next_range_low);
+		return -ENOSPC;
+	}
+
+	*ino = new_ino * sbi->cpus + cpuid;
+	sbi->s_inodes_used_count++;
+	inode_map->allocated++;
+
+	nova_dbg_verbose("Alloc ino %lu\n", *ino);
+	return 0;
+}
+
+int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *i = NULL;
+	struct nova_range_node *curr_node;
+	int found = 0;
+	int cpuid = ino % sbi->cpus;
+	unsigned long internal_ino = ino / sbi->cpus;
+	int ret = 0;
+
+	nova_dbg_verbose("Free inuse ino: %lu\n", ino);
+	inode_map = &sbi->inode_maps[cpuid];
+
+	mutex_lock(&inode_map->inode_table_mutex);
+	found = nova_search_inodetree(sbi, ino, &i);
+	if (!found) {
+		nova_dbg("%s ERROR: ino %lu not found\n", __func__, ino);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return -EINVAL;
+	}
+
+	if ((internal_ino == i->range_low) && (internal_ino == i->range_high)) {
+		/* fits entire node */
+		rb_erase(&i->node, &inode_map->inode_inuse_tree);
+		nova_free_inode_node(sb, i);
+		inode_map->num_range_node_inode--;
+		goto block_found;
+	}
+	if ((internal_ino == i->range_low) && (internal_ino < i->range_high)) {
+		/* Aligns left */
+		i->range_low = internal_ino + 1;
+		goto block_found;
+	}
+	if ((internal_ino > i->range_low) && (internal_ino == i->range_high)) {
+		/* Aligns right */
+		i->range_high = internal_ino - 1;
+		goto block_found;
+	}
+	if ((internal_ino > i->range_low) && (internal_ino < i->range_high)) {
+		/* Aligns somewhere in the middle */
+		curr_node = nova_alloc_inode_node(sb);
+		NOVA_ASSERT(curr_node);
+		if (curr_node == NULL) {
+			/* returning without freeing the block */
+			goto block_found;
+		}
+		curr_node->range_low = internal_ino + 1;
+		curr_node->range_high = i->range_high;
+
+		i->range_high = internal_ino - 1;
+
+		ret = nova_insert_inodetree(sbi, curr_node, cpuid);
+		if (ret) {
+			nova_free_inode_node(sb, curr_node);
+			goto err;
+		}
+		inode_map->num_range_node_inode++;
+		goto block_found;
+	}
+
+err:
+	nova_error_mng(sb, "Unable to free inode %lu\n", ino);
+	nova_error_mng(sb, "Found inuse block %lu - %lu\n",
+				 i->range_low, i->range_high);
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+
+block_found:
+	sbi->s_inodes_used_count--;
+	inode_map->freed++;
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+}
+
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index a88f0a2..497343d 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -221,9 +221,12 @@ static inline int nova_persist_inode(struct nova_inode *pi)
 }
 
 
+int nova_init_inode_inuse_list(struct super_block *sb);
 int nova_init_inode_table(struct super_block *sb);
 int nova_get_inode_address(struct super_block *sb, u64 ino,
 	u64 *pi_addr, int extendable);
 struct inode *nova_iget(struct super_block *sb, unsigned long ino);
+inline int nova_insert_inodetree(struct nova_sb_info *sbi,
+	struct nova_range_node *new_node, int cpu);
 
 #endif
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index aa88d9f..bf4b6ac 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -318,6 +318,16 @@ struct nova_range_node {
 };
 
 #include "bbuild.h"
+
+struct inode_map {
+	struct mutex		inode_table_mutex;
+	struct rb_root		inode_inuse_tree;
+	unsigned long		num_range_node_inode;
+	struct nova_range_node *first_inode_range;
+	int			allocated;
+	int			freed;
+};
+
 #include "balloc.h"
 
 static inline unsigned long
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 32fe29b..9b60873 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -378,6 +378,9 @@ static struct nova_inode *nova_init(struct super_block *sb,
 
 	nova_init_blockmap(sb, 0);
 
+	if (nova_init_inode_inuse_list(sb) < 0)
+		return ERR_PTR(-EINVAL);
+
 	if (nova_init_inode_table(sb) < 0)
 		return ERR_PTR(-EINVAL);
 
@@ -420,6 +423,7 @@ static inline void set_default_opts(struct nova_sb_info *sbi)
 	sbi->head_reserved_blocks = HEAD_RESERVED_BLOCKS;
 	sbi->tail_reserved_blocks = TAIL_RESERVED_BLOCKS;
 	sbi->cpus = num_online_cpus();
+	sbi->map_id = 0;
 }
 
 static void nova_root_check(struct super_block *sb, struct nova_inode *root_pi)
@@ -481,9 +485,11 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	struct nova_sb_info *sbi = NULL;
 	struct nova_inode *root_pi;
 	struct inode *root_i = NULL;
+	struct inode_map *inode_map;
 	unsigned long blocksize;
 	u32 random = 0;
 	int retval = -EINVAL;
+	int i;
 	timing_t mount_time;
 
 	NOVA_START_TIMING(mount_t, mount_time);
@@ -533,6 +539,21 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->gid = current_fsgid();
 	set_opt(sbi->s_mount_opt, HUGEIOREMAP);
 
+	sbi->inode_maps = kcalloc(sbi->cpus, sizeof(struct inode_map),
+					GFP_KERNEL);
+	if (!sbi->inode_maps) {
+		retval = -ENOMEM;
+		nova_dbg("%s: Allocating inode maps failed.",
+			 __func__);
+		goto out;
+	}
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		mutex_init(&inode_map->inode_table_mutex);
+		inode_map->inode_inuse_tree = RB_ROOT;
+	}
+
 	mutex_init(&sbi->s_lock);
 
 	sbi->zeroed_page = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -625,6 +646,9 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 
 	nova_delete_free_lists(sb);
 
+	kfree(sbi->inode_maps);
+	sbi->inode_maps = NULL;
+
 	kfree(sbi->nova_sb);
 	kfree(sbi);
 	nova_dbg("%s failed: return %d\n", __func__, retval);
@@ -706,6 +730,8 @@ static int nova_remount(struct super_block *sb, int *mntflags, char *data)
 static void nova_put_super(struct super_block *sb)
 {
 	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	int i;
 
 	if (sbi->virt_addr) {
 		/* Save everything before blocknode mapping! */
@@ -718,6 +744,13 @@ static void nova_put_super(struct super_block *sb)
 	kfree(sbi->zeroed_page);
 	nova_dbgmask = 0;
 
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		nova_dbgv("CPU %d: inode allocated %d, freed %d\n",
+			i, inode_map->allocated, inode_map->freed);
+	}
+
+	kfree(sbi->inode_maps);
 	kfree(sbi->nova_sb);
 	kfree(sbi);
 	sb->s_fs_info = NULL;
@@ -728,6 +761,12 @@ inline void nova_free_range_node(struct nova_range_node *node)
 	kmem_cache_free(nova_range_node_cachep, node);
 }
 
+inline void nova_free_inode_node(struct super_block *sb,
+	struct nova_range_node *node)
+{
+	nova_free_range_node(node);
+}
+
 inline struct nova_range_node *nova_alloc_range_node(struct super_block *sb)
 {
 	struct nova_range_node *p;
@@ -737,6 +776,11 @@ inline struct nova_range_node *nova_alloc_range_node(struct super_block *sb)
 	return p;
 }
 
+inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb)
+{
+	return nova_alloc_range_node(sb);
+}
+
 static struct inode *nova_alloc_inode(struct super_block *sb)
 {
 	struct nova_inode_info *vi;
diff --git a/fs/nova/super.h b/fs/nova/super.h
index dcafbd8..9772d2f 100644
--- a/fs/nova/super.h
+++ b/fs/nova/super.h
@@ -119,6 +119,12 @@ struct nova_sb_info {
 	/* ZEROED page for cache page initialized */
 	void *zeroed_page;
 
+	/* Per-CPU inode map */
+	struct inode_map	*inode_maps;
+
+	/* Decide new inode map id */
+	unsigned long map_id;
+
 	/* Per-CPU free block list */
 	struct free_list *free_lists;
 	unsigned long per_list_blocks;
@@ -150,6 +156,9 @@ static inline struct nova_super_block *nova_get_super(struct super_block *sb)
 
 extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
 extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
+extern inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb);
 extern void nova_free_range_node(struct nova_range_node *node);
+extern inline void nova_free_inode_node(struct super_block *sb,
+	struct nova_range_node *node);
 
 #endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 27/83] Save the inode inuse list to pmem upon umount
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (25 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 26/83] Add inode_map to track inuse inodes Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 28/83] Add NOVA address space operations Andiry Xu
                   ` (56 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/bbuild.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/bbuild.h |  1 +
 fs/nova/super.c  |  1 +
 3 files changed, 50 insertions(+)

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
index 12a2f11..66053cb 100644
--- a/fs/nova/bbuild.c
+++ b/fs/nova/bbuild.c
@@ -116,6 +116,54 @@ static u64 nova_save_free_list_blocknodes(struct super_block *sb, int cpu,
 	return temp_tail;
 }
 
+void nova_save_inode_list_to_log(struct super_block *sb)
+{
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODELIST_INO);
+	struct nova_inode_info_header sih;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	unsigned long num_blocks;
+	unsigned long num_nodes = 0;
+	struct inode_map *inode_map;
+	unsigned long i;
+	u64 temp_tail;
+	u64 new_block;
+	int allocated;
+
+	sih.ino = NOVA_INODELIST_INO;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	sih.i_blocks = 0;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		num_nodes += inode_map->num_range_node_inode;
+	}
+
+	num_blocks = num_nodes / RANGENODE_PER_PAGE;
+	if (num_nodes % RANGENODE_PER_PAGE)
+		num_blocks++;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, num_blocks,
+						&new_block, ANY_CPU, 0);
+	if (allocated != num_blocks) {
+		nova_dbg("Error saving inode list: %d\n", allocated);
+		return;
+	}
+
+	temp_tail = new_block;
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		temp_tail = nova_save_range_nodes_to_log(sb,
+				&inode_map->inode_inuse_tree, temp_tail, i);
+	}
+
+	pi->log_head = new_block;
+	nova_update_tail(pi, temp_tail);
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+
+	nova_dbg("%s: %lu inode nodes, pi head 0x%llx, tail 0x%llx\n",
+		__func__, num_nodes, pi->log_head, pi->log_tail);
+}
+
 void nova_save_blocknode_mappings_to_log(struct super_block *sb)
 {
 	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
diff --git a/fs/nova/bbuild.h b/fs/nova/bbuild.h
index 59cc379..5d2b5f0 100644
--- a/fs/nova/bbuild.h
+++ b/fs/nova/bbuild.h
@@ -3,6 +3,7 @@
 
 void nova_init_header(struct super_block *sb,
 	struct nova_inode_info_header *sih, u16 i_mode);
+void nova_save_inode_list_to_log(struct super_block *sb);
 void nova_save_blocknode_mappings_to_log(struct super_block *sb);
 
 #endif
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 9b60873..69e4afc 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -734,6 +734,7 @@ static void nova_put_super(struct super_block *sb)
 	int i;
 
 	if (sbi->virt_addr) {
+		nova_save_inode_list_to_log(sb);
 		/* Save everything before blocknode mapping! */
 		nova_save_blocknode_mappings_to_log(sb);
 		sbi->virt_addr = NULL;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 28/83] Add NOVA address space operations
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (26 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 27/83] Save the inode inuse list to pmem upon umount Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 29/83] Add write_inode and dirty_inode routines Andiry Xu
                   ` (55 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

direct_IO and writepages support.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 24 ++++++++++++++++++++++++
 fs/nova/inode.h |  1 +
 2 files changed, 25 insertions(+)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 7c10d0e..a30b6aa 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -175,6 +175,7 @@ static int nova_read_inode(struct super_block *sb, struct inode *inode,
 	}
 
 	inode->i_blocks = sih->i_blocks;
+	inode->i_mapping->a_ops = &nova_aops_dax;
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
@@ -488,3 +489,26 @@ int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
 	return ret;
 }
 
+static ssize_t nova_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+	/* DAX does not support direct IO */
+	return -EIO;
+}
+
+static int nova_writepages(struct address_space *mapping,
+	struct writeback_control *wbc)
+{
+	int ret;
+	timing_t wp_time;
+
+	NOVA_START_TIMING(write_pages_t, wp_time);
+	ret = dax_writeback_mapping_range(mapping,
+			mapping->host->i_sb->s_bdev, wbc);
+	NOVA_END_TIMING(write_pages_t, wp_time);
+	return ret;
+}
+
+const struct address_space_operations nova_aops_dax = {
+	.writepages		= nova_writepages,
+	.direct_IO		= nova_direct_IO,
+};
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 497343d..e00b3b9 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -221,6 +221,7 @@ static inline int nova_persist_inode(struct nova_inode *pi)
 }
 
 
+extern const struct address_space_operations nova_aops_dax;
 int nova_init_inode_inuse_list(struct super_block *sb);
 int nova_init_inode_table(struct super_block *sb);
 int nova_get_inode_address(struct super_block *sb, u64 ino,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 29/83] Add write_inode and dirty_inode routines.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (27 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 28/83] Add NOVA address space operations Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 30/83] New NOVA inode allocation Andiry Xu
                   ` (54 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 33 +++++++++++++++++++++++++++++++++
 fs/nova/inode.h |  2 ++
 fs/nova/super.c |  2 ++
 3 files changed, 37 insertions(+)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index a30b6aa..29d172a 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -489,6 +489,39 @@ int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
 	return ret;
 }
 
+int nova_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	/* write_inode should never be called because we always keep our inodes
+	 * clean. So let us know if write_inode ever gets called.
+	 */
+//	BUG();
+	return 0;
+}
+
+/*
+ * dirty_inode() is called from mark_inode_dirty_sync()
+ * usually dirty_inode should not be called because NOVA always keeps its inodes
+ * clean. Only exception is touch_atime which calls dirty_inode to update the
+ * i_atime field.
+ */
+void nova_dirty_inode(struct inode *inode, int flags)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	/* only i_atime should have changed if at all.
+	 * we can do in-place atomic update
+	 */
+	pi->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
+	nova_persist_inode(pi);
+	/* Relax atime persistency */
+	nova_flush_buffer(&pi->i_atime, sizeof(pi->i_atime), 0);
+}
+
 static ssize_t nova_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	/* DAX does not support direct IO */
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index e00b3b9..f9f5c14 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -229,5 +229,7 @@ int nova_get_inode_address(struct super_block *sb, u64 ino,
 struct inode *nova_iget(struct super_block *sb, unsigned long ino);
 inline int nova_insert_inodetree(struct nova_sb_info *sbi,
 	struct nova_range_node *new_node, int cpu);
+extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
+extern void nova_dirty_inode(struct inode *inode, int flags);
 
 #endif
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 69e4afc..c0427fd 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -861,6 +861,8 @@ static void destroy_rangenode_cache(void)
 static struct super_operations nova_sops = {
 	.alloc_inode	= nova_alloc_inode,
 	.destroy_inode	= nova_destroy_inode,
+	.write_inode	= nova_write_inode,
+	.dirty_inode	= nova_dirty_inode,
 	.put_super	= nova_put_super,
 	.statfs		= nova_statfs,
 	.remount_fs	= nova_remount,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 30/83] New NOVA inode allocation.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (28 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 29/83] Add write_inode and dirty_inode routines Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 31/83] Add new vfs " Andiry Xu
                   ` (53 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Allocate the new inode in a round-robin way.
Extend the inode table if needed.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 40 ++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.h |  1 +
 2 files changed, 41 insertions(+)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 29d172a..e4b8960 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -489,6 +489,46 @@ int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
 	return ret;
 }
 
+/* Returns 0 on failure */
+u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	unsigned long free_ino = 0;
+	int map_id;
+	u64 ino = 0;
+	int ret;
+	timing_t new_inode_time;
+
+	NOVA_START_TIMING(new_nova_inode_t, new_inode_time);
+	map_id = sbi->map_id;
+	sbi->map_id = (sbi->map_id + 1) % sbi->cpus;
+
+	inode_map = &sbi->inode_maps[map_id];
+
+	mutex_lock(&inode_map->inode_table_mutex);
+	ret = nova_alloc_unused_inode(sb, map_id, &free_ino);
+	if (ret) {
+		nova_dbg("%s: alloc inode number failed %d\n", __func__, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return 0;
+	}
+
+	ret = nova_get_inode_address(sb, free_ino, pi_addr, 1);
+	if (ret) {
+		nova_dbg("%s: get inode address failed %d\n", __func__, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return 0;
+	}
+
+	mutex_unlock(&inode_map->inode_table_mutex);
+
+	ino = free_ino;
+
+	NOVA_END_TIMING(new_nova_inode_t, new_inode_time);
+	return ino;
+}
+
 int nova_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	/* write_inode should never be called because we always keep our inodes
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index f9f5c14..fc1876c 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -229,6 +229,7 @@ int nova_get_inode_address(struct super_block *sb, u64 ino,
 struct inode *nova_iget(struct super_block *sb, unsigned long ino);
 inline int nova_insert_inodetree(struct nova_sb_info *sbi,
 	struct nova_range_node *new_node, int cpu);
+u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr);
 extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
 extern void nova_dirty_inode(struct inode *inode, int flags);
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 31/83] Add new vfs inode allocation.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (29 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 30/83] New NOVA inode allocation Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 32/83] Add log entry definitions Andiry Xu
                   ` (52 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

This routine allocates and initializes a new vfs inode, and setup
the attributes of corresponding NOVA inode and inode_info.
inode operations are missing now.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/nova/inode.h |   3 ++
 2 files changed, 146 insertions(+), 1 deletion(-)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index e4b8960..15517cc 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -363,7 +363,49 @@ static inline int nova_search_inodetree(struct nova_sb_info *sbi,
 	return nova_find_range_node(sbi, tree, internal_ino, ret_node);
 }
 
-int nova_alloc_unused_inode(struct super_block *sb, int cpuid,
+static void nova_get_inode_flags(struct inode *inode, struct nova_inode *pi)
+{
+	unsigned int flags = inode->i_flags;
+	unsigned int nova_flags = le32_to_cpu(pi->i_flags);
+
+	nova_flags &= ~(FS_SYNC_FL | FS_APPEND_FL | FS_IMMUTABLE_FL |
+			 FS_NOATIME_FL | FS_DIRSYNC_FL);
+	if (flags & S_SYNC)
+		nova_flags |= FS_SYNC_FL;
+	if (flags & S_APPEND)
+		nova_flags |= FS_APPEND_FL;
+	if (flags & S_IMMUTABLE)
+		nova_flags |= FS_IMMUTABLE_FL;
+	if (flags & S_NOATIME)
+		nova_flags |= FS_NOATIME_FL;
+	if (flags & S_DIRSYNC)
+		nova_flags |= FS_DIRSYNC_FL;
+
+	pi->i_flags = cpu_to_le32(nova_flags);
+}
+
+static void nova_init_inode(struct inode *inode, struct nova_inode *pi)
+{
+	pi->i_mode = cpu_to_le16(inode->i_mode);
+	pi->i_uid = cpu_to_le32(i_uid_read(inode));
+	pi->i_gid = cpu_to_le32(i_gid_read(inode));
+	pi->i_links_count = cpu_to_le16(inode->i_nlink);
+	pi->i_size = cpu_to_le64(inode->i_size);
+	pi->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
+	pi->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
+	pi->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
+	pi->i_generation = cpu_to_le32(inode->i_generation);
+	pi->log_head = 0;
+	pi->log_tail = 0;
+	pi->deleted = 0;
+	pi->delete_epoch_id = 0;
+	nova_get_inode_flags(inode, pi);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
+		pi->dev.rdev = cpu_to_le32(inode->i_rdev);
+}
+
+static int nova_alloc_unused_inode(struct super_block *sb, int cpuid,
 	unsigned long *ino)
 {
 	struct nova_sb_info *sbi = NOVA_SB(sb);
@@ -529,6 +571,106 @@ u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr)
 	return ino;
 }
 
+struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
+	struct inode *dir, u64 pi_addr, u64 ino, umode_t mode,
+	size_t size, dev_t rdev, const struct qstr *qstr, u64 epoch_id)
+{
+	struct super_block *sb;
+	struct nova_sb_info *sbi;
+	struct inode *inode;
+	struct nova_inode *diri = NULL;
+	struct nova_inode_info *si;
+	struct nova_inode_info_header *sih = NULL;
+	struct nova_inode *pi;
+	int errval;
+	timing_t new_inode_time;
+
+	NOVA_START_TIMING(new_vfs_inode_t, new_inode_time);
+	sb = dir->i_sb;
+	sbi = (struct nova_sb_info *)sb->s_fs_info;
+	inode = new_inode(sb);
+	if (!inode) {
+		errval = -ENOMEM;
+		goto fail2;
+	}
+
+	inode_init_owner(inode, dir, mode);
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+
+	inode->i_generation = atomic_add_return(1, &sbi->next_generation);
+	inode->i_size = size;
+
+	diri = nova_get_inode(sb, dir);
+	if (!diri) {
+		errval = -EACCES;
+		goto fail1;
+	}
+
+	pi = (struct nova_inode *)nova_get_block(sb, pi_addr);
+	nova_dbg_verbose("%s: allocating inode %llu @ 0x%llx\n",
+					__func__, ino, pi_addr);
+
+	/* chosen inode is in ino */
+	inode->i_ino = ino;
+
+	switch (type) {
+	case TYPE_CREATE:
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		break;
+	case TYPE_MKNOD:
+		init_special_inode(inode, mode, rdev);
+		break;
+	case TYPE_SYMLINK:
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		break;
+	case TYPE_MKDIR:
+		inode->i_mapping->a_ops = &nova_aops_dax;
+		set_nlink(inode, 2);
+		break;
+	default:
+		nova_dbg("Unknown new inode type %d\n", type);
+		break;
+	}
+
+	/*
+	 * Pi is part of the dir log so no transaction is needed,
+	 * but we need to flush to NVMM.
+	 */
+	pi->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+	pi->i_flags = nova_mask_flags(mode, diri->i_flags);
+	pi->nova_ino = ino;
+	pi->i_create_time = current_time(inode).tv_sec;
+	pi->create_epoch_id = epoch_id;
+	nova_init_inode(inode, pi);
+
+	si = NOVA_I(inode);
+	sih = &si->header;
+	nova_init_header(sb, sih, inode->i_mode);
+	sih->pi_addr = pi_addr;
+	sih->ino = ino;
+	sih->i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	nova_set_inode_flags(inode, pi, le32_to_cpu(pi->i_flags));
+	sih->i_flags = le32_to_cpu(pi->i_flags);
+
+	if (insert_inode_locked(inode) < 0) {
+		nova_err(sb, "nova_new_inode failed ino %lx\n", inode->i_ino);
+		errval = -EINVAL;
+		goto fail1;
+	}
+
+	nova_flush_buffer(pi, NOVA_INODE_SIZE, 0);
+	NOVA_END_TIMING(new_vfs_inode_t, new_inode_time);
+	return inode;
+fail1:
+	make_bad_inode(inode);
+	iput(inode);
+fail2:
+	NOVA_END_TIMING(new_vfs_inode_t, new_inode_time);
+	return ERR_PTR(errval);
+}
+
 int nova_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	/* write_inode should never be called because we always keep our inodes
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index fc1876c..943f77f 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -230,6 +230,9 @@ struct inode *nova_iget(struct super_block *sb, unsigned long ino);
 inline int nova_insert_inodetree(struct nova_sb_info *sbi,
 	struct nova_range_node *new_node, int cpu);
 u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr);
+struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
+	struct inode *dir, u64 pi_addr, u64 ino, umode_t mode,
+	size_t size, dev_t rdev, const struct qstr *qstr, u64 epoch_id);
 extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
 extern void nova_dirty_inode(struct inode *inode, int flags);
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 32/83] Add log entry definitions.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (30 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 31/83] Add new vfs " Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 33/83] Inode log and entry printing for debug purpose Andiry Xu
                   ` (51 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA appends log entries to the inode log upon metadata change.

NOVA has four kinds of log entries:

File write entry describes a write to a contiguous range of pmem pages,
Dentry describes a file/directory being added or removed from a directory,
Setattr entry is used for updating inode attributes,
Link change entry describes link changes to an inode, e.g. link/unlink.
All of them are aligned to 8 bytes.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.h | 180 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 180 insertions(+)

diff --git a/fs/nova/log.h b/fs/nova/log.h
index 2bc131f..6b4a085 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -62,6 +62,175 @@ static inline void nova_set_entry_type(void *p, enum nova_entry_type type)
 	*(u8 *)p = type;
 }
 
+/*
+ * Write log entry.  Records a write to a contiguous range of PMEM pages.
+ *
+ * Documentation/filesystems/nova.txt contains descriptions of some fields.
+ */
+struct nova_file_write_entry {
+	u8	entry_type;
+	u8	reassigned;	/* Data is not latest */
+	u8	padding[2];
+	__le32	num_pages;
+	__le64	block;          /* offset of first block in this write */
+	__le64	pgoff;          /* file offset at the beginning of this write */
+	__le32	invalid_pages;	/* For GC */
+	/* For both ctime and mtime */
+	__le32	mtime;
+	__le64	size;           /* Write size for non-aligned writes */
+	__le64	epoch_id;
+	__le64	trans_id;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define WENTRY(entry)	((struct nova_file_write_entry *) entry)
+
+/* List of file write entries */
+struct nova_file_write_item {
+	struct nova_file_write_entry	entry;
+	struct list_head		list;
+};
+
+/*
+ * Log entry for adding a file/directory to a directory.
+ *
+ * Update DIR_LOG_REC_LEN if modify this struct!
+ */
+struct nova_dentry {
+	u8	entry_type;
+	u8	name_len;		/* length of the dentry name */
+	u8	reassigned;		/* Currently deleted */
+	u8	invalid;		/* Invalid now? */
+	__le16	de_len;			/* length of this dentry */
+	__le16	links_count;
+	__le32	mtime;			/* For both mtime and ctime */
+	__le32	csum;			/* entry checksum */
+	__le64	ino;			/* inode no pointed to by this entry */
+	__le64	padding;
+	__le64	epoch_id;
+	__le64	trans_id;
+	char	name[NOVA_NAME_LEN + 1];	/* File name */
+} __attribute((__packed__));
+
+#define DENTRY(entry)	((struct nova_dentry *) entry)
+
+#define NOVA_DIR_PAD			8	/* Align to 8 bytes boundary */
+#define NOVA_DIR_ROUND			(NOVA_DIR_PAD - 1)
+#define NOVA_DENTRY_HEADER_LEN		48
+#define NOVA_DIR_LOG_REC_LEN(name_len) \
+	(((name_len + 1) + NOVA_DENTRY_HEADER_LEN \
+	 + NOVA_DIR_ROUND) & ~NOVA_DIR_ROUND)
+
+#define NOVA_MAX_ENTRY_LEN		NOVA_DIR_LOG_REC_LEN(NOVA_NAME_LEN)
+
+/*
+ * Log entry for updating file attributes.
+ */
+struct nova_setattr_logentry {
+	u8	entry_type;
+	u8	attr;       /* bitmap of which attributes to update */
+	__le16	mode;
+	__le32	uid;
+	__le32	gid;
+	__le32	atime;
+	__le32	mtime;
+	__le32	ctime;
+	__le64	size;        /* File size after truncation */
+	__le64	epoch_id;
+	__le64	trans_id;
+	u8	invalid;
+	u8	paddings[3];
+	__le32	csum;
+} __attribute((__packed__));
+
+#define SENTRY(entry)	((struct nova_setattr_logentry *) entry)
+
+/* Link change log entry.
+ *
+ * TODO: Do we need this to be 32 bytes?
+ */
+struct nova_link_change_entry {
+	u8	entry_type;
+	u8	invalid;
+	__le16	links;
+	__le32	ctime;
+	__le32	flags;
+	__le32	generation;    /* for NFS handles */
+	__le64	epoch_id;
+	__le64	trans_id;
+	__le32	csumpadding;
+	__le32	csum;
+} __attribute((__packed__));
+
+#define LCENTRY(entry)	((struct nova_link_change_entry *) entry)
+
+
+/*
+ * Transient DRAM structure that describes changes needed to append a log entry
+ * to an inode
+ */
+struct nova_inode_update {
+	u64 head;
+	u64 tail;
+	u64 curr_entry;
+	struct nova_dentry *create_dentry;
+	struct nova_dentry *delete_dentry;
+};
+
+
+/*
+ * Transient DRAM structure to parameterize the creation of a log entry.
+ */
+struct nova_log_entry_info {
+	enum nova_entry_type type;
+	struct iattr *attr;
+	struct nova_inode_update *update;
+	void *data;	/* struct dentry */
+	u64 epoch_id;
+	u64 trans_id;
+	u64 curr_p;	/* output */
+	u64 file_size;	/* de_len for dentry */
+	u64 ino;
+	u32 time;
+	int link_change;
+	int inplace;	/* For file write entry */
+};
+
+
+
+static inline size_t nova_get_log_entry_size(struct super_block *sb,
+	enum nova_entry_type type)
+{
+	size_t size = 0;
+
+	switch (type) {
+	case FILE_WRITE:
+		size = sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		size = NOVA_DENTRY_HEADER_LEN;
+		break;
+	case SET_ATTR:
+		size = sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		size = sizeof(struct nova_link_change_entry);
+		break;
+	default:
+		break;
+	}
+
+	return size;
+}
+
+static inline void nova_persist_entry(void *entry)
+{
+	size_t entry_len = CACHELINE_SIZE;
+
+	nova_flush_buffer(entry, entry_len, 0);
+}
+
 static inline u64 next_log_page(struct super_block *sb, u64 curr)
 {
 	struct nova_inode_log_page *curr_page;
@@ -183,6 +352,17 @@ static inline bool goto_next_page(struct super_block *sb, u64 curr_p)
 	return false;
 }
 
+static inline int is_dir_init_entry(struct super_block *sb,
+	struct nova_dentry *entry)
+{
+	if (entry->name_len == 1 && strncmp(entry->name, ".", 1) == 0)
+		return 1;
+	if (entry->name_len == 2 && strncmp(entry->name, "..", 2) == 0)
+		return 1;
+
+	return 0;
+}
+
 
 int nova_allocate_inode_log_pages(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long num_pages,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 33/83] Inode log and entry printing for debug purpose.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (31 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 32/83] Add log entry definitions Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 34/83] Journal: NOVA light weight journal definitions Andiry Xu
                   ` (50 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/nova.h  |   3 +
 fs/nova/stats.c | 234 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 237 insertions(+)

diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index bf4b6ac..03c4991 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -367,6 +367,9 @@ void nova_get_IO_stats(void);
 void nova_print_timing_stats(struct super_block *sb);
 void nova_clear_stats(struct super_block *sb);
 void nova_print_inode(struct nova_inode *pi);
+void nova_print_inode_log(struct super_block *sb, struct inode *inode);
+void nova_print_inode_log_pages(struct super_block *sb, struct inode *inode);
+int nova_check_inode_logs(struct super_block *sb, struct nova_inode *pi);
 void nova_print_free_lists(struct super_block *sb);
 
 #endif /* __NOVA_H */
diff --git a/fs/nova/stats.c b/fs/nova/stats.c
index 9ddd267..990e964 100644
--- a/fs/nova/stats.c
+++ b/fs/nova/stats.c
@@ -333,6 +333,240 @@ void nova_print_inode(struct nova_inode *pi)
 		pi->create_epoch_id, pi->delete_epoch_id);
 }
 
+static inline void nova_print_file_write_entry(struct super_block *sb,
+	u64 curr, struct nova_file_write_entry *entry)
+{
+	nova_dbg("file write entry @ 0x%llx: epoch %llu, trans %llu, "
+			"pgoff %llu, pages %u, blocknr %llu, reassigned %u, "
+			"invalid count %u, size %llu, mtime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->pgoff, entry->num_pages,
+			entry->block >> PAGE_SHIFT,
+			entry->reassigned,
+			entry->invalid_pages, entry->size, entry->mtime);
+}
+
+static inline void nova_print_set_attr_entry(struct super_block *sb,
+	u64 curr, struct nova_setattr_logentry *entry)
+{
+	nova_dbg("set attr entry @ 0x%llx: epoch %llu, trans %llu, invalid %u, "
+			"mode %u, size %llu, atime %u, mtime %u, ctime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->invalid, entry->mode,
+			entry->size, entry->atime, entry->mtime, entry->ctime);
+}
+
+static inline void nova_print_link_change_entry(struct super_block *sb,
+	u64 curr, struct nova_link_change_entry *entry)
+{
+	nova_dbg("link change entry @ 0x%llx: epoch %llu, trans %llu, "
+			"invalid %u, links %u, flags %u, ctime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->invalid, entry->links,
+			entry->flags, entry->ctime);
+}
+
+static inline size_t nova_print_dentry(struct super_block *sb,
+	u64 curr, struct nova_dentry *entry)
+{
+	nova_dbg("dir logentry @ 0x%llx: epoch %llu, trans %llu, "
+			"reassigned %u, invalid %u, inode %llu, links %u, "
+			"namelen %u, rec len %u, name %s, mtime %u\n",
+			curr, entry->epoch_id, entry->trans_id,
+			entry->reassigned, entry->invalid,
+			le64_to_cpu(entry->ino),
+			entry->links_count, entry->name_len,
+			le16_to_cpu(entry->de_len), entry->name,
+			entry->mtime);
+
+	return le16_to_cpu(entry->de_len);
+}
+
+u64 nova_print_log_entry(struct super_block *sb, u64 curr)
+{
+	void *addr;
+	size_t size;
+	u8 type;
+
+	addr = (void *)nova_get_block(sb, curr);
+	type = nova_get_entry_type(addr);
+	switch (type) {
+	case SET_ATTR:
+		nova_print_set_attr_entry(sb, curr, addr);
+		curr += sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		nova_print_link_change_entry(sb, curr, addr);
+		curr += sizeof(struct nova_link_change_entry);
+		break;
+	case FILE_WRITE:
+		nova_print_file_write_entry(sb, curr, addr);
+		curr += sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		size = nova_print_dentry(sb, curr, addr);
+		curr += size;
+		if (size == 0) {
+			nova_dbg("%s: dentry with size 0 @ 0x%llx\n",
+					__func__, curr);
+			curr += sizeof(struct nova_file_write_entry);
+			NOVA_ASSERT(0);
+		}
+		break;
+	case NEXT_PAGE:
+		nova_dbg("%s: next page sign @ 0x%llx\n", __func__, curr);
+		curr = PAGE_TAIL(curr);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n", __func__, type, curr);
+		curr += sizeof(struct nova_file_write_entry);
+		NOVA_ASSERT(0);
+		break;
+	}
+
+	return curr;
+}
+
+void nova_print_curr_log_page(struct super_block *sb, u64 curr)
+{
+	struct nova_inode_page_tail *tail;
+	u64 start, end;
+
+	start = BLOCK_OFF(curr);
+	end = PAGE_TAIL(curr);
+
+	while (start < end)
+		start = nova_print_log_entry(sb, start);
+
+	tail = nova_get_block(sb, end);
+	nova_dbg("Page tail. curr 0x%llx, next page 0x%llx, %u entries, %u invalid\n",
+			start, tail->next_page,
+			tail->num_entries, tail->invalid_entries);
+}
+
+void nova_print_nova_log(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	u64 curr;
+
+	if (sih->log_tail == 0 || sih->log_head == 0)
+		return;
+
+	curr = sih->log_head;
+	nova_dbg("Pi %lu: log head 0x%llx, tail 0x%llx\n",
+			sih->ino, curr, sih->log_tail);
+	while (curr != sih->log_tail) {
+		if ((curr & (PAGE_SIZE - 1)) == LOG_BLOCK_TAIL) {
+			struct nova_inode_page_tail *tail =
+					nova_get_block(sb, curr);
+			nova_dbg("Log tail, curr 0x%llx, next page 0x%llx, "
+					"%u entries, %u invalid\n",
+					curr, tail->next_page,
+					tail->num_entries,
+					tail->invalid_entries);
+			curr = tail->next_page;
+		} else {
+			curr = nova_print_log_entry(sb, curr);
+		}
+	}
+}
+
+void nova_print_inode_log(struct super_block *sb, struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	nova_print_nova_log(sb, sih);
+}
+
+int nova_get_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode *pi)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr, next;
+	int count = 1;
+
+	if (pi->log_head == 0 || pi->log_tail == 0) {
+		nova_dbg("Pi %lu has no log\n", sih->ino);
+		return 0;
+	}
+
+	curr = pi->log_head;
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	while ((next = curr_page->page_tail.next_page) != 0) {
+		curr = next;
+		curr_page = (struct nova_inode_log_page *)
+			nova_get_block(sb, curr);
+		count++;
+	}
+
+	return count;
+}
+
+void nova_print_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_inode_log_page *curr_page;
+	u64 curr, next;
+	int count = 1;
+	int used = count;
+
+	if (sih->log_head == 0 || sih->log_tail == 0) {
+		nova_dbg("Pi %lu has no log\n", sih->ino);
+		return;
+	}
+
+	curr = sih->log_head;
+	nova_dbg("Pi %lu: log head @ 0x%llx, tail @ 0x%llx\n",
+			sih->ino, curr, sih->log_tail);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb, curr);
+	while ((next = curr_page->page_tail.next_page) != 0) {
+		nova_dbg("Current page 0x%llx, next page 0x%llx, %u entries, %u invalid\n",
+			curr >> PAGE_SHIFT, next >> PAGE_SHIFT,
+			curr_page->page_tail.num_entries,
+			curr_page->page_tail.invalid_entries);
+		if (sih->log_tail >> PAGE_SHIFT == curr >> PAGE_SHIFT)
+			used = count;
+		curr = next;
+		curr_page = (struct nova_inode_log_page *)
+			nova_get_block(sb, curr);
+		count++;
+	}
+	if (sih->log_tail >> PAGE_SHIFT == curr >> PAGE_SHIFT)
+		used = count;
+	nova_dbg("Pi %lu: log used %d pages, has %d pages, si reports %lu pages\n",
+		sih->ino, used, count,
+		sih->log_pages);
+}
+
+void nova_print_inode_log_pages(struct super_block *sb, struct inode *inode)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	nova_print_nova_log_pages(sb, sih);
+}
+
+int nova_check_inode_logs(struct super_block *sb, struct nova_inode *pi)
+{
+	int count = 0;
+	int tail_at = 0;
+	u64 curr;
+
+	curr = pi->log_head;
+
+	while (curr) {
+		count++;
+		if ((curr >> PAGE_SHIFT) == (pi->log_tail >> PAGE_SHIFT))
+			tail_at = count;
+		curr = next_log_page(sb, curr);
+	}
+
+	nova_dbg("Log %d pages, tail @ page %d\n", count, tail_at);
+
+	return 0;
+}
+
 void nova_print_free_lists(struct super_block *sb)
 {
 	struct nova_sb_info *sbi = NOVA_SB(sb);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 34/83] Journal: NOVA light weight journal definitions.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (32 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 33/83] Inode log and entry printing for debug purpose Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 35/83] Journal: Lite journal helper routines Andiry Xu
                   ` (49 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA uses per-CPU lite journals to provide fast atomicity guarantees
for multi-log appending and multi-word inplace updates.

NOVA uses undo journaling. Each journal is a circular buffer
of 4KB pmem page. Two pointers, journal_head and journal_tail
reside in the reserved journal block, and point to the journal page.
If the two pointers are not equal, there are uncommitted transactions
and NOVA recovers the data by replaying the journal entries.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/bbuild.c  |  1 +
 fs/nova/journal.h | 43 +++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.c     |  1 +
 fs/nova/super.c   |  1 +
 4 files changed, 46 insertions(+)
 create mode 100644 fs/nova/journal.h

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
index 66053cb..af1b352 100644
--- a/fs/nova/bbuild.c
+++ b/fs/nova/bbuild.c
@@ -25,6 +25,7 @@
 #include <linux/random.h>
 #include <linux/delay.h>
 #include "nova.h"
+#include "journal.h"
 #include "super.h"
 #include "inode.h"
 
diff --git a/fs/nova/journal.h b/fs/nova/journal.h
new file mode 100644
index 0000000..d1d0ffb
--- /dev/null
+++ b/fs/nova/journal.h
@@ -0,0 +1,43 @@
+#ifndef __JOURNAL_H
+#define __JOURNAL_H
+
+#include <linux/types.h>
+#include <linux/fs.h>
+#include "nova.h"
+#include "super.h"
+
+
+/* ======================= Lite journal ========================= */
+
+#define NOVA_MAX_JOURNAL_LENGTH 128
+
+#define	JOURNAL_INODE	1
+#define	JOURNAL_ENTRY	2
+
+/* Lightweight journal entry */
+struct nova_lite_journal_entry {
+	__le64 type;       // JOURNAL_INODE or JOURNAL_ENTRY
+	__le64 data1;
+	__le64 data2;
+	__le32 padding;
+	__le32 csum;
+} __attribute((__packed__));
+
+/* Head and tail pointers into a circular queue of journal entries.  There's
+ * one of these per CPU.
+ */
+struct journal_ptr_pair {
+	__le64 journal_head;
+	__le64 journal_tail;
+};
+
+static inline
+struct journal_ptr_pair *nova_get_journal_pointers(struct super_block *sb,
+	int cpu)
+{
+	return (struct journal_ptr_pair *)((char *)nova_get_block(sb,
+		NOVA_DEF_BLOCK_SIZE_4K * JOURNAL_START) + cpu * CACHELINE_SIZE);
+}
+
+
+#endif
diff --git a/fs/nova/log.c b/fs/nova/log.c
index bdd133e..f01b7c8 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -16,6 +16,7 @@
  */
 
 #include "nova.h"
+#include "journal.h"
 #include "inode.h"
 #include "log.h"
 
diff --git a/fs/nova/super.c b/fs/nova/super.c
index c0427fd..d73c202 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -38,6 +38,7 @@
 #include <linux/list.h>
 #include <linux/dax.h>
 #include "nova.h"
+#include "journal.h"
 #include "super.h"
 
 int measure_timing;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 35/83] Journal: Lite journal helper routines.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (33 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 34/83] Journal: NOVA light weight journal definitions Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 36/83] Journal: Lite journal recovery Andiry Xu
                   ` (48 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile  |   2 +-
 fs/nova/journal.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/journal.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index b3638a4..4aeadea 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := balloc.o bbuild.o inode.o log.o rebuild.o stats.o super.o
+nova-y := balloc.o bbuild.o inode.o journal.o log.o rebuild.o stats.o super.o
diff --git a/fs/nova/journal.c b/fs/nova/journal.c
new file mode 100644
index 0000000..75d590f
--- /dev/null
+++ b/fs/nova/journal.c
@@ -0,0 +1,108 @@
+/*
+ * NOVA journaling facility.
+ *
+ * This file contains journaling code to guarantee the atomicity of directory
+ * operations that span multiple inodes (unlink, rename, etc).
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/init.h>
+#include <linux/vfs.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include "nova.h"
+#include "journal.h"
+
+/**************************** Lite journal ******************************/
+
+static inline void
+nova_print_lite_transaction(struct nova_lite_journal_entry *entry)
+{
+	nova_dbg("Entry %p: Type %llu, data1 0x%llx, data2 0x%llx\n, checksum %u\n",
+			entry, entry->type,
+			entry->data1, entry->data2, entry->csum);
+}
+
+static inline int nova_update_journal_entry_csum(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u32 crc = 0;
+
+	crc = nova_crc32c(~0, (__u8 *)entry,
+			(sizeof(struct nova_lite_journal_entry)
+			 - sizeof(__le32)));
+
+	entry->csum = cpu_to_le32(crc);
+	nova_flush_buffer(entry, sizeof(struct nova_lite_journal_entry), 0);
+	return 0;
+}
+
+static inline int nova_check_entry_integrity(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u32 crc = 0;
+
+	crc = nova_crc32c(~0, (__u8 *)entry,
+			(sizeof(struct nova_lite_journal_entry)
+			 - sizeof(__le32)));
+
+	if (entry->csum == cpu_to_le32(crc))
+		return 0;
+	else
+		return 1;
+}
+
+// Get the next journal entry.  Journal entries are stored in a circular
+// buffer.  They live a 1-page circular buffer.
+//
+// TODO: Add check to ensure that the journal doesn't grow too large.
+static inline u64 next_lite_journal(u64 curr_p)
+{
+	size_t size = sizeof(struct nova_lite_journal_entry);
+
+	if ((curr_p & (PAGE_SIZE - 1)) + size >= PAGE_SIZE)
+		return (curr_p & PAGE_MASK);
+
+	return curr_p + size;
+}
+
+// Walk the journal for one CPU, and verify the checksum on each entry.
+static int nova_check_journal_entries(struct super_block *sb,
+	struct journal_ptr_pair *pair)
+{
+	struct nova_lite_journal_entry *entry;
+	u64 temp;
+	int ret;
+
+	temp = pair->journal_head;
+	while (temp != pair->journal_tail) {
+		entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+									temp);
+		ret = nova_check_entry_integrity(sb, entry);
+		if (ret) {
+			nova_dbg("Entry %p checksum failure\n", entry);
+			nova_print_lite_transaction(entry);
+			return ret;
+		}
+		temp = next_lite_journal(temp);
+	}
+
+	return 0;
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 36/83] Journal: Lite journal recovery.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (34 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 35/83] Journal: Lite journal helper routines Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 37/83] Journal: Lite journal create and commit Andiry Xu
                   ` (47 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/journal.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/fs/nova/journal.c b/fs/nova/journal.c
index 75d590f..f31de97 100644
--- a/fs/nova/journal.c
+++ b/fs/nova/journal.c
@@ -106,3 +106,58 @@ static int nova_check_journal_entries(struct super_block *sb,
 
 	return 0;
 }
+
+/**************************** Journal Recovery ******************************/
+
+static void nova_undo_journal_entry(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u64 addr, value;
+
+	addr = le64_to_cpu(entry->data1);
+	value = le64_to_cpu(entry->data2);
+
+	*(u64 *)nova_get_block(sb, addr) = (u64)value;
+	nova_flush_buffer((void *)nova_get_block(sb, addr), CACHELINE_SIZE, 0);
+}
+
+static void nova_undo_lite_journal_entry(struct super_block *sb,
+	struct nova_lite_journal_entry *entry)
+{
+	u64 type;
+
+	type = le64_to_cpu(entry->type);
+
+	switch (type) {
+	case JOURNAL_INODE:
+		/* Currently unused */
+		break;
+	case JOURNAL_ENTRY:
+		nova_undo_journal_entry(sb, entry);
+		break;
+	default:
+		nova_dbg("%s: unknown data type %llu\n", __func__, type);
+		break;
+	}
+}
+
+/* Roll back all journal enries */
+static int nova_recover_lite_journal(struct super_block *sb,
+	struct journal_ptr_pair *pair)
+{
+	struct nova_lite_journal_entry *entry;
+	u64 temp;
+
+	temp = pair->journal_head;
+	while (temp != pair->journal_tail) {
+		entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+									temp);
+		nova_undo_lite_journal_entry(sb, entry);
+		temp = next_lite_journal(temp);
+	}
+
+	pair->journal_tail = pair->journal_head;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	return 0;
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 37/83] Journal: Lite journal create and commit.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (35 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 36/83] Journal: Lite journal recovery Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 38/83] Journal: NOVA lite journal initialization Andiry Xu
                   ` (46 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA uses lite journal to perform light weight transaction.
Instead of journaling metadata/data changes directly,
NOVA first append updates to each inode's log, and then
journal the log tail pointers to make sure all the logs
are updated atomically. For inode creation and deletion,
NOVA journals the inode's valid field.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/journal.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/journal.h |  11 ++++
 2 files changed, 190 insertions(+)

diff --git a/fs/nova/journal.c b/fs/nova/journal.c
index f31de97..0e203fa 100644
--- a/fs/nova/journal.c
+++ b/fs/nova/journal.c
@@ -161,3 +161,182 @@ static int nova_recover_lite_journal(struct super_block *sb,
 
 	return 0;
 }
+
+/**************************** Create/commit ******************************/
+
+/* Create and append an undo entry for a small update to PMEM. */
+static u64 nova_append_entry_journal(struct super_block *sb,
+	u64 curr_p, void *field)
+{
+	struct nova_lite_journal_entry *entry;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 *aligned_field;
+	u64 addr;
+
+	entry = (struct nova_lite_journal_entry *)nova_get_block(sb,
+							curr_p);
+	entry->type = cpu_to_le64(JOURNAL_ENTRY);
+	entry->padding = 0;
+	/* Align to 8 bytes */
+	aligned_field = (u64 *)((unsigned long)field & ~7UL);
+	/* Store the offset from the start of Nova instead of the pointer */
+	addr = (u64)nova_get_addr_off(sbi, aligned_field);
+	entry->data1 = cpu_to_le64(addr);
+	entry->data2 = cpu_to_le64(*aligned_field);
+	nova_update_journal_entry_csum(sb, entry);
+
+	curr_p = next_lite_journal(curr_p);
+	return curr_p;
+}
+
+static u64 nova_journal_inode_tail(struct super_block *sb,
+	u64 curr_p, struct nova_inode *pi)
+{
+	curr_p = nova_append_entry_journal(sb, curr_p, &pi->log_tail);
+
+	return curr_p;
+}
+
+/* Create and append undo log entries for creating a new file or directory. */
+static u64 nova_append_inode_journal(struct super_block *sb,
+	u64 curr_p, struct inode *inode, int new_inode,
+	int invalidate, int is_dir)
+{
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+
+	if (!pi) {
+		nova_err(sb, "%s: get inode failed\n", __func__);
+		return curr_p;
+	}
+
+	if (is_dir)
+		return nova_journal_inode_tail(sb, curr_p, pi);
+
+	if (new_inode) {
+		curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->valid);
+	} else {
+		curr_p = nova_journal_inode_tail(sb, curr_p, pi);
+		if (invalidate) {
+			curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->valid);
+			curr_p = nova_append_entry_journal(sb, curr_p,
+						&pi->delete_epoch_id);
+		}
+	}
+
+	return curr_p;
+}
+
+static u64 nova_append_dentry_journal(struct super_block *sb,
+	u64 curr_p, struct nova_dentry *dentry)
+{
+	curr_p = nova_append_entry_journal(sb, curr_p, &dentry->ino);
+	curr_p = nova_append_entry_journal(sb, curr_p, &dentry->csum);
+	return curr_p;
+}
+
+/* Journaled transactions for inode creation */
+u64 nova_create_inode_transaction(struct super_block *sb,
+	struct inode *inode, struct inode *dir, int cpu,
+	int new_inode, int invalidate)
+{
+	struct journal_ptr_pair *pair;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+
+	temp = pair->journal_head;
+
+	temp = nova_append_inode_journal(sb, temp, inode,
+					new_inode, invalidate, 0);
+
+	temp = nova_append_inode_journal(sb, temp, dir,
+					new_inode, invalidate, 1);
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* Journaled transactions for rename operations */
+u64 nova_create_rename_transaction(struct super_block *sb,
+	struct inode *old_inode, struct inode *old_dir, struct inode *new_inode,
+	struct inode *new_dir, struct nova_dentry *father_entry,
+	int invalidate_new_inode, int cpu)
+{
+	struct journal_ptr_pair *pair;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+
+	temp = pair->journal_head;
+
+	/* Journal tails for old inode */
+	temp = nova_append_inode_journal(sb, temp, old_inode, 0, 0, 0);
+
+	/* Journal tails for old dir */
+	temp = nova_append_inode_journal(sb, temp, old_dir, 0, 0, 1);
+
+	if (new_inode) {
+		/* New inode may be unlinked */
+		temp = nova_append_inode_journal(sb, temp, new_inode, 0,
+					invalidate_new_inode, 0);
+	}
+
+	if (new_dir)
+		temp = nova_append_inode_journal(sb, temp, new_dir, 0, 0, 1);
+
+	if (father_entry)
+		temp = nova_append_dentry_journal(sb, temp, father_entry);
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* For log entry inplace update */
+u64 nova_create_logentry_transaction(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int cpu)
+{
+	struct journal_ptr_pair *pair;
+	size_t size = 0;
+	int i, count;
+	u64 temp;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+
+	size = nova_get_log_entry_size(sb, type);
+
+	temp = pair->journal_head;
+
+	count = size / 8;
+	for (i = 0; i < count; i++) {
+		temp = nova_append_entry_journal(sb, temp,
+						(char *)entry + i * 8);
+	}
+
+	pair->journal_tail = temp;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+
+	nova_dbgv("%s: head 0x%llx, tail 0x%llx\n",
+			__func__, pair->journal_head, pair->journal_tail);
+	return temp;
+}
+
+/* Commit the transactions by dropping the journal entries */
+void nova_commit_lite_transaction(struct super_block *sb, u64 tail, int cpu)
+{
+	struct journal_ptr_pair *pair;
+
+	pair = nova_get_journal_pointers(sb, cpu);
+
+	pair->journal_head = tail;
+	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
+}
diff --git a/fs/nova/journal.h b/fs/nova/journal.h
index d1d0ffb..2259880 100644
--- a/fs/nova/journal.h
+++ b/fs/nova/journal.h
@@ -40,4 +40,15 @@ struct journal_ptr_pair *nova_get_journal_pointers(struct super_block *sb,
 }
 
 
+u64 nova_create_inode_transaction(struct super_block *sb,
+	struct inode *inode, struct inode *dir, int cpu,
+	int new_inode, int invalidate);
+u64 nova_create_rename_transaction(struct super_block *sb,
+	struct inode *old_inode, struct inode *old_dir, struct inode *new_inode,
+	struct inode *new_dir, struct nova_dentry *father_entry,
+	int invalidate_new_inode, int cpu);
+u64 nova_create_logentry_transaction(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int cpu);
+void nova_commit_lite_transaction(struct super_block *sb, u64 tail, int cpu);
+
 #endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 38/83] Journal: NOVA lite journal initialization.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (36 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 37/83] Journal: Lite journal create and commit Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 39/83] Log operation: dentry append Andiry Xu
                   ` (45 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA uses per-CPU spinlock to protect the journals.
Lite journal initialization consists of two parts:
for a new NOVA instance, hard_init allocates the journal pages.
soft_init initializes the locks and performs journal recovery.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/journal.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/journal.h |  2 ++
 fs/nova/super.c   | 15 ++++++++++++
 fs/nova/super.h   |  3 +++
 4 files changed, 90 insertions(+)

diff --git a/fs/nova/journal.c b/fs/nova/journal.c
index 0e203fa..d2578e2 100644
--- a/fs/nova/journal.c
+++ b/fs/nova/journal.c
@@ -340,3 +340,73 @@ void nova_commit_lite_transaction(struct super_block *sb, u64 tail, int cpu)
 	pair->journal_head = tail;
 	nova_flush_buffer(&pair->journal_head, CACHELINE_SIZE, 1);
 }
+
+/**************************** Initialization ******************************/
+
+// Initialized DRAM journal state, validate, and recover
+int nova_lite_journal_soft_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct journal_ptr_pair *pair;
+	int i;
+	int ret = 0;
+
+	sbi->journal_locks = kcalloc(sbi->cpus, sizeof(spinlock_t),
+				     GFP_KERNEL);
+	if (!sbi->journal_locks)
+		return -ENOMEM;
+
+	for (i = 0; i < sbi->cpus; i++)
+		spin_lock_init(&sbi->journal_locks[i]);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+		if (pair->journal_head == pair->journal_tail)
+			continue;
+
+		/* Ensure all entries are genuine */
+		ret = nova_check_journal_entries(sb, pair);
+		if (ret) {
+			nova_err(sb, "Journal %d checksum failure\n", i);
+			ret = -EINVAL;
+			break;
+		}
+
+		ret = nova_recover_lite_journal(sb, pair);
+	}
+
+	return ret;
+}
+
+/* Initialized persistent journal state */
+int nova_lite_journal_hard_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct journal_ptr_pair *pair;
+	unsigned long blocknr = 0;
+	int allocated;
+	int i;
+	u64 block;
+
+	sih.ino = NOVA_LITEJOURNAL_INO;
+	sih.i_blk_type = NOVA_BLOCK_TYPE_4K;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+
+		allocated = nova_new_log_blocks(sb, &sih, &blocknr, 1,
+			ALLOC_INIT_ZERO, ANY_CPU, ALLOC_FROM_HEAD);
+		nova_dbg_verbose("%s: allocate log @ 0x%lx\n", __func__,
+							blocknr);
+		if (allocated != 1 || blocknr == 0)
+			return -ENOSPC;
+
+		block = nova_get_block_off(sb, blocknr, NOVA_BLOCK_TYPE_4K);
+		pair->journal_head = pair->journal_tail = block;
+		nova_flush_buffer(pair, CACHELINE_SIZE, 0);
+	}
+
+	PERSISTENT_BARRIER();
+	return nova_lite_journal_soft_init(sb);
+}
diff --git a/fs/nova/journal.h b/fs/nova/journal.h
index 2259880..6e3a528 100644
--- a/fs/nova/journal.h
+++ b/fs/nova/journal.h
@@ -50,5 +50,7 @@ u64 nova_create_rename_transaction(struct super_block *sb,
 u64 nova_create_logentry_transaction(struct super_block *sb,
 	void *entry, enum nova_entry_type type, int cpu);
 void nova_commit_lite_transaction(struct super_block *sb, u64 tail, int cpu);
+int nova_lite_journal_soft_init(struct super_block *sb);
+int nova_lite_journal_hard_init(struct super_block *sb);
 
 #endif
diff --git a/fs/nova/super.c b/fs/nova/super.c
index d73c202..216d396 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -379,6 +379,11 @@ static struct nova_inode *nova_init(struct super_block *sb,
 
 	nova_init_blockmap(sb, 0);
 
+	if (nova_lite_journal_hard_init(sb) < 0) {
+		nova_err(sb, "Lite journal hard initialization failed\n");
+		return ERR_PTR(-EINVAL);
+	}
+
 	if (nova_init_inode_inuse_list(sb) < 0)
 		return ERR_PTR(-EINVAL);
 
@@ -598,6 +603,12 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 		goto out;
 	}
 
+	if (nova_lite_journal_soft_init(sb)) {
+		retval = -EINVAL;
+		nova_err(sb, "Lite journal initialization failed\n");
+		goto out;
+	}
+
 	blocksize = le32_to_cpu(sbi->nova_sb->s_blocksize);
 	nova_set_blocksize(sb, blocksize);
 
@@ -647,6 +658,9 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 
 	nova_delete_free_lists(sb);
 
+	kfree(sbi->journal_locks);
+	sbi->journal_locks = NULL;
+
 	kfree(sbi->inode_maps);
 	sbi->inode_maps = NULL;
 
@@ -745,6 +759,7 @@ static void nova_put_super(struct super_block *sb)
 
 	kfree(sbi->zeroed_page);
 	nova_dbgmask = 0;
+	kfree(sbi->journal_locks);
 
 	for (i = 0; i < sbi->cpus; i++) {
 		inode_map = &sbi->inode_maps[i];
diff --git a/fs/nova/super.h b/fs/nova/super.h
index 9772d2f..56a840e 100644
--- a/fs/nova/super.h
+++ b/fs/nova/super.h
@@ -119,6 +119,9 @@ struct nova_sb_info {
 	/* ZEROED page for cache page initialized */
 	void *zeroed_page;
 
+	/* Per-CPU journal lock */
+	spinlock_t *journal_locks;
+
 	/* Per-CPU inode map */
 	struct inode_map	*inode_maps;
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 39/83] Log operation: dentry append.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (37 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 38/83] Journal: NOVA lite journal initialization Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 40/83] Log operation: file write entry append Andiry Xu
                   ` (44 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA performs atomic log appending by first appending the entry
to the tail of the log, and then atomically update the log tail pointer.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h |   4 ++
 2 files changed, 166 insertions(+)

diff --git a/fs/nova/log.c b/fs/nova/log.c
index f01b7c8..13f9597 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -20,6 +20,168 @@
 #include "inode.h"
 #include "log.h"
 
+static int nova_update_old_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *dentry,
+	struct nova_log_entry_info *entry_info)
+{
+	unsigned short links_count;
+	int link_change = entry_info->link_change;
+	u64 addr;
+
+	dentry->epoch_id = entry_info->epoch_id;
+	dentry->trans_id = entry_info->trans_id;
+	/* Remove_dentry */
+	dentry->ino = cpu_to_le64(0);
+	dentry->invalid = 1;
+	dentry->mtime = cpu_to_le32(dir->i_mtime.tv_sec);
+
+	links_count = cpu_to_le16(dir->i_nlink);
+	if (links_count == 0 && link_change == -1)
+		links_count = 0;
+	else
+		links_count += link_change;
+	dentry->links_count = cpu_to_le16(links_count);
+
+	addr = nova_get_addr_off(NOVA_SB(sb), dentry);
+	nova_inc_page_invalid_entries(sb, addr);
+
+	nova_persist_entry(dentry);
+
+	return 0;
+}
+
+static int nova_update_new_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct dentry *dentry = entry_info->data;
+	unsigned short links_count;
+	int link_change = entry_info->link_change;
+
+	entry->entry_type = DIR_LOG;
+	entry->epoch_id = entry_info->epoch_id;
+	entry->trans_id = entry_info->trans_id;
+	entry->ino = entry_info->ino;
+	entry->name_len = dentry->d_name.len;
+	memcpy_to_pmem_nocache(entry->name, dentry->d_name.name,
+				dentry->d_name.len);
+	entry->name[dentry->d_name.len] = '\0';
+	entry->mtime = cpu_to_le32(dir->i_mtime.tv_sec);
+	//entry->size = cpu_to_le64(dir->i_size);
+
+	links_count = cpu_to_le16(dir->i_nlink);
+	if (links_count == 0 && link_change == -1)
+		links_count = 0;
+	else
+		links_count += link_change;
+	entry->links_count = cpu_to_le16(links_count);
+
+	/* Update actual de_len */
+	entry->de_len = cpu_to_le16(entry_info->file_size);
+
+	nova_persist_entry(entry);
+
+	return 0;
+}
+
+static int nova_update_log_entry(struct super_block *sb, struct inode *inode,
+	void *entry, struct nova_log_entry_info *entry_info)
+{
+	enum nova_entry_type type = entry_info->type;
+
+	switch (type) {
+	case FILE_WRITE:
+		break;
+	case DIR_LOG:
+		if (entry_info->inplace)
+			nova_update_old_dentry(sb, inode, entry, entry_info);
+		else
+			nova_update_new_dentry(sb, inode, entry, entry_info);
+		break;
+	case SET_ATTR:
+		break;
+	case LINK_CHANGE:
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static int nova_append_log_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_info_header *sih,
+	struct nova_log_entry_info *entry_info)
+{
+	void *entry;
+	enum nova_entry_type type = entry_info->type;
+	struct nova_inode_update *update = entry_info->update;
+	u64 tail;
+	u64 curr_p;
+	size_t size;
+	int extended = 0;
+
+	if (type == DIR_LOG)
+		size = entry_info->file_size;
+	else
+		size = nova_get_log_entry_size(sb, type);
+
+	tail = update->tail;
+
+	curr_p = nova_get_append_head(sb, pi, sih, tail, size,
+						MAIN_LOG, 0, &extended);
+	if (curr_p == 0)
+		return -ENOSPC;
+
+	nova_dbg_verbose("%s: inode %lu attr change entry @ 0x%llx\n",
+				__func__, sih->ino, curr_p);
+
+	entry = nova_get_block(sb, curr_p);
+	/* inode is already updated with attr */
+	memset(entry, 0, size);
+	nova_update_log_entry(sb, inode, entry, entry_info);
+	nova_inc_page_num_entries(sb, curr_p);
+	update->curr_entry = curr_p;
+	update->tail = curr_p + size;
+
+	entry_info->curr_p = curr_p;
+	return 0;
+}
+
+int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *dir, struct dentry *dentry, u64 ino,
+	unsigned short de_len, struct nova_inode_update *update,
+	int link_change, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_dir_entry_t, append_time);
+
+	entry_info.type = DIR_LOG;
+	entry_info.update = update;
+	entry_info.data = dentry;
+	entry_info.ino = ino;
+	entry_info.link_change = link_change;
+	entry_info.file_size = de_len;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+	entry_info.inplace = 0;
+
+	ret = nova_append_log_entry(sb, pi, dir, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	dir->i_blocks = sih->i_blocks;
+
+	NOVA_END_TIMING(append_dir_entry_t, append_time);
+	return ret;
+}
+
 /* Coalesce log pages to a singly linked list */
 static int nova_coalesce_log_pages(struct super_block *sb,
 	unsigned long prev_blocknr, unsigned long first_blocknr,
diff --git a/fs/nova/log.h b/fs/nova/log.h
index 6b4a085..305e69b 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -364,6 +364,10 @@ static inline int is_dir_init_entry(struct super_block *sb,
 }
 
 
+int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *dir, struct dentry *dentry, u64 ino,
+	unsigned short de_len, struct nova_inode_update *update,
+	int link_change, u64 epoch_id);
 int nova_allocate_inode_log_pages(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long num_pages,
 	u64 *new_block, int cpuid, enum nova_alloc_direction from_tail);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 40/83] Log operation: file write entry append.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (38 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 39/83] Log operation: dentry append Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 41/83] Log operation: setattr " Andiry Xu
                   ` (43 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA performs writes by appending file write entries to the log.
A file write entry is the metadata of a write operation, and
contains pointers to the data blocks. A single write operation
may append multiple file write entries to the log, if the
allocator cannot provide enough contiguous data blocks.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h |  3 +++
 2 files changed, 54 insertions(+)

diff --git a/fs/nova/log.c b/fs/nova/log.c
index 13f9597..437db26 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -20,6 +20,18 @@
 #include "inode.h"
 #include "log.h"
 
+static int nova_update_write_entry(struct super_block *sb,
+	struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	entry->epoch_id = cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id = cpu_to_le64(entry_info->trans_id);
+	entry->mtime = cpu_to_le32(entry_info->time);
+	entry->size = cpu_to_le64(entry_info->file_size);
+	nova_persist_entry(entry);
+	return 0;
+}
+
 static int nova_update_old_dentry(struct super_block *sb,
 	struct inode *dir, struct nova_dentry *dentry,
 	struct nova_log_entry_info *entry_info)
@@ -91,6 +103,11 @@ static int nova_update_log_entry(struct super_block *sb, struct inode *inode,
 
 	switch (type) {
 	case FILE_WRITE:
+		if (entry_info->inplace)
+			nova_update_write_entry(sb, entry, entry_info);
+		else
+			memcpy_to_pmem_nocache(entry, entry_info->data,
+				sizeof(struct nova_file_write_entry));
 		break;
 	case DIR_LOG:
 		if (entry_info->inplace)
@@ -149,6 +166,40 @@ static int nova_append_log_entry(struct super_block *sb,
 	return 0;
 }
 
+/*
+ * Append a nova_file_write_entry to the current nova_inode_log_page.
+ * blocknr and start_blk are pgoff.
+ * We cannot update pi->log_tail here because a transaction may contain
+ * multiple entries.
+ */
+int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_file_write_item *item,
+	struct nova_inode_update *update)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *data = &item->entry;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_file_entry_t, append_time);
+
+	entry_info.type = FILE_WRITE;
+	entry_info.update = update;
+	entry_info.data = data;
+	entry_info.epoch_id = data->epoch_id;
+	entry_info.trans_id = data->trans_id;
+	entry_info.inplace = 0;
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret)
+		nova_err(sb, "%s failed\n", __func__);
+
+	NOVA_END_TIMING(append_file_entry_t, append_time);
+	return ret;
+}
+
 int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
 	struct inode *dir, struct dentry *dentry, u64 ino,
 	unsigned short de_len, struct nova_inode_update *update,
diff --git a/fs/nova/log.h b/fs/nova/log.h
index 305e69b..db7a72e 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -364,6 +364,9 @@ static inline int is_dir_init_entry(struct super_block *sb,
 }
 
 
+int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct nova_file_write_item *item,
+	struct nova_inode_update *update);
 int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
 	struct inode *dir, struct dentry *dentry, u64 ino,
 	unsigned short de_len, struct nova_inode_update *update,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 41/83] Log operation: setattr entry append
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (39 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 40/83] Log operation: file write entry append Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 42/83] Log operation: link change append Andiry Xu
                   ` (42 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA appends a setattr entry to the log upon inode modification operations:
set size, chmod, etc.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/fs/nova/log.c b/fs/nova/log.c
index 437db26..f85b63e 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -20,6 +20,37 @@
 #include "inode.h"
 #include "log.h"
 
+static void nova_update_setattr_entry(struct inode *inode,
+	struct nova_setattr_logentry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct iattr *attr = entry_info->attr;
+	unsigned int ia_valid = attr->ia_valid, attr_mask;
+
+	/* These files are in the lowest byte */
+	attr_mask = ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_SIZE |
+			ATTR_ATIME | ATTR_MTIME | ATTR_CTIME;
+
+	entry->entry_type	= SET_ATTR;
+	entry->attr	= ia_valid & attr_mask;
+	entry->mode	= cpu_to_le16(inode->i_mode);
+	entry->uid	= cpu_to_le32(i_uid_read(inode));
+	entry->gid	= cpu_to_le32(i_gid_read(inode));
+	entry->atime	= cpu_to_le32(inode->i_atime.tv_sec);
+	entry->ctime	= cpu_to_le32(inode->i_ctime.tv_sec);
+	entry->mtime	= cpu_to_le32(inode->i_mtime.tv_sec);
+	entry->epoch_id = cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id	= cpu_to_le64(entry_info->trans_id);
+	entry->invalid	= 0;
+
+	if (ia_valid & ATTR_SIZE)
+		entry->size = cpu_to_le64(attr->ia_size);
+	else
+		entry->size = cpu_to_le64(inode->i_size);
+
+	nova_persist_entry(entry);
+}
+
 static int nova_update_write_entry(struct super_block *sb,
 	struct nova_file_write_entry *entry,
 	struct nova_log_entry_info *entry_info)
@@ -116,6 +147,7 @@ static int nova_update_log_entry(struct super_block *sb, struct inode *inode,
 			nova_update_new_dentry(sb, inode, entry, entry_info);
 		break;
 	case SET_ATTR:
+		nova_update_setattr_entry(inode, entry, entry_info);
 		break;
 	case LINK_CHANGE:
 		break;
@@ -166,6 +198,38 @@ static int nova_append_log_entry(struct super_block *sb,
 	return 0;
 }
 
+/* Returns new tail after append */
+static int nova_append_setattr_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, struct iattr *attr,
+	struct nova_inode_update *update, u64 *last_setattr, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+	timing_t append_time;
+	int ret;
+
+	NOVA_START_TIMING(append_setattr_t, append_time);
+	entry_info.type = SET_ATTR;
+	entry_info.attr = attr;
+	entry_info.update = update;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		goto out;
+	}
+
+	*last_setattr = sih->last_setattr;
+	sih->last_setattr = entry_info.curr_p;
+
+out:
+	NOVA_END_TIMING(append_setattr_t, append_time);
+	return ret;
+}
+
 /*
  * Append a nova_file_write_entry to the current nova_inode_log_page.
  * blocknr and start_blk are pgoff.
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 42/83] Log operation: link change append.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (40 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 41/83] Log operation: setattr " Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 43/83] Log operation: in-place update log entry Andiry Xu
                   ` (41 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA appends link change entries to atomically update link count and ctime.
This occurs in link, unlink and rmdir.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h |  3 +++
 2 files changed, 55 insertions(+)

diff --git a/fs/nova/log.c b/fs/nova/log.c
index f85b63e..4638ccf 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -51,6 +51,25 @@ static void nova_update_setattr_entry(struct inode *inode,
 	nova_persist_entry(entry);
 }
 
+static void nova_update_link_change_entry(struct inode *inode,
+	struct nova_link_change_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	entry->entry_type	= LINK_CHANGE;
+	entry->epoch_id		= cpu_to_le64(entry_info->epoch_id);
+	entry->trans_id		= cpu_to_le64(entry_info->trans_id);
+	entry->invalid		= 0;
+	entry->links		= cpu_to_le16(inode->i_nlink);
+	entry->ctime		= cpu_to_le32(inode->i_ctime.tv_sec);
+	entry->flags		= cpu_to_le32(sih->i_flags);
+	entry->generation	= cpu_to_le32(inode->i_generation);
+
+	nova_persist_entry(entry);
+}
+
 static int nova_update_write_entry(struct super_block *sb,
 	struct nova_file_write_entry *entry,
 	struct nova_log_entry_info *entry_info)
@@ -150,6 +169,7 @@ static int nova_update_log_entry(struct super_block *sb, struct inode *inode,
 		nova_update_setattr_entry(inode, entry, entry_info);
 		break;
 	case LINK_CHANGE:
+		nova_update_link_change_entry(inode, entry, entry_info);
 		break;
 	default:
 		break;
@@ -230,6 +250,38 @@ static int nova_append_setattr_entry(struct super_block *sb,
 	return ret;
 }
 
+/* Returns new tail after append */
+int nova_append_link_change_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_update *update, u64 *old_linkc, u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+	int ret = 0;
+	timing_t append_time;
+
+	NOVA_START_TIMING(append_link_change_t, append_time);
+
+	entry_info.type = LINK_CHANGE;
+	entry_info.update = update;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	ret = nova_append_log_entry(sb, pi, inode, sih, &entry_info);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		goto out;
+	}
+
+	*old_linkc = sih->last_link_change;
+	sih->last_link_change = entry_info.curr_p;
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(append_link_change_t, append_time);
+	return ret;
+}
+
 /*
  * Append a nova_file_write_entry to the current nova_inode_log_page.
  * blocknr and start_blk are pgoff.
diff --git a/fs/nova/log.h b/fs/nova/log.h
index db7a72e..f36f4a3 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -364,6 +364,9 @@ static inline int is_dir_init_entry(struct super_block *sb,
 }
 
 
+int nova_append_link_change_entry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_update *update, u64 *old_linkc, u64 epoch_id);
 int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
 	struct inode *inode, struct nova_file_write_item *item,
 	struct nova_inode_update *update);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 43/83] Log operation: in-place update log entry
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (41 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 42/83] Log operation: link change append Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 44/83] Log operation: invalidate log entries Andiry Xu
                   ` (40 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

To in-place update a log entry, NOVA starts a lite transaction
to journal the log entry, then performs update and commits the transaction.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.h |  12 ++++
 fs/nova/log.c   | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h   |   9 +++
 3 files changed, 204 insertions(+)

diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 943f77f..6970872 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -5,6 +5,7 @@ struct nova_inode_info_header;
 struct nova_inode;
 
 #include "super.h"
+#include "log.h"
 
 enum nova_new_inode_type {
 	TYPE_CREATE = 0,
@@ -143,6 +144,17 @@ static inline void nova_update_tail(struct nova_inode *pi, u64 new_tail)
 	NOVA_END_TIMING(update_tail_t, update_time);
 }
 
+static inline void nova_update_inode(struct super_block *sb,
+	struct inode *inode, struct nova_inode *pi,
+	struct nova_inode_update *update)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+
+	sih->log_tail = update->tail;
+	nova_update_tail(pi, update->tail);
+}
+
 static inline
 struct inode_table *nova_get_inode_table(struct super_block *sb, int cpu)
 {
diff --git a/fs/nova/log.c b/fs/nova/log.c
index 4638ccf..c8b7d2e 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -218,6 +218,35 @@ static int nova_append_log_entry(struct super_block *sb,
 	return 0;
 }
 
+/* Perform lite transaction to atomically in-place update log entry */
+static int nova_inplace_update_log_entry(struct super_block *sb,
+	struct inode *inode, void *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	enum nova_entry_type type = entry_info->type;
+	u64 journal_tail;
+	size_t size;
+	int cpu;
+	timing_t update_time;
+
+	NOVA_START_TIMING(update_entry_t, update_time);
+	size = nova_get_log_entry_size(sb, type);
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	journal_tail = nova_create_logentry_transaction(sb, entry, type, cpu);
+	nova_update_log_entry(sb, inode, entry, entry_info);
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	NOVA_END_TIMING(update_entry_t, update_time);
+	return 0;
+}
+
 /* Returns new tail after append */
 static int nova_append_setattr_entry(struct super_block *sb,
 	struct nova_inode *pi, struct inode *inode, struct iattr *attr,
@@ -250,6 +279,125 @@ static int nova_append_setattr_entry(struct super_block *sb,
 	return ret;
 }
 
+static int nova_can_inplace_update_setattr(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	u64 last_log = 0;
+	struct nova_setattr_logentry *entry = NULL;
+
+	last_log = sih->last_setattr;
+	if (last_log) {
+		entry = (struct nova_setattr_logentry *)nova_get_block(sb,
+								last_log);
+		/* Do not overwrite setsize entry */
+		if (entry->attr & ATTR_SIZE)
+			return 0;
+		if (entry->epoch_id == epoch_id)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int nova_inplace_update_setattr_entry(struct super_block *sb,
+	struct inode *inode, struct nova_inode_info_header *sih,
+	struct iattr *attr, u64 epoch_id)
+{
+	struct nova_setattr_logentry *entry = NULL;
+	struct nova_log_entry_info entry_info;
+	u64 last_log = 0;
+
+	nova_dbgv("%s : Modifying last log entry for inode %lu\n",
+				__func__, inode->i_ino);
+	last_log = sih->last_setattr;
+	entry = (struct nova_setattr_logentry *)nova_get_block(sb,
+							last_log);
+
+	entry_info.type = SET_ATTR;
+	entry_info.attr = attr;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					&entry_info);
+}
+
+int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, unsigned int ia_valid, struct iattr *attr,
+	u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode_update update;
+	u64 last_setattr = 0;
+	int ret;
+
+	if (ia_valid & ATTR_MODE)
+		sih->i_mode = inode->i_mode;
+
+	/*
+	 * Let's try to do inplace update.
+	 */
+	if (!(ia_valid & ATTR_SIZE) &&
+			nova_can_inplace_update_setattr(sb, sih, epoch_id)) {
+		nova_inplace_update_setattr_entry(sb, inode, sih,
+						attr, epoch_id);
+	} else {
+		/* We are holding inode lock so OK to append the log */
+		nova_dbgv("%s : Appending last log entry for inode ino = %lu\n",
+				__func__, inode->i_ino);
+		update.tail = 0;
+		ret = nova_append_setattr_entry(sb, pi, inode, attr, &update,
+						&last_setattr, epoch_id);
+		if (ret) {
+			nova_dbg("%s: append setattr entry failure\n",
+								__func__);
+			return ret;
+		}
+
+		nova_update_inode(sb, inode, pi, &update);
+	}
+
+	return 0;
+}
+
+static int nova_can_inplace_update_lcentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 epoch_id)
+{
+	u64 last_log = 0;
+	struct nova_link_change_entry *entry = NULL;
+
+	last_log = sih->last_link_change;
+	if (last_log) {
+		entry = (struct nova_link_change_entry *)nova_get_block(sb,
+								last_log);
+		if (entry->epoch_id == epoch_id)
+			return 1;
+	}
+
+	return 0;
+}
+
+static int nova_inplace_update_lcentry(struct super_block *sb,
+	struct inode *inode, struct nova_inode_info_header *sih,
+	u64 epoch_id)
+{
+	struct nova_link_change_entry *entry = NULL;
+	struct nova_log_entry_info entry_info;
+	u64 last_log = 0;
+
+	last_log = sih->last_link_change;
+	entry = (struct nova_link_change_entry *)nova_get_block(sb,
+							last_log);
+
+	entry_info.type = LINK_CHANGE;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					&entry_info);
+}
+
 /* Returns new tail after append */
 int nova_append_link_change_entry(struct super_block *sb,
 	struct nova_inode *pi, struct inode *inode,
@@ -263,6 +411,15 @@ int nova_append_link_change_entry(struct super_block *sb,
 
 	NOVA_START_TIMING(append_link_change_t, append_time);
 
+	if (nova_can_inplace_update_lcentry(sb, sih, epoch_id)) {
+		nova_inplace_update_lcentry(sb, inode, sih, epoch_id);
+		update->tail = sih->log_tail;
+
+		*old_linkc = 0;
+		sih->trans_id++;
+		goto out;
+	}
+
 	entry_info.type = LINK_CHANGE;
 	entry_info.update = update;
 	entry_info.epoch_id = epoch_id;
@@ -282,6 +439,14 @@ int nova_append_link_change_entry(struct super_block *sb,
 	return ret;
 }
 
+int nova_inplace_update_write_entry(struct super_block *sb,
+	struct inode *inode, struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info)
+{
+	return nova_inplace_update_log_entry(sb, inode, entry,
+					entry_info);
+}
+
 /*
  * Append a nova_file_write_entry to the current nova_inode_log_page.
  * blocknr and start_blk are pgoff.
@@ -316,6 +481,24 @@ int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
 	return ret;
 }
 
+int nova_inplace_update_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *dentry, int link_change,
+	u64 epoch_id)
+{
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+
+	entry_info.type = DIR_LOG;
+	entry_info.link_change = link_change;
+	entry_info.epoch_id = epoch_id;
+	entry_info.trans_id = sih->trans_id;
+	entry_info.inplace = 1;
+
+	return nova_inplace_update_log_entry(sb, dir, dentry,
+					&entry_info);
+}
+
 int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
 	struct inode *dir, struct dentry *dentry, u64 ino,
 	unsigned short de_len, struct nova_inode_update *update,
diff --git a/fs/nova/log.h b/fs/nova/log.h
index f36f4a3..74891b3 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -364,12 +364,21 @@ static inline int is_dir_init_entry(struct super_block *sb,
 }
 
 
+int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
+	struct nova_inode *pi, unsigned int ia_valid, struct iattr *attr,
+	u64 epoch_id);
 int nova_append_link_change_entry(struct super_block *sb,
 	struct nova_inode *pi, struct inode *inode,
 	struct nova_inode_update *update, u64 *old_linkc, u64 epoch_id);
+int nova_inplace_update_write_entry(struct super_block *sb,
+	struct inode *inode, struct nova_file_write_entry *entry,
+	struct nova_log_entry_info *entry_info);
 int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
 	struct inode *inode, struct nova_file_write_item *item,
 	struct nova_inode_update *update);
+int nova_inplace_update_dentry(struct super_block *sb,
+	struct inode *dir, struct nova_dentry *dentry, int link_change,
+	u64 epoch_id);
 int nova_append_dentry(struct super_block *sb, struct nova_inode *pi,
 	struct inode *dir, struct dentry *dentry, u64 ino,
 	unsigned short de_len, struct nova_inode_update *update,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 44/83] Log operation: invalidate log entries
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (42 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 43/83] Log operation: in-place update log entry Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 45/83] Log operation: file inode log lookup and assign Andiry Xu
                   ` (39 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

After new log entries are appended to the log, old log entries
can be marked invalid to faciliate garbage collection.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.c  | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h  |   4 ++
 fs/nova/nova.h |  12 +++++
 3 files changed, 176 insertions(+)

diff --git a/fs/nova/log.c b/fs/nova/log.c
index c8b7d2e..d150f2e 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -20,6 +20,88 @@
 #include "inode.h"
 #include "log.h"
 
+static int nova_execute_invalidate_reassign_logentry(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int reassign,
+	unsigned int num_free)
+{
+	struct nova_file_write_entry *fw_entry;
+	int invalid = 0;
+
+	switch (type) {
+	case FILE_WRITE:
+		fw_entry = (struct nova_file_write_entry *)entry;
+		if (reassign)
+			fw_entry->reassigned = 1;
+		if (num_free)
+			fw_entry->invalid_pages += num_free;
+		if (fw_entry->invalid_pages == fw_entry->num_pages)
+			invalid = 1;
+		break;
+	case DIR_LOG:
+		if (reassign) {
+			((struct nova_dentry *)entry)->reassigned = 1;
+		} else {
+			((struct nova_dentry *)entry)->invalid = 1;
+			invalid = 1;
+		}
+		break;
+	case SET_ATTR:
+		((struct nova_setattr_logentry *)entry)->invalid = 1;
+		invalid = 1;
+		break;
+	case LINK_CHANGE:
+		((struct nova_link_change_entry *)entry)->invalid = 1;
+		invalid = 1;
+		break;
+	default:
+		break;
+	}
+
+	if (invalid) {
+		u64 addr = nova_get_addr_off(NOVA_SB(sb), entry);
+
+		nova_inc_page_invalid_entries(sb, addr);
+	}
+
+	nova_persist_entry(entry);
+	return 0;
+}
+
+static int nova_invalidate_reassign_logentry(struct super_block *sb,
+	void *entry, enum nova_entry_type type, int reassign,
+	unsigned int num_free)
+{
+	nova_execute_invalidate_reassign_logentry(sb, entry, type,
+						reassign, num_free);
+	return 0;
+}
+
+static int nova_invalidate_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type, unsigned int num_free)
+{
+	return nova_invalidate_reassign_logentry(sb, entry, type, 0, num_free);
+}
+
+static int nova_reassign_logentry(struct super_block *sb, void *entry,
+	enum nova_entry_type type)
+{
+	return nova_invalidate_reassign_logentry(sb, entry, type, 1, 0);
+}
+
+static inline int nova_invalidate_write_entry(struct super_block *sb,
+	struct nova_file_write_entry *entry, int reassign,
+	unsigned int num_free)
+{
+	if (!entry)
+		return 0;
+
+	if (num_free == 0 && entry->reassigned == 1)
+		return 0;
+
+	return nova_invalidate_reassign_logentry(sb, entry, FILE_WRITE,
+							reassign, num_free);
+}
+
 static void nova_update_setattr_entry(struct inode *inode,
 	struct nova_setattr_logentry *entry,
 	struct nova_log_entry_info *entry_info)
@@ -279,6 +361,27 @@ static int nova_append_setattr_entry(struct super_block *sb,
 	return ret;
 }
 
+/* Invalidate old setattr entry */
+static int nova_invalidate_setattr_entry(struct super_block *sb,
+	u64 last_setattr)
+{
+	struct nova_setattr_logentry *old_entry;
+	void *addr;
+	int ret;
+
+	addr = (void *)nova_get_block(sb, last_setattr);
+	old_entry = (struct nova_setattr_logentry *)addr;
+
+	/* Do not invalidate setsize entries */
+	if (!old_entry_freeable(sb, old_entry->epoch_id) ||
+			(old_entry->attr & ATTR_SIZE))
+		return 0;
+
+	ret = nova_invalidate_logentry(sb, old_entry, SET_ATTR, 0);
+
+	return ret;
+}
+
 static int nova_can_inplace_update_setattr(struct super_block *sb,
 	struct nova_inode_info_header *sih, u64 epoch_id)
 {
@@ -358,9 +461,35 @@ int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
 		nova_update_inode(sb, inode, pi, &update);
 	}
 
+	/* Invalidate old setattr entry */
+	if (last_setattr)
+		nova_invalidate_setattr_entry(sb, last_setattr);
+
 	return 0;
 }
 
+/* Invalidate old link change entry */
+int nova_invalidate_link_change_entry(struct super_block *sb,
+	u64 old_link_change)
+{
+	struct nova_link_change_entry *old_entry;
+	void *addr;
+	int ret;
+
+	if (old_link_change == 0)
+		return 0;
+
+	addr = (void *)nova_get_block(sb, old_link_change);
+	old_entry = (struct nova_link_change_entry *)addr;
+
+	if (!old_entry_freeable(sb, old_entry->epoch_id))
+		return 0;
+
+	ret = nova_invalidate_logentry(sb, old_entry, LINK_CHANGE, 0);
+
+	return ret;
+}
+
 static int nova_can_inplace_update_lcentry(struct super_block *sb,
 	struct nova_inode_info_header *sih, u64 epoch_id)
 {
@@ -481,6 +610,37 @@ int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
 	return ret;
 }
 
+/* Create dentry and delete dentry must be invalidated together */
+int nova_invalidate_dentries(struct super_block *sb,
+	struct nova_inode_update *update)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_dentry *create_dentry;
+	struct nova_dentry *delete_dentry;
+	u64 create_curr, delete_curr;
+	int ret;
+
+	create_dentry = update->create_dentry;
+	delete_dentry = update->delete_dentry;
+
+	if (!create_dentry)
+		return 0;
+
+	nova_reassign_logentry(sb, create_dentry, DIR_LOG);
+
+	if (!old_entry_freeable(sb, create_dentry->epoch_id))
+		return 0;
+
+	create_curr = nova_get_addr_off(sbi, create_dentry);
+	delete_curr = nova_get_addr_off(sbi, delete_dentry);
+
+	nova_invalidate_logentry(sb, create_dentry, DIR_LOG, 0);
+
+	ret = nova_invalidate_logentry(sb, delete_dentry, DIR_LOG, 0);
+
+	return ret;
+}
+
 int nova_inplace_update_dentry(struct super_block *sb,
 	struct inode *dir, struct nova_dentry *dentry, int link_change,
 	u64 epoch_id)
diff --git a/fs/nova/log.h b/fs/nova/log.h
index 74891b3..2548083 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -367,6 +367,8 @@ static inline int is_dir_init_entry(struct super_block *sb,
 int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
 	struct nova_inode *pi, unsigned int ia_valid, struct iattr *attr,
 	u64 epoch_id);
+int nova_invalidate_link_change_entry(struct super_block *sb,
+	u64 old_link_change);
 int nova_append_link_change_entry(struct super_block *sb,
 	struct nova_inode *pi, struct inode *inode,
 	struct nova_inode_update *update, u64 *old_linkc, u64 epoch_id);
@@ -376,6 +378,8 @@ int nova_inplace_update_write_entry(struct super_block *sb,
 int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
 	struct inode *inode, struct nova_file_write_item *item,
 	struct nova_inode_update *update);
+int nova_invalidate_dentries(struct super_block *sb,
+	struct nova_inode_update *update);
 int nova_inplace_update_dentry(struct super_block *sb,
 	struct inode *dir, struct nova_dentry *dentry, int link_change,
 	u64 epoch_id);
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 03c4991..6cf3c33 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -328,6 +328,18 @@ struct inode_map {
 	int			freed;
 };
 
+
+/* Old entry is freeable if it is appended after the latest snapshot */
+static inline int old_entry_freeable(struct super_block *sb, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (epoch_id == sbi->s_epoch_id)
+		return 1;
+
+	return 0;
+}
+
 #include "balloc.h"
 
 static inline unsigned long
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 45/83] Log operation: file inode log lookup and assign
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (43 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 44/83] Log operation: invalidate log entries Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 46/83] Dir: Add Directory radix tree insert/remove methods Andiry Xu
                   ` (38 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

After NOVA appends file write entry to commit new writes,
it updates the file offset radix tree, finds the old entries (if overwrite)
and reclaims the stale data blocks.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.c  | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.h  |   5 +++
 fs/nova/nova.h |  64 ++++++++++++++++++++++++++++++++++
 3 files changed, 177 insertions(+)

diff --git a/fs/nova/log.c b/fs/nova/log.c
index d150f2e..451be27 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -102,6 +102,50 @@ static inline int nova_invalidate_write_entry(struct super_block *sb,
 							reassign, num_free);
 }
 
+unsigned int nova_free_old_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	unsigned long pgoff, unsigned int num_free,
+	bool delete_dead, u64 epoch_id)
+{
+	unsigned long old_nvmm;
+	timing_t free_time;
+
+	if (!entry)
+		return 0;
+
+	NOVA_START_TIMING(free_old_t, free_time);
+
+	old_nvmm = get_nvmm(sb, sih, entry, pgoff);
+
+	if (!delete_dead)
+		nova_invalidate_write_entry(sb, entry, 1, num_free);
+
+	nova_dbgv("%s: pgoff %lu, free %u blocks\n",
+				__func__, pgoff, num_free);
+	nova_free_data_blocks(sb, sih, old_nvmm, num_free);
+
+	sih->i_blocks -= num_free;
+
+	NOVA_END_TIMING(free_old_t, free_time);
+	return num_free;
+}
+
+struct nova_file_write_entry *nova_find_next_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, pgoff_t pgoff)
+{
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_entry *entries[1];
+	int nr_entries;
+
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pgoff, 1);
+	if (nr_entries == 1)
+		entry = entries[0];
+
+	return entry;
+}
+
 static void nova_update_setattr_entry(struct inode *inode,
 	struct nova_setattr_logentry *entry,
 	struct nova_log_entry_info *entry_info)
@@ -568,6 +612,70 @@ int nova_append_link_change_entry(struct super_block *sb,
 	return ret;
 }
 
+int nova_assign_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	bool free)
+{
+	struct nova_file_write_entry *old_entry;
+	struct nova_file_write_entry *start_old_entry = NULL;
+	void **pentry;
+	unsigned long start_pgoff = entry->pgoff;
+	unsigned long start_old_pgoff = 0;
+	unsigned int num = entry->num_pages;
+	unsigned int num_free = 0;
+	unsigned long curr_pgoff;
+	int i;
+	int ret = 0;
+	timing_t assign_time;
+
+	NOVA_START_TIMING(assign_t, assign_time);
+	for (i = 0; i < num; i++) {
+		curr_pgoff = start_pgoff + i;
+
+		pentry = radix_tree_lookup_slot(&sih->tree, curr_pgoff);
+		if (pentry) {
+			old_entry = radix_tree_deref_slot(pentry);
+			if (old_entry != start_old_entry) {
+				if (start_old_entry && free)
+					nova_free_old_entry(sb, sih,
+							start_old_entry,
+							start_old_pgoff,
+							num_free, false,
+							entry->epoch_id);
+				nova_invalidate_write_entry(sb,
+						start_old_entry, 1, 0);
+
+				start_old_entry = old_entry;
+				start_old_pgoff = curr_pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+
+			radix_tree_replace_slot(&sih->tree, pentry, entry);
+		} else {
+			ret = radix_tree_insert(&sih->tree, curr_pgoff, entry);
+			if (ret) {
+				nova_dbg("%s: ERROR %d\n", __func__, ret);
+				goto out;
+			}
+		}
+	}
+
+	if (start_old_entry && free)
+		nova_free_old_entry(sb, sih, start_old_entry,
+					start_old_pgoff, num_free, false,
+					entry->epoch_id);
+
+	nova_invalidate_write_entry(sb, start_old_entry, 1, 0);
+
+out:
+	NOVA_END_TIMING(assign_t, assign_time);
+
+	return ret;
+}
+
 int nova_inplace_update_write_entry(struct super_block *sb,
 	struct inode *inode, struct nova_file_write_entry *entry,
 	struct nova_log_entry_info *entry_info)
diff --git a/fs/nova/log.h b/fs/nova/log.h
index 2548083..f5149f7 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -398,4 +398,9 @@ int nova_free_contiguous_log_blocks(struct super_block *sb,
 int nova_free_inode_log(struct super_block *sb, struct nova_inode *pi,
 	struct nova_inode_info_header *sih);
 
+void nova_print_nova_log(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+void nova_print_nova_log_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+
 #endif
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 6cf3c33..8f085cf 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -342,6 +342,70 @@ static inline int old_entry_freeable(struct super_block *sb, u64 epoch_id)
 
 #include "balloc.h"
 
+static inline struct nova_file_write_entry *
+nova_get_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long blocknr)
+{
+	struct nova_file_write_entry *entry;
+
+	entry = radix_tree_lookup(&sih->tree, blocknr);
+
+	return entry;
+}
+
+
+/*
+ * Find data at a file offset (pgoff) in the data pointed to by a write log
+ * entry.
+ */
+static inline unsigned long get_nvmm(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long pgoff)
+{
+	/* entry is already verified before this call and resides in dram
+	 * or we can do memcpy_mcsafe here but have to avoid double copy and
+	 * verification of the entry.
+	 */
+	if (entry->pgoff > pgoff || (unsigned long) entry->pgoff +
+			(unsigned long) entry->num_pages <= pgoff) {
+		struct nova_sb_info *sbi = NOVA_SB(sb);
+		u64 curr;
+
+		curr = nova_get_addr_off(sbi, entry);
+		nova_dbg("Entry ERROR: inode %lu, curr 0x%llx, pgoff %lu, entry pgoff %llu, num %u\n",
+			sih->ino,
+			curr, pgoff, entry->pgoff, entry->num_pages);
+		nova_print_nova_log_pages(sb, sih);
+		nova_print_nova_log(sb, sih);
+		NOVA_ASSERT(0);
+	}
+
+	return (unsigned long) (entry->block >> PAGE_SHIFT) + pgoff
+		- entry->pgoff;
+}
+
+static inline u64 nova_find_nvmm_block(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	unsigned long blocknr)
+{
+	unsigned long nvmm;
+	struct nova_file_write_entry *entryc, entry_copy;
+
+	if (!entry) {
+		entry = nova_get_write_entry(sb, sih, blocknr);
+		if (!entry)
+			return 0;
+	}
+
+	entryc = &entry_copy;
+	if (memcpy_mcsafe(entryc, entry,
+			sizeof(struct nova_file_write_entry)) < 0)
+		return 0;
+
+	nvmm = get_nvmm(sb, sih, entryc, blocknr);
+	return nvmm << PAGE_SHIFT;
+}
+
 static inline unsigned long
 nova_get_numblocks(unsigned short btype)
 {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 46/83] Dir: Add Directory radix tree insert/remove methods.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (44 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 45/83] Log operation: file inode log lookup and assign Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 47/83] Dir: Add initial dentries when initializing a directory inode log Andiry Xu
                   ` (37 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA uses Hash to quickly locate dentry in the directory inode log.
The key is the hash of the filename, the value is the dentry.

Currently hash collision is ignored, and the radix tree may occupy
large memory space with huge directories. Considering replacing
it in the future.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |   2 +-
 fs/nova/dir.c    | 141 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h   |  26 ++++++++++
 3 files changed, 168 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/dir.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index 4aeadea..3a3243c 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := balloc.o bbuild.o inode.o journal.o log.o rebuild.o stats.o super.o
+nova-y := balloc.o bbuild.o dir.o inode.o journal.o log.o rebuild.o stats.o super.o
diff --git a/fs/nova/dir.c b/fs/nova/dir.c
new file mode 100644
index 0000000..5bea57a
--- /dev/null
+++ b/fs/nova/dir.c
@@ -0,0 +1,141 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for directories.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include "nova.h"
+#include "inode.h"
+
+#define DT2IF(dt) (((dt) << 12) & S_IFMT)
+#define IF2DT(sif) (((sif) & S_IFMT) >> 12)
+
+struct nova_dentry *nova_find_dentry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, const char *name,
+	unsigned long name_len)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_dentry *direntry;
+	unsigned long hash;
+
+	hash = BKDRHash(name, name_len);
+	direntry = radix_tree_lookup(&sih->tree, hash);
+
+	return direntry;
+}
+
+int nova_insert_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name,
+	int namelen, struct nova_dentry *direntry)
+{
+	unsigned long hash;
+	int ret;
+
+	hash = BKDRHash(name, namelen);
+	nova_dbgv("%s: insert %s hash %lu\n", __func__, name, hash);
+
+	/* FIXME: hash collision ignored here */
+	ret = radix_tree_insert(&sih->tree, hash, direntry);
+	if (ret)
+		nova_dbg("%s ERROR %d: %s\n", __func__, ret, name);
+
+	return ret;
+}
+
+static int nova_check_dentry_match(struct super_block *sb,
+	struct nova_dentry *dentry, const char *name, int namelen)
+{
+	if (dentry->name_len != namelen)
+		return -EINVAL;
+
+	return strncmp(dentry->name, name, namelen);
+}
+
+int nova_remove_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name, int namelen,
+	int replay, struct nova_dentry **create_dentry)
+{
+	struct nova_dentry *entry;
+	unsigned long hash;
+
+	hash = BKDRHash(name, namelen);
+	entry = radix_tree_delete(&sih->tree, hash);
+
+	if (replay == 0) {
+		if (!entry) {
+			nova_dbg("%s ERROR: %s, length %d, hash %lu\n",
+					__func__, name, namelen, hash);
+			return -EINVAL;
+		}
+
+		if (entry->ino == 0 || entry->invalid ||
+		    nova_check_dentry_match(sb, entry, name, namelen)) {
+			nova_dbg("%s dentry not match: %s, length %d, hash %lu\n",
+				 __func__, name, namelen, hash);
+			/* for debug information, still allow access to nvmm */
+			nova_dbg("dentry: type %d, inode %llu, name %s, namelen %u, rec len %u\n",
+				 entry->entry_type, le64_to_cpu(entry->ino),
+				 entry->name, entry->name_len,
+				 le16_to_cpu(entry->de_len));
+			return -EINVAL;
+		}
+
+		if (create_dentry)
+			*create_dentry = entry;
+	}
+
+	return 0;
+}
+
+void nova_delete_dir_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_dentry *direntry;
+	unsigned long pos = 0;
+	struct nova_dentry *entries[FREE_BATCH];
+	timing_t delete_time;
+	int nr_entries;
+	int i;
+	void *ret;
+
+	NOVA_START_TIMING(delete_dir_tree_t, delete_time);
+
+	nova_dbgv("%s: delete dir %lu\n", __func__, sih->ino);
+	do {
+		nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, FREE_BATCH);
+		for (i = 0; i < nr_entries; i++) {
+			direntry = entries[i];
+
+			pos = BKDRHash(direntry->name, direntry->name_len);
+			ret = radix_tree_delete(&sih->tree, pos);
+			if (!ret || ret != direntry) {
+				nova_err(sb, "dentry: type %d, inode %llu, "
+					"name %s, namelen %u, rec len %u\n",
+					direntry->entry_type,
+					le64_to_cpu(direntry->ino),
+					direntry->name, direntry->name_len,
+					le16_to_cpu(direntry->de_len));
+				if (!ret)
+					nova_dbg("ret is NULL\n");
+			}
+		}
+		pos++;
+	} while (nr_entries == FREE_BATCH);
+
+	NOVA_END_TIMING(delete_dir_tree_t, delete_time);
+}
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 8f085cf..3890479 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -340,6 +340,19 @@ static inline int old_entry_freeable(struct super_block *sb, u64 epoch_id)
 	return 0;
 }
 
+// BKDR String Hash Function
+static inline unsigned long BKDRHash(const char *str, int length)
+{
+	unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
+	unsigned long hash = 0;
+	int i;
+
+	for (i = 0; i < length; i++)
+		hash = hash * seed + (*str++);
+
+	return hash;
+}
+
 #include "balloc.h"
 
 static inline struct nova_file_write_entry *
@@ -433,6 +446,19 @@ nova_get_blocknr(struct super_block *sb, u64 block, unsigned short btype)
 /* ==============  Function prototypes  ================= */
 /* ====================================================== */
 
+/* dir.c */
+int nova_insert_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name,
+	int namelen, struct nova_dentry *direntry);
+int nova_remove_dir_radix_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, const char *name, int namelen,
+	int replay, struct nova_dentry **create_dentry);
+void nova_delete_dir_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih);
+struct nova_dentry *nova_find_dentry(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode, const char *name,
+	unsigned long name_len);
+
 /* rebuild.c */
 int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
 	u64 ino, u64 pi_addr, int rebuild_dir);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 47/83] Dir: Add initial dentries when initializing a directory inode log.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (45 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 46/83] Dir: Add Directory radix tree insert/remove methods Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 48/83] Dir: Readdir operation Andiry Xu
                   ` (36 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

For root directory and newly created directory via mkdir(),
we append . and .. dentries to the directory inode log.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/dir.c   | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h  |  2 ++
 fs/nova/super.c |  5 ++++
 3 files changed, 89 insertions(+)

diff --git a/fs/nova/dir.c b/fs/nova/dir.c
index 5bea57a..377d2da 100644
--- a/fs/nova/dir.c
+++ b/fs/nova/dir.c
@@ -139,3 +139,85 @@ void nova_delete_dir_tree(struct super_block *sb,
 
 	NOVA_END_TIMING(delete_dir_tree_t, delete_time);
 }
+
+/* ========================= Entry operations ============================= */
+
+static unsigned int nova_init_dentry(struct super_block *sb,
+	struct nova_dentry *de_entry, u64 self_ino, u64 parent_ino,
+	u64 epoch_id)
+{
+	void *start = de_entry;
+	struct nova_inode_log_page *curr_page = start;
+	unsigned int length;
+	unsigned short de_len;
+
+	de_len = NOVA_DIR_LOG_REC_LEN(1);
+	memset(de_entry, 0, de_len);
+	de_entry->entry_type = DIR_LOG;
+	de_entry->epoch_id = epoch_id;
+	de_entry->trans_id = 0;
+	de_entry->ino = cpu_to_le64(self_ino);
+	de_entry->name_len = 1;
+	de_entry->de_len = cpu_to_le16(de_len);
+	de_entry->mtime = timespec_trunc(current_kernel_time(),
+					 sb->s_time_gran).tv_sec;
+
+	de_entry->links_count = 1;
+	strncpy(de_entry->name, ".\0", 2);
+	nova_persist_entry(de_entry);
+
+	length = de_len;
+
+	de_entry = (struct nova_dentry *)((char *)de_entry + length);
+	de_len = NOVA_DIR_LOG_REC_LEN(2);
+	memset(de_entry, 0, de_len);
+	de_entry->entry_type = DIR_LOG;
+	de_entry->epoch_id = epoch_id;
+	de_entry->trans_id = 0;
+	de_entry->ino = cpu_to_le64(parent_ino);
+	de_entry->name_len = 2;
+	de_entry->de_len = cpu_to_le16(de_len);
+	de_entry->mtime = timespec_trunc(current_kernel_time(),
+					 sb->s_time_gran).tv_sec;
+
+	de_entry->links_count = 2;
+	strncpy(de_entry->name, "..\0", 3);
+	nova_persist_entry(de_entry);
+	length += de_len;
+
+	nova_set_page_num_entries(sb, curr_page, 2, 1);
+
+	nova_flush_buffer(start, length, 0);
+	return length;
+}
+
+/* Append . and .. entries */
+int nova_append_dir_init_entries(struct super_block *sb,
+	struct nova_inode *pi, u64 self_ino, u64 parent_ino, u64 epoch_id)
+{
+	struct nova_inode_info_header sih;
+	int allocated;
+	u64 new_block;
+	unsigned int length;
+	struct nova_dentry *de_entry;
+
+	sih.ino = self_ino;
+	sih.i_blk_type = NOVA_DEFAULT_BLOCK_TYPE;
+
+	allocated = nova_allocate_inode_log_pages(sb, &sih, 1, &new_block,
+							ANY_CPU, 0);
+	if (allocated != 1) {
+		nova_err(sb, "ERROR: no inode log page available\n");
+		return -ENOMEM;
+	}
+
+	pi->log_tail = pi->log_head = new_block;
+
+	de_entry = (struct nova_dentry *)nova_get_block(sb, new_block);
+
+	length = nova_init_dentry(sb, de_entry, self_ino, parent_ino, epoch_id);
+
+	nova_update_tail(pi, new_block + length);
+
+	return 0;
+}
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 3890479..a94f44d 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -458,6 +458,8 @@ void nova_delete_dir_tree(struct super_block *sb,
 struct nova_dentry *nova_find_dentry(struct super_block *sb,
 	struct nova_inode *pi, struct inode *inode, const char *name,
 	unsigned long name_len);
+int nova_append_dir_init_entries(struct super_block *sb,
+	struct nova_inode *pi, u64 self_ino, u64 parent_ino, u64 epoch_id);
 
 /* rebuild.c */
 int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 216d396..1e67062 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -349,6 +349,7 @@ static struct nova_inode *nova_init(struct super_block *sb,
 	struct nova_inode *root_i, *pi;
 	struct nova_super_block *super;
 	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 epoch_id;
 	timing_t init_time;
 
 	NOVA_START_TIMING(new_init_t, init_time);
@@ -415,6 +416,10 @@ static struct nova_inode *nova_init(struct super_block *sb,
 
 	nova_flush_buffer(root_i, sizeof(*root_i), false);
 
+	epoch_id = nova_get_epoch_id(sb);
+	nova_append_dir_init_entries(sb, root_i, NOVA_ROOT_INO,
+					NOVA_ROOT_INO, epoch_id);
+
 	PERSISTENT_MARK();
 	PERSISTENT_BARRIER();
 	nova_info("NOVA initialization finish\n");
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 48/83] Dir: Readdir operation.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (46 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 47/83] Dir: Add initial dentries when initializing a directory inode log Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 49/83] Dir: Append create/remove dentry Andiry Xu
                   ` (35 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA reads the directory by traversing the log and reports
the valid dentries. Valid dentris have inode number greater than zero,
meaning it's a create dentry.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/dir.c   | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.c |   2 +
 fs/nova/nova.h  |   1 +
 3 files changed, 156 insertions(+)

diff --git a/fs/nova/dir.c b/fs/nova/dir.c
index 377d2da..35a66f9 100644
--- a/fs/nova/dir.c
+++ b/fs/nova/dir.c
@@ -221,3 +221,156 @@ int nova_append_dir_init_entries(struct super_block *sb,
 
 	return 0;
 }
+
+static u64 nova_find_next_dentry_addr(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 pos)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_entry *entries[1];
+	int nr_entries;
+	u64 addr = 0;
+
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, 1);
+	if (nr_entries == 1) {
+		entry = entries[0];
+		addr = nova_get_addr_off(sbi, entry);
+	}
+
+	return addr;
+}
+
+static int nova_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pidir;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *child_pi;
+	struct nova_inode *prev_child_pi = NULL;
+	struct nova_dentry *entry = NULL;
+	struct nova_dentry *prev_entry = NULL;
+	unsigned short de_len;
+	u64 pi_addr;
+	unsigned long pos = 0;
+	ino_t ino;
+	void *addr;
+	u64 curr_p;
+	u8 type;
+	int ret = 0;
+	timing_t readdir_time;
+
+	NOVA_START_TIMING(readdir_t, readdir_time);
+	pidir = nova_get_inode(sb, inode);
+	nova_dbgv("%s: ino %llu, size %llu, pos 0x%llx\n",
+			__func__, (u64)inode->i_ino,
+			pidir->i_size, ctx->pos);
+
+	if (sih->log_head == 0) {
+		nova_err(sb, "Dir %lu log is NULL!\n", inode->i_ino);
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	pos = ctx->pos;
+
+	if (pos == 0)
+		curr_p = sih->log_head;
+	else if (pos == READDIR_END)
+		goto out;
+	else {
+		curr_p = nova_find_next_dentry_addr(sb, sih, pos);
+		if (curr_p == 0)
+			goto out;
+	}
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p))
+			curr_p = next_log_page(sb, curr_p);
+
+
+		if (curr_p == 0) {
+			nova_err(sb, "Dir %lu log is NULL!\n", inode->i_ino);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+		type = nova_get_entry_type(addr);
+		switch (type) {
+		case SET_ATTR:
+			curr_p += sizeof(struct nova_setattr_logentry);
+			continue;
+		case LINK_CHANGE:
+			curr_p += sizeof(struct nova_link_change_entry);
+			continue;
+		case DIR_LOG:
+			break;
+		default:
+			nova_err(sb, "%s: unknown type %d, 0x%llx\n",
+				 __func__, type, curr_p);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		entry = (struct nova_dentry *)nova_get_block(sb, curr_p);
+		nova_dbgv("curr_p: 0x%llx, type %d, ino %llu, name %s, namelen %u, rec len %u\n",
+			  curr_p, entry->entry_type, le64_to_cpu(entry->ino),
+			  entry->name, entry->name_len,
+			  le16_to_cpu(entry->de_len));
+
+		de_len = le16_to_cpu(entry->de_len);
+		if (entry->ino > 0 && entry->invalid == 0
+					&& entry->reassigned == 0) {
+			ino = __le64_to_cpu(entry->ino);
+			pos = BKDRHash(entry->name, entry->name_len);
+
+			ret = nova_get_inode_address(sb, ino,
+						     &pi_addr, 0);
+			if (ret) {
+				nova_dbg("%s: get child inode %lu address failed %d\n",
+					 __func__, ino, ret);
+				ctx->pos = READDIR_END;
+				goto out;
+			}
+
+			child_pi = nova_get_block(sb, pi_addr);
+			nova_dbgv("ctx: ino %llu, name %s, name_len %u, de_len %u\n",
+				(u64)ino, entry->name, entry->name_len,
+				entry->de_len);
+			if (prev_entry && !dir_emit(ctx, prev_entry->name,
+				prev_entry->name_len, ino,
+				IF2DT(le16_to_cpu(prev_child_pi->i_mode)))) {
+				nova_dbgv("Here: pos %llu\n", ctx->pos);
+				ret = 0;
+				goto out;
+			}
+			prev_entry = entry;
+
+			prev_child_pi = child_pi;
+		}
+		ctx->pos = pos;
+		curr_p += de_len;
+	}
+
+	if (prev_entry && !dir_emit(ctx, prev_entry->name,
+			prev_entry->name_len, ino,
+			IF2DT(le16_to_cpu(prev_child_pi->i_mode))))
+		return 0;
+
+	ctx->pos = READDIR_END;
+	ret = 0;
+out:
+	NOVA_END_TIMING(readdir_t, readdir_time);
+	nova_dbgv("%s return\n", __func__);
+	return ret;
+}
+
+const struct file_operations nova_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.iterate	= nova_readdir,
+	.fsync		= noop_fsync,
+};
diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 15517cc..41417e3 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -181,6 +181,7 @@ static int nova_read_inode(struct super_block *sb, struct inode *inode,
 	case S_IFREG:
 		break;
 	case S_IFDIR:
+		inode->i_fop = &nova_dir_operations;
 		break;
 	case S_IFLNK:
 		break;
@@ -625,6 +626,7 @@ struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
 		inode->i_mapping->a_ops = &nova_aops_dax;
 		break;
 	case TYPE_MKDIR:
+		inode->i_fop = &nova_dir_operations;
 		inode->i_mapping->a_ops = &nova_aops_dax;
 		set_nlink(inode, 2);
 		break;
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index a94f44d..ed269fe 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -447,6 +447,7 @@ nova_get_blocknr(struct super_block *sb, u64 block, unsigned short btype)
 /* ====================================================== */
 
 /* dir.c */
+extern const struct file_operations nova_dir_operations;
 int nova_insert_dir_radix_tree(struct super_block *sb,
 	struct nova_inode_info_header *sih, const char *name,
 	int namelen, struct nova_dentry *direntry);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 49/83] Dir: Append create/remove dentry.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (47 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 48/83] Dir: Readdir operation Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 50/83] Inode: Add nova_evict_inode Andiry Xu
                   ` (34 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA adds or removes a directory/file by appending a dentry
to the parent directory's log. Dentry contains filename and inode number.
A positive inode number indicates a create(valid) dentry, and
a dentry with inode number zero is a remove dentry.
NOVA can also inplace update a create dentry to invalidate it.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/dir.c  | 140 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h |   4 ++
 2 files changed, 144 insertions(+)

diff --git a/fs/nova/dir.c b/fs/nova/dir.c
index 35a66f9..47ee9ad 100644
--- a/fs/nova/dir.c
+++ b/fs/nova/dir.c
@@ -222,6 +222,146 @@ int nova_append_dir_init_entries(struct super_block *sb,
 	return 0;
 }
 
+/* adds a directory entry pointing to the inode. assumes the inode has
+ * already been logged for consistency
+ */
+int nova_add_dentry(struct dentry *dentry, u64 ino, int inc_link,
+	struct nova_inode_update *update, u64 epoch_id)
+{
+	struct inode *dir = dentry->d_parent->d_inode;
+	struct super_block *sb = dir->i_sb;
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pidir;
+	const char *name = dentry->d_name.name;
+	int namelen = dentry->d_name.len;
+	struct nova_dentry *direntry;
+	unsigned short loglen;
+	int ret;
+	u64 curr_entry;
+	timing_t add_dentry_time;
+
+	nova_dbg_verbose("%s: dir %lu new inode %llu\n",
+				__func__, dir->i_ino, ino);
+	nova_dbg_verbose("%s: %s %d\n", __func__, name, namelen);
+	NOVA_START_TIMING(add_dentry_t, add_dentry_time);
+	if (namelen == 0)
+		return -EINVAL;
+
+	pidir = nova_get_inode(sb, dir);
+
+	/*
+	 * XXX shouldn't update any times until successful
+	 * completion of syscall, but too many callers depend
+	 * on this.
+	 */
+	dir->i_mtime = dir->i_ctime = current_time(dir);
+
+	loglen = NOVA_DIR_LOG_REC_LEN(namelen);
+	ret = nova_append_dentry(sb, pidir, dir, dentry,
+				ino, loglen, update,
+				inc_link, epoch_id);
+
+	if (ret) {
+		nova_dbg("%s: append dir entry failure\n", __func__);
+		return ret;
+	}
+
+	curr_entry = update->curr_entry;
+	direntry = (struct nova_dentry *)nova_get_block(sb, curr_entry);
+	sih->last_dentry = curr_entry;
+	ret = nova_insert_dir_radix_tree(sb, sih, name, namelen, direntry);
+
+	sih->trans_id++;
+	NOVA_END_TIMING(add_dentry_t, add_dentry_time);
+	return ret;
+}
+
+static int nova_can_inplace_update_dentry(struct super_block *sb,
+	struct nova_dentry *dentry, u64 epoch_id)
+{
+	if (dentry && dentry->epoch_id == epoch_id)
+		return 1;
+
+	return 0;
+}
+
+/* removes a directory entry pointing to the inode. assumes the inode has
+ * already been logged for consistency
+ */
+int nova_remove_dentry(struct dentry *dentry, int dec_link,
+	struct nova_inode_update *update, u64 epoch_id)
+{
+	struct inode *dir = dentry->d_parent->d_inode;
+	struct super_block *sb = dir->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = NOVA_I(dir);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pidir;
+	struct qstr *entry = &dentry->d_name;
+	struct nova_dentry *old_dentry = NULL;
+	unsigned short loglen;
+	int ret;
+	u64 curr_entry;
+	timing_t remove_dentry_time;
+
+	NOVA_START_TIMING(remove_dentry_t, remove_dentry_time);
+
+	update->create_dentry = NULL;
+	update->delete_dentry = NULL;
+
+	if (!dentry->d_name.len) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = nova_remove_dir_radix_tree(sb, sih, entry->name, entry->len, 0,
+					&old_dentry);
+
+	if (ret)
+		goto out;
+
+	pidir = nova_get_inode(sb, dir);
+
+	dir->i_mtime = dir->i_ctime = current_time(dir);
+
+	if (nova_can_inplace_update_dentry(sb, old_dentry, epoch_id)) {
+		nova_inplace_update_dentry(sb, dir, old_dentry,
+						dec_link, epoch_id);
+		curr_entry = nova_get_addr_off(sbi, old_dentry);
+
+		sih->last_dentry = curr_entry;
+		/* Leave create/delete_dentry to NULL
+		 * Do not change tail if used as input
+		 */
+		if (update->tail == 0) {
+			update->tail = sih->log_tail;
+		}
+		sih->trans_id++;
+		goto out;
+	}
+
+	loglen = NOVA_DIR_LOG_REC_LEN(entry->len);
+	ret = nova_append_dentry(sb, pidir, dir, dentry,
+				0, loglen, update,
+				dec_link, epoch_id);
+
+	if (ret) {
+		nova_dbg("%s: append dir entry failure\n", __func__);
+		goto out;
+	}
+
+	update->create_dentry = old_dentry;
+	curr_entry = update->curr_entry;
+	update->delete_dentry = (struct nova_dentry *)nova_get_block(sb,
+						curr_entry);
+	sih->last_dentry = curr_entry;
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(remove_dentry_t, remove_dentry_time);
+	return ret;
+}
+
 static u64 nova_find_next_dentry_addr(struct super_block *sb,
 	struct nova_inode_info_header *sih, u64 pos)
 {
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index ed269fe..3a51dae 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -461,6 +461,10 @@ struct nova_dentry *nova_find_dentry(struct super_block *sb,
 	unsigned long name_len);
 int nova_append_dir_init_entries(struct super_block *sb,
 	struct nova_inode *pi, u64 self_ino, u64 parent_ino, u64 epoch_id);
+int nova_add_dentry(struct dentry *dentry, u64 ino, int inc_link,
+	struct nova_inode_update *update, u64 epoch_id);
+int nova_remove_dentry(struct dentry *dentry, int dec_link,
+	struct nova_inode_update *update, u64 epoch_id);
 
 /* rebuild.c */
 int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 50/83] Inode: Add nova_evict_inode.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (48 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 49/83] Dir: Append create/remove dentry Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 51/83] Rebuild: directory inode Andiry Xu
                   ` (33 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

If the inode still have links, release the DRAM resource (radix tree, etc).
Otherwise reclaim data pages and log pages.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 257 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/nova/inode.h |   5 ++
 fs/nova/log.h   |   7 ++
 fs/nova/super.c |   1 +
 4 files changed, 269 insertions(+), 1 deletion(-)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 41417e3..17addd3 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -457,7 +457,7 @@ static int nova_alloc_unused_inode(struct super_block *sb, int cpuid,
 	return 0;
 }
 
-int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
+static int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
 {
 	struct nova_sb_info *sbi = NOVA_SB(sb);
 	struct inode_map *inode_map;
@@ -532,6 +532,261 @@ int nova_free_inuse_inode(struct super_block *sb, unsigned long ino)
 	return ret;
 }
 
+static int nova_free_inode(struct super_block *sb, struct nova_inode *pi,
+	struct nova_inode_info_header *sih)
+{
+	int err = 0;
+	timing_t free_time;
+
+	NOVA_START_TIMING(free_inode_t, free_time);
+
+	nova_free_inode_log(sb, pi, sih);
+
+	sih->log_pages = 0;
+	sih->i_mode = 0;
+	sih->pi_addr = 0;
+	sih->i_size = 0;
+	sih->i_blocks = 0;
+
+	err = nova_free_inuse_inode(sb, pi->nova_ino);
+
+	NOVA_END_TIMING(free_inode_t, free_time);
+	return err;
+}
+
+/*
+ * We do not really rely on this last blocknr
+ * because blocks can be allocated beyond file end
+ */
+static unsigned long nova_get_last_blocknr(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_inode *pi, fake_pi;
+	unsigned long last_blocknr;
+	unsigned int btype;
+	unsigned int data_bits;
+	int ret;
+
+	ret = nova_get_reference(sb, sih->pi_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("%s: read pi @ 0x%lx failed\n",
+				__func__, sih->pi_addr);
+		btype = 0;
+	} else {
+		btype = sih->i_blk_type;
+	}
+
+	data_bits = blk_type_to_shift[btype];
+
+	if (sih->i_size == 0)
+		last_blocknr = 0;
+	else
+		last_blocknr = (sih->i_size - 1) >> data_bits;
+
+	return last_blocknr;
+}
+
+int nova_delete_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long start_blocknr,
+	unsigned long last_blocknr, bool delete_nvmm, bool delete_dead,
+	u64 epoch_id)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_entry *old_entry = NULL;
+	unsigned long pgoff = start_blocknr;
+	unsigned long old_pgoff = 0;
+	unsigned int num_free = 0;
+	int freed = 0;
+	void *ret;
+	timing_t delete_time;
+
+	NOVA_START_TIMING(delete_file_tree_t, delete_time);
+
+	/* Handle EOF blocks */
+	do {
+		entry = radix_tree_lookup(&sih->tree, pgoff);
+		if (entry) {
+			ret = radix_tree_delete(&sih->tree, pgoff);
+			WARN_ON(!ret || ret != entry);
+			if (entry != old_entry) {
+				if (old_entry && delete_nvmm) {
+					nova_free_old_entry(sb, sih,
+							old_entry, old_pgoff,
+							num_free, delete_dead,
+							epoch_id);
+					freed += num_free;
+				}
+
+				old_entry = entry;
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+			pgoff++;
+		} else {
+			/* We are finding a hole. Jump to the next entry. */
+			entry = nova_find_next_entry(sb, sih, pgoff);
+			if (!entry)
+				break;
+
+			pgoff++;
+			pgoff = pgoff > entry->pgoff ? pgoff : entry->pgoff;
+		}
+	} while (1);
+
+	if (old_entry && delete_nvmm) {
+		nova_free_old_entry(sb, sih, old_entry, old_pgoff,
+					num_free, delete_dead, epoch_id);
+		freed += num_free;
+	}
+
+	nova_dbgv("Inode %lu: delete file tree from pgoff %lu to %lu, %d blocks freed\n",
+			sih->ino, start_blocknr, last_blocknr, freed);
+
+	NOVA_END_TIMING(delete_file_tree_t, delete_time);
+	return freed;
+}
+
+static int nova_free_dram_resource(struct super_block *sb,
+	struct nova_inode_info_header *sih)
+{
+	unsigned long last_blocknr;
+	int freed = 0;
+
+	if (sih->ino == 0)
+		return 0;
+
+	if (!(S_ISREG(sih->i_mode)) && !(S_ISDIR(sih->i_mode)))
+		return 0;
+
+	if (S_ISREG(sih->i_mode)) {
+		last_blocknr = nova_get_last_blocknr(sb, sih);
+		freed = nova_delete_file_tree(sb, sih, 0,
+					last_blocknr, false, false, 0);
+	} else {
+		nova_delete_dir_tree(sb, sih);
+		freed = 1;
+	}
+
+	return freed;
+}
+
+static int nova_free_inode_resource(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih)
+{
+	unsigned long last_blocknr;
+	int ret = 0;
+	int freed = 0;
+
+	pi->deleted = 1;
+
+	if (pi->valid) {
+		nova_dbg("%s: inode %lu still valid\n",
+				__func__, sih->ino);
+		pi->valid = 0;
+	}
+	nova_persist_inode(pi);
+
+	/* We need the log to free the blocks from the b-tree */
+	switch (__le16_to_cpu(pi->i_mode) & S_IFMT) {
+	case S_IFREG:
+		last_blocknr = nova_get_last_blocknr(sb, sih);
+		nova_dbgv("%s: file ino %lu\n", __func__, sih->ino);
+		freed = nova_delete_file_tree(sb, sih, 0,
+					last_blocknr, true, true, 0);
+		break;
+	case S_IFDIR:
+		nova_dbgv("%s: dir ino %lu\n", __func__, sih->ino);
+		nova_delete_dir_tree(sb, sih);
+		break;
+	case S_IFLNK:
+		/* Log will be freed later */
+		nova_dbgv("%s: symlink ino %lu\n",
+				__func__, sih->ino);
+		freed = nova_delete_file_tree(sb, sih, 0, 0,
+						true, true, 0);
+		break;
+	default:
+		nova_dbgv("%s: special ino %lu\n",
+				__func__, sih->ino);
+		break;
+	}
+
+	nova_dbg_verbose("%s: Freed %d\n", __func__, freed);
+	/* Then we can free the inode */
+	ret = nova_free_inode(sb, pi, sih);
+	if (ret)
+		nova_err(sb, "%s: free inode %lu failed\n",
+				__func__, sih->ino);
+
+	return ret;
+}
+
+void nova_evict_inode(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	timing_t evict_time;
+	int destroy = 0;
+	int ret;
+
+	NOVA_START_TIMING(evict_inode_t, evict_time);
+	if (!sih) {
+		nova_err(sb, "%s: ino %lu sih is NULL!\n",
+				__func__, inode->i_ino);
+		NOVA_ASSERT(0);
+		goto out;
+	}
+
+	// pi can be NULL if the file has already been deleted, but a handle
+	// remains.
+	if (pi && pi->nova_ino != inode->i_ino) {
+		nova_err(sb, "%s: inode %lu ino does not match: %llu\n",
+				__func__, inode->i_ino, pi->nova_ino);
+		nova_dbg("inode size %llu, pi addr 0x%lx, pi head 0x%llx, tail 0x%llx, mode %u\n",
+				inode->i_size, sih->pi_addr, sih->log_head,
+				sih->log_tail, pi->i_mode);
+		nova_dbg("sih: ino %lu, inode size %lu, mode %u, inode mode %u\n",
+				sih->ino, sih->i_size,
+				sih->i_mode, inode->i_mode);
+		nova_print_inode_log(sb, inode);
+	}
+
+	nova_dbg_verbose("%s: %lu\n", __func__, inode->i_ino);
+	if (!inode->i_nlink && !is_bad_inode(inode)) {
+		if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+			goto out;
+
+		if (pi) {
+			ret = nova_free_inode_resource(sb, pi, sih);
+			if (ret)
+				goto out;
+		}
+
+		destroy = 1;
+		pi = NULL; /* we no longer own the nova_inode */
+
+		inode->i_mtime = inode->i_ctime = current_time(inode);
+		inode->i_size = 0;
+	}
+out:
+	if (destroy == 0) {
+		nova_dbgv("%s: destroying %lu\n", __func__, inode->i_ino);
+		nova_free_dram_resource(sb, sih);
+	}
+	/* TODO: Since we don't use page-cache, do we really need the following
+	 * call?
+	 */
+	truncate_inode_pages(&inode->i_data, 0);
+
+	clear_inode(inode);
+	NOVA_END_TIMING(evict_inode_t, evict_time);
+}
+
 /* Returns 0 on failure */
 u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr)
 {
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 6970872..62c8bdc 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -245,6 +245,11 @@ u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr);
 struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
 	struct inode *dir, u64 pi_addr, u64 ino, umode_t mode,
 	size_t size, dev_t rdev, const struct qstr *qstr, u64 epoch_id);
+int nova_delete_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, unsigned long start_blocknr,
+	unsigned long last_blocknr, bool delete_nvmm, bool delete_dead,
+	u64 epoch_id);
+extern void nova_evict_inode(struct inode *inode);
 extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
 extern void nova_dirty_inode(struct inode *inode, int flags);
 
diff --git a/fs/nova/log.h b/fs/nova/log.h
index f5149f7..87ce5f9 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -364,6 +364,13 @@ static inline int is_dir_init_entry(struct super_block *sb,
 }
 
 
+unsigned int nova_free_old_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	unsigned long pgoff, unsigned int num_free,
+	bool delete_dead, u64 epoch_id);
+struct nova_file_write_entry *nova_find_next_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, pgoff_t pgoff);
 int nova_handle_setattr_operation(struct super_block *sb, struct inode *inode,
 	struct nova_inode *pi, unsigned int ia_valid, struct iattr *attr,
 	u64 epoch_id);
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 1e67062..daf3270 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -884,6 +884,7 @@ static struct super_operations nova_sops = {
 	.destroy_inode	= nova_destroy_inode,
 	.write_inode	= nova_write_inode,
 	.dirty_inode	= nova_dirty_inode,
+	.evict_inode	= nova_evict_inode,
 	.put_super	= nova_put_super,
 	.statfs		= nova_statfs,
 	.remount_fs	= nova_remount,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 51/83] Rebuild: directory inode.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (49 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 50/83] Inode: Add nova_evict_inode Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 52/83] Rebuild: file inode Andiry Xu
                   ` (32 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

When vfs issues a read inode command, or when the inode is newly allocated,
walk through the inode log to rebuild inode information and the radix tree.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.h   |  15 +++
 fs/nova/nova.h    |  21 ++++
 fs/nova/rebuild.c | 329 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 364 insertions(+), 1 deletion(-)

diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 62c8bdc..42690e6 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -97,6 +97,21 @@ struct nova_inode_info_header {
 	u8  i_blk_type;
 };
 
+/* For rebuild purpose, temporarily store pi infomation */
+struct nova_inode_rebuild {
+	u64	i_size;
+	u32	i_flags;	/* Inode flags */
+	u32	i_ctime;	/* Inode modification time */
+	u32	i_mtime;	/* Inode b-tree Modification time */
+	u32	i_atime;	/* Access time */
+	u32	i_uid;		/* Owner Uid */
+	u32	i_gid;		/* Group Id */
+	u32	i_generation;	/* File version (for NFS) */
+	u16	i_links_count;	/* Links count */
+	u16	i_mode;		/* File mode */
+	u64	trans_id;
+};
+
 /*
  * DRAM state for inodes
  */
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 3a51dae..983c6b2 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -301,6 +301,24 @@ static inline u64 nova_get_epoch_id(struct super_block *sb)
 }
 
 #include "inode.h"
+
+static inline int nova_get_head_tail(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih)
+{
+	struct nova_inode fake_pi;
+	int rc;
+
+	rc = memcpy_mcsafe(&fake_pi, pi, sizeof(struct nova_inode));
+	if (rc)
+		return rc;
+
+	sih->i_blk_type = fake_pi.i_blk_type;
+	sih->log_head = fake_pi.log_head;
+	sih->log_tail = fake_pi.log_tail;
+
+	return rc;
+}
+
 #include "log.h"
 
 struct nova_range_node_lowhigh {
@@ -467,6 +485,9 @@ int nova_remove_dentry(struct dentry *dentry, int dec_link,
 	struct nova_inode_update *update, u64 epoch_id);
 
 /* rebuild.c */
+int nova_rebuild_dir_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih);
 int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
 	u64 ino, u64 pi_addr, int rebuild_dir);
 
diff --git a/fs/nova/rebuild.c b/fs/nova/rebuild.c
index 0595851..9a1327d 100644
--- a/fs/nova/rebuild.c
+++ b/fs/nova/rebuild.c
@@ -18,6 +18,319 @@
 #include "nova.h"
 #include "inode.h"
 
+/* entry given to this function is a copy in dram */
+static void nova_apply_setattr_entry(struct super_block *sb,
+	struct nova_inode_rebuild *reb,	struct nova_inode_info_header *sih,
+	struct nova_setattr_logentry *entry)
+{
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	loff_t start, end;
+	int freed = 0;
+
+	reb->i_mode	= entry->mode;
+	reb->i_uid	= entry->uid;
+	reb->i_gid	= entry->gid;
+	reb->i_atime	= entry->atime;
+
+	if (S_ISREG(reb->i_mode)) {
+		start = entry->size;
+		end = reb->i_size;
+
+		first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+		if (end > 0)
+			last_blocknr = (end - 1) >> data_bits;
+		else
+			last_blocknr = 0;
+
+		freed = nova_delete_file_tree(sb, sih, first_blocknr,
+					last_blocknr, false, false, 0);
+	}
+}
+
+/* entry given to this function is a copy in dram */
+static void nova_apply_link_change_entry(struct super_block *sb,
+	struct nova_inode_rebuild *reb,	struct nova_link_change_entry *entry)
+{
+	reb->i_links_count	= entry->links;
+	reb->i_ctime		= entry->ctime;
+	reb->i_flags		= entry->flags;
+	reb->i_generation	= entry->generation;
+
+	/* Do not flush now */
+}
+
+static void nova_update_inode_with_rebuild(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_inode *pi)
+{
+	pi->i_size = cpu_to_le64(reb->i_size);
+	pi->i_flags = cpu_to_le32(reb->i_flags);
+	pi->i_uid = cpu_to_le32(reb->i_uid);
+	pi->i_gid = cpu_to_le32(reb->i_gid);
+	pi->i_atime = cpu_to_le32(reb->i_atime);
+	pi->i_ctime = cpu_to_le32(reb->i_ctime);
+	pi->i_mtime = cpu_to_le32(reb->i_mtime);
+	pi->i_generation = cpu_to_le32(reb->i_generation);
+	pi->i_links_count = cpu_to_le16(reb->i_links_count);
+	pi->i_mode = cpu_to_le16(reb->i_mode);
+}
+
+static int nova_init_inode_rebuild(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_inode *pi)
+{
+	struct nova_inode fake_pi;
+	int rc;
+
+	rc = memcpy_mcsafe(&fake_pi, pi, sizeof(struct nova_inode));
+	if (rc)
+		return rc;
+
+	reb->i_size = le64_to_cpu(fake_pi.i_size);
+	reb->i_flags = le32_to_cpu(fake_pi.i_flags);
+	reb->i_uid = le32_to_cpu(fake_pi.i_uid);
+	reb->i_gid = le32_to_cpu(fake_pi.i_gid);
+	reb->i_atime = le32_to_cpu(fake_pi.i_atime);
+	reb->i_ctime = le32_to_cpu(fake_pi.i_ctime);
+	reb->i_mtime = le32_to_cpu(fake_pi.i_mtime);
+	reb->i_generation = le32_to_cpu(fake_pi.i_generation);
+	reb->i_links_count = le16_to_cpu(fake_pi.i_links_count);
+	reb->i_mode = le16_to_cpu(fake_pi.i_mode);
+	reb->trans_id = 0;
+
+	return rc;
+}
+
+static inline void nova_rebuild_file_time_and_size(struct super_block *sb,
+	struct nova_inode_rebuild *reb, u32 mtime, u32 ctime, u64 size)
+{
+	reb->i_mtime = cpu_to_le32(mtime);
+	reb->i_ctime = cpu_to_le32(ctime);
+	reb->i_size = cpu_to_le64(size);
+}
+
+static int nova_rebuild_inode_start(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct nova_inode_rebuild *reb, u64 pi_addr)
+{
+	int ret;
+
+	ret = nova_get_head_tail(sb, pi, sih);
+	if (ret)
+		return ret;
+
+	ret = nova_init_inode_rebuild(sb, reb, pi);
+	if (ret)
+		return ret;
+
+	sih->pi_addr = pi_addr;
+
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				sih->log_head, sih->log_tail);
+	sih->log_pages = 1;
+
+	return ret;
+}
+
+static int nova_rebuild_inode_finish(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct nova_inode_rebuild *reb, u64 curr_p)
+{
+	u64 next;
+
+	sih->i_size = le64_to_cpu(reb->i_size);
+	sih->i_mode = le64_to_cpu(reb->i_mode);
+	sih->i_flags = le32_to_cpu(reb->i_flags);
+	sih->trans_id = reb->trans_id + 1;
+
+	nova_update_inode_with_rebuild(sb, reb, pi);
+	nova_persist_inode(pi);
+
+	/* Keep traversing until log ends */
+	curr_p &= PAGE_MASK;
+	while ((next = next_log_page(sb, curr_p)) > 0) {
+		sih->log_pages++;
+		curr_p = next;
+	}
+
+	return 0;
+}
+
+/******************* Directory rebuild *********************/
+
+static inline void nova_rebuild_dir_time_and_size(struct super_block *sb,
+	struct nova_inode_rebuild *reb, struct nova_dentry *entry)
+{
+	if (!entry || !reb)
+		return;
+
+	reb->i_ctime = entry->mtime;
+	reb->i_mtime = entry->mtime;
+	reb->i_links_count = entry->links_count;
+	//reb->i_size = entry->size;
+}
+
+static void nova_reassign_last_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 curr_p)
+{
+	struct nova_dentry *dentry, *old_dentry;
+
+	if (sih->last_dentry == 0) {
+		sih->last_dentry = curr_p;
+	} else {
+		old_dentry = (struct nova_dentry *)nova_get_block(sb,
+							sih->last_dentry);
+		dentry = (struct nova_dentry *)nova_get_block(sb, curr_p);
+		if (dentry->trans_id >= old_dentry->trans_id)
+			sih->last_dentry = curr_p;
+	}
+}
+
+static inline int nova_replay_add_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *entry)
+{
+	if (!entry->name_len)
+		return -EINVAL;
+
+	nova_dbg_verbose("%s: add %s\n", __func__, entry->name);
+	return nova_insert_dir_radix_tree(sb, sih,
+			entry->name, entry->name_len, entry);
+}
+
+/* entry given to this function is a copy in dram */
+static inline int nova_replay_remove_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *entry)
+{
+	nova_dbg_verbose("%s: remove %s\n", __func__, entry->name);
+	nova_remove_dir_radix_tree(sb, sih, entry->name,
+					entry->name_len, 1, NULL);
+	return 0;
+}
+
+static int nova_rebuild_handle_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode_rebuild *reb,
+	struct nova_dentry *entry, u64 curr_p)
+{
+	int ret = 0;
+
+	nova_dbgv("curr_p: 0x%llx, type %d, ino %llu, name %s, namelen %u, rec len %u\n",
+			curr_p,
+			entry->entry_type, le64_to_cpu(entry->ino),
+			entry->name, entry->name_len,
+			le16_to_cpu(entry->de_len));
+
+	nova_reassign_last_dentry(sb, sih, curr_p);
+
+	if (entry->invalid == 0) {
+		if (entry->ino > 0)
+			ret = nova_replay_add_dentry(sb, sih, entry);
+		else
+			ret = nova_replay_remove_dentry(sb, sih, entry);
+	}
+
+	if (ret) {
+		nova_err(sb, "%s ERROR %d\n", __func__, ret);
+		return ret;
+	}
+
+	if (entry->trans_id >= reb->trans_id) {
+		nova_rebuild_dir_time_and_size(sb, reb, entry);
+		reb->trans_id = entry->trans_id;
+	}
+
+	return ret;
+}
+
+int nova_rebuild_dir_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_dentry *entry = NULL;
+	struct nova_setattr_logentry *attr_entry = NULL;
+	struct nova_link_change_entry *lc_entry = NULL;
+	struct nova_inode_rebuild rebuild, *reb;
+	u64 ino = pi->nova_ino;
+	unsigned short de_len;
+	timing_t rebuild_time;
+	void *addr, *entryc = NULL;
+	u64 curr_p;
+	u8 type;
+	int ret;
+
+	NOVA_START_TIMING(rebuild_dir_t, rebuild_time);
+	nova_dbgv("Rebuild dir %llu tree\n", ino);
+
+	reb = &rebuild;
+	ret = nova_rebuild_inode_start(sb, pi, sih, reb, pi_addr);
+	if (ret)
+		goto out;
+
+	curr_p = sih->log_head;
+	if (curr_p == 0) {
+		nova_err(sb, "Dir %llu log is NULL!\n", ino);
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			sih->log_pages++;
+			curr_p = next_log_page(sb, curr_p);
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "Dir %llu log is NULL!\n", ino);
+			ret = -EIO;
+			goto out;
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+
+		entryc = addr;
+
+		type = nova_get_entry_type(entryc);
+
+		switch (type) {
+		case SET_ATTR:
+			attr_entry = (struct nova_setattr_logentry *)entryc;
+			nova_apply_setattr_entry(sb, reb, sih, attr_entry);
+			sih->last_setattr = curr_p;
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			lc_entry = (struct nova_link_change_entry *)entryc;
+			if (lc_entry->trans_id >= reb->trans_id) {
+				nova_apply_link_change_entry(sb, reb, lc_entry);
+				reb->trans_id = lc_entry->trans_id;
+			}
+			sih->last_link_change = curr_p;
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case DIR_LOG:
+			entry = (struct nova_dentry *)addr;
+			ret = nova_rebuild_handle_dentry(sb, sih, reb,
+					entry, curr_p);
+			if (ret)
+				goto out;
+			de_len = le16_to_cpu(DENTRY(entryc)->de_len);
+			curr_p += de_len;
+			break;
+		default:
+			nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+			NOVA_ASSERT(0);
+			break;
+		}
+	}
+
+	ret = nova_rebuild_inode_finish(sb, pi, sih, reb, curr_p);
+	sih->i_blocks = sih->log_pages;
+
+out:
+	NOVA_END_TIMING(rebuild_dir_t, rebuild_time);
+	return ret;
+}
+
 /* initialize nova inode header and other DRAM data structures */
 int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
 	u64 ino, u64 pi_addr, int rebuild_dir)
@@ -42,7 +355,21 @@ int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
 
 	sih->ino = ino;
 
-	/* Traverse the log */
+	switch (__le16_to_cpu(pi->i_mode) & S_IFMT) {
+	case S_IFLNK:
+		/* Treat symlink files as normal files */
+		/* Fall through */
+	case S_IFREG:
+		break;
+	case S_IFDIR:
+		if (rebuild_dir)
+			nova_rebuild_dir_inode_tree(sb, pi, pi_addr, sih);
+		break;
+	default:
+		sih->pi_addr = pi_addr;
+		break;
+	}
+
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 52/83] Rebuild: file inode.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (50 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 51/83] Rebuild: directory inode Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 53/83] Namei: lookup Andiry Xu
                   ` (31 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Rebuild file inode metadata and radix tree on read_inode.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/log.h     |   4 ++
 fs/nova/rebuild.c | 124 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 128 insertions(+)

diff --git a/fs/nova/log.h b/fs/nova/log.h
index 87ce5f9..bdb85eb 100644
--- a/fs/nova/log.h
+++ b/fs/nova/log.h
@@ -385,6 +385,10 @@ int nova_inplace_update_write_entry(struct super_block *sb,
 int nova_append_file_write_entry(struct super_block *sb, struct nova_inode *pi,
 	struct inode *inode, struct nova_file_write_item *item,
 	struct nova_inode_update *update);
+int nova_assign_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry,
+	bool free);
 int nova_invalidate_dentries(struct super_block *sb,
 	struct nova_inode_update *update);
 int nova_inplace_update_dentry(struct super_block *sb,
diff --git a/fs/nova/rebuild.c b/fs/nova/rebuild.c
index 9a1327d..07cf6e3 100644
--- a/fs/nova/rebuild.c
+++ b/fs/nova/rebuild.c
@@ -156,6 +156,126 @@ static int nova_rebuild_inode_finish(struct super_block *sb,
 	return 0;
 }
 
+static void nova_rebuild_handle_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_inode_rebuild *reb,
+	struct nova_file_write_entry *entry)
+{
+	if (entry->num_pages != entry->invalid_pages) {
+		/*
+		 * The overlaped blocks are already freed.
+		 * Don't double free them, just re-assign the pointers.
+		 */
+		nova_assign_write_entry(sb, sih, entry, false);
+	}
+
+	if (entry->trans_id >= sih->trans_id) {
+		nova_rebuild_file_time_and_size(sb, reb,
+					entry->mtime, entry->mtime,
+					entry->size);
+		reb->trans_id = entry->trans_id;
+	}
+
+	/* Update sih->i_size for setattr apply operations */
+	sih->i_size = le64_to_cpu(reb->i_size);
+}
+
+static int nova_rebuild_file_inode_tree(struct super_block *sb,
+	struct nova_inode *pi, u64 pi_addr,
+	struct nova_inode_info_header *sih)
+{
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_setattr_logentry *attr_entry = NULL;
+	struct nova_link_change_entry *link_change_entry = NULL;
+	struct nova_inode_rebuild rebuild, *reb;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	u64 ino = pi->nova_ino;
+	timing_t rebuild_time;
+	void *addr, *entryc = NULL;
+	u64 curr_p;
+	u8 type;
+	int ret;
+
+	NOVA_START_TIMING(rebuild_file_t, rebuild_time);
+	nova_dbg_verbose("Rebuild file inode %llu tree\n", ino);
+
+	reb = &rebuild;
+	ret = nova_rebuild_inode_start(sb, pi, sih, reb, pi_addr);
+	if (ret)
+		goto out;
+
+	curr_p = sih->log_head;
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+//	nova_print_nova_log(sb, sih);
+
+	while (curr_p != sih->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			sih->log_pages++;
+			curr_p = next_log_page(sb, curr_p);
+		}
+
+		if (curr_p == 0) {
+			nova_err(sb, "File inode %llu log is NULL!\n", ino);
+			ret = -EIO;
+			goto out;
+		}
+
+		addr = (void *)nova_get_block(sb, curr_p);
+
+		entryc = addr;
+
+		type = nova_get_entry_type(entryc);
+
+		switch (type) {
+		case SET_ATTR:
+			attr_entry = (struct nova_setattr_logentry *)entryc;
+			nova_apply_setattr_entry(sb, reb, sih, attr_entry);
+			sih->last_setattr = curr_p;
+			if (attr_entry->trans_id >= reb->trans_id) {
+				nova_rebuild_file_time_and_size(sb, reb,
+							attr_entry->mtime,
+							attr_entry->ctime,
+							attr_entry->size);
+				reb->trans_id = attr_entry->trans_id;
+			}
+
+			/* Update sih->i_size for setattr operation */
+			sih->i_size = le64_to_cpu(reb->i_size);
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			link_change_entry =
+				(struct nova_link_change_entry *)entryc;
+			nova_apply_link_change_entry(sb, reb,
+						link_change_entry);
+			sih->last_link_change = curr_p;
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case FILE_WRITE:
+			entry = (struct nova_file_write_entry *)addr;
+			nova_rebuild_handle_write_entry(sb, sih, reb,
+						entryc);
+			curr_p += sizeof(struct nova_file_write_entry);
+			break;
+		default:
+			nova_err(sb, "unknown type %d, 0x%llx\n", type, curr_p);
+			NOVA_ASSERT(0);
+			curr_p += sizeof(struct nova_file_write_entry);
+			break;
+		}
+
+	}
+
+	ret = nova_rebuild_inode_finish(sb, pi, sih, reb, curr_p);
+	sih->i_blocks = sih->log_pages + (sih->i_size >> data_bits);
+
+out:
+//	nova_print_inode_log_page(sb, inode);
+	NOVA_END_TIMING(rebuild_file_t, rebuild_time);
+	return ret;
+}
+
 /******************* Directory rebuild *********************/
 
 static inline void nova_rebuild_dir_time_and_size(struct super_block *sb,
@@ -360,12 +480,16 @@ int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
 		/* Treat symlink files as normal files */
 		/* Fall through */
 	case S_IFREG:
+		nova_rebuild_file_inode_tree(sb, pi, pi_addr, sih);
 		break;
 	case S_IFDIR:
 		if (rebuild_dir)
 			nova_rebuild_dir_inode_tree(sb, pi, pi_addr, sih);
 		break;
 	default:
+		/* In case of special inode, walk the log */
+		if (pi->log_head)
+			nova_rebuild_file_inode_tree(sb, pi, pi_addr, sih);
 		sih->pi_addr = pi_addr;
 		break;
 	}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 53/83] Namei: lookup.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (51 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 52/83] Rebuild: file inode Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 54/83] Namei: create and mknod Andiry Xu
                   ` (30 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA lookup the inode number by searching the radix tree with
the filename hash value and locating the corresponding dentry on the log.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |  3 +-
 fs/nova/inode.c  |  2 ++
 fs/nova/namei.c  | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h   |  4 +++
 4 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/namei.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index 3a3243c..eb97e46 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,4 +4,5 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := balloc.o bbuild.o dir.o inode.o journal.o log.o rebuild.o stats.o super.o
+nova-y := balloc.o bbuild.o dir.o inode.o journal.o log.o namei.o\
+	  rebuild.o stats.o super.o
diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 17addd3..2d3f7a3 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -181,6 +181,7 @@ static int nova_read_inode(struct super_block *sb, struct inode *inode,
 	case S_IFREG:
 		break;
 	case S_IFDIR:
+		inode->i_op = &nova_dir_inode_operations;
 		inode->i_fop = &nova_dir_operations;
 		break;
 	case S_IFLNK:
@@ -881,6 +882,7 @@ struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
 		inode->i_mapping->a_ops = &nova_aops_dax;
 		break;
 	case TYPE_MKDIR:
+		inode->i_op = &nova_dir_inode_operations;
 		inode->i_fop = &nova_dir_operations;
 		inode->i_mapping->a_ops = &nova_aops_dax;
 		set_nlink(inode, 2);
diff --git a/fs/nova/namei.c b/fs/nova/namei.c
new file mode 100644
index 0000000..8076f5b
--- /dev/null
+++ b/fs/nova/namei.c
@@ -0,0 +1,97 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode operations for directories.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include "nova.h"
+#include "journal.h"
+#include "inode.h"
+
+static ino_t nova_inode_by_name(struct inode *dir, struct qstr *entry,
+				 struct nova_dentry **res_entry)
+{
+	struct super_block *sb = dir->i_sb;
+	struct nova_dentry *direntry;
+
+	direntry = nova_find_dentry(sb, NULL, dir,
+					entry->name, entry->len);
+	if (direntry == NULL)
+		return 0;
+
+	*res_entry = direntry;
+	return direntry->ino;
+}
+
+static struct dentry *nova_lookup(struct inode *dir, struct dentry *dentry,
+				   unsigned int flags)
+{
+	struct inode *inode = NULL;
+	struct nova_dentry *de;
+	ino_t ino;
+	timing_t lookup_time;
+
+	NOVA_START_TIMING(lookup_t, lookup_time);
+	if (dentry->d_name.len > NOVA_NAME_LEN) {
+		nova_dbg("%s: namelen %u exceeds limit\n",
+			__func__, dentry->d_name.len);
+		return ERR_PTR(-ENAMETOOLONG);
+	}
+
+	nova_dbg_verbose("%s: %s\n", __func__, dentry->d_name.name);
+	ino = nova_inode_by_name(dir, &dentry->d_name, &de);
+	nova_dbg_verbose("%s: ino %lu\n", __func__, ino);
+	if (ino) {
+		inode = nova_iget(dir->i_sb, ino);
+		if (inode == ERR_PTR(-ESTALE) || inode == ERR_PTR(-ENOMEM)
+				|| inode == ERR_PTR(-EACCES)) {
+			nova_err(dir->i_sb,
+				  "%s: get inode failed: %lu\n",
+				  __func__, (unsigned long)ino);
+			return ERR_PTR(-EIO);
+		}
+	}
+
+	NOVA_END_TIMING(lookup_t, lookup_time);
+	return d_splice_alias(inode, dentry);
+}
+
+struct dentry *nova_get_parent(struct dentry *child)
+{
+	struct inode *inode;
+	struct qstr dotdot = QSTR_INIT("..", 2);
+	struct nova_dentry *de = NULL;
+	ino_t ino;
+
+	nova_inode_by_name(child->d_inode, &dotdot, &de);
+	if (!de)
+		return ERR_PTR(-ENOENT);
+
+	/* FIXME: can de->ino be avoided by using the return value of
+	 * nova_inode_by_name()?
+	 */
+	ino = le64_to_cpu(de->ino);
+
+	if (ino)
+		inode = nova_iget(child->d_inode->i_sb, ino);
+	else
+		return ERR_PTR(-ENOENT);
+
+	return d_obtain_alias(inode);
+}
+
+const struct inode_operations nova_dir_inode_operations = {
+	.lookup		= nova_lookup,
+};
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 983c6b2..03ea0bd 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -484,6 +484,10 @@ int nova_add_dentry(struct dentry *dentry, u64 ino, int inc_link,
 int nova_remove_dentry(struct dentry *dentry, int dec_link,
 	struct nova_inode_update *update, u64 epoch_id);
 
+/* namei.c */
+extern const struct inode_operations nova_dir_inode_operations;
+extern struct dentry *nova_get_parent(struct dentry *child);
+
 /* rebuild.c */
 int nova_rebuild_dir_inode_tree(struct super_block *sb,
 	struct nova_inode *pi, u64 pi_addr,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 54/83] Namei: create and mknod.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (52 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 53/83] Namei: lookup Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 55/83] Namei: mkdir Andiry Xu
                   ` (29 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA allocates and initializes a new inode, and appends a dentry
to the directory's log. Then NOVA creates a transaction to
commit both changes atomically: update the directory log tail
pointer and validate the new inode.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/namei.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 141 insertions(+)

diff --git a/fs/nova/namei.c b/fs/nova/namei.c
index 8076f5b..a07cc4f 100644
--- a/fs/nova/namei.c
+++ b/fs/nova/namei.c
@@ -68,6 +68,145 @@ static struct dentry *nova_lookup(struct inode *dir, struct dentry *dentry,
 	return d_splice_alias(inode, dentry);
 }
 
+static void nova_lite_transaction_for_new_inode(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode *pidir, struct inode *inode,
+	struct inode *dir, struct nova_inode_update *update)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int cpu;
+	u64 journal_tail;
+	timing_t trans_time;
+
+	NOVA_START_TIMING(create_trans_t, trans_time);
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+
+	// If you change what's required to create a new inode, you need to
+	// update this functions so the changes will be roll back on failure.
+	journal_tail = nova_create_inode_transaction(sb, inode, dir, cpu, 1, 0);
+
+	nova_update_inode(sb, dir, pidir, update);
+
+	pi->valid = 1;
+	nova_persist_inode(pi);
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	NOVA_END_TIMING(create_trans_t, trans_time);
+}
+
+/* Returns new tail after append */
+/*
+ * By the time this is called, we already have created
+ * the directory cache entry for the new file, but it
+ * is so far negative - it has no inode.
+ *
+ * If the create succeeds, we fill in the inode information
+ * with d_instantiate().
+ */
+static int nova_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+			bool excl)
+{
+	struct inode *inode = NULL;
+	int err = PTR_ERR(inode);
+	struct super_block *sb = dir->i_sb;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 pi_addr = 0;
+	u64 ino, epoch_id;
+	timing_t create_time;
+
+	NOVA_START_TIMING(create_t, create_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	update.tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_err;
+
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+	inode = nova_new_vfs_inode(TYPE_CREATE, dir, pi_addr, ino, mode,
+					0, 0, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode))
+		goto out_err;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	pi = nova_get_block(sb, pi_addr);
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+						&update);
+	NOVA_END_TIMING(create_t, create_time);
+	return err;
+out_err:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(create_t, create_time);
+	return err;
+}
+
+static int nova_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		       dev_t rdev)
+{
+	struct inode *inode = NULL;
+	int err = PTR_ERR(inode);
+	struct super_block *sb = dir->i_sb;
+	u64 pi_addr = 0;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 ino;
+	u64 epoch_id;
+	timing_t mknod_time;
+
+	NOVA_START_TIMING(mknod_t, mknod_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+
+	update.tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_err;
+
+	inode = nova_new_vfs_inode(TYPE_MKNOD, dir, pi_addr, ino, mode,
+					0, rdev, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode))
+		goto out_err;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	pi = nova_get_block(sb, pi_addr);
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+						&update);
+	NOVA_END_TIMING(mknod_t, mknod_time);
+	return err;
+out_err:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(mknod_t, mknod_time);
+	return err;
+}
+
 struct dentry *nova_get_parent(struct dentry *child)
 {
 	struct inode *inode;
@@ -93,5 +232,7 @@ struct dentry *nova_get_parent(struct dentry *child)
 }
 
 const struct inode_operations nova_dir_inode_operations = {
+	.create		= nova_create,
 	.lookup		= nova_lookup,
+	.mknod		= nova_mknod,
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 55/83] Namei: mkdir
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (53 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 54/83] Namei: create and mknod Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 56/83] Namei: link and unlink Andiry Xu
                   ` (28 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA mkdir is similar to create. The difference is NOVA will
allocate log page for the newly created directory, and append
init dentries.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/namei.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/fs/nova/namei.c b/fs/nova/namei.c
index a07cc4f..a95b2fe 100644
--- a/fs/nova/namei.c
+++ b/fs/nova/namei.c
@@ -207,6 +207,79 @@ static int nova_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
 	return err;
 }
 
+static int nova_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct super_block *sb = dir->i_sb;
+	struct inode *inode;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_info *si, *sidir;
+	struct nova_inode_info_header *sih = NULL;
+	struct nova_inode_update update;
+	u64 pi_addr = 0;
+	u64 ino;
+	u64 epoch_id;
+	int err = -EMLINK;
+	timing_t mkdir_time;
+
+	NOVA_START_TIMING(mkdir_t, mkdir_time);
+	if (dir->i_nlink >= NOVA_LINK_MAX)
+		goto out;
+
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_err;
+
+	epoch_id = nova_get_epoch_id(sb);
+	nova_dbgv("%s: name %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %llu, dir %lu, link %d\n", __func__,
+				ino, dir->i_ino, dir->i_nlink);
+
+	update.tail = 0;
+	err = nova_add_dentry(dentry, ino, 1, &update, epoch_id);
+	if (err) {
+		nova_dbg("failed to add dir entry\n");
+		goto out_err;
+	}
+
+	inode = nova_new_vfs_inode(TYPE_MKDIR, dir, pi_addr, ino,
+					S_IFDIR | mode, sb->s_blocksize,
+					0, &dentry->d_name, epoch_id);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto out_err;
+	}
+
+	pi = nova_get_inode(sb, inode);
+	err = nova_append_dir_init_entries(sb, pi, inode->i_ino, dir->i_ino,
+					epoch_id);
+	if (err < 0)
+		goto out_err;
+
+	/* Build the dir tree */
+	si = NOVA_I(inode);
+	sih = &si->header;
+	nova_rebuild_dir_inode_tree(sb, pi, pi_addr, sih);
+
+	pidir = nova_get_inode(sb, dir);
+	sidir = NOVA_I(dir);
+	sih = &si->header;
+	dir->i_blocks = sih->i_blocks;
+	inc_nlink(dir);
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+					&update);
+out:
+	NOVA_END_TIMING(mkdir_t, mkdir_time);
+	return err;
+
+out_err:
+//	clear_nlink(inode);
+	nova_err(sb, "%s return %d\n", __func__, err);
+	goto out;
+}
+
 struct dentry *nova_get_parent(struct dentry *child)
 {
 	struct inode *inode;
@@ -234,5 +307,6 @@ struct dentry *nova_get_parent(struct dentry *child)
 const struct inode_operations nova_dir_inode_operations = {
 	.create		= nova_create,
 	.lookup		= nova_lookup,
+	.mkdir		= nova_mkdir,
 	.mknod		= nova_mknod,
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 56/83] Namei: link and unlink.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (54 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 55/83] Namei: mkdir Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 57/83] Namei: rmdir Andiry Xu
                   ` (27 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

For link change operations, NOVA appends a link change entry
to the affected inode's log, and uses lite transaction to
atomically commit changes to multiple logs.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/namei.c | 159 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)

diff --git a/fs/nova/namei.c b/fs/nova/namei.c
index a95b2fe..360d716 100644
--- a/fs/nova/namei.c
+++ b/fs/nova/namei.c
@@ -207,6 +207,163 @@ static int nova_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
 	return err;
 }
 
+static void nova_lite_transaction_for_time_and_link(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode *pidir, struct inode *inode,
+	struct inode *dir, struct nova_inode_update *update,
+	struct nova_inode_update *update_dir, int invalidate, u64 epoch_id)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	u64 journal_tail;
+	int cpu;
+	timing_t trans_time;
+
+	NOVA_START_TIMING(link_trans_t, trans_time);
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+
+	// If you change what's required to create a new inode, you need to
+	// update this functions so the changes will be roll back on failure.
+	journal_tail = nova_create_inode_transaction(sb, inode, dir, cpu,
+						0, invalidate);
+
+	if (invalidate) {
+		pi->valid = 0;
+		pi->delete_epoch_id = epoch_id;
+	}
+	nova_update_inode(sb, inode, pi, update);
+
+	nova_update_inode(sb, dir, pidir, update_dir);
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	NOVA_END_TIMING(link_trans_t, trans_time);
+}
+
+static int nova_link(struct dentry *dest_dentry, struct inode *dir,
+		      struct dentry *dentry)
+{
+	struct super_block *sb = dir->i_sb;
+	struct inode *inode = dest_dentry->d_inode;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	u64 epoch_id;
+	int err = -ENOMEM;
+	timing_t link_time;
+
+	NOVA_START_TIMING(link_t, link_time);
+	if (inode->i_nlink >= NOVA_LINK_MAX) {
+		err = -EMLINK;
+		goto out;
+	}
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	ihold(inode);
+	epoch_id = nova_get_epoch_id(sb);
+
+	nova_dbgv("%s: name %s, dest %s\n", __func__,
+			dentry->d_name.name, dest_dentry->d_name.name);
+	nova_dbgv("%s: inode %lu, dir %lu\n", __func__,
+			inode->i_ino, dir->i_ino);
+
+	update_dir.tail = 0;
+	err = nova_add_dentry(dentry, inode->i_ino, 0, &update_dir, epoch_id);
+	if (err) {
+		iput(inode);
+		goto out;
+	}
+
+	inode->i_ctime = current_time(inode);
+	inc_nlink(inode);
+
+	update.tail = 0;
+	err = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (err) {
+		iput(inode);
+		goto out;
+	}
+
+	d_instantiate(dentry, inode);
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+					&update, &update_dir, 0, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+
+out:
+	NOVA_END_TIMING(link_t, link_time);
+	return err;
+}
+
+static int nova_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	struct super_block *sb = dir->i_sb;
+	int retval = -ENOMEM;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	u64 epoch_id;
+	int invalidate = 0;
+	timing_t unlink_time;
+
+	NOVA_START_TIMING(unlink_t, unlink_time);
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out;
+
+	epoch_id = nova_get_epoch_id(sb);
+	nova_dbgv("%s: %s\n", __func__, dentry->d_name.name);
+	nova_dbgv("%s: inode %lu, dir %lu\n", __func__,
+				inode->i_ino, dir->i_ino);
+
+	update_dir.tail = 0;
+	retval = nova_remove_dentry(dentry, 0, &update_dir, epoch_id);
+	if (retval)
+		goto out;
+
+	inode->i_ctime = dir->i_ctime;
+
+	if (inode->i_nlink == 1)
+		invalidate = 1;
+
+	if (inode->i_nlink)
+		drop_nlink(inode);
+
+	update.tail = 0;
+	retval = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (retval)
+		goto out;
+
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+				&update, &update_dir, invalidate, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+	nova_invalidate_dentries(sb, &update_dir);
+
+	NOVA_END_TIMING(unlink_t, unlink_time);
+	return 0;
+out:
+	nova_err(sb, "%s return %d\n", __func__, retval);
+	NOVA_END_TIMING(unlink_t, unlink_time);
+	return retval;
+}
+
 static int nova_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
@@ -307,6 +464,8 @@ struct dentry *nova_get_parent(struct dentry *child)
 const struct inode_operations nova_dir_inode_operations = {
 	.create		= nova_create,
 	.lookup		= nova_lookup,
+	.link		= nova_link,
+	.unlink		= nova_unlink,
 	.mkdir		= nova_mkdir,
 	.mknod		= nova_mknod,
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 57/83] Namei: rmdir
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (55 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 56/83] Namei: link and unlink Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 58/83] Namei: rename Andiry Xu
                   ` (26 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Similar to unlink.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/namei.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/fs/nova/namei.c b/fs/nova/namei.c
index 360d716..4bf6396 100644
--- a/fs/nova/namei.c
+++ b/fs/nova/namei.c
@@ -437,6 +437,110 @@ static int nova_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	goto out;
 }
 
+/*
+ * routine to check that the specified directory is empty (for rmdir)
+ */
+static int nova_empty_dir(struct inode *inode)
+{
+	struct super_block *sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_dentry *entry;
+	unsigned long pos = 0;
+	struct nova_dentry *entries[4];
+	int nr_entries;
+	int i;
+
+	sb = inode->i_sb;
+	nr_entries = radix_tree_gang_lookup(&sih->tree,
+					(void **)entries, pos, 4);
+	if (nr_entries > 2)
+		return 0;
+
+	for (i = 0; i < nr_entries; i++) {
+		entry = entries[i];
+
+		if (!is_dir_init_entry(sb, entry))
+			return 0;
+	}
+
+	return 1;
+}
+
+static int nova_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	struct nova_dentry *de;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode), *pidir;
+	struct nova_inode_update update_dir;
+	struct nova_inode_update update;
+	u64 old_linkc = 0;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	int err = -ENOTEMPTY;
+	u64 epoch_id;
+	timing_t rmdir_time;
+
+	NOVA_START_TIMING(rmdir_t, rmdir_time);
+	if (!inode)
+		return -ENOENT;
+
+	nova_dbgv("%s: name %s\n", __func__, dentry->d_name.name);
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		return -EINVAL;
+
+	if (nova_inode_by_name(dir, &dentry->d_name, &de) == 0)
+		return -ENOENT;
+
+	if (!nova_empty_dir(inode))
+		return err;
+
+	nova_dbgv("%s: inode %lu, dir %lu, link %d\n", __func__,
+				inode->i_ino, dir->i_ino, dir->i_nlink);
+
+	if (inode->i_nlink != 2)
+		nova_dbg("empty directory %lu has nlink!=2 (%d), dir %lu",
+				inode->i_ino, inode->i_nlink, dir->i_ino);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	update_dir.tail = 0;
+	err = nova_remove_dentry(dentry, -1, &update_dir, epoch_id);
+	if (err)
+		goto end_rmdir;
+
+	/*inode->i_version++; */
+	clear_nlink(inode);
+	inode->i_ctime = dir->i_ctime;
+
+	if (dir->i_nlink)
+		drop_nlink(dir);
+
+	nova_delete_dir_tree(sb, sih);
+
+	update.tail = 0;
+	err = nova_append_link_change_entry(sb, pi, inode, &update,
+						&old_linkc, epoch_id);
+	if (err)
+		goto end_rmdir;
+
+	nova_lite_transaction_for_time_and_link(sb, pi, pidir, inode, dir,
+					&update, &update_dir, 1, epoch_id);
+
+	nova_invalidate_link_change_entry(sb, old_linkc);
+	nova_invalidate_dentries(sb, &update_dir);
+
+	NOVA_END_TIMING(rmdir_t, rmdir_time);
+	return err;
+
+end_rmdir:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(rmdir_t, rmdir_time);
+	return err;
+}
+
 struct dentry *nova_get_parent(struct dentry *child)
 {
 	struct inode *inode;
@@ -467,5 +571,6 @@ const struct inode_operations nova_dir_inode_operations = {
 	.link		= nova_link,
 	.unlink		= nova_unlink,
 	.mkdir		= nova_mkdir,
+	.rmdir		= nova_rmdir,
 	.mknod		= nova_mknod,
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 58/83] Namei: rename
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (56 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 57/83] Namei: rmdir Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 59/83] Namei: setattr Andiry Xu
                   ` (25 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Rename is the most cpmplex namei operation. The target dir may be
different from the source dir, and the target inode may exist.
Rename involves up to four inodes, and NOVA uses rename transation
to atomically update all the affected inodes.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/namei.c | 195 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 195 insertions(+)

diff --git a/fs/nova/namei.c b/fs/nova/namei.c
index 4bf6396..bb50c0a 100644
--- a/fs/nova/namei.c
+++ b/fs/nova/namei.c
@@ -541,6 +541,200 @@ static int nova_rmdir(struct inode *dir, struct dentry *dentry)
 	return err;
 }
 
+static int nova_rename(struct inode *old_dir,
+			struct dentry *old_dentry,
+			struct inode *new_dir, struct dentry *new_dentry,
+			unsigned int flags)
+{
+	struct inode *old_inode = old_dentry->d_inode;
+	struct inode *new_inode = new_dentry->d_inode;
+	struct super_block *sb = old_inode->i_sb;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *old_pi = NULL, *new_pi = NULL;
+	struct nova_inode *new_pidir = NULL, *old_pidir = NULL;
+	struct nova_dentry *father_entry = NULL;
+	char *head_addr = NULL;
+	int invalidate_new_inode = 0;
+	struct nova_inode_update update_dir_new;
+	struct nova_inode_update update_dir_old;
+	struct nova_inode_update update_new;
+	struct nova_inode_update update_old;
+	u64 old_linkc1 = 0, old_linkc2 = 0;
+	int err = -ENOENT;
+	int inc_link = 0, dec_link = 0;
+	int cpu;
+	int change_parent = 0;
+	u64 journal_tail;
+	u64 epoch_id;
+	timing_t rename_time;
+
+	nova_dbgv("%s: rename %s to %s,\n", __func__,
+			old_dentry->d_name.name, new_dentry->d_name.name);
+	nova_dbgv("%s: %s inode %lu, old dir %lu, new dir %lu, new inode %lu\n",
+			__func__, S_ISDIR(old_inode->i_mode) ? "dir" : "normal",
+			old_inode->i_ino, old_dir->i_ino, new_dir->i_ino,
+			new_inode ? new_inode->i_ino : 0);
+
+	if (flags & ~RENAME_NOREPLACE)
+		return -EINVAL;
+
+	NOVA_START_TIMING(rename_t, rename_time);
+
+	if (new_inode) {
+		err = -ENOTEMPTY;
+		if (S_ISDIR(old_inode->i_mode) && !nova_empty_dir(new_inode))
+			goto out;
+	} else {
+		if (S_ISDIR(old_inode->i_mode)) {
+			err = -EMLINK;
+			if (new_dir->i_nlink >= NOVA_LINK_MAX)
+				goto out;
+		}
+	}
+
+	if (S_ISDIR(old_inode->i_mode)) {
+		dec_link = -1;
+		if (!new_inode)
+			inc_link = 1;
+		/*
+		 * Tricky for in-place update:
+		 * New dentry is always after renamed dentry, so we have to
+		 * make sure new dentry has the correct links count
+		 * to workaround the rebuild nlink issue.
+		 */
+		if (old_dir == new_dir) {
+			inc_link--;
+			if (inc_link == 0)
+				dec_link = 0;
+		}
+	}
+
+	epoch_id = nova_get_epoch_id(sb);
+	new_pidir = nova_get_inode(sb, new_dir);
+	old_pidir = nova_get_inode(sb, old_dir);
+
+	old_pi = nova_get_inode(sb, old_inode);
+	old_inode->i_ctime = current_time(old_inode);
+	update_old.tail = 0;
+	err = nova_append_link_change_entry(sb, old_pi, old_inode,
+					&update_old, &old_linkc1, epoch_id);
+	if (err)
+		goto out;
+
+	if (S_ISDIR(old_inode->i_mode) && old_dir != new_dir) {
+		/* My father is changed. Update .. entry */
+		/* For simplicity, we use in-place update and journal it */
+		change_parent = 1;
+		head_addr = (char *)nova_get_block(sb, old_pi->log_head);
+		father_entry = (struct nova_dentry *)(head_addr +
+					NOVA_DIR_LOG_REC_LEN(1));
+
+		if (le64_to_cpu(father_entry->ino) != old_dir->i_ino)
+			nova_err(sb, "%s: dir %lu parent should be %lu, but actually %lu\n",
+				__func__,
+				old_inode->i_ino, old_dir->i_ino,
+				le64_to_cpu(father_entry->ino));
+	}
+
+	update_dir_new.tail = 0;
+	if (new_inode) {
+		/* First remove the old entry in the new directory */
+		err = nova_remove_dentry(new_dentry, 0, &update_dir_new,
+					epoch_id);
+		if (err)
+			goto out;
+	}
+
+	/* link into the new directory. */
+	err = nova_add_dentry(new_dentry, old_inode->i_ino,
+				inc_link, &update_dir_new, epoch_id);
+	if (err)
+		goto out;
+
+	if (inc_link > 0)
+		inc_nlink(new_dir);
+
+	update_dir_old.tail = 0;
+	if (old_dir == new_dir) {
+		update_dir_old.tail = update_dir_new.tail;
+	}
+
+	err = nova_remove_dentry(old_dentry, dec_link, &update_dir_old,
+					epoch_id);
+	if (err)
+		goto out;
+
+	if (dec_link < 0)
+		drop_nlink(old_dir);
+
+	if (new_inode) {
+		new_pi = nova_get_inode(sb, new_inode);
+		new_inode->i_ctime = current_time(new_inode);
+
+		if (S_ISDIR(old_inode->i_mode)) {
+			if (new_inode->i_nlink)
+				drop_nlink(new_inode);
+		}
+		if (new_inode->i_nlink)
+			drop_nlink(new_inode);
+
+		update_new.tail = 0;
+		err = nova_append_link_change_entry(sb, new_pi, new_inode,
+						&update_new, &old_linkc2,
+						epoch_id);
+		if (err)
+			goto out;
+	}
+
+	cpu = smp_processor_id();
+	spin_lock(&sbi->journal_locks[cpu]);
+	if (new_inode && new_inode->i_nlink == 0)
+		invalidate_new_inode = 1;
+	journal_tail = nova_create_rename_transaction(sb, old_inode, old_dir,
+				new_inode,
+				old_dir != new_dir ? new_dir : NULL,
+				father_entry,
+				invalidate_new_inode,
+				cpu);
+
+	nova_update_inode(sb, old_inode, old_pi, &update_old);
+	nova_update_inode(sb, old_dir, old_pidir, &update_dir_old);
+
+	if (old_pidir != new_pidir)
+		nova_update_inode(sb, new_dir, new_pidir, &update_dir_new);
+
+	if (change_parent && father_entry) {
+		father_entry->ino = cpu_to_le64(new_dir->i_ino);
+		nova_persist_entry(father_entry);
+	}
+
+	if (new_inode) {
+		if (invalidate_new_inode) {
+			new_pi->valid = 0;
+			new_pi->delete_epoch_id = epoch_id;
+		}
+		nova_update_inode(sb, new_inode, new_pi, &update_new);
+	}
+
+	PERSISTENT_BARRIER();
+
+	nova_commit_lite_transaction(sb, journal_tail, cpu);
+	spin_unlock(&sbi->journal_locks[cpu]);
+
+	nova_invalidate_link_change_entry(sb, old_linkc1);
+	nova_invalidate_link_change_entry(sb, old_linkc2);
+	if (new_inode)
+		nova_invalidate_dentries(sb, &update_dir_new);
+	nova_invalidate_dentries(sb, &update_dir_old);
+
+	NOVA_END_TIMING(rename_t, rename_time);
+	return 0;
+out:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	NOVA_END_TIMING(rename_t, rename_time);
+	return err;
+}
+
 struct dentry *nova_get_parent(struct dentry *child)
 {
 	struct inode *inode;
@@ -573,4 +767,5 @@ const struct inode_operations nova_dir_inode_operations = {
 	.mkdir		= nova_mkdir,
 	.rmdir		= nova_rmdir,
 	.mknod		= nova_mknod,
+	.rename		= nova_rename,
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 59/83] Namei: setattr
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (57 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 58/83] Namei: rename Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 60/83] Add special inode operations Andiry Xu
                   ` (24 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Add notify_change for setattr operations. Truncate the file blocks
if the file is shrunk.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 180 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.h |   1 +
 fs/nova/namei.c |   2 +
 3 files changed, 183 insertions(+)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 2d3f7a3..2092a55 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -141,6 +141,58 @@ void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
 	inode->i_flags |= S_DAX;
 }
 
+static inline void check_eof_blocks(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct nova_inode_info_header *sih)
+{
+	if ((pi->i_flags & cpu_to_le32(NOVA_EOFBLOCKS_FL)) &&
+		(inode->i_size + sb->s_blocksize) > (sih->i_blocks
+			<< sb->s_blocksize_bits)) {
+		pi->i_flags &= cpu_to_le32(~NOVA_EOFBLOCKS_FL);
+		nova_persist_inode(pi);
+	}
+}
+
+/*
+ * Free data blocks from inode in the range start <=> end
+ */
+static void nova_truncate_file_blocks(struct inode *inode, loff_t start,
+				    loff_t end, u64 epoch_id)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	int freed = 0;
+
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+
+	nova_dbg_verbose("truncate: pi %p iblocks %lx %llx %llx %llx\n", pi,
+			 sih->i_blocks, start, end, pi->i_size);
+
+	first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+	if (end == 0)
+		return;
+	last_blocknr = (end - 1) >> data_bits;
+
+	if (first_blocknr > last_blocknr)
+		return;
+
+	freed = nova_delete_file_tree(sb, sih, first_blocknr,
+				last_blocknr, true, false, epoch_id);
+
+	inode->i_blocks -= (freed * (1 << (data_bits -
+				sb->s_blocksize_bits)));
+
+	sih->i_blocks = inode->i_blocks;
+	/* Check for the flag EOFBLOCKS is still valid after the set size */
+	check_eof_blocks(sb, pi, inode, sih);
+
+}
+
 /* copy persistent state to struct inode */
 static int nova_read_inode(struct super_block *sb, struct inode *inode,
 	u64 pi_addr)
@@ -963,6 +1015,134 @@ void nova_dirty_inode(struct inode *inode, int flags)
 	nova_flush_buffer(&pi->i_atime, sizeof(pi->i_atime), 0);
 }
 
+/*
+ * Zero the tail page. Used in resize request
+ * to avoid to keep data in case the file grows again.
+ */
+static void nova_clear_last_page_tail(struct super_block *sb,
+	struct inode *inode, loff_t newsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned long offset = newsize & (sb->s_blocksize - 1);
+	unsigned long pgoff, length;
+	u64 nvmm;
+	char *nvmm_addr;
+
+	if (offset == 0 || newsize > inode->i_size)
+		return;
+
+	length = sb->s_blocksize - offset;
+	pgoff = newsize >> sb->s_blocksize_bits;
+
+	nvmm = nova_find_nvmm_block(sb, sih, NULL, pgoff);
+	if (nvmm == 0)
+		return;
+
+	nvmm_addr = (char *)nova_get_block(sb, nvmm);
+	memcpy_to_pmem_nocache(nvmm_addr + offset, sbi->zeroed_page, length);
+}
+
+static void nova_setsize(struct inode *inode, loff_t oldsize, loff_t newsize,
+	u64 epoch_id)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	timing_t setsize_time;
+
+	/* We only support truncate regular file */
+	if (!(S_ISREG(inode->i_mode))) {
+		nova_err(inode->i_sb, "%s:wrong file mode %x\n", inode->i_mode);
+		return;
+	}
+
+	NOVA_START_TIMING(setsize_t, setsize_time);
+
+	inode_dio_wait(inode);
+
+	nova_dbgv("%s: inode %lu, old size %llu, new size %llu\n",
+		__func__, inode->i_ino, oldsize, newsize);
+
+	sih_lock(sih);
+	if (newsize != oldsize) {
+		nova_clear_last_page_tail(sb, inode, newsize);
+		i_size_write(inode, newsize);
+		sih->i_size = newsize;
+	}
+
+	/* FIXME: we should make sure that there is nobody reading the inode
+	 * before truncating it. Also we need to munmap the truncated range
+	 * from application address space, if mmapped.
+	 */
+	/* synchronize_rcu(); */
+
+	/* FIXME: Do we need to clear truncated DAX pages? */
+//	dax_truncate_page(inode, newsize, nova_dax_get_block);
+
+	truncate_pagecache(inode, newsize);
+	nova_truncate_file_blocks(inode, newsize, oldsize, epoch_id);
+	sih_unlock(sih);
+	NOVA_END_TIMING(setsize_t, setsize_time);
+}
+
+int nova_notify_change(struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = dentry->d_inode;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi = nova_get_inode(sb, inode);
+	int ret;
+	unsigned int ia_valid = attr->ia_valid, attr_mask;
+	loff_t oldsize = inode->i_size;
+	u64 epoch_id;
+	timing_t setattr_time;
+
+	NOVA_START_TIMING(setattr_t, setattr_time);
+	if (!pi) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	ret = setattr_prepare(dentry, attr);
+	if (ret)
+		goto out;
+
+	/* Update inode with attr except for size */
+	setattr_copy(inode, attr);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	attr_mask = ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_SIZE | ATTR_ATIME
+			| ATTR_MTIME | ATTR_CTIME;
+
+	ia_valid = ia_valid & attr_mask;
+
+	if (ia_valid == 0)
+		goto out;
+
+	ret = nova_handle_setattr_operation(sb, inode, pi, ia_valid,
+					attr, epoch_id);
+	if (ret)
+		goto out;
+
+	/* Only after log entry is committed, we can truncate size */
+	if ((ia_valid & ATTR_SIZE) && (attr->ia_size != oldsize ||
+			pi->i_flags & cpu_to_le32(NOVA_EOFBLOCKS_FL))) {
+//		nova_set_blocksize_hint(sb, inode, pi, attr->ia_size);
+
+		/* now we can freely truncate the inode */
+		nova_setsize(inode, oldsize, attr->ia_size, epoch_id);
+	}
+
+	sih->trans_id++;
+out:
+	NOVA_END_TIMING(setattr_t, setattr_time);
+	return ret;
+}
+
 static ssize_t nova_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	/* DAX does not support direct IO */
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 42690e6..4ddf8c2 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -267,5 +267,6 @@ int nova_delete_file_tree(struct super_block *sb,
 extern void nova_evict_inode(struct inode *inode);
 extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
 extern void nova_dirty_inode(struct inode *inode, int flags);
+extern int nova_notify_change(struct dentry *dentry, struct iattr *attr);
 
 #endif
diff --git a/fs/nova/namei.c b/fs/nova/namei.c
index bb50c0a..1966bff 100644
--- a/fs/nova/namei.c
+++ b/fs/nova/namei.c
@@ -768,4 +768,6 @@ const struct inode_operations nova_dir_inode_operations = {
 	.rmdir		= nova_rmdir,
 	.mknod		= nova_mknod,
 	.rename		= nova_rename,
+	.setattr	= nova_notify_change,
+	.get_acl	= NULL,
 };
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 60/83] Add special inode operations.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (58 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 59/83] Namei: setattr Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 61/83] Super: Add nova_export_ops Andiry Xu
                   ` (23 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/inode.c | 2 ++
 fs/nova/namei.c | 5 +++++
 fs/nova/nova.h  | 1 +
 3 files changed, 8 insertions(+)

diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 2092a55..0e9ab4b 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -239,6 +239,7 @@ static int nova_read_inode(struct super_block *sb, struct inode *inode,
 	case S_IFLNK:
 		break;
 	default:
+		inode->i_op = &nova_special_inode_operations;
 		init_special_inode(inode, inode->i_mode,
 				   le32_to_cpu(pi->dev.rdev));
 		break;
@@ -929,6 +930,7 @@ struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
 		break;
 	case TYPE_MKNOD:
 		init_special_inode(inode, mode, rdev);
+		inode->i_op = &nova_special_inode_operations;
 		break;
 	case TYPE_SYMLINK:
 		inode->i_mapping->a_ops = &nova_aops_dax;
diff --git a/fs/nova/namei.c b/fs/nova/namei.c
index 1966bff..7a81672 100644
--- a/fs/nova/namei.c
+++ b/fs/nova/namei.c
@@ -771,3 +771,8 @@ const struct inode_operations nova_dir_inode_operations = {
 	.setattr	= nova_notify_change,
 	.get_acl	= NULL,
 };
+
+const struct inode_operations nova_special_inode_operations = {
+	.setattr	= nova_notify_change,
+	.get_acl	= NULL,
+};
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 03ea0bd..85292d3 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -486,6 +486,7 @@ int nova_remove_dentry(struct dentry *dentry, int dec_link,
 
 /* namei.c */
 extern const struct inode_operations nova_dir_inode_operations;
+extern const struct inode_operations nova_special_inode_operations;
 extern struct dentry *nova_get_parent(struct dentry *child);
 
 /* rebuild.c */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 61/83] Super: Add nova_export_ops.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (59 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 60/83] Add special inode operations Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 62/83] File: getattr and file inode operations Andiry Xu
                   ` (22 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/super.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/fs/nova/super.c b/fs/nova/super.c
index daf3270..0847e57 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -51,6 +51,7 @@ module_param(nova_dbgmask, int, 0444);
 MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
 
 static struct super_operations nova_sops;
+static const struct export_operations nova_export_ops;
 
 static struct kmem_cache *nova_inode_cachep;
 static struct kmem_cache *nova_range_node_cachep;
@@ -631,6 +632,7 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_op = &nova_sops;
 	sb->s_maxbytes = nova_max_size(sb->s_blocksize_bits);
 	sb->s_time_gran = 1000000000; // 1 second.
+	sb->s_export_op = &nova_export_ops;
 	sb->s_xattr = NULL;
 	sb->s_flags |= MS_NOSEC;
 
@@ -904,6 +906,52 @@ static struct file_system_type nova_fs_type = {
 	.kill_sb	= kill_block_super,
 };
 
+static struct inode *nova_nfs_get_inode(struct super_block *sb,
+					 u64 ino, u32 generation)
+{
+	struct inode *inode;
+
+	if (ino < NOVA_ROOT_INO)
+		return ERR_PTR(-ESTALE);
+
+	if (ino > LONG_MAX)
+		return ERR_PTR(-ESTALE);
+
+	inode = nova_iget(sb, ino);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	if (generation && inode->i_generation != generation) {
+		/* we didn't find the right inode.. */
+		iput(inode);
+		return ERR_PTR(-ESTALE);
+	}
+
+	return inode;
+}
+
+static struct dentry *nova_fh_to_dentry(struct super_block *sb,
+					 struct fid *fid, int fh_len,
+					 int fh_type)
+{
+	return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+				    nova_nfs_get_inode);
+}
+
+static struct dentry *nova_fh_to_parent(struct super_block *sb,
+					 struct fid *fid, int fh_len,
+					 int fh_type)
+{
+	return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+				    nova_nfs_get_inode);
+}
+
+static const struct export_operations nova_export_ops = {
+	.fh_to_dentry	= nova_fh_to_dentry,
+	.fh_to_parent	= nova_fh_to_parent,
+	.get_parent	= nova_get_parent,
+};
+
 static int __init init_nova_fs(void)
 {
 	int rc = 0;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 62/83] File: getattr and file inode operations
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (60 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 61/83] Super: Add nova_export_ops Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 63/83] File operation: llseek Andiry Xu
                   ` (21 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |  2 +-
 fs/nova/file.c   | 31 +++++++++++++++++++++++++++++++
 fs/nova/inode.c  | 25 +++++++++++++++++++++++++
 fs/nova/inode.h  |  2 ++
 fs/nova/nova.h   |  3 +++
 5 files changed, 62 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/file.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index eb97e46..468ed6f 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,5 +4,5 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := balloc.o bbuild.o dir.o inode.o journal.o log.o namei.o\
+nova-y := balloc.o bbuild.o dir.o file.o inode.o journal.o log.o namei.o\
 	  rebuild.o stats.o super.o
diff --git a/fs/nova/file.c b/fs/nova/file.c
new file mode 100644
index 0000000..b46d4bd
--- /dev/null
+++ b/fs/nova/file.c
@@ -0,0 +1,31 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for files.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/uaccess.h>
+#include <linux/falloc.h>
+#include <asm/mman.h>
+#include "nova.h"
+#include "inode.h"
+
+
+const struct inode_operations nova_file_inode_operations = {
+	.setattr	= nova_notify_change,
+	.getattr	= nova_getattr,
+	.get_acl	= NULL,
+};
diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 0e9ab4b..6fcc5e7 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -231,6 +231,7 @@ static int nova_read_inode(struct super_block *sb, struct inode *inode,
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
+		inode->i_op = &nova_file_inode_operations;
 		break;
 	case S_IFDIR:
 		inode->i_op = &nova_dir_inode_operations;
@@ -926,6 +927,7 @@ struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
 
 	switch (type) {
 	case TYPE_CREATE:
+		inode->i_op = &nova_file_inode_operations;
 		inode->i_mapping->a_ops = &nova_aops_dax;
 		break;
 	case TYPE_MKNOD:
@@ -1089,6 +1091,29 @@ static void nova_setsize(struct inode *inode, loff_t oldsize, loff_t newsize,
 	NOVA_END_TIMING(setsize_t, setsize_time);
 }
 
+int nova_getattr(const struct path *path, struct kstat *stat,
+		 u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned int flags = sih->i_flags;
+
+	if (flags & FS_APPEND_FL)
+		stat->attributes |= STATX_ATTR_APPEND;
+	if (flags & FS_COMPR_FL)
+		stat->attributes |= STATX_ATTR_COMPRESSED;
+	if (flags & FS_IMMUTABLE_FL)
+		stat->attributes |= STATX_ATTR_IMMUTABLE;
+	if (flags & FS_NODUMP_FL)
+		stat->attributes |= STATX_ATTR_NODUMP;
+
+	generic_fillattr(inode, stat);
+	/* stat->blocks should be the number of 512B blocks */
+	stat->blocks = (inode->i_blocks << inode->i_sb->s_blocksize_bits) >> 9;
+	return 0;
+}
+
 int nova_notify_change(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 4ddf8c2..48403cf 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -267,6 +267,8 @@ int nova_delete_file_tree(struct super_block *sb,
 extern void nova_evict_inode(struct inode *inode);
 extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
 extern void nova_dirty_inode(struct inode *inode, int flags);
+extern int nova_getattr(const struct path *path, struct kstat *stat,
+		 u32 request_mask, unsigned int query_flags);
 extern int nova_notify_change(struct dentry *dentry, struct iattr *attr);
 
 #endif
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 85292d3..601e082 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -484,6 +484,9 @@ int nova_add_dentry(struct dentry *dentry, u64 ino, int inc_link,
 int nova_remove_dentry(struct dentry *dentry, int dec_link,
 	struct nova_inode_update *update, u64 epoch_id);
 
+/* file.c */
+extern const struct inode_operations nova_file_inode_operations;
+
 /* namei.c */
 extern const struct inode_operations nova_dir_inode_operations;
 extern const struct inode_operations nova_special_inode_operations;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 63/83] File operation: llseek.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (61 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 62/83] File: getattr and file inode operations Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 64/83] File operation: open, fsync, flush Andiry Xu
                   ` (20 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Search the file radix tree to find hold or data.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/file.c  |  47 +++++++++++++++++++++++
 fs/nova/inode.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/inode.h |   1 +
 fs/nova/nova.h  |   1 +
 4 files changed, 162 insertions(+)

diff --git a/fs/nova/file.c b/fs/nova/file.c
index b46d4bd..ecaf20a 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -23,6 +23,53 @@
 #include "nova.h"
 #include "inode.h"
 
+static loff_t nova_llseek(struct file *file, loff_t offset, int origin)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	int retval;
+
+	if (origin != SEEK_DATA && origin != SEEK_HOLE)
+		return generic_file_llseek(file, offset, origin);
+
+	sih_lock_shared(sih);
+	switch (origin) {
+	case SEEK_DATA:
+		retval = nova_find_region(inode, &offset, 0);
+		if (retval) {
+			sih_unlock_shared(sih);
+			return retval;
+		}
+		break;
+	case SEEK_HOLE:
+		retval = nova_find_region(inode, &offset, 1);
+		if (retval) {
+			sih_unlock_shared(sih);
+			return retval;
+		}
+		break;
+	}
+
+	if ((offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET)) ||
+	    offset > inode->i_sb->s_maxbytes) {
+		sih_unlock_shared(sih);
+		return -ENXIO;
+	}
+
+	if (offset != file->f_pos) {
+		file->f_pos = offset;
+		file->f_version = 0;
+	}
+
+	sih_unlock_shared(sih);
+	return offset;
+}
+
+
+const struct file_operations nova_dax_file_operations = {
+	.llseek		= nova_llseek,
+};
 
 const struct inode_operations nova_file_inode_operations = {
 	.setattr	= nova_notify_change,
diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 6fcc5e7..a6d74cb 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -193,6 +193,52 @@ static void nova_truncate_file_blocks(struct inode *inode, loff_t start,
 
 }
 
+/* search the radix tree to find hole or data
+ * in the specified range
+ * Input:
+ * first_blocknr: first block in the specified range
+ * last_blocknr: last_blocknr in the specified range
+ * @data_found: indicates whether data blocks were found
+ * @hole_found: indicates whether a hole was found
+ * hole: whether we are looking for a hole or data
+ */
+static int nova_lookup_hole_in_range(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	unsigned long first_blocknr, unsigned long last_blocknr,
+	int *data_found, int *hole_found, int hole)
+{
+	struct nova_file_write_entry *entry;
+	unsigned long blocks = 0;
+	unsigned long pgoff, old_pgoff;
+
+	pgoff = first_blocknr;
+	while (pgoff <= last_blocknr) {
+		old_pgoff = pgoff;
+		entry = radix_tree_lookup(&sih->tree, pgoff);
+		if (entry) {
+			*data_found = 1;
+			if (!hole)
+				goto done;
+			pgoff++;
+		} else {
+			*hole_found = 1;
+			entry = nova_find_next_entry(sb, sih, pgoff);
+			pgoff++;
+			if (entry) {
+				pgoff = pgoff > entry->pgoff ?
+					pgoff : entry->pgoff;
+				if (pgoff > last_blocknr)
+					pgoff = last_blocknr + 1;
+			}
+		}
+
+		if (!*hole_found || !hole)
+			blocks += pgoff - old_pgoff;
+	}
+done:
+	return blocks;
+}
+
 /* copy persistent state to struct inode */
 static int nova_read_inode(struct super_block *sb, struct inode *inode,
 	u64 pi_addr)
@@ -232,6 +278,7 @@ static int nova_read_inode(struct super_block *sb, struct inode *inode,
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
 		inode->i_op = &nova_file_inode_operations;
+		inode->i_fop = &nova_dax_file_operations;
 		break;
 	case S_IFDIR:
 		inode->i_op = &nova_dir_inode_operations;
@@ -929,6 +976,7 @@ struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
 	case TYPE_CREATE:
 		inode->i_op = &nova_file_inode_operations;
 		inode->i_mapping->a_ops = &nova_aops_dax;
+		inode->i_fop = &nova_dax_file_operations;
 		break;
 	case TYPE_MKNOD:
 		init_special_inode(inode, mode, rdev);
@@ -1170,6 +1218,71 @@ int nova_notify_change(struct dentry *dentry, struct iattr *attr)
 	return ret;
 }
 
+/*
+ * find the file offset for SEEK_DATA/SEEK_HOLE
+ */
+unsigned long nova_find_region(struct inode *inode, loff_t *offset, int hole)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	unsigned int data_bits = blk_type_to_shift[sih->i_blk_type];
+	unsigned long first_blocknr, last_blocknr;
+	unsigned long blocks = 0, offset_in_block;
+	int data_found = 0, hole_found = 0;
+
+	if (*offset >= inode->i_size)
+		return -ENXIO;
+
+	if (!inode->i_blocks || !sih->i_size) {
+		if (hole)
+			return inode->i_size;
+		else
+			return -ENXIO;
+	}
+
+	offset_in_block = *offset & ((1UL << data_bits) - 1);
+
+	first_blocknr = *offset >> data_bits;
+	last_blocknr = inode->i_size >> data_bits;
+
+	nova_dbgv("find_region offset %llx, first_blocknr %lx, last_blocknr %lx hole %d\n",
+		  *offset, first_blocknr, last_blocknr, hole);
+
+	blocks = nova_lookup_hole_in_range(inode->i_sb, sih,
+		first_blocknr, last_blocknr, &data_found, &hole_found, hole);
+
+	/* Searching data but only hole found till the end */
+	if (!hole && !data_found && hole_found)
+		return -ENXIO;
+
+	if (data_found && !hole_found) {
+		/* Searching data but we are already into them */
+		if (hole)
+			/* Searching hole but only data found, go to the end */
+			*offset = inode->i_size;
+		return 0;
+	}
+
+	/* Searching for hole, hole found and starting inside an hole */
+	if (hole && hole_found && !blocks) {
+		/* we found data after it */
+		if (!data_found)
+			/* last hole */
+			*offset = inode->i_size;
+		return 0;
+	}
+
+	if (offset_in_block) {
+		blocks--;
+		*offset += (blocks << data_bits) +
+			   ((1 << data_bits) - offset_in_block);
+	} else {
+		*offset += blocks << data_bits;
+	}
+
+	return 0;
+}
+
 static ssize_t nova_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	/* DAX does not support direct IO */
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 48403cf..693aa90 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -264,6 +264,7 @@ int nova_delete_file_tree(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long start_blocknr,
 	unsigned long last_blocknr, bool delete_nvmm, bool delete_dead,
 	u64 epoch_id);
+unsigned long nova_find_region(struct inode *inode, loff_t *offset, int hole);
 extern void nova_evict_inode(struct inode *inode);
 extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
 extern void nova_dirty_inode(struct inode *inode, int flags);
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 601e082..b2831f6 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -485,6 +485,7 @@ int nova_remove_dentry(struct dentry *dentry, int dec_link,
 	struct nova_inode_update *update, u64 epoch_id);
 
 /* file.c */
+extern const struct file_operations nova_dax_file_operations;
 extern const struct inode_operations nova_file_inode_operations;
 
 /* namei.c */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 64/83] File operation: open, fsync, flush.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (62 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 63/83] File operation: llseek Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 65/83] File operation: read Andiry Xu
                   ` (19 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA persists file metadata and data before returning to the user space.
Hence, fsync is a no-op if the file is not mmaped.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/file.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/fs/nova/file.c b/fs/nova/file.c
index ecaf20a..f60fdf3 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -66,9 +66,59 @@ static loff_t nova_llseek(struct file *file, loff_t offset, int origin)
 	return offset;
 }
 
+/* This function is called by both msync() and fsync().
+ * TODO: Check if we can avoid calling nova_flush_buffer() for fsync. We use
+ * movnti to write data to files, so we may want to avoid doing unnecessary
+ * nova_flush_buffer() on fsync()
+ */
+static int nova_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	struct address_space *mapping = file->f_mapping;
+	unsigned long start_pgoff, end_pgoff;
+	int ret = 0;
+	timing_t fsync_time;
+
+	NOVA_START_TIMING(fsync_t, fsync_time);
+
+	if (datasync)
+		NOVA_STATS_ADD(fdatasync, 1);
+
+	/* No need to flush if the file is not mmaped */
+	if (!mapping_mapped(mapping))
+		goto persist;
+
+	start_pgoff = start >> PAGE_SHIFT;
+	end_pgoff = (end + 1) >> PAGE_SHIFT;
+	nova_dbgv("%s: msync pgoff range %lu to %lu\n",
+			__func__, start_pgoff, end_pgoff);
+
+	ret = generic_file_fsync(file, start, end, datasync);
+
+persist:
+	PERSISTENT_BARRIER();
+	NOVA_END_TIMING(fsync_t, fsync_time);
+
+	return ret;
+}
+
+/* This callback is called when a file is closed */
+static int nova_flush(struct file *file, fl_owner_t id)
+{
+	PERSISTENT_BARRIER();
+	return 0;
+}
+
+static int nova_open(struct inode *inode, struct file *filp)
+{
+	return generic_file_open(inode, filp);
+}
+
 
 const struct file_operations nova_dax_file_operations = {
 	.llseek		= nova_llseek,
+	.open		= nova_open,
+	.fsync		= nova_fsync,
+	.flush		= nova_flush,
 };
 
 const struct inode_operations nova_file_inode_operations = {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 65/83] File operation: read.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (63 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 64/83] File operation: open, fsync, flush Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 66/83] Super: Add file write item cache Andiry Xu
                   ` (18 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA is a DAX file system and does not use page cache.
For read, NOVA looks up the file write entry by searching the radix tree,
and copies data from pmem pages to user buffer directly.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/file.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 144 insertions(+)

diff --git a/fs/nova/file.c b/fs/nova/file.c
index f60fdf3..842da45 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -113,9 +113,153 @@ static int nova_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t
+do_dax_mapping_read(struct file *filp, char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct inode *inode = filp->f_mapping->host;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry;
+	pgoff_t index, end_index;
+	unsigned long offset;
+	loff_t isize, pos;
+	size_t copied = 0, error = 0;
+	timing_t memcpy_time;
+
+	pos = *ppos;
+	index = pos >> PAGE_SHIFT;
+	offset = pos & ~PAGE_MASK;
+
+	if (!access_ok(VERIFY_WRITE, buf, len)) {
+		error = -EFAULT;
+		goto out;
+	}
+
+	isize = i_size_read(inode);
+	if (!isize)
+		goto out;
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lu, size %lld\n",
+		__func__, inode->i_ino,	pos, len, isize);
+
+	if (len > isize - pos)
+		len = isize - pos;
+
+	if (len <= 0)
+		goto out;
+
+	end_index = (isize - 1) >> PAGE_SHIFT;
+	do {
+		unsigned long nr, left;
+		unsigned long nvmm;
+		void *dax_mem = NULL;
+		int zero = 0;
+
+		/* nr is the maximum number of bytes to copy from this page */
+		if (index >= end_index) {
+			if (index > end_index)
+				goto out;
+			nr = ((isize - 1) & ~PAGE_MASK) + 1;
+			if (nr <= offset)
+				goto out;
+		}
+
+		entry = nova_get_write_entry(sb, sih, index);
+		if (unlikely(entry == NULL)) {
+			nova_dbgv("Required extent not found: pgoff %lu, inode size %lld\n",
+				index, isize);
+			nr = PAGE_SIZE;
+			zero = 1;
+			goto memcpy;
+		}
+
+		/* Find contiguous blocks */
+		if (index < entry->pgoff ||
+			index - entry->pgoff >= entry->num_pages) {
+			nova_err(sb, "%s ERROR: %lu, entry pgoff %llu, num %u, blocknr %llu\n",
+				__func__, index, entry->pgoff,
+				entry->num_pages, entry->block >> PAGE_SHIFT);
+			return -EINVAL;
+		}
+		if (entry->reassigned == 0) {
+			nr = (entry->num_pages - (index - entry->pgoff))
+				* PAGE_SIZE;
+		} else {
+			nr = PAGE_SIZE;
+		}
+
+		nvmm = get_nvmm(sb, sih, entry, index);
+		dax_mem = nova_get_block(sb, (nvmm << PAGE_SHIFT));
+
+memcpy:
+		nr = nr - offset;
+		if (nr > len - copied)
+			nr = len - copied;
+
+		NOVA_START_TIMING(memcpy_r_nvmm_t, memcpy_time);
+
+		if (!zero)
+			left = __copy_to_user(buf + copied,
+						dax_mem + offset, nr);
+		else
+			left = __clear_user(buf + copied, nr);
+
+		NOVA_END_TIMING(memcpy_r_nvmm_t, memcpy_time);
+
+		if (left) {
+			nova_dbg("%s ERROR!: bytes %lu, left %lu\n",
+				__func__, nr, left);
+			error = -EFAULT;
+			goto out;
+		}
+
+		copied += (nr - left);
+		offset += (nr - left);
+		index += offset >> PAGE_SHIFT;
+		offset &= ~PAGE_MASK;
+	} while (copied < len);
+
+out:
+	*ppos = pos + copied;
+	if (filp)
+		file_accessed(filp);
+
+	NOVA_STATS_ADD(read_bytes, copied);
+
+	nova_dbgv("%s returned %zu\n", __func__, copied);
+	return copied ? copied : error;
+}
+
+/*
+ * Wrappers. We need to use the read lock to avoid
+ * concurrent truncate operation. No problem for write because we held
+ * lock.
+ */
+static ssize_t nova_dax_file_read(struct file *filp, char __user *buf,
+			    size_t len, loff_t *ppos)
+{
+	struct inode *inode = filp->f_mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	ssize_t res;
+	timing_t dax_read_time;
+
+	NOVA_START_TIMING(dax_read_t, dax_read_time);
+	inode_lock_shared(inode);
+	sih_lock_shared(sih);
+	res = do_dax_mapping_read(filp, buf, len, ppos);
+	sih_unlock_shared(sih);
+	inode_unlock_shared(inode);
+	NOVA_END_TIMING(dax_read_t, dax_read_time);
+	return res;
+}
+
 
 const struct file_operations nova_dax_file_operations = {
 	.llseek		= nova_llseek,
+	.read		= nova_dax_file_read,
 	.open		= nova_open,
 	.fsync		= nova_fsync,
 	.flush		= nova_flush,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 66/83] Super: Add file write item cache.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (64 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 65/83] File operation: read Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 67/83] Dax: commit list of file write items to log Andiry Xu
                   ` (17 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

nova_file_write_item combines a file write item with a list head.
NOVA uses a linked list of file write items to describe a write operation.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/super.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 fs/nova/super.h |  3 +++
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/fs/nova/super.c b/fs/nova/super.c
index 0847e57..9710be8 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -55,6 +55,7 @@ static const struct export_operations nova_export_ops;
 
 static struct kmem_cache *nova_inode_cachep;
 static struct kmem_cache *nova_range_node_cachep;
+static struct kmem_cache *nova_file_write_item_cachep;
 
 
 /* FIXME: should the following variable be one per NOVA instance? */
@@ -791,6 +792,21 @@ inline void nova_free_inode_node(struct super_block *sb,
 	nova_free_range_node(node);
 }
 
+inline void nova_free_file_write_item(struct nova_file_write_item *item)
+{
+	kmem_cache_free(nova_file_write_item_cachep, item);
+}
+
+inline struct nova_file_write_item *
+nova_alloc_file_write_item(struct super_block *sb)
+{
+	struct nova_file_write_item *p;
+
+	p = (struct nova_file_write_item *)
+		kmem_cache_alloc(nova_file_write_item_cachep, GFP_NOFS);
+	return p;
+}
+
 inline struct nova_range_node *nova_alloc_range_node(struct super_block *sb)
 {
 	struct nova_range_node *p;
@@ -849,6 +865,18 @@ static int __init init_rangenode_cache(void)
 	return 0;
 }
 
+static int __init init_file_write_item_cache(void)
+{
+	nova_file_write_item_cachep = kmem_cache_create(
+					"nova_file_write_item_cache",
+					sizeof(struct nova_file_write_item),
+					0, (SLAB_RECLAIM_ACCOUNT |
+					SLAB_MEM_SPREAD), NULL);
+	if (nova_file_write_item_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
 static int __init init_inodecache(void)
 {
 	nova_inode_cachep = kmem_cache_create("nova_inode_cache",
@@ -875,6 +903,11 @@ static void destroy_rangenode_cache(void)
 	kmem_cache_destroy(nova_range_node_cachep);
 }
 
+static void destroy_file_write_item_cache(void)
+{
+	kmem_cache_destroy(nova_file_write_item_cachep);
+}
+
 
 /*
  * the super block writes are all done "on the fly", so the
@@ -974,14 +1007,21 @@ static int __init init_nova_fs(void)
 	if (rc)
 		goto out1;
 
-	rc = register_filesystem(&nova_fs_type);
+	rc = init_file_write_item_cache();
 	if (rc)
 		goto out2;
 
+	rc = register_filesystem(&nova_fs_type);
+	if (rc)
+		goto out3;
+
 out:
 	NOVA_END_TIMING(init_t, init_time);
 	return rc;
 
+out3:
+	destroy_file_write_item_cache();
+
 out2:
 	destroy_inodecache();
 
@@ -993,6 +1033,7 @@ static int __init init_nova_fs(void)
 static void __exit exit_nova_fs(void)
 {
 	unregister_filesystem(&nova_fs_type);
+	destroy_file_write_item_cache();
 	destroy_inodecache();
 	destroy_rangenode_cache();
 }
diff --git a/fs/nova/super.h b/fs/nova/super.h
index 56a840e..bcf9548 100644
--- a/fs/nova/super.h
+++ b/fs/nova/super.h
@@ -160,8 +160,11 @@ static inline struct nova_super_block *nova_get_super(struct super_block *sb)
 extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
 extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
 extern inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb);
+extern struct nova_file_write_item *
+nova_alloc_file_write_item(struct super_block *sb);
 extern void nova_free_range_node(struct nova_range_node *node);
 extern inline void nova_free_inode_node(struct super_block *sb,
 	struct nova_range_node *node);
+void nova_free_file_write_item(struct nova_file_write_item *item);
 
 #endif
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 67/83] Dax: commit list of file write items to log.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (65 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 66/83] Super: Add file write item cache Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 68/83] File operation: copy-on-write write Andiry Xu
                   ` (16 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Given a list of file write items, NOVA commits them by appending
each file write entry to the log, and then updates the radix tree
to point to these new entries, and updates log tail pointer to
commit all the writes atomically.
If the items are allocated on heap, free them on success.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |   2 +-
 fs/nova/dax.c    | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h   |   5 +++
 3 files changed, 118 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/dax.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index 468ed6f..7f851f2 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,5 +4,5 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := balloc.o bbuild.o dir.o file.o inode.o journal.o log.o namei.o\
+nova-y := balloc.o bbuild.o dax.o dir.o file.o inode.o journal.o log.o namei.o\
 	  rebuild.o stats.o super.o
diff --git a/fs/nova/dax.c b/fs/nova/dax.c
new file mode 100644
index 0000000..1669dc0
--- /dev/null
+++ b/fs/nova/dax.c
@@ -0,0 +1,112 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * DAX file operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/module.h>
+#include <linux/buffer_head.h>
+#include <linux/cpufeature.h>
+#include <asm/pgtable.h>
+#include <linux/version.h>
+#include "nova.h"
+#include "inode.h"
+
+
+static int nova_reassign_file_tree(struct super_block *sb,
+	struct nova_inode_info_header *sih, u64 begin_tail, u64 end_tail)
+{
+	void *addr;
+	struct nova_file_write_entry *entry;
+	u64 curr_p = begin_tail;
+	size_t entry_size = sizeof(struct nova_file_write_entry);
+
+	while (curr_p && curr_p != end_tail) {
+		if (is_last_entry(curr_p, entry_size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_err(sb, "%s: File inode %lu log is NULL!\n",
+				__func__, sih->ino);
+			return -EINVAL;
+		}
+
+		addr = (void *) nova_get_block(sb, curr_p);
+		entry = (struct nova_file_write_entry *) addr;
+
+		if (nova_get_entry_type(entry) != FILE_WRITE) {
+			nova_dbg("%s: entry type is not write? %d\n",
+				__func__, nova_get_entry_type(entry));
+			curr_p += entry_size;
+			continue;
+		}
+
+		nova_assign_write_entry(sb, sih, entry, true);
+		curr_p += entry_size;
+	}
+
+	return 0;
+}
+
+int nova_commit_writes_to_log(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct list_head *head, unsigned long new_blocks,
+	int free)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_item *entry_item, *temp;
+	struct nova_inode_update update;
+	unsigned int data_bits;
+	u64 begin_tail = 0;
+	int ret = 0;
+
+	if (list_empty(head))
+		return 0;
+
+	update.tail = 0;
+
+	list_for_each_entry(entry_item, head, list) {
+		ret = nova_append_file_write_entry(sb, pi, inode,
+					entry_item, &update);
+		if (ret) {
+			nova_dbg("%s: append inode entry failed\n", __func__);
+			return -ENOSPC;
+		}
+
+		if (begin_tail == 0)
+			begin_tail = update.curr_entry;
+	}
+
+	/* Update file tree */
+	ret = nova_reassign_file_tree(sb, sih, begin_tail, update.tail);
+	if (ret < 0) {
+		/* FIXME: Need to rebuild the tree */
+		return ret;
+	}
+
+	data_bits = blk_type_to_shift[sih->i_blk_type];
+	sih->i_blocks += (new_blocks << (data_bits - sb->s_blocksize_bits));
+
+	inode->i_blocks = sih->i_blocks;
+
+	nova_update_inode(sb, inode, pi, &update);
+	NOVA_STATS_ADD(inplace_new_blocks, 1);
+
+	sih->trans_id++;
+
+	if (free) {
+		list_for_each_entry_safe(entry_item, temp, head, list)
+			nova_free_file_write_item(entry_item);
+	}
+
+	return ret;
+}
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index b2831f6..dcda02a 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -464,6 +464,11 @@ nova_get_blocknr(struct super_block *sb, u64 block, unsigned short btype)
 /* ==============  Function prototypes  ================= */
 /* ====================================================== */
 
+/* dax.c */
+int nova_commit_writes_to_log(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, struct list_head *head, unsigned long new_blocks,
+	int free);
+
 /* dir.c */
 extern const struct file_operations nova_dir_operations;
 int nova_insert_dir_radix_tree(struct super_block *sb,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 68/83] File operation: copy-on-write write.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (66 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 67/83] Dax: commit list of file write items to log Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 69/83] Super: Add module param inplace_data_updates Andiry Xu
                   ` (15 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

If the file is not mmaped, NOVA performs copy-on-write.
The CoW is composed of parts:

1. Allocate contiguous data pages.
2. Copy data from user buffer to the data pages.
   If the write is not aligned to page size, also copy data from existing
   pmem pages.
3. Allocate and initialize a file write item, add it to a linked list.
4. Repeat 1 - 3 until the whole user data is copied to pmem pages.
5. Commit the list of file write items to the log and update the radix tree.
6. Update log tail pointer once all the items are committed.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/dax.c  | 149 +++++++++++++++++++++++++++++++++++++++++
 fs/nova/file.c | 208 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h |   8 +++
 3 files changed, 365 insertions(+)

diff --git a/fs/nova/dax.c b/fs/nova/dax.c
index 1669dc0..9561d8e 100644
--- a/fs/nova/dax.c
+++ b/fs/nova/dax.c
@@ -22,6 +22,113 @@
 #include "inode.h"
 
 
+static inline int nova_copy_partial_block(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long index,
+	size_t offset, size_t length, void *kmem)
+{
+	void *ptr;
+	int rc = 0;
+	unsigned long nvmm;
+
+	nvmm = get_nvmm(sb, sih, entry, index);
+	ptr = nova_get_block(sb, (nvmm << PAGE_SHIFT));
+
+	if (ptr != NULL) {
+		if (support_clwb)
+			rc = memcpy_mcsafe(kmem + offset, ptr + offset,
+						length);
+		else
+			memcpy_to_pmem_nocache(kmem + offset, ptr + offset,
+						length);
+	}
+
+	/* TODO: If rc < 0, go to MCE data recovery. */
+	return rc;
+}
+
+static inline int nova_handle_partial_block(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *entry, unsigned long index,
+	size_t offset, size_t length, void *kmem)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (entry == NULL) {
+		/* Fill zero */
+		if (support_clwb)
+			memset(kmem + offset, 0, length);
+		else
+			memcpy_to_pmem_nocache(kmem + offset,
+					sbi->zeroed_page, length);
+	} else {
+		nova_copy_partial_block(sb, sih, entry, index,
+					offset, length, kmem);
+
+	}
+	if (support_clwb)
+		nova_flush_buffer(kmem + offset, length, 0);
+	return 0;
+}
+
+/*
+ * Fill the new start/end block from original blocks.
+ * Do nothing if fully covered; copy if original blocks present;
+ * Fill zero otherwise.
+ */
+int nova_handle_head_tail_blocks(struct super_block *sb,
+	struct inode *inode, loff_t pos, size_t count, void *kmem)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	size_t offset, eblk_offset;
+	unsigned long start_blk, end_blk, num_blocks;
+	struct nova_file_write_entry *entry;
+	timing_t partial_time;
+	int ret = 0;
+
+	NOVA_START_TIMING(partial_block_t, partial_time);
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	/* offset in the actual block size block */
+	offset = pos & (nova_inode_blk_size(sih) - 1);
+	start_blk = pos >> sb->s_blocksize_bits;
+	end_blk = start_blk + num_blocks - 1;
+
+	nova_dbg_verbose("%s: %lu blocks\n", __func__, num_blocks);
+	/* We avoid zeroing the alloc'd range, which is going to be overwritten
+	 * by this system call anyway
+	 */
+	nova_dbg_verbose("%s: start offset %lu start blk %lu %p\n", __func__,
+				offset, start_blk, kmem);
+	if (offset != 0) {
+		entry = nova_get_write_entry(sb, sih, start_blk);
+		ret = nova_handle_partial_block(sb, sih, entry,
+						start_blk, 0, offset, kmem);
+		if (ret < 0)
+			return ret;
+	}
+
+	kmem = (void *)((char *)kmem +
+			((num_blocks - 1) << sb->s_blocksize_bits));
+	eblk_offset = (pos + count) & (nova_inode_blk_size(sih) - 1);
+	nova_dbg_verbose("%s: end offset %lu, end blk %lu %p\n", __func__,
+				eblk_offset, end_blk, kmem);
+	if (eblk_offset != 0) {
+		entry = nova_get_write_entry(sb, sih, end_blk);
+
+		ret = nova_handle_partial_block(sb, sih, entry, end_blk,
+						eblk_offset,
+						sb->s_blocksize - eblk_offset,
+						kmem);
+		if (ret < 0)
+			return ret;
+	}
+	NOVA_END_TIMING(partial_block_t, partial_time);
+
+	return ret;
+}
+
 static int nova_reassign_file_tree(struct super_block *sb,
 	struct nova_inode_info_header *sih, u64 begin_tail, u64 end_tail)
 {
@@ -110,3 +217,45 @@ int nova_commit_writes_to_log(struct super_block *sb, struct nova_inode *pi,
 
 	return ret;
 }
+
+int nova_cleanup_incomplete_write(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct list_head *head, int free)
+{
+	struct nova_file_write_item *entry_item, *temp;
+	struct nova_file_write_entry *entry;
+	unsigned long blocknr;
+
+	list_for_each_entry_safe(entry_item, temp, head, list) {
+		entry = &entry_item->entry;
+		blocknr = nova_get_blocknr(sb, entry->block, sih->i_blk_type);
+		nova_free_data_blocks(sb, sih, blocknr, entry->num_pages);
+
+		if (free)
+			nova_free_file_write_item(entry_item);
+	}
+
+	return 0;
+}
+
+void nova_init_file_write_item(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_item *item,
+	u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time,
+	u64 file_size)
+{
+	struct nova_file_write_entry *entry = &item->entry;
+
+	INIT_LIST_HEAD(&item->list);
+	memset(entry, 0, sizeof(struct nova_file_write_entry));
+	entry->entry_type = FILE_WRITE;
+	entry->reassigned = 0;
+	entry->epoch_id = epoch_id;
+	entry->trans_id = sih->trans_id;
+	entry->pgoff = cpu_to_le64(pgoff);
+	entry->num_pages = cpu_to_le32(num_pages);
+	entry->invalid_pages = 0;
+	entry->block = cpu_to_le64(nova_get_block_off(sb, blocknr,
+							sih->i_blk_type));
+	entry->mtime = cpu_to_le32(time);
+
+	entry->size = file_size;
+}
diff --git a/fs/nova/file.c b/fs/nova/file.c
index 842da45..26f15c7 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -256,10 +256,218 @@ static ssize_t nova_dax_file_read(struct file *filp, char __user *buf,
 	return res;
 }
 
+/*
+ * Perform a COW write.   Must hold the inode lock before calling.
+ */
+static ssize_t do_nova_cow_file_write(struct file *filp,
+	const char __user *buf,	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode	*inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi;
+	struct nova_file_write_item *entry_item;
+	struct list_head item_head;
+	struct nova_inode_update update;
+	ssize_t	    written = 0;
+	loff_t pos;
+	size_t count, offset, copied;
+	unsigned long start_blk, num_blocks;
+	unsigned long total_blocks;
+	unsigned long blocknr = 0;
+	int allocated = 0;
+	void *kmem;
+	u64 file_size;
+	size_t bytes;
+	long status = 0;
+	timing_t cow_write_time, memcpy_time;
+	unsigned long step = 0;
+	ssize_t ret;
+	u64 epoch_id;
+	u32 time;
+
+
+	if (len == 0)
+		return 0;
+
+	sih_lock(sih);
+	NOVA_START_TIMING(cow_write_t, cow_write_time);
+	INIT_LIST_HEAD(&item_head);
+
+	if (!access_ok(VERIFY_READ, buf, len)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	pos = *ppos;
+
+	if (filp->f_flags & O_APPEND)
+		pos = i_size_read(inode);
+
+	count = len;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	total_blocks = num_blocks;
+	start_blk = pos >> sb->s_blocksize_bits;
+
+	/* offset in the actual block size block */
+
+	ret = file_remove_privs(filp);
+	if (ret)
+		goto out;
+
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lu\n",
+			__func__, inode->i_ino,	pos, count);
+
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = sih->log_tail;
+	while (num_blocks > 0) {
+		offset = pos & (nova_inode_blk_size(sih) - 1);
+		start_blk = pos >> sb->s_blocksize_bits;
+
+		/* don't zero-out the allocated blocks */
+		allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk,
+				 num_blocks, ALLOC_NO_INIT, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+
+		nova_dbg_verbose("%s: alloc %d blocks @ %lu\n", __func__,
+						allocated, blocknr);
+
+		if (allocated <= 0) {
+			nova_dbg("%s alloc blocks failed %d\n", __func__,
+								allocated);
+			ret = allocated;
+			goto out;
+		}
+
+		step++;
+		bytes = sb->s_blocksize * allocated - offset;
+		if (bytes > count)
+			bytes = count;
+
+		kmem = nova_get_block(inode->i_sb,
+			     nova_get_block_off(sb, blocknr, sih->i_blk_type));
+
+		if (offset || ((offset + bytes) & (PAGE_SIZE - 1)) != 0)  {
+			ret = nova_handle_head_tail_blocks(sb, inode, pos,
+							   bytes, kmem);
+			if (ret)
+				goto out;
+		}
+		/* Now copy from user buf */
+		//		nova_dbg("Write: %p\n", kmem);
+		NOVA_START_TIMING(memcpy_w_nvmm_t, memcpy_time);
+		copied = bytes - memcpy_to_pmem_nocache(kmem + offset,
+						buf, bytes);
+		NOVA_END_TIMING(memcpy_w_nvmm_t, memcpy_time);
+
+		if (pos + copied > inode->i_size)
+			file_size = cpu_to_le64(pos + copied);
+		else
+			file_size = cpu_to_le64(inode->i_size);
+
+		entry_item = nova_alloc_file_write_item(sb);
+		if (!entry_item) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		nova_init_file_write_item(sb, sih, entry_item, epoch_id,
+					start_blk, allocated, blocknr, time,
+					file_size);
+
+		list_add_tail(&entry_item->list, &item_head);
+
+		nova_dbgv("Write: %p, %lu\n", kmem, copied);
+		if (copied > 0) {
+			status = copied;
+			written += copied;
+			pos += copied;
+			buf += copied;
+			count -= copied;
+			num_blocks -= allocated;
+		}
+		if (unlikely(copied != bytes)) {
+			nova_dbg("%s ERROR!: %p, bytes %lu, copied %lu\n",
+				__func__, kmem, bytes, copied);
+			if (status >= 0)
+				status = -EFAULT;
+		}
+		if (status < 0)
+			break;
+	}
+
+	ret = nova_commit_writes_to_log(sb, pi, inode,
+					&item_head, total_blocks, 1);
+	if (ret < 0) {
+		nova_err(sb, "commit to log failed\n");
+		goto out;
+	}
+
+	ret = written;
+	NOVA_STATS_ADD(cow_write_breaks, step);
+	nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks);
+
+	*ppos = pos;
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		sih->i_size = pos;
+	}
+
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, &item_head, 1);
+
+	NOVA_END_TIMING(cow_write_t, cow_write_time);
+	NOVA_STATS_ADD(cow_write_bytes, written);
+	sih_unlock(sih);
+
+	return ret;
+}
+
+/*
+ * Acquire locks and perform COW write.
+ */
+ssize_t nova_cow_file_write(struct file *filp,
+	const char __user *buf,	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	int ret;
+
+	if (len == 0)
+		return 0;
+
+	sb_start_write(inode->i_sb);
+	inode_lock(inode);
+
+	ret = do_nova_cow_file_write(filp, buf, len, ppos);
+
+	inode_unlock(inode);
+	sb_end_write(inode->i_sb);
+
+	return ret;
+}
+
+
+static ssize_t nova_dax_file_write(struct file *filp, const char __user *buf,
+				   size_t len, loff_t *ppos)
+{
+	return nova_cow_file_write(filp, buf, len, ppos);
+}
+
 
 const struct file_operations nova_dax_file_operations = {
 	.llseek		= nova_llseek,
 	.read		= nova_dax_file_read,
+	.write		= nova_dax_file_write,
 	.open		= nova_open,
 	.fsync		= nova_fsync,
 	.flush		= nova_flush,
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index dcda02a..1c2205e 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -465,9 +465,17 @@ nova_get_blocknr(struct super_block *sb, u64 block, unsigned short btype)
 /* ====================================================== */
 
 /* dax.c */
+int nova_handle_head_tail_blocks(struct super_block *sb,
+	struct inode *inode, loff_t pos, size_t count, void *kmem);
 int nova_commit_writes_to_log(struct super_block *sb, struct nova_inode *pi,
 	struct inode *inode, struct list_head *head, unsigned long new_blocks,
 	int free);
+int nova_cleanup_incomplete_write(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct list_head *head, int free);
+void nova_init_file_write_item(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_item *item,
+	u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time,
+	u64 file_size);
 
 /* dir.c */
 extern const struct file_operations nova_dir_operations;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 69/83] Super: Add module param inplace_data_updates.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (67 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 68/83] File operation: copy-on-write write Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 70/83] File operation: Inplace write Andiry Xu
                   ` (14 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Provide inplace data updates option if people prefer inplace
updates to copy-on-write.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/nova.h  | 1 +
 fs/nova/super.c | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 1c2205e..6c94a9b 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -138,6 +138,7 @@ extern unsigned int nova_dbgmask;
 
 
 extern int measure_timing;
+extern int inplace_data_updates;
 
 
 extern unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX];
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 9710be8..980b1d7 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -43,10 +43,14 @@
 
 int measure_timing;
 int support_clwb;
+int inplace_data_updates;
 
 module_param(measure_timing, int, 0444);
 MODULE_PARM_DESC(measure_timing, "Timing measurement");
 
+module_param(inplace_data_updates, int, 0444);
+MODULE_PARM_DESC(inplace_data_updates, "Perform data updates in-place (i.e., not atomically)");
+
 module_param(nova_dbgmask, int, 0444);
 MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
 
@@ -541,7 +545,8 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 		goto out;
 	}
 
-	nova_dbg("measure timing %d\n", measure_timing);
+	nova_dbg("measure timing %d, inplace data update %d\n",
+			measure_timing, inplace_data_updates);
 
 	get_random_bytes(&random, sizeof(u32));
 	atomic_set(&sbi->next_generation, random);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 70/83] File operation: Inplace write.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (68 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 69/83] Super: Add module param inplace_data_updates Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 71/83] Symlink support Andiry Xu
                   ` (13 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

If the user specifies inplace updates, or the file is mmaped,
NOVA performs inplace writes.

The trick is dax page fault can occur concurrently with inplace writes,
and allocate new blocks. Also, inplace write memcpy may trigger page fault (xfstests 248).
Since page fault may take the write lock to modify the tree, write routine
cannot take tree lock during the memcpy.
As a result we perform inplace write in the following way:

1. Take the tree read lock, check existing entries or holes.
2. Release the read lock. Allocate new data pages if needed;
   allocate and initialize file write item, add to the list and perform memcpy.
3. With the list of file write items, take the tree write lock and perform commit:
   Due to concurrent page fault, the hole returned in step 1 may be filled by
   page fault handlers. In this case, NOVA copies the data from the file write item
   to the pages allocated by page fault handler, and free the data blocks allocated
   in step 2. This guarantees application can see the write via mmaped region.

The step 3 actually formats a new list of write items, and reuse the CoW commit
routine to commit the items.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/dax.c  | 472 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/file.c |  10 +-
 fs/nova/nova.h |   4 +
 3 files changed, 484 insertions(+), 2 deletions(-)

diff --git a/fs/nova/dax.c b/fs/nova/dax.c
index 9561d8e..8624ce4 100644
--- a/fs/nova/dax.c
+++ b/fs/nova/dax.c
@@ -259,3 +259,475 @@ void nova_init_file_write_item(struct super_block *sb,
 
 	entry->size = file_size;
 }
+
+/*
+ * Check if there is an existing entry or hole for target page offset.
+ * Used for inplace write, DAX-mmap and fallocate.
+ */
+unsigned long nova_check_existing_entry(struct super_block *sb,
+	struct inode *inode, unsigned long num_blocks, unsigned long start_blk,
+	struct nova_file_write_entry **ret_entry,
+	int check_next, u64 epoch_id,
+	int *inplace)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry;
+	unsigned long next_pgoff;
+	unsigned long ent_blks = 0;
+	timing_t check_time;
+
+	NOVA_START_TIMING(check_entry_t, check_time);
+
+	*ret_entry = NULL;
+	*inplace = 0;
+	entry = nova_get_write_entry(sb, sih, start_blk);
+
+	if (entry) {
+		*ret_entry = entry;
+
+		/* We can do inplace write. Find contiguous blocks */
+		if (entry->reassigned == 0)
+			ent_blks = entry->num_pages -
+					(start_blk - entry->pgoff);
+		else
+			ent_blks = 1;
+
+		if (ent_blks > num_blocks)
+			ent_blks = num_blocks;
+
+		if (entry->epoch_id == epoch_id)
+			*inplace = 1;
+
+	} else if (check_next) {
+		/* Possible Hole */
+		entry = nova_find_next_entry(sb, sih, start_blk);
+		if (entry) {
+			next_pgoff = entry->pgoff;
+			if (next_pgoff <= start_blk) {
+				nova_err(sb, "iblock %lu, entry pgoff %lu, num pages %lu\n",
+				       start_blk, next_pgoff, entry->num_pages);
+				nova_print_inode_log(sb, inode);
+				dump_stack();
+				ent_blks = num_blocks;
+				goto out;
+			}
+			ent_blks = next_pgoff - start_blk;
+			if (ent_blks > num_blocks)
+				ent_blks = num_blocks;
+		} else {
+			/* File grow */
+			ent_blks = num_blocks;
+		}
+	}
+
+	if (entry && ent_blks == 0) {
+		nova_dbg("%s: %d\n", __func__, check_next);
+		dump_stack();
+	}
+
+out:
+	NOVA_END_TIMING(check_entry_t, check_time);
+	return ent_blks;
+}
+
+/* Memcpy from newly allocated data blocks to existing data blocks */
+static int nova_inplace_memcpy(struct super_block *sb, struct inode *inode,
+	struct nova_file_write_entry *from, struct nova_file_write_entry *to,
+	unsigned long num_blocks, loff_t pos, size_t len)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_log_entry_info entry_info;
+	unsigned long pgoff;
+	unsigned long from_nvmm, to_nvmm;
+	void *from_addr, *to_addr = NULL;
+	loff_t base, start, end, offset;
+
+	pgoff = le64_to_cpu(from->pgoff);
+	base = start = pgoff << PAGE_SHIFT;
+	end = (pgoff + num_blocks) << PAGE_SHIFT;
+
+	if (start < pos)
+		start = pos;
+
+	if (end > pos + len)
+		end = pos + len;
+
+	len = end - start;
+	offset = start - base;
+
+	from_nvmm = get_nvmm(sb, sih, from, pgoff);
+	from_addr = nova_get_block(sb, (from_nvmm << PAGE_SHIFT));
+	to_nvmm = get_nvmm(sb, sih, to, pgoff);
+	to_addr = nova_get_block(sb, (to_nvmm << PAGE_SHIFT));
+
+	memcpy_to_pmem_nocache(to_addr + offset, from_addr + offset, len);
+
+	/* Update entry */
+	entry_info.type = FILE_WRITE;
+	entry_info.epoch_id = from->epoch_id;
+	entry_info.trans_id = from->trans_id;
+	entry_info.time = from->mtime;
+	entry_info.file_size = from->size;
+	entry_info.inplace = 1;
+
+	nova_inplace_update_write_entry(sb, inode, to, &entry_info);
+	return 0;
+}
+
+/*
+ * Due to concurrent DAX fault, we may have overlapped entries in the list.
+ * We copy the data to the existing data pages and update the entry.
+ * Must be called with sih write lock held.
+ */
+static int nova_commit_inplace_writes_to_log(struct super_block *sb,
+	struct nova_inode *pi, struct inode *inode,
+	struct list_head *head, unsigned long new_blocks,
+	loff_t pos, size_t len)
+{
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_item *entry_item, *temp;
+	struct nova_file_write_item *new_item;
+	struct nova_file_write_entry *curr, *entry;
+	struct list_head new_head;
+	unsigned long start_blk, ent_blks;
+	unsigned long num_blocks;
+	unsigned long blocknr;
+	u64 epoch_id;
+	int inplace;
+	int ret = 0;
+
+	if (list_empty(head))
+		return 0;
+
+	sih_lock(sih);
+	INIT_LIST_HEAD(&new_head);
+
+	list_for_each_entry_safe(entry_item, temp, head, list) {
+		list_del(&entry_item->list);
+		curr = &entry_item->entry;
+		epoch_id = le64_to_cpu(curr->epoch_id);
+again:
+		num_blocks = le32_to_cpu(curr->num_pages);
+		start_blk = le64_to_cpu(curr->pgoff);
+
+		ent_blks = nova_check_existing_entry(sb, inode, num_blocks,
+						start_blk, &entry,
+						1, epoch_id, &inplace);
+
+		if (!entry && ent_blks == num_blocks) {
+			/* Hole */
+			list_add_tail(&entry_item->list, &new_head);
+			continue;
+		}
+
+		blocknr = nova_get_blocknr(sb, curr->block,
+						sih->i_blk_type);
+		/* Overlap with head. Memcpy */
+		if (entry) {
+			new_blocks -= ent_blks;
+			nova_inplace_memcpy(sb, inode, curr, entry, ent_blks,
+						pos, len);
+			if (ent_blks == num_blocks) {
+				/* Full copy */
+				nova_free_data_blocks(sb, sih, blocknr,
+						ent_blks);
+				nova_free_file_write_item(entry_item);
+				continue;
+			} else {
+				/* Partial copy */
+				curr->num_pages -= ent_blks;
+				curr->pgoff += ent_blks;
+				curr->block += ent_blks << PAGE_SHIFT;
+				nova_free_data_blocks(sb, sih, blocknr,
+						ent_blks);
+				goto again;
+			}
+		}
+
+		/* Overlap with middle or tail. */
+		new_item = nova_alloc_file_write_item(sb);
+		if (!new_item) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		nova_init_file_write_item(sb, sih, new_item,
+				epoch_id, start_blk, ent_blks,
+				blocknr, entry->mtime, entry->size);
+
+		list_add_tail(&new_item->list, &new_head);
+
+		curr->num_pages -= ent_blks;
+		curr->pgoff += ent_blks;
+		curr->block += ent_blks << PAGE_SHIFT;
+		goto again;
+	}
+
+	ret = nova_commit_writes_to_log(sb, pi, inode,
+					&new_head, new_blocks, 1);
+	if (ret < 0) {
+		nova_err(sb, "commit to log failed\n");
+		goto out;
+	}
+
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, &new_head, 1);
+
+	sih_unlock(sih);
+	return ret;
+}
+
+/*
+ * Do an inplace write.  This function assumes that the lock on the inode is
+ * already held.
+ *
+ * We do this in three steps:
+ * 1. Check the tree, protected by sih read lock.
+ * 2. Allocate blocks for hole, copy from user buffer.
+ * 3. Take sih write lock and commit the writes.
+ *
+ * This is necessary because DAX fault can occur when we do the copy.
+ * We cannot hold sih lock when performing the data copy,
+ * and DAX fault may allocate data pages during step 2.
+ * In this case we overwrite with our data and free the data pages we allocated.
+ */
+ssize_t do_nova_inplace_file_write(struct file *filp,
+	const char __user *buf,	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode	*inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_item *entry_item;
+	struct list_head item_head;
+	struct nova_inode_update update;
+	ssize_t	    written = 0;
+	loff_t pos, original_pos;
+	size_t count, offset, copied;
+	unsigned long start_blk, num_blocks, ent_blks = 0;
+	unsigned long total_blocks;
+	unsigned long new_blocks = 0;
+	unsigned long blocknr = 0;
+	int allocated = 0;
+	int inplace = 0;
+	bool hole_fill = false;
+	void *kmem;
+	u64 blk_off;
+	size_t bytes;
+	long status = 0;
+	timing_t inplace_write_time, memcpy_time;
+	unsigned long step = 0;
+	u64 epoch_id;
+	u64 file_size;
+	u32 time;
+	ssize_t ret;
+
+	if (len == 0)
+		return 0;
+
+	NOVA_START_TIMING(inplace_write_t, inplace_write_time);
+	INIT_LIST_HEAD(&item_head);
+
+	if (!access_ok(VERIFY_READ, buf, len)) {
+		ret = -EFAULT;
+		goto out;
+	}
+	pos = original_pos = *ppos;
+
+	if (filp->f_flags & O_APPEND)
+		pos = i_size_read(inode);
+
+	count = len;
+
+	pi = nova_get_block(sb, sih->pi_addr);
+
+	offset = pos & (sb->s_blocksize - 1);
+	num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1;
+	total_blocks = num_blocks;
+
+	/* offset in the actual block size block */
+
+	ret = file_remove_privs(filp);
+	if (ret)
+		goto out;
+
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	nova_dbgv("%s: epoch_id %llu, inode %lu, offset %lld, count %lu\n",
+			__func__, epoch_id, inode->i_ino, pos, count);
+	update.tail = sih->log_tail;
+	while (num_blocks > 0) {
+		hole_fill = false;
+		offset = pos & (nova_inode_blk_size(sih) - 1);
+		start_blk = pos >> sb->s_blocksize_bits;
+
+		sih_lock_shared(sih);
+		ent_blks = nova_check_existing_entry(sb, inode, num_blocks,
+						start_blk, &entry,
+						1, epoch_id, &inplace);
+		sih_unlock_shared(sih);
+
+		if (entry && inplace) {
+			/* We can do inplace write. Find contiguous blocks */
+			blocknr = get_nvmm(sb, sih, entry, start_blk);
+			blk_off = blocknr << PAGE_SHIFT;
+			allocated = ent_blks;
+		} else {
+			/* Allocate blocks to fill hole */
+			allocated = nova_new_data_blocks(sb, sih, &blocknr,
+					 start_blk, ent_blks, ALLOC_NO_INIT,
+					 ANY_CPU, ALLOC_FROM_HEAD);
+
+			nova_dbg_verbose("%s: alloc %d blocks @ %lu\n",
+						__func__, allocated, blocknr);
+
+			if (allocated <= 0) {
+				nova_dbg("%s alloc blocks failed!, %d\n",
+							__func__, allocated);
+				ret = allocated;
+				goto out;
+			}
+
+			hole_fill = true;
+			new_blocks += allocated;
+			blk_off = nova_get_block_off(sb, blocknr,
+							sih->i_blk_type);
+
+			invalidate_inode_pages2_range(inode->i_mapping,
+					start_blk, start_blk + allocated - 1);
+		}
+
+		step++;
+		bytes = sb->s_blocksize * allocated - offset;
+		if (bytes > count)
+			bytes = count;
+
+		kmem = nova_get_block(inode->i_sb, blk_off);
+
+		if (hole_fill &&
+		    (offset || ((offset + bytes) & (PAGE_SIZE - 1)) != 0)) {
+			ret =  nova_handle_head_tail_blocks(sb, inode,
+							    pos, bytes, kmem);
+			if (ret)
+				goto out;
+
+		}
+
+		/* Now copy from user buf */
+//		nova_dbg("Write: %p\n", kmem);
+		NOVA_START_TIMING(memcpy_w_nvmm_t, memcpy_time);
+		copied = bytes - memcpy_to_pmem_nocache(kmem + offset,
+						buf, bytes);
+		NOVA_END_TIMING(memcpy_w_nvmm_t, memcpy_time);
+
+		if (pos + copied > inode->i_size)
+			file_size = cpu_to_le64(pos + copied);
+		else
+			file_size = cpu_to_le64(inode->i_size);
+
+		/* Handle hole fill write */
+		if (hole_fill) {
+			entry_item = nova_alloc_file_write_item(sb);
+			if (!entry_item) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			nova_init_file_write_item(sb, sih, entry_item,
+						epoch_id, start_blk, allocated,
+						blocknr, time, file_size);
+
+			list_add_tail(&entry_item->list, &item_head);
+		} else {
+			/* Update existing entry */
+			struct nova_log_entry_info entry_info;
+
+			entry_info.type = FILE_WRITE;
+			entry_info.epoch_id = epoch_id;
+			entry_info.trans_id = sih->trans_id;
+			entry_info.time = time;
+			entry_info.file_size = file_size;
+			entry_info.inplace = 1;
+
+			nova_inplace_update_write_entry(sb, inode, entry,
+							&entry_info);
+		}
+
+		nova_dbgv("Write: %p, %lu\n", kmem, copied);
+		if (copied > 0) {
+			status = copied;
+			written += copied;
+			pos += copied;
+			buf += copied;
+			count -= copied;
+			num_blocks -= allocated;
+		}
+		if (unlikely(copied != bytes)) {
+			nova_dbg("%s ERROR!: %p, bytes %lu, copied %lu\n",
+				__func__, kmem, bytes, copied);
+			if (status >= 0)
+				status = -EFAULT;
+		}
+		if (status < 0)
+			break;
+	}
+
+	ret = nova_commit_inplace_writes_to_log(sb, pi, inode, &item_head,
+					new_blocks, original_pos, len);
+	if (ret < 0) {
+		nova_err(sb, "commit to log failed\n");
+		goto out;
+	}
+
+	ret = written;
+	NOVA_STATS_ADD(inplace_write_breaks, step);
+	nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks);
+
+	*ppos = pos;
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		sih->i_size = pos;
+	}
+
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, &item_head, 1);
+
+	NOVA_END_TIMING(inplace_write_t, inplace_write_time);
+	NOVA_STATS_ADD(inplace_write_bytes, written);
+	return ret;
+}
+
+/*
+ * Acquire locks and perform an inplace update.
+ */
+ssize_t nova_inplace_file_write(struct file *filp,
+				const char __user *buf,	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	int ret;
+
+	if (len == 0)
+		return 0;
+
+	sb_start_write(inode->i_sb);
+	inode_lock(inode);
+
+	ret = do_nova_inplace_file_write(filp, buf, len, ppos);
+
+	inode_unlock(inode);
+	sb_end_write(inode->i_sb);
+
+	return ret;
+}
diff --git a/fs/nova/file.c b/fs/nova/file.c
index 26f15c7..b94a9a3 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -448,7 +448,10 @@ ssize_t nova_cow_file_write(struct file *filp,
 	sb_start_write(inode->i_sb);
 	inode_lock(inode);
 
-	ret = do_nova_cow_file_write(filp, buf, len, ppos);
+	if (mapping_mapped(mapping))
+		ret = do_nova_inplace_file_write(filp, buf, len, ppos);
+	else
+		ret = do_nova_cow_file_write(filp, buf, len, ppos);
 
 	inode_unlock(inode);
 	sb_end_write(inode->i_sb);
@@ -460,7 +463,10 @@ ssize_t nova_cow_file_write(struct file *filp,
 static ssize_t nova_dax_file_write(struct file *filp, const char __user *buf,
 				   size_t len, loff_t *ppos)
 {
-	return nova_cow_file_write(filp, buf, len, ppos);
+	if (inplace_data_updates)
+		return nova_inplace_file_write(filp, buf, len, ppos);
+	else
+		return nova_cow_file_write(filp, buf, len, ppos);
 }
 
 
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 6c94a9b..40c70da 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -477,6 +477,10 @@ void nova_init_file_write_item(struct super_block *sb,
 	struct nova_inode_info_header *sih, struct nova_file_write_item *item,
 	u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time,
 	u64 file_size);
+ssize_t nova_inplace_file_write(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos);
+ssize_t do_nova_inplace_file_write(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos);
 
 /* dir.c */
 extern const struct file_operations nova_dir_operations;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 71/83] Symlink support.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (69 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 70/83] File operation: Inplace write Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 72/83] File operation: fallocate Andiry Xu
                   ` (12 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA alloates two blocks for symlink inode: One for inode log,
and the other one is a data block, storing symname.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile  |   2 +-
 fs/nova/inode.c   |   2 +
 fs/nova/namei.c   |  70 ++++++++++++++++++++++++++++
 fs/nova/nova.h    |   5 ++
 fs/nova/symlink.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 211 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/symlink.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index 7f851f2..7bf6403 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -5,4 +5,4 @@
 obj-$(CONFIG_NOVA_FS) += nova.o
 
 nova-y := balloc.o bbuild.o dax.o dir.o file.o inode.o journal.o log.o namei.o\
-	  rebuild.o stats.o super.o
+	  rebuild.o stats.o super.o symlink.o
diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index a6d74cb..21be31a 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -285,6 +285,7 @@ static int nova_read_inode(struct super_block *sb, struct inode *inode,
 		inode->i_fop = &nova_dir_operations;
 		break;
 	case S_IFLNK:
+		inode->i_op = &nova_symlink_inode_operations;
 		break;
 	default:
 		inode->i_op = &nova_special_inode_operations;
@@ -983,6 +984,7 @@ struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
 		inode->i_op = &nova_special_inode_operations;
 		break;
 	case TYPE_SYMLINK:
+		inode->i_op = &nova_symlink_inode_operations;
 		inode->i_mapping->a_ops = &nova_aops_dax;
 		break;
 	case TYPE_MKDIR:
diff --git a/fs/nova/namei.c b/fs/nova/namei.c
index 7a81672..58f6a72 100644
--- a/fs/nova/namei.c
+++ b/fs/nova/namei.c
@@ -207,6 +207,75 @@ static int nova_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
 	return err;
 }
 
+static int nova_symlink(struct inode *dir, struct dentry *dentry,
+	const char *symname)
+{
+	struct super_block *sb = dir->i_sb;
+	int err = -ENAMETOOLONG;
+	unsigned int len = strlen(symname);
+	struct inode *inode;
+	struct nova_inode_info *si;
+	struct nova_inode_info_header *sih;
+	u64 pi_addr = 0;
+	struct nova_inode *pidir, *pi;
+	struct nova_inode_update update;
+	u64 ino;
+	u64 epoch_id;
+	timing_t symlink_time;
+
+	NOVA_START_TIMING(symlink_t, symlink_time);
+	if (len + 1 > sb->s_blocksize)
+		goto out;
+
+	pidir = nova_get_inode(sb, dir);
+	if (!pidir)
+		goto out_fail;
+
+	epoch_id = nova_get_epoch_id(sb);
+	ino = nova_new_nova_inode(sb, &pi_addr);
+	if (ino == 0)
+		goto out_fail;
+
+	nova_dbgv("%s: name %s, symname %s\n", __func__,
+				dentry->d_name.name, symname);
+	nova_dbgv("%s: inode %llu, dir %lu\n", __func__, ino, dir->i_ino);
+
+	update.tail = 0;
+	err = nova_add_dentry(dentry, ino, 0, &update, epoch_id);
+	if (err)
+		goto out_fail;
+
+	inode = nova_new_vfs_inode(TYPE_SYMLINK, dir, pi_addr, ino,
+					S_IFLNK|0777, len, 0,
+					&dentry->d_name, epoch_id);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto out_fail;
+	}
+
+	pi = nova_get_inode(sb, inode);
+
+	si = NOVA_I(inode);
+	sih = &si->header;
+
+	err = nova_block_symlink(sb, pi, inode, symname, len, epoch_id);
+	if (err)
+		goto out_fail;
+
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+
+	nova_lite_transaction_for_new_inode(sb, pi, pidir, inode, dir,
+					&update);
+out:
+	NOVA_END_TIMING(symlink_t, symlink_time);
+	return err;
+
+out_fail:
+	nova_err(sb, "%s return %d\n", __func__, err);
+	goto out;
+}
+
 static void nova_lite_transaction_for_time_and_link(struct super_block *sb,
 	struct nova_inode *pi, struct nova_inode *pidir, struct inode *inode,
 	struct inode *dir, struct nova_inode_update *update,
@@ -764,6 +833,7 @@ const struct inode_operations nova_dir_inode_operations = {
 	.lookup		= nova_lookup,
 	.link		= nova_link,
 	.unlink		= nova_unlink,
+	.symlink	= nova_symlink,
 	.mkdir		= nova_mkdir,
 	.rmdir		= nova_rmdir,
 	.mknod		= nova_mknod,
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 40c70da..6392bb3 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -518,6 +518,11 @@ int nova_rebuild_dir_inode_tree(struct super_block *sb,
 int nova_rebuild_inode(struct super_block *sb, struct nova_inode_info *si,
 	u64 ino, u64 pi_addr, int rebuild_dir);
 
+/* symlink.c */
+int nova_block_symlink(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, const char *symname, int len, u64 epoch_id);
+extern const struct inode_operations nova_symlink_inode_operations;
+
 /* stats.c */
 void nova_get_timing_stats(void);
 void nova_get_IO_stats(void);
diff --git a/fs/nova/symlink.c b/fs/nova/symlink.c
new file mode 100644
index 0000000..dbd57c5
--- /dev/null
+++ b/fs/nova/symlink.c
@@ -0,0 +1,133 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Symlink operations
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/version.h>
+#include "nova.h"
+#include "inode.h"
+
+int nova_block_symlink(struct super_block *sb, struct nova_inode *pi,
+	struct inode *inode, const char *symname, int len, u64 epoch_id)
+{
+	struct nova_file_write_item entry_item;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode_update update;
+	unsigned long name_blocknr = 0;
+	int allocated;
+	u64 block;
+	char *blockp;
+	u32 time;
+	int ret;
+
+	update.tail = sih->log_tail;
+
+	allocated = nova_new_data_blocks(sb, sih, &name_blocknr, 0, 1,
+				 ALLOC_INIT_ZERO, ANY_CPU, ALLOC_FROM_TAIL);
+	if (allocated != 1 || name_blocknr == 0) {
+		ret = allocated;
+		return ret;
+	}
+
+	/* First copy name to name block */
+	block = nova_get_block_off(sb, name_blocknr, NOVA_BLOCK_TYPE_4K);
+	blockp = (char *)nova_get_block(sb, block);
+
+	memcpy_to_pmem_nocache(blockp, symname, len);
+	blockp[len] = '\0';
+
+	/* Apply a write entry to the log page */
+	time = current_time(inode).tv_sec;
+	nova_init_file_write_item(sb, sih, &entry_item, epoch_id, 0, 1,
+					name_blocknr, time, len + 1);
+
+	sih_lock(sih);
+	ret = nova_append_file_write_entry(sb, pi, inode, &entry_item, &update);
+	if (ret) {
+		nova_dbg("%s: append file write entry failed %d\n",
+					__func__, ret);
+		nova_free_data_blocks(sb, sih, name_blocknr, 1);
+		return ret;
+	}
+
+	nova_update_inode(sb, inode, pi, &update);
+	sih->trans_id++;
+	sih_unlock(sih);
+
+	return 0;
+}
+
+/* FIXME: Temporary workaround */
+static int nova_readlink_copy(char __user *buffer, int buflen, const char *link)
+{
+	int len = PTR_ERR(link);
+
+	if (IS_ERR(link))
+		goto out;
+
+	len = strlen(link);
+	if (len > (unsigned int) buflen)
+		len = buflen;
+	if (copy_to_user(buffer, link, len))
+		len = -EFAULT;
+out:
+	return len;
+}
+
+static int nova_readlink(struct dentry *dentry, char __user *buffer, int buflen)
+{
+	struct nova_file_write_entry *entry;
+	struct inode *inode = dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	char *blockp;
+
+	entry = (struct nova_file_write_entry *)nova_get_block(sb,
+							sih->log_head);
+
+	blockp = (char *)nova_get_block(sb, BLOCK_OFF(entry->block));
+
+	return nova_readlink_copy(buffer, buflen, blockp);
+}
+
+static const char *nova_get_link(struct dentry *dentry, struct inode *inode,
+	struct delayed_call *done)
+{
+	struct nova_file_write_entry *entry;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	char *blockp;
+
+	entry = (struct nova_file_write_entry *)nova_get_block(sb,
+							sih->log_head);
+
+	blockp = (char *)nova_get_block(sb, BLOCK_OFF(entry->block));
+
+	return blockp;
+}
+
+const struct inode_operations nova_symlink_inode_operations = {
+	.readlink	= nova_readlink,
+	.get_link	= nova_get_link,
+	.setattr	= nova_notify_change,
+};
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 72/83] File operation: fallocate.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (70 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 71/83] Symlink support Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 73/83] Dax: Add iomap operations Andiry Xu
                   ` (11 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Fallocate works similar as writes, allocating zeroed blocked
for the holes in the request region.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/file.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h |   5 ++
 2 files changed, 153 insertions(+)

diff --git a/fs/nova/file.c b/fs/nova/file.c
index b94a9a3..a6b5bd3 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -113,6 +113,153 @@ static int nova_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static long nova_fallocate(struct file *file, int mode, loff_t offset,
+	loff_t len)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+	struct nova_file_write_entry *entry;
+	struct nova_file_write_item *entry_item;
+	struct list_head item_head;
+	struct nova_inode_update update;
+	unsigned long start_blk, num_blocks, ent_blks = 0;
+	unsigned long total_blocks = 0;
+	unsigned long blocknr = 0;
+	unsigned long blockoff;
+	loff_t new_size;
+	long ret = 0;
+	int inplace = 0;
+	int blocksize_mask;
+	int allocated = 0;
+	timing_t fallocate_time;
+	u64 epoch_id;
+	u32 time;
+
+	/*
+	 * Fallocate does not make much sence for CoW,
+	 * but we still support it for DAX-mmap purpose.
+	 */
+
+	/* We only support the FALLOC_FL_KEEP_SIZE mode */
+	if (mode & ~FALLOC_FL_KEEP_SIZE)
+		return -EOPNOTSUPP;
+
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
+	INIT_LIST_HEAD(&item_head);
+	new_size = len + offset;
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && new_size > inode->i_size) {
+		ret = inode_newsize_ok(inode, new_size);
+		if (ret)
+			return ret;
+	} else {
+		new_size = inode->i_size;
+	}
+
+	nova_dbgv("%s: inode %lu, offset %lld, count %lld, mode 0x%x\n",
+			__func__, inode->i_ino,	offset, len, mode);
+
+	NOVA_START_TIMING(fallocate_t, fallocate_time);
+	inode_lock(inode);
+	sih_lock(sih);
+
+	pi = nova_get_inode(sb, inode);
+	if (!pi) {
+		ret = -EACCES;
+		goto out;
+	}
+
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+	time = current_time(inode).tv_sec;
+
+	blocksize_mask = sb->s_blocksize - 1;
+	start_blk = offset >> sb->s_blocksize_bits;
+	blockoff = offset & blocksize_mask;
+	num_blocks = (blockoff + len + blocksize_mask) >> sb->s_blocksize_bits;
+
+	epoch_id = nova_get_epoch_id(sb);
+	update.tail = sih->log_tail;
+	while (num_blocks > 0) {
+		ent_blks = nova_check_existing_entry(sb, inode, num_blocks,
+						start_blk, &entry,
+						1, epoch_id, &inplace);
+
+		if (entry && inplace) {
+			if (entry->size < new_size) {
+				/* Update existing entry */
+				entry->size = new_size;
+				nova_persist_entry(entry);
+			}
+			allocated = ent_blks;
+			goto next;
+		}
+
+		/* Allocate zeroed blocks to fill hole */
+		allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk,
+				 ent_blks, ALLOC_INIT_ZERO, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+		nova_dbgv("%s: alloc %d blocks @ %lu\n", __func__,
+						allocated, blocknr);
+
+		if (allocated <= 0) {
+			nova_dbg("%s alloc %lu blocks failed!, %d\n",
+						__func__, ent_blks, allocated);
+			ret = allocated;
+			goto out;
+		}
+
+		entry_item = nova_alloc_file_write_item(sb);
+		if (!entry_item) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		/* Handle hole fill write */
+		nova_init_file_write_item(sb, sih, entry_item, epoch_id,
+					start_blk, allocated, blocknr,
+					time, new_size);
+
+		list_add_tail(&entry_item->list, &item_head);
+
+		total_blocks += allocated;
+next:
+		num_blocks -= allocated;
+		start_blk += allocated;
+	}
+
+	ret = nova_commit_writes_to_log(sb, pi, inode,
+					&item_head, total_blocks, 1);
+	if (ret < 0) {
+		nova_err(sb, "commit to log failed\n");
+		goto out;
+	}
+
+	if (ret || (mode & FALLOC_FL_KEEP_SIZE)) {
+		pi->i_flags |= cpu_to_le32(NOVA_EOFBLOCKS_FL);
+		sih->i_flags |= cpu_to_le32(NOVA_EOFBLOCKS_FL);
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && new_size > inode->i_size) {
+		inode->i_size = new_size;
+		sih->i_size = new_size;
+	}
+
+	nova_persist_inode(pi);
+
+out:
+	if (ret < 0)
+		nova_cleanup_incomplete_write(sb, sih, &item_head, 1);
+
+	sih_unlock(sih);
+	inode_unlock(inode);
+	NOVA_END_TIMING(fallocate_t, fallocate_time);
+	return ret;
+}
+
 static ssize_t
 do_dax_mapping_read(struct file *filp, char __user *buf,
 	size_t len, loff_t *ppos)
@@ -477,6 +624,7 @@ const struct file_operations nova_dax_file_operations = {
 	.open		= nova_open,
 	.fsync		= nova_fsync,
 	.flush		= nova_flush,
+	.fallocate	= nova_fallocate,
 };
 
 const struct inode_operations nova_file_inode_operations = {
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 6392bb3..ab9e8f3 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -477,6 +477,11 @@ void nova_init_file_write_item(struct super_block *sb,
 	struct nova_inode_info_header *sih, struct nova_file_write_item *item,
 	u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time,
 	u64 file_size);
+unsigned long nova_check_existing_entry(struct super_block *sb,
+	struct inode *inode, unsigned long num_blocks, unsigned long start_blk,
+	struct nova_file_write_entry **ret_entry,
+	int check_next, u64 epoch_id,
+	int *inplace);
 ssize_t nova_inplace_file_write(struct file *filp, const char __user *buf,
 	size_t len, loff_t *ppos);
 ssize_t do_nova_inplace_file_write(struct file *filp, const char __user *buf,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 73/83] Dax: Add iomap operations.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (71 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 72/83] File operation: fallocate Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 74/83] File operation: Mmap Andiry Xu
                   ` (10 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

The key of iomap is dax_get_blocks(). It first takes the read lock
and lookup the block; if the block is missing, it takes write lock,
check again and allocate the new block if needed.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/dax.c  | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h |   3 +
 2 files changed, 187 insertions(+)

diff --git a/fs/nova/dax.c b/fs/nova/dax.c
index 8624ce4..e639b23 100644
--- a/fs/nova/dax.c
+++ b/fs/nova/dax.c
@@ -731,3 +731,187 @@ ssize_t nova_inplace_file_write(struct file *filp,
 
 	return ret;
 }
+
+/*
+ * return > 0, # of blocks mapped or allocated.
+ * return = 0, if plain lookup failed.
+ * return < 0, error case.
+ */
+static int nova_dax_get_blocks(struct inode *inode, sector_t iblock,
+	unsigned long max_blocks, u32 *bno, bool *new, bool *boundary,
+	int create)
+{
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode *pi;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_file_write_entry *entry = NULL;
+	struct nova_file_write_item entry_item;
+	struct list_head item_head;
+	struct nova_inode_update update;
+	u32 time;
+	unsigned long nvmm = 0;
+	unsigned long blocknr = 0;
+	u64 epoch_id;
+	int num_blocks = 0;
+	int inplace = 0;
+	int allocated = 0;
+	int locked = 0;
+	int check_next;
+	int ret = 0;
+	timing_t get_block_time;
+
+
+	if (max_blocks == 0)
+		return 0;
+
+	NOVA_START_TIMING(dax_get_block_t, get_block_time);
+	INIT_LIST_HEAD(&item_head);
+
+	nova_dbgv("%s: pgoff %lu, num %lu, create %d\n",
+				__func__, iblock, max_blocks, create);
+
+	epoch_id = nova_get_epoch_id(sb);
+
+	check_next = 0;
+	sih_lock_shared(sih);
+
+again:
+	num_blocks = nova_check_existing_entry(sb, inode, max_blocks,
+					iblock, &entry, check_next,
+					epoch_id, &inplace);
+
+	if (entry) {
+		if (create == 0 || inplace) {
+			nvmm = get_nvmm(sb, sih, entry, iblock);
+			nova_dbgv("%s: found pgoff %lu, block %lu\n",
+					__func__, iblock, nvmm);
+			goto out;
+		}
+	}
+
+	if (create == 0) {
+		num_blocks = 0;
+		goto out1;
+	}
+
+	if (locked == 0) {
+		sih_unlock_shared(sih);
+		sih_lock(sih);
+		locked = 1;
+		/* Check again incase someone has done it for us */
+		check_next = 1;
+		goto again;
+	}
+
+	pi = nova_get_inode(sb, inode);
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	time = current_time(inode).tv_sec;
+	update.tail = sih->log_tail;
+
+	/* Return initialized blocks to the user */
+	allocated = nova_new_data_blocks(sb, sih, &blocknr, iblock,
+				 num_blocks, ALLOC_INIT_ZERO, ANY_CPU,
+				 ALLOC_FROM_HEAD);
+	if (allocated <= 0) {
+		nova_dbgv("%s alloc blocks failed %d\n", __func__,
+							allocated);
+		ret = allocated;
+		goto out;
+	}
+
+	num_blocks = allocated;
+	/* FIXME: how to handle file size? */
+	nova_init_file_write_item(sb, sih, &entry_item,
+					epoch_id, iblock, num_blocks,
+					blocknr, time, inode->i_size);
+
+	list_add_tail(&entry_item.list, &item_head);
+
+	nvmm = blocknr;
+
+	ret = nova_commit_writes_to_log(sb, pi, inode,
+					&item_head, num_blocks, 0);
+	if (ret < 0) {
+		nova_err(sb, "commit to log failed\n");
+		goto out;
+	}
+
+	NOVA_STATS_ADD(dax_new_blocks, 1);
+
+	*new = true;
+//	set_buffer_new(bh);
+out:
+	if (ret < 0) {
+		nova_cleanup_incomplete_write(sb, sih, &item_head, 0);
+		num_blocks = ret;
+		goto out1;
+	}
+
+	*bno = nvmm;
+//	if (num_blocks > 1)
+//		bh->b_size = sb->s_blocksize * num_blocks;
+
+out1:
+	if (locked)
+		sih_unlock(sih);
+	else
+		sih_unlock_shared(sih);
+
+	NOVA_END_TIMING(dax_get_block_t, get_block_time);
+	return num_blocks;
+}
+
+static int nova_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+	unsigned int flags, struct iomap *iomap)
+{
+	struct nova_sb_info *sbi = NOVA_SB(inode->i_sb);
+	unsigned int blkbits = inode->i_blkbits;
+	unsigned long first_block = offset >> blkbits;
+	unsigned long max_blocks = (length + (1 << blkbits) - 1) >> blkbits;
+	bool new = false, boundary = false;
+	u32 bno;
+	int ret;
+
+	ret = nova_dax_get_blocks(inode, first_block, max_blocks, &bno, &new,
+				  &boundary, flags & IOMAP_WRITE);
+	if (ret < 0) {
+		nova_dbgv("%s: nova_dax_get_blocks failed %d", __func__, ret);
+		return ret;
+	}
+
+	iomap->flags = 0;
+	iomap->bdev = inode->i_sb->s_bdev;
+	iomap->dax_dev = sbi->s_dax_dev;
+	iomap->offset = (u64)first_block << blkbits;
+
+	if (ret == 0) {
+		iomap->type = IOMAP_HOLE;
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->length = 1 << blkbits;
+	} else {
+		iomap->type = IOMAP_MAPPED;
+		iomap->addr = (u64)bno << blkbits;
+		iomap->length = (u64)ret << blkbits;
+		iomap->flags |= IOMAP_F_MERGED;
+	}
+
+	if (new)
+		iomap->flags |= IOMAP_F_NEW;
+	return 0;
+}
+
+static int nova_iomap_end(struct inode *inode, loff_t offset, loff_t length,
+	ssize_t written, unsigned int flags, struct iomap *iomap)
+{
+	if (iomap->type == IOMAP_MAPPED &&
+			written < length &&
+			(flags & IOMAP_WRITE))
+		truncate_pagecache(inode, inode->i_size);
+	return 0;
+}
+
+const struct iomap_ops nova_iomap_ops = {
+	.iomap_begin	= nova_iomap_begin,
+	.iomap_end	= nova_iomap_end,
+};
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index ab9e8f3..0d62c47 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -487,6 +487,9 @@ ssize_t nova_inplace_file_write(struct file *filp, const char __user *buf,
 ssize_t do_nova_inplace_file_write(struct file *filp, const char __user *buf,
 	size_t len, loff_t *ppos);
 
+extern const struct iomap_ops nova_iomap_ops;
+
+
 /* dir.c */
 extern const struct file_operations nova_dir_operations;
 int nova_insert_dir_radix_tree(struct super_block *sb,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 74/83] File operation: Mmap.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (72 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 73/83] Dax: Add iomap operations Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 75/83] File operation: read/write iter Andiry Xu
                   ` (9 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA uses the iomap framework to support mmap operation.
Currently it does not support huge page mmap.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/dax.c  | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/file.c | 25 +++++++++++++++++++++++++
 fs/nova/nova.h |  1 +
 3 files changed, 79 insertions(+)

diff --git a/fs/nova/dax.c b/fs/nova/dax.c
index e639b23..fa424b1 100644
--- a/fs/nova/dax.c
+++ b/fs/nova/dax.c
@@ -915,3 +915,56 @@ const struct iomap_ops nova_iomap_ops = {
 	.iomap_begin	= nova_iomap_begin,
 	.iomap_end	= nova_iomap_end,
 };
+
+
+/* TODO: Hugemap mmap */
+static int nova_dax_huge_fault(struct vm_fault *vmf,
+	enum page_entry_size pe_size)
+{
+	int ret = 0;
+	timing_t fault_time;
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	NOVA_START_TIMING(pmd_fault_t, fault_time);
+
+	nova_dbgv("%s: inode %lu, pgoff %lu\n",
+		  __func__, inode->i_ino, vmf->pgoff);
+
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		file_update_time(vmf->vma->vm_file);
+
+	ret = dax_iomap_fault(vmf, pe_size, NULL, NULL, &nova_iomap_ops);
+
+	NOVA_END_TIMING(pmd_fault_t, fault_time);
+	return ret;
+}
+
+static int nova_dax_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	nova_dbgv("%s: inode %lu, pgoff %lu, flags 0x%x\n",
+		  __func__, inode->i_ino, vmf->pgoff, vmf->flags);
+
+	return nova_dax_huge_fault(vmf, PE_SIZE_PTE);
+}
+
+static int nova_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = mapping->host;
+
+	nova_dbgv("%s: inode %lu, pgoff %lu, flags 0x%x\n",
+		  __func__, inode->i_ino, vmf->pgoff, vmf->flags);
+
+	return nova_dax_huge_fault(vmf, PE_SIZE_PTE);
+}
+
+const struct vm_operations_struct nova_dax_vm_ops = {
+	.fault	= nova_dax_fault,
+	.huge_fault = nova_dax_huge_fault,
+	.page_mkwrite = nova_dax_fault,
+	.pfn_mkwrite = nova_dax_pfn_mkwrite,
+};
diff --git a/fs/nova/file.c b/fs/nova/file.c
index a6b5bd3..0ae0333 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -617,10 +617,35 @@ static ssize_t nova_dax_file_write(struct file *filp, const char __user *buf,
 }
 
 
+static int nova_dax_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file->f_mapping->host;
+
+	file_accessed(file);
+
+	vma->vm_flags |= VM_MIXEDMAP;
+
+	vma->vm_ops = &nova_dax_vm_ops;
+
+	nova_dbg_mmap4k("[%s:%d] inode %lu, MMAP 4KPAGE vm_start(0x%lx), "
+			"vm_end(0x%lx), vm pgoff %lu, %lu blocks, "
+			"vm_flags(0x%lx), vm_page_prot(0x%lx)\n",
+			__func__, __LINE__,
+			inode->i_ino, vma->vm_start, vma->vm_end,
+			vma->vm_pgoff,
+			(vma->vm_end - vma->vm_start) >> PAGE_SHIFT,
+			vma->vm_flags,
+			pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
+
+
 const struct file_operations nova_dax_file_operations = {
 	.llseek		= nova_llseek,
 	.read		= nova_dax_file_read,
 	.write		= nova_dax_file_write,
+	.mmap		= nova_dax_file_mmap,
 	.open		= nova_open,
 	.fsync		= nova_fsync,
 	.flush		= nova_flush,
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 0d62c47..d209cfc 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -488,6 +488,7 @@ ssize_t do_nova_inplace_file_write(struct file *filp, const char __user *buf,
 	size_t len, loff_t *ppos);
 
 extern const struct iomap_ops nova_iomap_ops;
+extern const struct vm_operations_struct nova_dax_vm_ops;
 
 
 /* dir.c */
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 75/83] File operation: read/write iter.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (73 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 74/83] File operation: Mmap Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 76/83] Ioctl support Andiry Xu
                   ` (8 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

They use the iomap framework to do read/write. Due to software overheads
they are slower than dax read/write.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/file.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/fs/nova/file.c b/fs/nova/file.c
index 0ae0333..7e90415 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -260,6 +260,69 @@ static long nova_fallocate(struct file *file, int mode, loff_t offset,
 	return ret;
 }
 
+static ssize_t nova_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct inode *inode = iocb->ki_filp->f_mapping->host;
+	ssize_t ret;
+	timing_t read_iter_time;
+
+	if (!iov_iter_count(to))
+		return 0;
+
+	NOVA_START_TIMING(read_iter_t, read_iter_time);
+
+	inode_lock_shared(inode);
+	ret = dax_iomap_rw(iocb, to, &nova_iomap_ops);
+	inode_unlock_shared(inode);
+
+	file_accessed(iocb->ki_filp);
+	NOVA_END_TIMING(read_iter_t, read_iter_time);
+	return ret;
+}
+
+static ssize_t nova_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file->f_mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	loff_t offset;
+	size_t count;
+	ssize_t ret;
+	timing_t write_iter_time;
+
+	NOVA_START_TIMING(write_iter_t, write_iter_time);
+	inode_lock(inode);
+	ret = generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out_unlock;
+
+	ret = file_remove_privs(file);
+	if (ret)
+		goto out_unlock;
+
+	ret = file_update_time(file);
+	if (ret)
+		goto out_unlock;
+
+	count = iov_iter_count(from);
+	offset = iocb->ki_pos;
+
+	ret = dax_iomap_rw(iocb, from, &nova_iomap_ops);
+	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
+		i_size_write(inode, iocb->ki_pos);
+		sih->i_size = iocb->ki_pos;
+		mark_inode_dirty(inode);
+	}
+
+out_unlock:
+	inode_unlock(inode);
+	if (ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	NOVA_END_TIMING(write_iter_t, write_iter_time);
+	return ret;
+}
+
 static ssize_t
 do_dax_mapping_read(struct file *filp, char __user *buf,
 	size_t len, loff_t *ppos)
@@ -645,6 +708,8 @@ const struct file_operations nova_dax_file_operations = {
 	.llseek		= nova_llseek,
 	.read		= nova_dax_file_read,
 	.write		= nova_dax_file_write,
+	.read_iter	= nova_dax_read_iter,
+	.write_iter	= nova_dax_write_iter,
 	.mmap		= nova_dax_file_mmap,
 	.open		= nova_open,
 	.fsync		= nova_fsync,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 76/83] Ioctl support.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (74 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 75/83] File operation: read/write iter Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 77/83] GC: Fast garbage collection Andiry Xu
                   ` (7 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA appends link change entry to the inode log to implement
SETFLAGS and SETVERSION.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |   4 +-
 fs/nova/dir.c    |   4 ++
 fs/nova/file.c   |   4 ++
 fs/nova/inode.h  |   2 +
 fs/nova/ioctl.c  | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/nova.h   |   7 +++
 6 files changed, 203 insertions(+), 2 deletions(-)
 create mode 100644 fs/nova/ioctl.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index 7bf6403..87e56c6 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,5 +4,5 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := balloc.o bbuild.o dax.o dir.o file.o inode.o journal.o log.o namei.o\
-	  rebuild.o stats.o super.o symlink.o
+nova-y := balloc.o bbuild.o dax.o dir.o file.o inode.o ioctl.o journal.o\
+	  log.o namei.o rebuild.o stats.o super.o symlink.o
diff --git a/fs/nova/dir.c b/fs/nova/dir.c
index 47ee9ad..3694d9d 100644
--- a/fs/nova/dir.c
+++ b/fs/nova/dir.c
@@ -513,4 +513,8 @@ const struct file_operations nova_dir_operations = {
 	.read		= generic_read_dir,
 	.iterate	= nova_readdir,
 	.fsync		= noop_fsync,
+	.unlocked_ioctl = nova_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= nova_compat_ioctl,
+#endif
 };
diff --git a/fs/nova/file.c b/fs/nova/file.c
index 7e90415..2b70b9d 100644
--- a/fs/nova/file.c
+++ b/fs/nova/file.c
@@ -714,7 +714,11 @@ const struct file_operations nova_dax_file_operations = {
 	.open		= nova_open,
 	.fsync		= nova_fsync,
 	.flush		= nova_flush,
+	.unlocked_ioctl	= nova_ioctl,
 	.fallocate	= nova_fallocate,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= nova_compat_ioctl,
+#endif
 };
 
 const struct inode_operations nova_file_inode_operations = {
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 693aa90..086a7cb 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -264,6 +264,8 @@ int nova_delete_file_tree(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long start_blocknr,
 	unsigned long last_blocknr, bool delete_nvmm, bool delete_dead,
 	u64 epoch_id);
+extern void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
+	unsigned int flags);
 unsigned long nova_find_region(struct inode *inode, loff_t *offset, int hole);
 extern void nova_evict_inode(struct inode *inode);
 extern int nova_write_inode(struct inode *inode, struct writeback_control *wbc);
diff --git a/fs/nova/ioctl.c b/fs/nova/ioctl.c
new file mode 100644
index 0000000..2509371
--- /dev/null
+++ b/fs/nova/ioctl.c
@@ -0,0 +1,184 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2010-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/capability.h>
+#include <linux/time.h>
+#include <linux/sched.h>
+#include <linux/compat.h>
+#include <linux/mount.h>
+#include "nova.h"
+#include "inode.h"
+
+long nova_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode    *inode = mapping->host;
+	struct nova_inode_info *si = NOVA_I(inode);
+	struct nova_inode_info_header *sih = &si->header;
+	struct nova_inode *pi;
+	struct super_block *sb = inode->i_sb;
+	struct nova_inode_update update;
+	unsigned int flags;
+	int ret;
+
+	pi = nova_get_inode(sb, inode);
+	if (!pi)
+		return -EACCES;
+
+	switch (cmd) {
+	case FS_IOC_GETFLAGS:
+		flags = (sih->i_flags) & NOVA_FL_USER_VISIBLE;
+		return put_user(flags, (int __user *)arg);
+	case FS_IOC_SETFLAGS: {
+		unsigned int oldflags;
+		u64 old_linkc = 0;
+		u64 epoch_id;
+
+		ret = mnt_want_write_file(filp);
+		if (ret)
+			return ret;
+
+		if (!inode_owner_or_capable(inode)) {
+			ret = -EPERM;
+			goto flags_out;
+		}
+
+		if (get_user(flags, (int __user *)arg)) {
+			ret = -EFAULT;
+			goto flags_out;
+		}
+
+		inode_lock(inode);
+		sih_lock(sih);
+		oldflags = le32_to_cpu(pi->i_flags);
+
+		if ((flags ^ oldflags) &
+		    (FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+			if (!capable(CAP_LINUX_IMMUTABLE)) {
+				inode_unlock(inode);
+				ret = -EPERM;
+				goto flags_out_unlock;
+			}
+		}
+
+		if (!S_ISDIR(inode->i_mode))
+			flags &= ~FS_DIRSYNC_FL;
+
+		epoch_id = nova_get_epoch_id(sb);
+		flags = flags & FS_FL_USER_MODIFIABLE;
+		flags |= oldflags & ~FS_FL_USER_MODIFIABLE;
+		inode->i_ctime = current_time(inode);
+		nova_set_inode_flags(inode, pi, flags);
+		sih->i_flags = flags;
+
+		update.tail = 0;
+		ret = nova_append_link_change_entry(sb, pi, inode,
+					&update, &old_linkc, epoch_id);
+		if (!ret) {
+			nova_update_inode(sb, inode, pi, &update);
+			nova_invalidate_link_change_entry(sb, old_linkc);
+		}
+		sih->trans_id++;
+flags_out_unlock:
+		sih_unlock(sih);
+		inode_unlock(inode);
+flags_out:
+		mnt_drop_write_file(filp);
+		return ret;
+	}
+	case FS_IOC_GETVERSION:
+		return put_user(inode->i_generation, (int __user *)arg);
+	case FS_IOC_SETVERSION: {
+		u64 old_linkc = 0;
+		u64 epoch_id;
+		__u32 generation;
+
+		if (!inode_owner_or_capable(inode))
+			return -EPERM;
+		ret = mnt_want_write_file(filp);
+		if (ret)
+			return ret;
+		if (get_user(generation, (int __user *)arg)) {
+			ret = -EFAULT;
+			goto setversion_out;
+		}
+
+		epoch_id = nova_get_epoch_id(sb);
+		inode_lock(inode);
+		sih_lock(sih);
+		inode->i_ctime = current_time(inode);
+		inode->i_generation = generation;
+
+		update.tail = 0;
+		ret = nova_append_link_change_entry(sb, pi, inode,
+					&update, &old_linkc, epoch_id);
+		if (!ret) {
+			nova_update_inode(sb, inode, pi, &update);
+			nova_invalidate_link_change_entry(sb, old_linkc);
+		}
+		sih->trans_id++;
+		sih_unlock(sih);
+		inode_unlock(inode);
+setversion_out:
+		mnt_drop_write_file(filp);
+		return ret;
+	}
+	case NOVA_PRINT_TIMING: {
+		nova_print_timing_stats(sb);
+		return 0;
+	}
+	case NOVA_CLEAR_STATS: {
+		nova_clear_stats(sb);
+		return 0;
+	}
+	case NOVA_PRINT_LOG: {
+		nova_print_inode_log(sb, inode);
+		return 0;
+	}
+	case NOVA_PRINT_LOG_PAGES: {
+		nova_print_inode_log_pages(sb, inode);
+		return 0;
+	}
+	case NOVA_PRINT_FREE_LISTS: {
+		nova_print_free_lists(sb);
+		return 0;
+	}
+	default:
+		return -ENOTTY;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+long nova_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case FS_IOC32_GETFLAGS:
+		cmd = FS_IOC_GETFLAGS;
+		break;
+	case FS_IOC32_SETFLAGS:
+		cmd = FS_IOC_SETFLAGS;
+		break;
+	case FS_IOC32_GETVERSION:
+		cmd = FS_IOC_GETVERSION;
+		break;
+	case FS_IOC32_SETVERSION:
+		cmd = FS_IOC_SETVERSION;
+		break;
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return nova_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index d209cfc..ab9153e 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -515,6 +515,13 @@ int nova_remove_dentry(struct dentry *dentry, int dec_link,
 extern const struct file_operations nova_dax_file_operations;
 extern const struct inode_operations nova_file_inode_operations;
 
+/* ioctl.c */
+extern long nova_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
+#ifdef CONFIG_COMPAT
+extern long nova_compat_ioctl(struct file *file, unsigned int cmd,
+	unsigned long arg);
+#endif
+
 /* namei.c */
 extern const struct inode_operations nova_dir_inode_operations;
 extern const struct inode_operations nova_special_inode_operations;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 77/83] GC: Fast garbage collection.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (75 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 76/83] Ioctl support Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:18 ` [RFC v2 78/83] GC: Thorough " Andiry Xu
                   ` (6 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA cleans and compacts the log when the log is full.
The log is a linked list of 4KB pmem pages, and NOVA performs
fast garbage collection by deleting dead log pages (all the entries are invalid)
from the linked list.

Example:
I = Invalid, V = Valid

VIIV -> IIII -> VVII

	||
	||  fast gc
	\/

VIIV -> VVII

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |   2 +-
 fs/nova/gc.c     | 186 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/log.c    |   3 +
 fs/nova/nova.h   |   7 +++
 4 files changed, 197 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/gc.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index 87e56c6..7a5fb6d 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -4,5 +4,5 @@
 
 obj-$(CONFIG_NOVA_FS) += nova.o
 
-nova-y := balloc.o bbuild.o dax.o dir.o file.o inode.o ioctl.o journal.o\
+nova-y := balloc.o bbuild.o dax.o dir.o file.o gc.o inode.o ioctl.o journal.o\
 	  log.o namei.o rebuild.o stats.o super.o symlink.o
diff --git a/fs/nova/gc.c b/fs/nova/gc.c
new file mode 100644
index 0000000..1634c04
--- /dev/null
+++ b/fs/nova/gc.c
@@ -0,0 +1,186 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Garbage collection methods
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+
+static bool curr_page_invalid(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 page_head)
+{
+	struct nova_inode_log_page *curr_page;
+	struct nova_inode_page_tail page_tail;
+	unsigned int num_entries;
+	unsigned int invalid_entries;
+	bool ret;
+	timing_t check_time;
+	int rc;
+
+	NOVA_START_TIMING(check_invalid_t, check_time);
+
+	curr_page = (struct nova_inode_log_page *)
+					nova_get_block(sb, page_head);
+	rc = memcpy_mcsafe(&page_tail, &curr_page->page_tail,
+					sizeof(struct nova_inode_page_tail));
+	if (rc) {
+		nova_err(sb, "check page failed\n");
+		return false;
+	}
+
+	num_entries = le32_to_cpu(page_tail.num_entries);
+	invalid_entries = le32_to_cpu(page_tail.invalid_entries);
+
+	ret = (invalid_entries == num_entries);
+	if (!ret) {
+		sih->num_entries += num_entries;
+		sih->valid_entries += num_entries - invalid_entries;
+	}
+
+	NOVA_END_TIMING(check_invalid_t, check_time);
+	return ret;
+}
+
+static void free_curr_page(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_inode_log_page *curr_page,
+	struct nova_inode_log_page *last_page, u64 curr_head)
+{
+	u8 btype = sih->i_blk_type;
+
+	nova_set_next_page_address(sb, last_page,
+			curr_page->page_tail.next_page, 1);
+	nova_free_log_blocks(sb, sih,
+			nova_get_blocknr(sb, curr_head, btype), 1);
+}
+
+
+/*
+ * Scan pages in the log and remove those with no valid log entries.
+ */
+int nova_inode_log_fast_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_tail, u64 new_block,
+	int num_pages, int force_thorough)
+{
+	u64 curr, next, possible_head = 0;
+	int found_head = 0;
+	struct nova_inode_log_page *last_page = NULL;
+	struct nova_inode_log_page *curr_page = NULL;
+	int first_need_free = 0;
+	int num_logs;
+	u8 btype = sih->i_blk_type;
+	unsigned long blocks;
+	unsigned long checked_pages = 0;
+	int freed_pages = 0;
+	timing_t gc_time;
+
+	NOVA_START_TIMING(fast_gc_t, gc_time);
+	curr = sih->log_head;
+	sih->valid_entries = 0;
+	sih->num_entries = 0;
+
+	num_logs = 1;
+
+	nova_dbgv("%s: log head 0x%llx, tail 0x%llx\n",
+				__func__, curr, curr_tail);
+	while (1) {
+		if (curr >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT) {
+			/* Don't recycle tail page */
+			if (found_head == 0) {
+				possible_head = cpu_to_le64(curr);
+			}
+			break;
+		}
+
+		curr_page = (struct nova_inode_log_page *)
+					nova_get_block(sb, curr);
+		next = next_log_page(sb, curr);
+		if (next < 0)
+			break;
+
+		nova_dbg_verbose("curr 0x%llx, next 0x%llx\n", curr, next);
+		if (curr_page_invalid(sb, pi, sih, curr)) {
+			nova_dbg_verbose("curr page %p invalid\n", curr_page);
+			if (curr == sih->log_head) {
+				/* Free first page later */
+				first_need_free = 1;
+				last_page = curr_page;
+			} else {
+				nova_dbg_verbose("Free log block 0x%llx\n",
+						curr >> PAGE_SHIFT);
+				free_curr_page(sb, sih, curr_page, last_page,
+						curr);
+			}
+			NOVA_STATS_ADD(fast_gc_pages, 1);
+			freed_pages++;
+		} else {
+			if (found_head == 0) {
+				possible_head = cpu_to_le64(curr);
+				found_head = 1;
+			}
+			last_page = curr_page;
+		}
+
+		curr = next;
+		checked_pages++;
+		if (curr == 0)
+			break;
+	}
+
+	NOVA_STATS_ADD(fast_checked_pages, checked_pages);
+	nova_dbgv("checked pages %lu, freed %d\n", checked_pages, freed_pages);
+	checked_pages -= freed_pages;
+
+	// TODO:  I think this belongs in nova_extend_inode_log.
+	if (num_pages > 0) {
+		curr = BLOCK_OFF(curr_tail);
+		curr_page = (struct nova_inode_log_page *)
+						  nova_get_block(sb, curr);
+
+		nova_set_next_page_address(sb, curr_page, new_block, 1);
+	}
+
+	curr = sih->log_head;
+
+	pi->log_head = possible_head;
+	nova_persist_inode(pi);
+	sih->log_head = possible_head;
+	nova_dbgv("%s: %d new head 0x%llx\n", __func__,
+					found_head, possible_head);
+	sih->log_pages += (num_pages - freed_pages) * num_logs;
+	/* Don't update log tail pointer here */
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 1);
+
+	if (first_need_free) {
+		nova_dbg_verbose("Free log head block 0x%llx\n",
+					curr >> PAGE_SHIFT);
+		nova_free_log_blocks(sb, sih,
+				nova_get_blocknr(sb, curr, btype), 1);
+	}
+
+	NOVA_END_TIMING(fast_gc_t, gc_time);
+
+	if (sih->num_entries == 0)
+		return 0;
+
+	blocks = (sih->valid_entries * checked_pages) / sih->num_entries;
+	if ((sih->valid_entries * checked_pages) % sih->num_entries)
+		blocks++;
+
+	return 0;
+}
diff --git a/fs/nova/log.c b/fs/nova/log.c
index 451be27..66bf98e 100644
--- a/fs/nova/log.c
+++ b/fs/nova/log.c
@@ -964,6 +964,9 @@ static u64 nova_extend_inode_log(struct super_block *sb, struct nova_inode *pi,
 	}
 
 	/* Perform GC */
+	nova_inode_log_fast_gc(sb, pi, sih, curr_p,
+			       new_block, allocated, 0);
+
 	return new_block;
 }
 
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index ab9153e..32b7b2f 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -515,6 +515,13 @@ int nova_remove_dentry(struct dentry *dentry, int dec_link,
 extern const struct file_operations nova_dax_file_operations;
 extern const struct inode_operations nova_file_inode_operations;
 
+
+/* gc.c */
+int nova_inode_log_fast_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_tail, u64 new_block, int num_pages,
+	int force_thorough);
+
 /* ioctl.c */
 extern long nova_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
 #ifdef CONFIG_COMPAT
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 78/83] GC: Thorough garbage collection.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (76 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 77/83] GC: Fast garbage collection Andiry Xu
@ 2018-03-10 18:18 ` Andiry Xu
  2018-03-10 18:19 ` [RFC v2 79/83] Normal recovery Andiry Xu
                   ` (5 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:18 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

After fast gc, if the valid log entries still account for less
than the half of the log size, NOVA starts thorough garbage collection,
allocates a new log, copies the live log entries to it, and switches
to the new log atomically. The radix tree needs to be updated to point
to the new log.

Example:
I = Invalid, V = Valid

VIIV -> IIII -> VVII

     ||
     ||  fast gc
     \/

VIIV -> VVII

     ||
     ||  thorough gc
     \/

    VVVV

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/gc.c | 273 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 273 insertions(+)

diff --git a/fs/nova/gc.c b/fs/nova/gc.c
index 1634c04..d74286e 100644
--- a/fs/nova/gc.c
+++ b/fs/nova/gc.c
@@ -18,6 +18,62 @@
 #include "nova.h"
 #include "inode.h"
 
+static bool curr_log_entry_invalid(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_p, size_t *length)
+{
+	struct nova_file_write_entry *entry;
+	struct nova_dentry *dentry;
+	struct nova_setattr_logentry *setattr_entry;
+	struct nova_link_change_entry *linkc_entry;
+	void *entryc;
+	u8 type;
+	bool ret = true;
+
+	entryc = (void *)nova_get_block(sb, curr_p);
+	type = nova_get_entry_type(entryc);
+
+	switch (type) {
+	case SET_ATTR:
+		setattr_entry = (struct nova_setattr_logentry *) entryc;
+		if (setattr_entry->invalid == 0)
+			ret = false;
+		*length = sizeof(struct nova_setattr_logentry);
+		break;
+	case LINK_CHANGE:
+		linkc_entry = (struct nova_link_change_entry *) entryc;
+		if (linkc_entry->invalid == 0)
+			ret = false;
+		*length = sizeof(struct nova_link_change_entry);
+		break;
+	case FILE_WRITE:
+		entry = (struct nova_file_write_entry *) entryc;
+		if (entry->num_pages != entry->invalid_pages)
+			ret = false;
+		*length = sizeof(struct nova_file_write_entry);
+		break;
+	case DIR_LOG:
+		dentry = (struct nova_dentry *) entryc;
+		if (dentry->invalid == 0)
+			ret = false;
+		if (sih->last_dentry == curr_p)
+			ret = false;
+		*length = le16_to_cpu(dentry->de_len);
+		break;
+	case NEXT_PAGE:
+		/* No more entries in this page */
+		*length = PAGE_SIZE - ENTRY_LOC(curr_p);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+		NOVA_ASSERT(0);
+		*length = PAGE_SIZE - ENTRY_LOC(curr_p);
+		break;
+	}
+
+	return ret;
+}
 
 static bool curr_page_invalid(struct super_block *sb,
 	struct nova_inode *pi, struct nova_inode_info_header *sih,
@@ -68,6 +124,210 @@ static void free_curr_page(struct super_block *sb,
 			nova_get_blocknr(sb, curr_head, btype), 1);
 }
 
+static int nova_gc_assign_file_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_file_write_entry *old_entry,
+	struct nova_file_write_entry *new_entry)
+{
+	struct nova_file_write_entry *temp;
+	void **pentry;
+	unsigned long start_pgoff = old_entry->pgoff;
+	unsigned int num = old_entry->num_pages;
+	unsigned long curr_pgoff;
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < num; i++) {
+		curr_pgoff = start_pgoff + i;
+
+		pentry = radix_tree_lookup_slot(&sih->tree, curr_pgoff);
+		if (pentry) {
+			temp = radix_tree_deref_slot(pentry);
+			if (temp == old_entry)
+				radix_tree_replace_slot(&sih->tree, pentry,
+							new_entry);
+		}
+	}
+
+	return ret;
+}
+
+static int nova_gc_assign_dentry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_dentry *old_dentry,
+	struct nova_dentry *new_dentry)
+{
+	struct nova_dentry *temp;
+	void **pentry;
+	unsigned long hash;
+	int ret = 0;
+
+	hash = BKDRHash(old_dentry->name, old_dentry->name_len);
+	nova_dbgv("%s: assign %s hash %lu\n", __func__,
+			old_dentry->name, hash);
+
+	/* FIXME: hash collision ignored here */
+	pentry = radix_tree_lookup_slot(&sih->tree, hash);
+	if (pentry) {
+		temp = radix_tree_deref_slot(pentry);
+		if (temp == old_dentry)
+			radix_tree_replace_slot(&sih->tree, pentry, new_dentry);
+	}
+
+	return ret;
+}
+
+static int nova_gc_assign_new_entry(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	u64 curr_p, u64 new_curr)
+{
+	struct nova_file_write_entry *old_entry, *new_entry;
+	struct nova_dentry *old_dentry, *new_dentry;
+	void *addr, *new_addr;
+	u8 type;
+	int ret = 0;
+
+	addr = (void *)nova_get_block(sb, curr_p);
+	type = nova_get_entry_type(addr);
+	switch (type) {
+	case SET_ATTR:
+		sih->last_setattr = new_curr;
+		break;
+	case LINK_CHANGE:
+		sih->last_link_change = new_curr;
+		break;
+	case FILE_WRITE:
+		new_addr = (void *)nova_get_block(sb, new_curr);
+		old_entry = (struct nova_file_write_entry *)addr;
+		new_entry = (struct nova_file_write_entry *)new_addr;
+		ret = nova_gc_assign_file_entry(sb, sih, old_entry, new_entry);
+		break;
+	case DIR_LOG:
+		new_addr = (void *)nova_get_block(sb, new_curr);
+		old_dentry = (struct nova_dentry *)addr;
+		new_dentry = (struct nova_dentry *)new_addr;
+		if (sih->last_dentry == curr_p)
+			sih->last_dentry = new_curr;
+		ret = nova_gc_assign_dentry(sb, sih, old_dentry, new_dentry);
+		break;
+	default:
+		nova_dbg("%s: unknown type %d, 0x%llx\n",
+					__func__, type, curr_p);
+		NOVA_ASSERT(0);
+		break;
+	}
+
+	return ret;
+}
+
+/* Copy live log entries to the new log and atomically replace the old log */
+static unsigned long nova_inode_log_thorough_gc(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	unsigned long blocks, unsigned long checked_pages)
+{
+	struct nova_inode_log_page *curr_page = NULL;
+	size_t length;
+	u64 curr_p, new_curr;
+	u64 old_curr_p;
+	u64 tail_block;
+	u64 old_head;
+	u64 new_head = 0;
+	u64 next;
+	int allocated;
+	int extended = 0;
+	int ret;
+	timing_t gc_time;
+
+	NOVA_START_TIMING(thorough_gc_t, gc_time);
+
+	curr_p = sih->log_head;
+	old_curr_p = curr_p;
+	old_head = sih->log_head;
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				curr_p, sih->log_tail);
+	if (curr_p == 0 && sih->log_tail == 0)
+		goto out;
+
+	if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT)
+		goto out;
+
+	allocated = nova_allocate_inode_log_pages(sb, sih, blocks,
+					&new_head, ANY_CPU, 0);
+	if (allocated != blocks) {
+		nova_err(sb, "%s: ERROR: no inode log page available\n",
+					__func__);
+		goto out;
+	}
+
+	new_curr = new_head;
+	while (curr_p != sih->log_tail) {
+		old_curr_p = curr_p;
+		if (goto_next_page(sb, curr_p))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p >> PAGE_SHIFT == sih->log_tail >> PAGE_SHIFT) {
+			/* Don't recycle tail page */
+			break;
+		}
+
+		length = 0;
+		ret = curr_log_entry_invalid(sb, pi, sih, curr_p, &length);
+		if (!ret) {
+			extended = 0;
+			new_curr = nova_get_append_head(sb, pi, sih,
+						new_curr, length, MAIN_LOG,
+						1, &extended);
+			if (extended)
+				blocks++;
+			/* Copy entry to the new log */
+			memcpy_to_pmem_nocache(nova_get_block(sb, new_curr),
+				nova_get_block(sb, curr_p), length);
+			nova_inc_page_num_entries(sb, new_curr);
+			nova_gc_assign_new_entry(sb, pi, sih, curr_p, new_curr);
+			new_curr += length;
+		}
+
+		curr_p += length;
+	}
+
+	/* Step 1: Link new log to the tail block */
+	tail_block = BLOCK_OFF(sih->log_tail);
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+							BLOCK_OFF(new_curr));
+	next = next_log_page(sb, new_curr);
+	if (next > 0)
+		nova_free_contiguous_log_blocks(sb, sih, next);
+
+	nova_set_next_page_flag(sb, new_curr);
+	nova_set_next_page_address(sb, curr_page, tail_block, 0);
+
+	/* Step 2: Atomically switch to the new log */
+	pi->log_head = new_head;
+	nova_persist_inode(pi);
+	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
+	sih->log_head = new_head;
+
+	/* Step 3: Unlink the old log */
+	curr_page = (struct nova_inode_log_page *)nova_get_block(sb,
+							BLOCK_OFF(old_curr_p));
+	next = next_log_page(sb, old_curr_p);
+	if (next != tail_block)
+		nova_err(sb, "Old log error: old curr_p 0x%lx, next 0x%lx ",
+			"curr_p 0x%lx, tail block 0x%lx\n", old_curr_p,
+			next, curr_p, tail_block);
+
+	nova_set_next_page_address(sb, curr_page, 0, 1);
+
+	/* Step 4: Free the old log */
+	nova_free_contiguous_log_blocks(sb, sih, old_head);
+
+	sih->log_pages = sih->log_pages + blocks - checked_pages;
+	NOVA_STATS_ADD(thorough_gc_pages, checked_pages - blocks);
+	NOVA_STATS_ADD(thorough_checked_pages, checked_pages);
+out:
+	NOVA_END_TIMING(thorough_gc_t, gc_time);
+	return blocks;
+}
+
 
 /*
  * Scan pages in the log and remove those with no valid log entries.
@@ -178,9 +438,22 @@ int nova_inode_log_fast_gc(struct super_block *sb,
 	if (sih->num_entries == 0)
 		return 0;
 
+	/* Estimate how many pages worth of valid entries the log contains.
+	 *
+	 * If it is less than half the number pages that remain in the log,
+	 * compress them with thorough gc.
+	 */
 	blocks = (sih->valid_entries * checked_pages) / sih->num_entries;
 	if ((sih->valid_entries * checked_pages) % sih->num_entries)
 		blocks++;
 
+	if (force_thorough || (blocks && blocks * 2 < checked_pages)) {
+		nova_dbgv("Thorough GC for inode %lu: checked pages %lu, valid pages %lu\n",
+				sih->ino,
+				checked_pages, blocks);
+		blocks = nova_inode_log_thorough_gc(sb, pi, sih,
+							blocks, checked_pages);
+	}
+
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 79/83] Normal recovery.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (77 preceding siblings ...)
  2018-03-10 18:18 ` [RFC v2 78/83] GC: Thorough " Andiry Xu
@ 2018-03-10 18:19 ` Andiry Xu
  2018-03-10 18:19 ` [RFC v2 80/83] Failure recovery: bitmap operations Andiry Xu
                   ` (4 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:19 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Upon umount, NOVA stores the allocator information and the inuse
inode list in reserved inodes. During remount, NOVA reads these
information and rebuild the allocator and inuse inode list DRAM
data structures.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/bbuild.c | 266 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/bbuild.h |   1 +
 fs/nova/super.c  |   3 +
 3 files changed, 270 insertions(+)

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
index af1b352..ca51dca 100644
--- a/fs/nova/bbuild.c
+++ b/fs/nova/bbuild.c
@@ -52,6 +52,206 @@ void nova_init_header(struct super_block *sb,
 	init_rwsem(&sih->i_sem);
 }
 
+static inline int get_cpuid(struct nova_sb_info *sbi, unsigned long blocknr)
+{
+	return blocknr / sbi->per_list_blocks;
+}
+
+static void nova_destroy_range_node_tree(struct super_block *sb,
+	struct rb_root *tree)
+{
+	struct nova_range_node *curr;
+	struct rb_node *temp;
+
+	temp = rb_first(tree);
+	while (temp) {
+		curr = container_of(temp, struct nova_range_node, node);
+		temp = rb_next(temp);
+		rb_erase(&curr->node, tree);
+		nova_free_range_node(curr);
+	}
+}
+
+static void nova_destroy_blocknode_tree(struct super_block *sb, int cpu)
+{
+	struct free_list *free_list;
+
+	free_list = nova_get_free_list(sb, cpu);
+	nova_destroy_range_node_tree(sb, &free_list->block_free_tree);
+}
+
+static void nova_destroy_blocknode_trees(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++)
+		nova_destroy_blocknode_tree(sb, i);
+
+}
+
+static int nova_init_blockmap_from_inode(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	struct nova_inode_info_header sih;
+	struct free_list *free_list;
+	struct nova_range_node_lowhigh *entry;
+	struct nova_range_node *blknode;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	u64 curr_p;
+	u64 cpuid;
+	int ret = 0;
+
+	/* FIXME: Backup inode for BLOCKNODE */
+	ret = nova_get_head_tail(sb, pi, &sih);
+	if (ret)
+		goto out;
+
+	curr_p = sih.log_head;
+	if (curr_p == 0) {
+		nova_dbg("%s: pi head is 0!\n", __func__);
+		return -EINVAL;
+	}
+
+	while (curr_p != sih.log_tail) {
+		if (is_last_entry(curr_p, size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_dbg("%s: curr_p is NULL!\n", __func__);
+			NOVA_ASSERT(0);
+			ret = -EINVAL;
+			break;
+		}
+
+		entry = (struct nova_range_node_lowhigh *)nova_get_block(sb,
+							curr_p);
+		blknode = nova_alloc_blocknode(sb);
+		if (blknode == NULL)
+			NOVA_ASSERT(0);
+		blknode->range_low = le64_to_cpu(entry->range_low);
+		blknode->range_high = le64_to_cpu(entry->range_high);
+		cpuid = get_cpuid(sbi, blknode->range_low);
+
+		/* FIXME: Assume NR_CPUS not change */
+		free_list = nova_get_free_list(sb, cpuid);
+		ret = nova_insert_blocktree(sbi,
+				&free_list->block_free_tree, blknode);
+		if (ret) {
+			nova_err(sb, "%s failed\n", __func__);
+			nova_free_blocknode(sb, blknode);
+			NOVA_ASSERT(0);
+			nova_destroy_blocknode_trees(sb);
+			goto out;
+		}
+		free_list->num_blocknode++;
+		if (free_list->num_blocknode == 1)
+			free_list->first_node = blknode;
+		free_list->last_node = blknode;
+		free_list->num_free_blocks +=
+			blknode->range_high - blknode->range_low + 1;
+		curr_p += sizeof(struct nova_range_node_lowhigh);
+	}
+out:
+	nova_free_inode_log(sb, pi, &sih);
+	return ret;
+}
+
+static void nova_destroy_inode_trees(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		inode_map = &sbi->inode_maps[i];
+		nova_destroy_range_node_tree(sb,
+					&inode_map->inode_inuse_tree);
+	}
+}
+
+#define CPUID_MASK 0xff00000000000000
+
+static int nova_init_inode_list_from_inode(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode *pi = nova_get_inode_by_ino(sb, NOVA_INODELIST_INO);
+	struct nova_inode_info_header sih;
+	struct nova_range_node_lowhigh *entry;
+	struct nova_range_node *range_node;
+	struct inode_map *inode_map;
+	size_t size = sizeof(struct nova_range_node_lowhigh);
+	unsigned long num_inode_node = 0;
+	u64 curr_p;
+	unsigned long cpuid;
+	int ret;
+
+	/* FIXME: Backup inode for INODELIST */
+	ret = nova_get_head_tail(sb, pi, &sih);
+	if (ret)
+		goto out;
+
+	sbi->s_inodes_used_count = 0;
+	curr_p = sih.log_head;
+	if (curr_p == 0) {
+		nova_dbg("%s: pi head is 0!\n", __func__);
+		return -EINVAL;
+	}
+
+	while (curr_p != sih.log_tail) {
+		if (is_last_entry(curr_p, size))
+			curr_p = next_log_page(sb, curr_p);
+
+		if (curr_p == 0) {
+			nova_dbg("%s: curr_p is NULL!\n", __func__);
+			NOVA_ASSERT(0);
+		}
+
+		entry = (struct nova_range_node_lowhigh *)nova_get_block(sb,
+							curr_p);
+		range_node = nova_alloc_inode_node(sb);
+		if (range_node == NULL)
+			NOVA_ASSERT(0);
+
+		cpuid = (entry->range_low & CPUID_MASK) >> 56;
+		if (cpuid >= sbi->cpus) {
+			nova_err(sb, "Invalid cpuid %lu\n", cpuid);
+			nova_free_inode_node(sb, range_node);
+			NOVA_ASSERT(0);
+			nova_destroy_inode_trees(sb);
+			goto out;
+		}
+
+		range_node->range_low = entry->range_low & ~CPUID_MASK;
+		range_node->range_high = entry->range_high;
+		ret = nova_insert_inodetree(sbi, range_node, cpuid);
+		if (ret) {
+			nova_err(sb, "%s failed, %d\n", __func__, cpuid);
+			nova_free_inode_node(sb, range_node);
+			NOVA_ASSERT(0);
+			nova_destroy_inode_trees(sb);
+			goto out;
+		}
+
+		sbi->s_inodes_used_count +=
+			range_node->range_high - range_node->range_low + 1;
+		num_inode_node++;
+
+		inode_map = &sbi->inode_maps[cpuid];
+		inode_map->num_range_node_inode++;
+		if (!inode_map->first_inode_range)
+			inode_map->first_inode_range = range_node;
+
+		curr_p += sizeof(struct nova_range_node_lowhigh);
+	}
+
+	nova_dbg("%s: %lu inode nodes\n", __func__, num_inode_node);
+out:
+	nova_free_inode_log(sb, pi, &sih);
+	return ret;
+}
+
 static u64 nova_append_range_node_entry(struct super_block *sb,
 	struct nova_range_node *curr, u64 tail, unsigned long cpuid)
 {
@@ -214,3 +414,69 @@ void nova_save_blocknode_mappings_to_log(struct super_block *sb)
 		  pi->log_head, pi->log_tail);
 }
 
+/*********************** Recovery entrance *************************/
+
+/* Return TRUE if we can do a normal unmount recovery */
+static bool nova_try_normal_recovery(struct super_block *sb)
+{
+	struct nova_inode *pi =  nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	int ret;
+
+	if (pi->log_head == 0 || pi->log_tail == 0)
+		return false;
+
+	ret = nova_init_blockmap_from_inode(sb);
+	if (ret) {
+		nova_err(sb, "init blockmap failed, fall back to failure recovery\n");
+		return false;
+	}
+
+	ret = nova_init_inode_list_from_inode(sb);
+	if (ret) {
+		nova_err(sb, "init inode list failed, fall back to failure recovery\n");
+		nova_destroy_blocknode_trees(sb);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Recovery routine has two tasks:
+ * 1. Restore inuse inode list;
+ * 2. Restore the NVMM allocator.
+ */
+int nova_recovery(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_super_block *super = sbi->nova_sb;
+	unsigned long initsize = le64_to_cpu(super->s_size);
+	bool value = false;
+	int ret = 0;
+	timing_t start, end;
+
+	nova_dbgv("%s\n", __func__);
+
+	/* Always check recovery time */
+	if (measure_timing == 0)
+		getrawmonotonic(&start);
+
+	NOVA_START_TIMING(recovery_t, start);
+	sbi->num_blocks = ((unsigned long)(initsize) >> PAGE_SHIFT);
+
+	/* initialize free list info */
+	nova_init_blockmap(sb, 1);
+
+	value = nova_try_normal_recovery(sb);
+
+	NOVA_END_TIMING(recovery_t, start);
+	if (measure_timing == 0) {
+		getrawmonotonic(&end);
+		Timingstats[recovery_t] +=
+			(end.tv_sec - start.tv_sec) * 1000000000 +
+			(end.tv_nsec - start.tv_nsec);
+	}
+
+	sbi->s_epoch_id = le64_to_cpu(super->s_epoch_id);
+	return ret;
+}
diff --git a/fs/nova/bbuild.h b/fs/nova/bbuild.h
index 5d2b5f0..2c3deb0 100644
--- a/fs/nova/bbuild.h
+++ b/fs/nova/bbuild.h
@@ -5,5 +5,6 @@ void nova_init_header(struct super_block *sb,
 	struct nova_inode_info_header *sih, u16 i_mode);
 void nova_save_inode_list_to_log(struct super_block *sb);
 void nova_save_blocknode_mappings_to_log(struct super_block *sb);
+int nova_recovery(struct super_block *sb);
 
 #endif
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 980b1d7..14b4af6 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -642,6 +642,9 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_xattr = NULL;
 	sb->s_flags |= MS_NOSEC;
 
+	if ((sbi->s_mount_opt & NOVA_MOUNT_FORMAT) == 0)
+		nova_recovery(sb);
+
 	root_i = nova_iget(sb, NOVA_ROOT_INO);
 	if (IS_ERR(root_i)) {
 		retval = PTR_ERR(root_i);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 80/83] Failure recovery: bitmap operations.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (78 preceding siblings ...)
  2018-03-10 18:19 ` [RFC v2 79/83] Normal recovery Andiry Xu
@ 2018-03-10 18:19 ` Andiry Xu
  2018-03-10 18:19 ` [RFC v2 81/83] Failure recovery: Inode pages recovery routines Andiry Xu
                   ` (3 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:19 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Upon system failure, NOVA needs to scan all the inode logs
to rebuild the allocator. During the scanning, NOVA stores allocated
log/data pages in a bitmap, and uses the bitmap to rebuild the allocator
once scan finishes.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/bbuild.c | 252 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nova/bbuild.h |  18 ++++
 2 files changed, 270 insertions(+)

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
index ca51dca..35c661a 100644
--- a/fs/nova/bbuild.c
+++ b/fs/nova/bbuild.c
@@ -414,6 +414,258 @@ void nova_save_blocknode_mappings_to_log(struct super_block *sb)
 		  pi->log_head, pi->log_tail);
 }
 
+/************************** Bitmap operations ****************************/
+
+static inline void set_scan_bm(unsigned long bit,
+	struct single_scan_bm *scan_bm)
+{
+	set_bit(bit, scan_bm->bitmap);
+}
+
+inline void set_bm(unsigned long bit, struct scan_bitmap *bm,
+	enum bm_type type)
+{
+	switch (type) {
+	case BM_4K:
+		set_scan_bm(bit, &bm->scan_bm_4K);
+		break;
+	case BM_2M:
+		set_scan_bm(bit, &bm->scan_bm_2M);
+		break;
+	case BM_1G:
+		set_scan_bm(bit, &bm->scan_bm_1G);
+		break;
+	default:
+		break;
+	}
+}
+
+static int nova_insert_blocknode_map(struct super_block *sb,
+	int cpuid, unsigned long low, unsigned long high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	struct rb_root *tree;
+	struct nova_range_node *blknode = NULL;
+	unsigned long num_blocks = 0;
+	int ret;
+
+	num_blocks = high - low + 1;
+	nova_dbgv("%s: cpu %d, low %lu, high %lu, num %lu\n",
+		__func__, cpuid, low, high, num_blocks);
+	free_list = nova_get_free_list(sb, cpuid);
+	tree = &(free_list->block_free_tree);
+
+	blknode = nova_alloc_blocknode(sb);
+	if (blknode == NULL)
+		return -ENOMEM;
+	blknode->range_low = low;
+	blknode->range_high = high;
+	ret = nova_insert_blocktree(sbi, tree, blknode);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		nova_free_blocknode(sb, blknode);
+		goto out;
+	}
+	if (!free_list->first_node)
+		free_list->first_node = blknode;
+	free_list->last_node = blknode;
+	free_list->num_blocknode++;
+	free_list->num_free_blocks += num_blocks;
+out:
+	return ret;
+}
+
+static int __nova_build_blocknode_map(struct super_block *sb,
+	unsigned long *bitmap, unsigned long bsize, unsigned long scale)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long next = 0;
+	unsigned long low = 0;
+	unsigned long start, end;
+	int cpuid = 0;
+
+	free_list = nova_get_free_list(sb, cpuid);
+	start = free_list->block_start;
+	end = free_list->block_end + 1;
+	while (1) {
+		next = find_next_zero_bit(bitmap, end, start);
+		if (next == bsize)
+			break;
+		if (next == end) {
+			if (cpuid == sbi->cpus - 1)
+				break;
+
+			cpuid++;
+			free_list = nova_get_free_list(sb, cpuid);
+			start = free_list->block_start;
+			end = free_list->block_end + 1;
+			continue;
+		}
+
+		low = next;
+		next = find_next_bit(bitmap, end, next);
+		if (nova_insert_blocknode_map(sb, cpuid,
+				low << scale, (next << scale) - 1)) {
+			nova_dbg("Error: could not insert %lu - %lu\n",
+				low << scale, ((next << scale) - 1));
+		}
+		start = next;
+		if (next == bsize)
+			break;
+		if (next == end) {
+			if (cpuid == sbi->cpus - 1)
+				break;
+
+			cpuid++;
+			free_list = nova_get_free_list(sb, cpuid);
+			start = free_list->block_start;
+			end = free_list->block_end + 1;
+		}
+	}
+	return 0;
+}
+
+static void nova_update_4K_map(struct super_block *sb,
+	struct scan_bitmap *bm,	unsigned long *bitmap,
+	unsigned long bsize, unsigned long scale)
+{
+	unsigned long next = 0;
+	unsigned long low = 0;
+	int i;
+
+	while (1) {
+		next = find_next_bit(bitmap, bsize, next);
+		if (next == bsize)
+			break;
+		low = next;
+		next = find_next_zero_bit(bitmap, bsize, next);
+		for (i = (low << scale); i < (next << scale); i++)
+			set_bm(i, bm, BM_4K);
+		if (next == bsize)
+			break;
+	}
+}
+
+struct scan_bitmap *global_bm[MAX_CPUS];
+
+static int nova_build_blocknode_map(struct super_block *sb,
+	unsigned long initsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	struct scan_bitmap *final_bm;
+	unsigned long *src, *dst;
+	int i, j;
+	int num;
+	int ret;
+
+	final_bm = kzalloc(sizeof(struct scan_bitmap), GFP_KERNEL);
+	if (!final_bm)
+		return -ENOMEM;
+
+	final_bm->scan_bm_4K.bitmap_size =
+				(initsize >> (PAGE_SHIFT + 0x3));
+
+	/* Alloc memory to hold the block alloc bitmap */
+	final_bm->scan_bm_4K.bitmap = kzalloc(final_bm->scan_bm_4K.bitmap_size,
+							GFP_KERNEL);
+
+	if (!final_bm->scan_bm_4K.bitmap) {
+		kfree(final_bm);
+		return -ENOMEM;
+	}
+
+	/*
+	 * We are using free lists. Set 2M and 1G blocks in 4K map,
+	 * and use 4K map to rebuild block map.
+	 */
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		nova_update_4K_map(sb, bm, bm->scan_bm_2M.bitmap,
+			bm->scan_bm_2M.bitmap_size * 8, PAGE_SHIFT_2M - 12);
+		nova_update_4K_map(sb, bm, bm->scan_bm_1G.bitmap,
+			bm->scan_bm_1G.bitmap_size * 8, PAGE_SHIFT_1G - 12);
+	}
+
+	/* Merge per-CPU bms to the final single bm */
+	num = final_bm->scan_bm_4K.bitmap_size / sizeof(unsigned long);
+	if (final_bm->scan_bm_4K.bitmap_size % sizeof(unsigned long))
+		num++;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		src = (unsigned long *)bm->scan_bm_4K.bitmap;
+		dst = (unsigned long *)final_bm->scan_bm_4K.bitmap;
+
+		for (j = 0; j < num; j++)
+			dst[j] |= src[j];
+	}
+
+	ret = __nova_build_blocknode_map(sb, final_bm->scan_bm_4K.bitmap,
+			final_bm->scan_bm_4K.bitmap_size * 8, PAGE_SHIFT - 12);
+
+	kfree(final_bm->scan_bm_4K.bitmap);
+	kfree(final_bm);
+
+	return ret;
+}
+
+static void free_bm(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = global_bm[i];
+		if (bm) {
+			kfree(bm->scan_bm_4K.bitmap);
+			kfree(bm->scan_bm_2M.bitmap);
+			kfree(bm->scan_bm_1G.bitmap);
+			kfree(bm);
+		}
+	}
+}
+
+static int alloc_bm(struct super_block *sb, unsigned long initsize)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct scan_bitmap *bm;
+	int i;
+
+	for (i = 0; i < sbi->cpus; i++) {
+		bm = kzalloc(sizeof(struct scan_bitmap), GFP_KERNEL);
+		if (!bm)
+			return -ENOMEM;
+
+		global_bm[i] = bm;
+
+		bm->scan_bm_4K.bitmap_size =
+				(initsize >> (PAGE_SHIFT + 0x3));
+		bm->scan_bm_2M.bitmap_size =
+				(initsize >> (PAGE_SHIFT_2M + 0x3));
+		bm->scan_bm_1G.bitmap_size =
+				(initsize >> (PAGE_SHIFT_1G + 0x3));
+
+		/* Alloc memory to hold the block alloc bitmap */
+		bm->scan_bm_4K.bitmap = kzalloc(bm->scan_bm_4K.bitmap_size,
+							GFP_KERNEL);
+		bm->scan_bm_2M.bitmap = kzalloc(bm->scan_bm_2M.bitmap_size,
+							GFP_KERNEL);
+		bm->scan_bm_1G.bitmap = kzalloc(bm->scan_bm_1G.bitmap_size,
+							GFP_KERNEL);
+
+		if (!bm->scan_bm_4K.bitmap || !bm->scan_bm_2M.bitmap ||
+				!bm->scan_bm_1G.bitmap)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+
 /*********************** Recovery entrance *************************/
 
 /* Return TRUE if we can do a normal unmount recovery */
diff --git a/fs/nova/bbuild.h b/fs/nova/bbuild.h
index 2c3deb0..b093e05 100644
--- a/fs/nova/bbuild.h
+++ b/fs/nova/bbuild.h
@@ -1,6 +1,24 @@
 #ifndef __BBUILD_H
 #define __BBUILD_H
 
+enum bm_type {
+	BM_4K = 0,
+	BM_2M,
+	BM_1G,
+};
+
+struct single_scan_bm {
+	unsigned long bitmap_size;
+	unsigned long *bitmap;
+};
+
+struct scan_bitmap {
+	struct single_scan_bm scan_bm_4K;
+	struct single_scan_bm scan_bm_2M;
+	struct single_scan_bm scan_bm_1G;
+};
+
+
 void nova_init_header(struct super_block *sb,
 	struct nova_inode_info_header *sih, u16 i_mode);
 void nova_save_inode_list_to_log(struct super_block *sb);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 81/83] Failure recovery: Inode pages recovery routines.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (79 preceding siblings ...)
  2018-03-10 18:19 ` [RFC v2 80/83] Failure recovery: bitmap operations Andiry Xu
@ 2018-03-10 18:19 ` Andiry Xu
  2018-03-10 18:19 ` [RFC v2 82/83] Failure recovery: Per-CPU recovery Andiry Xu
                   ` (2 subsequent siblings)
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:19 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

For each inode, NOVA traverses the inode log and records the pages
allocated in the bitmap. For directory inode, NOVA only set the log pages.
For file and symlink inodes, NOVA needs to set the data pages.
NOVA divides the file into 1GB zones, and records the pages fall into
the current zone, until all the pages have been recorded.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/bbuild.c | 307 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 307 insertions(+)

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
index 35c661a..75dfcba 100644
--- a/fs/nova/bbuild.c
+++ b/fs/nova/bbuild.c
@@ -665,6 +665,313 @@ static int alloc_bm(struct super_block *sb, unsigned long initsize)
 	return 0;
 }
 
+/************************** NOVA recovery ****************************/
+
+#define MAX_PGOFF	262144
+
+struct task_ring {
+	u64 addr0[512];
+	int num;
+	int inodes_used_count;
+	u64 *entry_array;
+	u64 *nvmm_array;
+};
+
+static int nova_traverse_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct scan_bitmap *bm, u64 head)
+{
+	u64 curr_p;
+	u64 next;
+
+	curr_p = head;
+
+	if (curr_p == 0)
+		return 0;
+
+	WARN_ON(curr_p & (PAGE_SIZE - 1));
+	set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+
+	next = next_log_page(sb, curr_p);
+	while (next > 0) {
+		curr_p = next;
+		WARN_ON(curr_p & (PAGE_SIZE - 1));
+		set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+		next = next_log_page(sb, curr_p);
+	}
+
+	return 0;
+}
+
+static void nova_traverse_dir_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct scan_bitmap *bm)
+{
+	nova_traverse_inode_log(sb, pi, bm, pi->log_head);
+}
+
+static int nova_set_ring_array(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	struct task_ring *ring,
+	unsigned long base, struct scan_bitmap *bm)
+{
+	unsigned long start, end;
+	unsigned long pgoff, old_pgoff = 0;
+	unsigned long index;
+	unsigned int num_free = 0;
+	u64 old_entry = 0;
+
+	start = entry->pgoff;
+	if (start < base)
+		start = base;
+
+	end = entry->pgoff + entry->num_pages;
+	if (end > base + MAX_PGOFF)
+		end = base + MAX_PGOFF;
+
+	for (pgoff = start; pgoff < end; pgoff++) {
+		index = pgoff - base;
+		if (ring->nvmm_array[index]) {
+			if (ring->entry_array[index] != old_entry) {
+				old_entry = ring->entry_array[index];
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+		}
+	}
+
+	for (pgoff = start; pgoff < end; pgoff++) {
+		index = pgoff - base;
+		ring->entry_array[index] = (u64)entry;
+		ring->nvmm_array[index] = (u64)(entry->block >> PAGE_SHIFT)
+						+ pgoff - entry->pgoff;
+	}
+
+	return 0;
+}
+
+static int nova_set_file_bm(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct task_ring *ring,
+	struct scan_bitmap *bm, unsigned long base, unsigned long last_blocknr)
+{
+	unsigned long nvmm, pgoff;
+
+	if (last_blocknr >= base + MAX_PGOFF)
+		last_blocknr = MAX_PGOFF - 1;
+	else
+		last_blocknr -= base;
+
+	for (pgoff = 0; pgoff <= last_blocknr; pgoff++) {
+		nvmm = ring->nvmm_array[pgoff];
+		if (nvmm) {
+			set_bm(nvmm, bm, BM_4K);
+			ring->nvmm_array[pgoff] = 0;
+			ring->entry_array[pgoff] = 0;
+		}
+	}
+
+	return 0;
+}
+
+/* entry given to this function is a copy in dram */
+static void nova_ring_setattr_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih,
+	struct nova_setattr_logentry *entry, struct task_ring *ring,
+	unsigned long base, unsigned int data_bits, struct scan_bitmap *bm)
+{
+	unsigned long first_blocknr, last_blocknr;
+	unsigned long pgoff, old_pgoff = 0;
+	unsigned long index;
+	unsigned int num_free = 0;
+	u64 old_entry = 0;
+	loff_t start, end;
+
+	if (sih->i_size <= entry->size)
+		goto out;
+
+	start = entry->size;
+	end = sih->i_size;
+
+	first_blocknr = (start + (1UL << data_bits) - 1) >> data_bits;
+
+	if (end > 0)
+		last_blocknr = (end - 1) >> data_bits;
+	else
+		last_blocknr = 0;
+
+	if (first_blocknr > last_blocknr)
+		goto out;
+
+	if (first_blocknr < base)
+		first_blocknr = base;
+
+	if (last_blocknr > base + MAX_PGOFF - 1)
+		last_blocknr = base + MAX_PGOFF - 1;
+
+	for (pgoff = first_blocknr; pgoff <= last_blocknr; pgoff++) {
+		index = pgoff - base;
+		if (ring->nvmm_array[index]) {
+			if (ring->entry_array[index] != old_entry) {
+				old_entry = ring->entry_array[index];
+				old_pgoff = pgoff;
+				num_free = 1;
+			} else {
+				num_free++;
+			}
+		}
+	}
+
+	for (pgoff = first_blocknr; pgoff <= last_blocknr; pgoff++) {
+		index = pgoff - base;
+		ring->nvmm_array[index] = 0;
+		ring->entry_array[index] = 0;
+	}
+
+out:
+	sih->i_size = entry->size;
+}
+
+static unsigned long nova_traverse_file_write_entry(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct nova_file_write_entry *entry,
+	struct task_ring *ring,
+	unsigned long base, struct scan_bitmap *bm)
+{
+	unsigned long max_blocknr = 0;
+	sih->i_size = entry->size;
+
+	if (entry->num_pages != entry->invalid_pages) {
+		max_blocknr = entry->pgoff + entry->num_pages - 1;
+		if (entry->pgoff < base + MAX_PGOFF &&
+				entry->pgoff + entry->num_pages > base)
+			nova_set_ring_array(sb, sih, entry,
+						ring, base, bm);
+	}
+
+	return max_blocknr;
+}
+
+static int nova_traverse_file_inode_log(struct super_block *sb,
+	struct nova_inode *pi, struct nova_inode_info_header *sih,
+	struct task_ring *ring, struct scan_bitmap *bm)
+{
+	unsigned long base = 0;
+	unsigned long last_blocknr = 0, curr_last;
+	void *entry;
+	unsigned int btype;
+	unsigned int data_bits;
+	u64 curr_p;
+	u64 next;
+	u8 type;
+
+	btype = pi->i_blk_type;
+	data_bits = blk_type_to_shift[btype];
+
+again:
+	curr_p = pi->log_head;
+	nova_dbg_verbose("Log head 0x%llx, tail 0x%llx\n",
+				curr_p, pi->log_tail);
+	if (curr_p == 0 && pi->log_tail == 0)
+		return 0;
+
+	if (base == 0) {
+		WARN_ON(curr_p & (PAGE_SIZE - 1));
+		set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+	}
+
+	while (curr_p != pi->log_tail) {
+		if (goto_next_page(sb, curr_p)) {
+			curr_p = next_log_page(sb, curr_p);
+			if (base == 0) {
+				WARN_ON(curr_p & (PAGE_SIZE - 1));
+				set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+			}
+		}
+
+		entry = (void *)nova_get_block(sb, curr_p);
+
+		type = nova_get_entry_type(entry);
+		switch (type) {
+		case SET_ATTR:
+			nova_ring_setattr_entry(sb, sih, SENTRY(entry),
+						ring, base, data_bits,
+						bm);
+			curr_p += sizeof(struct nova_setattr_logentry);
+			break;
+		case LINK_CHANGE:
+			curr_p += sizeof(struct nova_link_change_entry);
+			break;
+		case FILE_WRITE:
+			curr_last = nova_traverse_file_write_entry(sb, sih,
+						WENTRY(entry), ring, base, bm);
+			curr_p += sizeof(struct nova_file_write_entry);
+			if (last_blocknr < curr_last)
+				last_blocknr = curr_last;
+			break;
+		default:
+			nova_dbg("%s: unknown type %d, 0x%llx\n",
+						__func__, type, curr_p);
+			NOVA_ASSERT(0);
+		}
+
+	}
+
+	if (base == 0) {
+		/* Keep traversing until log ends */
+		curr_p &= PAGE_MASK;
+		next = next_log_page(sb, curr_p);
+		while (next > 0) {
+			curr_p = next;
+			WARN_ON(curr_p & (PAGE_SIZE - 1));
+			set_bm(curr_p >> PAGE_SHIFT, bm, BM_4K);
+			next = next_log_page(sb, curr_p);
+		}
+	}
+
+	nova_set_file_bm(sb, sih, ring, bm, base, last_blocknr);
+	if (last_blocknr >= base + MAX_PGOFF) {
+		base += MAX_PGOFF;
+		goto again;
+	}
+
+	return 0;
+}
+
+static int nova_recover_inode_pages(struct super_block *sb,
+	struct nova_inode_info_header *sih, struct task_ring *ring,
+	struct nova_inode *pi, struct scan_bitmap *bm)
+{
+	unsigned long nova_ino;
+
+	if (pi->deleted == 1)
+		return 0;
+
+	nova_ino = pi->nova_ino;
+	ring->inodes_used_count++;
+
+	sih->i_mode = __le16_to_cpu(pi->i_mode);
+	sih->ino = nova_ino;
+
+	nova_dbgv("%s: inode %lu, head 0x%llx, tail 0x%llx\n",
+			__func__, nova_ino, pi->log_head, pi->log_tail);
+
+	switch (__le16_to_cpu(pi->i_mode) & S_IFMT) {
+	case S_IFDIR:
+		nova_traverse_dir_inode_log(sb, pi, bm);
+		break;
+	case S_IFLNK:
+		/* Treat symlink files as normal files */
+		/* Fall through */
+	case S_IFREG:
+		/* Fall through */
+	default:
+		/* In case of special inode, walk the log */
+		nova_traverse_file_inode_log(sb, pi, sih, ring, bm);
+		break;
+	}
+
+	return 0;
+}
+
 
 /*********************** Recovery entrance *************************/
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 82/83] Failure recovery: Per-CPU recovery.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (80 preceding siblings ...)
  2018-03-10 18:19 ` [RFC v2 81/83] Failure recovery: Inode pages recovery routines Andiry Xu
@ 2018-03-10 18:19 ` Andiry Xu
  2018-03-10 18:19 ` [RFC v2 83/83] Sysfs support Andiry Xu
  2018-03-11  2:14 ` [RFC v2 00/83] NOVA: a new file system for persistent memory Theodore Y. Ts'o
  83 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:19 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

NOVA starts a recovery thread on each CPU, and scans all the inodes
in a parallel way. It recovers the inode inuse list during the
scan as well.

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/bbuild.c | 396 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 396 insertions(+)

diff --git a/fs/nova/bbuild.c b/fs/nova/bbuild.c
index 75dfcba..3271166 100644
--- a/fs/nova/bbuild.c
+++ b/fs/nova/bbuild.c
@@ -677,6 +677,11 @@ struct task_ring {
 	u64 *nvmm_array;
 };
 
+static struct task_ring *task_rings;
+static struct task_struct **threads;
+wait_queue_head_t finish_wq;
+int *finished;
+
 static int nova_traverse_inode_log(struct super_block *sb,
 	struct nova_inode *pi, struct scan_bitmap *bm, u64 head)
 {
@@ -973,6 +978,378 @@ static int nova_recover_inode_pages(struct super_block *sb,
 }
 
 
+static void free_resources(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct task_ring *ring;
+	int i;
+
+	if (task_rings) {
+		for (i = 0; i < sbi->cpus; i++) {
+			ring = &task_rings[i];
+			vfree(ring->entry_array);
+			vfree(ring->nvmm_array);
+			ring->entry_array = NULL;
+			ring->nvmm_array = NULL;
+		}
+	}
+
+	kfree(task_rings);
+	kfree(threads);
+	kfree(finished);
+}
+
+static int failure_thread_func(void *data);
+
+static int allocate_resources(struct super_block *sb, int cpus)
+{
+	struct task_ring *ring;
+	int i;
+
+	task_rings = kcalloc(cpus, sizeof(struct task_ring), GFP_KERNEL);
+	if (!task_rings)
+		goto fail;
+
+	for (i = 0; i < cpus; i++) {
+		ring = &task_rings[i];
+
+		ring->nvmm_array = vzalloc(sizeof(u64) * MAX_PGOFF);
+		if (!ring->nvmm_array)
+			goto fail;
+
+		ring->entry_array = vmalloc(sizeof(u64) * MAX_PGOFF);
+		if (!ring->entry_array)
+			goto fail;
+	}
+
+	threads = kcalloc(cpus, sizeof(struct task_struct *), GFP_KERNEL);
+	if (!threads)
+		goto fail;
+
+	finished = kcalloc(cpus, sizeof(int), GFP_KERNEL);
+	if (!finished)
+		goto fail;
+
+	init_waitqueue_head(&finish_wq);
+
+	for (i = 0; i < cpus; i++) {
+		threads[i] = kthread_create(failure_thread_func,
+						sb, "recovery thread");
+		kthread_bind(threads[i], i);
+	}
+
+	return 0;
+
+fail:
+	free_resources(sb);
+	return -ENOMEM;
+}
+
+static void wait_to_finish(int cpus)
+{
+	int i;
+
+	for (i = 0; i < cpus; i++) {
+		while (finished[i] == 0) {
+			wait_event_interruptible_timeout(finish_wq, false,
+							msecs_to_jiffies(1));
+		}
+	}
+}
+
+/*********************** Failure recovery *************************/
+
+static int nova_failure_insert_inodetree(struct super_block *sb,
+	unsigned long ino_low, unsigned long ino_high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct inode_map *inode_map;
+	struct nova_range_node *prev = NULL, *next = NULL;
+	struct nova_range_node *new_node;
+	unsigned long internal_low, internal_high;
+	int cpu;
+	struct rb_root *tree;
+	int ret;
+
+	if (ino_low > ino_high) {
+		nova_err(sb, "%s: ino low %lu, ino high %lu\n",
+				__func__, ino_low, ino_high);
+		return -EINVAL;
+	}
+
+	cpu = ino_low % sbi->cpus;
+	if (ino_high % sbi->cpus != cpu) {
+		nova_err(sb, "%s: ino low %lu, ino high %lu\n",
+				__func__, ino_low, ino_high);
+		return -EINVAL;
+	}
+
+	internal_low = ino_low / sbi->cpus;
+	internal_high = ino_high / sbi->cpus;
+	inode_map = &sbi->inode_maps[cpu];
+	tree = &inode_map->inode_inuse_tree;
+	mutex_lock(&inode_map->inode_table_mutex);
+
+	ret = nova_find_free_slot(sbi, tree, internal_low, internal_high,
+					&prev, &next);
+	if (ret) {
+		nova_dbg("%s: ino %lu - %lu already exists!: %d\n",
+					__func__, ino_low, ino_high, ret);
+		mutex_unlock(&inode_map->inode_table_mutex);
+		return ret;
+	}
+
+	if (prev && next && (internal_low == prev->range_high + 1) &&
+			(internal_high + 1 == next->range_low)) {
+		/* fits the hole */
+		rb_erase(&next->node, tree);
+		inode_map->num_range_node_inode--;
+		prev->range_high = next->range_high;
+		nova_free_inode_node(sb, next);
+		goto finish;
+	}
+	if (prev && (internal_low == prev->range_high + 1)) {
+		/* Aligns left */
+		prev->range_high += internal_high - internal_low + 1;
+		goto finish;
+	}
+	if (next && (internal_high + 1 == next->range_low)) {
+		/* Aligns right */
+		next->range_low -= internal_high - internal_low + 1;
+		goto finish;
+	}
+
+	/* Aligns somewhere in the middle */
+	new_node = nova_alloc_inode_node(sb);
+	NOVA_ASSERT(new_node);
+	new_node->range_low = internal_low;
+	new_node->range_high = internal_high;
+	ret = nova_insert_inodetree(sbi, new_node, cpu);
+	if (ret) {
+		nova_err(sb, "%s failed\n", __func__);
+		nova_free_inode_node(sb, new_node);
+		goto finish;
+	}
+	inode_map->num_range_node_inode++;
+
+finish:
+	mutex_unlock(&inode_map->inode_table_mutex);
+	return ret;
+}
+
+static inline int nova_failure_update_inodetree(struct super_block *sb,
+	struct nova_inode *pi, unsigned long *ino_low, unsigned long *ino_high)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (*ino_low == 0) {
+		*ino_low = *ino_high = pi->nova_ino;
+	} else {
+		if (pi->nova_ino == *ino_high + sbi->cpus) {
+			*ino_high = pi->nova_ino;
+		} else {
+			/* A new start */
+			nova_failure_insert_inodetree(sb, *ino_low, *ino_high);
+			*ino_low = *ino_high = pi->nova_ino;
+		}
+	}
+
+	return 0;
+}
+
+static int failure_thread_func(void *data)
+{
+	struct super_block *sb = data;
+	struct nova_inode_info_header sih;
+	struct task_ring *ring;
+	struct nova_inode *pi, fake_pi;
+	unsigned long num_inodes_per_page;
+	unsigned long ino_low, ino_high;
+	unsigned long last_blocknr;
+	unsigned int data_bits;
+	u64 curr;
+	int cpuid = smp_processor_id();
+	unsigned long i;
+	unsigned long max_size = 0;
+	u64 pi_addr = 0;
+	int ret = 0;
+	int count;
+
+	pi = nova_get_inode_by_ino(sb, NOVA_INODETABLE_INO);
+	data_bits = blk_type_to_shift[pi->i_blk_type];
+	num_inodes_per_page = 1 << (data_bits - NOVA_INODE_BITS);
+
+	ring = &task_rings[cpuid];
+	nova_init_header(sb, &sih, 0);
+
+	for (count = 0; count < ring->num; count++) {
+		curr = ring->addr0[count];
+		ino_low = ino_high = 0;
+
+		/*
+		 * Note: The inode log page is allocated in 2MB
+		 * granularity, but not aligned on 2MB boundary.
+		 */
+		for (i = 0; i < 512; i++)
+			set_bm((curr >> PAGE_SHIFT) + i,
+					global_bm[cpuid], BM_4K);
+
+		for (i = 0; i < num_inodes_per_page; i++) {
+			pi_addr = curr + i * NOVA_INODE_SIZE;
+			ret = nova_get_reference(sb, pi_addr, &fake_pi,
+				(void **)&pi, sizeof(struct nova_inode));
+			if (ret) {
+				nova_dbg("Recover pi @ 0x%llx failed\n",
+						pi_addr);
+				continue;
+			}
+			/* FIXME: Check inode checksum */
+			if (fake_pi.i_mode && fake_pi.deleted == 0) {
+				if (fake_pi.valid == 0) {
+					/* Deleteable */
+					pi->deleted = 1;
+					fake_pi.deleted = 1;
+					continue;
+				}
+
+				nova_recover_inode_pages(sb, &sih, ring,
+						&fake_pi, global_bm[cpuid]);
+				nova_failure_update_inodetree(sb, pi,
+						&ino_low, &ino_high);
+				if (sih.i_size > max_size)
+					max_size = sih.i_size;
+			}
+		}
+
+		if (ino_low && ino_high)
+			nova_failure_insert_inodetree(sb, ino_low, ino_high);
+	}
+
+	/* Free radix tree */
+	if (max_size) {
+		last_blocknr = (max_size - 1) >> PAGE_SHIFT;
+		nova_delete_file_tree(sb, &sih, 0, last_blocknr,
+						false, false, 0);
+	}
+
+	finished[cpuid] = 1;
+	wake_up_interruptible(&finish_wq);
+	do_exit(ret);
+	return ret;
+}
+
+static int nova_failure_recovery_crawl(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct nova_inode_info_header sih;
+	struct inode_table *inode_table;
+	struct task_ring *ring;
+	struct nova_inode *pi, fake_pi;
+	unsigned long curr_addr;
+	u64 root_addr;
+	u64 curr;
+	int ret = 0;
+	int count;
+	int cpuid;
+
+	root_addr = nova_get_reserved_inode_addr(sb, NOVA_ROOT_INO);
+
+	for (cpuid = 0; cpuid < sbi->cpus; cpuid++) {
+		ring = &task_rings[cpuid];
+		inode_table = nova_get_inode_table(sb, cpuid);
+		if (!inode_table)
+			return -EINVAL;
+
+		count = 0;
+		curr = inode_table->log_head;
+		while (curr) {
+			if (ring->num >= 512) {
+				nova_err(sb, "%s: ring size too small\n",
+					 __func__);
+				return -EINVAL;
+			}
+
+			ring->addr0[count] = curr;
+
+			count++;
+
+			curr_addr = (unsigned long)nova_get_block(sb,
+							curr);
+			/* Next page resides at the last 8 bytes */
+			curr_addr += 2097152 - 8;
+			curr = *(u64 *)(curr_addr);
+		}
+
+		if (count > ring->num)
+			ring->num = count;
+	}
+
+	for (cpuid = 0; cpuid < sbi->cpus; cpuid++)
+		wake_up_process(threads[cpuid]);
+
+	nova_init_header(sb, &sih, 0);
+	/* Recover the root iode */
+	ret = nova_get_reference(sb, root_addr, &fake_pi,
+			(void **)&pi, sizeof(struct nova_inode));
+	if (ret) {
+		nova_dbg("Recover root pi failed\n");
+		return ret;
+	}
+
+	nova_recover_inode_pages(sb, &sih, &task_rings[0],
+					&fake_pi, global_bm[1]);
+
+	return ret;
+}
+
+int nova_failure_recovery(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct task_ring *ring;
+	struct nova_inode *pi;
+	struct journal_ptr_pair *pair;
+	int ret;
+	int i;
+
+	sbi->s_inodes_used_count = 0;
+
+	/* Initialize inuse inode list */
+	if (nova_init_inode_inuse_list(sb) < 0)
+		return -EINVAL;
+
+	/* Handle special inodes */
+	pi = nova_get_inode_by_ino(sb, NOVA_BLOCKNODE_INO);
+	pi->log_head = pi->log_tail = 0;
+	nova_flush_buffer(&pi->log_head, CACHELINE_SIZE, 0);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		pair = nova_get_journal_pointers(sb, i);
+
+		set_bm(pair->journal_head >> PAGE_SHIFT, global_bm[i], BM_4K);
+	}
+
+	PERSISTENT_BARRIER();
+
+	ret = allocate_resources(sb, sbi->cpus);
+	if (ret)
+		return ret;
+
+	ret = nova_failure_recovery_crawl(sb);
+
+	wait_to_finish(sbi->cpus);
+
+	for (i = 0; i < sbi->cpus; i++) {
+		ring = &task_rings[i];
+		sbi->s_inodes_used_count += ring->inodes_used_count;
+	}
+
+	free_resources(sb);
+
+	nova_dbg("Failure recovery total recovered %lu\n",
+			sbi->s_inodes_used_count - NOVA_NORMAL_INODE_START);
+	return ret;
+}
+
 /*********************** Recovery entrance *************************/
 
 /* Return TRUE if we can do a normal unmount recovery */
@@ -1027,7 +1404,23 @@ int nova_recovery(struct super_block *sb)
 	nova_init_blockmap(sb, 1);
 
 	value = nova_try_normal_recovery(sb);
+	if (value) {
+		nova_dbg("NOVA: Normal shutdown\n");
+	} else {
+		nova_dbg("NOVA: Failure recovery\n");
+		ret = alloc_bm(sb, initsize);
+		if (ret)
+			goto out;
+
+		sbi->s_inodes_used_count = 0;
+		ret = nova_failure_recovery(sb);
+		if (ret)
+			goto out;
 
+		ret = nova_build_blocknode_map(sb, initsize);
+	}
+
+out:
 	NOVA_END_TIMING(recovery_t, start);
 	if (measure_timing == 0) {
 		getrawmonotonic(&end);
@@ -1036,6 +1429,9 @@ int nova_recovery(struct super_block *sb)
 			(end.tv_nsec - start.tv_nsec);
 	}
 
+	if (!value)
+		free_bm(sb);
+
 	sbi->s_epoch_id = le64_to_cpu(super->s_epoch_id);
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* [RFC v2 83/83] Sysfs support.
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (81 preceding siblings ...)
  2018-03-10 18:19 ` [RFC v2 82/83] Failure recovery: Per-CPU recovery Andiry Xu
@ 2018-03-10 18:19 ` Andiry Xu
  2018-03-15  0:33   ` Randy Dunlap
  2018-03-22 15:00   ` David Sterba
  2018-03-11  2:14 ` [RFC v2 00/83] NOVA: a new file system for persistent memory Theodore Y. Ts'o
  83 siblings, 2 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-10 18:19 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

From: Andiry Xu <jix024@cs.ucsd.edu>

Sysfs support allows user to get/post information of running NOVA instance.
After mount, NOVA creates four entries under proc directory
/proc/fs/nova/pmem#/:

timing_stats	IO_stats	allocator	gc

Show NOVA file operation timing statistics:
cat /proc/fs/NOVA/pmem#/timing_stats

Clear timing statistics:
echo 1 > /proc/fs/NOVA/pmem#/timing_stats

Show NOVA I/O statistics:
cat /proc/fs/NOVA/pmem#/IO_stats

Clear I/O statistics:
echo 1 > /proc/fs/NOVA/pmem#/IO_stats

Show NOVA allocator information:
cat /proc/fs/NOVA/pmem#/allocator

Manual garbage collection:
echo #inode_number > /proc/fs/NOVA/pmem#/gc

Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
---
 fs/nova/Makefile |   2 +-
 fs/nova/nova.h   |   6 +
 fs/nova/super.c  |   9 ++
 fs/nova/super.h  |   1 +
 fs/nova/sysfs.c  | 379 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 396 insertions(+), 1 deletion(-)
 create mode 100644 fs/nova/sysfs.c

diff --git a/fs/nova/Makefile b/fs/nova/Makefile
index 7a5fb6d..6e1c29d 100644
--- a/fs/nova/Makefile
+++ b/fs/nova/Makefile
@@ -5,4 +5,4 @@
 obj-$(CONFIG_NOVA_FS) += nova.o
 
 nova-y := balloc.o bbuild.o dax.o dir.o file.o gc.o inode.o ioctl.o journal.o\
-	  log.o namei.o rebuild.o stats.o super.o symlink.o
+	  log.o namei.o rebuild.o stats.o super.o symlink.o sysfs.o
diff --git a/fs/nova/nova.h b/fs/nova/nova.h
index 32b7b2f..0814676 100644
--- a/fs/nova/nova.h
+++ b/fs/nova/nova.h
@@ -546,6 +546,12 @@ int nova_block_symlink(struct super_block *sb, struct nova_inode *pi,
 	struct inode *inode, const char *symname, int len, u64 epoch_id);
 extern const struct inode_operations nova_symlink_inode_operations;
 
+/* sysfs.c */
+extern const char *proc_dirname;
+extern struct proc_dir_entry *nova_proc_root;
+void nova_sysfs_init(struct super_block *sb);
+void nova_sysfs_exit(struct super_block *sb);
+
 /* stats.c */
 void nova_get_timing_stats(void);
 void nova_get_IO_stats(void);
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 14b4af6..039c003 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -596,6 +596,8 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 		goto out;
 	}
 
+	nova_sysfs_init(sb);
+
 	/* Init a new nova instance */
 	if (sbi->s_mount_opt & NOVA_MOUNT_FORMAT) {
 		root_pi = nova_init(sb, sbi->initsize);
@@ -680,6 +682,8 @@ static int nova_fill_super(struct super_block *sb, void *data, int silent)
 	kfree(sbi->inode_maps);
 	sbi->inode_maps = NULL;
 
+	nova_sysfs_exit(sb);
+
 	kfree(sbi->nova_sb);
 	kfree(sbi);
 	nova_dbg("%s failed: return %d\n", __func__, retval);
@@ -783,6 +787,8 @@ static void nova_put_super(struct super_block *sb)
 			i, inode_map->allocated, inode_map->freed);
 	}
 
+	nova_sysfs_exit(sb);
+
 	kfree(sbi->inode_maps);
 	kfree(sbi->nova_sb);
 	kfree(sbi);
@@ -1007,6 +1013,8 @@ static int __init init_nova_fs(void)
 	nova_info("Arch new instructions support: CLWB %s\n",
 			support_clwb ? "YES" : "NO");
 
+	nova_proc_root = proc_mkdir(proc_dirname, NULL);
+
 	rc = init_rangenode_cache();
 	if (rc)
 		goto out;
@@ -1041,6 +1049,7 @@ static int __init init_nova_fs(void)
 static void __exit exit_nova_fs(void)
 {
 	unregister_filesystem(&nova_fs_type);
+	remove_proc_entry(proc_dirname, NULL);
 	destroy_file_write_item_cache();
 	destroy_inodecache();
 	destroy_rangenode_cache();
diff --git a/fs/nova/super.h b/fs/nova/super.h
index bcf9548..bcbe862 100644
--- a/fs/nova/super.h
+++ b/fs/nova/super.h
@@ -112,6 +112,7 @@ struct nova_sb_info {
 	struct mutex	s_lock;	/* protects the SB's buffer-head */
 
 	int cpus;
+	struct proc_dir_entry *s_proc;
 
 	/* Current epoch. volatile guarantees visibility */
 	volatile u64 s_epoch_id;
diff --git a/fs/nova/sysfs.c b/fs/nova/sysfs.c
new file mode 100644
index 0000000..0a73ef4
--- /dev/null
+++ b/fs/nova/sysfs.c
@@ -0,0 +1,379 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Proc fs operations
+ *
+ * Copyright 2015-2016 Regents of the University of California,
+ * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
+ * Copyright 2012-2013 Intel Corporation
+ * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
+ * Copyright 2003 Sony Corporation
+ * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
+ * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include "nova.h"
+#include "inode.h"
+
+const char *proc_dirname = "fs/NOVA";
+struct proc_dir_entry *nova_proc_root;
+
+/* ====================== Statistics ======================== */
+static int nova_seq_timing_show(struct seq_file *seq, void *v)
+{
+	int i;
+
+	nova_get_timing_stats();
+
+	seq_puts(seq, "=========== NOVA kernel timing stats ===========\n");
+	for (i = 0; i < TIMING_NUM; i++) {
+		/* Title */
+		if (Timingstring[i][0] == '=') {
+			seq_printf(seq, "\n%s\n\n", Timingstring[i]);
+			continue;
+		}
+
+		if (measure_timing || Timingstats[i]) {
+			seq_printf(seq, "%s: count %llu, timing %llu, average %llu\n",
+				Timingstring[i],
+				Countstats[i],
+				Timingstats[i],
+				Countstats[i] ?
+				Timingstats[i] / Countstats[i] : 0);
+		} else {
+			seq_printf(seq, "%s: count %llu\n",
+				Timingstring[i],
+				Countstats[i]);
+		}
+	}
+
+	seq_puts(seq, "\n");
+	return 0;
+}
+
+static int nova_seq_timing_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_timing_show, PDE_DATA(inode));
+}
+
+ssize_t nova_seq_clear_stats(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+
+	nova_clear_stats(sb);
+	return len;
+}
+
+static const struct file_operations nova_seq_timing_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_timing_open,
+	.read		= seq_read,
+	.write		= nova_seq_clear_stats,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_IO_show(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = seq->private;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	unsigned long alloc_log_count = 0;
+	unsigned long alloc_log_pages = 0;
+	unsigned long alloc_data_count = 0;
+	unsigned long alloc_data_pages = 0;
+	unsigned long free_log_count = 0;
+	unsigned long freed_log_pages = 0;
+	unsigned long free_data_count = 0;
+	unsigned long freed_data_pages = 0;
+	int i;
+
+	nova_get_timing_stats();
+	nova_get_IO_stats();
+
+	seq_puts(seq, "============ NOVA allocation stats ============\n\n");
+
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+
+		alloc_log_count += free_list->alloc_log_count;
+		alloc_log_pages += free_list->alloc_log_pages;
+		alloc_data_count += free_list->alloc_data_count;
+		alloc_data_pages += free_list->alloc_data_pages;
+		free_log_count += free_list->free_log_count;
+		freed_log_pages += free_list->freed_log_pages;
+		free_data_count += free_list->free_data_count;
+		freed_data_pages += free_list->freed_data_pages;
+	}
+
+	seq_printf(seq, "alloc log count %lu, allocated log pages %lu\n"
+		"alloc data count %lu, allocated data pages %lu\n"
+		"free log count %lu, freed log pages %lu\n"
+		"free data count %lu, freed data pages %lu\n",
+		alloc_log_count, alloc_log_pages,
+		alloc_data_count, alloc_data_pages,
+		free_log_count, freed_log_pages,
+		free_data_count, freed_data_pages);
+
+	seq_printf(seq, "Fast GC %llu, check pages %llu, free pages %llu, average %llu\n",
+		Countstats[fast_gc_t], IOstats[fast_checked_pages],
+		IOstats[fast_gc_pages], Countstats[fast_gc_t] ?
+			IOstats[fast_gc_pages] / Countstats[fast_gc_t] : 0);
+	seq_printf(seq, "Thorough GC %llu, checked pages %llu, free pages %llu, average %llu\n",
+		Countstats[thorough_gc_t],
+		IOstats[thorough_checked_pages], IOstats[thorough_gc_pages],
+		Countstats[thorough_gc_t] ?
+			IOstats[thorough_gc_pages] / Countstats[thorough_gc_t]
+			: 0);
+
+	seq_puts(seq, "\n");
+
+	seq_puts(seq, "================ NOVA I/O stats ================\n\n");
+	seq_printf(seq, "Read %llu, bytes %llu, average %llu\n",
+		Countstats[dax_read_t], IOstats[read_bytes],
+		Countstats[dax_read_t] ?
+			IOstats[read_bytes] / Countstats[dax_read_t] : 0);
+	seq_printf(seq, "COW write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[cow_write_t], IOstats[cow_write_bytes],
+		Countstats[cow_write_t] ?
+			IOstats[cow_write_bytes] / Countstats[cow_write_t] : 0,
+		IOstats[cow_write_breaks], Countstats[cow_write_t] ?
+			IOstats[cow_write_breaks] / Countstats[cow_write_t]
+			: 0);
+	seq_printf(seq, "Inplace write %llu, bytes %llu, average %llu, write breaks %llu, average %llu\n",
+		Countstats[inplace_write_t], IOstats[inplace_write_bytes],
+		Countstats[inplace_write_t] ?
+			IOstats[inplace_write_bytes] /
+			Countstats[inplace_write_t] : 0,
+		IOstats[inplace_write_breaks], Countstats[inplace_write_t] ?
+			IOstats[inplace_write_breaks] /
+			Countstats[inplace_write_t] : 0);
+	seq_printf(seq, "Inplace write %llu, allocate new blocks %llu\n",
+			Countstats[inplace_write_t],
+			IOstats[inplace_new_blocks]);
+	seq_printf(seq, "DAX get blocks %llu, allocate new blocks %llu\n",
+			Countstats[dax_get_block_t], IOstats[dax_new_blocks]);
+	seq_printf(seq, "Page fault %llu\n", Countstats[mmap_fault_t]);
+	seq_printf(seq, "fsync %llu, fdatasync %llu\n",
+			Countstats[fsync_t], IOstats[fdatasync]);
+
+	seq_puts(seq, "\n");
+
+	return 0;
+}
+
+static int nova_seq_IO_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_IO_show, PDE_DATA(inode));
+}
+
+static const struct file_operations nova_seq_IO_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_IO_open,
+	.read		= seq_read,
+	.write		= nova_seq_clear_stats,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int nova_seq_show_allocator(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = seq->private;
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+	struct free_list *free_list;
+	int i;
+	unsigned long log_pages = 0;
+	unsigned long data_pages = 0;
+
+	seq_puts(seq, "======== NOVA per-CPU allocator stats ========\n");
+	for (i = 0; i < sbi->cpus; i++) {
+		free_list = nova_get_free_list(sb, i);
+		seq_printf(seq, "Free list %d: block start %lu, block end %lu, num_blocks %lu, num_free_blocks %lu, blocknode %lu\n",
+			i, free_list->block_start, free_list->block_end,
+			free_list->block_end - free_list->block_start + 1,
+			free_list->num_free_blocks, free_list->num_blocknode);
+
+		if (free_list->first_node) {
+			seq_printf(seq, "First node %lu - %lu\n",
+					free_list->first_node->range_low,
+					free_list->first_node->range_high);
+		}
+
+		if (free_list->last_node) {
+			seq_printf(seq, "Last node %lu - %lu\n",
+					free_list->last_node->range_low,
+					free_list->last_node->range_high);
+		}
+
+		seq_printf(seq, "Free list %d: alloc log count %lu, allocated log pages %lu, alloc data count %lu, allocated data pages %lu, free log count %lu, freed log pages %lu, free data count %lu, freed data pages %lu\n",
+			   i,
+			   free_list->alloc_log_count,
+			   free_list->alloc_log_pages,
+			   free_list->alloc_data_count,
+			   free_list->alloc_data_pages,
+			   free_list->free_log_count,
+			   free_list->freed_log_pages,
+			   free_list->free_data_count,
+			   free_list->freed_data_pages);
+
+		log_pages += free_list->alloc_log_pages;
+		log_pages -= free_list->freed_log_pages;
+
+		data_pages += free_list->alloc_data_pages;
+		data_pages -= free_list->freed_data_pages;
+	}
+
+	seq_printf(seq, "\nCurrently used pmem pages: log %lu, data %lu\n",
+			log_pages, data_pages);
+
+	return 0;
+}
+
+static int nova_seq_allocator_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_show_allocator,
+				PDE_DATA(inode));
+}
+
+static const struct file_operations nova_seq_allocator_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_allocator_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+
+/* ====================== GC ======================== */
+
+
+static int nova_seq_gc_show(struct seq_file *seq, void *v)
+{
+	seq_printf(seq, "Echo inode number to trigger garbage collection\n"
+		   "    example: echo 34 > /proc/fs/NOVA/pmem0/gc\n");
+	return 0;
+}
+
+static int nova_seq_gc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, nova_seq_gc_show, PDE_DATA(inode));
+}
+
+ssize_t nova_seq_gc(struct file *filp, const char __user *buf,
+	size_t len, loff_t *ppos)
+{
+	u64 target_inode_number;
+	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = mapping->host;
+	struct super_block *sb = PDE_DATA(inode);
+	struct inode *target_inode;
+	struct nova_inode *target_pi;
+	struct nova_inode_info *target_sih;
+
+	int ret;
+	char *_buf;
+	int retval = len;
+
+	_buf = kmalloc(len, GFP_KERNEL);
+	if (_buf == NULL)  {
+		retval = -ENOMEM;
+		nova_dbg("%s: kmalloc failed\n", __func__);
+		goto out;
+	}
+
+	if (copy_from_user(_buf, buf, len)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	_buf[len] = 0;
+	ret = kstrtoull(_buf, 0, &target_inode_number);
+	if (ret) {
+		nova_info("%s: Could not parse ino '%s'\n", __func__, _buf);
+		return ret;
+	}
+	nova_info("%s: target_inode_number=%llu.", __func__,
+		  target_inode_number);
+
+	target_inode = nova_iget(sb, target_inode_number);
+	if (target_inode == NULL) {
+		nova_info("%s: inode %llu does not exist.", __func__,
+			  target_inode_number);
+		retval = -ENOENT;
+		goto out;
+	}
+
+	target_pi = nova_get_inode(sb, target_inode);
+	if (target_pi == NULL) {
+		nova_info("%s: couldn't get nova inode %llu.", __func__,
+			  target_inode_number);
+		retval = -ENOENT;
+		goto out;
+	}
+
+	target_sih = NOVA_I(target_inode);
+
+	nova_info("%s: got inode %llu @ 0x%p; pi=0x%p\n", __func__,
+		  target_inode_number, target_inode, target_pi);
+
+	nova_inode_log_fast_gc(sb, target_pi, &target_sih->header,
+			       0, 0, 0, 1);
+	iput(target_inode);
+
+out:
+	kfree(_buf);
+	return retval;
+}
+
+static const struct file_operations nova_seq_gc_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nova_seq_gc_open,
+	.read		= seq_read,
+	.write		= nova_seq_gc,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* ====================== Setup/teardown======================== */
+void nova_sysfs_init(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (nova_proc_root)
+		sbi->s_proc = proc_mkdir(sbi->s_bdev->bd_disk->disk_name,
+					 nova_proc_root);
+
+	if (sbi->s_proc) {
+		proc_create_data("timing_stats", 0444, sbi->s_proc,
+				 &nova_seq_timing_fops, sb);
+		proc_create_data("IO_stats", 0444, sbi->s_proc,
+				 &nova_seq_IO_fops, sb);
+		proc_create_data("allocator", 0444, sbi->s_proc,
+				 &nova_seq_allocator_fops, sb);
+		proc_create_data("gc", 0444, sbi->s_proc,
+				 &nova_seq_gc_fops, sb);
+	}
+}
+
+void nova_sysfs_exit(struct super_block *sb)
+{
+	struct nova_sb_info *sbi = NOVA_SB(sb);
+
+	if (sbi->s_proc) {
+		remove_proc_entry("timing_stats", sbi->s_proc);
+		remove_proc_entry("IO_stats", sbi->s_proc);
+		remove_proc_entry("allocator", sbi->s_proc);
+		remove_proc_entry("gc", sbi->s_proc);
+		remove_proc_entry(sbi->s_bdev->bd_disk->disk_name,
+					nova_proc_root);
+	}
+}
-- 
2.7.4

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 00/83] NOVA: a new file system for persistent memory
  2018-03-10 18:17 [RFC v2 00/83] NOVA: a new file system for persistent memory Andiry Xu
                   ` (82 preceding siblings ...)
  2018-03-10 18:19 ` [RFC v2 83/83] Sysfs support Andiry Xu
@ 2018-03-11  2:14 ` Theodore Y. Ts'o
  2018-03-11  4:58   ` Andiry Xu
  83 siblings, 1 reply; 119+ messages in thread
From: Theodore Y. Ts'o @ 2018-03-11  2:14 UTC (permalink / raw)
  To: Andiry Xu
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, dan.j.williams,
	andy.rudoff, coughlan, swanson, david, jack, swhiteho, miklos,
	andiry.xu, Andiry Xu

FYI, your patch set doesn't even compile for me without these fixups.
I'm not sure why you were trying to declare inline functions in a
header file without the function body?

						- Ted

diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
index 8e992156f28c..9c7b74aa712e 100644
--- a/fs/nova/balloc.c
+++ b/fs/nova/balloc.c
@@ -74,12 +74,12 @@ static void nova_init_free_list(struct super_block *sb,
 		free_list->block_end -= sbi->tail_reserved_blocks;
 }
 
-inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
+struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
 {
 	return nova_alloc_range_node(sb);
 }
 
-inline void nova_free_blocknode(struct super_block *sb,
+void nova_free_blocknode(struct super_block *sb,
 	struct nova_range_node *node)
 {
 	nova_free_range_node(node);
@@ -206,7 +206,7 @@ int nova_insert_range_node(struct rb_root *tree,
 	return 0;
 }
 
-inline int nova_insert_blocktree(struct nova_sb_info *sbi,
+int nova_insert_blocktree(struct nova_sb_info *sbi,
 	struct rb_root *tree, struct nova_range_node *new_node)
 {
 	int ret;
@@ -659,7 +659,7 @@ static int nova_new_blocks(struct super_block *sb, unsigned long *blocknr,
 
 // Allocate data blocks.  The offset for the allocated block comes back in
 // blocknr.  Return the number of blocks allocated.
-inline int nova_new_data_blocks(struct super_block *sb,
+int nova_new_data_blocks(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long *blocknr,
 	unsigned long start_blk, unsigned int num,
 	enum nova_alloc_init zero, int cpu,
diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
index 463fbac99eff..aca7e8c18dde 100644
--- a/fs/nova/balloc.h
+++ b/fs/nova/balloc.h
@@ -62,18 +62,18 @@ enum alloc_type {
 
 int nova_alloc_block_free_lists(struct super_block *sb);
 void nova_delete_free_lists(struct super_block *sb);
-inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
-inline void nova_free_blocknode(struct super_block *sb,
+struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
+void nova_free_blocknode(struct super_block *sb,
 	struct nova_range_node *bnode);
 extern void nova_init_blockmap(struct super_block *sb, int recovery);
 extern unsigned long nova_count_free_blocks(struct super_block *sb);
-inline int nova_insert_blocktree(struct nova_sb_info *sbi,
+int nova_insert_blocktree(struct nova_sb_info *sbi,
 	struct rb_root *tree, struct nova_range_node *new_node);
 extern int nova_free_data_blocks(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
 extern int nova_free_log_blocks(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long blocknr, int num);
-extern inline int nova_new_data_blocks(struct super_block *sb,
+extern int nova_new_data_blocks(struct super_block *sb,
 	struct nova_inode_info_header *sih, unsigned long *blocknr,
 	unsigned long start_blk, unsigned int num,
 	enum nova_alloc_init zero, int cpu,
diff --git a/fs/nova/inode.c b/fs/nova/inode.c
index 21be31a05d26..31ef258978ba 100644
--- a/fs/nova/inode.c
+++ b/fs/nova/inode.c
@@ -440,7 +440,7 @@ struct inode *nova_iget(struct super_block *sb, unsigned long ino)
 	return ERR_PTR(err);
 }
 
-inline int nova_insert_inodetree(struct nova_sb_info *sbi,
+int nova_insert_inodetree(struct nova_sb_info *sbi,
 	struct nova_range_node *new_node, int cpu)
 {
 	struct rb_root *tree;
diff --git a/fs/nova/inode.h b/fs/nova/inode.h
index 086a7cba8ac3..1097e15ff7af 100644
--- a/fs/nova/inode.h
+++ b/fs/nova/inode.h
@@ -254,7 +254,7 @@ int nova_init_inode_table(struct super_block *sb);
 int nova_get_inode_address(struct super_block *sb, u64 ino,
 	u64 *pi_addr, int extendable);
 struct inode *nova_iget(struct super_block *sb, unsigned long ino);
-inline int nova_insert_inodetree(struct nova_sb_info *sbi,
+int nova_insert_inodetree(struct nova_sb_info *sbi,
 	struct nova_range_node *new_node, int cpu);
 u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr);
 struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
diff --git a/fs/nova/super.c b/fs/nova/super.c
index 039c003b698b..9f06ec847c95 100644
--- a/fs/nova/super.c
+++ b/fs/nova/super.c
@@ -795,23 +795,23 @@ static void nova_put_super(struct super_block *sb)
 	sb->s_fs_info = NULL;
 }
 
-inline void nova_free_range_node(struct nova_range_node *node)
+void nova_free_range_node(struct nova_range_node *node)
 {
 	kmem_cache_free(nova_range_node_cachep, node);
 }
 
-inline void nova_free_inode_node(struct super_block *sb,
+void nova_free_inode_node(struct super_block *sb,
 	struct nova_range_node *node)
 {
 	nova_free_range_node(node);
 }
 
-inline void nova_free_file_write_item(struct nova_file_write_item *item)
+void nova_free_file_write_item(struct nova_file_write_item *item)
 {
 	kmem_cache_free(nova_file_write_item_cachep, item);
 }
 
-inline struct nova_file_write_item *
+struct nova_file_write_item *
 nova_alloc_file_write_item(struct super_block *sb)
 {
 	struct nova_file_write_item *p;
diff --git a/fs/nova/super.h b/fs/nova/super.h
index bcbe862ac914..dc98346266e1 100644
--- a/fs/nova/super.h
+++ b/fs/nova/super.h
@@ -160,11 +160,11 @@ static inline struct nova_super_block *nova_get_super(struct super_block *sb)
 
 extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
 extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
-extern inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb);
+extern struct nova_range_node *nova_alloc_inode_node(struct super_block *sb);
 extern struct nova_file_write_item *
 nova_alloc_file_write_item(struct super_block *sb);
 extern void nova_free_range_node(struct nova_range_node *node);
-extern inline void nova_free_inode_node(struct super_block *sb,
+extern void nova_free_inode_node(struct super_block *sb,
 	struct nova_range_node *node);
 void nova_free_file_write_item(struct nova_file_write_item *item);
 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 00/83] NOVA: a new file system for persistent memory
  2018-03-11  2:14 ` [RFC v2 00/83] NOVA: a new file system for persistent memory Theodore Y. Ts'o
@ 2018-03-11  4:58   ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-11  4:58 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Andiry Xu, Linux FS Devel, linux-kernel,
	linux-nvdimm, Dan Williams, Rudoff, Andy, coughlan,
	Steven Swanson, Dave Chinner, jack, swhiteho, miklos, Jian Xu,
	Andiry Xu

On Sat, Mar 10, 2018 at 6:14 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> FYI, your patch set doesn't even compile for me without these fixups.
> I'm not sure why you were trying to declare inline functions in a
> header file without the function body?
>

Thanks for catching this. I will fix it in the next version and adopt
stricter flags next time.

Thanks,
Andiry

>                                                 - Ted
>
> diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
> index 8e992156f28c..9c7b74aa712e 100644
> --- a/fs/nova/balloc.c
> +++ b/fs/nova/balloc.c
> @@ -74,12 +74,12 @@ static void nova_init_free_list(struct super_block *sb,
>                 free_list->block_end -= sbi->tail_reserved_blocks;
>  }
>
> -inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
> +struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
>  {
>         return nova_alloc_range_node(sb);
>  }
>
> -inline void nova_free_blocknode(struct super_block *sb,
> +void nova_free_blocknode(struct super_block *sb,
>         struct nova_range_node *node)
>  {
>         nova_free_range_node(node);
> @@ -206,7 +206,7 @@ int nova_insert_range_node(struct rb_root *tree,
>         return 0;
>  }
>
> -inline int nova_insert_blocktree(struct nova_sb_info *sbi,
> +int nova_insert_blocktree(struct nova_sb_info *sbi,
>         struct rb_root *tree, struct nova_range_node *new_node)
>  {
>         int ret;
> @@ -659,7 +659,7 @@ static int nova_new_blocks(struct super_block *sb, unsigned long *blocknr,
>
>  // Allocate data blocks.  The offset for the allocated block comes back in
>  // blocknr.  Return the number of blocks allocated.
> -inline int nova_new_data_blocks(struct super_block *sb,
> +int nova_new_data_blocks(struct super_block *sb,
>         struct nova_inode_info_header *sih, unsigned long *blocknr,
>         unsigned long start_blk, unsigned int num,
>         enum nova_alloc_init zero, int cpu,
> diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
> index 463fbac99eff..aca7e8c18dde 100644
> --- a/fs/nova/balloc.h
> +++ b/fs/nova/balloc.h
> @@ -62,18 +62,18 @@ enum alloc_type {
>
>  int nova_alloc_block_free_lists(struct super_block *sb);
>  void nova_delete_free_lists(struct super_block *sb);
> -inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
> -inline void nova_free_blocknode(struct super_block *sb,
> +struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
> +void nova_free_blocknode(struct super_block *sb,
>         struct nova_range_node *bnode);
>  extern void nova_init_blockmap(struct super_block *sb, int recovery);
>  extern unsigned long nova_count_free_blocks(struct super_block *sb);
> -inline int nova_insert_blocktree(struct nova_sb_info *sbi,
> +int nova_insert_blocktree(struct nova_sb_info *sbi,
>         struct rb_root *tree, struct nova_range_node *new_node);
>  extern int nova_free_data_blocks(struct super_block *sb,
>         struct nova_inode_info_header *sih, unsigned long blocknr, int num);
>  extern int nova_free_log_blocks(struct super_block *sb,
>         struct nova_inode_info_header *sih, unsigned long blocknr, int num);
> -extern inline int nova_new_data_blocks(struct super_block *sb,
> +extern int nova_new_data_blocks(struct super_block *sb,
>         struct nova_inode_info_header *sih, unsigned long *blocknr,
>         unsigned long start_blk, unsigned int num,
>         enum nova_alloc_init zero, int cpu,
> diff --git a/fs/nova/inode.c b/fs/nova/inode.c
> index 21be31a05d26..31ef258978ba 100644
> --- a/fs/nova/inode.c
> +++ b/fs/nova/inode.c
> @@ -440,7 +440,7 @@ struct inode *nova_iget(struct super_block *sb, unsigned long ino)
>         return ERR_PTR(err);
>  }
>
> -inline int nova_insert_inodetree(struct nova_sb_info *sbi,
> +int nova_insert_inodetree(struct nova_sb_info *sbi,
>         struct nova_range_node *new_node, int cpu)
>  {
>         struct rb_root *tree;
> diff --git a/fs/nova/inode.h b/fs/nova/inode.h
> index 086a7cba8ac3..1097e15ff7af 100644
> --- a/fs/nova/inode.h
> +++ b/fs/nova/inode.h
> @@ -254,7 +254,7 @@ int nova_init_inode_table(struct super_block *sb);
>  int nova_get_inode_address(struct super_block *sb, u64 ino,
>         u64 *pi_addr, int extendable);
>  struct inode *nova_iget(struct super_block *sb, unsigned long ino);
> -inline int nova_insert_inodetree(struct nova_sb_info *sbi,
> +int nova_insert_inodetree(struct nova_sb_info *sbi,
>         struct nova_range_node *new_node, int cpu);
>  u64 nova_new_nova_inode(struct super_block *sb, u64 *pi_addr);
>  struct inode *nova_new_vfs_inode(enum nova_new_inode_type type,
> diff --git a/fs/nova/super.c b/fs/nova/super.c
> index 039c003b698b..9f06ec847c95 100644
> --- a/fs/nova/super.c
> +++ b/fs/nova/super.c
> @@ -795,23 +795,23 @@ static void nova_put_super(struct super_block *sb)
>         sb->s_fs_info = NULL;
>  }
>
> -inline void nova_free_range_node(struct nova_range_node *node)
> +void nova_free_range_node(struct nova_range_node *node)
>  {
>         kmem_cache_free(nova_range_node_cachep, node);
>  }
>
> -inline void nova_free_inode_node(struct super_block *sb,
> +void nova_free_inode_node(struct super_block *sb,
>         struct nova_range_node *node)
>  {
>         nova_free_range_node(node);
>  }
>
> -inline void nova_free_file_write_item(struct nova_file_write_item *item)
> +void nova_free_file_write_item(struct nova_file_write_item *item)
>  {
>         kmem_cache_free(nova_file_write_item_cachep, item);
>  }
>
> -inline struct nova_file_write_item *
> +struct nova_file_write_item *
>  nova_alloc_file_write_item(struct super_block *sb)
>  {
>         struct nova_file_write_item *p;
> diff --git a/fs/nova/super.h b/fs/nova/super.h
> index bcbe862ac914..dc98346266e1 100644
> --- a/fs/nova/super.h
> +++ b/fs/nova/super.h
> @@ -160,11 +160,11 @@ static inline struct nova_super_block *nova_get_super(struct super_block *sb)
>
>  extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
>  extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
> -extern inline struct nova_range_node *nova_alloc_inode_node(struct super_block *sb);
> +extern struct nova_range_node *nova_alloc_inode_node(struct super_block *sb);
>  extern struct nova_file_write_item *
>  nova_alloc_file_write_item(struct super_block *sb);
>  extern void nova_free_range_node(struct nova_range_node *node);
> -extern inline void nova_free_inode_node(struct super_block *sb,
> +extern void nova_free_inode_node(struct super_block *sb,
>         struct nova_range_node *node);
>  void nova_free_file_write_item(struct nova_file_write_item *item);
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 14/83] Add range node kmem cache.
  2018-03-10 18:17 ` [RFC v2 14/83] Add range node kmem cache Andiry Xu
@ 2018-03-11 11:55   ` Nikolay Borisov
  2018-03-11 21:31     ` Andiry Xu
  0 siblings, 1 reply; 119+ messages in thread
From: Nikolay Borisov @ 2018-03-11 11:55 UTC (permalink / raw)
  To: Andiry Xu, linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu



On 10.03.2018 20:17, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> Range node specifies a range of [start, end]. and is managed by a red-black tree.
> NOVA uses range node to manage NVM allocator and inodes being used.
> 
> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
> ---
>  fs/nova/nova.h  |  8 ++++++++
>  fs/nova/super.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
>  fs/nova/super.h |  2 ++
>  3 files changed, 52 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nova/nova.h b/fs/nova/nova.h
> index ba7ffca..e0e85fb 100644
> --- a/fs/nova/nova.h
> +++ b/fs/nova/nova.h
> @@ -301,6 +301,14 @@ static inline u64 nova_get_epoch_id(struct super_block *sb)
>  }
>  
>  #include "inode.h"
> +
> +/* A node in the RB tree representing a range of pages */
> +struct nova_range_node {
> +	struct rb_node node;
> +	unsigned long range_low;
> +	unsigned long range_high;
> +};
> +
>  #include "bbuild.h"
>  
>  /* ====================================================== */
> diff --git a/fs/nova/super.c b/fs/nova/super.c
> index f41cc04..aec1cd3 100644
> --- a/fs/nova/super.c
> +++ b/fs/nova/super.c
> @@ -52,6 +52,7 @@ MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
>  static struct super_operations nova_sops;
>  
>  static struct kmem_cache *nova_inode_cachep;
> +static struct kmem_cache *nova_range_node_cachep;
>  
>  
>  /* FIXME: should the following variable be one per NOVA instance? */
> @@ -686,6 +687,20 @@ static void nova_put_super(struct super_block *sb)
>  	sb->s_fs_info = NULL;
>  }
>  
> +inline void nova_free_range_node(struct nova_range_node *node)
> +{
> +	kmem_cache_free(nova_range_node_cachep, node);
> +}
> +
> +inline struct nova_range_node *nova_alloc_range_node(struct super_block *sb)
> +{
> +	struct nova_range_node *p;
> +
> +	p = (struct nova_range_node *)
nit: needless cast
> +		kmem_cache_zalloc(nova_range_node_cachep, GFP_NOFS);
> +	return p;
> +}
> +
>  static struct inode *nova_alloc_inode(struct super_block *sb)
>  {
>  	struct nova_inode_info *vi;
> @@ -719,6 +734,17 @@ static void init_once(void *foo)
>  	inode_init_once(&vi->vfs_inode);
>  }
>  
> +static int __init init_rangenode_cache(void)
> +{
> +	nova_range_node_cachep = kmem_cache_create("nova_range_node_cache",
> +					sizeof(struct nova_range_node),
> +					0, (SLAB_RECLAIM_ACCOUNT |

> +					SLAB_MEM_SPREAD), NULL);
> +	if (nova_range_node_cachep == NULL)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
>  static int __init init_inodecache(void)
>  {
>  	nova_inode_cachep = kmem_cache_create("nova_inode_cache",
> @@ -740,6 +766,11 @@ static void destroy_inodecache(void)
>  	kmem_cache_destroy(nova_inode_cachep);
>  }
>  
> +static void destroy_rangenode_cache(void)
> +{
> +	kmem_cache_destroy(nova_range_node_cachep);
> +}
> +
>  
>  /*
>   * the super block writes are all done "on the fly", so the
> @@ -781,20 +812,27 @@ static int __init init_nova_fs(void)
>  	nova_info("Arch new instructions support: CLWB %s\n",
>  			support_clwb ? "YES" : "NO");
>  
> -	rc = init_inodecache();
> +	rc = init_rangenode_cache();
>  	if (rc)
>  		goto out;
>  
> -	rc = register_filesystem(&nova_fs_type);
> +	rc = init_inodecache();
>  	if (rc)
>  		goto out1;
>  
> +	rc = register_filesystem(&nova_fs_type);
> +	if (rc)
> +		goto out2;
> +
>  out:
>  	NOVA_END_TIMING(init_t, init_time);
>  	return rc;
>  
> -out1:
> +out2:
>  	destroy_inodecache();
> +
> +out1:
> +	destroy_rangenode_cache();
>  	goto out;
>  }
>  
> @@ -802,6 +840,7 @@ static void __exit exit_nova_fs(void)
>  {
>  	unregister_filesystem(&nova_fs_type);
>  	destroy_inodecache();
> +	destroy_rangenode_cache();
>  }
>  
>  MODULE_AUTHOR("Andiry Xu <jix024@cs.ucsd.edu>");
> diff --git a/fs/nova/super.h b/fs/nova/super.h
> index cb53908..b478080 100644
> --- a/fs/nova/super.h
> +++ b/fs/nova/super.h
> @@ -145,5 +145,7 @@ static inline struct nova_super_block *nova_get_super(struct super_block *sb)
>  }
>  
>  extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
> +extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
> +extern void nova_free_range_node(struct nova_range_node *node);
>  
>  #endif
> 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines.
  2018-03-10 18:17 ` [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines Andiry Xu
@ 2018-03-11 12:00   ` Nikolay Borisov
  2018-03-11 19:22     ` Eric Biggers
  0 siblings, 1 reply; 119+ messages in thread
From: Nikolay Borisov @ 2018-03-11 12:00 UTC (permalink / raw)
  To: Andiry Xu, linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu, herbert

[Adding Herbert Xu to CC since he is the maintainer of the crypto subsys
maintainer]

On 10.03.2018 20:17, Andiry Xu wrote:
<snip>

> +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
> +{
> +	u8 *ptr = (u8 *) data;
> +	u64 acc = crc; /* accumulator, crc32c value in lower 32b */
> +	u32 csum;
> +
> +	/* x86 instruction crc32 is part of SSE-4.2 */
> +	if (static_cpu_has(X86_FEATURE_XMM4_2)) {
> +		/* This inline assembly implementation should be equivalent
> +		 * to the kernel's crc32c_intel_le_hw() function used by
> +		 * crc32c(), but this performs better on test machines.
> +		 */
> +		while (len > 8) {
> +			asm volatile(/* 64b quad words */
> +				"crc32q (%1), %0"
> +				: "=r" (acc)
> +				: "r"  (ptr), "0" (acc)
> +			);
> +			ptr += 8;
> +			len -= 8;
> +		}
> +
> +		while (len > 0) {
> +			asm volatile(/* trailing bytes */
> +				"crc32b (%1), %0"
> +				: "=r" (acc)
> +				: "r"  (ptr), "0" (acc)
> +			);
> +			ptr++;
> +			len--;
> +		}
> +
> +		csum = (u32) acc;
> +	} else {
> +		/* The kernel's crc32c() function should also detect and use the
> +		 * crc32 instruction of SSE-4.2. But calling in to this function
> +		 * is about 3x to 5x slower than the inline assembly version on
> +		 * some test machines.

That is really odd. Did you try to characterize why this is the case? Is
it purely the overhead of dispatching to the correct backend function?
That's a rather big performance hit.

> +		 */
> +		csum = crc32c(crc, data, len);
> +	}
> +
> +	return csum;
> +}
> +

<snip>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 16/83] Initialize block map and free lists in nova_init().
  2018-03-10 18:17 ` [RFC v2 16/83] Initialize block map and free lists in nova_init() Andiry Xu
@ 2018-03-11 12:12   ` Nikolay Borisov
  2018-03-11 21:30     ` Andiry Xu
  0 siblings, 1 reply; 119+ messages in thread
From: Nikolay Borisov @ 2018-03-11 12:12 UTC (permalink / raw)
  To: Andiry Xu, linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu



On 10.03.2018 20:17, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> NOVA divides the pmem range equally among per-CPU free lists,
> and format the red-black trees by inserting the initial free range.
> 
> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
> ---
>  fs/nova/balloc.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nova/balloc.h |  13 ++++-
>  fs/nova/super.c  |   2 +
>  3 files changed, 175 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
> index 450c942..cb627db 100644
> --- a/fs/nova/balloc.c
> +++ b/fs/nova/balloc.c
> @@ -55,4 +55,165 @@ void nova_delete_free_lists(struct super_block *sb)
>  	sbi->free_lists = NULL;
>  }
>  
> +// Initialize a free list.  Each CPU gets an equal share of the block space to
> +// manage.
> +static void nova_init_free_list(struct super_block *sb,
> +	struct free_list *free_list, int index)
> +{
> +	struct nova_sb_info *sbi = NOVA_SB(sb);
> +	unsigned long per_list_blocks;
> +
> +	per_list_blocks = sbi->num_blocks / sbi->cpus;

nit: You've already initialised per_list_blocks in nova_init_blockmap,
which calls this function. So just reference it, rather than performing
the the divison every time

> +
> +	free_list->block_start = per_list_blocks * index;
> +	free_list->block_end = free_list->block_start +
> +					per_list_blocks - 1;
> +	if (index == 0)
> +		free_list->block_start += sbi->head_reserved_blocks;
> +	if (index == sbi->cpus - 1)
> +		free_list->block_end -= sbi->tail_reserved_blocks;
> +}
> +
> +inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
> +{
> +	return nova_alloc_range_node(sb);
> +}
> +
> +inline void nova_free_blocknode(struct super_block *sb,
> +	struct nova_range_node *node)
> +{
> +	nova_free_range_node(node);
> +}
> +
> +void nova_init_blockmap(struct super_block *sb, int recovery)
> +{
> +	struct nova_sb_info *sbi = NOVA_SB(sb);
> +	struct rb_root *tree;
> +	struct nova_range_node *blknode;
> +	struct free_list *free_list;
> +	int i;
> +	int ret;
> +
> +	/* Divide the block range among per-CPU free lists */
> +	sbi->per_list_blocks = sbi->num_blocks / sbi->cpus;
> +	for (i = 0; i < sbi->cpus; i++) {
> +		free_list = nova_get_free_list(sb, i);
> +		tree = &(free_list->block_free_tree);
> +		nova_init_free_list(sb, free_list, i);
> +
> +		/* For recovery, update these fields later */
> +		if (recovery == 0) {
> +			free_list->num_free_blocks = free_list->block_end -
> +						free_list->block_start + 1;
> +
> +			blknode = nova_alloc_blocknode(sb);
> +			if (blknode == NULL)
> +				return;
> +			blknode->range_low = free_list->block_start;
> +			blknode->range_high = free_list->block_end;
> +			ret = nova_insert_blocktree(sbi, tree, blknode);
> +			if (ret) {
> +				nova_err(sb, "%s failed\n", __func__);
> +				nova_free_blocknode(sb, blknode);
> +				return;
> +			}
> +			free_list->first_node = blknode;
> +			free_list->last_node = blknode;
> +			free_list->num_blocknode = 1;
> +		}
> +
> +		nova_dbgv("%s: free list %d: block start %lu, end %lu, %lu free blocks\n",
> +			  __func__, i,
> +			  free_list->block_start,
> +			  free_list->block_end,
> +			  free_list->num_free_blocks);
> +	}
> +}
> +
> +static inline int nova_rbtree_compare_rangenode(struct nova_range_node *curr,
> +	unsigned long range_low)
> +{
> +	if (range_low < curr->range_low)
> +		return -1;
> +	if (range_low > curr->range_high)
> +		return 1;
>  
> +	return 0;
> +}
> +
> +int nova_find_range_node(struct nova_sb_info *sbi,
> +	struct rb_root *tree, unsigned long range_low,
> +	struct nova_range_node **ret_node)

Instead of having a **ret_node pointer as an argument, just make the
function return struct nova_range *node and have callers check for null:

struct nova_range_node *node = nova_find_range_node(sbi, tree, range);

if (ret) {
//do stuff with *node
}

> +{
> +	struct nova_range_node *curr = NULL;
> +	struct rb_node *temp;
> +	int compVal;
> +	int ret = 0;
> +
> +	temp = tree->rb_node;
> +
> +	while (temp) {
> +		curr = container_of(temp, struct nova_range_node, node);
> +		compVal = nova_rbtree_compare_rangenode(curr, range_low);
> +
> +		if (compVal == -1) {
> +			temp = temp->rb_left;
> +		} else if (compVal == 1) {
> +			temp = temp->rb_right;
> +		} else {
> +			ret = 1;
> +			break;
> +		}
> +	}
> +
> +	*ret_node = curr;
> +	return ret;
> +}
> +
> +
> +int nova_insert_range_node(struct rb_root *tree,
> +	struct nova_range_node *new_node)
> +{
> +	struct nova_range_node *curr;
> +	struct rb_node **temp, *parent;
> +	int compVal;
> +
> +	temp = &(tree->rb_node);
> +	parent = NULL;
> +
> +	while (*temp) {
> +		curr = container_of(*temp, struct nova_range_node, node);
> +		compVal = nova_rbtree_compare_rangenode(curr,
> +					new_node->range_low);
> +		parent = *temp;
> +
> +		if (compVal == -1) {
> +			temp = &((*temp)->rb_left);
> +		} else if (compVal == 1) {
> +			temp = &((*temp)->rb_right);
> +		} else {
> +			nova_dbg("%s: entry %lu - %lu already exists: %lu - %lu\n",
> +				 __func__, new_node->range_low,
> +				new_node->range_high, curr->range_low,
> +				curr->range_high);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	rb_link_node(&new_node->node, parent, temp);
> +	rb_insert_color(&new_node->node, tree);
> +
> +	return 0;
> +}
> +
> +inline int nova_insert_blocktree(struct nova_sb_info *sbi,
> +	struct rb_root *tree, struct nova_range_node *new_node)
> +{
> +	int ret;
> +
> +	ret = nova_insert_range_node(tree, new_node);
> +	if (ret)
> +		nova_dbg("ERROR: %s failed %d\n", __func__, ret);
> +
> +	return ret;
> +}
> diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
> index e7c7a1d..57a93e4 100644
> --- a/fs/nova/balloc.h
> +++ b/fs/nova/balloc.h
> @@ -62,5 +62,16 @@ enum alloc_type {
>  
>  int nova_alloc_block_free_lists(struct super_block *sb);
>  void nova_delete_free_lists(struct super_block *sb);
> -
> +inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
> +inline void nova_free_blocknode(struct super_block *sb,
> +	struct nova_range_node *bnode);
> +extern void nova_init_blockmap(struct super_block *sb, int recovery);
> +inline int nova_insert_blocktree(struct nova_sb_info *sbi,
> +	struct rb_root *tree, struct nova_range_node *new_node);
> +
> +extern int nova_insert_range_node(struct rb_root *tree,
> +				  struct nova_range_node *new_node);
> +extern int nova_find_range_node(struct nova_sb_info *sbi,
> +				struct rb_root *tree, unsigned long range_low,
> +				struct nova_range_node **ret_node);
>  #endif
> diff --git a/fs/nova/super.c b/fs/nova/super.c
> index 43b24a7..9762f26 100644
> --- a/fs/nova/super.c
> +++ b/fs/nova/super.c
> @@ -376,6 +376,8 @@ static struct nova_inode *nova_init(struct super_block *sb,
>  	pi->nova_ino = NOVA_BLOCKNODE_INO;
>  	nova_flush_buffer(pi, CACHELINE_SIZE, 1);
>  
> +	nova_init_blockmap(sb, 0);
> +
>  	sbi->nova_sb->s_size = cpu_to_le64(size);
>  	sbi->nova_sb->s_blocksize = cpu_to_le32(blocksize);
>  	sbi->nova_sb->s_magic = cpu_to_le32(NOVA_SUPER_MAGIC);
> 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 09/83] Add Kconfig and Makefile
  2018-03-10 18:17 ` [RFC v2 09/83] Add Kconfig and Makefile Andiry Xu
@ 2018-03-11 12:15   ` Nikolay Borisov
  2018-03-11 21:32     ` Andiry Xu
  0 siblings, 1 reply; 119+ messages in thread
From: Nikolay Borisov @ 2018-03-11 12:15 UTC (permalink / raw)
  To: Andiry Xu, linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu



On 10.03.2018 20:17, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
> ---
>  fs/Kconfig       |  2 ++
>  fs/Makefile      |  1 +
>  fs/nova/Kconfig  | 15 +++++++++++++++
>  fs/nova/Makefile |  7 +++++++
>  4 files changed, 25 insertions(+)
>  create mode 100644 fs/nova/Kconfig
>  create mode 100644 fs/nova/Makefile
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index bc821a8..5e9ff3e 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -58,6 +58,8 @@ config FS_DAX_PMD
>  	depends on ZONE_DEVICE
>  	depends on TRANSPARENT_HUGEPAGE
>  
> +source "fs/nova/Kconfig"
> +
>  # Selected by DAX drivers that do not expect filesystem DAX to support
>  # get_user_pages() of DAX mappings. I.e. "limited" indicates no support
>  # for fork() of processes with MAP_SHARED mappings or support for
> diff --git a/fs/Makefile b/fs/Makefile
> index add789e..65ea619 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -113,6 +113,7 @@ obj-$(CONFIG_OMFS_FS)		+= omfs/
>  obj-$(CONFIG_JFS_FS)		+= jfs/
>  obj-$(CONFIG_XFS_FS)		+= xfs/
>  obj-$(CONFIG_9P_FS)		+= 9p/
> +obj-$(CONFIG_NOVA_FS)		+= nova/
>  obj-$(CONFIG_AFS_FS)		+= afs/
>  obj-$(CONFIG_NILFS2_FS)		+= nilfs2/
>  obj-$(CONFIG_BEFS_FS)		+= befs/
> diff --git a/fs/nova/Kconfig b/fs/nova/Kconfig
> new file mode 100644
> index 0000000..c1c692e
> --- /dev/null
> +++ b/fs/nova/Kconfig
> @@ -0,0 +1,15 @@
> +config NOVA_FS
> +	tristate "NOVA: log-structured file system for non-volatile memories"
> +	depends on FS_DAX
> +	select CRC32

What do you need crc32 for? Selecting libcrc32c is enough to do "the
right thing"

> +	select LIBCRC32C
> +	help
> +	  If your system has a block of fast (comparable in access speed to
> +	  system memory) and non-volatile byte-addressable memory and you wish
> +	  to mount a light-weight filesystem with strong consistency support
> +	  over it, say Y here.
> +
> +	  To compile this as a module, choose M here: the module will be
> +	  called nova.
> +
> +	  If unsure, say N.
> diff --git a/fs/nova/Makefile b/fs/nova/Makefile
> new file mode 100644
> index 0000000..eb19646
> --- /dev/null
> +++ b/fs/nova/Makefile
> @@ -0,0 +1,7 @@
> +#
> +# Makefile for the linux NOVA filesystem routines.
> +#
> +
> +obj-$(CONFIG_NOVA_FS) += nova.o
> +
> +nova-y := bbuild.o inode.o rebuild.o super.o
> 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines.
  2018-03-11 12:00   ` Nikolay Borisov
@ 2018-03-11 19:22     ` Eric Biggers
  2018-03-11 21:45       ` Andiry Xu
  2018-03-19 19:39       ` Andiry Xu
  0 siblings, 2 replies; 119+ messages in thread
From: Eric Biggers @ 2018-03-11 19:22 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Andiry Xu, linux-fsdevel, linux-kernel, linux-nvdimm,
	dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu, herbert

On Sun, Mar 11, 2018 at 02:00:13PM +0200, Nikolay Borisov wrote:
> [Adding Herbert Xu to CC since he is the maintainer of the crypto subsys
> maintainer]
> 
> On 10.03.2018 20:17, Andiry Xu wrote:
> <snip>
> 
> > +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
> > +{
> > +	u8 *ptr = (u8 *) data;
> > +	u64 acc = crc; /* accumulator, crc32c value in lower 32b */
> > +	u32 csum;
> > +
> > +	/* x86 instruction crc32 is part of SSE-4.2 */
> > +	if (static_cpu_has(X86_FEATURE_XMM4_2)) {
> > +		/* This inline assembly implementation should be equivalent
> > +		 * to the kernel's crc32c_intel_le_hw() function used by
> > +		 * crc32c(), but this performs better on test machines.
> > +		 */
> > +		while (len > 8) {
> > +			asm volatile(/* 64b quad words */
> > +				"crc32q (%1), %0"
> > +				: "=r" (acc)
> > +				: "r"  (ptr), "0" (acc)
> > +			);
> > +			ptr += 8;
> > +			len -= 8;
> > +		}
> > +
> > +		while (len > 0) {
> > +			asm volatile(/* trailing bytes */
> > +				"crc32b (%1), %0"
> > +				: "=r" (acc)
> > +				: "r"  (ptr), "0" (acc)
> > +			);
> > +			ptr++;
> > +			len--;
> > +		}
> > +
> > +		csum = (u32) acc;
> > +	} else {
> > +		/* The kernel's crc32c() function should also detect and use the
> > +		 * crc32 instruction of SSE-4.2. But calling in to this function
> > +		 * is about 3x to 5x slower than the inline assembly version on
> > +		 * some test machines.
> 
> That is really odd. Did you try to characterize why this is the case? Is
> it purely the overhead of dispatching to the correct backend function?
> That's a rather big performance hit.
> 
> > +		 */
> > +		csum = crc32c(crc, data, len);
> > +	}
> > +
> > +	return csum;
> > +}
> > +

Are you sure that CONFIG_CRYPTO_CRC32C_INTEL was enabled during your tests and
that the accelerated version was being called?  Or, perhaps CRC32C_PCL_BREAKEVEN
(defined in arch/x86/crypto/crc32c-intel_glue.c) needs to be adjusted.  Please
don't hack around performance problems like this; if they exist, they need to be
fixed for everyone.

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 16/83] Initialize block map and free lists in nova_init().
  2018-03-11 12:12   ` Nikolay Borisov
@ 2018-03-11 21:30     ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-11 21:30 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux FS Devel, linux-kernel, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, jack, swhiteho,
	miklos, Jian Xu, Andiry Xu

On Sun, Mar 11, 2018 at 5:12 AM, Nikolay Borisov
<n.borisov.lkml@gmail.com> wrote:
>
>
> On 10.03.2018 20:17, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> NOVA divides the pmem range equally among per-CPU free lists,
>> and format the red-black trees by inserting the initial free range.
>>
>> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
>> ---
>>  fs/nova/balloc.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/nova/balloc.h |  13 ++++-
>>  fs/nova/super.c  |   2 +
>>  3 files changed, 175 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nova/balloc.c b/fs/nova/balloc.c
>> index 450c942..cb627db 100644
>> --- a/fs/nova/balloc.c
>> +++ b/fs/nova/balloc.c
>> @@ -55,4 +55,165 @@ void nova_delete_free_lists(struct super_block *sb)
>>       sbi->free_lists = NULL;
>>  }
>>
>> +// Initialize a free list.  Each CPU gets an equal share of the block space to
>> +// manage.
>> +static void nova_init_free_list(struct super_block *sb,
>> +     struct free_list *free_list, int index)
>> +{
>> +     struct nova_sb_info *sbi = NOVA_SB(sb);
>> +     unsigned long per_list_blocks;
>> +
>> +     per_list_blocks = sbi->num_blocks / sbi->cpus;
>
> nit: You've already initialised per_list_blocks in nova_init_blockmap,
> which calls this function. So just reference it, rather than performing
> the the divison every time
>

Thanks for catching this.

>> +
>> +     free_list->block_start = per_list_blocks * index;
>> +     free_list->block_end = free_list->block_start +
>> +                                     per_list_blocks - 1;
>> +     if (index == 0)
>> +             free_list->block_start += sbi->head_reserved_blocks;
>> +     if (index == sbi->cpus - 1)
>> +             free_list->block_end -= sbi->tail_reserved_blocks;
>> +}
>> +
>> +inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb)
>> +{
>> +     return nova_alloc_range_node(sb);
>> +}
>> +
>> +inline void nova_free_blocknode(struct super_block *sb,
>> +     struct nova_range_node *node)
>> +{
>> +     nova_free_range_node(node);
>> +}
>> +
>> +void nova_init_blockmap(struct super_block *sb, int recovery)
>> +{
>> +     struct nova_sb_info *sbi = NOVA_SB(sb);
>> +     struct rb_root *tree;
>> +     struct nova_range_node *blknode;
>> +     struct free_list *free_list;
>> +     int i;
>> +     int ret;
>> +
>> +     /* Divide the block range among per-CPU free lists */
>> +     sbi->per_list_blocks = sbi->num_blocks / sbi->cpus;
>> +     for (i = 0; i < sbi->cpus; i++) {
>> +             free_list = nova_get_free_list(sb, i);
>> +             tree = &(free_list->block_free_tree);
>> +             nova_init_free_list(sb, free_list, i);
>> +
>> +             /* For recovery, update these fields later */
>> +             if (recovery == 0) {
>> +                     free_list->num_free_blocks = free_list->block_end -
>> +                                             free_list->block_start + 1;
>> +
>> +                     blknode = nova_alloc_blocknode(sb);
>> +                     if (blknode == NULL)
>> +                             return;
>> +                     blknode->range_low = free_list->block_start;
>> +                     blknode->range_high = free_list->block_end;
>> +                     ret = nova_insert_blocktree(sbi, tree, blknode);
>> +                     if (ret) {
>> +                             nova_err(sb, "%s failed\n", __func__);
>> +                             nova_free_blocknode(sb, blknode);
>> +                             return;
>> +                     }
>> +                     free_list->first_node = blknode;
>> +                     free_list->last_node = blknode;
>> +                     free_list->num_blocknode = 1;
>> +             }
>> +
>> +             nova_dbgv("%s: free list %d: block start %lu, end %lu, %lu free blocks\n",
>> +                       __func__, i,
>> +                       free_list->block_start,
>> +                       free_list->block_end,
>> +                       free_list->num_free_blocks);
>> +     }
>> +}
>> +
>> +static inline int nova_rbtree_compare_rangenode(struct nova_range_node *curr,
>> +     unsigned long range_low)
>> +{
>> +     if (range_low < curr->range_low)
>> +             return -1;
>> +     if (range_low > curr->range_high)
>> +             return 1;
>>
>> +     return 0;
>> +}
>> +
>> +int nova_find_range_node(struct nova_sb_info *sbi,
>> +     struct rb_root *tree, unsigned long range_low,
>> +     struct nova_range_node **ret_node)
>
> Instead of having a **ret_node pointer as an argument, just make the
> function return struct nova_range *node and have callers check for null:
>
> struct nova_range_node *node = nova_find_range_node(sbi, tree, range);
>
> if (ret) {
> //do stuff with *node
> }
>

I pass **ret_node as an argument because if the target node is not
found, nova_find_range_node() returns the father node in
nova_find_free_slot(). So there is possibility that it returns 0 and a
not-NULL ret_node. Having it as a parameter makes this clearer.

Thanks,
Andiry

>> +{
>> +     struct nova_range_node *curr = NULL;
>> +     struct rb_node *temp;
>> +     int compVal;
>> +     int ret = 0;
>> +
>> +     temp = tree->rb_node;
>> +
>> +     while (temp) {
>> +             curr = container_of(temp, struct nova_range_node, node);
>> +             compVal = nova_rbtree_compare_rangenode(curr, range_low);
>> +
>> +             if (compVal == -1) {
>> +                     temp = temp->rb_left;
>> +             } else if (compVal == 1) {
>> +                     temp = temp->rb_right;
>> +             } else {
>> +                     ret = 1;
>> +                     break;
>> +             }
>> +     }
>> +
>> +     *ret_node = curr;
>> +     return ret;
>> +}
>> +
>> +
>> +int nova_insert_range_node(struct rb_root *tree,
>> +     struct nova_range_node *new_node)
>> +{
>> +     struct nova_range_node *curr;
>> +     struct rb_node **temp, *parent;
>> +     int compVal;
>> +
>> +     temp = &(tree->rb_node);
>> +     parent = NULL;
>> +
>> +     while (*temp) {
>> +             curr = container_of(*temp, struct nova_range_node, node);
>> +             compVal = nova_rbtree_compare_rangenode(curr,
>> +                                     new_node->range_low);
>> +             parent = *temp;
>> +
>> +             if (compVal == -1) {
>> +                     temp = &((*temp)->rb_left);
>> +             } else if (compVal == 1) {
>> +                     temp = &((*temp)->rb_right);
>> +             } else {
>> +                     nova_dbg("%s: entry %lu - %lu already exists: %lu - %lu\n",
>> +                              __func__, new_node->range_low,
>> +                             new_node->range_high, curr->range_low,
>> +                             curr->range_high);
>> +                     return -EINVAL;
>> +             }
>> +     }
>> +
>> +     rb_link_node(&new_node->node, parent, temp);
>> +     rb_insert_color(&new_node->node, tree);
>> +
>> +     return 0;
>> +}
>> +
>> +inline int nova_insert_blocktree(struct nova_sb_info *sbi,
>> +     struct rb_root *tree, struct nova_range_node *new_node)
>> +{
>> +     int ret;
>> +
>> +     ret = nova_insert_range_node(tree, new_node);
>> +     if (ret)
>> +             nova_dbg("ERROR: %s failed %d\n", __func__, ret);
>> +
>> +     return ret;
>> +}
>> diff --git a/fs/nova/balloc.h b/fs/nova/balloc.h
>> index e7c7a1d..57a93e4 100644
>> --- a/fs/nova/balloc.h
>> +++ b/fs/nova/balloc.h
>> @@ -62,5 +62,16 @@ enum alloc_type {
>>
>>  int nova_alloc_block_free_lists(struct super_block *sb);
>>  void nova_delete_free_lists(struct super_block *sb);
>> -
>> +inline struct nova_range_node *nova_alloc_blocknode(struct super_block *sb);
>> +inline void nova_free_blocknode(struct super_block *sb,
>> +     struct nova_range_node *bnode);
>> +extern void nova_init_blockmap(struct super_block *sb, int recovery);
>> +inline int nova_insert_blocktree(struct nova_sb_info *sbi,
>> +     struct rb_root *tree, struct nova_range_node *new_node);
>> +
>> +extern int nova_insert_range_node(struct rb_root *tree,
>> +                               struct nova_range_node *new_node);
>> +extern int nova_find_range_node(struct nova_sb_info *sbi,
>> +                             struct rb_root *tree, unsigned long range_low,
>> +                             struct nova_range_node **ret_node);
>>  #endif
>> diff --git a/fs/nova/super.c b/fs/nova/super.c
>> index 43b24a7..9762f26 100644
>> --- a/fs/nova/super.c
>> +++ b/fs/nova/super.c
>> @@ -376,6 +376,8 @@ static struct nova_inode *nova_init(struct super_block *sb,
>>       pi->nova_ino = NOVA_BLOCKNODE_INO;
>>       nova_flush_buffer(pi, CACHELINE_SIZE, 1);
>>
>> +     nova_init_blockmap(sb, 0);
>> +
>>       sbi->nova_sb->s_size = cpu_to_le64(size);
>>       sbi->nova_sb->s_blocksize = cpu_to_le32(blocksize);
>>       sbi->nova_sb->s_magic = cpu_to_le32(NOVA_SUPER_MAGIC);
>>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 14/83] Add range node kmem cache.
  2018-03-11 11:55   ` Nikolay Borisov
@ 2018-03-11 21:31     ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-11 21:31 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux FS Devel, linux-kernel, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, jack, swhiteho,
	miklos, Jian Xu, Andiry Xu

On Sun, Mar 11, 2018 at 4:55 AM, Nikolay Borisov
<n.borisov.lkml@gmail.com> wrote:
>
>
> On 10.03.2018 20:17, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> Range node specifies a range of [start, end]. and is managed by a red-black tree.
>> NOVA uses range node to manage NVM allocator and inodes being used.
>>
>> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
>> ---
>>  fs/nova/nova.h  |  8 ++++++++
>>  fs/nova/super.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
>>  fs/nova/super.h |  2 ++
>>  3 files changed, 52 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/nova/nova.h b/fs/nova/nova.h
>> index ba7ffca..e0e85fb 100644
>> --- a/fs/nova/nova.h
>> +++ b/fs/nova/nova.h
>> @@ -301,6 +301,14 @@ static inline u64 nova_get_epoch_id(struct super_block *sb)
>>  }
>>
>>  #include "inode.h"
>> +
>> +/* A node in the RB tree representing a range of pages */
>> +struct nova_range_node {
>> +     struct rb_node node;
>> +     unsigned long range_low;
>> +     unsigned long range_high;
>> +};
>> +
>>  #include "bbuild.h"
>>
>>  /* ====================================================== */
>> diff --git a/fs/nova/super.c b/fs/nova/super.c
>> index f41cc04..aec1cd3 100644
>> --- a/fs/nova/super.c
>> +++ b/fs/nova/super.c
>> @@ -52,6 +52,7 @@ MODULE_PARM_DESC(nova_dbgmask, "Control debugging output");
>>  static struct super_operations nova_sops;
>>
>>  static struct kmem_cache *nova_inode_cachep;
>> +static struct kmem_cache *nova_range_node_cachep;
>>
>>
>>  /* FIXME: should the following variable be one per NOVA instance? */
>> @@ -686,6 +687,20 @@ static void nova_put_super(struct super_block *sb)
>>       sb->s_fs_info = NULL;
>>  }
>>
>> +inline void nova_free_range_node(struct nova_range_node *node)
>> +{
>> +     kmem_cache_free(nova_range_node_cachep, node);
>> +}
>> +
>> +inline struct nova_range_node *nova_alloc_range_node(struct super_block *sb)
>> +{
>> +     struct nova_range_node *p;
>> +
>> +     p = (struct nova_range_node *)
> nit: needless cast

Thanks. Will fix.

Andiry

>> +             kmem_cache_zalloc(nova_range_node_cachep, GFP_NOFS);
>> +     return p;
>> +}
>> +
>>  static struct inode *nova_alloc_inode(struct super_block *sb)
>>  {
>>       struct nova_inode_info *vi;
>> @@ -719,6 +734,17 @@ static void init_once(void *foo)
>>       inode_init_once(&vi->vfs_inode);
>>  }
>>
>> +static int __init init_rangenode_cache(void)
>> +{
>> +     nova_range_node_cachep = kmem_cache_create("nova_range_node_cache",
>> +                                     sizeof(struct nova_range_node),
>> +                                     0, (SLAB_RECLAIM_ACCOUNT |
>
>> +                                     SLAB_MEM_SPREAD), NULL);
>> +     if (nova_range_node_cachep == NULL)
>> +             return -ENOMEM;
>> +     return 0;
>> +}
>> +
>>  static int __init init_inodecache(void)
>>  {
>>       nova_inode_cachep = kmem_cache_create("nova_inode_cache",
>> @@ -740,6 +766,11 @@ static void destroy_inodecache(void)
>>       kmem_cache_destroy(nova_inode_cachep);
>>  }
>>
>> +static void destroy_rangenode_cache(void)
>> +{
>> +     kmem_cache_destroy(nova_range_node_cachep);
>> +}
>> +
>>
>>  /*
>>   * the super block writes are all done "on the fly", so the
>> @@ -781,20 +812,27 @@ static int __init init_nova_fs(void)
>>       nova_info("Arch new instructions support: CLWB %s\n",
>>                       support_clwb ? "YES" : "NO");
>>
>> -     rc = init_inodecache();
>> +     rc = init_rangenode_cache();
>>       if (rc)
>>               goto out;
>>
>> -     rc = register_filesystem(&nova_fs_type);
>> +     rc = init_inodecache();
>>       if (rc)
>>               goto out1;
>>
>> +     rc = register_filesystem(&nova_fs_type);
>> +     if (rc)
>> +             goto out2;
>> +
>>  out:
>>       NOVA_END_TIMING(init_t, init_time);
>>       return rc;
>>
>> -out1:
>> +out2:
>>       destroy_inodecache();
>> +
>> +out1:
>> +     destroy_rangenode_cache();
>>       goto out;
>>  }
>>
>> @@ -802,6 +840,7 @@ static void __exit exit_nova_fs(void)
>>  {
>>       unregister_filesystem(&nova_fs_type);
>>       destroy_inodecache();
>> +     destroy_rangenode_cache();
>>  }
>>
>>  MODULE_AUTHOR("Andiry Xu <jix024@cs.ucsd.edu>");
>> diff --git a/fs/nova/super.h b/fs/nova/super.h
>> index cb53908..b478080 100644
>> --- a/fs/nova/super.h
>> +++ b/fs/nova/super.h
>> @@ -145,5 +145,7 @@ static inline struct nova_super_block *nova_get_super(struct super_block *sb)
>>  }
>>
>>  extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
>> +extern struct nova_range_node *nova_alloc_range_node(struct super_block *sb);
>> +extern void nova_free_range_node(struct nova_range_node *node);
>>
>>  #endif
>>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 09/83] Add Kconfig and Makefile
  2018-03-11 12:15   ` Nikolay Borisov
@ 2018-03-11 21:32     ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-11 21:32 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux FS Devel, linux-kernel, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, jack, swhiteho,
	miklos, Jian Xu, Andiry Xu

On Sun, Mar 11, 2018 at 5:15 AM, Nikolay Borisov
<n.borisov.lkml@gmail.com> wrote:
>
>
> On 10.03.2018 20:17, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
>> ---
>>  fs/Kconfig       |  2 ++
>>  fs/Makefile      |  1 +
>>  fs/nova/Kconfig  | 15 +++++++++++++++
>>  fs/nova/Makefile |  7 +++++++
>>  4 files changed, 25 insertions(+)
>>  create mode 100644 fs/nova/Kconfig
>>  create mode 100644 fs/nova/Makefile
>>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index bc821a8..5e9ff3e 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -58,6 +58,8 @@ config FS_DAX_PMD
>>       depends on ZONE_DEVICE
>>       depends on TRANSPARENT_HUGEPAGE
>>
>> +source "fs/nova/Kconfig"
>> +
>>  # Selected by DAX drivers that do not expect filesystem DAX to support
>>  # get_user_pages() of DAX mappings. I.e. "limited" indicates no support
>>  # for fork() of processes with MAP_SHARED mappings or support for
>> diff --git a/fs/Makefile b/fs/Makefile
>> index add789e..65ea619 100644
>> --- a/fs/Makefile
>> +++ b/fs/Makefile
>> @@ -113,6 +113,7 @@ obj-$(CONFIG_OMFS_FS)             += omfs/
>>  obj-$(CONFIG_JFS_FS)         += jfs/
>>  obj-$(CONFIG_XFS_FS)         += xfs/
>>  obj-$(CONFIG_9P_FS)          += 9p/
>> +obj-$(CONFIG_NOVA_FS)                += nova/
>>  obj-$(CONFIG_AFS_FS)         += afs/
>>  obj-$(CONFIG_NILFS2_FS)              += nilfs2/
>>  obj-$(CONFIG_BEFS_FS)                += befs/
>> diff --git a/fs/nova/Kconfig b/fs/nova/Kconfig
>> new file mode 100644
>> index 0000000..c1c692e
>> --- /dev/null
>> +++ b/fs/nova/Kconfig
>> @@ -0,0 +1,15 @@
>> +config NOVA_FS
>> +     tristate "NOVA: log-structured file system for non-volatile memories"
>> +     depends on FS_DAX
>> +     select CRC32
>
> What do you need crc32 for? Selecting libcrc32c is enough to do "the
> right thing"
>

I think this is the legacy of the removed NOVA-Fortis code. I will double check.

Thanks,
Andiry

>> +     select LIBCRC32C
>> +     help
>> +       If your system has a block of fast (comparable in access speed to
>> +       system memory) and non-volatile byte-addressable memory and you wish
>> +       to mount a light-weight filesystem with strong consistency support
>> +       over it, say Y here.
>> +
>> +       To compile this as a module, choose M here: the module will be
>> +       called nova.
>> +
>> +       If unsure, say N.
>> diff --git a/fs/nova/Makefile b/fs/nova/Makefile
>> new file mode 100644
>> index 0000000..eb19646
>> --- /dev/null
>> +++ b/fs/nova/Makefile
>> @@ -0,0 +1,7 @@
>> +#
>> +# Makefile for the linux NOVA filesystem routines.
>> +#
>> +
>> +obj-$(CONFIG_NOVA_FS) += nova.o
>> +
>> +nova-y := bbuild.o inode.o rebuild.o super.o
>>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines.
  2018-03-11 19:22     ` Eric Biggers
@ 2018-03-11 21:45       ` Andiry Xu
  2018-03-19 19:39       ` Andiry Xu
  1 sibling, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-11 21:45 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Nikolay Borisov, Linux FS Devel, linux-kernel, linux-nvdimm,
	Dan Williams, Rudoff, Andy, coughlan, Steven Swanson,
	Dave Chinner, jack, swhiteho, miklos, Jian Xu, Andiry Xu,
	herbert

On Sun, Mar 11, 2018 at 12:22 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> On Sun, Mar 11, 2018 at 02:00:13PM +0200, Nikolay Borisov wrote:
>> [Adding Herbert Xu to CC since he is the maintainer of the crypto subsys
>> maintainer]
>>
>> On 10.03.2018 20:17, Andiry Xu wrote:
>> <snip>
>>
>> > +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
>> > +{
>> > +   u8 *ptr = (u8 *) data;
>> > +   u64 acc = crc; /* accumulator, crc32c value in lower 32b */
>> > +   u32 csum;
>> > +
>> > +   /* x86 instruction crc32 is part of SSE-4.2 */
>> > +   if (static_cpu_has(X86_FEATURE_XMM4_2)) {
>> > +           /* This inline assembly implementation should be equivalent
>> > +            * to the kernel's crc32c_intel_le_hw() function used by
>> > +            * crc32c(), but this performs better on test machines.
>> > +            */
>> > +           while (len > 8) {
>> > +                   asm volatile(/* 64b quad words */
>> > +                           "crc32q (%1), %0"
>> > +                           : "=r" (acc)
>> > +                           : "r"  (ptr), "0" (acc)
>> > +                   );
>> > +                   ptr += 8;
>> > +                   len -= 8;
>> > +           }
>> > +
>> > +           while (len > 0) {
>> > +                   asm volatile(/* trailing bytes */
>> > +                           "crc32b (%1), %0"
>> > +                           : "=r" (acc)
>> > +                           : "r"  (ptr), "0" (acc)
>> > +                   );
>> > +                   ptr++;
>> > +                   len--;
>> > +           }
>> > +
>> > +           csum = (u32) acc;
>> > +   } else {
>> > +           /* The kernel's crc32c() function should also detect and use the
>> > +            * crc32 instruction of SSE-4.2. But calling in to this function
>> > +            * is about 3x to 5x slower than the inline assembly version on
>> > +            * some test machines.
>>
>> That is really odd. Did you try to characterize why this is the case? Is
>> it purely the overhead of dispatching to the correct backend function?
>> That's a rather big performance hit.
>>
>> > +            */
>> > +           csum = crc32c(crc, data, len);
>> > +   }
>> > +
>> > +   return csum;
>> > +}
>> > +
>
> Are you sure that CONFIG_CRYPTO_CRC32C_INTEL was enabled during your tests and
> that the accelerated version was being called?  Or, perhaps CRC32C_PCL_BREAKEVEN
> (defined in arch/x86/crypto/crc32c-intel_glue.c) needs to be adjusted.  Please
> don't hack around performance problems like this; if they exist, they need to be
> fixed for everyone.
>

I think we found the issue when implementing NOVA-Fortis metadata and
data protections, which use crc32c a lot. They have been removed in
this patchset; but I will double check and make sure if the issue
exists or not.

Thanks,
Andiry

> Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 83/83] Sysfs support.
  2018-03-10 18:19 ` [RFC v2 83/83] Sysfs support Andiry Xu
@ 2018-03-15  0:33   ` Randy Dunlap
  2018-03-15  6:07     ` Andiry Xu
  2018-03-22 15:00   ` David Sterba
  1 sibling, 1 reply; 119+ messages in thread
From: Randy Dunlap @ 2018-03-15  0:33 UTC (permalink / raw)
  To: Andiry Xu, linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

On 03/10/2018 10:19 AM, Andiry Xu wrote:
> Sysfs support allows user to get/post information of running NOVA instance.
> After mount, NOVA creates four entries under proc directory
> /proc/fs/nova/pmem#/:
> 
> timing_stats	IO_stats	allocator	gc

Hi,

This is all procfs, not sysfs, so the name is (or can be) confusing.

Please change it.

-- 
~Randy

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-10 18:17 ` [RFC v2 03/83] Add super.h Andiry Xu
@ 2018-03-15  4:54   ` Darrick J. Wong
  2018-03-15  6:11     ` Andiry Xu
  0 siblings, 1 reply; 119+ messages in thread
From: Darrick J. Wong @ 2018-03-15  4:54 UTC (permalink / raw)
  To: Andiry Xu
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, dan.j.williams,
	andy.rudoff, coughlan, swanson, david, jack, swhiteho, miklos,
	andiry.xu, Andiry Xu

On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> This header file defines NOVA persistent and volatile superblock
> data structures.
> 
> It also defines NOVA block layout:
> 
> Page 0: Superblock
> Page 1: Reserved inodes
> Page 2 - 15: Reserved
> Page 16 - 31: Inode table pointers
> Page 32 - 47: Journal address pointers
> Page 48 - 63: Reserved
> Pages n-2: Replicate reserved inodes
> Pages n-1: Replicate superblock
> 
> Other pages are for normal inodes, logs and data.
> 
> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
> ---
>  fs/nova/super.h | 149 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 149 insertions(+)
>  create mode 100644 fs/nova/super.h
> 
> diff --git a/fs/nova/super.h b/fs/nova/super.h
> new file mode 100644
> index 0000000..cb53908
> --- /dev/null
> +++ b/fs/nova/super.h
> @@ -0,0 +1,149 @@
> +#ifndef __SUPER_H
> +#define __SUPER_H
> +/*
> + * Structure of the NOVA super block in PMEM
> + *
> + * The fields are partitioned into static and dynamic fields. The static fields
> + * never change after file system creation. This was primarily done because
> + * nova_get_block() returns NULL if the block offset is 0 (helps in catching
> + * bugs). So if we modify any field using journaling (for consistency), we
> + * will have to modify s_sum which is at offset 0. So journaling code fails.
> + * This (static+dynamic fields) is a temporary solution and can be avoided
> + * once the file system becomes stable and nova_get_block() returns correct
> + * pointers even for offset 0.
> + */
> +struct nova_super_block {
> +	/* static fields. they never change after file system creation.
> +	 * checksum only validates up to s_start_dynamic field below
> +	 */
> +	__le32		s_sum;			/* checksum of this sb */
> +	__le32		s_magic;		/* magic signature */
> +	__le32		s_padding32;
> +	__le32		s_blocksize;		/* blocksize in bytes */
> +	__le64		s_size;			/* total size of fs in bytes */
> +	char		s_volume_name[16];	/* volume name */
> +
> +	/* all the dynamic fields should go here */
> +	__le64		s_epoch_id;		/* Epoch ID */
> +
> +	/* s_mtime and s_wtime should be together and their order should not be
> +	 * changed. we use an 8 byte write to update both of them atomically
> +	 */
> +	__le32		s_mtime;		/* mount time */
> +	__le32		s_wtime;		/* write time */

Hmmm, 32-bit timestamps?  2038 isn't that far away...

> +} __attribute((__packed__));
> +
> +#define NOVA_SB_SIZE 512       /* must be power of two */
> +
> +/* ======================= Reserved blocks ========================= */
> +
> +/*
> + * Page 0 contains super blocks;
> + * Page 1 contains reserved inodes;
> + * Page 2 - 15 are reserved.
> + * Page 16 - 31 contain pointers to inode tables.
> + * Page 32 - 47 contain pointers to journal pages.
> + */
> +#define	HEAD_RESERVED_BLOCKS	64
> +#define	NUM_JOURNAL_PAGES	16
> +
> +#define	SUPER_BLOCK_START       0 // Superblock
> +#define	RESERVE_INODE_START	1 // Reserved inodes
> +#define	INODE_TABLE_START	16 // inode table pointers
> +#define	JOURNAL_START		32 // journal pointer table
> +
> +/* For replica super block and replica reserved inodes */
> +#define	TAIL_RESERVED_BLOCKS	2
> +
> +/* ======================= Reserved inodes ========================= */
> +
> +/* We have space for 31 reserved inodes */
> +#define NOVA_ROOT_INO		(1)
> +#define NOVA_INODETABLE_INO	(2)	/* Fake inode associated with inode
> +					 * stroage.  We need this because our
> +					 * allocator requires inode to be
> +					 * associated with each allocation.
> +					 * The data actually lives in linked
> +					 * lists in INODE_TABLE_START. */
> +#define NOVA_BLOCKNODE_INO	(3)     /* Storage for allocator state */
> +#define NOVA_LITEJOURNAL_INO	(4)     /* Storage for lightweight journals */
> +#define NOVA_INODELIST_INO	(5)     /* Storage for Inode free list */
> +
> +
> +/* Normal inode starts at 32 */
> +#define NOVA_NORMAL_INODE_START      (32)

I've been wondering this whole time, why not make the inode number the
byte offset into the pmem?  Then you don't have to lose the last 8 bytes
of each inode block to point to the next one.

--D

> +
> +
> +
> +/*
> + * NOVA super-block data in DRAM
> + */
> +struct nova_sb_info {
> +	struct super_block *sb;			/* VFS super block */
> +	struct nova_super_block *nova_sb;	/* DRAM copy of SB */
> +	struct block_device *s_bdev;
> +	struct dax_device *s_dax_dev;
> +
> +	/*
> +	 * base physical and virtual address of NOVA (which is also
> +	 * the pointer to the super block)
> +	 */
> +	phys_addr_t	phys_addr;
> +	void		*virt_addr;
> +	void		*replica_reserved_inodes_addr;
> +	void		*replica_sb_addr;
> +
> +	unsigned long	num_blocks;
> +
> +	/* Mount options */
> +	unsigned long	bpi;
> +	unsigned long	blocksize;
> +	unsigned long	initsize;
> +	unsigned long	s_mount_opt;
> +	kuid_t		uid;    /* Mount uid for root directory */
> +	kgid_t		gid;    /* Mount gid for root directory */
> +	umode_t		mode;   /* Mount mode for root directory */
> +	atomic_t	next_generation;
> +	/* inode tracking */
> +	unsigned long	s_inodes_used_count;
> +	unsigned long	head_reserved_blocks;
> +	unsigned long	tail_reserved_blocks;
> +
> +	struct mutex	s_lock;	/* protects the SB's buffer-head */
> +
> +	int cpus;
> +
> +	/* Current epoch. volatile guarantees visibility */
> +	volatile u64 s_epoch_id;
> +
> +	/* ZEROED page for cache page initialized */
> +	void *zeroed_page;
> +};
> +
> +static inline struct nova_sb_info *NOVA_SB(struct super_block *sb)
> +{
> +	return sb->s_fs_info;
> +}
> +
> +static inline struct nova_super_block
> +*nova_get_redund_super(struct super_block *sb)
> +{
> +	struct nova_sb_info *sbi = NOVA_SB(sb);
> +
> +	return (struct nova_super_block *)(sbi->replica_sb_addr);
> +}
> +
> +
> +/* If this is part of a read-modify-write of the super block,
> + * nova_memunlock_super() before calling!
> + */
> +static inline struct nova_super_block *nova_get_super(struct super_block *sb)
> +{
> +	struct nova_sb_info *sbi = NOVA_SB(sb);
> +
> +	return (struct nova_super_block *)sbi->virt_addr;
> +}
> +
> +extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
> +
> +#endif
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 04/83] NOVA inode definition.
  2018-03-10 18:17 ` [RFC v2 04/83] NOVA inode definition Andiry Xu
@ 2018-03-15  5:06   ` Darrick J. Wong
  2018-03-15  6:16     ` Andiry Xu
  0 siblings, 1 reply; 119+ messages in thread
From: Darrick J. Wong @ 2018-03-15  5:06 UTC (permalink / raw)
  To: Andiry Xu
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, dan.j.williams,
	andy.rudoff, coughlan, swanson, david, jack, swhiteho, miklos,
	andiry.xu, Andiry Xu

On Sat, Mar 10, 2018 at 10:17:45AM -0800, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> inode.h defines the non-volatile and volatile NOVA inode data structures.
> 
> The non-volatile NOVA inode (nova_inode) is aligned to 128 bytes and contains
> file/directory metadata information. The most important fields
> are log_head and log_tail. log_head points to the start of
> the log, and log_tail points to the end of the latest committed
> log entry. NOVA make updates to the inode by appending
> to the log tail and update the log_tail pointer atomically.
> 
> The volatile NOVA inode (nova_inode_info) contains necessary
> information to limit access to the non-volatile NOVA inode during runtime.
> It has a radix tree to map file offset or filenames to the corresponding
> log entries.
> 
> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
> ---
>  fs/nova/inode.h | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 187 insertions(+)
>  create mode 100644 fs/nova/inode.h
> 
> diff --git a/fs/nova/inode.h b/fs/nova/inode.h
> new file mode 100644
> index 0000000..f9187e3
> --- /dev/null
> +++ b/fs/nova/inode.h
> @@ -0,0 +1,187 @@
> +#ifndef __INODE_H
> +#define __INODE_H
> +
> +struct nova_inode_info_header;
> +struct nova_inode;
> +
> +#include "super.h"
> +
> +enum nova_new_inode_type {
> +	TYPE_CREATE = 0,
> +	TYPE_MKNOD,
> +	TYPE_SYMLINK,
> +	TYPE_MKDIR
> +};
> +
> +
> +/*
> + * Structure of an inode in PMEM
> + * Keep the inode size to within 120 bytes: We use the last eight bytes
> + * as inode table tail pointer.

I would've expected a
BUILD_BUG_ON(NOVA_INODE_SIZE - sizeof(struct nova_inode) == 8);
or something to enforce this.

(Or just equate inode number with byte offset?  I looked ahead at the
directory entries and they seem to be 64-bit...)

I guess I'm being lazy and doing a on-disk-format-only review. :)

> + */
> +struct nova_inode {
> +
> +	/* first 40 bytes */
> +	u8	i_rsvd;		 /* reserved. used to be checksum */

Magic number?

> +	u8	valid;		 /* Is this inode valid? */
> +	u8	deleted;	 /* Is this inode deleted? */

Would i_mode == 0 cover these?

> +	u8	i_blk_type;	 /* data block size this inode uses */

I would've thought these would just be bits of i_flags?

Also, if I have a 1G blocksize file and free space fragments to the
point that there's > 1G of free space but none of it contiguous, I guess
I can expect ENOSPC?

> +	__le32	i_flags;	 /* Inode flags */
> +	__le64	i_size;		 /* Size of data in bytes */
> +	__le32	i_ctime;	 /* Inode modification time */
> +	__le32	i_mtime;	 /* Inode b-tree Modification time */
> +	__le32	i_atime;	 /* Access time */

Same y2038 grumble from the previous patch.

> +	__le16	i_mode;		 /* File mode */
> +	__le16	i_links_count;	 /* Links count */
> +
> +	__le64	i_xattr;	 /* Extended attribute block */
> +
> +	/* second 40 bytes */
> +	__le32	i_uid;		 /* Owner Uid */
> +	__le32	i_gid;		 /* Group Id */
> +	__le32	i_generation;	 /* File version (for NFS) */
> +	__le32	i_create_time;	 /* Create time */
> +	__le64	nova_ino;	 /* nova inode number */
> +
> +	__le64	log_head;	 /* Log head pointer */
> +	__le64	log_tail;	 /* Log tail pointer */
> +
> +	/* last 40 bytes */
> +	__le64	create_epoch_id; /* Transaction ID when create */
> +	__le64	delete_epoch_id; /* Transaction ID when deleted */
> +
> +	struct {
> +		__le32 rdev;	 /* major/minor # */
> +	} dev;			 /* device inode */
> +
> +	__le32	csum;            /* CRC32 checksum */
> +	/* Leave 8 bytes for inode table tail pointer */
> +} __attribute((__packed__));
> +
> +/*
> + * NOVA-specific inode state kept in DRAM
> + */
> +struct nova_inode_info_header {
> +	/* For files, tree holds a map from file offsets to
> +	 * write log entries.
> +	 *
> +	 * For directories, tree holds a map from a hash of the file name to
> +	 * dentry log entry.
> +	 */
> +	struct radix_tree_root tree;
> +	struct rw_semaphore i_sem;	/* Protect log and tree */
> +	unsigned short i_mode;		/* Dir or file? */
> +	unsigned int i_flags;
> +	unsigned long log_pages;	/* Num of log pages */
> +	unsigned long i_size;
> +	unsigned long i_blocks;
> +	unsigned long ino;
> +	unsigned long pi_addr;
> +	unsigned long valid_entries;	/* For thorough GC */
> +	unsigned long num_entries;	/* For thorough GC */
> +	u64 last_setattr;		/* Last setattr entry */
> +	u64 last_link_change;		/* Last link change entry */
> +	u64 last_dentry;		/* Last updated dentry */
> +	u64 trans_id;			/* Transaction ID */
> +	u64 log_head;			/* Log head pointer */
> +	u64 log_tail;			/* Log tail pointer */
> +	u8  i_blk_type;
> +};
> +
> +/*
> + * DRAM state for inodes
> + */
> +struct nova_inode_info {
> +	struct nova_inode_info_header header;
> +	struct inode vfs_inode;
> +};
> +
> +
> +static inline struct nova_inode_info *NOVA_I(struct inode *inode)
> +{
> +	return container_of(inode, struct nova_inode_info, vfs_inode);
> +}
> +
> +static inline void sih_lock(struct nova_inode_info_header *header)

"sih"?  What happened to the "nova" prefix?

--D

> +{
> +	down_write(&header->i_sem);
> +}
> +
> +static inline void sih_unlock(struct nova_inode_info_header *header)
> +{
> +	up_write(&header->i_sem);
> +}
> +
> +static inline void sih_lock_shared(struct nova_inode_info_header *header)
> +{
> +	down_read(&header->i_sem);
> +}
> +
> +static inline void sih_unlock_shared(struct nova_inode_info_header *header)
> +{
> +	up_read(&header->i_sem);
> +}
> +
> +static inline unsigned int
> +nova_inode_blk_shift(struct nova_inode_info_header *sih)
> +{
> +	return blk_type_to_shift[sih->i_blk_type];
> +}
> +
> +static inline uint32_t nova_inode_blk_size(struct nova_inode_info_header *sih)
> +{
> +	return blk_type_to_size[sih->i_blk_type];
> +}
> +
> +static inline u64 nova_get_reserved_inode_addr(struct super_block *sb,
> +	u64 inode_number)
> +{
> +	return (NOVA_DEF_BLOCK_SIZE_4K * RESERVE_INODE_START) +
> +			inode_number * NOVA_INODE_SIZE;
> +}
> +
> +static inline struct nova_inode *nova_get_reserved_inode(struct super_block *sb,
> +	u64 inode_number)
> +{
> +	struct nova_sb_info *sbi = NOVA_SB(sb);
> +	u64 addr;
> +
> +	addr = nova_get_reserved_inode_addr(sb, inode_number);
> +
> +	return (struct nova_inode *)(sbi->virt_addr + addr);
> +}
> +
> +static inline struct nova_inode *nova_get_inode_by_ino(struct super_block *sb,
> +						  u64 ino)
> +{
> +	if (ino == 0 || ino >= NOVA_NORMAL_INODE_START)
> +		return NULL;
> +
> +	return nova_get_reserved_inode(sb, ino);
> +}
> +
> +static inline struct nova_inode *nova_get_inode(struct super_block *sb,
> +	struct inode *inode)
> +{
> +	struct nova_inode_info *si = NOVA_I(inode);
> +	struct nova_inode_info_header *sih = &si->header;
> +	struct nova_inode fake_pi;
> +	void *addr;
> +	int rc;
> +
> +	addr = nova_get_block(sb, sih->pi_addr);
> +	rc = memcpy_mcsafe(&fake_pi, addr, sizeof(struct nova_inode));
> +	if (rc)
> +		return NULL;
> +
> +	return (struct nova_inode *)addr;
> +}
> +
> +static inline int nova_persist_inode(struct nova_inode *pi)
> +{
> +	nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
> +	return 0;
> +}
> +
> +#endif
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 83/83] Sysfs support.
  2018-03-15  0:33   ` Randy Dunlap
@ 2018-03-15  6:07     ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-15  6:07 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Linux FS Devel, linux-kernel, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, jack, swhiteho,
	miklos, Jian Xu, Andiry Xu

On Wed, Mar 14, 2018 at 5:33 PM, Randy Dunlap <rdunlap@infradead.org> wrote:
> On 03/10/2018 10:19 AM, Andiry Xu wrote:
>> Sysfs support allows user to get/post information of running NOVA instance.
>> After mount, NOVA creates four entries under proc directory
>> /proc/fs/nova/pmem#/:
>>
>> timing_stats  IO_stats        allocator       gc
>
> Hi,
>
> This is all procfs, not sysfs, so the name is (or can be) confusing.
>
> Please change it.
>

Thanks, will fix.

Andiry

> --
> ~Randy

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-15  4:54   ` Darrick J. Wong
@ 2018-03-15  6:11     ` Andiry Xu
  2018-03-15  9:05       ` Arnd Bergmann
  0 siblings, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-15  6:11 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Linux FS Devel, linux-kernel, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, jack, swhiteho,
	miklos, Jian Xu, Andiry Xu

On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> This header file defines NOVA persistent and volatile superblock
>> data structures.
>>
>> It also defines NOVA block layout:
>>
>> Page 0: Superblock
>> Page 1: Reserved inodes
>> Page 2 - 15: Reserved
>> Page 16 - 31: Inode table pointers
>> Page 32 - 47: Journal address pointers
>> Page 48 - 63: Reserved
>> Pages n-2: Replicate reserved inodes
>> Pages n-1: Replicate superblock
>>
>> Other pages are for normal inodes, logs and data.
>>
>> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
>> ---
>>  fs/nova/super.h | 149 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 149 insertions(+)
>>  create mode 100644 fs/nova/super.h
>>
>> diff --git a/fs/nova/super.h b/fs/nova/super.h
>> new file mode 100644
>> index 0000000..cb53908
>> --- /dev/null
>> +++ b/fs/nova/super.h
>> @@ -0,0 +1,149 @@
>> +#ifndef __SUPER_H
>> +#define __SUPER_H
>> +/*
>> + * Structure of the NOVA super block in PMEM
>> + *
>> + * The fields are partitioned into static and dynamic fields. The static fields
>> + * never change after file system creation. This was primarily done because
>> + * nova_get_block() returns NULL if the block offset is 0 (helps in catching
>> + * bugs). So if we modify any field using journaling (for consistency), we
>> + * will have to modify s_sum which is at offset 0. So journaling code fails.
>> + * This (static+dynamic fields) is a temporary solution and can be avoided
>> + * once the file system becomes stable and nova_get_block() returns correct
>> + * pointers even for offset 0.
>> + */
>> +struct nova_super_block {
>> +     /* static fields. they never change after file system creation.
>> +      * checksum only validates up to s_start_dynamic field below
>> +      */
>> +     __le32          s_sum;                  /* checksum of this sb */
>> +     __le32          s_magic;                /* magic signature */
>> +     __le32          s_padding32;
>> +     __le32          s_blocksize;            /* blocksize in bytes */
>> +     __le64          s_size;                 /* total size of fs in bytes */
>> +     char            s_volume_name[16];      /* volume name */
>> +
>> +     /* all the dynamic fields should go here */
>> +     __le64          s_epoch_id;             /* Epoch ID */
>> +
>> +     /* s_mtime and s_wtime should be together and their order should not be
>> +      * changed. we use an 8 byte write to update both of them atomically
>> +      */
>> +     __le32          s_mtime;                /* mount time */
>> +     __le32          s_wtime;                /* write time */
>
> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>

I will try fixing this in the next version.

>> +} __attribute((__packed__));
>> +
>> +#define NOVA_SB_SIZE 512       /* must be power of two */
>> +
>> +/* ======================= Reserved blocks ========================= */
>> +
>> +/*
>> + * Page 0 contains super blocks;
>> + * Page 1 contains reserved inodes;
>> + * Page 2 - 15 are reserved.
>> + * Page 16 - 31 contain pointers to inode tables.
>> + * Page 32 - 47 contain pointers to journal pages.
>> + */
>> +#define      HEAD_RESERVED_BLOCKS    64
>> +#define      NUM_JOURNAL_PAGES       16
>> +
>> +#define      SUPER_BLOCK_START       0 // Superblock
>> +#define      RESERVE_INODE_START     1 // Reserved inodes
>> +#define      INODE_TABLE_START       16 // inode table pointers
>> +#define      JOURNAL_START           32 // journal pointer table
>> +
>> +/* For replica super block and replica reserved inodes */
>> +#define      TAIL_RESERVED_BLOCKS    2
>> +
>> +/* ======================= Reserved inodes ========================= */
>> +
>> +/* We have space for 31 reserved inodes */
>> +#define NOVA_ROOT_INO                (1)
>> +#define NOVA_INODETABLE_INO  (2)     /* Fake inode associated with inode
>> +                                      * stroage.  We need this because our
>> +                                      * allocator requires inode to be
>> +                                      * associated with each allocation.
>> +                                      * The data actually lives in linked
>> +                                      * lists in INODE_TABLE_START. */
>> +#define NOVA_BLOCKNODE_INO   (3)     /* Storage for allocator state */
>> +#define NOVA_LITEJOURNAL_INO (4)     /* Storage for lightweight journals */
>> +#define NOVA_INODELIST_INO   (5)     /* Storage for Inode free list */
>> +
>> +
>> +/* Normal inode starts at 32 */
>> +#define NOVA_NORMAL_INODE_START      (32)
>
> I've been wondering this whole time, why not make the inode number the
> byte offset into the pmem?  Then you don't have to lose the last 8 bytes
> of each inode block to point to the next one.
>

During failure recovery, NOVA scans the inode logs. To find all the
inodes, it follows the inode block list. Making inode number the byte
offset cannot locate all the inodes during recovery.

One option is to organize the inodes in a B+tree, which makes the code
more complex.

Thanks,
Andiry

> --D
>
>> +
>> +
>> +
>> +/*
>> + * NOVA super-block data in DRAM
>> + */
>> +struct nova_sb_info {
>> +     struct super_block *sb;                 /* VFS super block */
>> +     struct nova_super_block *nova_sb;       /* DRAM copy of SB */
>> +     struct block_device *s_bdev;
>> +     struct dax_device *s_dax_dev;
>> +
>> +     /*
>> +      * base physical and virtual address of NOVA (which is also
>> +      * the pointer to the super block)
>> +      */
>> +     phys_addr_t     phys_addr;
>> +     void            *virt_addr;
>> +     void            *replica_reserved_inodes_addr;
>> +     void            *replica_sb_addr;
>> +
>> +     unsigned long   num_blocks;
>> +
>> +     /* Mount options */
>> +     unsigned long   bpi;
>> +     unsigned long   blocksize;
>> +     unsigned long   initsize;
>> +     unsigned long   s_mount_opt;
>> +     kuid_t          uid;    /* Mount uid for root directory */
>> +     kgid_t          gid;    /* Mount gid for root directory */
>> +     umode_t         mode;   /* Mount mode for root directory */
>> +     atomic_t        next_generation;
>> +     /* inode tracking */
>> +     unsigned long   s_inodes_used_count;
>> +     unsigned long   head_reserved_blocks;
>> +     unsigned long   tail_reserved_blocks;
>> +
>> +     struct mutex    s_lock; /* protects the SB's buffer-head */
>> +
>> +     int cpus;
>> +
>> +     /* Current epoch. volatile guarantees visibility */
>> +     volatile u64 s_epoch_id;
>> +
>> +     /* ZEROED page for cache page initialized */
>> +     void *zeroed_page;
>> +};
>> +
>> +static inline struct nova_sb_info *NOVA_SB(struct super_block *sb)
>> +{
>> +     return sb->s_fs_info;
>> +}
>> +
>> +static inline struct nova_super_block
>> +*nova_get_redund_super(struct super_block *sb)
>> +{
>> +     struct nova_sb_info *sbi = NOVA_SB(sb);
>> +
>> +     return (struct nova_super_block *)(sbi->replica_sb_addr);
>> +}
>> +
>> +
>> +/* If this is part of a read-modify-write of the super block,
>> + * nova_memunlock_super() before calling!
>> + */
>> +static inline struct nova_super_block *nova_get_super(struct super_block *sb)
>> +{
>> +     struct nova_sb_info *sbi = NOVA_SB(sb);
>> +
>> +     return (struct nova_super_block *)sbi->virt_addr;
>> +}
>> +
>> +extern void nova_error_mng(struct super_block *sb, const char *fmt, ...);
>> +
>> +#endif
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 04/83] NOVA inode definition.
  2018-03-15  5:06   ` Darrick J. Wong
@ 2018-03-15  6:16     ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-15  6:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Linux FS Devel, linux-kernel, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, jack, swhiteho,
	miklos, Jian Xu, Andiry Xu

On Wed, Mar 14, 2018 at 10:06 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Sat, Mar 10, 2018 at 10:17:45AM -0800, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> inode.h defines the non-volatile and volatile NOVA inode data structures.
>>
>> The non-volatile NOVA inode (nova_inode) is aligned to 128 bytes and contains
>> file/directory metadata information. The most important fields
>> are log_head and log_tail. log_head points to the start of
>> the log, and log_tail points to the end of the latest committed
>> log entry. NOVA make updates to the inode by appending
>> to the log tail and update the log_tail pointer atomically.
>>
>> The volatile NOVA inode (nova_inode_info) contains necessary
>> information to limit access to the non-volatile NOVA inode during runtime.
>> It has a radix tree to map file offset or filenames to the corresponding
>> log entries.
>>
>> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
>> ---
>>  fs/nova/inode.h | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 187 insertions(+)
>>  create mode 100644 fs/nova/inode.h
>>
>> diff --git a/fs/nova/inode.h b/fs/nova/inode.h
>> new file mode 100644
>> index 0000000..f9187e3
>> --- /dev/null
>> +++ b/fs/nova/inode.h
>> @@ -0,0 +1,187 @@
>> +#ifndef __INODE_H
>> +#define __INODE_H
>> +
>> +struct nova_inode_info_header;
>> +struct nova_inode;
>> +
>> +#include "super.h"
>> +
>> +enum nova_new_inode_type {
>> +     TYPE_CREATE = 0,
>> +     TYPE_MKNOD,
>> +     TYPE_SYMLINK,
>> +     TYPE_MKDIR
>> +};
>> +
>> +
>> +/*
>> + * Structure of an inode in PMEM
>> + * Keep the inode size to within 120 bytes: We use the last eight bytes
>> + * as inode table tail pointer.
>
> I would've expected a
> BUILD_BUG_ON(NOVA_INODE_SIZE - sizeof(struct nova_inode) == 8);
> or something to enforce this.
>

Thanks, will do.

> (Or just equate inode number with byte offset?  I looked ahead at the
> directory entries and they seem to be 64-bit...)
>
> I guess I'm being lazy and doing a on-disk-format-only review. :)
>
>> + */
>> +struct nova_inode {
>> +
>> +     /* first 40 bytes */
>> +     u8      i_rsvd;          /* reserved. used to be checksum */
>
> Magic number?
>

OK.

>> +     u8      valid;           /* Is this inode valid? */
>> +     u8      deleted;         /* Is this inode deleted? */
>
> Would i_mode == 0 cover these?
>

Deleted flag comes from NOVA-Fortis code. I will check if i_mode can cover it.

>> +     u8      i_blk_type;      /* data block size this inode uses */
>
> I would've thought these would just be bits of i_flags?
>
> Also, if I have a 1G blocksize file and free space fragments to the
> point that there's > 1G of free space but none of it contiguous, I guess
> I can expect ENOSPC?
>

Yes, but 1G blocksize has not been tested.

>> +     __le32  i_flags;         /* Inode flags */
>> +     __le64  i_size;          /* Size of data in bytes */
>> +     __le32  i_ctime;         /* Inode modification time */
>> +     __le32  i_mtime;         /* Inode b-tree Modification time */
>> +     __le32  i_atime;         /* Access time */
>
> Same y2038 grumble from the previous patch.
>

Will fix.

>> +     __le16  i_mode;          /* File mode */
>> +     __le16  i_links_count;   /* Links count */
>> +
>> +     __le64  i_xattr;         /* Extended attribute block */
>> +
>> +     /* second 40 bytes */
>> +     __le32  i_uid;           /* Owner Uid */
>> +     __le32  i_gid;           /* Group Id */
>> +     __le32  i_generation;    /* File version (for NFS) */
>> +     __le32  i_create_time;   /* Create time */
>> +     __le64  nova_ino;        /* nova inode number */
>> +
>> +     __le64  log_head;        /* Log head pointer */
>> +     __le64  log_tail;        /* Log tail pointer */
>> +
>> +     /* last 40 bytes */
>> +     __le64  create_epoch_id; /* Transaction ID when create */
>> +     __le64  delete_epoch_id; /* Transaction ID when deleted */
>> +
>> +     struct {
>> +             __le32 rdev;     /* major/minor # */
>> +     } dev;                   /* device inode */
>> +
>> +     __le32  csum;            /* CRC32 checksum */
>> +     /* Leave 8 bytes for inode table tail pointer */
>> +} __attribute((__packed__));
>> +
>> +/*
>> + * NOVA-specific inode state kept in DRAM
>> + */
>> +struct nova_inode_info_header {
>> +     /* For files, tree holds a map from file offsets to
>> +      * write log entries.
>> +      *
>> +      * For directories, tree holds a map from a hash of the file name to
>> +      * dentry log entry.
>> +      */
>> +     struct radix_tree_root tree;
>> +     struct rw_semaphore i_sem;      /* Protect log and tree */
>> +     unsigned short i_mode;          /* Dir or file? */
>> +     unsigned int i_flags;
>> +     unsigned long log_pages;        /* Num of log pages */
>> +     unsigned long i_size;
>> +     unsigned long i_blocks;
>> +     unsigned long ino;
>> +     unsigned long pi_addr;
>> +     unsigned long valid_entries;    /* For thorough GC */
>> +     unsigned long num_entries;      /* For thorough GC */
>> +     u64 last_setattr;               /* Last setattr entry */
>> +     u64 last_link_change;           /* Last link change entry */
>> +     u64 last_dentry;                /* Last updated dentry */
>> +     u64 trans_id;                   /* Transaction ID */
>> +     u64 log_head;                   /* Log head pointer */
>> +     u64 log_tail;                   /* Log tail pointer */
>> +     u8  i_blk_type;
>> +};
>> +
>> +/*
>> + * DRAM state for inodes
>> + */
>> +struct nova_inode_info {
>> +     struct nova_inode_info_header header;
>> +     struct inode vfs_inode;
>> +};
>> +
>> +
>> +static inline struct nova_inode_info *NOVA_I(struct inode *inode)
>> +{
>> +     return container_of(inode, struct nova_inode_info, vfs_inode);
>> +}
>> +
>> +static inline void sih_lock(struct nova_inode_info_header *header)
>
> "sih"?  What happened to the "nova" prefix?
>

This structure is born before the name NOVA was decided.

Thanks,
Andiry

> --D
>
>> +{
>> +     down_write(&header->i_sem);
>> +}
>> +
>> +static inline void sih_unlock(struct nova_inode_info_header *header)
>> +{
>> +     up_write(&header->i_sem);
>> +}
>> +
>> +static inline void sih_lock_shared(struct nova_inode_info_header *header)
>> +{
>> +     down_read(&header->i_sem);
>> +}
>> +
>> +static inline void sih_unlock_shared(struct nova_inode_info_header *header)
>> +{
>> +     up_read(&header->i_sem);
>> +}
>> +
>> +static inline unsigned int
>> +nova_inode_blk_shift(struct nova_inode_info_header *sih)
>> +{
>> +     return blk_type_to_shift[sih->i_blk_type];
>> +}
>> +
>> +static inline uint32_t nova_inode_blk_size(struct nova_inode_info_header *sih)
>> +{
>> +     return blk_type_to_size[sih->i_blk_type];
>> +}
>> +
>> +static inline u64 nova_get_reserved_inode_addr(struct super_block *sb,
>> +     u64 inode_number)
>> +{
>> +     return (NOVA_DEF_BLOCK_SIZE_4K * RESERVE_INODE_START) +
>> +                     inode_number * NOVA_INODE_SIZE;
>> +}
>> +
>> +static inline struct nova_inode *nova_get_reserved_inode(struct super_block *sb,
>> +     u64 inode_number)
>> +{
>> +     struct nova_sb_info *sbi = NOVA_SB(sb);
>> +     u64 addr;
>> +
>> +     addr = nova_get_reserved_inode_addr(sb, inode_number);
>> +
>> +     return (struct nova_inode *)(sbi->virt_addr + addr);
>> +}
>> +
>> +static inline struct nova_inode *nova_get_inode_by_ino(struct super_block *sb,
>> +                                               u64 ino)
>> +{
>> +     if (ino == 0 || ino >= NOVA_NORMAL_INODE_START)
>> +             return NULL;
>> +
>> +     return nova_get_reserved_inode(sb, ino);
>> +}
>> +
>> +static inline struct nova_inode *nova_get_inode(struct super_block *sb,
>> +     struct inode *inode)
>> +{
>> +     struct nova_inode_info *si = NOVA_I(inode);
>> +     struct nova_inode_info_header *sih = &si->header;
>> +     struct nova_inode fake_pi;
>> +     void *addr;
>> +     int rc;
>> +
>> +     addr = nova_get_block(sb, sih->pi_addr);
>> +     rc = memcpy_mcsafe(&fake_pi, addr, sizeof(struct nova_inode));
>> +     if (rc)
>> +             return NULL;
>> +
>> +     return (struct nova_inode *)addr;
>> +}
>> +
>> +static inline int nova_persist_inode(struct nova_inode *pi)
>> +{
>> +     nova_flush_buffer(pi, sizeof(struct nova_inode), 1);
>> +     return 0;
>> +}
>> +
>> +#endif
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-15  6:11     ` Andiry Xu
@ 2018-03-15  9:05       ` Arnd Bergmann
  2018-03-15 17:51         ` Andiry Xu
  0 siblings, 1 reply; 119+ messages in thread
From: Arnd Bergmann @ 2018-03-15  9:05 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Darrick J. Wong, Linux FS Devel, Linux Kernel Mailing List,
	linux-nvdimm, Dan Williams, Rudoff, Andy, coughlan,
	Steven Swanson, Dave Chinner, Jan Kara, swhiteho, miklos,
	Jian Xu, Andiry Xu

On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu <jix024@eng.ucsd.edu> wrote:
> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
>> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:

>>> +     /* s_mtime and s_wtime should be together and their order should not be
>>> +      * changed. we use an 8 byte write to update both of them atomically
>>> +      */
>>> +     __le32          s_mtime;                /* mount time */
>>> +     __le32          s_wtime;                /* write time */
>>
>> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>>
>
> I will try fixing this in the next version.

I would also recommend adding nanosecond-resolution timestamps.
In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
(it's good for several hundred years), but the more common format uses
64-bit seconds and 32-bit nanoseconds in other file systems.

Unfortunately it looks, you will have to come up with a more sophisticated
update method above, even if you leave out the nanoseconds, you can't
easily rely on a 16-byte atomic update across architectures to deal with
the two 64-bit timestamps. For the superblock fields, you might be able
to get away with using second resolution, and then encoding the
timestamps as a signed 64-bit 'mkfs time' along with two unsigned
32-bit times added on top, which gives you a range of 136 years mount
a file system after its creation.

      Arnd

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-15  9:05       ` Arnd Bergmann
@ 2018-03-15 17:51         ` Andiry Xu
  2018-03-15 20:04           ` Andreas Dilger
  2018-03-15 20:38           ` Arnd Bergmann
  0 siblings, 2 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-15 17:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Darrick J. Wong, Linux FS Devel, Linux Kernel Mailing List,
	linux-nvdimm, Dan Williams, Rudoff, Andy, coughlan,
	Steven Swanson, Dave Chinner, Jan Kara, swhiteho, miklos,
	Jian Xu, Andiry Xu

On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu <jix024@eng.ucsd.edu> wrote:
>> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>>> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>
>>>> +     /* s_mtime and s_wtime should be together and their order should not be
>>>> +      * changed. we use an 8 byte write to update both of them atomically
>>>> +      */
>>>> +     __le32          s_mtime;                /* mount time */
>>>> +     __le32          s_wtime;                /* write time */
>>>
>>> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>>>
>>
>> I will try fixing this in the next version.
>
> I would also recommend adding nanosecond-resolution timestamps.
> In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
> (it's good for several hundred years), but the more common format uses
> 64-bit seconds and 32-bit nanoseconds in other file systems.
>
> Unfortunately it looks, you will have to come up with a more sophisticated
> update method above, even if you leave out the nanoseconds, you can't
> easily rely on a 16-byte atomic update across architectures to deal with
> the two 64-bit timestamps. For the superblock fields, you might be able
> to get away with using second resolution, and then encoding the
> timestamps as a signed 64-bit 'mkfs time' along with two unsigned
> 32-bit times added on top, which gives you a range of 136 years mount
> a file system after its creation.
>

I will take a look at other file systems.

Superblock mtime is not a big problem as it is updated rarely. 64-bit
seconds and 32-bit nanoseconds make the inode and log entry bigger,
and updating file->atime cannot be done with a single 64bit update.
That may be annoying and needs to use journaling.

Thanks,
Andiry

>       Arnd

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-15 17:51         ` Andiry Xu
@ 2018-03-15 20:04           ` Andreas Dilger
  2018-03-15 20:38           ` Arnd Bergmann
  1 sibling, 0 replies; 119+ messages in thread
From: Andreas Dilger @ 2018-03-15 20:04 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Arnd Bergmann, Darrick J. Wong, Linux FS Devel,
	Linux Kernel Mailing List, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, Jan Kara, swhiteho,
	miklos, Jian Xu, Andiry Xu


[-- Attachment #1: Type: text/plain, Size: 2565 bytes --]

On Mar 15, 2018, at 11:51 AM, Andiry Xu <jix024@eng.ucsd.edu> wrote:
> 
> On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu <jix024@eng.ucsd.edu> wrote:
>>> On Wed, Mar 14, 2018 at 9:54 PM, Darrick J. Wong
>>> <darrick.wong@oracle.com> wrote:
>>>> On Sat, Mar 10, 2018 at 10:17:44AM -0800, Andiry Xu wrote:
>> 
>>>>> +     /* s_mtime and s_wtime should be together and their order should not be
>>>>> +      * changed. we use an 8 byte write to update both of them atomically
>>>>> +      */
>>>>> +     __le32          s_mtime;                /* mount time */
>>>>> +     __le32          s_wtime;                /* write time */
>>>> 
>>>> Hmmm, 32-bit timestamps?  2038 isn't that far away...
>>>> 
>>> 
>>> I will try fixing this in the next version.
>> 
>> I would also recommend adding nanosecond-resolution timestamps.
>> In theory, a signed 64-bit nanosecond field is sufficient for each timestamp
>> (it's good for several hundred years), but the more common format uses
>> 64-bit seconds and 32-bit nanoseconds in other file systems.
>> 
>> Unfortunately it looks, you will have to come up with a more sophisticated
>> update method above, even if you leave out the nanoseconds, you can't
>> easily rely on a 16-byte atomic update across architectures to deal with
>> the two 64-bit timestamps. For the superblock fields, you might be able
>> to get away with using second resolution, and then encoding the
>> timestamps as a signed 64-bit 'mkfs time' along with two unsigned
>> 32-bit times added on top, which gives you a range of 136 years mount
>> a file system after its creation.
>> 
> 
> I will take a look at other file systems.
> 
> Superblock mtime is not a big problem as it is updated rarely. 64-bit
> seconds and 32-bit nanoseconds make the inode and log entry bigger,
> and updating file->atime cannot be done with a single 64bit update.
> That may be annoying and needs to use journaling.

If the 64-bit atomicity was really a performance issue, you could do
something like:

	__u32	time_high = seconds >> 32;
	__u64	time_low = seconds << 32 | nanoseconds;

and then you only need to update time_high with a journal operation if it
has changed from the current time_high value (about once every 140 years),
and the time_low can be set atomically.  It needs a few extra cycles each
time (hidden with an unlikely()) vs. just setting both, but that is a win
if it avoids other CPU or IO overhead.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-15 17:51         ` Andiry Xu
  2018-03-15 20:04           ` Andreas Dilger
@ 2018-03-15 20:38           ` Arnd Bergmann
  2018-03-16  2:59             ` Theodore Y. Ts'o
  1 sibling, 1 reply; 119+ messages in thread
From: Arnd Bergmann @ 2018-03-15 20:38 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Darrick J. Wong, Linux FS Devel, Linux Kernel Mailing List,
	linux-nvdimm, Dan Williams, Rudoff, Andy, coughlan,
	Steven Swanson, Dave Chinner, Jan Kara, swhiteho, miklos,
	Jian Xu, Andiry Xu

On Thu, Mar 15, 2018 at 6:51 PM, Andiry Xu <jix024@eng.ucsd.edu> wrote:
> On Thu, Mar 15, 2018 at 2:05 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Thu, Mar 15, 2018 at 7:11 AM, Andiry Xu <jix024@eng.ucsd.edu> wrote:
>
> Superblock mtime is not a big problem as it is updated rarely. 64-bit
> seconds and 32-bit nanoseconds make the inode and log entry bigger,
> and updating file->atime cannot be done with a single 64bit update.
> That may be annoying and needs to use journaling.

If this is a big concern, you could use a format similar to what ext4 has:
30 bits of nanoseconds, and 34 bits of seconds, where the upper two
bits count the epoch. That gives you a time range from years 1902 to
2446.

You could also have a resolution of less than a nanosecond. Note
that today, the file time stamps generated by the kernel are in
jiffies resolution, so at best one millisecond. However, most modern
file systems go with the 64+32 bit timestamps because it's not all
that expensive.

      Arnd

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-15 20:38           ` Arnd Bergmann
@ 2018-03-16  2:59             ` Theodore Y. Ts'o
  2018-03-16  6:17               ` Andiry Xu
  2018-03-16  9:19               ` Arnd Bergmann
  0 siblings, 2 replies; 119+ messages in thread
From: Theodore Y. Ts'o @ 2018-03-16  2:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andiry Xu, Darrick J. Wong, Linux FS Devel,
	Linux Kernel Mailing List, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, Jan Kara, swhiteho,
	miklos, Jian Xu, Andiry Xu

On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
> 
> You could also have a resolution of less than a nanosecond. Note
> that today, the file time stamps generated by the kernel are in
> jiffies resolution, so at best one millisecond. However, most modern
> file systems go with the 64+32 bit timestamps because it's not all
> that expensive.

It actually depends on the architecture and the accuracy/granularity
of the timekeeping hardware available to the system, but it's possible
for the granularity of file time stamps to be up to one nanosecond.
So you can get results like this:

% stat unix_io.o 
  File: unix_io.o
  Size: 55000     	Blocks: 112        IO Block: 4096   regular file
Device: fc01h/64513d	Inode: 19931278    Links: 1
Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
Access: 2018-03-15 18:09:21.679914182 -0400
Modify: 2018-03-15 18:09:21.639914089 -0400
Change: 2018-03-15 18:09:21.639914089 -0400

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-16  2:59             ` Theodore Y. Ts'o
@ 2018-03-16  6:17               ` Andiry Xu
  2018-03-16  6:30                 ` Darrick J. Wong
  2018-03-16  9:19               ` Arnd Bergmann
  1 sibling, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-16  6:17 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Arnd Bergmann, Andiry Xu, Darrick J. Wong,
	Linux FS Devel, Linux Kernel Mailing List, linux-nvdimm,
	Dan Williams, Rudoff, Andy, coughlan, Steven Swanson,
	Dave Chinner, Jan Kara, swhiteho, miklos, Jian Xu, Andiry Xu

On Thu, Mar 15, 2018 at 7:59 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
>>
>> You could also have a resolution of less than a nanosecond. Note
>> that today, the file time stamps generated by the kernel are in
>> jiffies resolution, so at best one millisecond. However, most modern
>> file systems go with the 64+32 bit timestamps because it's not all
>> that expensive.
>
> It actually depends on the architecture and the accuracy/granularity
> of the timekeeping hardware available to the system, but it's possible
> for the granularity of file time stamps to be up to one nanosecond.
> So you can get results like this:
>
> % stat unix_io.o
>   File: unix_io.o
>   Size: 55000           Blocks: 112        IO Block: 4096   regular file
> Device: fc01h/64513d    Inode: 19931278    Links: 1
> Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> Access: 2018-03-15 18:09:21.679914182 -0400
> Modify: 2018-03-15 18:09:21.639914089 -0400
> Change: 2018-03-15 18:09:21.639914089 -0400
>

Thanks for all the suggestions. I think I will follow ext4's time
format. 2446 should be far away enough.

Thanks,
Andiry

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-16  6:17               ` Andiry Xu
@ 2018-03-16  6:30                 ` Darrick J. Wong
  0 siblings, 0 replies; 119+ messages in thread
From: Darrick J. Wong @ 2018-03-16  6:30 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Theodore Y. Ts'o, Arnd Bergmann, Linux FS Devel,
	Linux Kernel Mailing List, linux-nvdimm, Dan Williams, Rudoff,
	Andy, coughlan, Steven Swanson, Dave Chinner, Jan Kara, swhiteho,
	miklos, Jian Xu, Andiry Xu

On Thu, Mar 15, 2018 at 11:17:54PM -0700, Andiry Xu wrote:
> On Thu, Mar 15, 2018 at 7:59 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> > On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
> >>
> >> You could also have a resolution of less than a nanosecond. Note
> >> that today, the file time stamps generated by the kernel are in
> >> jiffies resolution, so at best one millisecond. However, most modern
> >> file systems go with the 64+32 bit timestamps because it's not all
> >> that expensive.
> >
> > It actually depends on the architecture and the accuracy/granularity
> > of the timekeeping hardware available to the system, but it's possible
> > for the granularity of file time stamps to be up to one nanosecond.
> > So you can get results like this:
> >
> > % stat unix_io.o
> >   File: unix_io.o
> >   Size: 55000           Blocks: 112        IO Block: 4096   regular file
> > Device: fc01h/64513d    Inode: 19931278    Links: 1
> > Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> > Access: 2018-03-15 18:09:21.679914182 -0400
> > Modify: 2018-03-15 18:09:21.639914089 -0400
> > Change: 2018-03-15 18:09:21.639914089 -0400
> >
> 
> Thanks for all the suggestions. I think I will follow ext4's time
> format. 2446 should be far away enough.

If you do, try to avoid the encoding problems that ext4 (still) has:

Not-fixed-by: a4dad1ae24f8 ("ext4: Fix handling of extended tv_sec")

--D

> Thanks,
> Andiry

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 03/83] Add super.h.
  2018-03-16  2:59             ` Theodore Y. Ts'o
  2018-03-16  6:17               ` Andiry Xu
@ 2018-03-16  9:19               ` Arnd Bergmann
  1 sibling, 0 replies; 119+ messages in thread
From: Arnd Bergmann @ 2018-03-16  9:19 UTC (permalink / raw)
  To: Theodore Y. Ts'o, Arnd Bergmann, Andiry Xu, Darrick J. Wong,
	Linux FS Devel, Linux Kernel Mailing List, linux-nvdimm,
	Dan Williams, Rudoff, Andy, coughlan, Steven Swanson,
	Dave Chinner, Jan Kara, swhiteho, miklos, Jian Xu, Andiry Xu

On Fri, Mar 16, 2018 at 3:59 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> On Thu, Mar 15, 2018 at 09:38:29PM +0100, Arnd Bergmann wrote:
>>
>> You could also have a resolution of less than a nanosecond. Note
>> that today, the file time stamps generated by the kernel are in
>> jiffies resolution, so at best one millisecond. However, most modern
>> file systems go with the 64+32 bit timestamps because it's not all
>> that expensive.
>
> It actually depends on the architecture and the accuracy/granularity
> of the timekeeping hardware available to the system, but it's possible
> for the granularity of file time stamps to be up to one nanosecond.
> So you can get results like this:
>
> % stat unix_io.o
>   File: unix_io.o
>   Size: 55000           Blocks: 112        IO Block: 4096   regular file
> Device: fc01h/64513d    Inode: 19931278    Links: 1
> Access: (0644/-rw-r--r--)  Uid: (15806/   tytso)   Gid: (15806/   tytso)
> Access: 2018-03-15 18:09:21.679914182 -0400
> Modify: 2018-03-15 18:09:21.639914089 -0400
> Change: 2018-03-15 18:09:21.639914089 -0400

Note how the nanoseconds only differ in digits 2, 7, 8, and 9 though:

The atime update happened 4 jiffies (at HZ=100) after the mtime,
the low digits are presumably jitter or ntp adjustments.

This is the result of current_time() using the plain tk_xtime
rather than reading the highres clocksource as ktime_get_real_ts64()
does.

This was a performance optimization a long time ago. We could
make the current_time() behavior configurable if we want though,
e.g. at compile time, or as a per-mount option. It's probably more
common these days to have a highres clocksource that can
be read efficiently than it was back when current_fs_time()
was first introduced.

       Arnd

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines.
  2018-03-11 19:22     ` Eric Biggers
  2018-03-11 21:45       ` Andiry Xu
@ 2018-03-19 19:39       ` Andiry Xu
  2018-03-19 20:30         ` Eric Biggers
  1 sibling, 1 reply; 119+ messages in thread
From: Andiry Xu @ 2018-03-19 19:39 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Nikolay Borisov, Linux FS Devel, Linux Kernel Mailing List,
	linux-nvdimm, Dan Williams, Rudoff, Andy, coughlan,
	Steven Swanson, Dave Chinner, Jan Kara, swhiteho, miklos,
	Jian Xu, Andiry Xu, Herbert Xu

On Sun, Mar 11, 2018 at 12:22 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> On Sun, Mar 11, 2018 at 02:00:13PM +0200, Nikolay Borisov wrote:
>> [Adding Herbert Xu to CC since he is the maintainer of the crypto subsys
>> maintainer]
>>
>> On 10.03.2018 20:17, Andiry Xu wrote:
>> <snip>
>>
>> > +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
>> > +{
>> > +   u8 *ptr = (u8 *) data;
>> > +   u64 acc = crc; /* accumulator, crc32c value in lower 32b */
>> > +   u32 csum;
>> > +
>> > +   /* x86 instruction crc32 is part of SSE-4.2 */
>> > +   if (static_cpu_has(X86_FEATURE_XMM4_2)) {
>> > +           /* This inline assembly implementation should be equivalent
>> > +            * to the kernel's crc32c_intel_le_hw() function used by
>> > +            * crc32c(), but this performs better on test machines.
>> > +            */
>> > +           while (len > 8) {
>> > +                   asm volatile(/* 64b quad words */
>> > +                           "crc32q (%1), %0"
>> > +                           : "=r" (acc)
>> > +                           : "r"  (ptr), "0" (acc)
>> > +                   );
>> > +                   ptr += 8;
>> > +                   len -= 8;
>> > +           }
>> > +
>> > +           while (len > 0) {
>> > +                   asm volatile(/* trailing bytes */
>> > +                           "crc32b (%1), %0"
>> > +                           : "=r" (acc)
>> > +                           : "r"  (ptr), "0" (acc)
>> > +                   );
>> > +                   ptr++;
>> > +                   len--;
>> > +           }
>> > +
>> > +           csum = (u32) acc;
>> > +   } else {
>> > +           /* The kernel's crc32c() function should also detect and use the
>> > +            * crc32 instruction of SSE-4.2. But calling in to this function
>> > +            * is about 3x to 5x slower than the inline assembly version on
>> > +            * some test machines.
>>
>> That is really odd. Did you try to characterize why this is the case? Is
>> it purely the overhead of dispatching to the correct backend function?
>> That's a rather big performance hit.
>>
>> > +            */
>> > +           csum = crc32c(crc, data, len);
>> > +   }
>> > +
>> > +   return csum;
>> > +}
>> > +
>
> Are you sure that CONFIG_CRYPTO_CRC32C_INTEL was enabled during your tests and
> that the accelerated version was being called?  Or, perhaps CRC32C_PCL_BREAKEVEN
> (defined in arch/x86/crypto/crc32c-intel_glue.c) needs to be adjusted.  Please
> don't hack around performance problems like this; if they exist, they need to be
> fixed for everyone.
>

I have performed the crc32c test on a Xeon X5647 at 2.93GHz, 14G DDR3
memory at 1066MHz platform.
You are right that enabling CONFIG_CRYPTO_CRC32C_INTEL improves the
performance significantly. nova_crc32c() is still slightly faster than
crc32c() with the flag enabled.

Result numbers are follows: data size in bytes, latency in ns, column
3 is crc32c() with  CONFIG_CRYPTO_CRC32C_INTEL enabled and column 4
disabled.

data size (bytes)        nova_crc32c()        crc32c() -enabled
crc32c() -disabled
64                              19                           21
                        56
128                            28                           29
                       99
256                            46                           43
                       182
512                            82                           149
                      354
1024                          157                         232
                    728
2048                          305                         415
                    1440
4096                          603                         725
                    2869

Thanks,
Andiry

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines.
  2018-03-19 19:39       ` Andiry Xu
@ 2018-03-19 20:30         ` Eric Biggers
  2018-03-19 21:59           ` Andiry Xu
  0 siblings, 1 reply; 119+ messages in thread
From: Eric Biggers @ 2018-03-19 20:30 UTC (permalink / raw)
  To: Andiry Xu
  Cc: Nikolay Borisov, Linux FS Devel, Linux Kernel Mailing List,
	linux-nvdimm, Dan Williams, Rudoff, Andy, coughlan,
	Steven Swanson, Dave Chinner, Jan Kara, swhiteho, miklos,
	Jian Xu, Andiry Xu, Herbert Xu

On Mon, Mar 19, 2018 at 12:39:55PM -0700, Andiry Xu wrote:
> On Sun, Mar 11, 2018 at 12:22 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> > On Sun, Mar 11, 2018 at 02:00:13PM +0200, Nikolay Borisov wrote:
> >> [Adding Herbert Xu to CC since he is the maintainer of the crypto subsys
> >> maintainer]
> >>
> >> On 10.03.2018 20:17, Andiry Xu wrote:
> >> <snip>
> >>
> >> > +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
> >> > +{
> >> > +   u8 *ptr = (u8 *) data;
> >> > +   u64 acc = crc; /* accumulator, crc32c value in lower 32b */
> >> > +   u32 csum;
> >> > +
> >> > +   /* x86 instruction crc32 is part of SSE-4.2 */
> >> > +   if (static_cpu_has(X86_FEATURE_XMM4_2)) {
> >> > +           /* This inline assembly implementation should be equivalent
> >> > +            * to the kernel's crc32c_intel_le_hw() function used by
> >> > +            * crc32c(), but this performs better on test machines.
> >> > +            */
> >> > +           while (len > 8) {
> >> > +                   asm volatile(/* 64b quad words */
> >> > +                           "crc32q (%1), %0"
> >> > +                           : "=r" (acc)
> >> > +                           : "r"  (ptr), "0" (acc)
> >> > +                   );
> >> > +                   ptr += 8;
> >> > +                   len -= 8;
> >> > +           }
> >> > +
> >> > +           while (len > 0) {
> >> > +                   asm volatile(/* trailing bytes */
> >> > +                           "crc32b (%1), %0"
> >> > +                           : "=r" (acc)
> >> > +                           : "r"  (ptr), "0" (acc)
> >> > +                   );
> >> > +                   ptr++;
> >> > +                   len--;
> >> > +           }
> >> > +
> >> > +           csum = (u32) acc;
> >> > +   } else {
> >> > +           /* The kernel's crc32c() function should also detect and use the
> >> > +            * crc32 instruction of SSE-4.2. But calling in to this function
> >> > +            * is about 3x to 5x slower than the inline assembly version on
> >> > +            * some test machines.
> >>
> >> That is really odd. Did you try to characterize why this is the case? Is
> >> it purely the overhead of dispatching to the correct backend function?
> >> That's a rather big performance hit.
> >>
> >> > +            */
> >> > +           csum = crc32c(crc, data, len);
> >> > +   }
> >> > +
> >> > +   return csum;
> >> > +}
> >> > +
> >
> > Are you sure that CONFIG_CRYPTO_CRC32C_INTEL was enabled during your tests and
> > that the accelerated version was being called?  Or, perhaps CRC32C_PCL_BREAKEVEN
> > (defined in arch/x86/crypto/crc32c-intel_glue.c) needs to be adjusted.  Please
> > don't hack around performance problems like this; if they exist, they need to be
> > fixed for everyone.
> >
> 
> I have performed the crc32c test on a Xeon X5647 at 2.93GHz, 14G DDR3
> memory at 1066MHz platform.
> You are right that enabling CONFIG_CRYPTO_CRC32C_INTEL improves the
> performance significantly. nova_crc32c() is still slightly faster than
> crc32c() with the flag enabled.
> 
> Result numbers are follows: data size in bytes, latency in ns, column
> 3 is crc32c() with  CONFIG_CRYPTO_CRC32C_INTEL enabled and column 4
> disabled.
> 
> data size (bytes)        nova_crc32c()        crc32c() -enabled
> crc32c() -disabled
> 64                              19                           21 56
> 128                            28                           29 99
> 256                            46                           43 182
> 512                            82                           149 354
> 1024                          157                         232 728
> 2048                          305                         415 1440
> 4096                          603                         725 2869
> 

Probably CRC32C_PCL_BREAKEVEN needs to be adjusted for that CPU, as I suggested
may be the case; notice that your measured speeds are about the same before 512
(CRC32C_PCL_BREAKEVEN) bytes, but the crypto API version is slower at >= 512
bytes.   It would be possible to set the breakeven point in
crc32c_intel_mod_init() depending on the CPU.  Again, if the performance is not
good enough you need to fix it for everyone, not hack around it.

Thanks,

Eric

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 01/83] Introduction and documentation of NOVA filesystem.
  2018-03-10 18:17 ` [RFC v2 01/83] Introduction and documentation of NOVA filesystem Andiry Xu
@ 2018-03-19 20:43   ` Randy Dunlap
  2018-03-19 23:00     ` Andiry Xu
  2018-04-22  8:05   ` Pavel Machek
  1 sibling, 1 reply; 119+ messages in thread
From: Randy Dunlap @ 2018-03-19 20:43 UTC (permalink / raw)
  To: Andiry Xu, linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: dan.j.williams, andy.rudoff, coughlan, swanson, david, jack,
	swhiteho, miklos, andiry.xu, Andiry Xu

On 03/10/2018 10:17 AM, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> NOVA is a log-structured file system tailored for byte-addressable non-volatile memories.
> It was designed and developed at the Non-Volatile Systems Laboratory in the Computer
> Science and Engineering Department at the University of California, San Diego.
> Its primary authors are Andiry Xu <jix024@eng.ucsd.edu>, Lu Zhang
> <luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.
> 
> These two papers provide a detailed, high-level description of NOVA's design goals and approach:
> 
>    NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
>    In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
>    (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
> 
>    NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
>    In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
>    (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf)
> 
> This patchset contains features from the FAST paper. We leave NOVA-Fortis features,
> such as snapshot, metadata and data replication and RAID parity for
> future submission.
> 
> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
> ---
>  Documentation/filesystems/00-INDEX |   2 +
>  Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++
>  MAINTAINERS                        |   8 +
>  3 files changed, 508 insertions(+)
>  create mode 100644 Documentation/filesystems/nova.txt

> diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
> new file mode 100644
> index 0000000..4728f50
> --- /dev/null
> +++ b/Documentation/filesystems/nova.txt
> @@ -0,0 +1,498 @@
> +The NOVA Filesystem
> +===================
> +
> +NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system
> +designed to provide a high performance and production-ready file system
> +tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
> +and Intel's soon-to-be-released 3DXPoint DIMMs).
> +NOVA combines design elements from many other file systems
> +and adapts conventional log-structured file system techniques to
> +exploit the fast random access that NVMs provide. In particular, NOVA maintains
> +separate logs for each inode to improve concurrency, and stores file data
> +outside the log to minimize log size and reduce garbage collection costs. NOVA's
> +logs provide metadata and data atomicity and focus on simplicity and
> +reliability, keeping complex metadata structures in DRAM to accelerate lookup
> +operations.
> +
> +NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in
> +the Computer Science and Engineering Department at the University of
> +California, San Diego.
> +
> +A more thorough discussion of NOVA's design is avaialable in these two papers:

                                                  available

> +
> +NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories
> +Jian Xu and Steven Swanson
> +In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
> +
> +NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
> +Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase,
> +Tamires Brito Da Silva, Andy Rudoff and Steven Swanson
> +In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
> +
> +This version of NOVA contains features from the FAST paper.
> +NOVA-Fortis features, such as snapshot, metadata and data protection and replication
> +are left for future submission.
> +
> +The main NOVA features include:
> +
> +  * POSIX semantics
> +  * Directly access (DAX) byte-addressable NVMM without page caching
> +  * Per-CPU NVMM pool to maximize concurrency
> +  * Strong consistency guarantees with 8-byte atomic stores
> +
> +
> +Filesystem Design
> +=================
> +
> +NOVA divides NVMM into several regions. NOVA's 512B superblock contains global

                                        (prefer:) 512-byte

> +file system information and the recovery inode. The recovery inode represents a
> +special file that stores recovery information (e.g., the list of unallocated
> +NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
> +provides per-CPU journals for complex file operations that involve multiple
> +inodes. The rest of the available NVMM stores logs and file data.
> +
> +NOVA is log-structured and stores a separate log for each inode to maximize
> +concurrency and provide atomicity for operations that affect a single file. The
> +logs only store metadata and comprise a linked list of 4 KB pages. Log entries
> +are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
> +pages may reside anywhere in NVMM.
> +
> +NOVA keeps copies of most file metadata in DRAM during normal
> +operations, eliminating the need to access metadata in NVMM during reads.
> +
> +NOVA supports both copy-on-write and in-place file data updates and appends
> +metadata about the write to the log. For operations that affect multiple inodes

                                                                            inodes,

> +NOVA uses lightweight, fixed-length journals –one per core.

                                                -- one per core.

> +
> +NOVA divides the allocatable NVMM into multiple regions, one region per CPU
> +core. A per-core allocator manages each of the regions, minimizing contention
> +during memory allocation.
> +
> +After a system crash, NOVA must scan all the logs to rebuild the memory
> +allocator state. Since, there are many logs, NOVA aggressively parallelizes the

                    Since there are

> +scan.
> +
> +
> +Building and using NOVA
> +=======================
> +
> +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
> +DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
> +
> +NOVA runs on a pmem non-volatile memory region.  You can create one of these
> +regions with the `memmap` kernel command line option.  For instance, adding
> +`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
> +from address 8GB, and the kernel will create a `pmem0` block device under the
> +`/dev` directory.
> +
> +After the OS has booted, you can initialize a NOVA instance with the following commands:
> +
> +
> +# modprobe nova
> +# mount -t NOVA -o init /dev/pmem0 /mnt/nova

Hmph, unique in upper-case-ness (at least for in-tree fs-es).
Would you consider "nova" instead?

> +
> +
> +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
> +`/mnt/nova`.
> +
> +NOVA support several module command line options:

        supports

> +
> + * measure_timing: Measure the timing of file system operations for profiling (default: 0)
> +
> + * inplace_data_updates:  Update data in place rather than with COW (default: 0)
> +
> +To recover an existing NOVA instance, mount NOVA without the init option, for example:
> +
> +# mount -t NOVA /dev/pmem0 /mnt/nova
> +
> +
> +Sysfs support
> +-------------
> +
> +NOVA provides sysfs support to enable user to get/set information of 

                                  enable a user
                        or        enable users

And the line above ends with a trailing space.  Please check/remove all of those.

> +a running NOVA instance.
> +After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/:

Above uses lower-case "nova" in /proc/fs/nova/... but the examples below use NOVA.
nova is preferred (IMO).

> +
> +timing_stats	IO_stats	allocator	gc
> +
> +Show NOVA file operation timing statistics:
> +# cat /proc/fs/NOVA/pmem#/timing_stats
> +
> +Clear timing statistics:
> +# echo 1 > /proc/fs/NOVA/pmem#/timing_stats
> +
> +Show NOVA I/O statistics:
> +# cat /proc/fs/NOVA/pmem#/IO_stats
> +
> +Clear I/O statistics:
> +# echo 1 > /proc/fs/NOVA/pmem#/IO_stats
> +
> +Show NOVA allocator information:
> +# cat /proc/fs/NOVA/pmem#/allocator
> +
> +Manual garbage collection:
> +# echo #inode_number > /proc/fs/NOVA/pmem#/gc
> +
> +
> +Source File Structure
> +=====================
> +
> +  * nova_def.h/nova.h
> +   Defines NOVA macros and key inline functions.
> +
> +  * balloc.{h,c}
> +    NOVA's pmem allocator implementation.
> +
> +  * bbuild.c
> +    Implements recovery routines to restore the in-use inode list and the NVMM
> +    allocator information.
> +
> +  * dax.c
> +    Implements DAX read/write and mmap functions to access file data. NOVA uses
> +    copy-on-write to modify file pages by default, unless inplace data update is
> +    enabled at mount-time.
> +
> +  * dir.c
> +    Contains functions to create, update, and remove NOVA dentries.
> +
> +  * file.c
> +    Implements file-related operations such as open, fallocate, llseek, fsync,
> +    and flush.
> +
> +  * gc.c
> +    NOVA's garbage collection functions.
> +
> +  * inode.{h,c}
> +    Creates, reads, and frees NOVA inode tables and inodes.
> +
> +  * ioctl.c
> +    Implements some ioctl commands to call NOVA's internal functions.
> +
> +  * journal.{h,c}
> +    For operations that affect multiple inodes NOVA uses lightweight,
> +    fixed-length journals – one per core. This file contains functions to
> +    create and manage the lite journals.
> +
> +  * log.{h,c}
> +    Functions to manipulate NOVA inode logs, including log page allocation, log
> +    entry creation, commit, modification, and deletion.
> +
> +  * namei.c
> +    Functions to create/remove files, directories, and links. It also looks for
> +    the NOVA inode number for a given path name.
> +
> +  * rebuild.c
> +    When mounting NOVA, rebuild NOVA inodes from its logs.
> +
> +  * stats.{h,c}
> +    Provide routines to gather and print NOVA usage statistics.
> +
> +  * super.{h,c}
> +    Super block structures and NOVA FS layout and entry points for NOVA
> +    mounting and unmounting, initializing or recovering the NOVA super block
> +    and other global file system information.
> +
> +  * symlink.c
> +    Implements functions to create and read symbolic links in the filesystem.
> +
> +  * sysfs.c
> +    Implements sysfs entries to take user inputs for printing NOVA statistics.

s/sysfs/procfs/

> +
> +
> +Filesystem Layout
> +=================
> +
> +A NOVA file systems resides in single PMEM device. *****
> +NOVA divides the device into 4KB blocks.

                                4 KB  {or use 4KB way up above here}

> +
> + block
> ++---------------------------------------------------------+
> +|    0    | primary super block (struct nova_super_block) |
> ++---------------------------------------------------------+
> +|    1    | Reserved inodes                               |
> ++---------------------------------------------------------+
> +|  2 - 15 | reserved                                      |
> ++---------------------------------------------------------+
> +| 16 - 31 | Inode table pointers                          |
> ++---------------------------------------------------------+
> +| 32 - 47 | Journal pointers                              |
> ++---------------------------------------------------------+
> +| 48 - 63 | reserved                                      |
> ++---------------------------------------------------------+
> +|   ...   | log and data pages                            |
> ++---------------------------------------------------------+
> +|   n-2   | replica reserved Inodes                       |
> ++---------------------------------------------------------+
> +|   n-1   | replica super block                           |
> ++---------------------------------------------------------+
> +
> +
> +
> +Superblock and Associated Structures
> +====================================
> +
> +The beginning of the PMEM device hold the super block and its associated

                                    holds

> +tables.  These include reserved inodes, a table of pointers to the journals
> +NOVA uses for complex operations, and pointers to inodes tables.  NOVA
> +maintains replicas of the super block and reserved inodes in the last two
> +blocks of the PMEM area.
> +
> +
> +Block Allocator/Free Lists
> +==========================
> +
> +NOVA uses per-CPU allocators to manage free PMEM blocks.  On initialization,> +NOVA divides the range of blocks in the PMEM device among the CPUs, and those
> +blocks are managed solely by that CPU.  We call these ranges of "allocation regions".
> +Each allocator maintains a red-black tree of unallocated ranges (struct
> +nova_range_node).
> +
> +Allocation Functions
> +--------------------
> +
> +NOVA allocate PMEM blocks using two mechanisms:

        allocates

> +
> +1.  Static allocation as defined in super.h
> +
> +2.  Allocation for log and data pages via nova_new_log_blocks() and
> +nova_new_data_blocks().
> +
> +
> +PMEM Address Translation
> +------------------------
> +
> +In NOVA's persistent data structures, memory locations are given as offsets
> +from the beginning of the PMEM region.  nova_get_block() translates offsets to
> +PMEM addresses.  nova_get_addr_off() performs the reverse translation.
> +
> +
> +Inodes
> +======
> +
> +NOVA maintains per-CPU inode tables, and inode numbers are striped across the
> +tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
> +
> +The inodes themselves live in a set of linked lists (one per CPU) of 2MB
> +blocks.  The last 8 bytes of each block points to the next block.  Pointers to
> +heads of these list live in PMEM block INODE_TABLE_START.

                  lists

> +Additional space for inodes is allocated on demand.
> +
> +To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB

s/cpu/CPU/g
s/a RB/an RB/

but that isn't quite a sentence. Please fix it.

> +tree that holds ranges of allocated inode numbers.
> +
> +
> +Logs
> +====
> +
> +NOVA maintains a log for each inode that records updates to the inode's
> +metadata and holds pointers to the file data.  NOVA makes updates to file data
> +and metadata atomic by atomically appending log entries to the log.
> +
> +Each inode contains pointers to head and tail of the inode's log.  When the log
> +grows past the end of the last page, nova allocates additional space.  For
> +short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
> +fixed amount of additional space (1MB).
> +
> +Log space is reclaimed during garbage collection.
> +
> +Log Entries
> +-----------
> +
> +There are four kinds of log entry, documented in log.h.  The log entries have
> +several entries in common:
> +
> +   1.  'epoch_id' gives the epoch during which the log entry was created.
> +   Creating a snapshot increments the epoch_id for the file systems.

                                                          file system.  (?)
or do multiple epochs (snapshots) => multiple fs-es?

> +   Currently disabled (always zero).
> +
> +   2.  'trans_id' is per-inode, monotone increasing, number assigned each
> +   log entry.  It provides an ordering over FS operations on a single inode.
> +
> +   3.  'invalid' is true if the effects of this entry are dead and the log
> +   entry can be garbage collected.
> +
> +   4.  'csum' is a CRC32 checksum for the entry. Currently it is disabled.
> +
> +Log structure
> +-------------
> +
> +The logs comprise a linked list of PMEM blocks.  The tail of each block
> +contains some metadata about the block and pointers to the next block and
> +block's replica (struct nova_inode_page_tail).
> +
> ++----------------+
> +| log entry      |
> ++----------------+
> +| log entry      |
> ++----------------+
> +| ...            |
> ++----------------+
> +| tail           |
> +|  metadata      |
> +|  -> next block |
> ++----------------+
> +
> +
> +Journals
> +========
> +
> +NOVA uses a lightweight journaling mechanisms to provide atomicity for

                                      mechanism

> +operations that modify more than one on inode.  The journals providing logging

end of that "sentence" (above) is confusing or missing something.

> +for two operations:
> +
> +1.  Single word updates (JOURNAL_ENTRY)
> +2.  Copying inodes (JOURNAL_INODE)
> +
> +The journals are undo logs: NOVA creates the journal entries for an operation,
> +and if the operation does not complete due to a system failure, the recovery
> +process rolls back the changes using the journal entries.
> +
> +To commit, NOVA drops the log.
> +
> +NOVA maintains one journal per CPU.  The head and tail pointers for each
> +journal live in a reserved page near the beginning of the file system.
> +
> +During recovery, NOVA scans the journals and undoes the operations described by
> +each entry.
> +
> +
> +File and Directory Access
> +=========================
> +
> +To access file data via read(), NOVA maintains a radix tree in DRAM for each
> +inode (nova_inode_info_header.tree) that maps file offsets to write log
> +entries.  For directories, the same tree maps a hash of filenames to their
> +corresponding dentry.
> +
> +In both cases, the nova populates the tree when the file or directory is opened

                  the nova fs (?)

> +by scanning its log.
> +
> +
> +MMap and DAX
> +============
> +
> +NOVA leverages the kernel's DAX mechanisms for mmap and file data access.
> +NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the
> +application's address space.
> +
> +
> +Garbage Collection
> +==================
> +
> +NOVA recovers log space with a two-phase garbage collection system.  When a log
> +reaches the end of its allocated pages, NOVA allocates more space.  Then, the
> +fast GC algorithm scans the log to remove pages that have no valid entries.
> +Then, it estimates how many pages the logs valid entries would fill.  If this
> +is less than half the number of pages in the log, the second GC phase copies
> +the valid entries to new pages.
> +
> +For example (V=valid; I=invalid):
> +
> ++---+         +---+	        +---+
> +| I |	       | I |  	      	| V |
> ++---+	       +---+  Thorough	+---+
> +| V |	       | V |  	 GC   	| V |
> ++---+	       +---+   =====> 	+---+
> +| I |	       | I |  	      	| V |
> ++---+	       +---+	        +---+
> +| V |	       | V |  	        | V |
> ++---+	       +---+            +---+
> +  |	         |
> +  V	         V
> ++---+	       +---+
> +| I |	       | V |
> ++---+	       +---+
> +| I | fast GC  | I |
> ++---+  ====>   +---+
> +| I |	       | I |
> ++---+	       +---+
> +| I |	       | V |
> ++---+	       +---+
> +  |
> +  V
> ++---+
> +| V |
> ++---+
> +| I |
> ++---+
> +| I |
> ++---+
> +| V |
> ++---+
> +
> +
> +Umount and Recovery
> +===================
> +
> +Clean umount/mount
> +------------------
> +
> +On a clean unmount, NOVA saves the contents of many of its DRAM data structures
> +to PMEM to accelerate the next mount:
> +
> +1. NOVA stores the allocator state for each of the per-cpu allocators to the
> +   log of a reserved inode (NOVA_BLOCK_NODE_INO).
> +
> +2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the
> +   NOVA_BLOCK_INODELIST_INO reserved inode.
> +
> +After a clean unmount, the following mount restores these data and then
> +invalidates them.
> +
> +Recovery after failures
> +-----------------------
> +
> +In case of a unclean dismount (e.g., system crash), NOVA must rebuild these

           of an unclean

> +DRAM structures by scanning the inode logs.  NOVA log scanning is fast because
> +per-CPU inode tables and per-inode logs allow for parallel recovery.
> +
> +The number of live log entries in an inode log is roughly the number of extents
> +in the file.  As a result, NOVA only needs to scan a small fraction of the NVMM
> +during recovery.
> +
> +The NOVA failure recovery consists of two steps:
> +
> +First, NOVA checks its lite weight journals and rolls back any uncommitted

          should be one word: lightweight (or liteweight)

> +transactions to restore the file system to a consistent state.
> +
> +Second, NOVA starts a recovery thread on each CPU and scans the inode tables in
> +parallel, performing log scanning for every valid inode in the inode table.
> +NOVA use different recovery mechanisms for directory inodes and file inodes:

                                                               and file inodes.

> +For a directory inode, NOVA scans the log's linked list to enumerate the pages
> +it occupies, but it does not inspect the log's contents.  For a file inode,
> +NOVA reads the write entries in the log to enumerate the data pages.
> +
> +During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds
> +the allocator based on the result. After this process completes, the file
> +system is ready to accept new requests.
> +
> +During the same scan, it rebuilds the list of available inodes.
> +
> +
> +Gaps, Missing Features, and Development Status
> +==============================================
> +
> +Although NOVA is a fully-functional file system, there is still much work left
> +to be done.  In particular, (at least) the following items are currently missing:
> +
> +1.  Snapshot, metadata and data replication and protection are left for future submission.
> +2.  There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system).
> +3.  NOVA only works on x86-64 kernels.
> +4.  NOVA does not currently support extended attributes or ACL.
> +5.  NOVA doesn't provide quota support.
> +6.  Moving NOVA file systems between machines with different numbers of CPUs does not work.

You could artificially limit the number of "known" CPUs so that a NOVA fs could be
moved from a 16-CPU system to an 8-CPU system by telling NOVA to use only 8 CPUs
(as an example).  Just a thought.

> +
> +None of these are fundamental limitations of NOVA's design.
> +
> +NOVA is complete and robust enough to run a range of complex applications, but
> +it is not yet ready for production use.  Our current focus is on adding a few
> +missing features from the list above and finding/fixing bugs.
> +
> +
> +Hacking and Contributing
> +========================
> +
> +If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues.
> +
> +If you have other questions or suggestions you can contact the NOVA developers
> +at cse-nova-hackers@eng.ucsd.edu.


-- 
~Randy

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines.
  2018-03-19 20:30         ` Eric Biggers
@ 2018-03-19 21:59           ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-19 21:59 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Nikolay Borisov, Linux FS Devel, Linux Kernel Mailing List,
	linux-nvdimm, Dan Williams, Rudoff, Andy, coughlan,
	Steven Swanson, Dave Chinner, Jan Kara, swhiteho, miklos,
	Jian Xu, Andiry Xu, Herbert Xu

On Mon, Mar 19, 2018 at 1:30 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> On Mon, Mar 19, 2018 at 12:39:55PM -0700, Andiry Xu wrote:
>> On Sun, Mar 11, 2018 at 12:22 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
>> > On Sun, Mar 11, 2018 at 02:00:13PM +0200, Nikolay Borisov wrote:
>> >> [Adding Herbert Xu to CC since he is the maintainer of the crypto subsys
>> >> maintainer]
>> >>
>> >> On 10.03.2018 20:17, Andiry Xu wrote:
>> >> <snip>
>> >>
>> >> > +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
>> >> > +{
>> >> > +   u8 *ptr = (u8 *) data;
>> >> > +   u64 acc = crc; /* accumulator, crc32c value in lower 32b */
>> >> > +   u32 csum;
>> >> > +
>> >> > +   /* x86 instruction crc32 is part of SSE-4.2 */
>> >> > +   if (static_cpu_has(X86_FEATURE_XMM4_2)) {
>> >> > +           /* This inline assembly implementation should be equivalent
>> >> > +            * to the kernel's crc32c_intel_le_hw() function used by
>> >> > +            * crc32c(), but this performs better on test machines.
>> >> > +            */
>> >> > +           while (len > 8) {
>> >> > +                   asm volatile(/* 64b quad words */
>> >> > +                           "crc32q (%1), %0"
>> >> > +                           : "=r" (acc)
>> >> > +                           : "r"  (ptr), "0" (acc)
>> >> > +                   );
>> >> > +                   ptr += 8;
>> >> > +                   len -= 8;
>> >> > +           }
>> >> > +
>> >> > +           while (len > 0) {
>> >> > +                   asm volatile(/* trailing bytes */
>> >> > +                           "crc32b (%1), %0"
>> >> > +                           : "=r" (acc)
>> >> > +                           : "r"  (ptr), "0" (acc)
>> >> > +                   );
>> >> > +                   ptr++;
>> >> > +                   len--;
>> >> > +           }
>> >> > +
>> >> > +           csum = (u32) acc;
>> >> > +   } else {
>> >> > +           /* The kernel's crc32c() function should also detect and use the
>> >> > +            * crc32 instruction of SSE-4.2. But calling in to this function
>> >> > +            * is about 3x to 5x slower than the inline assembly version on
>> >> > +            * some test machines.
>> >>
>> >> That is really odd. Did you try to characterize why this is the case? Is
>> >> it purely the overhead of dispatching to the correct backend function?
>> >> That's a rather big performance hit.
>> >>
>> >> > +            */
>> >> > +           csum = crc32c(crc, data, len);
>> >> > +   }
>> >> > +
>> >> > +   return csum;
>> >> > +}
>> >> > +
>> >
>> > Are you sure that CONFIG_CRYPTO_CRC32C_INTEL was enabled during your tests and
>> > that the accelerated version was being called?  Or, perhaps CRC32C_PCL_BREAKEVEN
>> > (defined in arch/x86/crypto/crc32c-intel_glue.c) needs to be adjusted.  Please
>> > don't hack around performance problems like this; if they exist, they need to be
>> > fixed for everyone.
>> >
>>
>> I have performed the crc32c test on a Xeon X5647 at 2.93GHz, 14G DDR3
>> memory at 1066MHz platform.
>> You are right that enabling CONFIG_CRYPTO_CRC32C_INTEL improves the
>> performance significantly. nova_crc32c() is still slightly faster than
>> crc32c() with the flag enabled.
>>
>> Result numbers are follows: data size in bytes, latency in ns, column
>> 3 is crc32c() with  CONFIG_CRYPTO_CRC32C_INTEL enabled and column 4
>> disabled.
>>
>> data size (bytes)        nova_crc32c()        crc32c() -enabled
>> crc32c() -disabled
>> 64                              19                           21 56
>> 128                            28                           29 99
>> 256                            46                           43 182
>> 512                            82                           149 354
>> 1024                          157                         232 728
>> 2048                          305                         415 1440
>> 4096                          603                         725 2869
>>
>
> Probably CRC32C_PCL_BREAKEVEN needs to be adjusted for that CPU, as I suggested
> may be the case; notice that your measured speeds are about the same before 512
> (CRC32C_PCL_BREAKEVEN) bytes, but the crypto API version is slower at >= 512
> bytes.   It would be possible to set the breakeven point in
> crc32c_intel_mod_init() depending on the CPU.  Again, if the performance is not
> good enough you need to fix it for everyone, not hack around it.
>

We verify that by setting CRC32C_PCL_BREAKEVEN to 8192, the
performance difference between nova_crc32c() and kernel's crc32c() is
negligible. Thanks for the comments, and I will use kernel's crc32c()
in the next version.

Thanks,
Andiry

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 01/83] Introduction and documentation of NOVA filesystem.
  2018-03-19 20:43   ` Randy Dunlap
@ 2018-03-19 23:00     ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-19 23:00 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Linux FS Devel, Linux Kernel Mailing List, linux-nvdimm,
	Dan Williams, Rudoff, Andy, coughlan, Steven Swanson,
	Dave Chinner, Jan Kara, swhiteho, miklos, Jian Xu, Andiry Xu

Thanks for all the comments.

On Mon, Mar 19, 2018 at 1:43 PM, Randy Dunlap <rdunlap@infradead.org> wrote:
> On 03/10/2018 10:17 AM, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> NOVA is a log-structured file system tailored for byte-addressable non-volatile memories.
>> It was designed and developed at the Non-Volatile Systems Laboratory in the Computer
>> Science and Engineering Department at the University of California, San Diego.
>> Its primary authors are Andiry Xu <jix024@eng.ucsd.edu>, Lu Zhang
>> <luzh@eng.ucsd.edu>, and Steven Swanson <swanson@eng.ucsd.edu>.
>>
>> These two papers provide a detailed, high-level description of NOVA's design goals and approach:
>>
>>    NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
>>    In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
>>    (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
>>
>>    NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
>>    In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
>>    (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf)
>>
>> This patchset contains features from the FAST paper. We leave NOVA-Fortis features,
>> such as snapshot, metadata and data replication and RAID parity for
>> future submission.
>>
>> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
>> ---
>>  Documentation/filesystems/00-INDEX |   2 +
>>  Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++
>>  MAINTAINERS                        |   8 +
>>  3 files changed, 508 insertions(+)
>>  create mode 100644 Documentation/filesystems/nova.txt
>
>> diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
>> new file mode 100644
>> index 0000000..4728f50
>> --- /dev/null
>> +++ b/Documentation/filesystems/nova.txt
>> @@ -0,0 +1,498 @@
>> +The NOVA Filesystem
>> +===================
>> +
>> +NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system
>> +designed to provide a high performance and production-ready file system
>> +tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
>> +and Intel's soon-to-be-released 3DXPoint DIMMs).
>> +NOVA combines design elements from many other file systems
>> +and adapts conventional log-structured file system techniques to
>> +exploit the fast random access that NVMs provide. In particular, NOVA maintains
>> +separate logs for each inode to improve concurrency, and stores file data
>> +outside the log to minimize log size and reduce garbage collection costs. NOVA's
>> +logs provide metadata and data atomicity and focus on simplicity and
>> +reliability, keeping complex metadata structures in DRAM to accelerate lookup
>> +operations.
>> +
>> +NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in
>> +the Computer Science and Engineering Department at the University of
>> +California, San Diego.
>> +
>> +A more thorough discussion of NOVA's design is avaialable in these two papers:
>
>                                                   available
>
>> +
>> +NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories
>> +Jian Xu and Steven Swanson
>> +In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
>> +
>> +NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
>> +Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase,
>> +Tamires Brito Da Silva, Andy Rudoff and Steven Swanson
>> +In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
>> +
>> +This version of NOVA contains features from the FAST paper.
>> +NOVA-Fortis features, such as snapshot, metadata and data protection and replication
>> +are left for future submission.
>> +
>> +The main NOVA features include:
>> +
>> +  * POSIX semantics
>> +  * Directly access (DAX) byte-addressable NVMM without page caching
>> +  * Per-CPU NVMM pool to maximize concurrency
>> +  * Strong consistency guarantees with 8-byte atomic stores
>> +
>> +
>> +Filesystem Design
>> +=================
>> +
>> +NOVA divides NVMM into several regions. NOVA's 512B superblock contains global
>
>                                         (prefer:) 512-byte
>
>> +file system information and the recovery inode. The recovery inode represents a
>> +special file that stores recovery information (e.g., the list of unallocated
>> +NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
>> +provides per-CPU journals for complex file operations that involve multiple
>> +inodes. The rest of the available NVMM stores logs and file data.
>> +
>> +NOVA is log-structured and stores a separate log for each inode to maximize
>> +concurrency and provide atomicity for operations that affect a single file. The
>> +logs only store metadata and comprise a linked list of 4 KB pages. Log entries
>> +are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log
>> +pages may reside anywhere in NVMM.
>> +
>> +NOVA keeps copies of most file metadata in DRAM during normal
>> +operations, eliminating the need to access metadata in NVMM during reads.
>> +
>> +NOVA supports both copy-on-write and in-place file data updates and appends
>> +metadata about the write to the log. For operations that affect multiple inodes
>
>                                                                             inodes,
>
>> +NOVA uses lightweight, fixed-length journals –one per core.
>
>                                                 -- one per core.
>
>> +
>> +NOVA divides the allocatable NVMM into multiple regions, one region per CPU
>> +core. A per-core allocator manages each of the regions, minimizing contention
>> +during memory allocation.
>> +
>> +After a system crash, NOVA must scan all the logs to rebuild the memory
>> +allocator state. Since, there are many logs, NOVA aggressively parallelizes the
>
>                     Since there are
>
>> +scan.
>> +
>> +
>> +Building and using NOVA
>> +=======================
>> +
>> +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
>> +DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support.  Install as usual.
>> +
>> +NOVA runs on a pmem non-volatile memory region.  You can create one of these
>> +regions with the `memmap` kernel command line option.  For instance, adding
>> +`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
>> +from address 8GB, and the kernel will create a `pmem0` block device under the
>> +`/dev` directory.
>> +
>> +After the OS has booted, you can initialize a NOVA instance with the following commands:
>> +
>> +
>> +# modprobe nova
>> +# mount -t NOVA -o init /dev/pmem0 /mnt/nova
>
> Hmph, unique in upper-case-ness (at least for in-tree fs-es).
> Would you consider "nova" instead?
>

I will try that.

>> +
>> +
>> +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
>> +`/mnt/nova`.
>> +
>> +NOVA support several module command line options:
>
>         supports
>
>> +
>> + * measure_timing: Measure the timing of file system operations for profiling (default: 0)
>> +
>> + * inplace_data_updates:  Update data in place rather than with COW (default: 0)
>> +
>> +To recover an existing NOVA instance, mount NOVA without the init option, for example:
>> +
>> +# mount -t NOVA /dev/pmem0 /mnt/nova
>> +
>> +
>> +Sysfs support
>> +-------------
>> +
>> +NOVA provides sysfs support to enable user to get/set information of
>
>                                   enable a user
>                         or        enable users
>
> And the line above ends with a trailing space.  Please check/remove all of those.
>
>> +a running NOVA instance.
>> +After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/:
>
> Above uses lower-case "nova" in /proc/fs/nova/... but the examples below use NOVA.
> nova is preferred (IMO).
>
>> +
>> +timing_stats IO_stats        allocator       gc
>> +
>> +Show NOVA file operation timing statistics:
>> +# cat /proc/fs/NOVA/pmem#/timing_stats
>> +
>> +Clear timing statistics:
>> +# echo 1 > /proc/fs/NOVA/pmem#/timing_stats
>> +
>> +Show NOVA I/O statistics:
>> +# cat /proc/fs/NOVA/pmem#/IO_stats
>> +
>> +Clear I/O statistics:
>> +# echo 1 > /proc/fs/NOVA/pmem#/IO_stats
>> +
>> +Show NOVA allocator information:
>> +# cat /proc/fs/NOVA/pmem#/allocator
>> +
>> +Manual garbage collection:
>> +# echo #inode_number > /proc/fs/NOVA/pmem#/gc
>> +
>> +
>> +Source File Structure
>> +=====================
>> +
>> +  * nova_def.h/nova.h
>> +   Defines NOVA macros and key inline functions.
>> +
>> +  * balloc.{h,c}
>> +    NOVA's pmem allocator implementation.
>> +
>> +  * bbuild.c
>> +    Implements recovery routines to restore the in-use inode list and the NVMM
>> +    allocator information.
>> +
>> +  * dax.c
>> +    Implements DAX read/write and mmap functions to access file data. NOVA uses
>> +    copy-on-write to modify file pages by default, unless inplace data update is
>> +    enabled at mount-time.
>> +
>> +  * dir.c
>> +    Contains functions to create, update, and remove NOVA dentries.
>> +
>> +  * file.c
>> +    Implements file-related operations such as open, fallocate, llseek, fsync,
>> +    and flush.
>> +
>> +  * gc.c
>> +    NOVA's garbage collection functions.
>> +
>> +  * inode.{h,c}
>> +    Creates, reads, and frees NOVA inode tables and inodes.
>> +
>> +  * ioctl.c
>> +    Implements some ioctl commands to call NOVA's internal functions.
>> +
>> +  * journal.{h,c}
>> +    For operations that affect multiple inodes NOVA uses lightweight,
>> +    fixed-length journals – one per core. This file contains functions to
>> +    create and manage the lite journals.
>> +
>> +  * log.{h,c}
>> +    Functions to manipulate NOVA inode logs, including log page allocation, log
>> +    entry creation, commit, modification, and deletion.
>> +
>> +  * namei.c
>> +    Functions to create/remove files, directories, and links. It also looks for
>> +    the NOVA inode number for a given path name.
>> +
>> +  * rebuild.c
>> +    When mounting NOVA, rebuild NOVA inodes from its logs.
>> +
>> +  * stats.{h,c}
>> +    Provide routines to gather and print NOVA usage statistics.
>> +
>> +  * super.{h,c}
>> +    Super block structures and NOVA FS layout and entry points for NOVA
>> +    mounting and unmounting, initializing or recovering the NOVA super block
>> +    and other global file system information.
>> +
>> +  * symlink.c
>> +    Implements functions to create and read symbolic links in the filesystem.
>> +
>> +  * sysfs.c
>> +    Implements sysfs entries to take user inputs for printing NOVA statistics.
>
> s/sysfs/procfs/
>
>> +
>> +
>> +Filesystem Layout
>> +=================
>> +
>> +A NOVA file systems resides in single PMEM device. *****
>> +NOVA divides the device into 4KB blocks.
>
>                                 4 KB  {or use 4KB way up above here}
>
>> +
>> + block
>> ++---------------------------------------------------------+
>> +|    0    | primary super block (struct nova_super_block) |
>> ++---------------------------------------------------------+
>> +|    1    | Reserved inodes                               |
>> ++---------------------------------------------------------+
>> +|  2 - 15 | reserved                                      |
>> ++---------------------------------------------------------+
>> +| 16 - 31 | Inode table pointers                          |
>> ++---------------------------------------------------------+
>> +| 32 - 47 | Journal pointers                              |
>> ++---------------------------------------------------------+
>> +| 48 - 63 | reserved                                      |
>> ++---------------------------------------------------------+
>> +|   ...   | log and data pages                            |
>> ++---------------------------------------------------------+
>> +|   n-2   | replica reserved Inodes                       |
>> ++---------------------------------------------------------+
>> +|   n-1   | replica super block                           |
>> ++---------------------------------------------------------+
>> +
>> +
>> +
>> +Superblock and Associated Structures
>> +====================================
>> +
>> +The beginning of the PMEM device hold the super block and its associated
>
>                                     holds
>
>> +tables.  These include reserved inodes, a table of pointers to the journals
>> +NOVA uses for complex operations, and pointers to inodes tables.  NOVA
>> +maintains replicas of the super block and reserved inodes in the last two
>> +blocks of the PMEM area.
>> +
>> +
>> +Block Allocator/Free Lists
>> +==========================
>> +
>> +NOVA uses per-CPU allocators to manage free PMEM blocks.  On initialization,> +NOVA divides the range of blocks in the PMEM device among the CPUs, and those
>> +blocks are managed solely by that CPU.  We call these ranges of "allocation regions".
>> +Each allocator maintains a red-black tree of unallocated ranges (struct
>> +nova_range_node).
>> +
>> +Allocation Functions
>> +--------------------
>> +
>> +NOVA allocate PMEM blocks using two mechanisms:
>
>         allocates
>
>> +
>> +1.  Static allocation as defined in super.h
>> +
>> +2.  Allocation for log and data pages via nova_new_log_blocks() and
>> +nova_new_data_blocks().
>> +
>> +
>> +PMEM Address Translation
>> +------------------------
>> +
>> +In NOVA's persistent data structures, memory locations are given as offsets
>> +from the beginning of the PMEM region.  nova_get_block() translates offsets to
>> +PMEM addresses.  nova_get_addr_off() performs the reverse translation.
>> +
>> +
>> +Inodes
>> +======
>> +
>> +NOVA maintains per-CPU inode tables, and inode numbers are striped across the
>> +tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
>> +
>> +The inodes themselves live in a set of linked lists (one per CPU) of 2MB
>> +blocks.  The last 8 bytes of each block points to the next block.  Pointers to
>> +heads of these list live in PMEM block INODE_TABLE_START.
>
>                   lists
>
>> +Additional space for inodes is allocated on demand.
>> +
>> +To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB
>
> s/cpu/CPU/g
> s/a RB/an RB/
>
> but that isn't quite a sentence. Please fix it.
>
>> +tree that holds ranges of allocated inode numbers.
>> +
>> +
>> +Logs
>> +====
>> +
>> +NOVA maintains a log for each inode that records updates to the inode's
>> +metadata and holds pointers to the file data.  NOVA makes updates to file data
>> +and metadata atomic by atomically appending log entries to the log.
>> +
>> +Each inode contains pointers to head and tail of the inode's log.  When the log
>> +grows past the end of the last page, nova allocates additional space.  For
>> +short logs (less than 1MB) , it doubles the length.  For longer logs, it adds a
>> +fixed amount of additional space (1MB).
>> +
>> +Log space is reclaimed during garbage collection.
>> +
>> +Log Entries
>> +-----------
>> +
>> +There are four kinds of log entry, documented in log.h.  The log entries have
>> +several entries in common:
>> +
>> +   1.  'epoch_id' gives the epoch during which the log entry was created.
>> +   Creating a snapshot increments the epoch_id for the file systems.
>
>                                                           file system.  (?)
> or do multiple epochs (snapshots) => multiple fs-es?
>
>> +   Currently disabled (always zero).
>> +
>> +   2.  'trans_id' is per-inode, monotone increasing, number assigned each
>> +   log entry.  It provides an ordering over FS operations on a single inode.
>> +
>> +   3.  'invalid' is true if the effects of this entry are dead and the log
>> +   entry can be garbage collected.
>> +
>> +   4.  'csum' is a CRC32 checksum for the entry. Currently it is disabled.
>> +
>> +Log structure
>> +-------------
>> +
>> +The logs comprise a linked list of PMEM blocks.  The tail of each block
>> +contains some metadata about the block and pointers to the next block and
>> +block's replica (struct nova_inode_page_tail).
>> +
>> ++----------------+
>> +| log entry      |
>> ++----------------+
>> +| log entry      |
>> ++----------------+
>> +| ...            |
>> ++----------------+
>> +| tail           |
>> +|  metadata      |
>> +|  -> next block |
>> ++----------------+
>> +
>> +
>> +Journals
>> +========
>> +
>> +NOVA uses a lightweight journaling mechanisms to provide atomicity for
>
>                                       mechanism
>
>> +operations that modify more than one on inode.  The journals providing logging
>
> end of that "sentence" (above) is confusing or missing something.
>
>> +for two operations:
>> +
>> +1.  Single word updates (JOURNAL_ENTRY)
>> +2.  Copying inodes (JOURNAL_INODE)
>> +
>> +The journals are undo logs: NOVA creates the journal entries for an operation,
>> +and if the operation does not complete due to a system failure, the recovery
>> +process rolls back the changes using the journal entries.
>> +
>> +To commit, NOVA drops the log.
>> +
>> +NOVA maintains one journal per CPU.  The head and tail pointers for each
>> +journal live in a reserved page near the beginning of the file system.
>> +
>> +During recovery, NOVA scans the journals and undoes the operations described by
>> +each entry.
>> +
>> +
>> +File and Directory Access
>> +=========================
>> +
>> +To access file data via read(), NOVA maintains a radix tree in DRAM for each
>> +inode (nova_inode_info_header.tree) that maps file offsets to write log
>> +entries.  For directories, the same tree maps a hash of filenames to their
>> +corresponding dentry.
>> +
>> +In both cases, the nova populates the tree when the file or directory is opened
>
>                   the nova fs (?)
>
>> +by scanning its log.
>> +
>> +
>> +MMap and DAX
>> +============
>> +
>> +NOVA leverages the kernel's DAX mechanisms for mmap and file data access.
>> +NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the
>> +application's address space.
>> +
>> +
>> +Garbage Collection
>> +==================
>> +
>> +NOVA recovers log space with a two-phase garbage collection system.  When a log
>> +reaches the end of its allocated pages, NOVA allocates more space.  Then, the
>> +fast GC algorithm scans the log to remove pages that have no valid entries.
>> +Then, it estimates how many pages the logs valid entries would fill.  If this
>> +is less than half the number of pages in the log, the second GC phase copies
>> +the valid entries to new pages.
>> +
>> +For example (V=valid; I=invalid):
>> +
>> ++---+         +---+          +---+
>> +| I |               | I |            | V |
>> ++---+               +---+  Thorough  +---+
>> +| V |               | V |     GC     | V |
>> ++---+               +---+   =====>   +---+
>> +| I |               | I |            | V |
>> ++---+               +---+            +---+
>> +| V |               | V |            | V |
>> ++---+               +---+            +---+
>> +  |           |
>> +  V           V
>> ++---+               +---+
>> +| I |               | V |
>> ++---+               +---+
>> +| I | fast GC  | I |
>> ++---+  ====>   +---+
>> +| I |               | I |
>> ++---+               +---+
>> +| I |               | V |
>> ++---+               +---+
>> +  |
>> +  V
>> ++---+
>> +| V |
>> ++---+
>> +| I |
>> ++---+
>> +| I |
>> ++---+
>> +| V |
>> ++---+
>> +
>> +
>> +Umount and Recovery
>> +===================
>> +
>> +Clean umount/mount
>> +------------------
>> +
>> +On a clean unmount, NOVA saves the contents of many of its DRAM data structures
>> +to PMEM to accelerate the next mount:
>> +
>> +1. NOVA stores the allocator state for each of the per-cpu allocators to the
>> +   log of a reserved inode (NOVA_BLOCK_NODE_INO).
>> +
>> +2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the
>> +   NOVA_BLOCK_INODELIST_INO reserved inode.
>> +
>> +After a clean unmount, the following mount restores these data and then
>> +invalidates them.
>> +
>> +Recovery after failures
>> +-----------------------
>> +
>> +In case of a unclean dismount (e.g., system crash), NOVA must rebuild these
>
>            of an unclean
>
>> +DRAM structures by scanning the inode logs.  NOVA log scanning is fast because
>> +per-CPU inode tables and per-inode logs allow for parallel recovery.
>> +
>> +The number of live log entries in an inode log is roughly the number of extents
>> +in the file.  As a result, NOVA only needs to scan a small fraction of the NVMM
>> +during recovery.
>> +
>> +The NOVA failure recovery consists of two steps:
>> +
>> +First, NOVA checks its lite weight journals and rolls back any uncommitted
>
>           should be one word: lightweight (or liteweight)
>
>> +transactions to restore the file system to a consistent state.
>> +
>> +Second, NOVA starts a recovery thread on each CPU and scans the inode tables in
>> +parallel, performing log scanning for every valid inode in the inode table.
>> +NOVA use different recovery mechanisms for directory inodes and file inodes:
>
>                                                                and file inodes.
>
>> +For a directory inode, NOVA scans the log's linked list to enumerate the pages
>> +it occupies, but it does not inspect the log's contents.  For a file inode,
>> +NOVA reads the write entries in the log to enumerate the data pages.
>> +
>> +During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds
>> +the allocator based on the result. After this process completes, the file
>> +system is ready to accept new requests.
>> +
>> +During the same scan, it rebuilds the list of available inodes.
>> +
>> +
>> +Gaps, Missing Features, and Development Status
>> +==============================================
>> +
>> +Although NOVA is a fully-functional file system, there is still much work left
>> +to be done.  In particular, (at least) the following items are currently missing:
>> +
>> +1.  Snapshot, metadata and data replication and protection are left for future submission.
>> +2.  There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system).
>> +3.  NOVA only works on x86-64 kernels.
>> +4.  NOVA does not currently support extended attributes or ACL.
>> +5.  NOVA doesn't provide quota support.
>> +6.  Moving NOVA file systems between machines with different numbers of CPUs does not work.
>
> You could artificially limit the number of "known" CPUs so that a NOVA fs could be
> moved from a 16-CPU system to an 8-CPU system by telling NOVA to use only 8 CPUs
> (as an example).  Just a thought.
>

I think storing the number of CPUs in the superblock, and perform
checking during mount phase can fix the issue.

Moving from 8-CPU to 16-CPU should be simple, just allocate more inode
tables and journal pages. Moving from 16-CPU to 8-CPU is a little more
difficult, mainly in inode table linking. CPU hotplug is still a
challenge.

I will try to fix it in the next version if I have time.

Thanks,
Andiry

>> +
>> +None of these are fundamental limitations of NOVA's design.
>> +
>> +NOVA is complete and robust enough to run a range of complex applications, but
>> +it is not yet ready for production use.  Our current focus is on adding a few
>> +missing features from the list above and finding/fixing bugs.
>> +
>> +
>> +Hacking and Contributing
>> +========================
>> +
>> +If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues.
>> +
>> +If you have other questions or suggestions you can contact the NOVA developers
>> +at cse-nova-hackers@eng.ucsd.edu.
>
>
> --
> ~Randy

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 83/83] Sysfs support.
  2018-03-10 18:19 ` [RFC v2 83/83] Sysfs support Andiry Xu
  2018-03-15  0:33   ` Randy Dunlap
@ 2018-03-22 15:00   ` David Sterba
  2018-03-23  0:31     ` Andiry Xu
  1 sibling, 1 reply; 119+ messages in thread
From: David Sterba @ 2018-03-22 15:00 UTC (permalink / raw)
  To: Andiry Xu
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, dan.j.williams,
	andy.rudoff, coughlan, swanson, david, jack, swhiteho, miklos,
	andiry.xu, Andiry Xu

On Sat, Mar 10, 2018 at 10:19:04AM -0800, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> Sysfs support allows user to get/post information of running NOVA instance.
> After mount, NOVA creates four entries under proc directory
> /proc/fs/nova/pmem#/:
> 
> timing_stats	IO_stats	allocator	gc
> 
> Show NOVA file operation timing statistics:
> cat /proc/fs/NOVA/pmem#/timing_stats
> 
> Clear timing statistics:
> echo 1 > /proc/fs/NOVA/pmem#/timing_stats
> 
> Show NOVA I/O statistics:
> cat /proc/fs/NOVA/pmem#/IO_stats
> 
> Clear I/O statistics:
> echo 1 > /proc/fs/NOVA/pmem#/IO_stats
> 
> Show NOVA allocator information:
> cat /proc/fs/NOVA/pmem#/allocator
> 
> Manual garbage collection:
> echo #inode_number > /proc/fs/NOVA/pmem#/gc

IIRC no new entries should be added to /proc, /sys is supposed to be
used. I can't find it documented though, so you'd better check with
sysfs people.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 83/83] Sysfs support.
  2018-03-22 15:00   ` David Sterba
@ 2018-03-23  0:31     ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-03-23  0:31 UTC (permalink / raw)
  To: dsterba, Andiry Xu, Linux FS Devel, Linux Kernel Mailing List,
	linux-nvdimm, Dan Williams, Rudoff, Andy, coughlan,
	Steven Swanson, Dave Chinner, Jan Kara, swhiteho, miklos,
	Jian Xu, Andiry Xu

On Thu, Mar 22, 2018 at 8:00 AM, David Sterba <dsterba@suse.cz> wrote:
> On Sat, Mar 10, 2018 at 10:19:04AM -0800, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> Sysfs support allows user to get/post information of running NOVA instance.
>> After mount, NOVA creates four entries under proc directory
>> /proc/fs/nova/pmem#/:
>>
>> timing_stats  IO_stats        allocator       gc
>>
>> Show NOVA file operation timing statistics:
>> cat /proc/fs/NOVA/pmem#/timing_stats
>>
>> Clear timing statistics:
>> echo 1 > /proc/fs/NOVA/pmem#/timing_stats
>>
>> Show NOVA I/O statistics:
>> cat /proc/fs/NOVA/pmem#/IO_stats
>>
>> Clear I/O statistics:
>> echo 1 > /proc/fs/NOVA/pmem#/IO_stats
>>
>> Show NOVA allocator information:
>> cat /proc/fs/NOVA/pmem#/allocator
>>
>> Manual garbage collection:
>> echo #inode_number > /proc/fs/NOVA/pmem#/gc
>
> IIRC no new entries should be added to /proc, /sys is supposed to be
> used. I can't find it documented though, so you'd better check with
> sysfs people.

Thanks. I will try to switch to sysfs.

Thanks,
Andiry

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 01/83] Introduction and documentation of NOVA filesystem.
  2018-03-10 18:17 ` [RFC v2 01/83] Introduction and documentation of NOVA filesystem Andiry Xu
  2018-03-19 20:43   ` Randy Dunlap
@ 2018-04-22  8:05   ` Pavel Machek
  1 sibling, 0 replies; 119+ messages in thread
From: Pavel Machek @ 2018-04-22  8:05 UTC (permalink / raw)
  To: Andiry Xu
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, dan.j.williams,
	andy.rudoff, coughlan, swanson, david, jack, swhiteho, miklos,
	andiry.xu, Andiry Xu


[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

Hi!

> +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
> +`/mnt/nova`.
> +
> +NOVA support several module command line options:
> +
> + * measure_timing: Measure the timing of file system operations for profiling (default: 0)
> +
> + * inplace_data_updates:  Update data in place rather than with COW (default: 0)
> +
> +To recover an existing NOVA instance, mount NOVA without the init option, for example:
> +
> +# mount -t NOVA /dev/pmem0 /mnt/nova

You may want to limit documentation lines to 80 columns...

									Pavel
									
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 06/83] Add inode get/read methods.
  2018-03-10 18:17 ` [RFC v2 06/83] Add inode get/read methods Andiry Xu
@ 2018-04-23  6:12   ` Darrick J. Wong
  2018-04-23 15:55     ` Andiry Xu
  0 siblings, 1 reply; 119+ messages in thread
From: Darrick J. Wong @ 2018-04-23  6:12 UTC (permalink / raw)
  To: Andiry Xu
  Cc: linux-fsdevel, linux-kernel, linux-nvdimm, dan.j.williams,
	andy.rudoff, coughlan, swanson, david, jack, swhiteho, miklos,
	andiry.xu, Andiry Xu

[haaa, I finally found time to read more of these]

On Sat, Mar 10, 2018 at 10:17:47AM -0800, Andiry Xu wrote:
> From: Andiry Xu <jix024@cs.ucsd.edu>
> 
> These routines are incomplete and currently only support reserved inodes,
> whose addresses are fixed. This is necessary for fill_super to work.
> File/dir operations are left NULL.
> 
> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
> ---
>  fs/nova/inode.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nova/inode.h |   3 +
>  2 files changed, 179 insertions(+)
>  create mode 100644 fs/nova/inode.c
> 
> diff --git a/fs/nova/inode.c b/fs/nova/inode.c
> new file mode 100644
> index 0000000..bfdc5dc
> --- /dev/null
> +++ b/fs/nova/inode.c
> @@ -0,0 +1,176 @@
> +/*
> + * BRIEF DESCRIPTION
> + *
> + * Inode methods (allocate/free/read/write).
> + *
> + * Copyright 2015-2016 Regents of the University of California,
> + * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
> + * Copyright 2012-2013 Intel Corporation
> + * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
> + * Copyright 2003 Sony Corporation
> + * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
> + * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
> + * This file is licensed under the terms of the GNU General Public
> + * License version 2. This program is licensed "as is" without any
> + * warranty of any kind, whether express or implied.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/aio.h>
> +#include <linux/highuid.h>
> +#include <linux/module.h>
> +#include <linux/mpage.h>
> +#include <linux/backing-dev.h>
> +#include <linux/types.h>
> +#include <linux/ratelimit.h>
> +#include "nova.h"
> +#include "inode.h"
> +
> +unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX] = {12, 21, 30};
> +uint32_t blk_type_to_size[NOVA_BLOCK_TYPE_MAX] = {0x1000, 0x200000, 0x40000000};
> +
> +void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
> +	unsigned int flags)
> +{
> +	inode->i_flags &=
> +		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
> +	if (flags & FS_SYNC_FL)
> +		inode->i_flags |= S_SYNC;
> +	if (flags & FS_APPEND_FL)
> +		inode->i_flags |= S_APPEND;
> +	if (flags & FS_IMMUTABLE_FL)
> +		inode->i_flags |= S_IMMUTABLE;
> +	if (flags & FS_NOATIME_FL)
> +		inode->i_flags |= S_NOATIME;
> +	if (flags & FS_DIRSYNC_FL)
> +		inode->i_flags |= S_DIRSYNC;
> +	if (!pi->i_xattr)
> +		inode_has_no_xattr(inode);
> +	inode->i_flags |= S_DAX;
> +}
> +
> +/* copy persistent state to struct inode */
> +static int nova_read_inode(struct super_block *sb, struct inode *inode,
> +	u64 pi_addr)
> +{
> +	struct nova_inode_info *si = NOVA_I(inode);
> +	struct nova_inode *pi, fake_pi;
> +	struct nova_inode_info_header *sih = &si->header;
> +	int ret = -EIO;
> +	unsigned long ino;
> +
> +	ret = nova_get_reference(sb, pi_addr, &fake_pi,
> +			(void **)&pi, sizeof(struct nova_inode));
> +	if (ret) {
> +		nova_dbg("%s: read pi @ 0x%llx failed\n",
> +				__func__, pi_addr);
> +		goto bad_inode;
> +	}
> +
> +	inode->i_mode = sih->i_mode;

Hm, do you validate the on-pmem metadata as it's read in?  What if
i_mode is garbage?

> +	i_uid_write(inode, le32_to_cpu(pi->i_uid));
> +	i_gid_write(inode, le32_to_cpu(pi->i_gid));
> +//	set_nlink(inode, le16_to_cpu(pi->i_links_count));

C++ comment?

> +	inode->i_generation = le32_to_cpu(pi->i_generation);
> +	nova_set_inode_flags(inode, pi, le32_to_cpu(pi->i_flags));
> +	ino = inode->i_ino;
> +
> +	/* check if the inode is active. */
> +	if (inode->i_mode == 0 || pi->deleted == 1) {
> +		/* this inode is deleted */
> +		ret = -ESTALE;
> +		goto bad_inode;
> +	}
> +
> +	inode->i_blocks = sih->i_blocks;

Not le64_to_cpu(sih->i_blocks)?  Or is that somewhere else I'm
missing...

> +
> +	switch (inode->i_mode & S_IFMT) {
> +	case S_IFREG:
> +		break;
> +	case S_IFDIR:
> +		break;
> +	case S_IFLNK:
> +		break;
> +	default:
> +		init_special_inode(inode, inode->i_mode,
> +				   le32_to_cpu(pi->dev.rdev));
> +		break;
> +	}
> +
> +	/* Update size and time after rebuild the tree */
> +	inode->i_size = le64_to_cpu(sih->i_size);

FWIW the type of i_size is loff_t, which is an unsigned type.  Despite
this, the VFS does not support files with negative sizes... which means
that this probably ought to check for that.

--D

> +	inode->i_atime.tv_sec = (__s32)le32_to_cpu(pi->i_atime);
> +	inode->i_ctime.tv_sec = (__s32)le32_to_cpu(pi->i_ctime);
> +	inode->i_mtime.tv_sec = (__s32)le32_to_cpu(pi->i_mtime);
> +	inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec =
> +					 inode->i_ctime.tv_nsec = 0;
> +	set_nlink(inode, le16_to_cpu(pi->i_links_count));
> +	return 0;
> +
> +bad_inode:
> +	make_bad_inode(inode);
> +	return ret;
> +}
> +
> +/* Get the address in PMEM of an inode by inode number.  Allocate additional
> + * block to store additional inodes if necessary.
> + */
> +int nova_get_inode_address(struct super_block *sb, u64 ino,
> +	u64 *pi_addr, int extendable)
> +{
> +	if (ino < NOVA_NORMAL_INODE_START) {
> +		*pi_addr = nova_get_reserved_inode_addr(sb, ino);
> +		return 0;
> +	}
> +
> +	*pi_addr = 0;
> +	return 0;
> +}
> +
> +struct inode *nova_iget(struct super_block *sb, unsigned long ino)
> +{
> +	struct nova_inode_info *si;
> +	struct inode *inode;
> +	u64 pi_addr;
> +	int err;
> +
> +	inode = iget_locked(sb, ino);
> +	if (unlikely(!inode))
> +		return ERR_PTR(-ENOMEM);
> +	if (!(inode->i_state & I_NEW))
> +		return inode;
> +
> +	si = NOVA_I(inode);
> +
> +	nova_dbgv("%s: inode %lu\n", __func__, ino);
> +
> +	err = nova_get_inode_address(sb, ino, &pi_addr, 0);
> +	if (err) {
> +		nova_dbg("%s: get inode %lu address failed %d\n",
> +			 __func__, ino, err);
> +		goto fail;
> +	}
> +
> +	if (pi_addr == 0) {
> +		nova_dbg("%s: failed to get pi_addr for inode %lu\n",
> +			 __func__, ino);
> +		err = -EACCES;
> +		goto fail;
> +	}
> +
> +	err = nova_read_inode(sb, inode, pi_addr);
> +	if (unlikely(err)) {
> +		nova_dbg("%s: failed to read inode %lu\n", __func__, ino);
> +		goto fail;
> +
> +	}
> +
> +	inode->i_ino = ino;
> +
> +	unlock_new_inode(inode);
> +	return inode;
> +fail:
> +	iget_failed(inode);
> +	return ERR_PTR(err);
> +}
> +
> diff --git a/fs/nova/inode.h b/fs/nova/inode.h
> index f9187e3..dbd5256 100644
> --- a/fs/nova/inode.h
> +++ b/fs/nova/inode.h
> @@ -184,4 +184,7 @@ static inline int nova_persist_inode(struct nova_inode *pi)
>  	return 0;
>  }
>  
> +int nova_get_inode_address(struct super_block *sb, u64 ino,
> +	u64 *pi_addr, int extendable);
> +struct inode *nova_iget(struct super_block *sb, unsigned long ino);
>  #endif
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [RFC v2 06/83] Add inode get/read methods.
  2018-04-23  6:12   ` Darrick J. Wong
@ 2018-04-23 15:55     ` Andiry Xu
  0 siblings, 0 replies; 119+ messages in thread
From: Andiry Xu @ 2018-04-23 15:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Linux FS Devel, Linux Kernel Mailing List, linux-nvdimm,
	Dan Williams, Rudoff, Andy, Tom Coughlan, Steven Swanson,
	Dave Chinner, Jan Kara, Steven Whitehouse, miklos, Jian Xu,
	Andiry Xu

On Sun, Apr 22, 2018 at 11:12 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> [haaa, I finally found time to read more of these]
>
> On Sat, Mar 10, 2018 at 10:17:47AM -0800, Andiry Xu wrote:
>> From: Andiry Xu <jix024@cs.ucsd.edu>
>>
>> These routines are incomplete and currently only support reserved inodes,
>> whose addresses are fixed. This is necessary for fill_super to work.
>> File/dir operations are left NULL.
>>
>> Signed-off-by: Andiry Xu <jix024@cs.ucsd.edu>
>> ---
>>  fs/nova/inode.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/nova/inode.h |   3 +
>>  2 files changed, 179 insertions(+)
>>  create mode 100644 fs/nova/inode.c
>>
>> diff --git a/fs/nova/inode.c b/fs/nova/inode.c
>> new file mode 100644
>> index 0000000..bfdc5dc
>> --- /dev/null
>> +++ b/fs/nova/inode.c
>> @@ -0,0 +1,176 @@
>> +/*
>> + * BRIEF DESCRIPTION
>> + *
>> + * Inode methods (allocate/free/read/write).
>> + *
>> + * Copyright 2015-2016 Regents of the University of California,
>> + * UCSD Non-Volatile Systems Lab, Andiry Xu <jix024@cs.ucsd.edu>
>> + * Copyright 2012-2013 Intel Corporation
>> + * Copyright 2009-2011 Marco Stornelli <marco.stornelli@gmail.com>
>> + * Copyright 2003 Sony Corporation
>> + * Copyright 2003 Matsushita Electric Industrial Co., Ltd.
>> + * 2003-2004 (c) MontaVista Software, Inc. , Steve Longerbeam
>> + * This file is licensed under the terms of the GNU General Public
>> + * License version 2. This program is licensed "as is" without any
>> + * warranty of any kind, whether express or implied.
>> + */
>> +
>> +#include <linux/fs.h>
>> +#include <linux/aio.h>
>> +#include <linux/highuid.h>
>> +#include <linux/module.h>
>> +#include <linux/mpage.h>
>> +#include <linux/backing-dev.h>
>> +#include <linux/types.h>
>> +#include <linux/ratelimit.h>
>> +#include "nova.h"
>> +#include "inode.h"
>> +
>> +unsigned int blk_type_to_shift[NOVA_BLOCK_TYPE_MAX] = {12, 21, 30};
>> +uint32_t blk_type_to_size[NOVA_BLOCK_TYPE_MAX] = {0x1000, 0x200000, 0x40000000};
>> +
>> +void nova_set_inode_flags(struct inode *inode, struct nova_inode *pi,
>> +     unsigned int flags)
>> +{
>> +     inode->i_flags &=
>> +             ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
>> +     if (flags & FS_SYNC_FL)
>> +             inode->i_flags |= S_SYNC;
>> +     if (flags & FS_APPEND_FL)
>> +             inode->i_flags |= S_APPEND;
>> +     if (flags & FS_IMMUTABLE_FL)
>> +             inode->i_flags |= S_IMMUTABLE;
>> +     if (flags & FS_NOATIME_FL)
>> +             inode->i_flags |= S_NOATIME;
>> +     if (flags & FS_DIRSYNC_FL)
>> +             inode->i_flags |= S_DIRSYNC;
>> +     if (!pi->i_xattr)
>> +             inode_has_no_xattr(inode);
>> +     inode->i_flags |= S_DAX;
>> +}
>> +
>> +/* copy persistent state to struct inode */
>> +static int nova_read_inode(struct super_block *sb, struct inode *inode,
>> +     u64 pi_addr)
>> +{
>> +     struct nova_inode_info *si = NOVA_I(inode);
>> +     struct nova_inode *pi, fake_pi;
>> +     struct nova_inode_info_header *sih = &si->header;
>> +     int ret = -EIO;
>> +     unsigned long ino;
>> +
>> +     ret = nova_get_reference(sb, pi_addr, &fake_pi,
>> +                     (void **)&pi, sizeof(struct nova_inode));
>> +     if (ret) {
>> +             nova_dbg("%s: read pi @ 0x%llx failed\n",
>> +                             __func__, pi_addr);
>> +             goto bad_inode;
>> +     }
>> +
>> +     inode->i_mode = sih->i_mode;
>
> Hm, do you validate the on-pmem metadata as it's read in?  What if
> i_mode is garbage?
>

I have checksum for inode and all metadata structures in the
NOVA-fortis code. I removed them in this patchset to make the code
shorter and simpler.

>> +     i_uid_write(inode, le32_to_cpu(pi->i_uid));
>> +     i_gid_write(inode, le32_to_cpu(pi->i_gid));
>> +//   set_nlink(inode, le16_to_cpu(pi->i_links_count));
>
> C++ comment?
>

Will fix.

>> +     inode->i_generation = le32_to_cpu(pi->i_generation);
>> +     nova_set_inode_flags(inode, pi, le32_to_cpu(pi->i_flags));
>> +     ino = inode->i_ino;
>> +
>> +     /* check if the inode is active. */
>> +     if (inode->i_mode == 0 || pi->deleted == 1) {
>> +             /* this inode is deleted */
>> +             ret = -ESTALE;
>> +             goto bad_inode;
>> +     }
>> +
>> +     inode->i_blocks = sih->i_blocks;
>
> Not le64_to_cpu(sih->i_blocks)?  Or is that somewhere else I'm
> missing...
>

sih is the in-DRAM inode structure, so le64_to_cpu is not needed. When
we read inode from pmem (nova_inode *pi) to sih, we need
le64_to_cpu(). The endian checking is not performed thoroughly as it
only supports x86-64 now.

>> +
>> +     switch (inode->i_mode & S_IFMT) {
>> +     case S_IFREG:
>> +             break;
>> +     case S_IFDIR:
>> +             break;
>> +     case S_IFLNK:
>> +             break;
>> +     default:
>> +             init_special_inode(inode, inode->i_mode,
>> +                                le32_to_cpu(pi->dev.rdev));
>> +             break;
>> +     }
>> +
>> +     /* Update size and time after rebuild the tree */
>> +     inode->i_size = le64_to_cpu(sih->i_size);
>
> FWIW the type of i_size is loff_t, which is an unsigned type.  Despite
> this, the VFS does not support files with negative sizes... which means
> that this probably ought to check for that.
>

I will think about that. Thanks.

Thanks,
Andiry

> --D
>
>> +     inode->i_atime.tv_sec = (__s32)le32_to_cpu(pi->i_atime);
>> +     inode->i_ctime.tv_sec = (__s32)le32_to_cpu(pi->i_ctime);
>> +     inode->i_mtime.tv_sec = (__s32)le32_to_cpu(pi->i_mtime);
>> +     inode->i_atime.tv_nsec = inode->i_mtime.tv_nsec =
>> +                                      inode->i_ctime.tv_nsec = 0;
>> +     set_nlink(inode, le16_to_cpu(pi->i_links_count));
>> +     return 0;
>> +
>> +bad_inode:
>> +     make_bad_inode(inode);
>> +     return ret;
>> +}
>> +
>> +/* Get the address in PMEM of an inode by inode number.  Allocate additional
>> + * block to store additional inodes if necessary.
>> + */
>> +int nova_get_inode_address(struct super_block *sb, u64 ino,
>> +     u64 *pi_addr, int extendable)
>> +{
>> +     if (ino < NOVA_NORMAL_INODE_START) {
>> +             *pi_addr = nova_get_r