linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/7] first draft of ZUFS - the Kernel part
@ 2018-03-13 17:14 Boaz Harrosh
  2018-03-13 17:15 ` [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU Boaz Harrosh
                   ` (6 more replies)
  0 siblings, 7 replies; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:14 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon

Hello all

I would please like to present the ZUFS file system and the Kernel code part in this
patchset.

The Kernel code presented here can be found at:
	https://github.com/NetApp/zufs-zuf

And for a very crud User-mode Server:
	https://github.com/NetApp/zufs-zus

ZUFS - stands for Zero-copy User-mode FS
- It is geared towards true zero copy end to end of both data and meta data.
- It is geared towards very *low latency*, very high CPU locality, lock-less parallelism.
- Synchronous operations
- Numa awareness

Short description:
  ZUFS is a from scratch implementation of a filesystem-in-user-space, which tries to address the above
goals. from the get go it is aimed for pmem based FSs. But can easily support other type of FSs that
can utilize x10 latency and parallelism improvements. The novelty of this project is that the interface
is designed with a modern multi-core NUMA machine in mind down to the ABI, so to reach these goals.

Not only FSs need apply, also any kind of user-mode Server can set up a pseudo filesystem and communicate
with application via virtual files. These can then benefit from zero copy low-latency communication
directly to/from application buffers. Or Application mmap direct Server resources. As long as it looks like
a file system to the Kernel.

Current status is that we have couple trivial filesystem implementations and together with the Kernel module
the UM-Server and the FSs User-mode pluggin can actually pass a good bunch of xfstests quick run. (Still working on Stability)

Just to get some points across as I said this project is all about performance and low latency.
Here below are a POC results I have run:

	In Kernel pmem-FS			ZUFS			FUSE	
Threads	Op/s	Lat (us)		Op/s	Lat [us]	Op/s	Lat [us]
1	388361	2.271589		200799	4.6		71820	13.5
2	635115	2.604376		314321	5.9		148083	13.1
4	1260307	2.626361		565574	6.6		212133	18.3
8	2744963	2.485292		1113138	6.6		209799	37.6
12	2126945	5.020506		1598451	6.8		201689	58.7
18	4350995	3.386433		1648689	7.8		174823	101.8
24	4211180	4.784997		1702285	8		149413	159
36	3057166	9.291997		1783346	13.4		148276	240.7
48	3148972	10.382461		1741873	17.4		145296	327.3

I have used an average server machine in our lab with two NUMA nodes and total of 40 cores (Can't remember all the details).
Running fio with 4k random writes. The IO is then just memcpy_nt() to a pmem simulated DRAM. The fio was run with more and
more threads (see threads column)

We can see that we are still > x2 slower than in-Kernel FS. But I believe I can shave off another 1 us by optimizing the
app-to-server thread switch by utilizing perhaps the "Binder" scheduler object or devising another way to not be going
through the scheduler (and its locks) when switching VMs

BE-CAREFUL: This is a big code dump. And very much an RFC. Not yet very stable, not yet cleaned up.
I have sliced the FS to 4 very big patches. Please talk to me if I should split it up to many more
patches.

[I am afraid to send such huge emails so I'm posting web links instead. What is the mailing-list
 message limit anyone knows?

 For first version I'm sending web links to github of the 4 HUGE patches.
 Please clone above trees if you want to play with this.

 Please tell me if you want to be removed from CC of these emails
]

list of patches:
[RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU

   This is a small but very important patch to the mmap code to support
   above scores. See inside the before and after results.
   (And is unfinished)

[RFC 2/7] fs: Add the ZUF filesystem to the build + License
   Add the makefile and Kconfig + the licensing of the code

[RFC 3/7] zuf: Preliminary Documentation
   Very unfinished Start of a Documentation for this project, Overall
   concept. Kernel side and usermode Server side.
   Please help me in asking what you need more in here. Any questions
   I will try to add here

[RFC 4/7] zuf: zuf-rootfs && zuf-core
[RFC 5/7] zus: Devices && mounting
[RFC 6/7] zuf: Filesystem operations
[RFC 7/7] zuf: Write/Read && mmap implementation

  After these 4 HUGE patches. There is a working live system. Still bugs
  and corner cases. But it can git clone and make a Kernel which is for
  me not so bad.

Thank you
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
@ 2018-03-13 17:15 ` Boaz Harrosh
  2018-03-13 18:56   ` Matthew Wilcox
  2018-03-13 17:17 ` [RFC 2/7] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


On a call to mmap an mmap provider (like an FS) can put
this flag on vma->vm_flags.

This tells the Kernel that the vma will be used from a single
core only and therefore invalidation of PTE(s) need not a
wide CPU scheduling

The motivation of this flag is the ZUFS project where we want
to optimally map user-application buffers into a user-mode-server
execute the operation and efficiently unmap.

In this project we utilize a per-core server thread so everything
is kept local. If we use the regular zap_ptes() API All CPU's
are scheduled for the unmap, though in our case we know that we
have only used a single core. The regular zap_ptes adds a very big
latency on every operations and mostly kills the concurrency of the
over all system. Because it imposes a serialization between all cores

Some preliminary measurements on a 40 core machines:

	unpatched		patched
Threads	Op/s	Lat [us]	Op/s	Lat [us]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1	185391	4.9		200799	4.6
2	197993	9.6		314321	5.9
4	310597	12.1		565574	6.6
8	546702	13.8		1113138	6.6
12	641728	17.2		1598451	6.8
18	744750	22.2		1648689	7.8
24	790805	28.3		1702285	8
36	849763	38.9		1783346	13.4
48	792000	44.6		1741873	17.4

[FIXME]
	We need to actually impose this policy. On very first
	pte_insert we should sample the used CPU_ID and on all
	susequent pte_inserts we need to make sure it is the
	same CPU_ID used.

NOTE: That this vma is never used during a page_fault. It is
always used in a synchronous way from an affinity set thread
to a single core.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/proc/task_mmu.c | 3 +++
 include/linux/mm.h | 3 +++
 mm/memory.c        | 2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 339e4c1..20786ba 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -681,6 +681,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_PKEY_BIT2)]	= "",
 		[ilog2(VM_PKEY_BIT3)]	= "",
 #endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+		[ilog2(VM_LOCAL_CPU)]	= "lc",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ea818ff..02bb8b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -226,6 +226,9 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_LOCAL_CPU	BIT(37)		/* FIXME: Needs to move from here */
+#else /* ! CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+#define VM_LOCAL_CPU	0		/* FIXME: Needs to move from here */
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
diff --git a/mm/memory.c b/mm/memory.c
index 7930046..7620ced 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1804,7 +1804,7 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 				goto out_unlock;
 			entry = *pte;
 			goto out_mkwrite;
-		} else
+		} else if (!(vma->vm_flags & VM_LOCAL_CPU))
 			goto out_unlock;
 	}
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC 2/7] fs: Add the ZUF filesystem to the build + License
  2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
  2018-03-13 17:15 ` [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU Boaz Harrosh
@ 2018-03-13 17:17 ` Boaz Harrosh
  2018-03-13 20:16   ` Andreas Dilger
  2018-03-13 17:18 ` [RFC 3/7] zuf: Preliminary Documentation Boaz Harrosh
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


This adds the ZUF filesystem-in-user_mode module
to the fs/ build system.

Also added:
	* fs/zuf/Kconfig
	* fs/zuf/module.c - This file contains the LICENCE of zuf code base
	* fs/zuf/Makefile - Rather empty Makefile with only module.c above

I add the fs/zuf/Makefile to demonstrate that at every
patchset stage code still compiles and there are no external
references outside of the code already submitted.

Off course only at the very last patch we have a working
ZUF feeder

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/Kconfig       |  1 +
 fs/Makefile      |  1 +
 fs/zuf/Kconfig   | 23 +++++++++++++++++++
 fs/zuf/Makefile  | 14 ++++++++++++
 fs/zuf/module.c  | 51 +++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zus_api.h | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 159 insertions(+)
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/zus_api.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d6..c04c454 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -247,6 +247,7 @@ source "fs/romfs/Kconfig"
 source "fs/pstore/Kconfig"
 source "fs/sysv/Kconfig"
 source "fs/ufs/Kconfig"
+source "fs/zuf/Kconfig"
 source "fs/exofs/Kconfig"
 
 endif # MISC_FILESYSTEMS
diff --git a/fs/Makefile b/fs/Makefile
index ef772f1..78c13f0 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -129,3 +129,4 @@ obj-y				+= exofs/ # Multiple modules
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-$(CONFIG_ZUF)		+= zuf/
diff --git a/fs/zuf/Kconfig b/fs/zuf/Kconfig
new file mode 100644
index 0000000..a17d463
--- /dev/null
+++ b/fs/zuf/Kconfig
@@ -0,0 +1,23 @@
+menuconfig ZUF
+	tristate "ZUF - Zero-copy User-mode Feeder"
+	depends on BLOCK
+	depends on ZONE_DEVICE
+	select CRC16
+	select MEMCG
+	help
+	   ZUFS Kernel part.
+	   To enable say Y here.
+
+	   To compile this as a module,  choose M here: the module will be
+	   called zuf.ko
+
+if ZUF
+
+config ZUF_DEBUG
+	bool "ZUF: enable debug subsystems use"
+	depends on ZUF
+	default n
+	help
+	  INTERNAL QA USE ONLY!!! DO NOT USE!!!
+
+endif
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
new file mode 100644
index 0000000..7e4e51f
--- /dev/null
+++ b/fs/zuf/Makefile
@@ -0,0 +1,14 @@
+#
+# ZUF: Zero-copy User-mode Feeder
+#
+# Copyright (c) 2018 NetApp Inc. All rights reserved.
+#
+# ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+#
+# Makefile for the Linux zufs Kernel Feeder.
+#
+
+obj-$(CONFIG_ZUF) += zuf.o
+
+# Main FS
+zuf-y += module.o
diff --git a/fs/zuf/module.c b/fs/zuf/module.c
new file mode 100644
index 0000000..ebf51d7
--- /dev/null
+++ b/fs/zuf/module.c
@@ -0,0 +1,51 @@
+/* ZUFS-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/*
+ * zuf - Zero-copy User-mode Feeder
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * Alternatively you can redistribute this file under the terms of the
+ * BSD license as stated below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ *    contributors may be used to endorse or promote products derived from
+ *    this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <linux/module.h>
+
+#include "zus_api.h"
+
+MODULE_AUTHOR("Boaz Harrosh <boazh@netapp.com>");
+MODULE_AUTHOR("Sagi Manole <sagim@netapp.com>");
+MODULE_DESCRIPTION("Zero-copy User-mode Feeder");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(__stringify(ZUFS_MAJOR_VERSION) "."
+		__stringify(ZUFS_MINOR_VERSION));
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
new file mode 100644
index 0000000..d6ccc85
--- /dev/null
+++ b/fs/zuf/zus_api.h
@@ -0,0 +1,69 @@
+/*
+ * zufs_api.h:
+ *	ZUFS (Zero-copy User-mode File System) is:
+ *		zuf (Zero-copy User-mode Feeder (Kernel)) +
+ *		zus (Zero-copy User-mode Server (daemon))
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#ifndef _LINUX_ZUFS_API_H
+#define _LINUX_ZUFS_API_H
+
+#include <linux/types.h>
+#include <linux/uuid.h>
+#include <stddef.h>
+#include <asm/statfs.h>
+
+/*
+ * Version rules:
+ *   This is the zus-to-zuf API version. And not the Filesystem
+ * on disk structures versions. These are left to the FS-plugging
+ * to supply and check.
+ * Specifically any of the API structures and constants found in this
+ * file.
+ * If the changes are made in a way backward compatible with old
+ * user-space, MINOR is incremented. Else MAJOR is incremented.
+ *
+ * We believe that the zus Server application comes with the
+ * Distro and should be dependent on the Kernel package.
+ * The more stable ABI is between the zus Server and its FS plugins.
+ * Because of the intimate relationships in the zuf-core behavior
+ * We would also like zus Server to be signed by the running Kernel's
+ * make crypto key and checked before load because of the Security
+ * nature of an FS provider.
+ */
+#define ZUFS_MINORS_PER_MAJOR	1024
+#define ZUFS_MAJOR_VERSION 1
+#define ZUFS_MINOR_VERSION 0
+
+/* User space compatibility definitions */
+#ifndef __KERNEL__
+
+#include <stdint.h>
+#include <stdbool.h>
+
+#define u8 uint8_t
+#define umode_t uint16_t
+
+#define le16_to_cpu	le16toh
+#define le64_to_cpu	le64toh
+
+#define PAGE_SHIFT     12
+#define PAGE_SIZE      (1 << PAGE_SHIFT)
+
+#ifndef __packed
+#	define __packed __attribute__((packed))
+#endif
+
+#define ALIGN(x, a)		ALIGN_MASK(x, (typeof(x))(a) - 1)
+#define ALIGN_MASK(x, mask)	(((x) + (mask)) & ~(mask))
+
+#endif /*  ndef __KERNEL__ */
+
+#endif /* _LINUX_ZUFS_API_H */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC 3/7] zuf: Preliminary Documentation
  2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
  2018-03-13 17:15 ` [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU Boaz Harrosh
  2018-03-13 17:17 ` [RFC 2/7] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
@ 2018-03-13 17:18 ` Boaz Harrosh
  2018-03-13 20:32   ` Randy Dunlap
  2018-03-13 17:22 ` [RFC 4/7] zuf: zuf-rootfs && zuf-core Boaz Harrosh
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:18 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


Adding Documentation/filesystems/zufs.txt

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 Documentation/filesystems/zufs.txt | 351 +++++++++++++++++++++++++++++++++++++
 1 file changed, 351 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt

diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
new file mode 100644
index 0000000..779f14b
--- /dev/null
+++ b/Documentation/filesystems/zufs.txt
@@ -0,0 +1,351 @@
+ZUFS - Zero-copy User-mode FileSystem
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Trees:
+	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+	git clone https://github.com/NetApp/zufs-zus -b zus-github
+
+patches, comments, questions, requests to:
+	boazh@netapp.com
+
+Introduction:
+~~~~~~~~~~~~~
+
+ZUFS - stands for Zero-copy User-mode FS
+▪ It is geared towards true zero copy end to end of both data and meta data.
+▪ It is geared towards very *low latency*, very high CPU locality, lock-less
+  parallelism.
+▪ Synchronous operations
+▪ Numa awareness
+
+ ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
+tries to address the above goals. It is aimed for pmem based FSs. But can easily
+support any other type of FSs that can utilize x10 latency and parallelism
+improvements.
+
+Glossary and names:
+~~~~~~~~~~~~~~~~~~~
+
+ZUF - Zero-copy User-mode Feeder
+  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
+  VFS and dispatch commands to a User-mode application Server.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+
+ZUS - Zero-copy User-mode Server
+  zufs utilizes a User-mode server application. That takes care of the detailed
+  communication protocol and correctness with the Kernel.
+  In turn it utilizes many zusFS Filesystem plugins to implement the actual
+  on disc Filesystem.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zus -b zus-github
+
+zusFS - FS plugins
+  These are .so loadable modules that implement one or more Filesystem-types
+  (-t xyz).
+  The zus server communicates with the plugin via a set of function vectors
+  for the different operations. And establishes communication via defined
+  structures.
+
+Filesystem-type:
+  At startup zus registers with the Kernel one or more Filesystem-type(s)
+  Associated with the type is a 4 letter type-name (-t fstn) different
+  info about the fs, like a magic number and so on.
+  One Server can support many FS-types, in turn each FS-type can mount
+  multiple super-blocks, each supporting multiple devices.
+
+Device-Table (DT) - A zufs FS can support multiple devices
+  ZUF in Kernel may receive, like any mount command a block-device or none.
+  For the former if the specified FS-types states so in a special field.
+  The mount will look for a Device table. A list of devices in a specific
+  order sitting at some offset on the block-device. The system will then
+  proceed to open and own all these devices and associate them to the mounting
+  super-block.
+  If FS-type specifies a -1 at DT_offset then there is no device table
+  and a DT of a single device is created. (If we have no devices, none
+  is specified than we operate without any block devices. (Mount options give
+  some indication of the storage information)
+  The device table has special consideration for pmem devices and will
+  present the all linear array of devices to zus, as one flat mmap space.
+  Alternatively all none pmem devices are also provided an interface
+  with facility of data movement from pmem to a slower device.
+  A detailed NUMA info is exported to the Server for maximum utilization.
+
+pmem:
+  Multiple pmem devices are presented to the server as a single
+  linear file mmap. Something like /dev/dax. But it is strictly
+  available only to that specific super-block that owns it.
+
+dpp_t - Dual port pointer type
+  At some points in the protocol there are objects that return from zus
+  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
+  It is actually an offset 8 bytes aligned with the 3 low bits specifying
+  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
+  pool == 0 means the offset is in pmem who's management is by zuf and
+  a full easy access is provided for zus.
+
+  pool != 0 Is a pre-established tempfs file (up to 6 such files) where
+  the zus has an mmap on the file and the Kernel can access that data
+  via an offset into the file.
+  All dpp_t objects life time rules are strictly defined.
+  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
+  zus and zuf can access and change this structure. On any modification
+  the zus is called so to be notified of any changes, persistence.
+  More such objects are: Symlinks, xattrs, mmap-data-blocks etc...
+
+Relay-wait-object:
+  communication between Kernel and server are done via zus-threads that
+  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
+  the IOCTL returns operation id executed and the return info is returned via
+  a new IOCTL call, which then waits for the next operation.
+  To wake up the sleeping thread we use a Relay-wait-object. Currently
+  it is two waitqueue_head(s) back to back.
+  In future we should investigate the use of that special binder object
+  that releases its thread time slice to the other thread without going through
+  the scheduler.
+
+ZT-threads-array:
+  The novelty of the zufs is the ZT-threads system. One thread or more is
+  pre-created for each active core in the system.
+  ▪ The thread is AFFINITY set for that single core only.
+  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
+    At initialization the ZT thread communicates through a ZT_INIT ioctl
+    and registers as the handler of that core (Channel)
+  ▪ ZT-vma - Mmap 4M vma zero copy communication area per ZT
+    Pre allocated vma is created into which will be mapped the application
+    or Kernel buffers for the current operation.
+  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
+    via the IOCTL_ZU_WAIT_OPT call. supplying a 4k communication buffer
+
+  ▪ On an operation dispatch current CPU's ZT is selected, app pages mapped
+    into the ZT-vma. Server thread released with an operation to execute.
+  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
+    Server wait for new operation on that CPU.
+
+ZUS-mount-thread:
+  The system utilizes a single mount thread. (This thread is not affinity to any
+  core).
+  ▪ It will first Register all FS-types supported by this Server (By calling
+    all zusFS plugins to register their supported types). Once done
+  ▪ As above the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
+  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
+    a mount is dispatched back to zus.
+  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
+    the same array is then used for all super-blocks in the system
+  ▪ As part of the mount command in the context of this same mount-thread
+    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
+    Associated with this super_block
+  ▪ On return (like above a new call to IOCTL_ZU_MOUNT will return info of the
+    mount before sleeping in kernel waiting for a new dispatch. All SB info
+    is provided to zuf, including the root inode info. Kernel then proceeds
+    to complete the mount call.
+  ▪ NOTE that since there is a single mount thread all FS-registration
+    super_block and pmem management are lockless.
+  
+Philosophy of operations:
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. [zuf-root]
+
+On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
+called zuf-root.
+The zuf-root has no visible files. All communication is done via special-files.
+special-files are open(O_TMPFILE) and establish a special role via an
+IOCTL.
+All communications with the server are done via the zuf-root. Each root owns
+many FS-types and each FS-type owns many super-blocks of this type. All Sharing
+the same communication channels.
+Since all FS-type Servers live in the same zus application address space, at
+times. If the administrator wants to separate between different servers, he/she
+can mount a new zuf-root and point a new server instance on that new mount,
+registering other FS-types on that other instance. The all communication array
+will then be duplicated as well.
+(Otherwise pointing a new server instance on a busy root will return an error)
+
+2. [zus server start]
+  ▪ On load all configured zusFS plugins are loaded.
+  ▪ The Server starts by starting a single mount thread.
+  ▪ It than proceeds to register with Kernel all FS-types it will support.
+    (This is done on the single mount thread, so all FS-registration and
+     mount/umount operate in a single thread and therefor need not any locks)
+  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for a mount
+    command.
+
+3. [mount -t xyz]
+  [In Kernel]
+  ▪ If xyz was registered above as part of the Server startup. the regular
+    mount command will come to the zuf module with a zuf_mount() call. with
+    the xyz-FS-info. In turn this points to a zuf-root.
+  ▪ Code than proceed to load a device-table of devices as  specified above.
+    It then establishes an md object with a specific pmem_id.
+  ▪ It proceeds to call mount_bdev. Always with the same main-device
+    thous fully sporting automatic bind mounts. Even if different
+    devices are given to the mount command.
+  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
+    specifying two parameters. One the FS-type to mount, and then
+    the pmem_id Associated with this super_block.
+
+  [In zus]
+  ▪ A zus_super_block_info is allocated.
+  ▪ zus calls PMEM_GRAB(pmem_id) to establish a direct mapping to its
+    pmem devices. On return we have full access to our PMEM
+
+  ▪ ZT-threads-array
+    If this is the first mount the all ZT-threads-array is created and
+    established. The mount thread will wait until all zt-threads finished
+    initialization and ready to rock.
+  ▪ Root-zus_inode is loaded and is returned to kernel
+  ▪ More info about the mount like block sizes and so on are returned to kernel.
+
+  [In Kernel]
+   The zuf_fill_super is finalized vectors established and we have a new
+   super_block ready for operations.
+
+4. An FS operation like create or WRITE/READ and so on arrives from application
+   via VFS. Eventually an Operation is dispatched to zus:
+   ▪ A special per-operation descriptor is filled up with all parameters.
+   ▪ A current CPU channel is grabbed. the operation descriptor is put on
+     that channel (ZT). Including get_user_pages or Kernel-pages associated
+     with this OPT.
+   ▪ The ZT is awaken, app thread put to sleep.
+   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
+     the map is only on a single core. And no other core's TLB is affected.
+     (This here is the all performance secret)
+   ▪ ZT thread is returned to user-space.
+   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
+     vector. Output params filled.
+   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
+     to return the requested info.
+   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
+     ZT thread goes back to sleep waiting a new operation.
+     
+   ZT rules:
+       A ZT thread must not return back to Kernel. One exception is locks
+   if needed it might sleep waiting for a lock. In which case we will see that
+   the same CPU channel is reentered via another application and/or thread.
+   But now that CPU channel is taken.  What we do is we utilize a few channels
+   (ZTs) per core and the threads may grab another channel. But this only
+   postpones the problem on a busy contended system, all such channels will be
+   consumed. If all channels are taken the application thread is put on a busy
+   scheduling wait until a channel can be grabbed.
+   Therefor Server must not sleep on a ZT. If it needs such a sleeping operation
+   it will return -EAGAIN to zuf. The app is kept sleeping the operation is put
+   on an asynchronous Q and the ZT freed for foreground operation. At some point
+   when the server completes the delayed operation it will complete notify
+   the Kernel with a special async cookie. And the app will be awakened.
+   (Here too we utilize pre allocated asyc channels and vmas. If all channels
+    are busy, application is kept sleeping waiting its free slot turn)
+
+4. On umount the operation is reversed and all resources are torn down.
+5. In case of an application or Server crash, all resources are Associated
+   with files, on file_release these resources are caught and freed.
+
+Objects and life-time
+~~~~~~~~~~~~~~~~~~~~~
+
+Each Kernel object type has an assosiated zus Server object type who's life
+time is governed by the life-time of the Kernel object. Therefor the Server's
+job is easy because it need not establish any object caches / hashes and so on.
+
+Inside zus all objects are allocated by the zusFS plugin. So in turn it can
+allocate a bigger space for its own private data and access it via the
+container_off() coding pattern. So when I say below a zus-object I mean both
+zus public part + zusFS private part of the same object.
+
+All operations return a UM pointer that are OPEC the the Kernel code, they
+are just a cookie which is returned back to zus, when needed.
+At times when we want the Kernel to have direct access to a zus object like
+zus_inode, along with the cookie we also return a dpp_t, with a defined structure.
+
+Kernel object 			| zus object 		| Kernel access (via dpp_t)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+zuf_fs_type
+	file_system_type	| zus_fs_info		| no
+
+zuf_sb_info
+	super_block		| zus_sb_info		| no
+	
+zuf_inode_info			|			|
+	vfs_inode		| zus_inode_info	| no
+	zus_inode *		| 	zus_inode *	| yes
+	synlink *		|	char-array	| yes
+	xattr**			|	zus_xattr	| yes
+
+When a Kernel object's time is to die, a final call to zus is
+dispatched so the associated object can also be freed. Which means
+that on memory pressure when object caches are evicted also the zus
+memory resources are freed.
+
+
+How to use zufs:
+~~~~~~~~~~~~~~~~
+
+The most updated documentation of how to use the latest code bases
+is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
+
+We the developers at Netapp use this script to mount and test our
+latest code. So any new Secret will be found in these scripts. Please
+read them as the ultimate source of how to operate things.
+
+TODO: We are looking for exports in system-d and udev to properly
+integrate these tools into a destro.
+
+We assume you cloned these git trees:
+[]$ mkdir zufs; cd zufs
+[]$ git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+[]$ git clone https://github.com/NetApp/zufs-zuf -b zus-github
+
+This will create the following trees
+zufs/zus - Source code for Server
+zufs/zuf - Linux Kernel source tree to compile and install on your machine
+
+Also specifically:
+zufs/zus/fs/do-zu/zudo - script Documenting how to run things
+
+[]$ cd zuf
+
+First time
+[] ../zus/fs/do-zu/zudo
+this will create a file:
+	../zus/fs/do-zu/zu.conf
+
+Edit this file for your environment. Devices, mount-point and so on.
+On first run an example file will be created for you. Fill in the
+blanks. Most params can stay as is in most cases
+
+Now lest start running:
+
+[1]$ ../zus/fs/do-zu/zudo mkfs
+This will run the proper mkfs command selected at zu.conf file
+with the proper devices.
+
+[2]$ ../zus/fs/do-zu/zudo zuf-insmod
+This loads the zuf.ko module
+
+[3]$ ../zus/fs/do-zu/zudo zuf-root
+This mounts the zuf-root FS above on /sys/fs/zuf (automatically created above)
+
+[4]$ ../zus/fs/do-zu/zudo zus-up
+This runs the zus daemon in the background
+
+[5]$ ../zus/fs/do-zu/zudo mount
+This mount the mkfs FS above on the specified dir in zu.conf
+
+To run all the 5 commands above at once do:
+[]$ ../zus/fs/do-zu/zudo up
+
+To undo all the above in reverse order do:
+[]$ ../zus/fs/do-zu/zudo down
+
+And the most magic command is:
+[]$ ../zus/fs/do-zu/zudo again
+Will do a "down", then update-mods, then "up"
+(update-mods is a special script to copy the latest compiled binaries)
+
+Now you are ready for some:
+[]$ ../zus/fs/do-zu/zudo xfstest
+xfstests is assumed to be installed in the regular /opt/xfstests dir
+
+Again please see inside the scripts what each command does
+these scripts are the ultimate Documentation, do not believe
+anything I'm saying here. (Because it is outdated by now)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC 4/7] zuf: zuf-rootfs && zuf-core
  2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
                   ` (2 preceding siblings ...)
  2018-03-13 17:18 ` [RFC 3/7] zuf: Preliminary Documentation Boaz Harrosh
@ 2018-03-13 17:22 ` Boaz Harrosh
  2018-03-13 17:36   ` Boaz Harrosh
  2018-03-13 17:25 ` [RFC 5/7] zus: Devices && mounting Boaz Harrosh
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


Please see here:
https://github.com/NetApp/zufs-zuf/commit/036046a44b964f9ff977938f379b63392ace050f

1363 lines patch will it pass through the mailers?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 5/7] zus: Devices && mounting
  2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
                   ` (3 preceding siblings ...)
  2018-03-13 17:22 ` [RFC 4/7] zuf: zuf-rootfs && zuf-core Boaz Harrosh
@ 2018-03-13 17:25 ` Boaz Harrosh
  2018-03-13 17:38   ` Boaz Harrosh
  2018-03-13 17:28 ` [RFC 6/7] zuf: Filesystem operations Boaz Harrosh
  2018-03-13 17:32 ` [RFC 7/7] zuf: Write/Read && mmap implementation Boaz Harrosh
  6 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:25 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


Please see here:
https://github.com/NetApp/zufs-zuf/commit/b8041d6b548e1bd85c05c4cdc84f04c2ed6f6eac

Thanks
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 6/7] zuf: Filesystem operations
  2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
                   ` (4 preceding siblings ...)
  2018-03-13 17:25 ` [RFC 5/7] zus: Devices && mounting Boaz Harrosh
@ 2018-03-13 17:28 ` Boaz Harrosh
  2018-03-13 17:39   ` Boaz Harrosh
  2018-03-13 17:32 ` [RFC 7/7] zuf: Write/Read && mmap implementation Boaz Harrosh
  6 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:28 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


Please see here:
https://github.com/NetApp/zufs-zuf/commit/70650fc2a131da07b162d4f072946b14a76fe4fb

Thanks
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 7/7] zuf: Write/Read && mmap implementation
  2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
                   ` (5 preceding siblings ...)
  2018-03-13 17:28 ` [RFC 6/7] zuf: Filesystem operations Boaz Harrosh
@ 2018-03-13 17:32 ` Boaz Harrosh
  6 siblings, 0 replies; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:32 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


The despatch to the server can operate on buffers up to
4 Mega bytes. Any bigger operations are split up and despatched
at this size.

Also if a multy-segments aio is used each segment is despatched
on its own. (TODO this can be easily fixed with sg operations)

On write if any mmaped buffers changed, for example new
allocated holes do to this write or a previous mmaped COW
was written. A range subset of the written range can be returned
for the Kernel to call mapping_unmap on.

mmap is achived with the GET_BLOCK operation. GET_BLOCK will
return if we need to unmap the previous mapped PTE in case
we are writing a prevouse faulted hole or in case of a COW.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile  |   1 +
 fs/zuf/_extern.h |   7 ++
 fs/zuf/file.c    |   3 +
 fs/zuf/mmap.c    | 335 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/rw.c      | 167 +++++++++++++++++++++++++++
 fs/zuf/zus_api.h |  26 +++++
 6 files changed, 539 insertions(+)
 create mode 100644 fs/zuf/mmap.c
 create mode 100644 fs/zuf/rw.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 4c125f7..0eb933c 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,6 @@ zuf-y += md.o t2.o t1.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
+zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o file.o namei.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index cf2e80f..16e99e9 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -46,6 +46,13 @@ bool zuf_dir_emit(struct super_block *sb, struct dir_context *ctx,
 uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
 			const char *symname, ulong len, struct page *pages[2]);
 
+/* mmap.c */
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma);
+
+/* rw.c */
+ssize_t zuf_rw_read_iter(struct kiocb *kiocb, struct iov_iter *ii);
+ssize_t zuf_rw_write_iter(struct kiocb *kiocb, struct iov_iter *ii);
+
 /* file.c */
 int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync);
 
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 3b37d9f..3fe59d1 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -386,6 +386,9 @@ static int zuf_file_release(struct inode *inode, struct file *filp)
 
 const struct file_operations zuf_file_operations = {
 	.llseek			= zuf_llseek,
+	.read_iter		= zuf_rw_read_iter,
+	.write_iter		= zuf_rw_write_iter,
+	.mmap			= zuf_file_mmap,
 	.open			= generic_file_open,
 	.fsync			= zuf_fsync,
 	.flush			= zuf_flush,
diff --git a/fs/zuf/mmap.c b/fs/zuf/mmap.c
new file mode 100644
index 0000000..b4c8689
--- /dev/null
+++ b/fs/zuf/mmap.c
@@ -0,0 +1,335 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Read/Write operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/pfn_t.h>
+#include "zuf.h"
+
+/* ~~~ Functions for mmap and page faults ~~~ */
+
+/* MAP_PRIVATE, copy data to user private page (cow_page) */
+static int _cow_private_page(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_IO IO = {
+		.hdr.operation = ZUS_OP_READ,
+		.hdr.in_len = sizeof(IO),
+		.hdr.out_len = 0,
+		.hdr.offset = 0,
+		.hdr.len = PAGE_SIZE,
+		.zus_ii = zii->zus_ii,
+		/* FIXME: Kernel guys this name is confusing should be pgindex*/
+		.filepos = md_p2o(vmf->pgoff),
+	};
+	int err;
+
+	/* Basically a READ into vmf->cow_page */
+	err = zufs_dispatch(ZUF_ROOT(sbi), &IO.hdr, &vmf->cow_page, 1);
+	if (unlikely(err)) {
+		zuf_err("[%ld] What??? bn=0x%lx address=0x%lx => %d\n",
+			inode->i_ino, vmf->pgoff, vmf->address, err);
+		/* FIXME: Probably return VM_FAULT_SIGBUS */
+	}
+
+	/*HACK: This is an hack since Kernel v4.7 where a VM_FAULT_LOCKED with
+	 * vmf->page==NULL is no longer supported. Looks like for now this way
+	 * works well. We let mm mess around with unlocking and putting its own
+	 * cow_page.
+	 */
+	vmf->page = vmf->cow_page;
+	get_page(vmf->page);
+	lock_page(vmf->page);
+
+	return VM_FAULT_LOCKED;
+}
+
+int _rw_init_zero_page(struct zuf_inode_info *zii)
+{
+	if (zii->zero_page)
+		return 0;
+
+	zii->zero_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (unlikely(!zii->zero_page))
+		return -ENOMEM;
+	zii->zero_page->mapping = zii->vfs_inode.i_mapping;
+	return 0;
+}
+
+static int _get_block(struct zuf_sb_info *sbi, struct zuf_inode_info *zii,
+		      int rw, ulong index, struct zufs_ioc_get_block *get_block)
+{
+	get_block->hdr.operation = ZUS_OP_GET_BLOCK;
+
+	get_block->hdr.in_len = sizeof(*get_block); /* FIXME */
+	get_block->hdr.out_start = 0; /* FIXME */
+	get_block->hdr.out_len = sizeof(*get_block); /* FIXME */
+
+	get_block->zus_ii = zii->zus_ii;
+	get_block->index = index;
+	get_block->rw = rw;
+
+	return zufs_dispatch(ZUF_ROOT(sbi), &get_block->hdr, NULL, 0);
+}
+
+static int zuf_write_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			   bool pfn_mkwrite)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_get_block get_block = {};
+	int fault = VM_FAULT_SIGBUS;
+	pgoff_t size;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end,
+		    vmf->address, vmf->pgoff, vmf->flags,
+		    vmf->cow_page, vmf->page);
+
+	if (unlikely(vmf->page && vmf->page != zii->zero_page)) {
+		zuf_err("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+			"pgoff=0x%lx vmf_flags=0x%x page=%p cow_page=%p\n",
+			_zi_ino(zi), vma->vm_start, vma->vm_end,
+			vmf->address, vmf->pgoff, vmf->flags,
+			vmf->page, vmf->cow_page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	sb_start_pagefault(inode->i_sb);
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff +
+					md_o2p((vmf->address - vma->vm_start));
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 _zi_ino(zi), vmf->pgoff, pgoff, size);
+
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		zuf_warn("cow is write\n");
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	zus_inode_cmtime_now(inode, zi);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _get_block(sbi, zii, WRITE, vmf->pgoff, &get_block);
+	if (unlikely(err)) {
+		zuf_err("crap => %d\n", err);
+		goto out;
+	}
+
+	if (get_block.ret_flags & ZUFS_GBF_NEW) {
+		/* newly created block */
+		unmap_mapping_range(inode->i_mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_SIZE, 0);
+	} else if (pfn_mkwrite) {
+		/* If the block did not change just tell mm to flip
+		 * the write bit
+		 */
+		fault = VM_FAULT_WRITE;
+		goto out;
+	}
+
+	pfn = md_pfn(sbi->md, get_block.pmem_bn);
+	err = vm_insert_mixed_mkwrite(vma, vmf->address,
+			      phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV));
+	if (unlikely(err)) {
+		zuf_err("crap => %d\n", err);
+		goto out;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    _zi_ino(zi), pfn, vma->vm_page_prot.pgprot, err);
+
+	zuf_sync_inc(inode);
+
+	fault = VM_FAULT_NOPAGE;
+out:
+	zuf_smr_unlock(zii);
+	sb_end_pagefault(inode->i_sb);
+	return fault;
+}
+
+static int zuf_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return zuf_write_fault(vmf->vma, vmf, true);
+}
+
+static int zuf_read_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_get_block get_block = {};
+	int fault = VM_FAULT_SIGBUS;
+	pgoff_t size;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end,
+		    vmf->address, vmf->pgoff, vmf->flags,
+		    vmf->cow_page, vmf->page);
+
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff +
+					md_o2p((vmf->address - vma->vm_start));
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 _zi_ino(zi), vmf->pgoff, pgoff, size);
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		zuf_warn("cow is read\n");
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	file_accessed(vma->vm_file);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _get_block(sbi, zii, READ, vmf->pgoff, &get_block);
+	if (unlikely(err)) {
+		zuf_err("crap => %d\n", err);
+		goto out;
+	}
+
+	if (get_block.pmem_bn == 0) {
+		/* Hole in file */
+		err = _rw_init_zero_page(zii);
+		if (unlikely(err))
+			goto out;
+
+		err = vm_insert_page(vma, vmf->address, zii->zero_page);
+		zuf_dbg_mmap("[%ld] inserted zero\n", _zi_ino(zi));
+
+		/* NOTE: we are fooling mm, we do not need this page
+		 * to be locked and get(ed)
+		 */
+		fault = VM_FAULT_NOPAGE;
+		goto out;
+	}
+
+	/* We have a real page */
+	pfn = md_pfn(sbi->md, get_block.pmem_bn);
+	err = vm_insert_mixed(vma, vmf->address,
+			      phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV));
+	if (unlikely(err)) {
+		zuf_err("[%ld] vm_insert_page/mixed => %d\n", _zi_ino(zi), err);
+		goto out;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    _zi_ino(zi), pfn, vma->vm_page_prot.pgprot, err);
+
+	fault = VM_FAULT_NOPAGE;
+
+out:
+	zuf_smr_unlock(zii);
+	return fault;
+}
+
+static int zuf_fault(struct vm_fault *vmf)
+{
+	bool write_fault = (0 != (vmf->flags & FAULT_FLAG_WRITE));
+
+	if (write_fault)
+		return zuf_write_fault(vmf->vma, vmf, false);
+	else
+		return zuf_read_fault(vmf->vma, vmf);
+}
+
+static int zuf_page_mkwrite(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = vma->vm_file->f_mapping->host;
+
+	/* our zero page doesn't really hold the correct offset to the file in
+	 * page->index so vmf->pgoff is incorrect, lets fix that
+	 */
+	vmf->pgoff = vma->vm_pgoff +
+				((vmf->address - vma->vm_start) >> PAGE_SHIFT);
+
+	zuf_dbg_mmap("[%ld] pgoff=0x%lx\n", inode->i_ino, vmf->pgoff);
+
+	/* call fault handler to get a real page for writing */
+	return zuf_write_fault(vma, vmf, false);
+}
+
+static void zuf_mmap_open(struct vm_area_struct *vma)
+{
+	struct zuf_inode_info *zii = ZUII(file_inode(vma->vm_file));
+
+	atomic_inc(&zii->vma_count);
+}
+
+static void zuf_mmap_close(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	int vma_count = atomic_dec_return(&ZUII(inode)->vma_count);
+
+	if (unlikely(vma_count < 0))
+		zuf_err("[%ld] WHAT??? vma_count=%d\n",
+			 inode->i_ino, vma_count);
+	else if (unlikely(vma_count == 0))
+		/* TOZU _despatch_mmap_close(inode)
+		 * User-mode would like to know we have no more
+		 * mapping on this inode
+		 */
+		;
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_fault,
+	.page_mkwrite	= zuf_page_mkwrite,
+	.pfn_mkwrite	= zuf_pfn_mkwrite,
+	.open           = zuf_mmap_open,
+	.close		= zuf_mmap_close,
+};
+
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	file_accessed(file);
+
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &zuf_vm_ops;
+
+	atomic_inc(&zii->vma_count);
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
new file mode 100644
index 0000000..ec7cd9b
--- /dev/null
+++ b/fs/zuf/rw.c
@@ -0,0 +1,167 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Read/Write operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+#include <linux/uio.h>
+
+#include "zuf.h"
+
+/* ~~~ Functions for read_iter ~~~ */
+
+static int _IO_dispatch(struct zuf_sb_info *sbi, struct zus_inode_info *zus_ii,
+			int operation, uint pgoffset, struct page **pages,
+			uint nump, u64 filepos, uint len)
+{
+	struct zufs_ioc_IO IO = {
+		.hdr.operation = operation,
+		.hdr.in_len = sizeof(IO),
+		.hdr.out_len = 0,
+		.hdr.offset = pgoffset,
+		.hdr.len = len,
+		.zus_ii = zus_ii,
+		.filepos = filepos,
+	};
+
+	return zufs_dispatch(ZUF_ROOT(sbi), &IO.hdr, pages, nump);
+}
+
+static ssize_t _zufs_IO(struct zuf_sb_info *sbi, struct inode *inode,
+			int operation, struct iov_iter *ii, loff_t pos)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	int err = -EINVAL;
+	loff_t start_pos = pos;
+
+	while (iov_iter_count(ii)) {
+		struct page *pages[ZUS_API_MAP_MAX_PAGES];
+		ssize_t bytes;
+		size_t pgoffset;
+		uint nump, i;
+
+		bytes = iov_iter_get_pages(ii, pages, ZUS_API_MAP_MAX_SIZE,
+					   ZUS_API_MAP_MAX_PAGES, &pgoffset);
+		if (bytes < 0) {
+			err = bytes;
+			break;
+		}
+
+		nump = DIV_ROUND_UP(bytes + pgoffset, PAGE_SIZE);
+		err = _IO_dispatch(sbi, zii->zus_ii, operation, pgoffset, pages,
+				   nump, pos, bytes);
+
+		for (i = 0; i < nump; ++i)
+			put_page(pages[i]);
+
+		if (unlikely(err))
+			break;
+
+		iov_iter_advance(ii, bytes);
+		pos += bytes;
+	}
+
+	if (unlikely(pos == start_pos))
+		return err;
+	return pos - start_pos;
+}
+
+static ssize_t _read_iter(struct inode *inode, struct kiocb *kiocb,
+			  struct iov_iter *ii)
+{
+	struct super_block *sb = inode->i_sb;
+	ssize_t ret;
+
+	/* EOF protection */
+	if (unlikely(kiocb->ki_pos > i_size_read(inode)))
+		return 0;
+
+	iov_iter_truncate(ii, i_size_read(inode) - kiocb->ki_pos);
+	if (unlikely(!iov_iter_count(ii))) {
+		/* Don't let zero len reads have any effect */
+		zuf_dbg_rw("called with NULL len\n");
+		return 0;
+	}
+
+	ret = _zufs_IO(SBI(sb), inode, ZUS_OP_READ, ii, kiocb->ki_pos);
+	if (unlikely(ret < 0))
+		return ret;
+
+	kiocb->ki_pos += ret;
+	return ret;
+}
+
+ssize_t zuf_rw_read_iter(struct kiocb *kiocb, struct iov_iter *ii)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+	ssize_t ret;
+
+	zuf_dbg_vfs("[%ld] ppos=0x%llx len=0x%zx\n",
+		     inode->i_ino, kiocb->ki_pos, iov_iter_count(ii));
+
+	file_accessed(kiocb->ki_filp);
+	ret = _read_iter(inode, kiocb, ii);
+
+	zuf_dbg_vfs("[%ld] => 0x%lx\n", inode->i_ino, ret);
+	return ret;
+}
+
+/* ~~~ Functions for write_iter ~~~ */
+
+static ssize_t _write_iter(struct inode *inode, struct kiocb *kiocb,
+			  struct iov_iter *ii)
+{
+	ssize_t ret;
+
+	ret = _zufs_IO(SBI(inode->i_sb), inode, ZUS_OP_WRITE, ii,
+		       kiocb->ki_pos);
+	if (unlikely(ret < 0))
+		return ret;
+
+	kiocb->ki_pos += ret;
+	return ret;
+}
+
+static int _remove_privs_locked(struct inode *inode, struct file *file)
+{
+	int ret = file_remove_privs(file);
+
+	return ret;
+}
+
+ssize_t zuf_rw_write_iter(struct kiocb *kiocb, struct iov_iter *ii)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+
+	zuf_dbg_vfs("[%ld] ppos=0x%llx len=0x%zx\n",
+		     inode->i_ino, kiocb->ki_pos, iov_iter_count(ii));
+
+	ret = generic_write_checks(kiocb, ii);
+	if (unlikely(ret < 0))
+		goto out;
+
+	ret = _remove_privs_locked(inode, kiocb->ki_filp);
+	if (unlikely(ret < 0))
+		goto out;
+
+	zus_inode_cmtime_now(inode, zii->zi);
+	ret = _write_iter(inode, kiocb, ii);
+
+	if (kiocb->ki_pos > i_size_read(inode))
+		i_size_write(inode, kiocb->ki_pos);
+
+	inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+
+out:
+
+	zuf_dbg_vfs("[%ld] => 0x%lx\n", inode->i_ino, ret);
+	return ret;
+}
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 5870d63..90b34e4 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -529,6 +529,32 @@ static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
 	return true;
 }
 
+/* ZUS_OP_READ/ZUS_OP_WRITE */
+struct zufs_ioc_IO {
+	struct zufs_ioc_hdr hdr;
+	struct zus_inode_info *zus_ii; /* IN */
+
+	__u64 filepos;
+};
+
+enum {
+	ZUFS_GBF_RESERVED = 1,
+	ZUFS_GBF_NEW = 2,
+};
+
+/* ZUS_OP_GET_BLOCK */
+struct zufs_ioc_get_block {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 index; /* page index in file */
+	__u64 rw; /* Some flags + READ or WRITE */
+
+	/* OUT */
+	zu_dpp_t pmem_bn; /* zero return means: map a hole */
+	__u64 ret_flags;  /* One of ZUFS_GBF_XXX */
+};
+
 /* ZUS_OP_GET_SYMLINK */
 struct zufs_ioc_get_link {
 	struct zufs_ioc_hdr hdr;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC 4/7] zuf: zuf-rootfs && zuf-core
  2018-03-13 17:22 ` [RFC 4/7] zuf: zuf-rootfs && zuf-core Boaz Harrosh
@ 2018-03-13 17:36   ` Boaz Harrosh
  2018-03-14 12:56     ` Nikolay Borisov
  0 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:36 UTC (permalink / raw)
  To: Boaz Harrosh, linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


zuf-core established the communication channels with
the zus UM Server.

zuf-root is a psuedo FS that the zus communicates through,
registers new file-systems. receives new mount requests.

In this patch we have the bring up of that special FS, and
the core communication mechanics. Which is the Novelty
of this code submission.

The zuf-rootfs (-t zuf) is usually by default mounted on
/sys/fs/zuf. If an admin wants to run more server applications
(Note that each server application supports many types of FSs)
He/she can mount a second instance of -t zuf and point the new
Server to it.

(Otherwise a second instance attempting to communicate with a
 busy zuf will fail)

TODO: How to trigger a first mount on module_load. Currently
admin needs to manually "mount -t zuf none /sys/fs/zuf"

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   4 +
 fs/zuf/_extern.h  |  47 +++++
 fs/zuf/_pr.h      |  53 ++++++
 fs/zuf/relay.h    |  86 +++++++++
 fs/zuf/super.c    |  21 +++
 fs/zuf/zuf-core.c | 517 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-root.c | 330 ++++++++++++++++++++++++++++++++++
 fs/zuf/zuf.h      |  90 ++++++++++
 fs/zuf/zus_api.h  | 108 ++++++++++++
 9 files changed, 1256 insertions(+)
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/relay.h
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 7e4e51f..d00940c 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,5 +10,9 @@
 
 obj-$(CONFIG_ZUF) += zuf.o
 
+# ZUF core
+zuf-y += zuf-core.o zuf-root.o
+
 # Main FS
+zuf-y += super.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
new file mode 100644
index 0000000..e490043
--- /dev/null
+++ b/fs/zuf/_extern.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_EXTERN_H__
+#define __ZUF_EXTERN_H__
+/*
+ * DO NOT INCLUDE this file directly, it is included by zuf.h
+ * It is here because zuf.h got to big
+ */
+
+/*
+ * extern functions declarations
+ */
+
+/* super.c */
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data);
+
+/* zuf-core.c */
+int zufs_zts_init(struct zuf_root_info *zri); /* Some private types in core */
+void zufs_zts_fini(struct zuf_root_info *zri);
+
+long zufs_ioc(struct file *filp, unsigned int cmd, ulong arg);
+int zufs_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
+			struct zufs_ioc_mount *zim);
+int zufs_dispatch_umount(struct zuf_root_info *zri,
+			 struct zus_sb_info *zus_sbi);
+
+int zufs_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
+		  struct page **pages, uint nump);
+
+int zuf_zt_mmap(struct file *file, struct vm_area_struct *vma);
+
+void zufs_zt_release(struct file *filp);
+void zufs_mounter_release(struct file *filp);
+
+/* zuf-root.c */
+int zuf_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
+
+#endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
new file mode 100644
index 0000000..39c4622
--- /dev/null
+++ b/fs/zuf/_pr.h
@@ -0,0 +1,53 @@
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_PR_H__
+#define __ZUF_PR_H__
+
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/*
+ * Debug code
+ */
+#define zuf_err(s, args ...)		pr_err("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_err_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_err("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_warn(s, args ...)		pr_warn("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_warn_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_warn("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_info(s, args ...)          pr_info("~info~ " s, ## args)
+
+#define zuf_chan_debug(c, s, args...)	pr_debug(c " [%s:%d] " s, __func__, \
+							__LINE__, ## args)
+
+/* ~~~ channel prints ~~~ */
+#define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
+#define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
+#define zuf_dbg_rw(s, args ...)		zuf_chan_debug("rw   ", s, ##args)
+#define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
+#define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
+#define zuf_dbg_xattr(s, args ...)	zuf_chan_debug("xattr", s, ##args)
+#define zuf_dbg_acl(s, args ...)	zuf_chan_debug("acl  ", s, ##args)
+#define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
+#define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
+#define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
+#define zuf_dbg_mmap(s, args ...)	zuf_chan_debug("mmap ", s, ##args)
+#define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
+
+#endif /* define __ZUF_PR_H__ */
diff --git a/fs/zuf/relay.h b/fs/zuf/relay.h
new file mode 100644
index 0000000..490a193
--- /dev/null
+++ b/fs/zuf/relay.h
@@ -0,0 +1,86 @@
+/*
+ * Relay scheduler-object Header file.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#ifndef __RELAY_H__
+#define __RELAY_H__
+
+/* ~~~~ Relay ~~~~ */
+struct relay {
+	wait_queue_head_t fss_wq;
+	bool fss_wakeup;
+	bool fss_waiting;
+
+	wait_queue_head_t app_wq;
+	bool app_wakeup;
+	bool app_waiting;
+};
+
+static inline void relay_init(struct relay *relay)
+{
+	init_waitqueue_head(&relay->fss_wq);
+	init_waitqueue_head(&relay->app_wq);
+}
+
+static inline void relay_fss_waiting_grab(struct relay *relay)
+{
+	relay->fss_waiting = true;
+}
+
+static inline bool relay_is_app_waiting(struct relay *relay)
+{
+	return relay->app_waiting;
+}
+
+static inline void relay_app_wakeup(struct relay *relay)
+{
+	relay->app_waiting = false;
+
+	relay->app_wakeup = true;
+	wake_up(&relay->app_wq);
+}
+
+static inline int relay_fss_wait(struct relay *relay)
+{
+	int err;
+
+	relay->fss_wakeup = false;
+	err =  wait_event_interruptible(relay->fss_wq, relay->fss_wakeup);
+
+	relay->fss_waiting = false;
+	return err;
+}
+
+static inline bool relay_is_fss_waiting(struct relay *relay)
+{
+	return relay->fss_waiting;
+}
+
+static inline void relay_fss_wakeup(struct relay *relay)
+{
+	relay->fss_wakeup = true;
+	wake_up(&relay->fss_wq);
+}
+
+static inline int relay_fss_wakeup_app_wait(struct relay *relay,
+					    spinlock_t *spinlock)
+{
+	relay->app_waiting = true;
+
+	relay_fss_wakeup(relay);
+
+	relay->app_wakeup = false;
+	if (spinlock)
+		spin_unlock(spinlock);
+
+	return wait_event_interruptible(relay->app_wq, relay->app_wakeup);
+}
+
+#endif /* ifndef __RELAY_H__ */
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
new file mode 100644
index 0000000..6e176a5
--- /dev/null
+++ b/fs/zuf/super.c
@@ -0,0 +1,21 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Super block operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
new file mode 100644
index 0000000..12a23f1
--- /dev/null
+++ b/fs/zuf/zuf-core.c
@@ -0,0 +1,517 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/delay.h>
+#include <linux/uaccess.h>
+
+#include "zuf.h"
+
+static struct page *g_drain_p = NULL;
+
+struct zufs_thread {
+	struct zuf_special_file hdr;
+	struct relay relay;
+	struct file *file;
+	struct vm_area_struct *vma;
+	int no;
+
+	/* Next operation*/
+	struct zufs_ioc_hdr *next_opt;
+	struct page **pages;
+	uint nump;
+} ____cacheline_aligned;
+
+static int _zt_from_f(struct file *filp, int cpu, struct zufs_thread **ztp)
+{
+	struct zuf_root_info *zri = ZRI(filp->f_inode->i_sb);
+
+	if ((cpu < 0) || (zri->_max_zts <= cpu))  {
+		zuf_err("fatal\n");
+		return -ERANGE;
+	}
+
+	*ztp = &zri->_all_zt[cpu];
+	return 0;
+}
+
+int zufs_zts_init(struct zuf_root_info *zri)
+{
+	zri->_max_zts = num_online_cpus();
+
+	zri->_all_zt = kcalloc(zri->_max_zts, sizeof(struct zufs_thread),
+			       GFP_KERNEL);
+	if (unlikely(!zri->_all_zt))
+		return -ENOMEM;
+
+	g_drain_p = alloc_page(GFP_KERNEL);
+	if (!g_drain_p) {
+		zuf_err("!!! failed to alloc g_drain_p\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void zufs_zts_fini(struct zuf_root_info *zri)
+{
+	if (g_drain_p) {
+		__free_page(g_drain_p);
+		g_drain_p = NULL;
+	}
+	kfree(zri->_all_zt);
+	zri->_all_zt = NULL;
+}
+
+static int _zu_register_fs(struct file *file, void *parg)
+{
+	struct zufs_ioc_register_fs rfs;
+	int err;
+
+	err = copy_from_user(&rfs, parg, sizeof(rfs));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+
+	err = zuf_register_fs(file->f_inode->i_sb, &rfs);
+	if (err)
+		zuf_err("=>%d\n", err);
+	err = put_user(err, (int *)parg);
+	return err;
+}
+
+/* ~~~~ mounting ~~~~*/
+int zufs_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
+			struct zufs_ioc_mount *zim)
+{
+	zim->zus_zfi = zus_zfi;
+	zim->num_cpu = zri->_max_zts;
+
+	if (unlikely(!zri->mount.file)) {
+		zuf_err("Server not up\n");
+		zim->hdr.err = -EIO;
+		return zim->hdr.err;
+	}
+
+	for (;;) {
+		bool fss_waiting;
+		/* It is OK to wait if user storms mounts */
+		spin_lock(&zri->mount.lock);
+		fss_waiting = relay_is_fss_waiting(&zri->mount.relay);
+		if (fss_waiting)
+			break;
+
+		spin_unlock(&zri->mount.lock);
+		if (unlikely(!zri->mount.file)) {
+			zuf_err("Server died\n");
+			zim->hdr.err = -EIO;
+			break;
+		}
+		zuf_dbg_verbose("waiting\n");
+		msleep(100);
+	}
+
+	zri->mount.zim = zim;
+	relay_fss_wakeup_app_wait(&zri->mount.relay, &zri->mount.lock);
+
+	return zim->hdr.err;
+}
+
+int zufs_dispatch_umount(struct zuf_root_info *zri, struct zus_sb_info *zus_sbi)
+{
+	struct zufs_ioc_mount zim = {
+		.is_umounting = true,
+		.zus_sbi = zus_sbi,
+	};
+
+	return zufs_dispatch_mount(zri, NULL, &zim);
+}
+
+static int _zu_mount(struct file *file, void *parg)
+{
+	struct super_block *sb = file->f_inode->i_sb;
+	struct zuf_root_info *zri = ZRI(sb);
+	bool waiting_for_reply;
+	struct zufs_ioc_mount *zim;
+	int err;
+
+	spin_lock(&zri->mount.lock);
+
+	if (unlikely(!file->private_data)) {
+		/* First time register this file as the mount-thread owner */
+		zri->mount.zsf.type = zlfs_e_mout_thread;
+		zri->mount.file = file;
+		file->private_data = &zri->mount;
+	} else if (unlikely(file->private_data != &zri->mount)) {
+		zuf_err("Say what?? %p != %p\n",
+			file->private_data, &zri->mount);
+		return -EIO;
+	}
+
+	relay_fss_waiting_grab(&zri->mount.relay);
+	zim = zri->mount.zim;
+	zri->mount.zim = NULL;
+	waiting_for_reply = zim && relay_is_app_waiting(&zri->mount.relay);
+
+	spin_unlock(&zri->mount.lock);
+
+	if (waiting_for_reply) {
+		zim->hdr.err = copy_from_user(zim, parg, sizeof(*zim));
+		relay_app_wakeup(&zri->mount.relay);
+		if (unlikely(zim->hdr.err)) {
+			zuf_err("=>%d\n", zim->hdr.err);
+			return zim->hdr.err;
+		}
+	}
+
+	/* This gets to sleep until a mount comes */
+	err = relay_fss_wait(&zri->mount.relay);
+	if (unlikely(err || !zri->mount.zim)) {
+		struct zufs_ioc_hdr *hdr = parg;
+
+		/* Released by _zu_break INTER or crash */
+		zuf_warn("_zu_break? %p => %d\n", zri->mount.zim, err);
+		put_user(ZUS_OP_BREAK, &hdr->operation);
+		put_user(EIO, &hdr->err);
+		return err;
+	}
+
+	err = copy_to_user(parg, zri->mount.zim, sizeof(*zri->mount.zim));
+	if (unlikely(err))
+		zuf_err("=>%d\n", err);
+	return err;
+}
+
+void zufs_mounter_release(struct file *file)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+
+	zuf_warn("closed fu=%d au=%d fw=%d aw=%d\n",
+		  zri->mount.relay.fss_wakeup, zri->mount.relay.app_wakeup,
+		  zri->mount.relay.fss_waiting, zri->mount.relay.app_waiting);
+
+	spin_lock(&zri->mount.lock);
+	zri->mount.file = NULL;
+	if (relay_is_app_waiting(&zri->mount.relay)) {
+		zuf_err("server emergency exit while IO\n");
+
+		if (zri->mount.zim)
+			zri->mount.zim->hdr.err = -EIO;
+		spin_unlock(&zri->mount.lock);
+
+		relay_app_wakeup(&zri->mount.relay);
+		msleep(1000); /* crap */
+	} else {
+		if (zri->mount.zim)
+			zri->mount.zim->hdr.err = 0;
+		spin_unlock(&zri->mount.lock);
+	}
+}
+
+static int _map_pages(struct zufs_thread *zt, struct page **pages, uint nump,
+		      bool zap)
+{
+	int p, err;
+	pgprot_t prot;
+
+	if (!(zt->vma && pages && nump))
+		return 0;
+
+	prot = pgprot_modify(prot, PAGE_SHARED);
+	for (p = 0; p < nump; ++p) {
+		ulong zt_addr = zt->vma->vm_start + p * PAGE_SIZE;
+		ulong pfn = page_to_pfn(zap ? g_drain_p : pages[p]);
+
+		err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
+		if (unlikely(err)) {
+			zuf_err("zuf: remap_pfn_range => %d p=0x%x start=0x%lx\n",
+				 err, p, zt->vma->vm_start);
+			return err;
+		}
+	}
+	return 0;
+}
+
+static void _unmap_pages(struct zufs_thread *zt, struct page **pages, uint nump)
+{
+	if (!(zt->vma && pages && nump))
+		return;
+
+	zt->pages = NULL;
+	zt->nump = 0;
+
+	/* Punch in a drain page for this CPU */
+	_map_pages(zt, pages, nump, true);
+}
+
+static int _zu_init(struct file *file, void *parg)
+{
+	struct zufs_thread *zt;
+	int cpu = smp_processor_id();
+	struct zufs_ioc_init zi_init;
+	int err;
+
+	err = copy_from_user(&zi_init, parg, sizeof(zi_init));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+
+	zuf_warn("[%d] aff=0x%lx\n", cpu, zi_init.affinity);
+
+	zi_init.hdr.err = _zt_from_f(file, cpu, &zt);
+	if (unlikely(zi_init.hdr.err)) {
+		zuf_err("=>%d\n", err);
+		goto out;
+	}
+
+	if (zt->file) {
+		zuf_err("[%d] thread already set\n", cpu);
+		memset(zt, 0, sizeof(*zt));
+	}
+
+	relay_init(&zt->relay);
+	zt->hdr.type = zlfs_e_zt;
+	zt->file = file;
+	zt->no = cpu;
+
+	file->private_data = &zt->hdr;
+out:
+	err = copy_to_user(parg, &zi_init, sizeof(zi_init));
+	if (err)
+		zuf_err("=>%d\n", err);
+	return err;
+}
+
+struct zufs_thread *_zt_from_f_private(struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	WARN_ON(zsf->type != zlfs_e_zt);
+	return container_of(zsf, struct zufs_thread, hdr);
+}
+
+/* Caller checks that file->private_data != NULL */
+void zufs_zt_release(struct file *file)
+{
+	struct zufs_thread *zt = _zt_from_f_private(file);
+
+	if (unlikely(zt->file != file))
+		zuf_err("What happened zt->file(%p) != file(%p)\n",
+			zt->file, file);
+
+	zuf_warn("[%d] closed fu=%d au=%d fw=%d aw=%d\n",
+		  zt->no, zt->relay.fss_wakeup, zt->relay.app_wakeup,
+		  zt->relay.fss_waiting, zt->relay.app_waiting);
+
+	if (relay_is_app_waiting(&zt->relay)) {
+		zuf_err("server emergency exit while IO\n");
+
+		/* NOTE: Do not call _unmap_pages the vma is gone */
+
+		zt->next_opt->err = -EIO;
+		zt->file = NULL;
+
+		relay_app_wakeup(&zt->relay);
+		msleep(1000); /* crap */
+	}
+
+	memset(zt, 0, sizeof(*zt));
+}
+
+static int _zu_wait(struct file *file, void *parg)
+{
+	struct zufs_thread *zt;
+	int cpu = smp_processor_id();
+	int err;
+
+	err = _zt_from_f(file, cpu, &zt);
+	if (unlikely(err))
+		goto err;
+
+	if (!zt->file || file != zt->file) {
+		zuf_err("fatal\n");
+		err = -E2BIG;
+		goto err;
+	}
+
+	relay_fss_waiting_grab(&zt->relay);
+
+	if (relay_is_app_waiting(&zt->relay)) {
+		_unmap_pages(zt, zt->pages, zt->nump);
+
+		get_user(zt->next_opt->err, (int *)parg);
+		if (zt->next_opt->out_len) {
+			void *rply = (void *)zt->next_opt +
+							zt->next_opt->out_start;
+			void *from = parg + zt->next_opt->out_start;
+
+			err = copy_from_user(rply, from, zt->next_opt->out_len);
+		}
+		zt->next_opt = NULL;
+
+		relay_app_wakeup(&zt->relay);
+	}
+
+	err  = relay_fss_wait(&zt->relay);
+
+	if (zt->next_opt &&  zt->next_opt->operation < ZUS_OP_BREAK) {
+		/* call map here at the zuf thread so we need no locks */
+		_map_pages(zt, zt->pages, zt->nump, false);
+		err = copy_to_user(parg, zt->next_opt, zt->next_opt->in_len);
+	} else {
+		struct zufs_ioc_hdr *hdr = parg;
+
+		/* This Means we were released by _zu_break */
+		zuf_warn("_zu_break? %p => %d\n", zt->next_opt, err);
+		put_user(ZUS_OP_BREAK, &hdr->operation);
+		put_user(err, &hdr->err);
+	}
+
+	return err;
+
+err:
+	put_user(err, (int *)parg);
+	return err;
+}
+
+int zufs_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
+		  struct page **pages, uint nump)
+{
+	int cpu = smp_processor_id();
+	struct zufs_thread *zt;
+
+	if ((cpu < 0) || (zri->_max_zts <= cpu))
+		return -ERANGE;
+	zt = &zri->_all_zt[cpu];
+
+	if (unlikely(!zt->file))
+		return -EIO;
+
+	while (!relay_is_fss_waiting(&zt->relay)) {
+		mb();
+		if (unlikely(!zt->file))
+			return -EIO;
+		zuf_dbg_err("[%d] can this be\n", cpu);
+		/* FIXME: Do something much smarter */
+		msleep(10);
+		mb();
+	}
+
+	zt->next_opt = hdr;
+	zt->pages = pages;
+	zt->nump = nump;
+
+	relay_fss_wakeup_app_wait(&zt->relay, NULL);
+
+	return zt->file ? hdr->err : -EIO;
+}
+
+static int _zu_break(struct file *filp, void *parg)
+{
+	struct zuf_root_info *zri = ZRI(filp->f_inode->i_sb);
+	int i;
+
+	zuf_dbg_core("enter\n");
+	mb(); /* TODO how to schedule on all CPU's */
+
+	for (i = 0; i < zri->_max_zts; ++i) {
+		struct zufs_thread *zt = &zri->_all_zt[i];
+
+		if (unlikely(!(zt && zt->file)))
+			continue;
+		relay_fss_wakeup(&zt->relay);
+	}
+
+	if (zri->mount.file)
+		relay_fss_wakeup(&zri->mount.relay);
+
+	zuf_dbg_core("exit\n");
+	return 0;
+}
+
+long zufs_ioc(struct file *file, unsigned int cmd, ulong arg)
+{
+	void __user *parg = (void __user *)arg;
+
+	switch (cmd) {
+	case ZU_IOC_REGISTER_FS:
+		return _zu_register_fs(file, parg);
+	case ZU_IOC_MOUNT:
+		return _zu_mount(file, parg);
+	case ZU_IOC_INIT_THREAD:
+		return _zu_init(file, parg);
+	case ZU_IOC_WAIT_OPT:
+		return _zu_wait(file, parg);
+	case ZU_IOC_BREAK_ALL:
+		return _zu_break(file, parg);
+	default:
+		zuf_err("%d %ld\n", cmd, ZU_IOC_WAIT_OPT);
+		return -ENOTTY;
+	}
+}
+
+static int zuf_file_fault(struct vm_fault *vmf)
+{
+	zuf_err("should not fault\n");
+	return VM_FAULT_SIGBUS;
+}
+
+static void zuf_mmap_open(struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	struct zufs_thread *zt = _zt_from_f_private(file);
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+	zt->vma = vma;
+}
+
+static void zuf_mmap_close(struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	struct zufs_thread *zt = _zt_from_f_private(file);
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	zt->vma = NULL;
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_file_fault,
+	.open           = zuf_mmap_open,
+	.close		= zuf_mmap_close,
+};
+
+int zuf_zt_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zufs_thread *zt = _zt_from_f_private(file);
+
+	/* Tell Kernel We will only access on a single core */
+	vma->vm_flags |= VM_LOCAL_CPU;
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &zuf_vm_ops;
+
+	zt->vma = vma;
+
+	zuf_dbg_core("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
new file mode 100644
index 0000000..8102d3a
--- /dev/null
+++ b/fs/zuf/zuf-root.c
@@ -0,0 +1,330 @@
+/*
+ * ZUF Root filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * ZUF core is mounted on a small specialized FS that
+ * provides the communication with the mount thread, zuf multy-channel
+ * communication [ZTs], and the pmem devices.
+ * Subsequently all FS super_blocks are children of this root, and point
+ * to it. All using the same zuf communication multy-channel.
+ *
+ * [
+ * TODO:
+ *	Multiple servers can run on Multiple mounted roots. Each registering
+ *	their own FSTYPEs. Admin should make sure that the FSTYPEs do not
+ *	overlap
+ * ]
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <asm-generic/mman.h>
+
+#include "zuf.h"
+
+/* ~~~~ Register/Unregister FS-types ~~~~ */
+#ifdef CONFIG_LOCKDEP
+
+/*
+ * NOTE: When CONFIG_LOCKDEP is on the register_filesystem complains when
+ * the fstype object is from a kmalloc. Because of some lockdep_keys not
+ * being const_obj something.
+ *
+ * So in this case we have maximum of 16 fstypes system wide
+ * (Total for all mounted zuf_root(s)). This way we can have them
+ * in const_obj memory below at g_fs_array
+ */
+
+enum { MAX_LOCKDEP_FSs = 16 };
+static uint g_fs_next;
+static struct zuf_fs_type g_fs_array[MAX_LOCKDEP_FSs];
+
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	if (MAX_LOCKDEP_FSs <= g_fs_next)
+		return NULL;
+
+	return &g_fs_array[g_fs_next++];
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	if (zft == &g_fs_array[0])
+		g_fs_next = 0;
+}
+
+#else /* !CONFIG_LOCKDEP*/
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	return kcalloc(1, sizeof(struct zuf_fs_type), GFP_KERNEL);
+}
+
+static void _fs_type_free(zuf_fs_type *zft)
+{
+	kfree(zft);
+}
+#endif /*CONFIG_LOCKDEP*/
+
+int zuf_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs)
+{
+	struct zuf_fs_type *zft = _fs_type_alloc();
+
+	if (unlikely(!zft))
+		return -ENOMEM;
+
+	/* Original vfs file type */
+	zft->vfs_fst.owner	= THIS_MODULE;
+	zft->vfs_fst.name	= kstrdup(rfs->rfi.fsname, GFP_KERNEL);
+	zft->vfs_fst.mount	= zuf_mount,
+	zft->vfs_fst.kill_sb	= kill_block_super,
+
+	/* ZUS info about this FS */
+	zft->rfi		= rfs->rfi;
+	zft->zus_zfi		= rfs->zus_zfi;
+	INIT_LIST_HEAD(&zft->list);
+	/* Back pointer to our communication channels */
+	zft->zri		= ZRI(sb);
+
+	zuf_add_fs_type(zft->zri, zft);
+	zuf_info("register_filesystem [%s]\n", zft->vfs_fst.name);
+	return register_filesystem(&zft->vfs_fst);
+}
+
+void _unregister_fs(struct zuf_root_info *zri)
+{
+	struct zuf_fs_type *zft, *n;
+
+	list_for_each_entry_safe_reverse(zft, n, &zri->fst_list, list) {
+		unregister_filesystem(&zft->vfs_fst);
+		list_del_init(&zft->list);
+		_fs_type_free(zft);
+	}
+}
+
+int zufr_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	switch (zsf->type) {
+	case zlfs_e_zt:
+		return zuf_zt_mmap(file, vma);
+	default:
+		zuf_err("type=%d\n", zsf->type);
+		return -ENOTTY;
+	}
+}
+
+static int zufr_release(struct inode *inode, struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf)
+		return 0;
+
+	switch (zsf->type) {
+	case zlfs_e_zt:
+		zufs_zt_release(file);
+		return 0;
+	case zlfs_e_mout_thread: {
+		struct zuf_root_info *zri = ZRI(inode->i_sb);
+
+		zufs_mounter_release(file);
+		_unregister_fs(zri);
+		return 0;
+	}
+	case zlfs_e_pmem:
+		/* NOTHING to clean for pmem file yet */
+		/* zufs_pmem_release(file);*/
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+static int zufr_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+
+	drop_nlink(inode);
+	return 0;
+}
+
+static const struct inode_operations zufr_inode_operations;
+static const struct file_operations zufr_file_dir_operations = {
+	.open		= dcache_dir_open,
+	.release	= dcache_dir_close,
+	.llseek		= dcache_dir_lseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= dcache_readdir,
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = zufs_ioc,
+};
+static const struct file_operations zufr_file_reg_operations = {
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = zufs_ioc,
+	.mmap		= zufr_mmap,
+	.release	= zufr_release,
+};
+
+static int zufr_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct zuf_root_info *zri = ZRI(dir->i_sb);
+	struct inode *inode;
+	int err;
+
+	inode = new_inode(dir->i_sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = ++zri->next_ino; /* none atomic only one mount thread */
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_kernel_time();
+	inode->i_atime = inode->i_ctime;
+	inode_init_owner(inode, dir, mode);
+
+	inode->i_op = &zufr_inode_operations;
+	inode->i_fop = &zufr_file_reg_operations;
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_err("[%ld] insert_inode_locked => %d\n", inode->i_ino, err);
+		goto fail;
+	}
+	d_tmpfile(dentry, inode);
+	unlock_new_inode(inode);
+	return 0;
+
+fail:
+	clear_nlink(inode);
+	make_bad_inode(inode);
+	iput(inode);
+	return err;
+}
+
+static void zufr_put_super(struct super_block *sb)
+{
+	struct zuf_root_info *zri = ZRI(sb);
+
+	zufs_zts_fini(zri);
+	_unregister_fs(zri);
+
+	zuf_info("zuf_root umount\n");
+}
+
+static void zufr_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
+static const struct inode_operations zufr_inode_operations = {
+	.lookup		= simple_lookup,
+
+	.tmpfile	= zufr_tmpfile,
+	.unlink		= zufr_unlink,
+};
+static const struct super_operations zufr_super_operations = {
+	.statfs		= simple_statfs,
+
+	.evict_inode	= zufr_evict_inode,
+	.put_super	= zufr_put_super,
+};
+
+#define ZUFR_SUPER_MAGIC 0x1717
+
+static int zufr_fill_super(struct super_block *sb, void *data, int silent)
+{
+	static struct tree_descr zufr_files[] = {{""}};
+	struct zuf_root_info *zri;
+	struct inode *root_i;
+	int err;
+
+	zri = kzalloc(sizeof(*zri), GFP_KERNEL);
+	if (!zri) {
+		zuf_err_cnd(silent,
+			    "Not enough memory to allocate zuf_root_info\n");
+		return -ENOMEM;
+	}
+
+	err = simple_fill_super(sb, ZUFR_SUPER_MAGIC, zufr_files);
+	if (unlikely(err))
+		return err;
+
+	sb->s_op = &zufr_super_operations;
+	sb->s_fs_info = zri;
+	zri->sb = sb;
+
+	root_i = sb->s_root->d_inode;
+	root_i->i_fop = &zufr_file_dir_operations;
+	root_i->i_op = &zufr_inode_operations;
+
+	spin_lock_init(&zri->mount.lock);
+	relay_init(&zri->mount.relay);
+	INIT_LIST_HEAD(&zri->fst_list);
+	INIT_LIST_HEAD(&zri->pmem_list);
+
+	err = zufs_zts_init(zri);
+	if (unlikely(err))
+		return err; /* put will be called we have a root */
+
+	return 0;
+}
+
+static struct dentry *zufr_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name,
+				  void *data)
+{
+	struct dentry *ret = mount_single(fs_type, flags, data, zufr_fill_super);
+
+	zuf_info("zuf_root mount => %p\n", ret);
+	return ret;
+}
+
+static struct file_system_type zufr_type = {
+	.owner =	THIS_MODULE,
+	.name =		"zuf",
+	.mount =	zufr_mount,
+	.kill_sb	= kill_litter_super,
+};
+
+/* Create an /sys/fs/zuf/ directory. to mount on */
+static struct kset *zufr_kset;
+
+int __init zuf_root_init(void)
+{
+	int err;
+
+	zufr_kset = kset_create_and_add("zuf", NULL, fs_kobj);
+	if (!zufr_kset) {
+		err = -ENOMEM;
+		goto un_inodecache;
+	}
+
+	err = register_filesystem(&zufr_type);
+	if (unlikely(err))
+		goto un_kset;
+
+	return 0;
+
+un_kset:
+	kset_unregister(zufr_kset);
+un_inodecache:
+	return err;
+}
+
+void __exit zuf_root_exit(void)
+{
+	unregister_filesystem(&zufr_type);
+	kset_unregister(zufr_kset);
+}
+
+module_init(zuf_root_init)
+module_exit(zuf_root_exit)
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
new file mode 100644
index 0000000..15516d0
--- /dev/null
+++ b/fs/zuf/zuf.h
@@ -0,0 +1,90 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the ZUF filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_H
+#define __ZUF_H
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/xattr.h>
+#include <linux/exportfs.h>
+#include <linux/page_ref.h>
+
+#include "zus_api.h"
+
+#include "relay.h"
+#include "_pr.h"
+
+enum zlfs_e_special_file {
+	zlfs_e_zt = 1,
+	zlfs_e_mout_thread,
+	zlfs_e_pmem,
+};
+
+struct zuf_special_file {
+	enum zlfs_e_special_file type;
+};
+
+/* This is the zuf-root.c mini filesystem */
+struct zuf_root_info {
+	struct __mount_thread_info {
+		struct zuf_special_file zsf;
+		spinlock_t lock;
+		struct relay relay;
+		struct zufs_ioc_mount *zim;
+		struct file *file;
+	} mount;
+
+	ulong next_ino;
+
+	uint _max_zts;
+	struct zufs_thread *_all_zt;
+
+	struct super_block *sb;
+	struct list_head fst_list;
+
+	uint next_pmem_id;
+	struct list_head pmem_list;
+};
+
+static inline struct zuf_root_info *ZRI(struct super_block *sb)
+{
+	struct zuf_root_info *zri = sb->s_fs_info;
+
+	WARN_ON(zri->sb != sb);
+	return zri;
+}
+
+struct zuf_fs_type {
+	struct file_system_type vfs_fst;
+	struct zus_fs_info	*zus_zfi;
+	struct register_fs_info rfi;
+	struct zuf_root_info *zri;
+
+	struct list_head list;
+};
+
+static inline void zuf_add_fs_type(struct zuf_root_info *zri,
+				   struct zuf_fs_type *zft)
+{
+	/* Unlocked for now only one mount-thread with zus */
+	list_add(&zft->list, &zri->fst_list);
+}
+
+/* Keep this include last thing in file */
+#include "_extern.h"
+
+#endif /* __ZUF_H */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index d6ccc85..19ce326 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -66,4 +66,112 @@
 
 #endif /*  ndef __KERNEL__ */
 
+/**
+ * zufs dual port memory
+ * This is a special type of offset to either memory or persistent-memory,
+ * that is designed to be used in the interface mechanism between userspace
+ * and kernel, and can be accessed by both. Note that user must use the
+ * appropriate accessors to translate to a pointer.
+ */
+typedef __u64	zu_dpp_t;
+
+/* ~~~~~ ZUFS API ioctl commands ~~~~~ */
+enum {
+	ZUS_API_MAP_MAX_PAGES	= 1024,
+	ZUS_API_MAP_MAX_SIZE	= ZUS_API_MAP_MAX_PAGES * PAGE_SIZE,
+};
+
+struct zufs_ioc_hdr {
+	__u32 err;	/* IN/OUT must be first */
+	__u16 in_start;	/* Not used always 0 */
+	__u16 in_len;	/* How much to be copied *to* user mode */
+	__u16 out_start;/* start of output parameters */
+	__u16 out_len;	/* How much to be copied *from* user mode */
+	__u32 operation;/* One of e_zufs_operation */
+	__u32 offset;	/* Start of user buffer in ZT mmap */
+	__u32 len;	/* Len of user buffer in ZT mmap */
+};
+
+/* Register FS */
+/* A cookie from user-mode given in register_fs_info */
+struct zus_fs_info;
+struct zufs_ioc_register_fs {
+	struct zufs_ioc_hdr hdr;
+	struct zus_fs_info *zus_zfi;
+	struct register_fs_info {
+		/* IN */
+		char fsname[16];	/* Only 4 chars and a NUL please      */
+		__u32 FS_magic;         /* This is the FS's version && magic  */
+		__u32 FS_ver_major;	/* on disk, not the zuf-to-zus version*/
+		__u32 FS_ver_minor;	/* (See also struct zufs_dev_table)   */
+
+		__u8 acl_on;
+		__u8 notused[3];
+		__u64 dt_offset;
+
+		__u32 s_time_gran;
+		__u32 def_mode;
+		__u64 s_maxbytes;
+
+	} rfi;
+};
+#define ZU_IOC_REGISTER_FS	_IOWR('S', 10, struct zufs_ioc_register_fs)
+
+/* A cookie from user-mode returned by mount */
+struct zus_sb_info;
+
+/* zus cookie per inode */
+struct zus_inode_info;
+
+/* mount / umount */
+struct  zufs_ioc_mount {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_fs_info *zus_zfi;
+	uint num_cpu;
+	uint pmem_kern_id;
+	__u8 is_umounting;
+
+	/* OUT */
+	struct zus_sb_info *zus_sbi;
+	/* mount is also iget of root */
+	struct zus_inode_info *zus_ii;
+	zu_dpp_t _zi;
+
+	/* More mount info */
+	__u32 s_blocksize_bits;
+};
+#define ZU_IOC_MOUNT	_IOWR('S', 12, struct zufs_ioc_mount)
+
+/* ZT init */
+struct zufs_ioc_init {
+	struct zufs_ioc_hdr hdr;
+	ulong affinity;	/* IN */
+};
+#define ZU_IOC_INIT_THREAD	_IOWR('S', 20, struct zufs_ioc_init)
+
+/* break_all (Server telling kernel to clean) */
+struct zufs_ioc_break_all {
+	struct zufs_ioc_hdr hdr;
+};
+#define ZU_IOC_BREAK_ALL	_IOWR('S', 22, struct zufs_ioc_break_all)
+
+enum { ZUFS_MAX_COMMAND_BUFF = (PAGE_SIZE - sizeof(struct zufs_ioc_hdr)) };
+struct zufs_ioc_wait_operation {
+	struct zufs_ioc_hdr hdr;
+	char opt_buff[ZUFS_MAX_COMMAND_BUFF];
+};
+#define ZU_IOC_WAIT_OPT		_IOWR('S', 21, struct zufs_ioc_wait_operation)
+
+/* ~~~ all the permutations of zufs_ioc_wait_operation ~~~ */
+/* These are the possible operations sent from Kernel to the Server in the
+ * return of the ZU_IOC_WAIT_OPT.
+ */
+enum e_zufs_operation {
+	ZUS_OP_NULL = 0,
+
+	ZUS_OP_BREAK,		/* Kernel telling Server to exit */
+	ZUS_OP_MAX_OPT,
+};
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC 5/7] zus: Devices && mounting
  2018-03-13 17:25 ` [RFC 5/7] zus: Devices && mounting Boaz Harrosh
@ 2018-03-13 17:38   ` Boaz Harrosh
  0 siblings, 0 replies; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:38 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


In this patch we already establish a mounted filesystem.

There are three mode of operations:
* mount with out a device (mount -t FOO none /somepath)

* A single device - The FS stated register_fs_info->dt_offset==-1
  No checks are made by Kernel, the single bdev is registered with
  Kernel's mount_bdev. It is up to the zusFS to check validity

* Multy devices - The FS stated register_fs_info->dt_offset==X
  This mode is the main of this patch.
  A single device is given on the mount command line. At
  register_fs_info->dt_offset of this device we look for a
  zufs_dev_table structure. After all the checks we look there
  at the device list and open all devices. Any one of the devices may
  be given on command line. But they will always be opened in
  DT(Device Table) order. The Device table has the notion of two types
  of bdevs:
  T1 devices - are pmem devices capable of direct_access
  T2 devices - are none direct_access devices

  All t1 devices are presented as one linear array. in DT order
  In t1.c we mmap this space for the server to directly access
  pmem. (In the proper persistent way)

  [We do not support any direct_access device, we only support
   pmem(s) where the all device can be addressed by a single
   physical/virtual address. This is checked before mount]

   The T2 devices are also grabbed and owned by the super_block
   A later API will enable the Server to write or transfer buffers
   from T1 to T2 in a very efficient manner. Also presented as a
   single linear array in DT order.

   Both kind of devices are NUMA aware and the NUMA info is presented
   to the zusFS for optimal allocation and access.

So these are the steps for mounting a zufs Filesystem:

* All devices (Single or DT) are opened and established in an md
  object. This md-object is given an pmem-id

* mount_bdev is called with the main (first) device, in turn
  fill_supper is called.

* fill_supper despatches a mount_operation(register_fs_info) to the
  server with the pmem-id of the md-object above.

*  The Server at the zus mount routine. Will first thing do
  a GRAB_PMEM(pmem-id) ioctl call to establish a special filehandle
  through which it will have full access to the all of its pmem space.
  With that it will call the zusFS to continue to inspect the content
  of the pmem and mount the FS.

* On return from mount the zusFS returns the root inode info

* fill_supper continues to create a root vfs-inode and returns
  successfully.

* We now have a mounted super_block, with corresponding super_block
  objects in the Server.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   5 +-
 fs/zuf/_extern.h  |  12 +
 fs/zuf/inode.c    |  45 ++++
 fs/zuf/md.c       | 695 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/md.h       | 188 +++++++++++++++
 fs/zuf/super.c    | 605 ++++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/t1.c       | 114 +++++++++
 fs/zuf/t2.c       | 348 +++++++++++++++++++++++++++
 fs/zuf/t2.h       |  67 ++++++
 fs/zuf/zuf-core.c |  68 +++++-
 fs/zuf/zuf-root.c |   9 +-
 fs/zuf/zuf.h      | 220 +++++++++++++++++
 fs/zuf/zus_api.h  | 177 ++++++++++++++
 13 files changed, 2549 insertions(+), 4 deletions(-)
 create mode 100644 fs/zuf/inode.c
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index d00940c..94ce80b 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,9 +10,12 @@
 
 obj-$(CONFIG_ZUF) += zuf.o
 
+# Infrastructure
+zuf-y += md.o t2.o t1.o
+
 # ZUF core
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o
+zuf-y += super.o inode.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index e490043..0543fd8 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -19,7 +19,16 @@
  * extern functions declarations
  */
 
+/* inode.c */
+struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       zu_dpp_t _zi, bool *exist);
+void zuf_evict_inode(struct inode *inode);
+int zuf_write_inode(struct inode *inode, struct writeback_control *wbc);
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync);
+
 /* super.c */
+int zuf_init_inodecache(void);
+void zuf_destroy_inodecache(void);
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
 
@@ -44,4 +53,7 @@ void zufs_mounter_release(struct file *filp);
 /* zuf-root.c */
 int zuf_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
 
+/* t1.c */
+int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
new file mode 100644
index 0000000..7aa8c9e
--- /dev/null
+++ b/fs/zuf/inode.c
@@ -0,0 +1,45 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode methods (allocate/free/read/write).
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       zu_dpp_t _zi, bool *exist)
+{
+	return ERR_PTR(-ENOMEM);
+}
+
+void zuf_evict_inode(struct inode *inode)
+{
+}
+
+int zuf_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	/* write_inode should never be called because we always keep our inodes
+	 * clean. So let us know if write_inode ever gets called.
+	 */
+
+	/* d_tmpfile() does a mark_inode_dirty so only complain on regular files
+	 * TODO: How? Every thing off for now
+	 * WARN_ON(inode->i_nlink);
+	 */
+
+	return 0;
+}
+
+/* This function is called by msync(), fsync() && sync_fs(). */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync)
+{
+	return 0;
+}
diff --git a/fs/zuf/md.c b/fs/zuf/md.c
new file mode 100644
index 0000000..9436f03
--- /dev/null
+++ b/fs/zuf/md.c
@@ -0,0 +1,695 @@
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/blkdev.h>
+#include <linux/pfn_t.h>
+#include <linux/crc16.h>
+#include <linux/uuid.h>
+#include <linux/gcd.h>
+#include <linux/dax.h>
+
+#include "_pr.h"
+#include "md.h"
+#include "t2.h"
+#include "zus_api.h"
+
+/* length of uuid dev path /dev/disk/by-uuid/<uuid> */
+#define PATH_UUID	64
+
+const fmode_t _g_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+/* allocate space for and copy an existing uuid */
+static char *_uuid_path(uuid_le *uuid)
+{
+	char path[PATH_UUID];
+
+	sprintf(path, "/dev/disk/by-uuid/%pUb", uuid);
+	return kstrdup(path, GFP_KERNEL);
+}
+
+static int _bdev_get_by_path(const char *path, struct block_device **bdev,
+			     void *holder)
+{
+	/* The owner of the device is the pointer that will hold it. This
+	 * protects from same device mounting on two super-blocks as well
+	 * as same device being repeated twice.
+	 */
+	*bdev = blkdev_get_by_path(path, _g_mode, holder);
+	if (IS_ERR(*bdev)) {
+		int err = PTR_ERR(*bdev);
+		*bdev = NULL;
+		return err;
+	}
+	return 0;
+}
+
+static void _bdev_put(struct block_device **bdev, struct block_device *s_bdev)
+{
+	if (*bdev) {
+		if (!s_bdev || *bdev != s_bdev)
+			blkdev_put(*bdev, _g_mode);
+		*bdev = NULL;
+	}
+}
+
+static int ___bdev_get_by_uuid(struct block_device **bdev, uuid_le *uuid,
+			       void *holder, bool silent, const char *msg,
+			       const char *f, int l)
+{
+	char *path = NULL;
+	int err;
+
+	path = _uuid_path(uuid);
+	err = _bdev_get_by_path(path, bdev, holder);
+	if (unlikely(err))
+		zuf_err_cnd(silent, "[%s:%d] %s path=%s =>%d\n",
+			     f, l, msg, path, err);
+
+	kfree(path);
+	return err;
+}
+
+#define _bdev_get_by_uuid(bdev, uuid, holder, msg) \
+	___bdev_get_by_uuid(bdev, uuid, holder, silent, msg, __func__, __LINE__)
+
+static bool _main_bdev(struct block_device *bdev)
+{
+	if (bdev->bd_super && bdev->bd_super->s_bdev == bdev)
+		return true;
+	return false;
+}
+
+short md_calc_csum(struct zufs_dev_table *msb)
+{
+	uint n = ZUFS_SB_STATIC_SIZE(msb) - sizeof(msb->s_sum);
+
+	return crc16(~0, (__u8 *)&msb->s_version, n);
+}
+
+/* ~~~~~~~ mdt related functions ~~~~~~~ */
+
+struct zufs_dev_table *md_t2_mdt_read(struct block_device *bdev)
+{
+	int err;
+	struct page *page;
+	/* t2 interface works for all block devices */
+	struct multi_devices *md;
+	struct md_dev_info *mdi;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (unlikely(!md))
+		return ERR_PTR(-ENOMEM);
+
+	md->t2_count = 1;
+	md->devs[0].bdev = bdev;
+	mdi = &md->devs[0];
+	md->t2a.map = &mdi;
+	md->t2a.bn_gcd = 1; /*Does not matter only must not be zero */
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		zuf_dbg_err("!!! failed to alloc page\n");
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = t2_readpage(md, 0, page);
+	if (err) {
+		zuf_dbg_err("!!! t2_readpage err=%d\n", err);
+		__free_page(page);
+	}
+out:
+	kfree(md);
+	return err ? ERR_PTR(err) : page_address(page);
+}
+
+static bool _csum_mismatch(struct zufs_dev_table *msb, int silent)
+{
+	ushort crc = md_calc_csum(msb);
+
+	if (msb->s_sum == cpu_to_le16(crc))
+		return false;
+
+	zuf_warn_cnd(silent, "expected(0x%x) != s_sum(0x%x)\n",
+		      cpu_to_le16(crc), msb->s_sum);
+	return true;
+}
+
+static bool _uuid_le_equal(uuid_le *uuid1, uuid_le *uuid2)
+{
+	return (memcmp(uuid1, uuid2, sizeof(uuid_le)) == 0);
+}
+
+bool md_mdt_check(struct zufs_dev_table *msb,
+		  struct zufs_dev_table *main_msb, struct block_device *bdev,
+		  struct mdt_check *mc)
+{
+	struct zufs_dev_table *msb2 = (void *)msb + ZUFS_SB_SIZE;
+	struct md_dev_id *dev_id;
+	ulong bdev_size, super_size;
+
+	BUILD_BUG_ON(ZUFS_SB_STATIC_SIZE(msb) & (SMP_CACHE_BYTES - 1));
+
+	/* Do sanity checks on the superblock */
+	if (le32_to_cpu(msb->s_magic) != mc->magic) {
+		if (le32_to_cpu(msb2->s_magic) != mc->magic) {
+			zuf_warn_cnd(mc->silent,
+				     "Can't find a valid partition\n");
+			return false;
+		}
+
+		zuf_warn_cnd(mc->silent,
+			     "Magic error in super block: using copy\n");
+		/* Try to auto-recover the super block */
+		memcpy_flushcache(msb, msb2, sizeof(*msb));
+	}
+
+	if ((mc->major_ver != msb_major_version(msb)) ||
+	    (mc->minor_ver < msb_minor_version(msb))) {
+		zuf_warn_cnd(mc->silent,
+			     "mkfs-mount versions mismatch! %d.%d != %d.%d\n",
+			     msb_major_version(msb), msb_minor_version(msb),
+			     mc->major_ver, mc->minor_ver);
+		return false;
+	}
+
+	if (_csum_mismatch(msb, mc->silent)) {
+		if (_csum_mismatch(msb2, mc->silent)) {
+			zuf_warn_cnd(mc->silent,
+				     "checksum error in super block\n");
+			return false;
+		}
+
+		zuf_warn_cnd(mc->silent,
+			     "crc16 error in super block: using copy\n");
+		/* Try to auto-recover the super block */
+		memcpy_flushcache(msb, msb2, sizeof(*msb));
+	}
+
+	if (main_msb && !_uuid_le_equal(&main_msb->s_uuid, &msb->s_uuid)) {
+		zuf_warn_cnd(mc->silent,
+			     "uuids do not match main_msb=%pUb msb=%pUb\n",
+			     &main_msb->s_uuid, &msb->s_uuid);
+		return false;
+	}
+
+	/* check t1 device size */
+	bdev_size = i_size_read(bdev->bd_inode);
+	dev_id = &msb->s_dev_list.dev_ids[msb->s_dev_list.id_index];
+	super_size = md_p2o(__dev_id_blocks(dev_id));
+	if (unlikely(!super_size || super_size & ZUFS_ALLOC_MASK)) {
+		zuf_warn_cnd(mc->silent, "super_size(0x%lx) ! 2_M aligned\n",
+			      super_size);
+		return false;
+	}
+
+	if (unlikely(super_size > bdev_size)) {
+		zuf_warn_cnd(mc->silent,
+			     "bdev_size(0x%lx) too small expected 0x%lx\n",
+			     bdev_size, super_size);
+		return false;
+	} else if (unlikely(super_size < bdev_size)) {
+		zuf_dbg_err("Note msb->size=(0x%lx) < bdev_size(0x%lx)\n",
+			      super_size, bdev_size);
+	}
+
+	return true;
+}
+
+
+int md_set_sb(struct multi_devices *md, struct block_device *s_bdev,
+	      void *sb, int silent)
+{
+	struct md_dev_info *mdi = md_dev_info(md, md->dev_index);
+	int i;
+
+	mdi->bdev = s_bdev;
+
+	for (i = 0; i < md->t1_count; ++i) {
+		struct md_dev_info *mdi = md_t1_dev(md, i);
+
+		if (mdi->bdev->bd_super && (mdi->bdev->bd_super != sb)) {
+			zuf_warn_cnd(silent,
+				"!!! %s already mounted on a different FS => -EBUSY\n",
+				_bdev_name(mdi->bdev));
+			return -EBUSY;
+		}
+
+		mdi->bdev->bd_super = sb;
+	}
+
+	return 0;
+}
+
+void md_fini(struct multi_devices *md, struct block_device *s_bdev)
+{
+	int i;
+
+	kfree(md->t2a.map);
+	kfree(md->t1a.map);
+
+	for (i = 0; i < md->t1_count + md->t2_count; ++i) {
+		struct md_dev_info *mdi = md_dev_info(md, i);
+
+		if (mdi->bdev && !_main_bdev(mdi->bdev))
+			mdi->bdev->bd_super = NULL;
+		_bdev_put(&mdi->bdev, s_bdev);
+	}
+
+	kfree(md);
+}
+
+
+/* ~~~~~~~ Pre-mount operations ~~~~~~~ */
+
+static int _get_device(struct block_device **bdev, const char *dev_name,
+		       uuid_le *uuid, void *holder, int silent,
+		       bool *bind_mount)
+{
+	int err;
+
+	if (dev_name)
+		err = _bdev_get_by_path(dev_name, bdev, holder);
+	else
+		err = _bdev_get_by_uuid(bdev, uuid, holder,
+					"failed to get device");
+
+	if (unlikely(err)) {
+		zuf_err_cnd(silent,
+			"failed to get device dev_name=%s uuid=%pUb err=%d\n",
+			dev_name, uuid, err);
+		return err;
+	}
+
+	if (bind_mount && _main_bdev(*bdev))
+		*bind_mount = true;
+
+	return 0;
+}
+
+static int _init_dev_info(struct md_dev_info *mdi, struct md_dev_id *id,
+			  int index, u64 offset,
+			  struct zufs_dev_table *main_msb,
+			  struct mdt_check *mc, bool t1_dev,
+			  int silent)
+{
+	struct zufs_dev_table *msb = NULL;
+	int err = 0;
+
+	if (mdi->bdev == NULL) {
+		err = _get_device(&mdi->bdev, NULL, &id->uuid, mc->holder,
+				  silent, NULL);
+		if (unlikely(err))
+			return err;
+	}
+
+	mdi->offset = offset;
+	mdi->size = md_p2o(__dev_id_blocks(id));
+	mdi->index = index;
+
+	if (t1_dev) {
+		struct page *dev_page;
+		int end_of_dev_nid;
+
+		err = md_t1_info_init(mdi, silent);
+		if (unlikely(err))
+			return err;
+
+		if ((ulong)mdi->t1i.virt_addr & ZUFS_ALLOC_MASK) {
+			zuf_warn_cnd(silent, "!!! unaligned device %s\n",
+				      _bdev_name(mdi->bdev));
+			return -EINVAL;
+		}
+
+		msb = mdi->t1i.virt_addr;
+
+		dev_page = pfn_to_page(mdi->t1i.phys_pfn);
+		mdi->nid = page_to_nid(dev_page);
+		end_of_dev_nid = page_to_nid(dev_page + md_o2p(mdi->size - 1));
+
+		if (mdi->nid != end_of_dev_nid)
+			zuf_warn("pmem crosses NUMA boundaries");
+	} else {
+		msb = md_t2_mdt_read(mdi->bdev);
+		if (IS_ERR(msb)) {
+			zuf_err_cnd(silent,
+				    "failed to read msb from t2 => %ld\n",
+				    PTR_ERR(msb));
+			return PTR_ERR(msb);
+		}
+		mdi->nid = __dev_id_nid(id);
+	}
+
+	if (!md_mdt_check(msb, main_msb, mdi->bdev, mc)) {
+		zuf_err_cnd(silent, "device %s failed integrity check\n",
+			     _bdev_name(mdi->bdev));
+		err = -EINVAL;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	if (!(t1_dev || IS_ERR_OR_NULL(msb)))
+		free_page((ulong)msb);
+	return err;
+}
+
+static int _map_setup(struct multi_devices *md, ulong blocks, int dev_start,
+		      struct md_dev_larray *larray)
+{
+	ulong map_size, bn_end;
+	int i, dev_index = dev_start;
+
+	map_size = blocks / larray->bn_gcd;
+	larray->map = kcalloc(map_size, sizeof(*larray->map), GFP_KERNEL);
+	if (!larray->map) {
+		zuf_dbg_err("failed to allocate dev map\n");
+		return -ENOMEM;
+	}
+
+	bn_end = md_o2p(md->devs[dev_index].size);
+	for (i = 0; i < map_size; ++i) {
+		if ((i * larray->bn_gcd) >= bn_end)
+			bn_end += md_o2p(md->devs[++dev_index].size);
+		larray->map[i] = &md->devs[dev_index];
+	}
+
+	return 0;
+}
+
+static int _md_init(struct multi_devices *md, struct mdt_check *mc,
+		    struct md_dev_list *dev_list, int silent)
+{
+	struct zufs_dev_table *main_msb = NULL;
+	u64 total_size = 0;
+	int i, err;
+
+	for (i = 0; i < md->t1_count; ++i) {
+		struct md_dev_info *mdi = md_t1_dev(md, i);
+		struct zufs_dev_table *dev_msb;
+
+		err = _init_dev_info(mdi, &dev_list->dev_ids[i], i, total_size,
+				     main_msb, mc, true, silent);
+		if (unlikely(err))
+			return err;
+
+		/* apparently gcd(0,X)=X which is nice */
+		md->t1a.bn_gcd = gcd(md->t1a.bn_gcd, md_o2p(mdi->size));
+		total_size += mdi->size;
+
+		dev_msb = md_t1_addr(md, i);
+		if (!main_msb)
+			main_msb = dev_msb;
+
+		if (test_msb_opt(dev_msb, ZUFS_SHADOW))
+			memcpy(mdi->t1i.virt_addr,
+			       mdi->t1i.virt_addr + mdi->size, mdi->size);
+
+		zuf_dbg_verbose("dev=%d %pUb %s v=%p pfn=%lu off=%lu size=%lu\n",
+				 i, &dev_list->dev_ids[i].uuid,
+				 _bdev_name(mdi->bdev), dev_msb,
+				 mdi->t1i.phys_pfn, mdi->offset, mdi->size);
+	}
+
+	if (unlikely(le64_to_cpu(main_msb->s_t1_blocks) !=
+						md_o2p(total_size))) {
+		zuf_err_cnd(silent,
+			"FS corrupted msb->t1_blocks(0x%llx) != total_size(0x%llx)\n",
+			main_msb->s_t1_blocks, total_size);
+		return -EIO;
+	}
+
+	err = _map_setup(md, le64_to_cpu(main_msb->s_t1_blocks), 0, &md->t1a);
+	if (unlikely(err))
+		return err;
+
+	zuf_dbg_verbose("t1 devices=%d total_size=%llu segment_map=%lu\n",
+			 md->t1_count, total_size,
+			 md_o2p(total_size) / md->t1a.bn_gcd);
+
+	if (md->t2_count == 0)
+		return 0;
+
+	/* Done with t1. Counting t2s */
+	total_size = 0;
+	for (i = 0; i < md->t2_count; ++i) {
+		struct md_dev_info *mdi = md_t2_dev(md, i);
+
+		err = _init_dev_info(mdi, &dev_list->dev_ids[md->t1_count + i],
+				     md->t1_count + i, total_size, main_msb,
+				     mc, false, silent);
+		if (unlikely(err))
+			return err;
+
+		/* apparently gcd(0,X)=X which is nice */
+		md->t2a.bn_gcd = gcd(md->t2a.bn_gcd, md_o2p(mdi->size));
+		total_size += mdi->size;
+
+		zuf_dbg_verbose("dev=%d %s off=%lu size=%lu\n", i,
+				 _bdev_name(mdi->bdev), mdi->offset, mdi->size);
+	}
+
+	if (unlikely(le64_to_cpu(main_msb->s_t2_blocks) != md_o2p(total_size))) {
+		zuf_err_cnd(silent,
+			"FS corrupted msb_t2_blocks(0x%llx) != total_size(0x%llx)\n",
+			main_msb->s_t2_blocks, total_size);
+		return -EIO;
+	}
+
+	err = _map_setup(md, le64_to_cpu(main_msb->s_t2_blocks), md->t1_count,
+			 &md->t2a);
+	if (unlikely(err))
+		return err;
+
+	zuf_dbg_verbose("t2 devices=%d total_size=%llu segment_map=%lu\n",
+			 md->t2_count, total_size,
+			 md_o2p(total_size) / md->t2a.bn_gcd);
+
+	return 0;
+}
+
+static int _load_dev_list(struct md_dev_list *dev_list, struct mdt_check *mc,
+			  struct block_device *bdev, const char *dev_name,
+			  int silent)
+{
+	struct zufs_dev_table *msb;
+	int err = 0;
+
+	msb = md_t2_mdt_read(bdev);
+	if (IS_ERR(msb)) {
+		zuf_err_cnd(silent,
+			    "failed to read super block from %s; err=%ld\n",
+			    dev_name, PTR_ERR(msb));
+		err = PTR_ERR(msb);
+		goto out;
+	}
+
+	if (!md_mdt_check(msb, NULL, bdev, mc)) {
+		zuf_err_cnd(silent, "bad msb in %s\n", dev_name);
+		err = -EINVAL;
+		goto out;
+	}
+
+	*dev_list = msb->s_dev_list;
+
+out:
+	if (!IS_ERR_OR_NULL(msb))
+		free_page((ulong)msb);
+
+	return err;
+}
+
+int md_init(struct multi_devices *md, const char *dev_name,
+	    struct mdt_check *mc, const char **dev_path)
+{
+	struct md_dev_list *dev_list;
+	struct block_device *bdev;
+	short id_index;
+	bool bind_mount = false;
+	int err;
+
+	dev_list = kmalloc(sizeof(*dev_list), GFP_KERNEL);
+	if (unlikely(!dev_list))
+		return -ENOMEM;
+
+	err = _get_device(&bdev, dev_name, NULL, mc->holder, mc->silent,
+			  &bind_mount);
+	if (unlikely(err))
+		goto out2;
+
+	err = _load_dev_list(dev_list, mc, bdev, dev_name, mc->silent);
+	if (unlikely(err)) {
+		_bdev_put(&bdev, NULL);
+		goto out2;
+	}
+
+	id_index = le16_to_cpu(dev_list->id_index);
+	if (bind_mount) {
+		_bdev_put(&bdev, NULL);
+		md->dev_index = id_index;
+		goto out;
+	}
+
+	md->t1_count = le16_to_cpu(dev_list->t1_count);
+	md->t2_count = le16_to_cpu(dev_list->t2_count);
+	md->devs[id_index].bdev = bdev;
+
+	if ((id_index != 0)) {
+		err = _get_device(&md_t1_dev(md, 0)->bdev, NULL,
+				  &dev_list->dev_ids[0].uuid, mc->holder,
+				  mc->silent, &bind_mount);
+		if (unlikely(err))
+			goto out2;
+
+		if (bind_mount)
+			goto out;
+	}
+
+	if (md->t2_count) {
+		int t2_index = md->t1_count;
+
+		/* t2 is the primary device if given in mount, or the first
+		 * mount specified it as primary device
+		 */
+		if (id_index != md->t1_count) {
+			err = _get_device(&md_t2_dev(md, 0)->bdev, NULL,
+					  &dev_list->dev_ids[t2_index].uuid,
+					  mc->holder, mc->silent, &bind_mount);
+			if (unlikely(err))
+				goto out2;
+		}
+		md->dev_index = t2_index;
+	}
+
+out:
+	if (md->dev_index != id_index)
+		*dev_path = _uuid_path(&dev_list->dev_ids[md->dev_index].uuid);
+	else
+		*dev_path = kstrdup(dev_name, GFP_KERNEL);
+
+	if (!bind_mount) {
+		err = _md_init(md, mc, dev_list, mc->silent);
+		if (unlikely(err))
+			goto out2;
+		_bdev_put(&md_dev_info(md, md->dev_index)->bdev, NULL);
+	} else {
+		md_fini(md, NULL);
+	}
+
+out2:
+	kfree(dev_list);
+
+	return err;
+}
+
+struct multi_devices *md_alloc(size_t size)
+{
+	uint s = max(sizeof(struct multi_devices), size);
+	struct multi_devices *md = kzalloc(s, GFP_KERNEL);
+
+	if (unlikely(!md))
+		return ERR_PTR(-ENOMEM);
+	return md;
+}
+
+int md_numa_info(struct multi_devices *md, struct zufs_ioc_pmem *zi_pmem)
+{
+	zi_pmem->pmem_total_blocks = md_t1_blocks(md);
+#if 0
+	if (max_cpu_id < sys_num_active_cpus) {
+		max_cpu_id = sys_num_active_cpus;
+		return -ETOSMALL;
+	}
+
+	max_cpu_id = sys_num_active_cpus;
+	__u32 max_nodes;
+	__u32 active_pmem_nodes;
+	struct zufs_pmem_info {
+		int sections;
+		struct zufs_pmem_sec {
+			__u32 length;
+			__u16 numa_id;
+			__u16 numa_index;
+		} secs[ZUFS_DEV_MAX];
+	} pmem;
+
+	struct zufs_numa_info {
+		__u32 max_cpu_id; // The below array size
+		struct zufs_cpu_info {
+			__u32 numa_id;
+			__u32 numa_index;
+		} numa_id_map[];
+	} *numa_info;
+	k_nf = kcalloc(max_cpu_id, sizeof(struct zufs_cpu_info), GFP_KERNEL);
+	....
+	copy_to_user(->numa_info, kn_f,
+		     max_cpu_id * sizeof(struct zufs_cpu_info));
+#endif
+	return 0;
+}
+
+static int _check_da_ret(struct md_dev_info *mdi, long avail, bool silent)
+{
+	if (unlikely(avail < (long)mdi->size)) {
+		if (0 < avail) {
+			zuf_warn_cnd(silent,
+				"Unsupported DAX device %s (range mismatch) => 0x%lx < 0x%lx\n",
+				_bdev_name(mdi->bdev), avail, mdi->size);
+			return -ERANGE;
+		}
+		zuf_warn_cnd(silent, "!!! %s direct_access return =>%ld\n",
+			     _bdev_name(mdi->bdev), avail);
+		return avail;
+	}
+	return 0;
+}
+
+int md_t1_info_init(struct md_dev_info *mdi, bool silent)
+{
+	pfn_t a_pfn_t;
+	void *addr;
+	long nrpages, avail;
+	int id;
+
+	mdi->t1i.dax_dev = fs_dax_get_by_host(_bdev_name(mdi->bdev));
+	if (unlikely(!mdi->t1i.dax_dev))
+		return -EOPNOTSUPP;
+
+	id = dax_read_lock();
+
+	nrpages = dax_direct_access(mdi->t1i.dax_dev, 0, md_o2p(mdi->size),
+				    &addr, &a_pfn_t);
+	dax_read_unlock(id);
+	if (unlikely(nrpages <= 0)) {
+		if (!nrpages)
+			nrpages = -ERANGE;
+		avail = nrpages;
+	} else {
+		avail = md_p2o(nrpages);
+	}
+
+	mdi->t1i.virt_addr = addr;
+	mdi->t1i.phys_pfn = pfn_t_to_pfn(a_pfn_t);
+
+	zuf_dbg_verbose("0x%lx 0x%llx\n",
+			 (ulong)addr, a_pfn_t.val);
+
+	return _check_da_ret(mdi, avail, silent);
+}
+
+void md_t1_info_fini(struct md_dev_info *mdi)
+{
+	fs_put_dax(mdi->t1i.dax_dev);
+	mdi->t1i.dax_dev = NULL;
+	mdi->t1i.virt_addr = NULL;
+}
diff --git a/fs/zuf/md.h b/fs/zuf/md.h
new file mode 100644
index 0000000..1ad3db3c
--- /dev/null
+++ b/fs/zuf/md.h
@@ -0,0 +1,188 @@
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __MD_H__
+#define __MD_H__
+
+#include "zus_api.h"
+
+struct md_t1_info {
+	ulong phys_pfn;
+	void *virt_addr;
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+};
+
+struct md_t2_info {
+	bool err_read_reported;
+	bool err_write_reported;
+};
+
+struct md_dev_info {
+	struct block_device *bdev;
+	ulong size;
+	ulong offset;
+	union {
+		struct md_t1_info	t1i;
+		struct md_t2_info	t2i;
+	};
+	int index;
+	int nid;
+};
+
+struct md_dev_larray {
+	ulong bn_gcd;
+	struct md_dev_info **map;
+};
+
+struct multi_devices {
+	int dev_index;
+	int t1_count;
+	int t2_count;
+	struct md_dev_info devs[MD_DEV_MAX];
+	struct md_dev_larray t1a;
+	struct md_dev_larray t2a;
+};
+
+static inline u64 md_p2o(ulong bn)
+{
+	return (u64)bn << PAGE_SHIFT;
+}
+
+static inline ulong md_o2p(u64 offset)
+{
+	return offset >> PAGE_SHIFT;
+}
+
+static inline ulong md_o2p_up(u64 offset)
+{
+	return md_o2p(offset + PAGE_SIZE - 1);
+}
+
+static inline struct md_dev_info *md_t1_dev(struct multi_devices *md, int i)
+{
+	return &md->devs[i];
+}
+
+static inline struct md_dev_info *md_t2_dev(struct multi_devices *md, int i)
+{
+	return &md->devs[md->t1_count + i];
+}
+
+static inline struct md_dev_info *md_dev_info(struct multi_devices *md, int i)
+{
+	return &md->devs[i];
+}
+
+static inline void *md_t1_addr(struct multi_devices *md, int i)
+{
+	struct md_dev_info *mdi = md_t1_dev(md, i);
+
+	return mdi->t1i.virt_addr;
+}
+
+static inline struct md_dev_info *md_bn_t1_dev(struct multi_devices *md,
+						 ulong bn)
+{
+	return md->t1a.map[bn / md->t1a.bn_gcd];
+}
+
+static inline ulong md_pfn(struct multi_devices *md, ulong block)
+{
+	struct md_dev_info *mdi = md_bn_t1_dev(md, block);
+
+	return mdi->t1i.phys_pfn + (block - md_o2p(mdi->offset));
+}
+
+static inline void *md_addr(struct multi_devices *md, ulong offset)
+{
+	struct md_dev_info *mdi = md_bn_t1_dev(md, md_o2p(offset));
+
+	return offset ? mdi->t1i.virt_addr + (offset - mdi->offset) : NULL;
+}
+
+static inline void *md_baddr(struct multi_devices *md, ulong bn)
+{
+	return md_addr(md, md_p2o(bn));
+}
+
+static inline struct zufs_dev_table *md_zdt(struct multi_devices *md)
+{
+	return md_t1_addr(md, 0);
+}
+
+static inline ulong md_t1_blocks(struct multi_devices *md)
+{
+	return le64_to_cpu(md_zdt(md)->s_t1_blocks);
+}
+
+static inline ulong md_t2_blocks(struct multi_devices *md)
+{
+	return le64_to_cpu(md_zdt(md)->s_t2_blocks);
+}
+
+static inline struct md_dev_info *md_bn_t2_dev(struct multi_devices *md,
+					       ulong bn)
+{
+	return md->t2a.map[bn / md->t2a.bn_gcd];
+}
+
+static inline ulong md_t2_local_bn(struct multi_devices *md, ulong bn)
+{
+	struct md_dev_info *mdi = md_bn_t2_dev(md, bn);
+
+	return bn - md_o2p(mdi->offset);
+}
+
+static inline void *md_addr_verify(struct multi_devices *md, ulong offset)
+{
+	if (unlikely(offset > md_p2o(md_t1_blocks(md)))) {
+		zuf_dbg_err("offset=0x%lx > max=0x%llx\n",
+			    offset, md_p2o(md_t1_blocks(md)));
+		return NULL;
+	}
+
+	return md_addr(md, offset);
+}
+
+static inline const char *_bdev_name(struct block_device *bdev)
+{
+	return dev_name(&bdev->bd_part->__dev);
+}
+
+struct mdt_check {
+	uint major_ver;
+	uint minor_ver;
+	u32  magic;
+
+	void *holder;
+	bool silent;
+};
+
+/* md.c */
+struct zufs_dev_table *md_t2_mdt_read(struct block_device *bdev);
+bool md_mdt_check(struct zufs_dev_table *msb, struct zufs_dev_table *main_msb,
+		  struct block_device *bdev, struct mdt_check *mc);
+struct multi_devices *md_alloc(size_t size);
+int md_init(struct multi_devices *md, const char *dev_name,
+	    struct mdt_check *mc, const char **dev_path);
+void md_fini(struct multi_devices *md, struct block_device *s_bdev);
+int md_set_sb(struct multi_devices *md, struct block_device *s_bdev, void *sb,
+	      int silent);
+
+struct zufs_ioc_pmem;
+int md_numa_info(struct multi_devices *md, struct zufs_ioc_pmem *zi_pmem);
+
+int md_t1_info_init(struct md_dev_info *mdi, bool silent);
+void md_t1_info_fini(struct md_dev_info *mdi);
+
+#endif
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 6e176a5..03d1772 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -12,10 +12,613 @@
  *	Sagi Manole <sagim@netapp.com>"
  */
 
+#include <linux/types.h>
+#include <linux/parser.h>
+#include <linux/statfs.h>
+#include <linux/backing-dev.h>
+
 #include "zuf.h"
 
+static struct super_operations zuf_sops;
+static struct kmem_cache *zuf_inode_cachep;
+
+enum {
+	Opt_uid,
+	Opt_gid,
+	Opt_pedantic,
+	Opt_ephemeral,
+	Opt_dax,
+	Opt_err
+};
+
+static const match_table_t tokens = {
+	{ Opt_uid,		"uid=%u"		},
+	{ Opt_gid,		"gid=%u"		},
+	{ Opt_pedantic,		"pedantic"		},
+	{ Opt_pedantic,		"pedantic=%d"		},
+	{ Opt_ephemeral,	"ephemeral"		},
+	{ Opt_dax,		"dax"			},
+	{ Opt_err,		NULL			},
+};
+
+/* Output parameters from _parse_options */
+struct __parse_options {
+	bool clear_t2sync;
+	bool pedantic_17;
+};
+
+static int _parse_options(struct zuf_sb_info *sbi, const char *data,
+			  bool remount, struct __parse_options *po)
+{
+	char *orig_options, *options;
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int option;
+	int err = 0;
+	bool ephemeral = false;
+
+	/* no options given */
+	if (!data)
+		return 0;
+
+	options = orig_options = kstrdup(data, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+
+		/* Initialize args struct so we know whether arg was found */
+		args[0].to = args[0].from = NULL;
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_uid:
+			if (remount)
+				goto bad_opt;
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			sbi->uid = KUIDT_INIT(option);
+			break;
+		case Opt_gid:
+			if (match_int(&args[0], &option))
+				goto bad_val;
+			sbi->gid = KGIDT_INIT(option);
+			break;
+		case Opt_pedantic:
+			set_opt(sbi, PEDANTIC);
+			break;
+		case Opt_ephemeral:
+			set_opt(sbi, EPHEMERAL);
+			ephemeral = true;
+			break;
+		case Opt_dax:
+			set_opt(sbi, DAX);
+			break;
+		default: {
+			goto bad_opt;
+		}
+		}
+	}
+
+	if (remount && test_opt(sbi, EPHEMERAL) && (ephemeral == false))
+		clear_opt(sbi, EPHEMERAL);
+out:
+	kfree(orig_options);
+	return err;
+
+bad_val:
+	zuf_warn_cnd(test_opt(sbi, SILENT),
+		     "Bad value '%s' for mount option '%s'\n",
+		     args[0].from, p);
+	err = -EINVAL;
+	goto out;
+bad_opt:
+	zuf_warn_cnd(test_opt(sbi, SILENT), "Bad mount option: \"%s\"\n", p);
+	err = -EINVAL;
+	goto out;
+}
+
+static void _print_mount_info(struct zuf_sb_info *sbi, char *mount_options)
+{
+	char buff[1000];
+	int space = sizeof(buff);
+	char *b = buff;
+	uint i;
+	int printed;
+
+	for (i = 0; i < sbi->md->t1_count; ++i) {
+		printed = snprintf(b, space, "%s%s", i ? "," : "",
+			       _bdev_name(md_t1_dev(sbi->md, i)->bdev));
+
+		if (unlikely(printed > space))
+			goto no_space;
+
+		b += printed;
+		space -= printed;
+	}
+
+	if (sbi->md->t2_count) {
+		printed = snprintf(b, space, " t2=%s",
+				   _bdev_name(md_t2_dev(sbi->md, 0)->bdev));
+		if (unlikely(printed > space))
+			goto no_space;
+
+		b += printed;
+		space -= printed;
+	}
+
+	if (mount_options) {
+		printed = snprintf(b, space, " -o %s", mount_options);
+		if (unlikely(printed > space))
+			goto no_space;
+	}
+
+print:
+	zuf_info("mounted t1=%s (0x%lx/0x%lx)\n", buff,
+		  md_t1_blocks(sbi->md), md_t2_blocks(sbi->md));
+	return;
+
+no_space:
+	snprintf(buff + sizeof(buff) - 4, 4, "...");
+	goto print;
+}
+
+static void _sb_mwtime_now(struct super_block *sb, struct zufs_dev_table *zdt)
+{
+	struct timespec now = current_kernel_time();
+
+	timespec_to_mt(&zdt->s_mtime, &now);
+	zdt->s_wtime = zdt->s_mtime;
+	/* TOZO _persist_md(sb, &zdt->s_mtime, 2*sizeof(zdt->s_mtime)); */
+}
+
+static int _setup_bdi(struct super_block *sb, const char *device_name)
+{
+	int err;
+
+	if (sb->s_bdi && (sb->s_bdi != &noop_backing_dev_info)) {
+		/*
+		 * sb->s_bdi points to blkdev's bdi however we want to redirect
+		 * it to our private bdi...
+		 */
+		bdi_put(sb->s_bdi);
+	}
+	sb->s_bdi = &noop_backing_dev_info;
+
+	err = super_setup_bdi_name(sb, "zuf-%s", device_name);
+	if (unlikely(err)) {
+		zuf_err("Failed to super_setup_bdi\n");
+		return err;
+	}
+
+	sb->s_bdi->ra_pages = ZUFS_READAHEAD_PAGES;
+	sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+	return 0;
+}
+
+static void zuf_put_super(struct super_block *sb)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+
+	if (sbi->zus_sbi) {
+		zufs_dispatch_umount(ZUF_ROOT(sbi), sbi->zus_sbi);
+		sbi->zus_sbi = NULL;
+	}
+
+	/* NOTE!!! this is a HACK! we should not touch the s_umount
+	 * lock but to make lockdep happy we do that since our devices
+	 * are held exclusivly. Need to revisit every kernel version
+	 * change.
+	 */
+	if (sbi->md) {
+		up_write(&sb->s_umount);
+		md_fini(sbi->md, sb->s_bdev);
+		down_write(&sb->s_umount);
+	}
+
+	sb->s_fs_info = NULL;
+	if (!test_opt(sbi, FAILED))
+		zuf_info("unmounted /dev/%s\n", _bdev_name(sb->s_bdev));
+	kfree(sbi);
+}
+
+struct __fill_super_params {
+	struct multi_devices *md;
+	char *mount_options;
+};
+
+static int zuf_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct zuf_sb_info *sbi;
+	struct __fill_super_params *fsp = data;
+	struct __parse_options po = {};
+	struct zufs_ioc_mount zim = {};
+	struct register_fs_info *rfi;
+	struct inode *root_i;
+	bool exist;
+	int err;
+
+	BUILD_BUG_ON(sizeof(struct zufs_dev_table) > ZUFS_SB_SIZE);
+	BUILD_BUG_ON(sizeof(struct zus_inode) != ZUFS_INODE_SIZE);
+
+	sbi = kzalloc(sizeof(struct zuf_sb_info), GFP_KERNEL);
+	if (!sbi) {
+		zuf_err_cnd(silent, "Not enough memory to allocate sbi\n");
+		return -ENOMEM;
+	}
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	/* Initialize embedded objects */
+	spin_lock_init(&sbi->s_mmap_dirty_lock);
+	INIT_LIST_HEAD(&sbi->s_mmap_dirty);
+	if (silent)
+		set_opt(sbi, SILENT);
+
+	sbi->md = fsp->md;
+	err = md_set_sb(sbi->md, sb->s_bdev, sb, silent);
+	if (unlikely(err))
+		goto error;
+
+	err = _parse_options(sbi, fsp->mount_options, 0, &po);
+	if (err)
+		goto error;
+
+	err = _setup_bdi(sb, _bdev_name(sb->s_bdev));
+	if (err) {
+		zuf_err_cnd(silent, "Failed to setup bdi => %d\n", err);
+		goto error;
+	}
+
+	/* Tell ZUS to mount an FS for us */
+	zim.pmem_kern_id = zuf_pmem_id(sbi->md);
+	err = zufs_dispatch_mount(ZUF_ROOT(sbi), zuf_fst(sb)->zus_zfi, &zim);
+	if (unlikely(err))
+		goto error;
+	sbi->zus_sbi = zim.zus_sbi;
+
+	/* Init with default values */
+	sb->s_blocksize_bits = zim.s_blocksize_bits;
+	sb->s_blocksize = 1 << zim.s_blocksize_bits;
+
+	sbi->mode = ZUFS_DEF_SBI_MODE;
+	sbi->uid = current_fsuid();
+	sbi->gid = current_fsgid();
+
+	rfi = &zuf_fst(sb)->rfi;
+
+	sb->s_magic = rfi->FS_magic;
+	sb->s_time_gran = rfi->s_time_gran;
+	sb->s_maxbytes = rfi->s_maxbytes;
+	sb->s_flags |= MS_NOSEC | (rfi->acl_on ? MS_POSIXACL : 0);
+
+	sb->s_op = &zuf_sops;
+
+	root_i = zuf_iget(sb, zim.zus_ii, zim._zi, &exist);
+	if (IS_ERR(root_i)) {
+		err = PTR_ERR(root_i);
+		goto error;
+	}
+	WARN_ON(exist);
+
+	sb->s_root = d_make_root(root_i);
+	if (!sb->s_root) {
+		zuf_err_cnd(silent, "get tozu root inode failed\n");
+		iput(root_i); /* undo zuf_iget */
+		err = -ENOMEM;
+		goto error;
+	}
+
+	if (!zuf_rdonly(sb))
+		_sb_mwtime_now(sb, md_zdt(sbi->md));
+
+	_print_mount_info(sbi, fsp->mount_options);
+	clear_opt(sbi, SILENT);
+	return 0;
+
+error:
+	zuf_warn("NOT mounting => %d\n", err);
+	set_opt(sbi, FAILED);
+	zuf_put_super(sb);
+	return err;
+}
+
+static void _zst_to_kst(const struct statfs64 *zst, struct kstatfs *kst)
+{
+	kst->f_type	= zst->f_type;
+	kst->f_bsize	= zst->f_bsize;
+	kst->f_blocks	= zst->f_blocks;
+	kst->f_bfree	= zst->f_bfree;
+	kst->f_bavail	= zst->f_bavail;
+	kst->f_files	= zst->f_files;
+	kst->f_ffree	= zst->f_ffree;
+	kst->f_fsid	= zst->f_fsid;
+	kst->f_namelen	= zst->f_namelen;
+	kst->f_frsize	= zst->f_frsize;
+	kst->f_flags	= zst->f_flags;
+}
+
+static int zuf_statfs(struct dentry *d, struct kstatfs *buf)
+{
+	struct zuf_sb_info *sbi = SBI(d->d_sb);
+	struct zufs_ioc_statfs ioc_statfs = {
+		.hdr.in_len = offsetof(struct zufs_ioc_statfs, statfs_out),
+		.hdr.out_len = sizeof(ioc_statfs),
+		.hdr.operation = ZUS_OP_STATFS,
+		.zus_sbi = sbi->zus_sbi,
+	};
+	int err;
+
+	err = zufs_dispatch(ZUF_ROOT(sbi), &ioc_statfs.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_err("zufs_dispatch failed op=ZUS_OP_STATFS => %d\n", err);
+		return err;
+	}
+
+	_zst_to_kst(&ioc_statfs.statfs_out, buf);
+	return 0;
+}
+
+static int zuf_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct zuf_sb_info *sbi = SBI(root->d_sb);
+
+	if (__kuid_val(sbi->uid) && uid_valid(sbi->uid))
+		seq_printf(seq, ",uid=%u", __kuid_val(sbi->uid));
+	if (__kgid_val(sbi->gid) && gid_valid(sbi->gid))
+		seq_printf(seq, ",gid=%u", __kgid_val(sbi->gid));
+	if (test_opt(sbi, EPHEMERAL))
+		seq_puts(seq, ",ephemeral");
+	if (test_opt(sbi, DAX))
+		seq_puts(seq, ",dax");
+
+	return 0;
+}
+
+static int zuf_show_devname(struct seq_file *seq, struct dentry *root)
+{
+	seq_printf(seq, "/dev/%s", _bdev_name(root->d_sb->s_bdev));
+
+	return 0;
+}
+
+static int zuf_remount(struct super_block *sb, int *mntflags, char *data)
+{
+	unsigned long old_mount_opt;
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct __parse_options po; /* Actually not used */
+	int err;
+
+	zuf_info("remount... -o %s\n", data);
+
+	/* Store the old options */
+	old_mount_opt = sbi->s_mount_opt;
+
+	err = _parse_options(sbi, data, 1, &po);
+	if (unlikely(err))
+		goto fail;
+
+	if ((*mntflags & MS_RDONLY) != zuf_rdonly(sb))
+		_sb_mwtime_now(sb, md_zdt(sbi->md));
+
+	return 0;
+
+fail:
+	sbi->s_mount_opt = old_mount_opt;
+	zuf_dbg_err("remount failed restore option\n");
+	return err;
+}
+
+static int zuf_update_s_wtime(struct super_block *sb)
+{
+	if (!(sb->s_flags & MS_RDONLY)) {
+		struct timespec now = current_kernel_time();
+
+		timespec_to_mt(&md_zdt(SBI(sb)->md)->s_wtime, &now);
+	}
+	return 0;
+}
+
+static void _sync_add_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+
+	/* Because we are lazy removing the inodes, only in case of an fsync
+	 * or an evict_inode. It is fine if we are call multiple times.
+	 */
+	if (list_empty(&zii->i_mmap_dirty))
+		list_add(&zii->i_mmap_dirty, &sbi->s_mmap_dirty);
+
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+static void _sync_remove_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	list_del_init(&zii->i_mmap_dirty);
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+void zuf_sync_inc(struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (1 == atomic_inc_return(&zii->write_mapped))
+		_sync_add_inode(inode);
+}
+
+/* zuf_sync_dec will unmapped in batches */
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (0 == atomic_sub_return(write_unmapped, &zii->write_mapped))
+		_sync_remove_inode(inode);
+}
+
+/*
+ * We must fsync any mmap-active inodes
+ */
+static int zuf_sync_fs(struct super_block *sb, int wait)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zuf_inode_info *zii, *t;
+	enum {to_clean_size = 120};
+	struct zuf_inode_info *zii_to_clean[to_clean_size];
+	uint i, to_clean;
+
+more_inodes:
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	to_clean = 0;
+	list_for_each_entry_safe(zii, t, &sbi->s_mmap_dirty, i_mmap_dirty) {
+		list_del_init(&zii->i_mmap_dirty);
+		zii_to_clean[to_clean++] = zii;
+		if (to_clean >= to_clean_size)
+			break;
+	}
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+
+	if (!to_clean)
+		return 0;
+
+	for (i = 0; i < to_clean; ++i)
+		zuf_isync(&zii_to_clean[i]->vfs_inode, 0, ~0 - 1, 1);
+
+	if (to_clean == to_clean_size)
+		goto more_inodes;
+
+	return 0;
+}
+
+static struct inode *zuf_alloc_inode(struct super_block *sb)
+{
+	struct zuf_inode_info *zii;
+
+	zii = kmem_cache_alloc(zuf_inode_cachep, GFP_NOFS);
+	if (!zii)
+		return NULL;
+
+	zii->vfs_inode.i_version = 1;
+	return &zii->vfs_inode;
+}
+
+static void zuf_destroy_inode(struct inode *inode)
+{
+	kmem_cache_free(zuf_inode_cachep, ZUII(inode));
+}
+
+static void _init_once(void *foo)
+{
+	struct zuf_inode_info *zii = foo;
+
+	inode_init_once(&zii->vfs_inode);
+	INIT_LIST_HEAD(&zii->i_mmap_dirty);
+	zii->zi = NULL;
+	zii->zero_page = NULL;
+	init_rwsem(&zii->in_sync);
+	atomic_set(&zii->vma_count, 0);
+	atomic_set(&zii->write_mapped, 0);
+}
+
+int __init zuf_init_inodecache(void)
+{
+	zuf_inode_cachep = kmem_cache_create("zuf_inode_cache",
+					       sizeof(struct zuf_inode_info),
+					       0,
+					       (SLAB_RECLAIM_ACCOUNT |
+						SLAB_MEM_SPREAD |
+						SLAB_TYPESAFE_BY_RCU),
+					       _init_once);
+	if (zuf_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+void zuf_destroy_inodecache(void)
+{
+	kmem_cache_destroy(zuf_inode_cachep);
+}
+
+/*
+ * the super block writes are all done "on the fly", so the
+ * super block is never in a "dirty" state, so there's no need
+ * for write_super.
+ */
+static struct super_operations zuf_sops = {
+	.alloc_inode	= zuf_alloc_inode,
+	.destroy_inode	= zuf_destroy_inode,
+	.write_inode	= zuf_write_inode,
+	.evict_inode	= zuf_evict_inode,
+	.put_super	= zuf_put_super,
+	.freeze_fs	= zuf_update_s_wtime,
+	.unfreeze_fs	= zuf_update_s_wtime,
+	.sync_fs	= zuf_sync_fs,
+	.statfs		= zuf_statfs,
+	.remount_fs	= zuf_remount,
+	.show_options	= zuf_show_options,
+	.show_devname	= zuf_show_devname,
+};
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data)
 {
-	return ERR_PTR(-ENOTSUPP);
+	int silent = flags & MS_SILENT ? 1 : 0;
+	struct __fill_super_params fsp = {
+		.mount_options = data,
+	};
+	struct register_fs_info *rfi = &ZUF_FST(fs_type)->rfi;
+	struct mdt_check mc = {
+		.major_ver	= rfi->FS_ver_major,
+		.minor_ver	= rfi->FS_ver_minor,
+		.magic		= rfi->FS_magic,
+
+		.holder = fs_type,
+		.silent = silent,
+	};
+	struct dentry *ret = NULL;
+	const char *dev_path = NULL;
+	struct zuf_fs_type *fst;
+	int err;
+
+	zuf_dbg_vfs("dev_name=%s, data=%s\n", dev_name, (const char *)data);
+
+	fsp.md = md_alloc(sizeof(struct zuf_pmem));
+	if (IS_ERR(fsp.md)) {
+		err = PTR_ERR(fsp.md);
+		fsp.md = NULL;
+		goto out;
+	}
+
+	err = md_init(fsp.md, dev_name, &mc, &dev_path);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "md_init failed! => %d\n", err);
+		goto out;
+	}
+
+	fst = container_of(fs_type, struct zuf_fs_type, vfs_fst);
+	zuf_add_pmem(fst->zri, fsp.md);
+
+	zuf_dbg_vfs("mounting with dev_path=%s\n", dev_path);
+	ret = mount_bdev(fs_type, flags, dev_path, &fsp, zuf_fill_super);
+
+out:
+	if (unlikely(err) && fsp.md)
+		md_fini(fsp.md, NULL);
+	kfree(dev_path);
+	return err ? ERR_PTR(err) : ret;
 }
diff --git a/fs/zuf/t1.c b/fs/zuf/t1.c
new file mode 100644
index 0000000..b0c869c
--- /dev/null
+++ b/fs/zuf/t1.c
@@ -0,0 +1,114 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Just the special mmap of the all t1 array to the ZUS Server
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/pfn_t.h>
+#include <asm/pgtable.h>
+
+#include "zuf.h"
+
+/* ~~~ Functions for mmap a t1-array and page faults ~~~ */
+struct zuf_pmem *_pmem_from_f_private(struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	WARN_ON(zsf->type != zlfs_e_pmem);
+	return container_of(zsf, struct zuf_pmem, hdr);
+}
+
+static int t1_file_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_pmem *z_pmem;
+	pgoff_t size;
+	ulong bn = vmf->pgoff;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_t1("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    inode->i_ino, vma->vm_start, vma->vm_end,
+		    vmf->address, vmf->pgoff, vmf->flags,
+		    vmf->cow_page, vmf->page);
+
+	if (unlikely(vmf->page)) {
+		zuf_err("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+			"pgoff=0x%lx vmf_flags=0x%x page=%p cow_page=%p\n",
+			inode->i_ino, vma->vm_start, vma->vm_end,
+			vmf->address, vmf->pgoff, vmf->flags,
+			vmf->page, vmf->cow_page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff +
+					md_o2p((vmf->address - vma->vm_start));
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 inode->i_ino, vmf->pgoff, pgoff, size);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	if (vmf->cow_page)
+		/* HOWTO: prevent private mmaps */
+		return VM_FAULT_SIGBUS;
+
+	z_pmem = _pmem_from_f_private(vma->vm_file);
+	pfn = md_pfn(&z_pmem->md, bn);
+
+	err = vm_insert_mixed_mkwrite(vma, vmf->address,
+			       phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV));
+	zuf_dbg_t1("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    inode->i_ino, pfn, vma->vm_page_prot.pgprot, err);
+
+	/*
+	 * err == -EBUSY is fine, we've raced against another thread
+	 * that faulted-in the same page
+	 */
+	if (err && (err != -EBUSY)) {
+		zuf_err("[%ld] vm_insert_page/mixed => %d\n",
+			inode->i_ino, err);
+		return VM_FAULT_SIGBUS;
+	}
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct t1_vm_ops = {
+	.fault		= t1_file_fault,
+};
+
+int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf || zsf->type != zlfs_e_pmem)
+		return -EPERM;
+
+
+	/* FIXME:  MIXEDMAP for the support of pmem-pages (Why?)
+	 */
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &t1_vm_ops;
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
+
diff --git a/fs/zuf/t2.c b/fs/zuf/t2.c
new file mode 100644
index 0000000..fa4eadc
--- /dev/null
+++ b/fs/zuf/t2.c
@@ -0,0 +1,348 @@
+/*
+ * Tier-2 operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include "t2.h"
+
+#include <linux/bitops.h>
+#include <linux/bio.h>
+
+#include "zuf.h"
+
+#define t2_dbg(fmt, args ...) zuf_dbg_t2(fmt, ##args)
+
+const char *_pr_rw(int rw)
+{
+	return (rw & WRITE) ? "WRITE" : "READ";
+}
+#define t2_tis_dbg(tis, fmt, args ...) \
+	zuf_dbg_t2("%s: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags),	       \
+		    atomic_read(&tis->refcount), tis->rw_flags, ##args)
+
+#define t2_tis_dbg_rw(tis, fmt, args ...) \
+	zuf_dbg_t2_rw("%s<%p>: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags),     \
+		    tis->priv, atomic_read(&tis->refcount), tis->rw_flags,\
+		    ##args)
+
+/* ~~~~~~~~~~~~ Async read/write ~~~~~~~~~~ */
+void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done,
+		 void *priv, uint n_vects, struct t2_io_state *tis)
+{
+	atomic_set(&tis->refcount, 1);
+	tis->md = md;
+	tis->done = done;
+	tis->priv = priv;
+	tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES);
+	tis->rw_flags = rw;
+	tis->last_t2 = -1;
+	tis->cur_bio = NULL;
+	tis->index = ~0;
+	bio_list_init(&tis->delayed_bios);
+	tis->err = 0;
+	blk_start_plug(&tis->plug);
+	t2_tis_dbg_rw(tis, "done=%pF n_vects=%d\n", done, n_vects);
+}
+
+static void _tis_put(struct t2_io_state *tis)
+{
+	t2_tis_dbg_rw(tis, "done=%pF\n", tis->done);
+
+	if (test_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags))
+		wake_up_atomic_t(&tis->refcount);
+	else if (tis->done)
+		/* last - done may free the tis */
+		tis->done(tis, NULL, true);
+}
+
+static inline void tis_get(struct t2_io_state *tis)
+{
+	atomic_inc(&tis->refcount);
+}
+
+static inline int tis_put(struct t2_io_state *tis)
+{
+	if (atomic_dec_and_test(&tis->refcount)) {
+		_tis_put(tis);
+		return 1;
+	}
+	return 0;
+}
+
+static inline bool _err_set_reported(struct md_dev_info *mdi, bool write)
+{
+	bool *reported = write ? &mdi->t2i.err_write_reported :
+				 &mdi->t2i.err_read_reported;
+
+	if (!(*reported)) {
+		*reported = true;
+		return true;
+	}
+	return false;
+}
+
+static int _status_to_errno(blk_status_t status)
+{
+	return -EIO;
+}
+
+static void _tis_bio_done(struct bio *bio)
+{
+	struct t2_io_state *tis = bio->bi_private;
+	struct md_dev_info *mdi = md_t2_dev(tis->md, 0);
+
+	t2_tis_dbg(tis, "done=%pF err=%d\n", tis->done, bio->bi_status);
+
+	if (unlikely(bio->bi_status)) {
+		zuf_dbg_err("%s: err=%d last-err=%d\n",
+			     _pr_rw(tis->rw_flags), bio->bi_status, tis->err);
+		if (_err_set_reported(mdi, 0 != (tis->rw_flags & WRITE)))
+			zuf_err("%s: err=%d\n",
+				 _pr_rw(tis->rw_flags), bio->bi_status);
+		/* Store the last one */
+		tis->err = _status_to_errno(bio->bi_status);
+	} else if (unlikely(mdi->t2i.err_write_reported ||
+			    mdi->t2i.err_read_reported)) {
+		if (tis->rw_flags & WRITE)
+			mdi->t2i.err_write_reported = false;
+		else
+			mdi->t2i.err_read_reported = false;
+	}
+
+	if (tis->done)
+		tis->done(tis, bio, false);
+
+	bio_put(bio);
+	tis_put(tis);
+}
+
+static bool _tis_delay(struct t2_io_state *tis)
+{
+	return 0 != (tis->rw_flags & TIS_DELAY_SUBMIT);
+}
+
+#define bio_list_for_each_safe(bio, btmp, bl)				\
+	for (bio = (bl)->head,	btmp = bio ? bio->bi_next : NULL;	\
+	     bio; bio = btmp,	btmp = bio ? bio->bi_next : NULL)
+
+static void _tis_submit_bio(struct t2_io_state *tis, bool flush, bool done)
+{
+	if (flush || done) {
+		if (_tis_delay(tis)) {
+			struct bio *btmp, *bio;
+
+			bio_list_for_each_safe(bio, btmp, &tis->delayed_bios) {
+				bio->bi_next = NULL;
+				if (bio->bi_iter.bi_sector == -1) {
+					t2_warn("!!!!!!!!!!!!!\n");
+					bio_put(bio);
+					continue;
+				}
+				t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+					    bio->bi_vcnt, tis->n_vects);
+				submit_bio(bio);
+			}
+			bio_list_init(&tis->delayed_bios);
+		}
+
+		if (!tis->cur_bio)
+			return;
+
+		if (tis->cur_bio->bi_iter.bi_sector != -1) {
+			t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+				    tis->cur_bio->bi_vcnt, tis->n_vects);
+			submit_bio(tis->cur_bio);
+			tis->cur_bio = NULL;
+			tis->index = ~0;
+		} else if (done) {
+			t2_tis_dbg(tis, "put cur_bio=%p\n", tis->cur_bio);
+			bio_put(tis->cur_bio);
+			WARN_ON(tis_put(tis));
+		}
+	} else if (tis->cur_bio && (tis->cur_bio->bi_iter.bi_sector != -1)) {
+		/* Not flushing regular progress */
+		if (_tis_delay(tis)) {
+			t2_tis_dbg(tis, "list_add cur_bio=%p\n", tis->cur_bio);
+			bio_list_add(&tis->delayed_bios, tis->cur_bio);
+		} else {
+			t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+				    tis->cur_bio->bi_vcnt, tis->n_vects);
+			submit_bio(tis->cur_bio);
+		}
+		tis->cur_bio = NULL;
+		tis->index = ~0;
+	}
+}
+
+/* tis->cur_bio MUST be NULL, checked by caller */
+static void _tis_alloc(struct t2_io_state *tis, struct md_dev_info *mdi,
+		       gfp_t gfp)
+{
+	struct bio *bio = bio_alloc(gfp, tis->n_vects);
+	int bio_op;
+
+	if (unlikely(!bio)) {
+		if (!_tis_delay(tis))
+			t2_warn("!!! failed to alloc bio");
+		tis->err = -ENOMEM;
+		return;
+	}
+
+	if (WARN_ON(!tis || !tis->md)) {
+		tis->err = -ENOMEM;
+		return;
+	}
+
+	/* FIXME: bio_set_op_attrs macro has a BUG which does not allow this
+	 * question inline.
+	 */
+	bio_op = (tis->rw_flags & WRITE) ? REQ_OP_WRITE : REQ_OP_READ;
+	bio_set_op_attrs(bio, bio_op, 0);
+
+	if (mdi->bdev)
+		bio_set_dev(bio, mdi->bdev);
+	bio->bi_iter.bi_sector = -1;
+	bio->bi_end_io = _tis_bio_done;
+	bio->bi_private = tis;
+
+	tis->index = mdi ? mdi->index : ~0;
+	tis->last_t2 = -1;
+	tis->cur_bio = bio;
+	tis_get(tis);
+	t2_tis_dbg(tis, "New bio n_vects=%d\n", tis->n_vects);
+}
+
+int t2_io_prealloc(struct t2_io_state *tis, uint n_vects)
+{
+	tis->err = 0; /* reset any -ENOMEM from a previous t2_io_add */
+
+	_tis_submit_bio(tis, true, false);
+	tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES);
+
+	t2_tis_dbg(tis, "n_vects=%d cur_bio=%p\n", tis->n_vects, tis->cur_bio);
+
+	if (!tis->cur_bio)
+		_tis_alloc(tis, NULL, GFP_NOFS);
+	return tis->err;
+}
+
+int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page)
+{
+	struct md_dev_info *mdi = md_bn_t2_dev(tis->md, t2);
+	ulong local_t2 = md_t2_local_bn(tis->md, t2);
+	int ret;
+
+	if (((local_t2 != (tis->last_t2 + 1)) && (tis->last_t2 != -1)) ||
+	   (mdi && (0 < tis->index) && (tis->index != mdi->index)))
+		_tis_submit_bio(tis, false, false);
+
+start:
+	if (!tis->cur_bio) {
+		_tis_alloc(tis, mdi, _tis_delay(tis) ? GFP_ATOMIC : GFP_NOFS);
+		if (unlikely(tis->err))
+			return tis->err;
+	} else if (tis->index == ~0) {
+		/* the bio was allocated during t2_io_prealloc */
+		tis->index = mdi->index;
+		bio_set_dev(tis->cur_bio, mdi->bdev);
+	}
+
+	if (tis->last_t2 == -1)
+		tis->cur_bio->bi_iter.bi_sector = local_t2 * T2_SECTORS_PER_PAGE;
+
+	ret = bio_add_page(tis->cur_bio, page, PAGE_SIZE, 0);
+	if (unlikely(ret != PAGE_SIZE)) {
+		t2_tis_dbg(tis, "bio_add_page=>%d bi_vcnt=%d n_vects=%d\n",
+			   ret, tis->cur_bio->bi_vcnt, tis->n_vects);
+		_tis_submit_bio(tis, false, false);
+		goto start; /* device does not support tis->n_vects */
+	}
+
+	if ((tis->cur_bio->bi_vcnt == tis->n_vects) && (tis->n_vects != 1))
+		_tis_submit_bio(tis, false, false);
+
+	t2_tis_dbg(tis, "t2=0x%lx last_t2=0x%lx local_t2=0x%lx page-i=0x%lx\n",
+		   t2, tis->last_t2, local_t2, page->index);
+
+	tis->last_t2 = local_t2;
+	return 0;
+}
+
+int t2_io_end(struct t2_io_state *tis, bool wait)
+{
+	int err = 0;
+
+	if (unlikely(!tis || !tis->md))
+		return 0; /* never initialized nothing to do */
+
+	t2_tis_dbg_rw(tis, "wait=%d\n", wait);
+
+	_tis_submit_bio(tis, true, true);
+	blk_finish_plug(&tis->plug);
+
+	if (wait)
+		set_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags);
+	tis_put(tis);
+
+	if (wait) {
+		err = wait_on_atomic_t(&tis->refcount, atomic_t_wait,
+					TASK_INTERRUPTIBLE);
+		if (likely(!err))
+			err = tis->err;
+		if (tis->done)
+			tis->done(tis, NULL, true);
+	}
+
+	/* In case of a ctrl-c we return an err but tis->err == 0 */
+	return err;
+}
+
+/* ~~~~~~~ Sync read/write ~~~~~~~ TODO: Remove soon */
+static int _sync_io_page(struct multi_devices *md, int rw, ulong bn,
+			 struct page *page)
+{
+	struct t2_io_state tis;
+	int err;
+
+	t2_io_begin(md, rw, NULL, NULL, 1, &tis);
+
+	t2_tis_dbg((&tis), "bn=0x%lx p-i=0x%lx\n", bn, page->index);
+
+	err = t2_io_add(&tis, bn, page);
+	if (unlikely(err))
+		return err;
+
+	err = submit_bio_wait(tis.cur_bio);
+	if (unlikely(err)) {
+		SetPageError(page);
+		/*
+		 * We failed to write the page out to tier-2.
+		 * Print a dire warning that things will go BAD (tm)
+		 * very quickly.
+		 */
+		zuf_err("io-error bn=0x%lx => %d\n", bn, err);
+	}
+
+	/* Same as t2_io_end+_tis_bio_done but without the kref stuff */
+	blk_finish_plug(&tis.plug);
+	if (likely(tis.cur_bio))
+		bio_put(tis.cur_bio);
+
+	return err;
+}
+
+int t2_writepage(struct multi_devices *md, ulong bn, struct page *page)
+{
+	return _sync_io_page(md, WRITE, bn, page);
+}
+
+int t2_readpage(struct multi_devices *md, ulong bn, struct page *page)
+{
+	return _sync_io_page(md, READ, bn, page);
+}
diff --git a/fs/zuf/t2.h b/fs/zuf/t2.h
new file mode 100644
index 0000000..75c24f7
--- /dev/null
+++ b/fs/zuf/t2.h
@@ -0,0 +1,67 @@
+/*
+ * Tier-2 Header file.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#ifndef __T2_H__
+#define __T2_H__
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/bio.h>
+#include <linux/kref.h>
+#include "_pr.h"
+#include "md.h"
+
+#define T2_SECTORS_PER_PAGE	(PAGE_SIZE / 512)
+
+#define t2_warn(fmt, args ...) zuf_warn(fmt, ##args)
+
+/* t2.c */
+
+/* Sync read/write */
+int t2_writepage(struct multi_devices *md, ulong bn, struct page *page);
+int t2_readpage(struct multi_devices *md, ulong bn, struct page *page);
+
+/* Async read/write */
+struct t2_io_state;
+typedef void (*t2_io_done_fn)(struct t2_io_state *tis, struct bio *bio,
+			      bool last);
+
+struct t2_io_state {
+	atomic_t refcount; /* counts in-flight bios */
+	struct blk_plug plug;
+
+	struct multi_devices	*md;
+	int		index;
+	t2_io_done_fn	done;
+	void		*priv;
+
+	uint		n_vects;
+	ulong		rw_flags;
+	ulong		last_t2;
+	struct bio	*cur_bio;
+	struct bio_list	delayed_bios;
+	int		err;
+};
+
+/* For rw_flags above */
+/* From Kernel: WRITE		(1U << 0) */
+#define TIS_DELAY_SUBMIT	(1U << 2)
+enum {B_TIS_FREE_AFTER_WAIT = 3};
+#define TIS_FREE_AFTER_WAIT	(1U << B_TIS_FREE_AFTER_WAIT)
+#define TIS_USER_DEF_FIRST	(1U << 8)
+
+void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done,
+		 void *priv, uint n_vects, struct t2_io_state *tis);
+int t2_io_prealloc(struct t2_io_state *tis, uint n_vects);
+int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page);
+int t2_io_end(struct t2_io_state *tis, bool wait);
+
+#endif /*def __T2_H__*/
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 12a23f1..963c417 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -14,7 +14,6 @@
 #include <linux/mm.h>
 #include <linux/mm_types.h>
 #include <linux/delay.h>
-#include <linux/uaccess.h>
 
 #include "zuf.h"
 
@@ -220,6 +219,71 @@ void zufs_mounter_release(struct file *file)
 	}
 }
 
+/* ~~~~ PMEM GRAB ~~~~ */
+static int zufr_find_pmem(struct zuf_root_info *zri,
+		   uint pmem_kern_id, struct zuf_pmem **pmem_md)
+{
+	struct zuf_pmem *z_pmem;
+
+	list_for_each_entry(z_pmem, &zri->pmem_list, list) {
+		if (z_pmem->pmem_id == pmem_kern_id) {
+			*pmem_md = z_pmem;
+			return 0;
+		}
+	}
+
+	return -ENODEV;
+}
+
+static int _zu_grab_pmem(struct file *file, void *parg)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zufs_ioc_pmem __user *arg_pmem = parg;
+	struct zufs_ioc_pmem zi_pmem = {};
+	struct zuf_pmem *pmem_md;
+	int err;
+
+	err = get_user(zi_pmem.pmem_kern_id, &arg_pmem->pmem_kern_id);
+	if (err) {
+		zuf_err("\n");
+		return err;
+	}
+
+	err = zufr_find_pmem(zri, zi_pmem.pmem_kern_id, &pmem_md);
+	if (err) {
+		zuf_err("!!! pmem_kern_id=%d not found\n",
+			zi_pmem.pmem_kern_id);
+		goto out;
+	}
+
+	if (pmem_md->file) {
+		zuf_err("[%u] pmem already taken\n", zi_pmem.pmem_kern_id);
+		err = -EIO;
+		goto out;
+	}
+
+	err = md_numa_info(&pmem_md->md, &zi_pmem);
+	if (unlikely(err)) {
+		zuf_err("md_numa_info => %d\n", err);
+		goto out;
+	}
+
+	i_size_write(file->f_inode, md_p2o(md_t1_blocks(&pmem_md->md)));
+	pmem_md->hdr.type = zlfs_e_pmem;
+	pmem_md->file = file;
+	file->private_data = &pmem_md->hdr;
+	zuf_dbg_core("pmem %d GRABED %s\n",
+		     zi_pmem.pmem_kern_id,
+		     _bdev_name(md_t1_dev(&pmem_md->md, 0)->bdev));
+
+out:
+	zi_pmem.hdr.err = err;
+	err = copy_to_user(parg, &zi_pmem, sizeof(zi_pmem));
+	if (err)
+		zuf_err("=>%d\n", err);
+	return err;
+}
+
 static int _map_pages(struct zufs_thread *zt, struct page **pages, uint nump,
 		      bool zap)
 {
@@ -451,6 +515,8 @@ long zufs_ioc(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_register_fs(file, parg);
 	case ZU_IOC_MOUNT:
 		return _zu_mount(file, parg);
+	case ZU_IOC_GRAB_PMEM:
+		return _zu_grab_pmem(file, parg);
 	case ZU_IOC_INIT_THREAD:
 		return _zu_init(file, parg);
 	case ZU_IOC_WAIT_OPT:
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
index 8102d3a..ebb44c9 100644
--- a/fs/zuf/zuf-root.c
+++ b/fs/zuf/zuf-root.c
@@ -117,6 +117,8 @@ int zufr_mmap(struct file *file, struct vm_area_struct *vma)
 	switch (zsf->type) {
 	case zlfs_e_zt:
 		return zuf_zt_mmap(file, vma);
+	case zlfs_e_pmem:
+		return zuf_pmem_mmap(file, vma);
 	default:
 		zuf_err("type=%d\n", zsf->type);
 		return -ENOTTY;
@@ -300,7 +302,10 @@ static struct kset *zufr_kset;
 
 int __init zuf_root_init(void)
 {
-	int err;
+	int err = zuf_init_inodecache();
+
+	if (unlikely(err))
+		return err;
 
 	zufr_kset = kset_create_and_add("zuf", NULL, fs_kobj);
 	if (!zufr_kset) {
@@ -317,6 +322,7 @@ int __init zuf_root_init(void)
 un_kset:
 	kset_unregister(zufr_kset);
 un_inodecache:
+	zuf_destroy_inodecache();
 	return err;
 }
 
@@ -324,6 +330,7 @@ void __exit zuf_root_exit(void)
 {
 	unregister_filesystem(&zufr_type);
 	kset_unregister(zufr_kset);
+	zuf_destroy_inodecache();
 }
 
 module_init(zuf_root_init)
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 15516d0..a5d277f 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -26,7 +26,9 @@
 #include "zus_api.h"
 
 #include "relay.h"
+#include "t2.h"
 #include "_pr.h"
+#include "md.h"
 
 enum zlfs_e_special_file {
 	zlfs_e_zt = 1,
@@ -38,6 +40,15 @@ struct zuf_special_file {
 	enum zlfs_e_special_file type;
 };
 
+/* Our special md structure */
+struct zuf_pmem {
+	struct multi_devices md; /* must be first */
+	struct list_head list;
+	struct zuf_special_file hdr;
+	uint pmem_id;
+	struct file *file;
+};
+
 /* This is the zuf-root.c mini filesystem */
 struct zuf_root_info {
 	struct __mount_thread_info {
@@ -84,6 +95,215 @@ static inline void zuf_add_fs_type(struct zuf_root_info *zri,
 	list_add(&zft->list, &zri->fst_list);
 }
 
+static inline void zuf_add_pmem(struct zuf_root_info *zri,
+				   struct multi_devices *md)
+{
+	struct zuf_pmem *z_pmem = (void *)md;
+
+	z_pmem->pmem_id = ++zri->next_pmem_id; /* Avoid 0 id */
+
+	/* Unlocked for now only one mount-thread with zus */
+	list_add(&z_pmem->list, &zri->pmem_list);
+}
+
+static inline uint zuf_pmem_id(struct multi_devices *md)
+{
+	struct zuf_pmem *z_pmem = container_of(md, struct zuf_pmem, md);
+
+	return z_pmem->pmem_id;
+}
+
+// void zuf_del_fs_type(struct zuf_root_info *zri, struct zuf_fs_type *zft);
+
+/*
+ * Private Super-block flags
+ */
+enum {
+	ZUF_MOUNT_PEDANTIC	= 0x000001,	/* Check for memory leaks */
+	ZUF_MOUNT_PEDANTIC_SHADOW = 0x00002,	/* */
+	ZUF_MOUNT_SILENT	= 0x000004,	/* verbosity is silent */
+	ZUF_MOUNT_EPHEMERAL	= 0x000008,	/* Don't persist the data */
+	ZUF_MOUNT_FAILED	= 0x000010,	/* mark a failed-mount */
+	ZUF_MOUNT_DAX		= 0x000020,	/* mounted with dax option */
+	ZUF_MOUNT_POSIXACL	= 0x000040,	/* mounted with posix acls */
+};
+
+#define clear_opt(sbi, opt)       (sbi->s_mount_opt &= ~ZUF_MOUNT_ ## opt)
+#define set_opt(sbi, opt)         (sbi->s_mount_opt |= ZUF_MOUNT_ ## opt)
+#define test_opt(sbi, opt)      (sbi->s_mount_opt & ZUF_MOUNT_ ## opt)
+
+#define ZUFS_DEF_SBI_MODE (S_IRUGO | S_IXUGO | S_IWUSR)
+
+/* Flags bits on zii->flags */
+enum {
+	ZII_UNMAP_LOCK	= 1,
+};
+
+/*
+ * ZUF per-inode data in memory
+ */
+struct zuf_inode_info {
+	struct inode		vfs_inode;
+
+	/* Stuff for mmap write */
+	struct rw_semaphore	in_sync;
+	struct list_head	i_mmap_dirty;
+	atomic_t		write_mapped;
+	atomic_t		vma_count;
+	struct page		*zero_page; /* TODO: Remove */
+
+	/* cookies from Server */
+	struct zus_inode	*zi;
+	struct zus_inode_info	*zus_ii;
+};
+
+static inline struct zuf_inode_info *ZUII(struct inode *inode)
+{
+	return container_of(inode, struct zuf_inode_info, vfs_inode);
+}
+
+/*
+ * ZUF super-block data in memory
+ */
+struct zuf_sb_info {
+	struct super_block *sb;
+	struct multi_devices *md;
+
+	/* zus cookie*/
+	struct zus_sb_info *zus_sbi;
+
+	/* Mount options */
+	unsigned long	s_mount_opt;
+	kuid_t		uid;    /* Mount uid for root directory */
+	kgid_t		gid;    /* Mount gid for root directory */
+	umode_t		mode;   /* Mount mode for root directory */
+
+	spinlock_t		s_mmap_dirty_lock;
+	struct list_head	s_mmap_dirty;
+};
+
+static inline struct zuf_sb_info *SBI(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+static inline struct zuf_fs_type *ZUF_FST(struct file_system_type *fs_type)
+{
+	return container_of(fs_type, struct zuf_fs_type, vfs_fst);
+}
+
+static inline struct zuf_fs_type *zuf_fst(struct super_block *sb)
+{
+	return ZUF_FST(sb->s_type);
+}
+
+static inline struct zuf_root_info *ZUF_ROOT(struct zuf_sb_info *sbi)
+{
+	return zuf_fst(sbi->sb)->zri;
+}
+
+static inline bool zuf_rdonly(struct super_block *sb)
+{
+	return sb->s_flags & MS_RDONLY;
+}
+
+static inline struct zus_inode *zus_zi(struct inode *inode)
+{
+	return ZUII(inode)->zi;
+}
+
+/* An accessor because of the frequent use in prints */
+static inline ulong _zi_ino(struct zus_inode *zi)
+{
+	return le64_to_cpu(zi->i_ino);
+}
+
+static inline bool _zi_active(struct zus_inode *zi)
+{
+	return (zi->i_nlink || zi->i_mode);
+}
+
+static inline void mt_to_timespec(struct timespec *t, __le64 *mt)
+{
+	u32 nsec;
+
+	t->tv_sec = div_s64_rem(le64_to_cpu(*mt), NSEC_PER_SEC, &nsec);
+	t->tv_nsec = nsec;
+}
+
+static inline void timespec_to_mt(__le64 *mt, struct timespec *t)
+{
+	*mt = cpu_to_le64(t->tv_sec * NSEC_PER_SEC + t->tv_nsec);
+}
+
+static inline void zuf_r_lock(struct zuf_inode_info *zii)
+{
+	inode_lock_shared(&zii->vfs_inode);
+}
+static inline void zuf_r_unlock(struct zuf_inode_info *zii)
+{
+	inode_unlock_shared(&zii->vfs_inode);
+}
+
+static inline void zuf_smr_lock(struct zuf_inode_info *zii)
+{
+	down_read_nested(&zii->in_sync, 1);
+}
+static inline void zuf_smr_lock_pagefault(struct zuf_inode_info *zii)
+{
+	down_read_nested(&zii->in_sync, 2);
+}
+static inline void zuf_smr_unlock(struct zuf_inode_info *zii)
+{
+	up_read(&zii->in_sync);
+}
+
+static inline void zuf_smw_lock(struct zuf_inode_info *zii)
+{
+	down_write(&zii->in_sync);
+}
+static inline void zuf_smw_lock_nested(struct zuf_inode_info *zii)
+{
+	down_write_nested(&zii->in_sync, 1);
+}
+static inline void zuf_smw_unlock(struct zuf_inode_info *zii)
+{
+	up_write(&zii->in_sync);
+}
+
+static inline void zuf_w_lock(struct zuf_inode_info *zii)
+{
+	inode_lock(&zii->vfs_inode);
+	zuf_smw_lock(zii);
+}
+static inline void zuf_w_lock_nested(struct zuf_inode_info *zii)
+{
+	inode_lock_nested(&zii->vfs_inode, 2);
+	zuf_smw_lock_nested(zii);
+}
+static inline void zuf_w_unlock(struct zuf_inode_info *zii)
+{
+	zuf_smw_unlock(zii);
+	inode_unlock(&zii->vfs_inode);
+}
+
+static inline void ZUF_CHECK_I_W_LOCK(struct inode *inode)
+{
+#ifdef CONFIG_ZUF_DEBUG
+	if (WARN_ON(down_write_trylock(&inode->i_rwsem)))
+		up_write(&inode->i_rwsem);
+#endif
+}
+
+/* CAREFUL: Needs an sfence eventually, after this call */
+static inline
+void zus_inode_cmtime_now(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_mtime = zi->i_ctime;
+}
+
 /* Keep this include last thing in file */
 #include "_extern.h"
 
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 19ce326..d461782 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -66,6 +66,17 @@
 
 #endif /*  ndef __KERNEL__ */
 
+/*
+ * Maximal count of links to a file
+ */
+#define ZUFS_LINK_MAX          32000
+#define ZUFS_MAX_SYMLINK	PAGE_SIZE
+#define ZUFS_NAME_LEN		255
+#define ZUFS_READAHEAD_PAGES	8
+
+/* All device sizes offsets must align on 2M */
+#define ZUFS_ALLOC_MASK		(1024 * 1024 * 2 - 1)
+
 /**
  * zufs dual port memory
  * This is a special type of offset to either memory or persistent-memory,
@@ -75,6 +86,121 @@
  */
 typedef __u64	zu_dpp_t;
 
+/*
+ * Structure of a ZUS inode.
+ * This is all the inode fields
+ */
+
+/* zus_inode size */
+#define ZUFS_INODE_SIZE 128    /* must be power of two */
+#define ZUFS_INODE_BITS   7
+
+struct zus_inode {
+	__le32	i_flags;	/* Inode flags */
+	__le16	i_mode;		/* File mode */
+	__le16	i_nlink;	/* Links count */
+	__le64	i_size;		/* Size of data in bytes */
+/* 16*/	struct __zi_on_disk_desc {
+		__le64	a[2];
+	}	i_on_disk;	/* FS-specific on disc placement */
+/* 32*/	__le64	i_blocks;
+	__le64	i_mtime;	/* Inode/data Modification time */
+	__le64	i_ctime;	/* Inode/data Changed time */
+	__le64	i_atime;	/* Data Access time */
+/* 64 - cache-line boundary */
+	__le64	i_ino;		/* Inode number */
+	__le32	i_uid;		/* Owner Uid */
+	__le32	i_gid;		/* Group Id */
+	__le64	i_xattr;	/* FS-specific Extended attribute block */
+	__le64	i_generation;	/* File version (for NFS) */
+/* 96*/	union {
+		__le32	i_rdev;		/* special-inode major/minor etc ...*/
+		u8	i_symlink[32];	/* if i_size < sizeof(i_symlink) */
+		__le64	i_sym_sno;	/* FS-specific symlink placement */
+		struct  _zu_dir {
+			__le64  parent;
+		}	i_dir;
+	};
+	/* Total ZUFS_INODE_SIZE bytes always */
+};
+
+#define ZUFS_SB_SIZE 2048       /* must be power of two */
+
+/* device table s_flags */
+#define		ZUFS_SHADOW	(1UL << 4)	/* simulate cpu cache */
+
+#define test_msb_opt(msb, opt)	(le64_to_cpu(msb->s_flags) & opt)
+
+#define ZUFS_DEV_NUMA_SHIFT		60
+#define ZUFS_DEV_BLOCKS_MASK		0x0FFFFFFFFFFFFFFF
+
+struct md_dev_id {
+	uuid_le	uuid;
+	__le64	blocks;
+} __packed;
+
+static inline __u64 __dev_id_blocks(struct md_dev_id *dev)
+{
+	return le64_to_cpu(dev->blocks) & ZUFS_DEV_BLOCKS_MASK;
+}
+
+static inline int __dev_id_nid(struct md_dev_id *dev)
+{
+	return (int)(le64_to_cpu(dev->blocks) >> ZUFS_DEV_NUMA_SHIFT);
+}
+
+/* 64 is the nicest number to still fit when the ZDT is 2048 and 6 bits can
+ * fit in page struct for address to block translation.
+ */
+#define MD_DEV_MAX   64
+
+struct md_dev_list {
+	__le16		   id_index;	/* index of current dev in list */
+	__le16		   t1_count;	/* # of t1 devs */
+	__le16		   t2_count;	/* # of t2 devs (after t1_count) */
+	__le16		   reserved;	/* align to 64 bit */
+	struct md_dev_id dev_ids[MD_DEV_MAX];
+} __attribute__((aligned(64)));
+
+/*
+ * Structure of the on disk zufs device table
+ * NOTE: zufs_dev_table is always of size ZUFS_SB_SIZE. These below are the
+ *   currently defined/used members in this version.
+ *   TODO: remove the s_ from all the fields
+ */
+struct zufs_dev_table {
+	/* static fields. they never change after file system creation.
+	 * checksum only validates up to s_start_dynamic field below
+	 */
+	__le16		s_sum;              /* checksum of this sb */
+	__le16		s_version;          /* zdt-version */
+	__le32		s_magic;            /* magic signature */
+	uuid_le		s_uuid;		    /* 128-bit uuid */
+	__le64		s_flags;
+	__le64		s_t1_blocks;
+	__le64		s_t2_blocks;
+
+	struct md_dev_list s_dev_list;
+
+	char		s_start_dynamic[0];
+
+	/* all the dynamic fields should go here */
+	__le64		s_mtime;		/* mount time */
+	__le64		s_wtime;		/* write time */
+};
+
+static inline int msb_major_version(struct zufs_dev_table *msb)
+{
+	return le16_to_cpu(msb->s_version) / ZUFS_MINORS_PER_MAJOR;
+}
+
+static inline int msb_minor_version(struct zufs_dev_table *msb)
+{
+	return le16_to_cpu(msb->s_version) % ZUFS_MINORS_PER_MAJOR;
+}
+
+#define ZUFS_SB_STATIC_SIZE(ps) ((u64)&ps->s_start_dynamic - (u64)ps)
+
 /* ~~~~~ ZUFS API ioctl commands ~~~~~ */
 enum {
 	ZUS_API_MAP_MAX_PAGES	= 1024,
@@ -143,6 +269,46 @@ struct  zufs_ioc_mount {
 };
 #define ZU_IOC_MOUNT	_IOWR('S', 12, struct zufs_ioc_mount)
 
+/* pmem  */
+struct zufs_ioc_pmem {
+	/* Set by zus */
+	struct zufs_ioc_hdr hdr;
+	__u32 pmem_kern_id;
+
+	/* Returned to zus */
+	__u64 pmem_total_blocks;
+	__u32 max_nodes;
+	__u32 active_pmem_nodes;
+	struct zufs_pmem_info {
+		int sections;
+		struct zufs_pmem_sec {
+			__u32 length;
+			__u16 numa_id;
+			__u16 numa_index;
+		} secs[MD_DEV_MAX];
+	} pmem;
+
+	/* Variable length array mapping A CPU to the proper active pmem to use.
+	 * ZUS starts with 4k if to small hdr.err === ETOSMALL and
+	 * max_cpu_id set for the needed amount.
+	 *
+	 * Careful a user_mode pointer if not needed by server set to NULL
+	 *
+	 * @max_cpu_id is set by server to say how much space at numa_info,
+	 * Kernel returns the actual active CPUs
+	 */
+	struct zufs_numa_info {
+		__u32 max_cpu_id;
+		__u32 pad;
+		struct zufs_cpu_info {
+			__u32 numa_id;
+			__u32 numa_index;
+		} numa_id_map[];
+	} *numa_info;
+};
+/* GRAB is never ungrabed umount or file close cleans it all */
+#define ZU_IOC_GRAB_PMEM	_IOWR('S', 12, struct zufs_ioc_init)
+
 /* ZT init */
 struct zufs_ioc_init {
 	struct zufs_ioc_hdr hdr;
@@ -169,9 +335,20 @@ struct zufs_ioc_wait_operation {
  */
 enum e_zufs_operation {
 	ZUS_OP_NULL = 0,
+	ZUS_OP_STATFS,
 
 	ZUS_OP_BREAK,		/* Kernel telling Server to exit */
 	ZUS_OP_MAX_OPT,
 };
 
+/* ZUS_OP_STATFS */
+struct zufs_ioc_statfs {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* OUT */
+	struct statfs64 statfs_out;
+};
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC 6/7] zuf: Filesystem operations
  2018-03-13 17:28 ` [RFC 6/7] zuf: Filesystem operations Boaz Harrosh
@ 2018-03-13 17:39   ` Boaz Harrosh
  0 siblings, 0 replies; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-13 17:39 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon


The principle for all operations is the same.

* The few parameters are given on the despatch IOCTL
  buffer (up to 4k of parameters)

* Any application buffers or other big buffers
  like readdir are mapped via zuf-core to the the Server VM

* The operation is despatched. Return code and few out
  parameter are returned in the despatch-return buffer.
  Any data stored/read at mapped application buffers.

* zus objects like zus_inode symlinks and so on are returned
  through a dpp_t (Dual port pointer) - a special kind of zuf
  construct that enables to have a Kernel pointer and a server
  pointer to the same memory. If pmem is used this is usually
  a pointer to pmem.

  The Kernel's zuf part may even write to this returned pointer
  but it will then send a despatch, for the FS to persist the
  change.

TODO:
	This patch is probably too big. how best to split it?

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile    |   2 +-
 fs/zuf/_extern.h   |  48 +++++
 fs/zuf/directory.c | 156 ++++++++++++++
 fs/zuf/file.c      | 403 ++++++++++++++++++++++++++++++++++
 fs/zuf/inode.c     | 617 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/namei.c     | 421 ++++++++++++++++++++++++++++++++++++
 fs/zuf/symlink.c   |  76 +++++++
 fs/zuf/zus_api.h   | 234 ++++++++++++++++++++
 8 files changed, 1953 insertions(+), 4 deletions(-)
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/namei.c
 create mode 100644 fs/zuf/symlink.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 94ce80b..4c125f7 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,5 @@ zuf-y += md.o t2.o t1.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o inode.o
+zuf-y += super.o inode.o directory.o file.o namei.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 0543fd8..cf2e80f 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -19,16 +19,43 @@
  * extern functions declarations
  */
 
+/* directory.c */
+int zuf_add_dentry(struct inode *dir, struct qstr *str,
+		   struct inode *inode, bool rename);
+int zuf_remove_dentry(struct inode *dir, struct qstr *str);
+
 /* inode.c */
+int zuf_evict_dispatch(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       int operation);
 struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
 		       zu_dpp_t _zi, bool *exist);
 void zuf_evict_inode(struct inode *inode);
+struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
+			    const struct qstr *qstr, const char *symname,
+			    ulong rdev_or_isize, bool tmpfile);
 int zuf_write_inode(struct inode *inode, struct writeback_control *wbc);
+int zuf_update_time(struct inode *inode, struct timespec *time, int flags);
+int zuf_setattr(struct dentry *dentry, struct iattr *attr);
+int zuf_getattr(const struct path *path, struct kstat *stat,
+		 u32 request_mask, unsigned int flags);
+void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi);
+bool zuf_dir_emit(struct super_block *sb, struct dir_context *ctx,
+		  ulong ino, const char *name, int length);
+
+/* symlink.c */
+uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
+			const char *symname, ulong len, struct page *pages[2]);
+
+/* file.c */
 int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync);
 
 /* super.c */
 int zuf_init_inodecache(void);
 void zuf_destroy_inodecache(void);
+
+void zuf_sync_inc(struct inode *inode);
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped);
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
 
@@ -56,4 +83,25 @@ int zuf_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
+/*
+ * Inodes and files operations
+ */
+
+/* dir.c */
+extern const struct file_operations zuf_dir_operations;
+
+/* file.c */
+extern const struct inode_operations zuf_file_inode_operations;
+extern const struct file_operations zuf_file_operations;
+
+/* inode.c */
+extern const struct address_space_operations zuf_aops;
+
+/* namei.c */
+extern const struct inode_operations zuf_dir_inode_operations;
+extern const struct inode_operations zuf_special_inode_operations;
+
+/* symlink.c */
+extern const struct inode_operations zuf_symlink_inode_operations;
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
new file mode 100644
index 0000000..f8f68b8
--- /dev/null
+++ b/fs/zuf/directory.c
@@ -0,0 +1,156 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for directories.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/fs.h>
+#include <linux/vmalloc.h>
+#include "zuf.h"
+
+static int zuf_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	loff_t i_size = i_size_read(inode);
+	struct zufs_ioc_readdir ioc_readdir = {
+		.hdr.in_len = sizeof(ioc_readdir),
+		.hdr.out_len = sizeof(ioc_readdir),
+		.hdr.operation = ZUS_OP_READDIR,
+		.dir_ii = ZUII(inode)->zus_ii,
+	};
+	struct zufs_readdir_iter rdi;
+	struct page *pages[ZUS_API_MAP_MAX_PAGES];
+	struct zufs_dir_entry *zde;
+	void *addr, *__a;
+	uint nump, i;
+	int err;
+
+	if (ctx->pos && i_size <= ctx->pos)
+		return 0;
+	if (!i_size)
+		i_size = PAGE_SIZE; /* Just for the . && .. */
+
+	ioc_readdir.hdr.len = min_t(loff_t, i_size - ctx->pos,
+				    ZUS_API_MAP_MAX_SIZE);
+	nump = md_o2p_up(ioc_readdir.hdr.len);
+	addr = vzalloc(md_p2o(nump));
+	if (unlikely(!addr))
+		return -ENOMEM;
+
+	WARN_ON((ulong)addr & (PAGE_SIZE - 1));
+
+	__a = addr;
+	for (i = 0; i < nump; ++i) {
+		pages[i] = vmalloc_to_page(__a);
+		__a += PAGE_SIZE;
+	}
+
+more:
+	ioc_readdir.pos = ctx->pos;
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(sb)), &ioc_readdir.hdr, pages, nump);
+	if (unlikely(err)) {
+		zuf_err("zufs_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	zufs_readdir_iter_init(&rdi, &ioc_readdir, addr);
+	while ((zde = zufs_next_zde(&rdi)) != NULL) {
+		zuf_dbg_verbose("%s pos=0x%lx\n",
+				zde->zstr.name, (ulong)zde->pos);
+		ctx->pos = zde->pos;
+		if (!dir_emit(ctx, zde->zstr.name, zde->zstr.len, zde->ino,
+			      zde->type))
+			goto out;
+	}
+	ctx->pos = ioc_readdir.pos;
+	if (ioc_readdir.more) {
+		zuf_dbg_err("more\n");
+		goto more;
+	}
+out:
+	vfree(addr);
+	return err;
+}
+
+/*
+ *FIXME comment to full git diff
+ */
+
+static int _dentry_dispatch(struct inode *dir, struct inode *inode,
+			    struct qstr *str, int operation)
+{
+	struct zufs_ioc_dentry ioc_dentry = {
+		.hdr.operation = operation,
+		.hdr.in_len = sizeof(ioc_dentry),
+		.hdr.out_len = sizeof(ioc_dentry),
+		.zus_ii = inode ? ZUII(inode)->zus_ii : NULL,
+		.zus_dir_ii = ZUII(dir)->zus_ii,
+		.str.len = str->len,
+	};
+	int err;
+
+	memcpy(&ioc_dentry.str.name, str->name, str->len);
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(dir->i_sb)), &ioc_dentry.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_err("op=%d zufs_dispatch failed => %d\n", operation, err);
+		return err;
+	}
+
+	return 0;
+}
+
+/* return pointer to added de on success, err-code on failure */
+int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode,
+		   bool rename)
+{
+	struct zuf_inode_info *zii = ZUII(dir);
+	int err;
+
+	if (!str->len || !zii->zi)
+		return -EINVAL;
+
+	zus_inode_cmtime_now(dir, zii->zi);
+	err = _dentry_dispatch(dir, inode, str, ZUS_OP_ADD_DENTRY);
+	if (unlikely(err)) {
+		zuf_err("_dentry_dispatch failed => %d\n", err);
+		return err;
+	}
+	i_size_write(dir, le64_to_cpu(zii->zi->i_size));
+
+	return 0;
+}
+
+int zuf_remove_dentry(struct inode *dir, struct qstr *str)
+{
+	struct zuf_inode_info *zii = ZUII(dir);
+	int err;
+
+	if (!str->len)
+		return -EINVAL;
+
+	zus_inode_cmtime_now(dir, zii->zi);
+	err = _dentry_dispatch(dir, NULL, str, ZUS_OP_REMOVE_DENTRY);
+	if (unlikely(err))
+		return err;
+
+	i_size_write(dir, le64_to_cpu(zii->zi->i_size));
+	return 0;
+}
+
+const struct file_operations zuf_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= zuf_readdir,
+	.fsync		= noop_fsync,
+};
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
new file mode 100644
index 0000000..3b37d9f
--- /dev/null
+++ b/fs/zuf/file.c
@@ -0,0 +1,403 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for files.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+#include <linux/falloc.h>
+#include <linux/mman.h>
+#include <linux/fadvise.h>
+#include "zuf.h"
+
+static long zuf_fallocate(struct file *file, int mode, loff_t offset,
+			   loff_t len)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_range ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUS_OP_FALLOCATE,
+		.zus_ii = ZUII(inode)->zus_ii,
+		.offset = offset,
+		.length = len,
+		.opflags = mode,
+	};
+	int err;
+
+	zuf_dbg_vfs("[%ld] mode=0x%x offset=0x%llx len=0x%llx\n",
+		     inode->i_ino, mode, offset, len);
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	zuf_w_lock(zii);
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+	     (i_size_read(inode) < offset + len)) {
+		err = inode_newsize_ok(inode, offset + len);
+		if (unlikely(err))
+			goto out;
+	}
+
+	zus_inode_cmtime_now(inode, zii->zi);
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_range.hdr,
+			    NULL, 0);
+	if (unlikely(err))
+		zuf_err("zufs_dispatch failed => %d\n", err);
+
+	i_size_write(inode, le64_to_cpu(zii->zi->i_size));
+	inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+
+out:
+	zuf_w_unlock(zii);
+
+	return err;
+}
+
+static loff_t zuf_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_seek ioc_seek = {
+		.hdr.in_len = sizeof(ioc_seek),
+		.hdr.out_len = sizeof(ioc_seek),
+		.hdr.operation = ZUS_OP_LLSEEK,
+		.zus_ii = zii->zus_ii,
+		.offset_in = offset,
+		.whence = whence,
+	};
+	int err = 0;
+
+	zuf_dbg_vfs("[%ld] offset=0x%llx whence=%d\n",
+		     inode->i_ino, offset, whence);
+
+	if (whence != SEEK_DATA && whence != SEEK_HOLE)
+		return generic_file_llseek(file, offset, whence);
+
+	zuf_r_lock(zii);
+
+	if ((offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET)) ||
+	    offset > inode->i_sb->s_maxbytes) {
+		err = -EINVAL;
+		goto out;
+	} else if (inode->i_size <= offset) {
+		err = -ENXIO;
+		goto out;
+	} else if (!inode->i_blocks) {
+		if (whence == SEEK_HOLE)
+			ioc_seek.offset_out = i_size_read(inode);
+		else
+			err = -ENXIO;
+		goto out;
+	}
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_seek.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_err("zufs_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	if (ioc_seek.offset_out != file->f_pos) {
+		file->f_pos = ioc_seek.offset_out;
+		file->f_version = 0;
+	}
+
+out:
+	zuf_r_unlock(zii);
+
+	return err ?: ioc_seek.offset_out;
+}
+
+/* This function is called by both msync() and fsync(). */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_range ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUS_OP_SYNC,
+		.zus_ii = zii->zus_ii,
+		.offset = start,
+		.opflags = datasync,
+	};
+	loff_t isize;
+	ulong uend = end + 1;
+	int err = 0;
+
+	zuf_dbg_vfs(
+		"[%ld] start=0x%llx end=0x%llx  datasync=%d write_mapped=%d\n",
+		inode->i_ino, start, end, datasync,
+		atomic_read(&zii->write_mapped));
+
+	/* We want to serialize the syncs so they don't fight with each other
+	 * and is though more efficient, but we do not want to lock out
+	 * read/writes and page-faults so we have a special sync semaphore
+	 */
+	zuf_smw_lock(zii);
+
+	isize = i_size_read(inode);
+	if (!isize) {
+		zuf_dbg_mmap("[%ld] file is empty\n", inode->i_ino);
+		goto out;
+	}
+	if (isize < uend)
+		uend = isize;
+	if (uend < start) {
+		zuf_dbg_mmap("[%ld] isize=0x%llx start=0x%llx end=0x%lx\n",
+				 inode->i_ino, isize, start, uend);
+		err = -ENODATA;
+		goto out;
+	}
+
+	if (!atomic_read(&zii->write_mapped))
+		goto out; /* Nothing to do on this inode */
+
+	ioc_range.length = uend - start;
+	unmap_mapping_range(inode->i_mapping, start, ioc_range.length, 0);
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_range.hdr,
+			    NULL, 0);
+	if (unlikely(err))
+		zuf_err("zufs_dispatch failed => %d\n", err);
+
+	zuf_sync_dec(inode, ioc_range.write_unmapped);
+
+out:
+	zuf_smw_unlock(zii);
+	return err;
+}
+
+static int zuf_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	return zuf_isync(file_inode(file), start, end, datasync);
+}
+
+/* This callback is called when a file is closed */
+static int zuf_flush(struct file *file, fl_owner_t id)
+{
+	zuf_dbg_vfs("[%ld]\n", file->f_inode->i_ino);
+
+	return 0;
+}
+
+static int tozu_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		       u64 offset, u64 len)
+{
+	int err = 0;
+	ulong start_index = md_o2p(offset);
+	ulong end_index = md_o2p_up(offset + len);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_vfs(
+		"[%ld] offset=0x%llx len=0x%llx i-start=0x%lx i-end=0x%lx\n",
+		inode->i_ino, offset, len, start_index, end_index);
+
+	if (fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC))
+		return -EBADR;
+
+	zuf_r_lock(zii);
+
+	/* TODO: ZUS fiemap (&msi)*/
+
+	zuf_r_unlock(zii);
+	return err;
+}
+
+static void _lock_two_ziis(struct zuf_inode_info *zii1,
+			   struct zuf_inode_info *zii2)
+{
+	if (zii1 > zii2)
+		swap(zii2, zii2);
+
+	zuf_w_lock(zii1);
+	if (zii1 != zii2)
+		zuf_w_lock_nested(zii2);
+}
+
+static void _unlock_two_ziis(struct zuf_inode_info *zii1,
+		      struct zuf_inode_info *zii2)
+{
+	if (zii1 > zii2)
+		swap(zii2, zii2);
+
+	if (zii1 != zii2)
+		zuf_w_unlock(zii2);
+	zuf_w_unlock(zii1);
+}
+
+static int _clone_file_range(struct inode *src_inode, loff_t pos_in,
+			     struct inode *dst_inode, loff_t pos_out,
+			     u64 len, int operation)
+{
+	struct zuf_inode_info *src_zii = ZUII(src_inode);
+	struct zuf_inode_info *dst_zii = ZUII(dst_inode);
+	struct zus_inode *dst_zi = dst_zii->zi;
+	struct super_block *sb = src_inode->i_sb;
+	struct zufs_ioc_clone ioc_clone = {
+		.hdr.in_len = sizeof(ioc_clone),
+		.hdr.out_len = sizeof(ioc_clone),
+		.hdr.operation = operation,
+		.src_zus_ii = src_zii->zus_ii,
+		.dst_zus_ii = dst_zii->zus_ii,
+		.pos_in = pos_in,
+		.pos_out = pos_out,
+		.len = len,
+	};
+	int err;
+
+	_lock_two_ziis(src_zii, dst_zii);
+
+	/* NOTE: len==0 means to-end-of-file which is what we want */
+	unmap_mapping_range(src_inode->i_mapping, pos_in,  len, 0);
+	unmap_mapping_range(dst_inode->i_mapping, pos_out, len, 0);
+
+	zus_inode_cmtime_now(dst_inode, dst_zi);
+	err = zufs_dispatch(ZUF_ROOT(SBI(sb)), &ioc_clone.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_err("failed to clone %ld -> %ld ; err=%d\n",
+			 src_inode->i_ino, dst_inode->i_ino, err);
+		goto out;
+	}
+
+	dst_inode->i_blocks = le64_to_cpu(dst_zi->i_blocks);
+	i_size_write(dst_inode, dst_zi->i_size);
+
+out:
+	_unlock_two_ziis(src_zii, dst_zii);
+
+	return err;
+}
+
+static int zuf_clone_file_range(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out, u64 len)
+{
+	struct inode *src_inode = file_inode(file_in);
+	struct inode *dst_inode = file_inode(file_out);
+	ulong src_size = i_size_read(src_inode);
+	ulong dst_size = i_size_read(dst_inode);
+	struct super_block *sb = src_inode->i_sb;
+	ulong len_up = len;
+	int err;
+
+	zuf_dbg_vfs(
+		"ino-in=%ld ino-out=%ld pos_in=0x%llx pos_out=0x%llx length=0x%llx\n",
+		src_inode->i_ino, dst_inode->i_ino, pos_in, pos_out, len);
+
+	if (src_inode == dst_inode) {
+		if (pos_in == pos_out) {
+			zuf_dbg_err("[%ld] Clone nothing!!\n",
+				src_inode->i_ino);
+			return 0;
+		}
+		if (pos_in < pos_out) {
+			if (pos_in + len > pos_out) {
+				zuf_dbg_err(
+					"[%ld] overlapping pos_in < pos_out?? => EINVAL\n",
+					src_inode->i_ino);
+				return -EINVAL;
+			}
+		} else {
+			if (pos_out + len > pos_in) {
+				zuf_dbg_err("[%ld] overlapping pos_out < pos_in?? => EINVAL\n",
+					src_inode->i_ino);
+				return -EINVAL;
+			}
+		}
+	}
+
+	if ((pos_in & (sb->s_blocksize - 1)) ||
+	    (pos_out & (sb->s_blocksize - 1))) {
+		zuf_err("[%ld] Not aligned len=0x%llx pos_in=0x%llx "
+			"pos_out=0x%llx src-size=0x%llx dst-size=0x%llx\n",
+			 src_inode->i_ino, len, pos_in, pos_out,
+			 i_size_read(src_inode), i_size_read(dst_inode));
+		return -EINVAL;
+	}
+
+	/* STD says that len==0 means up to end of SRC */
+	if (!len)
+		len_up = len = src_size - pos_in;
+
+	if (!pos_in && !pos_out && (src_size <= pos_in + len) &&
+	    (dst_size <= src_size)) {
+		len_up = 0;
+	} else if (len & (sb->s_blocksize - 1)) {
+		/* un-aligned len, see if it is beyond EOF */
+		if ((src_size > pos_in  + len) ||
+		    (dst_size > pos_out + len)) {
+			zuf_err("[%ld] Not aligned len=0x%llx pos_in=0x%llx "
+				"pos_out=0x%llx src-size=0x%lx dst-size=0x%lx\n",
+				src_inode->i_ino, len, pos_in, pos_out,
+				src_size, dst_size);
+			return -EINVAL;
+		}
+		len_up = md_p2o(md_o2p_up(len));
+	}
+
+	err = _clone_file_range(src_inode, pos_in, dst_inode, pos_out, len_up,
+				ZUS_OP_CLONE);
+	if (unlikely(err))
+		zuf_err("_clone_file_range failed => %d\n", err);
+
+	return err;
+}
+
+static ssize_t zuf_copy_file_range(struct file *file_in, loff_t pos_in,
+				   struct file *file_out, loff_t pos_out,
+				   size_t len, uint flags)
+{
+	struct inode *src_inode = file_inode(file_in);
+	struct inode *dst_inode = file_inode(file_out);
+	ssize_t ret;
+
+	zuf_dbg_vfs("ino-in=%ld ino-out=%ld pos_in=0x%llx pos_out=0x%llx length=0x%lx\n",
+		    src_inode->i_ino, dst_inode->i_ino, pos_in, pos_out, len);
+
+	ret = _clone_file_range(src_inode, pos_in, dst_inode, pos_out, len,
+				ZUS_OP_COPY);
+
+	return ret ?: len;
+}
+
+/* ZUFS:
+ * make sure we clean up the resources consumed by zufs_init()
+ */
+static int zuf_file_release(struct inode *inode, struct file *filp)
+{
+	if (unlikely(filp->private_data))
+		zuf_err("not yet\n");
+
+	return 0;
+}
+
+const struct file_operations zuf_file_operations = {
+	.llseek			= zuf_llseek,
+	.open			= generic_file_open,
+	.fsync			= zuf_fsync,
+	.flush			= zuf_flush,
+	.release		= zuf_file_release,
+	.fallocate		= zuf_fallocate,
+	.copy_file_range	= zuf_copy_file_range,
+	.clone_file_range	= zuf_clone_file_range,
+};
+
+const struct inode_operations zuf_file_inode_operations = {
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+	.fiemap		= tozu_fiemap,
+};
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 7aa8c9e..1129aae 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -12,16 +12,424 @@
  *	Sagi Manole <sagim@netapp.com>"
  */
 
+#include <linux/fs.h>
+#include <linux/aio.h>
+#include <linux/highuid.h>
+#include <linux/module.h>
+#include <linux/mpage.h>
+#include <linux/backing-dev.h>
+#include <linux/types.h>
+#include <linux/ratelimit.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/security.h>
+#include <linux/delay.h>
 #include "zuf.h"
 
+/* Flags that should be inherited by new inodes from their parent. */
+#define ZUFS_FL_INHERITED (FS_SECRM_FL | FS_UNRM_FL | FS_COMPR_FL |	\
+			FS_SYNC_FL | FS_NODUMP_FL | FS_NOATIME_FL |	\
+			FS_COMPRBLK_FL | FS_NOCOMP_FL |			\
+			FS_JOURNAL_DATA_FL | FS_NOTAIL_FL | FS_DIRSYNC_FL)
+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define ZUFS_REG_FLMASK (~(FS_DIRSYNC_FL | FS_TOPDIR_FL))
+/* Flags that are appropriate for non-directories/regular files. */
+#define ZUFS_OTHER_FLMASK (FS_NODUMP_FL | FS_NOATIME_FL)
+
+
+static bool _zi_valid(struct zus_inode *zi)
+{
+	if (!_zi_active(zi))
+		return false;
+
+	switch (le16_to_cpu(zi->i_mode) & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		return true;
+	default:
+		zuf_err("unknown file type ino=%lld mode=%d\n", zi->i_ino,
+			  zi->i_mode);
+		return false;
+	}
+}
+
+static void _set_inode_from_zi(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_mode = le16_to_cpu(zi->i_mode);
+	inode->i_uid = KUIDT_INIT(le32_to_cpu(zi->i_uid));
+	inode->i_gid = KGIDT_INIT(le32_to_cpu(zi->i_gid));
+	set_nlink(inode, le16_to_cpu(zi->i_nlink));
+	inode->i_size = le64_to_cpu(zi->i_size);
+	inode->i_size = le64_to_cpu(zi->i_blocks);
+	mt_to_timespec(&inode->i_atime, &zi->i_atime);
+	mt_to_timespec(&inode->i_ctime, &zi->i_ctime);
+	mt_to_timespec(&inode->i_mtime, &zi->i_mtime);
+	inode->i_generation = le64_to_cpu(zi->i_generation);
+	zuf_set_inode_flags(inode, zi);
+
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		inode->i_op = &zuf_file_inode_operations;
+		inode->i_fop = &zuf_file_operations;
+		break;
+	case S_IFDIR:
+		inode->i_op = &zuf_dir_inode_operations;
+		inode->i_fop = &zuf_dir_operations;
+		break;
+	case S_IFLNK:
+		inode->i_op = &zuf_symlink_inode_operations;
+		break;
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		inode->i_size = 0;
+		inode->i_op = &zuf_special_inode_operations;
+		init_special_inode(inode, inode->i_mode,
+				   le32_to_cpu(zi->i_rdev));
+		break;
+	default:
+		zuf_err("unknown file type ino=%lld mode=%d\n", zi->i_ino,
+			  zi->i_mode);
+		break;
+	}
+
+	inode->i_ino = le64_to_cpu(zi->i_ino);
+}
+
+static void tozu_get_inode_flags(struct inode *inode, struct zus_inode *zi)
+{
+	unsigned int flags = inode->i_flags;
+	unsigned int tozu_flags = le32_to_cpu(zi->i_flags);
+
+	tozu_flags &= ~(FS_SYNC_FL | FS_APPEND_FL | FS_IMMUTABLE_FL |
+			 FS_NOATIME_FL | FS_DIRSYNC_FL);
+	if (flags & S_SYNC)
+		tozu_flags |= FS_SYNC_FL;
+	if (flags & S_APPEND)
+		tozu_flags |= FS_APPEND_FL;
+	if (flags & S_IMMUTABLE)
+		tozu_flags |= FS_IMMUTABLE_FL;
+	if (flags & S_NOATIME)
+		tozu_flags |= FS_NOATIME_FL;
+	if (flags & S_DIRSYNC)
+		tozu_flags |= FS_DIRSYNC_FL;
+
+	zi->i_flags = cpu_to_le32(tozu_flags);
+}
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static __le32 _mask_flags(umode_t mode, __le32 flags)
+{
+	flags &= cpu_to_le32(ZUFS_FL_INHERITED);
+	if (S_ISDIR(mode))
+		return flags;
+	else if (S_ISREG(mode))
+		return flags & cpu_to_le32(ZUFS_REG_FLMASK);
+	else
+		return flags & cpu_to_le32(ZUFS_OTHER_FLMASK);
+}
+
+static int _set_zi_from_inode(struct inode *dir, struct zus_inode *zi,
+			      struct inode *inode)
+{
+	struct zus_inode *zidir = zus_zi(dir);
+
+	if (unlikely(!zidir))
+		return -EACCES;
+
+	zi->i_flags = _mask_flags(inode->i_mode, zidir->i_flags);
+	zi->i_mode = cpu_to_le16(inode->i_mode);
+	zi->i_uid = cpu_to_le32(__kuid_val(inode->i_uid));
+	zi->i_gid = cpu_to_le32(__kgid_val(inode->i_gid));
+	/* NOTE: zus is boss of i_nlink (but let it know what we think) */
+	zi->i_nlink = cpu_to_le16(inode->i_nlink);
+	zi->i_size = cpu_to_le64(inode->i_size);
+	zi->i_blocks = cpu_to_le64(inode->i_blocks);
+	timespec_to_mt(&zi->i_atime, &inode->i_atime);
+	timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_generation = cpu_to_le32(inode->i_generation);
+	tozu_get_inode_flags(inode, zi);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
+		zi->i_rdev = cpu_to_le32(inode->i_rdev);
+
+	return 0;
+}
+
+static bool _times_equal(struct timespec *t, __le64 *mt)
+{
+	__le64 time;
+
+	timespec_to_mt(&time, t);
+	return time == *mt;
+}
+
+/* This function checks if VFS's inode and zus_inode are in sync */
+static void _warn_inode_dirty(struct inode *inode, struct zus_inode *zi)
+{
+#define __MISMACH_INT(inode, X, Y)	\
+	if (X != Y)			\
+		zuf_warn("[%ld] " #X"=0x%lx " #Y"=0x%lx""\n",	\
+			  inode->i_ino, (ulong)(X), (ulong)(Y))
+#define __MISMACH_TIME(inode, X, Y)	\
+	if (!_times_equal(X, Y)) {	\
+		struct timespec t;	\
+		mt_to_timespec(&t, (Y));\
+		zuf_warn("[%ld] " #X"=%ld:%ld " #Y"=%ld:%ld""\n",	\
+			  inode->i_ino, (X)->tv_sec, (X)->tv_nsec,	\
+			  t.tv_sec, t.tv_nsec);		\
+	}
+
+	if (!_times_equal(&inode->i_ctime, &zi->i_ctime) ||
+	    !_times_equal(&inode->i_mtime, &zi->i_mtime) ||
+	    !_times_equal(&inode->i_atime, &zi->i_atime) ||
+	    inode->i_size != le64_to_cpu(zi->i_size) ||
+	    inode->i_mode != le16_to_cpu(zi->i_mode) ||
+	    __kuid_val(inode->i_uid) != le32_to_cpu(zi->i_uid) ||
+	    __kgid_val(inode->i_gid) != le32_to_cpu(zi->i_gid) ||
+	    inode->i_nlink != le16_to_cpu(zi->i_nlink) ||
+	    inode->i_ino != _zi_ino(zi) ||
+	    inode->i_blocks != le64_to_cpu(zi->i_blocks)) {
+		__MISMACH_TIME(inode, &inode->i_ctime, &zi->i_ctime);
+		__MISMACH_TIME(inode, &inode->i_mtime, &zi->i_mtime);
+		__MISMACH_TIME(inode, &inode->i_atime, &zi->i_atime);
+		__MISMACH_INT(inode, inode->i_size, le64_to_cpu(zi->i_size));
+		__MISMACH_INT(inode, inode->i_mode, le16_to_cpu(zi->i_mode));
+		__MISMACH_INT(inode, __kuid_val(inode->i_uid),
+			      le32_to_cpu(zi->i_uid));
+		__MISMACH_INT(inode, __kgid_val(inode->i_gid),
+			      le32_to_cpu(zi->i_gid));
+		__MISMACH_INT(inode, inode->i_nlink, le16_to_cpu(zi->i_nlink));
+		__MISMACH_INT(inode, inode->i_ino, _zi_ino(zi));
+		__MISMACH_INT(inode, inode->i_blocks,
+			      le64_to_cpu(zi->i_blocks));
+	}
+}
+
+static void _zii_connect(struct inode *inode, struct zus_inode *zi,
+			 struct zus_inode_info *zus_ii)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zii->zi = zi;
+	zii->zus_ii = zus_ii;
+}
+
 struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
 		       zu_dpp_t _zi, bool *exist)
 {
-	return ERR_PTR(-ENOMEM);
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zus_inode *zi = md_addr_verify(sbi->md, _zi);
+	struct inode *inode;
+
+	if (unlikely(!zi)) {
+		/* Don't trust ZUS pointers */
+		zuf_err("Bad zus_inode 0x%llx\n", _zi);
+		return ERR_PTR(-EIO);
+	}
+	if (unlikely(!zus_ii)) {
+		zuf_err("zus_ii NULL\n");
+		return ERR_PTR(-EIO);
+	}
+
+	if (!_zi_valid(zi)) {
+		zuf_err("inactive node ino=%lld links=%d mode=%d\n", zi->i_ino,
+			  zi->i_nlink, zi->i_mode);
+		return ERR_PTR(-ESTALE);
+	}
+
+	zuf_dbg_zus("[%lld] size=0x%llx, blocks=0x%llx ct=0x%llx mt=0x%llx link=0x%x mode=0x%x xattr=0x%llx\n",
+		    zi->i_ino, zi->i_size, zi->i_blocks, zi->i_ctime,
+		    zi->i_mtime, zi->i_nlink, zi->i_mode, zi->i_xattr);
+
+	inode = iget_locked(sb, _zi_ino(zi));
+	if (unlikely(!inode))
+		return ERR_PTR(-ENOMEM);
+
+	if (!(inode->i_state & I_NEW)) {
+		*exist = true;
+		return inode;
+	}
+
+	*exist = false;
+	_set_inode_from_zi(inode, zi);
+	_zii_connect(inode, zi, zus_ii);
+
+	unlock_new_inode(inode);
+	return inode;
+}
+
+int zuf_evict_dispatch(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       int operation)
+{
+	struct zufs_ioc_evict_inode ioc_evict_inode = {
+		.hdr.in_len = sizeof(ioc_evict_inode),
+		.hdr.out_len = sizeof(ioc_evict_inode),
+		.hdr.operation = operation,
+		.zus_ii = zus_ii,
+	};
+	int err;
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(sb)), &ioc_evict_inode.hdr, NULL, 0);
+	if (unlikely(err))
+		zuf_err("zufs_dispatch failed op=%d => %d\n",
+			 operation, err);
+	return err;
 }
 
 void zuf_evict_inode(struct inode *inode)
 {
+	struct super_block *sb = inode->i_sb;
+	struct zuf_inode_info *zii = ZUII(inode);
+	int operation;
+	int write_mapped;
+
+	if (!inode->i_nlink) {
+		if (unlikely(!zii->zi)) {
+			zuf_dbg_err("[%ld] inode without zi mode=0x%x size=0x%llx\n",
+				    inode->i_ino, inode->i_mode, inode->i_size);
+			goto out;
+		}
+
+		if (unlikely(is_bad_inode(inode)))
+			zuf_warn("[%ld] inode is bad mode=0x%x zi=%p\n",
+				  inode->i_ino, inode->i_mode, zii->zi);
+		else
+			_warn_inode_dirty(inode, zii->zi);
+
+		operation = ZUS_OP_FREE_INODE;
+	} else {
+		zuf_dbg_verbose("[%ld] inode is going down?\n", inode->i_ino);
+
+		if (unlikely(!inode || !sb || !sb->s_root ||
+			     !sb->s_root->d_inode ||
+			     !sb->s_root->d_inode->i_mapping))
+			goto out;
+
+		operation = ZUS_OP_EVICT_INODE;
+	}
+
+	zuf_evict_dispatch(sb, zii->zus_ii, operation);
+
+out:
+	zii->zus_ii = NULL;
+	zii->zi = NULL;
+
+	if (zii && zii->zero_page) {
+		zii->zero_page->mapping = NULL;
+		__free_pages(zii->zero_page, 0);
+		zii->zero_page = NULL;
+	}
+
+	/* ZUS on evict has synced all mmap dirty pages, YES? */
+	write_mapped = atomic_read(&zii->write_mapped);
+	if (unlikely(write_mapped || !list_empty(&zii->i_mmap_dirty))) {
+		zuf_dbg_mmap("[%ld] !!!! write_mapped=%d list_empty=%d\n",
+			      inode->i_ino, write_mapped,
+			      list_empty(&zii->i_mmap_dirty));
+		zuf_sync_dec(inode, write_mapped);
+	}
+
+	clear_inode(inode);
+}
+
+/* @rdev_or_isize is i_size in the case of a symlink
+ * and rdev in the case of special-files
+ */
+struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
+			    const struct qstr *qstr, const char *symname,
+			    ulong rdev_or_isize, bool tmpfile)
+{
+	struct super_block *sb = dir->i_sb;
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zufs_ioc_new_inode ioc_new_inode = {
+		.hdr.in_len = sizeof(ioc_new_inode),
+		.hdr.out_len = sizeof(ioc_new_inode),
+		.hdr.operation = ZUS_OP_NEW_INODE,
+		.dir_ii = ZUII(dir)->zus_ii,
+		.flags = tmpfile ? ZI_TMPFILE : 0,
+		.str.len = qstr->len,
+	};
+	struct inode *inode;
+	struct zus_inode *zi;
+	struct page *pages[2];
+	uint nump = 0;
+	int err;
+
+	memcpy(&ioc_new_inode.str.name, qstr->name, qstr->len);
+
+	inode = new_inode(sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode_init_owner(inode, dir, mode);
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(dir);
+	inode->i_atime = inode->i_ctime;
+
+	zuf_dbg_verbose("inode=%p name=%s\n", inode, qstr->name);
+
+	zuf_set_inode_flags(inode, &ioc_new_inode.zi);
+
+	err = _set_zi_from_inode(dir, &ioc_new_inode.zi, inode);
+	if (unlikely(err))
+		goto fail;
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
+	    S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
+		init_special_inode(inode, mode, rdev_or_isize);
+	} else if (symname) {
+		inode->i_size = rdev_or_isize;
+		nump = zuf_prepare_symname(&ioc_new_inode, symname,
+					   rdev_or_isize, pages);
+	}
+
+	err = zufs_dispatch(ZUF_ROOT(sbi), &ioc_new_inode.hdr, pages, nump);
+	if (unlikely(err)) {
+		zuf_err("zufs_dispatch failed => %d\n", err);
+		goto fail;
+	}
+	zi = md_addr(sbi->md, ioc_new_inode._zi);
+
+	_zii_connect(inode, zi, ioc_new_inode.zus_ii);
+
+	/* update inode fields from filesystem inode */
+	inode->i_ino = le64_to_cpu(zi->i_ino);
+	inode->i_size = le64_to_cpu(zi->i_size);
+	inode->i_generation = le64_to_cpu(zi->i_generation);
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	set_nlink(inode, le16_to_cpu(zi->i_nlink));
+	i_size_write(dir, le64_to_cpu(zus_zi(dir)->i_size));
+
+	zuf_dbg_zus("[%lld] size=0x%llx, blocks=0x%llx ct=0x%llx mt=0x%llx link=0x%x mode=0x%x xattr=0x%llx\n",
+		    zi->i_ino, zi->i_size, zi->i_blocks, zi->i_ctime,
+		    zi->i_mtime, zi->i_nlink, zi->i_mode, zi->i_xattr);
+
+	zuf_dbg_verbose("allocating inode %ld (zi=%p)\n", _zi_ino(zi), zi);
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_err("[%ld:%s] generation=%lld insert_inode_locked => %d\n",
+			 inode->i_ino, qstr->name, zi->i_generation, err);
+		goto fail;
+	}
+
+	return inode;
+
+fail:
+	clear_nlink(inode);
+	make_bad_inode(inode);
+	iput(inode);
+	return ERR_PTR(err);
 }
 
 int zuf_write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -38,8 +446,211 @@ int zuf_write_inode(struct inode *inode, struct writeback_control *wbc)
 	return 0;
 }
 
-/* This function is called by msync(), fsync() && sync_fs(). */
-int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync)
+/*
+ * Mostly supporting file_accessed() for now. Which is the only one we use.
+ *
+ * But also file_update_time is used by fifo code.
+ */
+int zuf_update_time(struct inode *inode, struct timespec *time, int flags)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	struct zufs_ioc_attr ioc_attr = {
+		.hdr.in_len = sizeof(ioc_attr),
+		.hdr.out_len = sizeof(ioc_attr),
+		.hdr.operation = ZUS_OP_UPDATE_TIME,
+		.zus_ii = ZUII(inode)->zus_ii,
+	};
+	int err;
+
+	if (flags & S_ATIME) {
+		ioc_attr.zuf_attr |= STATX_ATIME;
+		inode->i_atime = *time;
+		timespec_to_mt(&zi->i_atime, &inode->i_atime);
+	}
+
+	/* for Support of file_update_time() */
+	if ((flags & S_CTIME) || (flags & S_MTIME) || (flags & S_VERSION)) {
+		if (flags & S_VERSION) {
+			ioc_attr.zuf_attr |= ZUFS_STATX_VERSION;
+			inode_inc_iversion(inode);
+			zi->i_generation = cpu_to_le64(inode->i_version);
+		}
+		if (flags & S_CTIME) {
+			ioc_attr.zuf_attr |= STATX_CTIME;
+			inode->i_ctime = *time;
+			timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		}
+		if (flags & S_MTIME) {
+			ioc_attr.zuf_attr |= STATX_MTIME;
+			inode->i_mtime = *time;
+			timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+		}
+	}
+
+	if (ioc_attr.zuf_attr == 0)
+		return 0;
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_attr.hdr, NULL, 0);
+	if (unlikely(err))
+		zuf_err("zufs_dispatch failed => %d\n", err);
+
+	return err;
+}
+
+int zuf_getattr(const struct path *path, struct kstat *stat, u32 request_mask,
+		unsigned int flags)
+{
+	struct dentry *dentry = path->dentry;
+	struct inode *inode = d_inode(dentry);
+
+	if (inode->i_flags & S_APPEND)
+		stat->attributes |= STATX_ATTR_APPEND;
+	if (inode->i_flags & S_IMMUTABLE)
+		stat->attributes |= STATX_ATTR_IMMUTABLE;
+
+	stat->attributes_mask |= (STATX_ATTR_APPEND |
+				  STATX_ATTR_IMMUTABLE);
+	generic_fillattr(inode, stat);
+	/* stat->blocks should be the number of 512B blocks */
+	stat->blocks = inode->i_blocks << (inode->i_sb->s_blocksize_bits - 9);
+
+	return 0;
+}
+
+int zuf_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_attr ioc_attr = {
+		.hdr.in_len = sizeof(ioc_attr),
+		.hdr.out_len = sizeof(ioc_attr),
+		.hdr.operation = ZUS_OP_SETATTR,
+		.zus_ii = zii->zus_ii,
+	};
+	int err;
+
+	if (!zi)
+		return -EACCES;
+
+	err = setattr_prepare(dentry, attr);
+	if (unlikely(err))
+		return err;
+
+	if (attr->ia_valid & ATTR_MODE) {
+		zuf_dbg_vfs("[%ld] ATTR_MODE=0x%x\n",
+			     inode->i_ino, attr->ia_mode);
+		ioc_attr.zuf_attr |= STATX_MODE;
+		inode->i_mode = attr->ia_mode;
+		zi->i_mode = cpu_to_le16(inode->i_mode);
+		if (test_opt(SBI(inode->i_sb), POSIXACL)) {
+			err = posix_acl_chmod(inode, inode->i_mode);
+			if (unlikely(err))
+				return err;
+		}
+	}
+
+	if (attr->ia_valid & ATTR_UID) {
+		zuf_dbg_vfs("[%ld] ATTR_UID=0x%x\n",
+			     inode->i_ino, __kuid_val(attr->ia_uid));
+		ioc_attr.zuf_attr |= STATX_UID;
+		inode->i_uid = attr->ia_uid;
+		zi->i_uid = cpu_to_le32(__kuid_val(inode->i_uid));
+	}
+	if (attr->ia_valid & ATTR_GID) {
+		zuf_dbg_vfs("[%ld] ATTR_GID=0x%x\n",
+			     inode->i_ino, __kgid_val(attr->ia_gid));
+		ioc_attr.zuf_attr |= STATX_GID;
+		inode->i_gid = attr->ia_gid;
+		zi->i_gid = cpu_to_le32(__kgid_val(inode->i_gid));
+	}
+
+	if ((attr->ia_valid & ATTR_SIZE)) {
+		zuf_dbg_vfs("[%ld] ATTR_SIZE=0x%llx\n",
+			     inode->i_ino, attr->ia_size);
+		if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+		      S_ISLNK(inode->i_mode))) {
+			zuf_err("[%ld] wrong file mode=%x\n",
+				inode->i_ino, inode->i_mode);
+			return -EINVAL;
+		}
+		ioc_attr.zuf_attr |= STATX_SIZE;
+
+		ZUF_CHECK_I_W_LOCK(inode);
+		zuf_smw_lock(zii);
+
+		/* Make all mmap() users FAULT for truncated pages */
+		unmap_mapping_range(inode->i_mapping,
+				    attr->ia_size + PAGE_SIZE - 1, 0, 1);
+
+		ioc_attr.truncate_size = attr->ia_size;
+		/* on attr_size we want to update times as well */
+		attr->ia_valid |= ATTR_CTIME | ATTR_MTIME;
+	}
+
+	if (attr->ia_valid & ATTR_ATIME) {
+		ioc_attr.zuf_attr |= STATX_ATIME;
+		inode->i_atime = attr->ia_atime;
+		timespec_to_mt(&zi->i_atime, &inode->i_atime);
+		zuf_dbg_vfs("[%ld] ATTR_ATIME=0x%llx\n",
+			     inode->i_ino, zi->i_atime);
+	}
+	if (attr->ia_valid & ATTR_CTIME) {
+		ioc_attr.zuf_attr |= STATX_CTIME;
+		inode->i_ctime = attr->ia_ctime;
+		timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		zuf_dbg_vfs("[%ld] ATTR_CTIME=0x%llx\n",
+			     inode->i_ino, zi->i_ctime);
+	}
+	if (attr->ia_valid & ATTR_MTIME) {
+		ioc_attr.zuf_attr |= STATX_MTIME;
+		inode->i_mtime = attr->ia_mtime;
+		timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+		zuf_dbg_vfs("[%ld] ATTR_MTIME=0x%llx\n",
+			     inode->i_ino, zi->i_mtime);
+	}
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_attr.hdr, NULL, 0);
+	if (unlikely(err))
+		zuf_err("zufs_dispatch failed => %d\n", err);
+
+	if ((attr->ia_valid & ATTR_SIZE)) {
+		i_size_write(inode, le64_to_cpu(zi->i_size));
+		inode->i_blocks = le64_to_cpu(zi->i_blocks);
+
+		zuf_smw_unlock(zii);
+	}
+
+	return err;
+}
+
+void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi)
+{
+	unsigned int flags = le32_to_cpu(zi->i_flags);
+
+	inode->i_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+	if (flags & FS_SYNC_FL)
+		inode->i_flags |= S_SYNC;
+	if (flags & FS_APPEND_FL)
+		inode->i_flags |= S_APPEND;
+	if (flags & FS_IMMUTABLE_FL)
+		inode->i_flags |= S_IMMUTABLE;
+	if (flags & FS_NOATIME_FL)
+		inode->i_flags |= S_NOATIME;
+	if (flags & FS_DIRSYNC_FL)
+		inode->i_flags |= S_DIRSYNC;
+	if (!zi->i_xattr)
+		inode_has_no_xattr(inode);
+}
+
+/* direct_IO is not called. We set an empty one so open(O_DIRECT) will be happy
+ */
+static ssize_t zuf_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
+	WARN_ON(1);
 	return 0;
 }
+const struct address_space_operations zuf_aops = {
+	.direct_IO		= zuf_direct_IO,
+};
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
new file mode 100644
index 0000000..179069b
--- /dev/null
+++ b/fs/zuf/namei.c
@@ -0,0 +1,421 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode operations for directories.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#include <linux/fs.h>
+#include "zuf.h"
+
+
+static struct inode *d_parent(struct dentry *dentry)
+{
+	return dentry->d_parent->d_inode;
+}
+
+static void _instantiate_unlock(struct dentry *dentry, struct inode *inode)
+{
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+}
+
+static struct dentry *zuf_lookup(struct inode *dir, struct dentry *dentry,
+				 uint flags)
+{
+	struct super_block *sb = dir->i_sb;
+	struct qstr *str = &dentry->d_name;
+	uint in_len = offsetof(struct zufs_ioc_lookup, _zi);
+	struct zufs_ioc_lookup ioc_lu = {
+		.hdr.in_len = in_len,
+		.hdr.out_start = in_len,
+		.hdr.out_len = sizeof(ioc_lu) - in_len,
+		.hdr.operation = ZUS_OP_LOOKUP,
+		.dir_ii = ZUII(dir)->zus_ii,
+		.str.len = str->len,
+	};
+	struct inode *inode = NULL;
+	bool exist;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s\n", dir->i_ino, dentry->d_name.name);
+
+	if (dentry->d_name.len > ZUFS_NAME_LEN)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	memcpy(&ioc_lu.str.name, str->name, str->len);
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(sb)), &ioc_lu.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufs_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	inode = zuf_iget(dir->i_sb, ioc_lu.zus_ii, ioc_lu._zi, &exist);
+	if (exist) {
+		zuf_dbg_err("race in lookup\n");
+		zuf_evict_dispatch(sb, ioc_lu.zus_ii, ZUS_OP_EVICT_INODE);
+	}
+
+out:
+	return d_splice_alias(inode, dentry);
+}
+
+/*
+ * By the time this is called, we already have created
+ * the directory cache entry for the new file, but it
+ * is so far negative - it has no inode.
+ *
+ * If the create succeeds, we fill in the inode information
+ * with d_instantiate().
+ */
+static int zuf_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+		      bool excl)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s mode=0x%x\n",
+		     dir->i_ino, dentry->d_name.name, mode);
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, 0, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_file_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+	inode->i_fop = &zuf_file_operations;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		     dev_t rdev)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] mode=0x%x rdev=0x%x\n", dir->i_ino, mode, rdev);
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, rdev, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_special_inode_operations;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, 0, true);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	/* TODO: See about more ephemeral operations on this file, around
+	 * mmap and such.
+	 * Must see about that tmpfile mode that is later link_at
+	 * (probably the !O_EXCL flag)
+	 */
+	inode->i_op = &zuf_file_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+	inode->i_fop = &zuf_file_operations;
+
+	set_nlink(inode, 1); /* user_mode knows nothing */
+	d_tmpfile(dentry, inode);
+	/* tmpfile operate on nlink=0. Since this is a tmp file we do not care
+	 * about cl_flushing. If later this file will be linked to a dir. the
+	 * add_dentry will flush the zi.
+	 */
+	zus_zi(inode)->i_nlink = inode->i_nlink;
+
+	unlock_new_inode(inode);
+	return 0;
+}
+
+static int zuf_symlink(struct inode *dir, struct dentry *dentry,
+		       const char *symname)
+{
+	struct inode *inode;
+	ulong len;
+
+	zuf_dbg_vfs("[%ld] de->name=%s symname=%s\n",
+			dir->i_ino, dentry->d_name.name, symname);
+
+	len = strlen(symname);
+	if (len + 1 > ZUFS_MAX_SYMLINK)
+		return -ENAMETOOLONG;
+
+	inode = zuf_new_inode(dir, S_IFLNK|S_IRWXUGO, &dentry->d_name,
+			       symname, len, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_symlink_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_link(struct dentry *dest_dentry, struct inode *dir,
+		    struct dentry *dentry)
+{
+	struct inode *inode = dest_dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld dest_d-ino=%ld dest_d-name=%s\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino,
+		     dest_dentry->d_inode->i_ino, dest_dentry->d_name.name);
+
+	if (inode->i_nlink >= ZUFS_LINK_MAX)
+		return -EMLINK;
+
+	ihold(inode);
+
+	err = zuf_add_dentry(dir, &dentry->d_name, inode, false);
+	if (unlikely(err)) {
+		iput(inode);
+		return err;
+	}
+
+	inode->i_ctime = current_time(dir);
+
+	set_nlink(inode, le16_to_cpu(zus_zi(inode)->i_nlink));
+
+	d_instantiate(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino);
+
+	err = zuf_remove_dentry(dir, &dentry->d_name);
+	if (unlikely(err))
+		return err;
+
+	inode->i_ctime = dir->i_ctime;
+
+	set_nlink(inode, le16_to_cpu(ZUII(inode)->zi->i_nlink));
+
+	return 0;
+}
+
+static int zuf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s dentry-parent=%ld mode=0x%x\n",
+		     dir->i_ino, dentry->d_name.name, d_parent(dentry)->i_ino,
+		     mode);
+
+	if (dir->i_nlink >= ZUFS_LINK_MAX)
+		return -EMLINK;
+
+	inode = zuf_new_inode(dir, S_IFDIR | mode, &dentry->d_name, NULL, 0,
+			      false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_dir_inode_operations;
+	inode->i_fop = &zuf_dir_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	set_nlink(dir, le16_to_cpu(ZUII(inode)->zi->i_nlink));
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static bool _empty_dir(struct inode *dir)
+{
+	if (dir->i_nlink != 2) {
+		zuf_warn("[%ld] directory has nlink(%d) != 2\n",
+			  dir->i_ino, dir->i_nlink);
+		return false;
+	}
+	/* NOTE: Above is not the only -ENOTEMPTY the zus-fs will need to check
+	 * for the "only-files" no subdirs case. And return -ENOTEMPTY below
+	 */
+	return true;
+}
+
+static int zuf_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino);
+
+	if (!inode)
+		return -ENOENT;
+
+	if (!_empty_dir(inode))
+		return -ENOTEMPTY;
+
+	err = zuf_remove_dentry(dir, &dentry->d_name);
+	if (unlikely(err))
+		return err;
+
+	inode->i_ctime = dir->i_ctime;
+
+	set_nlink(inode, le16_to_cpu(zus_zi(inode)->i_nlink));
+	set_nlink(dir, le16_to_cpu(zus_zi(dir)->i_nlink));
+
+	return 0;
+}
+
+/* Structure of a directory element; */
+struct zuf_dir_element {
+	__le64  ino;
+	char name[254];
+};
+
+static int _rename_exchange(struct inode *old_inode, struct inode *new_inode,
+			    struct inode *old_dir, struct inode *new_dir)
+{
+	/* A subdir holds a ref on parent, see if we need to exchange refs */
+	if ((S_ISDIR(old_inode->i_mode) != S_ISDIR(new_inode->i_mode)) &&
+	    (old_dir != new_dir)) {
+		if (S_ISDIR(old_inode->i_mode)) {
+			if (ZUFS_LINK_MAX <= new_dir->i_nlink)
+				return -EMLINK;
+		} else {
+			if (ZUFS_LINK_MAX <= old_dir->i_nlink)
+				return -EMLINK;
+		}
+	}
+
+	set_nlink(old_dir, le16_to_cpu(zus_zi(old_dir)->i_nlink));
+	set_nlink(new_dir, le16_to_cpu(zus_zi(new_dir)->i_nlink));
+
+	/* Update Directory times */
+	mt_to_timespec(&old_dir->i_mtime, &zus_zi(old_dir)->i_mtime);
+	mt_to_timespec(&old_dir->i_ctime, &zus_zi(old_dir)->i_ctime);
+	if (old_dir != new_dir) {
+		mt_to_timespec(&new_dir->i_mtime, &zus_zi(new_dir)->i_mtime);
+		mt_to_timespec(&new_dir->i_ctime, &zus_zi(new_dir)->i_ctime);
+	}
+	return 0;
+}
+
+static int zuf_rename(struct inode *old_dir, struct dentry *old_dentry,
+		      struct inode *new_dir, struct dentry *new_dentry,
+		      uint flags)
+{
+	struct inode *old_inode = d_inode(old_dentry);
+	struct inode *new_inode = d_inode(new_dentry);
+	struct zuf_sb_info *sbi = SBI(old_inode->i_sb);
+	struct zufs_ioc_rename ioc_rename = {
+		.hdr.in_len = sizeof(ioc_rename),
+		.hdr.out_len = sizeof(ioc_rename),
+		.hdr.operation = ZUS_OP_RENAME,
+		.old_dir_ii = ZUII(old_dir)->zus_ii,
+		.new_dir_ii = ZUII(new_dir)->zus_ii,
+		.old_zus_ii = old_inode ? ZUII(old_inode)->zus_ii : NULL,
+		.new_zus_ii = new_inode ? ZUII(new_inode)->zus_ii : NULL,
+		.old_d_str.len = old_dentry->d_name.len,
+		.new_d_str.len = new_dentry->d_name.len,
+	};
+	struct timespec time = current_time(old_dir);
+	int err;
+
+	zuf_dbg_vfs("old_inode=%ld new_inode=%ld old_name=%s new_name=%s f=0x%x\n",
+		     old_inode ? old_inode->i_ino : 0,
+		     new_inode ? new_inode->i_ino : 0, old_dentry->d_name.name,
+		     new_dentry->d_name.name, flags);
+
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE /*| RENAME_WHITEOUT*/))
+		return -EINVAL;
+
+	if (!(flags & RENAME_EXCHANGE) && S_ISDIR(old_inode->i_mode)) {
+		if (new_inode) {
+			if (!_empty_dir(new_inode))
+				return -ENOTEMPTY;
+		} else if (ZUFS_LINK_MAX <= new_dir->i_nlink) {
+			return -EMLINK;
+		}
+	}
+
+	memcpy(&ioc_rename.old_d_str.name, old_dentry->d_name.name,
+		old_dentry->d_name.len);
+	memcpy(&ioc_rename.new_d_str.name, new_dentry->d_name.name,
+		new_dentry->d_name.len);
+	timespec_to_mt(&ioc_rename.time, &time);
+
+	err = zufs_dispatch(ZUF_ROOT(sbi), &ioc_rename.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_err("zufs_dispatch failed => %d\n", err);
+		return err;
+	}
+
+	if (flags & RENAME_EXCHANGE)
+		return _rename_exchange(old_inode, new_inode, old_dir, new_dir);
+
+	mt_to_timespec(&new_dir->i_mtime, &zus_zi(new_dir)->i_mtime);
+	mt_to_timespec(&new_dir->i_ctime, &zus_zi(new_dir)->i_ctime);
+
+	if (new_inode) {
+		struct zus_inode *new_zi = zus_zi(new_inode);
+
+		set_nlink(new_inode, le16_to_cpu(new_zi->i_nlink));
+		mt_to_timespec(&new_inode->i_ctime, &new_zi->i_ctime);
+	} else {
+		struct zus_inode *old_zi = zus_zi(old_inode);
+
+		mt_to_timespec(&old_inode->i_ctime, &old_zi->i_ctime);
+	}
+
+	if (S_ISDIR(old_inode->i_mode)) {
+		set_nlink(old_dir, le16_to_cpu(zus_zi(old_dir)->i_nlink));
+		if (!new_inode)
+			set_nlink(new_dir,
+				  le16_to_cpu(zus_zi(new_dir)->i_nlink));
+	}
+
+	return 0;
+}
+
+const struct inode_operations zuf_dir_inode_operations = {
+	.create		= zuf_create,
+	.lookup		= zuf_lookup,
+	.link		= zuf_link,
+	.unlink		= zuf_unlink,
+	.symlink	= zuf_symlink,
+	.mkdir		= zuf_mkdir,
+	.rmdir		= zuf_rmdir,
+	.mknod		= zuf_mknod,
+	.tmpfile	= zuf_tmpfile,
+	.rename		= zuf_rename,
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
+
+const struct inode_operations zuf_special_inode_operations = {
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
diff --git a/fs/zuf/symlink.c b/fs/zuf/symlink.c
new file mode 100644
index 0000000..8188225
--- /dev/null
+++ b/fs/zuf/symlink.c
@@ -0,0 +1,76 @@
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Symlink operations
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+/* Can never fail all checks already made before.
+ * Returns: The number of pages stored @pages
+ */
+uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
+			 const char *symname, ulong len,
+			 struct page *pages[2])
+{
+	uint nump;
+
+	ioc_new_inode->zi.i_size = cpu_to_le64(len);
+	if (len < sizeof(ioc_new_inode->zi.i_symlink)) {
+		memcpy(&ioc_new_inode->zi.i_symlink, symname, len);
+		return 0;
+	}
+
+	pages[0] = virt_to_page(symname);
+	nump = 1;
+
+	ioc_new_inode->hdr.len = len;
+	ioc_new_inode->hdr.offset = (ulong)symname & (PAGE_SIZE - 1);
+
+	if (PAGE_SIZE < ioc_new_inode->hdr.offset + len) {
+		pages[1] = virt_to_page(symname + PAGE_SIZE);
+		++nump;
+	}
+
+	return nump;
+}
+
+static const char *zuf_get_link(struct dentry *dentry, struct inode *inode,
+				struct delayed_call *notused)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_get_link ioc_get_link = {
+		.hdr.in_len = sizeof(ioc_get_link),
+		.hdr.out_len = sizeof(ioc_get_link),
+		.hdr.operation = ZUS_OP_GET_SYMLINK,
+		.zus_ii = zii->zus_ii,
+	};
+	int err;
+
+	if (inode->i_size < sizeof(zii->zi->i_symlink))
+		return zii->zi->i_symlink;
+
+	err = zufs_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_get_link.hdr,
+			    NULL, 0);
+	if (unlikely(err)) {
+		zuf_err("zufs_dispatch failed => %d\n", err);
+		return ERR_PTR(err);
+	}
+
+	return md_addr(SBI(inode->i_sb)->md, ioc_get_link._link);
+}
+
+const struct inode_operations zuf_symlink_inode_operations = {
+	.get_link	= zuf_get_link,
+	.update_time	= zuf_update_time,
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+};
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index d461782..5870d63 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -20,6 +20,10 @@
 #include <stddef.h>
 #include <asm/statfs.h>
 
+/* TODO: Someone forgot i_version for STATX_ attrs should send a patch to add it
+ */
+#define ZUFS_STATX_VERSION	0x40000000U
+
 /*
  * Version rules:
  *   This is the zus-to-zuf API version. And not the Filesystem
@@ -337,6 +341,28 @@ enum e_zufs_operation {
 	ZUS_OP_NULL = 0,
 	ZUS_OP_STATFS,
 
+	ZUS_OP_NEW_INODE,
+	ZUS_OP_FREE_INODE,
+	ZUS_OP_EVICT_INODE,
+
+	ZUS_OP_LOOKUP,
+	ZUS_OP_ADD_DENTRY,
+	ZUS_OP_REMOVE_DENTRY,
+	ZUS_OP_RENAME,
+	ZUS_OP_READDIR,
+	ZUS_OP_CLONE,
+	ZUS_OP_COPY,
+
+	ZUS_OP_READ,
+	ZUS_OP_WRITE,
+	ZUS_OP_GET_BLOCK,
+	ZUS_OP_GET_SYMLINK,
+	ZUS_OP_SETATTR,
+	ZUS_OP_UPDATE_TIME,
+	ZUS_OP_SYNC,
+	ZUS_OP_FALLOCATE,
+	ZUS_OP_LLSEEK,
+
 	ZUS_OP_BREAK,		/* Kernel telling Server to exit */
 	ZUS_OP_MAX_OPT,
 };
@@ -351,4 +377,212 @@ struct zufs_ioc_statfs {
 	struct statfs64 statfs_out;
 };
 
+/* zufs_ioc_new_inode flags: */
+enum zi_flags {
+	ZI_TMPFILE = 1,		/* for new_inode */
+	ZI_LOOKUP_RACE = 1,	/* for evict */
+};
+
+struct zufs_str {
+	__u8 len;
+	char name[ZUFS_NAME_LEN];
+};
+
+/* ZUS_OP_NEW_INODE */
+struct zufs_ioc_new_inode {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode zi;
+	struct zus_inode_info *dir_ii; /* If mktmp this is the root */
+	struct zufs_str str;
+	__u64 flags;
+
+	 /* OUT */
+	zu_dpp_t _zi;
+	struct zus_inode_info *zus_ii;
+};
+
+/* ZUS_OP_FREE_INODE, ZUS_OP_EVICT_INODE */
+struct zufs_ioc_evict_inode {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 flags;
+};
+
+/* ZUS_OP_LOOKUP */
+struct zufs_ioc_lookup {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *dir_ii;
+	struct zufs_str str;
+
+	 /* OUT */
+	zu_dpp_t _zi;
+	struct zus_inode_info *zus_ii;
+};
+
+/* ZUS_OP_ADD_DENTRY, ZUS_OP_REMOVE_DENTRY */
+struct zufs_ioc_dentry {
+	struct zufs_ioc_hdr hdr;
+	struct zus_inode_info *zus_ii; /* IN */
+	struct zus_inode_info *zus_dir_ii; /* IN */
+	struct zufs_str str; /* IN */
+	__u64 ino; /* OUT - only for lookup */
+};
+
+/* ZUS_OP_RENAME */
+struct zufs_ioc_rename {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *old_dir_ii;
+	struct zus_inode_info *new_dir_ii;
+	struct zus_inode_info *old_zus_ii;
+	struct zus_inode_info *new_zus_ii;
+	struct zufs_str old_d_str;
+	struct zufs_str new_d_str;
+	__le64 time;
+};
+
+/* ZUS_OP_READDIR */
+struct zufs_ioc_readdir {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *dir_ii;
+	loff_t pos;
+
+	/* OUT */
+	__u8	more;
+};
+
+struct zufs_dir_entry {
+	__le64 ino;
+	struct {
+		unsigned	type	: 8;
+		ulong		pos	: 56;
+	};
+	struct zufs_str zstr;
+};
+
+struct zufs_readdir_iter {
+	void *__zde, *last;
+	struct zufs_ioc_readdir *ioc_readdir;
+};
+
+enum {E_ZDE_HDR_SIZE =
+	offsetof(struct zufs_dir_entry, zstr) + offsetof(struct zufs_str, name),
+};
+
+static inline void zufs_readdir_iter_init(struct zufs_readdir_iter *rdi,
+					  struct zufs_ioc_readdir *ioc_readdir,
+					  void *app_ptr)
+{
+	rdi->__zde = app_ptr;
+	rdi->last = app_ptr + ioc_readdir->hdr.len;
+	rdi->ioc_readdir = ioc_readdir;
+	ioc_readdir->more = false;
+}
+
+static inline uint zufs_dir_entry_len(__u8 name_len)
+{
+	return ALIGN(E_ZDE_HDR_SIZE + name_len, sizeof(__u64));
+}
+
+static inline
+struct zufs_dir_entry *zufs_next_zde(struct zufs_readdir_iter *rdi)
+{
+	struct zufs_dir_entry *zde = rdi->__zde;
+	uint len;
+
+	if (rdi->last <= rdi->__zde + E_ZDE_HDR_SIZE)
+		return NULL;
+	if (zde->zstr.len == 0)
+		return NULL;
+	len = zufs_dir_entry_len(zde->zstr.len);
+	if (rdi->last <= rdi->__zde + len)
+		return NULL;
+
+	rdi->__zde += len;
+	return zde;
+}
+
+static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
+				 __u8 type, __u64 pos, const char *name,
+				 __u8 len)
+{
+	struct zufs_dir_entry *zde = rdi->__zde;
+
+	if (rdi->last <= rdi->__zde + zufs_dir_entry_len(len)) {
+		rdi->ioc_readdir->more = true;
+		return false;
+	}
+
+	rdi->ioc_readdir->more = 0;
+	zde->ino = ino;
+	zde->type = type;
+	/*ASSERT(0 == (pos && (1 << 56 - 1)));*/
+	zde->pos = pos;
+	strncpy(zde->zstr.name, name, len);
+	zde->zstr.len = len;
+	zufs_next_zde(rdi);
+
+	return true;
+}
+
+/* ZUS_OP_GET_SYMLINK */
+struct zufs_ioc_get_link {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+
+	/* OUT */
+	zu_dpp_t _link;
+};
+
+/* ZUS_OP_SETATTR */
+struct zufs_ioc_attr {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 truncate_size;
+	__u32 zuf_attr;
+	__u32 pad;
+};
+
+/* ZUS_OP_ISYNC, ZUS_OP_FALLOCATE */
+struct zufs_ioc_range {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 offset, length;
+	__u32 opflags;
+	__u32 pad;
+
+	/* OUT */
+	__u64 write_unmapped;
+};
+
+/* ZUS_OP_CLONE */
+struct zufs_ioc_clone {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *src_zus_ii;
+	struct zus_inode_info *dst_zus_ii;
+	__u64 pos_in, pos_out;
+	__u64 len;
+};
+
+/* ZUS_OP_LLSEEK */
+struct zufs_ioc_seek {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 offset_in;
+	__u32 whence;
+	__u32 pad;
+
+	/* OUT */
+	__u64 offset_out;
+};
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-13 17:15 ` [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU Boaz Harrosh
@ 2018-03-13 18:56   ` Matthew Wilcox
  2018-03-14  8:20     ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2018-03-13 18:56 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Ric Wheeler, Miklos Szeredi, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
> On a call to mmap an mmap provider (like an FS) can put
> this flag on vma->vm_flags.
> 
> This tells the Kernel that the vma will be used from a single
> core only and therefore invalidation of PTE(s) need not a
> wide CPU scheduling
> 
> The motivation of this flag is the ZUFS project where we want
> to optimally map user-application buffers into a user-mode-server
> execute the operation and efficiently unmap.

I've been looking at something similar, and I prefer my approach,
although I'm not nearly as far along with my implementation as you are.

My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
The page fault handler refuses to insert any TLB entries into the process
address space.  But follow_page_mask() will return the appropriate struct
page for it.  This should be enough for O_DIRECT accesses to work as
you'll get the appropriate scatterlists built.

I suspect Boaz has already done a lot of thinking about this and doesn't
need the explanation, but here's how it looks for anyone following along
at home:

Process A calls read().
Kernel allocates a page cache page for it and calls the filesystem through
  ->readpages (or ->readpage).
Filesystem calls the managing process to get the data for that page.
Managing process draws a pentagram and summons Beelzebub (or runs Perl;
  whichever you find more scary).
Managing process notifies the filesystem that the page is now full of data.
Filesystem marks the page as being Uptodate and unlocks it.
Process was waiting on the page lock, wakes up and copies the data from the
  page cache into userspace.  read() is complete.

What we're concerned about here is what to do after the managing process
tells the kernel that the read is complete.  Clearly allowing the managing
process continued access to the page is Bad as the page may be freed by the
page cache and then reused for something else.  Doing a TLB shootdown is
expensive.  So Boaz's approach is to have the process promise that it won't
have any other thread look at it.  My approach is to never allow the page
to have load/store access from userspace; it can only be passed to other
system calls.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 2/7] fs: Add the ZUF filesystem to the build + License
  2018-03-13 17:17 ` [RFC 2/7] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
@ 2018-03-13 20:16   ` Andreas Dilger
  2018-03-14 17:21     ` Boaz Harrosh
  0 siblings, 1 reply; 39+ messages in thread
From: Andreas Dilger @ 2018-03-13 20:16 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Ric Wheeler, Miklos Szeredi, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

[-- Attachment #1: Type: text/plain, Size: 2830 bytes --]

On Mar 13, 2018, at 11:17 AM, Boaz Harrosh <boazh@netapp.com> wrote:
> 
> 
> This adds the ZUF filesystem-in-user_mode module
> to the fs/ build system.
> 
> Also added:
> 	* fs/zuf/Kconfig
> 	* fs/zuf/module.c - This file contains the LICENCE of zuf code base
> 	* fs/zuf/Makefile - Rather empty Makefile with only module.c above
> 
> I add the fs/zuf/Makefile to demonstrate that at every
> patchset stage code still compiles and there are no external
> references outside of the code already submitted.
> 
> Off course only at the very last patch we have a working
> ZUF feeder
> 
> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
> +/*
> + * Version rules:
> + *   This is the zus-to-zuf API version. And not the Filesystem
> + * on disk structures versions. These are left to the FS-plugging
> + * to supply and check.
> + * Specifically any of the API structures and constants found in this
> + * file.
> + * If the changes are made in a way backward compatible with old
> + * user-space, MINOR is incremented. Else MAJOR is incremented.
> + *
> + * We believe that the zus Server application comes with the
> + * Distro and should be dependent on the Kernel package.
> + * The more stable ABI is between the zus Server and its FS plugins.
> + * Because of the intimate relationships in the zuf-core behavior
> + * We would also like zus Server to be signed by the running Kernel's
> + * make crypto key and checked before load because of the Security
> + * nature of an FS provider.
> + */
> +#define ZUFS_MINORS_PER_MAJOR	1024
> +#define ZUFS_MAJOR_VERSION 1
> +#define ZUFS_MINOR_VERSION 0

I haven't really been following this development, but my recommendation
would be to use feature flags (e.g. at least __u64 compat, __u64 incompat)
for the API and separately for the disk format, rather than using version
numbers.  This makes it clear what "version" relates to a specific feature,
and also allows *removal* of features if they turn out to be a bad idea.
With version numbers you can only ever *add* features, and have to keep
support for every old feature added.

Also, having separate feature flags allows independent development of new
features, and doesn't require that feature X has to be in version N or it
will break for anyone using/testing that feature outside of the main tree.

This has worked for 25 years for ext2/3/4 and 15 years for Lustre.  ZFS
has a slightly more complex feature flags, distinguishing between features
that _could_ be used (i.e. enabled at format time or by the administrator),
and features that _are_ used (with a refcount).  That avoids gratuitous
incompatibility if some feature is enabled, but not actually used (e.g.
ext4 files over 2TB), and also allows removing that incompatibility if the
feature is no longer used (e.g. all > 2TB files are deleted).


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 3/7] zuf: Preliminary Documentation
  2018-03-13 17:18 ` [RFC 3/7] zuf: Preliminary Documentation Boaz Harrosh
@ 2018-03-13 20:32   ` Randy Dunlap
  2018-03-14 18:01     ` Boaz Harrosh
  0 siblings, 1 reply; 39+ messages in thread
From: Randy Dunlap @ 2018-03-13 20:32 UTC (permalink / raw)
  To: Boaz Harrosh, linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon

Hi,

Just a few questions.  Very little editing.  :)


On 03/13/2018 10:18 AM, Boaz Harrosh wrote:
> 
> Adding Documentation/filesystems/zufs.txt
> 
> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
> ---
>  Documentation/filesystems/zufs.txt | 351 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 351 insertions(+)
>  create mode 100644 Documentation/filesystems/zufs.txt
> 
> diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
> new file mode 100644
> index 0000000..779f14b
> --- /dev/null
> +++ b/Documentation/filesystems/zufs.txt
> @@ -0,0 +1,351 @@
> +ZUFS - Zero-copy User-mode FileSystem
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Trees:
> +	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
> +	git clone https://github.com/NetApp/zufs-zus -b zus-github
> +
> +patches, comments, questions, requests to:
> +	boazh@netapp.com
> +
> +Introduction:
> +~~~~~~~~~~~~~
> +
> +ZUFS - stands for Zero-copy User-mode FS
> +▪ It is geared towards true zero copy end to end of both data and meta data.
> +▪ It is geared towards very *low latency*, very high CPU locality, lock-less
> +  parallelism.
> +▪ Synchronous operations
> +▪ Numa awareness
> +
> + ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
> +tries to address the above goals. It is aimed for pmem based FSs. But can easily
> +support any other type of FSs that can utilize x10 latency and parallelism
> +improvements.
> +
> +Glossary and names:
> +~~~~~~~~~~~~~~~~~~~
> +
> +ZUF - Zero-copy User-mode Feeder
> +  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
> +  VFS and dispatch commands to a User-mode application Server.
> +  Uptodate code is found at:
> +	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
> +
> +ZUS - Zero-copy User-mode Server
> +  zufs utilizes a User-mode server application. That takes care of the detailed
> +  communication protocol and correctness with the Kernel.
> +  In turn it utilizes many zusFS Filesystem plugins to implement the actual
> +  on disc Filesystem.
> +  Uptodate code is found at:
> +	git clone https://github.com/NetApp/zufs-zus -b zus-github
> +
> +zusFS - FS plugins
> +  These are .so loadable modules that implement one or more Filesystem-types
> +  (-t xyz).
> +  The zus server communicates with the plugin via a set of function vectors
> +  for the different operations. And establishes communication via defined
> +  structures.
> +
> +Filesystem-type:
> +  At startup zus registers with the Kernel one or more Filesystem-type(s)
> +  Associated with the type is a 4 letter type-name (-t fstn) different

(?)                           is a unique 4-letter type-name

> +  info about the fs, like a magic number and so on.
> +  One Server can support many FS-types, in turn each FS-type can mount
> +  multiple super-blocks, each supporting multiple devices.
> +
> +Device-Table (DT) - A zufs FS can support multiple devices
> +  ZUF in Kernel may receive, like any mount command a block-device or none.
> +  For the former if the specified FS-types states so in a special field.
> +  The mount will look for a Device table. A list of devices in a specific
> +  order sitting at some offset on the block-device. The system will then
> +  proceed to open and own all these devices and associate them to the mounting
> +  super-block.
> +  If FS-type specifies a -1 at DT_offset then there is no device table
> +  and a DT of a single device is created. (If we have no devices, none
> +  is specified than we operate without any block devices. (Mount options give
> +  some indication of the storage information)

missing one ')'

> +  The device table has special consideration for pmem devices and will
> +  present the all linear array of devices to zus, as one flat mmap space.
> +  Alternatively all none pmem devices are also provided an interface

                   all known (?)

> +  with facility of data movement from pmem to a slower device.
> +  A detailed NUMA info is exported to the Server for maximum utilization.
> +
> +pmem:
> +  Multiple pmem devices are presented to the server as a single
> +  linear file mmap. Something like /dev/dax. But it is strictly
> +  available only to that specific super-block that owns it.
> +
> +dpp_t - Dual port pointer type
> +  At some points in the protocol there are objects that return from zus
> +  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
> +  It is actually an offset 8 bytes aligned with the 3 low bits specifying
> +  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
> +  pool == 0 means the offset is in pmem who's management is by zuf and
> +  a full easy access is provided for zus.
> +
> +  pool != 0 Is a pre-established tempfs file (up to 6 such files) where
> +  the zus has an mmap on the file and the Kernel can access that data
> +  via an offset into the file.

so non-zero pool [dpp_t & 0x7] can be a value of 1 - 7, and above says up to
6 such tempfs files.  What is the other pool value used for?

> +  All dpp_t objects life time rules are strictly defined.
> +  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
> +  zus and zuf can access and change this structure. On any modification
> +  the zus is called so to be notified of any changes, persistence.
> +  More such objects are: Symlinks, xattrs, mmap-data-blocks etc...
> +
> +Relay-wait-object:
> +  communication between Kernel and server are done via zus-threads that
> +  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
> +  the IOCTL returns operation id executed and the return info is returned via
> +  a new IOCTL call, which then waits for the next operation.

Does that say 2 IOCTLs per command?  One to start it and one to fetch return info?

> +  To wake up the sleeping thread we use a Relay-wait-object. Currently
> +  it is two waitqueue_head(s) back to back.
> +  In future we should investigate the use of that special binder object
> +  that releases its thread time slice to the other thread without going through
> +  the scheduler.
> +
> +ZT-threads-array:
> +  The novelty of the zufs is the ZT-threads system. One thread or more is
> +  pre-created for each active core in the system.
> +  ▪ The thread is AFFINITY set for that single core only.
> +  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
> +    At initialization the ZT thread communicates through a ZT_INIT ioctl
> +    and registers as the handler of that core (Channel)
> +  ▪ ZT-vma - Mmap 4M vma zero copy communication area per ZT
> +    Pre allocated vma is created into which will be mapped the application
> +    or Kernel buffers for the current operation.
> +  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
> +    via the IOCTL_ZU_WAIT_OPT call. supplying a 4k communication buffer
> +
> +  ▪ On an operation dispatch current CPU's ZT is selected, app pages mapped
> +    into the ZT-vma. Server thread released with an operation to execute.
> +  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
> +    Server wait for new operation on that CPU.
> +
> +ZUS-mount-thread:
> +  The system utilizes a single mount thread. (This thread is not affinity to any
> +  core).
> +  ▪ It will first Register all FS-types supported by this Server (By calling
> +    all zusFS plugins to register their supported types). Once done
> +  ▪ As above the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
> +  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
> +    a mount is dispatched back to zus.
> +  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
> +    the same array is then used for all super-blocks in the system
> +  ▪ As part of the mount command in the context of this same mount-thread
> +    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
> +    Associated with this super_block
> +  ▪ On return (like above a new call to IOCTL_ZU_MOUNT will return info of the

missing ')' somewhere.

> +    mount before sleeping in kernel waiting for a new dispatch. All SB info
> +    is provided to zuf, including the root inode info. Kernel then proceeds
> +    to complete the mount call.
> +  ▪ NOTE that since there is a single mount thread all FS-registration
> +    super_block and pmem management are lockless.
> +  
> +Philosophy of operations:
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +1. [zuf-root]
> +
> +On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
> +called zuf-root.
> +The zuf-root has no visible files. All communication is done via special-files.
> +special-files are open(O_TMPFILE) and establish a special role via an
> +IOCTL.
> +All communications with the server are done via the zuf-root. Each root owns
> +many FS-types and each FS-type owns many super-blocks of this type. All Sharing
> +the same communication channels.
> +Since all FS-type Servers live in the same zus application address space, at
> +times. If the administrator wants to separate between different servers, he/she
> +can mount a new zuf-root and point a new server instance on that new mount,
> +registering other FS-types on that other instance. The all communication array
> +will then be duplicated as well.
> +(Otherwise pointing a new server instance on a busy root will return an error)
> +
> +2. [zus server start]
> +  ▪ On load all configured zusFS plugins are loaded.
> +  ▪ The Server starts by starting a single mount thread.
> +  ▪ It than proceeds to register with Kernel all FS-types it will support.
> +    (This is done on the single mount thread, so all FS-registration and
> +     mount/umount operate in a single thread and therefor need not any locks)
> +  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for a mount
> +    command.
> +
> +3. [mount -t xyz]
> +  [In Kernel]
> +  ▪ If xyz was registered above as part of the Server startup. the regular
> +    mount command will come to the zuf module with a zuf_mount() call. with
> +    the xyz-FS-info. In turn this points to a zuf-root.
> +  ▪ Code than proceed to load a device-table of devices as  specified above.
> +    It then establishes an md object with a specific pmem_id.

                              md ??

> +  ▪ It proceeds to call mount_bdev. Always with the same main-device
> +    thous fully sporting automatic bind mounts. Even if different
> +    devices are given to the mount command.
> +  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
> +    specifying two parameters. One the FS-type to mount, and then
> +    the pmem_id Associated with this super_block.
> +
> +  [In zus]
> +  ▪ A zus_super_block_info is allocated.
> +  ▪ zus calls PMEM_GRAB(pmem_id) to establish a direct mapping to its
> +    pmem devices. On return we have full access to our PMEM
> +
> +  ▪ ZT-threads-array
> +    If this is the first mount the all ZT-threads-array is created and
> +    established. The mount thread will wait until all zt-threads finished
> +    initialization and ready to rock.
> +  ▪ Root-zus_inode is loaded and is returned to kernel
> +  ▪ More info about the mount like block sizes and so on are returned to kernel.
> +
> +  [In Kernel]
> +   The zuf_fill_super is finalized vectors established and we have a new
> +   super_block ready for operations.
> +
> +4. An FS operation like create or WRITE/READ and so on arrives from application
> +   via VFS. Eventually an Operation is dispatched to zus:
> +   ▪ A special per-operation descriptor is filled up with all parameters.
> +   ▪ A current CPU channel is grabbed. the operation descriptor is put on
> +     that channel (ZT). Including get_user_pages or Kernel-pages associated
> +     with this OPT.
> +   ▪ The ZT is awaken, app thread put to sleep.
> +   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
> +     the map is only on a single core. And no other core's TLB is affected.
> +     (This here is the all performance secret)
> +   ▪ ZT thread is returned to user-space.
> +   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
> +     vector. Output params filled.
> +   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
> +     to return the requested info.
> +   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
> +     ZT thread goes back to sleep waiting a new operation.
> +     
> +   ZT rules:
> +       A ZT thread must not return back to Kernel. One exception is locks
> +   if needed it might sleep waiting for a lock. In which case we will see that
> +   the same CPU channel is reentered via another application and/or thread.
> +   But now that CPU channel is taken.  What we do is we utilize a few channels
> +   (ZTs) per core and the threads may grab another channel. But this only
> +   postpones the problem on a busy contended system, all such channels will be
> +   consumed. If all channels are taken the application thread is put on a busy
> +   scheduling wait until a channel can be grabbed.
> +   Therefor Server must not sleep on a ZT. If it needs such a sleeping operation
> +   it will return -EAGAIN to zuf. The app is kept sleeping the operation is put
> +   on an asynchronous Q and the ZT freed for foreground operation. At some point
> +   when the server completes the delayed operation it will complete notify
> +   the Kernel with a special async cookie. And the app will be awakened.
> +   (Here too we utilize pre allocated asyc channels and vmas. If all channels
> +    are busy, application is kept sleeping waiting its free slot turn)
> +
> +4. On umount the operation is reversed and all resources are torn down.
> +5. In case of an application or Server crash, all resources are Associated
> +   with files, on file_release these resources are caught and freed.
> +
> +Objects and life-time
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Each Kernel object type has an assosiated zus Server object type who's life
> +time is governed by the life-time of the Kernel object. Therefor the Server's
> +job is easy because it need not establish any object caches / hashes and so on.
> +
> +Inside zus all objects are allocated by the zusFS plugin. So in turn it can
> +allocate a bigger space for its own private data and access it via the
> +container_off() coding pattern. So when I say below a zus-object I mean both
> +zus public part + zusFS private part of the same object.
> +
> +All operations return a UM pointer that are OPEC the the Kernel code, they

-ETOOMANYNLA

2LA: UM
4LA: OPEC

> +are just a cookie which is returned back to zus, when needed.
> +At times when we want the Kernel to have direct access to a zus object like
> +zus_inode, along with the cookie we also return a dpp_t, with a defined structure.
> +
> +Kernel object 			| zus object 		| Kernel access (via dpp_t)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +zuf_fs_type
> +	file_system_type	| zus_fs_info		| no
> +
> +zuf_sb_info
> +	super_block		| zus_sb_info		| no
> +	
> +zuf_inode_info			|			|
> +	vfs_inode		| zus_inode_info	| no
> +	zus_inode *		| 	zus_inode *	| yes
> +	synlink *		|	char-array	| yes
> +	xattr**			|	zus_xattr	| yes
> +
> +When a Kernel object's time is to die, a final call to zus is
> +dispatched so the associated object can also be freed. Which means
> +that on memory pressure when object caches are evicted also the zus
> +memory resources are freed.
> +
> +
> +How to use zufs:
> +~~~~~~~~~~~~~~~~
> +
> +The most updated documentation of how to use the latest code bases
> +is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
> +
> +We the developers at Netapp use this script to mount and test our
> +latest code. So any new Secret will be found in these scripts. Please
> +read them as the ultimate source of how to operate things.
> +
> +TODO: We are looking for exports in system-d and udev to properly
> +integrate these tools into a destro.
> +
> +We assume you cloned these git trees:
> +[]$ mkdir zufs; cd zufs
> +[]$ git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
> +[]$ git clone https://github.com/NetApp/zufs-zuf -b zus-github
> +
> +This will create the following trees
> +zufs/zus - Source code for Server
> +zufs/zuf - Linux Kernel source tree to compile and install on your machine
> +
> +Also specifically:
> +zufs/zus/fs/do-zu/zudo - script Documenting how to run things
> +
> +[]$ cd zuf
> +
> +First time
> +[] ../zus/fs/do-zu/zudo
> +this will create a file:
> +	../zus/fs/do-zu/zu.conf
> +
> +Edit this file for your environment. Devices, mount-point and so on.
> +On first run an example file will be created for you. Fill in the
> +blanks. Most params can stay as is in most cases
> +
> +Now lest start running:
> +
> +[1]$ ../zus/fs/do-zu/zudo mkfs
> +This will run the proper mkfs command selected at zu.conf file
> +with the proper devices.
> +
> +[2]$ ../zus/fs/do-zu/zudo zuf-insmod
> +This loads the zuf.ko module
> +
> +[3]$ ../zus/fs/do-zu/zudo zuf-root
> +This mounts the zuf-root FS above on /sys/fs/zuf (automatically created above)
> +
> +[4]$ ../zus/fs/do-zu/zudo zus-up
> +This runs the zus daemon in the background
> +
> +[5]$ ../zus/fs/do-zu/zudo mount
> +This mount the mkfs FS above on the specified dir in zu.conf
> +
> +To run all the 5 commands above at once do:
> +[]$ ../zus/fs/do-zu/zudo up
> +
> +To undo all the above in reverse order do:
> +[]$ ../zus/fs/do-zu/zudo down
> +
> +And the most magic command is:
> +[]$ ../zus/fs/do-zu/zudo again
> +Will do a "down", then update-mods, then "up"
> +(update-mods is a special script to copy the latest compiled binaries)
> +
> +Now you are ready for some:
> +[]$ ../zus/fs/do-zu/zudo xfstest
> +xfstests is assumed to be installed in the regular /opt/xfstests dir
> +
> +Again please see inside the scripts what each command does
> +these scripts are the ultimate Documentation, do not believe
> +anything I'm saying here. (Because it is outdated by now)
> 

thanks,
-- 
~Randy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-13 18:56   ` Matthew Wilcox
@ 2018-03-14  8:20     ` Miklos Szeredi
  2018-03-14 11:17       ` Matthew Wilcox
  2018-03-14 21:41       ` Boaz Harrosh
  0 siblings, 2 replies; 39+ messages in thread
From: Miklos Szeredi @ 2018-03-14  8:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
>> On a call to mmap an mmap provider (like an FS) can put
>> this flag on vma->vm_flags.
>>
>> This tells the Kernel that the vma will be used from a single
>> core only and therefore invalidation of PTE(s) need not a
>> wide CPU scheduling
>>
>> The motivation of this flag is the ZUFS project where we want
>> to optimally map user-application buffers into a user-mode-server
>> execute the operation and efficiently unmap.
>
> I've been looking at something similar, and I prefer my approach,
> although I'm not nearly as far along with my implementation as you are.
>
> My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
> The page fault handler refuses to insert any TLB entries into the process
> address space.  But follow_page_mask() will return the appropriate struct
> page for it.  This should be enough for O_DIRECT accesses to work as
> you'll get the appropriate scatterlists built.
>
> I suspect Boaz has already done a lot of thinking about this and doesn't
> need the explanation, but here's how it looks for anyone following along
> at home:
>
> Process A calls read().
> Kernel allocates a page cache page for it and calls the filesystem through
>   ->readpages (or ->readpage).
> Filesystem calls the managing process to get the data for that page.
> Managing process draws a pentagram and summons Beelzebub (or runs Perl;
>   whichever you find more scary).
> Managing process notifies the filesystem that the page is now full of data.
> Filesystem marks the page as being Uptodate and unlocks it.
> Process was waiting on the page lock, wakes up and copies the data from the
>   page cache into userspace.  read() is complete.
>
> What we're concerned about here is what to do after the managing process
> tells the kernel that the read is complete.  Clearly allowing the managing
> process continued access to the page is Bad as the page may be freed by the
> page cache and then reused for something else.  Doing a TLB shootdown is
> expensive.  So Boaz's approach is to have the process promise that it won't
> have any other thread look at it.  My approach is to never allow the page
> to have load/store access from userspace; it can only be passed to other
> system calls.

This all seems to revolve around the fact that userspace fs server
process needs to copy something into userspace client's buffer, right?

Instead of playing with memory mappings, why not just tell the kernel
*what* to copy?

While in theory not as generic, I don't see any real limitations (you
don't actually need the current contents of the buffer in the read
case and vica verse in the write case).

And we already have an interface for this: splice(2).  What am I
missing?  What's the killer argument in favor of the above messing
with tlb caches etc, instead of just letting the kernel do the dirty
work.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-14  8:20     ` Miklos Szeredi
@ 2018-03-14 11:17       ` Matthew Wilcox
  2018-03-14 11:31         ` Miklos Szeredi
  2018-03-14 21:41       ` Boaz Harrosh
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2018-03-14 11:17 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Boaz Harrosh, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Wed, Mar 14, 2018 at 09:20:57AM +0100, Miklos Szeredi wrote:
> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
> >> On a call to mmap an mmap provider (like an FS) can put
> >> this flag on vma->vm_flags.
> >>
> >> This tells the Kernel that the vma will be used from a single
> >> core only and therefore invalidation of PTE(s) need not a
> >> wide CPU scheduling
> >>
> >> The motivation of this flag is the ZUFS project where we want
> >> to optimally map user-application buffers into a user-mode-server
> >> execute the operation and efficiently unmap.
> >
> > I've been looking at something similar, and I prefer my approach,
> > although I'm not nearly as far along with my implementation as you are.
> >
> > My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
> > The page fault handler refuses to insert any TLB entries into the process
> > address space.  But follow_page_mask() will return the appropriate struct
> > page for it.  This should be enough for O_DIRECT accesses to work as
> > you'll get the appropriate scatterlists built.
> >
> > I suspect Boaz has already done a lot of thinking about this and doesn't
> > need the explanation, but here's how it looks for anyone following along
> > at home:
> >
> > Process A calls read().
> > Kernel allocates a page cache page for it and calls the filesystem through
> >   ->readpages (or ->readpage).
> > Filesystem calls the managing process to get the data for that page.
> > Managing process draws a pentagram and summons Beelzebub (or runs Perl;
> >   whichever you find more scary).
> > Managing process notifies the filesystem that the page is now full of data.
> > Filesystem marks the page as being Uptodate and unlocks it.
> > Process was waiting on the page lock, wakes up and copies the data from the
> >   page cache into userspace.  read() is complete.
> >
> > What we're concerned about here is what to do after the managing process
> > tells the kernel that the read is complete.  Clearly allowing the managing
> > process continued access to the page is Bad as the page may be freed by the
> > page cache and then reused for something else.  Doing a TLB shootdown is
> > expensive.  So Boaz's approach is to have the process promise that it won't
> > have any other thread look at it.  My approach is to never allow the page
> > to have load/store access from userspace; it can only be passed to other
> > system calls.
> 
> This all seems to revolve around the fact that userspace fs server
> process needs to copy something into userspace client's buffer, right?
> 
> Instead of playing with memory mappings, why not just tell the kernel
> *what* to copy?
> 
> While in theory not as generic, I don't see any real limitations (you
> don't actually need the current contents of the buffer in the read
> case and vica verse in the write case).
> 
> And we already have an interface for this: splice(2).  What am I
> missing?  What's the killer argument in favor of the above messing
> with tlb caches etc, instead of just letting the kernel do the dirty
> work.

Great question.  You're completely right that the question is how to tell
the kernel what to copy.  The problem is that splice() can only write to
the first page of a pipe.  So you need one pipe per outstanding request,
which can easily turn into thousands of file descriptors.  If we enhanced
splice() so it could write to any page in a pipe, then I think splice()
would be the perfect interface.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-14 11:17       ` Matthew Wilcox
@ 2018-03-14 11:31         ` Miklos Szeredi
  2018-03-14 11:45           ` Matthew Wilcox
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2018-03-14 11:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Wed, Mar 14, 2018 at 12:17 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Mar 14, 2018 at 09:20:57AM +0100, Miklos Szeredi wrote:
>> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
>> > On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
>> >> On a call to mmap an mmap provider (like an FS) can put
>> >> this flag on vma->vm_flags.
>> >>
>> >> This tells the Kernel that the vma will be used from a single
>> >> core only and therefore invalidation of PTE(s) need not a
>> >> wide CPU scheduling
>> >>
>> >> The motivation of this flag is the ZUFS project where we want
>> >> to optimally map user-application buffers into a user-mode-server
>> >> execute the operation and efficiently unmap.
>> >
>> > I've been looking at something similar, and I prefer my approach,
>> > although I'm not nearly as far along with my implementation as you are.
>> >
>> > My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
>> > The page fault handler refuses to insert any TLB entries into the process
>> > address space.  But follow_page_mask() will return the appropriate struct
>> > page for it.  This should be enough for O_DIRECT accesses to work as
>> > you'll get the appropriate scatterlists built.
>> >
>> > I suspect Boaz has already done a lot of thinking about this and doesn't
>> > need the explanation, but here's how it looks for anyone following along
>> > at home:
>> >
>> > Process A calls read().
>> > Kernel allocates a page cache page for it and calls the filesystem through
>> >   ->readpages (or ->readpage).
>> > Filesystem calls the managing process to get the data for that page.
>> > Managing process draws a pentagram and summons Beelzebub (or runs Perl;
>> >   whichever you find more scary).
>> > Managing process notifies the filesystem that the page is now full of data.
>> > Filesystem marks the page as being Uptodate and unlocks it.
>> > Process was waiting on the page lock, wakes up and copies the data from the
>> >   page cache into userspace.  read() is complete.
>> >
>> > What we're concerned about here is what to do after the managing process
>> > tells the kernel that the read is complete.  Clearly allowing the managing
>> > process continued access to the page is Bad as the page may be freed by the
>> > page cache and then reused for something else.  Doing a TLB shootdown is
>> > expensive.  So Boaz's approach is to have the process promise that it won't
>> > have any other thread look at it.  My approach is to never allow the page
>> > to have load/store access from userspace; it can only be passed to other
>> > system calls.
>>
>> This all seems to revolve around the fact that userspace fs server
>> process needs to copy something into userspace client's buffer, right?
>>
>> Instead of playing with memory mappings, why not just tell the kernel
>> *what* to copy?
>>
>> While in theory not as generic, I don't see any real limitations (you
>> don't actually need the current contents of the buffer in the read
>> case and vica verse in the write case).
>>
>> And we already have an interface for this: splice(2).  What am I
>> missing?  What's the killer argument in favor of the above messing
>> with tlb caches etc, instead of just letting the kernel do the dirty
>> work.
>
> Great question.  You're completely right that the question is how to tell
> the kernel what to copy.  The problem is that splice() can only write to
> the first page of a pipe.  So you need one pipe per outstanding request,
> which can easily turn into thousands of file descriptors.  If we enhanced
> splice() so it could write to any page in a pipe, then I think splice()
> would be the perfect interface.

Don't know your usecase, but afaict zufs will have one queue per cpu.
Having one pipe/cpu doesn't sound too bad.

But yeah, there's plenty of room for improvement in the splice
interface.  Just needs a killer app like this :)

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-14 11:31         ` Miklos Szeredi
@ 2018-03-14 11:45           ` Matthew Wilcox
  2018-03-14 14:49             ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2018-03-14 11:45 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Boaz Harrosh, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Wed, Mar 14, 2018 at 12:31:22PM +0100, Miklos Szeredi wrote:
> On Wed, Mar 14, 2018 at 12:17 PM, Matthew Wilcox <willy@infradead.org> wrote:
> >> And we already have an interface for this: splice(2).  What am I
> >> missing?  What's the killer argument in favor of the above messing
> >> with tlb caches etc, instead of just letting the kernel do the dirty
> >> work.
> >
> > Great question.  You're completely right that the question is how to tell
> > the kernel what to copy.  The problem is that splice() can only write to
> > the first page of a pipe.  So you need one pipe per outstanding request,
> > which can easily turn into thousands of file descriptors.  If we enhanced
> > splice() so it could write to any page in a pipe, then I think splice()
> > would be the perfect interface.
> 
> Don't know your usecase, but afaict zufs will have one queue per cpu.
> Having one pipe/cpu doesn't sound too bad.

Erm ... there's nothing wrong with having one pipe per CPU.  But pipes
being non-seekable means that ZUFS can only handle synchronous I/Os.
If you want to have a network backend, then you'd only be able to have
one outstanding network request per pipe, which is really going to suck
for bandwidth.

> But yeah, there's plenty of room for improvement in the splice
> interface.  Just needs a killer app like this :)

I'm happy to start investigating making pipes random-access ...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 4/7] zuf: zuf-rootfs && zuf-core
  2018-03-13 17:36   ` Boaz Harrosh
@ 2018-03-14 12:56     ` Nikolay Borisov
  2018-03-14 18:34       ` Boaz Harrosh
  0 siblings, 1 reply; 39+ messages in thread
From: Nikolay Borisov @ 2018-03-14 12:56 UTC (permalink / raw)
  To: Boaz Harrosh, linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon



On 13.03.2018 19:36, Boaz Harrosh wrote:
> 
> zuf-core established the communication channels with
> the zus UM Server.
> 
> zuf-root is a psuedo FS that the zus communicates through,
> registers new file-systems. receives new mount requests.
> 
> In this patch we have the bring up of that special FS, and
> the core communication mechanics. Which is the Novelty
> of this code submission.
> 
> The zuf-rootfs (-t zuf) is usually by default mounted on
> /sys/fs/zuf. If an admin wants to run more server applications
> (Note that each server application supports many types of FSs)
> He/she can mount a second instance of -t zuf and point the new
> Server to it.
> 
> (Otherwise a second instance attempting to communicate with a
>  busy zuf will fail)
> 
> TODO: How to trigger a first mount on module_load. Currently
> admin needs to manually "mount -t zuf none /sys/fs/zuf"
> 
> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
<snip>

> +	while (!relay_is_fss_waiting(&zt->relay)) {
> +		mb();
Not something for this early in the development cycle but something to
keep in mind:

Always document all assumptions re. memory barriers usage + intended
pairing scenario otherwise it's very hard to reason whether this is
correct or not. In fact barriers without comments are considered broken.

> +		if (unlikely(!zt->file))
> +			return -EIO;
> +		zuf_dbg_err("[%d] can this be\n", cpu);
> +		/* FIXME: Do something much smarter */
> +		msleep(10);
> +		mb();
> +	}
> +
> +	zt->next_opt = hdr;
> +	zt->pages = pages;
> +	zt->nump = nump;
> +
> +	relay_fss_wakeup_app_wait(&zt->relay, NULL);
> +
> +	return zt->file ? hdr->err : -EIO;
> +}

<snip>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-14 11:45           ` Matthew Wilcox
@ 2018-03-14 14:49             ` Miklos Szeredi
  2018-03-14 14:57               ` Matthew Wilcox
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2018-03-14 14:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Wed, Mar 14, 2018 at 12:45 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Mar 14, 2018 at 12:31:22PM +0100, Miklos Szeredi wrote:
>> On Wed, Mar 14, 2018 at 12:17 PM, Matthew Wilcox <willy@infradead.org> wrote:
>> >> And we already have an interface for this: splice(2).  What am I
>> >> missing?  What's the killer argument in favor of the above messing
>> >> with tlb caches etc, instead of just letting the kernel do the dirty
>> >> work.
>> >
>> > Great question.  You're completely right that the question is how to tell
>> > the kernel what to copy.  The problem is that splice() can only write to
>> > the first page of a pipe.  So you need one pipe per outstanding request,
>> > which can easily turn into thousands of file descriptors.  If we enhanced
>> > splice() so it could write to any page in a pipe, then I think splice()
>> > would be the perfect interface.
>>
>> Don't know your usecase, but afaict zufs will have one queue per cpu.
>> Having one pipe/cpu doesn't sound too bad.
>
> Erm ... there's nothing wrong with having one pipe per CPU.  But pipes
> being non-seekable means that ZUFS can only handle synchronous I/Os.
> If you want to have a network backend, then you'd only be able to have
> one outstanding network request per pipe, which is really going to suck
> for bandwidth.

I guess ZUFS is mostly about fast synchronous access (please correct
me if I'm wrong).  Not sure that model fits network filesystems, where
performance of caching will dominate real life performance.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-14 14:49             ` Miklos Szeredi
@ 2018-03-14 14:57               ` Matthew Wilcox
  2018-03-14 15:39                 ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2018-03-14 14:57 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Boaz Harrosh, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Wed, Mar 14, 2018 at 03:49:35PM +0100, Miklos Szeredi wrote:
> On Wed, Mar 14, 2018 at 12:45 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > Erm ... there's nothing wrong with having one pipe per CPU.  But pipes
> > being non-seekable means that ZUFS can only handle synchronous I/Os.
> > If you want to have a network backend, then you'd only be able to have
> > one outstanding network request per pipe, which is really going to suck
> > for bandwidth.
> 
> I guess ZUFS is mostly about fast synchronous access (please correct
> me if I'm wrong).  Not sure that model fits network filesystems, where
> performance of caching will dominate real life performance.

I'm sure that's Boaz's use case ;-)  But if we're introducing
a replacement for FUSE, let's make it better than FUSE, not just
specialised to Boaz's use case.  Also, networks aren't necessarily slow;
some of us live in a world where the other end-point on the "network"
is *usually* the hypervisor, or a different guest on the same piece of
physical hardware.  Not to mention that 400Gbps ethernet is almost upon
us (standard approved four months ago) and PCIe Gen 4 is only 256Gbps
with a x16 link.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-14 14:57               ` Matthew Wilcox
@ 2018-03-14 15:39                 ` Miklos Szeredi
       [not found]                   ` <CAON-v2ygEDCn90C9t-zadjsd5GRgj0ECqntQSDDtO_Zjk=KoVw@mail.gmail.com>
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2018-03-14 15:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Wed, Mar 14, 2018 at 3:57 PM, Matthew Wilcox <willy@infradead.org> wrote:
> On Wed, Mar 14, 2018 at 03:49:35PM +0100, Miklos Szeredi wrote:
>> On Wed, Mar 14, 2018 at 12:45 PM, Matthew Wilcox <willy@infradead.org> wrote:
>> > Erm ... there's nothing wrong with having one pipe per CPU.  But pipes
>> > being non-seekable means that ZUFS can only handle synchronous I/Os.
>> > If you want to have a network backend, then you'd only be able to have
>> > one outstanding network request per pipe, which is really going to suck
>> > for bandwidth.
>>
>> I guess ZUFS is mostly about fast synchronous access (please correct
>> me if I'm wrong).  Not sure that model fits network filesystems, where
>> performance of caching will dominate real life performance.
>
> I'm sure that's Boaz's use case ;-)  But if we're introducing
> a replacement for FUSE, let's make it better than FUSE, not just
> specialised to Boaz's use case.

Okay, so the FUSE vs. ZUFS question was bound to be raised at some
point.  What's the high level thinking on this?

We can make ZUFS be a better FUSE, but with a new API?  New API means
we'll lose existing user base.   Having a new API but adding a compat
layer may be able to work around that, but it's probably hard to fully
emulate the old API.

What are the limiting factors in the FUSE API that are preventing
fixing these performance problems in FUSE?

Or it's just that FUSE kernel implementation is a horrid piece of
bloat (true) and we need to do something from scratch without worrying
about backward compat issues?

Is ZUFS planning to acquire a caching layer?

Btw, what is happening when a ZUFS file is mmaped?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
       [not found]                   ` <CAON-v2ygEDCn90C9t-zadjsd5GRgj0ECqntQSDDtO_Zjk=KoVw@mail.gmail.com>
@ 2018-03-14 16:48                     ` Matthew Wilcox
  0 siblings, 0 replies; 39+ messages in thread
From: Matthew Wilcox @ 2018-03-14 16:48 UTC (permalink / raw)
  To: Amit Golander
  Cc: Miklos Szeredi, Amir Goldstein, Amit Golander, Andy Rudof,
	Anna Schumaker, Boaz Harrosh, Jan Kara, Jefff moyer, Ric Wheeler,
	Sage Weil, Sagi Manole, Shachar Sharon, Steve French,
	Steven Whitehouse, linux-fsdevel

On Wed, Mar 14, 2018 at 04:27:20PM +0000, Amit Golander wrote:
> ZUFS is not intended or designed to replace FUSE. Just like RDMA is not
> intended to replace TCP.

Your intent is to have two independent approaches to permitting
filesystems-in-userspace to exist.  And I've pointed out that your
approach doesn't work for my use-case (and neither does FUSE).  While we
might accept two implementations of something, at the point that three
implementations are proposed, we usually say "No, come up with a better
solution that works for everybody".

> On Wed, 14 Mar 2018 at 8:39 Miklos Szeredi <mszeredi@redhat.com> wrote:
> 
> > On Wed, Mar 14, 2018 at 3:57 PM, Matthew Wilcox <willy@infradead.org>
> > wrote:
> > > On Wed, Mar 14, 2018 at 03:49:35PM +0100, Miklos Szeredi wrote:
> > >> On Wed, Mar 14, 2018 at 12:45 PM, Matthew Wilcox <willy@infradead.org>
> > wrote:
> > >> > Erm ... there's nothing wrong with having one pipe per CPU.  But pipes
> > >> > being non-seekable means that ZUFS can only handle synchronous I/Os.
> > >> > If you want to have a network backend, then you'd only be able to have
> > >> > one outstanding network request per pipe, which is really going to
> > suck
> > >> > for bandwidth.
> > >>
> > >> I guess ZUFS is mostly about fast synchronous access (please correct
> > >> me if I'm wrong).  Not sure that model fits network filesystems, where
> > >> performance of caching will dominate real life performance.
> > >
> > > I'm sure that's Boaz's use case ;-)  But if we're introducing
> > > a replacement for FUSE, let's make it better than FUSE, not just
> > > specialised to Boaz's use case.
> >
> > Okay, so the FUSE vs. ZUFS question was bound to be raised at some
> > point.  What's the high level thinking on this?
> >
> > We can make ZUFS be a better FUSE, but with a new API?  New API means
> > we'll lose existing user base.   Having a new API but adding a compat
> > layer may be able to work around that, but it's probably hard to fully
> > emulate the old API.
> >
> > What are the limiting factors in the FUSE API that are preventing
> > fixing these performance problems in FUSE?
> >
> > Or it's just that FUSE kernel implementation is a horrid piece of
> > bloat (true) and we need to do something from scratch without worrying
> > about backward compat issues?
> >
> > Is ZUFS planning to acquire a caching layer?
> >
> > Btw, what is happening when a ZUFS file is mmaped?
> >
> > Thanks,
> > Miklos
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 2/7] fs: Add the ZUF filesystem to the build + License
  2018-03-13 20:16   ` Andreas Dilger
@ 2018-03-14 17:21     ` Boaz Harrosh
  2018-03-15  4:21       ` Andreas Dilger
  0 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-14 17:21 UTC (permalink / raw)
  To: Andreas Dilger, Boaz Harrosh
  Cc: linux-fsdevel, Ric Wheeler, Miklos Szeredi, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On 13/03/18 22:16, Andreas Dilger wrote:
<>
>> + */
>> +#define ZUFS_MINORS_PER_MAJOR	1024
>> +#define ZUFS_MAJOR_VERSION 1
>> +#define ZUFS_MINOR_VERSION 0
> 
> I haven't really been following this development, but my recommendation
> would be to use feature flags (e.g. at least __u64 compat, __u64 incompat)
> for the API and separately for the disk format, rather than using version
> numbers.  This makes it clear what "version" relates to a specific feature,
> and also allows *removal* of features if they turn out to be a bad idea.
> With version numbers you can only ever *add* features, and have to keep
> support for every old feature added.
> 
> Also, having separate feature flags allows independent development of new
> features, and doesn't require that feature X has to be in version N or it
> will break for anyone using/testing that feature outside of the main tree.
> 
> This has worked for 25 years for ext2/3/4 and 15 years for Lustre.  ZFS
> has a slightly more complex feature flags, distinguishing between features
> that _could_ be used (i.e. enabled at format time or by the administrator),
> and features that _are_ used (with a refcount).  That avoids gratuitous
> incompatibility if some feature is enabled, but not actually used (e.g.
> ext4 files over 2TB), and also allows removing that incompatibility if the
> feature is no longer used (e.g. all > 2TB files are deleted).
> 

Yes thank you. As you can see at this RFC stage I have not even put any
code to enforce the ABI/API versioning yet. Exactly because I don't like
it as you explained. I will think about your suggestion and see. This is
not on disk stuff. This is more the communication channel between
ZUF<=>ZUS. Though there are a couple on disk stuff.
(The on disk things are all hidden from here inside the usermode FS plugin)

The thing is that I want to work a system with the distro's that the
ZUF<=>ZUS ABI can freely change, by forcing the zusd package be dependent
on the kernel package. And it be signed by the Kernel's make key. Meaning
it will only run against the kernel it was compiled against.

And keep the stable ABI with feature and versioning between the
ZUSD<=>zusFS-plugin(s)
We'll have to see

Thanks
Boaz
> 
> Cheers, Andreas
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 3/7] zuf: Preliminary Documentation
  2018-03-13 20:32   ` Randy Dunlap
@ 2018-03-14 18:01     ` Boaz Harrosh
  2018-03-14 19:16       ` Randy Dunlap
  0 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-14 18:01 UTC (permalink / raw)
  To: Randy Dunlap, Boaz Harrosh, linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon

On 13/03/18 22:32, Randy Dunlap wrote:
> Hi,
> 
> Just a few questions.  Very little editing.  :)
> 
> 
> On 03/13/2018 10:18 AM, Boaz Harrosh wrote:
>>
>> Adding Documentation/filesystems/zufs.txt
>>
>> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
>> ---
>>  Documentation/filesystems/zufs.txt | 351 +++++++++++++++++++++++++++++++++++++
>>  1 file changed, 351 insertions(+)
>>  create mode 100644 Documentation/filesystems/zufs.txt
>>
>> diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
>> new file mode 100644
>> index 0000000..779f14b
>> --- /dev/null
>> +++ b/Documentation/filesystems/zufs.txt
>> @@ -0,0 +1,351 @@
>> +ZUFS - Zero-copy User-mode FileSystem
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Trees:
>> +	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
>> +	git clone https://github.com/NetApp/zufs-zus -b zus-github
>> +
>> +patches, comments, questions, requests to:
>> +	boazh@netapp.com
>> +
>> +Introduction:
>> +~~~~~~~~~~~~~
>> +
>> +ZUFS - stands for Zero-copy User-mode FS
>> +▪ It is geared towards true zero copy end to end of both data and meta data.
>> +▪ It is geared towards very *low latency*, very high CPU locality, lock-less
>> +  parallelism.
>> +▪ Synchronous operations
>> +▪ Numa awareness
>> +
>> + ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
>> +tries to address the above goals. It is aimed for pmem based FSs. But can easily
>> +support any other type of FSs that can utilize x10 latency and parallelism
>> +improvements.
>> +
>> +Glossary and names:
>> +~~~~~~~~~~~~~~~~~~~
>> +
>> +ZUF - Zero-copy User-mode Feeder
>> +  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
>> +  VFS and dispatch commands to a User-mode application Server.
>> +  Uptodate code is found at:
>> +	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
>> +
>> +ZUS - Zero-copy User-mode Server
>> +  zufs utilizes a User-mode server application. That takes care of the detailed
>> +  communication protocol and correctness with the Kernel.
>> +  In turn it utilizes many zusFS Filesystem plugins to implement the actual
>> +  on disc Filesystem.
>> +  Uptodate code is found at:
>> +	git clone https://github.com/NetApp/zufs-zus -b zus-github
>> +
>> +zusFS - FS plugins
>> +  These are .so loadable modules that implement one or more Filesystem-types
>> +  (-t xyz).
>> +  The zus server communicates with the plugin via a set of function vectors
>> +  for the different operations. And establishes communication via defined
>> +  structures.
>> +
>> +Filesystem-type:
>> +  At startup zus registers with the Kernel one or more Filesystem-type(s)
>> +  Associated with the type is a 4 letter type-name (-t fstn) different
> 
> (?)                           is a unique 4-letter type-name
> 

FStype-name I guess. this field in the Kernel is called file_system_type.name
and is just a pointer so I guess it can be any length. I have space for 15
characters in this API.
But I agree that the sentence above is clear as mud I will reword it

>> +  info about the fs, like a magic number and so on.
>> +  One Server can support many FS-types, in turn each FS-type can mount
>> +  multiple super-blocks, each supporting multiple devices.
>> +
>> +Device-Table (DT) - A zufs FS can support multiple devices
>> +  ZUF in Kernel may receive, like any mount command a block-device or none.
>> +  For the former if the specified FS-types states so in a special field.
>> +  The mount will look for a Device table. A list of devices in a specific
>> +  order sitting at some offset on the block-device. The system will then
>> +  proceed to open and own all these devices and associate them to the mounting
>> +  super-block.
>> +  If FS-type specifies a -1 at DT_offset then there is no device table
>> +  and a DT of a single device is created. (If we have no devices, none
>> +  is specified than we operate without any block devices. (Mount options give
>> +  some indication of the storage information)
> 
> missing one ')'
> 

yep

>> +  The device table has special consideration for pmem devices and will
>> +  present the all linear array of devices to zus, as one flat mmap space.
>> +  Alternatively all none pmem devices are also provided an interface
> 
>                    all known (?)
> 

No this is none-pmem .i.e other devices that are regular bdev(s) and do
not give the special pmem interface. Will fix

>> +  with facility of data movement from pmem to a slower device.
>> +  A detailed NUMA info is exported to the Server for maximum utilization.
>> +
>> +pmem:
>> +  Multiple pmem devices are presented to the server as a single
>> +  linear file mmap. Something like /dev/dax. But it is strictly
>> +  available only to that specific super-block that owns it.
>> +
>> +dpp_t - Dual port pointer type
>> +  At some points in the protocol there are objects that return from zus
>> +  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
>> +  It is actually an offset 8 bytes aligned with the 3 low bits specifying
>> +  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
>> +  pool == 0 means the offset is in pmem who's management is by zuf and
>> +  a full easy access is provided for zus.
>> +
>> +  pool != 0 Is a pre-established tempfs file (up to 6 such files) where
>> +  the zus has an mmap on the file and the Kernel can access that data
>> +  via an offset into the file.
> 
> so non-zero pool [dpp_t & 0x7] can be a value of 1 - 7, and above says up to
> 6 such tempfs files.  What is the other pool value used for?
> 

Yes my bad. Thanks

>> +  All dpp_t objects life time rules are strictly defined.
>> +  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
>> +  zus and zuf can access and change this structure. On any modification
>> +  the zus is called so to be notified of any changes, persistence.
>> +  More such objects are: Symlinks, xattrs, mmap-data-blocks etc...
>> +
>> +Relay-wait-object:
>> +  communication between Kernel and server are done via zus-threads that
>> +  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
>> +  the IOCTL returns operation id executed and the return info is returned via
>> +  a new IOCTL call, which then waits for the next operation.
> 
> Does that say 2 IOCTLs per command?  One to start it and one to fetch return info?
> 

Not really, I guess I did not explain well. It is one IOCTL per command + the very very
first one. It is a cyclic system. The IOCTL that returns results is then sleeping
again to wait for a new command.

>From the perspective of the user-mode code this is a single IORW (both write+read)
type IOCTL that sends some information - command-results - and receives some
information - A new command.

How to explain this better.
>> +  To wake up the sleeping thread we use a Relay-wait-object. Currently
>> +  it is two waitqueue_head(s) back to back.
>> +  In future we should investigate the use of that special binder object
>> +  that releases its thread time slice to the other thread without going through
>> +  the scheduler.
>> +
>> +ZT-threads-array:
>> +  The novelty of the zufs is the ZT-threads system. One thread or more is
>> +  pre-created for each active core in the system.
>> +  ▪ The thread is AFFINITY set for that single core only.
>> +  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
>> +    At initialization the ZT thread communicates through a ZT_INIT ioctl
>> +    and registers as the handler of that core (Channel)
>> +  ▪ ZT-vma - Mmap 4M vma zero copy communication area per ZT
>> +    Pre allocated vma is created into which will be mapped the application
>> +    or Kernel buffers for the current operation.
>> +  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
>> +    via the IOCTL_ZU_WAIT_OPT call. supplying a 4k communication buffer
>> +
>> +  ▪ On an operation dispatch current CPU's ZT is selected, app pages mapped
>> +    into the ZT-vma. Server thread released with an operation to execute.
>> +  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
>> +    Server wait for new operation on that CPU.
>> +
>> +ZUS-mount-thread:
>> +  The system utilizes a single mount thread. (This thread is not affinity to any
>> +  core).
>> +  ▪ It will first Register all FS-types supported by this Server (By calling
>> +    all zusFS plugins to register their supported types). Once done
>> +  ▪ As above the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
>> +  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
>> +    a mount is dispatched back to zus.
>> +  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
>> +    the same array is then used for all super-blocks in the system
>> +  ▪ As part of the mount command in the context of this same mount-thread
>> +    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
>> +    Associated with this super_block
>> +  ▪ On return (like above a new call to IOCTL_ZU_MOUNT will return info of the
> 
> missing ')' somewhere.
> 
Yes
>> +    mount before sleeping in kernel waiting for a new dispatch. All SB info
>> +    is provided to zuf, including the root inode info. Kernel then proceeds
>> +    to complete the mount call.
>> +  ▪ NOTE that since there is a single mount thread all FS-registration
>> +    super_block and pmem management are lockless.
>> +  
>> +Philosophy of operations:
>> +~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +1. [zuf-root]
>> +
>> +On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
>> +called zuf-root.
>> +The zuf-root has no visible files. All communication is done via special-files.
>> +special-files are open(O_TMPFILE) and establish a special role via an
>> +IOCTL.
>> +All communications with the server are done via the zuf-root. Each root owns
>> +many FS-types and each FS-type owns many super-blocks of this type. All Sharing
>> +the same communication channels.
>> +Since all FS-type Servers live in the same zus application address space, at
>> +times. If the administrator wants to separate between different servers, he/she
>> +can mount a new zuf-root and point a new server instance on that new mount,
>> +registering other FS-types on that other instance. The all communication array
>> +will then be duplicated as well.
>> +(Otherwise pointing a new server instance on a busy root will return an error)
>> +
>> +2. [zus server start]
>> +  ▪ On load all configured zusFS plugins are loaded.
>> +  ▪ The Server starts by starting a single mount thread.
>> +  ▪ It than proceeds to register with Kernel all FS-types it will support.
>> +    (This is done on the single mount thread, so all FS-registration and
>> +     mount/umount operate in a single thread and therefor need not any locks)
>> +  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for a mount
>> +    command.
>> +
>> +3. [mount -t xyz]
>> +  [In Kernel]
>> +  ▪ If xyz was registered above as part of the Server startup. the regular
>> +    mount command will come to the zuf module with a zuf_mount() call. with
>> +    the xyz-FS-info. In turn this points to a zuf-root.
>> +  ▪ Code than proceed to load a device-table of devices as  specified above.
>> +    It then establishes an md object with a specific pmem_id.
> 
>                               md ??
> 

Thanks yes I should explain better MD stands for multi_devices which is the
name of the structure. I will put some definitions somewhere

>> +  ▪ It proceeds to call mount_bdev. Always with the same main-device
>> +    thous fully sporting automatic bind mounts. Even if different
>> +    devices are given to the mount command.
>> +  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
>> +    specifying two parameters. One the FS-type to mount, and then
>> +    the pmem_id Associated with this super_block.
>> +
>> +  [In zus]
>> +  ▪ A zus_super_block_info is allocated.
>> +  ▪ zus calls PMEM_GRAB(pmem_id) to establish a direct mapping to its
>> +    pmem devices. On return we have full access to our PMEM
>> +
>> +  ▪ ZT-threads-array
>> +    If this is the first mount the all ZT-threads-array is created and
>> +    established. The mount thread will wait until all zt-threads finished
>> +    initialization and ready to rock.
>> +  ▪ Root-zus_inode is loaded and is returned to kernel
>> +  ▪ More info about the mount like block sizes and so on are returned to kernel.
>> +
>> +  [In Kernel]
>> +   The zuf_fill_super is finalized vectors established and we have a new
>> +   super_block ready for operations.
>> +
>> +4. An FS operation like create or WRITE/READ and so on arrives from application
>> +   via VFS. Eventually an Operation is dispatched to zus:
>> +   ▪ A special per-operation descriptor is filled up with all parameters.
>> +   ▪ A current CPU channel is grabbed. the operation descriptor is put on
>> +     that channel (ZT). Including get_user_pages or Kernel-pages associated
>> +     with this OPT.
>> +   ▪ The ZT is awaken, app thread put to sleep.
>> +   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
>> +     the map is only on a single core. And no other core's TLB is affected.
>> +     (This here is the all performance secret)
>> +   ▪ ZT thread is returned to user-space.
>> +   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
>> +     vector. Output params filled.
>> +   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
>> +     to return the requested info.
>> +   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
>> +     ZT thread goes back to sleep waiting a new operation.
>> +     
>> +   ZT rules:
>> +       A ZT thread must not return back to Kernel. One exception is locks
>> +   if needed it might sleep waiting for a lock. In which case we will see that
>> +   the same CPU channel is reentered via another application and/or thread.
>> +   But now that CPU channel is taken.  What we do is we utilize a few channels
>> +   (ZTs) per core and the threads may grab another channel. But this only
>> +   postpones the problem on a busy contended system, all such channels will be
>> +   consumed. If all channels are taken the application thread is put on a busy
>> +   scheduling wait until a channel can be grabbed.
>> +   Therefor Server must not sleep on a ZT. If it needs such a sleeping operation
>> +   it will return -EAGAIN to zuf. The app is kept sleeping the operation is put
>> +   on an asynchronous Q and the ZT freed for foreground operation. At some point
>> +   when the server completes the delayed operation it will complete notify
>> +   the Kernel with a special async cookie. And the app will be awakened.
>> +   (Here too we utilize pre allocated asyc channels and vmas. If all channels
>> +    are busy, application is kept sleeping waiting its free slot turn)
>> +
>> +4. On umount the operation is reversed and all resources are torn down.
>> +5. In case of an application or Server crash, all resources are Associated
>> +   with files, on file_release these resources are caught and freed.
>> +
>> +Objects and life-time
>> +~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Each Kernel object type has an assosiated zus Server object type who's life
>> +time is governed by the life-time of the Kernel object. Therefor the Server's
>> +job is easy because it need not establish any object caches / hashes and so on.
>> +
>> +Inside zus all objects are allocated by the zusFS plugin. So in turn it can
>> +allocate a bigger space for its own private data and access it via the
>> +container_off() coding pattern. So when I say below a zus-object I mean both
>> +zus public part + zusFS private part of the same object.
>> +
>> +All operations return a UM pointer that are OPEC the the Kernel code, they
> 
> -ETOOMANYNLA
> 
> 2LA: UM
> 4LA: OPEC

Yes thanks I will change

>> +are just a cookie which is returned back to zus, when needed.
>> +At times when we want the Kernel to have direct access to a zus object like
>> +zus_inode, along with the cookie we also return a dpp_t, with a defined structure.
>> +
>> +Kernel object 			| zus object 		| Kernel access (via dpp_t)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +zuf_fs_type
>> +	file_system_type	| zus_fs_info		| no
>> +
>> +zuf_sb_info
>> +	super_block		| zus_sb_info		| no
>> +	
>> +zuf_inode_info			|			|
>> +	vfs_inode		| zus_inode_info	| no
>> +	zus_inode *		| 	zus_inode *	| yes
>> +	synlink *		|	char-array	| yes
>> +	xattr**			|	zus_xattr	| yes
>> +
>> +When a Kernel object's time is to die, a final call to zus is
>> +dispatched so the associated object can also be freed. Which means
>> +that on memory pressure when object caches are evicted also the zus
>> +memory resources are freed.
>> +
>> +
>> +How to use zufs:
>> +~~~~~~~~~~~~~~~~
>> +
>> +The most updated documentation of how to use the latest code bases
>> +is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
>> +
>> +We the developers at Netapp use this script to mount and test our
>> +latest code. So any new Secret will be found in these scripts. Please
>> +read them as the ultimate source of how to operate things.
>> +
>> +TODO: We are looking for exports in system-d and udev to properly
>> +integrate these tools into a destro.
>> +
>> +We assume you cloned these git trees:
>> +[]$ mkdir zufs; cd zufs
>> +[]$ git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
>> +[]$ git clone https://github.com/NetApp/zufs-zuf -b zus-github
>> +
>> +This will create the following trees
>> +zufs/zus - Source code for Server
>> +zufs/zuf - Linux Kernel source tree to compile and install on your machine
>> +
>> +Also specifically:
>> +zufs/zus/fs/do-zu/zudo - script Documenting how to run things
>> +
>> +[]$ cd zuf
>> +
>> +First time
>> +[] ../zus/fs/do-zu/zudo
>> +this will create a file:
>> +	../zus/fs/do-zu/zu.conf
>> +
>> +Edit this file for your environment. Devices, mount-point and so on.
>> +On first run an example file will be created for you. Fill in the
>> +blanks. Most params can stay as is in most cases
>> +
>> +Now lest start running:
>> +
>> +[1]$ ../zus/fs/do-zu/zudo mkfs
>> +This will run the proper mkfs command selected at zu.conf file
>> +with the proper devices.
>> +
>> +[2]$ ../zus/fs/do-zu/zudo zuf-insmod
>> +This loads the zuf.ko module
>> +
>> +[3]$ ../zus/fs/do-zu/zudo zuf-root
>> +This mounts the zuf-root FS above on /sys/fs/zuf (automatically created above)
>> +
>> +[4]$ ../zus/fs/do-zu/zudo zus-up
>> +This runs the zus daemon in the background
>> +
>> +[5]$ ../zus/fs/do-zu/zudo mount
>> +This mount the mkfs FS above on the specified dir in zu.conf
>> +
>> +To run all the 5 commands above at once do:
>> +[]$ ../zus/fs/do-zu/zudo up
>> +
>> +To undo all the above in reverse order do:
>> +[]$ ../zus/fs/do-zu/zudo down
>> +
>> +And the most magic command is:
>> +[]$ ../zus/fs/do-zu/zudo again
>> +Will do a "down", then update-mods, then "up"
>> +(update-mods is a special script to copy the latest compiled binaries)
>> +
>> +Now you are ready for some:
>> +[]$ ../zus/fs/do-zu/zudo xfstest
>> +xfstests is assumed to be installed in the regular /opt/xfstests dir
>> +
>> +Again please see inside the scripts what each command does
>> +these scripts are the ultimate Documentation, do not believe
>> +anything I'm saying here. (Because it is outdated by now)
>>
> 
> thanks,
> 

Thank you Randy for all the input.
I always have such hard time with the Documentation and very low confidence
in it. As I was slaving over this I hoped you'd have a bit of time to look
it over.

Will fix
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 4/7] zuf: zuf-rootfs && zuf-core
  2018-03-14 12:56     ` Nikolay Borisov
@ 2018-03-14 18:34       ` Boaz Harrosh
  0 siblings, 0 replies; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-14 18:34 UTC (permalink / raw)
  To: Nikolay Borisov, linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon

On 14/03/18 14:56, Nikolay Borisov wrote:
> 
> 
> On 13.03.2018 19:36, Boaz Harrosh wrote:
>>
>> zuf-core established the communication channels with
>> the zus UM Server.
>>
>> zuf-root is a psuedo FS that the zus communicates through,
>> registers new file-systems. receives new mount requests.
>>
>> In this patch we have the bring up of that special FS, and
>> the core communication mechanics. Which is the Novelty
>> of this code submission.
>>
>> The zuf-rootfs (-t zuf) is usually by default mounted on
>> /sys/fs/zuf. If an admin wants to run more server applications
>> (Note that each server application supports many types of FSs)
>> He/she can mount a second instance of -t zuf and point the new
>> Server to it.
>>
>> (Otherwise a second instance attempting to communicate with a
>>  busy zuf will fail)
>>
>> TODO: How to trigger a first mount on module_load. Currently
>> admin needs to manually "mount -t zuf none /sys/fs/zuf"
>>
>> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
> <snip>
> 
>> +	while (!relay_is_fss_waiting(&zt->relay)) {
>> +		mb();
> Not something for this early in the development cycle but something to
> keep in mind:
> 
> Always document all assumptions re. memory barriers usage + intended
> pairing scenario otherwise it's very hard to reason whether this is
> correct or not. In fact barriers without comments are considered broken.
> 

Yes. I totally agree. I love it how you commented on the ugliest piece
of code in all of this.

This is BTW totally very wrong and the mb is wrong and is a bandade
over the wrong kind of hurt. Because in theory coming out of sleep
I might be scheduled on another core and I should then pick up a new
zt instead of syncing with the now wrong one.

for a POC it was fine, I get here maybe 0.17% of the times but I will
totally need to put some thought and love into this problem.

I kind of want to move from that Relay object to what Binder is using
and hope to shave off some more latency. So I'll see if I will fix
this or move to new code

Thank you for looking, Yes this is a bad contraption
Boaz

>> +		if (unlikely(!zt->file))
>> +			return -EIO;
>> +		zuf_dbg_err("[%d] can this be\n", cpu);
>> +		/* FIXME: Do something much smarter */
>> +		msleep(10);
>> +		mb();
>> +	}
>> +
>> +	zt->next_opt = hdr;
>> +	zt->pages = pages;
>> +	zt->nump = nump;
>> +
>> +	relay_fss_wakeup_app_wait(&zt->relay, NULL);
>> +
>> +	return zt->file ? hdr->err : -EIO;
>> +}
> 
> <snip>
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 3/7] zuf: Preliminary Documentation
  2018-03-14 18:01     ` Boaz Harrosh
@ 2018-03-14 19:16       ` Randy Dunlap
  0 siblings, 0 replies; 39+ messages in thread
From: Randy Dunlap @ 2018-03-14 19:16 UTC (permalink / raw)
  To: Boaz Harrosh, linux-fsdevel
  Cc: Ric Wheeler, Miklos Szeredi, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon

On 03/14/2018 11:01 AM, Boaz Harrosh wrote:
> On 13/03/18 22:32, Randy Dunlap wrote:
>> Hi,
>>
>> Just a few questions.  Very little editing.  :)

> Thank you Randy for all the input.
> I always have such hard time with the Documentation and very low confidence
> in it. As I was slaving over this I hoped you'd have a bit of time to look
> it over.

I will make another pass over the next version to fix lots of typos, grammar,
spelling, etc.

-- 
~Randy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-14  8:20     ` Miklos Szeredi
  2018-03-14 11:17       ` Matthew Wilcox
@ 2018-03-14 21:41       ` Boaz Harrosh
  2018-03-15  8:47         ` Miklos Szeredi
  1 sibling, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-14 21:41 UTC (permalink / raw)
  To: Miklos Szeredi, Matthew Wilcox
  Cc: linux-fsdevel, Ric Wheeler, Steve French, Steven Whitehouse,
	Jefff moyer, Sage Weil, Jan Kara, Amir Goldstein, Andy Rudof,
	Anna Schumaker, Amit Golander, Sagi Manole, Shachar Sharon

On 14/03/18 10:20, Miklos Szeredi wrote:
> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
>> On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
>>> On a call to mmap an mmap provider (like an FS) can put
>>> this flag on vma->vm_flags.
>>>
>>> This tells the Kernel that the vma will be used from a single
>>> core only and therefore invalidation of PTE(s) need not a
>>> wide CPU scheduling
>>>
>>> The motivation of this flag is the ZUFS project where we want
>>> to optimally map user-application buffers into a user-mode-server
>>> execute the operation and efficiently unmap.
>>
>> I've been looking at something similar, and I prefer my approach,
>> although I'm not nearly as far along with my implementation as you are.
>>
>> My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
>> The page fault handler refuses to insert any TLB entries into the process
>> address space.  But follow_page_mask() will return the appropriate struct
>> page for it.  This should be enough for O_DIRECT accesses to work as
>> you'll get the appropriate scatterlists built.
>>
>> I suspect Boaz has already done a lot of thinking about this and doesn't
>> need the explanation, but here's how it looks for anyone following along
>> at home:
>>
>> Process A calls read().
>> Kernel allocates a page cache page for it and calls the filesystem through
>>   ->readpages (or ->readpage).
>> Filesystem calls the managing process to get the data for that page.
>> Managing process draws a pentagram and summons Beelzebub (or runs Perl;
>>   whichever you find more scary).
>> Managing process notifies the filesystem that the page is now full of data.
>> Filesystem marks the page as being Uptodate and unlocks it.
>> Process was waiting on the page lock, wakes up and copies the data from the
>>   page cache into userspace.  read() is complete.
>>
>> What we're concerned about here is what to do after the managing process
>> tells the kernel that the read is complete.  Clearly allowing the managing
>> process continued access to the page is Bad as the page may be freed by the
>> page cache and then reused for something else.  Doing a TLB shootdown is
>> expensive.  So Boaz's approach is to have the process promise that it won't
>> have any other thread look at it.  My approach is to never allow the page
>> to have load/store access from userspace; it can only be passed to other
>> system calls.
> 

Hi Matthew, Hi Miklos

Thank you for looking at this.
I'm answering both Matthew an Miklos's all thread, by trying to explain
something that you might not have completely wrapped around yet.

Matthew first

Please note that in the ZUFS system there are no page-faults at all involved
(God no, this is like +40us minimum and I'm fighting to shave off 13us)

In ZUF-to-ZUS communication
command comes in:
A1 we punch in the pages at the per-core-VMA before they are used,
A2 we then return to user-space, access these pages once.
   (without any page faults)
A3 Then return to kernel and punch in a drain page at that spot

New command comes in:
B1 we punch in the pages at the same per-core-VMA before they are used,
B2 Return to user-space, access these new pages once.
B3 Then return to kernel and punch in a drain page at that spot

Actually I could skip A3/B3 all together but in testing after my patch
it did not cost at all, so I like the extra easiness (Because otherwise
there is a dance I need to do when app or server crash and files start
to close I need to scan VMAs and zap them)

Current mm's mapping code (at insert_pfn) will fail at B1 above. Because
it wants to see a ZERO empty spot before inserting a new pte.
What the mm code wants is that I call
	A3 - zap_vma_ptes(vma)

This is because if the spot was not ZERO it means there was a previous
mapping there. And some other core might have cached that entry at the
TLB. so when I punch in this new value the other core could access
the old page while this core is accessing the new page.
(TLB-invalidate is a single core command and is why zap_vma_ptes
 needs to schedule all cores to each call TLB-invalidate)

And this is the all difference between the two testes above. That I do not
zap_vma_ptes With the new (one liner) code.

Please Note that the VM_LOCAL_CPU flag is not set by the application (zus Server)
but by the Kernel driver, telling the Kernel that it has enforced such an API
that we access from a single CORE so please allow me B1 because I know what I'm
doing. (Also we do put some trust into zus because it has our filesystem  data
and because we wrote it ;-))

I understand your approach where you say "The PTE table is just a global
communicator of pages but is not really mapped into any process .i.e never
faulted into any core's local-TLB" (The Kernel access of that memory is done
on a Kernel address at another TLB). And is why I can get away from
zap_vma_ptes(vma).
So is this not the same thing? your flag says no one TLB cached this PTE
my flag says only-this-core-cached this PTE. We both ask
"So please skip the zap_vma_ptes(vma) stage for me"

I think you might be able to use my flag for your system. Is only a small
part of what you need with the all "Get the page from the PTE at" and
so on. But the "please skip zap_vma_ptes(vma)" part is this patch here, No?

BTW I did not at all understand what is your project trying to solve.
please send me some Notes about it I want to see if they might fit
after all

> This all seems to revolve around the fact that userspace fs server
> process needs to copy something into userspace client's buffer, right?
> 
> Instead of playing with memory mappings, why not just tell the kernel
> *what* to copy?
> 
> While in theory not as generic, I don't see any real limitations (you
> don't actually need the current contents of the buffer in the read
> case and vica verse in the write case).
> 

This is not so easy, for many reasons. It was actually my first approach
which I pursued for a while but dropped it for the easier to implement
and more general approach.

Note that we actually do that in the implementation of mmap. There is
a ZUS_OP_GET_BLOCK which returns a dpp_t of a page to map into the
application's VM. We could just copy it at that point

We have some app buffers arriving with pointers local to one VM (the app)
and then we want to copy them to another app buffers. How do you do that?
So yes you need to get_user_pages() so they can be accessed from kernel, switch
to second VM then receive pointers there. These need to be dpp_t like the games
I do, or - In the app context copy_user_to_page.

But that API was not enough for me. Because this is good with pmem.
But what if I actually want it from disk or network. My API you can do
that easily without any copy or caching still.
Not in this RFC - but there is a plan (Is my very next todo) for an
ASYNC operation mode as well as the sync operation. The zus is telling
ZUF => -ASYNC please the data you wanted is on slow media I need
to sleep. The request is put on hold and completed in the background.
An async thread will later call to complete the command. Note that in
that case we will do zap_vma_ptes(vma). And back to square one. But in
that case the cost of zap_vma_ptes(vma) is surly accepted.

Also there was a very big locking problem with the OP_GET_BLOCK
approach. Because the while a copy is made, FS needs to lock access to
that same page in many kind of scenarios. Just few examples:
1- COW writer a concurrent reader should see the old data.
2- unwritten-buffer-write - concurrent reader should see zeros which
   means I need to write zeros first, before letting reads in. Grrr
   this is current DAX code. I know how to do better
3- tier-down - I want to write a page to slow media and reuse it. Must not
   allow this while it is accessed.

And many more. So in all these cases the API will need to be
OP_GET_BLOCK / OP_PUT_BLOCK which is two trips. Fffff very slow.

And specially in the network or from-device case, the zus server needs
to now have all these buffer cache management and life time hell.
because it needs to read this data somewhere before it presents the page
back to Kernel, and there you have a COPY for you.

In my API you can network directly to the APP buffers they are there
why not use them. (Did I say ZERO copy ;-) )

Also Psuedo FS application servers say like MySQL-5. OP_GET_BLOCK
will give it a big memory management problem. where now we can just
write directly to app buffers, and again in zero copy.

Please note that it will be very easy with this API to also support
page-cache for FSs that wants it like the network FSs you said.
The FS will set a bit in the fs_register call to say that it would
rather use page cache. These type of FSs will run on a different
kind of BDI which says "Yes page cache please". All the IO entry
vectors point to the generic_iter API and instead we implement
read/write_pages(). At read/write_pages() we do the exact same OP_READ/WRITE
like today. map the cache pages to the zus VM, despatch, return, release page_lock.
all is happy. Any one wanting to contribute this is very welcome.

I did have plans in that first approach to have a cache of OP_GET_BLOCKs
on the radix tree. And have the Server recall these blocks when needed.
But this called for alot of locking on the HOT path. And was much much
more complicated, bigger code.
Here we have a completely lockless, zero synchronization between cores
code. With the one liner of this patch even the all vma_mapping is lockless
And is so very simple, with a huge gain and no loss. Because ....

You said above: "Instead of playing with memory mappings"

But if you look at the amount of code, even compared to a pipe
or spline. You will see that the "playing with memory mappings"
is so very easy and simple. It might be new, hard to grasp approach
but it is just harder as a new concept then an actually code
complexity. All I actually do is:

1. Allocate a vma per core
2. call vm_insert_pfn
 .... Do something
3. vm_insert_pfn(NULL) (before this patch zap_vma_ptes())

It is all very simple really. For me it is opposite. It is
"Why mess around with dual_port_pointers, caching, and copy
 life time rules, when you can just call vm_insert_pfn"

> And we already have an interface for this: splice(2).  What am I
> missing?  What's the killer argument in favor of the above messing
> with tlb caches etc, instead of just letting the kernel do the dirty
> work.
> 

You answered yourself. We are the Kernel and we are doing the (simple)
work. If you look at all this from far, the zus-core with its Z-Threads
array is just a fancy pipe really - A zero copy pipe.

Being a splice API gives us nothing. It will have the same problems
as above. splice basically says Party A show me your buffers Party B
show me yours, and I can copy between them in the Kernel. Usually one
of them A or B is in Kernel buffers or a DMA target. So this case is very
like the OPT_GET_BLOCK you have life time problems. And if you use the direct
mmaped pipe like you talked to Matthew about then you are back to this exact
problem and with current API you cannot avoid neither the zap_vma_ptes() nor
actual page-faults after the mmap. So you are looking at 60u minimum
I have the all round trip in 4.6u. And I believe I can cut it down to
3.5u with fixing that Relay object

I have researched this for a while. I do not believe there is a more
rubust, and certainly this one liner is not complexity either.

> Thanks,
> Miklos
> 

I Hope this sheds some light on the matter.

Thankyou
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 2/7] fs: Add the ZUF filesystem to the build + License
  2018-03-14 17:21     ` Boaz Harrosh
@ 2018-03-15  4:21       ` Andreas Dilger
  2018-03-15 13:58         ` Boaz Harrosh
  0 siblings, 1 reply; 39+ messages in thread
From: Andreas Dilger @ 2018-03-15  4:21 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Ric Wheeler, Miklos Szeredi, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

[-- Attachment #1: Type: text/plain, Size: 2621 bytes --]

On Mar 14, 2018, at 11:21 AM, Boaz Harrosh <boazh@netapp.com> wrote:
> 
> On 13/03/18 22:16, Andreas Dilger wrote:
> <>
>>> + */
>>> +#define ZUFS_MINORS_PER_MAJOR	1024
>>> +#define ZUFS_MAJOR_VERSION 1
>>> +#define ZUFS_MINOR_VERSION 0
>> 
>> I haven't really been following this development, but my recommendation
>> would be to use feature flags (e.g. at least __u64 compat, __u64 incompat)
>> for the API and separately for the disk format, rather than using version
>> numbers.  This makes it clear what "version" relates to a specific feature,
>> and also allows *removal* of features if they turn out to be a bad idea.
>> With version numbers you can only ever *add* features, and have to keep
>> support for every old feature added.
>> 
>> Also, having separate feature flags allows independent development of new
>> features, and doesn't require that feature X has to be in version N or it
>> will break for anyone using/testing that feature outside of the main tree.
>> 
>> This has worked for 25 years for ext2/3/4 and 15 years for Lustre.  ZFS
>> has a slightly more complex feature flags, distinguishing between features
>> that _could_ be used (i.e. enabled at format time or by the administrator),
>> and features that _are_ used (with a refcount).  That avoids gratuitous
>> incompatibility if some feature is enabled, but not actually used (e.g.
>> ext4 files over 2TB), and also allows removing that incompatibility if the
>> feature is no longer used (e.g. all > 2TB files are deleted).
>> 
> 
> Yes thank you. As you can see at this RFC stage I have not even put any
> code to enforce the ABI/API versioning yet. Exactly because I don't like
> it as you explained. I will think about your suggestion and see. This is
> not on disk stuff. This is more the communication channel between
> ZUF<=>ZUS. Though there are a couple on disk stuff.
> (The on disk things are all hidden from here inside the usermode FS plugin)
> 
> The thing is that I want to work a system with the distro's that the
> ZUF<=>ZUS ABI can freely change, by forcing the zusd package be dependent
> on the kernel package. And it be signed by the Kernel's make key. Meaning
> it will only run against the kernel it was compiled against.

That is a major pain, and even the distros are doing things like weak module
symbol versions so that external kernel modules do not need to be rebuilt
for every minor kernel update.

> And keep the stable ABI with feature and versioning between the
> ZUSD<=>zusFS-plugin(s)
> We'll have to see
> 
> Thanks
> Boaz

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-14 21:41       ` Boaz Harrosh
@ 2018-03-15  8:47         ` Miklos Szeredi
  2018-03-15 15:27           ` Boaz Harrosh
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2018-03-15  8:47 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Wed, Mar 14, 2018 at 10:41 PM, Boaz Harrosh <boazh@netapp.com> wrote:
> On 14/03/18 10:20, Miklos Szeredi wrote:
>> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>> On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
>>>> On a call to mmap an mmap provider (like an FS) can put
>>>> this flag on vma->vm_flags.
>>>>
>>>> This tells the Kernel that the vma will be used from a single
>>>> core only and therefore invalidation of PTE(s) need not a
>>>> wide CPU scheduling
>>>>
>>>> The motivation of this flag is the ZUFS project where we want
>>>> to optimally map user-application buffers into a user-mode-server
>>>> execute the operation and efficiently unmap.
>>>
>>> I've been looking at something similar, and I prefer my approach,
>>> although I'm not nearly as far along with my implementation as you are.
>>>
>>> My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
>>> The page fault handler refuses to insert any TLB entries into the process
>>> address space.  But follow_page_mask() will return the appropriate struct
>>> page for it.  This should be enough for O_DIRECT accesses to work as
>>> you'll get the appropriate scatterlists built.
>>>
>>> I suspect Boaz has already done a lot of thinking about this and doesn't
>>> need the explanation, but here's how it looks for anyone following along
>>> at home:
>>>
>>> Process A calls read().
>>> Kernel allocates a page cache page for it and calls the filesystem through
>>>   ->readpages (or ->readpage).
>>> Filesystem calls the managing process to get the data for that page.
>>> Managing process draws a pentagram and summons Beelzebub (or runs Perl;
>>>   whichever you find more scary).
>>> Managing process notifies the filesystem that the page is now full of data.
>>> Filesystem marks the page as being Uptodate and unlocks it.
>>> Process was waiting on the page lock, wakes up and copies the data from the
>>>   page cache into userspace.  read() is complete.
>>>
>>> What we're concerned about here is what to do after the managing process
>>> tells the kernel that the read is complete.  Clearly allowing the managing
>>> process continued access to the page is Bad as the page may be freed by the
>>> page cache and then reused for something else.  Doing a TLB shootdown is
>>> expensive.  So Boaz's approach is to have the process promise that it won't
>>> have any other thread look at it.  My approach is to never allow the page
>>> to have load/store access from userspace; it can only be passed to other
>>> system calls.
>>
>
> Hi Matthew, Hi Miklos
>
> Thank you for looking at this.
> I'm answering both Matthew an Miklos's all thread, by trying to explain
> something that you might not have completely wrapped around yet.
>
> Matthew first
>
> Please note that in the ZUFS system there are no page-faults at all involved
> (God no, this is like +40us minimum and I'm fighting to shave off 13us)
>
> In ZUF-to-ZUS communication
> command comes in:
> A1 we punch in the pages at the per-core-VMA before they are used,
> A2 we then return to user-space, access these pages once.
>    (without any page faults)
> A3 Then return to kernel and punch in a drain page at that spot
>
> New command comes in:
> B1 we punch in the pages at the same per-core-VMA before they are used,
> B2 Return to user-space, access these new pages once.
> B3 Then return to kernel and punch in a drain page at that spot
>
> Actually I could skip A3/B3 all together but in testing after my patch
> it did not cost at all, so I like the extra easiness (Because otherwise
> there is a dance I need to do when app or server crash and files start
> to close I need to scan VMAs and zap them)
>
> Current mm's mapping code (at insert_pfn) will fail at B1 above. Because
> it wants to see a ZERO empty spot before inserting a new pte.
> What the mm code wants is that I call
>         A3 - zap_vma_ptes(vma)
>
> This is because if the spot was not ZERO it means there was a previous
> mapping there. And some other core might have cached that entry at the
> TLB. so when I punch in this new value the other core could access
> the old page while this core is accessing the new page.
> (TLB-invalidate is a single core command and is why zap_vma_ptes
>  needs to schedule all cores to each call TLB-invalidate)
>
> And this is the all difference between the two testes above. That I do not
> zap_vma_ptes With the new (one liner) code.
>
> Please Note that the VM_LOCAL_CPU flag is not set by the application (zus Server)
> but by the Kernel driver, telling the Kernel that it has enforced such an API
> that we access from a single CORE so please allow me B1 because I know what I'm
> doing. (Also we do put some trust into zus because it has our filesystem  data
> and because we wrote it ;-))
>
> I understand your approach where you say "The PTE table is just a global
> communicator of pages but is not really mapped into any process .i.e never
> faulted into any core's local-TLB" (The Kernel access of that memory is done
> on a Kernel address at another TLB). And is why I can get away from
> zap_vma_ptes(vma).
> So is this not the same thing? your flag says no one TLB cached this PTE
> my flag says only-this-core-cached this PTE. We both ask
> "So please skip the zap_vma_ptes(vma) stage for me"
>
> I think you might be able to use my flag for your system. Is only a small
> part of what you need with the all "Get the page from the PTE at" and
> so on. But the "please skip zap_vma_ptes(vma)" part is this patch here, No?
>
> BTW I did not at all understand what is your project trying to solve.
> please send me some Notes about it I want to see if they might fit
> after all
>
>> This all seems to revolve around the fact that userspace fs server
>> process needs to copy something into userspace client's buffer, right?
>>
>> Instead of playing with memory mappings, why not just tell the kernel
>> *what* to copy?
>>
>> While in theory not as generic, I don't see any real limitations (you
>> don't actually need the current contents of the buffer in the read
>> case and vica verse in the write case).
>>
>
> This is not so easy, for many reasons. It was actually my first approach
> which I pursued for a while but dropped it for the easier to implement
> and more general approach.
>
> Note that we actually do that in the implementation of mmap. There is
> a ZUS_OP_GET_BLOCK which returns a dpp_t of a page to map into the
> application's VM. We could just copy it at that point
>
> We have some app buffers arriving with pointers local to one VM (the app)
> and then we want to copy them to another app buffers. How do you do that?
> So yes you need to get_user_pages() so they can be accessed from kernel, switch
> to second VM then receive pointers there. These need to be dpp_t like the games
> I do, or - In the app context copy_user_to_page.
>
> But that API was not enough for me. Because this is good with pmem.
> But what if I actually want it from disk or network.

Yeah, that's the interesting part.  We want direct-io into the client
(the app) memory, with the userspace filesystem acting as a traffic
controller.

With your scheme it's like:

- get_user_pages
- map pages into server address space
- send request to server
- server does direct-io read from network/disk fd into mapped buffer
- server sends reply
- done

This could be changed to
- get_user_pages
- insert pages into pipe
- send request to server
- server "reverse splices" buffers from  pipe to network/disk fd
- server sends reply
- done

The two are basically the same, except we got rid of the unnecessary
userspace mapping.

Okay, the "reverse splice" or "rsplice" operation is yet to be
defined.  It would be like splice, except it passes an empty buffer
from the pipe into an operation that uses it to fill the buffer
(RSPLICE is to SPLICE as READ is to WRITE).

For write operation the normal splice(2) would be used in the same
way, straightforward passing of user buffer directly to underlying
device without memory copy ever being done.

See what I'm getting at?

> 1. Allocate a vma per core
> 2. call vm_insert_pfn
>  .... Do something
> 3. vm_insert_pfn(NULL) (before this patch zap_vma_ptes())
>
> It is all very simple really. For me it is opposite. It is
> "Why mess around with dual_port_pointers, caching, and copy
>  life time rules, when you can just call vm_insert_pfn"

Because you normally gain nothing by going through the server address space.

Mapping to server address space has issues, like allowing access from
server to full page containing the buffer, which might well be a
security issue.

Also with the direct-io from network/disk case the userspace address
will again be translated to a page in the kernel so it's just going
back and forth between representations using the page tables, which
likely even results in a measurable performance loss.

Again, what's the advantage of mapping to server address space?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 2/7] fs: Add the ZUF filesystem to the build + License
  2018-03-15  4:21       ` Andreas Dilger
@ 2018-03-15 13:58         ` Boaz Harrosh
  0 siblings, 0 replies; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-15 13:58 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-fsdevel, Ric Wheeler, Miklos Szeredi, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On 15/03/18 06:21, Andreas Dilger wrote:
> On Mar 14, 2018, at 11:21 AM, Boaz Harrosh <boazh@netapp.com> wrote:
>>
>> On 13/03/18 22:16, Andreas Dilger wrote:
>> <>
>>>> + */
>>>> +#define ZUFS_MINORS_PER_MAJOR	1024
>>>> +#define ZUFS_MAJOR_VERSION 1
>>>> +#define ZUFS_MINOR_VERSION 0
>>>
>>> I haven't really been following this development, but my recommendation
>>> would be to use feature flags (e.g. at least __u64 compat, __u64 incompat)
>>> for the API and separately for the disk format, rather than using version
>>> numbers.  This makes it clear what "version" relates to a specific feature,
>>> and also allows *removal* of features if they turn out to be a bad idea.
>>> With version numbers you can only ever *add* features, and have to keep
>>> support for every old feature added.
>>>
>>> Also, having separate feature flags allows independent development of new
>>> features, and doesn't require that feature X has to be in version N or it
>>> will break for anyone using/testing that feature outside of the main tree.
>>>
>>> This has worked for 25 years for ext2/3/4 and 15 years for Lustre.  ZFS
>>> has a slightly more complex feature flags, distinguishing between features
>>> that _could_ be used (i.e. enabled at format time or by the administrator),
>>> and features that _are_ used (with a refcount).  That avoids gratuitous
>>> incompatibility if some feature is enabled, but not actually used (e.g.
>>> ext4 files over 2TB), and also allows removing that incompatibility if the
>>> feature is no longer used (e.g. all > 2TB files are deleted).
>>>
>>
>> Yes thank you. As you can see at this RFC stage I have not even put any
>> code to enforce the ABI/API versioning yet. Exactly because I don't like
>> it as you explained. I will think about your suggestion and see. This is
>> not on disk stuff. This is more the communication channel between
>> ZUF<=>ZUS. Though there are a couple on disk stuff.
>> (The on disk things are all hidden from here inside the usermode FS plugin)
>>
>> The thing is that I want to work a system with the distro's that the
>> ZUF<=>ZUS ABI can freely change, by forcing the zusd package be dependent
>> on the kernel package. And it be signed by the Kernel's make key. Meaning
>> it will only run against the kernel it was compiled against.
> 
> That is a major pain, and even the distros are doing things like weak module
> symbol versions so that external kernel modules do not need to be rebuilt
> for every minor kernel update.
> 

OK I get it. So yes let me think about it. we have two structures
and then all these IOCTL definitions.

For the two structures: struct multi_device and struct zus_inode
we can have the above version as an indicator of compatibility,

And with the IOCTLs each time any structure change we can define
a new IOCTL constant and/or operation number. Or put the sizeof at
the header.

I'll see what's simple enough to do
Boaz

>> And keep the stable ABI with feature and versioning between the
>> ZUSD<=>zusFS-plugin(s)
>> We'll have to see
>>
>> Thanks
>> Boaz
> 
> Cheers, Andreas
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-15  8:47         ` Miklos Szeredi
@ 2018-03-15 15:27           ` Boaz Harrosh
  2018-03-15 15:34             ` Matthew Wilcox
  2018-03-15 16:10             ` Miklos Szeredi
  0 siblings, 2 replies; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-15 15:27 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Matthew Wilcox, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On 15/03/18 10:47, Miklos Szeredi wrote:
> On Wed, Mar 14, 2018 at 10:41 PM, Boaz Harrosh <boazh@netapp.com> wrote:
>> On 14/03/18 10:20, Miklos Szeredi wrote:
>>> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>> On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
>>>>> On a call to mmap an mmap provider (like an FS) can put
>>>>> this flag on vma->vm_flags.
>>>>>
>>>>> This tells the Kernel that the vma will be used from a single
>>>>> core only and therefore invalidation of PTE(s) need not a
>>>>> wide CPU scheduling
>>>>>
>>>>> The motivation of this flag is the ZUFS project where we want
>>>>> to optimally map user-application buffers into a user-mode-server
>>>>> execute the operation and efficiently unmap.
>>>>
>>>> I've been looking at something similar, and I prefer my approach,
>>>> although I'm not nearly as far along with my implementation as you are.
>>>>
>>>> My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
>>>> The page fault handler refuses to insert any TLB entries into the process
>>>> address space.  But follow_page_mask() will return the appropriate struct
>>>> page for it.  This should be enough for O_DIRECT accesses to work as
>>>> you'll get the appropriate scatterlists built.
>>>>
>>>> I suspect Boaz has already done a lot of thinking about this and doesn't
>>>> need the explanation, but here's how it looks for anyone following along
>>>> at home:
>>>>
>>>> Process A calls read().
>>>> Kernel allocates a page cache page for it and calls the filesystem through
>>>>   ->readpages (or ->readpage).
>>>> Filesystem calls the managing process to get the data for that page.
>>>> Managing process draws a pentagram and summons Beelzebub (or runs Perl;
>>>>   whichever you find more scary).
>>>> Managing process notifies the filesystem that the page is now full of data.
>>>> Filesystem marks the page as being Uptodate and unlocks it.
>>>> Process was waiting on the page lock, wakes up and copies the data from the
>>>>   page cache into userspace.  read() is complete.
>>>>
>>>> What we're concerned about here is what to do after the managing process
>>>> tells the kernel that the read is complete.  Clearly allowing the managing
>>>> process continued access to the page is Bad as the page may be freed by the
>>>> page cache and then reused for something else.  Doing a TLB shootdown is
>>>> expensive.  So Boaz's approach is to have the process promise that it won't
>>>> have any other thread look at it.  My approach is to never allow the page
>>>> to have load/store access from userspace; it can only be passed to other
>>>> system calls.
>>>
>>
>> Hi Matthew, Hi Miklos
>>
>> Thank you for looking at this.
>> I'm answering both Matthew an Miklos's all thread, by trying to explain
>> something that you might not have completely wrapped around yet.
>>
>> Matthew first
>>
>> Please note that in the ZUFS system there are no page-faults at all involved
>> (God no, this is like +40us minimum and I'm fighting to shave off 13us)
>>
>> In ZUF-to-ZUS communication
>> command comes in:
>> A1 we punch in the pages at the per-core-VMA before they are used,
>> A2 we then return to user-space, access these pages once.
>>    (without any page faults)
>> A3 Then return to kernel and punch in a drain page at that spot
>>
>> New command comes in:
>> B1 we punch in the pages at the same per-core-VMA before they are used,
>> B2 Return to user-space, access these new pages once.
>> B3 Then return to kernel and punch in a drain page at that spot
>>
>> Actually I could skip A3/B3 all together but in testing after my patch
>> it did not cost at all, so I like the extra easiness (Because otherwise
>> there is a dance I need to do when app or server crash and files start
>> to close I need to scan VMAs and zap them)
>>
>> Current mm's mapping code (at insert_pfn) will fail at B1 above. Because
>> it wants to see a ZERO empty spot before inserting a new pte.
>> What the mm code wants is that I call
>>         A3 - zap_vma_ptes(vma)
>>
>> This is because if the spot was not ZERO it means there was a previous
>> mapping there. And some other core might have cached that entry at the
>> TLB. so when I punch in this new value the other core could access
>> the old page while this core is accessing the new page.
>> (TLB-invalidate is a single core command and is why zap_vma_ptes
>>  needs to schedule all cores to each call TLB-invalidate)
>>
>> And this is the all difference between the two testes above. That I do not
>> zap_vma_ptes With the new (one liner) code.
>>
>> Please Note that the VM_LOCAL_CPU flag is not set by the application (zus Server)
>> but by the Kernel driver, telling the Kernel that it has enforced such an API
>> that we access from a single CORE so please allow me B1 because I know what I'm
>> doing. (Also we do put some trust into zus because it has our filesystem  data
>> and because we wrote it ;-))
>>
>> I understand your approach where you say "The PTE table is just a global
>> communicator of pages but is not really mapped into any process .i.e never
>> faulted into any core's local-TLB" (The Kernel access of that memory is done
>> on a Kernel address at another TLB). And is why I can get away from
>> zap_vma_ptes(vma).
>> So is this not the same thing? your flag says no one TLB cached this PTE
>> my flag says only-this-core-cached this PTE. We both ask
>> "So please skip the zap_vma_ptes(vma) stage for me"
>>
>> I think you might be able to use my flag for your system. Is only a small
>> part of what you need with the all "Get the page from the PTE at" and
>> so on. But the "please skip zap_vma_ptes(vma)" part is this patch here, No?
>>
>> BTW I did not at all understand what is your project trying to solve.
>> please send me some Notes about it I want to see if they might fit
>> after all
>>
>>> This all seems to revolve around the fact that userspace fs server
>>> process needs to copy something into userspace client's buffer, right?
>>>
>>> Instead of playing with memory mappings, why not just tell the kernel
>>> *what* to copy?
>>>
>>> While in theory not as generic, I don't see any real limitations (you
>>> don't actually need the current contents of the buffer in the read
>>> case and vica verse in the write case).
>>>
>>
>> This is not so easy, for many reasons. It was actually my first approach
>> which I pursued for a while but dropped it for the easier to implement
>> and more general approach.
>>
>> Note that we actually do that in the implementation of mmap. There is
>> a ZUS_OP_GET_BLOCK which returns a dpp_t of a page to map into the
>> application's VM. We could just copy it at that point
>>
>> We have some app buffers arriving with pointers local to one VM (the app)
>> and then we want to copy them to another app buffers. How do you do that?
>> So yes you need to get_user_pages() so they can be accessed from kernel, switch
>> to second VM then receive pointers there. These need to be dpp_t like the games
>> I do, or - In the app context copy_user_to_page.
>>
>> But that API was not enough for me. Because this is good with pmem.
>> But what if I actually want it from disk or network.
> 
> Yeah, that's the interesting part.  We want direct-io into the client
> (the app) memory, with the userspace filesystem acting as a traffic
> controller.
> 

(Yes we want that, but is not the only thing we want. We also want
 pmem, and other Server available data)

> With your scheme it's like:
> 
> - get_user_pages
> - map pages into server address space
> - send request to server
> - server does direct-io read from network/disk fd into mapped buffer
> - server sends reply
> - done
> 
> This could be changed to
> - get_user_pages
> - insert pages into pipe
> - send request to server
> - server "reverse splices" buffers from  pipe to network/disk fd

This can never properly translate. Even a simple file on disk
is linear for the app (unaligned buffer) but is scattered on
multiple blocks on disk. Yes perhaps networking can somewhat work
if you pre/post-pend the headers you need.
And you restrict direct IO semantics on everything specially the APP
with my system you can do zero copy on any kind of application

And this assumes networking or some-device. Which means going back
to the Kernel, which in ZUFS rules you must return -ASYNC to the zuf
and complete in a background ASYNC thread. This is an order of a magnitude
higher latency then what I showed here.

And what about the SYNC copy from Server to APP. With a pipe you
are forcing me to go back to the Kernel to execute the copy. which
means two more crossings. This will double the round trips.

> - server sends reply
> - done
> 
> The two are basically the same, except we got rid of the unnecessary
> userspace mapping.
> 
> Okay, the "reverse splice" or "rsplice" operation is yet to be
> defined.  It would be like splice, except it passes an empty buffer
> from the pipe into an operation that uses it to fill the buffer
> (RSPLICE is to SPLICE as READ is to WRITE).
> 

Exactly another trip back to Kernel.

> For write operation the normal splice(2) would be used in the same
> way, straightforward passing of user buffer directly to underlying
> device without memory copy ever being done.
> 

Here too, can work for direct access semantics with pointers and
sizes aligned, but cannot satisfy POSIX.

> See what I'm getting at?
> 

Do you see my points? 
- Zero-copy on all posix API, including *no* page-cache.
- A single kernel-UM transition.
- Synchronous extreme low latency.

>> 1. Allocate a vma per core
>> 2. call vm_insert_pfn
>>  .... Do something
>> 3. vm_insert_pfn(NULL) (before this patch zap_vma_ptes())
>>
>> It is all very simple really. For me it is opposite. It is
>> "Why mess around with dual_port_pointers, caching, and copy
>>  life time rules, when you can just call vm_insert_pfn"
> 
> Because you normally gain nothing by going through the server address space.
> 
> Mapping to server address space has issues, like allowing access from
> server to full page containing the buffer, which might well be a
> security issue.
> 

Not really there is already an high trust between the APP and the
filesystem Server owning the all of the APP's data. A compromised
Server can do lots and lots of bad things before a bug trashes the
unaligned tails of a buffer.
(And at that the Server only has access to IO buffers in the short window
 of the IO execution. Once on IO return this access is disconnected)

> Also with the direct-io from network/disk case the userspace address
> will again be translated to a page in the kernel so it's just going
> back and forth between representations using the page tables, which
> likely even results in a measurable performance loss.
> 

You mean another get_user_pages() This is not so bad we have all these
pages already paged in and the page-table HOT because we just now set
it. From my tests I did not notice any such slowness.

Again this is not my typical work load, and this extra get_user_pages()
is not an high priority for me. But if you really care about it
and you measure real slowness because of that extra get_user_pages()
what we can do is:
  I already have a zuf internal object describing the request
including the app mapped pages and sizes. We can implement a splice()
operation on the zuf driver. As a target. To supply the array of
pages already gotten at the first above get_user_pages().

> Again, what's the advantage of mapping to server address space?
> 

See above. In my case 90% of the time the data is already at the
Server application memcpy_nt away. If I want to support that mode
of a single trip to user-land. That is the way.

If you are positive you have 2 trips minimum and the target is
in kernel. You are already too slow and there is not much of a
difference.

So the advantage is the extra choice. Which for me is the
90% of the work load.

> Thanks,
> Miklos
> 

Thanks
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-15 15:27           ` Boaz Harrosh
@ 2018-03-15 15:34             ` Matthew Wilcox
  2018-03-15 15:58               ` Boaz Harrosh
  2018-03-15 16:10             ` Miklos Szeredi
  1 sibling, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2018-03-15 15:34 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Miklos Szeredi, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Thu, Mar 15, 2018 at 05:27:09PM +0200, Boaz Harrosh wrote:
> Not really there is already an high trust between the APP and the
> filesystem Server owning the all of the APP's data. A compromised
> Server can do lots and lots of bad things before a bug trashes the
> unaligned tails of a buffer.
> (And at that the Server only has access to IO buffers in the short window
>  of the IO execution. Once on IO return this access is disconnected)

Without a TLB shootdown, you can't guarantee that.  Here's how it works:

CPU A is notified of a new page, starts accessing the page.
CPU B decides to access the same page
CPU A notifies the kernel
Kernel withdraws the PTE mapping, but doesn't zap it.
CPU B can still access the page until whatever CPU magic happens to discard
the PTE from the TLB.
Kernel decides to recycle the page
Kernel allocates it to some kernel data structure
CPU B writes to it, can probably escalate to kernel privileges.

Now, you're going to argue that the process is trusted and should
be considered to be part of the kernel from a trust point of view.
In that case it needs to be distributed as part of the kernel and not
be an independent user process.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-15 15:34             ` Matthew Wilcox
@ 2018-03-15 15:58               ` Boaz Harrosh
  0 siblings, 0 replies; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-15 15:58 UTC (permalink / raw)
  To: Matthew Wilcox, Boaz Harrosh
  Cc: Miklos Szeredi, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On 15/03/18 17:34, Matthew Wilcox wrote:
> On Thu, Mar 15, 2018 at 05:27:09PM +0200, Boaz Harrosh wrote:
>> Not really there is already an high trust between the APP and the
>> filesystem Server owning the all of the APP's data. A compromised
>> Server can do lots and lots of bad things before a bug trashes the
>> unaligned tails of a buffer.
>> (And at that the Server only has access to IO buffers in the short window
>>  of the IO execution. Once on IO return this access is disconnected)
> 
> Without a TLB shootdown, you can't guarantee that.  Here's how it works:
> 
> CPU A is notified of a new page, starts accessing the page.
> CPU B decides to access the same page
> CPU A notifies the kernel
> Kernel withdraws the PTE mapping, but doesn't zap it.
> CPU B can still access the page until whatever CPU magic happens to discard
> the PTE from the TLB.
> Kernel decides to recycle the page
> Kernel allocates it to some kernel data structure
> CPU B writes to it, can probably escalate to kernel privileges.
> 
> Now, you're going to argue that the process is trusted and should
> be considered to be part of the kernel from a trust point of view.
> In that case it needs to be distributed as part of the kernel and not
> be an independent user process.
> 

You are right in the General case but this is not the case for ZUF.

The buffers belong to the application, So it is all about the
zus Server having access to the APP buffers but please look exactly
what zufs does:
(Repeated from above)

On the same exact core always (zufs rules ZTs and all)
> A1 we punch in the pages at the per-core-VMA before they are used,
> A2 we then return to user-space, access these pages once.
     from this core only
> A3 Then return to kernel and punch in a drain page at that spot

At A1 A3 stages there is a local TLB-invalidate for the single core
but not a system-wide one. So the window of where the server
has access to app buffers is during the IO and not passed it.

There is a trust that zus server will not access the per-core private
vma from another core, yes. But the way it is implemented this
is very hard to do. Because the vma sits on a O_TMPFILE+exclusive private
to the affinity set thread's stack and is not public to anyone
but this single Z-Thread.

Yes if the zus will access that vma from another core what you say, will
happen but this can only happen on compromised server. And a compromised Server
can mess up the applications IO buffers through the front door so why
worry about a back door. Perhaps I'm missing something?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-15 15:27           ` Boaz Harrosh
  2018-03-15 15:34             ` Matthew Wilcox
@ 2018-03-15 16:10             ` Miklos Szeredi
  2018-03-15 16:30               ` Boaz Harrosh
  1 sibling, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2018-03-15 16:10 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Thu, Mar 15, 2018 at 4:27 PM, Boaz Harrosh <boazh@netapp.com> wrote:
> On 15/03/18 10:47, Miklos Szeredi wrote:

>> With your scheme it's like:
>>
>> - get_user_pages
>> - map pages into server address space
>> - send request to server
>> - server does direct-io read from network/disk fd into mapped buffer
>> - server sends reply
>> - done
>>
>> This could be changed to
>> - get_user_pages
>> - insert pages into pipe
>> - send request to server
>> - server "reverse splices" buffers from  pipe to network/disk fd
>
> This can never properly translate. Even a simple file on disk
> is linear for the app (unaligned buffer) but is scattered on
> multiple blocks on disk. Yes perhaps networking can somewhat work
> if you pre/post-pend the headers you need.
> And you restrict direct IO semantics on everything specially the APP
> with my system you can do zero copy on any kind of application

I lost you there, sorry.

How will your scheme deal with alignment issues better than my scheme?

> And this assumes networking or some-device. Which means going back
> to the Kernel, which in ZUFS rules you must return -ASYNC to the zuf
> and complete in a background ASYNC thread. This is an order of a magnitude
> higher latency then what I showed here.

Indeed.

> And what about the SYNC copy from Server to APP. With a pipe you
> are forcing me to go back to the Kernel to execute the copy. which
> means two more crossings. This will double the round trips.

If you are trying to minimize the roundtrips, why not cache the
mapping in the kernel?  That way you don't necessarily have to go to
userspace at all.  With readahead logic, the server will be able to
preload the mapping before the reads happen, and you basically get the
same speed as an in-kernel fs would.

Also don't quite understand how are you planning to generalize beyond
the pmem case.  The interface is ready for that, sure.  But what about
caching?  Will that be done in the server?   Does that make sense?
Kernel already has page cache for that purpose and userspace cache
won't ever be as good as kernel cache.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-15 16:10             ` Miklos Szeredi
@ 2018-03-15 16:30               ` Boaz Harrosh
  2018-03-15 20:42                 ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-03-15 16:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Matthew Wilcox, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On 15/03/18 18:10, Miklos Szeredi wrote:
<>
>> This can never properly translate. Even a simple file on disk
>> is linear for the app (unaligned buffer) but is scattered on
>> multiple blocks on disk. Yes perhaps networking can somewhat work
>> if you pre/post-pend the headers you need.
>> And you restrict direct IO semantics on everything specially the APP
>> with my system you can do zero copy on any kind of application
> 
> I lost you there, sorry.
> 
> How will your scheme deal with alignment issues better than my scheme?
> 

In my pmem case easy memcpy. This will not work if you need to go
to hard disk I agree. (Which is not a priority for me)

>> And this assumes networking or some-device. Which means going back
>> to the Kernel, which in ZUFS rules you must return -ASYNC to the zuf
>> and complete in a background ASYNC thread. This is an order of a magnitude
>> higher latency then what I showed here.
> 
> Indeed.
> 
>> And what about the SYNC copy from Server to APP. With a pipe you
>> are forcing me to go back to the Kernel to execute the copy. which
>> means two more crossings. This will double the round trips.
> 
> If you are trying to minimize the roundtrips, why not cache the
> mapping in the kernel?  That way you don't necessarily have to go to
> userspace at all.  With readahead logic, the server will be able to
> preload the mapping before the reads happen, and you basically get the
> same speed as an in-kernel fs would.
> 

Yes as I said that was my first approach. But at the end this is
always a special workload optimization but in the general case this
actually adds a round trip and a huge complexity that always comes
to bite you.

> Also don't quite understand how are you planning to generalize beyond
> the pmem case.  The interface is ready for that, sure.  But what about
> caching?  Will that be done in the server?   Does that make sense?
> Kernel already has page cache for that purpose and userspace cache
> won't ever be as good as kernel cache.
> 

I explained about that. We can easily support page-cache in zufs
here what I wrote:
> Please note that it will be very easy with this API to also support
> page-cache for FSs that wants it like the network FSs you said.
> The FS will set a bit in the fs_register call to say that it would
> rather use page cache. These type of FSs will run on a different
> kind of BDI which says "Yes page cache please". All the IO entry
> vectors point to the generic_iter API and instead we implement
> read/write_pages(). At read/write_pages() we do the exact same OP_READ/WRITE
> like today. map the cache pages to the zus VM, despatch, return, release page_lock.
> all is happy. Any one wanting to contribute this is very welcome.

Yes please no caching at the zus level that's insane

> Thanks,
> Miklos
> 

Thanks
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-15 16:30               ` Boaz Harrosh
@ 2018-03-15 20:42                 ` Miklos Szeredi
  2018-04-25 12:21                   ` Boaz Harrosh
  0 siblings, 1 reply; 39+ messages in thread
From: Miklos Szeredi @ 2018-03-15 20:42 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Thu, Mar 15, 2018 at 5:30 PM, Boaz Harrosh <boazh@netapp.com> wrote:
> On 15/03/18 18:10, Miklos Szeredi wrote:
> <>
>>> This can never properly translate. Even a simple file on disk
>>> is linear for the app (unaligned buffer) but is scattered on
>>> multiple blocks on disk. Yes perhaps networking can somewhat work
>>> if you pre/post-pend the headers you need.
>>> And you restrict direct IO semantics on everything specially the APP
>>> with my system you can do zero copy on any kind of application
>>
>> I lost you there, sorry.
>>
>> How will your scheme deal with alignment issues better than my scheme?
>>
>
> In my pmem case easy memcpy. This will not work if you need to go
> to hard disk I agree. (Which is not a priority for me)
>
>>> And this assumes networking or some-device. Which means going back
>>> to the Kernel, which in ZUFS rules you must return -ASYNC to the zuf
>>> and complete in a background ASYNC thread. This is an order of a magnitude
>>> higher latency then what I showed here.
>>
>> Indeed.
>>
>>> And what about the SYNC copy from Server to APP. With a pipe you
>>> are forcing me to go back to the Kernel to execute the copy. which
>>> means two more crossings. This will double the round trips.
>>
>> If you are trying to minimize the roundtrips, why not cache the
>> mapping in the kernel?  That way you don't necessarily have to go to
>> userspace at all.  With readahead logic, the server will be able to
>> preload the mapping before the reads happen, and you basically get the
>> same speed as an in-kernel fs would.
>>
>
> Yes as I said that was my first approach. But at the end this is
> always a special workload optimization but in the general case this
> actually adds a round trip and a huge complexity that always comes
> to bite you.

Ideally most of the complexity would be in the page cache.  Not sure
how ready it is to handle pmem pages?

The general case (non-pmem) will always have to be handled
differently; you've just stated that it's much less latency sensitive
and needs async handling.    Basing the design on just trying to make
it use the same mechanism (userspace copy) is flawed in my opinion,
since it's suboptimal for either case.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-03-15 20:42                 ` Miklos Szeredi
@ 2018-04-25 12:21                   ` Boaz Harrosh
  2018-05-07 10:46                     ` Miklos Szeredi
  0 siblings, 1 reply; 39+ messages in thread
From: Boaz Harrosh @ 2018-04-25 12:21 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Matthew Wilcox, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon


On 03/15/2018 02:42 PM, Miklos Szeredi wrote:
> Ideally most of the complexity would be in the page cache.  Not sure
> how ready it is to handle pmem pages?
>
> The general case (non-pmem) will always have to be handled
> differently; you've just stated that it's much less latency sensitive
> and needs async handling.    Basing the design on just trying to make
> it use the same mechanism (userspace copy) is flawed in my opinion,
> since it's suboptimal for either case.
>
> Thanks,
> Miklos

OK So I was thinking hard on all this and am changing my mind and
agreeing with all that was said.

I want that the usFS plugin will have all the different options and
have an easy way to tell Kernel which mode to use.

Let me summarize all the options:

1. Sync, userspace copy directly to app-buffers (current implementation)

2. Async block device operation (none pmem)
     zuf owns all devices pmem and none pmem at mount time and provides
     a very efficient access to both. In the harddisk / ssd case as part 
of an IO call
     the server returns -EWOULD_BLOCK and in the background will issue a
     scatter_gather call through zuf.
     The memory target for the IO can be pmem, directly to user-buffers 
(DIO), transient
      server buffers.
      On completion an up call is made to ZUF to complete the IO 
operation and
      release the waiting application.

3. Splice and R-spilce
     In the case that the IO target is not a block-device but an 
external path like
     network / rdma / some none block device.
     Zuf already holds an internal object describing the IO context 
including the
     GUP app buffers. This internal object can be made the memory target 
of a splice
     operation.

4. Get-io_map type operation (currently implemented for mmap)
     The zus-FS returns a set of dpp_t(s) to kernel and the Kernel does 
the memcopy
     to app buffers. The Server also specifies if those buffers should 
be cached
     on a per inode radix-tree (xarray) and if so at the next access to 
the same
     range Kernel does the copy and never dispatches to user-space
     In this mode the Server can also revoke a cached mapping when needed

5. Use of VFS page-cache
     For a very slow backing device the FS request the regular VFS 
page-cache.
     On read/write_pages() vector zuf uses option 1. above to read into 
page-cache
     instead of app-buffers directly. Only cache misses dispatch back to 
user-space

Have I forgotten anything?

This way the zus-FS is in control and can do the "right thing" depending on
target device and FS characteristics. The interface lets us have a rich 
set of
tools to be used.

Hope that answers your concerns
Boaz

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
  2018-04-25 12:21                   ` Boaz Harrosh
@ 2018-05-07 10:46                     ` Miklos Szeredi
  0 siblings, 0 replies; 39+ messages in thread
From: Miklos Szeredi @ 2018-05-07 10:46 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, linux-fsdevel, Ric Wheeler, Steve French,
	Steven Whitehouse, Jefff moyer, Sage Weil, Jan Kara,
	Amir Goldstein, Andy Rudof, Anna Schumaker, Amit Golander,
	Sagi Manole, Shachar Sharon

On Wed, Apr 25, 2018 at 2:21 PM, Boaz Harrosh <boazh@netapp.com> wrote:
>
> On 03/15/2018 02:42 PM, Miklos Szeredi wrote:
>>
>> Ideally most of the complexity would be in the page cache.  Not sure
>> how ready it is to handle pmem pages?
>>
>> The general case (non-pmem) will always have to be handled
>> differently; you've just stated that it's much less latency sensitive
>> and needs async handling.    Basing the design on just trying to make
>> it use the same mechanism (userspace copy) is flawed in my opinion,
>> since it's suboptimal for either case.
>>
>> Thanks,
>> Miklos
>
>
> OK So I was thinking hard on all this and am changing my mind and
> agreeing with all that was said.
>
> I want that the usFS plugin will have all the different options and
> have an easy way to tell Kernel which mode to use.
>
> Let me summarize all the options:
>
> 1. Sync, userspace copy directly to app-buffers (current implementation)
>
> 2. Async block device operation (none pmem)
>     zuf owns all devices pmem and none pmem at mount time and provides
>     a very efficient access to both. In the harddisk / ssd case as part of
> an IO call
>     the server returns -EWOULD_BLOCK and in the background will issue a
>     scatter_gather call through zuf.
>     The memory target for the IO can be pmem, directly to user-buffers
> (DIO), transient
>      server buffers.
>      On completion an up call is made to ZUF to complete the IO operation
> and
>      release the waiting application.
>
> 3. Splice and R-spilce
>     In the case that the IO target is not a block-device but an external
> path like
>     network / rdma / some none block device.
>     Zuf already holds an internal object describing the IO context including
> the
>     GUP app buffers. This internal object can be made the memory target of a
> splice
>     operation.
>
> 4. Get-io_map type operation (currently implemented for mmap)
>     The zus-FS returns a set of dpp_t(s) to kernel and the Kernel does the
> memcopy
>     to app buffers. The Server also specifies if those buffers should be
> cached
>     on a per inode radix-tree (xarray) and if so at the next access to the
> same
>     range Kernel does the copy and never dispatches to user-space
>     In this mode the Server can also revoke a cached mapping when needed
>
> 5. Use of VFS page-cache
>     For a very slow backing device the FS request the regular VFS
> page-cache.
>     On read/write_pages() vector zuf uses option 1. above to read into
> page-cache
>     instead of app-buffers directly. Only cache misses dispatch back to
> user-space
>
> Have I forgotten anything?
>
> This way the zus-FS is in control and can do the "right thing" depending on
> target device and FS characteristics. The interface lets us have a rich set
> of
> tools to be used.
>
> Hope that answers your concerns

Why keep options 1 and 2?  An io-map (4) type interface should cover
this efficiently, shouldn't it?

I don't think page-cache is just for slow backing devices or that it
needs to be a separate interface.  Caches are and will always be the
fastest, no matter how fast your device is.  In linux the page cache
seems like the most convenient place to put a pmem mapping, for
example.

Of course, caches are also a  big PITA when dealing with distributed
filesystems.  Fuse doesn't have a perfect solution for that.  It's one
of key areas that needs improvement.

Also I'll add one more use case that crops up often with fuse: "whole
file data mapping".   Basically this means, the file's data in a
(virtual) userspace filesystem is equivalent to a file's data on an
underlying (physical) filesystem.  We could accelerate I/O in that
case tremendously as well as eliminating double caching.   I've been
undecided what to do with it;  for some time I was resisting, then
saying that I'll accept patches, and at some point I'll probably do a
patch myself.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2018-05-07 10:46 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
2018-03-13 17:15 ` [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU Boaz Harrosh
2018-03-13 18:56   ` Matthew Wilcox
2018-03-14  8:20     ` Miklos Szeredi
2018-03-14 11:17       ` Matthew Wilcox
2018-03-14 11:31         ` Miklos Szeredi
2018-03-14 11:45           ` Matthew Wilcox
2018-03-14 14:49             ` Miklos Szeredi
2018-03-14 14:57               ` Matthew Wilcox
2018-03-14 15:39                 ` Miklos Szeredi
     [not found]                   ` <CAON-v2ygEDCn90C9t-zadjsd5GRgj0ECqntQSDDtO_Zjk=KoVw@mail.gmail.com>
2018-03-14 16:48                     ` Matthew Wilcox
2018-03-14 21:41       ` Boaz Harrosh
2018-03-15  8:47         ` Miklos Szeredi
2018-03-15 15:27           ` Boaz Harrosh
2018-03-15 15:34             ` Matthew Wilcox
2018-03-15 15:58               ` Boaz Harrosh
2018-03-15 16:10             ` Miklos Szeredi
2018-03-15 16:30               ` Boaz Harrosh
2018-03-15 20:42                 ` Miklos Szeredi
2018-04-25 12:21                   ` Boaz Harrosh
2018-05-07 10:46                     ` Miklos Szeredi
2018-03-13 17:17 ` [RFC 2/7] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
2018-03-13 20:16   ` Andreas Dilger
2018-03-14 17:21     ` Boaz Harrosh
2018-03-15  4:21       ` Andreas Dilger
2018-03-15 13:58         ` Boaz Harrosh
2018-03-13 17:18 ` [RFC 3/7] zuf: Preliminary Documentation Boaz Harrosh
2018-03-13 20:32   ` Randy Dunlap
2018-03-14 18:01     ` Boaz Harrosh
2018-03-14 19:16       ` Randy Dunlap
2018-03-13 17:22 ` [RFC 4/7] zuf: zuf-rootfs && zuf-core Boaz Harrosh
2018-03-13 17:36   ` Boaz Harrosh
2018-03-14 12:56     ` Nikolay Borisov
2018-03-14 18:34       ` Boaz Harrosh
2018-03-13 17:25 ` [RFC 5/7] zus: Devices && mounting Boaz Harrosh
2018-03-13 17:38   ` Boaz Harrosh
2018-03-13 17:28 ` [RFC 6/7] zuf: Filesystem operations Boaz Harrosh
2018-03-13 17:39   ` Boaz Harrosh
2018-03-13 17:32 ` [RFC 7/7] zuf: Write/Read && mmap implementation Boaz Harrosh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).