linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem
@ 2019-02-19 11:51 Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License Boaz harrosh
                   ` (17 more replies)
  0 siblings, 18 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

I would please like to present the ZUFS file system and the Kernel code part
in this patchset.

The Kernel code presented here can be found at:
	https://github.com/NetApp/zufs-zuf

And the User-mode Server + example FSs here:
	https://github.com/NetApp/zufs-zus

ZUFS - stands for Zero-copy User-mode FS
* It is geared towards true zero copy end to end of both data and meta data.
* It is geared towards very *low latency*, very high CPU locality, lock-less
  parallelism.
* Synchronous operations (for low latency)
* Numa awareness

Short description:
  ZUFS is a from scratch implementation of a filesystem-in-user-space, which
  tries to address the above goals. from the get go it is aimed for pmem
  based FSs. But can easily support other type of FSs that can utilize x10
  latency and parallelism improvements.
  The novelty of this project is that the interface is designed with a modern
  multi-core NUMA machine in mind down to the ABI, so to reach these goals.

Please see first patch for License of this project

Current status: There are a couple of trivial open-source filesystem
implementations and a full blown proprietary implementation from Netapp.

Together with the Kernel module submitted here the User-mode-Server and the
zusFSs User-mode plugins, this code pass Netapp QA including xfstests +
internal QA tests. And was released to costumers as Maxdata 1.2.
So it is very stable.

In the git repository above there is also a backport for rhel 7.6.
Including rpm packages for Kernel and Server components.
(Also available evaluation licenses of Maxdata 1.2 for developers.
 Please contact Amit Golander <Amit.Golander@netapp.com> if you need one)

Just to get some points across as I said this project is all about
performance and low latency. Here below are some results I have run:

[fuse]
threads wr_iops	wr_bw	wr_lat
1	33606	134424	26.53226
2	57056	228224	30.38476
3	73142	292571	35.75727
4	88667	354668	40.12783
5	102280	409122	42.13261
6	110122	440488	48.29697
7	116561	466245	53.98572
8	129134	516539	55.6134

[fuse-splice]
threads	wr_iops	wr_bw	wr_lat
1	39670	158682	21.8399
2	51100	204400	34.63294
3	62385	249542	39.28847
4	75220	300882	47.42344
5	84522	338088	52.97299
6	93042	372168	57.40804
7	97706	390825	63.04435
8	98034	392137	73.24263

[xfs-dax]
threads	wr_iops	wr_bw	wr_lat   
1	19449	77799	48.03282
2	37704	150819	37.2343
3	55415	221663	30.59375
4	72285	289142	26.08636
5	90348	361392	23.89037
6	103696	414787	22.38045
7	120638	482552	21.38869
8	134157	536630	21.1426

[Maxdata-1.2-zufs]
threads	wr_iops	wr_bw	wr_lat   
1	57506	230026	14.387113
2	98624	394498	16.790232
3	142276	569106	17.344622
4	187984	751936	17.527123
5	190304	761219	19.504314
6	221407	885628	20.862000
7	211579	846316	23.262040
8	246029	984116	24.630604

[*1]
  These good results are when an mm patch is applied which introduces
  VM_LOCAL_CPU flag that eliminates vm_zap_ptes from scheduling on all
  CPUs when creating a per-cpu VMA.
  This patch was not accepted by the Linux Kernel community and is not
  presented in this patchset. (Patch available for review on demand)
  But a few weeks from now I will submit some incremental changes to the
  code which will return the numbers to above, and even better for some
  benchmarks. (without the mm patch)

I have used an 8 way KVM-qemu with 2 NUMA nodes.
Running fio with 4k random writes O_DIRECT | O_SYNC to a DRAM simulated
pmem. (memmap=! at grub), Fuse-fs was a memcpy same 4k null-FS
fio was then run with more and more threads (see threads column)
to test for scalability.

We are still > x2 slower than I would like to.
(Compared to an in-kernel pmem-base FS)
But I believe I can shave off another 1-2 us by farther optimizing
the app-to-server thread switch by developing a new scheduler-object
so to avoid going through the scheduler all together (and its locks)
when switching VMs.
(Currently using couple of wait_queue_head_t with wait_event() calls
 See relay.h in patches)

Please Review and ask any question big or trivial. I would love to
iron this code, and submit it upstream.

Thank you for reading
Boaz

~~~~~~~~~~~~~~~~~~
Boaz Harrosh (17):
  fs: Add the ZUF filesystem to the build + License
  zuf: Preliminary Documentation
  zuf: zuf-rootfs
  zuf: zuf-core The ZTs
  zuf: Multy Devices
  zuf: mounting
  zuf: Namei and directory operations
  zuf: readdir operation
  zuf: symlink
  zuf: More file operation
  zuf: Write/Read implementation
  zuf: mmap & sync
  zuf: ioctl implementation
  zuf: xattr implementation
  zuf: ACL support
  zuf: Special IOCTL fadvise (TODO)
  zuf: Support for dynamic-debug of zusFSs

 Documentation/filesystems/zufs.txt |  351 ++++++++
 fs/Kconfig                         |    1 +
 fs/Makefile                        |    1 +
 fs/zuf/Kconfig                     |   23 +
 fs/zuf/Makefile                    |   23 +
 fs/zuf/_extern.h                   |  166 ++++
 fs/zuf/_pr.h                       |   62 ++
 fs/zuf/acl.c                       |  281 +++++++
 fs/zuf/directory.c                 |  167 ++++
 fs/zuf/file.c                      |  527 ++++++++++++
 fs/zuf/inode.c                     |  648 ++++++++++++++
 fs/zuf/ioctl.c                     |  306 +++++++
 fs/zuf/md.c                        |  761 +++++++++++++++++
 fs/zuf/md.h                        |  318 +++++++
 fs/zuf/md_def.h                    |  145 ++++
 fs/zuf/mmap.c                      |  336 ++++++++
 fs/zuf/module.c                    |   28 +
 fs/zuf/namei.c                     |  435 ++++++++++
 fs/zuf/relay.h                     |   88 ++
 fs/zuf/rw.c                        |  705 ++++++++++++++++
 fs/zuf/super.c                     |  771 +++++++++++++++++
 fs/zuf/symlink.c                   |   74 ++
 fs/zuf/t1.c                        |  138 +++
 fs/zuf/t2.c                        |  375 +++++++++
 fs/zuf/t2.h                        |   68 ++
 fs/zuf/xattr.c                     |  310 +++++++
 fs/zuf/zuf-core.c                  | 1257 ++++++++++++++++++++++++++++
 fs/zuf/zuf-root.c                  |  431 ++++++++++
 fs/zuf/zuf.h                       |  414 +++++++++
 fs/zuf/zus_api.h                   |  869 +++++++++++++++++++
 30 files changed, 10079 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/acl.c
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/inode.c
 create mode 100644 fs/zuf/ioctl.c
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/md_def.h
 create mode 100644 fs/zuf/mmap.c
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/namei.c
 create mode 100644 fs/zuf/relay.h
 create mode 100644 fs/zuf/rw.c
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/symlink.c
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h
 create mode 100644 fs/zuf/xattr.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h
 create mode 100644 fs/zuf/zus_api.h

-- 
2.20.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-20 11:03   ` Greg KH
  2019-02-26 17:55   ` Schumaker, Anna
  2019-02-19 11:51 ` [RFC PATCH 02/17] zuf: Preliminary Documentation Boaz harrosh
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

This adds the ZUF filesystem-in-user_mode module to the
fs/ build system.

Also added:
	* fs/zuf/Kconfig
	* fs/zuf/module.c - This file contains the LICENCE
			    of zuf code base
	* fs/zuf/Makefile - Rather empty Makefile with only
			    module.c above

I add the fs/zuf/Makefile to demonstrate that at every
patchset stage code still compiles and there are no external
references outside of the code already submitted.

Off course only at the very last patch we have a working
ZUF feeder

[LICENCE]

  zuf.ko is a GPLv2 licensed project.

  However the ZUS user mode Server is a BSD-3-Clause licensed
  project.
  Therefor you will see that:
	zus_api.h
	md_def.h
	md.h
  These are common files with ZUS project are separately Licensed as
  BSD-3-Clause. Any contributor to these files should note that his
  code for these files is submitted as BSD-3-Clause.
  This is for the obvious reasons as these define the API between Kernel
  an user-mode Server. It is the opinion of this  project authors, as is
  of Linus that a pure API header is not governed by ANY license. But
  to make it clear.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/Kconfig       |  1 +
 fs/Makefile      |  1 +
 fs/zuf/Kconfig   | 28 ++++++++++++++++++++
 fs/zuf/Makefile  | 14 ++++++++++
 fs/zuf/module.c  | 28 ++++++++++++++++++++
 fs/zuf/zus_api.h | 69 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 141 insertions(+)
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/zus_api.h

diff --git a/fs/Kconfig b/fs/Kconfig
index ac474a61be37..5b23bb58e902 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -254,6 +254,7 @@ source "fs/romfs/Kconfig"
 source "fs/pstore/Kconfig"
 source "fs/sysv/Kconfig"
 source "fs/ufs/Kconfig"
+source "fs/zuf/Kconfig"
 source "fs/exofs/Kconfig"
 
 endif # MISC_FILESYSTEMS
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..168f178a7c89 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -128,3 +128,4 @@ obj-y				+= exofs/ # Multiple modules
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-$(CONFIG_ZUF)		+= zuf/
diff --git a/fs/zuf/Kconfig b/fs/zuf/Kconfig
new file mode 100644
index 000000000000..19fff3b75b69
--- /dev/null
+++ b/fs/zuf/Kconfig
@@ -0,0 +1,28 @@
+menuconfig ZUF
+	tristate "ZUF - Zero-copy User-mode Feeder"
+	depends on BLOCK
+	depends on ZONE_DEVICE
+	select CRC16
+	select MEMCG
+	help
+	   ZUFS Kernel part.
+	   To enable say Y here.
+
+	   To compile this as a module,  choose M here: the module will be
+	   called zuf.ko
+
+if ZUF
+
+config ZUF_DEBUG
+	bool "ZUF: enable debug subsystems use"
+	depends on ZUF
+	default n
+	help
+	  INTERNAL QA USE ONLY!!! DO NOT USE!!!
+	  Please leave as N here
+
+	  This option adds some extra code that helps
+	  in QA testing of the code. It may slow the
+	  operation and produce bigger code
+
+endif
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
new file mode 100644
index 000000000000..e75ba8a77974
--- /dev/null
+++ b/fs/zuf/Makefile
@@ -0,0 +1,14 @@
+#
+# ZUF: Zero-copy User-mode Feeder
+#
+# Copyright (c) 2018 NetApp Inc. All rights reserved.
+#
+# ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+#
+# Makefile for the Linux zufs Kernel Feeder.
+#
+
+obj-$(CONFIG_ZUF) += zuf.o
+
+# Main FS
+zuf-y += module.o
diff --git a/fs/zuf/module.c b/fs/zuf/module.c
new file mode 100644
index 000000000000..523633c1bf9d
--- /dev/null
+++ b/fs/zuf/module.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * zuf - Zero-copy User-mode Feeder
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <https://www.gnu.org/licenses/>.
+ */
+#include <linux/module.h>
+
+#include "zus_api.h"
+
+MODULE_AUTHOR("Boaz Harrosh <boazh@netapp.com>");
+MODULE_AUTHOR("Sagi Manole <sagim@netapp.com>");
+MODULE_DESCRIPTION("Zero-copy User-mode Feeder");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(__stringify(ZUFS_MAJOR_VERSION) "."
+		__stringify(ZUFS_MINOR_VERSION));
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
new file mode 100644
index 000000000000..f01db11721f4
--- /dev/null
+++ b/fs/zuf/zus_api.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
+/*
+ * zufs_api.h:
+ *	ZUFS (Zero-copy User-mode File System) is:
+ *		zuf (Zero-copy User-mode Feeder (Kernel)) +
+ *		zus (Zero-copy User-mode Server (daemon))
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#ifndef _LINUX_ZUFS_API_H
+#define _LINUX_ZUFS_API_H
+
+#include <linux/types.h>
+#include <linux/uuid.h>
+#include <stddef.h>
+#include <linux/statfs.h>
+
+/*
+ * Version rules:
+ *   This is the zus-to-zuf API version. And not the Filesystem
+ * on disk structures versions. These are left to the FS-plugging
+ * to supply and check.
+ * Specifically any of the API structures and constants found in this
+ * file.
+ * If the changes are made in a way backward compatible with old
+ * user-space, MINOR is incremented. Else MAJOR is incremented.
+ *
+ * We believe that the zus Server application comes with the
+ * Distro and should be dependent on the Kernel package.
+ * (In rhel they are both in the same package)
+ *
+ * The more stable ABI is between the zus Server and its FS plugins.
+ */
+#define ZUFS_MINORS_PER_MAJOR	1024
+#define ZUFS_MAJOR_VERSION 1
+#define ZUFS_MINOR_VERSION 0
+
+/* User space compatibility definitions */
+#ifndef __KERNEL__
+
+#include <string.h>
+
+#define u8 uint8_t
+#define umode_t uint16_t
+
+#define PAGE_SHIFT     12
+#define PAGE_SIZE      (1 << PAGE_SHIFT)
+
+#ifndef __packed
+#	define __packed __attribute__((packed))
+#endif
+
+#ifndef ALIGN
+#define ALIGN(x, a)		ALIGN_MASK(x, (typeof(x))(a) - 1)
+#define ALIGN_MASK(x, mask)	(((x) + (mask)) & ~(mask))
+#endif
+
+/* RHEL/CentOS7 specifics */
+#ifndef FALLOC_FL_UNSHARE_RANGE
+#define FALLOC_FL_UNSHARE_RANGE         0x40
+#endif
+
+#endif /*  ndef __KERNEL__ */
+
+#endif /* _LINUX_ZUFS_API_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 02/17] zuf: Preliminary Documentation
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-20  8:27   ` Miklos Szeredi
  2019-02-19 11:51 ` [RFC PATCH 03/17] zuf: zuf-rootfs Boaz harrosh
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

Adding Documentation/filesystems/zufs.txt

[v2]
  Incorporated Randy's few comments.

Randy Please give it an harder review?

CC: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 Documentation/filesystems/zufs.txt | 366 +++++++++++++++++++++++++++++
 1 file changed, 366 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt

diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
new file mode 100644
index 000000000000..01eb7a7e7257
--- /dev/null
+++ b/Documentation/filesystems/zufs.txt
@@ -0,0 +1,366 @@
+ZUFS - Zero-copy User-mode FileSystem
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Trees:
+	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+	git clone https://github.com/NetApp/zufs-zus -b zus-github
+
+patches, comments, questions, requests to:
+	boazh@netapp.com
+
+Introduction:
+~~~~~~~~~~~~~
+
+ZUFS - stands for Zero-copy User-mode FS
+▪ It is geared towards true zero copy end to end of both data and meta data.
+▪ It is geared towards very *low latency*, very high CPU locality, lock-less
+  parallelism.
+▪ Synchronous operations
+▪ Numa awareness
+
+ ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
+tries to address the above goals. It is aimed for pmem based FSs. But can easily
+support any other type of FSs that can utilize x10 latency and parallelism
+improvements.
+
+Glossary and names:
+~~~~~~~~~~~~~~~~~~~
+
+ZUF - Zero-copy User-mode Feeder
+  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
+  VFS and dispatch commands to a User-mode application Server.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+
+ZUS - Zero-copy User-mode Server
+  zufs utilizes a User-mode server application. That takes care of the detailed
+  communication protocol and correctness with the Kernel.
+  In turn it utilizes many zusFS Filesystem plugins to implement the actual
+  on disc Filesystem.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zus -b zus-github
+
+zusFS - FS plugins
+  These are .so loadable modules that implement one or more Filesystem-types
+  (-t xyz).
+  The zus server communicates with the plugin via a set of function vectors
+  for the different operations. And establishes communication via defined
+  structures.
+
+Filesystem-type:
+  At startup zus registers with the Kernel one or more Filesystem-type(s)
+  Associated with the type is a unique type-name (mount -t foofs) +
+  different info about the fs, like a magic number and so on.
+  One Server can support many FS-types, in turn each FS-type can mount
+  multiple super-blocks, each supporting multiple devices.
+
+Device-Table (DT) - A zufs FS can support multiple devices
+  ZUF in Kernel may receive, like any mount command a block-device or none.
+  For the former if the specified FS-types states so in a special field.
+  The mount will look for a Device table. A list of devices in a specific
+  order sitting at some offset on the block-device. The system will then
+  proceed to open and own all these devices and associate them to the mounting
+  super-block.
+  If FS-type specifies a -1 at DT_offset then there is no device table
+  and a DT of a single device is created. (If we have no devices, none
+  is specified than we operate without any block devices. (Mount options give
+  some indication of the storage information))
+  The device table has special consideration for pmem devices and will
+  present the all linear array of devices to zus, as one flat mmap space.
+  Alternatively all non-pmem devices are also provided an interface
+  with facility of data movement from pmem to slower devices.
+  A detailed NUMA info is exported to the Server for maximum utilization.
+
+pmem:
+  Multiple pmem devices are presented to the server as a single
+  linear file mmap. Something like /dev/dax. But it is strictly
+  available only to that specific super-block that owns it.
+
+dpp_t - Dual port pointer type
+  At some points in the protocol there are objects that return from zus
+  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
+  It is actually an offset 8 bytes aligned with the 3 low bits specifying
+  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
+  pool == 0 means the offset is in pmem who's management is by zuf and
+  a full easy access is provided for zus.
+
+  pool != 0 Is a pre-established file (up to 6 such files per sb) where
+  the zus has an mmap on the file and the Kernel can access that data
+  via an offset into the file.
+  pool == 7 denotes an offset into the application buffers associated
+  with the current IO.
+  All dpp_t objects life time rules are strictly defined.
+  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
+  zus and zuf can access and change this structure. On any modification
+  the zus is called so to be notified of any changes, persistence.
+  More such objects are: Symlinks, xattrs, mmap-data-blocks etc...
+
+Relay-wait-object:
+  communication between Kernel and server are done via zus-threads that
+  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
+  the IOCTL returns operation id executed and the return info is returned via
+  a new IOCTL call, which then waits for the next operation.
+  To wake up the sleeping thread we use a Relay-wait-object. Currently
+  it is two waitqueue_head(s) back to back.
+  In future we should investigate the use of that special binder object
+  that releases its thread time slice to the other thread without going through
+  the scheduler.
+
+ZT-threads-array:
+  The novelty of the zufs is the ZT-threads system. One thread or more is
+  pre-created for each active core in the system.
+  ▪ The thread is AFFINITY set for that single core only.
+  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
+    At initialization the ZT thread communicates through a ZT_INIT ioctl
+    and registers as the handler of that core (Channel)
+  ▪ ZT-vma - Mmap 4M vma zero copy communication area per ZT
+    Pre allocated vma is created into which will be mapped the application
+    or Kernel buffers for the current operation.
+  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
+    via the IOCTL_ZU_WAIT_OPT call.
+
+  ▪ On an operation dispatch current CPU's ZT is selected, app pages mapped
+    into the ZT-vma. Server thread released with an operation to execute.
+  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
+    Server wait for new operation on that CPU.
+  ▪ Each ZT has a cyclic logic. Each call to IOCTL_ZU_WAIT_OPT from Server
+    returns the results of the previous operation, before going to sleep
+    waiting to receive a new operation.
+	zus			zuf-zt				application
+    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+     ---> IOCTL_ZU_WAIT_OPT    if (app-waiting)
+     |					wake-up-application	 -> return to app
+     |				FS-WAIT
+     |				|				<- POSIX call
+     |				V		<- fs-wake-up(dispatch)
+     |			<- return with new command
+     |--<- do_new_operation
+
+ZUS-mount-thread:
+  The system utilizes a single mount thread. (This thread is not affinity to any
+  core).
+  ▪ It will first Register all FS-types supported by this Server (By calling
+    all zusFS plugins to register their supported types). Once done
+  ▪ As above the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
+  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
+    a mount is dispatched back to zus.
+  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
+    the same array is then used for all super-blocks in the system
+  ▪ As part of the mount command in the context of this same mount-thread
+    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
+    Associated with this super_block
+  ▪ On return like above a new call to IOCTL_ZU_MOUNT will return info of the
+    mount before sleeping in kernel waiting for a new dispatch. All SB info
+    is provided to zuf, including the root inode info. Kernel then proceeds
+    to complete the mount call.
+  ▪ NOTE that since there is a single mount thread all FS-registration
+    super_block and pmem management are lockless.
+
+Philosophy of operations:
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. [zuf-root]
+
+On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
+called zuf-root.
+The zuf-root has no visible files. All communication is done via special-files.
+special-files are open(O_TMPFILE) and establish a special role via an
+IOCTL.
+All communications with the server are done via the zuf-root. Each root owns
+many FS-types and each FS-type owns many super-blocks of this type. All Sharing
+the same communication channels.
+Since all FS-type Servers live in the same zus application address space, at
+times. If the administrator wants to separate between different servers, he/she
+can mount a new zuf-root and point a new server instance on that new mount,
+registering other FS-types on that other instance. The all communication array
+will then be duplicated as well.
+(Otherwise pointing a new server instance on a busy root will return an error)
+
+2. [zus server start]
+  ▪ On load all configured zusFS plugins are loaded.
+  ▪ The Server starts by starting a single mount thread.
+  ▪ It than proceeds to register with Kernel all FS-types it will support.
+    (This is done on the single mount thread, so all FS-registration and
+     mount/umount operate in a single thread and therefor need not any locks)
+  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for a mount
+    command.
+
+3. [mount -t xyz]
+  [In Kernel]
+  ▪ If xyz was registered above as part of the Server startup. the regular
+    mount command will come to the zuf module with a zuf_mount() call. with
+    the xyz-FS-info. In turn this points to a zuf-root.
+  ▪ Code than proceed to load a device-table of devices as  specified above.
+    It then establishes an multi_devices object with a specific pmem_id.
+  ▪ It proceeds to call mount_bdev. Always with the same main-device
+    thous fully sporting automatic bind mounts. Even if different
+    devices are given to the mount command.
+  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
+    specifying two parameters. One the FS-type to mount, and then
+    the pmem_id Associated with this super_block.
+
+  [In zus]
+  ▪ A zus_super_block_info is allocated.
+  ▪ zus calls PMEM_GRAB(pmem_id) to establish a direct mapping to its
+    pmem devices. On return we have full access to our PMEM
+
+  ▪ ZT-threads-array
+    If this is the first mount the all ZT-threads-array is created and
+    established. The mount thread will wait until all zt-threads finished
+    initialization and ready to rock.
+  ▪ Root-zus_inode is loaded and is returned to kernel
+  ▪ More info about the mount like block sizes and so on are returned to kernel.
+
+  [In Kernel]
+   The zuf_fill_super is finalized vectors established and we have a new
+   super_block ready for operations.
+
+4. An FS operation like create or WRITE/READ and so on arrives from application
+   via VFS. Eventually an Operation is dispatched to zus:
+   ▪ A special per-operation descriptor is filled up with all parameters.
+   ▪ A current CPU channel is grabbed. the operation descriptor is put on
+     that channel (ZT). Including get_user_pages or Kernel-pages associated
+     with this OPT.
+   ▪ The ZT is awaken, app thread put to sleep.
+   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
+     the map is only on a single core. And no other core's TLB is affected.
+     (This here is the all performance secret)
+   ▪ ZT thread is returned to user-space.
+   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
+     vector. Output params filled.
+   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
+     to return the requested info.
+   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
+     ZT thread goes back to sleep waiting a new operation.
+
+   ZT rules:
+       A ZT thread must not return back to Kernel. One exception is locks
+   if needed it might sleep waiting for a lock. In which case we will see that
+   the same CPU channel is reentered via another application and/or thread.
+   But now that CPU channel is taken.  What we do is we utilize a few channels
+   (ZTs) per core and the threads may grab another channel. But this only
+   postpones the problem on a busy contended system, all such channels will be
+   consumed. If all channels are taken the application thread is put on a busy
+   scheduling wait until a channel can be grabbed.
+   Therefor Server must not sleep on a ZT. If it needs such a sleeping operation
+   it will return -EAGAIN to zuf. The app is kept sleeping the operation is put
+   on an asynchronous Q and the ZT freed for foreground operation. At some point
+   when the server completes the delayed operation it will complete notify
+   the Kernel with a special async cookie. And the app will be awakened.
+   (Here too we utilize pre allocated asyc channels and vmas. If all channels
+    are busy, application is kept sleeping waiting its free slot turn)
+
+4. On umount the operation is reversed and all resources are torn down.
+5. In case of an application or Server crash, all resources are Associated
+   with files, on file_release these resources are caught and freed.
+
+Objects and life-time
+~~~~~~~~~~~~~~~~~~~~~
+
+Each Kernel object type has an assosiated zus Server object type who's life
+time is governed by the life-time of the Kernel object. Therefor the Server's
+job is easy because it need not establish any object caches / hashes and so on.
+
+Inside zus all objects are allocated by the zusFS plugin. So in turn it can
+allocate a bigger space for its own private data and access it via the
+container_off() coding pattern. So when I say below a zus-object I mean both
+zus public part + zusFS private part of the same object.
+
+All operations return a User-mode pointer that are opaque to the the Kernel
+code, they are just a cookie which is returned back to zus, when needed.
+At times when we want the Kernel to have direct access to a zus object like
+zus_inode, along with the cookie we also return a dpp_t, with a defined
+structure.
+
+Kernel object 			| zus object 		| Kernel access (via dpp_t)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+zuf_fs_type
+	file_system_type	| zus_fs_info		| no
+
+zuf_sb_info
+	super_block		| zus_sb_info		| no
+
+zuf_inode_info			|			|
+	vfs_inode		| zus_inode_info	| no
+	zus_inode *		| 	zus_inode *	| yes
+	synlink *		|	char-array	| yes
+	xattr**			|	zus_xattr	| yes
+
+When a Kernel object's time is to die, a final call to zus is
+dispatched so the associated object can also be freed. Which means
+that on memory pressure when object caches are evicted also the zus
+memory resources are freed.
+
+
+How to use zufs:
+~~~~~~~~~~~~~~~~
+
+The most updated documentation of how to use the latest code bases
+is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
+
+We the developers at Netapp use this script to mount and test our
+latest code. So any new Secret will be found in these scripts. Please
+read them as the ultimate source of how to operate things.
+
+TODO: We are looking for exports in system-d and udev to properly
+integrate these tools into a destro.
+
+We assume you cloned these git trees:
+[]$ mkdir zufs; cd zufs
+[]$ git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+[]$ git clone https://github.com/NetApp/zufs-zuf -b zus-github
+
+This will create the following trees
+zufs/zus - Source code for Server
+zufs/zuf - Linux Kernel source tree to compile and install on your machine
+
+Also specifically:
+zufs/zus/fs/do-zu/zudo - script Documenting how to run things
+
+[]$ cd zuf
+
+First time
+[] ../zus/fs/do-zu/zudo
+this will create a file:
+	../zus/fs/do-zu/zu.conf
+
+Edit this file for your environment. Devices, mount-point and so on.
+On first run an example file will be created for you. Fill in the
+blanks. Most params can stay as is in most cases
+
+Now lest start running:
+
+[1]$ ../zus/fs/do-zu/zudo mkfs
+This will run the proper mkfs command selected at zu.conf file
+with the proper devices.
+
+[2]$ ../zus/fs/do-zu/zudo zuf-insmod
+This loads the zuf.ko module
+
+[3]$ ../zus/fs/do-zu/zudo zuf-root
+This mounts the zuf-root FS above on /sys/fs/zuf (automatically created above)
+
+[4]$ ../zus/fs/do-zu/zudo zus-up
+This runs the zus daemon in the background
+
+[5]$ ../zus/fs/do-zu/zudo mount
+This mount the mkfs FS above on the specified dir in zu.conf
+
+To run all the 5 commands above at once do:
+[]$ ../zus/fs/do-zu/zudo up
+
+To undo all the above in reverse order do:
+[]$ ../zus/fs/do-zu/zudo down
+
+And the most magic command is:
+[]$ ../zus/fs/do-zu/zudo again
+Will do a "down", then update-mods, then "up"
+(update-mods is a special script to copy the latest compiled binaries)
+
+Now you are ready for some:
+[]$ ../zus/fs/do-zu/zudo xfstest
+xfstests is assumed to be installed in the regular /opt/xfstests dir
+
+Again please see inside the scripts what each command does
+these scripts are the ultimate Documentation, do not believe
+anything I'm saying here. (Because it is outdated by now)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 03/17] zuf: zuf-rootfs
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 02/17] zuf: Preliminary Documentation Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 04/17] zuf: zuf-core The ZTs Boaz harrosh
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

zuf-root is a psuedo FS that the zus Server communicates through,
registers new file-systems. receives new mount requests.

In this patch we have the bring up of that special FS.

The principal communication with zuf-rootfs is done by doing
an open(O_TMPFILE) invoking some IOCTL_XXX on the file. This
establishes a zuf_special_file type of object attached to the
"file *" and by that defining special behavior for that object
(Picture will be clearer in future patches)
Otherwise zuf-rootfs is not an FS at all. And has no viewable
files

The zuf-rootfs (mount -t zuf) is usually by default mounted on
/sys/fs/zuf. If an admin wants to run more server applications
(Note that each server application supports many types of FSs)
He/she can mount a second instance of -t zuf and point the new
Server to it.

(Otherwise a second instance attempting to communicate with a
 busy zuf will fail)

TODO: How to trigger a first mount on module_load. Currently
admin needs to manually "mount -t zuf none /sys/fs/zuf"

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   4 +
 fs/zuf/_extern.h  |  38 +++++
 fs/zuf/_pr.h      |  43 ++++++
 fs/zuf/super.c    |  53 +++++++
 fs/zuf/zuf-core.c |  60 ++++++++
 fs/zuf/zuf-root.c | 347 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf.h      | 108 +++++++++++++++
 fs/zuf/zus_api.h  |  36 +++++
 8 files changed, 689 insertions(+)
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index e75ba8a77974..8e62b4c52150 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,5 +10,9 @@
 
 obj-$(CONFIG_ZUF) += zuf.o
 
+# ZUF core
+zuf-y += zuf-core.o zuf-root.o
+
 # Main FS
+zuf-y += super.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
new file mode 100644
index 000000000000..3bb9f1d9acf6
--- /dev/null
+++ b/fs/zuf/_extern.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_EXTERN_H__
+#define __ZUF_EXTERN_H__
+/*
+ * DO NOT INCLUDE this file directly, it is included by zuf.h
+ * It is here because zuf.h got to big
+ */
+
+/*
+ * extern functions declarations
+ */
+
+/* super.c */
+int zuf_init_inodecache(void);
+void zuf_destroy_inodecache(void);
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data);
+
+/* zuf-core.c */
+long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
+int zufc_release(struct inode *inode, struct file *file);
+int zufc_mmap(struct file *file, struct vm_area_struct *vma);
+
+/* zuf-root.c */
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
+
+#endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
new file mode 100644
index 000000000000..30b8cf912c1f
--- /dev/null
+++ b/fs/zuf/_pr.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_PR_H__
+#define __ZUF_PR_H__
+
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/*
+ * Debug code
+ */
+#define zuf_err(s, args ...)		pr_err("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_err_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_err("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_warn(s, args ...)		pr_warn("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_warn_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_warn("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_info(s, args ...)          pr_info("~info~ " s, ## args)
+
+#define zuf_chan_debug(c, s, args...)	pr_debug(c " [%s:%d] " s, __func__, \
+							__LINE__, ## args)
+
+/* ~~~ channel prints ~~~ */
+#define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
+
+#endif /* define __ZUF_PR_H__ */
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
new file mode 100644
index 000000000000..f7f7798425a9
--- /dev/null
+++ b/fs/zuf/super.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Super block operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/parser.h>
+#include <linux/statfs.h>
+#include <linux/backing-dev.h>
+
+#include "zuf.h"
+
+static struct kmem_cache *zuf_inode_cachep;
+
+static void _init_once(void *foo)
+{
+	struct zuf_inode_info *zii = foo;
+
+	inode_init_once(&zii->vfs_inode);
+}
+
+int __init zuf_init_inodecache(void)
+{
+	zuf_inode_cachep = kmem_cache_create("zuf_inode_cache",
+					       sizeof(struct zuf_inode_info),
+					       0,
+					       (SLAB_RECLAIM_ACCOUNT |
+						SLAB_MEM_SPREAD |
+						SLAB_TYPESAFE_BY_RCU),
+					       _init_once);
+	if (zuf_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+void zuf_destroy_inodecache(void)
+{
+	kmem_cache_destroy(zuf_inode_cachep);
+}
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
new file mode 100644
index 000000000000..e12cae584f8a
--- /dev/null
+++ b/fs/zuf/zuf-core.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/delay.h>
+#include <linux/pfn_t.h>
+#include <linux/sched/signal.h>
+
+#include "zuf.h"
+
+long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
+{
+	switch (cmd) {
+	default:
+		zuf_err("%d\n", cmd);
+		return -ENOTTY;
+	}
+}
+
+int zufc_release(struct inode *inode, struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf)
+		return 0;
+
+	switch (zsf->type) {
+	default:
+		return 0;
+	}
+}
+
+int zufc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (unlikely(!zsf)) {
+		zuf_err("Which mmap is that !!!!\n");
+		return -ENOTTY;
+	}
+
+	switch (zsf->type) {
+	default:
+		zuf_err("type=%d\n", zsf->type);
+		return -ENOTTY;
+	}
+}
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
new file mode 100644
index 000000000000..55a839dbc854
--- /dev/null
+++ b/fs/zuf/zuf-root.c
@@ -0,0 +1,347 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ZUF Root filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * ZUF core is mounted on a small specialized FS that
+ * provides the communication with the mount thread, zuf multy-channel
+ * communication [ZTs], and the pmem devices.
+ * Subsequently all FS super_blocks are children of this root, and point
+ * to it. All using the same zuf communication multy-channel.
+ *
+ * [
+ * TODO:
+ *	Multiple servers can run on Multiple mounted roots. Each registering
+ *	their own FSTYPEs. Admin should make sure that the FSTYPEs do not
+ *	overlap
+ * ]
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <asm-generic/mman.h>
+
+#include "zuf.h"
+
+/* ~~~~ Register/Unregister FS-types ~~~~ */
+#ifdef CONFIG_LOCKDEP
+
+/*
+ * NOTE: When CONFIG_LOCKDEP is on the register_filesystem complains when
+ * the fstype object is from a kmalloc. Because of some lockdep_keys not
+ * being const_obj something.
+ *
+ * So in this case we have maximum of 16 fstypes system wide
+ * (Total for all mounted zuf_root(s)). This way we can have them
+ * in const_obj memory below at g_fs_array
+ */
+
+enum { MAX_LOCKDEP_FSs = 16 };
+static uint g_fs_next;
+static struct zuf_fs_type g_fs_array[MAX_LOCKDEP_FSs];
+
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	struct zuf_fs_type *ret;
+
+	if (MAX_LOCKDEP_FSs <= g_fs_next)
+		return NULL;
+
+	ret = &g_fs_array[g_fs_next++];
+	memset(ret, 0, sizeof(*ret));
+	return ret;
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	if (zft == &g_fs_array[0])
+		g_fs_next = 0;
+}
+
+#else /* !CONFIG_LOCKDEP*/
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	return kcalloc(1, sizeof(struct zuf_fs_type), GFP_KERNEL);
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	kfree(zft);
+}
+#endif /*CONFIG_LOCKDEP*/
+
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs)
+{
+	struct zuf_fs_type *zft = _fs_type_alloc();
+
+	if (unlikely(!zft))
+		return -ENOMEM;
+
+	/* Original vfs file type */
+	zft->vfs_fst.owner	= THIS_MODULE;
+	zft->vfs_fst.name	= kstrdup(rfs->rfi.fsname, GFP_KERNEL);
+	zft->vfs_fst.mount	= zuf_mount;
+	zft->vfs_fst.kill_sb	= kill_block_super;
+
+	/* ZUS info about this FS */
+	zft->rfi		= rfs->rfi;
+	zft->zus_zfi		= rfs->zus_zfi;
+	INIT_LIST_HEAD(&zft->list);
+	/* Back pointer to our communication channels */
+	zft->zri		= ZRI(sb);
+
+	zuf_add_fs_type(zft->zri, zft);
+	zuf_info("register_filesystem [%s]\n", zft->vfs_fst.name);
+	return register_filesystem(&zft->vfs_fst);
+}
+
+static void _unregister_all_fses(struct zuf_root_info *zri)
+{
+	struct zuf_fs_type *zft, *n;
+
+	list_for_each_entry_safe_reverse(zft, n, &zri->fst_list, list) {
+		unregister_filesystem(&zft->vfs_fst);
+		list_del_init(&zft->list);
+		_fs_type_free(zft);
+	}
+}
+
+static int zufr_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+
+	drop_nlink(inode);
+	return 0;
+}
+
+/* Force alignment of 2M for all vma(s)
+ *
+ * This belongs to t1.c and what it does for mmap. But we do not mind
+ * that both our mmaps (grab_pmem or ZTs) will be 2M aligned so keep
+ * it here. And zus mappings just all match perfectly with no need for
+ * holes.
+ * FIXME: This is copy/paste from dax-device. It can be very much simplified
+ * for what we need.
+ */
+static unsigned long zufr_get_unmapped_area(struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
+{
+	unsigned long off, off_end, off_align, len_align, addr_align;
+	unsigned long align = PMD_SIZE;
+
+	if (addr)
+		goto out;
+
+	off = pgoff << PAGE_SHIFT;
+	off_end = off + len;
+	off_align = round_up(off, align);
+
+	if ((off_end <= off_align) || ((off_end - off_align) < align))
+		goto out;
+
+	len_align = len + align;
+	if ((off + len_align) < off)
+		goto out;
+
+	addr_align = current->mm->get_unmapped_area(filp, addr, len_align,
+			pgoff, flags);
+	if (!IS_ERR_VALUE(addr_align)) {
+		addr_align += (off - addr_align) & (align - 1);
+		return addr_align;
+	}
+ out:
+	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+}
+
+static const struct inode_operations zufr_inode_operations;
+static const struct file_operations zufr_file_dir_operations = {
+	.open		= dcache_dir_open,
+	.release	= dcache_dir_close,
+	.llseek		= dcache_dir_lseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= dcache_readdir,
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = zufc_ioctl,
+};
+static const struct file_operations zufr_file_reg_operations = {
+	.fsync			= noop_fsync,
+	.unlocked_ioctl		= zufc_ioctl,
+	.get_unmapped_area	= zufr_get_unmapped_area,
+	.mmap			= zufc_mmap,
+	.release		= zufc_release,
+};
+
+static int zufr_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct zuf_root_info *zri = ZRI(dir->i_sb);
+	struct inode *inode;
+	int err;
+
+	inode = new_inode(dir->i_sb);
+	if (!inode)
+		return -ENOMEM;
+
+	/* We need to impersonate device-dax (S_DAX + S_IFCHR) in order to get
+	 * the PMD (huge) page faults and allow RDMA memory access via GUP
+	 * (get_user_pages_longterm).
+	 */
+	inode->i_flags = S_DAX;
+	mode = (mode & ~S_IFREG) | S_IFCHR; /* change file type to char */
+
+	inode->i_ino = ++zri->next_ino; /* none atomic only one mount thread */
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	inode->i_atime = inode->i_ctime;
+	inode_init_owner(inode, dir, mode);
+
+	inode->i_op = &zufr_inode_operations;
+	inode->i_fop = &zufr_file_reg_operations;
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_err("[%ld] insert_inode_locked => %d\n", inode->i_ino, err);
+		goto fail;
+	}
+	d_tmpfile(dentry, inode);
+	unlock_new_inode(inode);
+	return 0;
+
+fail:
+	clear_nlink(inode);
+	make_bad_inode(inode);
+	iput(inode);
+	return err;
+}
+
+static void zufr_put_super(struct super_block *sb)
+{
+	struct zuf_root_info *zri = ZRI(sb);
+
+	_unregister_all_fses(zri);
+
+	zuf_info("zuf_root umount\n");
+}
+
+static void zufr_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
+static const struct inode_operations zufr_inode_operations = {
+	.lookup		= simple_lookup,
+
+	.tmpfile	= zufr_tmpfile,
+	.unlink		= zufr_unlink,
+};
+static const struct super_operations zufr_super_operations = {
+	.statfs		= simple_statfs,
+
+	.evict_inode	= zufr_evict_inode,
+	.put_super	= zufr_put_super,
+};
+
+#define ZUFR_SUPER_MAGIC 0x1717
+
+static int zufr_fill_super(struct super_block *sb, void *data, int silent)
+{
+	static struct tree_descr zufr_files[] = {
+		{""},
+	};
+	struct zuf_root_info *zri;
+	struct inode *root_i;
+	int err;
+
+	zri = kzalloc(sizeof(*zri), GFP_KERNEL);
+	if (!zri) {
+		zuf_err_cnd(silent,
+			    "Not enough memory to allocate zuf_root_info\n");
+		return -ENOMEM;
+	}
+
+	err = simple_fill_super(sb, ZUFR_SUPER_MAGIC, zufr_files);
+	if (unlikely(err)) {
+		kfree(zri);
+		return err;
+	}
+
+	sb->s_op = &zufr_super_operations;
+	sb->s_fs_info = zri;
+	zri->sb = sb;
+
+	root_i = sb->s_root->d_inode;
+	root_i->i_fop = &zufr_file_dir_operations;
+	root_i->i_op = &zufr_inode_operations;
+
+	mutex_init(&zri->sbl_lock);
+	INIT_LIST_HEAD(&zri->fst_list);
+	INIT_LIST_HEAD(&zri->pmem_list);
+
+	return 0;
+}
+
+static struct dentry *zufr_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name,
+				  void *data)
+{
+	struct dentry *ret = mount_single(fs_type, flags, data, zufr_fill_super);
+
+	zuf_info("zuf_root mount [%ld]\n",
+		 IS_ERR_OR_NULL(ret) ? PTR_ERR(ret) : ret->d_inode->i_ino);
+	return ret;
+}
+
+static struct file_system_type zufr_type = {
+	.owner =	THIS_MODULE,
+	.name =		"zuf",
+	.mount =	zufr_mount,
+	.kill_sb	= kill_litter_super,
+};
+
+/* Create an /sys/fs/zuf/ directory. to mount on */
+static struct kset *zufr_kset;
+
+int __init zuf_root_init(void)
+{
+	int err = zuf_init_inodecache();
+
+	if (unlikely(err))
+		return err;
+
+	zufr_kset = kset_create_and_add("zuf", NULL, fs_kobj);
+	if (!zufr_kset) {
+		err = -ENOMEM;
+		goto un_inodecache;
+	}
+
+	err = register_filesystem(&zufr_type);
+	if (unlikely(err))
+		goto un_kset;
+
+	return 0;
+
+un_kset:
+	kset_unregister(zufr_kset);
+un_inodecache:
+	zuf_destroy_inodecache();
+	return err;
+}
+
+void __exit zuf_root_exit(void)
+{
+	unregister_filesystem(&zufr_type);
+	kset_unregister(zufr_kset);
+	zuf_destroy_inodecache();
+}
+
+module_init(zuf_root_init)
+module_exit(zuf_root_exit)
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
new file mode 100644
index 000000000000..f979d8cbe60c
--- /dev/null
+++ b/fs/zuf/zuf.h
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the ZUF filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_H
+#define __ZUF_H
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/xattr.h>
+#include <linux/exportfs.h>
+#include <linux/page_ref.h>
+
+#include "zus_api.h"
+
+#include "_pr.h"
+
+enum zlfs_e_special_file {
+	zlfs_e_zt = 1,
+	zlfs_e_mout_thread,
+	zlfs_e_pmem,
+	zlfs_e_dpp_buff,
+};
+
+struct zuf_special_file {
+	enum zlfs_e_special_file type;
+	struct file *file;
+};
+
+/* This is the zuf-root.c mini filesystem */
+struct zuf_root_info {
+	struct __mount_thread_info {
+		struct zuf_special_file zsf;
+		struct zufs_ioc_mount *zim;
+	} mount;
+
+	#define SBL_INC 64
+	struct sb_is_list {
+		uint num;
+		uint max;
+		struct super_block **array;
+	} sbl;
+	struct mutex sbl_lock;
+
+	ulong next_ino;
+
+	struct zuf_threads_pool *_ztp;
+
+	struct super_block *sb;
+	struct list_head fst_list;
+
+	uint next_pmem_id;
+	struct list_head pmem_list;
+};
+
+static inline struct zuf_root_info *ZRI(struct super_block *sb)
+{
+	struct zuf_root_info *zri = sb->s_fs_info;
+
+	WARN_ON(zri->sb != sb);
+	return zri;
+}
+
+struct zuf_fs_type {
+	struct file_system_type vfs_fst;
+	struct zus_fs_info	*zus_zfi;
+	struct register_fs_info rfi;
+	struct zuf_root_info *zri;
+
+	struct list_head list;
+};
+
+static inline void zuf_add_fs_type(struct zuf_root_info *zri,
+				   struct zuf_fs_type *zft)
+{
+	/* Unlocked for now only one mount-thread with zus */
+	list_add(&zft->list, &zri->fst_list);
+}
+
+/*
+ * ZUF per-inode data in memory
+ */
+struct zuf_inode_info {
+	struct inode		vfs_inode;
+};
+
+static inline struct zuf_inode_info *ZUII(struct inode *inode)
+{
+	return container_of(inode, struct zuf_inode_info, vfs_inode);
+}
+
+/* Keep this include last thing in file */
+#include "_extern.h"
+
+#endif /* __ZUF_H */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index f01db11721f4..34e3e1a9a107 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -66,4 +66,40 @@
 
 #endif /*  ndef __KERNEL__ */
 
+struct zufs_ioc_hdr {
+	__u32 err;	/* IN/OUT must be first */
+	__u16 in_len;	/* How much to be copied *to* zus */
+	__u16 out_max;	/* Max receive buffer at dispatch caller */
+	__u16 out_start;/* Start of output parameters (to caller) */
+	__u16 out_len;	/* How much to be copied *from* zus to caller */
+			/* can be modified by zus */
+	__u32 operation;/* One of e_zufs_operation */
+	__u32 offset;	/* Start of user buffer in ZT mmap */
+	__u32 len;	/* Len of user buffer in ZT mmap */
+};
+
+/* Register FS */
+/* A cookie from user-mode given in register_fs_info */
+struct zus_fs_info;
+struct zufs_ioc_register_fs {
+	struct zufs_ioc_hdr hdr;
+	struct zus_fs_info *zus_zfi;
+	struct register_fs_info {
+		/* IN */
+		char fsname[16];	/* Only 4 chars and a NUL please      */
+		__u32 FS_magic;         /* This is the FS's version && magic  */
+		__u32 FS_ver_major;	/* on disk, not the zuf-to-zus version*/
+		__u32 FS_ver_minor;	/* (See also struct md_dev_table)   */
+
+		__u8 notused[3];
+		__u64 dt_offset;
+
+		__u32 s_time_gran;
+		__u32 def_mode;
+		__u64 s_maxbytes;
+
+	} rfi;
+};
+#define ZU_IOC_REGISTER_FS	_IOWR('Z', 10, struct zufs_ioc_register_fs)
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 04/17] zuf: zuf-core The ZTs
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (2 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 03/17] zuf: zuf-rootfs Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-26 18:34   ` Schumaker, Anna
  2019-02-19 11:51 ` [RFC PATCH 05/17] zuf: Multy Devices Boaz harrosh
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

zuf-core established the communication channels with the ZUS
User Mode Server.

In this patch we have the core communication mechanics.
Which is the Novelty of this project.
(See previous submitted documentation for more info)

Users will come later in the patchset

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |   22 +
 fs/zuf/_pr.h      |    4 +
 fs/zuf/relay.h    |   88 ++++
 fs/zuf/zuf-core.c | 1016 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf-root.c |    7 +
 fs/zuf/zuf.h      |   46 ++
 fs/zuf/zus_api.h  |  185 +++++++++
 7 files changed, 1367 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/relay.h

diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 3bb9f1d9acf6..52bb6b9deafe 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -28,10 +28,32 @@ struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
 
 /* zuf-core.c */
+int zufc_zts_init(struct zuf_root_info *zri); /* Some private types in core */
+void zufc_zts_fini(struct zuf_root_info *zri);
+
 long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
 int zufc_release(struct inode *inode, struct file *file);
 int zufc_mmap(struct file *file, struct vm_area_struct *vma);
 
+int __zufc_dispatch_mount(struct zuf_root_info *zri,
+			  enum e_mount_operation op,
+			  struct zufs_ioc_mount *zim);
+int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
+			enum e_mount_operation operation,
+			struct zufs_ioc_mount *zim);
+
+const char *zuf_op_name(enum e_zufs_operation op);
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo);
+static inline
+int zufc_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
+		  struct page **pages, uint nump)
+{
+	struct zuf_dispatch_op zdo;
+
+	zuf_dispatch_init(&zdo, hdr, pages, nump);
+	return __zufc_dispatch(zri, &zdo);
+}
+
 /* zuf-root.c */
 int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
 
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index 30b8cf912c1f..dc9f85453890 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -39,5 +39,9 @@
 
 /* ~~~ channel prints ~~~ */
 #define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
+#define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
+#define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
+#define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
+#define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
 
 #endif /* define __ZUF_PR_H__ */
diff --git a/fs/zuf/relay.h b/fs/zuf/relay.h
new file mode 100644
index 000000000000..a17d242b313a
--- /dev/null
+++ b/fs/zuf/relay.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Relay scheduler-object Header file.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#ifndef __RELAY_H__
+#define __RELAY_H__
+
+/* ~~~~ Relay ~~~~ */
+struct relay {
+	wait_queue_head_t fss_wq;
+	bool fss_wakeup;
+	bool fss_waiting;
+
+	wait_queue_head_t app_wq;
+	bool app_wakeup;
+	bool app_waiting;
+
+	cpumask_t cpus_allowed;
+};
+
+static inline void relay_init(struct relay *relay)
+{
+	init_waitqueue_head(&relay->fss_wq);
+	init_waitqueue_head(&relay->app_wq);
+}
+
+static inline bool relay_is_app_waiting(struct relay *relay)
+{
+	return relay->app_waiting;
+}
+
+static inline void relay_app_wakeup(struct relay *relay)
+{
+	relay->app_waiting = false;
+
+	relay->app_wakeup = true;
+	wake_up(&relay->app_wq);
+}
+
+static inline int relay_fss_wait(struct relay *relay)
+{
+	int err;
+
+	relay->fss_waiting = true;
+	relay->fss_wakeup = false;
+	err =  wait_event_interruptible(relay->fss_wq, relay->fss_wakeup);
+
+	return err;
+}
+
+static inline bool relay_is_fss_waiting_grab(struct relay *relay)
+{
+	if (relay->fss_waiting) {
+		relay->fss_waiting = false;
+		return true;
+	}
+	return false;
+}
+
+static inline void relay_fss_wakeup(struct relay *relay)
+{
+	relay->fss_wakeup = true;
+	wake_up(&relay->fss_wq);
+}
+
+static inline void relay_fss_wakeup_app_wait(struct relay *relay,
+					     spinlock_t *spinlock)
+{
+	relay->app_waiting = true;
+
+	relay_fss_wakeup(relay);
+
+	relay->app_wakeup = false;
+	if (spinlock)
+		spin_unlock(spinlock);
+
+	wait_event(relay->app_wq, relay->app_wakeup);
+}
+
+#endif /* ifndef __RELAY_H__ */
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index e12cae584f8a..95582c0a4ba5 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -18,14 +18,820 @@
 #include <linux/delay.h>
 #include <linux/pfn_t.h>
 #include <linux/sched/signal.h>
+#include <linux/uaccess.h>
 
 #include "zuf.h"
 
+struct zufc_thread {
+	struct zuf_special_file hdr;
+	struct relay relay;
+	struct vm_area_struct *vma;
+	int no;
+	int chan;
+
+	/* Kernel side allocated IOCTL buffer */
+	struct vm_area_struct *opt_buff_vma;
+	void *opt_buff;
+	ulong max_zt_command;
+
+	/* Next operation*/
+	struct zuf_dispatch_op *zdo;
+};
+
+enum { INITIAL_ZT_CHANNELS = 3 };
+
+struct zuf_threads_pool {
+	uint _max_zts;
+	uint _max_channels;
+	 /* array of pcp_arrays */
+	struct zufc_thread *_all_zt[ZUFS_MAX_ZT_CHANNELS];
+};
+
+static int _alloc_zts_channel(struct zuf_root_info *zri, int channel)
+{
+	zri->_ztp->_all_zt[channel] = alloc_percpu(struct zufc_thread);
+	if (unlikely(!zri->_ztp->_all_zt[channel])) {
+		zuf_err("!!! alloc_percpu channel=%d failed\n", channel);
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+static inline ulong _zt_pr_no(struct zufc_thread *zt)
+{
+	/* So in hex it will be channel as first nibble and cpu as 3rd and on */
+	return ((ulong)zt->no << 8) | zt->chan;
+}
+
+int zufc_zts_init(struct zuf_root_info *zri)
+{
+	int c;
+
+	zri->_ztp = kcalloc(1, sizeof(struct zuf_threads_pool), GFP_KERNEL);
+	if (unlikely(!zri->_ztp))
+		return -ENOMEM;
+
+	zri->_ztp->_max_zts = num_online_cpus();
+	zri->_ztp->_max_channels = INITIAL_ZT_CHANNELS;
+
+	for (c = 0; c < INITIAL_ZT_CHANNELS; ++c) {
+		int err = _alloc_zts_channel(zri, c);
+
+		if (unlikely(err))
+			return err;
+	}
+
+	return 0;
+}
+
+void zufc_zts_fini(struct zuf_root_info *zri)
+{
+	int c;
+
+	/* Always safe/must call zufc_zts_fini */
+	if (!zri->_ztp)
+		return;
+
+	for (c = 0; c < zri->_ztp->_max_channels; ++c) {
+		if (zri->_ztp->_all_zt[c])
+			free_percpu(zri->_ztp->_all_zt[c]);
+	}
+	kfree(zri->_ztp);
+	zri->_ztp = NULL;
+}
+
+static struct zufc_thread *_zt_from_cpu(struct zuf_root_info *zri,
+					int cpu, uint chan)
+{
+	return per_cpu_ptr(zri->_ztp->_all_zt[chan], cpu);
+}
+
+static int _zt_from_f(struct file *filp, int cpu, uint chan,
+		      struct zufc_thread **ztp)
+{
+	*ztp = _zt_from_cpu(ZRI(filp->f_inode->i_sb), cpu, chan);
+	if (unlikely(!*ztp))
+		return -ERANGE;
+	return 0;
+}
+
+static int _zu_register_fs(struct file *file, void *parg)
+{
+	struct zufs_ioc_register_fs rfs;
+	int err;
+
+	err = copy_from_user(&rfs, parg, sizeof(rfs));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+
+	err = zufr_register_fs(file->f_inode->i_sb, &rfs);
+	if (err)
+		zuf_err("=>%d\n", err);
+	err = put_user(err, (int *)parg);
+	return err;
+}
+
+/* ~~~~ mounting ~~~~*/
+int __zufc_dispatch_mount(struct zuf_root_info *zri,
+			  enum e_mount_operation operation,
+			  struct zufs_ioc_mount *zim)
+{
+	zim->hdr.operation = operation;
+
+	for (;;) {
+		bool fss_waiting;
+
+		spin_lock(&zri->mount.lock);
+
+		if (unlikely(!zri->mount.zsf.file)) {
+			spin_unlock(&zri->mount.lock);
+			zuf_err("Server not up\n");
+			zim->hdr.err = -EIO;
+			return zim->hdr.err;
+		}
+
+		fss_waiting = relay_is_fss_waiting_grab(&zri->mount.relay);
+		if (fss_waiting)
+			break;
+		/* in case of break above spin_unlock is done inside
+		 * relay_fss_wakeup_app_wait
+		 */
+
+		spin_unlock(&zri->mount.lock);
+
+		/* It is OK to wait if user storms mounts */
+		zuf_dbg_verbose("waiting\n");
+		msleep(100);
+	}
+
+	zri->mount.zim = zim;
+	relay_fss_wakeup_app_wait(&zri->mount.relay, &zri->mount.lock);
+
+	return zim->hdr.err;
+}
+
+int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
+			enum e_mount_operation operation,
+			struct zufs_ioc_mount *zim)
+{
+	zim->hdr.out_len = sizeof(*zim);
+	zim->hdr.in_len = sizeof(*zim);
+	if (operation == ZUFS_M_MOUNT || operation == ZUFS_M_REMOUNT)
+		zim->hdr.in_len += zim->zmi.po.mount_options_len;
+	zim->zmi.zus_zfi = zus_zfi;
+	zim->zmi.num_cpu = zri->_ztp->_max_zts;
+	zim->zmi.num_channels = zri->_ztp->_max_channels;
+
+	return __zufc_dispatch_mount(zri, operation, zim);
+}
+
+static int _zu_mount(struct file *file, void *parg)
+{
+	struct super_block *sb = file->f_inode->i_sb;
+	struct zuf_root_info *zri = ZRI(sb);
+	bool waiting_for_reply;
+	struct zufs_ioc_mount *zim;
+	ulong cp_ret;
+	int err;
+
+	spin_lock(&zri->mount.lock);
+
+	if (unlikely(!file->private_data)) {
+		/* First time register this file as the mount-thread owner */
+		zri->mount.zsf.type = zlfs_e_mout_thread;
+		zri->mount.zsf.file = file;
+		file->private_data = &zri->mount.zsf;
+	} else if (unlikely(file->private_data != &zri->mount)) {
+		spin_unlock(&zri->mount.lock);
+		zuf_err("Say what?? %p != %p\n",
+			file->private_data, &zri->mount);
+		return -EIO;
+	}
+
+	zim = zri->mount.zim;
+	zri->mount.zim = NULL;
+	waiting_for_reply = zim && relay_is_app_waiting(&zri->mount.relay);
+
+	spin_unlock(&zri->mount.lock);
+
+	if (waiting_for_reply) {
+		cp_ret = copy_from_user(zim, parg, zim->hdr.out_len);
+		if (unlikely(cp_ret)) {
+			zuf_err("copy_from_user => %ld\n", cp_ret);
+			 zim->hdr.err = -EFAULT;
+		}
+
+		relay_app_wakeup(&zri->mount.relay);
+	}
+
+	/* This gets to sleep until a mount comes */
+	err = relay_fss_wait(&zri->mount.relay);
+	if (unlikely(err || !zri->mount.zim)) {
+		struct zufs_ioc_hdr *hdr = parg;
+
+		/* Released by _zu_break INTER or crash */
+		zuf_dbg_zus("_zu_break? %p => %d\n", zri->mount.zim, err);
+		put_user(ZUFS_OP_BREAK, &hdr->operation);
+		put_user(EIO, &hdr->err);
+		return err;
+	}
+
+	zim = zri->mount.zim;
+	cp_ret = copy_to_user(parg, zim, zim->hdr.in_len);
+	if (unlikely(cp_ret)) {
+		err = -EFAULT;
+		zuf_err("copy_to_user =>%ld\n", cp_ret);
+	}
+	return err;
+}
+
+static void zufc_mounter_release(struct file *file)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+
+	zuf_dbg_zus("closed fu=%d au=%d fw=%d aw=%d\n",
+		  zri->mount.relay.fss_wakeup, zri->mount.relay.app_wakeup,
+		  zri->mount.relay.fss_waiting, zri->mount.relay.app_waiting);
+
+	spin_lock(&zri->mount.lock);
+	zri->mount.zsf.file = NULL;
+	if (relay_is_app_waiting(&zri->mount.relay)) {
+		zuf_err("server emergency exit while IO\n");
+
+		if (zri->mount.zim)
+			zri->mount.zim->hdr.err = -EIO;
+		spin_unlock(&zri->mount.lock);
+
+		relay_app_wakeup(&zri->mount.relay);
+		msleep(1000); /* crap */
+	} else {
+		if (zri->mount.zim)
+			zri->mount.zim->hdr.err = 0;
+		spin_unlock(&zri->mount.lock);
+	}
+}
+
+/* ~~~~ ZU_IOC_NUMA_MAP ~~~~ */
+static int _zu_numa_map(struct file *file, void *parg)
+{
+	struct zufs_ioc_numa_map *numa_map;
+	int n_nodes = num_online_nodes();
+	int n_cpus = num_online_cpus();
+	uint *nodes_cpu_count;
+	uint max_cpu_per_node = 0;
+	uint alloc_size;
+	int cpu, i, err;
+
+	alloc_size = sizeof(*numa_map) + n_cpus; /* char per cpu */
+
+	if ((n_nodes > 255) || (alloc_size > PAGE_SIZE)) {
+		zuf_warn("!!!unexpected big machine with %d nodes alloc_size=0x%x\n",
+			  n_nodes, alloc_size);
+		return -ENOTSUPP;
+	}
+
+	nodes_cpu_count = kcalloc(n_nodes, sizeof(uint), GFP_KERNEL);
+	if (unlikely(!nodes_cpu_count))
+		return -ENOMEM;
+
+	numa_map = kzalloc(alloc_size, GFP_KERNEL);
+	if (unlikely(!numa_map)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	numa_map->possible_nodes	= num_possible_nodes();
+	numa_map->possible_cpus		= num_possible_cpus();
+
+	numa_map->online_nodes		= n_nodes;
+	numa_map->online_cpus		= n_cpus;
+
+	for_each_cpu(cpu, cpu_online_mask) {
+		uint ctn  = cpu_to_node(cpu);
+		uint ncc = ++nodes_cpu_count[ctn];
+
+		numa_map->cpu_to_node[cpu] = ctn;
+		max_cpu_per_node = max(max_cpu_per_node, ncc);
+	}
+
+	for (i = 1; i < n_nodes; ++i) {
+		if (nodes_cpu_count[i] != nodes_cpu_count[0]) {
+			zuf_info("@[%d]=%d Unbalanced CPU sockets @[0]=%d\n",
+				  i, nodes_cpu_count[i], nodes_cpu_count[0]);
+			numa_map->nodes_not_symmetrical = true;
+			break;
+		}
+	}
+
+	numa_map->max_cpu_per_node = max_cpu_per_node;
+
+	zuf_dbg_verbose(
+		"possible_nodes=%d possible_cpus=%d online_nodes=%d online_cpus=%d\n",
+		numa_map->possible_nodes, numa_map->possible_cpus,
+		n_nodes, n_cpus);
+
+	err = copy_to_user(parg, numa_map, alloc_size);
+	kfree(numa_map);
+out:
+	kfree(nodes_cpu_count);
+	return err;
+}
+
+static int _map_pages(struct zufc_thread *zt, struct page **pages, uint nump,
+		      bool map_readonly)
+{
+	int p, err;
+
+	if (!(zt->vma && pages && nump))
+		return 0;
+
+	for (p = 0; p < nump; ++p) {
+		ulong zt_addr = zt->vma->vm_start + p * PAGE_SIZE;
+		ulong pfn = page_to_pfn(pages[p]);
+		pfn_t pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		vm_fault_t flt;
+
+		if (map_readonly)
+			flt = vmf_insert_mixed(zt->vma, zt_addr, pfnt);
+		else
+			flt = vmf_insert_mixed_mkwrite(zt->vma, zt_addr, pfnt);
+		err = zuf_flt_to_err(flt);
+		if (unlikely(err)) {
+			zuf_err("zuf: remap_pfn_range => %d p=0x%x start=0x%lx\n",
+				 err, p, zt->vma->vm_start);
+			return err;
+		}
+	}
+	return 0;
+}
+
+static void _unmap_pages(struct zufc_thread *zt, struct page **pages, uint nump)
+{
+	if (!(zt->vma && zt->zdo && pages && nump))
+		return;
+
+	zt->zdo->pages = NULL;
+	zt->zdo->nump = 0;
+
+	zap_vma_ptes(zt->vma, zt->vma->vm_start, nump * PAGE_SIZE);
+}
+
+static void _fill_buff(ulong *buff, uint size)
+{
+	ulong *buff_end = buff + size;
+	ulong val = 0;
+
+	for (; buff < buff_end; ++buff, ++val)
+		*buff = val;
+}
+
+static int _zu_init(struct file *file, void *parg)
+{
+	struct zufc_thread *zt;
+	int cpu = smp_processor_id();
+	struct zufs_ioc_init zi_init;
+	int err;
+
+	err = copy_from_user(&zi_init, parg, sizeof(zi_init));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+	if (unlikely(zi_init.channel_no >= ZUFS_MAX_ZT_CHANNELS)) {
+		zuf_err("[%d] channel_no=%d\n", cpu, zi_init.channel_no);
+		return -EINVAL;
+	}
+
+	zuf_dbg_zus("[%d] aff=0x%lx channel=%d\n",
+		    cpu, zi_init.affinity, zi_init.channel_no);
+
+	zi_init.hdr.err = _zt_from_f(file, cpu, zi_init.channel_no, &zt);
+	if (unlikely(zi_init.hdr.err)) {
+		zuf_err("=>%d\n", err);
+		goto out;
+	}
+
+	if (unlikely(zt->hdr.file)) {
+		zi_init.hdr.err = -EINVAL;
+		zuf_err("[%d] !!! thread already set\n", cpu);
+		goto out;
+	}
+
+	relay_init(&zt->relay);
+	zt->hdr.type = zlfs_e_zt;
+	zt->hdr.file = file;
+	zt->no = cpu;
+	zt->chan = zi_init.channel_no;
+
+	zt->max_zt_command = zi_init.max_command;
+	zt->opt_buff = vmalloc(zi_init.max_command);
+	if (unlikely(!zt->opt_buff)) {
+		zi_init.hdr.err = -ENOMEM;
+		goto out;
+	}
+	_fill_buff(zt->opt_buff, zi_init.max_command / sizeof(ulong));
+
+	file->private_data = &zt->hdr;
+out:
+	err = copy_to_user(parg, &zi_init, sizeof(zi_init));
+	if (err)
+		zuf_err("=>%d\n", err);
+	return err;
+}
+
+struct zufc_thread *_zt_from_f_private(struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	WARN_ON(zsf->type != zlfs_e_zt);
+	return container_of(zsf, struct zufc_thread, hdr);
+}
+
+/* Caller checks that file->private_data != NULL */
+static void zufc_zt_release(struct file *file)
+{
+	struct zufc_thread *zt = _zt_from_f_private(file);
+
+	if (unlikely(zt->hdr.file != file))
+		zuf_err("What happened zt->file(%p) != file(%p)\n",
+			zt->hdr.file, file);
+
+	zuf_dbg_zus("[%d] closed fu=%d au=%d fw=%d aw=%d\n",
+		  zt->no, zt->relay.fss_wakeup, zt->relay.app_wakeup,
+		  zt->relay.fss_waiting, zt->relay.app_waiting);
+
+	if (relay_is_app_waiting(&zt->relay)) {
+		zuf_err("server emergency exit while IO\n");
+
+		/* NOTE: Do not call _unmap_pages the vma is gone */
+		zt->hdr.file = NULL;
+
+		relay_app_wakeup(&zt->relay);
+		msleep(1000); /* crap */
+	}
+
+	vfree(zt->opt_buff);
+	memset(zt, 0, sizeof(*zt));
+}
+
+static int _copy_outputs(struct zufc_thread *zt, void *arg)
+{
+	struct zufs_ioc_hdr *hdr = zt->zdo->hdr;
+	struct zufs_ioc_hdr *user_hdr = zt->opt_buff;
+
+	if (zt->opt_buff_vma->vm_start != (ulong)arg) {
+		zuf_err("malicious Server\n");
+		return -EINVAL;
+	}
+
+	/* Update on the user out_len and return-code */
+	hdr->err = user_hdr->err;
+	hdr->out_len = user_hdr->out_len;
+
+	if (!hdr->out_len)
+		return 0;
+
+	if ((hdr->err == -EZUFS_RETRY) || (hdr->out_max < hdr->out_len)) {
+		if (WARN_ON(!zt->zdo->oh)) {
+			zuf_err("Trouble op(%s) out_max=%d out_len=%d\n",
+				zuf_op_name(hdr->operation),
+				hdr->out_max, hdr->out_len);
+			return -EFAULT;
+		}
+		zuf_dbg_zus("[%s] %d %d => %d\n",
+			    zuf_op_name(hdr->operation),
+			    hdr->out_max, hdr->out_len, hdr->err);
+		return zt->zdo->oh(zt->zdo, zt->opt_buff, zt->max_zt_command);
+	} else {
+		void *rply = (void *)hdr + hdr->out_start;
+		void *from = zt->opt_buff + hdr->out_start;
+
+		memcpy(rply, from, hdr->out_len);
+		return 0;
+	}
+}
+
+static int _zu_wait(struct file *file, void *parg)
+{
+	struct zufc_thread *zt;
+	int err;
+
+	zt = _zt_from_f_private(file);
+	if (unlikely(!zt)) {
+		zuf_err("Unexpected ZT state\n");
+		err = -ERANGE;
+		goto err;
+	}
+
+	if (!zt->hdr.file || file != zt->hdr.file) {
+		zuf_err("fatal\n");
+		err = -E2BIG;
+		goto err;
+	}
+	if (unlikely((ulong)parg != zt->opt_buff_vma->vm_start)) {
+		zuf_err("fatal 2\n");
+		err = -EINVAL;
+		goto err;
+	}
+
+	if (relay_is_app_waiting(&zt->relay)) {
+		if (unlikely(!zt->zdo)) {
+			zuf_err("User has gone...\n");
+			err = -E2BIG;
+			goto err;
+		} else {
+			/* overflow_handler might decide to execute the
+			 *parg here at zus context and return to server
+			 * If it also has an error to report to zus it
+			 * will set zdo->hdr->err.
+			 * EZUS_RETRY_DONE is when that happens.
+			 * In this case pages stay mapped in zt->vma
+			 */
+			err = _copy_outputs(zt, parg);
+			if (err == EZUF_RETRY_DONE) {
+				put_user(zt->zdo->hdr->err, (int *)parg);
+				return 0;
+			}
+
+			_unmap_pages(zt, zt->zdo->pages, zt->zdo->nump);
+			zt->zdo = NULL;
+			if (unlikely(err)) /* _copy_outputs returned an err */
+				goto err;
+		}
+		relay_app_wakeup(&zt->relay);
+	}
+
+	err = relay_fss_wait(&zt->relay);
+	if (err)
+		zuf_dbg_err("[%d] relay error: %d\n", zt->no, err);
+
+	if (zt->zdo &&  zt->zdo->hdr &&
+	    zt->zdo->hdr->operation < ZUFS_OP_BREAK) {
+		/* call map here at the zuf thread so we need no locks
+		 * TODO: Currently only ZUFS_OP_WRITE protects user-buffers
+		 * we should have a bit set in zt->zdo->hdr set per operation.
+		 * TODO: Why this does not work?
+		 */
+		_map_pages(zt, zt->zdo->pages, zt->zdo->nump, 0);
+		memcpy(zt->opt_buff, zt->zdo->hdr, zt->zdo->hdr->in_len);
+	} else {
+		struct zufs_ioc_hdr *hdr = zt->opt_buff;
+
+		/* This Means we were released by _zu_break */
+		zuf_dbg_zus("_zu_break? => %d\n", err);
+		hdr->operation = ZUFS_OP_BREAK;
+		hdr->err = err;
+	}
+
+	return err;
+
+err:
+	put_user(err, (int *)parg);
+	return err;
+}
+
+static int _try_grab_zt_channel(struct zuf_root_info *zri, int cpu,
+				 struct zufc_thread **ztp)
+{
+	struct zufc_thread *zt;
+	int c;
+
+	for (c = 0; ; ++c) {
+		zt = _zt_from_cpu(zri, cpu, c);
+		if (unlikely(!zt || !zt->hdr.file))
+			break;
+
+		if (relay_is_fss_waiting_grab(&zt->relay)) {
+			*ztp = zt;
+			return true;
+		}
+	}
+
+	*ztp = _zt_from_cpu(zri, cpu, 0);
+	return false;
+}
+
+#define _zuf_get_cpu() get_cpu()
+#define _zuf_put_cpu() put_cpu()
+
+#ifdef CONFIG_ZUF_DEBUG
+static
+int _r_zufs_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+#else
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+#endif
+{
+	struct task_struct *app = get_current();
+	struct zufs_ioc_hdr *hdr = zdo->hdr;
+	int cpu, cpu2;
+	struct zufc_thread *zt;
+
+	if (unlikely(hdr->out_len && !hdr->out_max)) {
+		/* TODO: Complain here and let caller code do this proper */
+		hdr->out_max = hdr->out_len;
+	}
+
+channel_busy:
+	cpu = _zuf_get_cpu();
+
+	if (!_try_grab_zt_channel(zri, cpu, &zt)) {
+		_zuf_put_cpu();
+
+		/* If channel was grabbed then maybe a break_all is in progress
+		 * on a different CPU make sure zt->file on this core is
+		 * updated
+		 */
+		mb();
+		if (unlikely(!zt->hdr.file)) {
+			zuf_err("[%d] !zt->file\n", cpu);
+			return -EIO;
+		}
+		zuf_dbg_err("[%d] can this be\n", cpu);
+		/* FIXME: Do something much smarter */
+		msleep(10);
+		if (signal_pending(get_current())) {
+			zuf_dbg_err("[%d] => EINTR\n", cpu);
+			return -EINTR;
+		}
+		goto channel_busy;
+	}
+
+	/* lock app to this cpu while waiting */
+	cpumask_copy(&zt->relay.cpus_allowed, &app->cpus_allowed);
+	cpumask_copy(&app->cpus_allowed,  cpumask_of(smp_processor_id()));
+
+	zt->zdo = zdo;
+
+	_zuf_put_cpu();
+
+	relay_fss_wakeup_app_wait(&zt->relay, NULL);
+
+	/* restore cpu affinity after wakeup */
+	cpumask_copy(&app->cpus_allowed, &zt->relay.cpus_allowed);
+
+cpu2 = smp_processor_id();
+if (cpu2 != cpu)
+	zuf_warn("App switched cpu1=%u cpu2=%u\n", cpu, cpu2);
+
+	return zt->hdr.file ? hdr->err : -EIO;
+}
+
+const char *zuf_op_name(enum e_zufs_operation op)
+{
+#define CASE_ENUM_NAME(e) case e: return #e
+	switch  (op) {
+		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
+	default:
+		return "UNKNOWN";
+	}
+}
+
+#ifdef CONFIG_ZUF_DEBUG
+
+#define MAX_ZT_SEC 5
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+{
+	u64 t1, t2;
+	int err;
+
+	t1 = ktime_get_ns();
+	err = _r_zufs_dispatch(zri, zdo);
+	t2 = ktime_get_ns();
+
+	if ((t2 - t1) > MAX_ZT_SEC * NSEC_PER_SEC)
+		zuf_err("zufc_dispatch(%s, [0x%x-0x%x]) took %lld sec\n",
+			zuf_op_name(zdo->hdr->operation), zdo->hdr->offset,
+			zdo->hdr->len,
+			(t2 - t1) / NSEC_PER_SEC);
+
+	return err;
+}
+#endif /* def CONFIG_ZUF_DEBUG */
+
+/* ~~~ iomap_exec && exec_buffer allocation ~~~ */
+struct zu_exec_buff {
+	struct zuf_special_file hdr;
+	struct vm_area_struct *vma;
+	void *opt_buff;
+	ulong alloc_size;
+};
+
+/* Do some common checks and conversions */
+static inline struct zu_exec_buff *_ebuff_from_file(struct file *file)
+{
+	struct zu_exec_buff *ebuff = file->private_data;
+
+	if (WARN_ON_ONCE(ebuff->hdr.type != zlfs_e_dpp_buff)) {
+		zuf_err("Must call ZU_IOC_ALLOC_BUFFER first\n");
+		return NULL;
+	}
+
+	if (WARN_ON_ONCE(ebuff->hdr.file != file))
+		return NULL;
+
+	return ebuff;
+}
+
+static int _zu_ebuff_alloc(struct file *file, void *arg)
+{
+	struct zufs_ioc_alloc_buffer ioc_alloc;
+	struct zu_exec_buff *ebuff;
+	int err;
+
+	err = copy_from_user(&ioc_alloc, arg, sizeof(ioc_alloc));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+
+	if (ioc_alloc.init_size > ioc_alloc.max_size)
+		return -EINVAL;
+
+	/* TODO: Easily Support growing */
+	/* TODO: Support global pools, also easy */
+	if (ioc_alloc.pool_no || ioc_alloc.init_size != ioc_alloc.max_size)
+		return -ENOTSUPP;
+
+	ebuff = kzalloc(sizeof(*ebuff), GFP_KERNEL);
+	if (unlikely(!ebuff))
+		return -ENOMEM;
+
+	ebuff->hdr.type = zlfs_e_dpp_buff;
+	ebuff->hdr.file = file;
+	i_size_write(file->f_inode, ioc_alloc.max_size);
+	ebuff->alloc_size =  ioc_alloc.init_size;
+	ebuff->opt_buff = vmalloc(ioc_alloc.init_size);
+	if (unlikely(!ebuff->opt_buff)) {
+		kfree(ebuff);
+		return -ENOMEM;
+	}
+	_fill_buff(ebuff->opt_buff, ioc_alloc.init_size / sizeof(ulong));
+
+	file->private_data = &ebuff->hdr;
+	return 0;
+}
+
+static void zufc_ebuff_release(struct file *file)
+{
+	struct zu_exec_buff *ebuff = _ebuff_from_file(file);
+
+	if (unlikely(!ebuff))
+		return;
+
+	vfree(ebuff->opt_buff);
+	ebuff->hdr.type = 0;
+	ebuff->hdr.file = NULL; /* for none-dbg Kernels && use-after-free */
+	kfree(ebuff);
+}
+
+static int _zu_break(struct file *filp, void *parg)
+{
+	struct zuf_root_info *zri = ZRI(filp->f_inode->i_sb);
+	int i, c;
+
+	zuf_dbg_core("enter\n");
+	mb(); /* TODO how to schedule on all CPU's */
+
+	for (i = 0; i < zri->_ztp->_max_zts; ++i) {
+		for (c = 0; c < zri->_ztp->_max_channels; ++c) {
+			struct zufc_thread *zt = _zt_from_cpu(zri, i, c);
+
+			if (unlikely(!(zt && zt->hdr.file)))
+				continue;
+			relay_fss_wakeup(&zt->relay);
+		}
+	}
+
+	if (zri->mount.zsf.file)
+		relay_fss_wakeup(&zri->mount.relay);
+
+	zuf_dbg_core("exit\n");
+	return 0;
+}
+
 long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 {
+	void __user *parg = (void __user *)arg;
+
 	switch (cmd) {
+	case ZU_IOC_REGISTER_FS:
+		return _zu_register_fs(file, parg);
+	case ZU_IOC_MOUNT:
+		return _zu_mount(file, parg);
+	case ZU_IOC_NUMA_MAP:
+		return _zu_numa_map(file, parg);
+	case ZU_IOC_INIT_THREAD:
+		return _zu_init(file, parg);
+	case ZU_IOC_WAIT_OPT:
+		return _zu_wait(file, parg);
+	case ZU_IOC_ALLOC_BUFFER:
+		return _zu_ebuff_alloc(file, parg);
+	case ZU_IOC_BREAK_ALL:
+		return _zu_break(file, parg);
 	default:
-		zuf_err("%d\n", cmd);
+		zuf_err("%d %ld\n", cmd, ZU_IOC_WAIT_OPT);
 		return -ENOTTY;
 	}
 }
@@ -38,11 +844,215 @@ int zufc_release(struct inode *inode, struct file *file)
 		return 0;
 
 	switch (zsf->type) {
+	case zlfs_e_zt:
+		zufc_zt_release(file);
+		return 0;
+	case zlfs_e_mout_thread:
+		zufc_mounter_release(file);
+		return 0;
+	case zlfs_e_pmem:
+		/* NOTHING to clean for pmem file yet */
+		/* zuf_pmem_release(file);*/
+		return 0;
+	case zlfs_e_dpp_buff:
+		zufc_ebuff_release(file);
+		return 0;
 	default:
 		return 0;
 	}
 }
 
+/* ~~~~  mmap area of app buffers into server ~~~~ */
+
+static int zuf_zt_fault(struct vm_fault *vmf)
+{
+	zuf_err("should not fault\n");
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_zt_fault,
+};
+
+static int _zufc_zt_mmap(struct file *file, struct vm_area_struct *vma,
+			 struct zufc_thread *zt)
+{
+	/* Tell Kernel We will only access on a single core */
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &zuf_vm_ops;
+
+	zt->vma = vma;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_flags,
+		vma->vm_pgoff);
+
+	return 0;
+}
+
+/* ~~~~  mmap the Kernel allocated IOCTL buffer per ZT ~~~~ */
+static int _opt_buff_mmap(struct vm_area_struct *vma, void *opt_buff,
+			  ulong opt_size)
+{
+	ulong offset;
+
+	if (!opt_buff)
+		return -ENOMEM;
+
+	for (offset = 0; offset < opt_size; offset += PAGE_SIZE) {
+		ulong addr = vma->vm_start + offset;
+		ulong pfn = vmalloc_to_pfn(opt_buff +  offset);
+		pfn_t pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		int err;
+
+		zuf_dbg_verbose("[0x%lx] pfn-0x%lx addr=0x%lx buff=0x%lx\n",
+				offset, pfn, addr, (ulong)opt_buff + offset);
+
+		err = zuf_flt_to_err(vmf_insert_mixed_mkwrite(vma, addr, pfnt));
+		if (unlikely(err)) {
+			zuf_err("zuf: zuf_insert_mixed_mkwrite => %d offset=0x%lx addr=0x%lx\n",
+				 err, offset, addr);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static int zuf_obuff_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct zufc_thread *zt = _zt_from_f_private(vma->vm_file);
+	long offset = (vmf->pgoff << PAGE_SHIFT) - ZUS_API_MAP_MAX_SIZE;
+	int err;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx file-start=0x%lx offset=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		offset);
+
+	/* if Server overruns its buffer crash it dead */
+	if (unlikely((offset < 0) || (zt->max_zt_command < offset))) {
+		zuf_err("[0x%lx] start=0x%lx end=0x%lx file-start=0x%lx offset=0x%lx\n",
+			_zt_pr_no(zt), vma->vm_start,
+			vma->vm_end, vma->vm_pgoff, offset);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* We never released a zus-core.c that does not fault the
+	 * first page first. I want to see if this happens
+	 */
+	if (unlikely(offset))
+		zuf_warn("Suspicious server activity\n");
+
+	/* This faults only once at very first access */
+	err = _opt_buff_mmap(vma, zt->opt_buff, zt->max_zt_command);
+	if (unlikely(err))
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct zuf_obuff_ops = {
+	.fault		= zuf_obuff_fault,
+};
+
+static int _zufc_obuff_mmap(struct file *file, struct vm_area_struct *vma,
+			    struct zufc_thread *zt)
+{
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &zuf_obuff_ops;
+
+	zt->opt_buff_vma = vma;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_flags,
+		vma->vm_pgoff);
+
+	return 0;
+}
+
+/* ~~~ */
+
+static int zufc_zt_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zufc_thread *zt = _zt_from_f_private(file);
+
+	/* We have two areas of mmap in this special file.
+	 * 0 to ZUS_API_MAP_MAX_SIZE:
+	 *	The first part where app pages are mapped
+	 *	into server per operation.
+	 * ZUS_API_MAP_MAX_SIZE of size zuf_root_info->max_zt_command
+	 *	Is where we map the per ZT ioctl-buffer, later passed
+	 *	to the zus_ioc_wait IOCTL call
+	 */
+	if (vma->vm_pgoff == ZUS_API_MAP_MAX_SIZE / PAGE_SIZE)
+		return _zufc_obuff_mmap(file, vma, zt);
+
+	/* zuf ZT API is very particular about where in its
+	 * special file we communicate
+	 */
+	if (unlikely(vma->vm_pgoff))
+		return -EINVAL;
+
+	return _zufc_zt_mmap(file, vma, zt);
+}
+
+/* ~~~~ Implementation of the ZU_IOC_ALLOC_BUFFER mmap facility ~~~~ */
+
+static int zuf_ebuff_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct zu_exec_buff *ebuff = _ebuff_from_file(vma->vm_file);
+	long offset = (vmf->pgoff << PAGE_SHIFT);
+	int err;
+
+	zuf_dbg_core("start=0x%lx end=0x%lx file-start=0x%lx file-off=0x%lx\n",
+		     vma->vm_start, vma->vm_end, vma->vm_pgoff, offset);
+
+	/* if Server overruns its buffer crash it dead */
+	if (unlikely((offset < 0) || (ebuff->alloc_size < offset))) {
+		zuf_err("start=0x%lx end=0x%lx file-start=0x%lx file-off=0x%lx\n",
+			vma->vm_start, vma->vm_end, vma->vm_pgoff,
+			offset);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* We never released a zus-core.c that does not fault the
+	 * first page first. I want to see if this happens
+	 */
+	if (unlikely(offset))
+		zuf_warn("Suspicious server activity\n");
+
+	/* This faults only once at very first access */
+	err = _opt_buff_mmap(vma, ebuff->opt_buff, ebuff->alloc_size);
+	if (unlikely(err))
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct zuf_ebuff_ops = {
+	.fault		= zuf_ebuff_fault,
+};
+
+static int zufc_ebuff_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zu_exec_buff *ebuff = _ebuff_from_file(vma->vm_file);
+
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &zuf_ebuff_ops;
+
+	ebuff->vma = vma;
+
+	zuf_dbg_core("start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		      vma->vm_start, vma->vm_end, vma->vm_flags, vma->vm_pgoff);
+
+	return 0;
+}
+
 int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct zuf_special_file *zsf = file->private_data;
@@ -53,6 +1063,10 @@ int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 	}
 
 	switch (zsf->type) {
+	case zlfs_e_zt:
+		return zufc_zt_mmap(file, vma);
+	case zlfs_e_dpp_buff:
+		return zufc_ebuff_mmap(file, vma);
 	default:
 		zuf_err("type=%d\n", zsf->type);
 		return -ENOTTY;
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
index 55a839dbc854..37b70ca33d3c 100644
--- a/fs/zuf/zuf-root.c
+++ b/fs/zuf/zuf-root.c
@@ -227,6 +227,7 @@ static void zufr_put_super(struct super_block *sb)
 {
 	struct zuf_root_info *zri = ZRI(sb);
 
+	zufc_zts_fini(zri);
 	_unregister_all_fses(zri);
 
 	zuf_info("zuf_root umount\n");
@@ -282,10 +283,16 @@ static int zufr_fill_super(struct super_block *sb, void *data, int silent)
 	root_i->i_fop = &zufr_file_dir_operations;
 	root_i->i_op = &zufr_inode_operations;
 
+	spin_lock_init(&zri->mount.lock);
 	mutex_init(&zri->sbl_lock);
+	relay_init(&zri->mount.relay);
 	INIT_LIST_HEAD(&zri->fst_list);
 	INIT_LIST_HEAD(&zri->pmem_list);
 
+	err = zufc_zts_init(zri);
+	if (unlikely(err))
+		return err; /* put will be called we have a root */
+
 	return 0;
 }
 
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index f979d8cbe60c..a33f5908155d 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -23,9 +23,11 @@
 #include <linux/xattr.h>
 #include <linux/exportfs.h>
 #include <linux/page_ref.h>
+#include <linux/mm.h>
 
 #include "zus_api.h"
 
+#include "relay.h"
 #include "_pr.h"
 
 enum zlfs_e_special_file {
@@ -44,6 +46,8 @@ struct zuf_special_file {
 struct zuf_root_info {
 	struct __mount_thread_info {
 		struct zuf_special_file zsf;
+		spinlock_t lock;
+		struct relay relay;
 		struct zufs_ioc_mount *zim;
 	} mount;
 
@@ -102,6 +106,48 @@ static inline struct zuf_inode_info *ZUII(struct inode *inode)
 	return container_of(inode, struct zuf_inode_info, vfs_inode);
 }
 
+static inline struct zuf_fs_type *ZUF_FST(struct file_system_type *fs_type)
+{
+	return container_of(fs_type, struct zuf_fs_type, vfs_fst);
+}
+
+static inline struct zuf_fs_type *zuf_fst(struct super_block *sb)
+{
+	return ZUF_FST(sb->s_type);
+}
+
+struct zuf_dispatch_op;
+typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
+				ulong zt_max_bytes);
+struct zuf_dispatch_op {
+	struct zufs_ioc_hdr *hdr;
+	struct page **pages;
+	uint nump;
+	overflow_handler oh;
+	struct super_block *sb;
+	struct inode *inode;
+};
+
+static inline void
+zuf_dispatch_init(struct zuf_dispatch_op *zdo, struct zufs_ioc_hdr *hdr,
+		 struct page **pages, uint nump)
+{
+	memset(zdo, 0, sizeof(*zdo));
+	zdo->hdr = hdr;
+	zdo->pages = pages; zdo->nump = nump;
+}
+
+static inline int zuf_flt_to_err(vm_fault_t flt)
+{
+	if (likely(flt == VM_FAULT_NOPAGE))
+		return 0;
+
+	if (flt == VM_FAULT_OOM)
+		return -ENOMEM;
+
+	return -EACCES;
+}
+
 /* Keep this include last thing in file */
 #include "_extern.h"
 
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 34e3e1a9a107..3319a70b5ccc 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -66,6 +66,47 @@
 
 #endif /*  ndef __KERNEL__ */
 
+/* first available error code after include/linux/errno.h */
+#define EZUFS_RETRY	531
+
+/* The below is private to zuf Kernel only. Is not exposed to VFS nor zus
+ * (defined here to allocate the constant)
+ */
+#define EZUF_RETRY_DONE 540
+
+/**
+ * zufs dual port memory
+ * This is a special type of offset to either memory or persistent-memory,
+ * that is designed to be used in the interface mechanism between userspace
+ * and kernel, and can be accessed by both.
+ * 3 first bits denote a mem-pool:
+ * 0   - pmem pool
+ * 1-6 - established shared pool by a call to zufs_ioc_create_mempool (below)
+ * 7   - offset into app memory
+ */
+typedef __u64 __bitwise zu_dpp_t;
+
+static inline uint zu_dpp_t_pool(zu_dpp_t t)
+{
+	return t & 0x7;
+}
+
+static inline ulong zu_dpp_t_val(zu_dpp_t t)
+{
+	return t & ~0x7;
+}
+
+static inline zu_dpp_t enc_zu_dpp_t(ulong v, uint pool)
+{
+	return v | pool;
+}
+
+/* ~~~~~ ZUFS API ioctl commands ~~~~~ */
+enum {
+	ZUS_API_MAP_MAX_PAGES	= 1024,
+	ZUS_API_MAP_MAX_SIZE	= ZUS_API_MAP_MAX_PAGES * PAGE_SIZE,
+};
+
 struct zufs_ioc_hdr {
 	__u32 err;	/* IN/OUT must be first */
 	__u16 in_len;	/* How much to be copied *to* zus */
@@ -102,4 +143,148 @@ struct zufs_ioc_register_fs {
 };
 #define ZU_IOC_REGISTER_FS	_IOWR('Z', 10, struct zufs_ioc_register_fs)
 
+/* A cookie from user-mode returned by mount */
+struct zus_sb_info;
+
+/* zus cookie per inode */
+struct zus_inode_info;
+
+enum ZUFS_M_FLAGS {
+	ZUFS_M_PEDANTIC		= 0x00000001,
+	ZUFS_M_EPHEMERAL	= 0x00000002,
+	ZUFS_M_SILENT		= 0x00000004,
+};
+
+struct zufs_parse_options {
+	__u32 mount_options_len;
+	__u32 pedantic;
+	__u64 mount_flags;
+	char mount_options[0];
+};
+
+enum e_mount_operation {
+	ZUFS_M_MOUNT	= 1,
+	ZUFS_M_UMOUNT,
+	ZUFS_M_REMOUNT,
+	ZUFS_M_DDBG_RD,
+	ZUFS_M_DDBG_WR,
+};
+
+struct zufs_mount_info {
+	/* IN */
+	struct zus_fs_info *zus_zfi;
+	__u16	num_cpu;
+	__u16	num_channels;
+	__u32	pmem_kern_id;
+	__u64	sb_id;
+
+	/* OUT */
+	struct zus_sb_info *zus_sbi;
+	/* mount is also iget of root */
+	struct zus_inode_info *zus_ii;
+	zu_dpp_t _zi;
+	__u64	old_mount_opt;
+	__u64	remount_flags;
+
+	/* More FS specific info */
+	__u32 s_blocksize_bits;
+	__u8	acl_on;
+	struct zufs_parse_options po;
+};
+
+/* mount / umount */
+struct  zufs_ioc_mount {
+	struct zufs_ioc_hdr hdr;
+	struct zufs_mount_info zmi;
+};
+#define ZU_IOC_MOUNT	_IOWR('Z', 11, struct zufs_ioc_mount)
+
+/* pmem  */
+struct zufs_ioc_numa_map {
+	/* Set by zus */
+	struct zufs_ioc_hdr hdr;
+
+	__u32	possible_nodes;
+	__u32	possible_cpus;
+	__u32	online_nodes;
+	__u32	online_cpus;
+
+	__u32	max_cpu_per_node;
+
+	/* This indicates that NOT all nodes have @max_cpu_per_node cpus */
+	bool	nodes_not_symmetrical;
+
+	/* Variable size must keep last
+	 * size @online_cpus
+	 */
+	__u8	cpu_to_node[];
+};
+#define ZU_IOC_NUMA_MAP	_IOWR('Z', 12, struct zufs_ioc_numa_map)
+
+/* ZT init */
+enum { ZUFS_MAX_ZT_CHANNELS = 64 };
+
+struct zufs_ioc_init {
+	struct zufs_ioc_hdr hdr;
+	ulong affinity;	/* IN */
+	uint channel_no;
+	uint max_command;
+};
+#define ZU_IOC_INIT_THREAD	_IOWR('Z', 14, struct zufs_ioc_init)
+
+/* break_all (Server telling kernel to clean) */
+struct zufs_ioc_break_all {
+	struct zufs_ioc_hdr hdr;
+};
+#define ZU_IOC_BREAK_ALL	_IOWR('Z', 15, struct zufs_ioc_break_all)
+
+/* ~~~  zufs_ioc_wait_operation ~~~ */
+struct zufs_ioc_wait_operation {
+	struct zufs_ioc_hdr hdr;
+	/* maximum size is governed by zufs_ioc_init->max_command */
+	char opt_buff[];
+};
+#define ZU_IOC_WAIT_OPT		_IOWR('Z', 16, struct zufs_ioc_wait_operation)
+
+/* These are the possible operations sent from Kernel to the Server in the
+ * return of the ZU_IOC_WAIT_OPT.
+ */
+enum e_zufs_operation {
+	ZUFS_OP_NULL = 0,
+
+	ZUFS_OP_BREAK,		/* Kernel telling Server to exit */
+	ZUFS_OP_MAX_OPT,
+};
+
+/* Allocate a special_file that will be a dual-port communication buffer with
+ * user mode.
+ * Server will access the buffer via the mmap of this file.
+ * Kernel will access the file via the valloc() pointer
+ *
+ * Some IOCTLs below demand use of this kind of buffer for communication
+ * TODO:
+ * pool_no is if we want to associate this buffer onto the 6 possible
+ * mem-pools per zuf_sbi. So anywhere we have a zu_dpp_t it will mean
+ * access from this pool.
+ * If pool_no is zero then it is private to only this file. In this case
+ * sb_id && zus_sbi are ignored / not needed.
+ */
+struct zufs_ioc_alloc_buffer {
+	struct zufs_ioc_hdr hdr;
+	/* The ID of the super block received in mount */
+	__u64	sb_id;
+	/* We verify the sb_id validity against zus_sbi */
+	struct zus_sb_info *zus_sbi;
+	/* max size of buffer allowed (size of mmap) */
+	__u32 max_size;
+	/* allocate this much on initial call and set into vma */
+	__u32 init_size;
+
+	/* TODO: These below are now set to ZERO. Need implementation */
+	__u16 pool_no;
+	__u16 flags;
+	__u32 reserved;
+};
+#define ZU_IOC_ALLOC_BUFFER	_IOWR('Z', 17, struct zufs_ioc_init)
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 05/17] zuf: Multy Devices
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (3 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 04/17] zuf: zuf-core The ZTs Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 06/17] zuf: mounting Boaz harrosh
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

ZUFS supports Multiple block devices per super_block.
This here is the devices handling code. At the output
a single multi_devices (md.h) object is associated with the
mounting super_block.

There are three mode of operations:
* mount with out a device (mount -t FOO none /somepath)

* A single device - The FS stated register_fs_info->dt_offset==-1
  No checks are made by Kernel, the single bdev is registered with
  Kernel's mount_bdev. It is up to the zusFS to check validity

* Multy devices - The FS stated register_fs_info->dt_offset==X

  This mode is the main of this patch.
  A single device is given on the mount command line. At
  register_fs_info->dt_offset of this device we look for a
  zufs_dev_table structure. After all the checks we look there
  at the device list and open all devices. Any one of the devices may
  be given on command line. But they will always be opened in
  DT(Device Table) order. The Device table has the notion of two types
  of bdevs:
  T1 devices - are pmem devices capable of direct_access
  T2 devices - are none direct_access devices

  All t1 devices are presented as one linear array. in DT order
  In t1.c we mmap this space for the server to directly access
  pmem. (In the proper persistent way)

  [We do not support any direct_access device, we only support
   pmem(s) where the all device can be addressed by a single
   physical/virtual address. This is checked before mount]

   The T2 devices are also grabbed and owned by the super_block
   A later API will enable the Server to write or transfer buffers
   from T1 to T2 in a very efficient manner. Also presented as a
   single linear array in DT order.

   Both kind of devices are NUMA aware and the NUMA info is presented
   to the zusFS for optimal allocation and access.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   3 +
 fs/zuf/_extern.h  |   3 +
 fs/zuf/_pr.h      |  11 +
 fs/zuf/md.c       | 764 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/md.h       | 318 +++++++++++++++++++
 fs/zuf/md_def.h   | 145 +++++++++
 fs/zuf/t1.c       | 138 +++++++++
 fs/zuf/t2.c       | 375 +++++++++++++++++++++++
 fs/zuf/t2.h       |  68 +++++
 fs/zuf/zuf-core.c |  87 ++++++
 fs/zuf/zuf.h      |  39 +++
 fs/zuf/zus_api.h  |  18 ++
 12 files changed, 1969 insertions(+)
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/md_def.h
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 8e62b4c52150..091bf053a6ed 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,6 +10,9 @@
 
 obj-$(CONFIG_ZUF) += zuf.o
 
+# Infrastructure
+zuf-y += md.o t1.o t2.o
+
 # ZUF core
 zuf-y += zuf-core.o zuf-root.o
 
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 52bb6b9deafe..15d632ea5ed2 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -57,4 +57,7 @@ int zufc_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
 /* zuf-root.c */
 int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
 
+/* t1.c */
+int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index dc9f85453890..85641b6f1478 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -40,8 +40,19 @@
 /* ~~~ channel prints ~~~ */
 #define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
 #define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
+#define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
+#define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
+#define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
 #define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
 #define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
 #define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
 
+#define md_err		zuf_err
+#define md_warn		zuf_warn
+#define md_err_cnd	zuf_err_cnd
+#define md_warn_cnd	zuf_warn_cnd
+#define md_dbg_err	zuf_dbg_err
+#define md_dbg_verbose	zuf_dbg_verbose
+
+
 #endif /* define __ZUF_PR_H__ */
diff --git a/fs/zuf/md.c b/fs/zuf/md.c
new file mode 100644
index 000000000000..059c96062177
--- /dev/null
+++ b/fs/zuf/md.c
@@ -0,0 +1,764 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/blkdev.h>
+#include <linux/pfn_t.h>
+#include <linux/crc16.h>
+#include <linux/uuid.h>
+
+#include <linux/gcd.h>
+
+#include "_pr.h"
+#include "md.h"
+#include "t2.h"
+
+/* length of uuid dev path /dev/disk/by-uuid/<uuid> */
+#define PATH_UUID	64
+
+const fmode_t _g_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+/* allocate space for and copy an existing uuid */
+static char *_uuid_path(uuid_le *uuid)
+{
+	char path[PATH_UUID];
+
+	sprintf(path, "/dev/disk/by-uuid/%pUb", uuid);
+	return kstrdup(path, GFP_KERNEL);
+}
+
+static int _bdev_get_by_path(const char *path, struct block_device **bdev,
+			     void *holder)
+{
+	/* The owner of the device is the pointer that will hold it. This
+	 * protects from same device mounting on two super-blocks as well
+	 * as same device being repeated twice.
+	 */
+	*bdev = blkdev_get_by_path(path, _g_mode, holder);
+	if (IS_ERR(*bdev)) {
+		int err = PTR_ERR(*bdev);
+		*bdev = NULL;
+		return err;
+	}
+	return 0;
+}
+
+static void _bdev_put(struct block_device **bdev, struct block_device *s_bdev)
+{
+	if (*bdev) {
+		if (!s_bdev || *bdev != s_bdev)
+			blkdev_put(*bdev, _g_mode);
+		*bdev = NULL;
+	}
+}
+
+static int ___bdev_get_by_uuid(struct block_device **bdev, uuid_le *uuid,
+			       void *holder, bool silent, const char *msg,
+			       const char *f, int l)
+{
+	char *path = NULL;
+	int err;
+
+	path = _uuid_path(uuid);
+	err = _bdev_get_by_path(path, bdev, holder);
+	if (unlikely(err))
+		md_err_cnd(silent, "[%s:%d] %s path=%s =>%d\n",
+			     f, l, msg, path, err);
+
+	kfree(path);
+	return err;
+}
+
+#define _bdev_get_by_uuid(bdev, uuid, holder, msg) \
+	___bdev_get_by_uuid(bdev, uuid, holder, silent, msg, __func__, __LINE__)
+
+static bool _main_bdev(struct block_device *bdev)
+{
+	if (bdev->bd_super && bdev->bd_super->s_bdev == bdev)
+		return true;
+	return false;
+}
+
+short md_calc_csum(struct md_dev_table *mdt)
+{
+	uint n = MDT_STATIC_SIZE(mdt) - sizeof(mdt->s_sum);
+
+	return crc16(~0, (__u8 *)&mdt->s_version, n);
+}
+
+/* ~~~~~~~ mdt related functions ~~~~~~~ */
+
+int md_t2_mdt_read(struct multi_devices *md, int index,
+		   struct md_dev_table *mdt)
+{
+	int err = t2_readpage(md, index, virt_to_page(mdt));
+
+	if (err)
+		md_dbg_verbose("!!! t2_readpage err=%d\n", err);
+
+	return err;
+}
+
+int _t2_mdt_read(struct block_device *bdev, struct md_dev_table *mdt)
+{
+	int err;
+	/* t2 interface works for all block devices */
+	struct multi_devices *md;
+	struct md_dev_info *mdi;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (unlikely(!md))
+		return -ENOMEM;
+
+	md->t2_count = 1;
+	md->devs[0].bdev = bdev;
+	mdi = &md->devs[0];
+	md->t2a.map = &mdi;
+	md->t2a.bn_gcd = 1; /*Does not matter only must not be zero */
+
+	err = md_t2_mdt_read(md, 0, mdt);
+
+	kfree(md);
+	return err;
+}
+
+int md_t2_mdt_write(struct multi_devices *md, struct md_dev_table *mdt)
+{
+	int i, err = 0;
+
+	for (i = 0; i < md->t2_count; ++i) {
+		ulong bn = md_o2p(md_t2_dev(md, i)->offset);
+
+		mdt->s_dev_list.id_index = mdt->s_dev_list.t1_count + i;
+		mdt->s_sum = cpu_to_le16(md_calc_csum(mdt));
+
+		err = t2_writepage(md, bn, virt_to_page(mdt));
+		if (err)
+			md_dbg_verbose("!!! t2_writepage err=%d\n", err);
+	}
+
+	return err;
+}
+
+static bool _csum_mismatch(struct md_dev_table *mdt, int silent)
+{
+	ushort crc = md_calc_csum(mdt);
+
+	if (mdt->s_sum == cpu_to_le16(crc))
+		return false;
+
+	md_warn_cnd(silent, "expected(0x%x) != s_sum(0x%x)\n",
+		      cpu_to_le16(crc), mdt->s_sum);
+	return true;
+}
+
+static bool _uuid_le_equal(uuid_le *uuid1, uuid_le *uuid2)
+{
+	return (memcmp(uuid1, uuid2, sizeof(uuid_le)) == 0);
+}
+
+static bool _mdt_compare_uuids(struct md_dev_table *mdt,
+			       struct md_dev_table *main_mdt, int silent)
+{
+	int i, dev_count;
+
+	if (!_uuid_le_equal(&mdt->s_uuid, &main_mdt->s_uuid)) {
+		md_warn_cnd(silent, "mdt uuid (%pUb != %pUb) mismatch\n",
+			      &mdt->s_uuid, &main_mdt->s_uuid);
+		return false;
+	}
+
+	dev_count = mdt->s_dev_list.t1_count + mdt->s_dev_list.t2_count +
+		    mdt->s_dev_list.rmem_count;
+	for (i = 0; i < dev_count; ++i) {
+		struct md_dev_id *dev_id1 = &mdt->s_dev_list.dev_ids[i];
+		struct md_dev_id *dev_id2 = &main_mdt->s_dev_list.dev_ids[i];
+
+		if (!_uuid_le_equal(&dev_id1->uuid, &dev_id2->uuid)) {
+			md_warn_cnd(silent,
+				    "mdt dev %d uuid (%pUb != %pUb) mismatch\n",
+				    i, &dev_id1->uuid, &dev_id2->uuid);
+			return false;
+		}
+
+		if (dev_id1->blocks != dev_id2->blocks) {
+			md_warn_cnd(silent,
+				    "mdt dev %d blocks (0x%llx != 0x%llx) mismatch\n",
+				    i, le64_to_cpu(dev_id1->blocks),
+				    le64_to_cpu(dev_id2->blocks));
+			return false;
+		}
+	}
+
+	return true;
+}
+
+bool md_mdt_check(struct md_dev_table *mdt,
+		  struct md_dev_table *main_mdt, struct block_device *bdev,
+		  struct mdt_check *mc)
+{
+	struct md_dev_table *mdt2 = (void *)mdt + MDT_SIZE;
+	struct md_dev_id *dev_id;
+	ulong bdev_size, super_size;
+
+	BUILD_BUG_ON(MDT_STATIC_SIZE(mdt) & (SMP_CACHE_BYTES - 1));
+
+	/* Do sanity checks on the superblock */
+	if (le32_to_cpu(mdt->s_magic) != mc->magic) {
+		if (le32_to_cpu(mdt2->s_magic) != mc->magic) {
+			md_warn_cnd(mc->silent,
+				     "Can't find a valid partition\n");
+			return false;
+		}
+
+		md_warn_cnd(mc->silent,
+			     "Magic error in super block: using copy\n");
+		/* Try to auto-recover the super block */
+		memcpy_flushcache(mdt, mdt2, sizeof(*mdt));
+	}
+
+	if ((mc->major_ver != mdt_major_version(mdt)) ||
+	    (mc->minor_ver < mdt_minor_version(mdt))) {
+		md_warn_cnd(mc->silent,
+			     "mkfs-mount versions mismatch! %d.%d != %d.%d\n",
+			     mdt_major_version(mdt), mdt_minor_version(mdt),
+			     mc->major_ver, mc->minor_ver);
+		return false;
+	}
+
+	if (_csum_mismatch(mdt, mc->silent)) {
+		if (_csum_mismatch(mdt2, mc->silent)) {
+			md_warn_cnd(mc->silent,
+				    "checksum error in super block\n");
+			return false;
+		}
+
+		md_warn_cnd(mc->silent,
+			    "crc16 error in super block: using copy\n");
+		/* Try to auto-recover the super block */
+		memcpy_flushcache(mdt, mdt2, MDT_SIZE);
+		/* TODO(sagi): copy fixed mdt to shadow */
+	}
+
+	if (main_mdt) {
+		if (mdt->s_dev_list.t1_count != main_mdt->s_dev_list.t1_count) {
+			md_warn_cnd(mc->silent, "mdt t1 count mismatch\n");
+			return false;
+		}
+
+		if (mdt->s_dev_list.t2_count != main_mdt->s_dev_list.t2_count) {
+			md_warn_cnd(mc->silent, "mdt t2 count mismatch\n");
+			return false;
+		}
+
+		if (mdt->s_dev_list.rmem_count !=
+		    main_mdt->s_dev_list.rmem_count) {
+			md_warn_cnd(mc->silent,
+				    "mdt rmem dev count mismatch\n");
+			return false;
+		}
+
+		if (!_mdt_compare_uuids(mdt, main_mdt, mc->silent))
+			return false;
+	}
+
+	/* check alignment */
+	dev_id = &mdt->s_dev_list.dev_ids[mdt->s_dev_list.id_index];
+	super_size = md_p2o(__dev_id_blocks(dev_id));
+	if (unlikely(!super_size || super_size & mc->alloc_mask)) {
+		md_warn_cnd(mc->silent, "super_size(0x%lx) ! 2_M aligned\n",
+			      super_size);
+		return false;
+	}
+
+	if (!bdev)
+		return true;
+
+	/* check t1 device size */
+	bdev_size = i_size_read(bdev->bd_inode);
+	if (unlikely(super_size > bdev_size)) {
+		md_warn_cnd(mc->silent,
+			    "bdev_size(0x%lx) too small expected 0x%lx\n",
+			    bdev_size, super_size);
+		return false;
+	} else if (unlikely(super_size < bdev_size)) {
+		md_dbg_err("Note mdt->size=(0x%lx) < bdev_size(0x%lx)\n",
+			      super_size, bdev_size);
+	}
+
+	return true;
+}
+
+
+int md_set_sb(struct multi_devices *md, struct block_device *s_bdev,
+	      void *sb, int silent)
+{
+	int i;
+
+	for (i = 0; i < md->t1_count; ++i) {
+		struct md_dev_info *mdi;
+
+		if (i == md->dev_index)
+			continue;
+
+		mdi = md_t1_dev(md, i);
+		if (mdi->bdev->bd_super && (mdi->bdev->bd_super != sb)) {
+			md_warn_cnd(silent,
+				"!!! %s already mounted on a different FS => -EBUSY\n",
+				_bdev_name(mdi->bdev));
+			return -EBUSY;
+		}
+
+		mdi->bdev->bd_super = sb;
+	}
+
+	md_dev_info(md, md->dev_index)->bdev = s_bdev;
+	return 0;
+}
+
+void md_fini(struct multi_devices *md, struct block_device *s_bdev)
+{
+	int i;
+
+	kfree(md->t2a.map);
+	kfree(md->t1a.map);
+
+	for (i = 0; i < md->t1_count + md->t2_count; ++i) {
+		struct md_dev_info *mdi = md_dev_info(md, i);
+
+		md_t1_info_fini(mdi);
+		if (mdi->bdev && !_main_bdev(mdi->bdev))
+			mdi->bdev->bd_super = NULL;
+		_bdev_put(&mdi->bdev, s_bdev);
+	}
+
+	kfree(md);
+}
+
+
+/* ~~~~~~~ Pre-mount operations ~~~~~~~ */
+
+static int _get_device(struct block_device **bdev, const char *dev_name,
+		       uuid_le *uuid, void *holder, int silent,
+		       bool *bind_mount)
+{
+	int err;
+
+	if (dev_name)
+		err = _bdev_get_by_path(dev_name, bdev, holder);
+	else
+		err = _bdev_get_by_uuid(bdev, uuid, holder,
+					"failed to get device");
+
+	if (unlikely(err)) {
+		md_err_cnd(silent,
+			"failed to get device dev_name=%s uuid=%pUb err=%d\n",
+			dev_name, uuid, err);
+		return err;
+	}
+
+	if (bind_mount && _main_bdev(*bdev))
+		*bind_mount = true;
+
+	return 0;
+}
+
+static int _init_dev_info(struct md_dev_info *mdi, struct md_dev_id *id,
+			  int index, u64 offset,
+			  struct md_dev_table *main_mdt,
+			  struct mdt_check *mc, bool t1_dev,
+			  int silent)
+{
+	struct md_dev_table *mdt = NULL;
+	bool mdt_alloc = false;
+	int err = 0;
+
+	if (mdi->bdev == NULL) {
+		err = _get_device(&mdi->bdev, NULL, &id->uuid, mc->holder,
+				  silent, NULL);
+		if (unlikely(err))
+			return err;
+	}
+
+	mdi->offset = offset;
+	mdi->size = md_p2o(__dev_id_blocks(id));
+	mdi->index = index;
+
+	if (t1_dev) {
+		struct page *dev_page;
+		int end_of_dev_nid;
+
+		err = md_t1_info_init(mdi, silent);
+		if (unlikely(err))
+			return err;
+
+		if ((ulong)mdi->t1i.virt_addr & mc->alloc_mask) {
+			md_warn_cnd(silent, "!!! unaligned device %s\n",
+				      _bdev_name(mdi->bdev));
+			return -EINVAL;
+		}
+
+		if (!__pfn_to_section(mdi->t1i.phys_pfn)) {
+			md_err_cnd(silent, "Intel does not like pages...\n");
+			return -EINVAL;
+		}
+
+		mdt = mdi->t1i.virt_addr;
+
+		mdi->t1i.pgmap = virt_to_page(mdt)->pgmap;
+		dev_page = pfn_to_page(mdi->t1i.phys_pfn);
+		mdi->nid = page_to_nid(dev_page);
+		end_of_dev_nid = page_to_nid(dev_page + md_o2p(mdi->size - 1));
+
+		if (mdi->nid != end_of_dev_nid)
+			md_warn("pmem crosses NUMA boundaries");
+	} else {
+		mdt = (void *)__get_free_page(GFP_KERNEL);
+		if (unlikely(!mdt)) {
+			md_dbg_err("!!! failed to alloc page\n");
+			return -ENOMEM;
+		}
+
+		mdt_alloc = true;
+		err = _t2_mdt_read(mdi->bdev, mdt);
+		if (unlikely(err)) {
+			md_err_cnd(silent, "failed to read mdt from t2 => %d\n",
+				   err);
+			goto out;
+		}
+		mdi->nid = __dev_id_nid(id);
+	}
+
+	if (!md_mdt_check(mdt, main_mdt, mdi->bdev, mc)) {
+		md_err_cnd(silent, "device %s failed integrity check\n",
+			     _bdev_name(mdi->bdev));
+		err = -EINVAL;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	if (mdt_alloc)
+		free_page((ulong)mdt);
+	return err;
+}
+
+static int _map_setup(struct multi_devices *md, ulong blocks, int dev_start,
+		      struct md_dev_larray *larray)
+{
+	ulong map_size, bn_end;
+	int i, dev_index = dev_start;
+
+	map_size = blocks / larray->bn_gcd;
+	larray->map = kcalloc(map_size, sizeof(*larray->map), GFP_KERNEL);
+	if (!larray->map) {
+		md_dbg_err("failed to allocate dev map\n");
+		return -ENOMEM;
+	}
+
+	bn_end = md_o2p(md->devs[dev_index].size);
+	for (i = 0; i < map_size; ++i) {
+		if ((i * larray->bn_gcd) >= bn_end)
+			bn_end += md_o2p(md->devs[++dev_index].size);
+		larray->map[i] = &md->devs[dev_index];
+	}
+
+	return 0;
+}
+
+static int _md_init(struct multi_devices *md, struct mdt_check *mc,
+		    struct md_dev_list *dev_list, int silent)
+{
+	struct md_dev_table *main_mdt = NULL;
+	u64 total_size = 0;
+	int i, err;
+
+	for (i = 0; i < md->t1_count; ++i) {
+		struct md_dev_info *mdi = md_t1_dev(md, i);
+		struct md_dev_table *dev_mdt;
+
+		err = _init_dev_info(mdi, &dev_list->dev_ids[i], i, total_size,
+				     main_mdt, mc, true, silent);
+		if (unlikely(err))
+			return err;
+
+		/* apparently gcd(0,X)=X which is nice */
+		md->t1a.bn_gcd = gcd(md->t1a.bn_gcd, md_o2p(mdi->size));
+		total_size += mdi->size;
+
+		dev_mdt = md_t1_addr(md, i);
+		if (!main_mdt)
+			main_mdt = dev_mdt;
+
+		if (mdt_test_option(dev_mdt, MDT_F_SHADOW))
+			memcpy(mdi->t1i.virt_addr,
+			       mdi->t1i.virt_addr + mdi->size, mdi->size);
+
+		md_dbg_verbose("dev=%d %pUb %s v=%p pfn=%lu off=%lu size=%lu\n",
+				 i, &dev_list->dev_ids[i].uuid,
+				 _bdev_name(mdi->bdev), dev_mdt,
+				 mdi->t1i.phys_pfn, mdi->offset, mdi->size);
+	}
+
+	md->t1_blocks = le64_to_cpu(main_mdt->s_t1_blocks);
+	if (unlikely(md->t1_blocks != md_o2p(total_size))) {
+		md_err_cnd(silent,
+			"FS corrupted md->t1_blocks(0x%lx) != total_size(0x%llx)\n",
+			md->t1_blocks, total_size);
+		return -EIO;
+	}
+
+	err = _map_setup(md, le64_to_cpu(main_mdt->s_t1_blocks), 0, &md->t1a);
+	if (unlikely(err))
+		return err;
+
+	md_dbg_verbose("t1 devices=%d total_size=0x%llx segment_map=0x%lx\n",
+			 md->t1_count, total_size,
+			 md_o2p(total_size) / md->t1a.bn_gcd);
+
+	if (md->t2_count == 0)
+		return 0;
+
+	/* Done with t1. Counting t2s */
+	total_size = 0;
+	for (i = 0; i < md->t2_count; ++i) {
+		struct md_dev_info *mdi = md_t2_dev(md, i);
+
+		err = _init_dev_info(mdi, &dev_list->dev_ids[md->t1_count + i],
+				     md->t1_count + i, total_size, main_mdt,
+				     mc, false, silent);
+		if (unlikely(err))
+			return err;
+
+		/* apparently gcd(0,X)=X which is nice */
+		md->t2a.bn_gcd = gcd(md->t2a.bn_gcd, md_o2p(mdi->size));
+		total_size += mdi->size;
+
+		md_dbg_verbose("dev=%d %s off=%lu size=%lu\n", i,
+				 _bdev_name(mdi->bdev), mdi->offset, mdi->size);
+	}
+
+	md->t2_blocks = le64_to_cpu(main_mdt->s_t2_blocks);
+	if (unlikely(md->t2_blocks != md_o2p(total_size))) {
+		md_err_cnd(silent,
+			"FS corrupted md->t2_blocks(0x%lx) != total_size(0x%llx)\n",
+			md->t2_blocks, total_size);
+		return -EIO;
+	}
+
+	err = _map_setup(md, le64_to_cpu(main_mdt->s_t2_blocks), md->t1_count,
+			 &md->t2a);
+	if (unlikely(err))
+		return err;
+
+	md_dbg_verbose("t2 devices=%d total_size=%llu segment_map=%lu\n",
+			 md->t2_count, total_size,
+			 md_o2p(total_size) / md->t2a.bn_gcd);
+
+	return 0;
+}
+
+static int _load_dev_list(struct md_dev_list *dev_list, struct mdt_check *mc,
+			  struct block_device *bdev, const char *dev_name,
+			  int silent)
+{
+	struct md_dev_table *mdt;
+	int err;
+
+	mdt = (void *)__get_free_page(GFP_KERNEL);
+	if (unlikely(!mdt)) {
+		md_dbg_err("!!! failed to alloc page\n");
+		return -ENOMEM;
+	}
+
+	err = _t2_mdt_read(bdev, mdt);
+	if (unlikely(err)) {
+		md_err_cnd(silent, "failed to read super block from %s => %d\n",
+			     dev_name, err);
+		goto out;
+	}
+
+	if (!md_mdt_check(mdt, NULL, bdev, mc)) {
+		md_err_cnd(silent, "bad mdt in %s\n", dev_name);
+		err = -EINVAL;
+		goto out;
+	}
+
+	*dev_list = mdt->s_dev_list;
+
+out:
+	free_page((ulong)mdt);
+	return err;
+}
+
+int md_init(struct multi_devices *md, const char *dev_name,
+	    struct mdt_check *mc, const char **dev_path)
+{
+	struct md_dev_list *dev_list;
+	struct block_device *bdev;
+	short id_index;
+	bool bind_mount = false;
+	int err;
+
+	dev_list = kmalloc(sizeof(*dev_list), GFP_KERNEL);
+	if (unlikely(!dev_list))
+		return -ENOMEM;
+
+	err = _get_device(&bdev, dev_name, NULL, mc->holder, mc->silent,
+			  &bind_mount);
+	if (unlikely(err))
+		goto out2;
+
+	err = _load_dev_list(dev_list, mc, bdev, dev_name, mc->silent);
+	if (unlikely(err)) {
+		_bdev_put(&bdev, NULL);
+		goto out2;
+	}
+
+	id_index = le16_to_cpu(dev_list->id_index);
+	if (bind_mount) {
+		_bdev_put(&bdev, NULL);
+		md->dev_index = id_index;
+		goto out;
+	}
+
+	md->t1_count = le16_to_cpu(dev_list->t1_count);
+	md->t2_count = le16_to_cpu(dev_list->t2_count);
+	md->devs[id_index].bdev = bdev;
+
+	if ((id_index != 0)) {
+		err = _get_device(&md_t1_dev(md, 0)->bdev, NULL,
+				  &dev_list->dev_ids[0].uuid, mc->holder,
+				  mc->silent, &bind_mount);
+		if (unlikely(err))
+			goto out2;
+
+		if (bind_mount)
+			goto out;
+	}
+
+	if (md->t2_count) {
+		int t2_index = md->t1_count;
+
+		/* t2 is the primary device if given in mount, or the first
+		 * mount specified it as primary device
+		 */
+		if (id_index != md->t1_count) {
+			err = _get_device(&md_t2_dev(md, 0)->bdev, NULL,
+					  &dev_list->dev_ids[t2_index].uuid,
+					  mc->holder, mc->silent, &bind_mount);
+			if (unlikely(err))
+				goto out2;
+
+			if (bind_mount)
+				md->dev_index = t2_index;
+		}
+
+		if (t2_index <= id_index)
+			md->dev_index = t2_index;
+	}
+
+out:
+	if (md->dev_index != id_index)
+		*dev_path = _uuid_path(&dev_list->dev_ids[md->dev_index].uuid);
+	else
+		*dev_path = kstrdup(dev_name, GFP_KERNEL);
+
+	if (!bind_mount) {
+		err = _md_init(md, mc, dev_list, mc->silent);
+		if (unlikely(err))
+			goto out2;
+		_bdev_put(&md_dev_info(md, md->dev_index)->bdev, NULL);
+	} else {
+		md_fini(md, NULL);
+	}
+
+out2:
+	kfree(dev_list);
+
+	return err;
+}
+
+struct multi_devices *md_alloc(size_t size)
+{
+	uint s = max(sizeof(struct multi_devices), size);
+	struct multi_devices *md = kzalloc(s, GFP_KERNEL);
+
+	if (unlikely(!md))
+		return ERR_PTR(-ENOMEM);
+	return md;
+}
+
+/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * PORTING SECTION:
+ * Below are members that are done differently in different Linux versions.
+ * So keep separate from code
+ */
+static int _check_da_ret(struct md_dev_info *mdi, long avail, bool silent)
+{
+	if (unlikely(avail < (long)mdi->size)) {
+		if (0 < avail) {
+			md_warn_cnd(silent,
+				"Unsupported DAX device %s (range mismatch) => 0x%lx < 0x%lx\n",
+				_bdev_name(mdi->bdev), avail, mdi->size);
+			return -ERANGE;
+		}
+		md_warn_cnd(silent, "!!! %s direct_access return => %ld\n",
+			    _bdev_name(mdi->bdev), avail);
+		return avail;
+	}
+	return 0;
+}
+
+#include <linux/dax.h>
+
+int md_t1_info_init(struct md_dev_info *mdi, bool silent)
+{
+	pfn_t a_pfn_t;
+	void *addr;
+	long nrpages, avail, pgoff;
+	int id;
+
+	mdi->t1i.dax_dev = fs_dax_get_by_bdev(mdi->bdev);
+	if (unlikely(!mdi->t1i.dax_dev))
+		return -EOPNOTSUPP;
+
+	id = dax_read_lock();
+
+	bdev_dax_pgoff(mdi->bdev, 0, PAGE_SIZE, &pgoff);
+	nrpages = dax_direct_access(mdi->t1i.dax_dev, pgoff, md_o2p(mdi->size),
+				    &addr, &a_pfn_t);
+	dax_read_unlock(id);
+	if (unlikely(nrpages <= 0)) {
+		if (!nrpages)
+			nrpages = -ERANGE;
+		avail = nrpages;
+	} else {
+		avail = md_p2o(nrpages);
+	}
+
+	mdi->t1i.virt_addr = addr;
+	mdi->t1i.phys_pfn = pfn_t_to_pfn(a_pfn_t);
+
+	md_dbg_verbose("0x%lx 0x%lx pgoff=0x%lx\n",
+			 (ulong)mdi->t1i.virt_addr, mdi->t1i.phys_pfn, pgoff);
+
+	return _check_da_ret(mdi, avail, silent);
+}
+
+void md_t1_info_fini(struct md_dev_info *mdi)
+{
+	fs_put_dax(mdi->t1i.dax_dev);
+	mdi->t1i.dax_dev = NULL;
+	mdi->t1i.virt_addr = NULL;
+}
diff --git a/fs/zuf/md.h b/fs/zuf/md.h
new file mode 100644
index 000000000000..eaf7280af356
--- /dev/null
+++ b/fs/zuf/md.h
@@ -0,0 +1,318 @@
+/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __MD_H__
+#define __MD_H__
+
+#include <linux/types.h>
+#include <linux/blkdev.h>
+
+#include "md_def.h"
+
+#ifndef __KERNEL__
+struct page;
+struct block_device;
+#endif /* ndef __KERNEL__ */
+
+struct md_t1_info {
+	void *virt_addr;
+#ifdef __KERNEL__
+	ulong phys_pfn;
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+#endif /*def __KERNEL__*/
+};
+
+struct md_t2_info {
+	bool err_read_reported;
+	bool err_write_reported;
+};
+
+struct md_dev_info {
+	struct block_device *bdev;
+	ulong size;
+	ulong offset;
+	union {
+		struct md_t1_info	t1i;
+		struct md_t2_info	t2i;
+	};
+	int index;
+	int nid;
+};
+
+struct md_dev_larray {
+	ulong bn_gcd;
+	struct md_dev_info **map;
+};
+
+#ifndef __KERNEL__
+struct fba {
+	int fd; void *ptr;
+	size_t size;
+	void *orig_ptr;
+};
+#endif /*! __KERNEL__*/
+
+struct zus_sb_info;
+struct multi_devices {
+	int dev_index;
+	int t1_count;
+	int t2_count;
+	struct md_dev_info devs[MD_DEV_MAX];
+	struct md_dev_larray t1a;
+	struct md_dev_larray t2a;
+#ifndef __KERNEL__
+	struct zufs_ioc_pmem pmem_info; /* As received from Kernel */
+
+	void *p_pmem_addr;
+	int fd;
+	uint user_page_size;
+	struct fba pages;
+	struct zus_sb_info *sbi;
+#else
+	ulong t1_blocks;
+	ulong t2_blocks;
+#endif /*! __KERNEL__*/
+};
+
+static inline __u64 md_p2o(ulong bn)
+{
+	return (__u64)bn << PAGE_SHIFT;
+}
+
+static inline ulong md_o2p(__u64 offset)
+{
+	return offset >> PAGE_SHIFT;
+}
+
+static inline ulong md_o2p_up(__u64 offset)
+{
+	return md_o2p(offset + PAGE_SIZE - 1);
+}
+
+static inline struct md_dev_info *md_t1_dev(struct multi_devices *md, int i)
+{
+	return &md->devs[i];
+}
+
+static inline struct md_dev_info *md_t2_dev(struct multi_devices *md, int i)
+{
+	return &md->devs[md->t1_count + i];
+}
+
+static inline struct md_dev_info *md_dev_info(struct multi_devices *md, int i)
+{
+	return &md->devs[i];
+}
+
+static inline void *md_t1_addr(struct multi_devices *md, int i)
+{
+	struct md_dev_info *mdi = md_t1_dev(md, i);
+
+	return mdi->t1i.virt_addr;
+}
+
+static inline ulong md_t1_blocks(struct multi_devices *md)
+{
+#ifdef __KERNEL__
+	return md->t1_blocks;
+#else
+	return md->pmem_info.mdt.s_t1_blocks;
+#endif
+}
+
+static inline ulong md_t2_blocks(struct multi_devices *md)
+{
+#ifdef __KERNEL__
+	return md->t2_blocks;
+#else
+	return md->pmem_info.mdt.s_t2_blocks;
+#endif
+}
+
+static inline struct md_dev_table *md_zdt(struct multi_devices *md)
+{
+	return md_t1_addr(md, 0);
+}
+
+static inline struct md_dev_info *md_bn_t1_dev(struct multi_devices *md,
+						 ulong bn)
+{
+	return md->t1a.map[bn / md->t1a.bn_gcd];
+}
+
+#ifdef __KERNEL__
+static inline ulong md_pfn(struct multi_devices *md, ulong block)
+{
+	struct md_dev_info *mdi;
+	bool add_pfn = false;
+	ulong base_pfn;
+
+	if (unlikely(md_t1_blocks(md) <= block)) {
+		if (WARN_ON(!mdt_test_option(md_zdt(md), MDT_F_SHADOW)))
+			return 0;
+		block -= md_t1_blocks(md);
+		add_pfn = true;
+	}
+
+	mdi = md_bn_t1_dev(md, block);
+	if (add_pfn)
+		base_pfn = mdi->t1i.phys_pfn + md_o2p(mdi->size);
+	else
+		base_pfn = mdi->t1i.phys_pfn;
+	return base_pfn + (block - md_o2p(mdi->offset));
+}
+#endif /* def __KERNEL__ */
+
+static inline void *md_addr(struct multi_devices *md, ulong offset)
+{
+#ifdef __KERNEL__
+	struct md_dev_info *mdi = md_bn_t1_dev(md, md_o2p(offset));
+
+	return offset ? mdi->t1i.virt_addr + (offset - mdi->offset) : NULL;
+#else
+	return offset ? md->p_pmem_addr + offset : NULL;
+#endif
+}
+
+static inline void *md_baddr(struct multi_devices *md, ulong bn)
+{
+	return md_addr(md, md_p2o(bn));
+}
+
+static inline struct md_dev_info *md_bn_t2_dev(struct multi_devices *md,
+					       ulong bn)
+{
+	return md->t2a.map[bn / md->t2a.bn_gcd];
+}
+
+static inline int md_t2_bn_nid(struct multi_devices *md, ulong bn)
+{
+	struct md_dev_info *mdi = md_bn_t2_dev(md, bn);
+
+	return mdi->nid;
+}
+
+static inline ulong md_t2_local_bn(struct multi_devices *md, ulong bn)
+{
+#ifdef __KERNEL__
+	struct md_dev_info *mdi = md_bn_t2_dev(md, bn);
+
+	return bn - md_o2p(mdi->offset);
+#else
+	return bn; /* In zus we just let Kernel worry about it */
+#endif
+}
+
+static inline ulong md_t2_gcd(struct multi_devices *md)
+{
+	return md->t2a.bn_gcd;
+}
+
+static inline void *md_addr_verify(struct multi_devices *md, ulong offset)
+{
+	if (unlikely(offset > md_p2o(md_t1_blocks(md)))) {
+		md_dbg_err("offset=0x%lx > max=0x%llx\n",
+			    offset, md_p2o(md_t1_blocks(md)));
+		return NULL;
+	}
+
+	return md_addr(md, offset);
+}
+
+static inline struct page *md_bn_to_page(struct multi_devices *md, ulong bn)
+{
+#ifdef __KERNEL__
+	return pfn_to_page(md_pfn(md, bn));
+#else
+	return md->pages.ptr + bn * md->user_page_size;
+#endif
+}
+
+static inline ulong md_addr_to_offset(struct multi_devices *md, void *addr)
+{
+#ifdef __KERNEL__
+	/* TODO: Keep the device index in page-flags we need to fix the
+	 * page-ref right? for now with pages untouched we need this loop
+	 */
+	int dev_index;
+
+	for (dev_index = 0; dev_index < md->t1_count; ++dev_index) {
+		struct md_dev_info *mdi = md_t1_dev(md, dev_index);
+
+		if ((mdi->t1i.virt_addr <= addr) &&
+		    (addr < (mdi->t1i.virt_addr + mdi->size)))
+			return mdi->offset + (addr - mdi->t1i.virt_addr);
+	}
+
+	return 0;
+#else /* !__KERNEL__ */
+	return addr - md->p_pmem_addr;
+#endif
+}
+
+static inline ulong md_addr_to_bn(struct multi_devices *md, void *addr)
+{
+	return md_o2p(md_addr_to_offset(md, addr));
+}
+
+static inline ulong md_page_to_bn(struct multi_devices *md, struct page *page)
+{
+#ifdef __KERNEL__
+	return md_addr_to_bn(md, page_address(page));
+#else
+	ulong bytes = (void *)page - md->pages.ptr;
+
+	return bytes / md->user_page_size;
+#endif
+}
+
+#ifdef __KERNEL__
+/* TODO: Change API to take mdi and also support in um */
+static inline const char *_bdev_name(struct block_device *bdev)
+{
+	return dev_name(&bdev->bd_part->__dev);
+}
+#endif /*def __KERNEL__*/
+
+struct mdt_check {
+	ulong alloc_mask;
+	uint major_ver;
+	uint minor_ver;
+	__u32  magic;
+
+	void *holder;
+	bool silent;
+};
+
+/* md.c */
+bool md_mdt_check(struct md_dev_table *mdt, struct md_dev_table *main_mdt,
+		  struct block_device *bdev, struct mdt_check *mc);
+int md_t2_mdt_read(struct multi_devices *md, int dev_index,
+		   struct md_dev_table *mdt);
+int md_t2_mdt_write(struct multi_devices *md, struct md_dev_table *mdt);
+short md_calc_csum(struct md_dev_table *mdt);
+void md_fini(struct multi_devices *md, struct block_device *s_bdev);
+
+#ifdef __KERNEL__
+struct multi_devices *md_alloc(size_t size);
+int md_init(struct multi_devices *md, const char *dev_name,
+	    struct mdt_check *mc, const char **dev_path);
+int md_set_sb(struct multi_devices *md, struct block_device *s_bdev, void *sb,
+	      int silent);
+int md_t1_info_init(struct md_dev_info *mdi, bool silent);
+void md_t1_info_fini(struct md_dev_info *mdi);
+
+#else /* m1us */
+int md_init_from_pmem_info(struct multi_devices *md);
+#endif
+
+#endif
diff --git a/fs/zuf/md_def.h b/fs/zuf/md_def.h
new file mode 100644
index 000000000000..a236567adfd9
--- /dev/null
+++ b/fs/zuf/md_def.h
@@ -0,0 +1,145 @@
+/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#ifndef _LINUX_MD_DEF_H
+#define _LINUX_MD_DEF_H
+
+#include <linux/types.h>
+#include <linux/uuid.h>
+
+#ifndef __KERNEL__
+
+#include <stdint.h>
+#include <endian.h>
+#include <stdbool.h>
+#include <stdlib.h>
+
+#ifndef le16_to_cpu
+
+#define le16_to_cpu(x)	((__u16)le16toh(x))
+#define le32_to_cpu(x)	((__u32)le32toh(x))
+#define le64_to_cpu(x)	((__u64)le64toh(x))
+#define cpu_to_le16(x)	((__le16)htole16(x))
+#define cpu_to_le32(x)	((__le32)htole32(x))
+#define cpu_to_le64(x)	((__le64)htole64(x))
+
+#endif
+
+#ifndef __aligned
+#define	__aligned(x)			__attribute__((aligned(x)))
+#endif
+
+#ifndef __packed
+#	define __packed __attribute__((packed))
+#endif
+
+#endif /*  ndef __KERNEL__ */
+
+#define MDT_SIZE 2048
+
+#define MD_DEV_NUMA_SHIFT		60
+#define MD_DEV_BLOCKS_MASK		0x0FFFFFFFFFFFFFFF
+
+struct md_dev_id {
+	uuid_le	uuid;
+	__le64	blocks;
+} __packed;
+
+static inline __u64 __dev_id_blocks(struct md_dev_id *dev)
+{
+	return le64_to_cpu(dev->blocks) & MD_DEV_BLOCKS_MASK;
+}
+
+static inline void __dev_id_blocks_set(struct md_dev_id *dev, __u64 blocks)
+{
+	dev->blocks &= ~MD_DEV_BLOCKS_MASK;
+	dev->blocks |= blocks;
+}
+
+static inline int __dev_id_nid(struct md_dev_id *dev)
+{
+	return (int)(le64_to_cpu(dev->blocks) >> MD_DEV_NUMA_SHIFT);
+}
+
+static inline void __dev_id_nid_set(struct md_dev_id *dev, int nid)
+{
+	dev->blocks &= MD_DEV_BLOCKS_MASK;
+	dev->blocks |= (__le64)nid << MD_DEV_NUMA_SHIFT;
+}
+
+/* 64 is the nicest number to still fit when the ZDT is 2048 and 6 bits can
+ * fit in page struct for address to block translation.
+ */
+#define MD_DEV_MAX   64
+
+struct md_dev_list {
+	__le16		   id_index;	/* index of current dev in list */
+	__le16		   t1_count;	/* # of t1 devs */
+	__le16		   t2_count;	/* # of t2 devs (after t1_count) */
+	__le16		   rmem_count;	/* align to 64 bit */
+	struct md_dev_id dev_ids[MD_DEV_MAX];
+} __aligned(64);
+
+/*
+ * Structure of the on disk multy device table
+ * NOTE: md_dev_table is always of size MDT_SIZE. These below are the
+ *   currently defined/used members in this version.
+ *   TODO: remove the s_ from all the fields
+ */
+struct md_dev_table {
+	/* static fields. they never change after file system creation.
+	 * checksum only validates up to s_start_dynamic field below
+	 */
+	__le16		s_sum;              /* checksum of this sb */
+	__le16		s_version;          /* zdt-version */
+	__le32		s_magic;            /* magic signature */
+	uuid_le		s_uuid;		    /* 128-bit uuid */
+	__le64		s_flags;
+	__le64		s_t1_blocks;
+	__le64		s_t2_blocks;
+
+	struct md_dev_list s_dev_list;
+
+	char		s_start_dynamic[0];
+
+	/* all the dynamic fields should go here */
+	__le64		s_mtime;		/* mount time */
+	__le64		s_wtime;		/* write time */
+};
+
+/* device table s_flags */
+enum enum_mdt_flags {
+	MDT_F_SHADOW		= (1UL << 0),	/* simulate cpu cache */
+	MDT_F_POSIXACL		= (1UL << 1),	/* enable acls */
+
+	MDT_F_USER_START	= 8,	/* first 8 bit reserved for mdt */
+};
+
+static inline bool mdt_test_option(struct md_dev_table *mdt,
+				   enum enum_mdt_flags flag)
+{
+	return (mdt->s_flags & flag) != 0;
+}
+
+#define MD_MINORS_PER_MAJOR	1024
+
+static inline int mdt_major_version(struct md_dev_table *mdt)
+{
+	return le16_to_cpu(mdt->s_version) / MD_MINORS_PER_MAJOR;
+}
+
+static inline int mdt_minor_version(struct md_dev_table *mdt)
+{
+	return le16_to_cpu(mdt->s_version) % MD_MINORS_PER_MAJOR;
+}
+
+#define MDT_STATIC_SIZE(mdt) ((__u64)&mdt->s_start_dynamic - (__u64)mdt)
+
+#endif /* _LINUX_MD_DEF_H */
diff --git a/fs/zuf/t1.c b/fs/zuf/t1.c
new file mode 100644
index 000000000000..53da4c5840c7
--- /dev/null
+++ b/fs/zuf/t1.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Just the special mmap of the all t1 array to the ZUS Server
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/pfn_t.h>
+#include <asm/pgtable.h>
+
+#include "_pr.h"
+#include "zuf.h"
+
+/* ~~~ Functions for mmap a t1-array and page faults ~~~ */
+struct zuf_pmem *_pmem_from_f_private(struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	WARN_ON(zsf->type != zlfs_e_pmem);
+	return container_of(zsf, struct zuf_pmem, hdr);
+}
+
+static vm_fault_t t1_fault(struct vm_fault *vmf, enum page_entry_size pe_size)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	ulong addr = vmf->address;
+	struct zuf_pmem *z_pmem;
+	pgoff_t size;
+	ulong bn;
+	pfn_t pfnt;
+	ulong pfn = 0;
+	vm_fault_t flt;
+
+	zuf_dbg_t1("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p pe_size=%d\n",
+		    inode->i_ino, vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page, pe_size);
+
+	if (unlikely(vmf->page)) {
+		zuf_err("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+			"pgoff=0x%lx vmf_flags=0x%x page=%p cow_page=%p\n",
+			inode->i_ino, vma->vm_start, vma->vm_end, addr,
+			vmf->pgoff, vmf->flags, vmf->page, vmf->cow_page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 inode->i_ino, vmf->pgoff, pgoff, size);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	if (vmf->cow_page)
+		/* HOWTO: prevent private mmaps */
+		return VM_FAULT_SIGBUS;
+
+	z_pmem = _pmem_from_f_private(vma->vm_file);
+
+	switch (pe_size) {
+	case PE_SIZE_PTE:
+		zuf_err("[%ld] PTE fault not expected pgoff=0x%lx addr=0x%lx\n",
+			inode->i_ino, vmf->pgoff, addr);
+		/* fall through do PMD insert anyway */
+	case PE_SIZE_PMD:
+		bn = linear_page_index(vma, addr & PMD_MASK);
+		pfn = md_pfn(&z_pmem->md, bn);
+		pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		flt = vmf_insert_pfn_pmd(vma, addr, vmf->pmd, pfnt, true);
+		zuf_dbg_t1("[%ld] PMD pfn-0x%lx addr=0x%lx bn=0x%lx pgoff=0x%lx => %d\n",
+			inode->i_ino, pfn, addr, bn, vmf->pgoff, flt);
+		break;
+	default:
+		/* FIXME: Easily support PE_SIZE_PUD Just needs to align to
+		 * PUD_MASK at zufr_get_unmapped_area(). But this is hard today
+		 * because of the 2M nvdimm lib takes for its page flag
+		 * information with NFIT. (That need not be there in any which
+		 * case.)
+		 * Which means zufr_get_unmapped_area needs to return
+		 * a align1G+2M address start. and first 1G is map PMD size.
+		 * Very ugly, sigh.
+		 * One thing I do not understand why when the vma->vm_start is
+		 * not PUD aligned and faults requests index zero. Then system
+		 * asks for PE_SIZE_PUD anyway. say my 0 index is 1G aligned
+		 * vmf_insert_pfn_pud() will always fail because the aligned
+		 * vm_addr is outside the vma.
+		 */
+		flt = VM_FAULT_FALLBACK;
+		zuf_dbg_t1("[%ld] default? pgoff=0x%lx addr=0x%lx pe_size=0x%x => %d\n",
+			   inode->i_ino, vmf->pgoff, addr, pe_size, flt);
+	}
+
+	return flt;
+}
+
+static vm_fault_t t1_fault_pte(struct vm_fault *vmf)
+{
+	return t1_fault(vmf, PE_SIZE_PTE);
+}
+
+static const struct vm_operations_struct t1_vm_ops = {
+	.huge_fault	= t1_fault,
+	.fault		= t1_fault_pte,
+};
+
+int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf || zsf->type != zlfs_e_pmem)
+		return -EPERM;
+
+
+	/* FIXME:  MIXEDMAP for the support of pmem-pages (Why?)
+	 */
+	vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+	vma->vm_ops = &t1_vm_ops;
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
+
diff --git a/fs/zuf/t2.c b/fs/zuf/t2.c
new file mode 100644
index 000000000000..7bc7b42466b9
--- /dev/null
+++ b/fs/zuf/t2.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tier-2 operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include "_pr.h"
+#include "t2.h"
+
+#include <linux/bitops.h>
+#include <linux/bio.h>
+
+#include "zuf.h"
+
+#define t2_dbg(fmt, args ...) zuf_dbg_t2(fmt, ##args)
+#define t2_warn(fmt, args ...) zuf_warn(fmt, ##args)
+
+const char *_pr_rw(int rw)
+{
+	return (rw & WRITE) ? "WRITE" : "READ";
+}
+#define t2_tis_dbg(tis, fmt, args ...) \
+	zuf_dbg_t2("%s: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags),	       \
+		    atomic_read(&tis->refcount), tis->rw_flags, ##args)
+
+#define t2_tis_dbg_rw(tis, fmt, args ...) \
+	zuf_dbg_t2_rw("%s<%p>: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags),     \
+		    tis->priv, atomic_read(&tis->refcount), tis->rw_flags,\
+		    ##args)
+
+/* ~~~~~~~~~~~~ Async read/write ~~~~~~~~~~ */
+void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done,
+		 void *priv, uint n_vects, struct t2_io_state *tis)
+{
+	atomic_set(&tis->refcount, 1);
+	tis->md = md;
+	tis->done = done;
+	tis->priv = priv;
+	tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES);
+	tis->rw_flags = rw;
+	tis->last_t2 = -1;
+	tis->cur_bio = NULL;
+	tis->index = ~0;
+	bio_list_init(&tis->delayed_bios);
+	tis->err = 0;
+	blk_start_plug(&tis->plug);
+	t2_tis_dbg_rw(tis, "done=%pS n_vects=%d\n", done, n_vects);
+}
+
+static void _tis_put(struct t2_io_state *tis)
+{
+	t2_tis_dbg_rw(tis, "done=%pS\n", tis->done);
+
+	if (test_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags))
+		wake_up_var(&tis->refcount);
+	else if (tis->done)
+		/* last - done may free the tis */
+		tis->done(tis, NULL, true);
+}
+
+static inline void tis_get(struct t2_io_state *tis)
+{
+	atomic_inc(&tis->refcount);
+}
+
+static inline int tis_put(struct t2_io_state *tis)
+{
+	if (atomic_dec_and_test(&tis->refcount)) {
+		_tis_put(tis);
+		return 1;
+	}
+	return 0;
+}
+
+static inline bool _err_set_reported(struct md_dev_info *mdi, bool write)
+{
+	bool *reported = write ? &mdi->t2i.err_write_reported :
+				 &mdi->t2i.err_read_reported;
+
+	if (!(*reported)) {
+		*reported = true;
+		return true;
+	}
+	return false;
+}
+
+static int _status_to_errno(blk_status_t status)
+{
+	return -EIO;
+}
+
+void t2_io_done(struct t2_io_state *tis, struct bio *bio, bool last)
+{
+	struct bio_vec *bv;
+	uint i;
+
+	if (!bio)
+		return;
+
+	bio_for_each_segment_all(bv, bio, i)
+		put_page(bv->bv_page);
+}
+
+static void _tis_bio_done(struct bio *bio)
+{
+	struct t2_io_state *tis = bio->bi_private;
+	struct md_dev_info *mdi = md_t2_dev(tis->md, 0);
+
+	t2_tis_dbg(tis, "done=%pS err=%d\n", tis->done, bio->bi_status);
+
+	if (unlikely(bio->bi_status)) {
+		zuf_dbg_err("%s: err=%d last-err=%d\n",
+			     _pr_rw(tis->rw_flags), bio->bi_status, tis->err);
+		if (_err_set_reported(mdi, 0 != (tis->rw_flags & WRITE)))
+			zuf_err("%s: err=%d\n",
+				 _pr_rw(tis->rw_flags), bio->bi_status);
+		/* Store the last one */
+		tis->err = _status_to_errno(bio->bi_status);
+	} else if (unlikely(mdi->t2i.err_write_reported ||
+			    mdi->t2i.err_read_reported)) {
+		if (tis->rw_flags & WRITE)
+			mdi->t2i.err_write_reported = false;
+		else
+			mdi->t2i.err_read_reported = false;
+	}
+
+	if (tis->done)
+		tis->done(tis, bio, false);
+	else
+		t2_io_done(tis, bio, false);
+
+	bio_put(bio);
+	tis_put(tis);
+}
+
+static bool _tis_delay(struct t2_io_state *tis)
+{
+	return 0 != (tis->rw_flags & TIS_DELAY_SUBMIT);
+}
+
+#define bio_list_for_each_safe(bio, btmp, bl)				\
+	for (bio = (bl)->head,	btmp = bio ? bio->bi_next : NULL;	\
+	     bio; bio = btmp,	btmp = bio ? bio->bi_next : NULL)
+
+static void _tis_submit_bio(struct t2_io_state *tis, bool flush, bool done)
+{
+	if (flush || done) {
+		if (_tis_delay(tis)) {
+			struct bio *btmp, *bio;
+
+			bio_list_for_each_safe(bio, btmp, &tis->delayed_bios) {
+				bio->bi_next = NULL;
+				if (bio->bi_iter.bi_sector == -1) {
+					t2_warn("!!!!!!!!!!!!!\n");
+					bio_put(bio);
+					continue;
+				}
+				t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+					    bio->bi_vcnt, tis->n_vects);
+				submit_bio(bio);
+			}
+			bio_list_init(&tis->delayed_bios);
+		}
+
+		if (!tis->cur_bio)
+			return;
+
+		if (tis->cur_bio->bi_iter.bi_sector != -1) {
+			t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+				    tis->cur_bio->bi_vcnt, tis->n_vects);
+			submit_bio(tis->cur_bio);
+			tis->cur_bio = NULL;
+			tis->index = ~0;
+		} else if (done) {
+			t2_tis_dbg(tis, "put cur_bio=%p\n", tis->cur_bio);
+			bio_put(tis->cur_bio);
+			WARN_ON(tis_put(tis));
+		}
+	} else if (tis->cur_bio && (tis->cur_bio->bi_iter.bi_sector != -1)) {
+		/* Not flushing regular progress */
+		if (_tis_delay(tis)) {
+			t2_tis_dbg(tis, "list_add cur_bio=%p\n", tis->cur_bio);
+			bio_list_add(&tis->delayed_bios, tis->cur_bio);
+		} else {
+			t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+				    tis->cur_bio->bi_vcnt, tis->n_vects);
+			submit_bio(tis->cur_bio);
+		}
+		tis->cur_bio = NULL;
+		tis->index = ~0;
+	}
+}
+
+/* tis->cur_bio MUST be NULL, checked by caller */
+static void _tis_alloc(struct t2_io_state *tis, struct md_dev_info *mdi,
+		       gfp_t gfp)
+{
+	struct bio *bio = bio_alloc(gfp, tis->n_vects);
+	int bio_op;
+
+	if (unlikely(!bio)) {
+		if (!_tis_delay(tis))
+			t2_warn("!!! failed to alloc bio");
+		tis->err = -ENOMEM;
+		return;
+	}
+
+	if (WARN_ON(!tis || !tis->md)) {
+		tis->err = -ENOMEM;
+		return;
+	}
+
+	/* FIXME: bio_set_op_attrs macro has a BUG which does not allow this
+	 * question inline.
+	 */
+	bio_op = (tis->rw_flags & WRITE) ? REQ_OP_WRITE : REQ_OP_READ;
+	bio_set_op_attrs(bio, bio_op, 0);
+
+	bio->bi_iter.bi_sector = -1;
+	bio->bi_end_io = _tis_bio_done;
+	bio->bi_private = tis;
+
+	if (mdi) {
+		bio_set_dev(bio, mdi->bdev);
+		tis->index = mdi->index;
+	} else {
+		tis->index = ~0;
+	}
+	tis->last_t2 = -1;
+	tis->cur_bio = bio;
+	tis_get(tis);
+	t2_tis_dbg(tis, "New bio n_vects=%d\n", tis->n_vects);
+}
+
+int t2_io_prealloc(struct t2_io_state *tis, uint n_vects)
+{
+	tis->err = 0; /* reset any -ENOMEM from a previous t2_io_add */
+
+	_tis_submit_bio(tis, true, false);
+	tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES);
+
+	t2_tis_dbg(tis, "n_vects=%d cur_bio=%p\n", tis->n_vects, tis->cur_bio);
+
+	if (!tis->cur_bio)
+		_tis_alloc(tis, NULL, GFP_NOFS);
+	return tis->err;
+}
+
+int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page)
+{
+	struct md_dev_info *mdi;
+	ulong local_t2;
+	int ret;
+
+	if (t2 > md_t2_blocks(tis->md)) {
+		zuf_err("bad t2 (0x%lx) offset\n", t2);
+		return -EFAULT;
+	}
+	get_page(page);
+
+	mdi = md_bn_t2_dev(tis->md, t2);
+	WARN_ON(!mdi);
+
+	local_t2 = md_t2_local_bn(tis->md, t2);
+	if (((local_t2 != (tis->last_t2 + 1)) && (tis->last_t2 != -1)) ||
+	   ((0 < tis->index) && (tis->index != mdi->index)))
+		_tis_submit_bio(tis, false, false);
+
+start:
+	if (!tis->cur_bio) {
+		_tis_alloc(tis, mdi, _tis_delay(tis) ? GFP_ATOMIC : GFP_NOFS);
+		if (unlikely(tis->err)) {
+			put_page(page);
+			return tis->err;
+		}
+	} else if (tis->index == ~0) {
+		/* the bio was allocated during t2_io_prealloc */
+		tis->index = mdi->index;
+		bio_set_dev(tis->cur_bio, mdi->bdev);
+	}
+
+	if (tis->last_t2 == -1)
+		tis->cur_bio->bi_iter.bi_sector = local_t2 * T2_SECTORS_PER_PAGE;
+
+	ret = bio_add_page(tis->cur_bio, page, PAGE_SIZE, 0);
+	if (unlikely(ret != PAGE_SIZE)) {
+		t2_tis_dbg(tis, "bio_add_page=>%d bi_vcnt=%d n_vects=%d\n",
+			   ret, tis->cur_bio->bi_vcnt, tis->n_vects);
+		_tis_submit_bio(tis, false, false);
+		goto start; /* device does not support tis->n_vects */
+	}
+
+	if ((tis->cur_bio->bi_vcnt == tis->n_vects) && (tis->n_vects != 1))
+		_tis_submit_bio(tis, false, false);
+
+	t2_tis_dbg(tis, "t2=0x%lx last_t2=0x%lx local_t2=0x%lx t1=0x%lx\n",
+		   t2, tis->last_t2, local_t2, md_page_to_bn(tis->md, page));
+
+	tis->last_t2 = local_t2;
+	return 0;
+}
+
+int t2_io_end(struct t2_io_state *tis, bool wait)
+{
+	if (unlikely(!tis || !tis->md))
+		return 0; /* never initialized nothing to do */
+
+	t2_tis_dbg_rw(tis, "wait=%d\n", wait);
+
+	_tis_submit_bio(tis, true, true);
+	blk_finish_plug(&tis->plug);
+
+	if (wait)
+		set_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags);
+	tis_put(tis);
+
+	if (wait) {
+		wait_var_event(&tis->refcount, !atomic_read(&tis->refcount));
+		if (tis->done)
+			tis->done(tis, NULL, true);
+	}
+
+	return tis->err;
+}
+
+/* ~~~~~~~ Sync read/write ~~~~~~~ TODO: Remove soon */
+static int _sync_io_page(struct multi_devices *md, int rw, ulong bn,
+			 struct page *page)
+{
+	struct t2_io_state tis;
+	int err;
+
+	t2_io_begin(md, rw, NULL, NULL, 1, &tis);
+
+	t2_tis_dbg((&tis), "bn=0x%lx p-i=0x%lx\n", bn, page->index);
+
+	err = t2_io_add(&tis, bn, page);
+	if (unlikely(err))
+		return err;
+
+	err = submit_bio_wait(tis.cur_bio);
+	if (unlikely(err)) {
+		SetPageError(page);
+		/*
+		 * We failed to write the page out to tier-2.
+		 * Print a dire warning that things will go BAD (tm)
+		 * very quickly.
+		 */
+		zuf_err("io-error bn=0x%lx => %d\n", bn, err);
+	}
+
+	/* Same as t2_io_end+_tis_bio_done but without the kref stuff */
+	blk_finish_plug(&tis.plug);
+	put_page(page);
+	if (likely(tis.cur_bio))
+		bio_put(tis.cur_bio);
+
+	return err;
+}
+
+int t2_writepage(struct multi_devices *md, ulong bn, struct page *page)
+{
+	return _sync_io_page(md, WRITE, bn, page);
+}
+
+int t2_readpage(struct multi_devices *md, ulong bn, struct page *page)
+{
+	return _sync_io_page(md, READ, bn, page);
+}
diff --git a/fs/zuf/t2.h b/fs/zuf/t2.h
new file mode 100644
index 000000000000..cbd23dd409eb
--- /dev/null
+++ b/fs/zuf/t2.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
+/*
+ * Tier-2 Header file.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#ifndef __T2_H__
+#define __T2_H__
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/bio.h>
+#include <linux/kref.h>
+#include "md.h"
+
+#define T2_SECTORS_PER_PAGE	(PAGE_SIZE / 512)
+
+/* t2.c */
+
+/* Sync read/write */
+int t2_writepage(struct multi_devices *md, ulong bn, struct page *page);
+int t2_readpage(struct multi_devices *md, ulong bn, struct page *page);
+
+/* Async read/write */
+struct t2_io_state;
+typedef void (*t2_io_done_fn)(struct t2_io_state *tis, struct bio *bio,
+			      bool last);
+
+struct t2_io_state {
+	atomic_t refcount; /* counts in-flight bios */
+	struct blk_plug plug;
+
+	struct multi_devices	*md;
+	int		index;
+	t2_io_done_fn	done;
+	void		*priv;
+
+	uint		n_vects;
+	ulong		rw_flags;
+	ulong		last_t2;
+	struct bio	*cur_bio;
+	struct bio_list	delayed_bios;
+	int		err;
+};
+
+/* For rw_flags above */
+/* From Kernel: WRITE		(1U << 0) */
+#define TIS_DELAY_SUBMIT	(1U << 2)
+enum {B_TIS_FREE_AFTER_WAIT = 3};
+#define TIS_FREE_AFTER_WAIT	(1U << B_TIS_FREE_AFTER_WAIT)
+#define TIS_USER_DEF_FIRST	(1U << 8)
+
+void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done,
+		 void *priv, uint n_vects, struct t2_io_state *tis);
+int t2_io_prealloc(struct t2_io_state *tis, uint n_vects);
+int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page);
+int t2_io_end(struct t2_io_state *tis, bool wait);
+
+/* This is done by default if t2_io_done_fn above is NULL
+ * Can also be chain-called by users.
+ */
+void t2_io_done(struct t2_io_state *tis, struct bio *bio, bool last);
+
+#endif /*def __T2_H__*/
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 95582c0a4ba5..d94c2e6d7578 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -339,6 +339,89 @@ static int _zu_numa_map(struct file *file, void *parg)
 	return err;
 }
 
+/* ~~~~ PMEM GRAB ~~~~ */
+static int zufr_find_pmem(struct zuf_root_info *zri,
+		   uint pmem_kern_id, struct zuf_pmem **pmem_md)
+{
+	struct zuf_pmem *z_pmem;
+
+	list_for_each_entry(z_pmem, &zri->pmem_list, list) {
+		if (z_pmem->pmem_id == pmem_kern_id) {
+			*pmem_md = z_pmem;
+			return 0;
+		}
+	}
+
+	return -ENODEV;
+}
+
+/*FIXME: At pmem the struct md_dev_list for t1(s) is not properly set
+ * For now we do not fix it and re-write the mdt. So just fix the one
+ * we are about to send to Server
+ */
+void _fix_numa_ids(struct multi_devices *md, struct md_dev_list *mdl)
+{
+	int i;
+
+	for (i = 0; i < md->t1_count; ++i)
+		if (md->devs[i].nid != __dev_id_nid(&mdl->dev_ids[i]))
+			__dev_id_nid_set(&mdl->dev_ids[i], md->devs[i].nid);
+}
+
+static int _zu_grab_pmem(struct file *file, void *parg)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zufs_ioc_pmem __user *arg_pmem = parg;
+	struct zufs_ioc_pmem *zi_pmem = kzalloc(sizeof(*zi_pmem), GFP_KERNEL);
+	struct zuf_pmem *pmem_md;
+	size_t pmem_size;
+	int err;
+
+	if (unlikely(!zi_pmem))
+		return -ENOMEM;
+
+	err = get_user(zi_pmem->pmem_kern_id, &arg_pmem->pmem_kern_id);
+	if (err) {
+		zuf_err("\n");
+		goto out;
+	}
+
+	err = zufr_find_pmem(zri, zi_pmem->pmem_kern_id, &pmem_md);
+	if (err) {
+		zuf_err("!!! pmem_kern_id=%d not found\n",
+			zi_pmem->pmem_kern_id);
+		goto out;
+	}
+
+	if (pmem_md->hdr.file) {
+		zuf_err("[%u] pmem already taken\n", zi_pmem->pmem_kern_id);
+		err = -EIO;
+		goto out;
+	}
+
+	memcpy(&zi_pmem->mdt, md_zdt(&pmem_md->md), sizeof(zi_pmem->mdt));
+	_fix_numa_ids(&pmem_md->md, &zi_pmem->mdt.s_dev_list);
+
+	pmem_size = md_p2o(md_t1_blocks(&pmem_md->md));
+	if (mdt_test_option(md_zdt(&pmem_md->md), MDT_F_SHADOW))
+		pmem_size += pmem_size;
+	i_size_write(file->f_inode, pmem_size);
+	pmem_md->hdr.type = zlfs_e_pmem;
+	pmem_md->hdr.file = file;
+	file->private_data = &pmem_md->hdr;
+	zuf_dbg_core("pmem %d i_size=0x%llx GRABED %s\n",
+		     zi_pmem->pmem_kern_id, i_size_read(file->f_inode),
+		     _bdev_name(md_t1_dev(&pmem_md->md, 0)->bdev));
+
+out:
+	zi_pmem->hdr.err = err;
+	err = copy_to_user(parg, zi_pmem, sizeof(*zi_pmem));
+	if (err)
+		zuf_err("=>%d\n", err);
+	kfree(zi_pmem);
+	return err;
+}
+
 static int _map_pages(struct zufc_thread *zt, struct page **pages, uint nump,
 		      bool map_readonly)
 {
@@ -822,6 +905,8 @@ long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_mount(file, parg);
 	case ZU_IOC_NUMA_MAP:
 		return _zu_numa_map(file, parg);
+	case ZU_IOC_GRAB_PMEM:
+		return _zu_grab_pmem(file, parg);
 	case ZU_IOC_INIT_THREAD:
 		return _zu_init(file, parg);
 	case ZU_IOC_WAIT_OPT:
@@ -1065,6 +1150,8 @@ int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 	switch (zsf->type) {
 	case zlfs_e_zt:
 		return zufc_zt_mmap(file, vma);
+	case zlfs_e_pmem:
+		return zuf_pmem_mmap(file, vma);
 	case zlfs_e_dpp_buff:
 		return zufc_ebuff_mmap(file, vma);
 	default:
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index a33f5908155d..0689f2031ec7 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -29,6 +29,8 @@
 
 #include "relay.h"
 #include "_pr.h"
+#include "md.h"
+#include "t2.h"
 
 enum zlfs_e_special_file {
 	zlfs_e_zt = 1,
@@ -42,6 +44,14 @@ struct zuf_special_file {
 	struct file *file;
 };
 
+/* Our special md structure */
+struct zuf_pmem {
+	struct multi_devices md; /* must be first */
+	struct list_head list;
+	struct zuf_special_file hdr;
+	uint pmem_id;
+};
+
 /* This is the zuf-root.c mini filesystem */
 struct zuf_root_info {
 	struct __mount_thread_info {
@@ -94,6 +104,35 @@ static inline void zuf_add_fs_type(struct zuf_root_info *zri,
 	list_add(&zft->list, &zri->fst_list);
 }
 
+static inline void zuf_add_pmem(struct zuf_root_info *zri,
+				   struct multi_devices *md)
+{
+	struct zuf_pmem *z_pmem = (void *)md;
+
+	z_pmem->pmem_id = ++zri->next_pmem_id; /* Avoid 0 id */
+
+	/* Unlocked for now only one mount-thread with zus */
+	INIT_LIST_HEAD(&z_pmem->list);
+	list_add(&z_pmem->list, &zri->pmem_list);
+}
+
+static inline void zuf_rm_pmem(struct multi_devices *md)
+{
+	struct zuf_pmem *z_pmem = (void *)md;
+
+	if (z_pmem->pmem_id) /* We avoided 0 id */
+		list_del_init(&z_pmem->list);
+}
+
+static inline uint zuf_pmem_id(struct multi_devices *md)
+{
+	struct zuf_pmem *z_pmem = container_of(md, struct zuf_pmem, md);
+
+	return z_pmem->pmem_id;
+}
+
+// void zuf_del_fs_type(struct zuf_root_info *zri, struct zuf_fs_type *zft);
+
 /*
  * ZUF per-inode data in memory
  */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 3319a70b5ccc..e0c439b6c8e9 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -19,6 +19,8 @@
 #include <stddef.h>
 #include <linux/statfs.h>
 
+#include "md_def.h"
+
 /*
  * Version rules:
  *   This is the zus-to-zuf API version. And not the Filesystem
@@ -74,6 +76,10 @@
  */
 #define EZUF_RETRY_DONE 540
 
+
+/* All device sizes offsets must align on 2M */
+#define ZUFS_ALLOC_MASK		(1024 * 1024 * 2 - 1)
+
 /**
  * zufs dual port memory
  * This is a special type of offset to either memory or persistent-memory,
@@ -221,6 +227,18 @@ struct zufs_ioc_numa_map {
 };
 #define ZU_IOC_NUMA_MAP	_IOWR('Z', 12, struct zufs_ioc_numa_map)
 
+struct zufs_ioc_pmem {
+	/* Set by zus */
+	struct zufs_ioc_hdr hdr;
+	__u32 pmem_kern_id;
+
+	/* Returned to zus */
+	struct md_dev_table mdt;
+
+};
+/* GRAB is never ungrabed umount or file close cleans it all */
+#define ZU_IOC_GRAB_PMEM	_IOWR('Z', 13, struct zufs_ioc_pmem)
+
 /* ZT init */
 enum { ZUFS_MAX_ZT_CHANNELS = 64 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 06/17] zuf: mounting
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (4 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 05/17] zuf: Multy Devices Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 07/17] zuf: Namei and directory operations Boaz harrosh
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

In this patch we already establish a mounted filesystem.

These are the steps for mounting a zufs Filesystem:

* All devices (Single or DT) are opened and established in an md
  object. This md-object is given an pmem-id

* mount_bdev is called with the main (first) device, in turn
  fill_supper is called.

* fill_supper despatches a mount_operation(register_fs_info) to the
  server with the pmem-id of the md-object above.

*  The Server at the zus mount routine. Will first thing do
  a GRAB_PMEM(pmem-id) ioctl call to establish a special filehandle
  through which it will have full access to the all of its pmem space.
  With that it will call the zusFS to continue to inspect the content
  of the pmem and mount the FS.

* On return from mount the zusFS returns the root inode info

* fill_supper continues to create a root vfs-inode and returns
  successfully.

* We now have a mounted super_block, with corresponding super_block
  objects in the Server.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile  |   2 +-
 fs/zuf/_extern.h |   4 +
 fs/zuf/inode.c   |  22 ++
 fs/zuf/super.c   | 623 ++++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf.h     | 122 ++++++++++
 fs/zuf/zus_api.h |  57 +++++
 6 files changed, 828 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/inode.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 091bf053a6ed..eaeffc65078f 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,5 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o
+zuf-y += super.o inode.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 15d632ea5ed2..dc6b41b6410b 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -20,6 +20,10 @@
  * extern functions declarations
  */
 
+/* inode.c */
+struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       zu_dpp_t _zi, bool *exist);
+
 /* super.c */
 int zuf_init_inodecache(void);
 void zuf_destroy_inodecache(void);
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
new file mode 100644
index 000000000000..315a273e6f6d
--- /dev/null
+++ b/fs/zuf/inode.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode methods (allocate/free/read/write).
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       zu_dpp_t _zi, bool *exist)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index f7f7798425a9..7f819be7056e 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -18,8 +18,575 @@
 
 #include "zuf.h"
 
+static struct super_operations zuf_sops;
 static struct kmem_cache *zuf_inode_cachep;
 
+enum {
+	Opt_uid,
+	Opt_gid,
+	Opt_pedantic,
+	Opt_ephemeral,
+	Opt_dax,
+	Opt_err
+};
+
+static const match_table_t tokens = {
+	{ Opt_pedantic,		"pedantic"		},
+	{ Opt_pedantic,		"pedantic=%d"		},
+	{ Opt_ephemeral,	"ephemeral"		},
+	{ Opt_dax,		"dax"			},
+	{ Opt_err,		NULL			},
+};
+
+static int _parse_options(struct zuf_sb_info *sbi, const char *data,
+			  bool remount, struct zufs_parse_options *po)
+{
+	char *orig_options, *options;
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int err = 0;
+	bool ephemeral = false;
+	size_t mount_options_len = 0;
+
+	/* no options given */
+	if (!data)
+		return 0;
+
+	options = orig_options = kstrdup(data, GFP_KERNEL);
+	if (!options)
+		return -ENOMEM;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+
+		/* Initialize args struct so we know whether arg was found */
+		args[0].to = args[0].from = NULL;
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_pedantic:
+			if (!args[0].from) {
+				po->mount_flags |= ZUFS_M_PEDANTIC;
+				set_opt(sbi, PEDANTIC);
+				continue;
+			}
+			if (match_int(&args[0], &po->pedantic))
+				goto bad_opt;
+			break;
+		case Opt_ephemeral:
+			po->mount_flags |= ZUFS_M_EPHEMERAL;
+			set_opt(sbi, EPHEMERAL);
+			ephemeral = true;
+			break;
+		case Opt_dax:
+			set_opt(sbi, DAX);
+			break;
+		default: {
+			if (mount_options_len != 0) {
+				po->mount_options[mount_options_len] = ',';
+				mount_options_len++;
+			}
+			strcat(po->mount_options, p);
+			mount_options_len += strlen(p);
+		}
+		}
+	}
+
+	if (remount && test_opt(sbi, EPHEMERAL) && (ephemeral == false))
+		clear_opt(sbi, EPHEMERAL);
+out:
+	kfree(orig_options);
+	return err;
+
+bad_opt:
+	zuf_warn_cnd(test_opt(sbi, SILENT), "Bad mount option: \"%s\"\n", p);
+	err = -EINVAL;
+	goto out;
+}
+
+static int _print_tier_info(struct multi_devices *md, char **buff, int start,
+			    int count, int *_space, char *str)
+{
+	int space = *_space;
+	char *b = *buff;
+	int printed;
+	int i;
+
+	printed = snprintf(b, space, str);
+	if (unlikely(printed > space))
+		return -ENOSPC;
+
+	b += printed;
+	space -= printed;
+
+	for (i = start; i < start + count; ++i) {
+		printed = snprintf(b, space, "%s%s", i == start ? "" : ",",
+				   _bdev_name(md_dev_info(md, i)->bdev));
+
+		if (unlikely(printed > space))
+			return -ENOSPC;
+
+		b += printed;
+		space -= printed;
+	}
+	*_space = space;
+	*buff = b;
+
+	return 0;
+}
+
+static void _print_mount_info(struct zuf_sb_info *sbi, char *mount_options)
+{
+	struct multi_devices *md = sbi->md;
+	char buff[992];
+	int space = sizeof(buff);
+	char *b = buff;
+	int err;
+
+	err = _print_tier_info(md, &b, 0, md->t1_count, &space, "t1=");
+	if (unlikely(err))
+		goto no_space;
+
+	if (md->t2_count == 0)
+		goto print_options;
+
+	err = _print_tier_info(md, &b, md->t1_count, md->t2_count, &space,
+			       " t2=");
+	if (unlikely(err))
+		goto no_space;
+
+print_options:
+	if (mount_options) {
+		int printed = snprintf(b, space, " -o %s", mount_options);
+
+		if (unlikely(printed > space))
+			goto no_space;
+	}
+
+print:
+	zuf_info("mounted %s (0x%lx/0x%lx)\n", buff,
+		 md_t1_blocks(sbi->md), md_t2_blocks(sbi->md));
+	return;
+
+no_space:
+	snprintf(buff + sizeof(buff) - 4, 4, "...");
+	goto print;
+}
+
+static void _sb_mwtime_now(struct super_block *sb, struct md_dev_table *zdt)
+{
+	struct timespec64 now = current_time(sb->s_root->d_inode);
+
+	timespec_to_mt(&zdt->s_mtime, &now);
+	zdt->s_wtime = zdt->s_mtime;
+	/* TOZO _persist_md(sb, &zdt->s_mtime, 2*sizeof(zdt->s_mtime)); */
+}
+
+static int _setup_bdi(struct super_block *sb, const char *device_name)
+{
+	int err;
+
+	if (sb->s_bdi && (sb->s_bdi != &noop_backing_dev_info)) {
+		/*
+		 * sb->s_bdi points to blkdev's bdi however we want to redirect
+		 * it to our private bdi...
+		 */
+		bdi_put(sb->s_bdi);
+	}
+	sb->s_bdi = &noop_backing_dev_info;
+
+	err = super_setup_bdi_name(sb, "zuf-%s", device_name);
+	if (unlikely(err)) {
+		zuf_err("Failed to super_setup_bdi\n");
+		return err;
+	}
+
+	sb->s_bdi->ra_pages = ZUFS_READAHEAD_PAGES;
+	sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+	return 0;
+}
+
+static int _sb_add(struct zuf_root_info *zri, struct super_block *sb,
+		   __u64 *sb_id)
+{
+	uint i;
+	int err;
+
+	mutex_lock(&zri->sbl_lock);
+
+	if (zri->sbl.num == zri->sbl.max) {
+		struct super_block **new_array;
+
+		new_array = krealloc(zri->sbl.array,
+				  (zri->sbl.max + SBL_INC) * sizeof(*new_array),
+				  GFP_KERNEL | __GFP_ZERO);
+		if (unlikely(!new_array)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		zri->sbl.max += SBL_INC;
+		zri->sbl.array = new_array;
+	}
+
+	for (i = 0; i < zri->sbl.max; ++i)
+		if (!zri->sbl.array[i])
+			break;
+
+	if (unlikely(i == zri->sbl.max)) {
+		zuf_err("!!!!! can't be! i=%d g_sbl.num=%d g_sbl.max=%d\n",
+			i, zri->sbl.num, zri->sbl.max);
+		err = -EFAULT;
+		goto out;
+	}
+
+	++zri->sbl.num;
+	zri->sbl.array[i] = sb;
+	*sb_id = i + 1;
+	err = 0;
+
+	zuf_dbg_vfs("sb_id=%lld\n", *sb_id);
+out:
+	mutex_unlock(&zri->sbl_lock);
+	return err;
+}
+
+static void _sb_remove(struct zuf_root_info *zri, struct super_block *sb)
+{
+	uint i;
+
+	mutex_lock(&zri->sbl_lock);
+
+	for (i = 0; i < zri->sbl.max; ++i)
+		if (zri->sbl.array[i] == sb)
+			break;
+	if (unlikely(i == zri->sbl.max)) {
+		zuf_err("!!!!! can't be! i=%d g_sbl.num=%d g_sbl.max=%d\n",
+			i, zri->sbl.num, zri->sbl.max);
+		goto out;
+	}
+
+	zri->sbl.array[i] = NULL;
+	--zri->sbl.num;
+out:
+	mutex_unlock(&zri->sbl_lock);
+}
+
+struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
+				   struct zus_sb_info *zus_sbi)
+{
+	struct super_block *sb;
+
+	--sb_id;
+
+	if (zri->sbl.max <= sb_id) {
+		zuf_err("Invalid SB_ID 0x%llx\n", sb_id);
+		return NULL;
+	}
+
+	sb = zri->sbl.array[sb_id];
+	if (!sb) {
+		zuf_err("Stale SB_ID 0x%llx\n", sb_id);
+		return NULL;
+	}
+
+	return sb;
+}
+
+static void zuf_put_super(struct super_block *sb)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+
+	if (sbi->zus_sbi) {
+		struct zufs_ioc_mount zim = {
+			.zmi.zus_sbi = sbi->zus_sbi,
+		};
+
+		zufc_dispatch_mount(ZUF_ROOT(sbi), NULL, ZUFS_M_UMOUNT, &zim);
+		sbi->zus_sbi = NULL;
+	}
+
+	/* NOTE!!! this is a HACK! we should not touch the s_umount
+	 * lock but to make lockdep happy we do that since our devices
+	 * are held exclusivly. Need to revisit every kernel version
+	 * change.
+	 */
+	if (sbi->md) {
+		up_write(&sb->s_umount);
+		zuf_rm_pmem(sbi->md);
+		md_fini(sbi->md, sb->s_bdev);
+		down_write(&sb->s_umount);
+	}
+
+	_sb_remove(ZUF_ROOT(sbi), sb);
+	sb->s_fs_info = NULL;
+	if (!test_opt(sbi, FAILED))
+		zuf_info("unmounted /dev/%s\n", _bdev_name(sb->s_bdev));
+	kfree(sbi);
+}
+
+struct __fill_super_params {
+	struct multi_devices *md;
+	char *mount_options;
+};
+
+static int zuf_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct zuf_sb_info *sbi = NULL;
+	struct __fill_super_params *fsp = data;
+	struct zufs_ioc_mount zim = {};
+	struct zufs_ioc_mount *ioc_mount;
+	enum big_alloc_type bat;
+	struct register_fs_info *rfi;
+	struct inode *root_i;
+	size_t zim_size, mount_options_len;
+	bool exist;
+	int err;
+
+	BUILD_BUG_ON(sizeof(struct md_dev_table) > MDT_SIZE);
+	BUILD_BUG_ON(sizeof(struct zus_inode) != ZUFS_INODE_SIZE);
+
+	mount_options_len = (fsp->mount_options ?
+					strlen(fsp->mount_options) : 0) + 1;
+	zim_size = sizeof(zim) + mount_options_len;
+	ioc_mount = big_alloc(zim_size, sizeof(zim), &zim,
+			      GFP_KERNEL | __GFP_ZERO, &bat);
+	if (unlikely(!ioc_mount))
+		return -ENOMEM;
+
+	ioc_mount->zmi.po.mount_options_len = mount_options_len;
+
+	err = _sb_add(zuf_fst(sb)->zri, sb, &ioc_mount->zmi.sb_id);
+	if (unlikely(err))
+		goto error;
+
+	sbi = kzalloc(sizeof(struct zuf_sb_info), GFP_KERNEL);
+	if (!sbi) {
+		zuf_err_cnd(silent, "Not enough memory to allocate sbi\n");
+		err = -ENOMEM;
+		goto error;
+	}
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	/* Initialize embedded objects */
+	spin_lock_init(&sbi->s_mmap_dirty_lock);
+	INIT_LIST_HEAD(&sbi->s_mmap_dirty);
+	if (silent) {
+		ioc_mount->zmi.po.mount_flags |= ZUFS_M_SILENT;
+		set_opt(sbi, SILENT);
+	}
+
+	sbi->md = fsp->md;
+	err = md_set_sb(sbi->md, sb->s_bdev, sb, silent);
+	if (unlikely(err))
+		goto error;
+	zuf_add_pmem(zuf_fst(sb)->zri, sbi->md);
+
+	err = _parse_options(sbi, fsp->mount_options, 0, &ioc_mount->zmi.po);
+	if (err)
+		goto error;
+
+	err = _setup_bdi(sb, _bdev_name(sb->s_bdev));
+	if (err) {
+		zuf_err_cnd(silent, "Failed to setup bdi => %d\n", err);
+		goto error;
+	}
+
+	/* Tell ZUS to mount an FS for us */
+	ioc_mount->zmi.pmem_kern_id = zuf_pmem_id(sbi->md);
+	err = zufc_dispatch_mount(ZUF_ROOT(sbi), zuf_fst(sb)->zus_zfi,
+				  ZUFS_M_MOUNT, ioc_mount);
+	if (unlikely(err))
+		goto error;
+	sbi->zus_sbi = ioc_mount->zmi.zus_sbi;
+
+	/* Init with default values */
+	sb->s_blocksize_bits = ioc_mount->zmi.s_blocksize_bits;
+	sb->s_blocksize = 1 << ioc_mount->zmi.s_blocksize_bits;
+
+	rfi = &zuf_fst(sb)->rfi;
+
+	sb->s_magic = rfi->FS_magic;
+	sb->s_time_gran = rfi->s_time_gran;
+	sb->s_maxbytes = rfi->s_maxbytes;
+	sb->s_flags |= MS_NOSEC | (ioc_mount->zmi.acl_on ? SB_POSIXACL : 0);
+
+	sb->s_op = &zuf_sops;
+
+	root_i = zuf_iget(sb, ioc_mount->zmi.zus_ii, ioc_mount->zmi._zi,
+			  &exist);
+	if (IS_ERR(root_i)) {
+		err = PTR_ERR(root_i);
+		goto error;
+	}
+	WARN_ON(exist);
+
+	sb->s_root = d_make_root(root_i);
+	if (!sb->s_root) {
+		zuf_err_cnd(silent, "get tozu root inode failed\n");
+		iput(root_i); /* undo zuf_iget */
+		err = -ENOMEM;
+		goto error;
+	}
+
+	if (!zuf_rdonly(sb))
+		_sb_mwtime_now(sb, md_zdt(sbi->md));
+
+	mt_to_timespec(&root_i->i_ctime, &zus_zi(root_i)->i_ctime);
+	mt_to_timespec(&root_i->i_mtime, &zus_zi(root_i)->i_mtime);
+
+	_print_mount_info(sbi, fsp->mount_options);
+	clear_opt(sbi, SILENT);
+	big_free(ioc_mount, bat);
+	return 0;
+
+error:
+	zuf_warn("NOT mounting => %d\n", err);
+	if (sbi) {
+		set_opt(sbi, FAILED);
+		zuf_put_super(sb);
+	}
+	big_free(ioc_mount, bat);
+	return err;
+}
+
+static void _zst_to_kst(const struct statfs64 *zst, struct kstatfs *kst)
+{
+	kst->f_type	= zst->f_type;
+	kst->f_bsize	= zst->f_bsize;
+	kst->f_blocks	= zst->f_blocks;
+	kst->f_bfree	= zst->f_bfree;
+	kst->f_bavail	= zst->f_bavail;
+	kst->f_files	= zst->f_files;
+	kst->f_ffree	= zst->f_ffree;
+	kst->f_fsid	= zst->f_fsid;
+	kst->f_namelen	= zst->f_namelen;
+	kst->f_frsize	= zst->f_frsize;
+	kst->f_flags	= zst->f_flags;
+}
+
+static int zuf_statfs(struct dentry *d, struct kstatfs *buf)
+{
+	struct zuf_sb_info *sbi = SBI(d->d_sb);
+	struct zufs_ioc_statfs ioc_statfs = {
+		.hdr.in_len = offsetof(struct zufs_ioc_statfs, statfs_out),
+		.hdr.out_len = sizeof(ioc_statfs),
+		.hdr.operation = ZUFS_OP_STATFS,
+		.zus_sbi = sbi->zus_sbi,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_statfs.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("zufc_dispatch failed op=ZUFS_OP_STATFS => %d\n", err);
+		return err;
+	}
+
+	_zst_to_kst(&ioc_statfs.statfs_out, buf);
+	return 0;
+}
+
+static int zuf_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct zuf_sb_info *sbi = SBI(root->d_sb);
+
+	if (test_opt(sbi, EPHEMERAL))
+		seq_puts(seq, ",ephemeral");
+	if (test_opt(sbi, DAX))
+		seq_puts(seq, ",dax");
+
+	return 0;
+}
+
+static int zuf_show_devname(struct seq_file *seq, struct dentry *root)
+{
+	seq_printf(seq, "/dev/%s", _bdev_name(root->d_sb->s_bdev));
+
+	return 0;
+}
+
+static int zuf_remount(struct super_block *sb, int *mntflags, char *data)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zufs_ioc_mount zim = {};
+	struct zufs_ioc_mount *ioc_mount;
+	size_t remount_options_len, zim_size;
+	enum big_alloc_type bat;
+	int err;
+
+	zuf_info("remount... -o %s\n", data);
+
+	remount_options_len = data ? (strlen(data) + 1) : 0;
+	zim_size = sizeof(zim) + remount_options_len;
+	ioc_mount = big_alloc(zim_size, sizeof(zim), &zim,
+			      GFP_KERNEL | __GFP_ZERO, &bat);
+	if (unlikely(!ioc_mount))
+		return -ENOMEM;
+
+	ioc_mount->zmi.zus_sbi = sbi->zus_sbi,
+	ioc_mount->zmi.remount_flags = zuf_rdonly(sb) ? ZUFS_REM_WAS_RO : 0;
+	ioc_mount->zmi.po.mount_options_len = remount_options_len;
+
+	/* Store the old options */
+	ioc_mount->zmi.old_mount_opt = sbi->s_mount_opt;
+
+	err = _parse_options(sbi, data, 1, &ioc_mount->zmi.po);
+	if (unlikely(err))
+		goto fail;
+
+	if (*mntflags & MS_RDONLY) {
+		ioc_mount->zmi.remount_flags |= ZUFS_REM_WILL_RO;
+
+		if (!zuf_rdonly(sb))
+			_sb_mwtime_now(sb, md_zdt(sbi->md));
+	} else if (zuf_rdonly(sb)) {
+		_sb_mwtime_now(sb, md_zdt(sbi->md));
+	}
+
+	err = zufc_dispatch_mount(ZUF_ROOT(sbi), zuf_fst(sb)->zus_zfi,
+				  ZUFS_M_REMOUNT, ioc_mount);
+	if (unlikely(err))
+		goto fail;
+
+	big_free(ioc_mount, bat);
+	return 0;
+
+fail:
+	sbi->s_mount_opt = ioc_mount->zmi.old_mount_opt;
+	big_free(ioc_mount, bat);
+	zuf_dbg_err("remount failed restore option\n");
+	return err;
+}
+
+static int zuf_update_s_wtime(struct super_block *sb)
+{
+	if (!(sb->s_flags & MS_RDONLY)) {
+		struct timespec64 now = current_time(sb->s_root->d_inode);
+
+		timespec_to_mt(&md_zdt(SBI(sb)->md)->s_wtime, &now);
+	}
+	return 0;
+}
+
+static struct inode *zuf_alloc_inode(struct super_block *sb)
+{
+	struct zuf_inode_info *zii;
+
+	zii = kmem_cache_alloc(zuf_inode_cachep, GFP_NOFS);
+	if (!zii)
+		return NULL;
+
+	zii->vfs_inode.i_version.counter = 1;
+	return &zii->vfs_inode;
+}
+
+static void zuf_destroy_inode(struct inode *inode)
+{
+	kmem_cache_free(zuf_inode_cachep, ZUII(inode));
+}
+
 static void _init_once(void *foo)
 {
 	struct zuf_inode_info *zii = foo;
@@ -46,8 +613,62 @@ void zuf_destroy_inodecache(void)
 	kmem_cache_destroy(zuf_inode_cachep);
 }
 
+static struct super_operations zuf_sops = {
+	.alloc_inode	= zuf_alloc_inode,
+	.destroy_inode	= zuf_destroy_inode,
+	.put_super	= zuf_put_super,
+	.freeze_fs	= zuf_update_s_wtime,
+	.unfreeze_fs	= zuf_update_s_wtime,
+	.statfs		= zuf_statfs,
+	.remount_fs	= zuf_remount,
+	.show_options	= zuf_show_options,
+	.show_devname	= zuf_show_devname,
+};
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data)
 {
-	return ERR_PTR(-ENOTSUPP);
+	int silent = flags & MS_SILENT ? 1 : 0;
+	struct __fill_super_params fsp = {
+		.mount_options = data,
+	};
+	struct zuf_fs_type *fst = ZUF_FST(fs_type);
+	struct register_fs_info *rfi = &fst->rfi;
+	struct mdt_check mc = {
+		.alloc_mask	= ZUFS_ALLOC_MASK,
+		.major_ver	= rfi->FS_ver_major,
+		.minor_ver	= rfi->FS_ver_minor,
+		.magic		= rfi->FS_magic,
+
+		.holder = fs_type,
+		.silent = silent,
+	};
+	struct dentry *ret = NULL;
+	const char *dev_path = NULL;
+	int err;
+
+	zuf_dbg_vfs("dev_name=%s, data=%s\n", dev_name, (const char *)data);
+
+	fsp.md = md_alloc(sizeof(struct zuf_pmem));
+	if (IS_ERR(fsp.md)) {
+		err = PTR_ERR(fsp.md);
+		fsp.md = NULL;
+		goto out;
+	}
+
+	err = md_init(fsp.md, dev_name, &mc, &dev_path);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "md_init failed! => %d\n", err);
+		goto out;
+	}
+
+	zuf_dbg_vfs("mounting with dev_path=%s\n", dev_path);
+	ret = mount_bdev(fs_type, flags, dev_path, &fsp, zuf_fill_super);
+
+out:
+	if (unlikely(err) && fsp.md)
+		md_fini(fsp.md, NULL);
+
+	kfree(dev_path);
+	return err ? ERR_PTR(err) : ret;
 }
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 0689f2031ec7..e23907f5e94e 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -133,11 +133,32 @@ static inline uint zuf_pmem_id(struct multi_devices *md)
 
 // void zuf_del_fs_type(struct zuf_root_info *zri, struct zuf_fs_type *zft);
 
+/*
+ * Private Super-block flags
+ */
+enum {
+	ZUF_MOUNT_PEDANTIC	= 0x000001,	/* Check for memory leaks */
+	ZUF_MOUNT_PEDANTIC_SHADOW = 0x00002,	/* */
+	ZUF_MOUNT_SILENT	= 0x000004,	/* verbosity is silent */
+	ZUF_MOUNT_EPHEMERAL	= 0x000008,	/* Don't persist the data */
+	ZUF_MOUNT_FAILED	= 0x000010,	/* mark a failed-mount */
+	ZUF_MOUNT_DAX		= 0x000020,	/* mounted with dax option */
+	ZUF_MOUNT_POSIXACL	= 0x000040,	/* mounted with posix acls */
+};
+
+#define clear_opt(sbi, opt)       (sbi->s_mount_opt &= ~ZUF_MOUNT_ ## opt)
+#define set_opt(sbi, opt)         (sbi->s_mount_opt |= ZUF_MOUNT_ ## opt)
+#define test_opt(sbi, opt)      (sbi->s_mount_opt & ZUF_MOUNT_ ## opt)
+
 /*
  * ZUF per-inode data in memory
  */
 struct zuf_inode_info {
 	struct inode		vfs_inode;
+
+	/* cookies from Server */
+	struct zus_inode	*zi;
+	struct zus_inode_info	*zus_ii;
 };
 
 static inline struct zuf_inode_info *ZUII(struct inode *inode)
@@ -145,6 +166,28 @@ static inline struct zuf_inode_info *ZUII(struct inode *inode)
 	return container_of(inode, struct zuf_inode_info, vfs_inode);
 }
 
+/*
+ * ZUF super-block data in memory
+ */
+struct zuf_sb_info {
+	struct super_block *sb;
+	struct multi_devices *md;
+
+	/* zus cookie*/
+	struct zus_sb_info *zus_sbi;
+
+	/* Mount options */
+	unsigned long	s_mount_opt;
+
+	spinlock_t		s_mmap_dirty_lock;
+	struct list_head	s_mmap_dirty;
+};
+
+static inline struct zuf_sb_info *SBI(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
 static inline struct zuf_fs_type *ZUF_FST(struct file_system_type *fs_type)
 {
 	return container_of(fs_type, struct zuf_fs_type, vfs_fst);
@@ -155,6 +198,85 @@ static inline struct zuf_fs_type *zuf_fst(struct super_block *sb)
 	return ZUF_FST(sb->s_type);
 }
 
+static inline struct zuf_root_info *ZUF_ROOT(struct zuf_sb_info *sbi)
+{
+	return zuf_fst(sbi->sb)->zri;
+}
+
+static inline bool zuf_rdonly(struct super_block *sb)
+{
+	return sb->s_flags & MS_RDONLY;
+}
+
+static inline struct zus_inode *zus_zi(struct inode *inode)
+{
+	return ZUII(inode)->zi;
+}
+
+static inline void mt_to_timespec(struct timespec64 *t, __le64 *mt)
+{
+	u32 nsec;
+
+	t->tv_sec = div_s64_rem(le64_to_cpu(*mt), NSEC_PER_SEC, &nsec);
+	t->tv_nsec = nsec;
+}
+
+static inline void timespec_to_mt(__le64 *mt, struct timespec64 *t)
+{
+	*mt = cpu_to_le64(t->tv_sec * NSEC_PER_SEC + t->tv_nsec);
+}
+
+/* CAREFUL: Needs an sfence eventually, after this call */
+static inline
+void zus_inode_cmtime_now(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_mtime = zi->i_ctime;
+}
+
+static inline
+void zus_inode_ctime_now(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+}
+
+enum big_alloc_type { ba_stack, ba_kmalloc, ba_vmalloc };
+
+static inline
+void *big_alloc(uint bytes, uint local_size, void *local, gfp_t gfp,
+		enum big_alloc_type *bat)
+{
+	void *ptr;
+
+	if (bytes <= local_size) {
+		*bat = ba_stack;
+		ptr = local;
+	} else if (bytes <= PAGE_SIZE) {
+		*bat = ba_kmalloc;
+		ptr = kmalloc(bytes, gfp);
+	} else {
+		*bat = ba_vmalloc;
+		ptr = vmalloc(bytes);
+	}
+
+	return ptr;
+}
+
+static inline void big_free(void *ptr, enum big_alloc_type bat)
+{
+	switch (bat) {
+	case ba_stack:
+		break;
+	case ba_kmalloc:
+		kfree(ptr);
+		break;
+	case ba_vmalloc:
+		vfree(ptr);
+	}
+}
+
 struct zuf_dispatch_op;
 typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
 				ulong zt_max_bytes);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index e0c439b6c8e9..ca8e10a1f5a8 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -77,6 +77,8 @@
 #define EZUF_RETRY_DONE 540
 
 
+#define ZUFS_READAHEAD_PAGES	8
+
 /* All device sizes offsets must align on 2M */
 #define ZUFS_ALLOC_MASK		(1024 * 1024 * 2 - 1)
 
@@ -107,6 +109,45 @@ static inline zu_dpp_t enc_zu_dpp_t(ulong v, uint pool)
 	return v | pool;
 }
 
+/*
+ * Structure of a ZUS inode.
+ * This is all the inode fields
+ */
+
+/* zus_inode size */
+#define ZUFS_INODE_SIZE 128    /* must be power of two */
+#define ZUFS_INODE_BITS   7
+
+struct zus_inode {
+	__le16	i_flags;	/* Inode flags */
+	__le16	i_mode;		/* File mode */
+	__le32	i_nlink;	/* Links count */
+	__le64	i_size;		/* Size of data in bytes */
+/* 16*/	struct __zi_on_disk_desc {
+		__le64	a[2];
+	}	i_on_disk;	/* FS-specific on disc placement */
+/* 32*/	__le64	i_blocks;
+	__le64	i_mtime;	/* Inode/data Modification time */
+	__le64	i_ctime;	/* Inode/data Changed time */
+	__le64	i_atime;	/* Data Access time */
+/* 64 - cache-line boundary */
+	__le64	i_ino;		/* Inode number */
+	__le32	i_uid;		/* Owner Uid */
+	__le32	i_gid;		/* Group Id */
+	__le64	i_xattr;	/* FS-specific Extended attribute block */
+	__le64	i_generation;	/* File version (for NFS) */
+/* 96*/	union {
+		__le32	i_rdev;		/* special-inode major/minor etc ...*/
+		u8	i_symlink[32];	/* if i_size < sizeof(i_symlink) */
+		__le64	i_sym_dpp;	/* Link location if long symlink */
+		struct  _zu_dir {
+			__le64	dir_root;
+			__le64  parent;
+		}	i_dir;
+	};
+	/* Total ZUFS_INODE_SIZE bytes always */
+};
+
 /* ~~~~~ ZUFS API ioctl commands ~~~~~ */
 enum {
 	ZUS_API_MAP_MAX_PAGES	= 1024,
@@ -176,6 +217,11 @@ enum e_mount_operation {
 	ZUFS_M_DDBG_WR,
 };
 
+enum {
+	ZUFS_REM_WAS_RO	= 0x00000001,
+	ZUFS_REM_WILL_RO	= 0x00000002,
+};
+
 struct zufs_mount_info {
 	/* IN */
 	struct zus_fs_info *zus_zfi;
@@ -269,11 +315,22 @@ struct zufs_ioc_wait_operation {
  */
 enum e_zufs_operation {
 	ZUFS_OP_NULL = 0,
+	ZUFS_OP_STATFS,
 
 	ZUFS_OP_BREAK,		/* Kernel telling Server to exit */
 	ZUFS_OP_MAX_OPT,
 };
 
+/* ZUFS_OP_STATFS */
+struct zufs_ioc_statfs {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* OUT */
+	struct statfs64 statfs_out;
+};
+
 /* Allocate a special_file that will be a dual-port communication buffer with
  * user mode.
  * Server will access the buffer via the mmap of this file.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 07/17] zuf: Namei and directory operations
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (5 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 06/17] zuf: mounting Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 08/17] zuf: readdir operation Boaz harrosh
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

Introducing Creation/deletion of files
Directory add/remove
Other namei operations

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile    |   3 +-
 fs/zuf/_extern.h   |  38 +++
 fs/zuf/directory.c |  94 +++++++
 fs/zuf/file.c      |  26 ++
 fs/zuf/inode.c     | 599 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/namei.c     | 402 ++++++++++++++++++++++++++++++
 fs/zuf/rw.c        |  25 ++
 fs/zuf/super.c     |   2 +
 fs/zuf/zuf-core.c  |   9 +
 fs/zuf/zuf.h       |  83 +++++++
 fs/zuf/zus_api.h   | 100 ++++++++
 11 files changed, 1379 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/namei.c
 create mode 100644 fs/zuf/rw.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index eaeffc65078f..501561d35b8a 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,6 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o inode.o
+zuf-y += rw.o
+zuf-y += super.o inode.o directory.o namei.o file.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index dc6b41b6410b..76634904eca3 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -20,9 +20,28 @@
  * extern functions declarations
  */
 
+/* directory.c */
+int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
+int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
+
 /* inode.c */
+int zuf_evict_dispatch(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       int operation, uint flags);
 struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
 		       zu_dpp_t _zi, bool *exist);
+void zuf_evict_inode(struct inode *inode);
+struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
+			    const struct qstr *qstr, const char *symname,
+			    ulong rdev_or_isize, bool tmpfile);
+int zuf_write_inode(struct inode *inode, struct writeback_control *wbc);
+int zuf_update_time(struct inode *inode, struct timespec64 *time, int flags);
+int zuf_setattr(struct dentry *dentry, struct iattr *attr);
+int zuf_getattr(const struct path *path, struct kstat *stat,
+		 u32 request_mask, unsigned int flags);
+void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi);
+
+/* rw.c */
+int zuf_trim_edge(struct inode *inode, ulong filepos, uint len);
 
 /* super.c */
 int zuf_init_inodecache(void);
@@ -64,4 +83,23 @@ int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
+/*
+ * Inodes and files operations
+ */
+
+/* dir.c */
+extern const struct file_operations zuf_dir_operations;
+
+/* file.c */
+extern const struct inode_operations zuf_file_inode_operations;
+extern const struct file_operations zuf_file_operations;
+
+/* inode.c */
+extern const struct address_space_operations zuf_aops;
+void zuf_zii_sync(struct inode *inode, bool sync_nlink);
+
+/* namei.c */
+extern const struct inode_operations zuf_dir_inode_operations;
+extern const struct inode_operations zuf_special_inode_operations;
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
new file mode 100644
index 000000000000..eb73a5c7cabf
--- /dev/null
+++ b/fs/zuf/directory.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for directories.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/fs.h>
+#include <linux/vmalloc.h>
+#include "zuf.h"
+
+/*
+ *FIXME comment to full git diff
+ */
+
+static int _dentry_dispatch(struct inode *dir, struct inode *inode,
+			    struct qstr *str, int operation)
+{
+	struct zufs_ioc_dentry ioc_dentry = {
+		.hdr.operation = operation,
+		.hdr.in_len = sizeof(ioc_dentry),
+		.hdr.out_len = sizeof(ioc_dentry),
+		.zus_ii = inode ? ZUII(inode)->zus_ii : NULL,
+		.zus_dir_ii = ZUII(dir)->zus_ii,
+		.str.len = str->len,
+	};
+	int err;
+
+	memcpy(&ioc_dentry.str.name, str->name, str->len);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(dir->i_sb)), &ioc_dentry.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] op=%d zufc_dispatch failed => %d\n",
+			    dir->i_ino, operation, err);
+		return err;
+	}
+
+	return 0;
+}
+
+/* return pointer to added de on success, err-code on failure */
+int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(dir);
+	int err;
+
+	if (!str->len || !zii->zi)
+		return -EINVAL;
+
+	zus_inode_cmtime_now(dir, zii->zi);
+	err = _dentry_dispatch(dir, inode, str, ZUFS_OP_ADD_DENTRY);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] _dentry_dispatch failed => %d\n",
+			    dir->i_ino, err);
+		return err;
+	}
+	zuf_zii_sync(dir, false);
+
+	return 0;
+}
+
+int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(dir);
+	int err;
+
+	if (!str->len)
+		return -EINVAL;
+
+	zus_inode_cmtime_now(dir, zii->zi);
+	err = _dentry_dispatch(dir, inode, str, ZUFS_OP_REMOVE_DENTRY);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] _dentry_dispatch failed => %d\n",
+			    dir->i_ino, err);
+		return err;
+	}
+	zuf_zii_sync(dir, false);
+
+	return 0;
+}
+
+const struct file_operations zuf_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.fsync		= noop_fsync,
+};
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
new file mode 100644
index 000000000000..c6c8ca71e957
--- /dev/null
+++ b/fs/zuf/file.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for files.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+const struct file_operations zuf_file_operations = {
+	.open			= generic_file_open,
+};
+
+const struct inode_operations zuf_file_inode_operations = {
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 315a273e6f6d..ad424a305063 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -13,10 +13,607 @@
  *	Sagi Manole <sagim@netapp.com>"
  */
 
+#include <linux/fs.h>
+#include <linux/aio.h>
+#include <linux/highuid.h>
+#include <linux/module.h>
+#include <linux/mpage.h>
+#include <linux/backing-dev.h>
+#include <linux/types.h>
+#include <linux/ratelimit.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/security.h>
+#include <linux/delay.h>
 #include "zuf.h"
 
+/* Flags that should be inherited by new inodes from their parent. */
+#define ZUFS_FL_INHERITED (S_SYNC | S_NOATIME | S_DIRSYNC)
+
+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define ZUFS_FL_REG_MASK (~S_DIRSYNC)
+
+/* Flags that are appropriate for non-dir/non-regular files. */
+#define ZUFS_FL_OTHER_MASK (S_NOATIME)
+
+static bool _zi_valid(struct zus_inode *zi)
+{
+	if (!_zi_active(zi))
+		return false;
+
+	switch (le16_to_cpu(zi->i_mode) & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		return true;
+	default:
+		zuf_err("unknown file type ino=%lld mode=%d\n", zi->i_ino,
+			  zi->i_mode);
+		return false;
+	}
+}
+
+static void _set_inode_from_zi(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_mode = le16_to_cpu(zi->i_mode);
+	inode->i_uid = KUIDT_INIT(le32_to_cpu(zi->i_uid));
+	inode->i_gid = KGIDT_INIT(le32_to_cpu(zi->i_gid));
+	set_nlink(inode, le16_to_cpu(zi->i_nlink));
+	inode->i_size = le64_to_cpu(zi->i_size);
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	mt_to_timespec(&inode->i_atime, &zi->i_atime);
+	mt_to_timespec(&inode->i_ctime, &zi->i_ctime);
+	mt_to_timespec(&inode->i_mtime, &zi->i_mtime);
+	inode->i_generation = le64_to_cpu(zi->i_generation);
+	zuf_set_inode_flags(inode, zi);
+
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		inode->i_op = &zuf_file_inode_operations;
+		inode->i_fop = &zuf_file_operations;
+		break;
+	case S_IFDIR:
+		inode->i_op = &zuf_dir_inode_operations;
+		inode->i_fop = &zuf_dir_operations;
+		break;
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		inode->i_size = 0;
+		inode->i_op = &zuf_special_inode_operations;
+		init_special_inode(inode, inode->i_mode,
+				   le32_to_cpu(zi->i_rdev));
+		break;
+	default:
+		zuf_err("unknown file type ino=%lld mode=%d\n", zi->i_ino,
+			  zi->i_mode);
+		break;
+	}
+
+	inode->i_ino = le64_to_cpu(zi->i_ino);
+}
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static uint _calc_flags(umode_t mode, uint dir_flags, uint flags)
+{
+	uint zufs_flags = dir_flags & ZUFS_FL_INHERITED;
+
+	if (S_ISREG(mode))
+		zufs_flags &= ZUFS_FL_REG_MASK;
+	else if (!S_ISDIR(mode))
+		zufs_flags &= ZUFS_FL_OTHER_MASK;
+
+	return zufs_flags;
+}
+
+static int _set_zi_from_inode(struct inode *dir, struct zus_inode *zi,
+			      struct inode *inode)
+{
+	struct zus_inode *zidir = zus_zi(dir);
+
+	if (unlikely(!zidir))
+		return -EACCES;
+
+	zi->i_mode = cpu_to_le16(inode->i_mode);
+	zi->i_uid = cpu_to_le32(__kuid_val(inode->i_uid));
+	zi->i_gid = cpu_to_le32(__kgid_val(inode->i_gid));
+	/* NOTE: zus is boss of i_nlink (but let it know what we think) */
+	zi->i_nlink = cpu_to_le16(inode->i_nlink);
+	zi->i_size = cpu_to_le64(inode->i_size);
+	zi->i_blocks = cpu_to_le64(inode->i_blocks);
+	timespec_to_mt(&zi->i_atime, &inode->i_atime);
+	timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_generation = cpu_to_le32(inode->i_generation);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
+		zi->i_rdev = cpu_to_le32(inode->i_rdev);
+
+	zi->i_flags = cpu_to_le16(_calc_flags(inode->i_mode,
+					      le16_to_cpu(zidir->i_flags),
+					      inode->i_flags));
+	return 0;
+}
+
+static bool _times_equal(struct timespec64 *t, __le64 *mt)
+{
+	__le64 time;
+
+	timespec_to_mt(&time, t);
+	return time == *mt;
+}
+
+/* This function checks if VFS's inode and zus_inode are in sync */
+static void _warn_inode_dirty(struct inode *inode, struct zus_inode *zi)
+{
+#define __MISMACH_INT(inode, X, Y)	\
+	if (X != Y)			\
+		zuf_warn("[%ld] " #X"=0x%lx " #Y"=0x%lx""\n",	\
+			  inode->i_ino, (ulong)(X), (ulong)(Y))
+#define __MISMACH_TIME(inode, X, Y)	\
+	if (!_times_equal(X, Y)) {	\
+		struct timespec64 t;	\
+		mt_to_timespec(&t, (Y));\
+		zuf_warn("[%ld] " #X"=%lld:%ld " #Y"=%lld:%ld""\n",	\
+			  inode->i_ino, (X)->tv_sec, (X)->tv_nsec,	\
+			  t.tv_sec, t.tv_nsec);		\
+	}
+
+	if (!_times_equal(&inode->i_ctime, &zi->i_ctime) ||
+	    !_times_equal(&inode->i_mtime, &zi->i_mtime) ||
+	    !_times_equal(&inode->i_atime, &zi->i_atime) ||
+	    inode->i_size != le64_to_cpu(zi->i_size) ||
+	    inode->i_mode != le16_to_cpu(zi->i_mode) ||
+	    __kuid_val(inode->i_uid) != le32_to_cpu(zi->i_uid) ||
+	    __kgid_val(inode->i_gid) != le32_to_cpu(zi->i_gid) ||
+	    inode->i_nlink != le16_to_cpu(zi->i_nlink) ||
+	    inode->i_ino != _zi_ino(zi) ||
+	    inode->i_blocks != le64_to_cpu(zi->i_blocks)) {
+		__MISMACH_TIME(inode, &inode->i_ctime, &zi->i_ctime);
+		__MISMACH_TIME(inode, &inode->i_mtime, &zi->i_mtime);
+		__MISMACH_TIME(inode, &inode->i_atime, &zi->i_atime);
+		__MISMACH_INT(inode, inode->i_size, le64_to_cpu(zi->i_size));
+		__MISMACH_INT(inode, inode->i_mode, le16_to_cpu(zi->i_mode));
+		__MISMACH_INT(inode, __kuid_val(inode->i_uid),
+			      le32_to_cpu(zi->i_uid));
+		__MISMACH_INT(inode, __kgid_val(inode->i_gid),
+			      le32_to_cpu(zi->i_gid));
+		__MISMACH_INT(inode, inode->i_nlink, le16_to_cpu(zi->i_nlink));
+		__MISMACH_INT(inode, inode->i_ino, _zi_ino(zi));
+		__MISMACH_INT(inode, inode->i_blocks,
+			      le64_to_cpu(zi->i_blocks));
+	}
+}
+
+static void _zii_connect(struct inode *inode, struct zus_inode *zi,
+			 struct zus_inode_info *zus_ii)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zii->zi = zi;
+	zii->zus_ii = zus_ii;
+}
+
 struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
 		       zu_dpp_t _zi, bool *exist)
 {
-	return ERR_PTR(-ENOTSUPP);
+	struct zus_inode *zi = zuf_dpp_t_addr(sb, _zi);
+	struct inode *inode;
+
+	*exist = false;
+	if (unlikely(!zi)) {
+		/* Don't trust ZUS pointers */
+		zuf_err("Bad zus_inode 0x%llx\n", _zi);
+		return ERR_PTR(-EIO);
+	}
+	if (unlikely(!zus_ii)) {
+		zuf_err("zus_ii NULL\n");
+		return ERR_PTR(-EIO);
+	}
+
+	if (!_zi_valid(zi)) {
+		zuf_err("inactive node ino=%lld links=%d mode=%d\n", zi->i_ino,
+			  zi->i_nlink, zi->i_mode);
+		return ERR_PTR(-ESTALE);
+	}
+
+	zuf_dbg_zus("[%lld] size=0x%llx, blocks=0x%llx ct=0x%llx mt=0x%llx link=0x%x mode=0x%x xattr=0x%llx\n",
+		    zi->i_ino, zi->i_size, zi->i_blocks, zi->i_ctime,
+		    zi->i_mtime, zi->i_nlink, zi->i_mode, zi->i_xattr);
+
+	inode = iget_locked(sb, _zi_ino(zi));
+	if (unlikely(!inode))
+		return ERR_PTR(-ENOMEM);
+
+	if (!(inode->i_state & I_NEW)) {
+		*exist = true;
+		return inode;
+	}
+
+	_set_inode_from_zi(inode, zi);
+	_zii_connect(inode, zi, zus_ii);
+
+	unlock_new_inode(inode);
+	return inode;
+}
+
+int zuf_evict_dispatch(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       int operation, uint flags)
+{
+	struct zufs_ioc_evict_inode ioc_evict_inode = {
+		.hdr.in_len = sizeof(ioc_evict_inode),
+		.hdr.out_len = sizeof(ioc_evict_inode),
+		.hdr.operation = operation,
+		.zus_ii = zus_ii,
+		.flags = flags,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_evict_inode.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR))
+		zuf_err("zufc_dispatch failed op=%s => %d\n",
+			 zuf_op_name(operation), err);
+	return err;
+}
+
+void zuf_evict_inode(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (!inode->i_nlink) {
+		if (unlikely(!zii->zi)) {
+			zuf_dbg_err("[%ld] inode without zi mode=0x%x size=0x%llx\n",
+				    inode->i_ino, inode->i_mode, inode->i_size);
+			goto out;
+		}
+
+		if (unlikely(is_bad_inode(inode)))
+			zuf_dbg_err("[%ld] inode is bad mode=0x%x zi=%p\n",
+				    inode->i_ino, inode->i_mode, zii->zi);
+		else
+			_warn_inode_dirty(inode, zii->zi);
+
+		zuf_w_lock(zii);
+
+		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_FREE_INODE, 0);
+
+		inode->i_mtime = inode->i_ctime = current_time(inode);
+		inode->i_size = 0;
+
+		zuf_w_unlock(zii);
+	} else {
+		zuf_dbg_vfs("[%ld] inode is going down?\n", inode->i_ino);
+
+		zuf_smw_lock(zii);
+
+		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_EVICT_INODE, 0);
+
+		zuf_smw_unlock(zii);
+	}
+
+out:
+	zii->zus_ii = NULL;
+	zii->zi = NULL;
+
+	if (zii->zero_page) {
+		zii->zero_page->mapping = NULL;
+		__free_pages(zii->zero_page, 0);
+		zii->zero_page = NULL;
+	}
+
+	clear_inode(inode);
+}
+
+/* @rdev_or_isize is i_size in the case of a symlink
+ * and rdev in the case of special-files
+ */
+struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
+			    const struct qstr *qstr, const char *symname,
+			    ulong rdev_or_isize, bool tmpfile)
+{
+	struct super_block *sb = dir->i_sb;
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zufs_ioc_new_inode ioc_new_inode = {
+		.hdr.in_len = sizeof(ioc_new_inode),
+		.hdr.out_len = sizeof(ioc_new_inode),
+		.hdr.operation = ZUFS_OP_NEW_INODE,
+		.dir_ii = ZUII(dir)->zus_ii,
+		.flags = tmpfile ? ZI_TMPFILE : 0,
+		.str.len = qstr->len,
+	};
+	struct inode *inode;
+	struct zus_inode *zi = NULL;
+	struct page *pages[2];
+	uint nump = 0;
+	int err;
+
+	memcpy(&ioc_new_inode.str.name, qstr->name, qstr->len);
+
+	inode = new_inode(sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode_init_owner(inode, dir, mode);
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(dir);
+	inode->i_atime = inode->i_ctime;
+
+	zuf_dbg_verbose("inode=%p name=%s\n", inode, qstr->name);
+
+	err = security_inode_init_security(inode, dir, qstr, NULL, NULL);
+	if (err && err != -EOPNOTSUPP)
+		goto fail;
+
+	zuf_set_inode_flags(inode, &ioc_new_inode.zi);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
+	    S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
+		init_special_inode(inode, mode, rdev_or_isize);
+	}
+
+	err = _set_zi_from_inode(dir, &ioc_new_inode.zi, inode);
+	if (unlikely(err))
+		goto fail;
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_new_inode.hdr, pages, nump);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto fail;
+	}
+	zi = zuf_dpp_t_addr(sb, ioc_new_inode._zi);
+
+	_zii_connect(inode, zi, ioc_new_inode.zus_ii);
+
+	/* update inode fields from filesystem inode */
+	inode->i_ino = le64_to_cpu(zi->i_ino);
+	inode->i_size = le64_to_cpu(zi->i_size);
+	inode->i_generation = le64_to_cpu(zi->i_generation);
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	set_nlink(inode, le16_to_cpu(zi->i_nlink));
+	zuf_zii_sync(dir, false);
+
+	zuf_dbg_zus("[%lld] size=0x%llx, blocks=0x%llx ct=0x%llx mt=0x%llx link=0x%x mode=0x%x xattr=0x%llx\n",
+		    zi->i_ino, zi->i_size, zi->i_blocks, zi->i_ctime,
+		    zi->i_mtime, zi->i_nlink, zi->i_mode, zi->i_xattr);
+
+	zuf_dbg_verbose("allocating inode %ld (zi=%p)\n", _zi_ino(zi), zi);
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld:%s] generation=%lld insert_inode_locked => %d\n",
+			    inode->i_ino, qstr->name, zi->i_generation, err);
+		goto fail;
+	}
+
+	return inode;
+
+fail:
+	clear_nlink(inode);
+	if (zi)
+		zi->i_nlink = 0;
+	make_bad_inode(inode);
+	iput(inode);
+	return ERR_PTR(err);
+}
+
+int zuf_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	/* write_inode should never be called because we always keep our inodes
+	 * clean. So let us know if write_inode ever gets called.
+	 */
+
+	/* d_tmpfile() does a mark_inode_dirty so only complain on regular files
+	 * TODO: How? Every thing off for now
+	 * WARN_ON(inode->i_nlink);
+	 */
+
+	return 0;
+}
+
+/*
+ * Mostly supporting file_accessed() for now. Which is the only one we use.
+ *
+ * But also file_update_time is used by fifo code.
+ */
+int zuf_update_time(struct inode *inode, struct timespec64 *time, int flags)
+{
+	struct zus_inode *zi = zus_zi(inode);
+
+	if (flags & S_ATIME) {
+		inode->i_atime = *time;
+		timespec_to_mt(&zi->i_atime, &inode->i_atime);
+		/* FIXME: Set a flag that zi needs flushing
+		 * for now every read needs zi-flushing.
+		 */
+	}
+
+	/* File_update_time() is not used by zuf.
+	 * FIXME: One exception is O_TMPFILE the vfs calls file_update_time
+	 * internally bypassing FS. So just do and silent.
+	 * The zus O_TMPFILE create protocol knows it needs flushing
+	 */
+	if ((flags & S_CTIME) || (flags & S_MTIME)) {
+		if (flags & S_CTIME) {
+			inode->i_ctime = *time;
+			timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		}
+		if (flags & S_MTIME) {
+			inode->i_mtime = *time;
+			timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+		}
+		zuf_dbg_vfs("called for S_CTIME | S_MTIME 0x%x\n", flags);
+	}
+
+	if (flags & ~(S_CTIME | S_MTIME | S_ATIME))
+		zuf_err("called for 0x%x\n", flags);
+
+	return 0;
+}
+
+int zuf_getattr(const struct path *path, struct kstat *stat, u32 request_mask,
+		unsigned int flags)
+{
+	struct dentry *dentry = path->dentry;
+	struct inode *inode = d_inode(dentry);
+
+	if (inode->i_flags & S_APPEND)
+		stat->attributes |= STATX_ATTR_APPEND;
+	if (inode->i_flags & S_IMMUTABLE)
+		stat->attributes |= STATX_ATTR_IMMUTABLE;
+
+	stat->attributes_mask |= (STATX_ATTR_APPEND |
+				  STATX_ATTR_IMMUTABLE);
+	generic_fillattr(inode, stat);
+	/* stat->blocks should be the number of 512B blocks */
+	stat->blocks = inode->i_blocks << (inode->i_sb->s_blocksize_bits - 9);
+
+	return 0;
+}
+
+int zuf_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_attr ioc_attr = {
+		.hdr.in_len = sizeof(ioc_attr),
+		.hdr.out_len = sizeof(ioc_attr),
+		.hdr.operation = ZUFS_OP_SETATTR,
+		.zus_ii = zii->zus_ii,
+	};
+	int err;
+
+	if (!zi)
+		return -EACCES;
+
+	err = setattr_prepare(dentry, attr);
+	if (unlikely(err))
+		return err;
+
+	if (attr->ia_valid & ATTR_MODE) {
+		zuf_dbg_vfs("[%ld] ATTR_MODE=0x%x\n",
+			     inode->i_ino, attr->ia_mode);
+		ioc_attr.zuf_attr |= STATX_MODE;
+		inode->i_mode = attr->ia_mode;
+		zi->i_mode = cpu_to_le16(inode->i_mode);
+		if (test_opt(SBI(inode->i_sb), POSIXACL)) {
+			err = posix_acl_chmod(inode, inode->i_mode);
+			if (unlikely(err))
+				return err;
+		}
+	}
+
+	if (attr->ia_valid & ATTR_UID) {
+		zuf_dbg_vfs("[%ld] ATTR_UID=0x%x\n",
+			     inode->i_ino, __kuid_val(attr->ia_uid));
+		ioc_attr.zuf_attr |= STATX_UID;
+		inode->i_uid = attr->ia_uid;
+		zi->i_uid = cpu_to_le32(__kuid_val(inode->i_uid));
+	}
+	if (attr->ia_valid & ATTR_GID) {
+		zuf_dbg_vfs("[%ld] ATTR_GID=0x%x\n",
+			     inode->i_ino, __kgid_val(attr->ia_gid));
+		ioc_attr.zuf_attr |= STATX_GID;
+		inode->i_gid = attr->ia_gid;
+		zi->i_gid = cpu_to_le32(__kgid_val(inode->i_gid));
+	}
+
+	if ((attr->ia_valid & ATTR_SIZE)) {
+		ulong off = attr->ia_size & (inode->i_sb->s_blocksize - 1);
+
+		zuf_dbg_vfs("[%ld] ATTR_SIZE=0x%llx\n",
+			     inode->i_ino, attr->ia_size);
+		if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+		      S_ISLNK(inode->i_mode))) {
+			zuf_err("[%ld] wrong file mode=%x\n",
+				inode->i_ino, inode->i_mode);
+			return -EINVAL;
+		}
+		ioc_attr.zuf_attr |= STATX_SIZE;
+
+		ZUF_CHECK_I_W_LOCK(inode);
+		zuf_smw_lock(zii);
+
+		if (attr->ia_size < inode->i_size) {
+			/* Make all mmap() users FAULT for truncated pages */
+			unmap_mapping_range(inode->i_mapping,
+					attr->ia_size + PAGE_SIZE - 1, 0, 1);
+
+			if (off)
+				zuf_trim_edge(inode, attr->ia_size,
+					      inode->i_sb->s_blocksize - off);
+		}
+
+		ioc_attr.truncate_size = attr->ia_size;
+		/* on attr_size we want to update times as well */
+		attr->ia_valid |= ATTR_CTIME | ATTR_MTIME;
+	}
+
+	if (attr->ia_valid & ATTR_ATIME) {
+		ioc_attr.zuf_attr |= STATX_ATIME;
+		inode->i_atime = attr->ia_atime;
+		timespec_to_mt(&zi->i_atime, &inode->i_atime);
+		zuf_dbg_vfs("[%ld] ATTR_ATIME=0x%llx\n",
+			     inode->i_ino, zi->i_atime);
+	}
+	if (attr->ia_valid & ATTR_CTIME) {
+		ioc_attr.zuf_attr |= STATX_CTIME;
+		inode->i_ctime = attr->ia_ctime;
+		timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		zuf_dbg_vfs("[%ld] ATTR_CTIME=0x%llx\n",
+			     inode->i_ino, zi->i_ctime);
+	}
+	if (attr->ia_valid & ATTR_MTIME) {
+		ioc_attr.zuf_attr |= STATX_MTIME;
+		inode->i_mtime = attr->ia_mtime;
+		timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+		zuf_dbg_vfs("[%ld] ATTR_MTIME=0x%llx\n",
+			     inode->i_ino, zi->i_mtime);
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_attr.hdr, NULL, 0);
+	if (unlikely(err))
+		zuf_dbg_err("[%ld] set_attr=0x%x failed => %d\n",
+			    inode->i_ino, ioc_attr.zuf_attr, err);
+
+	if ((attr->ia_valid & ATTR_SIZE)) {
+		i_size_write(inode, le64_to_cpu(zi->i_size));
+		inode->i_blocks = le64_to_cpu(zi->i_blocks);
+
+		zuf_smw_unlock(zii);
+	}
+
+	return err;
+}
+
+void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi)
+{
+	unsigned int flags = le32_to_cpu(zi->i_flags);
+
+	inode->i_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+	inode->i_flags |= flags;
+	if (!zi->i_xattr)
+		inode_has_no_xattr(inode);
+}
+
+/* direct_IO is not called. We set an empty one so open(O_DIRECT) will be happy
+ */
+static ssize_t zuf_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+	WARN_ON(1);
+	return 0;
 }
+const struct address_space_operations zuf_aops = {
+	.direct_IO		= zuf_direct_IO,
+};
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
new file mode 100644
index 000000000000..299134ca7c07
--- /dev/null
+++ b/fs/zuf/namei.c
@@ -0,0 +1,402 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode operations for directories.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#include <linux/fs.h>
+#include "zuf.h"
+
+
+static struct inode *d_parent(struct dentry *dentry)
+{
+	return dentry->d_parent->d_inode;
+}
+
+static void _set_nlink(struct inode *inode, struct zus_inode *zi)
+{
+	set_nlink(inode, le32_to_cpu(zi->i_nlink));
+}
+
+void zuf_zii_sync(struct inode *inode, bool sync_nlink)
+{
+	struct zus_inode *zi = zus_zi(inode);
+
+	if (inode->i_size != le64_to_cpu(zi->i_size) ||
+	    inode->i_blocks != le64_to_cpu(zi->i_blocks)) {
+		i_size_write(inode, le64_to_cpu(zi->i_size));
+		inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	}
+
+	if (sync_nlink)
+		_set_nlink(inode, zi);
+}
+
+static void _instantiate_unlock(struct dentry *dentry, struct inode *inode)
+{
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+}
+
+static struct dentry *zuf_lookup(struct inode *dir, struct dentry *dentry,
+				 uint flags)
+{
+	struct super_block *sb = dir->i_sb;
+	struct qstr *str = &dentry->d_name;
+	uint in_len = offsetof(struct zufs_ioc_lookup, _zi);
+	struct zufs_ioc_lookup ioc_lu = {
+		.hdr.in_len = in_len,
+		.hdr.out_start = in_len,
+		.hdr.out_len = sizeof(ioc_lu) - in_len,
+		.hdr.operation = ZUFS_OP_LOOKUP,
+		.dir_ii = ZUII(dir)->zus_ii,
+		.str.len = str->len,
+	};
+	struct inode *inode = NULL;
+	bool exist;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s\n", dir->i_ino, dentry->d_name.name);
+
+	if (dentry->d_name.len > ZUFS_NAME_LEN)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	memcpy(&ioc_lu.str.name, str->name, str->len);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_lu.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	inode = zuf_iget(dir->i_sb, ioc_lu.zus_ii, ioc_lu._zi, &exist);
+	if (exist) {
+		zuf_dbg_err("race in lookup\n");
+		zuf_evict_dispatch(sb, ioc_lu.zus_ii, ZUFS_OP_EVICT_INODE,
+				   ZI_LOOKUP_RACE);
+	}
+
+out:
+	return d_splice_alias(inode, dentry);
+}
+
+/*
+ * By the time this is called, we already have created
+ * the directory cache entry for the new file, but it
+ * is so far negative - it has no inode.
+ *
+ * If the create succeeds, we fill in the inode information
+ * with d_instantiate().
+ */
+static int zuf_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+		      bool excl)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s mode=0x%x\n",
+		     dir->i_ino, dentry->d_name.name, mode);
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, 0, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_file_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+	inode->i_fop = &zuf_file_operations;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		     dev_t rdev)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] mode=0x%x rdev=0x%x\n", dir->i_ino, mode, rdev);
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, rdev, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_special_inode_operations;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, 0, true);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	/* TODO: See about more ephemeral operations on this file, around
+	 * mmap and such.
+	 * Must see about that tmpfile mode that is later link_at
+	 * (probably the !O_EXCL flag)
+	 */
+	inode->i_op = &zuf_file_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+	inode->i_fop = &zuf_file_operations;
+
+	set_nlink(inode, 1); /* user_mode knows nothing */
+	d_tmpfile(dentry, inode);
+	/* tmpfile operate on nlink=0. Since this is a tmp file we do not care
+	 * about cl_flushing. If later this file will be linked to a dir. the
+	 * add_dentry will flush the zi.
+	 */
+	zus_zi(inode)->i_nlink = inode->i_nlink;
+
+	unlock_new_inode(inode);
+	return 0;
+}
+
+static int zuf_link(struct dentry *dest_dentry, struct inode *dir,
+		    struct dentry *dentry)
+{
+	struct inode *inode = dest_dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld dest_d-ino=%ld dest_d-name=%s\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino,
+		     dest_dentry->d_inode->i_ino, dest_dentry->d_name.name);
+
+	if (inode->i_nlink >= ZUFS_LINK_MAX)
+		return -EMLINK;
+
+	ihold(inode);
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+	zus_inode_ctime_now(inode, zus_zi(inode));
+
+	err = zuf_add_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err)) {
+		iput(inode);
+		return err;
+	}
+
+	_set_nlink(inode, zus_zi(inode));
+
+	d_instantiate(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino);
+
+	inode->i_ctime = dir->i_ctime;
+	timespec_to_mt(&zus_zi(inode)->i_ctime, &inode->i_ctime);
+
+	err = zuf_remove_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err))
+		return err;
+
+	zuf_zii_sync(inode, true);
+	zuf_zii_sync(dir, true);
+
+	return 0;
+}
+
+static int zuf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s dentry-parent=%ld mode=0x%x\n",
+		     dir->i_ino, dentry->d_name.name, d_parent(dentry)->i_ino,
+		     mode);
+
+	if (dir->i_nlink >= ZUFS_LINK_MAX)
+		return -EMLINK;
+
+	inode = zuf_new_inode(dir, S_IFDIR | mode, &dentry->d_name, NULL, 0,
+			      false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_dir_inode_operations;
+	inode->i_fop = &zuf_dir_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	zuf_zii_sync(dir, true);
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static bool _empty_dir(struct inode *dir)
+{
+	if (dir->i_nlink != 2) {
+		zuf_dbg_verbose("[%ld] directory has nlink(%d) != 2\n",
+				dir->i_ino, dir->i_nlink);
+		return false;
+	}
+	/* NOTE: Above is not the only -ENOTEMPTY the zus-fs will need to check
+	 * for the "only-files" no subdirs case. And return -ENOTEMPTY below
+	 */
+	return true;
+}
+
+static int zuf_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino);
+
+	if (!inode)
+		return -ENOENT;
+
+	if (!_empty_dir(inode))
+		return -ENOTEMPTY;
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+	inode->i_ctime = dir->i_ctime;
+	timespec_to_mt(&zus_zi(inode)->i_ctime, &inode->i_ctime);
+
+	err = zuf_remove_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err))
+		return err;
+
+	zuf_zii_sync(inode, true);
+	zuf_zii_sync(dir, true);
+
+	return 0;
+}
+
+/* Structure of a directory element; */
+struct zuf_dir_element {
+	__le64  ino;
+	char name[254];
+};
+
+static int zuf_rename(struct inode *old_dir, struct dentry *old_dentry,
+		      struct inode *new_dir, struct dentry *new_dentry,
+		      uint flags)
+{
+	struct inode *old_inode = d_inode(old_dentry);
+	struct inode *new_inode = d_inode(new_dentry);
+	struct zuf_sb_info *sbi = SBI(old_inode->i_sb);
+	struct zufs_ioc_rename ioc_rename = {
+		.hdr.in_len = sizeof(ioc_rename),
+		.hdr.out_len = sizeof(ioc_rename),
+		.hdr.operation = ZUFS_OP_RENAME,
+		.old_dir_ii = ZUII(old_dir)->zus_ii,
+		.new_dir_ii = ZUII(new_dir)->zus_ii,
+		.old_zus_ii = ZUII(old_inode)->zus_ii,
+		.new_zus_ii = new_inode ? ZUII(new_inode)->zus_ii : NULL,
+		.old_d_str.len = old_dentry->d_name.len,
+		.new_d_str.len = new_dentry->d_name.len,
+		.flags = flags,
+	};
+	struct timespec64 time = current_time(old_dir);
+	int err;
+
+	zuf_dbg_vfs(
+		"old_inode=%ld new_inode=%ld old_name=%s new_name=%s f=0x%x\n",
+		old_inode->i_ino, new_inode ? new_inode->i_ino : 0,
+		old_dentry->d_name.name, new_dentry->d_name.name, flags);
+
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE /*| RENAME_WHITEOUT*/))
+		return -EINVAL;
+
+	if (flags & RENAME_EXCHANGE) {
+		/* A subdir holds a ref on parent, see if we need to
+		 * exchange refs
+		 */
+		if (unlikely(!new_inode))
+			return -EINVAL;
+
+		if ((S_ISDIR(old_inode->i_mode) != S_ISDIR(new_inode->i_mode))
+		    && (old_dir != new_dir)) {
+			if (S_ISDIR(old_inode->i_mode)) {
+				if (ZUFS_LINK_MAX <= new_dir->i_nlink)
+					return -EMLINK;
+			} else {
+				if (ZUFS_LINK_MAX <= old_dir->i_nlink)
+					return -EMLINK;
+			}
+		}
+	} else if (S_ISDIR(old_inode->i_mode)) {
+		if (new_inode) {
+			if (!_empty_dir(new_inode))
+				return -ENOTEMPTY;
+		} else if (ZUFS_LINK_MAX <= new_dir->i_nlink) {
+			return -EMLINK;
+		}
+	}
+
+	memcpy(&ioc_rename.old_d_str.name, old_dentry->d_name.name,
+		old_dentry->d_name.len);
+	memcpy(&ioc_rename.new_d_str.name, new_dentry->d_name.name,
+		new_dentry->d_name.len);
+	timespec_to_mt(&ioc_rename.time, &time);
+
+	zus_inode_cmtime_now(old_dir, zus_zi(old_dir));
+	if (old_dir != new_dir)
+		zus_inode_cmtime_now(new_dir, zus_zi(new_dir));
+
+	if (new_inode)
+		zus_inode_ctime_now(new_inode, zus_zi(new_inode));
+	else
+		zus_inode_ctime_now(old_inode, zus_zi(old_inode));
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_rename.hdr, NULL, 0);
+
+	zuf_zii_sync(old_dir, true);
+	zuf_zii_sync(new_dir, true);
+
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		return err;
+	}
+
+	if (new_inode)
+		_set_nlink(new_inode, zus_zi(new_inode));
+
+	return 0;
+}
+
+const struct inode_operations zuf_dir_inode_operations = {
+	.create		= zuf_create,
+	.lookup		= zuf_lookup,
+	.link		= zuf_link,
+	.unlink		= zuf_unlink,
+	.mkdir		= zuf_mkdir,
+	.rmdir		= zuf_rmdir,
+	.mknod		= zuf_mknod,
+	.tmpfile	= zuf_tmpfile,
+	.rename		= zuf_rename,
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
+
+const struct inode_operations zuf_special_inode_operations = {
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
new file mode 100644
index 000000000000..1eb8453da564
--- /dev/null
+++ b/fs/zuf/rw.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Read/Write operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+#include <linux/fadvise.h>
+#include <linux/uio.h>
+#include <linux/delay.h>
+
+#include "zuf.h"
+#include "t2.h"
+
+/* ZERO a part of a single block. len does not cross a block boundary */
+int zuf_trim_edge(struct inode *inode, ulong filepos, uint len)
+{
+	return -EIO;
+}
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 7f819be7056e..2afa7b405945 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -616,6 +616,8 @@ void zuf_destroy_inodecache(void)
 static struct super_operations zuf_sops = {
 	.alloc_inode	= zuf_alloc_inode,
 	.destroy_inode	= zuf_destroy_inode,
+	.write_inode	= zuf_write_inode,
+	.evict_inode	= zuf_evict_inode,
 	.put_super	= zuf_put_super,
 	.freeze_fs	= zuf_update_s_wtime,
 	.unfreeze_fs	= zuf_update_s_wtime,
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index d94c2e6d7578..3b61a4845af7 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -765,6 +765,15 @@ const char *zuf_op_name(enum e_zufs_operation op)
 {
 #define CASE_ENUM_NAME(e) case e: return #e
 	switch  (op) {
+		CASE_ENUM_NAME(ZUFS_OP_STATFS		);
+		CASE_ENUM_NAME(ZUFS_OP_NEW_INODE	);
+		CASE_ENUM_NAME(ZUFS_OP_FREE_INODE	);
+		CASE_ENUM_NAME(ZUFS_OP_EVICT_INODE	);
+		CASE_ENUM_NAME(ZUFS_OP_LOOKUP		);
+		CASE_ENUM_NAME(ZUFS_OP_ADD_DENTRY	);
+		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY	);
+		CASE_ENUM_NAME(ZUFS_OP_RENAME		);
+		CASE_ENUM_NAME(ZUFS_OP_SETATTR		);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
 	default:
 		return "UNKNOWN";
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index e23907f5e94e..7d79189bfe60 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -156,6 +156,10 @@ enum {
 struct zuf_inode_info {
 	struct inode		vfs_inode;
 
+	/* Stuff for mmap write */
+	struct rw_semaphore	in_sync;
+	struct page		*zero_page; /* TODO: Remove */
+
 	/* cookies from Server */
 	struct zus_inode	*zi;
 	struct zus_inode_info	*zus_ii;
@@ -213,6 +217,17 @@ static inline struct zus_inode *zus_zi(struct inode *inode)
 	return ZUII(inode)->zi;
 }
 
+/* An accessor because of the frequent use in prints */
+static inline ulong _zi_ino(struct zus_inode *zi)
+{
+	return le64_to_cpu(zi->i_ino);
+}
+
+static inline bool _zi_active(struct zus_inode *zi)
+{
+	return (zi->i_nlink || zi->i_mode);
+}
+
 static inline void mt_to_timespec(struct timespec64 *t, __le64 *mt)
 {
 	u32 nsec;
@@ -226,6 +241,65 @@ static inline void timespec_to_mt(__le64 *mt, struct timespec64 *t)
 	*mt = cpu_to_le64(t->tv_sec * NSEC_PER_SEC + t->tv_nsec);
 }
 
+static inline void zuf_r_lock(struct zuf_inode_info *zii)
+{
+	inode_lock_shared(&zii->vfs_inode);
+}
+static inline void zuf_r_unlock(struct zuf_inode_info *zii)
+{
+	inode_unlock_shared(&zii->vfs_inode);
+}
+
+static inline void zuf_smr_lock(struct zuf_inode_info *zii)
+{
+	down_read_nested(&zii->in_sync, 1);
+}
+static inline void zuf_smr_lock_pagefault(struct zuf_inode_info *zii)
+{
+	down_read_nested(&zii->in_sync, 2);
+}
+static inline void zuf_smr_unlock(struct zuf_inode_info *zii)
+{
+	up_read(&zii->in_sync);
+}
+
+static inline void zuf_smw_lock(struct zuf_inode_info *zii)
+{
+	down_write(&zii->in_sync);
+}
+static inline void zuf_smw_lock_nested(struct zuf_inode_info *zii)
+{
+	down_write_nested(&zii->in_sync, 1);
+}
+static inline void zuf_smw_unlock(struct zuf_inode_info *zii)
+{
+	up_write(&zii->in_sync);
+}
+
+static inline void zuf_w_lock(struct zuf_inode_info *zii)
+{
+	inode_lock(&zii->vfs_inode);
+	zuf_smw_lock(zii);
+}
+static inline void zuf_w_lock_nested(struct zuf_inode_info *zii)
+{
+	inode_lock_nested(&zii->vfs_inode, 2);
+	zuf_smw_lock_nested(zii);
+}
+static inline void zuf_w_unlock(struct zuf_inode_info *zii)
+{
+	zuf_smw_unlock(zii);
+	inode_unlock(&zii->vfs_inode);
+}
+
+static inline void ZUF_CHECK_I_W_LOCK(struct inode *inode)
+{
+#ifdef CONFIG_ZUF_DEBUG
+	if (WARN_ON(down_write_trylock(&inode->i_rwsem)))
+		up_write(&inode->i_rwsem);
+#endif
+}
+
 /* CAREFUL: Needs an sfence eventually, after this call */
 static inline
 void zus_inode_cmtime_now(struct inode *inode, struct zus_inode *zi)
@@ -242,6 +316,15 @@ void zus_inode_ctime_now(struct inode *inode, struct zus_inode *zi)
 	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
 }
 
+static inline void *zuf_dpp_t_addr(struct super_block *sb, zu_dpp_t v)
+{
+	/* TODO: Implement zufs_ioc_create_mempool already */
+	if (WARN_ON(zu_dpp_t_pool(v)))
+		return NULL;
+
+	return md_addr_verify(SBI(sb)->md, zu_dpp_t_val(v));
+}
+
 enum big_alloc_type { ba_stack, ba_kmalloc, ba_vmalloc };
 
 static inline
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index ca8e10a1f5a8..9d66a38ab585 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -76,7 +76,18 @@
  */
 #define EZUF_RETRY_DONE 540
 
+/* TODO: Someone forgot i_flags & i_version for STATX_ attrs should send a patch
+ * to add them
+ */
+#define ZUFS_STATX_FLAGS	0x20000000U
+#define ZUFS_STATX_VERSION	0x40000000U
 
+/*
+ * Maximal count of links to a file
+ */
+#define ZUFS_LINK_MAX          32000
+#define ZUFS_MAX_SYMLINK	PAGE_SIZE
+#define ZUFS_NAME_LEN		255
 #define ZUFS_READAHEAD_PAGES	8
 
 /* All device sizes offsets must align on 2M */
@@ -317,6 +328,17 @@ enum e_zufs_operation {
 	ZUFS_OP_NULL = 0,
 	ZUFS_OP_STATFS,
 
+	ZUFS_OP_NEW_INODE,
+	ZUFS_OP_FREE_INODE,
+	ZUFS_OP_EVICT_INODE,
+
+	ZUFS_OP_LOOKUP,
+	ZUFS_OP_ADD_DENTRY,
+	ZUFS_OP_REMOVE_DENTRY,
+	ZUFS_OP_RENAME,
+
+	ZUFS_OP_SETATTR,
+
 	ZUFS_OP_BREAK,		/* Kernel telling Server to exit */
 	ZUFS_OP_MAX_OPT,
 };
@@ -331,6 +353,84 @@ struct zufs_ioc_statfs {
 	struct statfs64 statfs_out;
 };
 
+/* zufs_ioc_new_inode flags: */
+enum zi_flags {
+	ZI_TMPFILE = 1,		/* for new_inode */
+	ZI_LOOKUP_RACE = 1,	/* for evict */
+};
+
+struct zufs_str {
+	__u8 len;
+	char name[ZUFS_NAME_LEN];
+};
+
+/* ZUFS_OP_NEW_INODE */
+struct zufs_ioc_new_inode {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode zi;
+	struct zus_inode_info *dir_ii; /* If mktmp this is the root */
+	struct zufs_str str;
+	__u64 flags;
+
+	 /* OUT */
+	zu_dpp_t _zi;
+	struct zus_inode_info *zus_ii;
+};
+
+/* ZUFS_OP_FREE_INODE, ZUFS_OP_EVICT_INODE */
+struct zufs_ioc_evict_inode {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 flags;
+};
+
+/* ZUFS_OP_LOOKUP */
+struct zufs_ioc_lookup {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *dir_ii;
+	struct zufs_str str;
+
+	 /* OUT */
+	zu_dpp_t _zi;
+	struct zus_inode_info *zus_ii;
+};
+
+/* ZUFS_OP_ADD_DENTRY, ZUFS_OP_REMOVE_DENTRY */
+struct zufs_ioc_dentry {
+	struct zufs_ioc_hdr hdr;
+	struct zus_inode_info *zus_ii; /* IN */
+	struct zus_inode_info *zus_dir_ii; /* IN */
+	struct zufs_str str; /* IN */
+	__u64 ino; /* OUT - only for lookup */
+};
+
+/* ZUFS_OP_RENAME */
+struct zufs_ioc_rename {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *old_dir_ii;
+	struct zus_inode_info *new_dir_ii;
+	struct zus_inode_info *old_zus_ii;
+	struct zus_inode_info *new_zus_ii;
+	struct zufs_str old_d_str;
+	struct zufs_str new_d_str;
+	__u64 time;
+	__u32 flags;
+};
+
+/* ZUFS_OP_SETATTR */
+struct zufs_ioc_attr {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 truncate_size;
+	__u32 zuf_attr;
+	__u32 pad;
+};
+
 /* Allocate a special_file that will be a dual-port communication buffer with
  * user mode.
  * Server will access the buffer via the mmap of this file.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 08/17] zuf: readdir operation
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (6 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 07/17] zuf: Namei and directory operations Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 09/17] zuf: symlink Boaz harrosh
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

Establish protocol with Server for readdir

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h   |  3 ++
 fs/zuf/directory.c | 69 +++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c  |  1 +
 fs/zuf/zus_api.h   | 96 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 169 insertions(+)

diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 76634904eca3..ec9816d51aa3 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -39,6 +39,9 @@ int zuf_setattr(struct dentry *dentry, struct iattr *attr);
 int zuf_getattr(const struct path *path, struct kstat *stat,
 		 u32 request_mask, unsigned int flags);
 void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi);
+bool zuf_dir_emit(struct super_block *sb, struct dir_context *ctx,
+		  ulong ino, const char *name, int length);
+
 
 /* rw.c */
 int zuf_trim_edge(struct inode *inode, ulong filepos, uint len);
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
index eb73a5c7cabf..645dd367fd8c 100644
--- a/fs/zuf/directory.c
+++ b/fs/zuf/directory.c
@@ -17,6 +17,74 @@
 #include <linux/vmalloc.h>
 #include "zuf.h"
 
+static int zuf_readdir(struct file *file, struct dir_context *ctx)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	loff_t i_size = i_size_read(inode);
+	struct zufs_ioc_readdir ioc_readdir = {
+		.hdr.in_len = sizeof(ioc_readdir),
+		.hdr.out_len = sizeof(ioc_readdir),
+		.hdr.operation = ZUFS_OP_READDIR,
+		.dir_ii = ZUII(inode)->zus_ii,
+	};
+	struct zufs_readdir_iter rdi;
+	struct page *pages[ZUS_API_MAP_MAX_PAGES];
+	struct zufs_dir_entry *zde;
+	void *addr, *__a;
+	uint nump, i;
+	int err;
+
+	if (ctx->pos && i_size <= ctx->pos)
+		return 0;
+	if (!i_size)
+		i_size = PAGE_SIZE; /* Just for the . && .. */
+	if (i_size - ctx->pos < PAGE_SIZE)
+		ioc_readdir.hdr.len = PAGE_SIZE;
+	else
+		ioc_readdir.hdr.len = min_t(loff_t, i_size - ctx->pos,
+					    ZUS_API_MAP_MAX_SIZE);
+	nump = md_o2p_up(ioc_readdir.hdr.len);
+	addr = vzalloc(md_p2o(nump));
+	if (unlikely(!addr))
+		return -ENOMEM;
+
+	WARN_ON((ulong)addr & (PAGE_SIZE - 1));
+
+	__a = addr;
+	for (i = 0; i < nump; ++i) {
+		pages[i] = vmalloc_to_page(__a);
+		__a += PAGE_SIZE;
+	}
+
+more:
+	ioc_readdir.pos = ctx->pos;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_readdir.hdr, pages, nump);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	zufs_readdir_iter_init(&rdi, &ioc_readdir, addr);
+	while ((zde = zufs_next_zde(&rdi)) != NULL) {
+		zuf_dbg_verbose("%s pos=0x%lx\n",
+				zde->zstr.name, (ulong)zde->pos);
+		ctx->pos = zde->pos;
+		if (!dir_emit(ctx, zde->zstr.name, zde->zstr.len, zde->ino,
+			      zde->type))
+			goto out;
+	}
+	ctx->pos = ioc_readdir.pos;
+	if (ioc_readdir.more) {
+		zuf_dbg_err("more\n");
+		goto more;
+	}
+out:
+	vfree(addr);
+	return err;
+}
+
 /*
  *FIXME comment to full git diff
  */
@@ -90,5 +158,6 @@ int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode)
 const struct file_operations zuf_dir_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= generic_read_dir,
+	.iterate_shared	= zuf_readdir,
 	.fsync		= noop_fsync,
 };
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 3b61a4845af7..3d38f284d387 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -773,6 +773,7 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_ADD_DENTRY	);
 		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY	);
 		CASE_ENUM_NAME(ZUFS_OP_RENAME		);
+		CASE_ENUM_NAME(ZUFS_OP_READDIR		);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR		);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
 	default:
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 9d66a38ab585..8a4e597414a8 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -336,6 +336,7 @@ enum e_zufs_operation {
 	ZUFS_OP_ADD_DENTRY,
 	ZUFS_OP_REMOVE_DENTRY,
 	ZUFS_OP_RENAME,
+	ZUFS_OP_READDIR,
 
 	ZUFS_OP_SETATTR,
 
@@ -421,6 +422,101 @@ struct zufs_ioc_rename {
 	__u32 flags;
 };
 
+/* ZUFS_OP_READDIR */
+struct zufs_ioc_readdir {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *dir_ii;
+	loff_t pos;
+
+	/* OUT */
+	__u8	more;
+};
+
+struct zufs_dir_entry {
+	__le64 ino;
+	struct {
+		unsigned	type	: 8;
+		ulong		pos	: 56;
+	};
+	struct zufs_str zstr;
+};
+
+struct zufs_readdir_iter {
+	void *__zde, *last;
+	struct zufs_ioc_readdir *ioc_readdir;
+};
+
+enum {E_ZDE_HDR_SIZE =
+	offsetof(struct zufs_dir_entry, zstr) + offsetof(struct zufs_str, name),
+};
+
+static inline void zufs_readdir_iter_init(struct zufs_readdir_iter *rdi,
+					  struct zufs_ioc_readdir *ioc_readdir,
+					  void *app_ptr)
+{
+	rdi->__zde = app_ptr;
+	rdi->last = app_ptr + ioc_readdir->hdr.len;
+	rdi->ioc_readdir = ioc_readdir;
+	ioc_readdir->more = false;
+}
+
+static inline uint zufs_dir_entry_len(__u8 name_len)
+{
+	return ALIGN(E_ZDE_HDR_SIZE + name_len, sizeof(__u64));
+}
+
+static inline
+struct zufs_dir_entry *zufs_next_zde(struct zufs_readdir_iter *rdi)
+{
+	struct zufs_dir_entry *zde = rdi->__zde;
+	uint len;
+
+	if (rdi->last <= rdi->__zde + E_ZDE_HDR_SIZE)
+		return NULL;
+	if (zde->zstr.len == 0)
+		return NULL;
+	len = zufs_dir_entry_len(zde->zstr.len);
+	if (rdi->last <= rdi->__zde + len)
+		return NULL;
+
+	rdi->__zde += len;
+	return zde;
+}
+
+static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
+				 __u8 type, __u64 pos, const char *name,
+				 __u8 len)
+{
+	struct zufs_dir_entry *zde = rdi->__zde;
+
+	if (rdi->last <= rdi->__zde + zufs_dir_entry_len(len)) {
+		rdi->ioc_readdir->more = true;
+		return false;
+	}
+
+	rdi->ioc_readdir->more = 0;
+	zde->ino = ino;
+	zde->type = type;
+	/*ASSERT(0 == (pos && (1 << 56 - 1)));*/
+	zde->pos = pos;
+	strncpy(zde->zstr.name, name, len);
+	zde->zstr.len = len;
+	zufs_next_zde(rdi);
+
+	return true;
+}
+
+/* ZUFS_OP_GET_SYMLINK */
+struct zufs_ioc_get_link {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+
+	/* OUT */
+	zu_dpp_t _link;
+};
+
 /* ZUFS_OP_SETATTR */
 struct zufs_ioc_attr {
 	struct zufs_ioc_hdr hdr;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 09/17] zuf: symlink
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (7 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 08/17] zuf: readdir operation Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-20 11:05   ` Greg KH
  2019-02-19 11:51 ` [RFC PATCH 10/17] zuf: More file operation Boaz harrosh
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |  2 +-
 fs/zuf/_extern.h  |  7 +++++
 fs/zuf/inode.c    |  7 +++++
 fs/zuf/namei.c    | 27 ++++++++++++++++++
 fs/zuf/symlink.c  | 73 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c |  1 +
 fs/zuf/zus_api.h  |  1 +
 7 files changed, 117 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/symlink.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 501561d35b8a..9b7123f2af3e 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -18,5 +18,5 @@ zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
 zuf-y += rw.o
-zuf-y += super.o inode.o directory.o namei.o file.o
+zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index ec9816d51aa3..32a381ac4bd7 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -42,6 +42,10 @@ void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi);
 bool zuf_dir_emit(struct super_block *sb, struct dir_context *ctx,
 		  ulong ino, const char *name, int length);
 
+/* symlink.c */
+uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
+			const char *symname, ulong len, struct page *pages[2]);
+
 
 /* rw.c */
 int zuf_trim_edge(struct inode *inode, ulong filepos, uint len);
@@ -105,4 +109,7 @@ void zuf_zii_sync(struct inode *inode, bool sync_nlink);
 extern const struct inode_operations zuf_dir_inode_operations;
 extern const struct inode_operations zuf_special_inode_operations;
 
+/* symlink.c */
+extern const struct inode_operations zuf_symlink_inode_operations;
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index ad424a305063..2b49a0c31a02 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -82,6 +82,9 @@ static void _set_inode_from_zi(struct inode *inode, struct zus_inode *zi)
 		inode->i_op = &zuf_dir_inode_operations;
 		inode->i_fop = &zuf_dir_operations;
 		break;
+	case S_IFLNK:
+		inode->i_op = &zuf_symlink_inode_operations;
+		break;
 	case S_IFBLK:
 	case S_IFCHR:
 	case S_IFIFO:
@@ -357,6 +360,10 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
 	    S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
 		init_special_inode(inode, mode, rdev_or_isize);
+	} else if (symname) {
+		inode->i_size = rdev_or_isize;
+		nump = zuf_prepare_symname(&ioc_new_inode, symname,
+					   rdev_or_isize, pages);
 	}
 
 	err = _set_zi_from_inode(dir, &ioc_new_inode.zi, inode);
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
index 299134ca7c07..e78aa04f10d5 100644
--- a/fs/zuf/namei.c
+++ b/fs/zuf/namei.c
@@ -164,6 +164,32 @@ static int zuf_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	return 0;
 }
 
+static int zuf_symlink(struct inode *dir, struct dentry *dentry,
+		       const char *symname)
+{
+	struct inode *inode;
+	ulong len;
+
+	zuf_dbg_vfs("[%ld] de->name=%s symname=%s\n",
+			dir->i_ino, dentry->d_name.name, symname);
+
+	len = strlen(symname);
+	if (len + 1 > ZUFS_MAX_SYMLINK)
+		return -ENAMETOOLONG;
+
+	inode = zuf_new_inode(dir, S_IFLNK|S_IRWXUGO, &dentry->d_name,
+			       symname, len, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_symlink_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
 static int zuf_link(struct dentry *dest_dentry, struct inode *dir,
 		    struct dentry *dentry)
 {
@@ -385,6 +411,7 @@ const struct inode_operations zuf_dir_inode_operations = {
 	.lookup		= zuf_lookup,
 	.link		= zuf_link,
 	.unlink		= zuf_unlink,
+	.symlink	= zuf_symlink,
 	.mkdir		= zuf_mkdir,
 	.rmdir		= zuf_rmdir,
 	.mknod		= zuf_mknod,
diff --git a/fs/zuf/symlink.c b/fs/zuf/symlink.c
new file mode 100644
index 000000000000..1446bdf60cb9
--- /dev/null
+++ b/fs/zuf/symlink.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Symlink operations
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+/* Can never fail all checks already made before.
+ * Returns: The number of pages stored @pages
+ */
+uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
+			 const char *symname, ulong len,
+			 struct page *pages[2])
+{
+	uint nump;
+
+	ioc_new_inode->zi.i_size = cpu_to_le64(len);
+	if (len < sizeof(ioc_new_inode->zi.i_symlink)) {
+		memcpy(&ioc_new_inode->zi.i_symlink, symname, len);
+		return 0;
+	}
+
+	pages[0] = virt_to_page(symname);
+	nump = 1;
+
+	ioc_new_inode->hdr.len = len;
+	ioc_new_inode->hdr.offset = (ulong)symname & (PAGE_SIZE - 1);
+
+	if (PAGE_SIZE < ioc_new_inode->hdr.offset + len) {
+		pages[1] = virt_to_page(symname + PAGE_SIZE);
+		++nump;
+	}
+
+	return nump;
+}
+
+/*
+ * In case of short symlink, we serve it directly from zi; otherwise, read
+ * symlink value directly from pmem using dpp mapping.
+ */
+static const char *zuf_get_link(struct dentry *dentry, struct inode *inode,
+				struct delayed_call *notused)
+{
+	const char *link;
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (inode->i_size < sizeof(zii->zi->i_symlink))
+		return zii->zi->i_symlink;
+
+	link = zuf_dpp_t_addr(inode->i_sb, le64_to_cpu(zii->zi->i_sym_dpp));
+	if (!link) {
+		zuf_err("bad symlink: i_sym_dpp=0x%llx\n", zii->zi->i_sym_dpp);
+		return ERR_PTR(-EIO);
+	}
+	return link;
+}
+
+const struct inode_operations zuf_symlink_inode_operations = {
+	.get_link	= zuf_get_link,
+	.update_time	= zuf_update_time,
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+};
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 3d38f284d387..3a264e6475c4 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -774,6 +774,7 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY	);
 		CASE_ENUM_NAME(ZUFS_OP_RENAME		);
 		CASE_ENUM_NAME(ZUFS_OP_READDIR		);
+		CASE_ENUM_NAME(ZUFS_OP_GET_SYMLINK	);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR		);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
 	default:
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 8a4e597414a8..74f69a12a263 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -338,6 +338,7 @@ enum e_zufs_operation {
 	ZUFS_OP_RENAME,
 	ZUFS_OP_READDIR,
 
+	ZUFS_OP_GET_SYMLINK,
 	ZUFS_OP_SETATTR,
 
 	ZUFS_OP_BREAK,		/* Kernel telling Server to exit */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 10/17] zuf: More file operation
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (8 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 09/17] zuf: symlink Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 11/17] zuf: Write/Read implementation Boaz harrosh
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

Add more file operation.
Some are calling stubs in other files

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |   4 +
 fs/zuf/file.c     | 429 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/rw.c       |  12 ++
 fs/zuf/zuf-core.c |   4 +
 fs/zuf/zus_api.h  |  45 +++++
 5 files changed, 494 insertions(+)

diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 32a381ac4bd7..391484b0e125 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -48,6 +48,10 @@ uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
 
 
 /* rw.c */
+ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
+			 struct kiocb *kiocb, struct iov_iter *ii);
+ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
+			  struct kiocb *kiocb, struct iov_iter *ii);
 int zuf_trim_edge(struct inode *inode, ulong filepos, uint len);
 
 /* super.c */
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index c6c8ca71e957..0e62145e923a 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -13,14 +13,443 @@
  *	Sagi Manole <sagim@netapp.com>"
  */
 
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+#include <linux/falloc.h>
+#include <linux/mman.h>
+#include <linux/fadvise.h>
+#include <linux/delay.h>
 #include "zuf.h"
 
+static long zuf_fallocate(struct file *file, int mode, loff_t offset,
+			   loff_t len)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_range ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUFS_OP_FALLOCATE,
+		.zus_ii = ZUII(inode)->zus_ii,
+		.offset = offset,
+		.length = len,
+		.opflags = mode,
+	};
+	enum {FALLOC_RETRY = 7};
+	int retry = 0;
+	int err = 0;
+
+	zuf_dbg_vfs("[%ld] mode=0x%x offset=0x%llx len=0x%llx\n",
+		     inode->i_ino, mode, offset, len);
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	zuf_w_lock(zii);
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+	     (i_size_read(inode) < offset + len)) {
+		err = inode_newsize_ok(inode, offset + len);
+		if (unlikely(err))
+			goto out;
+	}
+
+	zus_inode_cmtime_now(inode, zii->zi);
+
+	if (mode & (FALLOC_FL_ZERO_RANGE | FALLOC_FL_PUNCH_HOLE)) {
+		/* ASSUMING FS supports these two */
+		struct super_block *sb = inode->i_sb;
+		ulong off1 = offset & (sb->s_blocksize - 1);
+		ulong off2 = (offset + len) & (sb->s_blocksize - 1);
+
+		if (md_o2p(offset) == md_o2p(offset + len)) {
+			/* Same block. Just nullify the range and goto out */
+			err = zuf_trim_edge(inode, offset, off2 - off1);
+			goto out_update;
+		}
+		if (off1) {
+			uint l = sb->s_blocksize - off1;
+
+			err = zuf_trim_edge(inode, offset, l);
+			if (unlikely(err))
+				goto out;
+			if (mode & FALLOC_FL_ZERO_RANGE) {
+				ioc_range.offset += l;
+				ioc_range.length -= l;
+			}
+		}
+		if (off2) {
+			err = zuf_trim_edge(inode, (offset + len) - off2, off2);
+			if (unlikely(err))
+				goto out;
+			if (mode & FALLOC_FL_ZERO_RANGE)
+				ioc_range.length -= off2;
+		}
+	}
+
+	/* no length remains, but size might have changed in trim_edge */
+	if (!ioc_range.length)
+		goto out_update;
+
+again:
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_range.hdr,
+			    NULL, 0);
+	if (unlikely(err)) {
+		if (err == -EZUFS_RETRY) {
+			if (FALLOC_RETRY < retry++) {
+				zuf_dbg_err("[%ld] retry=%d\n",
+					    inode->i_ino, retry);
+				msleep(retry - FALLOC_RETRY);
+			}
+			goto again;
+		}
+		zuf_dbg_err("[%ld] zufc_dispatch failed => %d\n",
+			    inode->i_ino, err);
+	}
+
+out_update:
+	i_size_write(inode, le64_to_cpu(zii->zi->i_size));
+	inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+
+out:
+	zuf_w_unlock(zii);
+
+	return err;
+}
+
+static loff_t zuf_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_seek ioc_seek = {
+		.hdr.in_len = sizeof(ioc_seek),
+		.hdr.out_len = sizeof(ioc_seek),
+		.hdr.operation = ZUFS_OP_LLSEEK,
+		.zus_ii = zii->zus_ii,
+		.offset_in = offset,
+		.whence = whence,
+	};
+	int err = 0;
+
+	zuf_dbg_vfs("[%ld] offset=0x%llx whence=%d\n",
+		     inode->i_ino, offset, whence);
+
+	if (whence != SEEK_DATA && whence != SEEK_HOLE)
+		return generic_file_llseek(file, offset, whence);
+
+	zuf_r_lock(zii);
+
+	if ((offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET)) ||
+	    offset > inode->i_sb->s_maxbytes) {
+		err = -EINVAL;
+		goto out;
+	} else if (inode->i_size <= offset) {
+		err = -ENXIO;
+		goto out;
+	} else if (!inode->i_blocks) {
+		if (whence == SEEK_HOLE)
+			ioc_seek.offset_out = i_size_read(inode);
+		else
+			err = -ENXIO;
+		goto out;
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_seek.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	if (ioc_seek.offset_out != file->f_pos) {
+		file->f_pos = ioc_seek.offset_out;
+		file->f_version = 0;
+	}
+
+out:
+	zuf_r_unlock(zii);
+
+	return err ?: ioc_seek.offset_out;
+}
+
+/* This callback is called when a file is closed */
+static int zuf_flush(struct file *file, fl_owner_t id)
+{
+	zuf_dbg_vfs("[%ld]\n", file->f_inode->i_ino);
+
+	return 0;
+}
+
+static int tozu_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		       u64 offset, u64 len)
+{
+	int err = -EOPNOTSUPP;
+	ulong start_index = md_o2p(offset);
+	ulong end_index = md_o2p_up(offset + len);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_vfs(
+		"[%ld] offset=0x%llx len=0x%llx i-start=0x%lx i-end=0x%lx\n",
+		inode->i_ino, offset, len, start_index, end_index);
+
+	if (fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC))
+		return -EBADR;
+
+	zuf_r_lock(zii);
+
+	/* TODO: ZUS fiemap (&msi)*/
+
+	zuf_r_unlock(zii);
+	return err;
+}
+
+static void _lock_two_ziis(struct zuf_inode_info *zii1,
+			   struct zuf_inode_info *zii2)
+{
+	if (zii1 > zii2)
+		swap(zii2, zii2);
+
+	zuf_w_lock(zii1);
+	if (zii1 != zii2)
+		zuf_w_lock_nested(zii2);
+}
+
+static void _unlock_two_ziis(struct zuf_inode_info *zii1,
+		      struct zuf_inode_info *zii2)
+{
+	if (zii1 > zii2)
+		swap(zii2, zii2);
+
+	if (zii1 != zii2)
+		zuf_w_unlock(zii2);
+	zuf_w_unlock(zii1);
+}
+
+static int _clone_file_range(struct inode *src_inode, loff_t pos_in,
+			     struct inode *dst_inode, loff_t pos_out,
+			     u64 len, u64 len_up, int operation)
+{
+	struct zuf_inode_info *src_zii = ZUII(src_inode);
+	struct zuf_inode_info *dst_zii = ZUII(dst_inode);
+	struct zus_inode *dst_zi = dst_zii->zi;
+	struct super_block *sb = src_inode->i_sb;
+	struct zufs_ioc_clone ioc_clone = {
+		.hdr.in_len = sizeof(ioc_clone),
+		.hdr.out_len = sizeof(ioc_clone),
+		.hdr.operation = operation,
+		.src_zus_ii = src_zii->zus_ii,
+		.dst_zus_ii = dst_zii->zus_ii,
+		.pos_in = pos_in,
+		.pos_out = pos_out,
+		.len = len,
+		.len_up = len_up,
+	};
+	int err;
+
+	_lock_two_ziis(src_zii, dst_zii);
+
+	/* NOTE: len==0 means to-end-of-file which is what we want */
+	unmap_mapping_range(src_inode->i_mapping, pos_in,  len, 0);
+	unmap_mapping_range(dst_inode->i_mapping, pos_out, len, 0);
+
+	zus_inode_cmtime_now(dst_inode, dst_zi);
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_clone.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("failed to clone %ld -> %ld ; err=%d\n",
+			 src_inode->i_ino, dst_inode->i_ino, err);
+		goto out;
+	}
+
+	dst_inode->i_blocks = le64_to_cpu(dst_zi->i_blocks);
+	i_size_write(dst_inode, dst_zi->i_size);
+
+out:
+	_unlock_two_ziis(src_zii, dst_zii);
+
+	return err;
+}
+
+static loff_t zuf_clone_file_range(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				loff_t len, uint remap_flags)
+{
+	struct inode *src_inode = file_inode(file_in);
+	struct inode *dst_inode = file_inode(file_out);
+	ulong src_size = i_size_read(src_inode);
+	ulong dst_size = i_size_read(dst_inode);
+	struct super_block *sb = src_inode->i_sb;
+	ulong len_up = len;
+	int err;
+
+	zuf_dbg_vfs(
+		"ino-in=%ld ino-out=%ld pos_in=0x%llx pos_out=0x%llx length=0x%llx\n",
+		src_inode->i_ino, dst_inode->i_ino, pos_in, pos_out, len);
+
+	if (remap_flags & ~REMAP_FILE_ADVISORY)
+		return -EINVAL;
+
+	if (src_inode == dst_inode) {
+		if (pos_in == pos_out) {
+			zuf_dbg_err("[%ld] Clone nothing!!\n",
+				src_inode->i_ino);
+			return 0;
+		}
+		if (pos_in < pos_out) {
+			if (pos_in + len > pos_out) {
+				zuf_dbg_err(
+					"[%ld] overlapping pos_in < pos_out?? => EINVAL\n",
+					src_inode->i_ino);
+				return -EINVAL;
+			}
+		} else {
+			if (pos_out + len > pos_in) {
+				zuf_dbg_err("[%ld] overlapping pos_out < pos_in?? => EINVAL\n",
+					src_inode->i_ino);
+				return -EINVAL;
+			}
+		}
+	}
+
+	if ((pos_in & (sb->s_blocksize - 1)) ||
+	    (pos_out & (sb->s_blocksize - 1))) {
+		zuf_err("[%ld] Not aligned len=0x%llx pos_in=0x%llx "
+			"pos_out=0x%llx src-size=0x%llx dst-size=0x%llx\n",
+			 src_inode->i_ino, len, pos_in, pos_out,
+			 i_size_read(src_inode), i_size_read(dst_inode));
+		return -EINVAL;
+	}
+
+	/* STD says that len==0 means up to end of SRC */
+	if (!len)
+		len_up = len = src_size - pos_in;
+
+	if (!pos_in && !pos_out && (src_size <= pos_in + len) &&
+	    (dst_size <= src_size)) {
+		len_up = 0;
+	} else if (len & (sb->s_blocksize - 1)) {
+		/* un-aligned len, see if it is beyond EOF */
+		if ((src_size > pos_in  + len) ||
+		    (dst_size > pos_out + len)) {
+			zuf_err("[%ld] Not aligned len=0x%llx pos_in=0x%llx "
+				"pos_out=0x%llx src-size=0x%lx dst-size=0x%lx\n",
+				src_inode->i_ino, len, pos_in, pos_out,
+				src_size, dst_size);
+			return -EINVAL;
+		}
+		len_up = md_p2o(md_o2p_up(len));
+	}
+
+	err = _clone_file_range(src_inode, pos_in, dst_inode, pos_out, len,
+				len_up, ZUFS_OP_CLONE);
+	if (unlikely(err))
+		zuf_err("_clone_file_range failed => %d\n", err);
+
+	return err ? err : len;
+}
+
+static ssize_t zuf_copy_file_range(struct file *file_in, loff_t pos_in,
+				   struct file *file_out, loff_t pos_out,
+				   size_t len, uint flags)
+{
+	struct inode *src_inode = file_inode(file_in);
+	struct inode *dst_inode = file_inode(file_out);
+	ssize_t ret;
+
+	zuf_dbg_vfs("ino-in=%ld ino-out=%ld pos_in=0x%llx pos_out=0x%llx length=0x%lx\n",
+		    src_inode->i_ino, dst_inode->i_ino, pos_in, pos_out, len);
+
+	ret = zuf_clone_file_range(file_in, pos_in, file_out, pos_out, len,
+				   REMAP_FILE_ADVISORY);
+
+	return ret ?: len;
+}
+
+/* ZUFS:
+ * make sure we clean up the resources consumed by zufs_init()
+ */
+static int zuf_file_release(struct inode *inode, struct file *filp)
+{
+	if (unlikely(filp->private_data))
+		zuf_err("not yet\n");
+
+	return 0;
+}
+
+static ssize_t zuf_read_iter(struct kiocb *kiocb, struct iov_iter *ii)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+
+	zuf_dbg_vfs("[%ld] ppos=0x%llx len=0x%zx\n",
+		     inode->i_ino, kiocb->ki_pos, iov_iter_count(ii));
+
+	file_accessed(kiocb->ki_filp);
+
+	zuf_r_lock(zii);
+
+	ret = zuf_rw_read_iter(inode->i_sb, inode, kiocb, ii);
+
+	zuf_r_unlock(zii);
+
+	zuf_dbg_vfs("[%ld] => 0x%lx\n", inode->i_ino, ret);
+	return ret;
+}
+
+static ssize_t zuf_write_iter(struct kiocb *kiocb, struct iov_iter *ii)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+
+	ret = generic_write_checks(kiocb, ii);
+	if (unlikely(ret < 0)) {
+		zuf_dbg_vfs("[%ld] generic_write_checks => 0x%lx\n",
+			    inode->i_ino, ret);
+		return ret;
+	}
+
+	zuf_r_lock(zii);
+
+	ret = file_remove_privs(kiocb->ki_filp);
+	if (unlikely(ret < 0))
+		goto out;
+
+	zus_inode_cmtime_now(inode, zii->zi);
+
+	ret = zuf_rw_write_iter(inode->i_sb, inode, kiocb, ii);
+	if (unlikely(ret < 0))
+		goto out;
+
+	if (i_size_read(inode) <= le64_to_cpu(zii->zi->i_size))
+		i_size_write(inode, le64_to_cpu(zii->zi->i_size));
+
+	inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+
+out:
+	zuf_r_unlock(zii);
+
+	zuf_dbg_vfs("[%ld] => 0x%lx\n", inode->i_ino, ret);
+	return ret;
+}
+
 const struct file_operations zuf_file_operations = {
+	.llseek			= zuf_llseek,
+	.read_iter		= zuf_read_iter,
+	.write_iter		= zuf_write_iter,
 	.open			= generic_file_open,
+	.flush			= zuf_flush,
+	.release		= zuf_file_release,
+	.fallocate		= zuf_fallocate,
+	.copy_file_range	= zuf_copy_file_range,
+	.remap_file_range	= zuf_clone_file_range,
 };
 
 const struct inode_operations zuf_file_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.fiemap		= tozu_fiemap,
 };
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
index 1eb8453da564..335bfd256499 100644
--- a/fs/zuf/rw.c
+++ b/fs/zuf/rw.c
@@ -23,3 +23,15 @@ int zuf_trim_edge(struct inode *inode, ulong filepos, uint len)
 {
 	return -EIO;
 }
+
+ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
+			 struct kiocb *kiocb, struct iov_iter *ii)
+{
+	return -EIO;
+}
+
+ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
+			  struct kiocb *kiocb, struct iov_iter *ii)
+{
+	return -EIO;
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 3a264e6475c4..96ffc6244daa 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -774,8 +774,12 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY	);
 		CASE_ENUM_NAME(ZUFS_OP_RENAME		);
 		CASE_ENUM_NAME(ZUFS_OP_READDIR		);
+		CASE_ENUM_NAME(ZUFS_OP_CLONE		);
+		CASE_ENUM_NAME(ZUFS_OP_COPY		);
 		CASE_ENUM_NAME(ZUFS_OP_GET_SYMLINK	);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR		);
+		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE	);
+		CASE_ENUM_NAME(ZUFS_OP_LLSEEK		);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
 	default:
 		return "UNKNOWN";
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 74f69a12a263..32e8c2cae518 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -337,9 +337,13 @@ enum e_zufs_operation {
 	ZUFS_OP_REMOVE_DENTRY,
 	ZUFS_OP_RENAME,
 	ZUFS_OP_READDIR,
+	ZUFS_OP_CLONE,
+	ZUFS_OP_COPY,
 
 	ZUFS_OP_GET_SYMLINK,
 	ZUFS_OP_SETATTR,
+	ZUFS_OP_FALLOCATE,
+	ZUFS_OP_LLSEEK,
 
 	ZUFS_OP_BREAK,		/* Kernel telling Server to exit */
 	ZUFS_OP_MAX_OPT,
@@ -528,6 +532,47 @@ struct zufs_ioc_attr {
 	__u32 pad;
 };
 
+enum ZUFS_RANGE_FLAGS {
+	ZUFS_RF_DONTNEED		= 0x00000001,
+};
+
+/* ZUFS_OP_ISYNC, ZUFS_OP_FALLOCATE */
+struct zufs_ioc_range {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 offset, length;
+	__u32 opflags;
+	__u32 ioc_flags;
+
+	/* OUT */
+	__u64 write_unmapped;
+};
+
+/* ZUFS_OP_CLONE */
+struct zufs_ioc_clone {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *src_zus_ii;
+	struct zus_inode_info *dst_zus_ii;
+	__u64 pos_in, pos_out;
+	__u64 len;
+	__u64 len_up;
+};
+
+/* ZUFS_OP_LLSEEK */
+struct zufs_ioc_seek {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 offset_in;
+	__u32 whence;
+	__u32 pad;
+
+	/* OUT */
+	__u64 offset_out;
+};
+
 /* Allocate a special_file that will be a dual-port communication buffer with
  * user mode.
  * Server will access the buffer via the mmap of this file.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 11/17] zuf: Write/Read implementation
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (9 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 10/17] zuf: More file operation Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 12/17] zuf: mmap & sync Boaz harrosh
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

The dispatch to the server can operate on buffers up to
4 Mega bytes. Any bigger operations are split up and dispatched
at this size.

Also if a multy-segments aio is used each segment is dispatched
on its own. (TODO this can be easily fixed with sg operations)

On write if any mmaped buffers changed, for example new
allocated holes do to this write or a previous mmaped COW
was written. A range subset of the written range can be returned
for the Kernel to call mapping_unmap on.

rw.c here also includes some operations for mmap. Will be used
in next patch.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |  13 +-
 fs/zuf/_pr.h      |   1 +
 fs/zuf/rw.c       | 562 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf-core.c |  79 ++++++-
 fs/zuf/zus_api.h  | 191 ++++++++++++++++
 5 files changed, 841 insertions(+), 5 deletions(-)

diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 391484b0e125..2905fe20cec7 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -46,13 +46,21 @@ bool zuf_dir_emit(struct super_block *sb, struct dir_context *ctx,
 uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
 			const char *symname, ulong len, struct page *pages[2]);
 
-
 /* rw.c */
+int _zuf_get_put_block(struct zuf_sb_info *sbi, struct zuf_inode_info *zii,
+			  enum e_zufs_operation op, int rw, ulong index,
+			  struct zufs_ioc_IO *get_block);
+int zuf_rw_read_page(struct zuf_sb_info *sbi, struct inode *inode,
+		     struct page *page, u64 filepos);
 ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
 			 struct kiocb *kiocb, struct iov_iter *ii);
 ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
 			  struct kiocb *kiocb, struct iov_iter *ii);
 int zuf_trim_edge(struct inode *inode, ulong filepos, uint len);
+int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
+			 __u64 *iom_e, uint iom_n);
+int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
+			 __u64 *iom_e_user, uint iom_n);
 
 /* super.c */
 int zuf_init_inodecache(void);
@@ -61,6 +69,9 @@ void zuf_destroy_inodecache(void);
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
 
+struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
+				   struct zus_sb_info *zus_sbi);
+
 /* zuf-core.c */
 int zufc_zts_init(struct zuf_root_info *zri); /* Some private types in core */
 void zufc_zts_fini(struct zuf_root_info *zri);
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index 85641b6f1478..151e127f513b 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -40,6 +40,7 @@
 /* ~~~ channel prints ~~~ */
 #define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
 #define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
+#define zuf_dbg_rw(s, args ...)		zuf_chan_debug("rw   ", s, ##args)
 #define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
 #define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
 #define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
index 335bfd256499..400d24ea7914 100644
--- a/fs/zuf/rw.c
+++ b/fs/zuf/rw.c
@@ -18,20 +18,576 @@
 #include "zuf.h"
 #include "t2.h"
 
+#define	rand_tag(kiocb)	(kiocb->ki_filp->f_mode & FMODE_RANDOM)
+#define	kiocb_ra(kiocb)	(&kiocb->ki_filp->f_ra)
+
+static int _ioc_bounds_check(struct zufs_iomap *ziom,
+			     struct zufs_iomap *user_ziom, void *ziom_end)
+{
+	size_t iom_max_bytes = ziom_end - (void *)&user_ziom->iom_e;
+
+	if (unlikely((iom_max_bytes / sizeof(__u64) < ziom->iom_max))) {
+		zuf_err("kernel-buff-size(0x%zx) < ziom->iom_max(0x%x)\n",
+			(iom_max_bytes / sizeof(__u64)), ziom->iom_max);
+		return -EINVAL;
+	}
+
+	if (unlikely(ziom->iom_max < ziom->iom_n)) {
+		zuf_err("ziom->iom_max(0x%x) < ziom->iom_n(0x%x)\n",
+			ziom->iom_max, ziom->iom_n);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int rw_overflow_handler(struct zuf_dispatch_op *zdo, void *arg,
+			       ulong max_bytes)
+{
+	struct zufs_ioc_IO *io = container_of(zdo->hdr, typeof(*io), hdr);
+	struct zufs_ioc_IO *io_user = arg;
+	ulong flags = io->flags;
+	int err;
+
+	*io = *io_user;
+
+	/* FIXME: Why do we need to restore the original flags? */
+	io_user->flags = flags;
+
+	err = _ioc_bounds_check(&io->ziom, &io_user->ziom, arg + max_bytes);
+	if (unlikely(err))
+		return err;
+
+	if ((io->hdr.err == -EZUFS_RETRY) &&
+	    io->ziom.iom_n && _zufs_iom_pop(io->iom_e)) {
+
+		zuf_dbg_rw(
+			"[%s]zuf_iom_execute_sync(%d) max=0x%lx iom_e[%d] => %d\n",
+			zuf_op_name(io->hdr.operation), io->ziom.iom_n,
+			max_bytes, _zufs_iom_opt_type(io_user->iom_e),
+			io->hdr.err);
+
+		io->hdr.err = zuf_iom_execute_sync(zdo->sb, zdo->inode,
+						   io_user->iom_e,
+						   io->ziom.iom_n);
+		err = EZUF_RETRY_DONE;
+	} else {
+		if (io->hdr.err == -EZUFS_RETRY)
+			zuf_err("ZUSfs violating API\n");
+		err = 0;
+	}
+
+	return err;
+}
+
+static int _IO_dispatch(struct zuf_sb_info *sbi, struct zufs_ioc_IO *IO,
+			struct zuf_inode_info *zii, int operation,
+			uint pgoffset, struct page **pages, uint nump,
+			u64 filepos, uint len)
+{
+	struct zuf_dispatch_op zdo;
+	int err;
+
+	IO->hdr.operation = operation;
+	IO->hdr.in_len = sizeof(*IO);
+	IO->hdr.out_len = sizeof(*IO);
+	IO->hdr.offset = pgoffset;
+	IO->hdr.len = len;
+	IO->zus_ii = zii->zus_ii;
+	IO->filepos = filepos;
+	IO->last_pos = filepos;
+
+	zuf_dispatch_init(&zdo, &IO->hdr, pages, nump);
+	zdo.oh = rw_overflow_handler;
+	zdo.sb = sbi->sb;
+	zdo.inode = &zii->vfs_inode;
+
+	zuf_dbg_verbose("[%ld][%s] fp=0x%llx nump=0x%x len=0x%x\n",
+			zdo.inode ? zdo.inode->i_ino : -1,
+			zuf_op_name(operation), filepos, nump, len);
+
+	err = __zufc_dispatch(ZUF_ROOT(sbi), &zdo);
+	if (unlikely(err == -EZUFS_RETRY)) {
+		zuf_err("Unexpected ZUS return => %d\n", err);
+		err = -EIO;
+	}
+	return err;
+}
+
+int _zuf_get_put_block(struct zuf_sb_info *sbi, struct zuf_inode_info *zii,
+			  enum e_zufs_operation op, int rw, ulong index,
+			  struct zufs_ioc_IO *get_block)
+{
+	if (op == ZUFS_OP_GET_BLOCK)
+		get_block->gp_block.rw = rw;
+	/* for put keep untouched, return as was set by server */
+
+	return _IO_dispatch(sbi, get_block, zii, op, 0, NULL, 0,
+			    md_p2o(index), 0);
+}
+
+int zuf_rw_read_page(struct zuf_sb_info *sbi, struct inode *inode,
+		     struct page *page, u64 filepos)
+{
+	struct zufs_ioc_IO io = {};
+	struct page *pages[1];
+	uint nump;
+	int err;
+
+	pages[0] = page;
+	nump = 1;
+
+	err = _IO_dispatch(sbi, &io, ZUII(inode), ZUFS_OP_READ, 0, pages, nump,
+			   filepos, PAGE_SIZE);
+	return err;
+}
+
 /* ZERO a part of a single block. len does not cross a block boundary */
 int zuf_trim_edge(struct inode *inode, ulong filepos, uint len)
 {
-	return -EIO;
+	struct page *zero_page = ZERO_PAGE(0);
+	struct zufs_ioc_IO io = {};
+	struct page *pages[1];
+	uint nump;
+	int err;
+
+	pages[0] = zero_page;
+	nump = 1;
+
+	err = _IO_dispatch(SBI(inode->i_sb), &io, ZUII(inode), ZUFS_OP_WRITE,
+			   0, pages, nump, filepos, len);
+	return err;
+
+}
+
+static struct page *_addr_to_page(unsigned long addr)
+{
+	const void *p = (const void *)addr;
+
+	return is_vmalloc_addr(p) ? vmalloc_to_page(p) : virt_to_page(p);
+}
+
+static ssize_t _iov_iter_get_pages_kvec(struct iov_iter *ii,
+		   struct page **pages, size_t maxsize, uint maxpages,
+		   size_t *start)
+{
+	ssize_t bytes;
+	size_t i, nump;
+	unsigned long addr = (unsigned long)ii->kvec->iov_base;
+
+	*start = addr & (PAGE_SIZE - 1);
+	bytes = min_t(ssize_t, iov_iter_single_seg_count(ii), maxsize);
+	nump = min_t(size_t, DIV_ROUND_UP(bytes + *start, PAGE_SIZE), maxpages);
+
+	/* TODO: FUSE assumes single page for ITER_KVEC. Boaz: Remove? */
+	WARN_ON(nump > 1);
+
+	for (i = 0; i < nump; ++i) {
+		pages[i] = _addr_to_page(addr + (i * PAGE_SIZE));
+
+		get_page(pages[i]);
+	}
+	return bytes;
+}
+
+static ssize_t _iov_iter_get_pages_any(struct iov_iter *ii,
+		   struct page **pages, size_t maxsize, uint maxpages,
+		   size_t *start)
+{
+	ssize_t bytes;
+
+	bytes = unlikely(ii->type & ITER_KVEC) ?
+		_iov_iter_get_pages_kvec(ii, pages, maxsize, maxpages, start) :
+		iov_iter_get_pages(ii, pages, maxsize, maxpages, start);
+
+	if (unlikely(bytes < 0))
+		zuf_dbg_err("[%d] bytes=%ld type=%d count=%lu",
+			smp_processor_id(), bytes, ii->type, ii->count);
+
+	return bytes;
+}
+
+static ssize_t _zufs_IO(struct zuf_sb_info *sbi, struct inode *inode,
+			struct file_ra_state *ra, int operation,
+			struct iov_iter *ii, loff_t pos, ulong flags)
+{
+	int err = 0;
+	loff_t start_pos = pos;
+
+	while (iov_iter_count(ii)) {
+		struct zufs_ioc_IO io = {};
+		struct page *pages[ZUS_API_MAP_MAX_PAGES];
+		uint nump;
+		ssize_t bytes;
+		size_t pgoffset;
+		uint i;
+
+		if (ra) {
+			io.ra.start		= ra->start;
+			io.ra.ra_pages	= ra->ra_pages;
+			io.ra.prev_pos	= ra->prev_pos;
+		}
+		io.flags = flags;
+
+		bytes = _iov_iter_get_pages_any(ii, pages,
+					ZUS_API_MAP_MAX_SIZE,
+					ZUS_API_MAP_MAX_PAGES, &pgoffset);
+		if (unlikely(bytes < 0)) {
+			err = bytes;
+			break;
+		}
+
+		nump = DIV_ROUND_UP(bytes + pgoffset, PAGE_SIZE);
+
+		err = _IO_dispatch(sbi, &io, ZUII(inode), operation,
+				   pgoffset, pages, nump, pos, bytes);
+
+		bytes = io.last_pos - pos;
+
+		iov_iter_advance(ii, bytes);
+		pos += bytes;
+
+		if (ra) {
+			ra->start	= io.ra.start;
+			ra->ra_pages	= io.ra.ra_pages;
+			ra->prev_pos	= io.ra.prev_pos;
+		}
+		if (io.wr_unmap.len)
+			unmap_mapping_range(inode->i_mapping,
+					    io.wr_unmap.offset,
+					    io.wr_unmap.len, 0);
+
+		for (i = 0; i < nump; ++i)
+			put_page(pages[i]);
+
+		if (unlikely(err))
+			break;
+	}
+
+	if (unlikely(pos == start_pos))
+		return err;
+	return pos - start_pos;
 }
 
 ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
 			 struct kiocb *kiocb, struct iov_iter *ii)
 {
-	return -EIO;
+	ssize_t ret;
+
+	/* EOF protection */
+	if (unlikely(kiocb->ki_pos > i_size_read(inode)))
+		return 0;
+
+	iov_iter_truncate(ii, i_size_read(inode) - kiocb->ki_pos);
+	if (unlikely(!iov_iter_count(ii))) {
+		/* Don't let zero len reads have any effect */
+		zuf_dbg_rw("called with NULL len\n");
+		return 0;
+	}
+
+	ret = _zufs_IO(SBI(sb), inode, kiocb_ra(kiocb), ZUFS_OP_READ, ii,
+		       kiocb->ki_pos, rand_tag(kiocb) ? ZUFS_IO_RAND : 0);
+	if (unlikely(ret < 0))
+		return ret;
+
+	kiocb->ki_pos += ret;
+	return ret;
 }
 
 ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
 			  struct kiocb *kiocb, struct iov_iter *ii)
 {
-	return -EIO;
+	ssize_t ret;
+	ulong flags = 0;
+
+	if (kiocb->ki_filp->f_flags & O_DSYNC ||
+	    IS_SYNC(kiocb->ki_filp->f_mapping->host))
+		flags = ZUFS_IO_DSYNC;
+	if (kiocb->ki_filp->f_flags & O_DIRECT)
+		flags |= ZUFS_IO_DIRECT;
+
+	ret = _zufs_IO(SBI(inode->i_sb), inode, NULL, ZUFS_OP_WRITE, ii,
+		       kiocb->ki_pos, flags);
+	if (unlikely(ret < 0))
+		return ret;
+
+	kiocb->ki_pos += ret;
+	return ret;
+}
+
+/* ~~~~ iom_dec.c ~~~ */
+/* for now here (at rw.c) looks logical */
+
+static int __iom_add_t2_io_len(struct super_block *sb, struct t2_io_state *tis,
+			       zu_dpp_t t1, ulong t2_bn, __u64 num_pages)
+{
+	void *ptr;
+	struct page *page;
+	int i, err;
+
+	ptr = zuf_dpp_t_addr(sb, t1);
+	if (unlikely(!ptr)) {
+		zuf_err("Bad t1 zu_dpp_t t1=0x%llx t2=0x%lx num_pages=0x%llx\n",
+			t1, t2_bn, num_pages);
+		return -EFAULT; /* zuf_dpp_t_addr already yeld */
+	}
+
+	page = virt_to_page(ptr);
+	if (unlikely(!page)) {
+		zuf_err("bad t1(0x%llx)\n", t1);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < num_pages; ++i) {
+		err = t2_io_add(tis, t2_bn++, page++);
+		if (unlikely(err))
+			return err;
+	}
+	return 0;
+}
+
+static int iom_add_t2_io_len(struct super_block *sb, struct t2_io_state *tis,
+			     __u64 **cur_e)
+{
+	struct zufs_iom_t2_io_len *t2iol = (void *)*cur_e;
+	int err = __iom_add_t2_io_len(sb, tis, t2iol->iom.t1_val,
+				      _zufs_iom_first_val(&t2iol->iom.t2_val),
+				      t2iol->num_pages);
+
+	*cur_e = (void *)(t2iol + 1);
+	return err;
+}
+
+static int iom_add_t2_io(struct super_block *sb, struct t2_io_state *tis,
+			 __u64 **cur_e)
+{
+	struct zufs_iom_t2_io *t2io = (void *)*cur_e;
+
+	int err = __iom_add_t2_io_len(sb, tis, t2io->t1_val,
+				      _zufs_iom_first_val(&t2io->t2_val), 1);
+
+	*cur_e = (void *)(t2io + 1);
+	return err;
+}
+
+static int iom_t2_zusmem_io(struct super_block *sb, struct t2_io_state *tis,
+			    __u64 **cur_e)
+{
+	struct zufs_iom_t2_zusmem_io *mem_io = (void *)*cur_e;
+	ulong t2_bn = _zufs_iom_first_val(&mem_io->t2_val);
+	ulong user_ptr = (ulong)mem_io->zus_mem_ptr;
+	int rw = _zufs_iom_opt_type(*cur_e) == IOM_T2_ZUSMEM_WRITE ?
+						WRITE : READ;
+	int num_p = md_o2p_up(mem_io->len);
+	int num_p_r;
+	struct page *pages[16];
+	int i, err = 0;
+
+	if (16 < num_p) {
+		zuf_err("num_p(%d) > 16\n", num_p);
+		return -EINVAL;
+	}
+
+	num_p_r = get_user_pages_fast(user_ptr, num_p, rw,
+				      pages);
+	if (num_p_r != num_p) {
+		zuf_err("!!!! get_user_pages_fast num_p_r(%d) != num_p(%d)\n",
+			num_p_r, num_p);
+		err = -EFAULT;
+		goto out;
+	}
+
+	for (i = 0; i < num_p_r && !err; ++i)
+		err = t2_io_add(tis, t2_bn++, pages[i]);
+
+out:
+	for (i = 0; i < num_p_r; ++i)
+		put_page(pages[i]);
+
+	*cur_e = (void *)(mem_io + 1);
+	return err;
+}
+
+static int iom_unmap(struct super_block *sb, struct inode *inode, __u64 **cur_e)
+{
+	struct zufs_iom_unmap *iom_unmap = (void *)*cur_e;
+	struct inode *inode_look = NULL;
+	ulong	unmap_index = _zufs_iom_first_val(&iom_unmap->unmap_index);
+	ulong	unmap_n = iom_unmap->unmap_n;
+	ulong	ino = iom_unmap->ino;
+
+	if (!inode || ino) {
+		if (WARN_ON(!ino)) {
+			zuf_err("[%ld] 0x%lx-0x%lx\n",
+				inode ? inode->i_ino : -1, unmap_index,
+				unmap_n);
+			goto out;
+		}
+		inode_look = ilookup(sb, ino);
+		if (!inode_look) {
+			/* From the time we requested an unmap to now
+			 * inode was evicted from cache so surely it no longer
+			 * have any mappings. Cool job was already done for us.
+			 * Even if a racing thread reloads the inode it will
+			 * not have this mapping we wanted to clear, but only
+			 * new ones.
+			 * TODO: For now warn when this happen, because in
+			 *    current usage it cannot happen. But before
+			 *    upstream we should convert to zuf_dbg_err
+			 */
+			zuf_warn("[%ld] 0x%lx-0x%lx\n",
+				 ino, unmap_index, unmap_n);
+			goto out;
+		}
+
+		inode = inode_look;
+	}
+
+	zuf_dbg_rw("[%ld] 0x%lx-0x%lx\n", inode->i_ino, unmap_index, unmap_n);
+
+	unmap_mapping_range(inode->i_mapping, md_p2o(unmap_index),
+			    md_p2o(unmap_n), 0);
+
+	if (inode_look)
+		iput(inode_look);
+
+out:
+	*cur_e = (void *)(iom_unmap + 1);
+	return 0;
+}
+
+struct _iom_exec_info {
+	struct super_block *sb;
+	struct inode *inode;
+	struct t2_io_state *rd_tis;
+	struct t2_io_state *wr_tis;
+	__u64 *iom_e;
+	uint iom_n;
+	bool print;
+};
+
+static int _iom_execute_inline(struct _iom_exec_info *iei)
+{
+	__u64 *cur_e, *end_e;
+	int err = 0;
+#ifdef CONFIG_ZUF_DEBUG
+	uint wrs = 0;
+	uint rds = 0;
+	uint uns = 0;
+	uint wrmem = 0;
+	uint rdmem = 0;
+#	define	WRS()	(++wrs)
+#	define	RDS()	(++rds)
+#	define	UNS()	(++uns)
+#	define	WRMEM()	(++wrmem)
+#	define	RDMEM()	(++rdmem)
+#else
+#	define	WRS()
+#	define	RDS()
+#	define	UNS()
+#	define	WRMEM()
+#	define	RDMEM()
+#endif /* !def CONFIG_ZUF_DEBUG */
+
+	cur_e =  iei->iom_e;
+	end_e = cur_e + iei->iom_n;
+	while (cur_e && (cur_e < end_e)) {
+		uint op;
+
+		op = _zufs_iom_opt_type(cur_e);
+
+		switch (op) {
+		case IOM_NONE:
+			return 0;
+
+		case IOM_T2_WRITE:
+			err = iom_add_t2_io(iei->sb, iei->wr_tis, &cur_e);
+			WRS();
+			break;
+		case IOM_T2_READ:
+			err = iom_add_t2_io(iei->sb, iei->rd_tis, &cur_e);
+			RDS();
+			break;
+
+		case IOM_T2_WRITE_LEN:
+			err = iom_add_t2_io_len(iei->sb, iei->wr_tis, &cur_e);
+			WRS();
+			break;
+		case IOM_T2_READ_LEN:
+			err = iom_add_t2_io_len(iei->sb, iei->rd_tis, &cur_e);
+			RDS();
+			break;
+
+		case IOM_T2_ZUSMEM_WRITE:
+			err = iom_t2_zusmem_io(iei->sb, iei->wr_tis, &cur_e);
+			WRMEM();
+			break;
+		case IOM_T2_ZUSMEM_READ:
+			err = iom_t2_zusmem_io(iei->sb, iei->rd_tis, &cur_e);
+			RDMEM();
+			break;
+
+		case IOM_UNMAP:
+			err = iom_unmap(iei->sb, iei->inode, &cur_e);
+			UNS();
+			break;
+
+		default:
+			zuf_err("!!!!! Bad opt %d\n",
+				_zufs_iom_opt_type(cur_e));
+			err = -EIO;
+			break;
+		}
+
+		if (unlikely(err))
+			break;
+	}
+
+#ifdef CONFIG_ZUF_DEBUG
+	zuf_dbg_rw("exec wrs=%d rds=%d uns=%d rdmem=%d wrmem=%d => %d\n",
+		   wrs, rds, uns, rdmem, wrmem, err);
+#endif
+
+	return err;
+}
+
+/* inode here is the default inode if ioc_unmap->ino is zero
+ * this is an optimization for the unmap done at write_iter hot path.
+ */
+int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
+			 __u64 *iom_e_user, uint iom_n)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct t2_io_state rd_tis = {};
+	struct t2_io_state wr_tis = {};
+	struct _iom_exec_info iei = {};
+	int err, err_r, err_w;
+
+	t2_io_begin(sbi->md, READ, NULL, 0, -1, &rd_tis);
+	t2_io_begin(sbi->md, WRITE, NULL, 0, -1, &wr_tis);
+
+	iei.sb = sb;
+	iei.inode = inode;
+	iei.rd_tis = &rd_tis;
+	iei.wr_tis = &wr_tis;
+	iei.iom_e = iom_e_user;
+	iei.iom_n = iom_n;
+	iei.print = 0;
+
+	err = _iom_execute_inline(&iei);
+
+	err_r = t2_io_end(&rd_tis, true);
+	err_w = t2_io_end(&wr_tis, true);
+
+	/* TODO: not sure if OK when _iom_execute return with -ENOMEM
+	 * In such a case, we might be better of skiping t2_io_ends.
+	 */
+	return err ?: (err_r ?: err_w);
+}
+
+int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
+			 __u64 *iom_e_user, uint iom_n)
+{
+	zuf_err("Async IOM NOT supported Yet!!!\n");
+	return -EFAULT;
 }
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 96ffc6244daa..371c2e93dd81 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -657,7 +657,8 @@ static int _zu_wait(struct file *file, void *parg)
 		 * we should have a bit set in zt->zdo->hdr set per operation.
 		 * TODO: Why this does not work?
 		 */
-		_map_pages(zt, zt->zdo->pages, zt->zdo->nump, 0);
+		_map_pages(zt, zt->zdo->pages, zt->zdo->nump,
+			   zt->zdo->hdr->operation == ZUFS_OP_WRITE);
 		memcpy(zt->opt_buff, zt->zdo->hdr, zt->zdo->hdr->in_len);
 	} else {
 		struct zufs_ioc_hdr *hdr = zt->opt_buff;
@@ -776,6 +777,10 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_READDIR		);
 		CASE_ENUM_NAME(ZUFS_OP_CLONE		);
 		CASE_ENUM_NAME(ZUFS_OP_COPY		);
+		CASE_ENUM_NAME(ZUFS_OP_READ		);
+		CASE_ENUM_NAME(ZUFS_OP_WRITE		);
+		CASE_ENUM_NAME(ZUFS_OP_GET_BLOCK	);
+		CASE_ENUM_NAME(ZUFS_OP_PUT_BLOCK	);
 		CASE_ENUM_NAME(ZUFS_OP_GET_SYMLINK	);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR		);
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE	);
@@ -832,6 +837,30 @@ static inline struct zu_exec_buff *_ebuff_from_file(struct file *file)
 	return ebuff;
 }
 
+static int _ebuff_bounds_check(struct zu_exec_buff *ebuff, ulong buff,
+			       struct zufs_iomap *ziom,
+			       struct zufs_iomap *user_ziom, void *ziom_end)
+{
+	size_t iom_max_bytes = ziom_end - (void *)&user_ziom->iom_e;
+
+	if (buff != ebuff->vma->vm_start ||
+	    ebuff->vma->vm_end < buff + iom_max_bytes) {
+		WARN_ON_ONCE(1);
+		zuf_err("Executing out off bound vm_start=0x%lx vm_end=0x%lx buff=0x%lx buff_end=0x%lx\n",
+			ebuff->vma->vm_start, ebuff->vma->vm_end, buff,
+			buff + iom_max_bytes);
+		return -EINVAL;
+	}
+
+	if (unlikely((iom_max_bytes / sizeof(__u64) < ziom->iom_max)))
+		return -EINVAL;
+
+	if (unlikely(ziom->iom_max < ziom->iom_n))
+		return -EINVAL;
+
+	return 0;
+}
+
 static int _zu_ebuff_alloc(struct file *file, void *arg)
 {
 	struct zufs_ioc_alloc_buffer ioc_alloc;
@@ -884,6 +913,52 @@ static void zufc_ebuff_release(struct file *file)
 	kfree(ebuff);
 }
 
+static int _zu_iomap_exec(struct file *file, void *arg)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zu_exec_buff *ebuff = _ebuff_from_file(file);
+	struct zufs_ioc_iomap_exec ioc_iomap;
+	struct zufs_ioc_iomap_exec *user_iomap;
+
+	struct super_block *sb;
+	int err;
+
+	if (unlikely(!ebuff))
+		return -EINVAL;
+
+	user_iomap = ebuff->opt_buff;
+	/* do all checks on a kernel copy so malicious Server cannot
+	 * crash the Kernel
+	 */
+	ioc_iomap = *user_iomap;
+
+	err = _ebuff_bounds_check(ebuff, (ulong)arg, &ioc_iomap.ziom,
+				  &user_iomap->ziom,
+				  ebuff->opt_buff + ebuff->alloc_size);
+	if (unlikely(err)) {
+		zuf_err("illegal iomap: iom_max=%u iom_n=%u\n",
+			ioc_iomap.ziom.iom_max, ioc_iomap.ziom.iom_n);
+		return err;
+	}
+
+	/* The ID of the super block received in mount */
+	sb = zuf_sb_from_id(zri, ioc_iomap.sb_id, ioc_iomap.zus_sbi);
+	if (unlikely(!sb))
+		return -EINVAL;
+
+	if (ioc_iomap.wait_for_done)
+		err = zuf_iom_execute_sync(sb, NULL, user_iomap->ziom.iom_e,
+					   ioc_iomap.ziom.iom_n);
+	else
+		err =  zuf_iom_execute_async(sb, ioc_iomap.ziom.iomb,
+					     user_iomap->ziom.iom_e,
+					     ioc_iomap.ziom.iom_n);
+
+	user_iomap->hdr.err = err;
+	zuf_dbg_core("OUT => %d\n", err);
+	return 0; /* report err at hdr, but the command was executed */
+};
+
 static int _zu_break(struct file *filp, void *parg)
 {
 	struct zuf_root_info *zri = ZRI(filp->f_inode->i_sb);
@@ -928,6 +1003,8 @@ long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_wait(file, parg);
 	case ZU_IOC_ALLOC_BUFFER:
 		return _zu_ebuff_alloc(file, parg);
+	case ZU_IOC_IOMAP_EXEC:
+		return _zu_iomap_exec(file, parg);
 	case ZU_IOC_BREAK_ALL:
 		return _zu_break(file, parg);
 	default:
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 32e8c2cae518..26b7b56f96c4 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -340,6 +340,10 @@ enum e_zufs_operation {
 	ZUFS_OP_CLONE,
 	ZUFS_OP_COPY,
 
+	ZUFS_OP_READ,
+	ZUFS_OP_WRITE,
+	ZUFS_OP_GET_BLOCK,
+	ZUFS_OP_PUT_BLOCK,
 	ZUFS_OP_GET_SYMLINK,
 	ZUFS_OP_SETATTR,
 	ZUFS_OP_FALLOCATE,
@@ -573,6 +577,116 @@ struct zufs_ioc_seek {
 	__u64 offset_out;
 };
 
+/* ~~~~ io_map structures && IOCTL(s) ~~~~ */
+/*
+ * These set of structures and helpers are used in return of zufs_ioc_IO and
+ * also at ZU_IOC_IOMAP_EXEC, NULL terminating list (array)
+ *
+ * Each iom_elemet stars with an __u64 of which the 8 hight bits carry an
+ * operation_type, And the 56 bits value denotes a page offset, (md_o2p()) or a
+ * length. operation_type is one of ZUFS_IOM_TYPE enum.
+ * The interpreter then jumps to the next operation depending on the size
+ * of the defined operation.
+ */
+
+enum ZUFS_IOM_TYPE {
+	IOM_NONE	= 0,
+	IOM_T1_WRITE	= 1,
+	IOM_T1_READ	= 2,
+
+	IOM_T2_WRITE	= 3,
+	IOM_T2_READ	= 4,
+	IOM_T2_WRITE_LEN = 5,
+	IOM_T2_READ_LEN	= 6,
+
+	IOM_T2_ZUSMEM_WRITE = 7,
+	IOM_T2_ZUSMEM_READ = 8,
+
+	IOM_UNMAP	= 9,
+
+	IOM_NUM_LEGAL_OPT,
+};
+
+#define ZUFS_IOM_VAL_BITS	56
+#define ZUFS_IOM_FIRST_VAL_MASK ((1UL << ZUFS_IOM_VAL_BITS) - 1)
+
+static inline ulong _zufs_iom_first_val(__u64 *iom_elemets)
+{
+	return *iom_elemets & ZUFS_IOM_FIRST_VAL_MASK;
+}
+
+static inline enum ZUFS_IOM_TYPE _zufs_iom_opt_type(__u64 *iom_e)
+{
+	uint ret = (*iom_e) >> ZUFS_IOM_VAL_BITS;
+
+	if (ret >= IOM_NUM_LEGAL_OPT)
+		return IOM_NONE;
+	return ret;
+}
+
+static inline bool _zufs_iom_pop(__u64 *iom_e)
+{
+	return _zufs_iom_opt_type(iom_e) != IOM_NONE;
+}
+
+/* IOM_T2_WRITE / IOM_T2_READ */
+struct zufs_iom_t2_io {
+	__u64	t2_val;
+	zu_dpp_t t1_val;
+};
+
+/* IOM_T2_WRITE_LEN / IOM_T2_READ_LEN */
+struct zufs_iom_t2_io_len {
+	struct zufs_iom_t2_io iom;
+	__u64 num_pages;
+} __packed;
+
+/* IOM_T2_ZUSMEM_WRITE / IOM_T2_ZUSMEM_READ */
+struct zufs_iom_t2_zusmem_io {
+	__u64	t2_val;
+	__u64	zus_mem_ptr; /* needs an get_user_pages() */
+	__u64	len;
+};
+
+/* IOM_UNMAP:
+ *	Executes unmap_mapping_range & remove of zuf's block-caching
+ *
+ * For now iom_unmap means even_cows=0, because Kernel takes care of all
+ * the cases of the even_cows=1. In future if needed it will be on the high
+ * bit of unmap_n.
+ */
+struct zufs_iom_unmap {
+	__u64	unmap_index;	/* Offset in pages of inode */
+	__u64	unmap_n;	/* Num pages to unmap (0 means: to eof) */
+	__u64	ino;		/* Pages of this inode */
+} __packed;
+
+#define ZUFS_WRITE_OP_SPACE						\
+	((sizeof(struct zufs_iom_unmap) +				\
+	  sizeof(struct zufs_iom_t2_io)) / sizeof(__u64) + sizeof(__u64))
+
+struct zus_iomap_build;
+/* For ZUFS_OP_IOM_DONE */
+struct zufs_ioc_iomap_done {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* The cookie received from zufs_ioc_iomap_exec */
+	struct	zus_iomap_build *iomb;
+};
+
+struct zufs_iomap {
+	/* A cookie from zus to return when execution is done */
+	struct	zus_iomap_build *iomb;
+
+	__u32	iom_max;	/* num of __u64 allocated	 */
+	__u32	iom_n;		/* num of valid __u64 in iom_e	 */
+	__u64	iom_e[];	/* encoded operations to execute */
+
+	/* This struct must be last */
+};
+
 /* Allocate a special_file that will be a dual-port communication buffer with
  * user mode.
  * Server will access the buffer via the mmap of this file.
@@ -604,4 +718,81 @@ struct zufs_ioc_alloc_buffer {
 };
 #define ZU_IOC_ALLOC_BUFFER	_IOWR('Z', 17, struct zufs_ioc_init)
 
+/*
+ * Execute an iomap in behalf of the Server
+ *
+ * NOTE: this IOCTL must come on an above ZU_IOC_ALLOC_BUFFER type file
+ * and the passed arg-buffer must be the pointer returned from an mmap
+ * call preformed in the file, before the call to this IOC.
+ * If this is not done the IOCTL will return EINVAL.
+ */
+struct zufs_ioc_iomap_exec {
+	struct zufs_ioc_hdr hdr;
+	/* The ID of the super block received in mount */
+	__u64	sb_id;
+	/* We verify the sb_id validity against zus_sbi */
+	struct zus_sb_info *zus_sbi;
+	/* If application buffers they are from this IO*/
+	__u64	zt_iocontext;
+	/* Only return from IOCTL when finished. iomap_done NOT called */
+	__u32	wait_for_done;
+	__u32	__pad;
+
+	struct zufs_iomap ziom; /* must be last */
+};
+#define ZU_IOC_IOMAP_EXEC	_IOWR('Z', 18, struct zufs_ioc_iomap_exec)
+
+/*
+ * ZUFS_OP_READ / ZUFS_OP_WRITE
+ *       also
+ * ZUFS_OP_GET_BLOCK / ZUFS_OP_PUT_BLOCK
+ */
+/* flags for gp_block->ret_flags */
+enum {
+	ZUFS_GBF_RESERVED = 1,	/* Not used */
+	ZUFS_GBF_NEW = 2,	/* In write, allocated a new block need unmap */
+};
+
+/* zufs_ioc_IO extra write flags */
+#define ZUFS_IO_DSYNC	(1 << 0)
+#define ZUFS_IO_DIRECT	(1 << 1)
+#define ZUFS_IO_RAND	(1 << 2)
+
+struct zufs_ioc_IO {
+	struct zufs_ioc_hdr hdr;
+
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 filepos;
+	__u64 flags;		/* read/write flags IN */
+
+	/* in / OUT */
+	union {
+		/* For reads */
+		struct __zufs_ra {
+			ulong start;
+			__u64 prev_pos;
+			__u32 ra_pages;
+			__u32 ra_pad; /* we need this*/
+		} ra;
+		/* For writes */
+		struct __zufs_write_unmap {
+			__u32  offset;
+			__u32  len;
+		} wr_unmap;
+		struct __zufs_get_put_block {
+			zu_dpp_t pmem_bn; /* zero means: map-read a hole */
+			__u32 rw;	  /* rw flags also return from GB */
+			__u32 ret_flags;  /* One of ZUFS_GBF_XXX */
+			void *priv;	  /**/
+		} gp_block;
+	};
+
+	/* The i_size I saw in this IO. If 0, than error code at .hdr.err */
+	__u64 last_pos;
+
+	struct zufs_iomap ziom;
+	__u64 iom_e[ZUFS_WRITE_OP_SPACE]; /* One tier_up for WRITE or GB */
+};
+
 #endif /* _LINUX_ZUFS_API_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 12/17] zuf: mmap & sync
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (10 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 11/17] zuf: Write/Read implementation Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 13/17] zuf: ioctl implementation Boaz harrosh
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

On page-fault call the zusFS for the page information. We always
mmap pmem pages directly. (No page cache)

With write-mmap and pmem. We need to keep track of dirty inodes
and call the zusFS when one of the sync variants are called.

This is because the Server will need to do a cl_flush on all
dirty pages.

If we did not have write-mmaped called on the inode we do
nothing on sync.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   2 +-
 fs/zuf/_extern.h  |  10 ++
 fs/zuf/_pr.h      |   1 +
 fs/zuf/file.c     |  65 +++++++++
 fs/zuf/inode.c    |  10 ++
 fs/zuf/mmap.c     | 339 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/super.c    |  91 +++++++++++++
 fs/zuf/zuf-core.c |   2 +
 fs/zuf/zuf.h      |   3 +
 fs/zuf/zus_api.h  |   9 ++
 10 files changed, 531 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/mmap.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 9b7123f2af3e..970062d6b13f 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,6 +17,6 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += rw.o
+zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 2905fe20cec7..5029f865655a 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -46,6 +46,10 @@ bool zuf_dir_emit(struct super_block *sb, struct dir_context *ctx,
 uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
 			const char *symname, ulong len, struct page *pages[2]);
 
+
+/* mmap.c */
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma);
+
 /* rw.c */
 int _zuf_get_put_block(struct zuf_sb_info *sbi, struct zuf_inode_info *zii,
 			  enum e_zufs_operation op, int rw, ulong index,
@@ -61,11 +65,17 @@ int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
 			 __u64 *iom_e, uint iom_n);
 int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
 			 __u64 *iom_e_user, uint iom_n);
+/* file.c */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync);
+
 
 /* super.c */
 int zuf_init_inodecache(void);
 void zuf_destroy_inodecache(void);
 
+void zuf_sync_inc(struct inode *inode);
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped);
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
 
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index 151e127f513b..a1ceab2abce2 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -45,6 +45,7 @@
 #define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
 #define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
 #define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
+#define zuf_dbg_mmap(s, args ...)	zuf_chan_debug("mmap ", s, ##args)
 #define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
 #define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
 
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 0e62145e923a..392b1a0d5881 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -174,6 +174,69 @@ static loff_t zuf_llseek(struct file *file, loff_t offset, int whence)
 	return err ?: ioc_seek.offset_out;
 }
 
+/* This function is called by both msync() and fsync(). */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_range ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUFS_OP_SYNC,
+		.zus_ii = zii->zus_ii,
+		.offset = start,
+		.opflags = datasync,
+	};
+	loff_t isize;
+	ulong uend = end + 1;
+	int err = 0;
+
+	zuf_dbg_vfs(
+		"[%ld] start=0x%llx end=0x%llx  datasync=%d write_mapped=%d\n",
+		inode->i_ino, start, end, datasync,
+		atomic_read(&zii->write_mapped));
+
+	/* We want to serialize the syncs so they don't fight with each other
+	 * and is though more efficient, but we do not want to lock out
+	 * read/writes and page-faults so we have a special sync semaphore
+	 */
+	zuf_smw_lock(zii);
+
+	isize = i_size_read(inode);
+	if (!isize) {
+		zuf_dbg_mmap("[%ld] file is empty\n", inode->i_ino);
+		goto out;
+	}
+	if (isize < uend)
+		uend = isize;
+	if (uend < start) {
+		zuf_dbg_mmap("[%ld] isize=0x%llx start=0x%llx end=0x%lx\n",
+				 inode->i_ino, isize, start, uend);
+		err = -ENODATA;
+		goto out;
+	}
+
+	if (!atomic_read(&zii->write_mapped))
+		goto out; /* Nothing to do on this inode */
+
+	ioc_range.length = uend - start;
+	unmap_mapping_range(inode->i_mapping, start, ioc_range.length, 0);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_range.hdr,
+			    NULL, 0);
+	if (unlikely(err))
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+
+	zuf_sync_dec(inode, ioc_range.write_unmapped);
+
+out:
+	zuf_smw_unlock(zii);
+	return err;
+}
+
+static int zuf_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	return zuf_isync(file_inode(file), start, end, datasync);
+}
+
 /* This callback is called when a file is closed */
 static int zuf_flush(struct file *file, fl_owner_t id)
 {
@@ -439,7 +502,9 @@ const struct file_operations zuf_file_operations = {
 	.llseek			= zuf_llseek,
 	.read_iter		= zuf_read_iter,
 	.write_iter		= zuf_write_iter,
+	.mmap			= zuf_file_mmap,
 	.open			= generic_file_open,
+	.fsync			= zuf_fsync,
 	.flush			= zuf_flush,
 	.release		= zuf_file_release,
 	.fallocate		= zuf_fallocate,
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 2b49a0c31a02..8f9b4f28c556 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -270,6 +270,7 @@ void zuf_evict_inode(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	struct zuf_inode_info *zii = ZUII(inode);
+	int write_mapped;
 
 	if (!inode->i_nlink) {
 		if (unlikely(!zii->zi)) {
@@ -312,6 +313,15 @@ void zuf_evict_inode(struct inode *inode)
 		zii->zero_page = NULL;
 	}
 
+	/* ZUS on evict has synced all mmap dirty pages, YES? */
+	write_mapped = atomic_read(&zii->write_mapped);
+	if (unlikely(write_mapped || !list_empty(&zii->i_mmap_dirty))) {
+		zuf_dbg_mmap("[%ld] !!!! write_mapped=%d list_empty=%d\n",
+			      inode->i_ino, write_mapped,
+			      list_empty(&zii->i_mmap_dirty));
+		zuf_sync_dec(inode, write_mapped);
+	}
+
 	clear_inode(inode);
 }
 
diff --git a/fs/zuf/mmap.c b/fs/zuf/mmap.c
new file mode 100644
index 000000000000..4a4eb117a6b0
--- /dev/null
+++ b/fs/zuf/mmap.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Read/Write operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/pfn_t.h>
+#include "zuf.h"
+
+/* ~~~ Functions for mmap and page faults ~~~ */
+
+/* MAP_PRIVATE, copy data to user private page (cow_page) */
+static int _cow_private_page(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	int err;
+
+	/* Basically a READ into vmf->cow_page */
+	err = zuf_rw_read_page(sbi, inode, vmf->cow_page,
+			       md_p2o(vmf->pgoff));
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("[%ld] read_page failed bn=0x%lx address=0x%lx => %d\n",
+			inode->i_ino, vmf->pgoff, vmf->address, err);
+		/* FIXME: Probably return VM_FAULT_SIGBUS */
+	}
+
+	/*HACK: This is an hack since Kernel v4.7 where a VM_FAULT_LOCKED with
+	 * vmf->page==NULL is no longer supported. Looks like for now this way
+	 * works well. We let mm mess around with unlocking and putting its own
+	 * cow_page.
+	 */
+	vmf->page = vmf->cow_page;
+	get_page(vmf->page);
+	lock_page(vmf->page);
+
+	return VM_FAULT_LOCKED;
+}
+
+int _rw_init_zero_page(struct zuf_inode_info *zii)
+{
+	if (zii->zero_page)
+		return 0;
+
+	zii->zero_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (unlikely(!zii->zero_page))
+		return -ENOMEM;
+	zii->zero_page->mapping = zii->vfs_inode.i_mapping;
+	return 0;
+}
+
+static int zuf_write_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			   bool pfn_mkwrite)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_IO get_block = {};
+	int fault = VM_FAULT_SIGBUS;
+	ulong addr = vmf->address;
+	pgoff_t size;
+	pfn_t pfnt;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page);
+
+	if (unlikely(vmf->page && vmf->page != zii->zero_page)) {
+		zuf_err("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+			"pgoff=0x%lx vmf_flags=0x%x page=%p cow_page=%p\n",
+			_zi_ino(zi), vma->vm_start, vma->vm_end, addr,
+			vmf->pgoff, vmf->flags, vmf->page, vmf->cow_page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	sb_start_pagefault(inode->i_sb);
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 _zi_ino(zi), vmf->pgoff, pgoff, size);
+
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	zus_inode_cmtime_now(inode, zi);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _zuf_get_put_block(sbi, zii, ZUFS_OP_GET_BLOCK, WRITE, vmf->pgoff,
+			     &get_block);
+	if (unlikely(err)) {
+		zuf_dbg_err("_get_put_block failed => %d\n", err);
+		goto out;
+	}
+
+	if ((get_block.gp_block.ret_flags & ZUFS_GBF_NEW) || !pfn_mkwrite) {
+		inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+		/* newly created block */
+		unmap_mapping_range(inode->i_mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_SIZE, 0);
+	} else if (pfn_mkwrite) {
+		/* If the block did not change just tell mm to flip
+		 * the write bit
+		 */
+		fault = VM_FAULT_WRITE;
+		goto skip_insert;
+	}
+
+	if (unlikely(get_block.gp_block.pmem_bn == 0)) {
+		zuf_err("[%ld] pmem_bn=0  rw=0x%x ret_flags=0x%x priv=0x%lx but no error?\n",
+			_zi_ino(zi), get_block.gp_block.rw,
+			get_block.gp_block.ret_flags,
+			(ulong)get_block.gp_block.priv);
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	pfn = md_pfn(sbi->md, get_block.gp_block.pmem_bn);
+	pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+	fault = vmf_insert_mixed_mkwrite(vma, addr, pfnt);
+	err = zuf_flt_to_err(fault);
+	if (unlikely(err)) {
+		zuf_err("vm_insert_mixed_mkwrite failed => %d\n", err);
+		goto put;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    _zi_ino(zi), pfn, vma->vm_page_prot.pgprot, err);
+
+skip_insert:
+	zuf_sync_inc(inode);
+put:
+	_zuf_get_put_block(sbi, zii, ZUFS_OP_PUT_BLOCK, WRITE, vmf->pgoff,
+			     &get_block);
+out:
+	zuf_smr_unlock(zii);
+	sb_end_pagefault(inode->i_sb);
+	return fault;
+}
+
+static int zuf_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return zuf_write_fault(vmf->vma, vmf, true);
+}
+
+static int zuf_read_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_IO get_block = {};
+	int fault = VM_FAULT_SIGBUS;
+	ulong addr = vmf->address;
+	pgoff_t size;
+	pfn_t pfnt;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page);
+
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 _zi_ino(zi), vmf->pgoff, pgoff, size);
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		zuf_warn("cow is read\n");
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	file_accessed(vma->vm_file);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _zuf_get_put_block(sbi, zii, ZUFS_OP_GET_BLOCK, READ, vmf->pgoff,
+			     &get_block);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("_get_put_block failed => %d\n", err);
+		goto out;
+	}
+
+	if (get_block.gp_block.pmem_bn == 0) {
+		/* Hole in file */
+		err = _rw_init_zero_page(zii);
+		if (unlikely(err))
+			goto out;
+
+		err = vm_insert_page(vma, addr, zii->zero_page);
+		zuf_dbg_mmap("[%ld] inserted zero\n", _zi_ino(zi));
+
+		/* NOTE: we are fooling mm, we do not need this page
+		 * to be locked and get(ed)
+		 */
+		fault = VM_FAULT_NOPAGE;
+		goto out;
+	}
+
+	/* We have a real page */
+	pfn = md_pfn(sbi->md, get_block.gp_block.pmem_bn);
+	pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+	fault = vmf_insert_mixed(vma, addr, pfnt);
+	err = zuf_flt_to_err(fault);
+	if (unlikely(err)) {
+		zuf_err("[%ld] vm_insert_page/mixed => %d\n", _zi_ino(zi), err);
+		goto put;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    _zi_ino(zi), pfn, vma->vm_page_prot.pgprot, err);
+
+put:
+	_zuf_get_put_block(sbi, zii, ZUFS_OP_PUT_BLOCK, READ, vmf->pgoff,
+		       &get_block);
+out:
+	zuf_smr_unlock(zii);
+	return fault;
+}
+
+static int zuf_fault(struct vm_fault *vmf)
+{
+	bool write_fault = (0 != (vmf->flags & FAULT_FLAG_WRITE));
+
+	if (write_fault)
+		return zuf_write_fault(vmf->vma, vmf, false);
+	else
+		return zuf_read_fault(vmf->vma, vmf);
+}
+
+static int zuf_page_mkwrite(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	ulong addr = vmf->address;
+
+	/* our zero page doesn't really hold the correct offset to the file in
+	 * page->index so vmf->pgoff is incorrect, lets fix that
+	 */
+	vmf->pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
+
+	zuf_dbg_mmap("[%ld] pgoff=0x%lx\n", inode->i_ino, vmf->pgoff);
+
+	/* call fault handler to get a real page for writing */
+	return zuf_write_fault(vma, vmf, false);
+}
+
+static void zuf_mmap_open(struct vm_area_struct *vma)
+{
+	struct zuf_inode_info *zii = ZUII(file_inode(vma->vm_file));
+
+	atomic_inc(&zii->vma_count);
+}
+
+static void zuf_mmap_close(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	int vma_count = atomic_dec_return(&ZUII(inode)->vma_count);
+
+	if (unlikely(vma_count < 0))
+		zuf_err("[%ld] WHAT??? vma_count=%d\n",
+			 inode->i_ino, vma_count);
+	else if (unlikely(vma_count == 0)) {
+		struct zuf_inode_info *zii = ZUII(inode);
+		struct zufs_ioc_mmap_close mmap_close = {};
+		int err;
+
+		mmap_close.hdr.operation = ZUFS_OP_MMAP_CLOSE;
+		mmap_close.hdr.in_len = sizeof(mmap_close);
+
+		mmap_close.zus_ii = zii->zus_ii;
+		mmap_close.rw = 0; /* TODO: Do we need this */
+
+		zuf_smr_lock(zii);
+
+		err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &mmap_close.hdr,
+				    NULL, 0);
+		if (unlikely(err))
+			zuf_dbg_err("[%ld] err=%d\n", inode->i_ino, err);
+
+		zuf_smr_unlock(zii);
+	}
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_fault,
+	.page_mkwrite	= zuf_page_mkwrite,
+	.pfn_mkwrite	= zuf_pfn_mkwrite,
+	.open           = zuf_mmap_open,
+	.close		= zuf_mmap_close,
+};
+
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	file_accessed(file);
+
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &zuf_vm_ops;
+
+	atomic_inc(&zii->vma_count);
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 2afa7b405945..2f1dd44290a2 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -570,6 +570,90 @@ static int zuf_update_s_wtime(struct super_block *sb)
 	return 0;
 }
 
+static void _sync_add_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+
+	/* Because we are lazy removing the inodes, only in case of an fsync
+	 * or an evict_inode. It is fine if we are call multiple times.
+	 */
+	if (list_empty(&zii->i_mmap_dirty))
+		list_add(&zii->i_mmap_dirty, &sbi->s_mmap_dirty);
+
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+static void _sync_remove_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	list_del_init(&zii->i_mmap_dirty);
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+void zuf_sync_inc(struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (1 == atomic_inc_return(&zii->write_mapped))
+		_sync_add_inode(inode);
+}
+
+/* zuf_sync_dec will unmapped in batches */
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (0 == atomic_sub_return(write_unmapped, &zii->write_mapped))
+		_sync_remove_inode(inode);
+}
+
+/*
+ * We must fsync any mmap-active inodes
+ */
+static int zuf_sync_fs(struct super_block *sb, int wait)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zuf_inode_info *zii, *t;
+	enum {to_clean_size = 120};
+	struct zuf_inode_info *zii_to_clean[to_clean_size];
+	uint i, to_clean;
+
+	zuf_dbg_vfs("Syncing wait=%d\n", wait);
+more_inodes:
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	to_clean = 0;
+	list_for_each_entry_safe(zii, t, &sbi->s_mmap_dirty, i_mmap_dirty) {
+		list_del_init(&zii->i_mmap_dirty);
+		zii_to_clean[to_clean++] = zii;
+		if (to_clean >= to_clean_size)
+			break;
+	}
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+
+	if (!to_clean)
+		return 0;
+
+	for (i = 0; i < to_clean; ++i)
+		zuf_isync(&zii_to_clean[i]->vfs_inode, 0, ~0 - 1, 1);
+
+	if (to_clean == to_clean_size)
+		goto more_inodes;
+
+	return 0;
+}
+
 static struct inode *zuf_alloc_inode(struct super_block *sb)
 {
 	struct zuf_inode_info *zii;
@@ -592,6 +676,12 @@ static void _init_once(void *foo)
 	struct zuf_inode_info *zii = foo;
 
 	inode_init_once(&zii->vfs_inode);
+	INIT_LIST_HEAD(&zii->i_mmap_dirty);
+	zii->zi = NULL;
+	zii->zero_page = NULL;
+	init_rwsem(&zii->in_sync);
+	atomic_set(&zii->vma_count, 0);
+	atomic_set(&zii->write_mapped, 0);
 }
 
 int __init zuf_init_inodecache(void)
@@ -621,6 +711,7 @@ static struct super_operations zuf_sops = {
 	.put_super	= zuf_put_super,
 	.freeze_fs	= zuf_update_s_wtime,
 	.unfreeze_fs	= zuf_update_s_wtime,
+	.sync_fs	= zuf_sync_fs,
 	.statfs		= zuf_statfs,
 	.remount_fs	= zuf_remount,
 	.show_options	= zuf_show_options,
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 371c2e93dd81..86f624031d8d 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -781,8 +781,10 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_WRITE		);
 		CASE_ENUM_NAME(ZUFS_OP_GET_BLOCK	);
 		CASE_ENUM_NAME(ZUFS_OP_PUT_BLOCK	);
+		CASE_ENUM_NAME(ZUFS_OP_MMAP_CLOSE	);
 		CASE_ENUM_NAME(ZUFS_OP_GET_SYMLINK	);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR		);
+		CASE_ENUM_NAME(ZUFS_OP_SYNC		);
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE	);
 		CASE_ENUM_NAME(ZUFS_OP_LLSEEK		);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 7d79189bfe60..98f4ea088671 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -158,6 +158,9 @@ struct zuf_inode_info {
 
 	/* Stuff for mmap write */
 	struct rw_semaphore	in_sync;
+	struct list_head	i_mmap_dirty;
+	atomic_t		write_mapped;
+	atomic_t		vma_count;
 	struct page		*zero_page; /* TODO: Remove */
 
 	/* cookies from Server */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 26b7b56f96c4..3d6481768308 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -344,8 +344,10 @@ enum e_zufs_operation {
 	ZUFS_OP_WRITE,
 	ZUFS_OP_GET_BLOCK,
 	ZUFS_OP_PUT_BLOCK,
+	ZUFS_OP_MMAP_CLOSE,
 	ZUFS_OP_GET_SYMLINK,
 	ZUFS_OP_SETATTR,
+	ZUFS_OP_SYNC,
 	ZUFS_OP_FALLOCATE,
 	ZUFS_OP_LLSEEK,
 
@@ -516,6 +518,13 @@ static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
 	return true;
 }
 
+struct zufs_ioc_mmap_close {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 rw; /* Some flags + READ or WRITE */
+};
+
 /* ZUFS_OP_GET_SYMLINK */
 struct zufs_ioc_get_link {
 	struct zufs_ioc_hdr hdr;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 13/17] zuf: ioctl implementation
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (11 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 12/17] zuf: mmap & sync Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 14/17] zuf: xattr implementation Boaz harrosh
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

* support for some generic IOCTLs:
  FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_GETVERSION, FS_IOC_SETVERSION

* Simple support for zusFS defined IOCTLs
  We only support flat structures
  (no emmbedded pointers within the IOCTL structures)
  We try to deduce the size of the IOCTL from the _IOC_SIZE(cmd)
  If zusFS needs a bigger copy it will send a retry with the
  new size. So bad defined IOCTLs always do 2 trips to userland

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile    |   2 +-
 fs/zuf/_extern.h   |   5 +
 fs/zuf/directory.c |   4 +
 fs/zuf/file.c      |   4 +
 fs/zuf/ioctl.c     | 282 +++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c  |   1 +
 fs/zuf/zus_api.h   |  16 +++
 7 files changed, 313 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/ioctl.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 970062d6b13f..5304aba901b2 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,6 +17,6 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += rw.o mmap.o
+zuf-y += rw.o mmap.o ioctl.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 5029f865655a..b8e24c6a66d9 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -46,6 +46,11 @@ bool zuf_dir_emit(struct super_block *sb, struct dir_context *ctx,
 uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
 			const char *symname, ulong len, struct page *pages[2]);
 
+/* ioctl.c */
+long zuf_ioctl(struct file *filp, uint cmd, ulong arg);
+#ifdef CONFIG_COMPAT
+long zuf_compat_ioctl(struct file *file, uint cmd, ulong arg);
+#endif
 
 /* mmap.c */
 int zuf_file_mmap(struct file *file, struct vm_area_struct *vma);
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
index 645dd367fd8c..11fcbe0ba6ff 100644
--- a/fs/zuf/directory.c
+++ b/fs/zuf/directory.c
@@ -160,4 +160,8 @@ const struct file_operations zuf_dir_operations = {
 	.read		= generic_read_dir,
 	.iterate_shared	= zuf_readdir,
 	.fsync		= noop_fsync,
+	.unlocked_ioctl = zuf_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= zuf_compat_ioctl,
+#endif
 };
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 392b1a0d5881..48b339cb5f8f 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -507,7 +507,11 @@ const struct file_operations zuf_file_operations = {
 	.fsync			= zuf_fsync,
 	.flush			= zuf_flush,
 	.release		= zuf_file_release,
+	.unlocked_ioctl		= zuf_ioctl,
 	.fallocate		= zuf_fallocate,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl		= zuf_compat_ioctl,
+#endif
 	.copy_file_range	= zuf_copy_file_range,
 	.remap_file_range	= zuf_clone_file_range,
 };
diff --git a/fs/zuf/ioctl.c b/fs/zuf/ioctl.c
new file mode 100644
index 000000000000..13ce65764c38
--- /dev/null
+++ b/fs/zuf/ioctl.c
@@ -0,0 +1,282 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/capability.h>
+#include <linux/time.h>
+#include <linux/sched.h>
+#include <linux/compat.h>
+#include <linux/mount.h>
+#include <linux/fadvise.h>
+#include <linux/vmalloc.h>
+#include <linux/capability.h>
+
+#include "zuf.h"
+
+#define ZUFS_SUPPORTED_FS_FLAGS (FS_SYNC_FL | FS_APPEND_FL | FS_IMMUTABLE_FL | \
+				 FS_NOATIME_FL | FS_DIRTY_FL)
+
+#define ZUS_IOCTL_MAX_PAGES	8
+
+static int _ioctl_dispatch(struct inode *inode, uint cmd, ulong arg)
+{
+	struct _ioctl_info {
+		struct zufs_ioc_ioctl ctl;
+		char buf[900];
+	} ctl_alloc = {};
+	enum big_alloc_type bat;
+	struct zufs_ioc_ioctl *ioc_ioctl;
+	size_t ioc_size = _IOC_SIZE(cmd);
+	void __user *parg = (void __user *)arg;
+	struct timespec64 time = current_time(inode);
+	size_t size;
+	bool retry = false;
+	int err;
+
+again:
+	size = sizeof(*ioc_ioctl) + ioc_size;
+
+	zuf_dbg_vfs("[%ld] cmd=0x%x arg=0x%lx size=0x%zx IOC(%d, %d, %zd)\n",
+		    inode->i_ino, cmd, arg, size, _IOC_TYPE(cmd),
+		    _IOC_NR(cmd), ioc_size);
+
+	ioc_ioctl = big_alloc(size, sizeof(ctl_alloc), &ctl_alloc, GFP_KERNEL,
+			      &bat);
+	if (unlikely(!ioc_ioctl))
+		return -ENOMEM;
+
+	memset(ioc_ioctl, 0, sizeof(*ioc_ioctl));
+	ioc_ioctl->hdr.in_len = size;
+	ioc_ioctl->hdr.out_start = offsetof(struct zufs_ioc_ioctl, arg);
+	ioc_ioctl->hdr.out_max = size;
+	ioc_ioctl->hdr.out_len = 0;
+	ioc_ioctl->hdr.operation = ZUFS_OP_IOCTL;
+	ioc_ioctl->zus_ii = ZUII(inode)->zus_ii;
+	ioc_ioctl->cmd = cmd;
+	timespec_to_mt(&ioc_ioctl->time, &time);
+
+	if (arg && ioc_size) {
+		if (copy_from_user(ioc_ioctl->arg, parg, ioc_size)) {
+			err = -EFAULT;
+			goto out;
+		}
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_ioctl->hdr,
+			    NULL, 0);
+
+	if (!retry && err == -EZUFS_RETRY) {
+		ioc_size = ioc_ioctl->new_size - sizeof(*ioc_ioctl);
+		big_free(ioc_ioctl, bat);
+		retry = true;
+		goto again;
+	}
+
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d IOC(%d, %d, %zd)\n",
+			    err, _IOC_TYPE(cmd), _IOC_NR(cmd), ioc_size);
+		goto out;
+	}
+
+	if (ioc_ioctl->hdr.out_len) {
+		if (copy_to_user(parg, ioc_ioctl->arg,
+		    ioc_ioctl->hdr.out_len)) {
+			err = -EFAULT;
+			goto out;
+		}
+	}
+
+out:
+	big_free(ioc_ioctl, bat);
+
+	return err;
+}
+
+static uint _translate_to_ioc_flags(struct zus_inode *zi)
+{
+	uint zi_flags = le16_to_cpu(zi->i_flags);
+	uint ioc_flags = 0;
+
+	if (zi_flags & S_SYNC)
+		ioc_flags |= FS_SYNC_FL;
+	if (zi_flags & S_APPEND)
+		ioc_flags |= FS_APPEND_FL;
+	if (zi_flags & S_IMMUTABLE)
+		ioc_flags |= FS_IMMUTABLE_FL;
+	if (zi_flags & S_NOATIME)
+		ioc_flags |= FS_NOATIME_FL;
+	if (zi_flags & S_DIRSYNC)
+		ioc_flags |= FS_DIRSYNC_FL;
+
+	return ioc_flags;
+}
+
+static int _ioc_getflags(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	uint flags = _translate_to_ioc_flags(zi);
+
+	return put_user(flags, parg);
+}
+
+static void _translate_to_zi_flags(struct zus_inode *zi, unsigned int flags)
+{
+	uint zi_flags = le16_to_cpu(zi->i_flags);
+
+	zi_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+
+	if (flags & FS_SYNC_FL)
+		zi_flags |= S_SYNC;
+	if (flags & FS_APPEND_FL)
+		zi_flags |= S_APPEND;
+	if (flags & FS_IMMUTABLE_FL)
+		zi_flags |= S_IMMUTABLE;
+	if (flags & FS_NOATIME_FL)
+		zi_flags |= S_NOATIME;
+	if (flags & FS_DIRSYNC_FL)
+		zi_flags |= S_DIRSYNC;
+
+	zi->i_flags = cpu_to_le16(zi_flags);
+}
+
+/* use statx ioc to flush zi changes to fs */
+static int __ioc_dispatch_zi_update(struct inode *inode, uint flags)
+{
+	struct zufs_ioc_attr ioc_attr = {
+		.hdr.in_len = sizeof(ioc_attr),
+		.hdr.out_len = sizeof(ioc_attr),
+		.hdr.operation = ZUFS_OP_SETATTR,
+		.zus_ii = ZUII(inode)->zus_ii,
+		.zuf_attr = flags,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_attr.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR))
+		zuf_err("zufc_dispatch failed => %d\n", err);
+
+	return err;
+}
+
+static int _ioc_setflags(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	uint flags, oldflags;
+	int err;
+
+	if (!inode_owner_or_capable(inode))
+		return -EPERM;
+
+	if (get_user(flags, parg))
+		return -EFAULT;
+
+	if (flags & ~ZUFS_SUPPORTED_FS_FLAGS)
+		return -EOPNOTSUPP;
+
+	inode_lock(inode);
+
+	oldflags = le32_to_cpu(zi->i_flags);
+
+	if ((flags ^ oldflags) &
+		(FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+		if (!capable(CAP_LINUX_IMMUTABLE)) {
+			inode_unlock(inode);
+			return -EPERM;
+		}
+	}
+
+	if (!S_ISDIR(inode->i_mode))
+		flags &= ~FS_DIRSYNC_FL;
+
+	flags = flags & FS_FL_USER_MODIFIABLE;
+	flags |= oldflags & ~FS_FL_USER_MODIFIABLE;
+	inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	_translate_to_zi_flags(zi, flags);
+	zuf_set_inode_flags(inode, zi);
+
+	err = __ioc_dispatch_zi_update(inode, ZUFS_STATX_FLAGS | STATX_CTIME);
+
+	inode_unlock(inode);
+	return err;
+}
+
+static int _ioc_setversion(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	__u32 generation;
+	int err;
+
+	if (!inode_owner_or_capable(inode))
+		return -EPERM;
+
+	if (get_user(generation, parg))
+		return -EFAULT;
+
+	inode_lock(inode);
+
+	inode->i_ctime = current_time(inode);
+	inode->i_generation = generation;
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_generation = cpu_to_le32(inode->i_generation);
+
+	err = __ioc_dispatch_zi_update(inode, ZUFS_STATX_VERSION | STATX_CTIME);
+
+	inode_unlock(inode);
+	return err;
+}
+
+long zuf_ioctl(struct file *filp, unsigned int cmd, ulong arg)
+{
+	struct inode *inode = filp->f_inode;
+	void __user *parg = (void __user *)arg;
+
+	switch (cmd) {
+	case FS_IOC_GETFLAGS:
+		return _ioc_getflags(inode, parg);
+	case FS_IOC_SETFLAGS:
+		return _ioc_setflags(inode, parg);
+	case FS_IOC_GETVERSION:
+		return put_user(inode->i_generation, (int __user *)arg);
+	case FS_IOC_SETVERSION:
+		return _ioc_setversion(inode, parg);
+	default:
+		return _ioctl_dispatch(inode, cmd, arg);
+	}
+}
+
+#ifdef CONFIG_COMPAT
+long zuf_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case FS_IOC32_GETFLAGS:
+		cmd = FS_IOC_GETFLAGS;
+		break;
+	case FS_IOC32_SETFLAGS:
+		cmd = FS_IOC_SETFLAGS;
+		break;
+	case FS_IOC32_GETVERSION:
+		cmd = FS_IOC_GETVERSION;
+		break;
+	case FS_IOC32_SETVERSION:
+		cmd = FS_IOC_SETVERSION;
+		break;
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return zuf_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 86f624031d8d..09ad210318f8 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -787,6 +787,7 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_SYNC		);
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE	);
 		CASE_ENUM_NAME(ZUFS_OP_LLSEEK		);
+		CASE_ENUM_NAME(ZUFS_OP_IOCTL		);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
 	default:
 		return "UNKNOWN";
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 3d6481768308..f32ee615b937 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -350,6 +350,7 @@ enum e_zufs_operation {
 	ZUFS_OP_SYNC,
 	ZUFS_OP_FALLOCATE,
 	ZUFS_OP_LLSEEK,
+	ZUFS_OP_IOCTL,
 
 	ZUFS_OP_BREAK,		/* Kernel telling Server to exit */
 	ZUFS_OP_MAX_OPT,
@@ -586,6 +587,21 @@ struct zufs_ioc_seek {
 	__u64 offset_out;
 };
 
+/* ZUFS_OP_IOCTL */
+struct zufs_ioc_ioctl {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u32 cmd;
+	__u64 time;
+
+	/* OUT */
+	union {
+		__u32 new_size;
+		char arg[0];
+	};
+};
+
 /* ~~~~ io_map structures && IOCTL(s) ~~~~ */
 /*
  * These set of structures and helpers are used in return of zufs_ioc_IO and
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 14/17] zuf: xattr implementation
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (12 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 13/17] zuf: ioctl implementation Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 15/17] zuf: ACL support Boaz harrosh
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

We establish the usual dispatch to user-mode. For get and
set. Since the buffers are variable length we utilize the
zdo->overflow_handler for the extra copy from Server.
(see also zuf-core.c)

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   1 +
 fs/zuf/_extern.h  |  10 ++
 fs/zuf/_pr.h      |   1 +
 fs/zuf/file.c     |   1 +
 fs/zuf/inode.c    |   3 +-
 fs/zuf/namei.c    |   2 +
 fs/zuf/super.c    |   1 +
 fs/zuf/symlink.c  |   1 +
 fs/zuf/xattr.c    | 314 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c |   3 +
 fs/zuf/zuf.h      |   7 ++
 fs/zuf/zus_api.h  |  22 ++++
 12 files changed, 365 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/xattr.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 5304aba901b2..5d638760a82f 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,6 +17,7 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
+zuf-y += xattr.o
 zuf-y += rw.o mmap.o ioctl.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index b8e24c6a66d9..1f4b39911a5d 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -73,6 +73,16 @@ int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
 /* file.c */
 int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync);
 
+/* xattr.c */
+int zuf_initxattrs(struct inode *inode, const struct xattr *xattr_array,
+		   void *fs_info);
+ssize_t __zuf_getxattr(struct inode *inode, int type, const char *name,
+		       void *buffer, size_t size);
+int __zuf_setxattr(struct inode *inode, int type, const char *name,
+		   const void *value, size_t size, int flags);
+ssize_t zuf_listxattr(struct dentry *dentry, char *buffer, size_t size);
+extern const struct xattr_handler *zuf_xattr_handlers[];
+
 
 /* super.c */
 int zuf_init_inodecache(void);
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index a1ceab2abce2..04b99f57f2b5 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -42,6 +42,7 @@
 #define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
 #define zuf_dbg_rw(s, args ...)		zuf_chan_debug("rw   ", s, ##args)
 #define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
+#define zuf_dbg_xattr(s, args ...)	zuf_chan_debug("xattr", s, ##args)
 #define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
 #define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
 #define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 48b339cb5f8f..814a75105321 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -521,4 +521,5 @@ const struct inode_operations zuf_file_inode_operations = {
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
 	.fiemap		= tozu_fiemap,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 8f9b4f28c556..73f94e7062e5 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -361,7 +361,8 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 
 	zuf_dbg_verbose("inode=%p name=%s\n", inode, qstr->name);
 
-	err = security_inode_init_security(inode, dir, qstr, NULL, NULL);
+	err = security_inode_init_security(inode, dir, qstr, zuf_initxattrs,
+					   NULL);
 	if (err && err != -EOPNOTSUPP)
 		goto fail;
 
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
index e78aa04f10d5..803069423674 100644
--- a/fs/zuf/namei.c
+++ b/fs/zuf/namei.c
@@ -420,10 +420,12 @@ const struct inode_operations zuf_dir_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.listxattr	= zuf_listxattr,
 };
 
 const struct inode_operations zuf_special_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 2f1dd44290a2..588558066333 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -414,6 +414,7 @@ static int zuf_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_flags |= MS_NOSEC | (ioc_mount->zmi.acl_on ? SB_POSIXACL : 0);
 
 	sb->s_op = &zuf_sops;
+	sb->s_xattr = zuf_xattr_handlers;
 
 	root_i = zuf_iget(sb, ioc_mount->zmi.zus_ii, ioc_mount->zmi._zi,
 			  &exist);
diff --git a/fs/zuf/symlink.c b/fs/zuf/symlink.c
index 1446bdf60cb9..5e9115ba4cbd 100644
--- a/fs/zuf/symlink.c
+++ b/fs/zuf/symlink.c
@@ -70,4 +70,5 @@ const struct inode_operations zuf_symlink_inode_operations = {
 	.update_time	= zuf_update_time,
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/xattr.c b/fs/zuf/xattr.c
new file mode 100644
index 000000000000..6aae297c09e3
--- /dev/null
+++ b/fs/zuf/xattr.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Extended Attributes
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/xattr.h>
+
+#include "zuf.h"
+
+/* ~~~~~~~~~~~~~~~ xattr get ~~~~~~~~~~~~~~~ */
+
+struct _xxxattr {
+	void *user_buffer;
+	union {
+		struct zufs_ioc_xattr ioc_xattr;
+		char buf[512];
+	} d;
+};
+
+static inline uint _XXXATTR_SIZE(uint ioc_size)
+{
+	struct _xxxattr *_xxxattr;
+
+	return ioc_size + (sizeof(*_xxxattr) - sizeof(_xxxattr->d));
+}
+
+static int _xattr_oh(struct zuf_dispatch_op *zdo, void *parg, ulong max_bytes)
+{
+	struct zufs_ioc_hdr *hdr = zdo->hdr;
+	struct zufs_ioc_xattr *ioc_xattr =
+			container_of(hdr, typeof(*ioc_xattr), hdr);
+	struct _xxxattr *_xxattr =
+			container_of(ioc_xattr, typeof(*_xxattr), d.ioc_xattr);
+	struct zufs_ioc_xattr *user_ioc_xattr = parg;
+
+	if (hdr->err)
+		return 0;
+
+	ioc_xattr->user_buf_size = user_ioc_xattr->user_buf_size;
+
+	hdr->out_len -= sizeof(ioc_xattr->user_buf_size);
+	memcpy(_xxattr->user_buffer, user_ioc_xattr->buf, hdr->out_len);
+	return 0;
+}
+
+ssize_t __zuf_getxattr(struct inode *inode, int type, const char *name,
+		       void *buffer, size_t size)
+{
+	size_t name_len = strlen(name) + 1; /* plus \NUL */
+	struct _xxxattr *p_xattr;
+	struct _xxxattr s_xattr;
+	enum big_alloc_type bat;
+	struct zufs_ioc_xattr *ioc_xattr;
+	size_t ioc_size = sizeof(*ioc_xattr) + name_len;
+	struct zuf_dispatch_op zdo;
+	int err;
+	ssize_t ret;
+
+	zuf_dbg_vfs("[%ld] type=%d name=%s size=%lu ioc_size=%lu\n",
+			inode->i_ino, type, name, size, ioc_size);
+
+	p_xattr = big_alloc(_XXXATTR_SIZE(ioc_size), sizeof(s_xattr), &s_xattr,
+			    GFP_KERNEL, &bat);
+	if (unlikely(!p_xattr))
+		return -ENOMEM;
+
+	ioc_xattr = &p_xattr->d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+	p_xattr->user_buffer = buffer;
+
+	ioc_xattr->hdr.in_len = ioc_size;
+	ioc_xattr->hdr.out_start =
+				offsetof(struct zufs_ioc_xattr, user_buf_size);
+	 /* out_len updated by zus */
+	ioc_xattr->hdr.out_len = sizeof(ioc_xattr->user_buf_size);
+	ioc_xattr->hdr.out_max = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_GET;
+	ioc_xattr->zus_ii = ZUII(inode)->zus_ii;
+	ioc_xattr->type = type;
+	ioc_xattr->name_len = name_len;
+	ioc_xattr->user_buf_size = size;
+
+	strcpy(ioc_xattr->buf, name);
+
+	zuf_dispatch_init(&zdo, &ioc_xattr->hdr, NULL, 0);
+	zdo.oh = _xattr_oh;
+	err = __zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &zdo);
+	ret = ioc_xattr->user_buf_size;
+
+	big_free(p_xattr, bat);
+
+	if (unlikely(err))
+		return err;
+
+	return ret;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr set ~~~~~~~~~~~~~~~ */
+
+int __zuf_setxattr(struct inode *inode, int type, const char *name,
+		   const void *value, size_t size, int flags)
+{
+	size_t name_len = strlen(name) + 1;
+	struct _xxxattr *p_xattr;
+	struct _xxxattr s_xattr;
+	enum big_alloc_type bat;
+	struct zufs_ioc_xattr *ioc_xattr;
+	size_t ioc_size = sizeof(*ioc_xattr) + name_len + size;
+	int err;
+
+	zuf_dbg_vfs("[%ld] type=%d name=%s size=%lu ioc_size=%lu\n",
+			inode->i_ino, type, name, size, ioc_size);
+
+	p_xattr = big_alloc(_XXXATTR_SIZE(ioc_size), sizeof(s_xattr), &s_xattr,
+			    GFP_KERNEL, &bat);
+	if (unlikely(!p_xattr))
+		return -ENOMEM;
+
+	ioc_xattr = &p_xattr->d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+
+	ioc_xattr->hdr.in_len = ioc_size;
+	ioc_xattr->hdr.out_len = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_SET;
+	ioc_xattr->zus_ii = ZUII(inode)->zus_ii;
+	ioc_xattr->type = type;
+	ioc_xattr->name_len = name_len;
+	ioc_xattr->user_buf_size = size;
+	ioc_xattr->flags = flags;
+
+	if (value && !size)
+		ioc_xattr->ioc_flags = ZUFS_XATTR_SET_EMPTY;
+
+	strcpy(ioc_xattr->buf, name);
+	if (value)
+		memcpy(ioc_xattr->buf + name_len, value, size);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_xattr->hdr,
+			    NULL, 0);
+
+	big_free(p_xattr, bat);
+
+	return err;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr list ~~~~~~~~~~~~~~~ */
+
+static ssize_t __zuf_listxattr(struct inode *inode, char *buffer, size_t size)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct _xxxattr s_xattr;
+	struct zufs_ioc_xattr *ioc_xattr;
+	struct zuf_dispatch_op zdo;
+
+	int err;
+
+	zuf_dbg_vfs("[%ld] size=%lu\n", inode->i_ino, size);
+
+	ioc_xattr = &s_xattr.d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+	s_xattr.user_buffer = buffer;
+
+	ioc_xattr->hdr.in_len = sizeof(*ioc_xattr);
+	ioc_xattr->hdr.out_start =
+				offsetof(struct zufs_ioc_xattr, user_buf_size);
+	 /* out_len updated by zus */
+	ioc_xattr->hdr.out_len = sizeof(ioc_xattr->user_buf_size);
+	ioc_xattr->hdr.out_max = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_LIST;
+	ioc_xattr->zus_ii = zii->zus_ii;
+	ioc_xattr->name_len = 0;
+	ioc_xattr->user_buf_size = size;
+	ioc_xattr->ioc_flags = capable(CAP_SYS_ADMIN) ? ZUFS_XATTR_TRUSTED : 0;
+
+	zuf_dispatch_init(&zdo, &ioc_xattr->hdr, NULL, 0);
+	zdo.oh = _xattr_oh;
+	err = __zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &zdo);
+	if (unlikely(err))
+		return err;
+
+	return ioc_xattr->user_buf_size;
+}
+
+ssize_t zuf_listxattr(struct dentry *dentry, char *buffer, size_t size)
+{
+	struct inode *inode = dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+
+	zuf_smr_lock(zii);
+
+	ret = __zuf_listxattr(inode, buffer, size);
+
+	zuf_smr_unlock(zii);
+
+	return ret;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr sb handlers ~~~~~~~~~~~~~~~ */
+static bool zuf_xattr_handler_list(struct dentry *dentry)
+{
+	return true;
+}
+
+static
+int zuf_xattr_handler_get(const struct xattr_handler *handler,
+			  struct dentry *dentry, struct inode *inode,
+			  const char *name, void *value, size_t size)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	int ret;
+
+	zuf_dbg_xattr("[%ld] name=%s\n", inode->i_ino, name);
+
+	zuf_smr_lock(zii);
+
+	ret = __zuf_getxattr(inode, handler->flags, name, value, size);
+
+	zuf_smr_unlock(zii);
+
+	return ret;
+}
+
+static
+int zuf_xattr_handler_set(const struct xattr_handler *handler,
+			  struct dentry *d_notused, struct inode *inode,
+			  const char *name, const void *value, size_t size,
+			  int flags)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	int err;
+
+	zuf_dbg_xattr("[%ld] name=%s size=0x%lx flags=0x%x\n",
+			inode->i_ino, name, size, flags);
+
+	zuf_smw_lock(zii);
+
+	err = __zuf_setxattr(inode, handler->flags, name, value, size, flags);
+
+	zuf_smw_unlock(zii);
+
+	return err;
+}
+
+const struct xattr_handler zuf_xattr_security_handler = {
+	.prefix	= XATTR_SECURITY_PREFIX,
+	.flags = ZUF_XF_SECURITY,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler zuf_xattr_trusted_handler = {
+	.prefix	= XATTR_TRUSTED_PREFIX,
+	.flags = ZUF_XF_TRUSTED,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler zuf_xattr_user_handler = {
+	.prefix	= XATTR_USER_PREFIX,
+	.flags = ZUF_XF_USER,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler *zuf_xattr_handlers[] = {
+	&zuf_xattr_user_handler,
+	&zuf_xattr_trusted_handler,
+	&zuf_xattr_security_handler,
+	&posix_acl_access_xattr_handler,
+	&posix_acl_default_xattr_handler,
+	NULL
+};
+
+/*
+ * Callback for security_inode_init_security() for acquiring xattrs.
+ */
+int zuf_initxattrs(struct inode *inode, const struct xattr *xattr_array,
+		   void *fs_info)
+{
+	const struct xattr *xattr;
+
+	for (xattr = xattr_array; xattr->name != NULL; xattr++) {
+		int err;
+
+		/* REMOVEME: We had a BUG here for a long time that never
+		 * crashed, I want to see this is called, please.
+		 */
+		zuf_warn("Yes it is name=%s value-size=%zd\n",
+			  xattr->name, xattr->value_len);
+
+		err = zuf_xattr_handler_set(&zuf_xattr_security_handler, NULL,
+					    inode, xattr->name, xattr->value,
+					    xattr->value_len, 0);
+		if (unlikely(err)) {
+			zuf_err("[%ld] failed to init xattrs err=%d\n",
+				 inode->i_ino, err);
+			return err;
+		}
+	}
+	return 0;
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 09ad210318f8..c6b614465ab3 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -788,6 +788,9 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE	);
 		CASE_ENUM_NAME(ZUFS_OP_LLSEEK		);
 		CASE_ENUM_NAME(ZUFS_OP_IOCTL		);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_GET	);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_SET	);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_LIST	);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
 	default:
 		return "UNKNOWN";
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 98f4ea088671..13b246189d8b 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -328,6 +328,13 @@ static inline void *zuf_dpp_t_addr(struct super_block *sb, zu_dpp_t v)
 	return md_addr_verify(SBI(sb)->md, zu_dpp_t_val(v));
 }
 
+/* xattr types */
+enum {	ZUF_XF_SECURITY    = 1,
+	ZUF_XF_SYSTEM      = 2,
+	ZUF_XF_TRUSTED     = 3,
+	ZUF_XF_USER        = 4,
+};
+
 enum big_alloc_type { ba_stack, ba_kmalloc, ba_vmalloc };
 
 static inline
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index f32ee615b937..40f369d20306 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -351,6 +351,9 @@ enum e_zufs_operation {
 	ZUFS_OP_FALLOCATE,
 	ZUFS_OP_LLSEEK,
 	ZUFS_OP_IOCTL,
+	ZUFS_OP_XATTR_GET,
+	ZUFS_OP_XATTR_SET,
+	ZUFS_OP_XATTR_LIST,
 
 	ZUFS_OP_BREAK,		/* Kernel telling Server to exit */
 	ZUFS_OP_MAX_OPT,
@@ -602,6 +605,25 @@ struct zufs_ioc_ioctl {
 	};
 };
 
+/* xattr ioc_flags */
+#define ZUFS_XATTR_SET_EMPTY	(1 << 0)
+#define ZUFS_XATTR_TRUSTED	(1 << 1)
+
+/* ZUFS_OP_XATTR */
+struct zufs_ioc_xattr {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u32	flags;
+	__u32	type;
+	__u16	name_len;
+	__u16	ioc_flags;
+
+	/* OUT */
+	__u32	user_buf_size;
+	char	buf[0];
+} __packed;
+
 /* ~~~~ io_map structures && IOCTL(s) ~~~~ */
 /*
  * These set of structures and helpers are used in return of zufs_ioc_IO and
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 15/17] zuf: ACL support
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (13 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 14/17] zuf: xattr implementation Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 16/17] zuf: Special IOCTL fadvise (TODO) Boaz harrosh
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

The ACL support is all in Kernel. There is no new API
with zusFS.
We define the internal structure of the ACL inside
an opec xattr and store via the xattr zus_api.

TODO:
  Future FSs that has their own ACL on-disk-format, Or
  Network zusFS that have their own verifiers for the ACL
  will need to establish an alternative API for the acl.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile  |   2 +-
 fs/zuf/_extern.h |   9 ++
 fs/zuf/_pr.h     |   1 +
 fs/zuf/acl.c     | 283 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/file.c    |   2 +
 fs/zuf/inode.c   |  11 ++
 fs/zuf/namei.c   |   4 +
 fs/zuf/zuf.h     |   6 +
 8 files changed, 317 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/acl.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 5d638760a82f..f53504b47c2a 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,7 +17,7 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += xattr.o
+zuf-y += acl.o xattr.o
 zuf-y += rw.o mmap.o ioctl.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 1f4b39911a5d..2e515af0bb22 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -83,6 +83,15 @@ int __zuf_setxattr(struct inode *inode, int type, const char *name,
 ssize_t zuf_listxattr(struct dentry *dentry, char *buffer, size_t size);
 extern const struct xattr_handler *zuf_xattr_handlers[];
 
+/* acl.c */
+int zuf_set_acl(struct inode *inode, struct posix_acl *acl, int type);
+struct posix_acl *zuf_get_acl(struct inode *inode, int type);
+int zuf_acls_create_pre(struct inode *dir, struct inode *inode,
+			struct posix_acl **user_acl);
+int zuf_acls_create_post(struct inode *dir, struct inode *inode,
+			 struct posix_acl *acl);
+extern const struct xattr_handler zuf_acl_access_xattr_handler;
+extern const struct xattr_handler zuf_acl_default_xattr_handler;
 
 /* super.c */
 int zuf_init_inodecache(void);
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index 04b99f57f2b5..7d7e4808dcf0 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -43,6 +43,7 @@
 #define zuf_dbg_rw(s, args ...)		zuf_chan_debug("rw   ", s, ##args)
 #define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
 #define zuf_dbg_xattr(s, args ...)	zuf_chan_debug("xattr", s, ##args)
+#define zuf_dbg_acl(s, args ...)	zuf_chan_debug("acl  ", s, ##args)
 #define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
 #define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
 #define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
diff --git a/fs/zuf/acl.c b/fs/zuf/acl.c
new file mode 100644
index 000000000000..ccea8ce455fb
--- /dev/null
+++ b/fs/zuf/acl.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Access Control List
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/xattr.h>
+#include "zuf.h"
+
+static void _acl_to_value(const struct posix_acl *acl, void *value)
+{
+	int n;
+	struct zuf_acl *macl = value;
+
+	zuf_dbg_acl("acl->count=%d\n", acl->a_count);
+
+	for (n = 0; n < acl->a_count; n++) {
+		const struct posix_acl_entry *entry = &acl->a_entries[n];
+
+		zuf_dbg_acl("aclno=%d tag=0x%x perm=0x%x\n",
+			     n, entry->e_tag, entry->e_perm);
+
+		macl->tag = cpu_to_le16(entry->e_tag);
+		macl->perm = cpu_to_le16(entry->e_perm);
+
+		switch (entry->e_tag) {
+		case ACL_USER:
+			macl->id = cpu_to_le32(__kuid_val(entry->e_uid));
+			break;
+		case ACL_GROUP:
+			macl->id = cpu_to_le32(__kgid_val(entry->e_gid));
+			break;
+		case ACL_USER_OBJ:
+		case ACL_GROUP_OBJ:
+		case ACL_MASK:
+		case ACL_OTHER:
+			break;
+		default:
+			zuf_dbg_err("e_tag=0x%x\n", entry->e_tag);
+			return;
+		}
+		macl++;
+	}
+}
+
+int zuf_set_acl(struct inode *inode, struct posix_acl *acl, int type)
+{
+	char *name = NULL;
+	void *buf;
+	int err;
+	size_t size;
+
+	zuf_dbg_acl("[%ld] acl=%p type=0x%x\n", inode->i_ino, acl, type);
+
+	switch (type) {
+	case ACL_TYPE_ACCESS: {
+		struct zus_inode *zi = ZUII(inode)->zi;
+
+		name = XATTR_POSIX_ACL_ACCESS;
+		if (acl) {
+			err = posix_acl_update_mode(inode, &inode->i_mode,
+						    &acl);
+			if (err < 0)
+				return err;
+
+			inode->i_ctime = current_time(inode);
+			timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		}
+		zi->i_mode = cpu_to_le16(inode->i_mode);
+		break;
+	}
+	case ACL_TYPE_DEFAULT:
+		name = XATTR_POSIX_ACL_DEFAULT;
+		if (!S_ISDIR(inode->i_mode))
+			return acl ? -EACCES : 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	size = acl ? acl->a_count * sizeof(struct zuf_acl) : 0;
+	buf = kmalloc(size, GFP_KERNEL);
+	if (unlikely(!buf))
+		return -ENOMEM;
+
+	if (acl)
+		_acl_to_value(acl, buf);
+
+	err = __zuf_setxattr(inode, ZUF_XF_SYSTEM, name, buf, size, 0);
+	if (!err)
+		set_cached_acl(inode, type, acl);
+
+	kfree(buf);
+	return err;
+}
+
+static struct posix_acl *_value_to_acl(void *value, size_t size)
+{
+	int n, count;
+	struct posix_acl *acl;
+	struct zuf_acl *macl = value;
+	void *end = value + size;
+
+	if (!value)
+		return NULL;
+
+	count = size / sizeof(struct zuf_acl);
+	if (count < 0)
+		return ERR_PTR(-EINVAL);
+	if (count == 0)
+		return NULL;
+
+	acl = posix_acl_alloc(count, GFP_NOFS);
+	if (unlikely(!acl))
+		return ERR_PTR(-ENOMEM);
+
+	for (n = 0; n < count; n++) {
+		if (end < (void *)macl + sizeof(struct zuf_acl))
+			goto fail;
+
+		zuf_dbg_acl("aclno=%d tag=0x%x perm=0x%x id=0x%x\n",
+			     n, le16_to_cpu(macl->tag), le16_to_cpu(macl->perm),
+			     le32_to_cpu(macl->id));
+
+		acl->a_entries[n].e_tag  = le16_to_cpu(macl->tag);
+		acl->a_entries[n].e_perm = le16_to_cpu(macl->perm);
+
+		switch (acl->a_entries[n].e_tag) {
+		case ACL_USER_OBJ:
+		case ACL_GROUP_OBJ:
+		case ACL_MASK:
+		case ACL_OTHER:
+			macl++;
+			break;
+		case ACL_USER:
+			acl->a_entries[n].e_uid =
+					     KUIDT_INIT(le32_to_cpu(macl->id));
+			macl++;
+			if (end < (void *)macl)
+				goto fail;
+			break;
+		case ACL_GROUP:
+			acl->a_entries[n].e_gid =
+					     KGIDT_INIT(le32_to_cpu(macl->id));
+			macl++;
+			if (end < (void *)macl)
+				goto fail;
+			break;
+
+		default:
+			goto fail;
+		}
+	}
+	if (macl != end)
+		goto fail;
+	return acl;
+
+fail:
+	posix_acl_release(acl);
+	return ERR_PTR(-EINVAL);
+}
+
+struct posix_acl *zuf_get_acl(struct inode *inode, int type)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	char *name = NULL;
+	void *buf;
+	struct posix_acl *acl = NULL;
+	int ret;
+
+	zuf_dbg_acl("[%ld] type=0x%x\n", inode->i_ino, type);
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (unlikely(!buf))
+		return ERR_PTR(-ENOMEM);
+
+	switch (type) {
+	case ACL_TYPE_ACCESS:
+		name = XATTR_POSIX_ACL_ACCESS;
+		break;
+	case ACL_TYPE_DEFAULT:
+		name = XATTR_POSIX_ACL_DEFAULT;
+		break;
+	default:
+		WARN_ON(1);
+		return ERR_PTR(-EINVAL);
+	}
+
+	zuf_smr_lock(zii);
+
+	ret = __zuf_getxattr(inode, ZUF_XF_SYSTEM, name, buf, PAGE_SIZE);
+	if (likely(ret > 0)) {
+		acl = _value_to_acl(buf, ret);
+	} else if (ret != -ENODATA) {
+		if (ret != 0)
+			zuf_dbg_err("failed to getattr ret=%d\n", ret);
+		acl = ERR_PTR(ret);
+	}
+
+	if (!IS_ERR(acl))
+		set_cached_acl(inode, type, acl);
+
+	zuf_smr_unlock(zii);
+
+	free_page((ulong)buf);
+
+	return acl;
+}
+
+/* Used by creation of new inodes */
+int zuf_acls_create_pre(struct inode *dir, struct inode *inode,
+			struct posix_acl **user_acl)
+{
+	struct posix_acl *acl;
+
+	if (!IS_POSIXACL(dir))
+		return 0;
+
+	zuf_dbg_acl("[%ld] i_ino=%ld i_mode=o%o\n",
+		     dir->i_ino, inode->i_ino, inode->i_mode);
+
+	if (S_ISLNK(inode->i_mode))
+		return 0;
+
+	acl = get_acl(dir, ACL_TYPE_DEFAULT);
+	if (IS_ERR(acl))
+		return PTR_ERR(acl);
+	if (!acl)
+		inode->i_mode &= ~current_umask();
+	else
+		*user_acl = acl;
+
+	return 0;
+}
+
+int zuf_acls_create_post(struct inode *dir, struct inode *inode,
+			 struct posix_acl *acl)
+{
+	int err;
+
+	zuf_dbg_acl("[%ld] i_ino=%ld i_mode=o%o\n",
+		     dir->i_ino, inode->i_ino, inode->i_mode);
+
+	if (S_ISDIR(inode->i_mode)) {
+		err = zuf_set_acl(inode, acl, ACL_TYPE_DEFAULT);
+		if (err)
+			goto cleanup;
+	}
+	err = __posix_acl_create(&acl, GFP_NOFS, &inode->i_mode);
+	if (unlikely(err < 0))
+		return err;
+
+	zus_zi(inode)->i_mode = cpu_to_le16(inode->i_mode);
+	if (err > 0) { /* This is an extended ACL */
+		err = zuf_set_acl(inode, acl, ACL_TYPE_ACCESS);
+	} else {
+		/* NOTE: Boaz think we will cry over this... */
+		struct zufs_ioc_attr ioc_attr = {
+			.hdr.in_len = sizeof(ioc_attr),
+			.hdr.out_len = sizeof(ioc_attr),
+			.hdr.operation = ZUFS_OP_SETATTR,
+			.zus_ii = ZUII(inode)->zus_ii,
+			.zuf_attr = STATX_MODE,
+		};
+
+		err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)),
+				    &ioc_attr.hdr, NULL, 0);
+		if (unlikely(err && err != -EINTR))
+			zuf_err("zufc_dispatch failed => %d\n", err);
+	}
+
+cleanup:
+	posix_acl_release(acl);
+	return err;
+}
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 814a75105321..0ec87ec4d078 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -521,5 +521,7 @@ const struct inode_operations zuf_file_inode_operations = {
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
 	.fiemap		= tozu_fiemap,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
 	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 73f94e7062e5..3b9e78feab06 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -342,6 +342,7 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 		.flags = tmpfile ? ZI_TMPFILE : 0,
 		.str.len = qstr->len,
 	};
+	struct posix_acl *acl = NULL;
 	struct inode *inode;
 	struct zus_inode *zi = NULL;
 	struct page *pages[2];
@@ -366,6 +367,10 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 	if (err && err != -EOPNOTSUPP)
 		goto fail;
 
+	err = zuf_acls_create_pre(dir, inode, &acl);
+	if (unlikely(err))
+		goto fail;
+
 	zuf_set_inode_flags(inode, &ioc_new_inode.zi);
 
 	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
@@ -406,6 +411,12 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 
 	zuf_dbg_verbose("allocating inode %ld (zi=%p)\n", _zi_ino(zi), zi);
 
+	if (acl && !symname) {
+		err = zuf_acls_create_post(dir, inode, acl);
+		if (unlikely(err))
+			goto fail;
+	}
+
 	err = insert_inode_locked(inode);
 	if (unlikely(err)) {
 		zuf_dbg_err("[%ld:%s] generation=%lld insert_inode_locked => %d\n",
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
index 803069423674..a33745c328b9 100644
--- a/fs/zuf/namei.c
+++ b/fs/zuf/namei.c
@@ -420,6 +420,8 @@ const struct inode_operations zuf_dir_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
 	.listxattr	= zuf_listxattr,
 };
 
@@ -427,5 +429,7 @@ const struct inode_operations zuf_special_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
 	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 13b246189d8b..b6347dc6eb6a 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -335,6 +335,12 @@ enum {	ZUF_XF_SECURITY    = 1,
 	ZUF_XF_USER        = 4,
 };
 
+struct zuf_acl {
+	__le16	tag;
+	__le16	perm;
+	__le32	id;
+} __packed;
+
 enum big_alloc_type { ba_stack, ba_kmalloc, ba_vmalloc };
 
 static inline
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 16/17] zuf: Special IOCTL fadvise (TODO)
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (14 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 15/17] zuf: ACL support Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 17/17] zuf: Support for dynamic-debug of zusFSs Boaz harrosh
  2019-02-19 12:15 ` [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Matthew Wilcox
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

We establish an fadvise dispatch with zus.

We also define a new IOCTL to drive this into
zuf. The IOCTL has the same structure and constants
as the fadvise syscall.

However:
TODO:
  The VFS does not call into the FS for fadvise, And
  since we do not use a page-cache it is an no-op for
  zuf.
  We need to send a patch to vfs that lets an FS
  Hop in if it wants to implement its own fadvise.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |  2 ++
 fs/zuf/ioctl.c    | 25 ++++++++++++++++++++++
 fs/zuf/rw.c       | 54 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c |  1 +
 fs/zuf/zus_api.h  | 14 ++++++++++++
 5 files changed, 96 insertions(+)

diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 2e515af0bb22..e345b737499d 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -66,6 +66,8 @@ ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
 ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
 			  struct kiocb *kiocb, struct iov_iter *ii);
 int zuf_trim_edge(struct inode *inode, ulong filepos, uint len);
+int zuf_fadvise(struct super_block *sb, struct inode *inode,
+		loff_t offset, loff_t len, int advise, bool rand);
 int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
 			 __u64 *iom_e, uint iom_n);
 int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
diff --git a/fs/zuf/ioctl.c b/fs/zuf/ioctl.c
index 13ce65764c38..fb9727ab9d31 100644
--- a/fs/zuf/ioctl.c
+++ b/fs/zuf/ioctl.c
@@ -238,6 +238,29 @@ static int _ioc_setversion(struct inode *inode, uint __user *parg)
 	return err;
 }
 
+static int _ioc_fadvise(struct file *file, ulong arg)
+{
+	struct inode *inode = file_inode(file);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_fadvise iof = {};
+	int err;
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	if (arg && copy_from_user(&iof, (void __user *)arg, sizeof(iof)))
+		return -EFAULT;
+
+	zuf_r_lock(zii);
+
+	err = zuf_fadvise(inode->i_sb, inode, iof.offset, iof.length,
+			  iof.advise, file->f_mode & FMODE_RANDOM);
+
+	zuf_r_unlock(zii);
+
+	return err;
+}
+
 long zuf_ioctl(struct file *filp, unsigned int cmd, ulong arg)
 {
 	struct inode *inode = filp->f_inode;
@@ -252,6 +275,8 @@ long zuf_ioctl(struct file *filp, unsigned int cmd, ulong arg)
 		return put_user(inode->i_generation, (int __user *)arg);
 	case FS_IOC_SETVERSION:
 		return _ioc_setversion(inode, parg);
+	case ZUFS_IOC_FADVISE:
+		return _ioc_fadvise(filp, arg);
 	default:
 		return _ioctl_dispatch(inode, cmd, arg);
 	}
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
index 400d24ea7914..0cdb3c257ff8 100644
--- a/fs/zuf/rw.c
+++ b/fs/zuf/rw.c
@@ -315,6 +315,60 @@ ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
 	return ret;
 }
 
+static int _fadv_willneed(struct super_block *sb, struct inode *inode,
+			  loff_t offset, loff_t len, bool rand)
+{
+	struct zufs_ioc_IO io = {};
+	struct __zufs_ra ra = {
+		.start = md_o2p(offset),
+		.ra_pages = md_o2p_up(len),
+		.prev_pos = offset - 1,
+	};
+	int err;
+
+	io.ra.start = ra.start;
+	io.ra.ra_pages = ra.ra_pages;
+	io.ra.prev_pos = ra.prev_pos;
+	io.flags = rand ? ZUFS_IO_RAND : 0;
+
+	err = _IO_dispatch(SBI(sb), &io, ZUII(inode), ZUFS_OP_PRE_READ, 0,
+			   NULL, 0, offset, 0);
+	return err;
+}
+
+static int _fadv_dontneed(struct super_block *sb, struct inode *inode,
+			  loff_t offset, loff_t len)
+{
+	struct zufs_ioc_range ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUFS_OP_SYNC,
+		.zus_ii = ZUII(inode)->zus_ii,
+		.offset = offset,
+		.length = len,
+		.ioc_flags = ZUFS_RF_DONTNEED,
+	};
+
+	return zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_range.hdr, NULL, 0);
+}
+
+int zuf_fadvise(struct super_block *sb, struct inode *inode,
+		loff_t offset, loff_t len, int advise, bool rand)
+{
+	switch (advise) {
+	case POSIX_FADV_WILLNEED:
+		return _fadv_willneed(sb, inode, offset, len, rand);
+	case POSIX_FADV_DONTNEED:
+		return _fadv_dontneed(sb, inode, offset, len);
+	case POSIX_FADV_NOREUSE: /* TODO */
+	case POSIX_FADV_SEQUENTIAL: /* TODO: turn off random */
+	case POSIX_FADV_NORMAL:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+	return -EINVAL;
+}
+
 /* ~~~~ iom_dec.c ~~~ */
 /* for now here (at rw.c) looks logical */
 
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index c6b614465ab3..2afccfcf90bb 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -778,6 +778,7 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_CLONE		);
 		CASE_ENUM_NAME(ZUFS_OP_COPY		);
 		CASE_ENUM_NAME(ZUFS_OP_READ		);
+		CASE_ENUM_NAME(ZUFS_OP_PRE_READ		);
 		CASE_ENUM_NAME(ZUFS_OP_WRITE		);
 		CASE_ENUM_NAME(ZUFS_OP_GET_BLOCK	);
 		CASE_ENUM_NAME(ZUFS_OP_PUT_BLOCK	);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 40f369d20306..95fb5c35cde5 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -159,6 +159,19 @@ struct zus_inode {
 	/* Total ZUFS_INODE_SIZE bytes always */
 };
 
+/* ~~~~~ vfs extension ioctl commands ~~~~~ */
+
+/* TODO: This one needs to be an FS vector called from
+ * the fadvise()  system call. (Future patch)
+ */
+struct zufs_ioc_fadvise {
+	__u64	offset;
+	__u64	length;		/* if 0 all file */
+	__u64	advise;
+} __packed;
+
+#define ZUFS_IOC_FADVISE	_IOW('S', 2, struct zufs_ioc_fadvise)
+
 /* ~~~~~ ZUFS API ioctl commands ~~~~~ */
 enum {
 	ZUS_API_MAP_MAX_PAGES	= 1024,
@@ -341,6 +354,7 @@ enum e_zufs_operation {
 	ZUFS_OP_COPY,
 
 	ZUFS_OP_READ,
+	ZUFS_OP_PRE_READ,
 	ZUFS_OP_WRITE,
 	ZUFS_OP_GET_BLOCK,
 	ZUFS_OP_PUT_BLOCK,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 17/17] zuf: Support for dynamic-debug of zusFSs
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (15 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 16/17] zuf: Special IOCTL fadvise (TODO) Boaz harrosh
@ 2019-02-19 11:51 ` Boaz harrosh
  2019-02-19 12:15 ` [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Matthew Wilcox
  17 siblings, 0 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

In zus we support dynamic-debug prints. ie user can
turn on and off the prints at run time by writing
to some special files.

The API is exactly the same as the Kernel's dynamic-prints
only the special file that we perform read/write on is:
	/sys/fs/zuf/ddbg

But otherwise it is identical to Kernel.

The Kernel code is a thin wrapper to dispatch to/from
the read/write of /sys/fs/zuf/ddbg file to the zus
server.
The heavy lifting is done by the zus project build system
and core code. See zus project how this is done

This facility is dispatched on the mount-thread and not
the regular ZTs. Because it is available globally before
any mounts.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/zuf-core.c |  1 +
 fs/zuf/zuf-root.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zus_api.h  | 12 +++++++-
 3 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 2afccfcf90bb..a9034bb196db 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -777,6 +777,7 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_READDIR		);
 		CASE_ENUM_NAME(ZUFS_OP_CLONE		);
 		CASE_ENUM_NAME(ZUFS_OP_COPY		);
+
 		CASE_ENUM_NAME(ZUFS_OP_READ		);
 		CASE_ENUM_NAME(ZUFS_OP_PRE_READ		);
 		CASE_ENUM_NAME(ZUFS_OP_WRITE		);
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
index 37b70ca33d3c..5c8c4af1a7e7 100644
--- a/fs/zuf/zuf-root.c
+++ b/fs/zuf/zuf-root.c
@@ -79,6 +79,82 @@ static void _fs_type_free(struct zuf_fs_type *zft)
 }
 #endif /*CONFIG_LOCKDEP*/
 
+#define DDBG_MAX_BUF_SIZE	(8 * PAGE_SIZE)
+/* We use ppos as a cookie for the dynamic debug ID we want to read from */
+static ssize_t _zus_ddbg_read(struct file *file, char __user *buf, size_t len,
+			      loff_t *ppos)
+{
+	struct zufs_ioc_mount *zim;
+	size_t buf_size = (DDBG_MAX_BUF_SIZE <= len) ? DDBG_MAX_BUF_SIZE : len;
+	size_t zim_size =  sizeof(zim->hdr) + sizeof(zim->zdi);
+	ssize_t err;
+
+	zim = vzalloc(zim_size + buf_size);
+	if (unlikely(!zim))
+		return -ENOMEM;
+
+	/* null terminate the 1st character in the buffer, hence the '+ 1' */
+	zim->hdr.in_len = zim_size + 1;
+	zim->hdr.out_len = zim_size + buf_size;
+	zim->zdi.len = buf_size;
+	zim->zdi.id = *ppos;
+	*ppos = 0;
+
+	err = __zufc_dispatch_mount(ZRI(file->f_inode->i_sb), ZUFS_M_DDBG_RD,
+				    zim);
+	if (unlikely(err)) {
+		zuf_err("error dispatching contorl message => %ld\n", err);
+		goto out;
+	}
+
+	err = simple_read_from_buffer(buf, zim->zdi.len, ppos, zim->zdi.msg,
+				      buf_size);
+	if (unlikely(err <= 0))
+		goto out;
+
+	*ppos = zim->zdi.id;
+out:
+	vfree(zim);
+	return err;
+}
+
+static ssize_t _zus_ddbg_write(struct file *file, const char __user *buf,
+			       size_t len, loff_t *ofst)
+{
+	struct _ddbg_info {
+		struct zufs_ioc_mount zim;
+		char buf[512];
+	} ddi = {};
+	ssize_t err;
+
+	if (unlikely(512 < len)) {
+		zuf_err("ddbg control message to long\n");
+		return -EINVAL;
+	}
+
+	memset(&ddi, 0, sizeof(ddi));
+	if (copy_from_user(ddi.zim.zdi.msg, buf, len))
+		return -EFAULT;
+
+	ddi.zim.hdr.in_len = sizeof(ddi);
+	ddi.zim.hdr.out_len = sizeof(ddi.zim);
+	err = __zufc_dispatch_mount(ZRI(file->f_inode->i_sb), ZUFS_M_DDBG_WR,
+				    &ddi.zim);
+	if (unlikely(err)) {
+		zuf_err("error dispatching contorl message => %ld\n", err);
+		return err;
+	}
+
+	return len;
+}
+
+static const struct file_operations _zus_ddbg_ops = {
+	.open = nonseekable_open,
+	.read = _zus_ddbg_read,
+	.write = _zus_ddbg_write,
+	.llseek = no_llseek,
+};
+
 int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs)
 {
 	struct zuf_fs_type *zft = _fs_type_alloc();
@@ -256,6 +332,7 @@ static const struct super_operations zufr_super_operations = {
 static int zufr_fill_super(struct super_block *sb, void *data, int silent)
 {
 	static struct tree_descr zufr_files[] = {
+		[2] = {"ddbg", &_zus_ddbg_ops, S_IFREG | 0600},
 		{""},
 	};
 	struct zuf_root_info *zri;
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 95fb5c35cde5..14ada4c760ae 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -268,10 +268,20 @@ struct zufs_mount_info {
 	struct zufs_parse_options po;
 };
 
+struct zufs_ddbg_info {
+	__u64 id; /* IN where to start from, OUT last ID */
+	/* IN size of buffer, OUT size of dynamic debug message */
+	__u64 len;
+	char msg[0];
+};
+
 /* mount / umount */
 struct  zufs_ioc_mount {
 	struct zufs_ioc_hdr hdr;
-	struct zufs_mount_info zmi;
+	union {
+		struct zufs_mount_info zmi;
+		struct zufs_ddbg_info zdi;
+	};
 };
 #define ZU_IOC_MOUNT	_IOWR('Z', 11, struct zufs_ioc_mount)
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
                   ` (16 preceding siblings ...)
  2019-02-19 11:51 ` [RFC PATCH 17/17] zuf: Support for dynamic-debug of zusFSs Boaz harrosh
@ 2019-02-19 12:15 ` Matthew Wilcox
  2019-02-19 19:15   ` Boaz Harrosh
  17 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2019-02-19 12:15 UTC (permalink / raw)
  To: Boaz harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Miklos Szeredi, Steven Whitehouse, Jefff moyer, Amir Goldstein,
	Amit Golander, Sagi Manole

On Tue, Feb 19, 2019 at 01:51:19PM +0200, Boaz harrosh wrote:
> Please see first patch for License of this project
> 
> Current status: There are a couple of trivial open-source filesystem
> implementations and a full blown proprietary implementation from Netapp.

I regard this patchset as being an attempt to avoid your obligations
under the GPL.  As such, I will not be reviewing this code and I oppose
its inclusion.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem
  2019-02-19 12:15 ` [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Matthew Wilcox
@ 2019-02-19 19:15   ` Boaz Harrosh
  0 siblings, 0 replies; 31+ messages in thread
From: Boaz Harrosh @ 2019-02-19 19:15 UTC (permalink / raw)
  To: Matthew Wilcox, Boaz harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Miklos Szeredi, Steven Whitehouse, Jefff moyer, Amir Goldstein,
	Amit Golander, Sagi Manole

On 19/02/19 14:15, Matthew Wilcox wrote:
> On Tue, Feb 19, 2019 at 01:51:19PM +0200, Boaz harrosh wrote:
>> Please see first patch for License of this project
>>
>> Current status: There are a couple of trivial open-source filesystem
>> implementations and a full blown proprietary implementation from Netapp.
> 
> I regard this patchset as being an attempt to avoid your obligations
> under the GPL.  As such, I will not be reviewing this code and I oppose
> its inclusion.
> 

Dearest Matthew

One day We'll sit on a bear and you explain to me. I do trust your
opinion, but I do not understand.

Specifically the above "full blown proprietary implementation from Netapp"
does not break the GPL at all. Parts of it are written in languages alien
to the Kernel and parts using user-mode libs and code IP that are not able
to live in the Kernel. At the beginning we had code to inject the FS into
the application of choice vi ld.so and only selected apps like a DB would
have a view of the filesystem. But you can imagine how this is a nightmare
for IT. Being POSIX under the Kernel is just so much less inventing the wheel
say: backup, disaster-recovery, cloud ....

Now actually if you look at the code submitted you will see that we are using
very very little out of the Kernel. Actually for comparison FUSE is using the
Kernel much heavier. Utilizing page-cache, Kernel re-claimers. Smart write-back
the lot. In ZUFS we take the upper most interfaces and send it down stream as is.
Where ever there is depth of stack we take the top most level and push that to
server as is completely synchronous to the app threads.

The only real novelty in this project is something completely new to this
submission, it is the new RPC we invented here that utilizes per-cpu Technics
to show a kind of performance never seen before between two processes.

You are a Kernel contributor, you have IP in the Kernel. Your opinion is very
important to me and to Netapp. Please point me to these areas that you feel
I have stepped on your IP, and have not respected the GPL? And I would want
very much to fix it.

Or maybe my sin is that I am to successful? Is the GPL guarded by speed?
I mean say FUSE it is already doing all these sins. And or other subsystems
that bridge Kernel functionality to user-mode. There are other user-mode
"drivers" all over the place. But they are all so slooooooow. So a serious
FS or server needs to sit in Kernel. With zufs we can now delegate to user-mode.
The kernel becomes a micro-kernel, very-fast-bridge, and moves out of the way.
Creating space for serious servers to sit in userland.

To summarize. I take your statement very seriously. Please state what service
of the GPLed Kernel am I exposing and circumventing and I will want to fix it ASAP.
I thought, and my philosophy was to take the POSIX interfaces as high as possible
and shove them to userland. In an RPC manner that I invented that is very fast.
If there are such areas that I am not doing so. Please show me?

Best regards
Boaz


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 02/17] zuf: Preliminary Documentation
  2019-02-19 11:51 ` [RFC PATCH 02/17] zuf: Preliminary Documentation Boaz harrosh
@ 2019-02-20  8:27   ` Miklos Szeredi
  2019-02-20 14:24     ` Boaz Harrosh
  0 siblings, 1 reply; 31+ messages in thread
From: Miklos Szeredi @ 2019-02-20  8:27 UTC (permalink / raw)
  To: Boaz harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Steven Whitehouse, Jefff moyer, Amir Goldstein, Amit Golander,
	Sagi Manole

On Tue, Feb 19, 2019 at 12:51 PM Boaz harrosh <boaz@plexistor.com> wrote:
>

> +4. An FS operation like create or WRITE/READ and so on arrives from application
> +   via VFS. Eventually an Operation is dispatched to zus:
> +   ▪ A special per-operation descriptor is filled up with all parameters.
> +   ▪ A current CPU channel is grabbed. the operation descriptor is put on
> +     that channel (ZT). Including get_user_pages or Kernel-pages associated
> +     with this OPT.
> +   ▪ The ZT is awaken, app thread put to sleep.
> +   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
> +     the map is only on a single core. And no other core's TLB is affected.
> +     (This here is the all performance secret)

I still don't get it.  You say mapping the page to server address
space is the performance secret.  I say it's a security issue as well
as being perfectly useless, except for special cases.

So let's see.  There's the pmem case, which seems to be what this is
mostly about.    So we take a pmem filesystem and e.g. a read(2)
syscall.  There's no zero copy to talk about in that case since data
needs to be copied from pmem to application buffer.   If we want to
minimize memory copies, than it will be a single memory copy from pmem
to the app buffer.   Your implementation chooses to do this copy in
the userland server, via the application buffer mapped into its
address space.   But the memory copy could just as well be done by the
kernel, from one virtual address to another; the kernel just has to
juggle with physical page lookups, but does not have to establish a
new mapping, which makes it all the more secure and performant.

Sure, we've heard about arguments why the above doesn't work if the
data comes from e.g. a network where the network driver could be
writing data directly to application buffer.  But I've shown how that
could also be done with a new rsplice() interface, again without
having to insert the application page into the server address space.
And no, this doesn't have to involve syscall overhead, if that turns
out to be the limiting factor: mapped pipes might be an interesting
new concept as well.

The only way I see this implementation actually save a memory copy is
if a transformation is required (which again, already implies at least
a single memory copy) such as compression or encryption.   But even
that could be delegated to the kernel, since it does have plenty of
such transformations implemented, only an interface is required to
tell the kernel which one to do on the buffer.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License
  2019-02-19 11:51 ` [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License Boaz harrosh
@ 2019-02-20 11:03   ` Greg KH
  2019-02-20 14:55     ` Boaz Harrosh
  2019-02-26 17:55   ` Schumaker, Anna
  1 sibling, 1 reply; 31+ messages in thread
From: Greg KH @ 2019-02-20 11:03 UTC (permalink / raw)
  To: Boaz harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Miklos Szeredi, Steven Whitehouse, Jefff moyer, Amir Goldstein,
	Amit Golander, Sagi Manole

On Tue, Feb 19, 2019 at 01:51:20PM +0200, Boaz harrosh wrote:
>   However the ZUS user mode Server is a BSD-3-Clause licensed
>   project.
>   Therefor you will see that:
> 	zus_api.h
> 	md_def.h
> 	md.h
>   These are common files with ZUS project are separately Licensed as
>   BSD-3-Clause. Any contributor to these files should note that his
>   code for these files is submitted as BSD-3-Clause.
>   This is for the obvious reasons as these define the API between Kernel
>   an user-mode Server. It is the opinion of this  project authors, as is
>   of Linus that a pure API header is not governed by ANY license. But
>   to make it clear.

No, that is not true, the files do have a license.  Please see the
proper SPDX text that we use for kernel api .h files for what that is,
and what you should also use for yours, if you go through with this.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 09/17] zuf: symlink
  2019-02-19 11:51 ` [RFC PATCH 09/17] zuf: symlink Boaz harrosh
@ 2019-02-20 11:05   ` Greg KH
  2019-02-20 14:12     ` Boaz Harrosh
  0 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2019-02-20 11:05 UTC (permalink / raw)
  To: Boaz harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Miklos Szeredi, Steven Whitehouse, Jefff moyer, Amir Goldstein,
	Amit Golander, Sagi Manole

On Tue, Feb 19, 2019 at 01:51:28PM +0200, Boaz harrosh wrote:
> From: Boaz Harrosh <boazh@netapp.com>
> 
> Signed-off-by: Boaz Harrosh <boazh@netapp.com>

I know I never accept patches without any changelog text...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 09/17] zuf: symlink
  2019-02-20 11:05   ` Greg KH
@ 2019-02-20 14:12     ` Boaz Harrosh
  0 siblings, 0 replies; 31+ messages in thread
From: Boaz Harrosh @ 2019-02-20 14:12 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Miklos Szeredi, Steven Whitehouse, Jefff moyer, Amir Goldstein,
	Amit Golander, Sagi Manole

On 20/02/19 13:05, Greg KH wrote:
> On Tue, Feb 19, 2019 at 01:51:28PM +0200, Boaz harrosh wrote:
>> From: Boaz Harrosh <boazh@netapp.com>
>>
>> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
> 
> I know I never accept patches without any changelog text...
> 

Ooops sorry, totally right. Me too ;-)

Will fix ASAP

Thanks
Boaz 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 02/17] zuf: Preliminary Documentation
  2019-02-20  8:27   ` Miklos Szeredi
@ 2019-02-20 14:24     ` Boaz Harrosh
  0 siblings, 0 replies; 31+ messages in thread
From: Boaz Harrosh @ 2019-02-20 14:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Steven Whitehouse, Jefff moyer, Amir Goldstein, Amit Golander,
	Sagi Manole

On 20/02/19 10:27, Miklos Szeredi wrote:
> On Tue, Feb 19, 2019 at 12:51 PM Boaz harrosh <boaz@plexistor.com> wrote:
>>
> 
>> +4. An FS operation like create or WRITE/READ and so on arrives from application
>> +   via VFS. Eventually an Operation is dispatched to zus:
>> +   ▪ A special per-operation descriptor is filled up with all parameters.
>> +   ▪ A current CPU channel is grabbed. the operation descriptor is put on
>> +     that channel (ZT). Including get_user_pages or Kernel-pages associated
>> +     with this OPT.
>> +   ▪ The ZT is awaken, app thread put to sleep.
>> +   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
>> +     the map is only on a single core. And no other core's TLB is affected.
>> +     (This here is the all performance secret)
> 
> I still don't get it.  You say mapping the page to server address
> space is the performance secret.  I say it's a security issue as well
> as being perfectly useless, except for special cases.
> 

Already working on this. Will submit new code in a month. There is already
an API of GET_BLOCK that returns one or more server blocks (vi dpp_t) to
Kernel. Today it is used in mmap, to directly map pmem to application space.
We will use this API to memcpy(nt) in kernel space.

The mapping facility will be optional for zusFS(s) but the preferred way
will be Kernel side copy.

This is because vm_zap_ptes() is way way to slow in current design
and I need things faster ...

<>
> Thanks,
> Miklos
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License
  2019-02-20 11:03   ` Greg KH
@ 2019-02-20 14:55     ` Boaz Harrosh
  2019-02-20 19:40       ` Greg KH
  0 siblings, 1 reply; 31+ messages in thread
From: Boaz Harrosh @ 2019-02-20 14:55 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Miklos Szeredi, Steven Whitehouse, Jefff moyer, Amir Goldstein,
	Amit Golander, Sagi Manole

On 20/02/19 13:03, Greg KH wrote:
> On Tue, Feb 19, 2019 at 01:51:20PM +0200, Boaz harrosh wrote:
>>   However the ZUS user mode Server is a BSD-3-Clause licensed
>>   project.
>>   Therefor you will see that:
>> 	zus_api.h
>> 	md_def.h
>> 	md.h
>>   These are common files with ZUS project are separately Licensed as
>>   BSD-3-Clause. Any contributor to these files should note that his
>>   code for these files is submitted as BSD-3-Clause.
>>   This is for the obvious reasons as these define the API between Kernel
>>   an user-mode Server. It is the opinion of this  project authors, as is
>>   of Linus that a pure API header is not governed by ANY license. But
>>   to make it clear.
> 
> No, that is not true, the files do have a license.  Please see the
> proper SPDX text that we use for kernel api .h files for what that is,
> and what you should also use for yours, if you go through with this.
> 

OK Good I understand more now. I looked at UAPI/ headers and read the
docs you mentioned.

Please help me nail this properly. I will change these headers to be:

/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note AND BSD-3-Clause */
(I will also change the message text to reflect this conversation)

My motivation is that those Interface HEADERS can also be used in proprietary
OSs as Macosx and Windows. To compile the ZUS Server application project which
is licensed as SPDX BSD-3-Clause.

So I changed the OR to an AND right? Is it now clear that anyone adding
new API interface like new operations. Understand that this applies also
to BSD-3-Clause projects?

> thanks,
> greg k-h

Thanks very much for looking
Boaz


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License
  2019-02-20 14:55     ` Boaz Harrosh
@ 2019-02-20 19:40       ` Greg KH
  0 siblings, 0 replies; 31+ messages in thread
From: Greg KH @ 2019-02-20 19:40 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Anna Schumaker, Al Viro, Ric Wheeler,
	Miklos Szeredi, Steven Whitehouse, Jefff moyer, Amir Goldstein,
	Amit Golander, Sagi Manole

On Wed, Feb 20, 2019 at 04:55:19PM +0200, Boaz Harrosh wrote:
> On 20/02/19 13:03, Greg KH wrote:
> > On Tue, Feb 19, 2019 at 01:51:20PM +0200, Boaz harrosh wrote:
> >>   However the ZUS user mode Server is a BSD-3-Clause licensed
> >>   project.
> >>   Therefor you will see that:
> >> 	zus_api.h
> >> 	md_def.h
> >> 	md.h
> >>   These are common files with ZUS project are separately Licensed as
> >>   BSD-3-Clause. Any contributor to these files should note that his
> >>   code for these files is submitted as BSD-3-Clause.
> >>   This is for the obvious reasons as these define the API between Kernel
> >>   an user-mode Server. It is the opinion of this  project authors, as is
> >>   of Linus that a pure API header is not governed by ANY license. But
> >>   to make it clear.
> > 
> > No, that is not true, the files do have a license.  Please see the
> > proper SPDX text that we use for kernel api .h files for what that is,
> > and what you should also use for yours, if you go through with this.
> > 
> 
> OK Good I understand more now. I looked at UAPI/ headers and read the
> docs you mentioned.
> 
> Please help me nail this properly. I will change these headers to be:
> 
> /* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note AND BSD-3-Clause */
> (I will also change the message text to reflect this conversation)
> 
> My motivation is that those Interface HEADERS can also be used in proprietary
> OSs as Macosx and Windows. To compile the ZUS Server application project which
> is licensed as SPDX BSD-3-Clause.

You are asking legal questions to a technical person?  Would you ask
medical questions to a lawyer?

Consult your lawyers please, they know the answers to this.

And if they do not, go find your corporate lawyers who do.

Hint, as a test, ask them why what you wrote as a suggested SPDX line
above makes no sense :)

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License
  2019-02-19 11:51 ` [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License Boaz harrosh
  2019-02-20 11:03   ` Greg KH
@ 2019-02-26 17:55   ` Schumaker, Anna
  2019-02-28 16:42     ` Boaz Harrosh
  1 sibling, 1 reply; 31+ messages in thread
From: Schumaker, Anna @ 2019-02-26 17:55 UTC (permalink / raw)
  To: viro, boaz, linux-fsdevel
  Cc: Manole, Sagi, swhiteho, amir73il, rwheeler, mszeredi, Golander,
	Amit, jmoyer

On Tue, 2019-02-19 at 13:51 +0200, Boaz harrosh wrote:
> NetApp Security WARNING: This is an external email. Do not click links or open
> attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> 
> From: Boaz Harrosh <boazh@netapp.com>
> 
> This adds the ZUF filesystem-in-user_mode module to the
> fs/ build system.
> 
> Also added:
>         * fs/zuf/Kconfig
>         * fs/zuf/module.c - This file contains the LICENCE
>                             of zuf code base
>         * fs/zuf/Makefile - Rather empty Makefile with only
>                             module.c above
> 
> I add the fs/zuf/Makefile to demonstrate that at every
> patchset stage code still compiles and there are no external
> references outside of the code already submitted.
> 
> Off course only at the very last patch we have a working
> ZUF feeder
> 
> [LICENCE]
> 
>   zuf.ko is a GPLv2 licensed project.
> 
>   However the ZUS user mode Server is a BSD-3-Clause licensed
>   project.
>   Therefor you will see that:
>         zus_api.h
>         md_def.h
>         md.h
>   These are common files with ZUS project are separately Licensed as
>   BSD-3-Clause. Any contributor to these files should note that his
>   code for these files is submitted as BSD-3-Clause.
>   This is for the obvious reasons as these define the API between Kernel
>   an user-mode Server. It is the opinion of this  project authors, as is
>   of Linus that a pure API header is not governed by ANY license. But
>   to make it clear.
> 
> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
> ---
>  fs/Kconfig       |  1 +
>  fs/Makefile      |  1 +
>  fs/zuf/Kconfig   | 28 ++++++++++++++++++++
>  fs/zuf/Makefile  | 14 ++++++++++
>  fs/zuf/module.c  | 28 ++++++++++++++++++++
>  fs/zuf/zus_api.h | 69 ++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 141 insertions(+)
>  create mode 100644 fs/zuf/Kconfig
>  create mode 100644 fs/zuf/Makefile
>  create mode 100644 fs/zuf/module.c
>  create mode 100644 fs/zuf/zus_api.h
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index ac474a61be37..5b23bb58e902 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -254,6 +254,7 @@ source "fs/romfs/Kconfig"
>  source "fs/pstore/Kconfig"
>  source "fs/sysv/Kconfig"
>  source "fs/ufs/Kconfig"
> +source "fs/zuf/Kconfig"
>  source "fs/exofs/Kconfig"
> 
>  endif # MISC_FILESYSTEMS
> diff --git a/fs/Makefile b/fs/Makefile
> index 293733f61594..168f178a7c89 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -128,3 +128,4 @@ obj-y                               += exofs/ # Multiple
> modules
>  obj-$(CONFIG_CEPH_FS)          += ceph/
>  obj-$(CONFIG_PSTORE)           += pstore/
>  obj-$(CONFIG_EFIVAR_FS)                += efivarfs/
> +obj-$(CONFIG_ZUF)              += zuf/
> diff --git a/fs/zuf/Kconfig b/fs/zuf/Kconfig
> new file mode 100644
> index 000000000000..19fff3b75b69
> --- /dev/null
> +++ b/fs/zuf/Kconfig
> @@ -0,0 +1,28 @@
> +menuconfig ZUF

Shouldn't this be "CONFIG_ZUF_FS" to stay consistent with other filesystems?

> +       tristate "ZUF - Zero-copy User-mode Feeder"
> +       depends on BLOCK
> +       depends on ZONE_DEVICE
> +       select CRC16
> +       select MEMCG
> +       help
> +          ZUFS Kernel part.
> +          To enable say Y here.
> +
> +          To compile this as a module,  choose M here: the module will be
> +          called zuf.ko
> +
> +if ZUF
> +
> +config ZUF_DEBUG
> +       bool "ZUF: enable debug subsystems use"
> +       depends on ZUF
> +       default n
> +       help
> +         INTERNAL QA USE ONLY!!! DO NOT USE!!!
> +         Please leave as N here
> +
> +         This option adds some extra code that helps
> +         in QA testing of the code. It may slow the
> +         operation and produce bigger code
> +
> +endif
> diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
> new file mode 100644
> index 000000000000..e75ba8a77974
> --- /dev/null
> +++ b/fs/zuf/Makefile
> @@ -0,0 +1,14 @@
> +#
> +# ZUF: Zero-copy User-mode Feeder
> +#
> +# Copyright (c) 2018 NetApp Inc. All rights reserved.
> +#
> +# ZUFS-License: GPL-2.0. See module.c for LICENSE details.
> +#
> +# Makefile for the Linux zufs Kernel Feeder.
> +#
> +
> +obj-$(CONFIG_ZUF) += zuf.o
> +
> +# Main FS
> +zuf-y += module.o
> diff --git a/fs/zuf/module.c b/fs/zuf/module.c
> new file mode 100644
> index 000000000000..523633c1bf9d
> --- /dev/null
> +++ b/fs/zuf/module.c
> @@ -0,0 +1,28 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * zuf - Zero-copy User-mode Feeder
> + *
> + * Copyright (c) 2018 NetApp Inc. All rights reserved.
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <https://www.gnu.org/licenses/>;.
> + */
> +#include <linux/module.h>
> +
> +#include "zus_api.h"
> +
> +MODULE_AUTHOR("Boaz Harrosh <boazh@netapp.com>");
> +MODULE_AUTHOR("Sagi Manole <sagim@netapp.com>");
> +MODULE_DESCRIPTION("Zero-copy User-mode Feeder");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION(__stringify(ZUFS_MAJOR_VERSION) "."
> +               __stringify(ZUFS_MINOR_VERSION));
> diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
> new file mode 100644
> index 000000000000..f01db11721f4
> --- /dev/null
> +++ b/fs/zuf/zus_api.h
> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
> +/*
> + * zufs_api.h:
> + *     ZUFS (Zero-copy User-mode File System) is:
> + *             zuf (Zero-copy User-mode Feeder (Kernel)) +
> + *             zus (Zero-copy User-mode Server (daemon))
> + *
> + * Copyright (c) 2018 NetApp Inc. All rights reserved.
> + *
> + * Authors:
> + *     Boaz Harrosh <boazh@netapp.com>
> + *     Sagi Manole <sagim@netapp.com>"
> + */
> +#ifndef _LINUX_ZUFS_API_H
> +#define _LINUX_ZUFS_API_H
> +
> +#include <linux/types.h>
> +#include <linux/uuid.h>
> +#include <stddef.h>
> +#include <linux/statfs.h>
> +
> +/*
> + * Version rules:
> + *   This is the zus-to-zuf API version. And not the Filesystem
> + * on disk structures versions. These are left to the FS-plugging
> + * to supply and check.
> + * Specifically any of the API structures and constants found in this
> + * file.
> + * If the changes are made in a way backward compatible with old
> + * user-space, MINOR is incremented. Else MAJOR is incremented.
> + *
> + * We believe that the zus Server application comes with the
> + * Distro and should be dependent on the Kernel package.
> + * (In rhel they are both in the same package)
> + *
> + * The more stable ABI is between the zus Server and its FS plugins.
> + */
> +#define ZUFS_MINORS_PER_MAJOR  1024
> +#define ZUFS_MAJOR_VERSION 1
> +#define ZUFS_MINOR_VERSION 0
> +
> +/* User space compatibility definitions */
> +#ifndef __KERNEL__
> +
> +#include <string.h>
> +
> +#define u8 uint8_t
> +#define umode_t uint16_t
> +
> +#define PAGE_SHIFT     12
> +#define PAGE_SIZE      (1 << PAGE_SHIFT)
> +
> +#ifndef __packed
> +#      define __packed __attribute__((packed))
> +#endif
> +
> +#ifndef ALIGN
> +#define ALIGN(x, a)            ALIGN_MASK(x, (typeof(x))(a) - 1)
> +#define ALIGN_MASK(x, mask)    (((x) + (mask)) & ~(mask))
> +#endif
> +
> +/* RHEL/CentOS7 specifics */
> +#ifndef FALLOC_FL_UNSHARE_RANGE
> +#define FALLOC_FL_UNSHARE_RANGE         0x40
> +#endif
> +
> +#endif /*  ndef __KERNEL__ */
> +
> +#endif /* _LINUX_ZUFS_API_H */
> --
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 04/17] zuf: zuf-core The ZTs
  2019-02-19 11:51 ` [RFC PATCH 04/17] zuf: zuf-core The ZTs Boaz harrosh
@ 2019-02-26 18:34   ` Schumaker, Anna
  2019-02-28 17:01     ` Boaz Harrosh
  0 siblings, 1 reply; 31+ messages in thread
From: Schumaker, Anna @ 2019-02-26 18:34 UTC (permalink / raw)
  To: viro, boaz, linux-fsdevel
  Cc: Manole, Sagi, swhiteho, amir73il, rwheeler, mszeredi, Golander,
	Amit, jmoyer

On Tue, 2019-02-19 at 13:51 +0200, Boaz harrosh wrote:
> NetApp Security WARNING: This is an external email. Do not click links or open
> attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> 
> From: Boaz Harrosh <boazh@netapp.com>
> 
> zuf-core established the communication channels with the ZUS
> User Mode Server.
> 
> In this patch we have the core communication mechanics.
> Which is the Novelty of this project.
> (See previous submitted documentation for more info)
> 
> Users will come later in the patchset
> 
> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
> ---
>  fs/zuf/_extern.h  |   22 +
>  fs/zuf/_pr.h      |    4 +
>  fs/zuf/relay.h    |   88 ++++
>  fs/zuf/zuf-core.c | 1016 ++++++++++++++++++++++++++++++++++++++++++++-
>  fs/zuf/zuf-root.c |    7 +
>  fs/zuf/zuf.h      |   46 ++
>  fs/zuf/zus_api.h  |  185 +++++++++
>  7 files changed, 1367 insertions(+), 1 deletion(-)
>  create mode 100644 fs/zuf/relay.h
> 
> diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
> index 3bb9f1d9acf6..52bb6b9deafe 100644
> --- a/fs/zuf/_extern.h
> +++ b/fs/zuf/_extern.h
> @@ -28,10 +28,32 @@ struct dentry *zuf_mount(struct file_system_type *fs_type,
> int flags,
>                          const char *dev_name, void *data);
> 
>  /* zuf-core.c */
> +int zufc_zts_init(struct zuf_root_info *zri); /* Some private types in core
> */
> +void zufc_zts_fini(struct zuf_root_info *zri);
> +
>  long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
>  int zufc_release(struct inode *inode, struct file *file);
>  int zufc_mmap(struct file *file, struct vm_area_struct *vma);
> 
> +int __zufc_dispatch_mount(struct zuf_root_info *zri,
> +                         enum e_mount_operation op,
> +                         struct zufs_ioc_mount *zim);
> +int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info
> *zus_zfi,
> +                       enum e_mount_operation operation,
> +                       struct zufs_ioc_mount *zim);
> +
> +const char *zuf_op_name(enum e_zufs_operation op);
> +int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo);
> +static inline
> +int zufc_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
> +                 struct page **pages, uint nump)
> +{
> +       struct zuf_dispatch_op zdo;
> +
> +       zuf_dispatch_init(&zdo, hdr, pages, nump);
> +       return __zufc_dispatch(zri, &zdo);
> +}
> +
>  /* zuf-root.c */
>  int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs
> *rfs);
> 
> diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
> index 30b8cf912c1f..dc9f85453890 100644
> --- a/fs/zuf/_pr.h
> +++ b/fs/zuf/_pr.h
> @@ -39,5 +39,9 @@
> 
>  /* ~~~ channel prints ~~~ */
>  #define zuf_dbg_err(s, args ...)       zuf_chan_debug("error", s, ##args)
> +#define zuf_dbg_vfs(s, args ...)       zuf_chan_debug("vfs  ", s, ##args)
> +#define zuf_dbg_core(s, args ...)      zuf_chan_debug("core ", s, ##args)
> +#define zuf_dbg_zus(s, args ...)       zuf_chan_debug("zusdg", s, ##args)
> +#define zuf_dbg_verbose(s, args ...)   zuf_chan_debug("d-oto", s, ##args)
> 
>  #endif /* define __ZUF_PR_H__ */
> diff --git a/fs/zuf/relay.h b/fs/zuf/relay.h
> new file mode 100644
> index 000000000000..a17d242b313a
> --- /dev/null
> +++ b/fs/zuf/relay.h
> @@ -0,0 +1,88 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Relay scheduler-object Header file.
> + *
> + * Copyright (c) 2018 NetApp Inc. All rights reserved.
> + *
> + * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
> + *
> + * Authors:
> + *     Boaz Harrosh <boazh@netapp.com>
> + */
> +
> +#ifndef __RELAY_H__
> +#define __RELAY_H__
> +
> +/* ~~~~ Relay ~~~~ */
> +struct relay {
> +       wait_queue_head_t fss_wq;
> +       bool fss_wakeup;
> +       bool fss_waiting;
> +
> +       wait_queue_head_t app_wq;
> +       bool app_wakeup;
> +       bool app_waiting;
> +
> +       cpumask_t cpus_allowed;
> +};
> +
> +static inline void relay_init(struct relay *relay)
> +{
> +       init_waitqueue_head(&relay->fss_wq);
> +       init_waitqueue_head(&relay->app_wq);
> +}
> +
> +static inline bool relay_is_app_waiting(struct relay *relay)
> +{
> +       return relay->app_waiting;
> +}
> +
> +static inline void relay_app_wakeup(struct relay *relay)
> +{
> +       relay->app_waiting = false;
> +
> +       relay->app_wakeup = true;
> +       wake_up(&relay->app_wq);
> +}
> +
> +static inline int relay_fss_wait(struct relay *relay)
> +{
> +       int err;
> +
> +       relay->fss_waiting = true;
> +       relay->fss_wakeup = false;
> +       err =  wait_event_interruptible(relay->fss_wq, relay->fss_wakeup);
> +
> +       return err;

Could you just do: "return wait_event_interruptible()" directly, instead of
using the err variable?

> +}
> +
> +static inline bool relay_is_fss_waiting_grab(struct relay *relay)
> +{
> +       if (relay->fss_waiting) {
> +               relay->fss_waiting = false;
> +               return true;
> +       }
> +       return false;
> +}
> +
> +static inline void relay_fss_wakeup(struct relay *relay)
> +{
> +       relay->fss_wakeup = true;
> +       wake_up(&relay->fss_wq);
> +}
> +
> +static inline void relay_fss_wakeup_app_wait(struct relay *relay,
> +                                            spinlock_t *spinlock)
> +{
> +       relay->app_waiting = true;
> +
> +       relay_fss_wakeup(relay);
> +
> +       relay->app_wakeup = false;
> +       if (spinlock)
> +               spin_unlock(spinlock);
> +
> +       wait_event(relay->app_wq, relay->app_wakeup);
> +}
> +
> +#endif /* ifndef __RELAY_H__ */
> diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
> index e12cae584f8a..95582c0a4ba5 100644
> --- a/fs/zuf/zuf-core.c
> +++ b/fs/zuf/zuf-core.c
> @@ -18,14 +18,820 @@
>  #include <linux/delay.h>
>  #include <linux/pfn_t.h>
>  #include <linux/sched/signal.h>
> +#include <linux/uaccess.h>
> 
>  #include "zuf.h"
> 
> +struct zufc_thread {
> +       struct zuf_special_file hdr;
> +       struct relay relay;
> +       struct vm_area_struct *vma;
> +       int no;
> +       int chan;
> +
> +       /* Kernel side allocated IOCTL buffer */
> +       struct vm_area_struct *opt_buff_vma;
> +       void *opt_buff;
> +       ulong max_zt_command;
> +
> +       /* Next operation*/
> +       struct zuf_dispatch_op *zdo;
> +};
> +
> +enum { INITIAL_ZT_CHANNELS = 3 };
> +
> +struct zuf_threads_pool {
> +       uint _max_zts;
> +       uint _max_channels;
> +        /* array of pcp_arrays */
> +       struct zufc_thread *_all_zt[ZUFS_MAX_ZT_CHANNELS];
> +};
> +
> +static int _alloc_zts_channel(struct zuf_root_info *zri, int channel)
> +{
> +       zri->_ztp->_all_zt[channel] = alloc_percpu(struct zufc_thread);
> +       if (unlikely(!zri->_ztp->_all_zt[channel])) {
> +               zuf_err("!!! alloc_percpu channel=%d failed\n", channel);
> +               return -ENOMEM;
> +       }
> +       return 0;
> +}
> +
> +static inline ulong _zt_pr_no(struct zufc_thread *zt)
> +{
> +       /* So in hex it will be channel as first nibble and cpu as 3rd and on
> */
> +       return ((ulong)zt->no << 8) | zt->chan;
> +}
> +
> +int zufc_zts_init(struct zuf_root_info *zri)
> +{
> +       int c;
> +
> +       zri->_ztp = kcalloc(1, sizeof(struct zuf_threads_pool), GFP_KERNEL);
> +       if (unlikely(!zri->_ztp))
> +               return -ENOMEM;
> +
> +       zri->_ztp->_max_zts = num_online_cpus();
> +       zri->_ztp->_max_channels = INITIAL_ZT_CHANNELS;
> +
> +       for (c = 0; c < INITIAL_ZT_CHANNELS; ++c) {
> +               int err = _alloc_zts_channel(zri, c);
> +
> +               if (unlikely(err))
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +void zufc_zts_fini(struct zuf_root_info *zri)
> +{
> +       int c;
> +
> +       /* Always safe/must call zufc_zts_fini */
> +       if (!zri->_ztp)
> +               return;
> +
> +       for (c = 0; c < zri->_ztp->_max_channels; ++c) {
> +               if (zri->_ztp->_all_zt[c])
> +                       free_percpu(zri->_ztp->_all_zt[c]);
> +       }
> +       kfree(zri->_ztp);
> +       zri->_ztp = NULL;
> +}
> +
> +static struct zufc_thread *_zt_from_cpu(struct zuf_root_info *zri,
> +                                       int cpu, uint chan)
> +{
> +       return per_cpu_ptr(zri->_ztp->_all_zt[chan], cpu);
> +}
> +
> +static int _zt_from_f(struct file *filp, int cpu, uint chan,
> +                     struct zufc_thread **ztp)
> +{
> +       *ztp = _zt_from_cpu(ZRI(filp->f_inode->i_sb), cpu, chan);
> +       if (unlikely(!*ztp))
> +               return -ERANGE;
> +       return 0;

I'm curious if there is a reason you did it this way instead of making use of
the ERR_PTR() macro to return ztp directly?

> +}
> +
> +static int _zu_register_fs(struct file *file, void *parg)
> +{
> +       struct zufs_ioc_register_fs rfs;
> +       int err;
> +
> +       err = copy_from_user(&rfs, parg, sizeof(rfs));
> +       if (unlikely(err)) {
> +               zuf_err("=>%d\n", err);
> +               return err;
> +       }
> +
> +       err = zufr_register_fs(file->f_inode->i_sb, &rfs);
> +       if (err)
> +               zuf_err("=>%d\n", err);
> +       err = put_user(err, (int *)parg);
> +       return err;
> +}
> +
> +/* ~~~~ mounting ~~~~*/
> +int __zufc_dispatch_mount(struct zuf_root_info *zri,
> +                         enum e_mount_operation operation,
> +                         struct zufs_ioc_mount *zim)
> +{
> +       zim->hdr.operation = operation;
> +
> +       for (;;) {
> +               bool fss_waiting;
> +
> +               spin_lock(&zri->mount.lock);
> +
> +               if (unlikely(!zri->mount.zsf.file)) {
> +                       spin_unlock(&zri->mount.lock);
> +                       zuf_err("Server not up\n");
> +                       zim->hdr.err = -EIO;
> +                       return zim->hdr.err;
> +               }
> +
> +               fss_waiting = relay_is_fss_waiting_grab(&zri->mount.relay);
> +               if (fss_waiting)
> +                       break;
> +               /* in case of break above spin_unlock is done inside
> +                * relay_fss_wakeup_app_wait
> +                */
> +
> +               spin_unlock(&zri->mount.lock);
> +
> +               /* It is OK to wait if user storms mounts */
> +               zuf_dbg_verbose("waiting\n");
> +               msleep(100);
> +       }
> +
> +       zri->mount.zim = zim;
> +       relay_fss_wakeup_app_wait(&zri->mount.relay, &zri->mount.lock);
> +
> +       return zim->hdr.err;
> +}
> +
> +int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info
> *zus_zfi,
> +                       enum e_mount_operation operation,
> +                       struct zufs_ioc_mount *zim)
> +{
> +       zim->hdr.out_len = sizeof(*zim);
> +       zim->hdr.in_len = sizeof(*zim);
> +       if (operation == ZUFS_M_MOUNT || operation == ZUFS_M_REMOUNT)
> +               zim->hdr.in_len += zim->zmi.po.mount_options_len;
> +       zim->zmi.zus_zfi = zus_zfi;
> +       zim->zmi.num_cpu = zri->_ztp->_max_zts;
> +       zim->zmi.num_channels = zri->_ztp->_max_channels;
> +
> +       return __zufc_dispatch_mount(zri, operation, zim);
> +}
> +
> +static int _zu_mount(struct file *file, void *parg)
> +{
> +       struct super_block *sb = file->f_inode->i_sb;
> +       struct zuf_root_info *zri = ZRI(sb);
> +       bool waiting_for_reply;
> +       struct zufs_ioc_mount *zim;
> +       ulong cp_ret;
> +       int err;
> +
> +       spin_lock(&zri->mount.lock);
> +
> +       if (unlikely(!file->private_data)) {
> +               /* First time register this file as the mount-thread owner */
> +               zri->mount.zsf.type = zlfs_e_mout_thread;
> +               zri->mount.zsf.file = file;
> +               file->private_data = &zri->mount.zsf;
> +       } else if (unlikely(file->private_data != &zri->mount)) {
> +               spin_unlock(&zri->mount.lock);
> +               zuf_err("Say what?? %p != %p\n",
> +                       file->private_data, &zri->mount);
> +               return -EIO;
> +       }
> +
> +       zim = zri->mount.zim;
> +       zri->mount.zim = NULL;
> +       waiting_for_reply = zim && relay_is_app_waiting(&zri->mount.relay);
> +
> +       spin_unlock(&zri->mount.lock);
> +
> +       if (waiting_for_reply) {
> +               cp_ret = copy_from_user(zim, parg, zim->hdr.out_len);
> +               if (unlikely(cp_ret)) {
> +                       zuf_err("copy_from_user => %ld\n", cp_ret);
> +                        zim->hdr.err = -EFAULT;
> +               }
> +
> +               relay_app_wakeup(&zri->mount.relay);
> +       }
> +
> +       /* This gets to sleep until a mount comes */
> +       err = relay_fss_wait(&zri->mount.relay);
> +       if (unlikely(err || !zri->mount.zim)) {
> +               struct zufs_ioc_hdr *hdr = parg;
> +
> +               /* Released by _zu_break INTER or crash */
> +               zuf_dbg_zus("_zu_break? %p => %d\n", zri->mount.zim, err);
> +               put_user(ZUFS_OP_BREAK, &hdr->operation);
> +               put_user(EIO, &hdr->err);
> +               return err;
> +       }
> +
> +       zim = zri->mount.zim;
> +       cp_ret = copy_to_user(parg, zim, zim->hdr.in_len);
> +       if (unlikely(cp_ret)) {
> +               err = -EFAULT;
> +               zuf_err("copy_to_user =>%ld\n", cp_ret);
> +       }
> +       return err;
> +}
> +
> +static void zufc_mounter_release(struct file *file)
> +{
> +       struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
> +
> +       zuf_dbg_zus("closed fu=%d au=%d fw=%d aw=%d\n",
> +                 zri->mount.relay.fss_wakeup, zri->mount.relay.app_wakeup,
> +                 zri->mount.relay.fss_waiting, zri->mount.relay.app_waiting);
> +
> +       spin_lock(&zri->mount.lock);
> +       zri->mount.zsf.file = NULL;
> +       if (relay_is_app_waiting(&zri->mount.relay)) {
> +               zuf_err("server emergency exit while IO\n");
> +
> +               if (zri->mount.zim)
> +                       zri->mount.zim->hdr.err = -EIO;
> +               spin_unlock(&zri->mount.lock);
> +
> +               relay_app_wakeup(&zri->mount.relay);
> +               msleep(1000); /* crap */
> +       } else {
> +               if (zri->mount.zim)
> +                       zri->mount.zim->hdr.err = 0;
> +               spin_unlock(&zri->mount.lock);
> +       }
> +}
> +
> +/* ~~~~ ZU_IOC_NUMA_MAP ~~~~ */
> +static int _zu_numa_map(struct file *file, void *parg)
> +{
> +       struct zufs_ioc_numa_map *numa_map;
> +       int n_nodes = num_online_nodes();
> +       int n_cpus = num_online_cpus();
> +       uint *nodes_cpu_count;
> +       uint max_cpu_per_node = 0;
> +       uint alloc_size;
> +       int cpu, i, err;
> +
> +       alloc_size = sizeof(*numa_map) + n_cpus; /* char per cpu */
> +
> +       if ((n_nodes > 255) || (alloc_size > PAGE_SIZE)) {
> +               zuf_warn("!!!unexpected big machine with %d nodes
> alloc_size=0x%x\n",
> +                         n_nodes, alloc_size);
> +               return -ENOTSUPP;
> +       }
> +
> +       nodes_cpu_count = kcalloc(n_nodes, sizeof(uint), GFP_KERNEL);
> +       if (unlikely(!nodes_cpu_count))
> +               return -ENOMEM;
> +
> +       numa_map = kzalloc(alloc_size, GFP_KERNEL);
> +       if (unlikely(!numa_map)) {
> +               err = -ENOMEM;
> +               goto out;
> +       }
> +
> +       numa_map->possible_nodes        = num_possible_nodes();
> +       numa_map->possible_cpus         = num_possible_cpus();
> +
> +       numa_map->online_nodes          = n_nodes;
> +       numa_map->online_cpus           = n_cpus;
> +
> +       for_each_cpu(cpu, cpu_online_mask) {
> +               uint ctn  = cpu_to_node(cpu);
> +               uint ncc = ++nodes_cpu_count[ctn];
> +
> +               numa_map->cpu_to_node[cpu] = ctn;
> +               max_cpu_per_node = max(max_cpu_per_node, ncc);
> +       }
> +
> +       for (i = 1; i < n_nodes; ++i) {
> +               if (nodes_cpu_count[i] != nodes_cpu_count[0]) {
> +                       zuf_info("@[%d]=%d Unbalanced CPU sockets @[0]=%d\n",
> +                                 i, nodes_cpu_count[i], nodes_cpu_count[0]);
> +                       numa_map->nodes_not_symmetrical = true;
> +                       break;
> +               }
> +       }
> +
> +       numa_map->max_cpu_per_node = max_cpu_per_node;
> +
> +       zuf_dbg_verbose(
> +               "possible_nodes=%d possible_cpus=%d online_nodes=%d
> online_cpus=%d\n",
> +               numa_map->possible_nodes, numa_map->possible_cpus,
> +               n_nodes, n_cpus);
> +
> +       err = copy_to_user(parg, numa_map, alloc_size);
> +       kfree(numa_map);
> +out:
> +       kfree(nodes_cpu_count);
> +       return err;
> +}
> +
> +static int _map_pages(struct zufc_thread *zt, struct page **pages, uint nump,
> +                     bool map_readonly)
> +{
> +       int p, err;
> +
> +       if (!(zt->vma && pages && nump))
> +               return 0;
> +
> +       for (p = 0; p < nump; ++p) {
> +               ulong zt_addr = zt->vma->vm_start + p * PAGE_SIZE;
> +               ulong pfn = page_to_pfn(pages[p]);
> +               pfn_t pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
> +               vm_fault_t flt;
> +
> +               if (map_readonly)
> +                       flt = vmf_insert_mixed(zt->vma, zt_addr, pfnt);
> +               else
> +                       flt = vmf_insert_mixed_mkwrite(zt->vma, zt_addr,
> pfnt);
> +               err = zuf_flt_to_err(flt);
> +               if (unlikely(err)) {
> +                       zuf_err("zuf: remap_pfn_range => %d p=0x%x
> start=0x%lx\n",
> +                                err, p, zt->vma->vm_start);
> +                       return err;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static void _unmap_pages(struct zufc_thread *zt, struct page **pages, uint
> nump)
> +{
> +       if (!(zt->vma && zt->zdo && pages && nump))
> +               return;
> +
> +       zt->zdo->pages = NULL;
> +       zt->zdo->nump = 0;
> +
> +       zap_vma_ptes(zt->vma, zt->vma->vm_start, nump * PAGE_SIZE);
> +}
> +
> +static void _fill_buff(ulong *buff, uint size)
> +{
> +       ulong *buff_end = buff + size;
> +       ulong val = 0;
> +
> +       for (; buff < buff_end; ++buff, ++val)
> +               *buff = val;
> +}
> +
> +static int _zu_init(struct file *file, void *parg)
> +{
> +       struct zufc_thread *zt;
> +       int cpu = smp_processor_id();
> +       struct zufs_ioc_init zi_init;
> +       int err;
> +
> +       err = copy_from_user(&zi_init, parg, sizeof(zi_init));
> +       if (unlikely(err)) {
> +               zuf_err("=>%d\n", err);
> +               return err;
> +       }
> +       if (unlikely(zi_init.channel_no >= ZUFS_MAX_ZT_CHANNELS)) {
> +               zuf_err("[%d] channel_no=%d\n", cpu, zi_init.channel_no);
> +               return -EINVAL;
> +       }
> +
> +       zuf_dbg_zus("[%d] aff=0x%lx channel=%d\n",
> +                   cpu, zi_init.affinity, zi_init.channel_no);
> +
> +       zi_init.hdr.err = _zt_from_f(file, cpu, zi_init.channel_no, &zt);
> +       if (unlikely(zi_init.hdr.err)) {
> +               zuf_err("=>%d\n", err);
> +               goto out;
> +       }
> +
> +       if (unlikely(zt->hdr.file)) {
> +               zi_init.hdr.err = -EINVAL;
> +               zuf_err("[%d] !!! thread already set\n", cpu);
> +               goto out;
> +       }
> +
> +       relay_init(&zt->relay);
> +       zt->hdr.type = zlfs_e_zt;
> +       zt->hdr.file = file;
> +       zt->no = cpu;
> +       zt->chan = zi_init.channel_no;
> +
> +       zt->max_zt_command = zi_init.max_command;
> +       zt->opt_buff = vmalloc(zi_init.max_command);
> +       if (unlikely(!zt->opt_buff)) {
> +               zi_init.hdr.err = -ENOMEM;
> +               goto out;
> +       }
> +       _fill_buff(zt->opt_buff, zi_init.max_command / sizeof(ulong));
> +
> +       file->private_data = &zt->hdr;
> +out:
> +       err = copy_to_user(parg, &zi_init, sizeof(zi_init));
> +       if (err)
> +               zuf_err("=>%d\n", err);
> +       return err;
> +}
> +
> +struct zufc_thread *_zt_from_f_private(struct file *file)
> +{
> +       struct zuf_special_file *zsf = file->private_data;
> +
> +       WARN_ON(zsf->type != zlfs_e_zt);
> +       return container_of(zsf, struct zufc_thread, hdr);
> +}
> +
> +/* Caller checks that file->private_data != NULL */
> +static void zufc_zt_release(struct file *file)
> +{
> +       struct zufc_thread *zt = _zt_from_f_private(file);
> +
> +       if (unlikely(zt->hdr.file != file))
> +               zuf_err("What happened zt->file(%p) != file(%p)\n",
> +                       zt->hdr.file, file);
> +
> +       zuf_dbg_zus("[%d] closed fu=%d au=%d fw=%d aw=%d\n",
> +                 zt->no, zt->relay.fss_wakeup, zt->relay.app_wakeup,
> +                 zt->relay.fss_waiting, zt->relay.app_waiting);
> +
> +       if (relay_is_app_waiting(&zt->relay)) {
> +               zuf_err("server emergency exit while IO\n");
> +
> +               /* NOTE: Do not call _unmap_pages the vma is gone */
> +               zt->hdr.file = NULL;
> +
> +               relay_app_wakeup(&zt->relay);
> +               msleep(1000); /* crap */
> +       }
> +
> +       vfree(zt->opt_buff);
> +       memset(zt, 0, sizeof(*zt));
> +}
> +
> +static int _copy_outputs(struct zufc_thread *zt, void *arg)
> +{
> +       struct zufs_ioc_hdr *hdr = zt->zdo->hdr;
> +       struct zufs_ioc_hdr *user_hdr = zt->opt_buff;
> +
> +       if (zt->opt_buff_vma->vm_start != (ulong)arg) {
> +               zuf_err("malicious Server\n");
> +               return -EINVAL;
> +       }
> +
> +       /* Update on the user out_len and return-code */
> +       hdr->err = user_hdr->err;
> +       hdr->out_len = user_hdr->out_len;
> +
> +       if (!hdr->out_len)
> +               return 0;
> +
> +       if ((hdr->err == -EZUFS_RETRY) || (hdr->out_max < hdr->out_len)) {
> +               if (WARN_ON(!zt->zdo->oh)) {
> +                       zuf_err("Trouble op(%s) out_max=%d out_len=%d\n",
> +                               zuf_op_name(hdr->operation),
> +                               hdr->out_max, hdr->out_len);
> +                       return -EFAULT;
> +               }
> +               zuf_dbg_zus("[%s] %d %d => %d\n",
> +                           zuf_op_name(hdr->operation),
> +                           hdr->out_max, hdr->out_len, hdr->err);
> +               return zt->zdo->oh(zt->zdo, zt->opt_buff, zt->max_zt_command);
> +       } else {
> +               void *rply = (void *)hdr + hdr->out_start;
> +               void *from = zt->opt_buff + hdr->out_start;
> +
> +               memcpy(rply, from, hdr->out_len);
> +               return 0;
> +       }
> +}
> +
> +static int _zu_wait(struct file *file, void *parg)
> +{
> +       struct zufc_thread *zt;
> +       int err;
> +
> +       zt = _zt_from_f_private(file);
> +       if (unlikely(!zt)) {
> +               zuf_err("Unexpected ZT state\n");
> +               err = -ERANGE;
> +               goto err;
> +       }
> +
> +       if (!zt->hdr.file || file != zt->hdr.file) {
> +               zuf_err("fatal\n");
> +               err = -E2BIG;
> +               goto err;
> +       }
> +       if (unlikely((ulong)parg != zt->opt_buff_vma->vm_start)) {
> +               zuf_err("fatal 2\n");
> +               err = -EINVAL;
> +               goto err;
> +       }
> +
> +       if (relay_is_app_waiting(&zt->relay)) {
> +               if (unlikely(!zt->zdo)) {
> +                       zuf_err("User has gone...\n");
> +                       err = -E2BIG;
> +                       goto err;
> +               } else {
> +                       /* overflow_handler might decide to execute the
> +                        *parg here at zus context and return to server
> +                        * If it also has an error to report to zus it
> +                        * will set zdo->hdr->err.
> +                        * EZUS_RETRY_DONE is when that happens.
> +                        * In this case pages stay mapped in zt->vma
> +                        */
> +                       err = _copy_outputs(zt, parg);
> +                       if (err == EZUF_RETRY_DONE) {
> +                               put_user(zt->zdo->hdr->err, (int *)parg);
> +                               return 0;
> +                       }
> +
> +                       _unmap_pages(zt, zt->zdo->pages, zt->zdo->nump);
> +                       zt->zdo = NULL;
> +                       if (unlikely(err)) /* _copy_outputs returned an err */
> +                               goto err;
> +               }
> +               relay_app_wakeup(&zt->relay);
> +       }
> +
> +       err = relay_fss_wait(&zt->relay);
> +       if (err)
> +               zuf_dbg_err("[%d] relay error: %d\n", zt->no, err);
> +
> +       if (zt->zdo &&  zt->zdo->hdr &&
> +           zt->zdo->hdr->operation < ZUFS_OP_BREAK) {
> +               /* call map here at the zuf thread so we need no locks
> +                * TODO: Currently only ZUFS_OP_WRITE protects user-buffers
> +                * we should have a bit set in zt->zdo->hdr set per operation.
> +                * TODO: Why this does not work?
> +                */
> +               _map_pages(zt, zt->zdo->pages, zt->zdo->nump, 0);
> +               memcpy(zt->opt_buff, zt->zdo->hdr, zt->zdo->hdr->in_len);
> +       } else {
> +               struct zufs_ioc_hdr *hdr = zt->opt_buff;
> +
> +               /* This Means we were released by _zu_break */
> +               zuf_dbg_zus("_zu_break? => %d\n", err);
> +               hdr->operation = ZUFS_OP_BREAK;
> +               hdr->err = err;
> +       }
> +
> +       return err;
> +
> +err:
> +       put_user(err, (int *)parg);
> +       return err;
> +}
> +
> +static int _try_grab_zt_channel(struct zuf_root_info *zri, int cpu,
> +                                struct zufc_thread **ztp)
> +{
> +       struct zufc_thread *zt;
> +       int c;
> +
> +       for (c = 0; ; ++c) {
> +               zt = _zt_from_cpu(zri, cpu, c);
> +               if (unlikely(!zt || !zt->hdr.file))
> +                       break;
> +
> +               if (relay_is_fss_waiting_grab(&zt->relay)) {
> +                       *ztp = zt;
> +                       return true;
> +               }
> +       }
> +
> +       *ztp = _zt_from_cpu(zri, cpu, 0);
> +       return false;
> +}
> +
> +#define _zuf_get_cpu() get_cpu()
> +#define _zuf_put_cpu() put_cpu()
> +
> +#ifdef CONFIG_ZUF_DEBUG
> +static
> +int _r_zufs_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
> +#else
> +int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
> +#endif
> +{
> +       struct task_struct *app = get_current();
> +       struct zufs_ioc_hdr *hdr = zdo->hdr;
> +       int cpu, cpu2;
> +       struct zufc_thread *zt;
> +
> +       if (unlikely(hdr->out_len && !hdr->out_max)) {
> +               /* TODO: Complain here and let caller code do this proper */
> +               hdr->out_max = hdr->out_len;
> +       }
> +
> +channel_busy:
> +       cpu = _zuf_get_cpu();
> +
> +       if (!_try_grab_zt_channel(zri, cpu, &zt)) {
> +               _zuf_put_cpu();
> +
> +               /* If channel was grabbed then maybe a break_all is in
> progress
> +                * on a different CPU make sure zt->file on this core is
> +                * updated
> +                */
> +               mb();
> +               if (unlikely(!zt->hdr.file)) {
> +                       zuf_err("[%d] !zt->file\n", cpu);
> +                       return -EIO;
> +               }
> +               zuf_dbg_err("[%d] can this be\n", cpu);
> +               /* FIXME: Do something much smarter */
> +               msleep(10);
> +               if (signal_pending(get_current())) {
> +                       zuf_dbg_err("[%d] => EINTR\n", cpu);
> +                       return -EINTR;
> +               }
> +               goto channel_busy;
> +       }
> +
> +       /* lock app to this cpu while waiting */
> +       cpumask_copy(&zt->relay.cpus_allowed, &app->cpus_allowed);
> +       cpumask_copy(&app->cpus_allowed,  cpumask_of(smp_processor_id()));
> +
> +       zt->zdo = zdo;
> +
> +       _zuf_put_cpu();
> +
> +       relay_fss_wakeup_app_wait(&zt->relay, NULL);
> +
> +       /* restore cpu affinity after wakeup */
> +       cpumask_copy(&app->cpus_allowed, &zt->relay.cpus_allowed);
> +
> +cpu2 = smp_processor_id();
> +if (cpu2 != cpu)
> +       zuf_warn("App switched cpu1=%u cpu2=%u\n", cpu, cpu2);
> +
> +       return zt->hdr.file ? hdr->err : -EIO;
> +}
> +
> +const char *zuf_op_name(enum e_zufs_operation op)
> +{
> +#define CASE_ENUM_NAME(e) case e: return #e
> +       switch  (op) {
> +               CASE_ENUM_NAME(ZUFS_OP_BREAK            );
> +       default:
> +               return "UNKNOWN";
> +       }
> +}
> +
> +#ifdef CONFIG_ZUF_DEBUG
> +
> +#define MAX_ZT_SEC 5
> +int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
> +{
> +       u64 t1, t2;
> +       int err;
> +
> +       t1 = ktime_get_ns();
> +       err = _r_zufs_dispatch(zri, zdo);
> +       t2 = ktime_get_ns();
> +
> +       if ((t2 - t1) > MAX_ZT_SEC * NSEC_PER_SEC)
> +               zuf_err("zufc_dispatch(%s, [0x%x-0x%x]) took %lld sec\n",
> +                       zuf_op_name(zdo->hdr->operation), zdo->hdr->offset,
> +                       zdo->hdr->len,
> +                       (t2 - t1) / NSEC_PER_SEC);
> +
> +       return err;
> +}
> +#endif /* def CONFIG_ZUF_DEBUG */
> +
> +/* ~~~ iomap_exec && exec_buffer allocation ~~~ */
> +struct zu_exec_buff {
> +       struct zuf_special_file hdr;
> +       struct vm_area_struct *vma;
> +       void *opt_buff;
> +       ulong alloc_size;
> +};
> +
> +/* Do some common checks and conversions */
> +static inline struct zu_exec_buff *_ebuff_from_file(struct file *file)
> +{
> +       struct zu_exec_buff *ebuff = file->private_data;
> +
> +       if (WARN_ON_ONCE(ebuff->hdr.type != zlfs_e_dpp_buff)) {
> +               zuf_err("Must call ZU_IOC_ALLOC_BUFFER first\n");
> +               return NULL;
> +       }
> +
> +       if (WARN_ON_ONCE(ebuff->hdr.file != file))
> +               return NULL;
> +
> +       return ebuff;
> +}
> +
> +static int _zu_ebuff_alloc(struct file *file, void *arg)
> +{
> +       struct zufs_ioc_alloc_buffer ioc_alloc;
> +       struct zu_exec_buff *ebuff;
> +       int err;
> +
> +       err = copy_from_user(&ioc_alloc, arg, sizeof(ioc_alloc));
> +       if (unlikely(err)) {
> +               zuf_err("=>%d\n", err);
> +               return err;
> +       }
> +
> +       if (ioc_alloc.init_size > ioc_alloc.max_size)
> +               return -EINVAL;
> +
> +       /* TODO: Easily Support growing */
> +       /* TODO: Support global pools, also easy */
> +       if (ioc_alloc.pool_no || ioc_alloc.init_size != ioc_alloc.max_size)
> +               return -ENOTSUPP;
> +
> +       ebuff = kzalloc(sizeof(*ebuff), GFP_KERNEL);
> +       if (unlikely(!ebuff))
> +               return -ENOMEM;
> +
> +       ebuff->hdr.type = zlfs_e_dpp_buff;
> +       ebuff->hdr.file = file;
> +       i_size_write(file->f_inode, ioc_alloc.max_size);
> +       ebuff->alloc_size =  ioc_alloc.init_size;
> +       ebuff->opt_buff = vmalloc(ioc_alloc.init_size);
> +       if (unlikely(!ebuff->opt_buff)) {
> +               kfree(ebuff);
> +               return -ENOMEM;
> +       }
> +       _fill_buff(ebuff->opt_buff, ioc_alloc.init_size / sizeof(ulong));
> +
> +       file->private_data = &ebuff->hdr;
> +       return 0;
> +}
> +
> +static void zufc_ebuff_release(struct file *file)
> +{
> +       struct zu_exec_buff *ebuff = _ebuff_from_file(file);
> +
> +       if (unlikely(!ebuff))
> +               return;
> +
> +       vfree(ebuff->opt_buff);
> +       ebuff->hdr.type = 0;
> +       ebuff->hdr.file = NULL; /* for none-dbg Kernels && use-after-free */
> +       kfree(ebuff);
> +}
> +
> +static int _zu_break(struct file *filp, void *parg)
> +{
> +       struct zuf_root_info *zri = ZRI(filp->f_inode->i_sb);
> +       int i, c;
> +
> +       zuf_dbg_core("enter\n");
> +       mb(); /* TODO how to schedule on all CPU's */
> +
> +       for (i = 0; i < zri->_ztp->_max_zts; ++i) {
> +               for (c = 0; c < zri->_ztp->_max_channels; ++c) {
> +                       struct zufc_thread *zt = _zt_from_cpu(zri, i, c);
> +
> +                       if (unlikely(!(zt && zt->hdr.file)))
> +                               continue;
> +                       relay_fss_wakeup(&zt->relay);
> +               }
> +       }
> +
> +       if (zri->mount.zsf.file)
> +               relay_fss_wakeup(&zri->mount.relay);
> +
> +       zuf_dbg_core("exit\n");
> +       return 0;
> +}
> +
>  long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
>  {
> +       void __user *parg = (void __user *)arg;
> +
>         switch (cmd) {
> +       case ZU_IOC_REGISTER_FS:
> +               return _zu_register_fs(file, parg);
> +       case ZU_IOC_MOUNT:
> +               return _zu_mount(file, parg);
> +       case ZU_IOC_NUMA_MAP:
> +               return _zu_numa_map(file, parg);
> +       case ZU_IOC_INIT_THREAD:
> +               return _zu_init(file, parg);
> +       case ZU_IOC_WAIT_OPT:
> +               return _zu_wait(file, parg);
> +       case ZU_IOC_ALLOC_BUFFER:
> +               return _zu_ebuff_alloc(file, parg);
> +       case ZU_IOC_BREAK_ALL:
> +               return _zu_break(file, parg);
>         default:
> -               zuf_err("%d\n", cmd);
> +               zuf_err("%d %ld\n", cmd, ZU_IOC_WAIT_OPT);
>                 return -ENOTTY;
>         }
>  }
> @@ -38,11 +844,215 @@ int zufc_release(struct inode *inode, struct file *file)
>                 return 0;
> 
>         switch (zsf->type) {
> +       case zlfs_e_zt:
> +               zufc_zt_release(file);
> +               return 0;
> +       case zlfs_e_mout_thread:
> +               zufc_mounter_release(file);
> +               return 0;
> +       case zlfs_e_pmem:
> +               /* NOTHING to clean for pmem file yet */
> +               /* zuf_pmem_release(file);*/
> +               return 0;
> +       case zlfs_e_dpp_buff:
> +               zufc_ebuff_release(file);
> +               return 0;
>         default:
>                 return 0;
>         }
>  }
> 
> +/* ~~~~  mmap area of app buffers into server ~~~~ */
> +
> +static int zuf_zt_fault(struct vm_fault *vmf)
> +{
> +       zuf_err("should not fault\n");
> +       return VM_FAULT_SIGBUS;
> +}
> +
> +static const struct vm_operations_struct zuf_vm_ops = {
> +       .fault          = zuf_zt_fault,
> +};
> +
> +static int _zufc_zt_mmap(struct file *file, struct vm_area_struct *vma,
> +                        struct zufc_thread *zt)
> +{
> +       /* Tell Kernel We will only access on a single core */
> +       vma->vm_flags |= VM_MIXEDMAP;
> +       vma->vm_ops = &zuf_vm_ops;
> +
> +       zt->vma = vma;
> +
> +       zuf_dbg_core(
> +               "[0x%lx] start=0x%lx end=0x%lx flags=0x%lx file-
> start=0x%lx\n",
> +               _zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_flags,
> +               vma->vm_pgoff);
> +
> +       return 0;
> +}
> +
> +/* ~~~~  mmap the Kernel allocated IOCTL buffer per ZT ~~~~ */
> +static int _opt_buff_mmap(struct vm_area_struct *vma, void *opt_buff,
> +                         ulong opt_size)
> +{
> +       ulong offset;
> +
> +       if (!opt_buff)
> +               return -ENOMEM;
> +
> +       for (offset = 0; offset < opt_size; offset += PAGE_SIZE) {
> +               ulong addr = vma->vm_start + offset;
> +               ulong pfn = vmalloc_to_pfn(opt_buff +  offset);
> +               pfn_t pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
> +               int err;
> +
> +               zuf_dbg_verbose("[0x%lx] pfn-0x%lx addr=0x%lx buff=0x%lx\n",
> +                               offset, pfn, addr, (ulong)opt_buff + offset);
> +
> +               err = zuf_flt_to_err(vmf_insert_mixed_mkwrite(vma, addr,
> pfnt));
> +               if (unlikely(err)) {
> +                       zuf_err("zuf: zuf_insert_mixed_mkwrite => %d
> offset=0x%lx addr=0x%lx\n",
> +                                err, offset, addr);
> +                       return err;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static int zuf_obuff_fault(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +       struct zufc_thread *zt = _zt_from_f_private(vma->vm_file);
> +       long offset = (vmf->pgoff << PAGE_SHIFT) - ZUS_API_MAP_MAX_SIZE;
> +       int err;
> +
> +       zuf_dbg_core(
> +               "[0x%lx] start=0x%lx end=0x%lx file-start=0x%lx
> offset=0x%lx\n",
> +               _zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_pgoff,
> +               offset);
> +
> +       /* if Server overruns its buffer crash it dead */
> +       if (unlikely((offset < 0) || (zt->max_zt_command < offset))) {
> +               zuf_err("[0x%lx] start=0x%lx end=0x%lx file-start=0x%lx
> offset=0x%lx\n",
> +                       _zt_pr_no(zt), vma->vm_start,
> +                       vma->vm_end, vma->vm_pgoff, offset);
> +               return VM_FAULT_SIGBUS;
> +       }
> +
> +       /* We never released a zus-core.c that does not fault the
> +        * first page first. I want to see if this happens
> +        */
> +       if (unlikely(offset))
> +               zuf_warn("Suspicious server activity\n");
> +
> +       /* This faults only once at very first access */
> +       err = _opt_buff_mmap(vma, zt->opt_buff, zt->max_zt_command);
> +       if (unlikely(err))
> +               return VM_FAULT_SIGBUS;
> +
> +       return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct zuf_obuff_ops = {
> +       .fault          = zuf_obuff_fault,
> +};
> +
> +static int _zufc_obuff_mmap(struct file *file, struct vm_area_struct *vma,
> +                           struct zufc_thread *zt)
> +{
> +       vma->vm_flags |= VM_MIXEDMAP;
> +       vma->vm_ops = &zuf_obuff_ops;
> +
> +       zt->opt_buff_vma = vma;
> +
> +       zuf_dbg_core(
> +               "[0x%lx] start=0x%lx end=0x%lx flags=0x%lx file-
> start=0x%lx\n",
> +               _zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_flags,
> +               vma->vm_pgoff);
> +
> +       return 0;
> +}
> +
> +/* ~~~ */
> +
> +static int zufc_zt_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       struct zufc_thread *zt = _zt_from_f_private(file);
> +
> +       /* We have two areas of mmap in this special file.
> +        * 0 to ZUS_API_MAP_MAX_SIZE:
> +        *      The first part where app pages are mapped
> +        *      into server per operation.
> +        * ZUS_API_MAP_MAX_SIZE of size zuf_root_info->max_zt_command
> +        *      Is where we map the per ZT ioctl-buffer, later passed
> +        *      to the zus_ioc_wait IOCTL call
> +        */
> +       if (vma->vm_pgoff == ZUS_API_MAP_MAX_SIZE / PAGE_SIZE)
> +               return _zufc_obuff_mmap(file, vma, zt);
> +
> +       /* zuf ZT API is very particular about where in its
> +        * special file we communicate
> +        */
> +       if (unlikely(vma->vm_pgoff))
> +               return -EINVAL;
> +
> +       return _zufc_zt_mmap(file, vma, zt);
> +}
> +
> +/* ~~~~ Implementation of the ZU_IOC_ALLOC_BUFFER mmap facility ~~~~ */
> +
> +static int zuf_ebuff_fault(struct vm_fault *vmf)
> +{
> +       struct vm_area_struct *vma = vmf->vma;
> +       struct zu_exec_buff *ebuff = _ebuff_from_file(vma->vm_file);
> +       long offset = (vmf->pgoff << PAGE_SHIFT);
> +       int err;
> +
> +       zuf_dbg_core("start=0x%lx end=0x%lx file-start=0x%lx file-
> off=0x%lx\n",
> +                    vma->vm_start, vma->vm_end, vma->vm_pgoff, offset);
> +
> +       /* if Server overruns its buffer crash it dead */
> +       if (unlikely((offset < 0) || (ebuff->alloc_size < offset))) {
> +               zuf_err("start=0x%lx end=0x%lx file-start=0x%lx file-
> off=0x%lx\n",
> +                       vma->vm_start, vma->vm_end, vma->vm_pgoff,
> +                       offset);
> +               return VM_FAULT_SIGBUS;
> +       }
> +
> +       /* We never released a zus-core.c that does not fault the
> +        * first page first. I want to see if this happens
> +        */
> +       if (unlikely(offset))
> +               zuf_warn("Suspicious server activity\n");
> +
> +       /* This faults only once at very first access */
> +       err = _opt_buff_mmap(vma, ebuff->opt_buff, ebuff->alloc_size);
> +       if (unlikely(err))
> +               return VM_FAULT_SIGBUS;
> +
> +       return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct zuf_ebuff_ops = {
> +       .fault          = zuf_ebuff_fault,
> +};
> +
> +static int zufc_ebuff_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       struct zu_exec_buff *ebuff = _ebuff_from_file(vma->vm_file);
> +
> +       vma->vm_flags |= VM_MIXEDMAP;
> +       vma->vm_ops = &zuf_ebuff_ops;
> +
> +       ebuff->vma = vma;
> +
> +       zuf_dbg_core("start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
> +                     vma->vm_start, vma->vm_end, vma->vm_flags, vma-
> >vm_pgoff);
> +
> +       return 0;
> +}
> +
>  int zufc_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>         struct zuf_special_file *zsf = file->private_data;
> @@ -53,6 +1063,10 @@ int zufc_mmap(struct file *file, struct vm_area_struct
> *vma)
>         }
> 
>         switch (zsf->type) {
> +       case zlfs_e_zt:
> +               return zufc_zt_mmap(file, vma);
> +       case zlfs_e_dpp_buff:
> +               return zufc_ebuff_mmap(file, vma);
>         default:
>                 zuf_err("type=%d\n", zsf->type);
>                 return -ENOTTY;
> diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
> index 55a839dbc854..37b70ca33d3c 100644
> --- a/fs/zuf/zuf-root.c
> +++ b/fs/zuf/zuf-root.c
> @@ -227,6 +227,7 @@ static void zufr_put_super(struct super_block *sb)
>  {
>         struct zuf_root_info *zri = ZRI(sb);
> 
> +       zufc_zts_fini(zri);
>         _unregister_all_fses(zri);
> 
>         zuf_info("zuf_root umount\n");
> @@ -282,10 +283,16 @@ static int zufr_fill_super(struct super_block *sb, void
> *data, int silent)
>         root_i->i_fop = &zufr_file_dir_operations;
>         root_i->i_op = &zufr_inode_operations;
> 
> +       spin_lock_init(&zri->mount.lock);
>         mutex_init(&zri->sbl_lock);
> +       relay_init(&zri->mount.relay);
>         INIT_LIST_HEAD(&zri->fst_list);
>         INIT_LIST_HEAD(&zri->pmem_list);
> 
> +       err = zufc_zts_init(zri);
> +       if (unlikely(err))
> +               return err; /* put will be called we have a root */
> +
>         return 0;
>  }
> 
> diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
> index f979d8cbe60c..a33f5908155d 100644
> --- a/fs/zuf/zuf.h
> +++ b/fs/zuf/zuf.h
> @@ -23,9 +23,11 @@
>  #include <linux/xattr.h>
>  #include <linux/exportfs.h>
>  #include <linux/page_ref.h>
> +#include <linux/mm.h>
> 
>  #include "zus_api.h"
> 
> +#include "relay.h"
>  #include "_pr.h"
> 
>  enum zlfs_e_special_file {
> @@ -44,6 +46,8 @@ struct zuf_special_file {
>  struct zuf_root_info {
>         struct __mount_thread_info {
>                 struct zuf_special_file zsf;
> +               spinlock_t lock;
> +               struct relay relay;
>                 struct zufs_ioc_mount *zim;
>         } mount;
> 
> @@ -102,6 +106,48 @@ static inline struct zuf_inode_info *ZUII(struct inode
> *inode)
>         return container_of(inode, struct zuf_inode_info, vfs_inode);
>  }
> 
> +static inline struct zuf_fs_type *ZUF_FST(struct file_system_type *fs_type)
> +{
> +       return container_of(fs_type, struct zuf_fs_type, vfs_fst);
> +}
> +
> +static inline struct zuf_fs_type *zuf_fst(struct super_block *sb)
> +{
> +       return ZUF_FST(sb->s_type);
> +}
> +
> +struct zuf_dispatch_op;
> +typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
> +                               ulong zt_max_bytes);
> +struct zuf_dispatch_op {
> +       struct zufs_ioc_hdr *hdr;
> +       struct page **pages;
> +       uint nump;
> +       overflow_handler oh;
> +       struct super_block *sb;
> +       struct inode *inode;
> +};
> +
> +static inline void
> +zuf_dispatch_init(struct zuf_dispatch_op *zdo, struct zufs_ioc_hdr *hdr,
> +                struct page **pages, uint nump)
> +{
> +       memset(zdo, 0, sizeof(*zdo));
> +       zdo->hdr = hdr;
> +       zdo->pages = pages; zdo->nump = nump;
> +}
> +
> +static inline int zuf_flt_to_err(vm_fault_t flt)
> +{
> +       if (likely(flt == VM_FAULT_NOPAGE))
> +               return 0;
> +
> +       if (flt == VM_FAULT_OOM)
> +               return -ENOMEM;
> +
> +       return -EACCES;
> +}
> +
>  /* Keep this include last thing in file */
>  #include "_extern.h"
> 
> diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
> index 34e3e1a9a107..3319a70b5ccc 100644
> --- a/fs/zuf/zus_api.h
> +++ b/fs/zuf/zus_api.h
> @@ -66,6 +66,47 @@
> 
>  #endif /*  ndef __KERNEL__ */
> 
> +/* first available error code after include/linux/errno.h */
> +#define EZUFS_RETRY    531
> +
> +/* The below is private to zuf Kernel only. Is not exposed to VFS nor zus
> + * (defined here to allocate the constant)
> + */
> +#define EZUF_RETRY_DONE 540
> +
> +/**
> + * zufs dual port memory
> + * This is a special type of offset to either memory or persistent-memory,
> + * that is designed to be used in the interface mechanism between userspace
> + * and kernel, and can be accessed by both.
> + * 3 first bits denote a mem-pool:
> + * 0   - pmem pool
> + * 1-6 - established shared pool by a call to zufs_ioc_create_mempool (below)
> + * 7   - offset into app memory
> + */
> +typedef __u64 __bitwise zu_dpp_t;
> +
> +static inline uint zu_dpp_t_pool(zu_dpp_t t)
> +{
> +       return t & 0x7;
> +}
> +
> +static inline ulong zu_dpp_t_val(zu_dpp_t t)
> +{
> +       return t & ~0x7;
> +}
> +
> +static inline zu_dpp_t enc_zu_dpp_t(ulong v, uint pool)
> +{
> +       return v | pool;
> +}
> +
> +/* ~~~~~ ZUFS API ioctl commands ~~~~~ */
> +enum {
> +       ZUS_API_MAP_MAX_PAGES   = 1024,
> +       ZUS_API_MAP_MAX_SIZE    = ZUS_API_MAP_MAX_PAGES * PAGE_SIZE,
> +};
> +
>  struct zufs_ioc_hdr {
>         __u32 err;      /* IN/OUT must be first */
>         __u16 in_len;   /* How much to be copied *to* zus */
> @@ -102,4 +143,148 @@ struct zufs_ioc_register_fs {
>  };
>  #define ZU_IOC_REGISTER_FS     _IOWR('Z', 10, struct zufs_ioc_register_fs)
> 
> +/* A cookie from user-mode returned by mount */
> +struct zus_sb_info;
> +
> +/* zus cookie per inode */
> +struct zus_inode_info;
> +
> +enum ZUFS_M_FLAGS {
> +       ZUFS_M_PEDANTIC         = 0x00000001,
> +       ZUFS_M_EPHEMERAL        = 0x00000002,
> +       ZUFS_M_SILENT           = 0x00000004,
> +};
> +
> +struct zufs_parse_options {
> +       __u32 mount_options_len;
> +       __u32 pedantic;
> +       __u64 mount_flags;
> +       char mount_options[0];
> +};
> +
> +enum e_mount_operation {
> +       ZUFS_M_MOUNT    = 1,
> +       ZUFS_M_UMOUNT,
> +       ZUFS_M_REMOUNT,
> +       ZUFS_M_DDBG_RD,
> +       ZUFS_M_DDBG_WR,
> +};
> +
> +struct zufs_mount_info {
> +       /* IN */
> +       struct zus_fs_info *zus_zfi;
> +       __u16   num_cpu;
> +       __u16   num_channels;
> +       __u32   pmem_kern_id;
> +       __u64   sb_id;
> +
> +       /* OUT */
> +       struct zus_sb_info *zus_sbi;
> +       /* mount is also iget of root */
> +       struct zus_inode_info *zus_ii;
> +       zu_dpp_t _zi;
> +       __u64   old_mount_opt;
> +       __u64   remount_flags;
> +
> +       /* More FS specific info */
> +       __u32 s_blocksize_bits;
> +       __u8    acl_on;
> +       struct zufs_parse_options po;
> +};
> +
> +/* mount / umount */
> +struct  zufs_ioc_mount {
> +       struct zufs_ioc_hdr hdr;
> +       struct zufs_mount_info zmi;
> +};
> +#define ZU_IOC_MOUNT   _IOWR('Z', 11, struct zufs_ioc_mount)
> +
> +/* pmem  */
> +struct zufs_ioc_numa_map {
> +       /* Set by zus */
> +       struct zufs_ioc_hdr hdr;
> +
> +       __u32   possible_nodes;
> +       __u32   possible_cpus;
> +       __u32   online_nodes;
> +       __u32   online_cpus;
> +
> +       __u32   max_cpu_per_node;
> +
> +       /* This indicates that NOT all nodes have @max_cpu_per_node cpus */
> +       bool    nodes_not_symmetrical;
> +
> +       /* Variable size must keep last
> +        * size @online_cpus
> +        */
> +       __u8    cpu_to_node[];
> +};
> +#define ZU_IOC_NUMA_MAP        _IOWR('Z', 12, struct zufs_ioc_numa_map)
> +
> +/* ZT init */
> +enum { ZUFS_MAX_ZT_CHANNELS = 64 };
> +
> +struct zufs_ioc_init {
> +       struct zufs_ioc_hdr hdr;
> +       ulong affinity; /* IN */
> +       uint channel_no;
> +       uint max_command;
> +};
> +#define ZU_IOC_INIT_THREAD     _IOWR('Z', 14, struct zufs_ioc_init)
> +
> +/* break_all (Server telling kernel to clean) */
> +struct zufs_ioc_break_all {
> +       struct zufs_ioc_hdr hdr;
> +};
> +#define ZU_IOC_BREAK_ALL       _IOWR('Z', 15, struct zufs_ioc_break_all)
> +
> +/* ~~~  zufs_ioc_wait_operation ~~~ */
> +struct zufs_ioc_wait_operation {
> +       struct zufs_ioc_hdr hdr;
> +       /* maximum size is governed by zufs_ioc_init->max_command */
> +       char opt_buff[];
> +};
> +#define ZU_IOC_WAIT_OPT                _IOWR('Z', 16, struct
> zufs_ioc_wait_operation)
> +
> +/* These are the possible operations sent from Kernel to the Server in the
> + * return of the ZU_IOC_WAIT_OPT.
> + */
> +enum e_zufs_operation {
> +       ZUFS_OP_NULL = 0,
> +
> +       ZUFS_OP_BREAK,          /* Kernel telling Server to exit */
> +       ZUFS_OP_MAX_OPT,
> +};
> +
> +/* Allocate a special_file that will be a dual-port communication buffer with
> + * user mode.
> + * Server will access the buffer via the mmap of this file.
> + * Kernel will access the file via the valloc() pointer
> + *
> + * Some IOCTLs below demand use of this kind of buffer for communication
> + * TODO:
> + * pool_no is if we want to associate this buffer onto the 6 possible
> + * mem-pools per zuf_sbi. So anywhere we have a zu_dpp_t it will mean
> + * access from this pool.
> + * If pool_no is zero then it is private to only this file. In this case
> + * sb_id && zus_sbi are ignored / not needed.
> + */
> +struct zufs_ioc_alloc_buffer {
> +       struct zufs_ioc_hdr hdr;
> +       /* The ID of the super block received in mount */
> +       __u64   sb_id;
> +       /* We verify the sb_id validity against zus_sbi */
> +       struct zus_sb_info *zus_sbi;
> +       /* max size of buffer allowed (size of mmap) */
> +       __u32 max_size;
> +       /* allocate this much on initial call and set into vma */
> +       __u32 init_size;
> +
> +       /* TODO: These below are now set to ZERO. Need implementation */
> +       __u16 pool_no;
> +       __u16 flags;
> +       __u32 reserved;
> +};
> +#define ZU_IOC_ALLOC_BUFFER    _IOWR('Z', 17, struct zufs_ioc_init)
> +
>  #endif /* _LINUX_ZUFS_API_H */
> --
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License
  2019-02-26 17:55   ` Schumaker, Anna
@ 2019-02-28 16:42     ` Boaz Harrosh
  0 siblings, 0 replies; 31+ messages in thread
From: Boaz Harrosh @ 2019-02-28 16:42 UTC (permalink / raw)
  To: Schumaker, Anna, viro, boaz, linux-fsdevel
  Cc: Manole, Sagi, swhiteho, amir73il, rwheeler, mszeredi, Golander,
	Amit, jmoyer

On 26/02/19 19:55, Schumaker, Anna wrote:
> On Tue, 2019-02-19 at 13:51 +0200, Boaz harrosh wrote:
<>
>> diff --git a/fs/zuf/Kconfig b/fs/zuf/Kconfig
>> new file mode 100644
>> index 000000000000..19fff3b75b69
>> --- /dev/null
>> +++ b/fs/zuf/Kconfig
>> @@ -0,0 +1,28 @@
>> +menuconfig ZUF
> 
> Shouldn't this be "CONFIG_ZUF_FS" to stay consistent with other filesystems?
> 

Totally sure will change maybe ZUFS though? I'll look what's common

>> +       tristate "ZUF - Zero-copy User-mode Feeder"
>> +       depends on BLOCK
>> +       depends on ZONE_DEVICE
>> +       select CRC16
>> +       select MEMCG
>> +       help
>> +          ZUFS Kernel part.
>> +          To enable say Y here.
>> +
>> +          To compile this as a module,  choose M here: the module will be
>> +          called zuf.ko
>> +
>> +if ZUF
>> +
<>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 04/17] zuf: zuf-core The ZTs
  2019-02-26 18:34   ` Schumaker, Anna
@ 2019-02-28 17:01     ` Boaz Harrosh
  0 siblings, 0 replies; 31+ messages in thread
From: Boaz Harrosh @ 2019-02-28 17:01 UTC (permalink / raw)
  To: Schumaker, Anna, viro, boaz, linux-fsdevel
  Cc: Manole, Sagi, swhiteho, amir73il, rwheeler, mszeredi, Golander,
	Amit, jmoyer

On 26/02/19 20:34, Schumaker, Anna wrote:
> On Tue, 2019-02-19 at 13:51 +0200, Boaz harrosh wrote:
<>
>> zuf-core established the communication channels with the ZUS
>> User Mode Server.
>>
>> In this patch we have the core communication mechanics.
>> Which is the Novelty of this project.
>> (See previous submitted documentation for more info)
>>
>> Users will come later in the patchset
>>
<>
>> +static inline int relay_fss_wait(struct relay *relay)
>> +{
>> +       int err;
>> +
>> +       relay->fss_waiting = true;
>> +       relay->fss_wakeup = false;
>> +       err =  wait_event_interruptible(relay->fss_wq, relay->fss_wakeup);
>> +
>> +       return err;
> 
> Could you just do: "return wait_event_interruptible()" directly, instead of
> using the err variable?
> 

Totally there used to be a dbg_print here there for the reminder of that time
Will change ...

>> +}
>> +
<>
>> +static struct zufc_thread *_zt_from_cpu(struct zuf_root_info *zri,
>> +                                       int cpu, uint chan)
>> +{
>> +       return per_cpu_ptr(zri->_ztp->_all_zt[chan], cpu);
>> +}
>> +
>> +static int _zt_from_f(struct file *filp, int cpu, uint chan,
>> +                     struct zufc_thread **ztp)
>> +{
>> +       *ztp = _zt_from_cpu(ZRI(filp->f_inode->i_sb), cpu, chan);
>> +       if (unlikely(!*ztp))
>> +               return -ERANGE;
>> +       return 0;
> 
> I'm curious if there is a reason you did it this way instead of making use of
> the ERR_PTR() macro to return ztp directly?
> 

For one now looking at it I hate the name its wrong. I will change that. It is done
like that because it used to be used in many places and I did not want every place
to have its print and invent its own error code.

But now it has a single user I might just fold it into its only user.
All other places must use _zt_from_f_private. Cool I'll kill it.

>> +}
>> +
<>

Thanks, will fix
Boaz

>> +static int _zu_init(struct file *file, void *parg)
>> +{
>> +       struct zufc_thread *zt;
>> +       int cpu = smp_processor_id();
>> +       struct zufs_ioc_init zi_init;
>> +       int err;
>> +
>> +       err = copy_from_user(&zi_init, parg, sizeof(zi_init));
>> +       if (unlikely(err)) {
>> +               zuf_err("=>%d\n", err);
>> +               return err;
>> +       }
>> +       if (unlikely(zi_init.channel_no >= ZUFS_MAX_ZT_CHANNELS)) {
>> +               zuf_err("[%d] channel_no=%d\n", cpu, zi_init.channel_no);
>> +               return -EINVAL;
>> +       }
>> +
>> +       zuf_dbg_zus("[%d] aff=0x%lx channel=%d\n",
>> +                   cpu, zi_init.affinity, zi_init.channel_no);
>> +
>> +       zi_init.hdr.err = _zt_from_f(file, cpu, zi_init.channel_no, &zt);
>> +       if (unlikely(zi_init.hdr.err)) {
>> +               zuf_err("=>%d\n", err);
>> +               goto out;
>> +       }
>> +
>> +       if (unlikely(zt->hdr.file)) {
>> +               zi_init.hdr.err = -EINVAL;
>> +               zuf_err("[%d] !!! thread already set\n", cpu);
>> +               goto out;
>> +       }
>> +
>> +       relay_init(&zt->relay);
>> +       zt->hdr.type = zlfs_e_zt;
>> +       zt->hdr.file = file;
>> +       zt->no = cpu;
>> +       zt->chan = zi_init.channel_no;
>> +
>> +       zt->max_zt_command = zi_init.max_command;
>> +       zt->opt_buff = vmalloc(zi_init.max_command);
>> +       if (unlikely(!zt->opt_buff)) {
>> +               zi_init.hdr.err = -ENOMEM;
>> +               goto out;
>> +       }
>> +       _fill_buff(zt->opt_buff, zi_init.max_command / sizeof(ulong));
>> +
>> +       file->private_data = &zt->hdr;
>> +out:
>> +       err = copy_to_user(parg, &zi_init, sizeof(zi_init));
>> +       if (err)
>> +               zuf_err("=>%d\n", err);
>> +       return err;
>> +}
>> +
>> +struct zufc_thread *_zt_from_f_private(struct file *file)
>> +{
>> +       struct zuf_special_file *zsf = file->private_data;
>> +
>> +       WARN_ON(zsf->type != zlfs_e_zt);
>> +       return container_of(zsf, struct zufc_thread, hdr);
>> +}
>> +
>> +/* Caller checks that file->private_data != NULL */
>> +static void zufc_zt_release(struct file *file)
>> +{
>> +       struct zufc_thread *zt = _zt_from_f_private(file);
>> +
<>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-02-28 17:01 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-19 11:51 [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License Boaz harrosh
2019-02-20 11:03   ` Greg KH
2019-02-20 14:55     ` Boaz Harrosh
2019-02-20 19:40       ` Greg KH
2019-02-26 17:55   ` Schumaker, Anna
2019-02-28 16:42     ` Boaz Harrosh
2019-02-19 11:51 ` [RFC PATCH 02/17] zuf: Preliminary Documentation Boaz harrosh
2019-02-20  8:27   ` Miklos Szeredi
2019-02-20 14:24     ` Boaz Harrosh
2019-02-19 11:51 ` [RFC PATCH 03/17] zuf: zuf-rootfs Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 04/17] zuf: zuf-core The ZTs Boaz harrosh
2019-02-26 18:34   ` Schumaker, Anna
2019-02-28 17:01     ` Boaz Harrosh
2019-02-19 11:51 ` [RFC PATCH 05/17] zuf: Multy Devices Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 06/17] zuf: mounting Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 07/17] zuf: Namei and directory operations Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 08/17] zuf: readdir operation Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 09/17] zuf: symlink Boaz harrosh
2019-02-20 11:05   ` Greg KH
2019-02-20 14:12     ` Boaz Harrosh
2019-02-19 11:51 ` [RFC PATCH 10/17] zuf: More file operation Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 11/17] zuf: Write/Read implementation Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 12/17] zuf: mmap & sync Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 13/17] zuf: ioctl implementation Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 14/17] zuf: xattr implementation Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 15/17] zuf: ACL support Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 16/17] zuf: Special IOCTL fadvise (TODO) Boaz harrosh
2019-02-19 11:51 ` [RFC PATCH 17/17] zuf: Support for dynamic-debug of zusFSs Boaz harrosh
2019-02-19 12:15 ` [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem Matthew Wilcox
2019-02-19 19:15   ` Boaz Harrosh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).