All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boaz Harrosh <boaz@plexistor.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Anna Schumaker <Anna.Schumaker@netapp.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
	Amir Goldstein <amir73il@gmail.com>,
	Sagi Manole <sagim@netapp.com>,
	Matthew Wilcox <willy@infradead.org>,
	Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 03/16] zuf: Preliminary Documentation
Date: Thu, 26 Sep 2019 05:07:12 +0300	[thread overview]
Message-ID: <20190926020725.19601-4-boazh@netapp.com> (raw)
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>

Adding Documentation/filesystems/zufs.txt.

Adding some Documentation first. So to give the reviewer
of the coming patch-set. Some background and overview of
the all system.

[v2]
  Incorporated Randy's few comments.

Randy Please give it an harder review?

CC: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 Documentation/filesystems/zufs.txt | 386 +++++++++++++++++++++++++++++
 1 file changed, 386 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt

diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
new file mode 100644
index 000000000000..2a347a446aa7
--- /dev/null
+++ b/Documentation/filesystems/zufs.txt
@@ -0,0 +1,386 @@
+ZUFS - Zero-copy User-mode FileSystem
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Trees:
+	git clone https://github.com/NetApp/zufs-zuf -b upstream
+	git clone https://github.com/NetApp/zufs-zus -b upstream
+
+patches, comments, questions, requests to:
+	boazh@netapp.com
+
+Introduction:
+~~~~~~~~~~~~~
+
+ZUFS - stands for Zero-copy User-mode FS
+▪ It is geared towards true zero copy end to end of both data and meta data.
+▪ It is geared towards very *low latency*, very high CPU locality, lock-less
+  parallelism.
+▪ Synchronous operations
+▪ Numa awareness
+
+  ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
+tries to address the above goals. It is aimed for pmem based FSs. But supports
+any other type of FSs
+
+Glossary and names:
+~~~~~~~~~~~~~~~~~~~
+
+ZUF - Zero-copy User-mode Feeder
+  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
+  VFS and dispatch commands to a User-mode application Server.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zuf -b upstream
+
+ZUS - Zero-copy User-mode Server
+  zufs utilizes a User-mode server application. That takes care of the detailed
+  communication protocol and correctness with the Kernel.
+  In turn it utilizes many zusFS Filesystem plugins to implement the actual
+  on disc Filesystem.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zus -b upstream
+
+zusFS - FS plugins
+  These are .so loadable modules that implement one or more Filesystem-types
+  (mount -t xyz).
+  The zus server communicates with the plugin via a set of function vectors
+  for the different operations. And establishes communication via defined
+  structures.
+
+Filesystem-type:
+  At startup zus registers with the Kernel one or more Filesystem-type(s)
+  Associated with the type is a unique type-name (mount -t foofs) +
+  different info about the fs, like a magic number and so on.
+  One Server can support many FS-types, in turn each FS-type can mount
+  multiple super-blocks, each supporting multiple devices.
+
+Device-Table (MDT) - A zufs FS can support multiple devices
+  ZUF in Kernel may receive, like any mount command a block-device or none.
+  For the former if the specified FS-types states so in a special field.
+  The mount will look for a Device table. A list of devices in a specific
+  order sitting at some offset on each block-device. The system will then
+  proceed to open and own all these devices and associate them to the mounting
+  super-block.
+  If FS-type specifies a -1 at DT_offset then there is no device table
+  and a DT of a single device is created. (If we have no devices, none
+  is specified than we operate without any block devices. (Mount options give
+  some indication of the storage information))
+  The device table has special consideration for pmem devices and will
+  present the all linear array of devices to zus, as one flat mmap space.
+  Alternatively all non-pmem devices are also provided an interface
+  with facility of data movement from pmem to slower devices.
+  A detailed NUMA info is exported to the Server for maximum utilization.
+  Each device has an associated NUMA node, so Server can optimize IO to
+  these devices
+
+pmem: (Also called t1)
+  Multiple pmem devices are presented to the server as a single
+  linear file mmap. Something like /dev/dax. But it is strictly
+  available only to the specific super-block that owns it.
+
+Shadow: (For debugging)
+  "Shadow" is used for debugging the correct persistence of pmem based
+  filesystems. With pmem if modified a user must call cl_flush/sfence
+  for the data to be guarantied resistance. This is very hard to test
+  and time consuming. So for that we invented the shadow.
+  There is a special mode bit in the MDT header that denotes a shadow
+  system. In a shadow setup each pmem device is divided in half. First
+  half is available for FS storage. The second half is a Shadow. IE
+  each time the FS calls cl_flush or mov_nt the data is then memcopied
+  to the shadow.
+  At mount time the Shadow is copied onto the main part. And thous
+  presenting only those bits that where persisted by the FS. So a simple
+  remount can simulate a full machine reboot.
+  The Shadow is presented as the upper part of the mmaped region. IE
+  the all t1 ranged is repeated again. The zus core code fasilitates
+  zusFS implementors in accessing this facility
+
+zufs_dpp_t - Dual port pointer type
+  At some points in the protocol there are objects that return from zus
+  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
+  It is actually an offset 8 bytes aligned with the 3 low bits specifying
+  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
+  pool == 0 means the offset is in pmem who's management is by zuf and
+  a full easy access is provided for zus.
+
+  pool != 0 Is a pre-established file (up to 6 such files per sb) where
+  the zus has an mmap on the file and the Kernel can access that data
+  via an offset into the file.
+  pool == 7 denotes an offset into the application buffers associated
+  with the current IO.
+  All dpp_t objects life time rules are strictly defined.
+  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
+  zus and zuf can access and change this structure. On any modification
+  the zus is called so to be notified of any changes, persistence.
+  More such objects are: Symlinks, xattrs, data-blocks etc...
+
+Relay-wait-object:
+  communication between Kernel and server are done via zus-threads that
+  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
+  the IOCTL returns operation id executed and the return info is returned via
+  a new IOCTL call, which then waits for the next operation.
+  To wake up the sleeping thread we use a Relay-wait-object. Currently
+  it is two waitqueue_head(s) back to back.
+  In future we should investigate the use of a new special scheduler object
+  That switches from thread A to predefined thread ZT context without passing
+  through the scheduler at all.
+  (The switching is already very fast, faster then anything currently
+   in the Kernel. But I believe I can shave another 1 micro off a roundtrip)
+
+ZT-threads-array:
+  The novelty of the zufs is the ZT-threads system. 3 threads or more are
+  pre-created for each active core in the system.
+  ▪ The thread is AFFINITY set for that single core only.
+  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
+    At initialization the ZT thread communicates through a ZT_INIT ioctl
+    and registers as the handler of that core (Channel)
+  ▪ Also for each ZT, Kernel allocates an IOCTL-buffer that is directly
+    accessed by Kernel. In turn that IOCTL-buffer is mmaped by zus
+    for the Server access of that communication buffer. (This is for zero
+    copy operations as well as avoiding the smem memory barrier)
+  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
+    via the IOCTL_ZU_WAIT_OPT call.
+
+  ▪ On operation dispatch current CPU's ZT free channel is selected.
+    Operation info is set into the IOCTL-buffer, the ZT is woken and the
+    application thread is put to sleep.
+  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
+    Server wait for new operation on that CPU.
+  ▪ Each ZT has a cyclic logic. Each call to IOCTL_ZU_WAIT_OPT from Server
+    returns the results of the previous operation, before going to sleep
+    waiting to receive a new operation.
+	zus			zuf-zt				application
+    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+     ---> IOCTL_ZU_WAIT_OPT    if (app-waiting)
+     |					wake-up-application	 -> return to app
+     |				FS-WAIT
+     |				|				<- POSIX call
+     |				V		<- fs-wake-up(dispatch)
+     |			<- return with new command
+     |--<- do_new_operation
+
+ZUS-mount-thread:
+  The system utilizes a single mount thread. (This thread is not affinity to any
+  core).
+  ▪ It will first Register all FS-types supported by this Server (By calling
+    all zusFS plugins to register their supported types). Once done
+  ▪ As above, the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
+  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
+    a mount is dispatched back to zus.
+  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
+    the same ZT-array is then used for all super-blocks in the system
+  ▪ As part of the mount command in the context of this same mount-thread
+    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
+    Associated with this super_block
+  ▪ On return like above a new call to IOCTL_ZU_MOUNT will return info of the
+    mount before sleeping in kernel waiting for a new dispatch. All SB info
+    is provided to zuf, including the root inode info. Kernel then proceeds
+    to complete the mount call.
+  ▪ NOTE that since there is a single mount thread lots of FS-registration
+    super_block and pmem management are lockless.
+
+Philosophy of operations:
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. [zuf-root]
+
+On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
+called zuf-root.
+The zuf-root has no visible files. All communication is done via special-files.
+special-files are open(O_TMPFILE) and establish a special role via an
+IOCTL. (Example above ZT-thread is one such special file)
+All communications with the server are done via the zuf-root. Each root owns
+many FS-types and each FS-type owns many super-blocks of this type. All Sharing
+the same communication channels.
+Since all FS-type Servers live in the same zus application address space, at
+times. If the administrator wants to separate between different servers, he/she
+can mount a new zuf-root and point a new server instance on that new mount,
+registering other FS-types on that other instance. The all communication array
+will then be duplicated as well.
+(Otherwise pointing a new server instance on a busy root will return an error)
+
+2. [zus server start]
+  ▪ On load all configured zusFS plugins are loaded.
+  ▪ The Server starts by starting a single mount thread.
+  ▪ It than proceeds to register with Kernel all FS-types it will support.
+    (This is done on the single mount thread, so FS-registration and
+     mount/umount operate in a single thread and therefor need not any locks)
+  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for
+    a mount command.
+
+3. [mount -t xyz]
+  [In Kernel]
+  ▪ If xyz was registered above as part of the Server startup. the regular
+    mount command will come to the zuf module with a zuf_mount() call. with
+    the xyz-FS-info. In turn this points to a zuf-root.
+  ▪ Code than proceed to load a device-table of devices as  specified above.
+    It then establishes an multi_devices object with a specific sb_id.
+  ▪ It proceeds to call mount_bdev. Always with the same main-device
+    thous fully sporting automatic bind mounts. Even if different
+    devices are given to the mount command.
+  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
+    specifying two parameters. One the FS-type to mount, and then
+    the sb_id Associated with this super_block.
+
+  [In zus]
+  ▪ A zus_super_block_info is allocated.
+  ▪ zus calls PMEM_GRAB(sb_id) to establish a direct mapping to its
+    pmem devices. On return we have full access to our PMEM
+
+  ▪ ZT-threads-array
+    If this is the first mount the ZT-threads-array is created and
+    established. The mount thread will wait until all zt-threads finished
+    initialization and ready to rock.
+  ▪ Root-zus_inode is loaded and is returned to kernel
+  ▪ More info about the mount like block sizes and so on are returned to kernel.
+
+  [In Kernel]
+   The zuf_fill_super is finalized vectors established and we have a new
+   super_block ready for operations.
+
+4. An FS operation like create or WRITE/READ and so on arrives from application
+   via VFS. Eventually an Operation is dispatched to zus:
+   ▪ A special per-operation descriptor is filled up with all parameters.
+   ▪ A current CPU channel is grabbed. the operation descriptor is put on
+     that channel (ZT). Including get_user_pages or Kernel-pages associated
+     with this OPT.
+   ▪ The ZT is awaken, app thread put to sleep.
+   ▪ Optionally in ZT context pages are mapped to that ZT-vma. This is so we
+     are sure the map is only on a single core. And no other core's TLB is
+     affected.
+   ▪ ZT thread is returned to user-space.
+   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
+     vector. Output params filled.
+   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
+     to return the requested info.
+   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
+     ZT thread goes back to sleep waiting a new operation.
+
+   ZT rules:
+       A ZT thread should try to minimize it's sleeps. it might take locks
+   In which case we will see that the same CPU channel is reentered via another
+   application/thread. But now that CPU channel is taken.  What we do is we
+   utilize a few channels (ZTs) per core and those threads may grab another
+   channel. But this only postpones the problem. On a busy contended system,
+   all such channels will be consumed. If all channels are taken the
+   application thread is put on a busy scheduling wait until a channel can
+   be grabbed.
+   If The server needs to sleep for a long time it should utilize the
+   ZUFS_ASYNC return option. The app is then kept sleeping on an
+   operation-context object and the ZT freed for foreground operation.
+   At some point in time when the server completes the delayed operation
+   it will notify the Kernel with a special async IO-context cookie.
+   And the app will be awakened.
+
+4. On umount the operation is reversed and all resources are released.
+5. In case of an application or Server crash, all resources are Associated
+   with files, on file_release these resources are caught and freed.
+
+Objects and life-time
+~~~~~~~~~~~~~~~~~~~~~
+
+Each Kernel object type has an assosiated zus Server object type who's life
+time is governed by the life-time of the Kernel object. Therefor the Server's
+job is easy because it need not establish any object caches / hashes and so on.
+
+Inside zus all objects are allocated by the zusFS plugin. So in turn it can
+allocate a bigger space for its own private data and access it via the
+container_off() coding pattern. So when I say below a zus-object I mean both
+zus public part + zusFS private part of the same object.
+
+All operations return a User-mode pointer that are opaque to the the Kernel
+code, they are just a cookie which is returned back to zus, when needed.
+At times when we want the Kernel to have direct access to a zus object like
+zufs_inode, along with the cookie we also return a dpp_t, with a defined
+structure.
+
+Kernel object 			| zus object 		| Kernel access (via dpp_t)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+zuf_fs_type
+	file_system_type	| zus_fs_info		| no
+
+zuf_sb_info
+	super_block		| zus_sb_info		| no
+
+zuf_inode_info			|			|
+	vfs_inode		| zus_inode_info	| no
+	zufs_inode *		| 	zufs_inode *	| yes
+	synlink *		|	char-array	| yes
+	xattr**			|	zus_xattr	| yes
+
+When a Kernel object's time is to die, a final call to zus is
+dispatched so the associated object can also be freed. Which means
+that on memory pressure when object caches are evicted also the zus
+memory resources are freed.
+
+
+How to use zufs:
+~~~~~~~~~~~~~~~~
+
+The most updated documentation of how to use the latest code bases
+is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
+
+We the developers at Netapp use this script to mount and test our
+latest code. So any new Secret will be found in these scripts. Please
+read them as the ultimate source of how to operate things.
+
+We assume you cloned these git trees:
+[]$ mkdir zufs; cd zufs
+[]$ git clone https://github.com/NetApp/zufs-zuf -b upstream
+[]$ git clone https://github.com/NetApp/zufs-zuf -b upstream
+
+This will create the following trees
+zufs/zus - Source code for Server
+zufs/zuf - Linux Kernel source tree to compile and install on your machine
+
+Also specifically:
+zufs/zus/fs/do-zu/zudo - script Documenting how to run things
+
+[]$ cd zufs
+
+First time
+[] zus/fs/do-zu/zudo
+this will create a file:
+	zus/fs/do-zu/zu.conf
+
+Edit this file for your environment. Devices, mount-point and so on.
+On first run an example file will be created for you. Fill in the
+blanks. Most params can stay as is in most cases
+
+Now lets start running:
+
+[1]$ zus/fs/do-zu/zudo mkfs
+This will run the proper mkfs command selected at zu.conf file
+with the proper devices.
+
+[2]$ zus/fs/do-zu/zudo zuf-insmod
+This loads the zuf.ko module
+
+[3]$ zus/fs/do-zu/zudo zuf-root
+This mounts the zuf-root FS above on /sys/fs/zuf (automatically created in [2])
+
+[4]$ zus/fs/do-zu/zudo zus-up
+This runs the zus daemon in the background
+
+[5]$ zus/fs/do-zu/zudo mount
+This mount the mkfs FS above on the specified dir in zu.conf
+
+To run all the 5 commands above at once do:
+[]$ zus/fs/do-zu/zudo up
+
+To undo all the above in reverse order do:
+[]$ zus/fs/do-zu/zudo down
+
+And the most magic command is:
+[]$ zus/fs/do-zu/zudo again
+Will do a "down", then update-mods, then "up"
+(update-mods is a special script to copy the latest compiled binaries)
+
+Now you are ready for some:
+[]$ zus/fs/do-zu/zudo xfstest
+xfstests is assumed to be installed in the regular /opt/xfstests dir
+
+Again please see inside the scripts what each command does
+these scripts are the ultimate Documentation, do not believe
+anything I'm saying here. (Because it is outdated by now)
+
+Have a nice day
-- 
2.21.0


  parent reply	other threads:[~2019-09-26  2:10 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-26  2:07 [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
2019-09-26  2:07 ` [PATCH 01/16] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
2019-09-26  2:07 ` [PATCH 02/16] MAINTAINERS: Add the ZUFS maintainership Boaz Harrosh
2019-09-26  2:07 ` Boaz Harrosh [this message]
2019-09-26  2:07 ` [PATCH 04/16] zuf: zuf-rootfs Boaz Harrosh
2019-09-26  2:07 ` [PATCH 05/16] zuf: zuf-core The ZTs Boaz Harrosh
2019-09-26  2:07 ` [PATCH 06/16] zuf: Multy Devices Boaz Harrosh
2019-09-26  2:07 ` [PATCH 07/16] zuf: mounting Boaz Harrosh
2019-09-26  2:07 ` [PATCH 08/16] zuf: Namei and directory operations Boaz Harrosh
2019-09-26  2:07 ` [PATCH 09/16] zuf: readdir operation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 10/16] zuf: symlink Boaz Harrosh
2019-09-26  2:07 ` [PATCH 11/16] zuf: Write/Read implementation Boaz Harrosh
     [not found]   ` <db90d73233484d251755c5a0cb7ee570b3fc9d19.camel@netapp.com>
2019-10-29 20:15     ` Matthew Wilcox
2019-11-14 14:04       ` Boaz Harrosh
2019-11-14 15:15     ` Boaz Harrosh
2019-11-14 16:08       ` Schumaker, Anna
2019-09-26  2:07 ` [PATCH 12/16] zuf: mmap & sync Boaz Harrosh
2019-09-26  2:07 ` [PATCH 13/16] zuf: More file operation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 14/16] zuf: ioctl implementation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 15/16] zuf: xattr && acl implementation Boaz Harrosh
2019-09-26  2:07 ` [PATCH 16/16] zuf: Support for dynamic-debug of zusFSs Boaz Harrosh
2019-09-26  7:11 ` [PATCHSET v02 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Miklos Szeredi
2019-09-26  9:41   ` Bernd Schubert
2019-09-26 11:27   ` Boaz Harrosh
2019-09-26 12:12     ` Bernd Schubert
2019-09-26 12:24       ` Boaz Harrosh
2019-09-26 13:45         ` Miklos Szeredi
2019-09-26 12:48   ` Boaz Harrosh
2019-09-26 13:48     ` Miklos Szeredi
2019-09-26 11:41 ` Boaz Harrosh
  -- strict thread matches above, loose matches on Subject: below --
2019-08-12 16:47 [PATCHSET " Boaz Harrosh
2019-08-12 16:47 ` [PATCH 03/16] zuf: Preliminary Documentation Boaz Harrosh
2019-08-12 16:42 [PATCHSET 00/16] zuf: ZUFS Zero-copy User-mode FileSystem Boaz Harrosh
2019-08-12 16:42 ` [PATCH 03/16] zuf: Preliminary Documentation Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190926020725.19601-4-boazh@netapp.com \
    --to=boaz@plexistor.com \
    --cc=Anna.Schumaker@netapp.com \
    --cc=amir73il@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mbenjami@redhat.com \
    --cc=mszeredi@redhat.com \
    --cc=sagim@netapp.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.