From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx143.netapp.com ([216.240.21.24]:53231 "EHLO mx143.netapp.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751289AbeCNSGs (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 14 Mar 2018 14:06:48 -0400
Subject: Re: [RFC 3/7] zuf: Preliminary Documentation
To: Randy Dunlap <rdunlap@infradead.org>,
        Boaz Harrosh <boazh@netapp.com>,
        linux-fsdevel <linux-fsdevel@vger.kernel.org>
References: <d5db55b6-e488-986a-81b1-a3514e8eba81@netapp.com>
 <287b77e0-87c6-2fa6-fc52-0141050e82ef@netapp.com>
 <5e04d191-d789-d221-112d-ffea088ab52d@infradead.org>
CC: Ric Wheeler <rwheeler@redhat.com>,
        Miklos Szeredi <mszeredi@redhat.com>,
        Steve French <smfrench@gmail.com>,
        Steven Whitehouse <swhiteho@redhat.com>,
        Jefff moyer <jmoyer@redhat.com>, Sage Weil <sweil@redhat.com>,
        Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com>,
        Andy Rudof <andy.rudoff@intel.com>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Amit Golander <Amit.Golander@netapp.com>,
        Sagi Manole <sagim@netapp.com>,
        Shachar Sharon <Shachar.Sharon@netapp.com>
From: Boaz Harrosh <boazh@netapp.com>
Message-ID: <2d89f7cf-63a4-1eef-4f9d-08f351deb87d@netapp.com>
Date: Wed, 14 Mar 2018 20:01:51 +0200
MIME-Version: 1.0
In-Reply-To: <5e04d191-d789-d221-112d-ffea088ab52d@infradead.org>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On 13/03/18 22:32, Randy Dunlap wrote:
> Hi,
> 
> Just a few questions.  Very little editing.  :)
> 
> 
> On 03/13/2018 10:18 AM, Boaz Harrosh wrote:
>>
>> Adding Documentation/filesystems/zufs.txt
>>
>> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
>> ---
>>  Documentation/filesystems/zufs.txt | 351 +++++++++++++++++++++++++++++++++++++
>>  1 file changed, 351 insertions(+)
>>  create mode 100644 Documentation/filesystems/zufs.txt
>>
>> diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
>> new file mode 100644
>> index 0000000..779f14b
>> --- /dev/null
>> +++ b/Documentation/filesystems/zufs.txt
>> @@ -0,0 +1,351 @@
>> +ZUFS - Zero-copy User-mode FileSystem
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Trees:
>> +	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
>> +	git clone https://github.com/NetApp/zufs-zus -b zus-github
>> +
>> +patches, comments, questions, requests to:
>> +	boazh@netapp.com
>> +
>> +Introduction:
>> +~~~~~~~~~~~~~
>> +
>> +ZUFS - stands for Zero-copy User-mode FS
>> +▪ It is geared towards true zero copy end to end of both data and meta data.
>> +▪ It is geared towards very *low latency*, very high CPU locality, lock-less
>> +  parallelism.
>> +▪ Synchronous operations
>> +▪ Numa awareness
>> +
>> + ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
>> +tries to address the above goals. It is aimed for pmem based FSs. But can easily
>> +support any other type of FSs that can utilize x10 latency and parallelism
>> +improvements.
>> +
>> +Glossary and names:
>> +~~~~~~~~~~~~~~~~~~~
>> +
>> +ZUF - Zero-copy User-mode Feeder
>> +  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
>> +  VFS and dispatch commands to a User-mode application Server.
>> +  Uptodate code is found at:
>> +	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
>> +
>> +ZUS - Zero-copy User-mode Server
>> +  zufs utilizes a User-mode server application. That takes care of the detailed
>> +  communication protocol and correctness with the Kernel.
>> +  In turn it utilizes many zusFS Filesystem plugins to implement the actual
>> +  on disc Filesystem.
>> +  Uptodate code is found at:
>> +	git clone https://github.com/NetApp/zufs-zus -b zus-github
>> +
>> +zusFS - FS plugins
>> +  These are .so loadable modules that implement one or more Filesystem-types
>> +  (-t xyz).
>> +  The zus server communicates with the plugin via a set of function vectors
>> +  for the different operations. And establishes communication via defined
>> +  structures.
>> +
>> +Filesystem-type:
>> +  At startup zus registers with the Kernel one or more Filesystem-type(s)
>> +  Associated with the type is a 4 letter type-name (-t fstn) different
> 
> (?)                           is a unique 4-letter type-name
> 

FStype-name I guess. this field in the Kernel is called file_system_type.name
and is just a pointer so I guess it can be any length. I have space for 15
characters in this API.
But I agree that the sentence above is clear as mud I will reword it

>> +  info about the fs, like a magic number and so on.
>> +  One Server can support many FS-types, in turn each FS-type can mount
>> +  multiple super-blocks, each supporting multiple devices.
>> +
>> +Device-Table (DT) - A zufs FS can support multiple devices
>> +  ZUF in Kernel may receive, like any mount command a block-device or none.
>> +  For the former if the specified FS-types states so in a special field.
>> +  The mount will look for a Device table. A list of devices in a specific
>> +  order sitting at some offset on the block-device. The system will then
>> +  proceed to open and own all these devices and associate them to the mounting
>> +  super-block.
>> +  If FS-type specifies a -1 at DT_offset then there is no device table
>> +  and a DT of a single device is created. (If we have no devices, none
>> +  is specified than we operate without any block devices. (Mount options give
>> +  some indication of the storage information)
> 
> missing one ')'
> 

yep

>> +  The device table has special consideration for pmem devices and will
>> +  present the all linear array of devices to zus, as one flat mmap space.
>> +  Alternatively all none pmem devices are also provided an interface
> 
>                    all known (?)
> 

No this is none-pmem .i.e other devices that are regular bdev(s) and do
not give the special pmem interface. Will fix

>> +  with facility of data movement from pmem to a slower device.
>> +  A detailed NUMA info is exported to the Server for maximum utilization.
>> +
>> +pmem:
>> +  Multiple pmem devices are presented to the server as a single
>> +  linear file mmap. Something like /dev/dax. But it is strictly
>> +  available only to that specific super-block that owns it.
>> +
>> +dpp_t - Dual port pointer type
>> +  At some points in the protocol there are objects that return from zus
>> +  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
>> +  It is actually an offset 8 bytes aligned with the 3 low bits specifying
>> +  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
>> +  pool == 0 means the offset is in pmem who's management is by zuf and
>> +  a full easy access is provided for zus.
>> +
>> +  pool != 0 Is a pre-established tempfs file (up to 6 such files) where
>> +  the zus has an mmap on the file and the Kernel can access that data
>> +  via an offset into the file.
> 
> so non-zero pool [dpp_t & 0x7] can be a value of 1 - 7, and above says up to
> 6 such tempfs files.  What is the other pool value used for?
> 

Yes my bad. Thanks

>> +  All dpp_t objects life time rules are strictly defined.
>> +  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
>> +  zus and zuf can access and change this structure. On any modification
>> +  the zus is called so to be notified of any changes, persistence.
>> +  More such objects are: Symlinks, xattrs, mmap-data-blocks etc...
>> +
>> +Relay-wait-object:
>> +  communication between Kernel and server are done via zus-threads that
>> +  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
>> +  the IOCTL returns operation id executed and the return info is returned via
>> +  a new IOCTL call, which then waits for the next operation.
> 
> Does that say 2 IOCTLs per command?  One to start it and one to fetch return info?
> 

Not really, I guess I did not explain well. It is one IOCTL per command + the very very
first one. It is a cyclic system. The IOCTL that returns results is then sleeping
again to wait for a new command.

>>From the perspective of the user-mode code this is a single IORW (both write+read)
type IOCTL that sends some information - command-results - and receives some
information - A new command.

How to explain this better.
>> +  To wake up the sleeping thread we use a Relay-wait-object. Currently
>> +  it is two waitqueue_head(s) back to back.
>> +  In future we should investigate the use of that special binder object
>> +  that releases its thread time slice to the other thread without going through
>> +  the scheduler.
>> +
>> +ZT-threads-array:
>> +  The novelty of the zufs is the ZT-threads system. One thread or more is
>> +  pre-created for each active core in the system.
>> +  ▪ The thread is AFFINITY set for that single core only.
>> +  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
>> +    At initialization the ZT thread communicates through a ZT_INIT ioctl
>> +    and registers as the handler of that core (Channel)
>> +  ▪ ZT-vma - Mmap 4M vma zero copy communication area per ZT
>> +    Pre allocated vma is created into which will be mapped the application
>> +    or Kernel buffers for the current operation.
>> +  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
>> +    via the IOCTL_ZU_WAIT_OPT call. supplying a 4k communication buffer
>> +
>> +  ▪ On an operation dispatch current CPU's ZT is selected, app pages mapped
>> +    into the ZT-vma. Server thread released with an operation to execute.
>> +  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
>> +    Server wait for new operation on that CPU.
>> +
>> +ZUS-mount-thread:
>> +  The system utilizes a single mount thread. (This thread is not affinity to any
>> +  core).
>> +  ▪ It will first Register all FS-types supported by this Server (By calling
>> +    all zusFS plugins to register their supported types). Once done
>> +  ▪ As above the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
>> +  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
>> +    a mount is dispatched back to zus.
>> +  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
>> +    the same array is then used for all super-blocks in the system
>> +  ▪ As part of the mount command in the context of this same mount-thread
>> +    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
>> +    Associated with this super_block
>> +  ▪ On return (like above a new call to IOCTL_ZU_MOUNT will return info of the
> 
> missing ')' somewhere.
> 
Yes
>> +    mount before sleeping in kernel waiting for a new dispatch. All SB info
>> +    is provided to zuf, including the root inode info. Kernel then proceeds
>> +    to complete the mount call.
>> +  ▪ NOTE that since there is a single mount thread all FS-registration
>> +    super_block and pmem management are lockless.
>> +  
>> +Philosophy of operations:
>> +~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +1. [zuf-root]
>> +
>> +On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
>> +called zuf-root.
>> +The zuf-root has no visible files. All communication is done via special-files.
>> +special-files are open(O_TMPFILE) and establish a special role via an
>> +IOCTL.
>> +All communications with the server are done via the zuf-root. Each root owns
>> +many FS-types and each FS-type owns many super-blocks of this type. All Sharing
>> +the same communication channels.
>> +Since all FS-type Servers live in the same zus application address space, at
>> +times. If the administrator wants to separate between different servers, he/she
>> +can mount a new zuf-root and point a new server instance on that new mount,
>> +registering other FS-types on that other instance. The all communication array
>> +will then be duplicated as well.
>> +(Otherwise pointing a new server instance on a busy root will return an error)
>> +
>> +2. [zus server start]
>> +  ▪ On load all configured zusFS plugins are loaded.
>> +  ▪ The Server starts by starting a single mount thread.
>> +  ▪ It than proceeds to register with Kernel all FS-types it will support.
>> +    (This is done on the single mount thread, so all FS-registration and
>> +     mount/umount operate in a single thread and therefor need not any locks)
>> +  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for a mount
>> +    command.
>> +
>> +3. [mount -t xyz]
>> +  [In Kernel]
>> +  ▪ If xyz was registered above as part of the Server startup. the regular
>> +    mount command will come to the zuf module with a zuf_mount() call. with
>> +    the xyz-FS-info. In turn this points to a zuf-root.
>> +  ▪ Code than proceed to load a device-table of devices as  specified above.
>> +    It then establishes an md object with a specific pmem_id.
> 
>                               md ??
> 

Thanks yes I should explain better MD stands for multi_devices which is the
name of the structure. I will put some definitions somewhere

>> +  ▪ It proceeds to call mount_bdev. Always with the same main-device
>> +    thous fully sporting automatic bind mounts. Even if different
>> +    devices are given to the mount command.
>> +  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
>> +    specifying two parameters. One the FS-type to mount, and then
>> +    the pmem_id Associated with this super_block.
>> +
>> +  [In zus]
>> +  ▪ A zus_super_block_info is allocated.
>> +  ▪ zus calls PMEM_GRAB(pmem_id) to establish a direct mapping to its
>> +    pmem devices. On return we have full access to our PMEM
>> +
>> +  ▪ ZT-threads-array
>> +    If this is the first mount the all ZT-threads-array is created and
>> +    established. The mount thread will wait until all zt-threads finished
>> +    initialization and ready to rock.
>> +  ▪ Root-zus_inode is loaded and is returned to kernel
>> +  ▪ More info about the mount like block sizes and so on are returned to kernel.
>> +
>> +  [In Kernel]
>> +   The zuf_fill_super is finalized vectors established and we have a new
>> +   super_block ready for operations.
>> +
>> +4. An FS operation like create or WRITE/READ and so on arrives from application
>> +   via VFS. Eventually an Operation is dispatched to zus:
>> +   ▪ A special per-operation descriptor is filled up with all parameters.
>> +   ▪ A current CPU channel is grabbed. the operation descriptor is put on
>> +     that channel (ZT). Including get_user_pages or Kernel-pages associated
>> +     with this OPT.
>> +   ▪ The ZT is awaken, app thread put to sleep.
>> +   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
>> +     the map is only on a single core. And no other core's TLB is affected.
>> +     (This here is the all performance secret)
>> +   ▪ ZT thread is returned to user-space.
>> +   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
>> +     vector. Output params filled.
>> +   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
>> +     to return the requested info.
>> +   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
>> +     ZT thread goes back to sleep waiting a new operation.
>> +     
>> +   ZT rules:
>> +       A ZT thread must not return back to Kernel. One exception is locks
>> +   if needed it might sleep waiting for a lock. In which case we will see that
>> +   the same CPU channel is reentered via another application and/or thread.
>> +   But now that CPU channel is taken.  What we do is we utilize a few channels
>> +   (ZTs) per core and the threads may grab another channel. But this only
>> +   postpones the problem on a busy contended system, all such channels will be
>> +   consumed. If all channels are taken the application thread is put on a busy
>> +   scheduling wait until a channel can be grabbed.
>> +   Therefor Server must not sleep on a ZT. If it needs such a sleeping operation
>> +   it will return -EAGAIN to zuf. The app is kept sleeping the operation is put
>> +   on an asynchronous Q and the ZT freed for foreground operation. At some point
>> +   when the server completes the delayed operation it will complete notify
>> +   the Kernel with a special async cookie. And the app will be awakened.
>> +   (Here too we utilize pre allocated asyc channels and vmas. If all channels
>> +    are busy, application is kept sleeping waiting its free slot turn)
>> +
>> +4. On umount the operation is reversed and all resources are torn down.
>> +5. In case of an application or Server crash, all resources are Associated
>> +   with files, on file_release these resources are caught and freed.
>> +
>> +Objects and life-time
>> +~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Each Kernel object type has an assosiated zus Server object type who's life
>> +time is governed by the life-time of the Kernel object. Therefor the Server's
>> +job is easy because it need not establish any object caches / hashes and so on.
>> +
>> +Inside zus all objects are allocated by the zusFS plugin. So in turn it can
>> +allocate a bigger space for its own private data and access it via the
>> +container_off() coding pattern. So when I say below a zus-object I mean both
>> +zus public part + zusFS private part of the same object.
>> +
>> +All operations return a UM pointer that are OPEC the the Kernel code, they
> 
> -ETOOMANYNLA
> 
> 2LA: UM
> 4LA: OPEC

Yes thanks I will change

>> +are just a cookie which is returned back to zus, when needed.
>> +At times when we want the Kernel to have direct access to a zus object like
>> +zus_inode, along with the cookie we also return a dpp_t, with a defined structure.
>> +
>> +Kernel object 			| zus object 		| Kernel access (via dpp_t)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +zuf_fs_type
>> +	file_system_type	| zus_fs_info		| no
>> +
>> +zuf_sb_info
>> +	super_block		| zus_sb_info		| no
>> +	
>> +zuf_inode_info			|			|
>> +	vfs_inode		| zus_inode_info	| no
>> +	zus_inode *		| 	zus_inode *	| yes
>> +	synlink *		|	char-array	| yes
>> +	xattr**			|	zus_xattr	| yes
>> +
>> +When a Kernel object's time is to die, a final call to zus is
>> +dispatched so the associated object can also be freed. Which means
>> +that on memory pressure when object caches are evicted also the zus
>> +memory resources are freed.
>> +
>> +
>> +How to use zufs:
>> +~~~~~~~~~~~~~~~~
>> +
>> +The most updated documentation of how to use the latest code bases
>> +is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
>> +
>> +We the developers at Netapp use this script to mount and test our
>> +latest code. So any new Secret will be found in these scripts. Please
>> +read them as the ultimate source of how to operate things.
>> +
>> +TODO: We are looking for exports in system-d and udev to properly
>> +integrate these tools into a destro.
>> +
>> +We assume you cloned these git trees:
>> +[]$ mkdir zufs; cd zufs
>> +[]$ git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
>> +[]$ git clone https://github.com/NetApp/zufs-zuf -b zus-github
>> +
>> +This will create the following trees
>> +zufs/zus - Source code for Server
>> +zufs/zuf - Linux Kernel source tree to compile and install on your machine
>> +
>> +Also specifically:
>> +zufs/zus/fs/do-zu/zudo - script Documenting how to run things
>> +
>> +[]$ cd zuf
>> +
>> +First time
>> +[] ../zus/fs/do-zu/zudo
>> +this will create a file:
>> +	../zus/fs/do-zu/zu.conf
>> +
>> +Edit this file for your environment. Devices, mount-point and so on.
>> +On first run an example file will be created for you. Fill in the
>> +blanks. Most params can stay as is in most cases
>> +
>> +Now lest start running:
>> +
>> +[1]$ ../zus/fs/do-zu/zudo mkfs
>> +This will run the proper mkfs command selected at zu.conf file
>> +with the proper devices.
>> +
>> +[2]$ ../zus/fs/do-zu/zudo zuf-insmod
>> +This loads the zuf.ko module
>> +
>> +[3]$ ../zus/fs/do-zu/zudo zuf-root
>> +This mounts the zuf-root FS above on /sys/fs/zuf (automatically created above)
>> +
>> +[4]$ ../zus/fs/do-zu/zudo zus-up
>> +This runs the zus daemon in the background
>> +
>> +[5]$ ../zus/fs/do-zu/zudo mount
>> +This mount the mkfs FS above on the specified dir in zu.conf
>> +
>> +To run all the 5 commands above at once do:
>> +[]$ ../zus/fs/do-zu/zudo up
>> +
>> +To undo all the above in reverse order do:
>> +[]$ ../zus/fs/do-zu/zudo down
>> +
>> +And the most magic command is:
>> +[]$ ../zus/fs/do-zu/zudo again
>> +Will do a "down", then update-mods, then "up"
>> +(update-mods is a special script to copy the latest compiled binaries)
>> +
>> +Now you are ready for some:
>> +[]$ ../zus/fs/do-zu/zudo xfstest
>> +xfstests is assumed to be installed in the regular /opt/xfstests dir
>> +
>> +Again please see inside the scripts what each command does
>> +these scripts are the ultimate Documentation, do not believe
>> +anything I'm saying here. (Because it is outdated by now)
>>
> 
> thanks,
> 

Thank you Randy for all the input.
I always have such hard time with the Documentation and very low confidence
in it. As I was slaving over this I hoped you'd have a bit of time to look
it over.

Will fix
Boaz