From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx143.netapp.com ([216.240.21.24]:53231 "EHLO mx143.netapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751289AbeCNSGs (ORCPT ); Wed, 14 Mar 2018 14:06:48 -0400 Subject: Re: [RFC 3/7] zuf: Preliminary Documentation To: Randy Dunlap , Boaz Harrosh , linux-fsdevel References: <287b77e0-87c6-2fa6-fc52-0141050e82ef@netapp.com> <5e04d191-d789-d221-112d-ffea088ab52d@infradead.org> CC: Ric Wheeler , Miklos Szeredi , Steve French , Steven Whitehouse , Jefff moyer , Sage Weil , Jan Kara , Amir Goldstein , Andy Rudof , Anna Schumaker , Amit Golander , Sagi Manole , Shachar Sharon From: Boaz Harrosh Message-ID: <2d89f7cf-63a4-1eef-4f9d-08f351deb87d@netapp.com> Date: Wed, 14 Mar 2018 20:01:51 +0200 MIME-Version: 1.0 In-Reply-To: <5e04d191-d789-d221-112d-ffea088ab52d@infradead.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 13/03/18 22:32, Randy Dunlap wrote: > Hi, > > Just a few questions. Very little editing. :) > > > On 03/13/2018 10:18 AM, Boaz Harrosh wrote: >> >> Adding Documentation/filesystems/zufs.txt >> >> Signed-off-by: Boaz Harrosh >> --- >> Documentation/filesystems/zufs.txt | 351 +++++++++++++++++++++++++++++++++++++ >> 1 file changed, 351 insertions(+) >> create mode 100644 Documentation/filesystems/zufs.txt >> >> diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt >> new file mode 100644 >> index 0000000..779f14b >> --- /dev/null >> +++ b/Documentation/filesystems/zufs.txt >> @@ -0,0 +1,351 @@ >> +ZUFS - Zero-copy User-mode FileSystem >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> + >> +Trees: >> + git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream >> + git clone https://github.com/NetApp/zufs-zus -b zus-github >> + >> +patches, comments, questions, requests to: >> + boazh@netapp.com >> + >> +Introduction: >> +~~~~~~~~~~~~~ >> + >> +ZUFS - stands for Zero-copy User-mode FS >> +▪ It is geared towards true zero copy end to end of both data and meta data. >> +▪ It is geared towards very *low latency*, very high CPU locality, lock-less >> + parallelism. >> +▪ Synchronous operations >> +▪ Numa awareness >> + >> + ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which >> +tries to address the above goals. It is aimed for pmem based FSs. But can easily >> +support any other type of FSs that can utilize x10 latency and parallelism >> +improvements. >> + >> +Glossary and names: >> +~~~~~~~~~~~~~~~~~~~ >> + >> +ZUF - Zero-copy User-mode Feeder >> + zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel >> + VFS and dispatch commands to a User-mode application Server. >> + Uptodate code is found at: >> + git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream >> + >> +ZUS - Zero-copy User-mode Server >> + zufs utilizes a User-mode server application. That takes care of the detailed >> + communication protocol and correctness with the Kernel. >> + In turn it utilizes many zusFS Filesystem plugins to implement the actual >> + on disc Filesystem. >> + Uptodate code is found at: >> + git clone https://github.com/NetApp/zufs-zus -b zus-github >> + >> +zusFS - FS plugins >> + These are .so loadable modules that implement one or more Filesystem-types >> + (-t xyz). >> + The zus server communicates with the plugin via a set of function vectors >> + for the different operations. And establishes communication via defined >> + structures. >> + >> +Filesystem-type: >> + At startup zus registers with the Kernel one or more Filesystem-type(s) >> + Associated with the type is a 4 letter type-name (-t fstn) different > > (?) is a unique 4-letter type-name > FStype-name I guess. this field in the Kernel is called file_system_type.name and is just a pointer so I guess it can be any length. I have space for 15 characters in this API. But I agree that the sentence above is clear as mud I will reword it >> + info about the fs, like a magic number and so on. >> + One Server can support many FS-types, in turn each FS-type can mount >> + multiple super-blocks, each supporting multiple devices. >> + >> +Device-Table (DT) - A zufs FS can support multiple devices >> + ZUF in Kernel may receive, like any mount command a block-device or none. >> + For the former if the specified FS-types states so in a special field. >> + The mount will look for a Device table. A list of devices in a specific >> + order sitting at some offset on the block-device. The system will then >> + proceed to open and own all these devices and associate them to the mounting >> + super-block. >> + If FS-type specifies a -1 at DT_offset then there is no device table >> + and a DT of a single device is created. (If we have no devices, none >> + is specified than we operate without any block devices. (Mount options give >> + some indication of the storage information) > > missing one ')' > yep >> + The device table has special consideration for pmem devices and will >> + present the all linear array of devices to zus, as one flat mmap space. >> + Alternatively all none pmem devices are also provided an interface > > all known (?) > No this is none-pmem .i.e other devices that are regular bdev(s) and do not give the special pmem interface. Will fix >> + with facility of data movement from pmem to a slower device. >> + A detailed NUMA info is exported to the Server for maximum utilization. >> + >> +pmem: >> + Multiple pmem devices are presented to the server as a single >> + linear file mmap. Something like /dev/dax. But it is strictly >> + available only to that specific super-block that owns it. >> + >> +dpp_t - Dual port pointer type >> + At some points in the protocol there are objects that return from zus >> + (The Server) to the Kernel via a dpp_t. This is a special kind of pointer >> + It is actually an offset 8 bytes aligned with the 3 low bits specifying >> + a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7] >> + pool == 0 means the offset is in pmem who's management is by zuf and >> + a full easy access is provided for zus. >> + >> + pool != 0 Is a pre-established tempfs file (up to 6 such files) where >> + the zus has an mmap on the file and the Kernel can access that data >> + via an offset into the file. > > so non-zero pool [dpp_t & 0x7] can be a value of 1 - 7, and above says up to > 6 such tempfs files. What is the other pool value used for? > Yes my bad. Thanks >> + All dpp_t objects life time rules are strictly defined. >> + Mainly the primary use of dpp_t is the on-pmem inode structure. Both >> + zus and zuf can access and change this structure. On any modification >> + the zus is called so to be notified of any changes, persistence. >> + More such objects are: Symlinks, xattrs, mmap-data-blocks etc... >> + >> +Relay-wait-object: >> + communication between Kernel and server are done via zus-threads that >> + sleep in Kernel (inside an IOCTL) and wait for commands. Once received >> + the IOCTL returns operation id executed and the return info is returned via >> + a new IOCTL call, which then waits for the next operation. > > Does that say 2 IOCTLs per command? One to start it and one to fetch return info? > Not really, I guess I did not explain well. It is one IOCTL per command + the very very first one. It is a cyclic system. The IOCTL that returns results is then sleeping again to wait for a new command. >>From the perspective of the user-mode code this is a single IORW (both write+read) type IOCTL that sends some information - command-results - and receives some information - A new command. How to explain this better. >> + To wake up the sleeping thread we use a Relay-wait-object. Currently >> + it is two waitqueue_head(s) back to back. >> + In future we should investigate the use of that special binder object >> + that releases its thread time slice to the other thread without going through >> + the scheduler. >> + >> +ZT-threads-array: >> + The novelty of the zufs is the ZT-threads system. One thread or more is >> + pre-created for each active core in the system. >> + ▪ The thread is AFFINITY set for that single core only. >> + ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT) >> + At initialization the ZT thread communicates through a ZT_INIT ioctl >> + and registers as the handler of that core (Channel) >> + ▪ ZT-vma - Mmap 4M vma zero copy communication area per ZT >> + Pre allocated vma is created into which will be mapped the application >> + or Kernel buffers for the current operation. >> + ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation >> + via the IOCTL_ZU_WAIT_OPT call. supplying a 4k communication buffer >> + >> + ▪ On an operation dispatch current CPU's ZT is selected, app pages mapped >> + into the ZT-vma. Server thread released with an operation to execute. >> + ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, >> + Server wait for new operation on that CPU. >> + >> +ZUS-mount-thread: >> + The system utilizes a single mount thread. (This thread is not affinity to any >> + core). >> + ▪ It will first Register all FS-types supported by this Server (By calling >> + all zusFS plugins to register their supported types). Once done >> + ▪ As above the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call. >> + ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt) >> + a mount is dispatched back to zus. >> + ▪ NOTE: That only on very first mount the above ZT-threads-array is created >> + the same array is then used for all super-blocks in the system >> + ▪ As part of the mount command in the context of this same mount-thread >> + a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem >> + Associated with this super_block >> + ▪ On return (like above a new call to IOCTL_ZU_MOUNT will return info of the > > missing ')' somewhere. > Yes >> + mount before sleeping in kernel waiting for a new dispatch. All SB info >> + is provided to zuf, including the root inode info. Kernel then proceeds >> + to complete the mount call. >> + ▪ NOTE that since there is a single mount thread all FS-registration >> + super_block and pmem management are lockless. >> + >> +Philosophy of operations: >> +~~~~~~~~~~~~~~~~~~~~~~~~~ >> + >> +1. [zuf-root] >> + >> +On module load (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is >> +called zuf-root. >> +The zuf-root has no visible files. All communication is done via special-files. >> +special-files are open(O_TMPFILE) and establish a special role via an >> +IOCTL. >> +All communications with the server are done via the zuf-root. Each root owns >> +many FS-types and each FS-type owns many super-blocks of this type. All Sharing >> +the same communication channels. >> +Since all FS-type Servers live in the same zus application address space, at >> +times. If the administrator wants to separate between different servers, he/she >> +can mount a new zuf-root and point a new server instance on that new mount, >> +registering other FS-types on that other instance. The all communication array >> +will then be duplicated as well. >> +(Otherwise pointing a new server instance on a busy root will return an error) >> + >> +2. [zus server start] >> + ▪ On load all configured zusFS plugins are loaded. >> + ▪ The Server starts by starting a single mount thread. >> + ▪ It than proceeds to register with Kernel all FS-types it will support. >> + (This is done on the single mount thread, so all FS-registration and >> + mount/umount operate in a single thread and therefor need not any locks) >> + ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for a mount >> + command. >> + >> +3. [mount -t xyz] >> + [In Kernel] >> + ▪ If xyz was registered above as part of the Server startup. the regular >> + mount command will come to the zuf module with a zuf_mount() call. with >> + the xyz-FS-info. In turn this points to a zuf-root. >> + ▪ Code than proceed to load a device-table of devices as specified above. >> + It then establishes an md object with a specific pmem_id. > > md ?? > Thanks yes I should explain better MD stands for multi_devices which is the name of the structure. I will put some definitions somewhere >> + ▪ It proceeds to call mount_bdev. Always with the same main-device >> + thous fully sporting automatic bind mounts. Even if different >> + devices are given to the mount command. >> + ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread >> + specifying two parameters. One the FS-type to mount, and then >> + the pmem_id Associated with this super_block. >> + >> + [In zus] >> + ▪ A zus_super_block_info is allocated. >> + ▪ zus calls PMEM_GRAB(pmem_id) to establish a direct mapping to its >> + pmem devices. On return we have full access to our PMEM >> + >> + ▪ ZT-threads-array >> + If this is the first mount the all ZT-threads-array is created and >> + established. The mount thread will wait until all zt-threads finished >> + initialization and ready to rock. >> + ▪ Root-zus_inode is loaded and is returned to kernel >> + ▪ More info about the mount like block sizes and so on are returned to kernel. >> + >> + [In Kernel] >> + The zuf_fill_super is finalized vectors established and we have a new >> + super_block ready for operations. >> + >> +4. An FS operation like create or WRITE/READ and so on arrives from application >> + via VFS. Eventually an Operation is dispatched to zus: >> + ▪ A special per-operation descriptor is filled up with all parameters. >> + ▪ A current CPU channel is grabbed. the operation descriptor is put on >> + that channel (ZT). Including get_user_pages or Kernel-pages associated >> + with this OPT. >> + ▪ The ZT is awaken, app thread put to sleep. >> + ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure >> + the map is only on a single core. And no other core's TLB is affected. >> + (This here is the all performance secret) >> + ▪ ZT thread is returned to user-space. >> + ▪ In ZT context the zus Server calls the appropriate zusFS->operation >> + vector. Output params filled. >> + ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor >> + to return the requested info. >> + ▪ At Kernel (zuf) the app thread is awaken with the results, and the >> + ZT thread goes back to sleep waiting a new operation. >> + >> + ZT rules: >> + A ZT thread must not return back to Kernel. One exception is locks >> + if needed it might sleep waiting for a lock. In which case we will see that >> + the same CPU channel is reentered via another application and/or thread. >> + But now that CPU channel is taken. What we do is we utilize a few channels >> + (ZTs) per core and the threads may grab another channel. But this only >> + postpones the problem on a busy contended system, all such channels will be >> + consumed. If all channels are taken the application thread is put on a busy >> + scheduling wait until a channel can be grabbed. >> + Therefor Server must not sleep on a ZT. If it needs such a sleeping operation >> + it will return -EAGAIN to zuf. The app is kept sleeping the operation is put >> + on an asynchronous Q and the ZT freed for foreground operation. At some point >> + when the server completes the delayed operation it will complete notify >> + the Kernel with a special async cookie. And the app will be awakened. >> + (Here too we utilize pre allocated asyc channels and vmas. If all channels >> + are busy, application is kept sleeping waiting its free slot turn) >> + >> +4. On umount the operation is reversed and all resources are torn down. >> +5. In case of an application or Server crash, all resources are Associated >> + with files, on file_release these resources are caught and freed. >> + >> +Objects and life-time >> +~~~~~~~~~~~~~~~~~~~~~ >> + >> +Each Kernel object type has an assosiated zus Server object type who's life >> +time is governed by the life-time of the Kernel object. Therefor the Server's >> +job is easy because it need not establish any object caches / hashes and so on. >> + >> +Inside zus all objects are allocated by the zusFS plugin. So in turn it can >> +allocate a bigger space for its own private data and access it via the >> +container_off() coding pattern. So when I say below a zus-object I mean both >> +zus public part + zusFS private part of the same object. >> + >> +All operations return a UM pointer that are OPEC the the Kernel code, they > > -ETOOMANYNLA > > 2LA: UM > 4LA: OPEC Yes thanks I will change >> +are just a cookie which is returned back to zus, when needed. >> +At times when we want the Kernel to have direct access to a zus object like >> +zus_inode, along with the cookie we also return a dpp_t, with a defined structure. >> + >> +Kernel object | zus object | Kernel access (via dpp_t) >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> +zuf_fs_type >> + file_system_type | zus_fs_info | no >> + >> +zuf_sb_info >> + super_block | zus_sb_info | no >> + >> +zuf_inode_info | | >> + vfs_inode | zus_inode_info | no >> + zus_inode * | zus_inode * | yes >> + synlink * | char-array | yes >> + xattr** | zus_xattr | yes >> + >> +When a Kernel object's time is to die, a final call to zus is >> +dispatched so the associated object can also be freed. Which means >> +that on memory pressure when object caches are evicted also the zus >> +memory resources are freed. >> + >> + >> +How to use zufs: >> +~~~~~~~~~~~~~~~~ >> + >> +The most updated documentation of how to use the latest code bases >> +is the script (set of scripts) at fs/do-zu/zudo on the zus git tree >> + >> +We the developers at Netapp use this script to mount and test our >> +latest code. So any new Secret will be found in these scripts. Please >> +read them as the ultimate source of how to operate things. >> + >> +TODO: We are looking for exports in system-d and udev to properly >> +integrate these tools into a destro. >> + >> +We assume you cloned these git trees: >> +[]$ mkdir zufs; cd zufs >> +[]$ git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream >> +[]$ git clone https://github.com/NetApp/zufs-zuf -b zus-github >> + >> +This will create the following trees >> +zufs/zus - Source code for Server >> +zufs/zuf - Linux Kernel source tree to compile and install on your machine >> + >> +Also specifically: >> +zufs/zus/fs/do-zu/zudo - script Documenting how to run things >> + >> +[]$ cd zuf >> + >> +First time >> +[] ../zus/fs/do-zu/zudo >> +this will create a file: >> + ../zus/fs/do-zu/zu.conf >> + >> +Edit this file for your environment. Devices, mount-point and so on. >> +On first run an example file will be created for you. Fill in the >> +blanks. Most params can stay as is in most cases >> + >> +Now lest start running: >> + >> +[1]$ ../zus/fs/do-zu/zudo mkfs >> +This will run the proper mkfs command selected at zu.conf file >> +with the proper devices. >> + >> +[2]$ ../zus/fs/do-zu/zudo zuf-insmod >> +This loads the zuf.ko module >> + >> +[3]$ ../zus/fs/do-zu/zudo zuf-root >> +This mounts the zuf-root FS above on /sys/fs/zuf (automatically created above) >> + >> +[4]$ ../zus/fs/do-zu/zudo zus-up >> +This runs the zus daemon in the background >> + >> +[5]$ ../zus/fs/do-zu/zudo mount >> +This mount the mkfs FS above on the specified dir in zu.conf >> + >> +To run all the 5 commands above at once do: >> +[]$ ../zus/fs/do-zu/zudo up >> + >> +To undo all the above in reverse order do: >> +[]$ ../zus/fs/do-zu/zudo down >> + >> +And the most magic command is: >> +[]$ ../zus/fs/do-zu/zudo again >> +Will do a "down", then update-mods, then "up" >> +(update-mods is a special script to copy the latest compiled binaries) >> + >> +Now you are ready for some: >> +[]$ ../zus/fs/do-zu/zudo xfstest >> +xfstests is assumed to be installed in the regular /opt/xfstests dir >> + >> +Again please see inside the scripts what each command does >> +these scripts are the ultimate Documentation, do not believe >> +anything I'm saying here. (Because it is outdated by now) >> > > thanks, > Thank you Randy for all the input. I always have such hard time with the Documentation and very low confidence in it. As I was slaving over this I hoped you'd have a bit of time to look it over. Will fix Boaz