[RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem

* [RFC PATCH 00/17] zuf: ZUFS Zero-copy User-mode FileSystem
@ 2019-02-19 11:51 Boaz harrosh
  2019-02-19 11:51 ` [RFC PATCH 01/17] fs: Add the ZUF filesystem to the build + License Boaz harrosh
                   ` (17 more replies)
  0 siblings, 18 replies; 31+ messages in thread
From: Boaz harrosh @ 2019-02-19 11:51 UTC (permalink / raw)
  To: linux-fsdevel, Anna Schumaker, Al Viro
  Cc: Ric Wheeler, Miklos Szeredi, Steven Whitehouse, Jefff moyer,
	Amir Goldstein, Amit Golander, Sagi Manole

From: Boaz Harrosh <boazh@netapp.com>

I would please like to present the ZUFS file system and the Kernel code part
in this patchset.

The Kernel code presented here can be found at:
	https://github.com/NetApp/zufs-zuf

And the User-mode Server + example FSs here:
	https://github.com/NetApp/zufs-zus

ZUFS - stands for Zero-copy User-mode FS
* It is geared towards true zero copy end to end of both data and meta data.
* It is geared towards very *low latency*, very high CPU locality, lock-less
  parallelism.
* Synchronous operations (for low latency)
* Numa awareness

Short description:
  ZUFS is a from scratch implementation of a filesystem-in-user-space, which
  tries to address the above goals. from the get go it is aimed for pmem
  based FSs. But can easily support other type of FSs that can utilize x10
  latency and parallelism improvements.
  The novelty of this project is that the interface is designed with a modern
  multi-core NUMA machine in mind down to the ABI, so to reach these goals.

Please see first patch for License of this project

Current status: There are a couple of trivial open-source filesystem
implementations and a full blown proprietary implementation from Netapp.

Together with the Kernel module submitted here the User-mode-Server and the
zusFSs User-mode plugins, this code pass Netapp QA including xfstests +
internal QA tests. And was released to costumers as Maxdata 1.2.
So it is very stable.

In the git repository above there is also a backport for rhel 7.6.
Including rpm packages for Kernel and Server components.
(Also available evaluation licenses of Maxdata 1.2 for developers.
 Please contact Amit Golander <Amit.Golander@netapp.com> if you need one)

Just to get some points across as I said this project is all about
performance and low latency. Here below are some results I have run:

[fuse]
threads wr_iops	wr_bw	wr_lat
1	33606	134424	26.53226
2	57056	228224	30.38476
3	73142	292571	35.75727
4	88667	354668	40.12783
5	102280	409122	42.13261
6	110122	440488	48.29697
7	116561	466245	53.98572
8	129134	516539	55.6134

[fuse-splice]
threads	wr_iops	wr_bw	wr_lat
1	39670	158682	21.8399
2	51100	204400	34.63294
3	62385	249542	39.28847
4	75220	300882	47.42344
5	84522	338088	52.97299
6	93042	372168	57.40804
7	97706	390825	63.04435
8	98034	392137	73.24263

[xfs-dax]
threads	wr_iops	wr_bw	wr_lat   
1	19449	77799	48.03282
2	37704	150819	37.2343
3	55415	221663	30.59375
4	72285	289142	26.08636
5	90348	361392	23.89037
6	103696	414787	22.38045
7	120638	482552	21.38869
8	134157	536630	21.1426

[Maxdata-1.2-zufs]
threads	wr_iops	wr_bw	wr_lat   
1	57506	230026	14.387113
2	98624	394498	16.790232
3	142276	569106	17.344622
4	187984	751936	17.527123
5	190304	761219	19.504314
6	221407	885628	20.862000
7	211579	846316	23.262040
8	246029	984116	24.630604

[*1]
  These good results are when an mm patch is applied which introduces
  VM_LOCAL_CPU flag that eliminates vm_zap_ptes from scheduling on all
  CPUs when creating a per-cpu VMA.
  This patch was not accepted by the Linux Kernel community and is not
  presented in this patchset. (Patch available for review on demand)
  But a few weeks from now I will submit some incremental changes to the
  code which will return the numbers to above, and even better for some
  benchmarks. (without the mm patch)

I have used an 8 way KVM-qemu with 2 NUMA nodes.
Running fio with 4k random writes O_DIRECT | O_SYNC to a DRAM simulated
pmem. (memmap=! at grub), Fuse-fs was a memcpy same 4k null-FS
fio was then run with more and more threads (see threads column)
to test for scalability.

We are still > x2 slower than I would like to.
(Compared to an in-kernel pmem-base FS)
But I believe I can shave off another 1-2 us by farther optimizing
the app-to-server thread switch by developing a new scheduler-object
so to avoid going through the scheduler all together (and its locks)
when switching VMs.
(Currently using couple of wait_queue_head_t with wait_event() calls
 See relay.h in patches)

Please Review and ask any question big or trivial. I would love to
iron this code, and submit it upstream.

Thank you for reading
Boaz

~~~~~~~~~~~~~~~~~~
Boaz Harrosh (17):
  fs: Add the ZUF filesystem to the build + License
  zuf: Preliminary Documentation
  zuf: zuf-rootfs
  zuf: zuf-core The ZTs
  zuf: Multy Devices
  zuf: mounting
  zuf: Namei and directory operations
  zuf: readdir operation
  zuf: symlink
  zuf: More file operation
  zuf: Write/Read implementation
  zuf: mmap & sync
  zuf: ioctl implementation
  zuf: xattr implementation
  zuf: ACL support
  zuf: Special IOCTL fadvise (TODO)
  zuf: Support for dynamic-debug of zusFSs

 Documentation/filesystems/zufs.txt |  351 ++++++++
 fs/Kconfig                         |    1 +
 fs/Makefile                        |    1 +
 fs/zuf/Kconfig                     |   23 +
 fs/zuf/Makefile                    |   23 +
 fs/zuf/_extern.h                   |  166 ++++
 fs/zuf/_pr.h                       |   62 ++
 fs/zuf/acl.c                       |  281 +++++++
 fs/zuf/directory.c                 |  167 ++++
 fs/zuf/file.c                      |  527 ++++++++++++
 fs/zuf/inode.c                     |  648 ++++++++++++++
 fs/zuf/ioctl.c                     |  306 +++++++
 fs/zuf/md.c                        |  761 +++++++++++++++++
 fs/zuf/md.h                        |  318 +++++++
 fs/zuf/md_def.h                    |  145 ++++
 fs/zuf/mmap.c                      |  336 ++++++++
 fs/zuf/module.c                    |   28 +
 fs/zuf/namei.c                     |  435 ++++++++++
 fs/zuf/relay.h                     |   88 ++
 fs/zuf/rw.c                        |  705 ++++++++++++++++
 fs/zuf/super.c                     |  771 +++++++++++++++++
 fs/zuf/symlink.c                   |   74 ++
 fs/zuf/t1.c                        |  138 +++
 fs/zuf/t2.c                        |  375 +++++++++
 fs/zuf/t2.h                        |   68 ++
 fs/zuf/xattr.c                     |  310 +++++++
 fs/zuf/zuf-core.c                  | 1257 ++++++++++++++++++++++++++++
 fs/zuf/zuf-root.c                  |  431 ++++++++++
 fs/zuf/zuf.h                       |  414 +++++++++
 fs/zuf/zus_api.h                   |  869 +++++++++++++++++++
 30 files changed, 10079 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/acl.c
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/inode.c
 create mode 100644 fs/zuf/ioctl.c
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/md_def.h
 create mode 100644 fs/zuf/mmap.c
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/namei.c
 create mode 100644 fs/zuf/relay.h
 create mode 100644 fs/zuf/rw.c
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/symlink.c
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h
 create mode 100644 fs/zuf/xattr.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h
 create mode 100644 fs/zuf/zus_api.h

-- 
2.20.1

^ permalink raw reply	[flat|nested] 31+ messages in thread