linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/52] [RFC] virtio-fs: shared file system for virtual machines
@ 2018-12-10 17:12 Vivek Goyal
  2018-12-10 17:12 ` [PATCH 01/52] fuse: add skeleton virtio_fs.ko module Vivek Goyal
                   ` (54 more replies)
  0 siblings, 55 replies; 98+ messages in thread
From: Vivek Goyal @ 2018-12-10 17:12 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, kvm
  Cc: vgoyal, miklos, stefanha, dgilbert, sweil, swhiteho

Hi,

Here are RFC patches for virtio-fs. Looking for feedback on this approach.

These patches should apply on top of 4.20-rc5. We have also put code for
various components here.

https://gitlab.com/virtio-fs

Problem Description
===================
We want to be able to take a directory tree on the host and share it with
guest[s]. Our goal is to be able to do it in a fast, consistent and secure
manner. Our primary use case is kata containers, but it should be usable in
other scenarios as well.

Containers may rely on local file system semantics for shared volumes,
read-write mounts that multiple containers access simultaneously.  File
system changes must be visible to other containers with the same consistency
expected of a local file system, including mmap MAP_SHARED.

Existing Solutions
==================
We looked at existing solutions and virtio-9p already provides basic shared
file system functionality although does not offer local file system semantics,
causing some workloads and test suites to fail. In addition, virtio-9p
performance has been an issue for Kata Containers and we believe this cannot
be alleviated without major changes that do not fit into the 9P protocol.

Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.

- Use fuse protocol (instead of 9p) for communication between guest
  and host. Guest kernel will be fuse client and a fuse server will
  run on host to serve the requests. Benchmark results (see below) are
  encouraging and show this approach performs well (2x to 8x improvement
  depending on test being run).

- For data access inside guest, mmap portion of file in QEMU address
  space and guest accesses this memory using dax. That way guest page
  cache is bypassed and there is only one copy of data (on host). This
  will also enable mmap(MAP_SHARED) between guests.

- For metadata coherency, there is a shared memory region which contains
  version number associated with metadata and any guest changing metadata
  updates version number and other guests refresh metadata on next
  access. This is still experimental and implementation is not complete.

How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).

DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.

By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).

These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.

HOWTO
======
We have put instructions on how to use it here.

https://virtio-fs.gitlab.io/

Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems. The “shared” option
only has an effect on coherence between virtio-fs filesystem instances
running inside different guests.

- cache=none
  metadata, data and pathname lookup are not cached in guest. They are always
  fetched from host and any changes are immediately pushed to host.

- cache=always
  metadata, data and pathname lookup are cached in guest and never expire.

- cache=auto
  metadata and pathname lookup cache expires after a configured amount of time
  (default is 1 second). Data is cached while the file is open (close to open
  consistency).

- writeback/no_writeback
  These options control the writeback strategy.  If writeback is disabled,
  then normal writes will immediately be synchronized with the host fs. If
  writeback is enabled, then writes may be cached in the guest until the file
  is closed or an fsync(2) performed. This option has no effect on mmap-ed
  writes or writes going through the DAX mechanism.

- shared/no_shared
  These options control the  use of the shared version table. If shared mode
  is enabled then metadata and pathname lookup is cached in guest, but is
  refreshed due to changes in another virtio-fs instance.

DAX 
===
- dax can be turned on/off when mounting virtio-fs inside guest.

WHAT WORKS
==========
- As of now primarily cache options none, auto and always are working.
  shared option is still being worked on.

- Dax on/off seems to work. It does not seem to be as fast as we were
  expecting it to be. Still need to look into optimization opportunities.

TODO
====
- Complete "cache=shared" implementation.
- Look into improving performance for dax. It seems slow.
- Lot of bug fixing, cleanup and performance improvement. 

RESULTS
=======
- pjdfstests are passing. Have tried cache=none/auto/always and dax on/off).
  
  https://github.com/pjd/pjdfstest

  (one symlink test fails and that seems to be due xfs on host. Yet to
   look into it).

- We have run some basic tests and compared with virtio-9p and it seems
  to be faster. I ran "smallfile" utility and a simple fio job to test
  mmap performance.

Test Setup
-----------
- A fedora 28 host with 32G RAM, 2 sockets (6 cores per socket, 2
  threads per core)

- Using a PCIE SSD at host as backing store.

- Created a VM with 16 VCPUS and 6GB memory. A 2GB cache window (for dax
  mmap).
  
fio mmap
--------
Wrote simple fio job to run mmap and READ. Ran test on 1 file and 4
files and different caching modes. File size is 4G. Dropped cache in
guest before each run. Cache on host was untouched. So data on host must
have been cached. These results are average of 3 runs.

		cache mode 	1-file(one thread) 	4-files(4 threads)

virtio-9p	mmap		28 MB/s			140 MB/s
virtio-fs	none + dax	126 MB/s		501 MB/s


virtio-9p	loose	 	31 MB/s			135 MB/s
virtio-fs	always		235 MB/s		858 MB/s
virtio-fs	always + dax	121 MB/s		487 MB/s


smallfile
---------
https://github.com/distributed-system-analysis/smallfile

I basically ran bunch of operations like create, ls-l, read, append,
rename and delete-renamed and measured performance over 3 runs and
took average. Dropped cache after before each operation started
running. Used effectively following command for each operation.

# python smallfile_cli.py --operation create --threads 8 --file-size 1024 --files 2048 --top <test-dir>


		cache mode 	operation	(files/sec) 

virtio-9p	none		create		194
virtio-fs	none		create		714

virtio-9p	mmap		create		201
virtio-fs	none + dax	create		759

virtio-9p	loose		create		16
virtio-fs	always		create          685
virtio-fs	always + dax	create		735

virtio-9p	none		ls-l		2038
virtio-fs	none		ls-l		4615

virtio-9p	mmap		ls-l		2087	
virtio-fs	none + dax	ls-l		4616

virtio-9p	loose		ls-l		1619
virtio-fs	always		ls-l		13571
virtio-fs	always + dax	ls-l		12626

virtio-9p	none		read		199
virtio-fs	none		read		1405

virtio-9p	mmap		read		203	
virtio-fs	none + dax	read		1345

virtio-9p	loose		read		207
virtio-fs	always		read		1436
virtio-fs	always + dax	read		1368

virtio-9p	none		append		197
virtio-fs	none		append		717

virtio-9p	mmap		append		200	
virtio-fs	none + dax	append		645

virtio-9p	loose		append		16	
virtio-fs	always		append		651	
virtio-fs	always + dax	append		704	

virtio-9p	none		rename		2442
virtio-fs	none		rename		5797

virtio-9p	mmap		rename		2518	
virtio-fs	none + dax	rename		6386

virtio-9p	loose		rename		4178
virtio-fs	always		rename		15834
virtio-fs	always + dax	rename		15529

Thanks
Vivek

Dr. David Alan Gilbert (5):
  virtio-fs: Add VIRTIO_PCI_CAP_SHARED_MEMORY_CFG and utility to find
    them
  virito-fs: Make dax optional
  virtio: Free fuse devices on umount
  virtio-fs: Retrieve shm capabilities for version table
  virtio-fs: Map using the values from the capabilities

Miklos Szeredi (8):
  fuse: simplify fuse_fill_super_common() calling
  fuse: delete dentry if timeout is zero
  fuse: multiplex cached/direct_io/dax file operations
  virtio-fs: pass version table pointer to fuse
  fuse: don't crash if version table is NULL
  fuse: add shared version support (virtio-fs only)
  fuse: shared version cleanups
  fuse: fix fuse_permission() for the default_permissions case

Stefan Hajnoczi (17):
  fuse: add skeleton virtio_fs.ko module
  fuse: add probe/remove virtio driver
  fuse: rely on mutex_unlock() barrier instead of fput()
  fuse: extract fuse_fill_super_common()
  virtio_fs: get mount working
  fuse: export fuse_end_request()
  fuse: export fuse_len_args()
  fuse: add fuse_iqueue_ops callbacks
  fuse: process requests queues
  fuse: export fuse_get_unique()
  fuse: implement FUSE_FORGET for virtio-fs
  virtio_fs: Set up dax_device
  dax: remove block device dependencies
  fuse: add fuse_conn->dax_dev field
  fuse: map virtio_fs DAX window BAR
  fuse: Implement basic DAX read/write support commands
  fuse: add DAX mmap support

Vivek Goyal (22):
  virtio-fs: Retrieve shm capabilities for cache
  virtio-fs: Map cache using the values from the capabilities
  Limit number of pages returned by direct_access()
  fuse: Introduce fuse_dax_mapping
  Create a list of free memory ranges
  fuse: Introduce setupmapping/removemapping commands
  Introduce interval tree basic data structures
  fuse: Maintain a list of busy elements
  Do fallocate() to grow file before mapping for file growing writes
  dax: Pass dax_dev to dax_writeback_mapping_range()
  fuse: Define dax address space operations
  fuse, dax: Take ->i_mmap_sem lock during dax page fault
  fuse: Add logic to free up a memory range
  fuse: Add logic to do direct reclaim of memory
  fuse: Kick worker when free memory drops below 20% of total ranges
  Dispatch FORGET requests later instead of dropping them
  Release file in process context
  fuse: Do not block on inode lock while freeing memory range
  fuse: Reschedule dax free work if too many EAGAIN attempts
  fuse: Wait for memory ranges to become free
  fuse: Take inode lock for dax inode truncation
  fuse: Clear setuid bit even in direct I/O path

 drivers/dax/super.c             |    3 +-
 fs/dax.c                        |   23 +-
 fs/ext4/inode.c                 |    2 +-
 fs/fuse/Kconfig                 |   11 +
 fs/fuse/Makefile                |    1 +
 fs/fuse/cuse.c                  |    3 +-
 fs/fuse/dev.c                   |   80 ++-
 fs/fuse/dir.c                   |  282 +++++++--
 fs/fuse/file.c                  | 1012 +++++++++++++++++++++++++++--
 fs/fuse/fuse_i.h                |  234 ++++++-
 fs/fuse/inode.c                 |  278 ++++++--
 fs/fuse/readdir.c               |   12 +-
 fs/fuse/virtio_fs.c             | 1336 +++++++++++++++++++++++++++++++++++++++
 fs/splice.c                     |    3 +-
 fs/xfs/xfs_aops.c               |    2 +-
 include/linux/dax.h             |    6 +-
 include/linux/fs.h              |    2 +
 include/uapi/linux/fuse.h       |   39 ++
 include/uapi/linux/virtio_fs.h  |   46 ++
 include/uapi/linux/virtio_ids.h |    1 +
 include/uapi/linux/virtio_pci.h |   10 +
 21 files changed, 3151 insertions(+), 235 deletions(-)
 create mode 100644 fs/fuse/virtio_fs.c
 create mode 100644 include/uapi/linux/virtio_fs.h

-- 
2.13.6


^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2019-03-20 10:42 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-10 17:12 [PATCH 00/52] [RFC] virtio-fs: shared file system for virtual machines Vivek Goyal
2018-12-10 17:12 ` [PATCH 01/52] fuse: add skeleton virtio_fs.ko module Vivek Goyal
2018-12-10 17:12 ` [PATCH 02/52] fuse: add probe/remove virtio driver Vivek Goyal
2018-12-10 17:12 ` [PATCH 03/52] fuse: rely on mutex_unlock() barrier instead of fput() Vivek Goyal
2018-12-10 17:12 ` [PATCH 04/52] fuse: extract fuse_fill_super_common() Vivek Goyal
2018-12-10 17:12 ` [PATCH 05/52] virtio_fs: get mount working Vivek Goyal
2018-12-10 17:12 ` [PATCH 06/52] fuse: export fuse_end_request() Vivek Goyal
2018-12-10 17:12 ` [PATCH 07/52] fuse: export fuse_len_args() Vivek Goyal
2018-12-10 17:12 ` [PATCH 08/52] fuse: add fuse_iqueue_ops callbacks Vivek Goyal
2018-12-10 17:12 ` [PATCH 09/52] fuse: process requests queues Vivek Goyal
2018-12-10 17:12 ` [PATCH 10/52] fuse: export fuse_get_unique() Vivek Goyal
2018-12-10 17:12 ` [PATCH 11/52] fuse: implement FUSE_FORGET for virtio-fs Vivek Goyal
2018-12-10 17:12 ` [PATCH 12/52] virtio_fs: Set up dax_device Vivek Goyal
2018-12-10 17:12 ` [PATCH 13/52] dax: remove block device dependencies Vivek Goyal
2018-12-10 17:12 ` [PATCH 14/52] fuse: add fuse_conn->dax_dev field Vivek Goyal
2018-12-10 17:12 ` [PATCH 15/52] fuse: map virtio_fs DAX window BAR Vivek Goyal
2018-12-12 16:37   ` Christian Borntraeger
2018-12-13 11:55     ` Stefan Hajnoczi
2018-12-13 16:06   ` kbuild test robot
2018-12-13 19:55   ` Dan Williams
2018-12-13 20:09     ` Dr. David Alan Gilbert
2018-12-13 20:15       ` Dan Williams
2018-12-13 20:40         ` Vivek Goyal
2018-12-13 21:18           ` Vivek Goyal
2018-12-14 10:09             ` Dr. David Alan Gilbert
2018-12-10 17:12 ` [PATCH 16/52] virtio-fs: Add VIRTIO_PCI_CAP_SHARED_MEMORY_CFG and utility to find them Vivek Goyal
2018-12-12 16:36   ` [PATCH] virtio-fs: fix semicolon.cocci warnings kbuild test robot
2018-12-12 16:36   ` [PATCH 16/52] virtio-fs: Add VIRTIO_PCI_CAP_SHARED_MEMORY_CFG and utility to find them kbuild test robot
2018-12-10 17:12 ` [PATCH 17/52] virtio-fs: Retrieve shm capabilities for cache Vivek Goyal
2018-12-10 17:12 ` [PATCH 18/52] virtio-fs: Map cache using the values from the capabilities Vivek Goyal
2018-12-13  9:10   ` David Hildenbrand
2018-12-13  9:13     ` Dr. David Alan Gilbert
2018-12-13  9:34       ` David Hildenbrand
2018-12-13 10:00         ` Dr. David Alan Gilbert
2018-12-13 11:26           ` David Hildenbrand
2018-12-13 12:15             ` Dr. David Alan Gilbert
2018-12-13 12:24               ` David Hildenbrand
2018-12-13 12:38                 ` Cornelia Huck
2018-12-14 13:44                   ` Stefan Hajnoczi
2018-12-14 13:50                     ` Cornelia Huck
2018-12-14 14:06                       ` Dr. David Alan Gilbert
2018-12-17 11:25                       ` Stefan Hajnoczi
2018-12-17 10:53                     ` David Hildenbrand
2018-12-17 14:56                       ` Stefan Hajnoczi
2018-12-18 17:13                         ` Cornelia Huck
2018-12-18 17:25                           ` David Hildenbrand
2019-01-02 10:24                             ` Stefan Hajnoczi
2019-03-17  0:33   ` Liu Bo
2019-03-20 10:42     ` Dr. David Alan Gilbert
2019-03-17  0:35   ` [PATCH] virtio-fs: fix multiple tag support Liu Bo
2019-03-19 20:26     ` Vivek Goyal
2019-03-20  2:04       ` Liu Bo
2018-12-10 17:12 ` [PATCH 19/52] virito-fs: Make dax optional Vivek Goyal
2018-12-10 17:12 ` [PATCH 20/52] Limit number of pages returned by direct_access() Vivek Goyal
2018-12-10 17:12 ` [PATCH 21/52] fuse: Introduce fuse_dax_mapping Vivek Goyal
2018-12-10 17:12 ` [PATCH 22/52] Create a list of free memory ranges Vivek Goyal
2018-12-11 17:44   ` kbuild test robot
2018-12-15 19:22   ` kbuild test robot
2018-12-10 17:12 ` [PATCH 23/52] fuse: simplify fuse_fill_super_common() calling Vivek Goyal
2018-12-10 17:12 ` [PATCH 24/52] fuse: Introduce setupmapping/removemapping commands Vivek Goyal
2018-12-10 17:12 ` [PATCH 25/52] Introduce interval tree basic data structures Vivek Goyal
2018-12-10 17:12 ` [PATCH 26/52] fuse: Implement basic DAX read/write support commands Vivek Goyal
2018-12-10 17:12 ` [PATCH 27/52] fuse: Maintain a list of busy elements Vivek Goyal
2018-12-10 17:12 ` [PATCH 28/52] Do fallocate() to grow file before mapping for file growing writes Vivek Goyal
2018-12-11  6:13   ` kbuild test robot
2018-12-11  6:20   ` kbuild test robot
2018-12-10 17:12 ` [PATCH 29/52] fuse: add DAX mmap support Vivek Goyal
2018-12-10 17:12 ` [PATCH 30/52] fuse: delete dentry if timeout is zero Vivek Goyal
2018-12-10 17:12 ` [PATCH 31/52] dax: Pass dax_dev to dax_writeback_mapping_range() Vivek Goyal
2018-12-11  6:12   ` kbuild test robot
2018-12-11 17:38   ` kbuild test robot
2018-12-10 17:12 ` [PATCH 32/52] fuse: Define dax address space operations Vivek Goyal
2018-12-10 17:12 ` [PATCH 33/52] fuse, dax: Take ->i_mmap_sem lock during dax page fault Vivek Goyal
2018-12-10 17:13 ` [PATCH 34/52] fuse: Add logic to free up a memory range Vivek Goyal
2018-12-10 17:13 ` [PATCH 35/52] fuse: Add logic to do direct reclaim of memory Vivek Goyal
2018-12-10 17:13 ` [PATCH 36/52] fuse: Kick worker when free memory drops below 20% of total ranges Vivek Goyal
2018-12-10 17:13 ` [PATCH 37/52] fuse: multiplex cached/direct_io/dax file operations Vivek Goyal
2018-12-10 17:13 ` [PATCH 38/52] Dispatch FORGET requests later instead of dropping them Vivek Goyal
2018-12-10 17:13 ` [PATCH 39/52] Release file in process context Vivek Goyal
2018-12-10 17:13 ` [PATCH 40/52] fuse: Do not block on inode lock while freeing memory range Vivek Goyal
2018-12-10 17:13 ` [PATCH 41/52] fuse: Reschedule dax free work if too many EAGAIN attempts Vivek Goyal
2018-12-10 17:13 ` [PATCH 42/52] fuse: Wait for memory ranges to become free Vivek Goyal
2018-12-10 17:13 ` [PATCH 43/52] fuse: Take inode lock for dax inode truncation Vivek Goyal
2018-12-10 17:13 ` [PATCH 44/52] fuse: Clear setuid bit even in direct I/O path Vivek Goyal
2018-12-10 17:13 ` [PATCH 45/52] virtio: Free fuse devices on umount Vivek Goyal
2018-12-10 17:13 ` [PATCH 46/52] virtio-fs: Retrieve shm capabilities for version table Vivek Goyal
2018-12-10 17:13 ` [PATCH 47/52] virtio-fs: Map using the values from the capabilities Vivek Goyal
2018-12-10 17:13 ` [PATCH 48/52] virtio-fs: pass version table pointer to fuse Vivek Goyal
2018-12-10 17:13 ` [PATCH 49/52] fuse: don't crash if version table is NULL Vivek Goyal
2018-12-10 17:13 ` [PATCH 50/52] fuse: add shared version support (virtio-fs only) Vivek Goyal
2018-12-10 17:13 ` [PATCH 51/52] fuse: shared version cleanups Vivek Goyal
2018-12-10 17:13 ` [PATCH 52/52] fuse: fix fuse_permission() for the default_permissions case Vivek Goyal
2018-12-19 21:25   ` kbuild test robot
2018-12-11 12:54 ` [PATCH 00/52] [RFC] virtio-fs: shared file system for virtual machines Stefan Hajnoczi
2018-12-12 20:30 ` Konrad Rzeszutek Wilk
2018-12-12 21:22   ` Vivek Goyal
2019-02-12 15:56 ` Aneesh Kumar K.V
2019-02-12 18:57   ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).