From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.9 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71D43C5CFFE for ; Mon, 10 Dec 2018 17:20:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 22EBC2064D for ; Mon, 10 Dec 2018 17:20:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 22EBC2064D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728723AbeLJRUW (ORCPT ); Mon, 10 Dec 2018 12:20:22 -0500 Received: from mx1.redhat.com ([209.132.183.28]:59266 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727356AbeLJRNe (ORCPT ); Mon, 10 Dec 2018 12:13:34 -0500 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7EC609F749; Mon, 10 Dec 2018 17:13:33 +0000 (UTC) Received: from horse.redhat.com (unknown [10.18.25.234]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6AE99105B1E3; Mon, 10 Dec 2018 17:13:30 +0000 (UTC) Received: by horse.redhat.com (Postfix, from userid 10451) id 037942208FC; Mon, 10 Dec 2018 12:13:29 -0500 (EST) From: Vivek Goyal To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: vgoyal@redhat.com, miklos@szeredi.hu, stefanha@redhat.com, dgilbert@redhat.com, sweil@redhat.com, swhiteho@redhat.com Subject: [PATCH 00/52] [RFC] virtio-fs: shared file system for virtual machines Date: Mon, 10 Dec 2018 12:12:26 -0500 Message-Id: <20181210171318.16998-1-vgoyal@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Mon, 10 Dec 2018 17:13:33 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Here are RFC patches for virtio-fs. Looking for feedback on this approach. These patches should apply on top of 4.20-rc5. We have also put code for various components here. https://gitlab.com/virtio-fs Problem Description =================== We want to be able to take a directory tree on the host and share it with guest[s]. Our goal is to be able to do it in a fast, consistent and secure manner. Our primary use case is kata containers, but it should be usable in other scenarios as well. Containers may rely on local file system semantics for shared volumes, read-write mounts that multiple containers access simultaneously. File system changes must be visible to other containers with the same consistency expected of a local file system, including mmap MAP_SHARED. Existing Solutions ================== We looked at existing solutions and virtio-9p already provides basic shared file system functionality although does not offer local file system semantics, causing some workloads and test suites to fail. In addition, virtio-9p performance has been an issue for Kata Containers and we believe this cannot be alleviated without major changes that do not fit into the 9P protocol. Design Overview =============== With the goal of designing something with better performance and local file system semantics, a bunch of ideas were proposed. - Use fuse protocol (instead of 9p) for communication between guest and host. Guest kernel will be fuse client and a fuse server will run on host to serve the requests. Benchmark results (see below) are encouraging and show this approach performs well (2x to 8x improvement depending on test being run). - For data access inside guest, mmap portion of file in QEMU address space and guest accesses this memory using dax. That way guest page cache is bypassed and there is only one copy of data (on host). This will also enable mmap(MAP_SHARED) between guests. - For metadata coherency, there is a shared memory region which contains version number associated with metadata and any guest changing metadata updates version number and other guests refresh metadata on next access. This is still experimental and implementation is not complete. How virtio-fs differs from existing approaches ============================================== The unique idea behind virtio-fs is to take advantage of the co-location of the virtual machine and hypervisor to avoid communication (vmexits). DAX allows file contents to be accessed without communication with the hypervisor. The shared memory region for metadata avoids communication in the common case where metadata is unchanged. By replacing expensive communication with cheaper shared memory accesses, we expect to achieve better performance than approaches based on network file system protocols. In addition, this also makes it easier to achieve local file system semantics (coherency). These techniques are not applicable to network file system protocols since the communications channel is bypassed by taking advantage of shared memory on a local machine. This is why we decided to build virtio-fs rather than focus on 9P or NFS. HOWTO ====== We have put instructions on how to use it here. https://virtio-fs.gitlab.io/ Caching Modes ============= Like virtio-9p, different caching modes are supported which determine the coherency level as well. The “cache=FOO” and “writeback” options control the level of coherence between the guest and host filesystems. The “shared” option only has an effect on coherence between virtio-fs filesystem instances running inside different guests. - cache=none metadata, data and pathname lookup are not cached in guest. They are always fetched from host and any changes are immediately pushed to host. - cache=always metadata, data and pathname lookup are cached in guest and never expire. - cache=auto metadata and pathname lookup cache expires after a configured amount of time (default is 1 second). Data is cached while the file is open (close to open consistency). - writeback/no_writeback These options control the writeback strategy. If writeback is disabled, then normal writes will immediately be synchronized with the host fs. If writeback is enabled, then writes may be cached in the guest until the file is closed or an fsync(2) performed. This option has no effect on mmap-ed writes or writes going through the DAX mechanism. - shared/no_shared These options control the use of the shared version table. If shared mode is enabled then metadata and pathname lookup is cached in guest, but is refreshed due to changes in another virtio-fs instance. DAX === - dax can be turned on/off when mounting virtio-fs inside guest. WHAT WORKS ========== - As of now primarily cache options none, auto and always are working. shared option is still being worked on. - Dax on/off seems to work. It does not seem to be as fast as we were expecting it to be. Still need to look into optimization opportunities. TODO ==== - Complete "cache=shared" implementation. - Look into improving performance for dax. It seems slow. - Lot of bug fixing, cleanup and performance improvement. RESULTS ======= - pjdfstests are passing. Have tried cache=none/auto/always and dax on/off). https://github.com/pjd/pjdfstest (one symlink test fails and that seems to be due xfs on host. Yet to look into it). - We have run some basic tests and compared with virtio-9p and it seems to be faster. I ran "smallfile" utility and a simple fio job to test mmap performance. Test Setup ----------- - A fedora 28 host with 32G RAM, 2 sockets (6 cores per socket, 2 threads per core) - Using a PCIE SSD at host as backing store. - Created a VM with 16 VCPUS and 6GB memory. A 2GB cache window (for dax mmap). fio mmap -------- Wrote simple fio job to run mmap and READ. Ran test on 1 file and 4 files and different caching modes. File size is 4G. Dropped cache in guest before each run. Cache on host was untouched. So data on host must have been cached. These results are average of 3 runs. cache mode 1-file(one thread) 4-files(4 threads) virtio-9p mmap 28 MB/s 140 MB/s virtio-fs none + dax 126 MB/s 501 MB/s virtio-9p loose 31 MB/s 135 MB/s virtio-fs always 235 MB/s 858 MB/s virtio-fs always + dax 121 MB/s 487 MB/s smallfile --------- https://github.com/distributed-system-analysis/smallfile I basically ran bunch of operations like create, ls-l, read, append, rename and delete-renamed and measured performance over 3 runs and took average. Dropped cache after before each operation started running. Used effectively following command for each operation. # python smallfile_cli.py --operation create --threads 8 --file-size 1024 --files 2048 --top cache mode operation (files/sec) virtio-9p none create 194 virtio-fs none create 714 virtio-9p mmap create 201 virtio-fs none + dax create 759 virtio-9p loose create 16 virtio-fs always create 685 virtio-fs always + dax create 735 virtio-9p none ls-l 2038 virtio-fs none ls-l 4615 virtio-9p mmap ls-l 2087 virtio-fs none + dax ls-l 4616 virtio-9p loose ls-l 1619 virtio-fs always ls-l 13571 virtio-fs always + dax ls-l 12626 virtio-9p none read 199 virtio-fs none read 1405 virtio-9p mmap read 203 virtio-fs none + dax read 1345 virtio-9p loose read 207 virtio-fs always read 1436 virtio-fs always + dax read 1368 virtio-9p none append 197 virtio-fs none append 717 virtio-9p mmap append 200 virtio-fs none + dax append 645 virtio-9p loose append 16 virtio-fs always append 651 virtio-fs always + dax append 704 virtio-9p none rename 2442 virtio-fs none rename 5797 virtio-9p mmap rename 2518 virtio-fs none + dax rename 6386 virtio-9p loose rename 4178 virtio-fs always rename 15834 virtio-fs always + dax rename 15529 Thanks Vivek Dr. David Alan Gilbert (5): virtio-fs: Add VIRTIO_PCI_CAP_SHARED_MEMORY_CFG and utility to find them virito-fs: Make dax optional virtio: Free fuse devices on umount virtio-fs: Retrieve shm capabilities for version table virtio-fs: Map using the values from the capabilities Miklos Szeredi (8): fuse: simplify fuse_fill_super_common() calling fuse: delete dentry if timeout is zero fuse: multiplex cached/direct_io/dax file operations virtio-fs: pass version table pointer to fuse fuse: don't crash if version table is NULL fuse: add shared version support (virtio-fs only) fuse: shared version cleanups fuse: fix fuse_permission() for the default_permissions case Stefan Hajnoczi (17): fuse: add skeleton virtio_fs.ko module fuse: add probe/remove virtio driver fuse: rely on mutex_unlock() barrier instead of fput() fuse: extract fuse_fill_super_common() virtio_fs: get mount working fuse: export fuse_end_request() fuse: export fuse_len_args() fuse: add fuse_iqueue_ops callbacks fuse: process requests queues fuse: export fuse_get_unique() fuse: implement FUSE_FORGET for virtio-fs virtio_fs: Set up dax_device dax: remove block device dependencies fuse: add fuse_conn->dax_dev field fuse: map virtio_fs DAX window BAR fuse: Implement basic DAX read/write support commands fuse: add DAX mmap support Vivek Goyal (22): virtio-fs: Retrieve shm capabilities for cache virtio-fs: Map cache using the values from the capabilities Limit number of pages returned by direct_access() fuse: Introduce fuse_dax_mapping Create a list of free memory ranges fuse: Introduce setupmapping/removemapping commands Introduce interval tree basic data structures fuse: Maintain a list of busy elements Do fallocate() to grow file before mapping for file growing writes dax: Pass dax_dev to dax_writeback_mapping_range() fuse: Define dax address space operations fuse, dax: Take ->i_mmap_sem lock during dax page fault fuse: Add logic to free up a memory range fuse: Add logic to do direct reclaim of memory fuse: Kick worker when free memory drops below 20% of total ranges Dispatch FORGET requests later instead of dropping them Release file in process context fuse: Do not block on inode lock while freeing memory range fuse: Reschedule dax free work if too many EAGAIN attempts fuse: Wait for memory ranges to become free fuse: Take inode lock for dax inode truncation fuse: Clear setuid bit even in direct I/O path drivers/dax/super.c | 3 +- fs/dax.c | 23 +- fs/ext4/inode.c | 2 +- fs/fuse/Kconfig | 11 + fs/fuse/Makefile | 1 + fs/fuse/cuse.c | 3 +- fs/fuse/dev.c | 80 ++- fs/fuse/dir.c | 282 +++++++-- fs/fuse/file.c | 1012 +++++++++++++++++++++++++++-- fs/fuse/fuse_i.h | 234 ++++++- fs/fuse/inode.c | 278 ++++++-- fs/fuse/readdir.c | 12 +- fs/fuse/virtio_fs.c | 1336 +++++++++++++++++++++++++++++++++++++++ fs/splice.c | 3 +- fs/xfs/xfs_aops.c | 2 +- include/linux/dax.h | 6 +- include/linux/fs.h | 2 + include/uapi/linux/fuse.h | 39 ++ include/uapi/linux/virtio_fs.h | 46 ++ include/uapi/linux/virtio_ids.h | 1 + include/uapi/linux/virtio_pci.h | 10 + 21 files changed, 3151 insertions(+), 235 deletions(-) create mode 100644 fs/fuse/virtio_fs.c create mode 100644 include/uapi/linux/virtio_fs.h -- 2.13.6