linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Maxim Levitsky <mlevitsk@redhat.com>
To: linux-nvme@lists.infradead.org
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Jens Axboe <axboe@fb.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Keith Busch <keith.busch@intel.com>,
	Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"David S . Miller" <davem@davemloft.net>,
	Mauro Carvalho Chehab <mchehab+samsung@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Wolfram Sang <wsa@the-dreams.de>,
	Nicolas Ferre <nicolas.ferre@microchip.com>,
	"Paul E . McKenney" <paulmck@linux.ibm.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Liang Cunming <cunming.liang@intel.com>,
	Liu Changpeng <changpeng.liu@intel.com>,
	Fam Zheng <fam@euphon.net>, Amnon Ilan <ailan@redhat.com>,
	John Ferlan <jferlan@redhat.com>
Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
Date: Tue, 19 Mar 2019 16:58:16 +0200	[thread overview]
Message-ID: <d41484848b1832192c6978c7054bec5c326afa6d.camel@redhat.com> (raw)
In-Reply-To: <20190319144116.400-1-mlevitsk@redhat.com>

Oops, I placed the subject in the wrong place.

Best regards,
	Maxim Levitsky

On Tue, 2019-03-19 at 16:41 +0200, Maxim Levitsky wrote:
> Date: Tue, 19 Mar 2019 14:45:45 +0200
> Subject: [PATCH 0/9] RFC: NVME VFIO mediated device
> 
> Hi everyone!
> 
> In this patch series, I would like to introduce my take on the problem of
> doing 
> as fast as possible virtualization of storage with emphasis on low latency.
> 
> In this patch series I implemented a kernel vfio based, mediated device that 
> allows the user to pass through a partition and/or whole namespace to a guest.
> 
> The idea behind this driver is based on paper you can find at
> https://www.usenix.org/conference/atc18/presentation/peng,
> 
> Although note that I stared the development prior to reading this paper, 
> independently.
> 
> In addition to that implementation is not based on code used in the paper as 
> I wasn't being able at that time to make the source available to me.
> 
> ***Key points about the implementation:***
> 
> * Polling kernel thread is used. The polling is stopped after a 
> predefined timeout (1/2 sec by default).
> Support for all interrupt driven mode is planned, and it shows promising
> results.
> 
> * Guest sees a standard NVME device - this allows to run guest with 
> unmodified drivers, for example windows guests.
> 
> * The NVMe device is shared between host and guest.
> That means that even a single namespace can be split between host 
> and guest based on different partitions.
> 
> * Simple configuration
> 
> *** Performance ***
> 
> Performance was tested on Intel DC P3700, With Xeon E5-2620 v2 
> and both latency and throughput is very similar to SPDK.
> 
> Soon I will test this on a better server and nvme device and provide
> more formal performance numbers.
> 
> Latency numbers:
> ~80ms - spdk with fio plugin on the host.
> ~84ms - nvme driver on the host
> ~87ms - mdev-nvme + nvme driver in the guest
> 
> Throughput was following similar pattern as well.
> 
> * Configuration example
>   $ modprobe nvme mdev_queues=4
>   $ modprobe nvme-mdev
> 
>   $ UUID=$(uuidgen)
>   $ DEVICE='device pci address'
>   $ echo $UUID > /sys/bus/pci/devices/$DEVICE/mdev_supported_types/nvme-
> 2Q_V1/create
>   $ echo n1p3 > /sys/bus/mdev/devices/$UUID/namespaces/add_namespace #attach
> host namespace 1 parition 3
>   $ echo 11 > /sys/bus/mdev/devices/$UUID/settings/iothread_cpu #pin the io
> thread to cpu 11
> 
>   Afterward boot qemu with
>   -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID
>   
>   Zero configuration on the guest.
>   
> *** FAQ ***
> 
> * Why to make this in the kernel? Why this is better that SPDK
> 
>   -> Reuse the existing nvme kernel driver in the host. No new drivers in the
> guest.
>   
>   -> Share the NVMe device between host and guest. 
>      Even in fully virtualized configurations,
>      some partitions of nvme device could be used by guests as block devices 
>      while others passed through with nvme-mdev to achieve balance between
>      all features of full IO stack emulation and performance.
>   
>   -> NVME-MDEV is a bit faster due to the fact that in-kernel driver 
>      can send interrupts to the guest directly without a context 
>      switch that can be expensive due to meltdown mitigation.
> 
>   -> Is able to utilize interrupts to get reasonable performance. 
>      This is only implemented
>      as a proof of concept and not included in the patches, 
>      but interrupt driven mode shows reasonable performance
>      
>   -> This is a framework that later can be used to support NVMe devices 
>      with more of the IO virtualization built-in 
>      (IOMMU with PASID support coupled with device that supports it)
> 
> * Why to attach directly to nvme-pci driver and not use block layer IO
>   -> The direct attachment allows for better performance, but I will
>      check the possibility of using block IO, especially for fabrics drivers.
>   
> *** Implementation notes ***
> 
> *  All guest memory is mapped into the physical nvme device 
>    but not 1:1 as vfio-pci would do this.
>    This allows very efficient DMA.
>    To support this, patch 2 adds ability for a mdev device to listen on 
>    guest's memory map events. 
>    Any such memory is immediately pinned and then DMA mapped.
>    (Support for fabric drivers where this is not possible exits too,
>     in which case the fabric driver will do its own DMA mapping)
> 
> *  nvme core driver is modified to announce the appearance 
>    and disappearance of nvme controllers and namespaces,
>    to which the nvme-mdev driver is subscribed.
>  
> *  nvme-pci driver is modified to expose raw interface of attaching to 
>    and sending/polling the IO queues.
>    This allows the mdev driver very efficiently to submit/poll for the IO.
>    By default one host queue is used per each mediated device.
>    (support for other fabric based host drivers is planned)
> 
> * The nvme-mdev doesn't assume presence of KVM, thus any VFIO user, including
>   SPDK, a qemu running with tccg, ... can use this virtual device.
> 
> *** Testing ***
> 
> The device was tested with stock QEMU 3.0 on the host,
> with host was using 5.0 kernel with nvme-mdev added and the following
> hardware:
>  * QEMU nvme virtual device (with nested guest)
>  * Intel DC P3700 on Xeon E5-2620 v2 server
>  * Samsung SM981 (in a Thunderbolt enclosure, with my laptop)
>  * Lenovo NVME device found in my laptop
> 
> The guest was tested with kernel 4.16, 4.18, 4.20 and
> the same custom complied kernel 5.0
> Windows 10 guest was tested too with both Microsoft's inbox driver and
> open source community NVME driver
> (https://lists.openfabrics.org/pipermail/nvmewin/2016-December/001420.html)
> 
> Testing was mostly done on x86_64, but 32 bit host/guest combination
> was lightly tested too.
> 
> In addition to that, the virtual device was tested with nested guest,
> by passing the virtual device to it,
> using pci passthrough, qemu userspace nvme driver, and spdk
> 
> 
> PS: I used to contribute to the kernel as a hobby using the
>     maximlevitsky@gmail.com address
> 
> Maxim Levitsky (9):
>   vfio/mdev: add .request callback
>   nvme/core: add some more values from the spec
>   nvme/core: add NVME_CTRL_SUSPENDED controller state
>   nvme/pci: use the NVME_CTRL_SUSPENDED state
>   nvme/pci: add known admin effects to augument admin effects log page
>   nvme/pci: init shadow doorbell after each reset
>   nvme/core: add mdev interfaces
>   nvme/core: add nvme-mdev core driver
>   nvme/pci: implement the mdev external queue allocation interface
> 
>  MAINTAINERS                   |   5 +
>  drivers/nvme/Kconfig          |   1 +
>  drivers/nvme/Makefile         |   1 +
>  drivers/nvme/host/core.c      | 149 +++++-
>  drivers/nvme/host/nvme.h      |  55 ++-
>  drivers/nvme/host/pci.c       | 385 ++++++++++++++-
>  drivers/nvme/mdev/Kconfig     |  16 +
>  drivers/nvme/mdev/Makefile    |   5 +
>  drivers/nvme/mdev/adm.c       | 873 ++++++++++++++++++++++++++++++++++
>  drivers/nvme/mdev/events.c    | 142 ++++++
>  drivers/nvme/mdev/host.c      | 491 +++++++++++++++++++
>  drivers/nvme/mdev/instance.c  | 802 +++++++++++++++++++++++++++++++
>  drivers/nvme/mdev/io.c        | 563 ++++++++++++++++++++++
>  drivers/nvme/mdev/irq.c       | 264 ++++++++++
>  drivers/nvme/mdev/mdev.h      |  56 +++
>  drivers/nvme/mdev/mmio.c      | 591 +++++++++++++++++++++++
>  drivers/nvme/mdev/pci.c       | 247 ++++++++++
>  drivers/nvme/mdev/priv.h      | 700 +++++++++++++++++++++++++++
>  drivers/nvme/mdev/udata.c     | 390 +++++++++++++++
>  drivers/nvme/mdev/vcq.c       | 207 ++++++++
>  drivers/nvme/mdev/vctrl.c     | 514 ++++++++++++++++++++
>  drivers/nvme/mdev/viommu.c    | 322 +++++++++++++
>  drivers/nvme/mdev/vns.c       | 356 ++++++++++++++
>  drivers/nvme/mdev/vsq.c       | 178 +++++++
>  drivers/vfio/mdev/vfio_mdev.c |  11 +
>  include/linux/mdev.h          |   4 +
>  include/linux/nvme.h          |  88 +++-
>  27 files changed, 7375 insertions(+), 41 deletions(-)
>  create mode 100644 drivers/nvme/mdev/Kconfig
>  create mode 100644 drivers/nvme/mdev/Makefile
>  create mode 100644 drivers/nvme/mdev/adm.c
>  create mode 100644 drivers/nvme/mdev/events.c
>  create mode 100644 drivers/nvme/mdev/host.c
>  create mode 100644 drivers/nvme/mdev/instance.c
>  create mode 100644 drivers/nvme/mdev/io.c
>  create mode 100644 drivers/nvme/mdev/irq.c
>  create mode 100644 drivers/nvme/mdev/mdev.h
>  create mode 100644 drivers/nvme/mdev/mmio.c
>  create mode 100644 drivers/nvme/mdev/pci.c
>  create mode 100644 drivers/nvme/mdev/priv.h
>  create mode 100644 drivers/nvme/mdev/udata.c
>  create mode 100644 drivers/nvme/mdev/vcq.c
>  create mode 100644 drivers/nvme/mdev/vctrl.c
>  create mode 100644 drivers/nvme/mdev/viommu.c
>  create mode 100644 drivers/nvme/mdev/vns.c
>  create mode 100644 drivers/nvme/mdev/vsq.c
> 



  parent reply	other threads:[~2019-03-19 14:58 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190319144116.400-1-mlevitsk@redhat.com>
2019-03-19 14:41 ` [PATCH 1/9] vfio/mdev: add .request callback Maxim Levitsky
2019-03-19 14:41 ` [PATCH 2/9] nvme/core: add some more values from the spec Maxim Levitsky
2019-03-19 14:41 ` [PATCH 3/9] nvme/core: add NVME_CTRL_SUSPENDED controller state Maxim Levitsky
2019-03-19 14:41 ` [PATCH 4/9] nvme/pci: use the NVME_CTRL_SUSPENDED state Maxim Levitsky
2019-03-20  2:54   ` Fam Zheng
2019-03-19 14:41 ` [PATCH 5/9] nvme/pci: add known admin effects to augument admin effects log page Maxim Levitsky
2019-03-19 14:41 ` [PATCH 6/9] nvme/pci: init shadow doorbell after each reset Maxim Levitsky
2019-03-19 14:41 ` [PATCH 7/9] nvme/core: add mdev interfaces Maxim Levitsky
2019-03-20 11:46   ` Stefan Hajnoczi
2019-03-20 12:50     ` Maxim Levitsky
2019-03-19 14:41 ` [PATCH 8/9] nvme/core: add nvme-mdev core driver Maxim Levitsky
2019-03-19 14:41 ` [PATCH 9/9] nvme/pci: implement the mdev external queue allocation interface Maxim Levitsky
2019-03-19 14:58 ` Maxim Levitsky [this message]
2019-03-25 18:52   ` [PATCH 0/9] RFC: NVME VFIO mediated device [BENCHMARKS] Maxim Levitsky
2019-03-26  9:38     ` Stefan Hajnoczi
2019-03-26  9:50       ` Maxim Levitsky
2019-03-19 15:22 ` your mail Keith Busch
2019-03-19 23:49   ` Chaitanya Kulkarni
2019-03-20 16:44     ` Maxim Levitsky
2019-03-20 16:30   ` Maxim Levitsky
2019-03-20 17:03     ` Keith Busch
2019-03-20 17:33       ` Maxim Levitsky
2019-04-08 10:04   ` Maxim Levitsky
2019-03-20 11:03 ` Felipe Franciosi
2019-03-20 19:08   ` Re: Maxim Levitsky
2019-03-21 16:12     ` Re: Stefan Hajnoczi
2019-03-21 16:21       ` Re: Keith Busch
2019-03-21 16:41         ` Re: Felipe Franciosi
2019-03-21 17:04           ` Re: Maxim Levitsky
2019-03-22  7:54             ` Re: Felipe Franciosi
2019-03-22 10:32               ` Re: Maxim Levitsky
2019-03-22 15:30               ` Re: Keith Busch
2019-03-25 15:44                 ` Re: Felipe Franciosi
2019-03-20 15:08 ` [PATCH 0/9] RFC: NVME VFIO mediated device Bart Van Assche
2019-03-20 16:48   ` Maxim Levitsky
2019-03-20 15:28 ` Bart Van Assche
2019-03-20 16:42   ` Maxim Levitsky
2019-03-20 17:03     ` Alex Williamson
2019-03-21 16:13 ` your mail Stefan Hajnoczi
2019-03-21 17:07   ` Maxim Levitsky
2019-03-25 16:46     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d41484848b1832192c6978c7054bec5c326afa6d.camel@redhat.com \
    --to=mlevitsk@redhat.com \
    --cc=ailan@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@fb.com \
    --cc=changpeng.liu@intel.com \
    --cc=cunming.liang@intel.com \
    --cc=davem@davemloft.net \
    --cc=fam@euphon.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@lst.de \
    --cc=jferlan@redhat.com \
    --cc=keith.busch@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mchehab+samsung@kernel.org \
    --cc=nicolas.ferre@microchip.com \
    --cc=paulmck@linux.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=sagi@grimberg.me \
    --cc=wsa@the-dreams.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).