[RFC] vhost: new rte_vhost API proposal

From: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
To: dev@dpdk.org, Maxime Coquelin <maxime.coquelin@redhat.com>,
	Tiwei Bie <tiwei.bie@intel.com>,
	Tetsuya Mukawa <mtetsuyah@gmail.com>,
	Thomas Monjalon <thomas@monjalon.net>
Cc: yliu@fridaylinux.org, Stefan Hajnoczi <stefanha@redhat.com>,
	Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Subject: [RFC] vhost: new rte_vhost API proposal
Date: Thu, 10 May 2018 15:22:53 +0200	[thread overview]
Message-ID: <1525958573-184361-1-git-send-email-dariuszx.stojaczyk@intel.com> (raw)

rte_vhost has been confirmed not to work with some Virtio devices
(it's not vhost-user spec compliant, see details below) and fixing
it directly would require quite a big amount of changes which would
completely break backwards compatibility. This library is intended
to smooth out the transition. It exposes a low-level API for
implementing new Virtio drivers/targets. The existing rte_vhost
is about to be refactored to use rte_virtio library underneath, and
demanding drivers could now use rte_virtio directly.

rte_virtio would offer both vhost and virtio driver APIs. These two
have a lot of common code for vhost-user handling or PCI access for
initiator/virtio-vhost-user (and possibly vDPA) so there's little
sense to keep target and initiator code separated between different
libs. Of course, the APIs would be separate - only some parts of
the code would be shared.

rte_virtio intends to abstract away most vhost-user/virtio-vhost-user
specifics and to allow developers to implement Virtio targets/drivers
with an ease. It calls user-provided callbacks once proper device
initialization state has been reached. That is - memory mappings
have changed, virtqueues are ready to be processed, features have
changed in runtime, etc.

Compared to the rte_vhost, this lib additionally allows the following:
* ability to start/stop particular queues - that's required
by the vhost-user spec. rte_vhost has been already confirmed
not to work with some Virtio devices which do not initialize
some of their management queues.
* most callbacks are now asynchronous - it greatly simplifies
the event handling for asynchronous applications and doesn't
make anything harder for synchronous ones.
* this is low-level API. It doesn't have any vhost-net, nvme
or crypto references. These backend-specific libraries will
be later refactored to use *this* generic library underneath.
This implies that the library doesn't do any virtqueue processing,
it only delivers vring addresses to the user, so he can process
virtqueues by himself.
* abstracting away PCI/vhost-user.
* The API imposes how public functions can be called and how
internal data can change, so there's only a minimal work required
to ensure thread-safety. Possibly no mutexes are required at all.
* full Virtio 1.0/vhost-user specification compliance.

This patch only introduces the API. Some additional functions
for vDPA might be still required, but everything present here
so far shouldn't need changing.

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
---
 lib/librte_virtio/rte_virtio.h | 245 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 245 insertions(+)
 create mode 100644 lib/librte_virtio/rte_virtio.h

diff --git a/lib/librte_virtio/rte_virtio.h b/lib/librte_virtio/rte_virtio.h
new file mode 100644
index 0000000..0203d5e
--- /dev/null
+++ b/lib/librte_virtio/rte_virtio.h
@@ -0,0 +1,245 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright (c) Intel Corporation.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <linux/vhost.h>
+
+/** Single memory region. Both physically and virtually contiguous */
+struct rte_virtio_mem_region {
+ uint64_t guest_phys_addr;
+ uint64_t guest_user_addr;
+ uint64_t host_user_addr;
+ uint64_t size;
+ void *mmap_addr;
+ uint64_t mmap_size;
+ int fd;
+};
+
+struct rte_virtio_memory {
+ uint32_t nregions;
+ struct rte_virtio_mem_region regions[];
+};
+
+/**
+ * Vhost device created and managed by rte_virtio. Accessible via
+ * \c rte_virtio_tgt_ops callbacks. This is only a part of the real
+ * vhost device data. This struct is published just for inline vdev
+ * functions to access their data directly.
+ */
+struct rte_virtio_dev {
+ struct rte_virtio_memory *mem;
+ uint64_t features;
+};
+
+/**
+ * Virtqueue created and managed by rte_virtio. Accessible via
+ * \c rte_virtio_tgt_ops callbacks.
+ */
+struct rte_virtio_vq {
+ struct vring_desc *desc;
+ struct vring_avail *avail;
+ struct vring_used *used;
+ /* available only if F_LOG_ALL has been negotiated */
+ void *log;
+ uint16_t size;
+};
+
+/**
+ * Device/queue related callbacks, all optional. Provided callback
+ * parameters are guaranteed not to be NULL until explicitly specified.
+ */
+struct rte_virtio_tgt_ops {
+ /** New initiator connected. */
+ void (*device_create)(struct rte_virtio_dev *vdev);
+ /**
+ * Device is ready to operate. vdev->mem is now available.
+ * This callback may be called multiple times as memory mappings
+ * can change dynamically. All queues are guaranteed to be stopped
+ * by now.
+ */
+ void (*device_init)(struct rte_virtio_dev *vdev);
+ /**
+ * Features have changed in runtime. Queues might be still running
+ * at this point.
+ */
+ void (*device_features_changed)(struct rte_virtio_dev *vdev);
+ /**
+ * Start processing vq. The `vq` is guaranteed not to be modified before
+ * `queue_stop` is called.
+ */
+ void (*queue_start)(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+ /**
+ * Stop processing vq. It shouldn't be accessed after this callback
+ * completes (via tgt_cb_complete). This can be called prior to shutdown
+ * or before actions that require changing vhost device/vq state.
+ */
+ void (*queue_stop)(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+ /** Device disconnected. All queues are guaranteed to be stopped by now */
+ void (*device_destroy)(struct rte_virtio_dev *vdev);
+ /**
+ * Custom message handler. `vdev` and `vq` can be NULL. This is called
+ * for backend-specific actions. The `id` should be prefixed by the
+ * backend name (net/crypto/scsi) and `ctx` is message-specific data
+ * that should be available until tgt_cb_complete is called.
+ */
+ void (*custom_msg)(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq,
+   char *id, void *ctx);
+
+ /**
+ * Interrupt handler, synchronous. If this callback is set to NULL,
+ * rte_virtio will hint the initiators not to send any interrupts.
+ */
+ void (*queue_kick)(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+ /** Device config read, synchronous. */
+ int (*get_config)(struct rte_virtio_dev *vdev, uint8_t *config,
+  uint32_t config_len);
+ /** Device config changed by the driver, synchronous. */
+ int (*set_config)(struct rte_virtio_dev *vdev, uint8_t *config,
+  uint32_t offset, uint32_t len, uint32_t flags);
+};
+
+/**
+ * Registers a new vhost target accepting remote connections. Multiple
+ * available transports are available. It is possible to create a Vhost-user
+ * Unix domain socket polling local connections or connect to a physical
+ * Virtio device and install an interrupt handler .
+ * \param trtype type of the transport used, e.g. "PCI", "PCI-vhost-user",
+ * "PCI-vDPA", "vhost-user".
+ * \param trid identifier of the device. For PCI this would be the BDF address,
+ * for vhost-user the socket name.
+ * \param trctx additional data for the specified transport. Can be NULL.
+ * \param tgt_ops callbacks to be called upon reaching specific initialization
+ * states.
+ * \param features supported Virtio features. To be negotiated with the
+ * driver ones. rte_virtio will append a couple of generic feature bits
+ * which are required by the Virtio spec. TODO list these features here
+ * \return 0 on success, negative errno otherwise
+ */
+int rte_virtio_tgt_register(char *trtype, char *trid, void *trctx,
+   struct rte_virtio_tgt_ops *tgt_ops,
+   uint64_t features);
+
+/**
+ * Finish async device tgt ops callback. Unless a tgt op has been documented
+ * as 'synchronous' this function must be called at the end of the op handler.
+ * It can be called either before or after the op handler returns. rte_virtio
+ * won't call any callbacks while another one hasn't been finished yet.
+ * \param vdev vhost device
+ * \param rc 0 on success, negative errno otherwise.
+ */
+int rte_virtio_tgt_cb_complete(struct rte_virtio_dev *vdev, int rc);
+
+/**
+ * Unregisters a vhost target asynchronously.
+ * \param cb_fn callback to be called on finish
+ * \param cb_arg argument for \c cb_fn
+ */
+void rte_virtio_tgt_unregister(char *trid,
+      void (*cb_fn)(void *arg), void *cb_arg);
+
+/**
+ * Bypass F_IOMMU_PLATFORM and translate gpa directly.
+ * \param mem vhost device memory
+ * \param gpa guest physical address
+ * \param len length of the memory to translate (in bytes). If requested
+ * memory chunk crosses memory region boundary, the *len will be set to
+ * the remaining, maximum length of virtually contiguous memory. In such
+ * case the user will be required to call another gpa_to_vva(gpa + *len).
+ * \return vhost virtual address or NULL if requested `gpa` is not mapped.
+ */
+static inline void *
+rte_virtio_gpa_to_vva(struct rte_virtio_memory *mem, uint64_t gpa, uint64_t *len)
+{
+ struct rte_virtio_mem_region *r;
+ uint32_t i;
+
+ for (i = 0; i < mem->nregions; i++) {
+ r = &mem->regions[i];
+ if (gpa >= r->guest_phys_addr &&
+    gpa <  r->guest_phys_addr + r->size) {
+
+ if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) {
+ *len = r->guest_phys_addr + r->size - gpa;
+ }
+
+ return gpa - r->guest_phys_addr +
+       r->host_user_addr;
+ }
+ }
+ *len = 0;
+
+ return 0;
+}
+
+/**
+ * Translate I/O virtual address to vhost address space.
+ * If F_IOMMU_PLATFORM has been negotiated, this might potentially
+ * send a TLB miss and wait for the TLB update response.
+ * If F_IOMMU_PLATFORM has not been negotiated, `iova` is
+ * a physical address and `perm` is ignored.
+ * \param vdev vhost device
+ * \param iova I/O virtual address
+ * \param len length of the memory to translate (in bytes). If requested
+ * memory chunk crosses memory region boundary, the *len will be set to
+ * the remaining, maximum length of virtually contiguous memory. In such
+ * case the user will be required to call another gpa_to_vva(gpa + *len).
+ * \perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW
+ * \return vhost virtual address or NULL if requested `iova` is not mapped
+ * or the `perm` doesn't match.
+ */
+static inline void *
+rte_virtio_iova_to_vva(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq,
+      uint64_t iova, uint32_t *len, uint8_t perm)
+{
+ void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq,
+  uint64_t iova, uint64_t size, uint8_t perm);
+
+ if (!(vdev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))) {
+ return rte_virtio_gpa_to_vva(vdev->mem, iova, len);
+ }
+
+ return __vhost_iova_to_vva(vdev, vq, iova, len, perm);
+}
+
+/**
+ * Notify the driver about vq change. This is an eventfd_write for vhost-user
+ * or MMIO write for PCI devices.
+ */
+void rte_virtio_dev_call(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+
+/**
+ * Notify the driver about device config change. This will result in \c
+ * rte_virtio_tgt_ops->get_config being called. This is an eventfd_write
+ * for vhost-user or MMIO write for PCI devices
+ */
+void rte_virtio_dev_cfg_call(struct rte_virtio_dev *vdev, struct rte_virtio_vq *vq);
+
-- 
2.7.4