Re: [RFC v2] vhost: new rte_vhost API proposal

From: Maxime Coquelin <maxime.coquelin@redhat.com>
To: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>,
	dev@dpdk.org, Tiwei Bie <tiwei.bie@intel.com>,
	Tetsuya Mukawa <mtetsuyah@gmail.com>,
	Thomas Monjalon <thomas@monjalon.net>
Cc: yliu@fridaylinux.org, Stefan Hajnoczi <stefanha@redhat.com>,
	James Harris <james.r.harris@intel.com>
Subject: Re: [RFC v2] vhost: new rte_vhost API proposal
Date: Fri, 18 May 2018 15:50:45 +0200	[thread overview]
Message-ID: <27ce772e-9f01-dff9-1f82-b99924efa950@redhat.com> (raw)
In-Reply-To: <1526648465-62579-1-git-send-email-dariuszx.stojaczyk@intel.com>

Hi Dariusz,

On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> stage. This is perfectly fine from the Vhost-user spec perspective, but
> doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> to be fully initialized before it allows the entire device to be
> processed. qFixing rte_vhost directly would require quite a big amount
> of changes, which would completely break backwards compatibility.
> 
> This rte_vhost2 library is intended to smooth out the transition.
> It exposes a low-level API for implementing new Vhost-user slaves.
> The existing rte_vhost is about to be refactored to use rte_vhost2
> library underneath, and demanding backends could now use rte_vhost2
> directly.

I like the idea, and the proposed way to smooth the transition.

I will certainly have other comments later, but please find below
the ones I have for the moment.

> We're designing a fresh library here, so this opens up a great
> amount of possibilities and improvements we can make to the public
> API that will pay off significantly for the sake of future
> specification/library extensions.
> 
> rte_vhost2 abstracts away most Vhost-user/virtio-vhost-user specifics
> and allows developers to implement Vhost devices with an ease.
> It calls user-provided callbacks once proper device initialization
> state has been reached. That is - memory mappings have changed,
> virtqueues are ready to be processed, features have changed in
> runtime, etc.
> 
> Compared to the rte_vhost, this lib additionally allows the following:
> * ability to start/stop particular queues
> * most callbacks are now asynchronous - it greatly simplifies
> the event handling for asynchronous applications and doesn't
> make anything harder for synchronous ones.
> * this is low-level API. It doesn't have any vhost-net, nvme
> or crypto references. These backend-specific libraries will
> be later refactored to use *this* generic library underneath.
> This implies that the library doesn't do any virtqueue processing,
> it only delivers vring addresses to the user, so he can process
> virtqueues by himself.
> * abstracting away Unix domain sockets (vhost-user) and PCI
> (virtio-vhost-user).
> * The API imposes how public functions can be called and how
> internal data can change, so there's only a minimal work required
> to ensure thread-safety. Possibly no mutexes are required at all.
> * full Virtio 1.0/vhost-user specification compliance.
> 
> This patch only introduces the API. Some additional functions
> for vDPA might be still required, but everything present here
> so far shouldn't need changing.
> 
> Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
> ---
>   lib/librte_vhost2/rte_vhost2.h | 331 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 331 insertions(+)
>   create mode 100644 lib/librte_vhost2/rte_vhost2.h
> 
> diff --git a/lib/librte_vhost2/rte_vhost2.h b/lib/librte_vhost2/rte_vhost2.h
> new file mode 100644
> index 0000000..385b093
> --- /dev/null
> +++ b/lib/librte_vhost2/rte_vhost2.h
> @@ -0,0 +1,331 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright (c) Intel Corporation.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_VHOST2_H_
> +#define _RTE_VHOST2_H_
> +
> +/**
> + * @file
> + * This library abstracts away most Vhost-user/virtio-vhost-user specifics
> + * and allows developers to implement Vhost devices with an ease.
> + * It calls user-provided callbacks once proper device initialization
> + * state has been reached. That is - memory mappings have changed,
> + * virtqueues are ready to be processed, features have changed in runtime, etc.
> + */
> +
> +#include <stdint.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/* Not C++-aware. */
> +#include <linux/vhost.h>
> +
> +#define RTE_VHOST2_MEMORY_MAX_NREGIONS	8
> +
> +#define RTE_VHOST2_CLIENT		(1ULL << 0)
> +#define RTE_VHOST2_NO_RECONNECT		(1ULL << 1)
> +
> +enum rte_vhost2_set_config_type {
> +	/** Config changed on request by the vhost driver. */
> +	VHOST_SET_CONFIG_TYPE_MASTER = 0,
> +	/** Config is being restored after a successful migration. */
> +	VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> +};
> +
> +#define RTE_VHOST2_MSG_VERSION_MASK	(0x3)
> +#define RTE_VHOST2_MSG_REPLY_MASK	(0x1 << 2)
> +#define RTE_VHOST2_MSG_NEED_REPLY	(0x1 << 3)
> +
> +struct rte_vhost2_msg {
> +	uint32_t id;
> +	uint32_t flags;
> +	uint32_t size; /**< The following payload size. */
> +	void *payload;
> +	int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
> +};
> +
> +/** Single memory region. Both physically and virtually contiguous */
> +struct rte_vhost2_mem_region {
> +	uint64_t guest_phys_addr;
> +	uint64_t guest_user_addr;
> +	uint64_t host_user_addr;
> +	uint64_t size;
> +	void *mmap_addr;
> +	uint64_t mmap_size;
> +	int fd;
> +};
> +
> +struct rte_vhost2_memory {
> +	uint32_t nregions;
> +	struct rte_vhost2_mem_region regions[];
> +};
> +
> +/**
> + * Vhost device created and managed by rte_vhost2. Accessible via
> + * \c rte_vhost2_tgt_ops callbacks. This is only a part of the real
> + * vhost device data. This struct is published just for inline vdev
> + * functions to access their data directly.
> + */
> +struct rte_vhost2_dev {
> +	struct rte_vhost2_memory *mem;
> +	bool iommu; /**< \c VIRTIO_F_IOMMU_PLATFORM has been negotiated */
> +};
> +
> +/**
> + * Virtqueue created and managed by rte_vhost2. Accessible via
> + * \c rte_vhost2_tgt_ops callbacks.
> + */
> +struct rte_vhost2_vq {
> +	 struct vring_desc *desc;
> +	 struct vring_avail *avail;
> +	 struct vring_used *used;
> +	 /* available only if \c VHOST_F_LOG_ALL has been negotiated */
> +	 void *log;
> +	 uint16_t size;
> +};
> +
> +/**
> + * Device/queue related callbacks, all optional. Provided callback
> + * parameters are guaranteed not to be NULL unless explicitly specified.
> + */
> +struct rte_vhost2_tgt_ops {
> +	/**
> +	 * New driver connected. If this is completed with a non-zero status,
> +	 * rte_vhost2 will terminate the connection.
> +	 */
> +	void (*device_create)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Device is ready to operate. vdev data is now initialized. This callback
> +	* may be called multiple times as e.g. memory mappings can change
> +	* dynamically. All queues are guaranteed to be stopped by now.
> +	*/
> +	void (*device_init)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Features have changed in runtime. This is called at least once during
> +	* initialization before `device_init`. Queues might be still running
> +	* at this point.
> +	*/
> +	void (*device_features_changed)(struct rte_vhost2_dev *vdev,
> +			uint64_t features);
> +	/**
> +	* Start processing vq. The `vq` is guaranteed not to be modified before
> +	* `queue_stop` is called.
> +	*/
> +	void (*queue_start)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/**
> +	* Stop processing vq. It shouldn't be accessed after this callback
> +	* completes (via \c rte_vhost2_tgt_cb_complete). This can be called
> +	* prior to shutdown or before actions that require changing vhost
> +	* device/vq state.
> +	*/
> +	void (*queue_stop)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/** Device disconnected. All queues are guaranteed to be stopped by now */
> +	void (*device_destroy)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Custom vhost-user message handler. This is called for
> +	* backend-specific messages (net/crypto/scsi) that weren't recognized
> +	* by the generic message parser. `msg` is available until
> +	* \c rte_vhost2_tgt_cb_complete is called.
> +	*/
> +	void (*custom_msg)(struct rte_vhost2_dev *vdev, struct rte_vhost2_msg *msg);
> +
> +	/** Interrupt handler, synchronous. */
> +	void (*queue_kick)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/**
> +	 * Full device config read, synchronous. Return 0 if `len` bytes of
> +	 * `config` have been successfully set, -1 otherwise.
> +	 */
> +	int (*get_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
> +			  uint32_t len);
> +	/**
> +	 * Device config changed by the driver, synchronous. `type` indicates
> +	 * the reason of change.
> +	 */
> +	int (*set_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
> +			  uint32_t offset, uint32_t len,
> +			  enum rte_vhost2_set_config_type type);
> +
> +	void *reserved[8]; /**< Reserved for future extension */
> +};
> +
> +/**
> + * Registers a new vhost target accepting remote connections. Multiple
> + * available transports are available. It is possible to create a Vhost-user
> + * Unix domain socket polling local connections or connect to a physical
> + * Virtio device and install an interrupt handler .
> + *
> + * This function is thread-safe.
> + *
> + * \param trtype type of the transport used, e.g. "vhost-user",
> + * "PCI-vhost-user", "PCI-vDPA".
> + * \param trid identifier of the device. For PCI this would be the BDF address,
> + * for vhost-user the socket name.
> + * \param trflags additional options for the specified transport
> + * \param trctx additional data for the specified transport. Can be NULL.
> + * \param tgt_ops callbacks to be called upon reaching specific initialization
> + * states.
> + * \param features supported Virtio features. To be negotiated with the
> + * driver ones. rte_vhost2 will append a couple of generic feature bits
> + * which are required by the Virtio spec. TODO list these features here
> + * \return 0 on success, negative errno otherwise
> + */
> +int rte_vhost2_tgt_register(const char *trtype, const char *trid,
> +			    uint64_t trflags, void *trctx,
> +			    struct rte_vhost2_tgt_ops *tgt_ops,
> +			    uint64_t features);

Couldn't the register API also pass the vdev?
Doing this, the backend could have rte_vhost2_dev in its device
struct.

> +
> +/**
> + * Finish async device tgt ops callback. Unless a tgt op has been documented
> + * as 'synchronous' this function must be called at the end of the op handler.
> + * It can be called either before or after the op handler returns. rte_vhost2
> + * won't call any tgt ops callbacks while another one hasn't been finished yet.
> + *
> + * This function is thread-safe.
> + *
> + * \param vdev vhost device
> + * \param rc 0 on success, negative errno otherwise. If non-zero value is
> + * given, the current callback will be perceived as failed. A queue that failed
> + * to start won't need to be stopped.
> + */
> +void rte_vhost2_tgt_cb_complete(struct rte_vhost2_dev *vdev, int rc);
> +
> +/**
> + * Unregisters a vhost target asynchronously. All active queue will be stopped
> + * and all devices destroyed.
> + *
> + * This function is thread-safe.
> + *
> + * \param cb_fn callback to be called on finish. It'll be called from the same
> + * thread that calls \c rte_vhost2_tgt_ops.
> + * \param cb_arg argument for \c cb_fn
> + * \return 0 on success, negative errno otherwise. `cb_fn` won't be called
> + * if non-zero value is returned.
> + */
> +int rte_vhost2_tgt_unregister(const char *trtype, const char *trid,
> +			       void (*cb_fn)(void *arg), void *cb_arg);
> +
> +/**
> + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
> + *
> + * This function is thread-safe.
> + *
> + * \param mem vhost device memory
> + * \param gpa guest physical address
> + * \param len length of the memory to translate (in bytes). If requested
> + * memory chunk crosses memory region boundary, the *len will be set to
> + * the remaining, maximum length of virtually contiguous memory. In such
> + * case the user will be required to call another gpa_to_vva(gpa + *len).
> + * \return vhost virtual address or NULL if requested `gpa` is not mapped.
> + */
> +static inline void *
> +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t gpa, uint64_t *len)
> +{
> +	struct rte_vhost2_mem_region *r;
> +	uint32_t i;
> +
> +	for (i = 0; i < mem->nregions; i++) {
> +		r = &mem->regions[i];
> +		if (gpa >= r->guest_phys_addr &&
> +		    gpa <  r->guest_phys_addr + r->size) {
> +
> +			if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) {
> +				*len = r->guest_phys_addr + r->size - gpa;
> +			}
> +
> +			return gpa - r->guest_phys_addr + r->host_user_addr;
> +		}
> +	}
> +	*len = 0;
> +
> +	return 0;
> +}

Maybe we could take the opportunity to only have rte_vhost2_iova_to_vva.

> +/**
> + * Translate I/O virtual address to vhost address space.
> + *
> + * If VIRTIO_F_IOMMU_PLATFORM has been negotiated, this might potentially send
> + * a TLB miss and wait for the TLB update response.
> + * If VIRTIO_F_IOMMU_PLATFORM has not been negotiated, `iova` is a physical
> + * address and `perm` is ignored.
> + *
> + * This function is thread-safe.
> + *
> + * \param vdev vhost device
> + * \param vq virtqueue. Must be started.
> + * \param iova I/O virtual address
> + * \param len length of the memory to translate (in bytes). If requested
> + * memory chunk crosses memory region boundary, the *len will be set to
> + * the remaining, maximum length of virtually contiguous memory. In such
> + * case the user will be required to call another gpa_to_vva(gpa + *len).
> + * \param perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW
> + * \return vhost virtual address or NULL if requested `iova` is not mapped
> + * or the `perm` doesn't match.
> + */
> +static inline void *
> +rte_vhost2_iova_to_vva(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq,
> +		       uint64_t iova, uint32_t *len, uint8_t perm)
> +{
> +	void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq,
> +				  uint64_t iova, uint64_t size, uint8_t perm);
> +
> +	if (!vdev->iommu) {
> +		return rte_vhost2_gpa_to_vva(vdev->mem, iova, len);
> +	}
> +
> +	return __vhost_iova_to_vva(vdev, vq, iova, len, perm);
> +}
> +
> +/**
> + * Notify the driver about vq change. This is an eventfd_write for vhost-user
> + * or MMIO write for PCI devices.
> + *
> + * \param vdev vhost device
> + * \param vq virtqueue. Must be started.
> + */
> +void rte_vhost2_dev_call(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +
> +/**
> + * Notify the driver about device config change. This will result in \c
> + * rte_vhost2_tgt_ops->get_config being called.
> + *
> + * \param vdev vhost device
> + */
> +void rte_vhost2_dev_cfg_call(struct rte_vhost2_dev *vdev);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_VHOST2_H_ */
>