All of lore.kernel.org
 help / color / mirror / Atom feed
From: Maxime Coquelin <maxime.coquelin@redhat.com>
To: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>,
	dev@dpdk.org, Tiwei Bie <tiwei.bie@intel.com>,
	Tetsuya Mukawa <mtetsuyah@gmail.com>,
	Thomas Monjalon <thomas@monjalon.net>
Cc: yliu@fridaylinux.org, Stefan Hajnoczi <stefanha@redhat.com>,
	James Harris <james.r.harris@intel.com>
Subject: Re: [RFC v2] vhost: new rte_vhost API proposal
Date: Fri, 18 May 2018 15:50:45 +0200	[thread overview]
Message-ID: <27ce772e-9f01-dff9-1f82-b99924efa950@redhat.com> (raw)
In-Reply-To: <1526648465-62579-1-git-send-email-dariuszx.stojaczyk@intel.com>

Hi Dariusz,

On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> stage. This is perfectly fine from the Vhost-user spec perspective, but
> doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> to be fully initialized before it allows the entire device to be
> processed. qFixing rte_vhost directly would require quite a big amount
> of changes, which would completely break backwards compatibility.
> 
> This rte_vhost2 library is intended to smooth out the transition.
> It exposes a low-level API for implementing new Vhost-user slaves.
> The existing rte_vhost is about to be refactored to use rte_vhost2
> library underneath, and demanding backends could now use rte_vhost2
> directly.

I like the idea, and the proposed way to smooth the transition.

I will certainly have other comments later, but please find below
the ones I have for the moment.

> We're designing a fresh library here, so this opens up a great
> amount of possibilities and improvements we can make to the public
> API that will pay off significantly for the sake of future
> specification/library extensions.
> 
> rte_vhost2 abstracts away most Vhost-user/virtio-vhost-user specifics
> and allows developers to implement Vhost devices with an ease.
> It calls user-provided callbacks once proper device initialization
> state has been reached. That is - memory mappings have changed,
> virtqueues are ready to be processed, features have changed in
> runtime, etc.
> 
> Compared to the rte_vhost, this lib additionally allows the following:
> * ability to start/stop particular queues
> * most callbacks are now asynchronous - it greatly simplifies
> the event handling for asynchronous applications and doesn't
> make anything harder for synchronous ones.
> * this is low-level API. It doesn't have any vhost-net, nvme
> or crypto references. These backend-specific libraries will
> be later refactored to use *this* generic library underneath.
> This implies that the library doesn't do any virtqueue processing,
> it only delivers vring addresses to the user, so he can process
> virtqueues by himself.
> * abstracting away Unix domain sockets (vhost-user) and PCI
> (virtio-vhost-user).
> * The API imposes how public functions can be called and how
> internal data can change, so there's only a minimal work required
> to ensure thread-safety. Possibly no mutexes are required at all.
> * full Virtio 1.0/vhost-user specification compliance.
> 
> This patch only introduces the API. Some additional functions
> for vDPA might be still required, but everything present here
> so far shouldn't need changing.
> 
> Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
> ---
>   lib/librte_vhost2/rte_vhost2.h | 331 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 331 insertions(+)
>   create mode 100644 lib/librte_vhost2/rte_vhost2.h
> 
> diff --git a/lib/librte_vhost2/rte_vhost2.h b/lib/librte_vhost2/rte_vhost2.h
> new file mode 100644
> index 0000000..385b093
> --- /dev/null
> +++ b/lib/librte_vhost2/rte_vhost2.h
> @@ -0,0 +1,331 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright (c) Intel Corporation.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_VHOST2_H_
> +#define _RTE_VHOST2_H_
> +
> +/**
> + * @file
> + * This library abstracts away most Vhost-user/virtio-vhost-user specifics
> + * and allows developers to implement Vhost devices with an ease.
> + * It calls user-provided callbacks once proper device initialization
> + * state has been reached. That is - memory mappings have changed,
> + * virtqueues are ready to be processed, features have changed in runtime, etc.
> + */
> +
> +#include <stdint.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/* Not C++-aware. */
> +#include <linux/vhost.h>
> +
> +#define RTE_VHOST2_MEMORY_MAX_NREGIONS	8
> +
> +#define RTE_VHOST2_CLIENT		(1ULL << 0)
> +#define RTE_VHOST2_NO_RECONNECT		(1ULL << 1)
> +
> +enum rte_vhost2_set_config_type {
> +	/** Config changed on request by the vhost driver. */
> +	VHOST_SET_CONFIG_TYPE_MASTER = 0,
> +	/** Config is being restored after a successful migration. */
> +	VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> +};
> +
> +#define RTE_VHOST2_MSG_VERSION_MASK	(0x3)
> +#define RTE_VHOST2_MSG_REPLY_MASK	(0x1 << 2)
> +#define RTE_VHOST2_MSG_NEED_REPLY	(0x1 << 3)
> +
> +struct rte_vhost2_msg {
> +	uint32_t id;
> +	uint32_t flags;
> +	uint32_t size; /**< The following payload size. */
> +	void *payload;
> +	int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
> +};
> +
> +/** Single memory region. Both physically and virtually contiguous */
> +struct rte_vhost2_mem_region {
> +	uint64_t guest_phys_addr;
> +	uint64_t guest_user_addr;
> +	uint64_t host_user_addr;
> +	uint64_t size;
> +	void *mmap_addr;
> +	uint64_t mmap_size;
> +	int fd;
> +};
> +
> +struct rte_vhost2_memory {
> +	uint32_t nregions;
> +	struct rte_vhost2_mem_region regions[];
> +};
> +
> +/**
> + * Vhost device created and managed by rte_vhost2. Accessible via
> + * \c rte_vhost2_tgt_ops callbacks. This is only a part of the real
> + * vhost device data. This struct is published just for inline vdev
> + * functions to access their data directly.
> + */
> +struct rte_vhost2_dev {
> +	struct rte_vhost2_memory *mem;
> +	bool iommu; /**< \c VIRTIO_F_IOMMU_PLATFORM has been negotiated */
> +};
> +
> +/**
> + * Virtqueue created and managed by rte_vhost2. Accessible via
> + * \c rte_vhost2_tgt_ops callbacks.
> + */
> +struct rte_vhost2_vq {
> +	 struct vring_desc *desc;
> +	 struct vring_avail *avail;
> +	 struct vring_used *used;
> +	 /* available only if \c VHOST_F_LOG_ALL has been negotiated */
> +	 void *log;
> +	 uint16_t size;
> +};
> +
> +/**
> + * Device/queue related callbacks, all optional. Provided callback
> + * parameters are guaranteed not to be NULL unless explicitly specified.
> + */
> +struct rte_vhost2_tgt_ops {
> +	/**
> +	 * New driver connected. If this is completed with a non-zero status,
> +	 * rte_vhost2 will terminate the connection.
> +	 */
> +	void (*device_create)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Device is ready to operate. vdev data is now initialized. This callback
> +	* may be called multiple times as e.g. memory mappings can change
> +	* dynamically. All queues are guaranteed to be stopped by now.
> +	*/
> +	void (*device_init)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Features have changed in runtime. This is called at least once during
> +	* initialization before `device_init`. Queues might be still running
> +	* at this point.
> +	*/
> +	void (*device_features_changed)(struct rte_vhost2_dev *vdev,
> +			uint64_t features);
> +	/**
> +	* Start processing vq. The `vq` is guaranteed not to be modified before
> +	* `queue_stop` is called.
> +	*/
> +	void (*queue_start)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/**
> +	* Stop processing vq. It shouldn't be accessed after this callback
> +	* completes (via \c rte_vhost2_tgt_cb_complete). This can be called
> +	* prior to shutdown or before actions that require changing vhost
> +	* device/vq state.
> +	*/
> +	void (*queue_stop)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/** Device disconnected. All queues are guaranteed to be stopped by now */
> +	void (*device_destroy)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Custom vhost-user message handler. This is called for
> +	* backend-specific messages (net/crypto/scsi) that weren't recognized
> +	* by the generic message parser. `msg` is available until
> +	* \c rte_vhost2_tgt_cb_complete is called.
> +	*/
> +	void (*custom_msg)(struct rte_vhost2_dev *vdev, struct rte_vhost2_msg *msg);
> +
> +	/** Interrupt handler, synchronous. */
> +	void (*queue_kick)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/**
> +	 * Full device config read, synchronous. Return 0 if `len` bytes of
> +	 * `config` have been successfully set, -1 otherwise.
> +	 */
> +	int (*get_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
> +			  uint32_t len);
> +	/**
> +	 * Device config changed by the driver, synchronous. `type` indicates
> +	 * the reason of change.
> +	 */
> +	int (*set_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
> +			  uint32_t offset, uint32_t len,
> +			  enum rte_vhost2_set_config_type type);
> +
> +	void *reserved[8]; /**< Reserved for future extension */
> +};
> +
> +/**
> + * Registers a new vhost target accepting remote connections. Multiple
> + * available transports are available. It is possible to create a Vhost-user
> + * Unix domain socket polling local connections or connect to a physical
> + * Virtio device and install an interrupt handler .
> + *
> + * This function is thread-safe.
> + *
> + * \param trtype type of the transport used, e.g. "vhost-user",
> + * "PCI-vhost-user", "PCI-vDPA".
> + * \param trid identifier of the device. For PCI this would be the BDF address,
> + * for vhost-user the socket name.
> + * \param trflags additional options for the specified transport
> + * \param trctx additional data for the specified transport. Can be NULL.
> + * \param tgt_ops callbacks to be called upon reaching specific initialization
> + * states.
> + * \param features supported Virtio features. To be negotiated with the
> + * driver ones. rte_vhost2 will append a couple of generic feature bits
> + * which are required by the Virtio spec. TODO list these features here
> + * \return 0 on success, negative errno otherwise
> + */
> +int rte_vhost2_tgt_register(const char *trtype, const char *trid,
> +			    uint64_t trflags, void *trctx,
> +			    struct rte_vhost2_tgt_ops *tgt_ops,
> +			    uint64_t features);

Couldn't the register API also pass the vdev?
Doing this, the backend could have rte_vhost2_dev in its device
struct.

> +
> +/**
> + * Finish async device tgt ops callback. Unless a tgt op has been documented
> + * as 'synchronous' this function must be called at the end of the op handler.
> + * It can be called either before or after the op handler returns. rte_vhost2
> + * won't call any tgt ops callbacks while another one hasn't been finished yet.
> + *
> + * This function is thread-safe.
> + *
> + * \param vdev vhost device
> + * \param rc 0 on success, negative errno otherwise. If non-zero value is
> + * given, the current callback will be perceived as failed. A queue that failed
> + * to start won't need to be stopped.
> + */
> +void rte_vhost2_tgt_cb_complete(struct rte_vhost2_dev *vdev, int rc);
> +
> +/**
> + * Unregisters a vhost target asynchronously. All active queue will be stopped
> + * and all devices destroyed.
> + *
> + * This function is thread-safe.
> + *
> + * \param cb_fn callback to be called on finish. It'll be called from the same
> + * thread that calls \c rte_vhost2_tgt_ops.
> + * \param cb_arg argument for \c cb_fn
> + * \return 0 on success, negative errno otherwise. `cb_fn` won't be called
> + * if non-zero value is returned.
> + */
> +int rte_vhost2_tgt_unregister(const char *trtype, const char *trid,
> +			       void (*cb_fn)(void *arg), void *cb_arg);
> +
> +/**
> + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
> + *
> + * This function is thread-safe.
> + *
> + * \param mem vhost device memory
> + * \param gpa guest physical address
> + * \param len length of the memory to translate (in bytes). If requested
> + * memory chunk crosses memory region boundary, the *len will be set to
> + * the remaining, maximum length of virtually contiguous memory. In such
> + * case the user will be required to call another gpa_to_vva(gpa + *len).
> + * \return vhost virtual address or NULL if requested `gpa` is not mapped.
> + */
> +static inline void *
> +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t gpa, uint64_t *len)
> +{
> +	struct rte_vhost2_mem_region *r;
> +	uint32_t i;
> +
> +	for (i = 0; i < mem->nregions; i++) {
> +		r = &mem->regions[i];
> +		if (gpa >= r->guest_phys_addr &&
> +		    gpa <  r->guest_phys_addr + r->size) {
> +
> +			if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) {
> +				*len = r->guest_phys_addr + r->size - gpa;
> +			}
> +
> +			return gpa - r->guest_phys_addr + r->host_user_addr;
> +		}
> +	}
> +	*len = 0;
> +
> +	return 0;
> +}

Maybe we could take the opportunity to only have rte_vhost2_iova_to_vva.

> +/**
> + * Translate I/O virtual address to vhost address space.
> + *
> + * If VIRTIO_F_IOMMU_PLATFORM has been negotiated, this might potentially send
> + * a TLB miss and wait for the TLB update response.
> + * If VIRTIO_F_IOMMU_PLATFORM has not been negotiated, `iova` is a physical
> + * address and `perm` is ignored.
> + *
> + * This function is thread-safe.
> + *
> + * \param vdev vhost device
> + * \param vq virtqueue. Must be started.
> + * \param iova I/O virtual address
> + * \param len length of the memory to translate (in bytes). If requested
> + * memory chunk crosses memory region boundary, the *len will be set to
> + * the remaining, maximum length of virtually contiguous memory. In such
> + * case the user will be required to call another gpa_to_vva(gpa + *len).
> + * \param perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW
> + * \return vhost virtual address or NULL if requested `iova` is not mapped
> + * or the `perm` doesn't match.
> + */
> +static inline void *
> +rte_vhost2_iova_to_vva(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq,
> +		       uint64_t iova, uint32_t *len, uint8_t perm)
> +{
> +	void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq,
> +				  uint64_t iova, uint64_t size, uint8_t perm);
> +
> +	if (!vdev->iommu) {
> +		return rte_vhost2_gpa_to_vva(vdev->mem, iova, len);
> +	}
> +
> +	return __vhost_iova_to_vva(vdev, vq, iova, len, perm);
> +}
> +
> +/**
> + * Notify the driver about vq change. This is an eventfd_write for vhost-user
> + * or MMIO write for PCI devices.
> + *
> + * \param vdev vhost device
> + * \param vq virtqueue. Must be started.
> + */
> +void rte_vhost2_dev_call(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +
> +/**
> + * Notify the driver about device config change. This will result in \c
> + * rte_vhost2_tgt_ops->get_config being called.
> + *
> + * \param vdev vhost device
> + */
> +void rte_vhost2_dev_cfg_call(struct rte_vhost2_dev *vdev);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_VHOST2_H_ */
> 

  reply	other threads:[~2018-05-18 13:50 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-10 13:22 [RFC] vhost: new rte_vhost API proposal Dariusz Stojaczyk
     [not found] ` <20180510163643.GD9308@stefanha-x1.localdomain>
2018-05-11  5:55   ` Stojaczyk, DariuszX
     [not found]     ` <20180511100531.GA19894@stefanha-x1.localdomain>
2018-05-18  7:51       ` Stojaczyk, DariuszX
2018-05-18 13:01 ` [RFC v2] " Dariusz Stojaczyk
2018-05-18 13:50   ` Maxime Coquelin [this message]
2018-05-20  7:07     ` Yuanhan Liu
2018-05-22 10:19     ` Stojaczyk, DariuszX
     [not found]   ` <20180525100550.GD14757@stefanha-x1.localdomain>
2018-05-29 13:38     ` Stojaczyk, DariuszX
     [not found]       ` <20180530085700.GC14623@stefanha-x1.localdomain>
2018-05-30 12:24         ` Stojaczyk, DariuszX
     [not found]   ` <20180607151227.23660-1-darek.stojaczyk@gmail.com>
     [not found]     ` <20180608100852.GA31164@stefanha-x1.localdomain>
2018-06-13  9:41       ` [RFC v3 0/7] vhost2: new librte_vhost2 proposal Dariusz Stojaczyk
2018-06-25 11:01     ` Tiwei Bie
2018-06-25 12:17       ` Stojaczyk, DariuszX
2018-06-26  8:22         ` Tiwei Bie
2018-06-26  8:30           ` Thomas Monjalon
2018-06-26  8:47           ` Stojaczyk, DariuszX
2018-06-26  9:14             ` Tiwei Bie
2018-06-26  9:38               ` Maxime Coquelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=27ce772e-9f01-dff9-1f82-b99924efa950@redhat.com \
    --to=maxime.coquelin@redhat.com \
    --cc=dariuszx.stojaczyk@intel.com \
    --cc=dev@dpdk.org \
    --cc=james.r.harris@intel.com \
    --cc=mtetsuyah@gmail.com \
    --cc=stefanha@redhat.com \
    --cc=thomas@monjalon.net \
    --cc=tiwei.bie@intel.com \
    --cc=yliu@fridaylinux.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.