Re: [RFC v2] vhost: new rte_vhost API proposal

From: "Stojaczyk, DariuszX" <dariuszx.stojaczyk@intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>,
	"dev@dpdk.org" <dev@dpdk.org>, "Bie, Tiwei" <tiwei.bie@intel.com>,
	Tetsuya Mukawa <mtetsuyah@gmail.com>,
	Thomas Monjalon <thomas@monjalon.net>
Cc: "yliu@fridaylinux.org" <yliu@fridaylinux.org>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	"Harris, James R" <james.r.harris@intel.com>
Subject: Re: [RFC v2] vhost: new rte_vhost API proposal
Date: Tue, 22 May 2018 10:19:53 +0000	[thread overview]
Message-ID: <FBE7E039FA50BF47A673AD0BD3CD56A8449A61D1@HASMSX105.ger.corp.intel.com> (raw)
In-Reply-To: <27ce772e-9f01-dff9-1f82-b99924efa950@redhat.com>

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Friday, May 18, 2018 9:51 PM
> On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> > rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> > been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> > in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> > stage. This is perfectly fine from the Vhost-user spec perspective, but
> > doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> > to be fully initialized before it allows the entire device to be
> > processed. qFixing rte_vhost directly would require quite a big amount
> > of changes, which would completely break backwards compatibility.
> >
> > This rte_vhost2 library is intended to smooth out the transition.
> > It exposes a low-level API for implementing new Vhost-user slaves.
> > The existing rte_vhost is about to be refactored to use rte_vhost2
> > library underneath, and demanding backends could now use rte_vhost2
> > directly.
> 
> I like the idea, and the proposed way to smooth the transition.
> 
> I will certainly have other comments later, but please find below
> the ones I have for the moment.
> 
> > <snip>
> > +
> > +/**
> > + * Registers a new vhost target accepting remote connections. Multiple
> > + * available transports are available. It is possible to create a Vhost-
> user
> > + * Unix domain socket polling local connections or connect to a
> physical
> > + * Virtio device and install an interrupt handler .
> > + *
> > + * This function is thread-safe.
> > + *
> > + * \param trtype type of the transport used, e.g. "vhost-user",
> > + * "PCI-vhost-user", "PCI-vDPA".
> > + * \param trid identifier of the device. For PCI this would be the BDF
> address,
> > + * for vhost-user the socket name.
> > + * \param trflags additional options for the specified transport
> > + * \param trctx additional data for the specified transport. Can be
> NULL.
> > + * \param tgt_ops callbacks to be called upon reaching specific
> initialization
> > + * states.
> > + * \param features supported Virtio features. To be negotiated with
> the
> > + * driver ones. rte_vhost2 will append a couple of generic feature bits
> > + * which are required by the Virtio spec. TODO list these features here
> > + * \return 0 on success, negative errno otherwise
> > + */
> > +int rte_vhost2_tgt_register(const char *trtype, const char *trid,
> > +			    uint64_t trflags, void *trctx,
> > +			    struct rte_vhost2_tgt_ops *tgt_ops,
> > +			    uint64_t features);
> 
> Couldn't the register API also pass the vdev?
> Doing this, the backend could have rte_vhost2_dev in its device
> struct.

Please notice the register API is for registering targets, not devices. A single Vhost-user server target can spawn multiple devices - one for each connection. I know the nomenclature is different from rte_vhost, but since each connection uses its own (virt)queues it makes sense to call things this way.

Initially I thought about adding some rte_vhost2_tgt struct declaration that register function would return, but later on came to a conclusion that it would only complicate things for the library user. A parent struct that would keep rte_vhost2_tgt* needs to contain `const char *trtype` and `const char *trid` anyway, so it's just easier to use these two strings for target identification. 

> > <snip>
> > +/**
> > + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
> > + *
> > + * This function is thread-safe.
> > + *
> > + * \param mem vhost device memory
> > + * \param gpa guest physical address
> > + * \param len length of the memory to translate (in bytes). If
> requested
> > + * memory chunk crosses memory region boundary, the *len will be
> set to
> > + * the remaining, maximum length of virtually contiguous memory. In
> such
> > + * case the user will be required to call another gpa_to_vva(gpa +
> *len).
> > + * \return vhost virtual address or NULL if requested `gpa` is not
> mapped.
> > + */
> > +static inline void *
> > +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t
> gpa, uint64_t *len)
> > +{
> > +	struct rte_vhost2_mem_region *r;
> > +	uint32_t i;
> > +
> > +	for (i = 0; i < mem->nregions; i++) {
> > +		r = &mem->regions[i];
> > +		if (gpa >= r->guest_phys_addr &&
> > +		    gpa <  r->guest_phys_addr + r->size) {
> > +
> > +			if (unlikely(*len > r->guest_phys_addr + r->size -
> gpa)) {
> > +				*len = r->guest_phys_addr + r->size - gpa;
> > +			}
> > +
> > +			return gpa - r->guest_phys_addr + r-
> >host_user_addr;
> > +		}
> > +	}
> > +	*len = 0;
> > +
> > +	return 0;
> > +}
> 
> Maybe we could take the opportunity to only have
> rte_vhost2_iova_to_vva.

Good idea; will remove it in v3.

Thanks,
D.