Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	jgg@mellanox.com, aarcange@redhat.com, jglisse@redhat.com,
	linux-mm@kvack.org,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	Christoph Hellwig <hch@infradead.org>,
	David Miller <davem@davemloft.net>,
	linux-arm-kernel@lists.infradead.org,
	linux-parisc@vger.kernel.org
Subject: Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address
Date: Mon, 9 Sep 2019 00:45:40 -0400	[thread overview]
Message-ID: <20190909004504-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <1cb5aa8d-6213-5fce-5a77-fcada572c882@redhat.com>

On Mon, Sep 09, 2019 at 10:18:57AM +0800, Jason Wang wrote:
> 
> On 2019/9/8 下午7:05, Michael S. Tsirkin wrote:
> > On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote:
> > > This is a rework on the commit 7f466032dc9e ("vhost: access vq
> > > metadata through kernel virtual address").
> > > 
> > > It was noticed that the copy_to/from_user() friends that was used to
> > > access virtqueue metdata tends to be very expensive for dataplane
> > > implementation like vhost since it involves lots of software checks,
> > > speculation barriers,
> > So if we drop speculation barrier,
> > there's a problem here in access will now be speculated.
> > This effectively disables the defence in depth effect of
> > b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd
> >      x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec
> > 
> > 
> > So now we need to sprinkle array_index_nospec or barrier_nospec over the
> > code whenever we use an index we got from userspace.
> > See below for some examples.
> > 
> > 
> > > hardware feature toggling (e.g SMAP). The
> > > extra cost will be more obvious when transferring small packets since
> > > the time spent on metadata accessing become more significant.
> > > 
> > > This patch tries to eliminate those overheads by accessing them
> > > through direct mapping of those pages. Invalidation callbacks is
> > > implemented for co-operation with general VM management (swap, KSM,
> > > THP or NUMA balancing). We will try to get the direct mapping of vq
> > > metadata before each round of packet processing if it doesn't
> > > exist. If we fail, we will simplely fallback to copy_to/from_user()
> > > friends.
> > > 
> > > This invalidation, direct mapping access and set are synchronized
> > > through spinlock. This takes a step back from the original commit
> > > 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > address") which tries to RCU which is suspicious and hard to be
> > > reviewed. This won't perform as well as RCU because of the atomic,
> > > this could be addressed by the future optimization.
> > > 
> > > This method might does not work for high mem page which requires
> > > temporary mapping so we just fallback to normal
> > > copy_to/from_user() and may not for arch that has virtual tagged cache
> > > since extra cache flushing is needed to eliminate the alias. This will
> > > result complex logic and bad performance. For those archs, this patch
> > > simply go for copy_to/from_user() friends. This is done by ruling out
> > > kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> > > 
> > > Note that this is only done when device IOTLB is not enabled. We
> > > could use similar method to optimize IOTLB in the future.
> > > 
> > > Tests shows at most about 22% improvement on TX PPS when using
> > > virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.
> > > 
> > >          SMAP on | SMAP off
> > > Before: 4.9Mpps | 6.9Mpps
> > > After:  6.0Mpps | 7.5Mpps
> > > 
> > > On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
> > > any difference.
> > Why is not Kaby Lake with SMAP off the same as Sandy Bridge?
> 
> 
> I don't know, I guess it was because the atomic is l
> 
> 
> > 
> > 
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
> > > Cc: Christoph Hellwig <hch@infradead.org>
> > > Cc: David Miller <davem@davemloft.net>
> > > Cc: Jerome Glisse <jglisse@redhat.com>
> > > Cc: Jason Gunthorpe <jgg@mellanox.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-arm-kernel@lists.infradead.org
> > > Cc: linux-parisc@vger.kernel.org
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >   drivers/vhost/vhost.c | 551 +++++++++++++++++++++++++++++++++++++++++-
> > >   drivers/vhost/vhost.h |  41 ++++
> > >   2 files changed, 589 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 791562e03fe0..f98155f28f02 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
> > >   		__vhost_vq_meta_reset(d->vqs[i]);
> > >   }
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +static void vhost_map_unprefetch(struct vhost_map *map)
> > > +{
> > > +	kfree(map->pages);
> > > +	kfree(map);
> > > +}
> > > +
> > > +static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
> > > +				struct vhost_map *map, int index)
> > > +{
> > > +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > +	int i;
> > > +
> > > +	if (uaddr->write) {
> > > +		for (i = 0; i < map->npages; i++)
> > > +			set_page_dirty(map->pages[i]);
> > > +	}
> > > +}
> > > +
> > > +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> > > +{
> > > +	struct vhost_map *map[VHOST_NUM_ADDRS];
> > > +	int i;
> > > +
> > > +	spin_lock(&vq->mmu_lock);
> > > +	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> > > +		map[i] = vq->maps[i];
> > > +		if (map[i]) {
> > > +			vhost_set_map_dirty(vq, map[i], i);
> > > +			vq->maps[i] = NULL;
> > > +		}
> > > +	}
> > > +	spin_unlock(&vq->mmu_lock);
> > > +
> > > +	/* No need for synchronization since we are serialized with
> > > +	 * memory accessors (e.g vq mutex held).
> > > +	 */
> > > +
> > > +	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> > > +		if (map[i])
> > > +			vhost_map_unprefetch(map[i]);
> > > +
> > > +}
> > > +
> > > +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
> > > +{
> > > +	int i;
> > > +
> > > +	vhost_uninit_vq_maps(vq);
> > > +	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> > > +		vq->uaddrs[i].size = 0;
> > > +}
> > > +
> > > +static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> > > +				     unsigned long start,
> > > +				     unsigned long end)
> > > +{
> > > +	if (unlikely(!uaddr->size))
> > > +		return false;
> > > +
> > > +	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> > > +}
> > > +
> > > +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> > > +{
> > > +	spin_lock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> > > +{
> > > +	spin_unlock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> > > +				     int index,
> > > +				     unsigned long start,
> > > +				     unsigned long end,
> > > +				     bool blockable)
> > > +{
> > > +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > +	struct vhost_map *map;
> > > +
> > > +	if (!vhost_map_range_overlap(uaddr, start, end))
> > > +		return 0;
> > > +	else if (!blockable)
> > > +		return -EAGAIN;
> > > +
> > > +	spin_lock(&vq->mmu_lock);
> > > +	++vq->invalidate_count;
> > > +
> > > +	map = vq->maps[index];
> > > +	if (map)
> > > +		vq->maps[index] = NULL;
> > > +	spin_unlock(&vq->mmu_lock);
> > > +
> > > +	if (map) {
> > > +		vhost_set_map_dirty(vq, map, index);
> > > +		vhost_map_unprefetch(map);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
> > > +				    int index,
> > > +				    unsigned long start,
> > > +				    unsigned long end)
> > > +{
> > > +	if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end))
> > > +		return;
> > > +
> > > +	spin_lock(&vq->mmu_lock);
> > > +	--vq->invalidate_count;
> > > +	spin_unlock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static int vhost_invalidate_range_start(struct mmu_notifier *mn,
> > > +					const struct mmu_notifier_range *range)
> > > +{
> > > +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> > > +					     mmu_notifier);
> > > +	bool blockable = mmu_notifier_range_blockable(range);
> > > +	int i, j, ret;
> > > +
> > > +	for (i = 0; i < dev->nvqs; i++) {
> > > +		struct vhost_virtqueue *vq = dev->vqs[i];
> > > +
> > > +		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
> > > +			ret = vhost_invalidate_vq_start(vq, j,
> > > +							range->start,
> > > +							range->end, blockable);
> > > +			if (ret)
> > > +				return ret;
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void vhost_invalidate_range_end(struct mmu_notifier *mn,
> > > +				       const struct mmu_notifier_range *range)
> > > +{
> > > +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> > > +					     mmu_notifier);
> > > +	int i, j;
> > > +
> > > +	for (i = 0; i < dev->nvqs; i++) {
> > > +		struct vhost_virtqueue *vq = dev->vqs[i];
> > > +
> > > +		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> > > +			vhost_invalidate_vq_end(vq, j,
> > > +						range->start,
> > > +						range->end);
> > > +	}
> > > +}
> > > +
> > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > +	.invalidate_range_start = vhost_invalidate_range_start,
> > > +	.invalidate_range_end = vhost_invalidate_range_end,
> > > +};
> > > +
> > > +static void vhost_init_maps(struct vhost_dev *dev)
> > > +{
> > > +	struct vhost_virtqueue *vq;
> > > +	int i, j;
> > > +
> > > +	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
> > > +
> > > +	for (i = 0; i < dev->nvqs; ++i) {
> > > +		vq = dev->vqs[i];
> > > +		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> > > +			vq->maps[j] = NULL;
> > > +	}
> > > +}
> > > +#endif
> > > +
> > >   static void vhost_vq_reset(struct vhost_dev *dev,
> > >   			   struct vhost_virtqueue *vq)
> > >   {
> > > @@ -326,7 +502,11 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> > >   	vq->busyloop_timeout = 0;
> > >   	vq->umem = NULL;
> > >   	vq->iotlb = NULL;
> > > +	vq->invalidate_count = 0;
> > >   	__vhost_vq_meta_reset(vq);
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	vhost_reset_vq_maps(vq);
> > > +#endif
> > >   }
> > >   static int vhost_worker(void *data)
> > > @@ -471,12 +651,15 @@ void vhost_dev_init(struct vhost_dev *dev,
> > >   	dev->iov_limit = iov_limit;
> > >   	dev->weight = weight;
> > >   	dev->byte_weight = byte_weight;
> > > +	dev->has_notifier = false;
> > >   	init_llist_head(&dev->work_list);
> > >   	init_waitqueue_head(&dev->wait);
> > >   	INIT_LIST_HEAD(&dev->read_list);
> > >   	INIT_LIST_HEAD(&dev->pending_list);
> > >   	spin_lock_init(&dev->iotlb_lock);
> > > -
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	vhost_init_maps(dev);
> > > +#endif
> > >   	for (i = 0; i < dev->nvqs; ++i) {
> > >   		vq = dev->vqs[i];
> > > @@ -485,6 +668,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> > >   		vq->heads = NULL;
> > >   		vq->dev = dev;
> > >   		mutex_init(&vq->mutex);
> > > +		spin_lock_init(&vq->mmu_lock);
> > >   		vhost_vq_reset(dev, vq);
> > >   		if (vq->handle_kick)
> > >   			vhost_poll_init(&vq->poll, vq->handle_kick,
> > > @@ -564,7 +748,19 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> > >   	if (err)
> > >   		goto err_cgroup;
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
> > > +	if (err)
> > > +		goto err_mmu_notifier;
> > > +#endif
> > > +	dev->has_notifier = true;
> > > +
> > >   	return 0;
> > > +
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +err_mmu_notifier:
> > > +	vhost_dev_free_iovecs(dev);
> > > +#endif
> > >   err_cgroup:
> > >   	kthread_stop(worker);
> > >   	dev->worker = NULL;
> > > @@ -655,6 +851,107 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > >   	spin_unlock(&dev->iotlb_lock);
> > >   }
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
> > > +			      int index, unsigned long uaddr,
> > > +			      size_t size, bool write)
> > > +{
> > > +	struct vhost_uaddr *addr = &vq->uaddrs[index];
> > > +
> > > +	addr->uaddr = uaddr;
> > > +	addr->size = size;
> > > +	addr->write = write;
> > > +}
> > > +
> > > +static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
> > > +{
> > > +	vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
> > > +			  (unsigned long)vq->desc,
> > > +			  vhost_get_desc_size(vq, vq->num),
> > > +			  false);
> > > +	vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL,
> > > +			  (unsigned long)vq->avail,
> > > +			  vhost_get_avail_size(vq, vq->num),
> > > +			  false);
> > > +	vhost_setup_uaddr(vq, VHOST_ADDR_USED,
> > > +			  (unsigned long)vq->used,
> > > +			  vhost_get_used_size(vq, vq->num),
> > > +			  true);
> > > +}
> > > +
> > > +static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> > > +			       int index)
> > > +{
> > > +	struct vhost_map *map;
> > > +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > +	struct page **pages;
> > > +	int npages = DIV_ROUND_UP(uaddr->size, PAGE_SIZE);
> > > +	int npinned;
> > > +	void *vaddr, *v;
> > > +	int err;
> > > +	int i;
> > > +
> > > +	spin_lock(&vq->mmu_lock);
> > > +
> > > +	err = -EFAULT;
> > > +	if (vq->invalidate_count)
> > > +		goto err;
> > > +
> > > +	err = -ENOMEM;
> > > +	map = kmalloc(sizeof(*map), GFP_ATOMIC);
> > > +	if (!map)
> > > +		goto err;
> > > +
> > > +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC);
> > > +	if (!pages)
> > > +		goto err_pages;
> > > +
> > > +	err = EFAULT;
> > > +	npinned = __get_user_pages_fast(uaddr->uaddr, npages,
> > > +					uaddr->write, pages);
> > > +	if (npinned > 0)
> > > +		release_pages(pages, npinned);
> > > +	if (npinned != npages)
> > > +		goto err_gup;
> > > +
> > > +	for (i = 0; i < npinned; i++)
> > > +		if (PageHighMem(pages[i]))
> > > +			goto err_gup;
> > > +
> > > +	vaddr = v = page_address(pages[0]);
> > > +
> > > +	/* For simplicity, fallback to userspace address if VA is not
> > > +	 * contigious.
> > > +	 */
> > > +	for (i = 1; i < npinned; i++) {
> > > +		v += PAGE_SIZE;
> > > +		if (v != page_address(pages[i]))
> > > +			goto err_gup;
> > > +	}
> > > +
> > > +	map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1));
> > > +	map->npages = npages;
> > > +	map->pages = pages;
> > > +
> > > +	vq->maps[index] = map;
> > > +	/* No need for a synchronize_rcu(). This function should be
> > > +	 * called by dev->worker so we are serialized with all
> > > +	 * readers.
> > > +	 */
> > > +	spin_unlock(&vq->mmu_lock);
> > > +
> > > +	return 0;
> > > +
> > > +err_gup:
> > > +	kfree(pages);
> > > +err_pages:
> > > +	kfree(map);
> > > +err:
> > > +	spin_unlock(&vq->mmu_lock);
> > > +	return err;
> > > +}
> > > +#endif
> > > +
> > >   void vhost_dev_cleanup(struct vhost_dev *dev)
> > >   {
> > >   	int i;
> > > @@ -684,8 +981,20 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> > >   		kthread_stop(dev->worker);
> > >   		dev->worker = NULL;
> > >   	}
> > > -	if (dev->mm)
> > > +	if (dev->mm) {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +		if (dev->has_notifier) {
> > > +			mmu_notifier_unregister(&dev->mmu_notifier,
> > > +						dev->mm);
> > > +			dev->has_notifier = false;
> > > +		}
> > > +#endif
> > >   		mmput(dev->mm);
> > > +	}
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	for (i = 0; i < dev->nvqs; i++)
> > > +		vhost_uninit_vq_maps(dev->vqs[i]);
> > > +#endif
> > >   	dev->mm = NULL;
> > >   }
> > >   EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> > > @@ -914,6 +1223,26 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> > >   static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			*((__virtio16 *)&used->ring[vq->num]) =
> > > +				cpu_to_vhost16(vq, vq->avail_idx);
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> > >   			      vhost_avail_event(vq));
> > >   }
> > > @@ -922,6 +1251,27 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> > >   				 struct vring_used_elem *head, int idx,
> > >   				 int count)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +	size_t size;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			size = count * sizeof(*head);
> > > +			memcpy(used->ring + idx, head, size);
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
> > >   				  count * sizeof(*head));
> > >   }
> > > @@ -929,6 +1279,25 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> > >   static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> > >   			      &vq->used->flags);
> > >   }
> > > @@ -936,6 +1305,25 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> > >   static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> > >   			      &vq->used->idx);
> > >   }
> > > @@ -981,12 +1369,50 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
> > >   static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> > >   				      __virtio16 *idx)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_avail *avail;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_AVAIL];
> > > +		if (likely(map)) {
> > > +			avail = map->addr;
> > > +			*idx = avail->idx;
> > index can now be speculated.
> 
> [...]
> 
> 
> > +		vhost_vq_access_map_begin(vq);
> > +
> > +		map = vq->maps[VHOST_ADDR_AVAIL];
> > +		if (likely(map)) {
> > +			avail = map->addr;
> > +			*head = avail->ring[idx & (vq->num - 1)];
> > 
> > Since idx can be speculated, I guess we need array_index_nospec here?
> 
> 
> So we have
> 
> ACQUIRE(mmu_lock)
> 
> get idx
> 
> RELEASE(mmu_lock)
> 
> ACQUIRE(mmu_lock)
> 
> read array[idx]
> 
> RELEASE(mmu_lock)
> 
> Then I think idx can't be speculated consider we've passed RELEASE +
> ACQUIRE?

I don't think memory barriers have anything to do with speculation,
they are architectural.

> 
> > 
> > 
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_get_avail(vq, *head,
> > >   			       &vq->avail->ring[idx & (vq->num - 1)]);
> > >   }
> > > @@ -994,24 +1420,98 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> > >   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > >   					__virtio16 *flags)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_avail *avail;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_AVAIL];
> > > +		if (likely(map)) {
> > > +			avail = map->addr;
> > > +			*flags = avail->flags;
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_get_avail(vq, *flags, &vq->avail->flags);
> > >   }
> > >   static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> > >   				       __virtio16 *event)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_avail *avail;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +		map = vq->maps[VHOST_ADDR_AVAIL];
> > > +		if (likely(map)) {
> > > +			avail = map->addr;
> > > +			*event = (__virtio16)avail->ring[vq->num];
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_get_avail(vq, *event, vhost_used_event(vq));
> > >   }
> > >   static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> > >   				     __virtio16 *idx)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_used *used;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_USED];
> > > +		if (likely(map)) {
> > > +			used = map->addr;
> > > +			*idx = used->idx;
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_get_used(vq, *idx, &vq->used->idx);
> > >   }
> > 
> > This seems to be used during init. Why do we bother
> > accelerating this?
> 
> 
> Ok, I can remove this part in next version.
> 
> 
> > 
> > 
> > >   static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> > >   				 struct vring_desc *desc, int idx)
> > >   {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +	struct vhost_map *map;
> > > +	struct vring_desc *d;
> > > +
> > > +	if (!vq->iotlb) {
> > > +		vhost_vq_access_map_begin(vq);
> > > +
> > > +		map = vq->maps[VHOST_ADDR_DESC];
> > > +		if (likely(map)) {
> > > +			d = map->addr;
> > > +			*desc = *(d + idx);
> > 
> > Since idx can be speculated, I guess we need array_index_nospec here?
> 
> 
> This is similar to the above avail idx case.
> 
> 
> > 
> > 
> > > +			vhost_vq_access_map_end(vq);
> > > +			return 0;
> > > +		}
> > > +
> > > +		vhost_vq_access_map_end(vq);
> > > +	}
> > > +#endif
> > > +
> > >   	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
> > >   }
> > I also wonder about the userspace address we get eventualy.
> > It would seem that we need to prevent that from speculating -
> > and that seems like a good idea even if this patch isn't
> > applied. As you are playing with micro-benchmarks, maybe
> > you could the below patch?
> 
> 
> Let me test it.
> 
> Thanks
> 
> 
> > It's unfortunately untested.
> > Thanks a lot in advance!
> > 
> > ===>
> > vhost: block speculation of translated descriptors
> > 
> > iovec addresses coming from vhost are assumed to be
> > pre-validated, but in fact can be speculated to a value
> > out of range.
> > 
> > Userspace address are later validated with array_index_nospec so we can
> > be sure kernel info does not leak through these addresses, but vhost
> > must also not leak userspace info outside the allowed memory table to
> > guests.
> > 
> > Following the defence in depth principle, make sure
> > the address is not validated out of node range.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 5dc174ac8cac..863e25011ef6 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2072,7 +2076,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> >   		size = node->size - addr + node->start;
> >   		_iov->iov_len = min((u64)len - s, size);
> >   		_iov->iov_base = (void __user *)(unsigned long)
> > -			(node->userspace_addr + addr - node->start);
> > +			(node->userspace_addr +
> > +			 array_index_nospec(addr - node->start,
> > +					    node->size));
> >   		s += size;
> >   		addr += size;
> >   		++ret;