From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C9F8C4360D for ; Mon, 9 Sep 2019 02:19:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0916E20863 for ; Mon, 9 Sep 2019 02:19:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0916E20863 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 715336B0005; Sun, 8 Sep 2019 22:19:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 69F966B0006; Sun, 8 Sep 2019 22:19:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 519256B0007; Sun, 8 Sep 2019 22:19:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0169.hostedemail.com [216.40.44.169]) by kanga.kvack.org (Postfix) with ESMTP id 21E416B0005 for ; Sun, 8 Sep 2019 22:19:38 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id B8396181AC9AE for ; Mon, 9 Sep 2019 02:19:37 +0000 (UTC) X-FDA: 75913775994.11.table07_53edfff6c7506 X-HE-Tag: table07_53edfff6c7506 X-Filterd-Recvd-Size: 24818 Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by imf25.hostedemail.com (Postfix) with ESMTP for ; Mon, 9 Sep 2019 02:19:36 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 222B0A46C10; Mon, 9 Sep 2019 02:19:35 +0000 (UTC) Received: from [10.72.12.61] (ovpn-12-61.pek2.redhat.com [10.72.12.61]) by smtp.corp.redhat.com (Postfix) with ESMTP id 16CA960925; Mon, 9 Sep 2019 02:19:05 +0000 (UTC) Subject: Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address To: "Michael S. Tsirkin" Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, jgg@mellanox.com, aarcange@redhat.com, jglisse@redhat.com, linux-mm@kvack.org, James Bottomley , Christoph Hellwig , David Miller , linux-arm-kernel@lists.infradead.org, linux-parisc@vger.kernel.org References: <20190905122736.19768-1-jasowang@redhat.com> <20190905122736.19768-3-jasowang@redhat.com> <20190908063618-mutt-send-email-mst@kernel.org> From: Jason Wang Message-ID: <1cb5aa8d-6213-5fce-5a77-fcada572c882@redhat.com> Date: Mon, 9 Sep 2019 10:18:57 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190908063618-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.68]); Mon, 09 Sep 2019 02:19:35 +0000 (UTC) Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2019/9/8 =E4=B8=8B=E5=8D=887:05, Michael S. Tsirkin wrote: > On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote: >> This is a rework on the commit 7f466032dc9e ("vhost: access vq >> metadata through kernel virtual address"). >> >> It was noticed that the copy_to/from_user() friends that was used to >> access virtqueue metdata tends to be very expensive for dataplane >> implementation like vhost since it involves lots of software checks, >> speculation barriers, > So if we drop speculation barrier, > there's a problem here in access will now be speculated. > This effectively disables the defence in depth effect of > b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd > x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec > > > So now we need to sprinkle array_index_nospec or barrier_nospec over th= e > code whenever we use an index we got from userspace. > See below for some examples. > > >> hardware feature toggling (e.g SMAP). The >> extra cost will be more obvious when transferring small packets since >> the time spent on metadata accessing become more significant. >> >> This patch tries to eliminate those overheads by accessing them >> through direct mapping of those pages. Invalidation callbacks is >> implemented for co-operation with general VM management (swap, KSM, >> THP or NUMA balancing). We will try to get the direct mapping of vq >> metadata before each round of packet processing if it doesn't >> exist. If we fail, we will simplely fallback to copy_to/from_user() >> friends. >> >> This invalidation, direct mapping access and set are synchronized >> through spinlock. This takes a step back from the original commit >> 7f466032dc9e ("vhost: access vq metadata through kernel virtual >> address") which tries to RCU which is suspicious and hard to be >> reviewed. This won't perform as well as RCU because of the atomic, >> this could be addressed by the future optimization. >> >> This method might does not work for high mem page which requires >> temporary mapping so we just fallback to normal >> copy_to/from_user() and may not for arch that has virtual tagged cache >> since extra cache flushing is needed to eliminate the alias. This will >> result complex logic and bad performance. For those archs, this patch >> simply go for copy_to/from_user() friends. This is done by ruling out >> kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE. >> >> Note that this is only done when device IOTLB is not enabled. We >> could use similar method to optimize IOTLB in the future. >> >> Tests shows at most about 22% improvement on TX PPS when using >> virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake. >> >> SMAP on | SMAP off >> Before: 4.9Mpps | 6.9Mpps >> After: 6.0Mpps | 7.5Mpps >> >> On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see >> any difference. > Why is not Kaby Lake with SMAP off the same as Sandy Bridge? I don't know, I guess it was because the atomic is l > > >> Cc: Andrea Arcangeli >> Cc: James Bottomley >> Cc: Christoph Hellwig >> Cc: David Miller >> Cc: Jerome Glisse >> Cc: Jason Gunthorpe >> Cc: linux-mm@kvack.org >> Cc: linux-arm-kernel@lists.infradead.org >> Cc: linux-parisc@vger.kernel.org >> Signed-off-by: Jason Wang >> Signed-off-by: Michael S. Tsirkin >> --- >> drivers/vhost/vhost.c | 551 ++++++++++++++++++++++++++++++++++++++++= +- >> drivers/vhost/vhost.h | 41 ++++ >> 2 files changed, 589 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index 791562e03fe0..f98155f28f02 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev= *d) >> __vhost_vq_meta_reset(d->vqs[i]); >> } >> =20 >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> +static void vhost_map_unprefetch(struct vhost_map *map) >> +{ >> + kfree(map->pages); >> + kfree(map); >> +} >> + >> +static void vhost_set_map_dirty(struct vhost_virtqueue *vq, >> + struct vhost_map *map, int index) >> +{ >> + struct vhost_uaddr *uaddr =3D &vq->uaddrs[index]; >> + int i; >> + >> + if (uaddr->write) { >> + for (i =3D 0; i < map->npages; i++) >> + set_page_dirty(map->pages[i]); >> + } >> +} >> + >> +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) >> +{ >> + struct vhost_map *map[VHOST_NUM_ADDRS]; >> + int i; >> + >> + spin_lock(&vq->mmu_lock); >> + for (i =3D 0; i < VHOST_NUM_ADDRS; i++) { >> + map[i] =3D vq->maps[i]; >> + if (map[i]) { >> + vhost_set_map_dirty(vq, map[i], i); >> + vq->maps[i] =3D NULL; >> + } >> + } >> + spin_unlock(&vq->mmu_lock); >> + >> + /* No need for synchronization since we are serialized with >> + * memory accessors (e.g vq mutex held). >> + */ >> + >> + for (i =3D 0; i < VHOST_NUM_ADDRS; i++) >> + if (map[i]) >> + vhost_map_unprefetch(map[i]); >> + >> +} >> + >> +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq) >> +{ >> + int i; >> + >> + vhost_uninit_vq_maps(vq); >> + for (i =3D 0; i < VHOST_NUM_ADDRS; i++) >> + vq->uaddrs[i].size =3D 0; >> +} >> + >> +static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, >> + unsigned long start, >> + unsigned long end) >> +{ >> + if (unlikely(!uaddr->size)) >> + return false; >> + >> + return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->siz= e); >> +} >> + >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *= vq) >> +{ >> + spin_lock(&vq->mmu_lock); >> +} >> + >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq= ) >> +{ >> + spin_unlock(&vq->mmu_lock); >> +} >> + >> +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> + int index, >> + unsigned long start, >> + unsigned long end, >> + bool blockable) >> +{ >> + struct vhost_uaddr *uaddr =3D &vq->uaddrs[index]; >> + struct vhost_map *map; >> + >> + if (!vhost_map_range_overlap(uaddr, start, end)) >> + return 0; >> + else if (!blockable) >> + return -EAGAIN; >> + >> + spin_lock(&vq->mmu_lock); >> + ++vq->invalidate_count; >> + >> + map =3D vq->maps[index]; >> + if (map) >> + vq->maps[index] =3D NULL; >> + spin_unlock(&vq->mmu_lock); >> + >> + if (map) { >> + vhost_set_map_dirty(vq, map, index); >> + vhost_map_unprefetch(map); >> + } >> + >> + return 0; >> +} >> + >> +static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq, >> + int index, >> + unsigned long start, >> + unsigned long end) >> +{ >> + if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end)) >> + return; >> + >> + spin_lock(&vq->mmu_lock); >> + --vq->invalidate_count; >> + spin_unlock(&vq->mmu_lock); >> +} >> + >> +static int vhost_invalidate_range_start(struct mmu_notifier *mn, >> + const struct mmu_notifier_range *range) >> +{ >> + struct vhost_dev *dev =3D container_of(mn, struct vhost_dev, >> + mmu_notifier); >> + bool blockable =3D mmu_notifier_range_blockable(range); >> + int i, j, ret; >> + >> + for (i =3D 0; i < dev->nvqs; i++) { >> + struct vhost_virtqueue *vq =3D dev->vqs[i]; >> + >> + for (j =3D 0; j < VHOST_NUM_ADDRS; j++) { >> + ret =3D vhost_invalidate_vq_start(vq, j, >> + range->start, >> + range->end, blockable); >> + if (ret) >> + return ret; >> + } >> + } >> + >> + return 0; >> +} >> + >> +static void vhost_invalidate_range_end(struct mmu_notifier *mn, >> + const struct mmu_notifier_range *range) >> +{ >> + struct vhost_dev *dev =3D container_of(mn, struct vhost_dev, >> + mmu_notifier); >> + int i, j; >> + >> + for (i =3D 0; i < dev->nvqs; i++) { >> + struct vhost_virtqueue *vq =3D dev->vqs[i]; >> + >> + for (j =3D 0; j < VHOST_NUM_ADDRS; j++) >> + vhost_invalidate_vq_end(vq, j, >> + range->start, >> + range->end); >> + } >> +} >> + >> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops =3D { >> + .invalidate_range_start =3D vhost_invalidate_range_start, >> + .invalidate_range_end =3D vhost_invalidate_range_end, >> +}; >> + >> +static void vhost_init_maps(struct vhost_dev *dev) >> +{ >> + struct vhost_virtqueue *vq; >> + int i, j; >> + >> + dev->mmu_notifier.ops =3D &vhost_mmu_notifier_ops; >> + >> + for (i =3D 0; i < dev->nvqs; ++i) { >> + vq =3D dev->vqs[i]; >> + for (j =3D 0; j < VHOST_NUM_ADDRS; j++) >> + vq->maps[j] =3D NULL; >> + } >> +} >> +#endif >> + >> static void vhost_vq_reset(struct vhost_dev *dev, >> struct vhost_virtqueue *vq) >> { >> @@ -326,7 +502,11 @@ static void vhost_vq_reset(struct vhost_dev *dev, >> vq->busyloop_timeout =3D 0; >> vq->umem =3D NULL; >> vq->iotlb =3D NULL; >> + vq->invalidate_count =3D 0; >> __vhost_vq_meta_reset(vq); >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + vhost_reset_vq_maps(vq); >> +#endif >> } >> =20 >> static int vhost_worker(void *data) >> @@ -471,12 +651,15 @@ void vhost_dev_init(struct vhost_dev *dev, >> dev->iov_limit =3D iov_limit; >> dev->weight =3D weight; >> dev->byte_weight =3D byte_weight; >> + dev->has_notifier =3D false; >> init_llist_head(&dev->work_list); >> init_waitqueue_head(&dev->wait); >> INIT_LIST_HEAD(&dev->read_list); >> INIT_LIST_HEAD(&dev->pending_list); >> spin_lock_init(&dev->iotlb_lock); >> - >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + vhost_init_maps(dev); >> +#endif >> =20 >> for (i =3D 0; i < dev->nvqs; ++i) { >> vq =3D dev->vqs[i]; >> @@ -485,6 +668,7 @@ void vhost_dev_init(struct vhost_dev *dev, >> vq->heads =3D NULL; >> vq->dev =3D dev; >> mutex_init(&vq->mutex); >> + spin_lock_init(&vq->mmu_lock); >> vhost_vq_reset(dev, vq); >> if (vq->handle_kick) >> vhost_poll_init(&vq->poll, vq->handle_kick, >> @@ -564,7 +748,19 @@ long vhost_dev_set_owner(struct vhost_dev *dev) >> if (err) >> goto err_cgroup; >> =20 >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + err =3D mmu_notifier_register(&dev->mmu_notifier, dev->mm); >> + if (err) >> + goto err_mmu_notifier; >> +#endif >> + dev->has_notifier =3D true; >> + >> return 0; >> + >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> +err_mmu_notifier: >> + vhost_dev_free_iovecs(dev); >> +#endif >> err_cgroup: >> kthread_stop(worker); >> dev->worker =3D NULL; >> @@ -655,6 +851,107 @@ static void vhost_clear_msg(struct vhost_dev *de= v) >> spin_unlock(&dev->iotlb_lock); >> } >> =20 >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> +static void vhost_setup_uaddr(struct vhost_virtqueue *vq, >> + int index, unsigned long uaddr, >> + size_t size, bool write) >> +{ >> + struct vhost_uaddr *addr =3D &vq->uaddrs[index]; >> + >> + addr->uaddr =3D uaddr; >> + addr->size =3D size; >> + addr->write =3D write; >> +} >> + >> +static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq) >> +{ >> + vhost_setup_uaddr(vq, VHOST_ADDR_DESC, >> + (unsigned long)vq->desc, >> + vhost_get_desc_size(vq, vq->num), >> + false); >> + vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL, >> + (unsigned long)vq->avail, >> + vhost_get_avail_size(vq, vq->num), >> + false); >> + vhost_setup_uaddr(vq, VHOST_ADDR_USED, >> + (unsigned long)vq->used, >> + vhost_get_used_size(vq, vq->num), >> + true); >> +} >> + >> +static int vhost_map_prefetch(struct vhost_virtqueue *vq, >> + int index) >> +{ >> + struct vhost_map *map; >> + struct vhost_uaddr *uaddr =3D &vq->uaddrs[index]; >> + struct page **pages; >> + int npages =3D DIV_ROUND_UP(uaddr->size, PAGE_SIZE); >> + int npinned; >> + void *vaddr, *v; >> + int err; >> + int i; >> + >> + spin_lock(&vq->mmu_lock); >> + >> + err =3D -EFAULT; >> + if (vq->invalidate_count) >> + goto err; >> + >> + err =3D -ENOMEM; >> + map =3D kmalloc(sizeof(*map), GFP_ATOMIC); >> + if (!map) >> + goto err; >> + >> + pages =3D kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC); >> + if (!pages) >> + goto err_pages; >> + >> + err =3D EFAULT; >> + npinned =3D __get_user_pages_fast(uaddr->uaddr, npages, >> + uaddr->write, pages); >> + if (npinned > 0) >> + release_pages(pages, npinned); >> + if (npinned !=3D npages) >> + goto err_gup; >> + >> + for (i =3D 0; i < npinned; i++) >> + if (PageHighMem(pages[i])) >> + goto err_gup; >> + >> + vaddr =3D v =3D page_address(pages[0]); >> + >> + /* For simplicity, fallback to userspace address if VA is not >> + * contigious. >> + */ >> + for (i =3D 1; i < npinned; i++) { >> + v +=3D PAGE_SIZE; >> + if (v !=3D page_address(pages[i])) >> + goto err_gup; >> + } >> + >> + map->addr =3D vaddr + (uaddr->uaddr & (PAGE_SIZE - 1)); >> + map->npages =3D npages; >> + map->pages =3D pages; >> + >> + vq->maps[index] =3D map; >> + /* No need for a synchronize_rcu(). This function should be >> + * called by dev->worker so we are serialized with all >> + * readers. >> + */ >> + spin_unlock(&vq->mmu_lock); >> + >> + return 0; >> + >> +err_gup: >> + kfree(pages); >> +err_pages: >> + kfree(map); >> +err: >> + spin_unlock(&vq->mmu_lock); >> + return err; >> +} >> +#endif >> + >> void vhost_dev_cleanup(struct vhost_dev *dev) >> { >> int i; >> @@ -684,8 +981,20 @@ void vhost_dev_cleanup(struct vhost_dev *dev) >> kthread_stop(dev->worker); >> dev->worker =3D NULL; >> } >> - if (dev->mm) >> + if (dev->mm) { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + if (dev->has_notifier) { >> + mmu_notifier_unregister(&dev->mmu_notifier, >> + dev->mm); >> + dev->has_notifier =3D false; >> + } >> +#endif >> mmput(dev->mm); >> + } >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + for (i =3D 0; i < dev->nvqs; i++) >> + vhost_uninit_vq_maps(dev->vqs[i]); >> +#endif >> dev->mm =3D NULL; >> } >> EXPORT_SYMBOL_GPL(vhost_dev_cleanup); >> @@ -914,6 +1223,26 @@ static inline void __user *__vhost_get_user(stru= ct vhost_virtqueue *vq, >> =20 >> static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map =3D vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used =3D map->addr; >> + *((__virtio16 *)&used->ring[vq->num]) =3D >> + cpu_to_vhost16(vq, vq->avail_idx); >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx), >> vhost_avail_event(vq)); >> } >> @@ -922,6 +1251,27 @@ static inline int vhost_put_used(struct vhost_vi= rtqueue *vq, >> struct vring_used_elem *head, int idx, >> int count) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + size_t size; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map =3D vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used =3D map->addr; >> + size =3D count * sizeof(*head); >> + memcpy(used->ring + idx, head, size); >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_copy_to_user(vq, vq->used->ring + idx, head, >> count * sizeof(*head)); >> } >> @@ -929,6 +1279,25 @@ static inline int vhost_put_used(struct vhost_vi= rtqueue *vq, >> static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) >> =20 >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map =3D vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used =3D map->addr; >> + used->flags =3D cpu_to_vhost16(vq, vq->used_flags); >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags), >> &vq->used->flags); >> } >> @@ -936,6 +1305,25 @@ static inline int vhost_put_used_flags(struct vh= ost_virtqueue *vq) >> static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) >> =20 >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map =3D vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used =3D map->addr; >> + used->idx =3D cpu_to_vhost16(vq, vq->last_used_idx); >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx), >> &vq->used->idx); >> } >> @@ -981,12 +1369,50 @@ static void vhost_dev_unlock_vqs(struct vhost_d= ev *d) >> static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, >> __virtio16 *idx) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_avail *avail; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map =3D vq->maps[VHOST_ADDR_AVAIL]; >> + if (likely(map)) { >> + avail =3D map->addr; >> + *idx =3D avail->idx; > index can now be speculated. [...] > + vhost_vq_access_map_begin(vq); > + > + map =3D vq->maps[VHOST_ADDR_AVAIL]; > + if (likely(map)) { > + avail =3D map->addr; > + *head =3D avail->ring[idx & (vq->num - 1)]; > > Since idx can be speculated, I guess we need array_index_nospec here? So we have ACQUIRE(mmu_lock) get idx RELEASE(mmu_lock) ACQUIRE(mmu_lock) read array[idx] RELEASE(mmu_lock) Then I think idx can't be speculated consider we've passed RELEASE +=20 ACQUIRE? > > >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_get_avail(vq, *head, >> &vq->avail->ring[idx & (vq->num - 1)]); >> } >> @@ -994,24 +1420,98 @@ static inline int vhost_get_avail_head(struct v= host_virtqueue *vq, >> static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, >> __virtio16 *flags) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_avail *avail; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map =3D vq->maps[VHOST_ADDR_AVAIL]; >> + if (likely(map)) { >> + avail =3D map->addr; >> + *flags =3D avail->flags; >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_get_avail(vq, *flags, &vq->avail->flags); >> } >> =20 >> static inline int vhost_get_used_event(struct vhost_virtqueue *vq, >> __virtio16 *event) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_avail *avail; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + map =3D vq->maps[VHOST_ADDR_AVAIL]; >> + if (likely(map)) { >> + avail =3D map->addr; >> + *event =3D (__virtio16)avail->ring[vq->num]; >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_get_avail(vq, *event, vhost_used_event(vq)); >> } >> =20 >> static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, >> __virtio16 *idx) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map =3D vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used =3D map->addr; >> + *idx =3D used->idx; >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_get_used(vq, *idx, &vq->used->idx); >> } > > This seems to be used during init. Why do we bother > accelerating this? Ok, I can remove this part in next version. > > >> =20 >> static inline int vhost_get_desc(struct vhost_virtqueue *vq, >> struct vring_desc *desc, int idx) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_desc *d; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map =3D vq->maps[VHOST_ADDR_DESC]; >> + if (likely(map)) { >> + d =3D map->addr; >> + *desc =3D *(d + idx); > > Since idx can be speculated, I guess we need array_index_nospec here? This is similar to the above avail idx case. > > >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc)= ); >> } >> =20 > I also wonder about the userspace address we get eventualy. > It would seem that we need to prevent that from speculating - > and that seems like a good idea even if this patch isn't > applied. As you are playing with micro-benchmarks, maybe > you could the below patch? Let me test it. Thanks > It's unfortunately untested. > Thanks a lot in advance! > > =3D=3D=3D> > vhost: block speculation of translated descriptors > > iovec addresses coming from vhost are assumed to be > pre-validated, but in fact can be speculated to a value > out of range. > > Userspace address are later validated with array_index_nospec so we can > be sure kernel info does not leak through these addresses, but vhost > must also not leak userspace info outside the allowed memory table to > guests. > > Following the defence in depth principle, make sure > the address is not validated out of node range. > > Signed-off-by: Michael S. Tsirkin > > --- > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 5dc174ac8cac..863e25011ef6 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2072,7 +2076,9 @@ static int translate_desc(struct vhost_virtqueue = *vq, u64 addr, u32 len, > size =3D node->size - addr + node->start; > _iov->iov_len =3D min((u64)len - s, size); > _iov->iov_base =3D (void __user *)(unsigned long) > - (node->userspace_addr + addr - node->start); > + (node->userspace_addr + > + array_index_nospec(addr - node->start, > + node->size)); > s +=3D size; > addr +=3D size; > ++ret;