From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B47CC43387 for ; Mon, 7 Jan 2019 08:40:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A00FF206B7 for ; Mon, 7 Jan 2019 08:40:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726614AbfAGIk2 (ORCPT ); Mon, 7 Jan 2019 03:40:28 -0500 Received: from mx1.redhat.com ([209.132.183.28]:56484 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726453AbfAGIk1 (ORCPT ); Mon, 7 Jan 2019 03:40:27 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1D1EA356E5; Mon, 7 Jan 2019 08:40:27 +0000 (UTC) Received: from [10.72.12.209] (ovpn-12-209.pek2.redhat.com [10.72.12.209]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C31AA608C5; Mon, 7 Jan 2019 08:40:21 +0000 (UTC) Subject: Re: [RFC PATCH V3 5/5] vhost: access vq metadata through kernel virtual address To: "Michael S. Tsirkin" Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, davem@davemloft.net References: <20181229124656.3900-1-jasowang@redhat.com> <20181229124656.3900-6-jasowang@redhat.com> <20190104162600-mutt-send-email-mst@kernel.org> From: Jason Wang Message-ID: Date: Mon, 7 Jan 2019 16:40:19 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <20190104162600-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Mon, 07 Jan 2019 08:40:27 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/1/5 上午5:34, Michael S. Tsirkin wrote: > On Sat, Dec 29, 2018 at 08:46:56PM +0800, Jason Wang wrote: >> It was noticed that the copy_user() friends that was used to access >> virtqueue metdata tends to be very expensive for dataplane >> implementation like vhost since it involves lots of software checks, >> speculation barrier, hardware feature toggling (e.g SMAP). The >> extra cost will be more obvious when transferring small packets since >> the time spent on metadata accessing become significant.. >> >> This patch tries to eliminate those overhead by accessing them through >> kernel virtual address by vmap(). To make the pages can be migrated, >> instead of pinning them through GUP, we use mmu notifiers to >> invalidate vmaps and re-establish vmaps during each round of metadata >> prefetching in necessary. For devices that doesn't use metadata >> prefetching, the memory acessors fallback to normal copy_user() >> implementation gracefully. The invalidation was synchronized with >> datapath through vq mutex, and in order to avoid hold vq mutex during >> range checking, MMU notifier was teared down when trying to modify vq >> metadata. >> >> Note that this was only done when device IOTLB is not enabled. We >> could use similar method to optimize it in the future. >> >> Tests shows about ~24% improvement on TX PPS when using virtio-user + >> vhost_net + xdp1 on TAP: >> >> Before: ~5.0Mpps >> After: ~6.1Mpps >> >> Signed-off-by: Jason Wang >> --- >> drivers/vhost/vhost.c | 263 +++++++++++++++++++++++++++++++++++++++++- >> drivers/vhost/vhost.h | 13 +++ >> 2 files changed, 274 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index 54b43feef8d9..e1ecb8acf8a3 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -440,6 +440,9 @@ void vhost_dev_init(struct vhost_dev *dev, >> vq->indirect = NULL; >> vq->heads = NULL; >> vq->dev = dev; >> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring)); >> + memset(&vq->used_ring, 0, sizeof(vq->used_ring)); >> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring)); >> mutex_init(&vq->mutex); >> vhost_vq_reset(dev, vq); >> if (vq->handle_kick) >> @@ -510,6 +513,73 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num) >> return sizeof(*vq->desc) * num; >> } >> >> +static void vhost_uninit_vmap(struct vhost_vmap *map) >> +{ >> + if (map->addr) >> + vunmap(map->unmap_addr); >> + >> + map->addr = NULL; >> + map->unmap_addr = NULL; >> +} >> + >> +static int vhost_invalidate_vmap(struct vhost_virtqueue *vq, >> + struct vhost_vmap *map, >> + unsigned long ustart, >> + size_t size, >> + unsigned long start, >> + unsigned long end, >> + bool blockable) >> +{ >> + if (end < ustart || start > ustart - 1 + size) >> + return 0; >> + >> + if (!blockable) >> + return -EAGAIN; >> + >> + mutex_lock(&vq->mutex); >> + vhost_uninit_vmap(map); >> + mutex_unlock(&vq->mutex); >> + >> + return 0; >> +} >> + >> +static int vhost_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, >> + struct mm_struct *mm, >> + unsigned long start, >> + unsigned long end, >> + bool blockable) >> +{ >> + struct vhost_dev *dev = container_of(mn, struct vhost_dev, >> + mmu_notifier); >> + int i; >> + >> + for (i = 0; i < dev->nvqs; i++) { >> + struct vhost_virtqueue *vq = dev->vqs[i]; >> + >> + if (vhost_invalidate_vmap(vq, &vq->avail_ring, >> + (unsigned long)vq->avail, >> + vhost_get_avail_size(vq, vq->num), >> + start, end, blockable)) >> + return -EAGAIN; >> + if (vhost_invalidate_vmap(vq, &vq->desc_ring, >> + (unsigned long)vq->desc, >> + vhost_get_desc_size(vq, vq->num), >> + start, end, blockable)) >> + return -EAGAIN; >> + if (vhost_invalidate_vmap(vq, &vq->used_ring, >> + (unsigned long)vq->used, >> + vhost_get_used_size(vq, vq->num), >> + start, end, blockable)) >> + return -EAGAIN; >> + } >> + >> + return 0; >> +} >> + >> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = { >> + .invalidate_range_start = vhost_mmu_notifier_invalidate_range_start, >> +}; >> + >> /* Caller should have device mutex */ >> long vhost_dev_set_owner(struct vhost_dev *dev) >> { >> @@ -541,7 +611,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev) >> if (err) >> goto err_cgroup; >> >> + dev->mmu_notifier.ops = &vhost_mmu_notifier_ops; >> + err = mmu_notifier_register(&dev->mmu_notifier, dev->mm); >> + if (err) >> + goto err_mmu_notifier; >> + >> return 0; >> +err_mmu_notifier: >> + vhost_dev_free_iovecs(dev); >> err_cgroup: >> kthread_stop(worker); >> dev->worker = NULL; >> @@ -632,6 +709,72 @@ static void vhost_clear_msg(struct vhost_dev *dev) >> spin_unlock(&dev->iotlb_lock); >> } >> >> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr, >> + size_t size, int write) >> +{ >> + struct page **pages; >> + int npages = DIV_ROUND_UP(size, PAGE_SIZE); >> + int npinned; >> + void *vaddr; >> + int err = 0; >> + >> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL); >> + if (!pages) >> + return -ENOMEM; >> + >> + npinned = get_user_pages_fast(uaddr, npages, write, pages); >> + if (npinned != npages) { >> + err = -EFAULT; >> + goto err; >> + } >> + >> + vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL); >> + if (!vaddr) { >> + err = EFAULT; >> + goto err; >> + } >> + >> + map->addr = vaddr + (uaddr & (PAGE_SIZE - 1)); >> + map->unmap_addr = vaddr; >> + >> +err: >> + /* Don't pin pages, mmu notifier will notify us about page >> + * migration. >> + */ >> + if (npinned > 0) >> + release_pages(pages, npinned); >> + kfree(pages); >> + return err; >> +} >> + >> +static void vhost_clean_vmaps(struct vhost_virtqueue *vq) >> +{ >> + vhost_uninit_vmap(&vq->avail_ring); >> + vhost_uninit_vmap(&vq->desc_ring); >> + vhost_uninit_vmap(&vq->used_ring); >> +} >> + >> +static int vhost_setup_avail_vmap(struct vhost_virtqueue *vq, >> + unsigned long avail) >> +{ >> + return vhost_init_vmap(&vq->avail_ring, avail, >> + vhost_get_avail_size(vq, vq->num), false); >> +} >> + >> +static int vhost_setup_desc_vmap(struct vhost_virtqueue *vq, >> + unsigned long desc) >> +{ >> + return vhost_init_vmap(&vq->desc_ring, desc, >> + vhost_get_desc_size(vq, vq->num), false); >> +} >> + >> +static int vhost_setup_used_vmap(struct vhost_virtqueue *vq, >> + unsigned long used) >> +{ >> + return vhost_init_vmap(&vq->used_ring, used, >> + vhost_get_used_size(vq, vq->num), true); >> +} >> + >> void vhost_dev_cleanup(struct vhost_dev *dev) >> { >> int i; >> @@ -661,8 +804,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev) >> kthread_stop(dev->worker); >> dev->worker = NULL; >> } >> - if (dev->mm) >> + if (dev->mm) { >> + mmu_notifier_unregister(&dev->mmu_notifier, dev->mm); >> mmput(dev->mm); >> + } >> + for (i = 0; i < dev->nvqs; i++) >> + vhost_clean_vmaps(dev->vqs[i]); >> dev->mm = NULL; >> } >> EXPORT_SYMBOL_GPL(vhost_dev_cleanup); >> @@ -891,6 +1038,16 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq, >> >> static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) >> { >> + if (!vq->iotlb) { > Do we have to limit this to !iotlb? No need, but for simplicity I leave this for the future. > >> + struct vring_used *used = vq->used_ring.addr; >> + >> + if (likely(used)) { >> + *((__virtio16 *)&used->ring[vq->num]) = >> + cpu_to_vhost16(vq, vq->avail_idx); > So here we are modifying userspace memory without marking it dirty. > Is this OK? And why? Probably not, any suggestion to fix this? > > >> + return 0; >> + } >> + } >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx), >> vhost_avail_event(vq)); >> } >> @@ -899,6 +1056,16 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, >> struct vring_used_elem *head, int idx, >> int count) >> { >> + if (!vq->iotlb) { >> + struct vring_used *used = vq->used_ring.addr; >> + >> + if (likely(used)) { >> + memcpy(used->ring + idx, head, >> + count * sizeof(*head)); >> + return 0; > Same here. > >> + } >> + } >> + >> return vhost_copy_to_user(vq, vq->used->ring + idx, head, >> count * sizeof(*head)); >> } >> @@ -906,6 +1073,15 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, >> static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) >> >> { >> + if (!vq->iotlb) { >> + struct vring_used *used = vq->used_ring.addr; >> + >> + if (likely(used)) { >> + used->flags = cpu_to_vhost16(vq, vq->used_flags); >> + return 0; >> + } >> + } >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags), >> &vq->used->flags); >> } >> @@ -913,6 +1089,15 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) >> static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) >> >> { >> + if (!vq->iotlb) { >> + struct vring_used *used = vq->used_ring.addr; >> + >> + if (likely(used)) { >> + used->idx = cpu_to_vhost16(vq, vq->last_used_idx); >> + return 0; >> + } >> + } >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx), >> &vq->used->idx); >> } >> @@ -958,12 +1143,30 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) >> static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, >> __virtio16 *idx) >> { >> + if (!vq->iotlb) { >> + struct vring_avail *avail = vq->avail_ring.addr; >> + >> + if (likely(avail)) { >> + *idx = avail->idx; >> + return 0; >> + } >> + } >> + >> return vhost_get_avail(vq, *idx, &vq->avail->idx); >> } >> >> static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, >> __virtio16 *head, int idx) >> { >> + if (!vq->iotlb) { >> + struct vring_avail *avail = vq->avail_ring.addr; >> + >> + if (likely(avail)) { >> + *head = avail->ring[idx & (vq->num - 1)]; >> + return 0; >> + } >> + } >> + >> return vhost_get_avail(vq, *head, >> &vq->avail->ring[idx & (vq->num - 1)]); >> } >> @@ -971,24 +1174,60 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, >> static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, >> __virtio16 *flags) >> { >> + if (!vq->iotlb) { >> + struct vring_avail *avail = vq->avail_ring.addr; >> + >> + if (likely(avail)) { >> + *flags = avail->flags; >> + return 0; >> + } >> + } >> + >> return vhost_get_avail(vq, *flags, &vq->avail->flags); >> } >> >> static inline int vhost_get_used_event(struct vhost_virtqueue *vq, >> __virtio16 *event) >> { >> + if (!vq->iotlb) { >> + struct vring_avail *avail = vq->avail_ring.addr; >> + >> + if (likely(avail)) { >> + *event = (__virtio16)avail->ring[vq->num]; >> + return 0; >> + } >> + } >> + >> return vhost_get_avail(vq, *event, vhost_used_event(vq)); >> } >> >> static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, >> __virtio16 *idx) >> { >> + if (!vq->iotlb) { >> + struct vring_used *used = vq->used_ring.addr; >> + >> + if (likely(used)) { >> + *idx = used->idx; >> + return 0; >> + } >> + } >> + >> return vhost_get_used(vq, *idx, &vq->used->idx); >> } >> >> static inline int vhost_get_desc(struct vhost_virtqueue *vq, >> struct vring_desc *desc, int idx) >> { >> + if (!vq->iotlb) { >> + struct vring_desc *d = vq->desc_ring.addr; >> + >> + if (likely(d)) { >> + *desc = *(d + idx); >> + return 0; >> + } >> + } >> + >> return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc)); >> } >> >> @@ -1325,8 +1564,16 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq) >> { >> unsigned int num = vq->num; >> >> - if (!vq->iotlb) >> + if (!vq->iotlb) { >> + if (unlikely(!vq->avail_ring.addr)) >> + vhost_setup_avail_vmap(vq, (unsigned long)vq->avail); >> + if (unlikely(!vq->desc_ring.addr)) >> + vhost_setup_desc_vmap(vq, (unsigned long)vq->desc); >> + if (unlikely(!vq->used_ring.addr)) >> + vhost_setup_used_vmap(vq, (unsigned long)vq->used); >> + >> return 1; >> + } >> >> return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc, >> vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) && >> @@ -1478,6 +1725,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg >> >> mutex_lock(&vq->mutex); >> >> + /* Unregister MMU notifer to allow invalidation callback >> + * can access vq->avail, vq->desc , vq->used and vq->num >> + * without holding vq->mutex. >> + */ >> + if (d->mm) >> + mmu_notifier_unregister(&d->mmu_notifier, d->mm); >> + >> switch (ioctl) { >> case VHOST_SET_VRING_NUM: >> /* Resizing ring with an active backend? >> @@ -1494,6 +1748,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg >> r = -EINVAL; >> break; >> } >> + vhost_clean_vmaps(vq); >> vq->num = s.num; >> break; >> case VHOST_SET_VRING_BASE: >> @@ -1571,6 +1826,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg >> } >> } >> >> + vhost_clean_vmaps(vq); >> + >> vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG)); >> vq->desc = (void __user *)(unsigned long)a.desc_user_addr; >> vq->avail = (void __user *)(unsigned long)a.avail_user_addr; >> @@ -1651,6 +1908,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg >> if (pollstart && vq->handle_kick) >> r = vhost_poll_start(&vq->poll, vq->kick); >> >> + if (d->mm) >> + mmu_notifier_register(&d->mmu_notifier, d->mm); >> mutex_unlock(&vq->mutex); >> >> if (pollstop && vq->handle_kick) >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h >> index 0d1ff977a43e..00f016a4f198 100644 >> --- a/drivers/vhost/vhost.h >> +++ b/drivers/vhost/vhost.h >> @@ -12,6 +12,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> >> struct vhost_work; >> typedef void (*vhost_work_fn_t)(struct vhost_work *work); >> @@ -80,6 +82,11 @@ enum vhost_uaddr_type { >> VHOST_NUM_ADDRS = 3, >> }; >> >> +struct vhost_vmap { >> + void *addr; >> + void *unmap_addr; >> +}; >> + > How about using actual types like struct vring_used etc so we get type > safety and do not need to cast on access? Yes, we can. Thanks > > >> /* The virtqueue structure describes a queue attached to a device. */ >> struct vhost_virtqueue { >> struct vhost_dev *dev; >> @@ -90,6 +97,11 @@ struct vhost_virtqueue { >> struct vring_desc __user *desc; >> struct vring_avail __user *avail; >> struct vring_used __user *used; >> + >> + struct vhost_vmap avail_ring; >> + struct vhost_vmap desc_ring; >> + struct vhost_vmap used_ring; >> + >> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS]; >> struct file *kick; >> struct eventfd_ctx *call_ctx; >> @@ -158,6 +170,7 @@ struct vhost_msg_node { >> >> struct vhost_dev { >> struct mm_struct *mm; >> + struct mmu_notifier mmu_notifier; >> struct mutex mutex; >> struct vhost_virtqueue **vqs; >> int nvqs; >> -- >> 2.17.1