From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5AED8C43331 for ; Thu, 26 Mar 2020 21:40:34 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0FC6420658 for ; Thu, 26 Mar 2020 21:40:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="AfKNxSdL" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0FC6420658 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:60928 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jHaEr-0005oJ-5F for qemu-devel@archiver.kernel.org; Thu, 26 Mar 2020 17:40:33 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:33347) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jHaDy-0004pJ-W8 for qemu-devel@nongnu.org; Thu, 26 Mar 2020 17:39:41 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jHaDw-0003f2-0H for qemu-devel@nongnu.org; Thu, 26 Mar 2020 17:39:38 -0400 Received: from hqnvemgate26.nvidia.com ([216.228.121.65]:12640) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1jHaDv-0003dB-ME for qemu-devel@nongnu.org; Thu, 26 Mar 2020 17:39:35 -0400 Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate26.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Thu, 26 Mar 2020 14:39:20 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Thu, 26 Mar 2020 14:39:34 -0700 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Thu, 26 Mar 2020 14:39:34 -0700 Received: from [10.40.103.35] (10.124.1.5) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Thu, 26 Mar 2020 21:39:14 +0000 Subject: Re: [PATCH v16 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking. To: Yan Zhao References: <1585084732-18473-1-git-send-email-kwankhede@nvidia.com> <20200325021135.GB20109@joy-OptiPlex-7040> X-Nvconfidentiality: public From: Kirti Wankhede Message-ID: <33d38629-aeaf-1c30-26d4-958b998620b0@nvidia.com> Date: Fri, 27 Mar 2020 03:09:01 +0530 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <20200325021135.GB20109@joy-OptiPlex-7040> X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL111.nvidia.com (172.20.187.18) To HQMAIL107.nvidia.com (172.20.187.13) Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1585258761; bh=R2DclykpxCbEN5cKK+lzJEn8WY5YFJ9cqEKiQpdhqfI=; h=X-PGP-Universal:Subject:To:CC:References:X-Nvconfidentiality:From: Message-ID:Date:User-Agent:MIME-Version:In-Reply-To: X-Originating-IP:X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=AfKNxSdLTW4W6qmTGuJiBRySg2H3G71GqTu5bjleMglLFjUOPW/GMw3DA4hH+VAbL ZH8joal+WzUCa2F6a7cZGvnQ4bCEKf+DseJqlnswmrP0cYzfbwu7NWYNc9JF51kRRH Gbweamk8+xGbcGd/IrF2QhBAvnhhvLnLGK+08Asak8tvWqV2ZrwEgdugPBcPR7g9jN +nUJCEi95n/N6AXcWyqSjK5bxbZi4nF2wLXc9igCnE+2DQB1O8dQeEso9xbtzGFn7+ PYk2FA37weI41E89u+MlQGt3n4JQ7cVVcDw2dD9DFZ1N/nZJJdlNze44ToedHwuxYg QEkx/BywvWUNg== X-detected-operating-system: by eggs.gnu.org: Windows 7 or 8 [fuzzy] X-Received-From: 216.228.121.65 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Zhengxiao.zx@Alibaba-inc.com" , "Tian, Kevin" , "Liu, Yi L" , "cjia@nvidia.com" , "kvm@vger.kernel.org" , "eskultet@redhat.com" , "Yang, Ziye" , "qemu-devel@nongnu.org" , "cohuck@redhat.com" , "shuangtai.tst@alibaba-inc.com" , "dgilbert@redhat.com" , "Wang, Zhi A" , "mlevitsk@redhat.com" , "pasic@linux.ibm.com" , "aik@ozlabs.ru" , "alex.williamson@redhat.com" , "eauger@redhat.com" , "felipe@nutanix.com" , "jonathan.davies@nutanix.com" , "Liu, Changpeng" , "Ken.Xue@amd.com" Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On 3/25/2020 7:41 AM, Yan Zhao wrote: > On Wed, Mar 25, 2020 at 05:18:52AM +0800, Kirti Wankhede wrote: >> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations: >> - Start dirty pages tracking while migration is active >> - Stop dirty pages tracking. >> - Get dirty pages bitmap. Its user space application's responsibility to >> copy content of dirty pages from source to destination during migration. >> >> To prevent DoS attack, memory for bitmap is allocated per vfio_dma >> structure. Bitmap size is calculated considering smallest supported page >> size. Bitmap is allocated for all vfio_dmas when dirty logging is enabled >> >> Bitmap is populated for already pinned pages when bitmap is allocated for >> a vfio_dma with the smallest supported page size. Update bitmap from >> pinning functions when tracking is enabled. When user application queries >> bitmap, check if requested page size is same as page size used to >> populated bitmap. If it is equal, copy bitmap, but if not equal, return >> error. >> >> Signed-off-by: Kirti Wankhede >> Reviewed-by: Neo Jia >> --- >> drivers/vfio/vfio_iommu_type1.c | 266 +++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 260 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >> index 70aeab921d0f..874a1a7ae925 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -71,6 +71,7 @@ struct vfio_iommu { >> unsigned int dma_avail; >> bool v2; >> bool nesting; >> + bool dirty_page_tracking; >> }; >> >> struct vfio_domain { >> @@ -91,6 +92,7 @@ struct vfio_dma { >> bool lock_cap; /* capable(CAP_IPC_LOCK) */ >> struct task_struct *task; >> struct rb_root pfn_list; /* Ex-user pinned pfn list */ >> + unsigned long *bitmap; >> }; >> >> struct vfio_group { >> @@ -125,7 +127,21 @@ struct vfio_regions { >> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \ >> (!list_empty(&iommu->domain_list)) >> >> +#define DIRTY_BITMAP_BYTES(n) (ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE) >> + >> +/* >> + * Input argument of number of bits to bitmap_set() is unsigned integer, which >> + * further casts to signed integer for unaligned multi-bit operation, >> + * __bitmap_set(). >> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte, >> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page >> + * system. >> + */ >> +#define DIRTY_BITMAP_PAGES_MAX (uint64_t)(INT_MAX - 1) >> +#define DIRTY_BITMAP_SIZE_MAX DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) >> + >> static int put_pfn(unsigned long pfn, int prot); >> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu); >> >> /* >> * This code handles mapping and unmapping of user data buffers >> @@ -175,6 +191,77 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old) >> rb_erase(&old->node, &iommu->dma_list); >> } >> >> + >> +static int vfio_dma_bitmap_alloc(struct vfio_dma *dma, uint64_t pgsize) >> +{ >> + uint64_t npages = dma->size / pgsize; >> + >> + if (npages > DIRTY_BITMAP_PAGES_MAX) >> + return -EINVAL; >> + >> + dma->bitmap = kvzalloc(DIRTY_BITMAP_BYTES(npages), GFP_KERNEL); >> + if (!dma->bitmap) >> + return -ENOMEM; >> + >> + return 0; >> +} >> + >> +static void vfio_dma_bitmap_free(struct vfio_dma *dma) >> +{ >> + kfree(dma->bitmap); >> + dma->bitmap = NULL; >> +} >> + >> +static void vfio_dma_populate_bitmap(struct vfio_dma *dma, uint64_t pgsize) >> +{ >> + struct rb_node *p; >> + >> + if (RB_EMPTY_ROOT(&dma->pfn_list)) >> + return; >> + >> + for (p = rb_first(&dma->pfn_list); p; p = rb_next(p)) { >> + struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn, node); >> + >> + bitmap_set(dma->bitmap, (vpfn->iova - dma->iova) / pgsize, 1); >> + } >> +} >> + >> +static int vfio_dma_bitmap_alloc_all(struct vfio_iommu *iommu, uint64_t pgsize) >> +{ >> + struct rb_node *n = rb_first(&iommu->dma_list); >> + >> + for (; n; n = rb_next(n)) { >> + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node); >> + int ret; >> + >> + ret = vfio_dma_bitmap_alloc(dma, pgsize); >> + if (ret) { >> + struct rb_node *p = rb_prev(n); >> + >> + for (; p; p = rb_prev(p)) { >> + struct vfio_dma *dma = rb_entry(n, >> + struct vfio_dma, node); >> + >> + vfio_dma_bitmap_free(dma); >> + } >> + return ret; >> + } >> + vfio_dma_populate_bitmap(dma, pgsize); >> + } >> + return 0; >> +} >> + >> +static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu) >> +{ >> + struct rb_node *n = rb_first(&iommu->dma_list); >> + >> + for (; n; n = rb_next(n)) { >> + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node); >> + >> + vfio_dma_bitmap_free(dma); >> + } >> +} >> + >> /* >> * Helper Functions for host iova-pfn list >> */ >> @@ -567,6 +654,18 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, >> vfio_unpin_page_external(dma, iova, do_accounting); >> goto pin_unwind; >> } >> + >> + if (iommu->dirty_page_tracking) { >> + unsigned long pgshift = >> + __ffs(vfio_pgsize_bitmap(iommu)); >> + >> + /* >> + * Bitmap populated with the smallest supported page >> + * size >> + */ >> + bitmap_set(dma->bitmap, >> + (vpfn->iova - dma->iova) >> pgshift, 1); >> + } >> } >> >> ret = i; >> @@ -801,6 +900,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) >> vfio_unmap_unpin(iommu, dma, true); >> vfio_unlink_dma(iommu, dma); >> put_task_struct(dma->task); >> + vfio_dma_bitmap_free(dma); >> kfree(dma); >> iommu->dma_avail++; >> } >> @@ -831,6 +931,57 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu) >> return bitmap; >> } >> >> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova, >> + size_t size, uint64_t pgsize, >> + u64 __user *bitmap) >> +{ >> + struct vfio_dma *dma; >> + unsigned long pgshift = __ffs(pgsize); >> + unsigned int npages, bitmap_size; >> + >> + dma = vfio_find_dma(iommu, iova, 1); >> + >> + if (!dma) >> + return -EINVAL; >> + >> + if (dma->iova != iova || dma->size != size) >> + return -EINVAL; >> + > Still don't sure if it's a good practice. > I saw the qemu implementation. > Qemu just iterates the whole IOVA address space, > It needs to find IOTLB entry for an IOVA > (1) if it can find an IOTLB for an IOVA, do the DIRTY_PAGES IOCTL and > increment IOVA by (iotlb.addr_mask + 1) > > (2) if no existing IOTLB found, the imrc->translate needs to go searching shadow > page table to try to generate one. > if it still fails,(most probably case, as IOMMU only maps a small part in its address > space). increment IOVA by 1 page. > > So, if the address space width is 39bit, and if there's only one page > mapped, you still have to translate IOVA for around 2^27 times in each > query. Isn't it too inefficient? > This is Qemu side implementation, let discuss it on QEMU patches. > So, IMHO, why we could not just save an rb tree specific for dirty pages, then generate > a bitmap for each query? This is looping back to implentation in v10 - v12 version. There are problems discussed during v10 to v12 version of patches with this approach. - populating dirty bitmap at the time of query will add more CPU cycles. - If we save these CPU cyles means dirty pages need to be tracked when they are pinned or dirtied by CPU, that is, inttoduced per vfio_dma bitmap. If ranges are not vfio_dma aligned, then copying bitmap to user space becomes complicated and unefficient. So we decided to go with the approach implemented here. > >> + npages = dma->size >> pgshift; >> + bitmap_size = DIRTY_BITMAP_BYTES(npages); >> + >> + /* mark all pages dirty if all pages are pinned and mapped. */ >> + if (dma->iommu_mapped) >> + bitmap_set(dma->bitmap, 0, npages); >> + >> + if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size)) >> + return -EFAULT; >> + >> + /* >> + * Re-populate bitmap to include all pinned pages which are considered >> + * as dirty but exclude pages which are unpinned and pages which are >> + * marked dirty by vfio_dma_rw() >> + */ >> + bitmap_clear(dma->bitmap, 0, npages); >> + vfio_dma_populate_bitmap(dma, pgsize); > will this also repopulate bitmap for pinned pages set by pass-through devices in > patch 07 ? > If pass through device's driver pins pages using vfio_pin_pages and all devices in the group pins pages through vfio_pin_pages, then iommu->pinned_page_dirty_scope is set true, then bitmap is repolutated. Thanks, Kirti > >> + return 0; >> +} >> + >> +static int verify_bitmap_size(uint64_t npages, uint64_t bitmap_size) >> +{ >> + uint64_t bsize; >> + >> + if (!npages || !bitmap_size || (bitmap_size > DIRTY_BITMAP_SIZE_MAX)) >> + return -EINVAL; >> + >> + bsize = DIRTY_BITMAP_BYTES(npages); >> + >> + if (bitmap_size < bsize) >> + return -EINVAL; >> + >> + return 0; >> +} >> + >> static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> struct vfio_iommu_type1_dma_unmap *unmap) >> { >> @@ -1038,16 +1189,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >> unsigned long vaddr = map->vaddr; >> size_t size = map->size; >> int ret = 0, prot = 0; >> - uint64_t mask; >> + uint64_t pgsize; >> struct vfio_dma *dma; >> >> /* Verify that none of our __u64 fields overflow */ >> if (map->size != size || map->vaddr != vaddr || map->iova != iova) >> return -EINVAL; >> >> - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1; >> + pgsize = (uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu)); >> >> - WARN_ON(mask & PAGE_MASK); >> + WARN_ON((pgsize - 1) & PAGE_MASK); >> >> /* READ/WRITE from device perspective */ >> if (map->flags & VFIO_DMA_MAP_FLAG_WRITE) >> @@ -1055,7 +1206,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >> if (map->flags & VFIO_DMA_MAP_FLAG_READ) >> prot |= IOMMU_READ; >> >> - if (!prot || !size || (size | iova | vaddr) & mask) >> + if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) >> return -EINVAL; >> >> /* Don't allow IOVA or virtual address wrap */ >> @@ -1130,6 +1281,12 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >> else >> ret = vfio_pin_map_dma(iommu, dma, size); >> >> + if (!ret && iommu->dirty_page_tracking) { >> + ret = vfio_dma_bitmap_alloc(dma, pgsize); >> + if (ret) >> + vfio_remove_dma(iommu, dma); >> + } >> + >> out_unlock: >> mutex_unlock(&iommu->lock); >> return ret; >> @@ -2278,6 +2435,93 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, >> >> return copy_to_user((void __user *)arg, &unmap, minsz) ? >> -EFAULT : 0; >> + } else if (cmd == VFIO_IOMMU_DIRTY_PAGES) { >> + struct vfio_iommu_type1_dirty_bitmap dirty; >> + uint32_t mask = VFIO_IOMMU_DIRTY_PAGES_FLAG_START | >> + VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP | >> + VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP; >> + int ret = 0; >> + >> + if (!iommu->v2) >> + return -EACCES; >> + >> + minsz = offsetofend(struct vfio_iommu_type1_dirty_bitmap, >> + flags); >> + >> + if (copy_from_user(&dirty, (void __user *)arg, minsz)) >> + return -EFAULT; >> + >> + if (dirty.argsz < minsz || dirty.flags & ~mask) >> + return -EINVAL; >> + >> + /* only one flag should be set at a time */ >> + if (__ffs(dirty.flags) != __fls(dirty.flags)) >> + return -EINVAL; >> + >> + if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) { >> + uint64_t pgsize = 1 << __ffs(vfio_pgsize_bitmap(iommu)); >> + >> + mutex_lock(&iommu->lock); >> + if (!iommu->dirty_page_tracking) { >> + ret = vfio_dma_bitmap_alloc_all(iommu, pgsize); >> + if (!ret) >> + iommu->dirty_page_tracking = true; >> + } >> + mutex_unlock(&iommu->lock); >> + return ret; >> + } else if (dirty.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) { >> + mutex_lock(&iommu->lock); >> + if (iommu->dirty_page_tracking) { >> + iommu->dirty_page_tracking = false; >> + vfio_dma_bitmap_free_all(iommu); >> + } >> + mutex_unlock(&iommu->lock); >> + return 0; >> + } else if (dirty.flags & >> + VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) { >> + struct vfio_iommu_type1_dirty_bitmap_get range; >> + unsigned long pgshift; >> + size_t data_size = dirty.argsz - minsz; >> + uint64_t iommu_pgsize = >> + 1 << __ffs(vfio_pgsize_bitmap(iommu)); >> + >> + if (!data_size || data_size < sizeof(range)) >> + return -EINVAL; >> + >> + if (copy_from_user(&range, (void __user *)(arg + minsz), >> + sizeof(range))) >> + return -EFAULT; >> + >> + /* allow only smallest supported pgsize */ >> + if (range.bitmap.pgsize != iommu_pgsize) >> + return -EINVAL; >> + if (range.iova & (iommu_pgsize - 1)) >> + return -EINVAL; >> + if (!range.size || range.size & (iommu_pgsize - 1)) >> + return -EINVAL; >> + if (range.iova + range.size < range.iova) >> + return -EINVAL; >> + if (!access_ok((void __user *)range.bitmap.data, >> + range.bitmap.size)) >> + return -EINVAL; >> + >> + pgshift = __ffs(range.bitmap.pgsize); >> + ret = verify_bitmap_size(range.size >> pgshift, >> + range.bitmap.size); >> + if (ret) >> + return ret; >> + >> + mutex_lock(&iommu->lock); >> + if (iommu->dirty_page_tracking) >> + ret = vfio_iova_dirty_bitmap(iommu, range.iova, >> + range.size, range.bitmap.pgsize, >> + range.bitmap.data); >> + else >> + ret = -EINVAL; >> + mutex_unlock(&iommu->lock); >> + >> + return ret; >> + } >> } >> >> return -ENOTTY; >> @@ -2345,10 +2589,20 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu, >> >> vaddr = dma->vaddr + offset; >> >> - if (write) >> + if (write) { >> *copied = __copy_to_user((void __user *)vaddr, data, >> count) ? 0 : count; >> - else >> + if (*copied && iommu->dirty_page_tracking) { >> + unsigned long pgshift = >> + __ffs(vfio_pgsize_bitmap(iommu)); >> + /* >> + * Bitmap populated with the smallest supported page >> + * size >> + */ >> + bitmap_set(dma->bitmap, offset >> pgshift, >> + *copied >> pgshift); >> + } >> + } else >> *copied = __copy_from_user(data, (void __user *)vaddr, >> count) ? 0 : count; >> if (kthread) >> -- >> 2.7.0 >>