From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,HK_RANDOM_FROM,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96E7DC388F9 for ; Tue, 17 Nov 2020 03:32:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 400F524697 for ; Tue, 17 Nov 2020 03:32:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726503AbgKQDcB (ORCPT ); Mon, 16 Nov 2020 22:32:01 -0500 Received: from szxga01-in.huawei.com ([45.249.212.187]:2064 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725804AbgKQDcA (ORCPT ); Mon, 16 Nov 2020 22:32:00 -0500 Received: from dggeme753-chm.china.huawei.com (unknown [172.30.72.55]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4CZs1J0gbRzVpXR; Tue, 17 Nov 2020 11:31:32 +0800 (CST) Received: from [10.174.184.120] (10.174.184.120) by dggeme753-chm.china.huawei.com (10.3.19.99) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1913.5; Tue, 17 Nov 2020 11:31:56 +0800 Subject: Re: [PATCH] vfio iommu type1: Improve vfio_iommu_type1_pin_pages performance To: Alex Williamson CC: , , , , , , , References: <2553f102-de17-b23b-4cd8-fefaf2a04f24@huawei.com> <20201113094417.51d73615@w520.home> <20201116093349.19fb84c1@w520.home> From: "xuxiaoyang (C)" Message-ID: <46a36d12-cc40-e6b5-66a0-192fcfb53c4a@huawei.com> Date: Tue, 17 Nov 2020 11:31:56 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.3.2 MIME-Version: 1.0 In-Reply-To: <20201116093349.19fb84c1@w520.home> Content-Type: text/plain; charset="utf-8" Content-Language: en-GB Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.184.120] X-ClientProxiedBy: dggeme713-chm.china.huawei.com (10.1.199.109) To dggeme753-chm.china.huawei.com (10.3.19.99) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2020/11/17 0:33, Alex Williamson wrote: > On Mon, 16 Nov 2020 21:47:33 +0800 > "xuxiaoyang (C)" wrote: > >> On 2020/11/14 0:44, Alex Williamson wrote: >>> On Tue, 10 Nov 2020 21:42:33 +0800 >>> "xuxiaoyang (C)" wrote: >>> >>>> vfio_iommu_type1_pin_pages is very inefficient because >>>> it is processed page by page when calling vfio_pin_page_external. >>>> Added contiguous_vaddr_get_pfn to process continuous pages >>>> to reduce the number of loops, thereby improving performance. >>>> >>>> Signed-off-by: Xiaoyang Xu >>>> --- >>>> drivers/vfio/vfio_iommu_type1.c | 241 ++++++++++++++++++++++++++++---- >>>> 1 file changed, 214 insertions(+), 27 deletions(-) >>> >>> Sorry for my previous misunderstanding of the logic here. Still, this >>> adds a lot of complexity, can you quantify the performance improvement >>> you're seeing? Would it instead be more worthwhile to support an >>> interface that pins based on an iova and range? Further comments below. >>> >> Thank you for your reply. I have a set of performance test data for reference. >> The host kernel version of my test environment is 4.19.36, and the test cases >> are for pin one page and 512 pages. When pin 512pages, the input is a >> continuous iova address. At the same time increase the measurement factor of >> whether the mem backend uses large pages. The following is the average of >> multiple tests. >> >> The patch was not applied >> 1 page 512 pages >> no huge pages: 1638ns 223651ns >> THP: 1668ns 222330ns >> HugeTLB: 1526ns 208151ns >> >> The patch was applied >> 1 page 512 pages >> no huge pages 1735ns 167286ns >> THP: 1934ns 126900ns >> HugeTLB: 1713ns 102188ns >> >> The performance will be reduced when the page is fixed with a single pin, >> while the page time of continuous pins can be reduced to half of the original. >> If you have other suggestions for testing methods, please let me know. > > These are artificial test results, which in fact show a performance > decrease for the single page use cases. What do we see in typical real > world scenarios? Do those real world scenarios tend more towards large > arrays of contiguous IOVAs or single pages, or large arrays of > discontiguous IOVAs? What's the resulting change in device > performance? Are pages pinned at a high enough frequency that pinning > latency results in a measurable benefit at the device or to the > overhead on the host? > Thank you for your reply. It is difficult for me to construct these use cases. When the device is performing DMA, frequent calls to pin/unpin pages are a big overhead, so we will pin a large part of the memory to create a memory pool at the beginning. Therefore, we pay more attention to the performance of pin continuous pages. Single page is processed in contiguous_vaddr_get_pfn, which will reduce performance. We can keep the previous code processing path in vfio_pin_page_external. As long as vfio_get_contiguous_pages_length can return fast enough, the performance loss of single page processing will be minimized. >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>> index 67e827638995..935f80807527 100644 >>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>> @@ -628,6 +628,206 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova, >>>> return unlocked; >>>> } >>>> >>>> +static int contiguous_vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, >>>> + int prot, long npage, unsigned long *phys_pfn) >>>> +{ >>>> + struct page **pages = NULL; >>>> + unsigned int flags = 0; >>>> + int i, ret; >>>> + >>>> + pages = kvmalloc_array(npage, sizeof(struct page *), GFP_KERNEL); >>>> + if (!pages) >>>> + return -ENOMEM; >>>> + >>>> + if (prot & IOMMU_WRITE) >>>> + flags |= FOLL_WRITE; >>>> + >>>> + mmap_read_lock(mm); >>>> + ret = pin_user_pages_remote(mm, vaddr, npage, flags | FOLL_LONGTERM, >>>> + pages, NULL, NULL); >>> >>> This is essentially the root of the performance improvement claim, >>> right? ie. we're pinning a range of pages rather than individually. >>> Unfortunately it means we also add a dynamic memory allocation even >>> when npage=1. >>> >> Yes, when npage = 1, I should keep the previous scheme of the scene >> and use local variables to save time in allocated memory >>> >>>> + mmap_read_unlock(mm); >>>> + >>>> + for (i = 0; i < ret; i++) >>>> + *(phys_pfn + i) = page_to_pfn(pages[i]); >>>> + >>>> + kvfree(pages); >>>> + >>>> + return ret; >>>> +} >>>> + >>>> +static int vfio_pin_contiguous_pages_external(struct vfio_iommu *iommu, >>>> + struct vfio_dma *dma, >>>> + unsigned long *user_pfn, >>>> + int npage, unsigned long *phys_pfn, >>>> + bool do_accounting) >>>> +{ >>>> + int ret, i, j, lock_acct = 0; >>>> + unsigned long remote_vaddr; >>>> + dma_addr_t iova; >>>> + struct mm_struct *mm; >>>> + struct vfio_pfn *vpfn; >>>> + >>>> + mm = get_task_mm(dma->task); >>>> + if (!mm) >>>> + return -ENODEV; >>>> + >>>> + iova = user_pfn[0] << PAGE_SHIFT; >>>> + remote_vaddr = dma->vaddr + iova - dma->iova; >>>> + ret = contiguous_vaddr_get_pfn(mm, remote_vaddr, dma->prot, >>>> + npage, phys_pfn); >>>> + mmput(mm); >>>> + if (ret <= 0) >>>> + return ret; >>>> + >>>> + npage = ret; >>>> + for (i = 0; i < npage; i++) { >>>> + iova = user_pfn[i] << PAGE_SHIFT; >>>> + ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]); >>>> + if (ret) >>>> + goto unwind; >>>> + >>>> + if (!is_invalid_reserved_pfn(phys_pfn[i])) >>>> + lock_acct++; >>>> + >>>> + if (iommu->dirty_page_tracking) { >>>> + unsigned long pgshift = __ffs(iommu->pgsize_bitmap); >>>> + >>>> + /* >>>> + * Bitmap populated with the smallest supported page >>>> + * size >>>> + */ >>>> + bitmap_set(dma->bitmap, >>>> + (iova - dma->iova) >> pgshift, 1); >>>> + } >>> >>> It doesn't make sense that we wouldn't also optimize this to simply set >>> npage bits. There's also no unwind for this. >>> >> Thank you for your correction, I should set npage to simplify. >> There is no unwind process on the current master branch. Is this a bug? > > It's a bit tricky to know when it's ok to clear a bit in the dirty > bitmap, it's much, much worse to accidentally clear a bit to > incorrectly report a page a clean than it is to fail to unwind and > leave pages marked as dirty. Given that this is an error path, we > might be willing to incorrectly leave pages marked dirty rather than > risk the alternative. > >>>> + } >>>> + >>>> + if (do_accounting) { >>>> + ret = vfio_lock_acct(dma, lock_acct, true); >>>> + if (ret) { >>>> + if (ret == -ENOMEM) >>>> + pr_warn("%s: Task %s (%d) RLIMIT_MEMLOCK (%ld) exceeded\n", >>>> + __func__, dma->task->comm, task_pid_nr(dma->task), >>>> + task_rlimit(dma->task, RLIMIT_MEMLOCK)); >>>> + goto unwind; >>>> + } >>>> + } >>> >>> This algorithm allows pinning many more pages in advance of testing >>> whether we've exceeded the task locked memory limit than the per page >>> approach. >>> >> More than 1~VFIO_PIN_PAGES_MAX_ENTRIES. But after failure, all pinned >> pages will be released. Is there a big impact here? >>> >>>> + >>>> + return i; >>>> +unwind: >>>> + for (j = 0; j < npage; j++) { >>>> + put_pfn(phys_pfn[j], dma->prot); >>>> + phys_pfn[j] = 0; >>>> + } >>>> + >>>> + for (j = 0; j < i; j++) { >>>> + iova = user_pfn[j] << PAGE_SHIFT; >>>> + vpfn = vfio_find_vpfn(dma, iova); >>>> + if (vpfn) >>> >>> Seems like not finding a vpfn would be an error. >>> >> I think the above process can ensure that vpfn is not NULL >> and delete this judgment. >>>> + vfio_remove_from_pfn_list(dma, vpfn); >>> >>> It seems poor form not to use vfio_iova_put_vfio_pfn() here even if we >>> know we hold the only reference. >>> >> The logic here can be reduced to a loop. >> When j < i, call vfio_iova_put_vfio_pfn. >>> >>>> + } >>>> + >>>> + return ret; >>>> +} >>>> + >>>> +static int vfio_iommu_type1_pin_contiguous_pages(struct vfio_iommu *iommu, >>>> + struct vfio_dma *dma, >>>> + unsigned long *user_pfn, >>>> + int npage, unsigned long *phys_pfn, >>>> + bool do_accounting) >>>> +{ >>>> + int ret, i, j; >>>> + unsigned long remote_vaddr; >>>> + dma_addr_t iova; >>>> + >>>> + ret = vfio_pin_contiguous_pages_external(iommu, dma, user_pfn, npage, >>>> + phys_pfn, do_accounting); >>>> + if (ret == npage) >>>> + return ret; >>>> + >>>> + if (ret < 0) >>>> + ret = 0; >>> >>> >>> I'm lost, why do we need the single page iteration below?? >>> >> Since there is no retry in contiguous_vaddr_get_pfn, oncean error occurs, >> the remaining pages will be processed in a single page. >> Is it better to increase retry in contiguous_vaddr_get_pfn? > > Do we expect it to work if we call it again? Do we expect the below > single page iteration to work if the npage pinning failed? Why? > Vaddr_get_pfn contains error handling for the VM_PFNMAP type of mapping area. In this case, we hope he can continue to work. As mentioned above, the scene of n=1 will be processed directly using vfio_pin_page_external. > >>>> + for (i = ret; i < npage; i++) { >>>> + iova = user_pfn[i] << PAGE_SHIFT; >>>> + remote_vaddr = dma->vaddr + iova - dma->iova; >>>> + >>>> + ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i], >>>> + do_accounting); >>>> + if (ret) >>>> + goto pin_unwind; >>>> + >>>> + ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]); >>>> + if (ret) { >>>> + if (put_pfn(phys_pfn[i], dma->prot) && do_accounting) >>>> + vfio_lock_acct(dma, -1, true); >>>> + goto pin_unwind; >>>> + } >>>> + >>>> + if (iommu->dirty_page_tracking) { >>>> + unsigned long pgshift = __ffs(iommu->pgsize_bitmap); >>>> + >>>> + /* >>>> + * Bitmap populated with the smallest supported page >>>> + * size >>>> + */ >>>> + bitmap_set(dma->bitmap, >>>> + (iova - dma->iova) >> pgshift, 1); >>>> + } >>>> + } >>>> + >>>> + return i; >>>> + >>>> +pin_unwind: >>>> + phys_pfn[i] = 0; >>>> + for (j = 0; j < i; j++) { >>>> + dma_addr_t iova; >>>> + >>>> + iova = user_pfn[j] << PAGE_SHIFT; >>>> + vfio_unpin_page_external(dma, iova, do_accounting); >>>> + phys_pfn[j] = 0; >>>> + } >>>> + >>>> + return ret; >>>> +} >>>> + >>>> +static int vfio_iommu_type1_get_contiguous_pages_length(struct vfio_iommu *iommu, >>>> + unsigned long *user_pfn, int npage, int prot) >>>> +{ >>>> + struct vfio_dma *dma_base; >>>> + int i; >>>> + dma_addr_t iova; >>>> + struct vfio_pfn *vpfn; >>>> + >>>> + if (npage <= 1) >>>> + return npage; >>>> + >>>> + iova = user_pfn[0] << PAGE_SHIFT; >>>> + dma_base = vfio_find_dma(iommu, iova, PAGE_SIZE); >>>> + if (!dma_base) >>>> + return -EINVAL; >>>> + >>>> + if ((dma_base->prot & prot) != prot) >>>> + return -EPERM; >>>> + >>>> + for (i = 1; i < npage; i++) { >>>> + iova = user_pfn[i] << PAGE_SHIFT; >>>> + >>>> + if (iova >= dma_base->iova + dma_base->size || >>>> + iova + PAGE_SIZE <= dma_base->iova) >>>> + break; >>>> + >>>> + vpfn = vfio_iova_get_vfio_pfn(dma_base, iova); >>>> + if (vpfn) { >>>> + vfio_iova_put_vfio_pfn(dma_base, vpfn); >>> >>> Why not just use vfio_find_vpfn() rather than get+put? >>> >> Thank you for your correction, I should use vfio_find_vpfn here. >>>> + break; >>>> + } >>>> + >>>> + if (user_pfn[i] != user_pfn[0] + i) >>> >>> Shouldn't this be the first test? >>> >> Thank you for your correction, the least costly judgment should be >> the first test. >>>> + break; >>>> + } >>>> + return i; >>>> +} >>>> + >>>> static int vfio_iommu_type1_pin_pages(void *iommu_data, >>>> struct iommu_group *iommu_group, >>>> unsigned long *user_pfn, >>>> @@ -637,9 +837,9 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, >>>> struct vfio_iommu *iommu = iommu_data; >>>> struct vfio_group *group; >>>> int i, j, ret; >>>> - unsigned long remote_vaddr; >>>> struct vfio_dma *dma; >>>> bool do_accounting; >>>> + int contiguous_npage; >>>> >>>> if (!iommu || !user_pfn || !phys_pfn) >>>> return -EINVAL; >>>> @@ -663,7 +863,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, >>>> */ >>>> do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu); >>>> >>>> - for (i = 0; i < npage; i++) { >>>> + for (i = 0; i < npage; i += contiguous_npage) { >>>> dma_addr_t iova; >>>> struct vfio_pfn *vpfn; >>>> >>>> @@ -682,31 +882,18 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, >>>> vpfn = vfio_iova_get_vfio_pfn(dma, iova); >>>> if (vpfn) { >>>> phys_pfn[i] = vpfn->pfn; >>>> - continue; >>>> - } >>>> - >>>> - remote_vaddr = dma->vaddr + (iova - dma->iova); >>>> - ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i], >>>> - do_accounting); >>>> - if (ret) >>>> - goto pin_unwind; >>>> - >>>> - ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]); >>>> - if (ret) { >>>> - if (put_pfn(phys_pfn[i], dma->prot) && do_accounting) >>>> - vfio_lock_acct(dma, -1, true); >>>> - goto pin_unwind; >>>> - } >>>> - >>>> - if (iommu->dirty_page_tracking) { >>>> - unsigned long pgshift = __ffs(iommu->pgsize_bitmap); >>>> - >>>> - /* >>>> - * Bitmap populated with the smallest supported page >>>> - * size >>>> - */ >>>> - bitmap_set(dma->bitmap, >>>> - (iova - dma->iova) >> pgshift, 1); >>>> + contiguous_npage = 1; >>>> + } else { >>>> + ret = vfio_iommu_type1_get_contiguous_pages_length(iommu, >>>> + &user_pfn[i], npage - i, prot); >>> >>> >>> It doesn't make a lot of sense to me that this isn't more integrated >>> into the base function. For example, we're passing &user_pfn[i] for >>> which we've already converted to an iova, found the vfio_dma associated >>> to that iova, and checked the protection. This callout does all of >>> that again on the same. >>> >> Thanks for your correction, I will delete the redundant check in >> vfio_iommu_type1_get_contiguous_pages_length and simplify the function to >> static int vfio_get_contiguous_pages_length (struct vfio_dma * dma, >> unsigned long * user_pfn, int npage) >>>> + if (ret < 0) >>>> + goto pin_unwind; >>>> + >>>> + ret = vfio_iommu_type1_pin_contiguous_pages(iommu, >>>> + dma, &user_pfn[i], ret, &phys_pfn[i], do_accounting); >>>> + if (ret < 0) >>>> + goto pin_unwind; >>>> + contiguous_npage = ret; >>>> } >>>> } >>>> ret = i; >>> >>> >>> This all seems _way_ more complicated than it needs to be, there are >>> too many different ways to flow through this code and claims of a >>> performance improvement are not substantiated with evidence. The type1 >>> code is already too fragile. Please simplify and justify. Thanks, >>> >>> Alex >>> >>> . >>> >> The main issue is the balance between performance and complexity. >> The test data has been given above, please tell me your opinion. > > I think the implementation here is overly complicated, there are too > many code paths and it's not clear what real world improvement to > expect. The test data only shows us the theoretical best case > improvement of optimizing for a specific use case, without indicating > how prevalent or frequent that use case occurs in operation of an > actual device. I suspect that it's possible to make the optimization > you're trying to achieve without this degree of complexity. Thanks, > > Alex > > . > More other real-world improvements may require feedback from others. I will post another patch to fix the previous bug and improve the performance of the pin page. Maybe others will make some improvements on this basis. Regards, Xu