From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Jordan Subject: [RFC PATCH v4 09/13] vfio: relieve mmap_sem reader cacheline bouncing by holding it longer Date: Mon, 5 Nov 2018 11:55:54 -0500 Message-ID: <20181105165558.11698-10-daniel.m.jordan@oracle.com> References: <20181105165558.11698-1-daniel.m.jordan@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: aarcange@redhat.com, aaron.lu@intel.com, akpm@linux-foundation.org, alex.williamson@redhat.com, bsd@redhat.com, daniel.m.jordan@oracle.com, darrick.wong@oracle.com, dave.hansen@linux.intel.com, jgg@mellanox.com, jwadams@google.com, jiangshanlai@gmail.com, mhocko@kernel.org, mike.kravetz@oracle.com, Pavel.Tatashin@microsoft.com, prasad.singamsetty@oracle.com, rdunlap@infradead.org, steven.sistare@oracle.com, tim.c.chen@intel.com, tj@kernel.org, vbabka@suse.cz To: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: <20181105165558.11698-1-daniel.m.jordan@oracle.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org Profiling shows significant time being spent on atomic ops in mmap_sem reader acquisition. mmap_sem is taken and dropped for every single base page during pinning, so this is not surprising. Reduce the number of times mmap_sem is taken by holding for longer, which relieves atomic cacheline bouncing. Results for all VFIO page pinning patches ----------------------------------------- The test measures the time from qemu invocation to the start of guest boot. The guest uses kvm with 320G memory backed with THP. 320G fits in a node on the test machine used here, so there was no thrashing in reclaim because of __GFP_THISNODE in THP allocations[1]. CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz memory: 754G split evenly between nodes scaling_governor: performance patch 6 patch 8 patch 9 (this one) ----------------------- ------------------------ ------------------------ thr speedup average sec speedup average sec speedup average sec 1 65.0 (± 0.6%) 65.2 (± 0.5%) 65.5 (± 0.4%) 2 1.5x 42.8 (± 5.8%) 1.8x 36.0 (± 0.9%) 1.9x 34.4 (± 0.3%) 3 1.9x 35.0 (±11.3%) 2.5x 26.4 (± 4.2%) 2.8x 23.7 (± 0.2%) 4 2.3x 28.5 (± 1.3%) 3.1x 21.2 (± 2.8%) 3.6x 18.3 (± 0.3%) 5 2.5x 26.2 (± 1.5%) 3.6x 17.9 (± 0.9%) 4.3x 15.1 (± 0.3%) 6 2.7x 24.5 (± 1.8%) 4.0x 16.5 (± 3.0%) 5.1x 12.9 (± 0.1%) 7 2.8x 23.5 (± 4.9%) 4.2x 15.4 (± 2.7%) 5.7x 11.5 (± 0.6%) 8 2.8x 22.8 (± 1.8%) 4.2x 15.5 (± 4.7%) 6.4x 10.3 (± 0.8%) 12 3.2x 20.2 (± 1.4%) 4.4x 14.7 (± 2.9%) 8.6x 7.6 (± 0.6%) 16 3.3x 20.0 (± 0.7%) 4.3x 15.4 (± 1.3%) 10.2x 6.4 (± 0.6%) At patch 6, lock_stat showed long reader wait time on mmap_sem writers, leading to patch 8. At patch 8, profiling revealed the issue with mmap_sem described above. Across all three patches, performance consistently improves as the thread count increases. The one exception is the antiscaling with nthr=16 in patch 8: those mmap_sem atomics are really bouncing around the machine. The performance with patch 9 looks pretty good overall. I'm working on finding the next bottleneck, and this is where it stopped: When nthr=16, the obvious issue profiling showed was contention on the split PMD page table lock when pages are faulted in during the pinning (>2% of the time). A split PMD lock protects a PUD_SIZE-ed amount of page table mappings (1G on x86), so if threads were operating on smaller chunks and contending in the same PUD_SIZE range, this could be the source of contention. However, when nthr=16, threads operate on 5G chunks (320G / 16 threads / (1< --- drivers/vfio/vfio_iommu_type1.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 9e52a24eb45b..0d6ec0369de5 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -331,7 +331,7 @@ static int put_pfn(unsigned long pfn, int prot) } static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, - int prot, unsigned long *pfn) + int prot, unsigned long *pfn, bool handle_mmap_sem) { struct page *page[1]; struct vm_area_struct *vma; @@ -342,7 +342,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, if (prot & IOMMU_WRITE) flags |= FOLL_WRITE; - down_read(&mm->mmap_sem); + if (handle_mmap_sem) + down_read(&mm->mmap_sem); if (mm == current->mm) { ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas); } else { @@ -360,14 +361,16 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, put_page(page[0]); } } - up_read(&mm->mmap_sem); + if (handle_mmap_sem) + up_read(&mm->mmap_sem); if (ret == 1) { *pfn = page_to_pfn(page[0]); return 0; } - down_read(&mm->mmap_sem); + if (handle_mmap_sem) + down_read(&mm->mmap_sem); vma = find_vma_intersection(mm, vaddr, vaddr + 1); @@ -377,7 +380,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, ret = 0; } - up_read(&mm->mmap_sem); + if (handle_mmap_sem) + up_read(&mm->mmap_sem); return ret; } @@ -399,9 +403,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, if (!mm) return -ENODEV; - ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); - if (ret) + down_read(&mm->mmap_sem); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, false); + if (ret) { + up_read(&mm->mmap_sem); return ret; + } pinned++; rsvd = is_invalid_reserved_pfn(*pfn_base); @@ -416,6 +423,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); + up_read(&mm->mmap_sem); return -ENOMEM; } lock_acct++; @@ -427,7 +435,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, /* Lock all the consecutive pages from pfn_base */ for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage; pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) { - ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, &pfn, false); if (ret) break; @@ -444,6 +452,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); ret = -ENOMEM; + up_read(&mm->mmap_sem); goto unpin_out; } lock_acct++; @@ -451,6 +460,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, } out: + up_read(&mm->mmap_sem); ret = vfio_lock_acct(dma, lock_acct, false); unpin_out: @@ -497,7 +507,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr, if (!mm) return -ENODEV; - ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base); + ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base, true); if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) { ret = vfio_lock_acct(dma, 1, true); if (ret) { -- 2.19.1