Re: [BUG] vfio device assignment regression with THP ref counting redesign

From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: kirill.shutemov@linux.intel.com,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-kernel@vger.kernel.org,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [BUG] vfio device assignment regression with THP ref counting redesign
Date: Thu, 28 Apr 2016 21:17:26 +0300	[thread overview]
Message-ID: <20160428181726.GA2847@node.shutemov.name> (raw)
In-Reply-To: <20160428102051.17d1c728@t450s.home>

On Thu, Apr 28, 2016 at 10:20:51AM -0600, Alex Williamson wrote:
> Hi,
> 
> vfio-based device assignment makes use of get_user_pages_fast() in order
> to pin pages for mapping through the iommu for userspace drivers.
> Until the recent redesign of THP reference counting in the v4.5 kernel,
> this all worked well.  Now we're seeing cases where a sanity test
> before we release our "pinned" mapping results in a different page
> address than what we programmed into the iommu.  So something is
> occurring which pretty much negates the pinning we're trying to do.
> 
> The test program I'm using is here:
> 
> https://github.com/awilliam/tests/blob/master/vfio-iommu-map-unmap.c
> 
> Apologies for lack of makefile, simply build with gcc -o <out> <in.c>.
> 
> To run this, enable the IOMMU on your system - enable in BIOS plus add
> intel_iommu=on to the kernel commandline (only Intel x86_64 tested).
> 
> Pick a target PCI device, it doesn't matter what it is, the test only
> needs a device for the purpose of creating an iommu domain, the device
> is never actually touched.  In my case I use a spare NIC at 00:19.0.
> libvirt tools are useful for setting this up, simply run 'virsh
> nodedev-detach pci_0000_00_19_0'.  Otherwise bind the device manually
> to vfio-pci using the standard new_id bind (ask, I can provide
> instructions).
> 
> I also tweak THP scanning to make sure it is actively trying to
> collapse pages:
> 
> echo always > /sys/kernel/mm/transparent_hugepage/defrag
> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> 
> Run the test with 'vfio-iommu-map-unmap 0000:00:19.0', or your chosen
> target device.
> 
> Of course to see that the mappings are moving, we need additional
> sanity testing in the vfio iommu driver.  For that:
> 
> https://github.com/awilliam/linux-vfio/commit/379f324e3629349a7486018ad1cc5d4877228d1e
> 
> When we map memory for vfio, we use get_user_pages_fast() on the
> process vaddr to give us a page.  page_to_pfn() then gives us the
> physical memory address which we program into the iommu.  Obviously we
> expect this mapping to be stable so long as we hold the page
> reference.  On unmap we generally retrieve the physical memory address
> from the iommu, convert it back to a page, and release our reference to
> it.  The debug code above adds an additional sanity test where on unmap
> we also call get_user_pages_fast() again before we're released the
> mapping reference and compare whether the physical page address still
> matches what we previously stored in the iommu.  On a v4.4 kernel this
> works every time.  On v4.5+, we get mismatches in dmesg within a few
> lines of output from the test program.
> 
> It's difficult to bisect around the THP reference counting redesign
> since THP becomes disabled for much of it.  I have discovered that this
> commit is a significant contributor:
> 
> 1f25fe2 mm, thp: adjust conditions when we can reuse the page on WP fault
> 
> Particularly the middle chunk in huge_memory.c.  Reverting this change
> alone significantly improves the problem, but does not lead to a stable
> system.
> 
> I'm not an mm expert, so I'm looking for help debugging this.  As shown
> above this issue is reproducible without KVM, so Andrea's previous KVM
> specific fix to this code is not applicable.  It also still occurs on
> kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> yet.  I'm able to reproduce this fairly quickly with the above test,
> but it's not hard to imagine a test w/o any iommu dependencies which
> simply does a user directed get_user_pages_fast() on a set of userspace
> addresses, retains the reference, and at some point later rechecks that
> a new get_user_pages_fast() results in the same page address.  It
> appears that any sort of device assignment, either vfio or legacy kvm,
> should be susceptible to this issue and therefore unsafe to use on v4.5+
> kernels without using explicit hugepages or disabling THP.  Thanks,

I'm not able to reproduce it so far. How long does it usually take?

How much memory your system has? Could you share your kernel config?

I've modified your instrumentation slightly to provide more info.
Could you try this:

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..434954841d19 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -59,6 +59,7 @@ struct vfio_iommu {
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
+	bool			dying;
 };
 
 struct vfio_domain {
@@ -336,8 +337,10 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
+	unsigned long vaddr = dma->vaddr;
 	struct vfio_domain *domain, *d;
 	long unlocked = 0;
+	struct page *page;
 
 	if (!dma->size)
 		return;
@@ -363,9 +366,20 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		phys = iommu_iova_to_phys(domain->domain, iova);
 		if (WARN_ON(!phys)) {
 			iova += PAGE_SIZE;
+			vaddr += PAGE_SIZE;
 			continue;
 		}
 
+		if (!iommu->dying && get_user_pages_fast(vaddr, 1, !!(dma->prot & IOMMU_WRITE), &page) == 1) {
+			if (phys >> PAGE_SHIFT != page_to_pfn(page)) {
+				dump_page(page, NULL);
+				if (PageTail(page))
+					dump_page(compound_head(page), NULL);
+				dump_page(pfn_to_page(phys >> PAGE_SHIFT), "1");
+			}
+			put_page(page);
+		}
+
 		/*
 		 * To optimize for fewer iommu_unmap() calls, each of which
 		 * may require hardware cache flushing, try to find the
@@ -374,6 +388,17 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		for (len = PAGE_SIZE;
 		     !domain->fgsp && iova + len < end; len += PAGE_SIZE) {
 			next = iommu_iova_to_phys(domain->domain, iova + len);
+
+			if (!iommu->dying && get_user_pages_fast(vaddr + len, 1, !!(dma->prot & IOMMU_WRITE), &page) == 1) {
+				if (next >> PAGE_SHIFT != page_to_pfn(page)) {
+					dump_page(page, NULL);
+					if (PageTail(page))
+						dump_page(compound_head(page), NULL);
+					dump_page(pfn_to_page(next >> PAGE_SHIFT), "2");
+				}
+				put_page(page);
+			}
+
 			if (next != phys + len)
 				break;
 		}
@@ -386,6 +411,7 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 					     unmapped >> PAGE_SHIFT,
 					     dma->prot, false);
 		iova += unmapped;
+		vaddr += unmapped;
 
 		cond_resched();
 	}
@@ -855,6 +881,8 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 {
 	struct rb_node *node;
 
+	iommu->dying = true;
+
 	while ((node = rb_first(&iommu->dma_list)))
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
-- 
 Kirill A. Shutemov