linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] vfio device assignment regression with THP ref counting redesign
@ 2016-04-28 16:20 Alex Williamson
  2016-04-28 18:17 ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2016-04-28 16:20 UTC (permalink / raw)
  To: kirill.shutemov; +Cc: Andrea Arcangeli, linux-kernel, linux-mm

Hi,

vfio-based device assignment makes use of get_user_pages_fast() in order
to pin pages for mapping through the iommu for userspace drivers.
Until the recent redesign of THP reference counting in the v4.5 kernel,
this all worked well.  Now we're seeing cases where a sanity test
before we release our "pinned" mapping results in a different page
address than what we programmed into the iommu.  So something is
occurring which pretty much negates the pinning we're trying to do.

The test program I'm using is here:

https://github.com/awilliam/tests/blob/master/vfio-iommu-map-unmap.c

Apologies for lack of makefile, simply build with gcc -o <out> <in.c>.

To run this, enable the IOMMU on your system - enable in BIOS plus add
intel_iommu=on to the kernel commandline (only Intel x86_64 tested).

Pick a target PCI device, it doesn't matter what it is, the test only
needs a device for the purpose of creating an iommu domain, the device
is never actually touched.  In my case I use a spare NIC at 00:19.0.
libvirt tools are useful for setting this up, simply run 'virsh
nodedev-detach pci_0000_00_19_0'.  Otherwise bind the device manually
to vfio-pci using the standard new_id bind (ask, I can provide
instructions).

I also tweak THP scanning to make sure it is actively trying to
collapse pages:

echo always > /sys/kernel/mm/transparent_hugepage/defrag
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

Run the test with 'vfio-iommu-map-unmap 0000:00:19.0', or your chosen
target device.

Of course to see that the mappings are moving, we need additional
sanity testing in the vfio iommu driver.  For that:

https://github.com/awilliam/linux-vfio/commit/379f324e3629349a7486018ad1cc5d4877228d1e

When we map memory for vfio, we use get_user_pages_fast() on the
process vaddr to give us a page.  page_to_pfn() then gives us the
physical memory address which we program into the iommu.  Obviously we
expect this mapping to be stable so long as we hold the page
reference.  On unmap we generally retrieve the physical memory address
from the iommu, convert it back to a page, and release our reference to
it.  The debug code above adds an additional sanity test where on unmap
we also call get_user_pages_fast() again before we're released the
mapping reference and compare whether the physical page address still
matches what we previously stored in the iommu.  On a v4.4 kernel this
works every time.  On v4.5+, we get mismatches in dmesg within a few
lines of output from the test program.

It's difficult to bisect around the THP reference counting redesign
since THP becomes disabled for much of it.  I have discovered that this
commit is a significant contributor:

1f25fe2 mm, thp: adjust conditions when we can reuse the page on WP fault

Particularly the middle chunk in huge_memory.c.  Reverting this change
alone significantly improves the problem, but does not lead to a stable
system.

I'm not an mm expert, so I'm looking for help debugging this.  As shown
above this issue is reproducible without KVM, so Andrea's previous KVM
specific fix to this code is not applicable.  It also still occurs on
kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
yet.  I'm able to reproduce this fairly quickly with the above test,
but it's not hard to imagine a test w/o any iommu dependencies which
simply does a user directed get_user_pages_fast() on a set of userspace
addresses, retains the reference, and at some point later rechecks that
a new get_user_pages_fast() results in the same page address.  It
appears that any sort of device assignment, either vfio or legacy kvm,
should be susceptible to this issue and therefore unsafe to use on v4.5+
kernels without using explicit hugepages or disabling THP.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-28 16:20 [BUG] vfio device assignment regression with THP ref counting redesign Alex Williamson
@ 2016-04-28 18:17 ` Kirill A. Shutemov
  2016-04-28 18:58   ` Alex Williamson
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-04-28 18:17 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kirill.shutemov, Andrea Arcangeli, linux-kernel, linux-mm

On Thu, Apr 28, 2016 at 10:20:51AM -0600, Alex Williamson wrote:
> Hi,
> 
> vfio-based device assignment makes use of get_user_pages_fast() in order
> to pin pages for mapping through the iommu for userspace drivers.
> Until the recent redesign of THP reference counting in the v4.5 kernel,
> this all worked well.  Now we're seeing cases where a sanity test
> before we release our "pinned" mapping results in a different page
> address than what we programmed into the iommu.  So something is
> occurring which pretty much negates the pinning we're trying to do.
> 
> The test program I'm using is here:
> 
> https://github.com/awilliam/tests/blob/master/vfio-iommu-map-unmap.c
> 
> Apologies for lack of makefile, simply build with gcc -o <out> <in.c>.
> 
> To run this, enable the IOMMU on your system - enable in BIOS plus add
> intel_iommu=on to the kernel commandline (only Intel x86_64 tested).
> 
> Pick a target PCI device, it doesn't matter what it is, the test only
> needs a device for the purpose of creating an iommu domain, the device
> is never actually touched.  In my case I use a spare NIC at 00:19.0.
> libvirt tools are useful for setting this up, simply run 'virsh
> nodedev-detach pci_0000_00_19_0'.  Otherwise bind the device manually
> to vfio-pci using the standard new_id bind (ask, I can provide
> instructions).
> 
> I also tweak THP scanning to make sure it is actively trying to
> collapse pages:
> 
> echo always > /sys/kernel/mm/transparent_hugepage/defrag
> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> 
> Run the test with 'vfio-iommu-map-unmap 0000:00:19.0', or your chosen
> target device.
> 
> Of course to see that the mappings are moving, we need additional
> sanity testing in the vfio iommu driver.  For that:
> 
> https://github.com/awilliam/linux-vfio/commit/379f324e3629349a7486018ad1cc5d4877228d1e
> 
> When we map memory for vfio, we use get_user_pages_fast() on the
> process vaddr to give us a page.  page_to_pfn() then gives us the
> physical memory address which we program into the iommu.  Obviously we
> expect this mapping to be stable so long as we hold the page
> reference.  On unmap we generally retrieve the physical memory address
> from the iommu, convert it back to a page, and release our reference to
> it.  The debug code above adds an additional sanity test where on unmap
> we also call get_user_pages_fast() again before we're released the
> mapping reference and compare whether the physical page address still
> matches what we previously stored in the iommu.  On a v4.4 kernel this
> works every time.  On v4.5+, we get mismatches in dmesg within a few
> lines of output from the test program.
> 
> It's difficult to bisect around the THP reference counting redesign
> since THP becomes disabled for much of it.  I have discovered that this
> commit is a significant contributor:
> 
> 1f25fe2 mm, thp: adjust conditions when we can reuse the page on WP fault
> 
> Particularly the middle chunk in huge_memory.c.  Reverting this change
> alone significantly improves the problem, but does not lead to a stable
> system.
> 
> I'm not an mm expert, so I'm looking for help debugging this.  As shown
> above this issue is reproducible without KVM, so Andrea's previous KVM
> specific fix to this code is not applicable.  It also still occurs on
> kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> yet.  I'm able to reproduce this fairly quickly with the above test,
> but it's not hard to imagine a test w/o any iommu dependencies which
> simply does a user directed get_user_pages_fast() on a set of userspace
> addresses, retains the reference, and at some point later rechecks that
> a new get_user_pages_fast() results in the same page address.  It
> appears that any sort of device assignment, either vfio or legacy kvm,
> should be susceptible to this issue and therefore unsafe to use on v4.5+
> kernels without using explicit hugepages or disabling THP.  Thanks,

I'm not able to reproduce it so far. How long does it usually take?

How much memory your system has? Could you share your kernel config?

I've modified your instrumentation slightly to provide more info.
Could you try this:

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..434954841d19 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -59,6 +59,7 @@ struct vfio_iommu {
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
+	bool			dying;
 };
 
 struct vfio_domain {
@@ -336,8 +337,10 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
+	unsigned long vaddr = dma->vaddr;
 	struct vfio_domain *domain, *d;
 	long unlocked = 0;
+	struct page *page;
 
 	if (!dma->size)
 		return;
@@ -363,9 +366,20 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		phys = iommu_iova_to_phys(domain->domain, iova);
 		if (WARN_ON(!phys)) {
 			iova += PAGE_SIZE;
+			vaddr += PAGE_SIZE;
 			continue;
 		}
 
+		if (!iommu->dying && get_user_pages_fast(vaddr, 1, !!(dma->prot & IOMMU_WRITE), &page) == 1) {
+			if (phys >> PAGE_SHIFT != page_to_pfn(page)) {
+				dump_page(page, NULL);
+				if (PageTail(page))
+					dump_page(compound_head(page), NULL);
+				dump_page(pfn_to_page(phys >> PAGE_SHIFT), "1");
+			}
+			put_page(page);
+		}
+
 		/*
 		 * To optimize for fewer iommu_unmap() calls, each of which
 		 * may require hardware cache flushing, try to find the
@@ -374,6 +388,17 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		for (len = PAGE_SIZE;
 		     !domain->fgsp && iova + len < end; len += PAGE_SIZE) {
 			next = iommu_iova_to_phys(domain->domain, iova + len);
+
+			if (!iommu->dying && get_user_pages_fast(vaddr + len, 1, !!(dma->prot & IOMMU_WRITE), &page) == 1) {
+				if (next >> PAGE_SHIFT != page_to_pfn(page)) {
+					dump_page(page, NULL);
+					if (PageTail(page))
+						dump_page(compound_head(page), NULL);
+					dump_page(pfn_to_page(next >> PAGE_SHIFT), "2");
+				}
+				put_page(page);
+			}
+
 			if (next != phys + len)
 				break;
 		}
@@ -386,6 +411,7 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 					     unmapped >> PAGE_SHIFT,
 					     dma->prot, false);
 		iova += unmapped;
+		vaddr += unmapped;
 
 		cond_resched();
 	}
@@ -855,6 +881,8 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 {
 	struct rb_node *node;
 
+	iommu->dying = true;
+
 	while ((node = rb_first(&iommu->dma_list)))
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-28 18:17 ` Kirill A. Shutemov
@ 2016-04-28 18:58   ` Alex Williamson
  2016-04-28 23:21     ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2016-04-28 18:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kirill.shutemov, Andrea Arcangeli, linux-kernel, linux-mm

On Thu, 28 Apr 2016 21:17:26 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Thu, Apr 28, 2016 at 10:20:51AM -0600, Alex Williamson wrote:
> > Hi,
> > 
> > vfio-based device assignment makes use of get_user_pages_fast() in order
> > to pin pages for mapping through the iommu for userspace drivers.
> > Until the recent redesign of THP reference counting in the v4.5 kernel,
> > this all worked well.  Now we're seeing cases where a sanity test
> > before we release our "pinned" mapping results in a different page
> > address than what we programmed into the iommu.  So something is
> > occurring which pretty much negates the pinning we're trying to do.
> > 
> > The test program I'm using is here:
> > 
> > https://github.com/awilliam/tests/blob/master/vfio-iommu-map-unmap.c
> > 
> > Apologies for lack of makefile, simply build with gcc -o <out> <in.c>.
> > 
> > To run this, enable the IOMMU on your system - enable in BIOS plus add
> > intel_iommu=on to the kernel commandline (only Intel x86_64 tested).
> > 
> > Pick a target PCI device, it doesn't matter what it is, the test only
> > needs a device for the purpose of creating an iommu domain, the device
> > is never actually touched.  In my case I use a spare NIC at 00:19.0.
> > libvirt tools are useful for setting this up, simply run 'virsh
> > nodedev-detach pci_0000_00_19_0'.  Otherwise bind the device manually
> > to vfio-pci using the standard new_id bind (ask, I can provide
> > instructions).
> > 
> > I also tweak THP scanning to make sure it is actively trying to
> > collapse pages:
> > 
> > echo always > /sys/kernel/mm/transparent_hugepage/defrag
> > echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> > echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> > 
> > Run the test with 'vfio-iommu-map-unmap 0000:00:19.0', or your chosen
> > target device.
> > 
> > Of course to see that the mappings are moving, we need additional
> > sanity testing in the vfio iommu driver.  For that:
> > 
> > https://github.com/awilliam/linux-vfio/commit/379f324e3629349a7486018ad1cc5d4877228d1e
> > 
> > When we map memory for vfio, we use get_user_pages_fast() on the
> > process vaddr to give us a page.  page_to_pfn() then gives us the
> > physical memory address which we program into the iommu.  Obviously we
> > expect this mapping to be stable so long as we hold the page
> > reference.  On unmap we generally retrieve the physical memory address
> > from the iommu, convert it back to a page, and release our reference to
> > it.  The debug code above adds an additional sanity test where on unmap
> > we also call get_user_pages_fast() again before we're released the
> > mapping reference and compare whether the physical page address still
> > matches what we previously stored in the iommu.  On a v4.4 kernel this
> > works every time.  On v4.5+, we get mismatches in dmesg within a few
> > lines of output from the test program.
> > 
> > It's difficult to bisect around the THP reference counting redesign
> > since THP becomes disabled for much of it.  I have discovered that this
> > commit is a significant contributor:
> > 
> > 1f25fe2 mm, thp: adjust conditions when we can reuse the page on WP fault
> > 
> > Particularly the middle chunk in huge_memory.c.  Reverting this change
> > alone significantly improves the problem, but does not lead to a stable
> > system.
> > 
> > I'm not an mm expert, so I'm looking for help debugging this.  As shown
> > above this issue is reproducible without KVM, so Andrea's previous KVM
> > specific fix to this code is not applicable.  It also still occurs on
> > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > yet.  I'm able to reproduce this fairly quickly with the above test,
> > but it's not hard to imagine a test w/o any iommu dependencies which
> > simply does a user directed get_user_pages_fast() on a set of userspace
> > addresses, retains the reference, and at some point later rechecks that
> > a new get_user_pages_fast() results in the same page address.  It
> > appears that any sort of device assignment, either vfio or legacy kvm,
> > should be susceptible to this issue and therefore unsafe to use on v4.5+
> > kernels without using explicit hugepages or disabling THP.  Thanks,  
> 
> I'm not able to reproduce it so far. How long does it usually take?

Generally within the first line of output from the test program.
 
> How much memory your system has? Could you share your kernel config?

24G, dual-socket Ivy Brdige EP.

Config:
https://paste.fedoraproject.org/360803/14618689/
 
> I've modified your instrumentation slightly to provide more info.
> Could you try this:

Thanks!  Results in:

[   83.429809] page:ffffea0010e57fc0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.439696] flags: 0x2fffff80000000()
[   83.443408] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.454001] flags: 0x2fffff80044048(uptodate|active|head|swapbacked)
[   83.460456] page:ffffea0018a67fc0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.470298] flags: 0x6fffff80000000()
[   83.473973] page dumped because: 1
[   83.477412] page:ffffea0010e57f80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.487283] flags: 0x2fffff80000000()
[   83.490969] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.501502] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.508915] page:ffffea0018a67f80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.518758] flags: 0x6fffff80000000()
[   83.522443] page dumped because: 1
[   83.525874] page:ffffea0010e57f40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.535737] flags: 0x2fffff80000000()
[   83.539434] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.549979] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.557412] page:ffffea0018a67f40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.567260] flags: 0x6fffff80000000()
[   83.570943] page dumped because: 1
[   83.574366] page:ffffea0010e57f00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.584211] flags: 0x2fffff80000000()
[   83.587878] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.598413] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.605862] page:ffffea0018a67f00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.615722] flags: 0x6fffff80000000()
[   83.619399] page dumped because: 1
[   83.622835] page:ffffea0010e57ec0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.632673] flags: 0x2fffff80000000()
[   83.636363] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.646893] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.654302] page:ffffea0018a67ec0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.664150] flags: 0x6fffff80000000()
[   83.667840] page dumped because: 1
[   83.671255] page:ffffea0010e57e80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.681108] flags: 0x2fffff80000000()
[   83.684783] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.695335] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.702773] page:ffffea0018a67e80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.712640] flags: 0x6fffff80000000()
[   83.716335] page dumped because: 1
[   83.719746] page:ffffea0010e57e40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.729591] flags: 0x2fffff80000000()
[   83.733279] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.743843] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.751268] page:ffffea0018a67e40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.761108] flags: 0x6fffff80000000()
[   83.764784] page dumped because: 1
[   83.768206] page:ffffea0010e57e00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.778076] flags: 0x2fffff80000000()
[   83.781754] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.792283] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.799712] page:ffffea0018a67e00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.809559] flags: 0x6fffff80000000()
[   83.813257] page dumped because: 1
[   83.816722] page:ffffea0010e57dc0 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.826605] flags: 0x2fffff80000000()
[   83.830285] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.840877] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.848321] page:ffffea0018a67dc0 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.858214] flags: 0x6fffff80000000()
[   83.861899] page dumped because: 1
[   83.865355] page:ffffea0010e57d80 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.875246] flags: 0x2fffff80000000()
[   83.878930] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.889525] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.896970] page:ffffea0018a67d80 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.906883] flags: 0x6fffff80000000()
[   83.910563] page dumped because: 1
[   83.914018] page:ffffea0010e57d40 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.923862] flags: 0x2fffff80000000()
[   83.927540] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.938079] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.945493] page:ffffea0018a67d40 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   83.955341] flags: 0x6fffff80000000()
[   83.959022] page dumped because: 1
[   83.962446] page:ffffea0010e57d00 count:0 mapcount:1 mapping:dead000000000400 index:0x1 compound_mapcount: 1
[   83.972296] flags: 0x2fffff80000000()
[   83.975980] page:ffffea0010e50000 count:3 mapcount:1 mapping:ffff88044c0fa8a1 index:0x7f8ae1400 compound_mapcount: 1
[   83.986516] flags: 0x2fffff8004404c(referenced|uptodate|active|head|swapbacked)
[   83.993932] page:ffffea0018a67d00 count:0 mapcount:0 mapping:dead000000000400 index:0x0 compound_mapcount: 0
[   84.003778] flags: 0x6fffff80000000()
[   84.007456] page dumped because: 1
...

As you can see by the kernel timestamp, this happened almost
immediately for me.  Thanks for taking a look at this,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-28 18:58   ` Alex Williamson
@ 2016-04-28 23:21     ` Andrea Arcangeli
  2016-04-29  0:44       ` Alex Williamson
  2016-04-29  0:51       ` Kirill A. Shutemov
  0 siblings, 2 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2016-04-28 23:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirill A. Shutemov, kirill.shutemov, linux-kernel, linux-mm

Hello Alex and Kirill,

On Thu, Apr 28, 2016 at 12:58:08PM -0600, Alex Williamson wrote:
> > > specific fix to this code is not applicable.  It also still occurs on
> > > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > > yet.  I'm able to reproduce this fairly quickly with the above test,
> > > but it's not hard to imagine a test w/o any iommu dependencies which
> > > simply does a user directed get_user_pages_fast() on a set of userspace
> > > addresses, retains the reference, and at some point later rechecks that
> > > a new get_user_pages_fast() results in the same page address.  It

Can you try to "git revert 1f25fe20a76af0d960172fb104d4b13697cafa84"
and then apply the below patch on top of the revert?

Totally untested... if I missed something and it isn't correct, I hope
this brings us in the right direction faster at least.

Overall the problem I think is that we need to restore full accuracy
and we can't deal with false positive COWs (which aren't entirely
cheap either... reading 512 cachelines should be much faster than
copying 2MB and using 4MB of CPU cache). 32k vs 4MB. The problem of
course is when we really need a COW, we'll waste an additional 32k,
but then it doesn't matter that much as we'd be forced to load 4MB of
cache anyway in such case. There's room for optimizations but even the
simple below patch would be ok for now.

>From 09e3d1ff10b49fb9c3ab77f0b96a862848e30067 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Fri, 29 Apr 2016 01:05:06 +0200
Subject: [PATCH 1/1] mm: thp: calculate page_mapcount() correctly for THP
 pages

This allows to revert commit 1f25fe20a76af0d960172fb104d4b13697cafa84
and it provides fully accuracy with wrprotect faults so page pinning
will stop causing false positive copy-on-writes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/util.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 6cc81e7..a0b9f63 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -383,9 +383,10 @@ struct address_space *page_mapping(struct page *page)
 /* Slow path of page_mapcount() for compound pages */
 int __page_mapcount(struct page *page)
 {
-	int ret;
+	int ret = 0, i;
 
-	ret = atomic_read(&page->_mapcount) + 1;
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		ret = max(ret, atomic_read(&page->_mapcount) + 1);
 	page = compound_head(page);
 	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
 	if (PageDoubleMap(page))

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-28 23:21     ` Andrea Arcangeli
@ 2016-04-29  0:44       ` Alex Williamson
  2016-04-29  0:51       ` Kirill A. Shutemov
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-04-29  0:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, kirill.shutemov, linux-kernel, linux-mm

On Fri, 29 Apr 2016 01:21:27 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> Hello Alex and Kirill,
> 
> On Thu, Apr 28, 2016 at 12:58:08PM -0600, Alex Williamson wrote:
> > > > specific fix to this code is not applicable.  It also still occurs on
> > > > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > > > yet.  I'm able to reproduce this fairly quickly with the above test,
> > > > but it's not hard to imagine a test w/o any iommu dependencies which
> > > > simply does a user directed get_user_pages_fast() on a set of userspace
> > > > addresses, retains the reference, and at some point later rechecks that
> > > > a new get_user_pages_fast() results in the same page address.  It  
> 
> Can you try to "git revert 1f25fe20a76af0d960172fb104d4b13697cafa84"
> and then apply the below patch on top of the revert?

Looking good so far!  I haven't seen any errors yet with this
combination of v4.5, 1f25fe20a reverted, and your patch applied on
top.  I'll keep testing since reverting 1f25fe20a alone already made
the bug much more elusive.  Thanks Andrea!

Alex

> Totally untested... if I missed something and it isn't correct, I hope
> this brings us in the right direction faster at least.
> 
> Overall the problem I think is that we need to restore full accuracy
> and we can't deal with false positive COWs (which aren't entirely
> cheap either... reading 512 cachelines should be much faster than
> copying 2MB and using 4MB of CPU cache). 32k vs 4MB. The problem of
> course is when we really need a COW, we'll waste an additional 32k,
> but then it doesn't matter that much as we'd be forced to load 4MB of
> cache anyway in such case. There's room for optimizations but even the
> simple below patch would be ok for now.
> 
> From 09e3d1ff10b49fb9c3ab77f0b96a862848e30067 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Fri, 29 Apr 2016 01:05:06 +0200
> Subject: [PATCH 1/1] mm: thp: calculate page_mapcount() correctly for THP
>  pages
> 
> This allows to revert commit 1f25fe20a76af0d960172fb104d4b13697cafa84
> and it provides fully accuracy with wrprotect faults so page pinning
> will stop causing false positive copy-on-writes.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/util.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 6cc81e7..a0b9f63 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -383,9 +383,10 @@ struct address_space *page_mapping(struct page *page)
>  /* Slow path of page_mapcount() for compound pages */
>  int __page_mapcount(struct page *page)
>  {
> -	int ret;
> +	int ret = 0, i;
>  
> -	ret = atomic_read(&page->_mapcount) + 1;
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		ret = max(ret, atomic_read(&page->_mapcount) + 1);
>  	page = compound_head(page);
>  	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
>  	if (PageDoubleMap(page))

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-28 23:21     ` Andrea Arcangeli
  2016-04-29  0:44       ` Alex Williamson
@ 2016-04-29  0:51       ` Kirill A. Shutemov
  2016-04-29  2:45         ` Alex Williamson
  1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-04-29  0:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Fri, Apr 29, 2016 at 01:21:27AM +0200, Andrea Arcangeli wrote:
> Hello Alex and Kirill,
> 
> On Thu, Apr 28, 2016 at 12:58:08PM -0600, Alex Williamson wrote:
> > > > specific fix to this code is not applicable.  It also still occurs on
> > > > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > > > yet.  I'm able to reproduce this fairly quickly with the above test,
> > > > but it's not hard to imagine a test w/o any iommu dependencies which
> > > > simply does a user directed get_user_pages_fast() on a set of userspace
> > > > addresses, retains the reference, and at some point later rechecks that
> > > > a new get_user_pages_fast() results in the same page address.  It
> 
> Can you try to "git revert 1f25fe20a76af0d960172fb104d4b13697cafa84"
> and then apply the below patch on top of the revert?
> 
> Totally untested... if I missed something and it isn't correct, I hope
> this brings us in the right direction faster at least.
> 
> Overall the problem I think is that we need to restore full accuracy
> and we can't deal with false positive COWs (which aren't entirely
> cheap either... reading 512 cachelines should be much faster than
> copying 2MB and using 4MB of CPU cache). 32k vs 4MB. The problem of
> course is when we really need a COW, we'll waste an additional 32k,
> but then it doesn't matter that much as we'd be forced to load 4MB of
> cache anyway in such case. There's room for optimizations but even the
> simple below patch would be ok for now.
> 
> From 09e3d1ff10b49fb9c3ab77f0b96a862848e30067 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Fri, 29 Apr 2016 01:05:06 +0200
> Subject: [PATCH 1/1] mm: thp: calculate page_mapcount() correctly for THP
>  pages
> 
> This allows to revert commit 1f25fe20a76af0d960172fb104d4b13697cafa84
> and it provides fully accuracy with wrprotect faults so page pinning
> will stop causing false positive copy-on-writes.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/util.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 6cc81e7..a0b9f63 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -383,9 +383,10 @@ struct address_space *page_mapping(struct page *page)
>  /* Slow path of page_mapcount() for compound pages */
>  int __page_mapcount(struct page *page)
>  {
> -	int ret;
> +	int ret = 0, i;
>  
> -	ret = atomic_read(&page->_mapcount) + 1;
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		ret = max(ret, atomic_read(&page->_mapcount) + 1);
>  	page = compound_head(page);
>  	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
>  	if (PageDoubleMap(page))

You are right about the cause. I spend some time on wrong path: I was only
able to trigger the bug with numa balancing enabled, so I assumed
something is wrong in that code...

I would like to preserve current page_mapcount() behaviouts.
I think this fix is better:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86f9f8b82f8e..163c10f48e1b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
        VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
        /*
         * We can only reuse the page if nobody else maps the huge page or it's
-        * part. We can do it by checking page_mapcount() on each sub-page, but
-        * it's expensive.
-        * The cheaper way is to check page_count() to be equal 1: every
-        * mapcount takes page reference reference, so this way we can
-        * guarantee, that the PMD is the only mapping.
-        * This can give false negative if somebody pinned the page, but that's
-        * fine.
+        * part.
         */
-       if (page_mapcount(page) == 1 && page_count(page) == 1) {
+       if (total_mapcount(page) == 1) {
                pmd_t entry;
                entry = pmd_mkyoung(orig_pmd);
                entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-29  0:51       ` Kirill A. Shutemov
@ 2016-04-29  2:45         ` Alex Williamson
  2016-04-29  7:06           ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2016-04-29  2:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, kirill.shutemov, linux-kernel, linux-mm

On Fri, 29 Apr 2016 03:51:06 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Fri, Apr 29, 2016 at 01:21:27AM +0200, Andrea Arcangeli wrote:
> > Hello Alex and Kirill,
> > 
> > On Thu, Apr 28, 2016 at 12:58:08PM -0600, Alex Williamson wrote:  
> > > > > specific fix to this code is not applicable.  It also still occurs on
> > > > > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > > > > yet.  I'm able to reproduce this fairly quickly with the above test,
> > > > > but it's not hard to imagine a test w/o any iommu dependencies which
> > > > > simply does a user directed get_user_pages_fast() on a set of userspace
> > > > > addresses, retains the reference, and at some point later rechecks that
> > > > > a new get_user_pages_fast() results in the same page address.  It  
> > 
> > Can you try to "git revert 1f25fe20a76af0d960172fb104d4b13697cafa84"
> > and then apply the below patch on top of the revert?
> > 
> > Totally untested... if I missed something and it isn't correct, I hope
> > this brings us in the right direction faster at least.
> > 
> > Overall the problem I think is that we need to restore full accuracy
> > and we can't deal with false positive COWs (which aren't entirely
> > cheap either... reading 512 cachelines should be much faster than
> > copying 2MB and using 4MB of CPU cache). 32k vs 4MB. The problem of
> > course is when we really need a COW, we'll waste an additional 32k,
> > but then it doesn't matter that much as we'd be forced to load 4MB of
> > cache anyway in such case. There's room for optimizations but even the
> > simple below patch would be ok for now.
> > 
> > From 09e3d1ff10b49fb9c3ab77f0b96a862848e30067 Mon Sep 17 00:00:00 2001
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > Date: Fri, 29 Apr 2016 01:05:06 +0200
> > Subject: [PATCH 1/1] mm: thp: calculate page_mapcount() correctly for THP
> >  pages
> > 
> > This allows to revert commit 1f25fe20a76af0d960172fb104d4b13697cafa84
> > and it provides fully accuracy with wrprotect faults so page pinning
> > will stop causing false positive copy-on-writes.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  mm/util.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/util.c b/mm/util.c
> > index 6cc81e7..a0b9f63 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -383,9 +383,10 @@ struct address_space *page_mapping(struct page *page)
> >  /* Slow path of page_mapcount() for compound pages */
> >  int __page_mapcount(struct page *page)
> >  {
> > -	int ret;
> > +	int ret = 0, i;
> >  
> > -	ret = atomic_read(&page->_mapcount) + 1;
> > +	for (i = 0; i < HPAGE_PMD_NR; i++)
> > +		ret = max(ret, atomic_read(&page->_mapcount) + 1);
> >  	page = compound_head(page);
> >  	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
> >  	if (PageDoubleMap(page))  
> 
> You are right about the cause. I spend some time on wrong path: I was only
> able to trigger the bug with numa balancing enabled, so I assumed
> something is wrong in that code...
> 
> I would like to preserve current page_mapcount() behaviouts.
> I think this fix is better:

This also seems to work in my testing, but assuming all else being
equal, there is a performance difference between the two for this test
case in favor of Andrea's solution.  Modifying the test to exit after
the first set of iterations, my system takes on average 107s to complete
with the solution below or 103.5s with the other approach.  Please note
that I have every mm debugging option I could find enabled and THP
scanning full speed on the system, so I don't know how this would play
out in a more tuned configuration.

The only reason I noticed is that I added a side test to sleep a random
number of seconds and kill the test program because sometimes killing
the test triggers errors.  I didn't see any errors with either of these
solutions, but suspected the first solution was completing more
iterations for similar intervals.  Modifying the test to exit seems to
prove that true.

I can't speak to which is the more architecturally correct solution,
but there may be a measurable performance difference to consider.
Thanks,

Alex

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 86f9f8b82f8e..163c10f48e1b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>         VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>         /*
>          * We can only reuse the page if nobody else maps the huge page or it's
> -        * part. We can do it by checking page_mapcount() on each sub-page, but
> -        * it's expensive.
> -        * The cheaper way is to check page_count() to be equal 1: every
> -        * mapcount takes page reference reference, so this way we can
> -        * guarantee, that the PMD is the only mapping.
> -        * This can give false negative if somebody pinned the page, but that's
> -        * fine.
> +        * part.
>          */
> -       if (page_mapcount(page) == 1 && page_count(page) == 1) {
> +       if (total_mapcount(page) == 1) {
>                 pmd_t entry;
>                 entry = pmd_mkyoung(orig_pmd);
>                 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-29  2:45         ` Alex Williamson
@ 2016-04-29  7:06           ` Kirill A. Shutemov
  2016-04-29 15:12             ` Alex Williamson
  2016-04-29 16:34             ` Andrea Arcangeli
  0 siblings, 2 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-04-29  7:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Andrea Arcangeli, kirill.shutemov, linux-kernel, linux-mm

On Thu, Apr 28, 2016 at 08:45:42PM -0600, Alex Williamson wrote:
> On Fri, 29 Apr 2016 03:51:06 +0300
> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > On Fri, Apr 29, 2016 at 01:21:27AM +0200, Andrea Arcangeli wrote:
> > > Hello Alex and Kirill,
> > > 
> > > On Thu, Apr 28, 2016 at 12:58:08PM -0600, Alex Williamson wrote:  
> > > > > > specific fix to this code is not applicable.  It also still occurs on
> > > > > > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > > > > > yet.  I'm able to reproduce this fairly quickly with the above test,
> > > > > > but it's not hard to imagine a test w/o any iommu dependencies which
> > > > > > simply does a user directed get_user_pages_fast() on a set of userspace
> > > > > > addresses, retains the reference, and at some point later rechecks that
> > > > > > a new get_user_pages_fast() results in the same page address.  It  
> > > 
> > > Can you try to "git revert 1f25fe20a76af0d960172fb104d4b13697cafa84"
> > > and then apply the below patch on top of the revert?
> > > 
> > > Totally untested... if I missed something and it isn't correct, I hope
> > > this brings us in the right direction faster at least.
> > > 
> > > Overall the problem I think is that we need to restore full accuracy
> > > and we can't deal with false positive COWs (which aren't entirely
> > > cheap either... reading 512 cachelines should be much faster than
> > > copying 2MB and using 4MB of CPU cache). 32k vs 4MB. The problem of
> > > course is when we really need a COW, we'll waste an additional 32k,
> > > but then it doesn't matter that much as we'd be forced to load 4MB of
> > > cache anyway in such case. There's room for optimizations but even the
> > > simple below patch would be ok for now.
> > > 
> > > From 09e3d1ff10b49fb9c3ab77f0b96a862848e30067 Mon Sep 17 00:00:00 2001
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > Date: Fri, 29 Apr 2016 01:05:06 +0200
> > > Subject: [PATCH 1/1] mm: thp: calculate page_mapcount() correctly for THP
> > >  pages
> > > 
> > > This allows to revert commit 1f25fe20a76af0d960172fb104d4b13697cafa84
> > > and it provides fully accuracy with wrprotect faults so page pinning
> > > will stop causing false positive copy-on-writes.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > ---
> > >  mm/util.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/util.c b/mm/util.c
> > > index 6cc81e7..a0b9f63 100644
> > > --- a/mm/util.c
> > > +++ b/mm/util.c
> > > @@ -383,9 +383,10 @@ struct address_space *page_mapping(struct page *page)
> > >  /* Slow path of page_mapcount() for compound pages */
> > >  int __page_mapcount(struct page *page)
> > >  {
> > > -	int ret;
> > > +	int ret = 0, i;
> > >  
> > > -	ret = atomic_read(&page->_mapcount) + 1;
> > > +	for (i = 0; i < HPAGE_PMD_NR; i++)
> > > +		ret = max(ret, atomic_read(&page->_mapcount) + 1);
> > >  	page = compound_head(page);
> > >  	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
> > >  	if (PageDoubleMap(page))  
> > 
> > You are right about the cause. I spend some time on wrong path: I was only
> > able to trigger the bug with numa balancing enabled, so I assumed
> > something is wrong in that code...
> > 
> > I would like to preserve current page_mapcount() behaviouts.
> > I think this fix is better:
> 
> This also seems to work in my testing, but assuming all else being
> equal, there is a performance difference between the two for this test
> case in favor of Andrea's solution.  Modifying the test to exit after
> the first set of iterations, my system takes on average 107s to complete
> with the solution below or 103.5s with the other approach.  Please note
> that I have every mm debugging option I could find enabled and THP
> scanning full speed on the system, so I don't know how this would play
> out in a more tuned configuration.
> 
> The only reason I noticed is that I added a side test to sleep a random
> number of seconds and kill the test program because sometimes killing
> the test triggers errors.  I didn't see any errors with either of these
> solutions, but suspected the first solution was completing more
> iterations for similar intervals.  Modifying the test to exit seems to
> prove that true.
> 
> I can't speak to which is the more architecturally correct solution,
> but there may be a measurable performance difference to consider.

Hm. I just woke up and haven't got any coffee yet, but I don't why my
approach would be worse for performance. Both have the same algorithmic
complexity.

> Thanks,
> 
> Alex
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 86f9f8b82f8e..163c10f48e1b 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >         VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> >         /*
> >          * We can only reuse the page if nobody else maps the huge page or it's
> > -        * part. We can do it by checking page_mapcount() on each sub-page, but
> > -        * it's expensive.
> > -        * The cheaper way is to check page_count() to be equal 1: every
> > -        * mapcount takes page reference reference, so this way we can
> > -        * guarantee, that the PMD is the only mapping.
> > -        * This can give false negative if somebody pinned the page, but that's
> > -        * fine.
> > +        * part.
> >          */
> > -       if (page_mapcount(page) == 1 && page_count(page) == 1) {
> > +       if (total_mapcount(page) == 1) {
> >                 pmd_t entry;
> >                 entry = pmd_mkyoung(orig_pmd);
> >                 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-29  7:06           ` Kirill A. Shutemov
@ 2016-04-29 15:12             ` Alex Williamson
  2016-04-29 16:34             ` Andrea Arcangeli
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-04-29 15:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, kirill.shutemov, linux-kernel, linux-mm

On Fri, 29 Apr 2016 10:06:11 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Thu, Apr 28, 2016 at 08:45:42PM -0600, Alex Williamson wrote:
> > On Fri, 29 Apr 2016 03:51:06 +0300
> > "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> >   
> > > On Fri, Apr 29, 2016 at 01:21:27AM +0200, Andrea Arcangeli wrote:  
> > > > Hello Alex and Kirill,
> > > > 
> > > > On Thu, Apr 28, 2016 at 12:58:08PM -0600, Alex Williamson wrote:    
> > > > > > > specific fix to this code is not applicable.  It also still occurs on
> > > > > > > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > > > > > > yet.  I'm able to reproduce this fairly quickly with the above test,
> > > > > > > but it's not hard to imagine a test w/o any iommu dependencies which
> > > > > > > simply does a user directed get_user_pages_fast() on a set of userspace
> > > > > > > addresses, retains the reference, and at some point later rechecks that
> > > > > > > a new get_user_pages_fast() results in the same page address.  It    
> > > > 
> > > > Can you try to "git revert 1f25fe20a76af0d960172fb104d4b13697cafa84"
> > > > and then apply the below patch on top of the revert?
> > > > 
> > > > Totally untested... if I missed something and it isn't correct, I hope
> > > > this brings us in the right direction faster at least.
> > > > 
> > > > Overall the problem I think is that we need to restore full accuracy
> > > > and we can't deal with false positive COWs (which aren't entirely
> > > > cheap either... reading 512 cachelines should be much faster than
> > > > copying 2MB and using 4MB of CPU cache). 32k vs 4MB. The problem of
> > > > course is when we really need a COW, we'll waste an additional 32k,
> > > > but then it doesn't matter that much as we'd be forced to load 4MB of
> > > > cache anyway in such case. There's room for optimizations but even the
> > > > simple below patch would be ok for now.
> > > > 
> > > > From 09e3d1ff10b49fb9c3ab77f0b96a862848e30067 Mon Sep 17 00:00:00 2001
> > > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > > Date: Fri, 29 Apr 2016 01:05:06 +0200
> > > > Subject: [PATCH 1/1] mm: thp: calculate page_mapcount() correctly for THP
> > > >  pages
> > > > 
> > > > This allows to revert commit 1f25fe20a76af0d960172fb104d4b13697cafa84
> > > > and it provides fully accuracy with wrprotect faults so page pinning
> > > > will stop causing false positive copy-on-writes.
> > > > 
> > > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > > ---
> > > >  mm/util.c | 5 +++--
> > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/mm/util.c b/mm/util.c
> > > > index 6cc81e7..a0b9f63 100644
> > > > --- a/mm/util.c
> > > > +++ b/mm/util.c
> > > > @@ -383,9 +383,10 @@ struct address_space *page_mapping(struct page *page)
> > > >  /* Slow path of page_mapcount() for compound pages */
> > > >  int __page_mapcount(struct page *page)
> > > >  {
> > > > -	int ret;
> > > > +	int ret = 0, i;
> > > >  
> > > > -	ret = atomic_read(&page->_mapcount) + 1;
> > > > +	for (i = 0; i < HPAGE_PMD_NR; i++)
> > > > +		ret = max(ret, atomic_read(&page->_mapcount) + 1);
> > > >  	page = compound_head(page);
> > > >  	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
> > > >  	if (PageDoubleMap(page))    
> > > 
> > > You are right about the cause. I spend some time on wrong path: I was only
> > > able to trigger the bug with numa balancing enabled, so I assumed
> > > something is wrong in that code...
> > > 
> > > I would like to preserve current page_mapcount() behaviouts.
> > > I think this fix is better:  
> > 
> > This also seems to work in my testing, but assuming all else being
> > equal, there is a performance difference between the two for this test
> > case in favor of Andrea's solution.  Modifying the test to exit after
> > the first set of iterations, my system takes on average 107s to complete
> > with the solution below or 103.5s with the other approach.  Please note
> > that I have every mm debugging option I could find enabled and THP
> > scanning full speed on the system, so I don't know how this would play
> > out in a more tuned configuration.
> > 
> > The only reason I noticed is that I added a side test to sleep a random
> > number of seconds and kill the test program because sometimes killing
> > the test triggers errors.  I didn't see any errors with either of these
> > solutions, but suspected the first solution was completing more
> > iterations for similar intervals.  Modifying the test to exit seems to
> > prove that true.
> > 
> > I can't speak to which is the more architecturally correct solution,
> > but there may be a measurable performance difference to consider.  
> 
> Hm. I just woke up and haven't got any coffee yet, but I don't why my
> approach would be worse for performance. Both have the same algorithmic
> complexity.

I can't explain it either, I won't claim to understand either solution,
but with all the kernel hacking vm debug/sanity options disabled, there
still appears to be a very slight advantage to Andrea's proposal.  I
expect the test program should show this even if you're having a
difficult time using it to reproduce the bug.  I ran Andrea's patch
overnight, no issues reported.  Running your patch for an extended test
now.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-29  7:06           ` Kirill A. Shutemov
  2016-04-29 15:12             ` Alex Williamson
@ 2016-04-29 16:34             ` Andrea Arcangeli
  2016-04-29 22:34               ` Alex Williamson
  2016-05-02 10:41               ` Kirill A. Shutemov
  1 sibling, 2 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2016-04-29 16:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Fri, Apr 29, 2016 at 10:06:11AM +0300, Kirill A. Shutemov wrote:
> Hm. I just woke up and haven't got any coffee yet, but I don't why my
> approach would be worse for performance. Both have the same algorithmic
> complexity.

Even before looking at the overall performance, I'm not sure your
patch is really fixing it all: you didn't touch reuse_swap_page which
is used by do_wp_page to know if it can call do_wp_page_reuse. Your
patch would still trigger a COW instead of calling do_wp_page_reuse,
but it would only happen if the page was pinned after the pmd split,
which is probably not what the testcase is triggering. My patch
instead fixed that too.

total_mapcount returns the wrong value for reuse_swap_page, which is
probably why you didn't try to use it there.

The main issue of my patch is that it has a performance downside that
is page_mapcount becomes expensive for all other usages, which is
better than breaking vfio but I couldn't use total_mapcount again
because it counts things wrong in reuse_swap_page.

Like I said there's room for optimizations so today I tried to
optimize more stuff...

>From 74f1fd7fab71a2cce0d1796fb38241acde2c1224 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Fri, 29 Apr 2016 01:05:06 +0200
Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
 during WP faults

This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page, so this introduces a page_trans_huge_mapcount() that
is effectively the full accurate return value for page_mapcount() if
dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h   |  5 +++++
 include/linux/swap.h |  3 +--
 mm/huge_memory.c     | 44 ++++++++++++++++++++++++++++++++++++--------
 mm/swapfile.c        |  5 +----
 4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8fb3604..c2026a1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -501,11 +501,16 @@ static inline int page_mapcount(struct page *page)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int total_mapcount(struct page *page);
+int page_trans_huge_mapcount(struct page *page);
 #else
 static inline int total_mapcount(struct page *page)
 {
 	return page_mapcount(page);
 }
+static inline int page_trans_huge_mapcount(struct page *page)
+{
+	return page_mapcount(page);
+}
 #endif
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2f6478f..905bf8e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -517,8 +517,7 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-#define reuse_swap_page(page) \
-	(!PageTransCompound(page) && page_mapcount(page) == 1)
+#define reuse_swap_page(page) (page_trans_huge_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 06bce0f..6a6d9c0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
 	 * We can only reuse the page if nobody else maps the huge page or it's
-	 * part. We can do it by checking page_mapcount() on each sub-page, but
-	 * it's expensive.
-	 * The cheaper way is to check page_count() to be equal 1: every
-	 * mapcount takes page reference reference, so this way we can
-	 * guarantee, that the PMD is the only mapping.
-	 * This can give false negative if somebody pinned the page, but that's
-	 * fine.
+	 * part.
 	 */
-	if (page_mapcount(page) == 1 && page_count(page) == 1) {
+	if (page_trans_huge_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -3226,6 +3220,40 @@ int total_mapcount(struct page *page)
 }
 
 /*
+ * This calculates accurately how many mappings a transparent hugepage
+ * has (unlike page_mapcount() which isn't fully accurate). This full
+ * accuracy is primarily needed to know if copy-on-write faults can
+ * takeover the page and change the mapping to read-write instead of
+ * copying them. This is different from total_mapcount() too: we must
+ * not count all mappings on the subpages individually, but instead we
+ * must check the highest mapcount any one of the subpages has.
+ *
+ * It would be entirely safe and even more correct to replace
+ * page_mapcount() with page_trans_huge_mapcount(), however we only
+ * use page_trans_huge_mapcount() in the copy-on-write faults where we
+ * need full accuracy to avoid breaking page pinning.
+ */
+int page_trans_huge_mapcount(struct page *page)
+{
+	int i, ret;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	ret = 0;
+	if (likely(!PageHuge(page))) {
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			ret = max(ret, atomic_read(&page[i]._mapcount) + 1);
+		if (PageDoubleMap(page))
+			ret -= 1;
+	}
+	ret += compound_mapcount(page);
+	return ret;
+}
+
+/*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
  *
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 83874ec..984470a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -930,10 +930,7 @@ int reuse_swap_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
 		return 0;
-	/* The page is part of THP and cannot be reused */
-	if (PageTransCompound(page))
-		return 0;
-	count = page_mapcount(page);
+	count = page_trans_huge_mapcount(page);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
 		if (count == 1 && !PageWriteback(page)) {

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-29 16:34             ` Andrea Arcangeli
@ 2016-04-29 22:34               ` Alex Williamson
  2016-05-02 10:41               ` Kirill A. Shutemov
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-04-29 22:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, kirill.shutemov, linux-kernel, linux-mm

On Fri, 29 Apr 2016 18:34:44 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Fri, Apr 29, 2016 at 10:06:11AM +0300, Kirill A. Shutemov wrote:
> > Hm. I just woke up and haven't got any coffee yet, but I don't why my
> > approach would be worse for performance. Both have the same algorithmic
> > complexity.  
> 
> Even before looking at the overall performance, I'm not sure your
> patch is really fixing it all: you didn't touch reuse_swap_page which
> is used by do_wp_page to know if it can call do_wp_page_reuse. Your
> patch would still trigger a COW instead of calling do_wp_page_reuse,
> but it would only happen if the page was pinned after the pmd split,
> which is probably not what the testcase is triggering. My patch
> instead fixed that too.
> 
> total_mapcount returns the wrong value for reuse_swap_page, which is
> probably why you didn't try to use it there.
> 
> The main issue of my patch is that it has a performance downside that
> is page_mapcount becomes expensive for all other usages, which is
> better than breaking vfio but I couldn't use total_mapcount again
> because it counts things wrong in reuse_swap_page.
> 
> Like I said there's room for optimizations so today I tried to
> optimize more stuff...

I've had this under test for several hours without error.  Thanks!

Alex


> From 74f1fd7fab71a2cce0d1796fb38241acde2c1224 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Fri, 29 Apr 2016 01:05:06 +0200
> Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
>  during WP faults
> 
> This will provide fully accuracy to the mapcount calculation in the
> write protect faults, so page pinning will not get broken by false
> positive copy-on-writes.
> 
> total_mapcount() isn't the right calculation needed in
> reuse_swap_page, so this introduces a page_trans_huge_mapcount() that
> is effectively the full accurate return value for page_mapcount() if
> dealing with Transparent Hugepages, however we only use the
> page_trans_huge_mapcount() during COW faults where it strictly needed,
> due to its higher runtime cost.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/mm.h   |  5 +++++
>  include/linux/swap.h |  3 +--
>  mm/huge_memory.c     | 44 ++++++++++++++++++++++++++++++++++++--------
>  mm/swapfile.c        |  5 +----
>  4 files changed, 43 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8fb3604..c2026a1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -501,11 +501,16 @@ static inline int page_mapcount(struct page *page)
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int total_mapcount(struct page *page);
> +int page_trans_huge_mapcount(struct page *page);
>  #else
>  static inline int total_mapcount(struct page *page)
>  {
>  	return page_mapcount(page);
>  }
> +static inline int page_trans_huge_mapcount(struct page *page)
> +{
> +	return page_mapcount(page);
> +}
>  #endif
>  
>  static inline struct page *virt_to_head_page(const void *x)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2f6478f..905bf8e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -517,8 +517,7 @@ static inline int swp_swapcount(swp_entry_t entry)
>  	return 0;
>  }
>  
> -#define reuse_swap_page(page) \
> -	(!PageTransCompound(page) && page_mapcount(page) == 1)
> +#define reuse_swap_page(page) (page_trans_huge_mapcount(page) == 1)
>  
>  static inline int try_to_free_swap(struct page *page)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 06bce0f..6a6d9c0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
>  	 * We can only reuse the page if nobody else maps the huge page or it's
> -	 * part. We can do it by checking page_mapcount() on each sub-page, but
> -	 * it's expensive.
> -	 * The cheaper way is to check page_count() to be equal 1: every
> -	 * mapcount takes page reference reference, so this way we can
> -	 * guarantee, that the PMD is the only mapping.
> -	 * This can give false negative if somebody pinned the page, but that's
> -	 * fine.
> +	 * part.
>  	 */
> -	if (page_mapcount(page) == 1 && page_count(page) == 1) {
> +	if (page_trans_huge_mapcount(page) == 1) {
>  		pmd_t entry;
>  		entry = pmd_mkyoung(orig_pmd);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -3226,6 +3220,40 @@ int total_mapcount(struct page *page)
>  }
>  
>  /*
> + * This calculates accurately how many mappings a transparent hugepage
> + * has (unlike page_mapcount() which isn't fully accurate). This full
> + * accuracy is primarily needed to know if copy-on-write faults can
> + * takeover the page and change the mapping to read-write instead of
> + * copying them. This is different from total_mapcount() too: we must
> + * not count all mappings on the subpages individually, but instead we
> + * must check the highest mapcount any one of the subpages has.
> + *
> + * It would be entirely safe and even more correct to replace
> + * page_mapcount() with page_trans_huge_mapcount(), however we only
> + * use page_trans_huge_mapcount() in the copy-on-write faults where we
> + * need full accuracy to avoid breaking page pinning.
> + */
> +int page_trans_huge_mapcount(struct page *page)
> +{
> +	int i, ret;
> +
> +	VM_BUG_ON_PAGE(PageTail(page), page);
> +
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) + 1;
> +
> +	ret = 0;
> +	if (likely(!PageHuge(page))) {
> +		for (i = 0; i < HPAGE_PMD_NR; i++)
> +			ret = max(ret, atomic_read(&page[i]._mapcount) + 1);
> +		if (PageDoubleMap(page))
> +			ret -= 1;
> +	}
> +	ret += compound_mapcount(page);
> +	return ret;
> +}
> +
> +/*
>   * This function splits huge page into normal pages. @page can point to any
>   * subpage of huge page to split. Split doesn't change the position of @page.
>   *
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 83874ec..984470a 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -930,10 +930,7 @@ int reuse_swap_page(struct page *page)
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	if (unlikely(PageKsm(page)))
>  		return 0;
> -	/* The page is part of THP and cannot be reused */
> -	if (PageTransCompound(page))
> -		return 0;
> -	count = page_mapcount(page);
> +	count = page_trans_huge_mapcount(page);
>  	if (count <= 1 && PageSwapCache(page)) {
>  		count += page_swapcount(page);
>  		if (count == 1 && !PageWriteback(page)) {

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-04-29 16:34             ` Andrea Arcangeli
  2016-04-29 22:34               ` Alex Williamson
@ 2016-05-02 10:41               ` Kirill A. Shutemov
  2016-05-02 11:15                 ` Jerome Glisse
  2016-05-02 15:23                 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
  1 sibling, 2 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 10:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Fri, Apr 29, 2016 at 06:34:44PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 29, 2016 at 10:06:11AM +0300, Kirill A. Shutemov wrote:
> > Hm. I just woke up and haven't got any coffee yet, but I don't why my
> > approach would be worse for performance. Both have the same algorithmic
> > complexity.
> 
> Even before looking at the overall performance, I'm not sure your
> patch is really fixing it all: you didn't touch reuse_swap_page which
> is used by do_wp_page to know if it can call do_wp_page_reuse. Your
> patch would still trigger a COW instead of calling do_wp_page_reuse,
> but it would only happen if the page was pinned after the pmd split,
> which is probably not what the testcase is triggering. My patch
> instead fixed that too.
> 
> total_mapcount returns the wrong value for reuse_swap_page, which is
> probably why you didn't try to use it there.
> 
> The main issue of my patch is that it has a performance downside that
> is page_mapcount becomes expensive for all other usages, which is
> better than breaking vfio but I couldn't use total_mapcount again
> because it counts things wrong in reuse_swap_page.
> 
> Like I said there's room for optimizations so today I tried to
> optimize more stuff...
> 
> From 74f1fd7fab71a2cce0d1796fb38241acde2c1224 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Fri, 29 Apr 2016 01:05:06 +0200
> Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
>  during WP faults
> 
> This will provide fully accuracy to the mapcount calculation in the
> write protect faults, so page pinning will not get broken by false
> positive copy-on-writes.
> 
> total_mapcount() isn't the right calculation needed in
> reuse_swap_page, so this introduces a page_trans_huge_mapcount() that
> is effectively the full accurate return value for page_mapcount() if
> dealing with Transparent Hugepages, however we only use the
> page_trans_huge_mapcount() during COW faults where it strictly needed,
> due to its higher runtime cost.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/mm.h   |  5 +++++
>  include/linux/swap.h |  3 +--
>  mm/huge_memory.c     | 44 ++++++++++++++++++++++++++++++++++++--------
>  mm/swapfile.c        |  5 +----
>  4 files changed, 43 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8fb3604..c2026a1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -501,11 +501,16 @@ static inline int page_mapcount(struct page *page)
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int total_mapcount(struct page *page);
> +int page_trans_huge_mapcount(struct page *page);
>  #else
>  static inline int total_mapcount(struct page *page)
>  {
>  	return page_mapcount(page);
>  }
> +static inline int page_trans_huge_mapcount(struct page *page)
> +{
> +	return page_mapcount(page);
> +}
>  #endif
>  
>  static inline struct page *virt_to_head_page(const void *x)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2f6478f..905bf8e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -517,8 +517,7 @@ static inline int swp_swapcount(swp_entry_t entry)
>  	return 0;
>  }
>  
> -#define reuse_swap_page(page) \
> -	(!PageTransCompound(page) && page_mapcount(page) == 1)
> +#define reuse_swap_page(page) (page_trans_huge_mapcount(page) == 1)
>  
>  static inline int try_to_free_swap(struct page *page)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 06bce0f..6a6d9c0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
>  	 * We can only reuse the page if nobody else maps the huge page or it's
> -	 * part. We can do it by checking page_mapcount() on each sub-page, but
> -	 * it's expensive.
> -	 * The cheaper way is to check page_count() to be equal 1: every
> -	 * mapcount takes page reference reference, so this way we can
> -	 * guarantee, that the PMD is the only mapping.
> -	 * This can give false negative if somebody pinned the page, but that's
> -	 * fine.
> +	 * part.
>  	 */
> -	if (page_mapcount(page) == 1 && page_count(page) == 1) {
> +	if (page_trans_huge_mapcount(page) == 1) {
>  		pmd_t entry;
>  		entry = pmd_mkyoung(orig_pmd);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -3226,6 +3220,40 @@ int total_mapcount(struct page *page)
>  }
>  
>  /*
> + * This calculates accurately how many mappings a transparent hugepage
> + * has (unlike page_mapcount() which isn't fully accurate). This full
> + * accuracy is primarily needed to know if copy-on-write faults can
> + * takeover the page and change the mapping to read-write instead of
> + * copying them. This is different from total_mapcount() too: we must
> + * not count all mappings on the subpages individually, but instead we
> + * must check the highest mapcount any one of the subpages has.
> + *
> + * It would be entirely safe and even more correct to replace
> + * page_mapcount() with page_trans_huge_mapcount(), however we only
> + * use page_trans_huge_mapcount() in the copy-on-write faults where we
> + * need full accuracy to avoid breaking page pinning.
> + */
> +int page_trans_huge_mapcount(struct page *page)
> +{
> +	int i, ret;
> +
> +	VM_BUG_ON_PAGE(PageTail(page), page);
> +
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) + 1;
> +
> +	ret = 0;
> +	if (likely(!PageHuge(page))) {
> +		for (i = 0; i < HPAGE_PMD_NR; i++)
> +			ret = max(ret, atomic_read(&page[i]._mapcount) + 1);
> +		if (PageDoubleMap(page))
> +			ret -= 1;
> +	}
> +	ret += compound_mapcount(page);
> +	return ret;
> +}
> +
> +/*
>   * This function splits huge page into normal pages. @page can point to any
>   * subpage of huge page to split. Split doesn't change the position of @page.
>   *
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 83874ec..984470a 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -930,10 +930,7 @@ int reuse_swap_page(struct page *page)
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	if (unlikely(PageKsm(page)))
>  		return 0;
> -	/* The page is part of THP and cannot be reused */
> -	if (PageTransCompound(page))
> -		return 0;
> -	count = page_mapcount(page);
> +	count = page_trans_huge_mapcount(page);
>  	if (count <= 1 && PageSwapCache(page)) {
>  		count += page_swapcount(page);
>  		if (count == 1 && !PageWriteback(page)) {

I don't think this would work correctly. Let's check one of callers:

static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pte_t *page_table, pmd_t *pmd,
		spinlock_t *ptl, pte_t orig_pte)
	__releases(ptl)
{
...
		if (reuse_swap_page(old_page)) {
			/*
			 * The page is all ours.  Move it to our anon_vma so
			 * the rmap code will not search our parent or siblings.
			 * Protected against the rmap code by the page lock.
			 */
			page_move_anon_rmap(old_page, vma, address);
			unlock_page(old_page);
			return wp_page_reuse(mm, vma, address, page_table, ptl,
					     orig_pte, old_page, 0, 0);
		}

The first thing to notice is that old_page can be a tail page here
therefore page_move_anon_rmap() should be able to handle this after you
patch, which it doesn't.

But I think there's a bigger problem.

Consider the following situation: after split_huge_pmd() we have
pte-mapped THP, fork() comes and now the pages is shared between two
processes. Child process munmap()s one half of the THP page, parent
munmap()s the other half.

IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
subpages have mapcount exactly one. Fault in the child would trigger
do_wp_page() and reuse_swap_page() returns true, which would lead to
page_move_anon_rmap() tranferring the whole compound page to child's
anon_vma. That's not correct.

We should at least avoid page_move_anon_rmap() for compound pages there.


Other thing I would like to discuss is if there's a problem on vfio side.
To me it looks like vfio expects guarantee from get_user_pages() which it
doesn't provide: obtaining pin on the page doesn't guarantee that the page
is going to remain mapped into userspace until the pin is gone.

Even with THP COW regressing fixed, vfio would stay fragile: any
MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
broken.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-02 10:41               ` Kirill A. Shutemov
@ 2016-05-02 11:15                 ` Jerome Glisse
  2016-05-02 12:14                   ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
  2016-05-02 15:23                 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
  1 sibling, 1 reply; 36+ messages in thread
From: Jerome Glisse @ 2016-05-02 11:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> On Fri, Apr 29, 2016 at 06:34:44PM +0200, Andrea Arcangeli wrote:
> > On Fri, Apr 29, 2016 at 10:06:11AM +0300, Kirill A. Shutemov wrote:
> > > Hm. I just woke up and haven't got any coffee yet, but I don't why my
> > > approach would be worse for performance. Both have the same algorithmic
> > > complexity.
> > 
> > Even before looking at the overall performance, I'm not sure your
> > patch is really fixing it all: you didn't touch reuse_swap_page which
> > is used by do_wp_page to know if it can call do_wp_page_reuse. Your
> > patch would still trigger a COW instead of calling do_wp_page_reuse,
> > but it would only happen if the page was pinned after the pmd split,
> > which is probably not what the testcase is triggering. My patch
> > instead fixed that too.
> > 
> > total_mapcount returns the wrong value for reuse_swap_page, which is
> > probably why you didn't try to use it there.
> > 
> > The main issue of my patch is that it has a performance downside that
> > is page_mapcount becomes expensive for all other usages, which is
> > better than breaking vfio but I couldn't use total_mapcount again
> > because it counts things wrong in reuse_swap_page.
> > 
> > Like I said there's room for optimizations so today I tried to
> > optimize more stuff...
> > 
> > From 74f1fd7fab71a2cce0d1796fb38241acde2c1224 Mon Sep 17 00:00:00 2001
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > Date: Fri, 29 Apr 2016 01:05:06 +0200
> > Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
> >  during WP faults
> > 
> > This will provide fully accuracy to the mapcount calculation in the
> > write protect faults, so page pinning will not get broken by false
> > positive copy-on-writes.
> > 
> > total_mapcount() isn't the right calculation needed in
> > reuse_swap_page, so this introduces a page_trans_huge_mapcount() that
> > is effectively the full accurate return value for page_mapcount() if
> > dealing with Transparent Hugepages, however we only use the
> > page_trans_huge_mapcount() during COW faults where it strictly needed,
> > due to its higher runtime cost.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  include/linux/mm.h   |  5 +++++
> >  include/linux/swap.h |  3 +--
> >  mm/huge_memory.c     | 44 ++++++++++++++++++++++++++++++++++++--------
> >  mm/swapfile.c        |  5 +----
> >  4 files changed, 43 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8fb3604..c2026a1 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -501,11 +501,16 @@ static inline int page_mapcount(struct page *page)
> >  
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  int total_mapcount(struct page *page);
> > +int page_trans_huge_mapcount(struct page *page);
> >  #else
> >  static inline int total_mapcount(struct page *page)
> >  {
> >  	return page_mapcount(page);
> >  }
> > +static inline int page_trans_huge_mapcount(struct page *page)
> > +{
> > +	return page_mapcount(page);
> > +}
> >  #endif
> >  
> >  static inline struct page *virt_to_head_page(const void *x)
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2f6478f..905bf8e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -517,8 +517,7 @@ static inline int swp_swapcount(swp_entry_t entry)
> >  	return 0;
> >  }
> >  
> > -#define reuse_swap_page(page) \
> > -	(!PageTransCompound(page) && page_mapcount(page) == 1)
> > +#define reuse_swap_page(page) (page_trans_huge_mapcount(page) == 1)
> >  
> >  static inline int try_to_free_swap(struct page *page)
> >  {
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 06bce0f..6a6d9c0 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> >  	/*
> >  	 * We can only reuse the page if nobody else maps the huge page or it's
> > -	 * part. We can do it by checking page_mapcount() on each sub-page, but
> > -	 * it's expensive.
> > -	 * The cheaper way is to check page_count() to be equal 1: every
> > -	 * mapcount takes page reference reference, so this way we can
> > -	 * guarantee, that the PMD is the only mapping.
> > -	 * This can give false negative if somebody pinned the page, but that's
> > -	 * fine.
> > +	 * part.
> >  	 */
> > -	if (page_mapcount(page) == 1 && page_count(page) == 1) {
> > +	if (page_trans_huge_mapcount(page) == 1) {
> >  		pmd_t entry;
> >  		entry = pmd_mkyoung(orig_pmd);
> >  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> > @@ -3226,6 +3220,40 @@ int total_mapcount(struct page *page)
> >  }
> >  
> >  /*
> > + * This calculates accurately how many mappings a transparent hugepage
> > + * has (unlike page_mapcount() which isn't fully accurate). This full
> > + * accuracy is primarily needed to know if copy-on-write faults can
> > + * takeover the page and change the mapping to read-write instead of
> > + * copying them. This is different from total_mapcount() too: we must
> > + * not count all mappings on the subpages individually, but instead we
> > + * must check the highest mapcount any one of the subpages has.
> > + *
> > + * It would be entirely safe and even more correct to replace
> > + * page_mapcount() with page_trans_huge_mapcount(), however we only
> > + * use page_trans_huge_mapcount() in the copy-on-write faults where we
> > + * need full accuracy to avoid breaking page pinning.
> > + */
> > +int page_trans_huge_mapcount(struct page *page)
> > +{
> > +	int i, ret;
> > +
> > +	VM_BUG_ON_PAGE(PageTail(page), page);
> > +
> > +	if (likely(!PageCompound(page)))
> > +		return atomic_read(&page->_mapcount) + 1;
> > +
> > +	ret = 0;
> > +	if (likely(!PageHuge(page))) {
> > +		for (i = 0; i < HPAGE_PMD_NR; i++)
> > +			ret = max(ret, atomic_read(&page[i]._mapcount) + 1);
> > +		if (PageDoubleMap(page))
> > +			ret -= 1;
> > +	}
> > +	ret += compound_mapcount(page);
> > +	return ret;
> > +}
> > +
> > +/*
> >   * This function splits huge page into normal pages. @page can point to any
> >   * subpage of huge page to split. Split doesn't change the position of @page.
> >   *
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 83874ec..984470a 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -930,10 +930,7 @@ int reuse_swap_page(struct page *page)
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  	if (unlikely(PageKsm(page)))
> >  		return 0;
> > -	/* The page is part of THP and cannot be reused */
> > -	if (PageTransCompound(page))
> > -		return 0;
> > -	count = page_mapcount(page);
> > +	count = page_trans_huge_mapcount(page);
> >  	if (count <= 1 && PageSwapCache(page)) {
> >  		count += page_swapcount(page);
> >  		if (count == 1 && !PageWriteback(page)) {
> 
> I don't think this would work correctly. Let's check one of callers:
> 
> static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		unsigned long address, pte_t *page_table, pmd_t *pmd,
> 		spinlock_t *ptl, pte_t orig_pte)
> 	__releases(ptl)
> {
> ...
> 		if (reuse_swap_page(old_page)) {
> 			/*
> 			 * The page is all ours.  Move it to our anon_vma so
> 			 * the rmap code will not search our parent or siblings.
> 			 * Protected against the rmap code by the page lock.
> 			 */
> 			page_move_anon_rmap(old_page, vma, address);
> 			unlock_page(old_page);
> 			return wp_page_reuse(mm, vma, address, page_table, ptl,
> 					     orig_pte, old_page, 0, 0);
> 		}
> 
> The first thing to notice is that old_page can be a tail page here
> therefore page_move_anon_rmap() should be able to handle this after you
> patch, which it doesn't.
> 
> But I think there's a bigger problem.
> 
> Consider the following situation: after split_huge_pmd() we have
> pte-mapped THP, fork() comes and now the pages is shared between two
> processes. Child process munmap()s one half of the THP page, parent
> munmap()s the other half.
> 
> IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
> subpages have mapcount exactly one. Fault in the child would trigger
> do_wp_page() and reuse_swap_page() returns true, which would lead to
> page_move_anon_rmap() tranferring the whole compound page to child's
> anon_vma. That's not correct.
> 
> We should at least avoid page_move_anon_rmap() for compound pages there.
> 
> 
> Other thing I would like to discuss is if there's a problem on vfio side.
> To me it looks like vfio expects guarantee from get_user_pages() which it
> doesn't provide: obtaining pin on the page doesn't guarantee that the page
> is going to remain mapped into userspace until the pin is gone.
> 
> Even with THP COW regressing fixed, vfio would stay fragile: any
> MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> broken.
> 

Well i don't think it is fair/accurate assessment of get_user_pages(), page
must remain mapped to same virtual address until pin is gone. I am ignoring
mremap() as it is a scient decision from userspace and while virtual address
change in that case, the pined page behind should move with the mapping.
Same of MADV_DONTNEED. I agree that get_user_pages() is broken after fork()
but this have been the case since dawn of time, so it is something expected.

If not vfio, then direct-io, have been expecting this kind of behavior for
long time, so i see this as part of get_user_pages() guarantee.

Concerning vfio, not providing this guarantee will break countless number of
workload. Thing like qemu/kvm allocate anonymous memory and hand it over to
the guest kernel which presents it as memory. Now a device driver inside the
guest kernel need to get bus mapping for a given (guest) page, which from
host point of view means a mapping from anonymous page to bus mapping but
for guest to keep accessing the same page the anonymous mapping (ie a
specific virtual address on the host side) must keep pointing to the same
page. This have been the case with get_user_pages() until now, so whether
we like it or not we must keep that guarantee.

This kind of workload knows that they can't do mremap()/fork()/... and keep
that guarantee but they at expect existing guarantee and i don't think we
can break that.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* GUP guarantees wrt to userspace mappings redesign
  2016-05-02 11:15                 ` Jerome Glisse
@ 2016-05-02 12:14                   ` Kirill A. Shutemov
  2016-05-02 13:39                     ` Jerome Glisse
                                       ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 12:14 UTC (permalink / raw)
  To: Jerome Glisse, Oleg Nesterov, Hugh Dickins, Linus Torvalds,
	Andrew Morton
  Cc: Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 01:15:13PM +0200, Jerome Glisse wrote:
> On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> > Other thing I would like to discuss is if there's a problem on vfio side.
> > To me it looks like vfio expects guarantee from get_user_pages() which it
> > doesn't provide: obtaining pin on the page doesn't guarantee that the page
> > is going to remain mapped into userspace until the pin is gone.
> > 
> > Even with THP COW regressing fixed, vfio would stay fragile: any
> > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> > broken.
> > 
> 
> Well i don't think it is fair/accurate assessment of get_user_pages(), page
> must remain mapped to same virtual address until pin is gone. I am ignoring
> mremap() as it is a scient decision from userspace and while virtual address
> change in that case, the pined page behind should move with the mapping.
> Same of MADV_DONTNEED. I agree that get_user_pages() is broken after fork()
> but this have been the case since dawn of time, so it is something expected.
> 
> If not vfio, then direct-io, have been expecting this kind of behavior for
> long time, so i see this as part of get_user_pages() guarantee.
> 
> Concerning vfio, not providing this guarantee will break countless number of
> workload. Thing like qemu/kvm allocate anonymous memory and hand it over to
> the guest kernel which presents it as memory. Now a device driver inside the
> guest kernel need to get bus mapping for a given (guest) page, which from
> host point of view means a mapping from anonymous page to bus mapping but
> for guest to keep accessing the same page the anonymous mapping (ie a
> specific virtual address on the host side) must keep pointing to the same
> page. This have been the case with get_user_pages() until now, so whether
> we like it or not we must keep that guarantee.
> 
> This kind of workload knows that they can't do mremap()/fork()/... and keep
> that guarantee but they at expect existing guarantee and i don't think we
> can break that.

Quick look around:

 - I don't see any check page_count() around __replace_page() in uprobes,
   so it can easily replace pinned page.

 - KSM has the page_count() check, there's still race wrt GUP_fast: it can
   take the pin between the check and establishing new pte entry.

 - khugepaged: the same story as with KSM.

I don't see how we can deliver on the guarantee, especially with lockless
GUP_fast.

Or am I missing something important?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings redesign
  2016-05-02 12:14                   ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
@ 2016-05-02 13:39                     ` Jerome Glisse
  2016-05-02 15:00                       ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
  2016-05-02 14:15                     ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
  2016-05-02 18:56                     ` Andrea Arcangeli
  2 siblings, 1 reply; 36+ messages in thread
From: Jerome Glisse @ 2016-05-02 13:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Oleg Nesterov, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 03:14:02PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 02, 2016 at 01:15:13PM +0200, Jerome Glisse wrote:
> > On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> > > Other thing I would like to discuss is if there's a problem on vfio side.
> > > To me it looks like vfio expects guarantee from get_user_pages() which it
> > > doesn't provide: obtaining pin on the page doesn't guarantee that the page
> > > is going to remain mapped into userspace until the pin is gone.
> > > 
> > > Even with THP COW regressing fixed, vfio would stay fragile: any
> > > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> > > broken.
> > > 
> > 
> > Well i don't think it is fair/accurate assessment of get_user_pages(), page
> > must remain mapped to same virtual address until pin is gone. I am ignoring
> > mremap() as it is a scient decision from userspace and while virtual address
> > change in that case, the pined page behind should move with the mapping.
> > Same of MADV_DONTNEED. I agree that get_user_pages() is broken after fork()
> > but this have been the case since dawn of time, so it is something expected.
> > 
> > If not vfio, then direct-io, have been expecting this kind of behavior for
> > long time, so i see this as part of get_user_pages() guarantee.
> > 
> > Concerning vfio, not providing this guarantee will break countless number of
> > workload. Thing like qemu/kvm allocate anonymous memory and hand it over to
> > the guest kernel which presents it as memory. Now a device driver inside the
> > guest kernel need to get bus mapping for a given (guest) page, which from
> > host point of view means a mapping from anonymous page to bus mapping but
> > for guest to keep accessing the same page the anonymous mapping (ie a
> > specific virtual address on the host side) must keep pointing to the same
> > page. This have been the case with get_user_pages() until now, so whether
> > we like it or not we must keep that guarantee.
> > 
> > This kind of workload knows that they can't do mremap()/fork()/... and keep
> > that guarantee but they at expect existing guarantee and i don't think we
> > can break that.
> 
> Quick look around:
> 
>  - I don't see any check page_count() around __replace_page() in uprobes,
>    so it can easily replace pinned page.

Not an issue for existing user as this is only use to instrument code, existing
user do not execute code from virtual address for which they have done a GUP.

> 
>  - KSM has the page_count() check, there's still race wrt GUP_fast: it can
>    take the pin between the check and establishing new pte entry.

KSM is not an issue for existing user as they all do get_user_pages() with
write = 1 and the KSM first map page read only before considering to replace
them and check page refcount. So there can be no race with gup_fast there.

> 
>  - khugepaged: the same story as with KSM.

I am assuming you are talking about collapse_huge_page() here, if you look in
that function there is a comment about GUP_fast. Noneless i believe the comment
is wrong as i believe there is an existing race window btw pmdp_collapse_flush()
and __collapse_huge_page_isolate() :

  get_user_pages_fast()          | collapse_huge_page()
   gup_pmd_range() -> valid pmd  | ...
                                 | pmdp_collapse_flush() clear pmd
                                 | ...
                                 | __collapse_huge_page_isolate()
                                 | [Above check page count and see no GUP]
   gup_pte_range() -> ref page   |

This is a very unlikely race because get_user_pages_fast() can not be preempted
while collapse_huge_page() can be preempted btw pmdp_collapse_flush() and
__collapse_huge_page_isolate(), more over collapse_huge_page() has lot more
instructions to chew on than get_user_pages_fast() btw gup_pmd_range() and
gup_pte_range().

So i think this is an unlikely race. I am not sure how to forbid it from
happening, except maybe in get_user_pages_fast() by checking pmd is still
valid after gup_pte_range().

> 
> I don't see how we can deliver on the guarantee, especially with lockless
> GUP_fast.
> 
> Or am I missing something important?

So as said above, i think existing user of get_user_pages() are not sensitive
to the races you pointed above. I am sure there are some corner case where
the guarantee that GUP pin a page against a virtual address is violated but
i do not think they apply to any existing user of GUP.

Note that i would personaly like that this existing assumption about GUP did
not exist. I hate it, but fact is that it does exist and nobody can remember
where the Doc did park the Delorean

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings redesign
  2016-05-02 12:14                   ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
  2016-05-02 13:39                     ` Jerome Glisse
@ 2016-05-02 14:15                     ` Oleg Nesterov
  2016-05-02 16:21                       ` Kirill A. Shutemov
  2016-05-02 18:56                     ` Andrea Arcangeli
  2 siblings, 1 reply; 36+ messages in thread
From: Oleg Nesterov @ 2016-05-02 14:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jerome Glisse, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

I am sure I missed the problem, but...

On 05/02, Kirill A. Shutemov wrote:
>
> Quick look around:
>
>  - I don't see any check page_count() around __replace_page() in uprobes,
>    so it can easily replace pinned page.

Why it should? even if it races with get_user_pages_fast()... this doesn't
differ from the case when an application writes to MAP_PRIVATE non-anonymous
region, no?

Oleg.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings
  2016-05-02 13:39                     ` Jerome Glisse
@ 2016-05-02 15:00                       ` Kirill A. Shutemov
  2016-05-02 15:22                         ` Jerome Glisse
  2016-05-02 19:02                         ` Andrea Arcangeli
  0 siblings, 2 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 15:00 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Oleg Nesterov, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 03:39:20PM +0200, Jerome Glisse wrote:
> On Mon, May 02, 2016 at 03:14:02PM +0300, Kirill A. Shutemov wrote:
> > On Mon, May 02, 2016 at 01:15:13PM +0200, Jerome Glisse wrote:
> > > On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> > > > Other thing I would like to discuss is if there's a problem on vfio side.
> > > > To me it looks like vfio expects guarantee from get_user_pages() which it
> > > > doesn't provide: obtaining pin on the page doesn't guarantee that the page
> > > > is going to remain mapped into userspace until the pin is gone.
> > > > 
> > > > Even with THP COW regressing fixed, vfio would stay fragile: any
> > > > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> > > > broken.
> > > > 
> > > 
> > > Well i don't think it is fair/accurate assessment of get_user_pages(), page
> > > must remain mapped to same virtual address until pin is gone. I am ignoring
> > > mremap() as it is a scient decision from userspace and while virtual address
> > > change in that case, the pined page behind should move with the mapping.
> > > Same of MADV_DONTNEED. I agree that get_user_pages() is broken after fork()
> > > but this have been the case since dawn of time, so it is something expected.
> > > 
> > > If not vfio, then direct-io, have been expecting this kind of behavior for
> > > long time, so i see this as part of get_user_pages() guarantee.
> > > 
> > > Concerning vfio, not providing this guarantee will break countless number of
> > > workload. Thing like qemu/kvm allocate anonymous memory and hand it over to
> > > the guest kernel which presents it as memory. Now a device driver inside the
> > > guest kernel need to get bus mapping for a given (guest) page, which from
> > > host point of view means a mapping from anonymous page to bus mapping but
> > > for guest to keep accessing the same page the anonymous mapping (ie a
> > > specific virtual address on the host side) must keep pointing to the same
> > > page. This have been the case with get_user_pages() until now, so whether
> > > we like it or not we must keep that guarantee.
> > > 
> > > This kind of workload knows that they can't do mremap()/fork()/... and keep
> > > that guarantee but they at expect existing guarantee and i don't think we
> > > can break that.
> > 
> > Quick look around:
> > 
> >  - I don't see any check page_count() around __replace_page() in uprobes,
> >    so it can easily replace pinned page.
> 
> Not an issue for existing user as this is only use to instrument code, existing
> user do not execute code from virtual address for which they have done a GUP.

Okay, so we can establish that GUP doesn't provide the guarantee in some
cases.

> >  - KSM has the page_count() check, there's still race wrt GUP_fast: it can
> >    take the pin between the check and establishing new pte entry.
> 
> KSM is not an issue for existing user as they all do get_user_pages() with
> write = 1 and the KSM first map page read only before considering to replace
> them and check page refcount. So there can be no race with gup_fast there.

In vfio case, 'write' is conditional on IOMMU_WRITE, meaning not all
get_user_pages() are with write=1.

> >  - khugepaged: the same story as with KSM.
> 
> I am assuming you are talking about collapse_huge_page() here, if you look in
> that function there is a comment about GUP_fast. Noneless i believe the comment
> is wrong as i believe there is an existing race window btw pmdp_collapse_flush()
> and __collapse_huge_page_isolate() :
> 
>   get_user_pages_fast()          | collapse_huge_page()
>    gup_pmd_range() -> valid pmd  | ...
>                                  | pmdp_collapse_flush() clear pmd
>                                  | ...
>                                  | __collapse_huge_page_isolate()
>                                  | [Above check page count and see no GUP]
>    gup_pte_range() -> ref page   |
> 
> This is a very unlikely race because get_user_pages_fast() can not be preempted
> while collapse_huge_page() can be preempted btw pmdp_collapse_flush() and
> __collapse_huge_page_isolate(), more over collapse_huge_page() has lot more
> instructions to chew on than get_user_pages_fast() btw gup_pmd_range() and
> gup_pte_range().

Yes, the race window is small, but there.

> So i think this is an unlikely race. I am not sure how to forbid it from
> happening, except maybe in get_user_pages_fast() by checking pmd is still
> valid after gup_pte_range().

Switching to non-fast GUP would help :-P

> > I don't see how we can deliver on the guarantee, especially with lockless
> > GUP_fast.
> > 
> > Or am I missing something important?
> 
> So as said above, i think existing user of get_user_pages() are not sensitive
> to the races you pointed above. I am sure there are some corner case where
> the guarantee that GUP pin a page against a virtual address is violated but
> i do not think they apply to any existing user of GUP.
> 
> Note that i would personaly like that this existing assumption about GUP did
> not exist. I hate it, but fact is that it does exist and nobody can remember
> where the Doc did park the Delorean

The drivers who want the guarantee can provide own ->mmap and have more
control on what is visible in userspace.

Alternatively, we have mmu_notifiers to track changes in userspace
mappings.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings
  2016-05-02 15:00                       ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
@ 2016-05-02 15:22                         ` Jerome Glisse
  2016-05-02 16:12                           ` Kirill A. Shutemov
  2016-05-02 19:11                           ` Andrea Arcangeli
  2016-05-02 19:02                         ` Andrea Arcangeli
  1 sibling, 2 replies; 36+ messages in thread
From: Jerome Glisse @ 2016-05-02 15:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Oleg Nesterov, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 06:00:13PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 02, 2016 at 03:39:20PM +0200, Jerome Glisse wrote:
> > On Mon, May 02, 2016 at 03:14:02PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, May 02, 2016 at 01:15:13PM +0200, Jerome Glisse wrote:
> > > > On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> > > > > Other thing I would like to discuss is if there's a problem on vfio side.
> > > > > To me it looks like vfio expects guarantee from get_user_pages() which it
> > > > > doesn't provide: obtaining pin on the page doesn't guarantee that the page
> > > > > is going to remain mapped into userspace until the pin is gone.
> > > > > 
> > > > > Even with THP COW regressing fixed, vfio would stay fragile: any
> > > > > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> > > > > broken.
> > > > > 
> > > > 
> > > > Well i don't think it is fair/accurate assessment of get_user_pages(), page
> > > > must remain mapped to same virtual address until pin is gone. I am ignoring
> > > > mremap() as it is a scient decision from userspace and while virtual address
> > > > change in that case, the pined page behind should move with the mapping.
> > > > Same of MADV_DONTNEED. I agree that get_user_pages() is broken after fork()
> > > > but this have been the case since dawn of time, so it is something expected.
> > > > 
> > > > If not vfio, then direct-io, have been expecting this kind of behavior for
> > > > long time, so i see this as part of get_user_pages() guarantee.
> > > > 
> > > > Concerning vfio, not providing this guarantee will break countless number of
> > > > workload. Thing like qemu/kvm allocate anonymous memory and hand it over to
> > > > the guest kernel which presents it as memory. Now a device driver inside the
> > > > guest kernel need to get bus mapping for a given (guest) page, which from
> > > > host point of view means a mapping from anonymous page to bus mapping but
> > > > for guest to keep accessing the same page the anonymous mapping (ie a
> > > > specific virtual address on the host side) must keep pointing to the same
> > > > page. This have been the case with get_user_pages() until now, so whether
> > > > we like it or not we must keep that guarantee.
> > > > 
> > > > This kind of workload knows that they can't do mremap()/fork()/... and keep
> > > > that guarantee but they at expect existing guarantee and i don't think we
> > > > can break that.
> > > 
> > > Quick look around:
> > > 
> > >  - I don't see any check page_count() around __replace_page() in uprobes,
> > >    so it can easily replace pinned page.
> > 
> > Not an issue for existing user as this is only use to instrument code, existing
> > user do not execute code from virtual address for which they have done a GUP.
> 
> Okay, so we can establish that GUP doesn't provide the guarantee in some
> cases.

Correct but it use to provide that guarantee in respect to THP.


> > >  - KSM has the page_count() check, there's still race wrt GUP_fast: it can
> > >    take the pin between the check and establishing new pte entry.
> > 
> > KSM is not an issue for existing user as they all do get_user_pages() with
> > write = 1 and the KSM first map page read only before considering to replace
> > them and check page refcount. So there can be no race with gup_fast there.
> 
> In vfio case, 'write' is conditional on IOMMU_WRITE, meaning not all
> get_user_pages() are with write=1.

I think this is still fine as it means that device will read only and thus
you can migrate to different page (ie the guest is not expecting to read back
anything writen by the device and device writting to the page would be illegal
and a proper IOMMU would forbid it). So it is like direct-io when you write
from anonymous memory to a file.


> > >  - khugepaged: the same story as with KSM.
> > 
> > I am assuming you are talking about collapse_huge_page() here, if you look in
> > that function there is a comment about GUP_fast. Noneless i believe the comment
> > is wrong as i believe there is an existing race window btw pmdp_collapse_flush()
> > and __collapse_huge_page_isolate() :
> > 
> >   get_user_pages_fast()          | collapse_huge_page()
> >    gup_pmd_range() -> valid pmd  | ...
> >                                  | pmdp_collapse_flush() clear pmd
> >                                  | ...
> >                                  | __collapse_huge_page_isolate()
> >                                  | [Above check page count and see no GUP]
> >    gup_pte_range() -> ref page   |
> > 
> > This is a very unlikely race because get_user_pages_fast() can not be preempted
> > while collapse_huge_page() can be preempted btw pmdp_collapse_flush() and
> > __collapse_huge_page_isolate(), more over collapse_huge_page() has lot more
> > instructions to chew on than get_user_pages_fast() btw gup_pmd_range() and
> > gup_pte_range().
> 
> Yes, the race window is small, but there.

Now that i think again about it, i don't think it exist. pmdp_collapse_flush()
will flush the tlb and thus send an IPI but get_user_pages_fast() can't be
preempted so the flush will have to wait for existing get_user_pages_fast() to
complete. Or am i missunderstanding flush ? So khugepaged is safe from GUP_fast
point of view like the comment, inside it, says.


> > So i think this is an unlikely race. I am not sure how to forbid it from
> > happening, except maybe in get_user_pages_fast() by checking pmd is still
> > valid after gup_pte_range().
> 
> Switching to non-fast GUP would help :-P
> 
> > > I don't see how we can deliver on the guarantee, especially with lockless
> > > GUP_fast.
> > > 
> > > Or am I missing something important?
> > 
> > So as said above, i think existing user of get_user_pages() are not sensitive
> > to the races you pointed above. I am sure there are some corner case where
> > the guarantee that GUP pin a page against a virtual address is violated but
> > i do not think they apply to any existing user of GUP.
> > 
> > Note that i would personaly like that this existing assumption about GUP did
> > not exist. I hate it, but fact is that it does exist and nobody can remember
> > where the Doc did park the Delorean
> 
> The drivers who want the guarantee can provide own ->mmap and have more
> control on what is visible in userspace.
> 
> Alternatively, we have mmu_notifiers to track changes in userspace
> mappings.
> 

Well you can't not rely on special vma here. Qemu alloc anonymous memory and
hand it over to guest, then a guest driver (ie runing in the guest not on the
host) try to map that memory and need valid DMA address for it, this is when
vfio (on the host kernel) starts pining memory of regular anonymous vma (on
the host). That same memory might back some special vma with ->mmap callback
but in the guest. Point is there is no driver on the host and no special vma.
>From host point of view this is anonymous memory, but from guest POV it is
just memory.

Requiring special vma would need major change to kvm and probably xen, in
respect on how they support things like PCI passthrough.

In existing workload, host kernel can not make assumption on how anonymous
memory is gonna be use.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-02 10:41               ` Kirill A. Shutemov
  2016-05-02 11:15                 ` Jerome Glisse
@ 2016-05-02 15:23                 ` Andrea Arcangeli
  2016-05-02 16:00                   ` Kirill A. Shutemov
  1 sibling, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-02 15:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> I don't think this would work correctly. Let's check one of callers:
> 
> static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 		unsigned long address, pte_t *page_table, pmd_t *pmd,
> 		spinlock_t *ptl, pte_t orig_pte)
> 	__releases(ptl)
> {
> ...
> 		if (reuse_swap_page(old_page)) {
> 			/*
> 			 * The page is all ours.  Move it to our anon_vma so
> 			 * the rmap code will not search our parent or siblings.
> 			 * Protected against the rmap code by the page lock.
> 			 */
> 			page_move_anon_rmap(old_page, vma, address);
> 			unlock_page(old_page);
> 			return wp_page_reuse(mm, vma, address, page_table, ptl,
> 					     orig_pte, old_page, 0, 0);
> 		}
> 
> The first thing to notice is that old_page can be a tail page here
> therefore page_move_anon_rmap() should be able to handle this after you
> patch, which it doesn't.

Agreed, that's an implementation error and easy to fix.

> But I think there's a bigger problem.
> 
> Consider the following situation: after split_huge_pmd() we have
> pte-mapped THP, fork() comes and now the pages is shared between two
> processes. Child process munmap()s one half of the THP page, parent
> munmap()s the other half.
> 
> IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
> subpages have mapcount exactly one. Fault in the child would trigger
> do_wp_page() and reuse_swap_page() returns true, which would lead to
> page_move_anon_rmap() tranferring the whole compound page to child's
> anon_vma. That's not correct.
> 
> We should at least avoid page_move_anon_rmap() for compound pages there.

So (compound_head() missing aside) the calculation I was doing is
correct with regard to taking over the page and marking the pagetable
read-write instead of triggering a COW and breaking the pinning, but
it's not right only in terms of calling page_move_anon_rmap? The child
or parent would then lose visibility on its ptes if the compound page
is moved to the local vma->anon_vma.

The fix should be just to change page_trans_huge_mapcount() to return
two refcounts, one "hard" for the pinning, and one "soft" for the rmap
which will be the same as total_mapcount. The runtime cost will remain
the same, so a fix can be easy for this one too.

> Other thing I would like to discuss is if there's a problem on vfio side.
> To me it looks like vfio expects guarantee from get_user_pages() which it
> doesn't provide: obtaining pin on the page doesn't guarantee that the page
> is going to remain mapped into userspace until the pin is gone.
> 
> Even with THP COW regressing fixed, vfio would stay fragile: any
> MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> broken.

vfio must run as root, it will take care of not doing such things, it
just needs a way to prevent the page to be moved so it can DMA into it
and mlock is not enough. This clearly has to be caused by a
get_user_pages(write=0) or by a serialized fork/exec() while a
longstanding page pin is being held (and to be safe fork/exec had to
be serialized in a way that the parent process wouldn't write to the
pinned page until after exec has run in the child, or it's already
racy no matter what kernel).

I agree it's somewhat fragile, the problem here is that the THP
refcounting change made it even weaker than it already was.

Ideally the MMU notifier invalidate should be used instead of pinning
the page, that would make it 100% robust and it wouldn't even pin the
page at all.

However we can't send an MMU notifier invalidate to an IOMMU because
next time the IOMMU non-present physical address is used it would kill
the app. Some new IOMMU can raise an exception synchronously that we
could use to implement a IOMMU secondary MMU page fault to make the
MMU notifier model work with IOMMUs too, but that's not feasible with
most IOMMU out there that raises an unrecoverable asynchronous
exception instead and can't implement a proper "IOMMU page
fault". Furthermore the speed of the invalidate may not be optimal
with IOMMUs which would then be an added cost to pay for swapping and
memory migration.

This is anyway a regression of the previous guarantees a pin would
provide, if we want to bring back the old semantics of a page pin, I
think fixing both places like I attempted to do (modulo two
implementation bugs) is better than fixing only the THP case.

If instead leave things as is, and we weaken the semantics of a page
pin, the alternative to deal with the even weakened semantics inside
the vfio code, is to use get_user_pages with write=1 forced and then
it'll probably work also with current upstream (unless it's fork/exec,
but I don't think it is, MADV_DONTFORK would be recommended anyway for
usages like this with vfio if fork can ever run and there are threads
in the parent, even O_DIRECT generates data corruption without
MADV_DONTFORK in such conditions for similar reasons).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-02 15:23                 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
@ 2016-05-02 16:00                   ` Kirill A. Shutemov
  2016-05-02 18:03                     ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 16:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Mon, May 02, 2016 at 05:23:07PM +0200, Andrea Arcangeli wrote:
> On Mon, May 02, 2016 at 01:41:19PM +0300, Kirill A. Shutemov wrote:
> > I don't think this would work correctly. Let's check one of callers:
> > 
> > static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > 		unsigned long address, pte_t *page_table, pmd_t *pmd,
> > 		spinlock_t *ptl, pte_t orig_pte)
> > 	__releases(ptl)
> > {
> > ...
> > 		if (reuse_swap_page(old_page)) {
> > 			/*
> > 			 * The page is all ours.  Move it to our anon_vma so
> > 			 * the rmap code will not search our parent or siblings.
> > 			 * Protected against the rmap code by the page lock.
> > 			 */
> > 			page_move_anon_rmap(old_page, vma, address);
> > 			unlock_page(old_page);
> > 			return wp_page_reuse(mm, vma, address, page_table, ptl,
> > 					     orig_pte, old_page, 0, 0);
> > 		}
> > 
> > The first thing to notice is that old_page can be a tail page here
> > therefore page_move_anon_rmap() should be able to handle this after you
> > patch, which it doesn't.
> 
> Agreed, that's an implementation error and easy to fix.
> 
> > But I think there's a bigger problem.
> > 
> > Consider the following situation: after split_huge_pmd() we have
> > pte-mapped THP, fork() comes and now the pages is shared between two
> > processes. Child process munmap()s one half of the THP page, parent
> > munmap()s the other half.
> > 
> > IIUC, afther that page_trans_huge_mapcount() would give us 1 as all 4k
> > subpages have mapcount exactly one. Fault in the child would trigger
> > do_wp_page() and reuse_swap_page() returns true, which would lead to
> > page_move_anon_rmap() tranferring the whole compound page to child's
> > anon_vma. That's not correct.
> > 
> > We should at least avoid page_move_anon_rmap() for compound pages there.
> 
> So (compound_head() missing aside) the calculation I was doing is
> correct with regard to taking over the page and marking the pagetable
> read-write instead of triggering a COW and breaking the pinning, but
> it's not right only in terms of calling page_move_anon_rmap? The child
> or parent would then lose visibility on its ptes if the compound page
> is moved to the local vma->anon_vma.
> 
> The fix should be just to change page_trans_huge_mapcount() to return
> two refcounts, one "hard" for the pinning, and one "soft" for the rmap
> which will be the same as total_mapcount. The runtime cost will remain
> the same, so a fix can be easy for this one too.

Sounds correct, but code is going to be ugly :-/

> > Other thing I would like to discuss is if there's a problem on vfio side.
> > To me it looks like vfio expects guarantee from get_user_pages() which it
> > doesn't provide: obtaining pin on the page doesn't guarantee that the page
> > is going to remain mapped into userspace until the pin is gone.
> > 
> > Even with THP COW regressing fixed, vfio would stay fragile: any
> > MADV_DONTNEED/fork()/mremap()/whatever what would make vfio expectation
> > broken.
> 
> vfio must run as root, it will take care of not doing such things, it
> just needs a way to prevent the page to be moved so it can DMA into it
> and mlock is not enough. This clearly has to be caused by a
> get_user_pages(write=0) or by a serialized fork/exec() while a
> longstanding page pin is being held (and to be safe fork/exec had to
> be serialized in a way that the parent process wouldn't write to the
> pinned page until after exec has run in the child, or it's already
> racy no matter what kernel).
> 
> I agree it's somewhat fragile, the problem here is that the THP
> refcounting change made it even weaker than it already was.

I didn't say we shouldn't fix the problem on THP side. But the attitude
"get_user_pages() would magically freeze page tables" worries me.

> Ideally the MMU notifier invalidate should be used instead of pinning
> the page, that would make it 100% robust and it wouldn't even pin the
> page at all.
> 
> However we can't send an MMU notifier invalidate to an IOMMU because
> next time the IOMMU non-present physical address is used it would kill
> the app. Some new IOMMU can raise an exception synchronously that we
> could use to implement a IOMMU secondary MMU page fault to make the
> MMU notifier model work with IOMMUs too, but that's not feasible with
> most IOMMU out there that raises an unrecoverable asynchronous
> exception instead and can't implement a proper "IOMMU page
> fault". Furthermore the speed of the invalidate may not be optimal
> with IOMMUs which would then be an added cost to pay for swapping and
> memory migration.
> 
> This is anyway a regression of the previous guarantees a pin would
> provide, if we want to bring back the old semantics of a page pin, I
> think fixing both places like I attempted to do (modulo two
> implementation bugs) is better than fixing only the THP case.

Agreed. I just didn't see the two-refcounts solution.

> If instead leave things as is, and we weaken the semantics of a page
> pin, the alternative to deal with the even weakened semantics inside
> the vfio code, is to use get_user_pages with write=1 forced and then
> it'll probably work also with current upstream (unless it's fork/exec,
> but I don't think it is, MADV_DONTFORK would be recommended anyway for
> usages like this with vfio if fork can ever run and there are threads
> in the parent, even O_DIRECT generates data corruption without
> MADV_DONTFORK in such conditions for similar reasons).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings
  2016-05-02 15:22                         ` Jerome Glisse
@ 2016-05-02 16:12                           ` Kirill A. Shutemov
  2016-05-02 19:14                             ` Andrea Arcangeli
  2016-05-02 19:11                           ` Andrea Arcangeli
  1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 16:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Oleg Nesterov, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 05:22:49PM +0200, Jerome Glisse wrote:
> On Mon, May 02, 2016 at 06:00:13PM +0300, Kirill A. Shutemov wrote:
> > > > Quick look around:
> > > > 
> > > >  - I don't see any check page_count() around __replace_page() in uprobes,
> > > >    so it can easily replace pinned page.
> > > 
> > > Not an issue for existing user as this is only use to instrument code, existing
> > > user do not execute code from virtual address for which they have done a GUP.
> > 
> > Okay, so we can establish that GUP doesn't provide the guarantee in some
> > cases.
> 
> Correct but it use to provide that guarantee in respect to THP.

Yes, the THP regression need to be fixed. I don't argue with that.

> > > >  - KSM has the page_count() check, there's still race wrt GUP_fast: it can
> > > >    take the pin between the check and establishing new pte entry.
> > > 
> > > KSM is not an issue for existing user as they all do get_user_pages() with
> > > write = 1 and the KSM first map page read only before considering to replace
> > > them and check page refcount. So there can be no race with gup_fast there.
> > 
> > In vfio case, 'write' is conditional on IOMMU_WRITE, meaning not all
> > get_user_pages() are with write=1.
> 
> I think this is still fine as it means that device will read only and thus
> you can migrate to different page (ie the guest is not expecting to read back
> anything writen by the device and device writting to the page would be illegal
> and a proper IOMMU would forbid it). So it is like direct-io when you write
> from anonymous memory to a file.

Hm. Okay.

> > > >  - khugepaged: the same story as with KSM.
> > > 
> > > I am assuming you are talking about collapse_huge_page() here, if you look in
> > > that function there is a comment about GUP_fast. Noneless i believe the comment
> > > is wrong as i believe there is an existing race window btw pmdp_collapse_flush()
> > > and __collapse_huge_page_isolate() :
> > > 
> > >   get_user_pages_fast()          | collapse_huge_page()
> > >    gup_pmd_range() -> valid pmd  | ...
> > >                                  | pmdp_collapse_flush() clear pmd
> > >                                  | ...
> > >                                  | __collapse_huge_page_isolate()
> > >                                  | [Above check page count and see no GUP]
> > >    gup_pte_range() -> ref page   |
> > > 
> > > This is a very unlikely race because get_user_pages_fast() can not be preempted
> > > while collapse_huge_page() can be preempted btw pmdp_collapse_flush() and
> > > __collapse_huge_page_isolate(), more over collapse_huge_page() has lot more
> > > instructions to chew on than get_user_pages_fast() btw gup_pmd_range() and
> > > gup_pte_range().
> > 
> > Yes, the race window is small, but there.
> 
> Now that i think again about it, i don't think it exist. pmdp_collapse_flush()
> will flush the tlb and thus send an IPI but get_user_pages_fast() can't be
> preempted so the flush will have to wait for existing get_user_pages_fast() to
> complete. Or am i missunderstanding flush ? So khugepaged is safe from GUP_fast
> point of view like the comment, inside it, says.

You are right. It's safe too.

> > > So as said above, i think existing user of get_user_pages() are not sensitive
> > > to the races you pointed above. I am sure there are some corner case where
> > > the guarantee that GUP pin a page against a virtual address is violated but
> > > i do not think they apply to any existing user of GUP.
> > > 
> > > Note that i would personaly like that this existing assumption about GUP did
> > > not exist. I hate it, but fact is that it does exist and nobody can remember
> > > where the Doc did park the Delorean
> > 
> > The drivers who want the guarantee can provide own ->mmap and have more
> > control on what is visible in userspace.
> > 
> > Alternatively, we have mmu_notifiers to track changes in userspace
> > mappings.
> > 
> 
> Well you can't not rely on special vma here. Qemu alloc anonymous memory and
> hand it over to guest, then a guest driver (ie runing in the guest not on the
> host) try to map that memory and need valid DMA address for it, this is when
> vfio (on the host kernel) starts pining memory of regular anonymous vma (on
> the host). That same memory might back some special vma with ->mmap callback
> but in the guest. Point is there is no driver on the host and no special vma.
> From host point of view this is anonymous memory, but from guest POV it is
> just memory.
> 
> Requiring special vma would need major change to kvm and probably xen, in
> respect on how they support things like PCI passthrough.
> 
> In existing workload, host kernel can not make assumption on how anonymous
> memory is gonna be use.

Any reason why mmu_notifier is not an option?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings redesign
  2016-05-02 14:15                     ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
@ 2016-05-02 16:21                       ` Kirill A. Shutemov
  2016-05-02 16:22                         ` Oleg Nesterov
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 16:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jerome Glisse, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 04:15:38PM +0200, Oleg Nesterov wrote:
> I am sure I missed the problem, but...
> 
> On 05/02, Kirill A. Shutemov wrote:
> >
> > Quick look around:
> >
> >  - I don't see any check page_count() around __replace_page() in uprobes,
> >    so it can easily replace pinned page.
> 
> Why it should? even if it races with get_user_pages_fast()... this doesn't
> differ from the case when an application writes to MAP_PRIVATE non-anonymous
> region, no?

< I know nothing about uprobes or ptrace in general >

I think the difference is that the write is initiated by the process
itself, but IIUC __replace_page() can be initiated by other process, so
it's out of control of the application.

So we have pages pinned by a driver and the driver expects the pinned
pages to be mapped into userspace, then __replace_page() kicks in and put
different page there -- driver's expectation is broken.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings redesign
  2016-05-02 16:21                       ` Kirill A. Shutemov
@ 2016-05-02 16:22                         ` Oleg Nesterov
  2016-05-02 18:03                           ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Oleg Nesterov @ 2016-05-02 16:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jerome Glisse, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On 05/02, Kirill A. Shutemov wrote:
>
> On Mon, May 02, 2016 at 04:15:38PM +0200, Oleg Nesterov wrote:
> > >
> > >  - I don't see any check page_count() around __replace_page() in uprobes,
> > >    so it can easily replace pinned page.
> >
> > Why it should? even if it races with get_user_pages_fast()... this doesn't
> > differ from the case when an application writes to MAP_PRIVATE non-anonymous
> > region, no?
>
> < I know nothing about uprobes or ptrace in general >
>
> I think the difference is that the write is initiated by the process
> itself, but IIUC __replace_page() can be initiated by other process, so
> it's out of control of the application.

Yes. Just like gdb can insert a breakpoint into the read-only executable vma.

> So we have pages pinned by a driver and the driver expects the pinned
> pages to be mapped into userspace, then __replace_page() kicks in and put
> different page there -- driver's expectation is broken.

Yes... but I don't understand the problem space. I mean, I do not know why
this driver should expect this, how it can be broken, etc.

I do not even understand why "initiated by other process" can make any
difference... Unless this driver somehow controls all threads which could
have this page mapped.

Oleg.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings redesign
  2016-05-02 18:03                           ` Kirill A. Shutemov
@ 2016-05-02 17:41                             ` Oleg Nesterov
  0 siblings, 0 replies; 36+ messages in thread
From: Oleg Nesterov @ 2016-05-02 17:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jerome Glisse, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On 05/02, Kirill A. Shutemov wrote:
>
> On Mon, May 02, 2016 at 06:22:11PM +0200, Oleg Nesterov wrote:

> > > So we have pages pinned by a driver and the driver expects the pinned
> > > pages to be mapped into userspace, then __replace_page() kicks in and put
> > > different page there -- driver's expectation is broken.
> >
> > Yes... but I don't understand the problem space. I mean, I do not know why
> > this driver should expect this, how it can be broken, etc.
> >
> > I do not even understand why "initiated by other process" can make any
> > difference... Unless this driver somehow controls all threads which could
> > have this page mapped.
>
> Okay, my understanding is following:
>
> Some drivers (i.e. vfio) rely on get_user_page{,_fast}() to pin the memory
> and expect pinned pages to be mapped into userspace until the pin is gone.
> This memory is used to communicate between kernel and userspace.

Thanks Kirill.

Then I think uprobes should be fine,

> I don't think there's something to fix on uprobe side. It's part of
> debugging interface. Debuggers can be destructive, nothing new there.

Yes, exactly. And as for uprobes in particular, __replace_page() can
only be called of vma->vm_file and and the mapping is private/executable,
VM_MAYSHARE must not be set.

Unlikely userspace can read or write to this memory to communicate with
kernel or something else.

Thanks,

Oleg.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings redesign
  2016-05-02 16:22                         ` Oleg Nesterov
@ 2016-05-02 18:03                           ` Kirill A. Shutemov
  2016-05-02 17:41                             ` Oleg Nesterov
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-02 18:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jerome Glisse, Hugh Dickins, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 06:22:11PM +0200, Oleg Nesterov wrote:
> On 05/02, Kirill A. Shutemov wrote:
> >
> > On Mon, May 02, 2016 at 04:15:38PM +0200, Oleg Nesterov wrote:
> > > >
> > > >  - I don't see any check page_count() around __replace_page() in uprobes,
> > > >    so it can easily replace pinned page.
> > >
> > > Why it should? even if it races with get_user_pages_fast()... this doesn't
> > > differ from the case when an application writes to MAP_PRIVATE non-anonymous
> > > region, no?
> >
> > < I know nothing about uprobes or ptrace in general >
> >
> > I think the difference is that the write is initiated by the process
> > itself, but IIUC __replace_page() can be initiated by other process, so
> > it's out of control of the application.
> 
> Yes. Just like gdb can insert a breakpoint into the read-only executable vma.
> 
> > So we have pages pinned by a driver and the driver expects the pinned
> > pages to be mapped into userspace, then __replace_page() kicks in and put
> > different page there -- driver's expectation is broken.
> 
> Yes... but I don't understand the problem space. I mean, I do not know why
> this driver should expect this, how it can be broken, etc.
> 
> I do not even understand why "initiated by other process" can make any
> difference... Unless this driver somehow controls all threads which could
> have this page mapped.

Okay, my understanding is following:

Some drivers (i.e. vfio) rely on get_user_page{,_fast}() to pin the memory
and expect pinned pages to be mapped into userspace until the pin is gone.
This memory is used to communicate between kernel and userspace.

This kinda works unless an application does something that can change page
tables in the area: fork() (due CoW), mremap(), munmap(), MADV_DONTNEED,
etc.

This model is really fragile, but the application has (kinda) control over
the situation: if it's very careful, it can keep the mapping intact.

With __replace_page() situation is different: it can be triggered from
outside of process and therefore out of control of the application.

I don't think there's something to fix on uprobe side. It's part of
debugging interface. Debuggers can be destructive, nothing new there.
I just tried to find the cases why this usage of GUP is not correct...

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-02 16:00                   ` Kirill A. Shutemov
@ 2016-05-02 18:03                     ` Andrea Arcangeli
  2016-05-05  1:19                       ` Alex Williamson
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-02 18:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Mon, May 02, 2016 at 07:00:42PM +0300, Kirill A. Shutemov wrote:
> Sounds correct, but code is going to be ugly :-/

Now if a page is not shared in the parent, it is already in the local
anon_vma. The only thing we could lose here is a pmd split in the
child caused by swapping and then parent releases the page, child
reuse it but it stays in the anon_vma of the parent. It doesn't sound
like a major concern.

What we could to improve this though, is to do a rmap walk after the
physical split_huge_page succeeded, to relocate the page->mapping to
the local vma->anon_vma of the child if page_mapcount() is 1 before
releasing the (root) anon_vma lock. If a page got a pmd split it'll be
a candidate for a physical split if any of the ptes has been
unmapped.

If it wasn't because of THP in tmpfs the old THP refcounting overall I
think would have been simpler, it never had issues like these, but
having a single model for all THP sounds much easier to maintain over
time, instead of dealing with totally different models and rules and
locking for every filesystem and MM part. This is why I hope we'll
soon leverage all this work in tmpfs too. With tmpfs being able to map
the compound THP with both ptes and pmds is mandatory, the old
refcounting had a too big constraint to make compound THP work in tmpfs.

> I didn't say we shouldn't fix the problem on THP side. But the attitude
> "get_user_pages() would magically freeze page tables" worries me.

It doesn't need to freeze page tables, it only needs to prevent the
pages to be freed. In fact this bug cannot generate random kernel
corruption no matter what, but then userland view of the memory will
go out of sync and it can generate data corruption to userland (in RAM
or hardware device DMA).

What THP refcounting did is just to change some expectation on the
userland side in terms of when the view on the pinned pages would get
lost and replaced by copies, depending what userland did. A process to
invoke page pinning must have root (or enough capabilities anyway), so
it's somewhat connected to the kernel behavior, it's non standard.

However issues like this are userland visible even to not privileged
tasks, in fact it's strongly recommended to use MADV_DONTFORK on the
get_user_pages addresses, if a program is using
get_user_pages/O_DIRECT+fork+threads to avoid silent data corruption.

> Agreed. I just didn't see the two-refcounts solution.

If you didn't do it already or if you're busy with something else,
I can change the patch to the two refcount solution, which should
restore the old semantics without breaking rmap.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings redesign
  2016-05-02 12:14                   ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
  2016-05-02 13:39                     ` Jerome Glisse
  2016-05-02 14:15                     ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
@ 2016-05-02 18:56                     ` Andrea Arcangeli
  2 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-02 18:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jerome Glisse, Oleg Nesterov, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 03:14:02PM +0300, Kirill A. Shutemov wrote:
> Quick look around:
> 
>  - I don't see any check page_count() around __replace_page() in uprobes,
>    so it can easily replace pinned page.
> 
>  - KSM has the page_count() check, there's still race wrt GUP_fast: it can
>    take the pin between the check and establishing new pte entry.

		 * Ok this is tricky, when get_user_pages_fast() run it doesn't
		 * take any lock, therefore the check that we are going to make
		 * with the pagecount against the mapcount is racey and
		 * O_DIRECT can happen right after the check.
		 * So we clear the pte and flush the tlb before the check
		 * this assure us that no O_DIRECT can happen after the check
		 * or in the middle of the check.
		 */
		entry = ptep_clear_flush_notify(vma, addr, ptep);

KSM takes care of that or it wouldn't be safe if KSM was with memory
under O_DIRECT.
 
>  - khugepaged: the same story as with KSM.

In __collapse_huge_page_isolate we do:

		/*
		 * cannot use mapcount: can't collapse if there's a gup pin.
		 * The page must only be referenced by the scanned process
		 * and page swap cache.
		 */
		if (page_count(page) != 1 + !!PageSwapCache(page)) {
			unlock_page(page);
			result = SCAN_PAGE_COUNT;
			goto out;
		}

At that point the pmd has been zapped (pmdp_collapse_flush already
run) and like for KSM case that is enough to ensure
get_user_pages_fast can't succeed and it'll have to call into the slow
get_user_pages.

These two issues are not specific to vfio and IOMMUs, this is must be
correct or O_DIRECT will generate data corruption in presence of
KSM/khugepaged. Both looks fine to me.

> I don't see how we can deliver on the guarantee, especially with lockless
> GUP_fast.

By zapping the pmd_trans_huge/pte and sending IPIs if needed
(get_user_pages_fast runs with irq disabled), before checking
page_count.

With the RCU version of it it's the same, but instead of sending IPIs,
we'll wait for a quiescient point to be sure of having flushed any
concurrent get_user_pages_fast out of the other CPUs, before we
proceed to check page_count (then no other get_user_pages_fast can
increase the page count for this page on this "mm" anymore).

That's how the guaranteed is provided against get_user_pages_fast.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings
  2016-05-02 15:00                       ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
  2016-05-02 15:22                         ` Jerome Glisse
@ 2016-05-02 19:02                         ` Andrea Arcangeli
  1 sibling, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-02 19:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jerome Glisse, Oleg Nesterov, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 06:00:13PM +0300, Kirill A. Shutemov wrote:
> Switching to non-fast GUP would help :-P

If we had a race in khugepaged or ksmd against gup_fast O_DIRECT we'd
get flood of bugreports of data corruption with KVM run with
cache=direct.

Just wanted to reassure there's no race, explained how the
serialization to force a fallback to non-fast GUP works in previous
email.

This issue we're fixing for the COW is totally unrelated to KVM too,
because it uses MADV_DONTFORK, but the other races with O_DIRECT
against khugepaged/kksmd would still happen if we didn't already have
proper serialization against get_user_pages_fast.

> Alternatively, we have mmu_notifiers to track changes in userspace
> mappings.

This is always the absolute best solution, then no gup pins are used
at all and all VM functionality is activated regardless of the
secondary MMU, just most IOMMUs can't generate a synchronous page
fault, when they fault the I/O is undefined. It'd be like if when you
get a page fault in the CPU, when you return from the fault you go to
then next instruction and during the fault you've no way to even
emulate the faulting instruction.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings
  2016-05-02 15:22                         ` Jerome Glisse
  2016-05-02 16:12                           ` Kirill A. Shutemov
@ 2016-05-02 19:11                           ` Andrea Arcangeli
  1 sibling, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-02 19:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Kirill A. Shutemov, Oleg Nesterov, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 05:22:49PM +0200, Jerome Glisse wrote:
> I think this is still fine as it means that device will read only and thus
> you can migrate to different page (ie the guest is not expecting to read back
> anything writen by the device and device writting to the page would be illegal
> and a proper IOMMU would forbid it). So it is like direct-io when you write
> from anonymous memory to a file.

Agreed. write=1 is so that if there's an O_DIRECT write() and the app
is only reading, there will be no COW generated on shared anonymous
memory/MAP_PRIVATE-filebacked.

> Now that i think again about it, i don't think it exist. pmdp_collapse_flush()
> will flush the tlb and thus send an IPI but get_user_pages_fast() can't be
> preempted so the flush will have to wait for existing get_user_pages_fast() to
> complete. Or am i missunderstanding flush ? So khugepaged is safe from GUP_fast
> point of view like the comment, inside it, says.

This is exactly correct, there's no race window.

The IPI (or the quiescent point in case of the gup_fast RCU version)
are the things that flush away get_user_pages_fast with pmdp_collapse_flush().

> Well you can't not rely on special vma here. Qemu alloc anonymous memory and
> hand it over to guest, then a guest driver (ie runing in the guest not on the
> host) try to map that memory and need valid DMA address for it, this is when
> vfio (on the host kernel) starts pining memory of regular anonymous vma (on
> the host). That same memory might back some special vma with ->mmap callback
> but in the guest. Point is there is no driver on the host and no special vma.
> From host point of view this is anonymous memory, but from guest POV it is
> just memory.

It's quite important it stays regular tmpfs/anon as device memory is
managed by the device and we'd lose everything (KSM/swapping/NUMA
balancing/compaction/memory-hotunplug/CMA etc..).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: GUP guarantees wrt to userspace mappings
  2016-05-02 16:12                           ` Kirill A. Shutemov
@ 2016-05-02 19:14                             ` Andrea Arcangeli
  0 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-02 19:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jerome Glisse, Oleg Nesterov, Hugh Dickins, Linus Torvalds,
	Andrew Morton, Alex Williamson, kirill.shutemov, linux-kernel,
	linux-mm

On Mon, May 02, 2016 at 07:12:52PM +0300, Kirill A. Shutemov wrote:
> Any reason why mmu_notifier is not an option?

No way to trigger an hardware re-tried secondary MMU fault as result
of PCI DMA memory access, and expensive to do an MMU notifier
invalidate if it requires waiting for the DMA to complete (but since
MMU notifier is now sleepable the latter is a secondary concern).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-02 18:03                     ` Andrea Arcangeli
@ 2016-05-05  1:19                       ` Alex Williamson
  2016-05-05 14:39                         ` Andrea Arcangeli
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Williamson @ 2016-05-05  1:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, kirill.shutemov, linux-kernel, linux-mm

On Mon, 2 May 2016 20:03:07 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Mon, May 02, 2016 at 07:00:42PM +0300, Kirill A. Shutemov wrote:
> > Agreed. I just didn't see the two-refcounts solution.  
> 
> If you didn't do it already or if you're busy with something else,
> I can change the patch to the two refcount solution, which should
> restore the old semantics without breaking rmap.

I didn't see any follow-up beyond this nor patches on lkml.  Do we have
something we feel confident for posting to v4.6 with a stable backport
to v4.5?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-05  1:19                       ` Alex Williamson
@ 2016-05-05 14:39                         ` Andrea Arcangeli
  2016-05-05 15:09                           ` Andrea Arcangeli
  2016-05-05 15:11                           ` Kirill A. Shutemov
  0 siblings, 2 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-05 14:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirill A. Shutemov, kirill.shutemov, linux-kernel, linux-mm

Hello Alex,

On Wed, May 04, 2016 at 07:19:27PM -0600, Alex Williamson wrote:
> On Mon, 2 May 2016 20:03:07 +0200
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > On Mon, May 02, 2016 at 07:00:42PM +0300, Kirill A. Shutemov wrote:
> > > Agreed. I just didn't see the two-refcounts solution.  
> > 
> > If you didn't do it already or if you're busy with something else,
> > I can change the patch to the two refcount solution, which should
> > restore the old semantics without breaking rmap.
> 
> I didn't see any follow-up beyond this nor patches on lkml.  Do we have
> something we feel confident for posting to v4.6 with a stable backport
> to v4.5?  Thanks,

I'm currently testing this:

>From c327b17f4de0c968bb3b9035fe36d80b2c28b2f8 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Fri, 29 Apr 2016 01:05:06 +0200
Subject: [PATCH 1/3] mm: thp: calculate the mapcount correctly for THP pages
 during WP faults

This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.

Kirill A. Shutemov reported the problem with moving the page anon_vma
to the local vma->anon_vma in a previous version of this patch.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h   |  6 +++++
 include/linux/swap.h |  8 ++++---
 mm/huge_memory.c     | 67 +++++++++++++++++++++++++++++++++++++++++++++-------
 mm/memory.c          | 21 +++++++++-------
 mm/swapfile.c        | 13 +++++-----
 5 files changed, 89 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a55e5be..4b532d8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -500,11 +500,17 @@ static inline int page_mapcount(struct page *page)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int total_mapcount(struct page *page);
+int page_trans_huge_mapcount(struct page *page, int *total_mapcount);
 #else
 static inline int total_mapcount(struct page *page)
 {
 	return page_mapcount(page);
 }
+static inline int page_trans_huge_mapcount(struct page *page,
+					   int *total_mapcount)
+{
+	return *total_mapcount = page_mapcount(page);
+}
 #endif
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2b83359..acef20d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,7 +418,7 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
-extern int reuse_swap_page(struct page *);
+extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
@@ -513,8 +513,10 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-#define reuse_swap_page(page) \
-	(!PageTransCompound(page) && page_mapcount(page) == 1)
+static inline bool reuse_swap_page(struct page *page, int *total_mapcount)
+{
+	return page_trans_huge_mapcount(page, total_mapcount) == 1;
+}
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86f9f8b..d368620 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
 	 * We can only reuse the page if nobody else maps the huge page or it's
-	 * part. We can do it by checking page_mapcount() on each sub-page, but
-	 * it's expensive.
-	 * The cheaper way is to check page_count() to be equal 1: every
-	 * mapcount takes page reference reference, so this way we can
-	 * guarantee, that the PMD is the only mapping.
-	 * This can give false negative if somebody pinned the page, but that's
-	 * fine.
+	 * part.
 	 */
-	if (page_mapcount(page) == 1 && page_count(page) == 1) {
+	if (page_trans_huge_mapcount(page, NULL) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -2080,7 +2074,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_write(pteval)) {
 			writable = true;
 		} else {
-			if (PageSwapCache(page) && !reuse_swap_page(page)) {
+			if (PageSwapCache(page) &&
+			    !reuse_swap_page(page, NULL)) {
 				unlock_page(page);
 				result = SCAN_SWAP_CACHE_PAGE;
 				goto out;
@@ -3225,6 +3220,60 @@ int total_mapcount(struct page *page)
 }
 
 /*
+ * This calculates accurately how many mappings a transparent hugepage
+ * has (unlike page_mapcount() which isn't fully accurate). This full
+ * accuracy is primarily needed to know if copy-on-write faults can
+ * takeover the page and change the mapping to read-write instead of
+ * copying them. At the same time this returns the total_mapcount too.
+ *
+ * The return value is telling if the page can be reused as it returns
+ * the highest mapcount any one of the subpages has. If the return
+ * value is one, even if different processes are mapping different
+ * subpages of the transparent hugepage, they can all reuse it,
+ * because each process is reusing a different subpage.
+ *
+ * The total_mapcount is instead counting all virtual mappings of the
+ * subpages. If the total_mapcount is equal to "one", it tells the
+ * caller all mappings belong to the same "mm" and in turn the
+ * anon_vma of the transparent hugepage can become the vma->anon_vma
+ * local one as no other process may be mapping any of the subpages.
+ *
+ * It would be more accurate to replace page_mapcount() with
+ * page_trans_huge_mapcount(), however we only use
+ * page_trans_huge_mapcount() in the copy-on-write faults where we
+ * need full accuracy to avoid breaking page pinning, because
+ * page_trans_huge_mapcount is slower than page_mapcount().
+ */
+int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
+{
+	int i, ret, _total_mapcount, mapcount;
+
+	/* hugetlbfs shouldn't call it */
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+
+	if (likely(!PageTransCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	page = compound_head(page);
+
+	_total_mapcount = ret = 0;
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mapcount = atomic_read(&page[i]._mapcount) + 1;
+		ret = max(ret, mapcount);
+		_total_mapcount += mapcount;
+	}
+	if (PageDoubleMap(page)) {
+		ret -= 1;
+		_total_mapcount -= HPAGE_PMD_NR;
+	}
+	mapcount = compound_mapcount(page);
+	ret += mapcount;
+	_total_mapcount += mapcount;
+	*total_mapcount = _total_mapcount;
+	return ret;
+}
+
+/*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
  *
diff --git a/mm/memory.c b/mm/memory.c
index 93897f2..1589aa4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2340,6 +2340,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * not dirty accountable.
 	 */
 	if (PageAnon(old_page) && !PageKsm(old_page)) {
+		int total_mapcount;
 		if (!trylock_page(old_page)) {
 			get_page(old_page);
 			pte_unmap_unlock(page_table, ptl);
@@ -2354,13 +2355,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 			put_page(old_page);
 		}
-		if (reuse_swap_page(old_page)) {
-			/*
-			 * The page is all ours.  Move it to our anon_vma so
-			 * the rmap code will not search our parent or siblings.
-			 * Protected against the rmap code by the page lock.
-			 */
-			page_move_anon_rmap(old_page, vma, address);
+		if (reuse_swap_page(old_page, &total_mapcount)) {
+			if (total_mapcount == 1) {
+				/*
+				 * The page is all ours. Move it to
+				 * our anon_vma so the rmap code will
+				 * not search our parent or siblings.
+				 * Protected against the rmap code by
+				 * the page lock.
+				 */
+				page_move_anon_rmap(old_page, vma, address);
+			}
 			unlock_page(old_page);
 			return wp_page_reuse(mm, vma, address, page_table, ptl,
 					     orig_pte, old_page, 0, 0);
@@ -2584,7 +2589,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 83874ec..031713ab 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -922,18 +922,19 @@ out:
  * to it.  And as a side-effect, free up its swap: because the old content
  * on disk will never be read, and seeking back there to write new content
  * later would only waste time away from clustering.
+ *
+ * NOTE: total_mapcount should not be relied upon by the caller if
+ * reuse_swap_page() returns false, but it may be always overwritten
+ * (see the other implementation for CONFIG_SWAP=n).
  */
-int reuse_swap_page(struct page *page)
+bool reuse_swap_page(struct page *page, int *total_mapcount)
 {
 	int count;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
-		return 0;
-	/* The page is part of THP and cannot be reused */
-	if (PageTransCompound(page))
-		return 0;
-	count = page_mapcount(page);
+		return false;
+	count = page_trans_huge_mapcount(page, total_mapcount);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
 		if (count == 1 && !PageWriteback(page)) {



>From b3cd271859f4c8243b58b4b55998fcc9ee0a0988 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 30 Apr 2016 18:35:34 +0200
Subject: [PATCH 3/4] mm: thp: microoptimize compound_mapcount()

compound_mapcount() is only called after PageCompound() has already
been checked by the caller, so there's no point to check it again. Gcc
may optimize it away too because it's inline but this will remove the
runtime check for sure and add it'll add an assert instead.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4b532d8..119325d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -471,8 +471,7 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page)
 
 static inline int compound_mapcount(struct page *page)
 {
-	if (!PageCompound(page))
-		return 0;
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	page = compound_head(page);
 	return atomic_read(compound_mapcount_ptr(page)) + 1;
 }



>From a2f1172344b87b1b0e18d07014ee5ab2027fac10 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Thu, 5 May 2016 00:59:27 +0200
Subject: [PATCH 4/4] mm: thp: split_huge_pmd_address() comment improvement

Comment is partly wrong, this improves it by including the case of
split_huge_pmd_address() called by try_to_unmap_one if
TTU_SPLIT_HUGE_PMD is set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f8f07e4..e716726 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3032,8 +3032,10 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		return;
 
 	/*
-	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
-	 * materialize from under us.
+	 * Caller holds the mmap_sem write mode or the anon_vma lock,
+	 * so a huge pmd cannot materialize from under us (khugepaged
+	 * holds both the mmap_sem write mode and the anon_vma lock
+	 * write mode).
 	 */
 	__split_huge_pmd(vma, pmd, address, freeze);
 }


I also noticed we aren't calling page_move_anon_rmap in
do_huge_pmd_wp_page when page_trans_huge_mapcount returns 1, that's a
longstanding inefficiency but it's not a bug. We're not locking the
page down in the THP COW because we don't have to deal with swapcache,
and in turn we can't overwrite the page->mapping. I think in practice
it would be safe anyway because it's an atomic write and no matter if
the rmap_walk reader sees the value before or after the write, it'll
still be able to find the pmd_trans_huge during the rmap walk. However
if page->mapping can change under the reader (i.e. rmap_walk) then the
reader should use READ_ONCE to access page->mapping (or page->mapping
should become volatile). Otherwise it'd be a bug with the C standard
where gcc could get confused in theory (in practice it would work fine
as we're mostly just dereferencing that page->mapping pointer and not
using it for switch/case or stuff like that where gcc could use an
hash). Regardless for robustness it'd be better if we take appropriate
locking and so we should take the page lock by doing a check if the
page->mapping is already pointing to the local vma->anon_vma first, if
not then we should take the page lock on the head THP and call
page_move_anon_rmap. Because this is a longstanding problem I didn't
address it yet, and it's only a missing optimization but it'd be nice
to get that covered too (considering we just worsened a bit the
optimization in presence of a COW after a pmd split and before the
physical split).

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-05 14:39                         ` Andrea Arcangeli
@ 2016-05-05 15:09                           ` Andrea Arcangeli
  2016-05-05 15:11                           ` Kirill A. Shutemov
  1 sibling, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-05 15:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirill A. Shutemov, kirill.shutemov, linux-kernel, linux-mm

On Thu, May 05, 2016 at 04:39:24PM +0200, Andrea Arcangeli wrote:
> I'm currently testing this:

I must have been testing an earlier version, this below has better
chance not to oops. There's a reason I didn't attempt a proper submit
yet.. this is just for testing until we're sure this ok.

I also had a version of it initially that added a "*local_rmap"
instead of a *total_mapcount parameter to reuse_swap_page but then I
don't think gcc would have optimized things away enough depending on
the kernel config. reuse_swap_page couldn't just forward the
"total_mapcount" pointer from caller to callee anymore. So this
results in less code at the very least but then we've to check
"total_mapcount == 1" instead of just a true/false "local_rmap" (which
might have been more self explanatory) if reuse_swap_page returns
true.

>From 4158c4b38883fb3481d70f6a4eebbbf2b379a791 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Fri, 29 Apr 2016 01:05:06 +0200
Subject: [PATCH 1/1] mm: thp: calculate the mapcount correctly for THP pages
 during WP faults

This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.

Kirill A. Shutemov reported the problem with moving the page anon_vma
to the local vma->anon_vma in a previous version of this patch.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h   |  9 +++++++
 include/linux/swap.h |  8 ++++---
 mm/huge_memory.c     | 68 +++++++++++++++++++++++++++++++++++++++++++++-------
 mm/memory.c          | 21 +++++++++-------
 mm/swapfile.c        | 13 +++++-----
 5 files changed, 93 insertions(+), 26 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a55e5be..263f229 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -500,11 +500,20 @@ static inline int page_mapcount(struct page *page)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int total_mapcount(struct page *page);
+int page_trans_huge_mapcount(struct page *page, int *total_mapcount);
 #else
 static inline int total_mapcount(struct page *page)
 {
 	return page_mapcount(page);
 }
+static inline int page_trans_huge_mapcount(struct page *page,
+					   int *total_mapcount)
+{
+	int mapcount = page_mapcount(page);
+	if (total_mapcount)
+		*total_mapcount = mapcount;
+	return mapcount;
+}
 #endif
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2b83359..acef20d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -418,7 +418,7 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
-extern int reuse_swap_page(struct page *);
+extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
@@ -513,8 +513,10 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-#define reuse_swap_page(page) \
-	(!PageTransCompound(page) && page_mapcount(page) == 1)
+static inline bool reuse_swap_page(struct page *page, int *total_mapcount)
+{
+	return page_trans_huge_mapcount(page, total_mapcount) == 1;
+}
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86f9f8b..d4d61f5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	/*
 	 * We can only reuse the page if nobody else maps the huge page or it's
-	 * part. We can do it by checking page_mapcount() on each sub-page, but
-	 * it's expensive.
-	 * The cheaper way is to check page_count() to be equal 1: every
-	 * mapcount takes page reference reference, so this way we can
-	 * guarantee, that the PMD is the only mapping.
-	 * This can give false negative if somebody pinned the page, but that's
-	 * fine.
+	 * part.
 	 */
-	if (page_mapcount(page) == 1 && page_count(page) == 1) {
+	if (page_trans_huge_mapcount(page, NULL) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -2080,7 +2074,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		if (pte_write(pteval)) {
 			writable = true;
 		} else {
-			if (PageSwapCache(page) && !reuse_swap_page(page)) {
+			if (PageSwapCache(page) &&
+			    !reuse_swap_page(page, NULL)) {
 				unlock_page(page);
 				result = SCAN_SWAP_CACHE_PAGE;
 				goto out;
@@ -3225,6 +3220,61 @@ int total_mapcount(struct page *page)
 }
 
 /*
+ * This calculates accurately how many mappings a transparent hugepage
+ * has (unlike page_mapcount() which isn't fully accurate). This full
+ * accuracy is primarily needed to know if copy-on-write faults can
+ * takeover the page and change the mapping to read-write instead of
+ * copying them. At the same time this returns the total_mapcount too.
+ *
+ * The return value is telling if the page can be reused as it returns
+ * the highest mapcount any one of the subpages has. If the return
+ * value is one, even if different processes are mapping different
+ * subpages of the transparent hugepage, they can all reuse it,
+ * because each process is reusing a different subpage.
+ *
+ * The total_mapcount is instead counting all virtual mappings of the
+ * subpages. If the total_mapcount is equal to "one", it tells the
+ * caller all mappings belong to the same "mm" and in turn the
+ * anon_vma of the transparent hugepage can become the vma->anon_vma
+ * local one as no other process may be mapping any of the subpages.
+ *
+ * It would be more accurate to replace page_mapcount() with
+ * page_trans_huge_mapcount(), however we only use
+ * page_trans_huge_mapcount() in the copy-on-write faults where we
+ * need full accuracy to avoid breaking page pinning, because
+ * page_trans_huge_mapcount is slower than page_mapcount().
+ */
+int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
+{
+	int i, ret, _total_mapcount, mapcount;
+
+	/* hugetlbfs shouldn't call it */
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+
+	if (likely(!PageTransCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	page = compound_head(page);
+
+	_total_mapcount = ret = 0;
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mapcount = atomic_read(&page[i]._mapcount) + 1;
+		ret = max(ret, mapcount);
+		_total_mapcount += mapcount;
+	}
+	if (PageDoubleMap(page)) {
+		ret -= 1;
+		_total_mapcount -= HPAGE_PMD_NR;
+	}
+	mapcount = compound_mapcount(page);
+	ret += mapcount;
+	_total_mapcount += mapcount;
+	if (total_mapcount)
+		*total_mapcount = _total_mapcount;
+	return ret;
+}
+
+/*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
  *
diff --git a/mm/memory.c b/mm/memory.c
index 93897f2..1589aa4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2340,6 +2340,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * not dirty accountable.
 	 */
 	if (PageAnon(old_page) && !PageKsm(old_page)) {
+		int total_mapcount;
 		if (!trylock_page(old_page)) {
 			get_page(old_page);
 			pte_unmap_unlock(page_table, ptl);
@@ -2354,13 +2355,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 			put_page(old_page);
 		}
-		if (reuse_swap_page(old_page)) {
-			/*
-			 * The page is all ours.  Move it to our anon_vma so
-			 * the rmap code will not search our parent or siblings.
-			 * Protected against the rmap code by the page lock.
-			 */
-			page_move_anon_rmap(old_page, vma, address);
+		if (reuse_swap_page(old_page, &total_mapcount)) {
+			if (total_mapcount == 1) {
+				/*
+				 * The page is all ours. Move it to
+				 * our anon_vma so the rmap code will
+				 * not search our parent or siblings.
+				 * Protected against the rmap code by
+				 * the page lock.
+				 */
+				page_move_anon_rmap(old_page, vma, address);
+			}
 			unlock_page(old_page);
 			return wp_page_reuse(mm, vma, address, page_table, ptl,
 					     orig_pte, old_page, 0, 0);
@@ -2584,7 +2589,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 83874ec..031713ab 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -922,18 +922,19 @@ out:
  * to it.  And as a side-effect, free up its swap: because the old content
  * on disk will never be read, and seeking back there to write new content
  * later would only waste time away from clustering.
+ *
+ * NOTE: total_mapcount should not be relied upon by the caller if
+ * reuse_swap_page() returns false, but it may be always overwritten
+ * (see the other implementation for CONFIG_SWAP=n).
  */
-int reuse_swap_page(struct page *page)
+bool reuse_swap_page(struct page *page, int *total_mapcount)
 {
 	int count;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
-		return 0;
-	/* The page is part of THP and cannot be reused */
-	if (PageTransCompound(page))
-		return 0;
-	count = page_mapcount(page);
+		return false;
+	count = page_trans_huge_mapcount(page, total_mapcount);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
 		if (count == 1 && !PageWriteback(page)) {

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-05 14:39                         ` Andrea Arcangeli
  2016-05-05 15:09                           ` Andrea Arcangeli
@ 2016-05-05 15:11                           ` Kirill A. Shutemov
  2016-05-05 15:24                             ` Andrea Arcangeli
  1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-05 15:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Thu, May 05, 2016 at 04:39:24PM +0200, Andrea Arcangeli wrote:
> Hello Alex,
> 
> On Wed, May 04, 2016 at 07:19:27PM -0600, Alex Williamson wrote:
> > On Mon, 2 May 2016 20:03:07 +0200
> > Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > > On Mon, May 02, 2016 at 07:00:42PM +0300, Kirill A. Shutemov wrote:
> > > > Agreed. I just didn't see the two-refcounts solution.  
> > > 
> > > If you didn't do it already or if you're busy with something else,
> > > I can change the patch to the two refcount solution, which should
> > > restore the old semantics without breaking rmap.
> > 
> > I didn't see any follow-up beyond this nor patches on lkml.  Do we have
> > something we feel confident for posting to v4.6 with a stable backport
> > to v4.5?  Thanks,
> 
> I'm currently testing this:
> 
> From c327b17f4de0c968bb3b9035fe36d80b2c28b2f8 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Fri, 29 Apr 2016 01:05:06 +0200
> Subject: [PATCH 1/3] mm: thp: calculate the mapcount correctly for THP pages
>  during WP faults
> 
> This will provide fully accuracy to the mapcount calculation in the
> write protect faults, so page pinning will not get broken by false
> positive copy-on-writes.
> 
> total_mapcount() isn't the right calculation needed in
> reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
> that is effectively the full accurate return value for page_mapcount()
> if dealing with Transparent Hugepages, however we only use the
> page_trans_huge_mapcount() during COW faults where it strictly needed,
> due to its higher runtime cost.
> 
> This also provide at practical zero cost the total_mapcount
> information which is needed to know if we can still relocate the page
> anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
> can reuse the page no matter if it's a pte or a pmd_trans_huge
> triggering the fault, but we can only relocate the page anon_vma to
> the local vma->anon_vma if we're sure it's only this "vma" mapping the
> whole THP physical range.
> 
> Kirill A. Shutemov reported the problem with moving the page anon_vma
> to the local vma->anon_vma in a previous version of this patch.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/mm.h   |  6 +++++
>  include/linux/swap.h |  8 ++++---
>  mm/huge_memory.c     | 67 +++++++++++++++++++++++++++++++++++++++++++++-------
>  mm/memory.c          | 21 +++++++++-------
>  mm/swapfile.c        | 13 +++++-----
>  5 files changed, 89 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a55e5be..4b532d8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -500,11 +500,17 @@ static inline int page_mapcount(struct page *page)
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  int total_mapcount(struct page *page);
> +int page_trans_huge_mapcount(struct page *page, int *total_mapcount);
>  #else
>  static inline int total_mapcount(struct page *page)
>  {
>  	return page_mapcount(page);
>  }
> +static inline int page_trans_huge_mapcount(struct page *page,
> +					   int *total_mapcount)
> +{
> +	return *total_mapcount = page_mapcount(page);
> +}
>  #endif
>  
>  static inline struct page *virt_to_head_page(const void *x)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2b83359..acef20d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -418,7 +418,7 @@ extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> -extern int reuse_swap_page(struct page *);
> +extern bool reuse_swap_page(struct page *, int *);
>  extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  
> @@ -513,8 +513,10 @@ static inline int swp_swapcount(swp_entry_t entry)
>  	return 0;
>  }
>  
> -#define reuse_swap_page(page) \
> -	(!PageTransCompound(page) && page_mapcount(page) == 1)
> +static inline bool reuse_swap_page(struct page *page, int *total_mapcount)
> +{
> +	return page_trans_huge_mapcount(page, total_mapcount) == 1;
> +}
>  
>  static inline int try_to_free_swap(struct page *page)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 86f9f8b..d368620 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
>  	/*
>  	 * We can only reuse the page if nobody else maps the huge page or it's
> -	 * part. We can do it by checking page_mapcount() on each sub-page, but
> -	 * it's expensive.
> -	 * The cheaper way is to check page_count() to be equal 1: every
> -	 * mapcount takes page reference reference, so this way we can
> -	 * guarantee, that the PMD is the only mapping.
> -	 * This can give false negative if somebody pinned the page, but that's
> -	 * fine.
> +	 * part.
>  	 */
> -	if (page_mapcount(page) == 1 && page_count(page) == 1) {
> +	if (page_trans_huge_mapcount(page, NULL) == 1) {

Hm. How total_mapcount equal to NULL wouldn't lead to NULL-pointer
dereference inside page_trans_huge_mapcount()?

>  		pmd_t entry;
>  		entry = pmd_mkyoung(orig_pmd);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -2080,7 +2074,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		if (pte_write(pteval)) {
>  			writable = true;
>  		} else {
> -			if (PageSwapCache(page) && !reuse_swap_page(page)) {
> +			if (PageSwapCache(page) &&
> +			    !reuse_swap_page(page, NULL)) {

Ditto.

>  				unlock_page(page);
>  				result = SCAN_SWAP_CACHE_PAGE;
>  				goto out;
> @@ -3225,6 +3220,60 @@ int total_mapcount(struct page *page)
>  }
>  
>  /*
> + * This calculates accurately how many mappings a transparent hugepage
> + * has (unlike page_mapcount() which isn't fully accurate). This full
> + * accuracy is primarily needed to know if copy-on-write faults can
> + * takeover the page and change the mapping to read-write instead of
> + * copying them. At the same time this returns the total_mapcount too.
> + *
> + * The return value is telling if the page can be reused as it returns
> + * the highest mapcount any one of the subpages has. If the return
> + * value is one, even if different processes are mapping different
> + * subpages of the transparent hugepage, they can all reuse it,
> + * because each process is reusing a different subpage.
> + *
> + * The total_mapcount is instead counting all virtual mappings of the
> + * subpages. If the total_mapcount is equal to "one", it tells the
> + * caller all mappings belong to the same "mm" and in turn the
> + * anon_vma of the transparent hugepage can become the vma->anon_vma
> + * local one as no other process may be mapping any of the subpages.
> + *
> + * It would be more accurate to replace page_mapcount() with
> + * page_trans_huge_mapcount(), however we only use
> + * page_trans_huge_mapcount() in the copy-on-write faults where we
> + * need full accuracy to avoid breaking page pinning, because
> + * page_trans_huge_mapcount is slower than page_mapcount().
> + */
> +int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
> +{
> +	int i, ret, _total_mapcount, mapcount;
> +
> +	/* hugetlbfs shouldn't call it */
> +	VM_BUG_ON_PAGE(PageHuge(page), page);
> +
> +	if (likely(!PageTransCompound(page)))
> +		return atomic_read(&page->_mapcount) + 1;
> +
> +	page = compound_head(page);
> +
> +	_total_mapcount = ret = 0;
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		mapcount = atomic_read(&page[i]._mapcount) + 1;
> +		ret = max(ret, mapcount);
> +		_total_mapcount += mapcount;
> +	}
> +	if (PageDoubleMap(page)) {
> +		ret -= 1;
> +		_total_mapcount -= HPAGE_PMD_NR;
> +	}
> +	mapcount = compound_mapcount(page);
> +	ret += mapcount;
> +	_total_mapcount += mapcount;
> +	*total_mapcount = _total_mapcount;
> +	return ret;
> +}
> +
> +/*
>   * This function splits huge page into normal pages. @page can point to any
>   * subpage of huge page to split. Split doesn't change the position of @page.
>   *
> diff --git a/mm/memory.c b/mm/memory.c
> index 93897f2..1589aa4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2340,6 +2340,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * not dirty accountable.
>  	 */
>  	if (PageAnon(old_page) && !PageKsm(old_page)) {
> +		int total_mapcount;
>  		if (!trylock_page(old_page)) {
>  			get_page(old_page);
>  			pte_unmap_unlock(page_table, ptl);
> @@ -2354,13 +2355,17 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			}
>  			put_page(old_page);
>  		}
> -		if (reuse_swap_page(old_page)) {
> -			/*
> -			 * The page is all ours.  Move it to our anon_vma so
> -			 * the rmap code will not search our parent or siblings.
> -			 * Protected against the rmap code by the page lock.
> -			 */
> -			page_move_anon_rmap(old_page, vma, address);
> +		if (reuse_swap_page(old_page, &total_mapcount)) {
> +			if (total_mapcount == 1) {
> +				/*
> +				 * The page is all ours. Move it to
> +				 * our anon_vma so the rmap code will
> +				 * not search our parent or siblings.
> +				 * Protected against the rmap code by
> +				 * the page lock.
> +				 */
> +				page_move_anon_rmap(old_page, vma, address);

compound_head() is missing, I believe.

> +			}
>  			unlock_page(old_page);
>  			return wp_page_reuse(mm, vma, address, page_table, ptl,
>  					     orig_pte, old_page, 0, 0);
> @@ -2584,7 +2589,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(mm, MM_SWAPENTS);
>  	pte = mk_pte(page, vma->vm_page_prot);
> -	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> +	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 83874ec..031713ab 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -922,18 +922,19 @@ out:
>   * to it.  And as a side-effect, free up its swap: because the old content
>   * on disk will never be read, and seeking back there to write new content
>   * later would only waste time away from clustering.
> + *
> + * NOTE: total_mapcount should not be relied upon by the caller if
> + * reuse_swap_page() returns false, but it may be always overwritten
> + * (see the other implementation for CONFIG_SWAP=n).
>   */
> -int reuse_swap_page(struct page *page)
> +bool reuse_swap_page(struct page *page, int *total_mapcount)
>  {
>  	int count;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	if (unlikely(PageKsm(page)))
> -		return 0;
> -	/* The page is part of THP and cannot be reused */
> -	if (PageTransCompound(page))
> -		return 0;
> -	count = page_mapcount(page);
> +		return false;
> +	count = page_trans_huge_mapcount(page, total_mapcount);
>  	if (count <= 1 && PageSwapCache(page)) {
>  		count += page_swapcount(page);
>  		if (count == 1 && !PageWriteback(page)) {
> 
> 
> 
> From b3cd271859f4c8243b58b4b55998fcc9ee0a0988 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Sat, 30 Apr 2016 18:35:34 +0200
> Subject: [PATCH 3/4] mm: thp: microoptimize compound_mapcount()
> 
> compound_mapcount() is only called after PageCompound() has already
> been checked by the caller, so there's no point to check it again. Gcc
> may optimize it away too because it's inline but this will remove the
> runtime check for sure and add it'll add an assert instead.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/mm.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4b532d8..119325d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -471,8 +471,7 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page)
>  
>  static inline int compound_mapcount(struct page *page)
>  {
> -	if (!PageCompound(page))
> -		return 0;
> +	VM_BUG_ON_PAGE(!PageCompound(page), page);
>  	page = compound_head(page);
>  	return atomic_read(compound_mapcount_ptr(page)) + 1;
>  }
> 
> 
> 
> From a2f1172344b87b1b0e18d07014ee5ab2027fac10 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Thu, 5 May 2016 00:59:27 +0200
> Subject: [PATCH 4/4] mm: thp: split_huge_pmd_address() comment improvement
> 
> Comment is partly wrong, this improves it by including the case of
> split_huge_pmd_address() called by try_to_unmap_one if
> TTU_SPLIT_HUGE_PMD is set.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/huge_memory.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f8f07e4..e716726 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3032,8 +3032,10 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
>  		return;
>  
>  	/*
> -	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
> -	 * materialize from under us.
> +	 * Caller holds the mmap_sem write mode or the anon_vma lock,
> +	 * so a huge pmd cannot materialize from under us (khugepaged
> +	 * holds both the mmap_sem write mode and the anon_vma lock
> +	 * write mode).
>  	 */
>  	__split_huge_pmd(vma, pmd, address, freeze);
>  }
> 
> 
> I also noticed we aren't calling page_move_anon_rmap in
> do_huge_pmd_wp_page when page_trans_huge_mapcount returns 1, that's a
> longstanding inefficiency but it's not a bug. We're not locking the
> page down in the THP COW because we don't have to deal with swapcache,
> and in turn we can't overwrite the page->mapping. I think in practice
> it would be safe anyway because it's an atomic write and no matter if
> the rmap_walk reader sees the value before or after the write, it'll
> still be able to find the pmd_trans_huge during the rmap walk. However
> if page->mapping can change under the reader (i.e. rmap_walk) then the
> reader should use READ_ONCE to access page->mapping (or page->mapping
> should become volatile). Otherwise it'd be a bug with the C standard
> where gcc could get confused in theory (in practice it would work fine
> as we're mostly just dereferencing that page->mapping pointer and not
> using it for switch/case or stuff like that where gcc could use an
> hash). Regardless for robustness it'd be better if we take appropriate
> locking and so we should take the page lock by doing a check if the
> page->mapping is already pointing to the local vma->anon_vma first, if
> not then we should take the page lock on the head THP and call
> page_move_anon_rmap. Because this is a longstanding problem I didn't
> address it yet, and it's only a missing optimization but it'd be nice
> to get that covered too (considering we just worsened a bit the
> optimization in presence of a COW after a pmd split and before the
> physical split).

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-05 15:11                           ` Kirill A. Shutemov
@ 2016-05-05 15:24                             ` Andrea Arcangeli
  2016-05-06  7:29                               ` Kirill A. Shutemov
  0 siblings, 1 reply; 36+ messages in thread
From: Andrea Arcangeli @ 2016-05-05 15:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Thu, May 05, 2016 at 06:11:10PM +0300, Kirill A. Shutemov wrote:
> Hm. How total_mapcount equal to NULL wouldn't lead to NULL-pointer
> dereference inside page_trans_huge_mapcount()?

Sorry for the confusion, this was still work in progress and then I've
seen the email from Alex and I sent the last version I had committed
right away. An earlier version of course had the proper checks for
NULL but they got wiped as I transitioned from one model to another
and back.

> > +				page_move_anon_rmap(old_page, vma, address);
> 
> compound_head() is missing, I believe.

Oh yes, fixed that too.

			if (total_mapcount == 1) {
				/*
				 * The page is all ours. Move it to
				 * our anon_vma so the rmap code will
				 * not search our parent or siblings.
				 * Protected against the rmap code by
				 * the page lock.
				 */
				page_move_anon_rmap(compound_head(old_page),
						    vma, address);
			}


If there's no other issue I can git send-email.

Then we should look into calling page_move_anon_rmap from THP COWs
too, hugetlbfs calls it too. I think we probably need to make
page_move_anon_rmap smarter and optionally let it take the lock for us
after reading page->mapping first to be sure it's really moving it.

The question is then if trylock or lock_page should be used, my
preference would be just trylock.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [BUG] vfio device assignment regression with THP ref counting redesign
  2016-05-05 15:24                             ` Andrea Arcangeli
@ 2016-05-06  7:29                               ` Kirill A. Shutemov
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2016-05-06  7:29 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alex Williamson, kirill.shutemov, linux-kernel, linux-mm

On Thu, May 05, 2016 at 05:24:06PM +0200, Andrea Arcangeli wrote:
> On Thu, May 05, 2016 at 06:11:10PM +0300, Kirill A. Shutemov wrote:
> > Hm. How total_mapcount equal to NULL wouldn't lead to NULL-pointer
> > dereference inside page_trans_huge_mapcount()?
> 
> Sorry for the confusion, this was still work in progress and then I've
> seen the email from Alex and I sent the last version I had committed
> right away. An earlier version of course had the proper checks for
> NULL but they got wiped as I transitioned from one model to another
> and back.
> 
> > > +				page_move_anon_rmap(old_page, vma, address);
> > 
> > compound_head() is missing, I believe.
> 
> Oh yes, fixed that too.
> 
> 			if (total_mapcount == 1) {
> 				/*
> 				 * The page is all ours. Move it to
> 				 * our anon_vma so the rmap code will
> 				 * not search our parent or siblings.
> 				 * Protected against the rmap code by
> 				 * the page lock.
> 				 */
> 				page_move_anon_rmap(compound_head(old_page),
> 						    vma, address);
> 			}
> 
> 
> If there's no other issue I can git send-email.

I don't see any.

> Then we should look into calling page_move_anon_rmap from THP COWs
> too, hugetlbfs calls it too. I think we probably need to make
> page_move_anon_rmap smarter and optionally let it take the lock for us
> after reading page->mapping first to be sure it's really moving it.
> 
> The question is then if trylock or lock_page should be used, my
> preference would be just trylock.

trylock is probably fine. It's not big deal if we wouldn't move the page
to new anon_vma, just nice-to-have.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-05-06  7:29 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-28 16:20 [BUG] vfio device assignment regression with THP ref counting redesign Alex Williamson
2016-04-28 18:17 ` Kirill A. Shutemov
2016-04-28 18:58   ` Alex Williamson
2016-04-28 23:21     ` Andrea Arcangeli
2016-04-29  0:44       ` Alex Williamson
2016-04-29  0:51       ` Kirill A. Shutemov
2016-04-29  2:45         ` Alex Williamson
2016-04-29  7:06           ` Kirill A. Shutemov
2016-04-29 15:12             ` Alex Williamson
2016-04-29 16:34             ` Andrea Arcangeli
2016-04-29 22:34               ` Alex Williamson
2016-05-02 10:41               ` Kirill A. Shutemov
2016-05-02 11:15                 ` Jerome Glisse
2016-05-02 12:14                   ` GUP guarantees wrt to userspace mappings redesign Kirill A. Shutemov
2016-05-02 13:39                     ` Jerome Glisse
2016-05-02 15:00                       ` GUP guarantees wrt to userspace mappings Kirill A. Shutemov
2016-05-02 15:22                         ` Jerome Glisse
2016-05-02 16:12                           ` Kirill A. Shutemov
2016-05-02 19:14                             ` Andrea Arcangeli
2016-05-02 19:11                           ` Andrea Arcangeli
2016-05-02 19:02                         ` Andrea Arcangeli
2016-05-02 14:15                     ` GUP guarantees wrt to userspace mappings redesign Oleg Nesterov
2016-05-02 16:21                       ` Kirill A. Shutemov
2016-05-02 16:22                         ` Oleg Nesterov
2016-05-02 18:03                           ` Kirill A. Shutemov
2016-05-02 17:41                             ` Oleg Nesterov
2016-05-02 18:56                     ` Andrea Arcangeli
2016-05-02 15:23                 ` [BUG] vfio device assignment regression with THP ref counting redesign Andrea Arcangeli
2016-05-02 16:00                   ` Kirill A. Shutemov
2016-05-02 18:03                     ` Andrea Arcangeli
2016-05-05  1:19                       ` Alex Williamson
2016-05-05 14:39                         ` Andrea Arcangeli
2016-05-05 15:09                           ` Andrea Arcangeli
2016-05-05 15:11                           ` Kirill A. Shutemov
2016-05-05 15:24                             ` Andrea Arcangeli
2016-05-06  7:29                               ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).