linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] huge_pmd_unshare migration and flushing
@ 2018-08-21 20:59 Mike Kravetz
  2018-08-21 20:59 ` [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
  2018-08-21 20:59 ` [PATCH v3 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
  0 siblings, 2 replies; 21+ messages in thread
From: Mike Kravetz @ 2018-08-21 20:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
	Mike Kravetz

v3 of a patch to correct a data corruption issue caused by improper
handling of shared huge PMDs during page migration.  This issue was
observed in a customer environment and can be recreated fairly easily
with a test program.  Patch 0001 addresses this issue only and is
copied to stable with the intention that this will go to stable
releases.  It has existed since the addition of shared huge PMD support.

While considering the issue above, Kirill Shutemov noticed that other
callers of huge_pmd_unshare have potential issues with cache and TLB
flushing.  A separate patch (0002) takes advantage of the new routine
huge_pmd_sharing_possible() to adjust flushing ranges in the cases
where huge PMD sharing is possible.  There is no copy to stable for this
patch as it has not been reported as an issue and discovered only via
code inspection.

Mike Kravetz (2):
  mm: migration: fix migration of huge PMD shared pages
  hugetlb: take PMD sharing into account when flushing tlb/caches

 include/linux/hugetlb.h | 14 +++++++
 mm/hugetlb.c            | 93 +++++++++++++++++++++++++++++++++++++----
 mm/rmap.c               | 42 +++++++++++++++++--
 3 files changed, 137 insertions(+), 12 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-21 20:59 [PATCH v3 0/2] huge_pmd_unshare migration and flushing Mike Kravetz
@ 2018-08-21 20:59 ` Mike Kravetz
  2018-08-21 22:03   ` kbuild test robot
  2018-08-22  0:51   ` kbuild test robot
  2018-08-21 20:59 ` [PATCH v3 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
  1 sibling, 2 replies; 21+ messages in thread
From: Mike Kravetz @ 2018-08-21 20:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
	Mike Kravetz, stable

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 14 ++++++++++++++
 mm/hugetlb.c            | 40 +++++++++++++++++++++++++++++++++++++++
 mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..1c6cde68487f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
@@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
 	return 0;
 }
 
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+					pte_t *ptep)
+{
+	return 0;
+}
+
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	return false;
+}
+
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..fd155dc52117 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 
 	/*
 	 * check on proper vm_flags and page table alignment
+	 *
+	 * Note that this is the same check used in huge_pmd_sharing_possible.
+	 * If you change one, consider changing both.
 	 */
 	if (vma->vm_flags & VM_MAYSHARE &&
 	    vma->vm_start <= base && end <= vma->vm_end)
@@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	return false;
 }
 
+/*
+ * Determine if start,end range within vma could be mapped by shared pmd.
+ * If yes, adjust start and end to cover range associated with possible
+ * shared pmd mappings.
+ */
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long check_addr = *start;
+	bool ret = false;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return ret;
+
+	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
+		unsigned long a_start = check_addr & PUD_MASK;
+		unsigned long a_end = a_start + PUD_SIZE;
+
+		/*
+		 * If sharing is possible, adjust start/end if necessary.
+		 *
+		 * Note that this is the same check used in vma_shareable.  If
+		 * you change one, consider changing both.
+		 */
+		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
+			if (a_start < *start)
+				*start = a_start;
+			if (a_end > *end)
+				*end = a_end;
+
+			ret = true;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..8cf853a4b093 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * We have to assume the worse case ie pmd for invalidation. Note that
-	 * the page can not be free in this function as call of try_to_unmap()
-	 * must hold a reference on the page.
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
 	 */
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		(void)huge_pmd_sharing_possible(vma, &start, &end);
+	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (PageHuge(page)) {
+			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, start, end);
+				flush_tlb_range(vma, start, end);
+				mmu_notifier_invalidate_range(mm, start, end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
 		    (flags & TTU_MIGRATION) &&
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches
  2018-08-21 20:59 [PATCH v3 0/2] huge_pmd_unshare migration and flushing Mike Kravetz
  2018-08-21 20:59 ` [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
@ 2018-08-21 20:59 ` Mike Kravetz
  2018-08-21 23:07   ` kbuild test robot
  2018-08-22  1:20   ` kbuild test robot
  1 sibling, 2 replies; 21+ messages in thread
From: Mike Kravetz @ 2018-08-21 20:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
	Mike Kravetz

When fixing an issue with PMD sharing and migration, it was discovered
via code inspection that other callers of huge_pmd_unshare potentially
have an issue with cache and tlb flushing.

Use the routine huge_pmd_sharing_possible() to calculate worst case
ranges for mmu notifiers.  Ensure that this range is flushed if
huge_pmd_unshare succeeds and unmaps a PUD_SUZE area.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 53 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fd155dc52117..c31d92889775 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3333,8 +3333,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	const unsigned long mmun_start = start;	/* For mmu_notifiers */
-	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	unsigned long mmun_start = start;	/* For mmu_notifiers */
+	unsigned long mmun_end   = end;		/* For mmu_notifiers */
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
@@ -3346,6 +3346,11 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 */
 	tlb_remove_check_page_size_change(tlb, sz);
 	tlb_start_vma(tlb, vma);
+
+	/*
+	 * If sharing possible, alert mmu notifiers of worst case.
+	 */
+	(void)huge_pmd_sharing_possible(vma, &mmun_start, &mmun_end);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	address = start;
 	for (; address < end; address += sz) {
@@ -3356,6 +3361,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			spin_unlock(ptl);
+			/*
+			 * We just unmapped a page of PMDs by clearing a PUD.
+			 * The caller's TLB flush range should cover this area.
+			 */
 			continue;
 		}
 
@@ -3438,12 +3447,23 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm;
 	struct mmu_gather tlb;
+	unsigned long tlb_start = start;
+	unsigned long tlb_end = end;
+
+	/*
+	 * If shared PMDs were possibly used within this vma range, adjust
+	 * start/end for worst case tlb flushing.
+	 * Note that we can not be sure if PMDs are shared until we try to
+	 * unmap pages.  However, we want to make sure TLB flushing covers
+	 * the largest possible range.
+	 */
+	(void)huge_pmd_sharing_possible(vma, &tlb_start, &tlb_end);
 
 	mm = vma->vm_mm;
 
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, tlb_start, tlb_end);
 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
-	tlb_finish_mmu(&tlb, start, end);
+	tlb_finish_mmu(&tlb, tlb_start, tlb_end);
 }
 
 /*
@@ -4309,11 +4329,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long pages = 0;
+	unsigned long f_start = start;
+	unsigned long f_end = end;
+	bool shared_pmd = false;
+
+	/*
+	 * In the case of shared PMDs, the area to flush could be beyond
+	 * start/end.  Set f_start/f_end to cover the maximum possible
+	 * range if PMD sharing is possible.
+	 */
+	(void)huge_pmd_sharing_possible(vma, &f_start, &f_end);
 
 	BUG_ON(address >= end);
-	flush_cache_range(vma, address, end);
+	flush_cache_range(vma, f_start, f_end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, f_start, f_end);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -4324,6 +4354,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
 			spin_unlock(ptl);
+			shared_pmd = true;
 			continue;
 		}
 		pte = huge_ptep_get(ptep);
@@ -4359,9 +4390,13 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
 	 * once we release i_mmap_rwsem, another task can do the final put_page
-	 * and that page table be reused and filled with junk.
+	 * and that page table be reused and filled with junk.  If we actually
+	 * did unshare a page of pmds, flush the range corresponding to the pud.
 	 */
-	flush_hugetlb_tlb_range(vma, start, end);
+	if (shared_pmd)
+		flush_hugetlb_tlb_range(vma, f_start, f_end);
+	else
+		flush_hugetlb_tlb_range(vma, start, end);
 	/*
 	 * No need to call mmu_notifier_invalidate_range() we are downgrading
 	 * page table protection not changing it to point to a new page.
@@ -4369,7 +4404,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * See Documentation/vm/mmu_notifier.rst
 	 */
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, f_start, f_end);
 
 	return pages << h->order;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-21 20:59 ` [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
@ 2018-08-21 22:03   ` kbuild test robot
  2018-08-21 23:06     ` Mike Kravetz
  2018-08-22  0:51   ` kbuild test robot
  1 sibling, 1 reply; 21+ messages in thread
From: kbuild test robot @ 2018-08-21 22:03 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild-all, linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Michal Hocko, Andrew Morton, Mike Kravetz,
	stable

[-- Attachment #1: Type: text/plain, Size: 7732 bytes --]

Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180821]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/mm/fault.o: In function `huge_pmd_sharing_possible':
>> fault.c:(.text+0xa06): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   arch/x86/mm/pgtable.o: In function `huge_pmd_sharing_possible':
   pgtable.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/fork.o: In function `huge_pmd_sharing_possible':
   fork.c:(.text+0x309): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sysctl.o: In function `huge_pmd_sharing_possible':
   sysctl.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/core.o: In function `huge_pmd_sharing_possible':
   core.c:(.text+0x299): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/loadavg.o: In function `huge_pmd_sharing_possible':
   loadavg.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/clock.o: In function `huge_pmd_sharing_possible':
   clock.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/cputime.o: In function `huge_pmd_sharing_possible':
   cputime.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/idle.o: In function `huge_pmd_sharing_possible':
   idle.c:(.text+0x36): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/fair.o: In function `huge_pmd_sharing_possible':
   fair.c:(.text+0x864): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/rt.o: In function `huge_pmd_sharing_possible':
   rt.c:(.text+0x72b): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/deadline.o: In function `huge_pmd_sharing_possible':
   deadline.c:(.text+0xac7): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/wait.o: In function `huge_pmd_sharing_possible':
   wait.c:(.text+0x16e): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/wait_bit.o: In function `huge_pmd_sharing_possible':
   wait_bit.c:(.text+0x7b): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/swait.o: In function `huge_pmd_sharing_possible':
   swait.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/completion.o: In function `huge_pmd_sharing_possible':
   completion.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/filemap.o: In function `huge_pmd_sharing_possible':
   filemap.c:(.text+0x3ca): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/page_alloc.o: In function `huge_pmd_sharing_possible':
   page_alloc.c:(.text+0xa95): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/swap.o: In function `huge_pmd_sharing_possible':
   swap.c:(.text+0x551): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/vmscan.o: In function `huge_pmd_sharing_possible':
   vmscan.c:(.text+0x5bb): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/shmem.o: In function `huge_pmd_sharing_possible':
   shmem.c:(.text+0x6d): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/util.o: In function `huge_pmd_sharing_possible':
   util.c:(.text+0xc): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/compaction.o: In function `huge_pmd_sharing_possible':
   compaction.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/debug.o: In function `huge_pmd_sharing_possible':
   debug.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/gup.o: In function `huge_pmd_sharing_possible':
   gup.c:(.text+0x17c): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/memory.o: In function `huge_pmd_sharing_possible':
   memory.c:(.text+0x5f9): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mincore.o: In function `huge_pmd_sharing_possible':
   mincore.c:(.text+0x150): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mlock.o: In function `huge_pmd_sharing_possible':
   mlock.c:(.text+0x245): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mmap.o: In function `huge_pmd_sharing_possible':
   mmap.c:(.text+0x565): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mprotect.o: In function `huge_pmd_sharing_possible':
   mprotect.c:(.text+0x39): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mremap.o: In function `huge_pmd_sharing_possible':
   mremap.c:(.text+0xf2): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/page_vma_mapped.o: In function `huge_pmd_sharing_possible':
   page_vma_mapped.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/pagewalk.o: In function `huge_pmd_sharing_possible':
   pagewalk.c:(.text+0x13d): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/rmap.o: In function `huge_pmd_sharing_possible':
   rmap.c:(.text+0x3bb): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6482 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-21 22:03   ` kbuild test robot
@ 2018-08-21 23:06     ` Mike Kravetz
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Kravetz @ 2018-08-21 23:06 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Michal Hocko, Andrew Morton, stable

On 08/21/2018 03:03 PM, kbuild test robot wrote:
> Hi Mike,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.18 next-20180821]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
> config: i386-tinyconfig (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 

Oops, simple build fix addressed in updated patch below.

From 57b7d63d88f65b1c385a5f15c7cd5fa07513e955 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 21 Aug 2018 10:50:17 -0700
Subject: [PATCH v4 1/2] mm: migration: fix migration of huge PMD shared pages

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 14 ++++++++++++++
 mm/hugetlb.c            | 40 +++++++++++++++++++++++++++++++++++++++
 mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..840b43a085cc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
@@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
 	return 0;
 }
 
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+					pte_t *ptep)
+{
+	return 0;
+}
+
+static inline bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	return false;
+}
+
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..fd155dc52117 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 
 	/*
 	 * check on proper vm_flags and page table alignment
+	 *
+	 * Note that this is the same check used in huge_pmd_sharing_possible.
+	 * If you change one, consider changing both.
 	 */
 	if (vma->vm_flags & VM_MAYSHARE &&
 	    vma->vm_start <= base && end <= vma->vm_end)
@@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	return false;
 }
 
+/*
+ * Determine if start,end range within vma could be mapped by shared pmd.
+ * If yes, adjust start and end to cover range associated with possible
+ * shared pmd mappings.
+ */
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long check_addr = *start;
+	bool ret = false;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return ret;
+
+	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
+		unsigned long a_start = check_addr & PUD_MASK;
+		unsigned long a_end = a_start + PUD_SIZE;
+
+		/*
+		 * If sharing is possible, adjust start/end if necessary.
+		 *
+		 * Note that this is the same check used in vma_shareable.  If
+		 * you change one, consider changing both.
+		 */
+		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
+			if (a_start < *start)
+				*start = a_start;
+			if (a_end > *end)
+				*end = a_end;
+
+			ret = true;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..8cf853a4b093 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * We have to assume the worse case ie pmd for invalidation. Note that
-	 * the page can not be free in this function as call of try_to_unmap()
-	 * must hold a reference on the page.
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
 	 */
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		(void)huge_pmd_sharing_possible(vma, &start, &end);
+	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (PageHuge(page)) {
+			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, start, end);
+				flush_tlb_range(vma, start, end);
+				mmu_notifier_invalidate_range(mm, start, end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
 		    (flags & TTU_MIGRATION) &&
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches
  2018-08-21 20:59 ` [PATCH v3 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
@ 2018-08-21 23:07   ` kbuild test robot
  2018-08-22  1:20   ` kbuild test robot
  1 sibling, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2018-08-21 23:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild-all, linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Michal Hocko, Andrew Morton, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180821]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   powerpc64-linux-gnu-ld: warning: orphan section `.gnu.hash' from `linker stubs' being placed in section `.gnu.hash'.
   mm/rmap.o: In function `.try_to_unmap_one':
   rmap.c:(.text+0x2468): undefined reference to `.huge_pmd_sharing_possible'
   mm/hugetlb.o: In function `.__unmap_hugepage_range':
>> (.text+0x6834): undefined reference to `.huge_pmd_sharing_possible'
   mm/hugetlb.o: In function `.unmap_hugepage_range':
   (.text+0x6de8): undefined reference to `.huge_pmd_sharing_possible'
   mm/hugetlb.o: In function `.hugetlb_change_protection':
   (.text+0x91bc): undefined reference to `.huge_pmd_sharing_possible'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23879 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-21 20:59 ` [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
  2018-08-21 22:03   ` kbuild test robot
@ 2018-08-22  0:51   ` kbuild test robot
  2018-08-22  1:10     ` Mike Kravetz
  1 sibling, 1 reply; 21+ messages in thread
From: kbuild test robot @ 2018-08-22  0:51 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild-all, linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Michal Hocko, Andrew Morton, Mike Kravetz,
	stable

[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]

Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180821]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=sparc64 

All errors (new ones prefixed by >>):

   mm/rmap.o: In function `try_to_unmap_one':
>> rmap.c:(.text+0x2708): undefined reference to `huge_pmd_sharing_possible'
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 55174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-22  0:51   ` kbuild test robot
@ 2018-08-22  1:10     ` Mike Kravetz
  2018-08-22 12:28       ` Michal Hocko
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Mike Kravetz @ 2018-08-22  1:10 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Michal Hocko, Andrew Morton, stable

On 08/21/2018 05:51 PM, kbuild test robot wrote:
> Hi Mike,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.18 next-20180821]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
> config: sparc64-allyesconfig (attached as .config)
> compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         GCC_VERSION=7.2.0 make.cross ARCH=sparc64 
> 

Ok, this should take care of all the build errors.  Needed to address
!CONFIG_HUGETLB_PAGE and !CONFIG_ARCH_WANT_HUGE_PMD_SHARE.

From 2868cea3ff33fccd03da9188dde2ae5b619bf300 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 21 Aug 2018 10:50:17 -0700
Subject: [PATCH v5 1/2] mm: migration: fix migration of huge PMD shared pages

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 14 +++++++++++++
 mm/hugetlb.c            | 46 +++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++---
 3 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..840b43a085cc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
@@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
 	return 0;
 }
 
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+					pte_t *ptep)
+{
+	return 0;
+}
+
+static inline bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	return false;
+}
+
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..f085019a4724 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 
 	/*
 	 * check on proper vm_flags and page table alignment
+	 *
+	 * Note that this is the same check used in huge_pmd_sharing_possible.
+	 * If you change one, consider changing both.
 	 */
 	if (vma->vm_flags & VM_MAYSHARE &&
 	    vma->vm_start <= base && end <= vma->vm_end)
@@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	return false;
 }
 
+/*
+ * Determine if start,end range within vma could be mapped by shared pmd.
+ * If yes, adjust start and end to cover range associated with possible
+ * shared pmd mappings.
+ */
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long check_addr = *start;
+	bool ret = false;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return ret;
+
+	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
+		unsigned long a_start = check_addr & PUD_MASK;
+		unsigned long a_end = a_start + PUD_SIZE;
+
+		/*
+		 * If sharing is possible, adjust start/end if necessary.
+		 *
+		 * Note that this is the same check used in vma_shareable.  If
+		 * you change one, consider changing both.
+		 */
+		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
+			if (a_start < *start)
+				*start = a_start;
+			if (a_end > *end)
+				*end = a_end;
+
+			ret = true;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
@@ -4659,6 +4699,12 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
+
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	return false;
+}
 #define want_pmd_share()	(0)
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..8cf853a4b093 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * We have to assume the worse case ie pmd for invalidation. Note that
-	 * the page can not be free in this function as call of try_to_unmap()
-	 * must hold a reference on the page.
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
 	 */
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		(void)huge_pmd_sharing_possible(vma, &start, &end);
+	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (PageHuge(page)) {
+			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, start, end);
+				flush_tlb_range(vma, start, end);
+				mmu_notifier_invalidate_range(mm, start, end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
 		    (flags & TTU_MIGRATION) &&
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches
  2018-08-21 20:59 ` [PATCH v3 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
  2018-08-21 23:07   ` kbuild test robot
@ 2018-08-22  1:20   ` kbuild test robot
  1 sibling, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2018-08-22  1:20 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild-all, linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Michal Hocko, Andrew Morton, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 2024 bytes --]

Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180821]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=sparc64 

All errors (new ones prefixed by >>):

   mm/rmap.o: In function `try_to_unmap_one':
   rmap.c:(.text+0x2708): undefined reference to `huge_pmd_sharing_possible'
   mm/hugetlb.o: In function `__unmap_hugepage_range':
>> hugetlb.c:(.text+0x5530): undefined reference to `huge_pmd_sharing_possible'
   mm/hugetlb.o: In function `unmap_hugepage_range':
   hugetlb.c:(.text+0x57bc): undefined reference to `huge_pmd_sharing_possible'
   mm/hugetlb.o: In function `hugetlb_change_protection':
   hugetlb.c:(.text+0x7714): undefined reference to `huge_pmd_sharing_possible'
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 55174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-22  1:10     ` Mike Kravetz
@ 2018-08-22 12:28       ` Michal Hocko
  2018-08-22 16:48         ` Mike Kravetz
  2018-08-22 21:05       ` Kirill A. Shutemov
  2018-08-23 12:48       ` Michal Hocko
  2 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-08-22 12:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
[...]
> diff --git a/mm/rmap.c b/mm/rmap.c
> index eb477809a5c0..8cf853a4b093 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>  
>  	/*
> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> -	 * the page can not be free in this function as call of try_to_unmap()
> -	 * must hold a reference on the page.
> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> +	 * For hugetlb, it could be much worse if we need to do pud
> +	 * invalidation in the case of pmd sharing.
> +	 *
> +	 * Note that the page can not be free in this function as call of
> +	 * try_to_unmap() must hold a reference on the page.
>  	 */
>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> +	if (PageHuge(page)) {
> +		/*
> +		 * If sharing is possible, start and end will be adjusted
> +		 * accordingly.
> +		 */
> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
> +	}
>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);

I do not get this part. Why don't we simply unconditionally invalidate
the whole huge page range?

>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> +		if (PageHuge(page)) {
> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {

huge_pmd_unshare is documented to require a pte lock. Where do we take
it?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-22 12:28       ` Michal Hocko
@ 2018-08-22 16:48         ` Mike Kravetz
  2018-08-23  7:30           ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Kravetz @ 2018-08-22 16:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On 08/22/2018 05:28 AM, Michal Hocko wrote:
> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> [...]
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index eb477809a5c0..8cf853a4b093 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  	}
>>  
>>  	/*
>> -	 * We have to assume the worse case ie pmd for invalidation. Note that
>> -	 * the page can not be free in this function as call of try_to_unmap()
>> -	 * must hold a reference on the page.
>> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
>> +	 * For hugetlb, it could be much worse if we need to do pud
>> +	 * invalidation in the case of pmd sharing.
>> +	 *
>> +	 * Note that the page can not be free in this function as call of
>> +	 * try_to_unmap() must hold a reference on the page.
>>  	 */
>>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
>> +	if (PageHuge(page)) {
>> +		/*
>> +		 * If sharing is possible, start and end will be adjusted
>> +		 * accordingly.
>> +		 */
>> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
>> +	}
>>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> 
> I do not get this part. Why don't we simply unconditionally invalidate
> the whole huge page range?

In this routine, we are only unmapping a single page.  The existing code
is limiting the invalidate range to that page size: 4K or 2M.  With shared
PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
think we want to unconditionally invalidate 1G.  Is that what you are asking?

I do not know how often PMD sharing is exercised.  It certainly is used by
DBs for large shared areas.  I suspect it is less frequent than hugtlb pages
in general, and certainly less frequent than THP or base pages.

>>  
>>  	while (page_vma_mapped_walk(&pvmw)) {
>> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>>  		address = pvmw.address;
>>  
>> +		if (PageHuge(page)) {
>> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> 
> huge_pmd_unshare is documented to require a pte lock. Where do we take
> it?

It is somewhat hidden, but we are in the loop:

	while (page_vma_mapped_walk(&pvmw)) {

The routine page_vma_mapped_walk will acquire the lock, and it correctly
checks for huge pages and uses huge_pte_lockptr().

page_vma_mapped_walk_done() will release the lock.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-22  1:10     ` Mike Kravetz
  2018-08-22 12:28       ` Michal Hocko
@ 2018-08-22 21:05       ` Kirill A. Shutemov
  2018-08-22 21:48         ` Mike Kravetz
  2018-08-23 12:48       ` Michal Hocko
  2 siblings, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2018-08-22 21:05 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
	stable

On Tue, Aug 21, 2018 at 06:10:42PM -0700, Mike Kravetz wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3103099f64fd..f085019a4724 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  
>  	/*
>  	 * check on proper vm_flags and page table alignment
> +	 *
> +	 * Note that this is the same check used in huge_pmd_sharing_possible.
> +	 * If you change one, consider changing both.

Should we have helper to isolate the check in one place?

>  	 */
>  	if (vma->vm_flags & VM_MAYSHARE &&
>  	    vma->vm_start <= base && end <= vma->vm_end)
> @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  	return false;
>  }
>  
> +/*
> + * Determine if start,end range within vma could be mapped by shared pmd.
> + * If yes, adjust start and end to cover range associated with possible
> + * shared pmd mappings.
> + */
> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long check_addr = *start;
> +	bool ret = false;
> +
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return ret;

Do we ever use return value? I don't see it.

And in this case function name is not really work...

> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> +		unsigned long a_start = check_addr & PUD_MASK;
> +		unsigned long a_end = a_start + PUD_SIZE;
> +
> +		/*
> +		 * If sharing is possible, adjust start/end if necessary.
> +		 *
> +		 * Note that this is the same check used in vma_shareable.  If
> +		 * you change one, consider changing both.
> +		 */
> +		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
> +			if (a_start < *start)
> +				*start = a_start;
> +			if (a_end > *end)
> +				*end = a_end;
> +
> +			ret = true;
> +		}
> +	}
> +
> +	return ret;
> +}
> +

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-22 21:05       ` Kirill A. Shutemov
@ 2018-08-22 21:48         ` Mike Kravetz
  0 siblings, 0 replies; 21+ messages in thread
From: Mike Kravetz @ 2018-08-22 21:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
	stable

On 08/22/2018 02:05 PM, Kirill A. Shutemov wrote:
> On Tue, Aug 21, 2018 at 06:10:42PM -0700, Mike Kravetz wrote:
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 3103099f64fd..f085019a4724 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>>  
>>  	/*
>>  	 * check on proper vm_flags and page table alignment
>> +	 *
>> +	 * Note that this is the same check used in huge_pmd_sharing_possible.
>> +	 * If you change one, consider changing both.
> 
> Should we have helper to isolate the check in one place?
> 

Yes, I will create one.  Most likely just a #define.

>>  	 */
>>  	if (vma->vm_flags & VM_MAYSHARE &&
>>  	    vma->vm_start <= base && end <= vma->vm_end)
>> @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>>  	return false;
>>  }
>>  
>> +/*
>> + * Determine if start,end range within vma could be mapped by shared pmd.
>> + * If yes, adjust start and end to cover range associated with possible
>> + * shared pmd mappings.
>> + */
>> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
>> +				unsigned long *start, unsigned long *end)
>> +{
>> +	unsigned long check_addr = *start;
>> +	bool ret = false;
>> +
>> +	if (!(vma->vm_flags & VM_MAYSHARE))
>> +		return ret;
> 
> Do we ever use return value? I don't see it.
> 
> And in this case function name is not really work...

You are correct.  None of the code uses the return value.  I initially
thought some caller would use it.  But every caller wants/needs to
adjust the range if sharing is possible.  This is a really long name
but how about:

void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
				unsigned long *start, unsigned long *end)

I'm open to other names and will update patch with suggestions.
-- 
Mike Kravetz

> 
>> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
>> +		unsigned long a_start = check_addr & PUD_MASK;
>> +		unsigned long a_end = a_start + PUD_SIZE;
>> +
>> +		/*
>> +		 * If sharing is possible, adjust start/end if necessary.
>> +		 *
>> +		 * Note that this is the same check used in vma_shareable.  If
>> +		 * you change one, consider changing both.
>> +		 */
>> +		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
>> +			if (a_start < *start)
>> +				*start = a_start;
>> +			if (a_end > *end)
>> +				*end = a_end;
>> +
>> +			ret = true;
>> +		}
>> +	}
>> +
>> +	return ret;
>> +}
>> +

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-22 16:48         ` Mike Kravetz
@ 2018-08-23  7:30           ` Michal Hocko
  2018-08-23  8:21             ` Kirill A. Shutemov
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-08-23  7:30 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
> On 08/22/2018 05:28 AM, Michal Hocko wrote:
> > On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> > [...]
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index eb477809a5c0..8cf853a4b093 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>  	}
> >>  
> >>  	/*
> >> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> >> -	 * the page can not be free in this function as call of try_to_unmap()
> >> -	 * must hold a reference on the page.
> >> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> >> +	 * For hugetlb, it could be much worse if we need to do pud
> >> +	 * invalidation in the case of pmd sharing.
> >> +	 *
> >> +	 * Note that the page can not be free in this function as call of
> >> +	 * try_to_unmap() must hold a reference on the page.
> >>  	 */
> >>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> >> +	if (PageHuge(page)) {
> >> +		/*
> >> +		 * If sharing is possible, start and end will be adjusted
> >> +		 * accordingly.
> >> +		 */
> >> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
> >> +	}
> >>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > 
> > I do not get this part. Why don't we simply unconditionally invalidate
> > the whole huge page range?
> 
> In this routine, we are only unmapping a single page.  The existing code
> is limiting the invalidate range to that page size: 4K or 2M.  With shared
> PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
> think we want to unconditionally invalidate 1G.  Is that what you are asking?

But we know that huge_pmd_unshare unmapped a shared pte so we know when
to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible
a) duplicates some checks and b) it updates start/stop out of line.

> I do not know how often PMD sharing is exercised.  It certainly is used by
> DBs for large shared areas.  I suspect it is less frequent than hugtlb pages
> in general, and certainly less frequent than THP or base pages.
> 
> >>  
> >>  	while (page_vma_mapped_walk(&pvmw)) {
> >> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> >>  		address = pvmw.address;
> >>  
> >> +		if (PageHuge(page)) {
> >> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> > 
> > huge_pmd_unshare is documented to require a pte lock. Where do we take
> > it?
> 
> It is somewhat hidden, but we are in the loop:
> 
> 	while (page_vma_mapped_walk(&pvmw)) {
> 
> The routine page_vma_mapped_walk will acquire the lock, and it correctly
> checks for huge pages and uses huge_pte_lockptr().
> 
> page_vma_mapped_walk_done() will release the lock.

OK, I can see it now. Thanks for the clarification. page_vma_mapped_walk
is quite hard to follow.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23  7:30           ` Michal Hocko
@ 2018-08-23  8:21             ` Kirill A. Shutemov
  2018-08-23 10:33               ` Michal Hocko
  2018-08-23 16:45               ` Mike Kravetz
  0 siblings, 2 replies; 21+ messages in thread
From: Kirill A. Shutemov @ 2018-08-23  8:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, kbuild test robot, kbuild-all, linux-mm,
	linux-kernel, Kirill A . Shutemov, Jérôme Glisse,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable

On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
> On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
> > On 08/22/2018 05:28 AM, Michal Hocko wrote:
> > > On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> > > [...]
> > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > >> index eb477809a5c0..8cf853a4b093 100644
> > >> --- a/mm/rmap.c
> > >> +++ b/mm/rmap.c
> > >> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >>  	}
> > >>  
> > >>  	/*
> > >> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> > >> -	 * the page can not be free in this function as call of try_to_unmap()
> > >> -	 * must hold a reference on the page.
> > >> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> > >> +	 * For hugetlb, it could be much worse if we need to do pud
> > >> +	 * invalidation in the case of pmd sharing.
> > >> +	 *
> > >> +	 * Note that the page can not be free in this function as call of
> > >> +	 * try_to_unmap() must hold a reference on the page.
> > >>  	 */
> > >>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> > >> +	if (PageHuge(page)) {
> > >> +		/*
> > >> +		 * If sharing is possible, start and end will be adjusted
> > >> +		 * accordingly.
> > >> +		 */
> > >> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
> > >> +	}
> > >>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > > 
> > > I do not get this part. Why don't we simply unconditionally invalidate
> > > the whole huge page range?
> > 
> > In this routine, we are only unmapping a single page.  The existing code
> > is limiting the invalidate range to that page size: 4K or 2M.  With shared
> > PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
> > think we want to unconditionally invalidate 1G.  Is that what you are asking?
> 
> But we know that huge_pmd_unshare unmapped a shared pte so we know when
> to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible
> a) duplicates some checks and b) it updates start/stop out of line.

My reading on this is that mmu_notifier_invalidate_range_start() has to be
called from sleepable context on the full range that *can* be invalidated
before following mmu_notifier_invalidate_range_end().

In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD
page that effectively enlarge range that has to be covered by
mmu_notifier_invalidate_range_start(). We cannot yet know if there's any
shared page tables in the range, so we need to go with worst case
scenario.

I don't see conceptually better solution than what is proposed.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23  8:21             ` Kirill A. Shutemov
@ 2018-08-23 10:33               ` Michal Hocko
  2018-08-23 16:45               ` Mike Kravetz
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2018-08-23 10:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Kravetz, kbuild test robot, kbuild-all, linux-mm,
	linux-kernel, Kirill A . Shutemov, Jérôme Glisse,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable

On Thu 23-08-18 11:21:12, Kirill A. Shutemov wrote:
> On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
> > On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
> > > On 08/22/2018 05:28 AM, Michal Hocko wrote:
> > > > On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> > > > [...]
> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > > >> index eb477809a5c0..8cf853a4b093 100644
> > > >> --- a/mm/rmap.c
> > > >> +++ b/mm/rmap.c
> > > >> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > >>  	}
> > > >>  
> > > >>  	/*
> > > >> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> > > >> -	 * the page can not be free in this function as call of try_to_unmap()
> > > >> -	 * must hold a reference on the page.
> > > >> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> > > >> +	 * For hugetlb, it could be much worse if we need to do pud
> > > >> +	 * invalidation in the case of pmd sharing.
> > > >> +	 *
> > > >> +	 * Note that the page can not be free in this function as call of
> > > >> +	 * try_to_unmap() must hold a reference on the page.
> > > >>  	 */
> > > >>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> > > >> +	if (PageHuge(page)) {
> > > >> +		/*
> > > >> +		 * If sharing is possible, start and end will be adjusted
> > > >> +		 * accordingly.
> > > >> +		 */
> > > >> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
> > > >> +	}
> > > >>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > > > 
> > > > I do not get this part. Why don't we simply unconditionally invalidate
> > > > the whole huge page range?
> > > 
> > > In this routine, we are only unmapping a single page.  The existing code
> > > is limiting the invalidate range to that page size: 4K or 2M.  With shared
> > > PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
> > > think we want to unconditionally invalidate 1G.  Is that what you are asking?
> > 
> > But we know that huge_pmd_unshare unmapped a shared pte so we know when
> > to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible
> > a) duplicates some checks and b) it updates start/stop out of line.
> 
> My reading on this is that mmu_notifier_invalidate_range_start() has to be
> called from sleepable context on the full range that *can* be invalidated
> before following mmu_notifier_invalidate_range_end().
> 
> In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD
> page that effectively enlarge range that has to be covered by
> mmu_notifier_invalidate_range_start(). We cannot yet know if there's any
> shared page tables in the range, so we need to go with worst case
> scenario.
> 
> I don't see conceptually better solution than what is proposed.

I was thinking we would just pull PageHuge outside of the
page_vma_mapped_walk. I thought it would look much more straightforward
but I've tried to put something together and it grown into an ugly code
as well. So going the Mike's way might be a better option after all.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-22  1:10     ` Mike Kravetz
  2018-08-22 12:28       ` Michal Hocko
  2018-08-22 21:05       ` Kirill A. Shutemov
@ 2018-08-23 12:48       ` Michal Hocko
  2018-08-23 17:01         ` Mike Kravetz
  2 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-08-23 12:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
[...]

OK, after burning myself when trying to be clever here it seems like
your proposed solution is indeed simpler.

> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long check_addr = *start;
> +	bool ret = false;
> +
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return ret;
> +
> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> +		unsigned long a_start = check_addr & PUD_MASK;
> +		unsigned long a_end = a_start + PUD_SIZE;

I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as
huge_pmd_unshare does.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23  8:21             ` Kirill A. Shutemov
  2018-08-23 10:33               ` Michal Hocko
@ 2018-08-23 16:45               ` Mike Kravetz
  1 sibling, 0 replies; 21+ messages in thread
From: Mike Kravetz @ 2018-08-23 16:45 UTC (permalink / raw)
  To: Kirill A. Shutemov, Michal Hocko
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On 08/23/2018 01:21 AM, Kirill A. Shutemov wrote:
> On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
>> On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
>>> On 08/22/2018 05:28 AM, Michal Hocko wrote:
>>>> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
>>>> [...]
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index eb477809a5c0..8cf853a4b093 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>>  	}
>>>>>  
>>>>>  	/*
>>>>> -	 * We have to assume the worse case ie pmd for invalidation. Note that
>>>>> -	 * the page can not be free in this function as call of try_to_unmap()
>>>>> -	 * must hold a reference on the page.
>>>>> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
>>>>> +	 * For hugetlb, it could be much worse if we need to do pud
>>>>> +	 * invalidation in the case of pmd sharing.
>>>>> +	 *
>>>>> +	 * Note that the page can not be free in this function as call of
>>>>> +	 * try_to_unmap() must hold a reference on the page.
>>>>>  	 */
>>>>>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
>>>>> +	if (PageHuge(page)) {
>>>>> +		/*
>>>>> +		 * If sharing is possible, start and end will be adjusted
>>>>> +		 * accordingly.
>>>>> +		 */
>>>>> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
>>>>> +	}
>>>>>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>>>
>>>> I do not get this part. Why don't we simply unconditionally invalidate
>>>> the whole huge page range?
>>>
>>> In this routine, we are only unmapping a single page.  The existing code
>>> is limiting the invalidate range to that page size: 4K or 2M.  With shared
>>> PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
>>> think we want to unconditionally invalidate 1G.  Is that what you are asking?
>>
>> But we know that huge_pmd_unshare unmapped a shared pte so we know when
>> to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible
>> a) duplicates some checks and b) it updates start/stop out of line.
> 
> My reading on this is that mmu_notifier_invalidate_range_start() has to be
> called from sleepable context on the full range that *can* be invalidated
> before following mmu_notifier_invalidate_range_end().
> 
> In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD
> page that effectively enlarge range that has to be covered by
> mmu_notifier_invalidate_range_start(). We cannot yet know if there's any
> shared page tables in the range, so we need to go with worst case
> scenario.

Yes, that is a good summary.  We can not know for sure if there is PMD
sharing until we hold the page table lock.  So, we don't know if we should
invalidate/flush 2M or 1G.  Yet, we need to call
mmu_notifier_invalidate_range_start before taking the lock.  And, the notifiers
need to know the range of the worst possible case.  The best approach I came up
with is to adjust the range if sharing is 'possible'.

> 
> I don't see conceptually better solution than what is proposed.
> 

I have updated the patches based on Kirill's previous comments and will send out
a new version later today.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23 12:48       ` Michal Hocko
@ 2018-08-23 17:01         ` Mike Kravetz
  2018-08-23 17:56           ` Mike Kravetz
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Kravetz @ 2018-08-23 17:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On 08/23/2018 05:48 AM, Michal Hocko wrote:
> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> [...]
> 
> OK, after burning myself when trying to be clever here it seems like
> your proposed solution is indeed simpler.
> 
>> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
>> +				unsigned long *start, unsigned long *end)
>> +{
>> +	unsigned long check_addr = *start;
>> +	bool ret = false;
>> +
>> +	if (!(vma->vm_flags & VM_MAYSHARE))
>> +		return ret;
>> +
>> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
>> +		unsigned long a_start = check_addr & PUD_MASK;
>> +		unsigned long a_end = a_start + PUD_SIZE;
> 
> I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as
> huge_pmd_unshare does.

Sure, I can do that.

However, I consider the statement making that calculation in huge_pmd_unshare
to be VERY ugly and confusing code.

	*addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE;

Note that it is adjusting the value of passed argument 'unsigned long *addr'.
This argument/value is part of a loop condition in all current callers of
huge_pmd_unshare.  For instance:

	for (; address < end; address += huge_page_size(h)) {

So, that calculation in huge_pmd_unshare gets the calling loop back to
the starting address of the unmapped range.  It even takes the loop increment
'huge_page_size(h)' into account.  That is why that ' - HPAGE_SIZE' is at
the end of the calculation.

ugly and confusing!  And on my list of things to clean up.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23 17:01         ` Mike Kravetz
@ 2018-08-23 17:56           ` Mike Kravetz
  2018-08-23 19:36             ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Mike Kravetz @ 2018-08-23 17:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On 08/23/2018 10:01 AM, Mike Kravetz wrote:
> On 08/23/2018 05:48 AM, Michal Hocko wrote:
>> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
>> [...]
>>
>> OK, after burning myself when trying to be clever here it seems like
>> your proposed solution is indeed simpler.
>>
>>> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
>>> +				unsigned long *start, unsigned long *end)
>>> +{
>>> +	unsigned long check_addr = *start;
>>> +	bool ret = false;
>>> +
>>> +	if (!(vma->vm_flags & VM_MAYSHARE))
>>> +		return ret;
>>> +
>>> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
>>> +		unsigned long a_start = check_addr & PUD_MASK;
>>> +		unsigned long a_end = a_start + PUD_SIZE;
>>
>> I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as
>> huge_pmd_unshare does.
> 
> Sure, I can do that.

On second thought, this is more similar to vma_shareable() which uses
PUD_MASK and PUD_SIZE.  In fact Kirill asked me to put in a common helper
for this and vma_shareable.  So, I would prefer to leave it as PUD* unless
you feel strongly.

IMO, it would make more sense to change this in huge_pmd_unshare as PMD
sharing is pretty explicitly tied to PUD_SIZE.  But, that is a future cleanup.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23 17:56           ` Mike Kravetz
@ 2018-08-23 19:36             ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2018-08-23 19:36 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: kbuild test robot, kbuild-all, linux-mm, linux-kernel,
	Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On Thu 23-08-18 10:56:30, Mike Kravetz wrote:
> On 08/23/2018 10:01 AM, Mike Kravetz wrote:
> > On 08/23/2018 05:48 AM, Michal Hocko wrote:
> >> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> >> [...]
> >>
> >> OK, after burning myself when trying to be clever here it seems like
> >> your proposed solution is indeed simpler.
> >>
> >>> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
> >>> +				unsigned long *start, unsigned long *end)
> >>> +{
> >>> +	unsigned long check_addr = *start;
> >>> +	bool ret = false;
> >>> +
> >>> +	if (!(vma->vm_flags & VM_MAYSHARE))
> >>> +		return ret;
> >>> +
> >>> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> >>> +		unsigned long a_start = check_addr & PUD_MASK;
> >>> +		unsigned long a_end = a_start + PUD_SIZE;
> >>
> >> I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as
> >> huge_pmd_unshare does.
> > 
> > Sure, I can do that.
> 
> On second thought, this is more similar to vma_shareable() which uses
> PUD_MASK and PUD_SIZE.  In fact Kirill asked me to put in a common helper
> for this and vma_shareable.  So, I would prefer to leave it as PUD* unless
> you feel strongly.

I don't
 
> IMO, it would make more sense to change this in huge_pmd_unshare as PMD
> sharing is pretty explicitly tied to PUD_SIZE.  But, that is a future cleanup.

Fair enough. I didn't realize we are PUD explicit elsewhere. So if you
plan to update huge_pmd_unshare to be in sync then no objections from me
at all. I merely wanted to be in sync with this because it is a central
point to look at wrt pmd sharing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-08-23 19:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-21 20:59 [PATCH v3 0/2] huge_pmd_unshare migration and flushing Mike Kravetz
2018-08-21 20:59 ` [PATCH v3 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
2018-08-21 22:03   ` kbuild test robot
2018-08-21 23:06     ` Mike Kravetz
2018-08-22  0:51   ` kbuild test robot
2018-08-22  1:10     ` Mike Kravetz
2018-08-22 12:28       ` Michal Hocko
2018-08-22 16:48         ` Mike Kravetz
2018-08-23  7:30           ` Michal Hocko
2018-08-23  8:21             ` Kirill A. Shutemov
2018-08-23 10:33               ` Michal Hocko
2018-08-23 16:45               ` Mike Kravetz
2018-08-22 21:05       ` Kirill A. Shutemov
2018-08-22 21:48         ` Mike Kravetz
2018-08-23 12:48       ` Michal Hocko
2018-08-23 17:01         ` Mike Kravetz
2018-08-23 17:56           ` Mike Kravetz
2018-08-23 19:36             ` Michal Hocko
2018-08-21 20:59 ` [PATCH v3 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
2018-08-21 23:07   ` kbuild test robot
2018-08-22  1:20   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).