linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] huge_pmd_unshare migration and flushing
@ 2018-08-23 20:59 Mike Kravetz
  2018-08-23 20:59 ` [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Mike Kravetz @ 2018-08-23 20:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
	Mike Kravetz

Correct a data corruption issue caused by improper handling of shared
huge PMDs during page migration.  This issue was observed in a customer
environment and can be recreated fairly easily with a test program.
Patch 0001 addresses this issue only and is copied to stable with the
intention that this will go to stable releases.  It has existed since
the addition of shared huge PMD support.

While considering the issue above, Kirill Shutemov noticed that other
callers of huge_pmd_unshare have potential issues with cache and TLB
flushing.  A separate patch (0002) takes advantage of the new routine
adjust_range_if_pmd_sharing_possible() to adjust flushing ranges in
the cases where huge PMD sharing is possible.  There is no copy to
stable for this patch as it has not been reported as an issue and
discovered only via code inspection.

v5-v6:	Rename and update 'sharing possible' routine as suggested by
	Kirill.
v3-v5:  Address build errors if !CONFIG_HUGETLB_PAGE and
        !CONFIG_ARCH_WANT_HUGE_PMD_SHARE

Mike Kravetz (2):
  mm: migration: fix migration of huge PMD shared pages
  hugetlb: take PMD sharing into account when flushing tlb/caches

 include/linux/hugetlb.h | 14 +++++++
 mm/hugetlb.c            | 93 ++++++++++++++++++++++++++++++++++++-----
 mm/rmap.c               | 42 +++++++++++++++++--
 3 files changed, 135 insertions(+), 14 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23 20:59 [PATCH v6 0/2] huge_pmd_unshare migration and flushing Mike Kravetz
@ 2018-08-23 20:59 ` Mike Kravetz
  2018-08-24  2:59   ` Naoya Horiguchi
                     ` (2 more replies)
  2018-08-23 20:59 ` [PATCH v6 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
  2018-08-24 11:35 ` [PATCH v6 0/2] huge_pmd_unshare migration and flushing Kirill A. Shutemov
  2 siblings, 3 replies; 29+ messages in thread
From: Mike Kravetz @ 2018-08-23 20:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
	Mike Kravetz, stable

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 14 ++++++++++++++
 mm/hugetlb.c            | 40 +++++++++++++++++++++++++++++++++++++--
 mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..4ee95d8c8413 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
@@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
 	return 0;
 }
 
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+					pte_t *ptep)
+{
+	return 0;
+}
+
+static inline void adjust_range_if_pmd_sharing_possible(
+				struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+}
+
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..a73c5728e961 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4548,6 +4548,9 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	return saddr;
 }
 
+#define _range_in_vma(vma, start, end) \
+	((vma)->vm_start <= (start) && (end) <= (vma)->vm_end)
+
 static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 {
 	unsigned long base = addr & PUD_MASK;
@@ -4556,12 +4559,40 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	/*
 	 * check on proper vm_flags and page table alignment
 	 */
-	if (vma->vm_flags & VM_MAYSHARE &&
-	    vma->vm_start <= base && end <= vma->vm_end)
+	if (vma->vm_flags & VM_MAYSHARE && _range_in_vma(vma, base, end))
 		return true;
 	return false;
 }
 
+/*
+ * Determine if start,end range within vma could be mapped by shared pmd.
+ * If yes, adjust start and end to cover range associated with possible
+ * shared pmd mappings.
+ */
+void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long check_addr = *start;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return;
+
+	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
+		unsigned long a_start = check_addr & PUD_MASK;
+		unsigned long a_end = a_start + PUD_SIZE;
+
+		/*
+		 * If sharing is possible, adjust start/end if necessary.
+		 */
+		if (_range_in_vma(vma, a_start, a_end)) {
+			if (a_start < *start)
+				*start = a_start;
+			if (a_end > *end)
+				*end = a_end;
+		}
+	}
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
@@ -4659,6 +4690,11 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
+
+void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+}
 #define want_pmd_share()	(0)
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..1e79fac3186b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * We have to assume the worse case ie pmd for invalidation. Note that
-	 * the page can not be free in this function as call of try_to_unmap()
-	 * must hold a reference on the page.
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
 	 */
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
+	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (PageHuge(page)) {
+			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, start, end);
+				flush_tlb_range(vma, start, end);
+				mmu_notifier_invalidate_range(mm, start, end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
 		    (flags & TTU_MIGRATION) &&
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches
  2018-08-23 20:59 [PATCH v6 0/2] huge_pmd_unshare migration and flushing Mike Kravetz
  2018-08-23 20:59 ` [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
@ 2018-08-23 20:59 ` Mike Kravetz
  2018-08-24  3:07   ` Naoya Horiguchi
  2018-08-24 11:35 ` [PATCH v6 0/2] huge_pmd_unshare migration and flushing Kirill A. Shutemov
  2 siblings, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2018-08-23 20:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
	Mike Kravetz

When fixing an issue with PMD sharing and migration, it was discovered
via code inspection that other callers of huge_pmd_unshare potentially
have an issue with cache and tlb flushing.

Use the routine adjust_range_if_pmd_sharing_possible() to calculate
worst case ranges for mmu notifiers.  Ensure that this range is flushed
if huge_pmd_unshare succeeds and unmaps a PUD_SUZE area.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 53 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a73c5728e961..082cddf46b4f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3333,8 +3333,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	const unsigned long mmun_start = start;	/* For mmu_notifiers */
-	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
+	unsigned long mmun_start = start;	/* For mmu_notifiers */
+	unsigned long mmun_end   = end;		/* For mmu_notifiers */
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
@@ -3346,6 +3346,11 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 */
 	tlb_remove_check_page_size_change(tlb, sz);
 	tlb_start_vma(tlb, vma);
+
+	/*
+	 * If sharing possible, alert mmu notifiers of worst case.
+	 */
+	adjust_range_if_pmd_sharing_possible(vma, &mmun_start, &mmun_end);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	address = start;
 	for (; address < end; address += sz) {
@@ -3356,6 +3361,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		ptl = huge_pte_lock(h, mm, ptep);
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			spin_unlock(ptl);
+			/*
+			 * We just unmapped a page of PMDs by clearing a PUD.
+			 * The caller's TLB flush range should cover this area.
+			 */
 			continue;
 		}
 
@@ -3438,12 +3447,23 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm;
 	struct mmu_gather tlb;
+	unsigned long tlb_start = start;
+	unsigned long tlb_end = end;
+
+	/*
+	 * If shared PMDs were possibly used within this vma range, adjust
+	 * start/end for worst case tlb flushing.
+	 * Note that we can not be sure if PMDs are shared until we try to
+	 * unmap pages.  However, we want to make sure TLB flushing covers
+	 * the largest possible range.
+	 */
+	adjust_range_if_pmd_sharing_possible(vma, &tlb_start, &tlb_end);
 
 	mm = vma->vm_mm;
 
-	tlb_gather_mmu(&tlb, mm, start, end);
+	tlb_gather_mmu(&tlb, mm, tlb_start, tlb_end);
 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
-	tlb_finish_mmu(&tlb, start, end);
+	tlb_finish_mmu(&tlb, tlb_start, tlb_end);
 }
 
 /*
@@ -4309,11 +4329,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long pages = 0;
+	unsigned long f_start = start;
+	unsigned long f_end = end;
+	bool shared_pmd = false;
+
+	/*
+	 * In the case of shared PMDs, the area to flush could be beyond
+	 * start/end.  Set f_start/f_end to cover the maximum possible
+	 * range if PMD sharing is possible.
+	 */
+	adjust_range_if_pmd_sharing_possible(vma, &f_start, &f_end);
 
 	BUG_ON(address >= end);
-	flush_cache_range(vma, address, end);
+	flush_cache_range(vma, f_start, f_end);
 
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, f_start, f_end);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (; address < end; address += huge_page_size(h)) {
 		spinlock_t *ptl;
@@ -4324,6 +4354,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 		if (huge_pmd_unshare(mm, &address, ptep)) {
 			pages++;
 			spin_unlock(ptl);
+			shared_pmd = true;
 			continue;
 		}
 		pte = huge_ptep_get(ptep);
@@ -4359,9 +4390,13 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
 	 * may have cleared our pud entry and done put_page on the page table:
 	 * once we release i_mmap_rwsem, another task can do the final put_page
-	 * and that page table be reused and filled with junk.
+	 * and that page table be reused and filled with junk.  If we actually
+	 * did unshare a page of pmds, flush the range corresponding to the pud.
 	 */
-	flush_hugetlb_tlb_range(vma, start, end);
+	if (shared_pmd)
+		flush_hugetlb_tlb_range(vma, f_start, f_end);
+	else
+		flush_hugetlb_tlb_range(vma, start, end);
 	/*
 	 * No need to call mmu_notifier_invalidate_range() we are downgrading
 	 * page table protection not changing it to point to a new page.
@@ -4369,7 +4404,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	 * See Documentation/vm/mmu_notifier.rst
 	 */
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, f_start, f_end);
 
 	return pages << h->order;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23 20:59 ` [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
@ 2018-08-24  2:59   ` Naoya Horiguchi
  2018-08-24  8:41   ` Michal Hocko
  2018-08-24  9:25   ` Michal Hocko
  2 siblings, 0 replies; 29+ messages in thread
From: Naoya Horiguchi @ 2018-08-24  2:59 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Davidlohr Bueso,
	Michal Hocko, Andrew Morton, stable

On Thu, Aug 23, 2018 at 01:59:16PM -0700, Mike Kravetz wrote:
> The page migration code employs try_to_unmap() to try and unmap the
> source page.  This is accomplished by using rmap_walk to find all
> vmas where the page is mapped.  This search stops when page mapcount
> is zero.  For shared PMD huge pages, the page map count is always 1
> no matter the number of mappings.  Shared mappings are tracked via
> the reference count of the PMD page.  Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
> 
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page.  Hence, data is lost.
> 
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined due to ECC memory errors.
> DB developers noticed they could reproduce the issue by (hotplug)
> offlining memory used to back huge pages.  A simple testcase can
> reproduce the problem by creating a shared PMD mapping (note that
> this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
> x86)), and using migrate_pages() to migrate process pages between
> nodes while continually writing to the huge pages being migrated.
> 
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops the reference on the PMD page.  After this, flush
> caches and TLB.
> 
> mmu notifiers are called before locking page tables, but we can not
> be sure of PMD sharing until page tables are locked.  Therefore,
> check for the possibility of PMD sharing before locking so that
> notifiers can prepare for the worst possible case.
> 
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks Mike,

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> ---
>  include/linux/hugetlb.h | 14 ++++++++++++++
>  mm/hugetlb.c            | 40 +++++++++++++++++++++++++++++++++++++--
>  mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 91 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 36fa6a2a82e3..4ee95d8c8413 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  			      int write);
>  struct page *follow_huge_pd(struct vm_area_struct *vma,
> @@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
>  	return 0;
>  }
>  
> +static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
> +					pte_t *ptep)
> +{
> +	return 0;
> +}
> +
> +static inline void adjust_range_if_pmd_sharing_possible(
> +				struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +}
> +
>  #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
>  #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3103099f64fd..a73c5728e961 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4548,6 +4548,9 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
>  	return saddr;
>  }
>  
> +#define _range_in_vma(vma, start, end) \
> +	((vma)->vm_start <= (start) && (end) <= (vma)->vm_end)
> +
>  static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	unsigned long base = addr & PUD_MASK;
> @@ -4556,12 +4559,40 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  	/*
>  	 * check on proper vm_flags and page table alignment
>  	 */
> -	if (vma->vm_flags & VM_MAYSHARE &&
> -	    vma->vm_start <= base && end <= vma->vm_end)
> +	if (vma->vm_flags & VM_MAYSHARE && _range_in_vma(vma, base, end))
>  		return true;
>  	return false;
>  }
>  
> +/*
> + * Determine if start,end range within vma could be mapped by shared pmd.
> + * If yes, adjust start and end to cover range associated with possible
> + * shared pmd mappings.
> + */
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long check_addr = *start;
> +
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return;
> +
> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> +		unsigned long a_start = check_addr & PUD_MASK;
> +		unsigned long a_end = a_start + PUD_SIZE;
> +
> +		/*
> +		 * If sharing is possible, adjust start/end if necessary.
> +		 */
> +		if (_range_in_vma(vma, a_start, a_end)) {
> +			if (a_start < *start)
> +				*start = a_start;
> +			if (a_end > *end)
> +				*end = a_end;
> +		}
> +	}
> +}
> +
>  /*
>   * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
>   * and returns the corresponding pte. While this is not necessary for the
> @@ -4659,6 +4690,11 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  {
>  	return 0;
>  }
> +
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +}
>  #define want_pmd_share()	(0)
>  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index eb477809a5c0..1e79fac3186b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>  
>  	/*
> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> -	 * the page can not be free in this function as call of try_to_unmap()
> -	 * must hold a reference on the page.
> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> +	 * For hugetlb, it could be much worse if we need to do pud
> +	 * invalidation in the case of pmd sharing.
> +	 *
> +	 * Note that the page can not be free in this function as call of
> +	 * try_to_unmap() must hold a reference on the page.
>  	 */
>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> +	if (PageHuge(page)) {
> +		/*
> +		 * If sharing is possible, start and end will be adjusted
> +		 * accordingly.
> +		 */
> +		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
> +	}
>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> +		if (PageHuge(page)) {
> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> +				/*
> +				 * huge_pmd_unshare unmapped an entire PMD
> +				 * page.  There is no way of knowing exactly
> +				 * which PMDs may be cached for this mm, so
> +				 * we must flush them all.  start/end were
> +				 * already adjusted above to cover this range.
> +				 */
> +				flush_cache_range(vma, start, end);
> +				flush_tlb_range(vma, start, end);
> +				mmu_notifier_invalidate_range(mm, start, end);
> +
> +				/*
> +				 * The ref count of the PMD page was dropped
> +				 * which is part of the way map counting
> +				 * is done for shared PMDs.  Return 'true'
> +				 * here.  When there is no other sharing,
> +				 * huge_pmd_unshare returns false and we will
> +				 * unmap the actual page and drop map count
> +				 * to zero.
> +				 */
> +				page_vma_mapped_walk_done(&pvmw);
> +				break;
> +			}
> +		}
>  
>  		if (IS_ENABLED(CONFIG_MIGRATION) &&
>  		    (flags & TTU_MIGRATION) &&
> -- 
> 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches
  2018-08-23 20:59 ` [PATCH v6 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
@ 2018-08-24  3:07   ` Naoya Horiguchi
  0 siblings, 0 replies; 29+ messages in thread
From: Naoya Horiguchi @ 2018-08-24  3:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Davidlohr Bueso,
	Michal Hocko, Andrew Morton

On Thu, Aug 23, 2018 at 01:59:17PM -0700, Mike Kravetz wrote:
> When fixing an issue with PMD sharing and migration, it was discovered
> via code inspection that other callers of huge_pmd_unshare potentially
> have an issue with cache and tlb flushing.
> 
> Use the routine adjust_range_if_pmd_sharing_possible() to calculate
> worst case ranges for mmu notifiers.  Ensure that this range is flushed
> if huge_pmd_unshare succeeds and unmaps a PUD_SUZE area.

s/PUD_SUZE/PUD_SIZE/

> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Looks good to me.

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> ---
>  mm/hugetlb.c | 53 +++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 44 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a73c5728e961..082cddf46b4f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3333,8 +3333,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	struct page *page;
>  	struct hstate *h = hstate_vma(vma);
>  	unsigned long sz = huge_page_size(h);
> -	const unsigned long mmun_start = start;	/* For mmu_notifiers */
> -	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
> +	unsigned long mmun_start = start;	/* For mmu_notifiers */
> +	unsigned long mmun_end   = end;		/* For mmu_notifiers */
>  
>  	WARN_ON(!is_vm_hugetlb_page(vma));
>  	BUG_ON(start & ~huge_page_mask(h));
> @@ -3346,6 +3346,11 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	 */
>  	tlb_remove_check_page_size_change(tlb, sz);
>  	tlb_start_vma(tlb, vma);
> +
> +	/*
> +	 * If sharing possible, alert mmu notifiers of worst case.
> +	 */
> +	adjust_range_if_pmd_sharing_possible(vma, &mmun_start, &mmun_end);
>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>  	address = start;
>  	for (; address < end; address += sz) {
> @@ -3356,6 +3361,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		ptl = huge_pte_lock(h, mm, ptep);
>  		if (huge_pmd_unshare(mm, &address, ptep)) {
>  			spin_unlock(ptl);
> +			/*
> +			 * We just unmapped a page of PMDs by clearing a PUD.
> +			 * The caller's TLB flush range should cover this area.
> +			 */
>  			continue;
>  		}
>  
> @@ -3438,12 +3447,23 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  {
>  	struct mm_struct *mm;
>  	struct mmu_gather tlb;
> +	unsigned long tlb_start = start;
> +	unsigned long tlb_end = end;
> +
> +	/*
> +	 * If shared PMDs were possibly used within this vma range, adjust
> +	 * start/end for worst case tlb flushing.
> +	 * Note that we can not be sure if PMDs are shared until we try to
> +	 * unmap pages.  However, we want to make sure TLB flushing covers
> +	 * the largest possible range.
> +	 */
> +	adjust_range_if_pmd_sharing_possible(vma, &tlb_start, &tlb_end);
>  
>  	mm = vma->vm_mm;
>  
> -	tlb_gather_mmu(&tlb, mm, start, end);
> +	tlb_gather_mmu(&tlb, mm, tlb_start, tlb_end);
>  	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
> -	tlb_finish_mmu(&tlb, start, end);
> +	tlb_finish_mmu(&tlb, tlb_start, tlb_end);
>  }
>  
>  /*
> @@ -4309,11 +4329,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	pte_t pte;
>  	struct hstate *h = hstate_vma(vma);
>  	unsigned long pages = 0;
> +	unsigned long f_start = start;
> +	unsigned long f_end = end;
> +	bool shared_pmd = false;
> +
> +	/*
> +	 * In the case of shared PMDs, the area to flush could be beyond
> +	 * start/end.  Set f_start/f_end to cover the maximum possible
> +	 * range if PMD sharing is possible.
> +	 */
> +	adjust_range_if_pmd_sharing_possible(vma, &f_start, &f_end);
>  
>  	BUG_ON(address >= end);
> -	flush_cache_range(vma, address, end);
> +	flush_cache_range(vma, f_start, f_end);
>  
> -	mmu_notifier_invalidate_range_start(mm, start, end);
> +	mmu_notifier_invalidate_range_start(mm, f_start, f_end);
>  	i_mmap_lock_write(vma->vm_file->f_mapping);
>  	for (; address < end; address += huge_page_size(h)) {
>  		spinlock_t *ptl;
> @@ -4324,6 +4354,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  		if (huge_pmd_unshare(mm, &address, ptep)) {
>  			pages++;
>  			spin_unlock(ptl);
> +			shared_pmd = true;
>  			continue;
>  		}
>  		pte = huge_ptep_get(ptep);
> @@ -4359,9 +4390,13 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
>  	 * may have cleared our pud entry and done put_page on the page table:
>  	 * once we release i_mmap_rwsem, another task can do the final put_page
> -	 * and that page table be reused and filled with junk.
> +	 * and that page table be reused and filled with junk.  If we actually
> +	 * did unshare a page of pmds, flush the range corresponding to the pud.
>  	 */
> -	flush_hugetlb_tlb_range(vma, start, end);
> +	if (shared_pmd)
> +		flush_hugetlb_tlb_range(vma, f_start, f_end);
> +	else
> +		flush_hugetlb_tlb_range(vma, start, end);
>  	/*
>  	 * No need to call mmu_notifier_invalidate_range() we are downgrading
>  	 * page table protection not changing it to point to a new page.
> @@ -4369,7 +4404,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  	 * See Documentation/vm/mmu_notifier.rst
>  	 */
>  	i_mmap_unlock_write(vma->vm_file->f_mapping);
> -	mmu_notifier_invalidate_range_end(mm, start, end);
> +	mmu_notifier_invalidate_range_end(mm, f_start, f_end);
>  
>  	return pages << h->order;
>  }
> -- 
> 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23 20:59 ` [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
  2018-08-24  2:59   ` Naoya Horiguchi
@ 2018-08-24  8:41   ` Michal Hocko
  2018-08-24 18:08     ` Mike Kravetz
  2018-08-24  9:25   ` Michal Hocko
  2 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2018-08-24  8:41 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Andrew Morton, stable

On Thu 23-08-18 13:59:16, Mike Kravetz wrote:
> The page migration code employs try_to_unmap() to try and unmap the
> source page.  This is accomplished by using rmap_walk to find all
> vmas where the page is mapped.  This search stops when page mapcount
> is zero.  For shared PMD huge pages, the page map count is always 1
> no matter the number of mappings.  Shared mappings are tracked via
> the reference count of the PMD page.  Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
> 
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page.  Hence, data is lost.
> 
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined due to ECC memory errors.
> DB developers noticed they could reproduce the issue by (hotplug)
> offlining memory used to back huge pages.  A simple testcase can
> reproduce the problem by creating a shared PMD mapping (note that
> this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
> x86)), and using migrate_pages() to migrate process pages between
> nodes while continually writing to the huge pages being migrated.
> 
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops the reference on the PMD page.  After this, flush
> caches and TLB.
> 
> mmu notifiers are called before locking page tables, but we can not
> be sure of PMD sharing until page tables are locked.  Therefore,
> check for the possibility of PMD sharing before locking so that
> notifiers can prepare for the worst possible case.
> 
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

One nit below.

[...]
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3103099f64fd..a73c5728e961 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4548,6 +4548,9 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
>  	return saddr;
>  }
>  
> +#define _range_in_vma(vma, start, end) \
> +	((vma)->vm_start <= (start) && (end) <= (vma)->vm_end)
> +

static inline please. Macros and potential side effects on given
arguments are just not worth the risk. I also think this is something
for more general use. We have that pattern at many places. So I would
stick that to linux/mm.h

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-23 20:59 ` [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
  2018-08-24  2:59   ` Naoya Horiguchi
  2018-08-24  8:41   ` Michal Hocko
@ 2018-08-24  9:25   ` Michal Hocko
  2 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2018-08-24  9:25 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Andrew Morton, stable

On Thu 23-08-18 13:59:16, Mike Kravetz wrote:
[...]
> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> +		if (PageHuge(page)) {
> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> +				/*
> +				 * huge_pmd_unshare unmapped an entire PMD
> +				 * page.  There is no way of knowing exactly
> +				 * which PMDs may be cached for this mm, so
> +				 * we must flush them all.  start/end were
> +				 * already adjusted above to cover this range.
> +				 */
> +				flush_cache_range(vma, start, end);
> +				flush_tlb_range(vma, start, end);
> +				mmu_notifier_invalidate_range(mm, start, end);
> +
> +				/*
> +				 * The ref count of the PMD page was dropped
> +				 * which is part of the way map counting
> +				 * is done for shared PMDs.  Return 'true'
> +				 * here.  When there is no other sharing,
> +				 * huge_pmd_unshare returns false and we will
> +				 * unmap the actual page and drop map count
> +				 * to zero.
> +				 */
> +				page_vma_mapped_walk_done(&pvmw);
> +				break;
> +			}
> +		}

Wait a second. This is not correct, right? You have to call the
notifiers after page_vma_mapped_walk_done because they might be
sleepable and we are still holding the pte lock. This is btw. a problem
for other users of mmu_notifier_invalidate_range in try_to_unmap_one,
unless I am terribly confused. This would suggest 369ea8242c0fb is
incorrect.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 0/2] huge_pmd_unshare migration and flushing
  2018-08-23 20:59 [PATCH v6 0/2] huge_pmd_unshare migration and flushing Mike Kravetz
  2018-08-23 20:59 ` [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
  2018-08-23 20:59 ` [PATCH v6 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
@ 2018-08-24 11:35 ` Kirill A. Shutemov
  2 siblings, 0 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2018-08-24 11:35 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Jérôme Glisse, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton

On Thu, Aug 23, 2018 at 08:59:15PM +0000, Mike Kravetz wrote:
> Correct a data corruption issue caused by improper handling of shared
> huge PMDs during page migration.  This issue was observed in a customer
> environment and can be recreated fairly easily with a test program.
> Patch 0001 addresses this issue only and is copied to stable with the
> intention that this will go to stable releases.  It has existed since
> the addition of shared huge PMD support.
> 
> While considering the issue above, Kirill Shutemov noticed that other
> callers of huge_pmd_unshare have potential issues with cache and TLB
> flushing.  A separate patch (0002) takes advantage of the new routine
> adjust_range_if_pmd_sharing_possible() to adjust flushing ranges in
> the cases where huge PMD sharing is possible.  There is no copy to
> stable for this patch as it has not been reported as an issue and
> discovered only via code inspection.

Looks good to me.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-24  8:41   ` Michal Hocko
@ 2018-08-24 18:08     ` Mike Kravetz
  2018-08-27  7:46       ` Michal Hocko
  2018-08-27 19:11       ` Michal Hocko
  0 siblings, 2 replies; 29+ messages in thread
From: Mike Kravetz @ 2018-08-24 18:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Andrew Morton, stable

On 08/24/2018 01:41 AM, Michal Hocko wrote:
> On Thu 23-08-18 13:59:16, Mike Kravetz wrote:
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> One nit below.
> 
> [...]
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 3103099f64fd..a73c5728e961 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4548,6 +4548,9 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
>>  	return saddr;
>>  }
>>  
>> +#define _range_in_vma(vma, start, end) \
>> +	((vma)->vm_start <= (start) && (end) <= (vma)->vm_end)
>> +
> 
> static inline please. Macros and potential side effects on given
> arguments are just not worth the risk. I also think this is something
> for more general use. We have that pattern at many places. So I would
> stick that to linux/mm.h

Thanks Michal,

Here is an updated patch which does as you suggest above.

-- 
Mike Kravetz

From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Fri, 24 Aug 2018 10:58:20 -0700
Subject: [PATCH v7 1/2] mm: migration: fix migration of huge PMD shared pages

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 14 ++++++++++++++
 include/linux/mm.h      |  6 ++++++
 mm/hugetlb.c            | 37 ++++++++++++++++++++++++++++++++++--
 mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++++++---
 4 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..4ee95d8c8413 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
@@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
 	return 0;
 }
 
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+					pte_t *ptep)
+{
+	return 0;
+}
+
+static inline void adjust_range_if_pmd_sharing_possible(
+				struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+}
+
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 68a5121694ef..40ad93bc9548 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2463,6 +2463,12 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
 	return vma;
 }
 
+static inline bool range_in_vma(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end)
+{
+	return (vma && vma->vm_start <= start && end <= vma->vm_end);
+}
+
 #ifdef CONFIG_MMU
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 void vma_set_page_prot(struct vm_area_struct *vma);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..f469315a6a0f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4556,12 +4556,40 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	/*
 	 * check on proper vm_flags and page table alignment
 	 */
-	if (vma->vm_flags & VM_MAYSHARE &&
-	    vma->vm_start <= base && end <= vma->vm_end)
+	if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
 		return true;
 	return false;
 }
 
+/*
+ * Determine if start,end range within vma could be mapped by shared pmd.
+ * If yes, adjust start and end to cover range associated with possible
+ * shared pmd mappings.
+ */
+void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long check_addr = *start;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return;
+
+	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
+		unsigned long a_start = check_addr & PUD_MASK;
+		unsigned long a_end = a_start + PUD_SIZE;
+
+		/*
+		 * If sharing is possible, adjust start/end if necessary.
+		 */
+		if (range_in_vma(vma, a_start, a_end)) {
+			if (a_start < *start)
+				*start = a_start;
+			if (a_end > *end)
+				*end = a_end;
+		}
+	}
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
@@ -4659,6 +4687,11 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
+
+void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+}
 #define want_pmd_share()	(0)
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..1e79fac3186b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * We have to assume the worse case ie pmd for invalidation. Note that
-	 * the page can not be free in this function as call of try_to_unmap()
-	 * must hold a reference on the page.
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
 	 */
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
+	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (PageHuge(page)) {
+			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, start, end);
+				flush_tlb_range(vma, start, end);
+				mmu_notifier_invalidate_range(mm, start, end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
 		    (flags & TTU_MIGRATION) &&
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-24 18:08     ` Mike Kravetz
@ 2018-08-27  7:46       ` Michal Hocko
  2018-08-27 13:46         ` Jerome Glisse
  2018-08-27 16:42         ` Mike Kravetz
  2018-08-27 19:11       ` Michal Hocko
  1 sibling, 2 replies; 29+ messages in thread
From: Michal Hocko @ 2018-08-27  7:46 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Andrew Morton, stable

On Fri 24-08-18 11:08:24, Mike Kravetz wrote:
> On 08/24/2018 01:41 AM, Michal Hocko wrote:
> > On Thu 23-08-18 13:59:16, Mike Kravetz wrote:
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > 
> > One nit below.
> > 
> > [...]
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index 3103099f64fd..a73c5728e961 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -4548,6 +4548,9 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> >>  	return saddr;
> >>  }
> >>  
> >> +#define _range_in_vma(vma, start, end) \
> >> +	((vma)->vm_start <= (start) && (end) <= (vma)->vm_end)
> >> +
> > 
> > static inline please. Macros and potential side effects on given
> > arguments are just not worth the risk. I also think this is something
> > for more general use. We have that pattern at many places. So I would
> > stick that to linux/mm.h
> 
> Thanks Michal,
> 
> Here is an updated patch which does as you suggest above.
[...]
> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> +		if (PageHuge(page)) {
> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> +				/*
> +				 * huge_pmd_unshare unmapped an entire PMD
> +				 * page.  There is no way of knowing exactly
> +				 * which PMDs may be cached for this mm, so
> +				 * we must flush them all.  start/end were
> +				 * already adjusted above to cover this range.
> +				 */
> +				flush_cache_range(vma, start, end);
> +				flush_tlb_range(vma, start, end);
> +				mmu_notifier_invalidate_range(mm, start, end);
> +
> +				/*
> +				 * The ref count of the PMD page was dropped
> +				 * which is part of the way map counting
> +				 * is done for shared PMDs.  Return 'true'
> +				 * here.  When there is no other sharing,
> +				 * huge_pmd_unshare returns false and we will
> +				 * unmap the actual page and drop map count
> +				 * to zero.
> +				 */
> +				page_vma_mapped_walk_done(&pvmw);
> +				break;
> +			}

This still calls into notifier while holding the ptl lock. Either I am
missing something or the invalidation is broken in this loop (not also
for other invalidations).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-27  7:46       ` Michal Hocko
@ 2018-08-27 13:46         ` Jerome Glisse
  2018-08-27 19:09           ` Michal Hocko
  2018-08-29 17:24           ` Mike Kravetz
  2018-08-27 16:42         ` Mike Kravetz
  1 sibling, 2 replies; 29+ messages in thread
From: Jerome Glisse @ 2018-08-27 13:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable

On Mon, Aug 27, 2018 at 09:46:45AM +0200, Michal Hocko wrote:
> On Fri 24-08-18 11:08:24, Mike Kravetz wrote:
> > On 08/24/2018 01:41 AM, Michal Hocko wrote:
> > > On Thu 23-08-18 13:59:16, Mike Kravetz wrote:
> > > 
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > 
> > > One nit below.
> > > 
> > > [...]
> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > >> index 3103099f64fd..a73c5728e961 100644
> > >> --- a/mm/hugetlb.c
> > >> +++ b/mm/hugetlb.c
> > >> @@ -4548,6 +4548,9 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> > >>  	return saddr;
> > >>  }
> > >>  
> > >> +#define _range_in_vma(vma, start, end) \
> > >> +	((vma)->vm_start <= (start) && (end) <= (vma)->vm_end)
> > >> +
> > > 
> > > static inline please. Macros and potential side effects on given
> > > arguments are just not worth the risk. I also think this is something
> > > for more general use. We have that pattern at many places. So I would
> > > stick that to linux/mm.h
> > 
> > Thanks Michal,
> > 
> > Here is an updated patch which does as you suggest above.
> [...]
> > @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> >  		address = pvmw.address;
> >  
> > +		if (PageHuge(page)) {
> > +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> > +				/*
> > +				 * huge_pmd_unshare unmapped an entire PMD
> > +				 * page.  There is no way of knowing exactly
> > +				 * which PMDs may be cached for this mm, so
> > +				 * we must flush them all.  start/end were
> > +				 * already adjusted above to cover this range.
> > +				 */
> > +				flush_cache_range(vma, start, end);
> > +				flush_tlb_range(vma, start, end);
> > +				mmu_notifier_invalidate_range(mm, start, end);
> > +
> > +				/*
> > +				 * The ref count of the PMD page was dropped
> > +				 * which is part of the way map counting
> > +				 * is done for shared PMDs.  Return 'true'
> > +				 * here.  When there is no other sharing,
> > +				 * huge_pmd_unshare returns false and we will
> > +				 * unmap the actual page and drop map count
> > +				 * to zero.
> > +				 */
> > +				page_vma_mapped_walk_done(&pvmw);
> > +				break;
> > +			}
> 
> This still calls into notifier while holding the ptl lock. Either I am
> missing something or the invalidation is broken in this loop (not also
> for other invalidations).

mmu_notifier_invalidate_range() is done with pt lock held only the start
and end versions need to happen outside pt lock.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-27  7:46       ` Michal Hocko
  2018-08-27 13:46         ` Jerome Glisse
@ 2018-08-27 16:42         ` Mike Kravetz
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Kravetz @ 2018-08-27 16:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Andrew Morton, stable

On 08/27/2018 12:46 AM, Michal Hocko wrote:
> On Fri 24-08-18 11:08:24, Mike Kravetz wrote:
>> On 08/24/2018 01:41 AM, Michal Hocko wrote:
>>> On Thu 23-08-18 13:59:16, Mike Kravetz wrote:
>>>
>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>
>>> One nit below.
>>>
>>> [...]
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 3103099f64fd..a73c5728e961 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -4548,6 +4548,9 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
>>>>  	return saddr;
>>>>  }
>>>>  
>>>> +#define _range_in_vma(vma, start, end) \
>>>> +	((vma)->vm_start <= (start) && (end) <= (vma)->vm_end)
>>>> +
>>>
>>> static inline please. Macros and potential side effects on given
>>> arguments are just not worth the risk. I also think this is something
>>> for more general use. We have that pattern at many places. So I would
>>> stick that to linux/mm.h
>>
>> Thanks Michal,
>>
>> Here is an updated patch which does as you suggest above.
> [...]
>> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>>  		address = pvmw.address;
>>  
>> +		if (PageHuge(page)) {
>> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
>> +				/*
>> +				 * huge_pmd_unshare unmapped an entire PMD
>> +				 * page.  There is no way of knowing exactly
>> +				 * which PMDs may be cached for this mm, so
>> +				 * we must flush them all.  start/end were
>> +				 * already adjusted above to cover this range.
>> +				 */
>> +				flush_cache_range(vma, start, end);
>> +				flush_tlb_range(vma, start, end);
>> +				mmu_notifier_invalidate_range(mm, start, end);
>> +
>> +				/*
>> +				 * The ref count of the PMD page was dropped
>> +				 * which is part of the way map counting
>> +				 * is done for shared PMDs.  Return 'true'
>> +				 * here.  When there is no other sharing,
>> +				 * huge_pmd_unshare returns false and we will
>> +				 * unmap the actual page and drop map count
>> +				 * to zero.
>> +				 */
>> +				page_vma_mapped_walk_done(&pvmw);
>> +				break;
>> +			}
> 
> This still calls into notifier while holding the ptl lock. Either I am
> missing something or the invalidation is broken in this loop (not also
> for other invalidations).

As Jerome said ...

When creating this patch, I started by using the same flush/invalidation
routines used by the existing code.  This is because it is not obvious what
interfaces can be called in what context, and I didn't want to do anything
different.  The best 'documentation' are the comments in the mmu_notifier_ops
definition.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-27 13:46         ` Jerome Glisse
@ 2018-08-27 19:09           ` Michal Hocko
  2018-08-29 17:24           ` Mike Kravetz
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2018-08-27 19:09 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable

On Mon 27-08-18 09:46:33, Jerome Glisse wrote:
> On Mon, Aug 27, 2018 at 09:46:45AM +0200, Michal Hocko wrote:
> > On Fri 24-08-18 11:08:24, Mike Kravetz wrote:
> > > On 08/24/2018 01:41 AM, Michal Hocko wrote:
> > > > On Thu 23-08-18 13:59:16, Mike Kravetz wrote:
> > > > 
> > > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > One nit below.
> > > > 
> > > > [...]
> > > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > >> index 3103099f64fd..a73c5728e961 100644
> > > >> --- a/mm/hugetlb.c
> > > >> +++ b/mm/hugetlb.c
> > > >> @@ -4548,6 +4548,9 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> > > >>  	return saddr;
> > > >>  }
> > > >>  
> > > >> +#define _range_in_vma(vma, start, end) \
> > > >> +	((vma)->vm_start <= (start) && (end) <= (vma)->vm_end)
> > > >> +
> > > > 
> > > > static inline please. Macros and potential side effects on given
> > > > arguments are just not worth the risk. I also think this is something
> > > > for more general use. We have that pattern at many places. So I would
> > > > stick that to linux/mm.h
> > > 
> > > Thanks Michal,
> > > 
> > > Here is an updated patch which does as you suggest above.
> > [...]
> > > @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > >  		address = pvmw.address;
> > >  
> > > +		if (PageHuge(page)) {
> > > +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> > > +				/*
> > > +				 * huge_pmd_unshare unmapped an entire PMD
> > > +				 * page.  There is no way of knowing exactly
> > > +				 * which PMDs may be cached for this mm, so
> > > +				 * we must flush them all.  start/end were
> > > +				 * already adjusted above to cover this range.
> > > +				 */
> > > +				flush_cache_range(vma, start, end);
> > > +				flush_tlb_range(vma, start, end);
> > > +				mmu_notifier_invalidate_range(mm, start, end);
> > > +
> > > +				/*
> > > +				 * The ref count of the PMD page was dropped
> > > +				 * which is part of the way map counting
> > > +				 * is done for shared PMDs.  Return 'true'
> > > +				 * here.  When there is no other sharing,
> > > +				 * huge_pmd_unshare returns false and we will
> > > +				 * unmap the actual page and drop map count
> > > +				 * to zero.
> > > +				 */
> > > +				page_vma_mapped_walk_done(&pvmw);
> > > +				break;
> > > +			}
> > 
> > This still calls into notifier while holding the ptl lock. Either I am
> > missing something or the invalidation is broken in this loop (not also
> > for other invalidations).
> 
> mmu_notifier_invalidate_range() is done with pt lock held only the start
> and end versions need to happen outside pt lock.

OK, that was not clear to me. Especially srcu_read_lock in
__mmu_notifier_invalidate_range suggests the callback might sleep. There
is no note about the pte lock. There is even a note about possible
blocking
	 * If this callback cannot block, and invalidate_range_{start,end}
	 * cannot block, mmu_notifier_ops.flags should have
	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.

I am removing that part of the comment but it really confused me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-24 18:08     ` Mike Kravetz
  2018-08-27  7:46       ` Michal Hocko
@ 2018-08-27 19:11       ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2018-08-27 19:11 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
	Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
	Davidlohr Bueso, Andrew Morton, stable

On Fri 24-08-18 11:08:24, Mike Kravetz wrote:
[...]
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Fri, 24 Aug 2018 10:58:20 -0700
> Subject: [PATCH v7 1/2] mm: migration: fix migration of huge PMD shared pages
> 
> The page migration code employs try_to_unmap() to try and unmap the
> source page.  This is accomplished by using rmap_walk to find all
> vmas where the page is mapped.  This search stops when page mapcount
> is zero.  For shared PMD huge pages, the page map count is always 1
> no matter the number of mappings.  Shared mappings are tracked via
> the reference count of the PMD page.  Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
> 
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page.  Hence, data is lost.
> 
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined due to ECC memory errors.
> DB developers noticed they could reproduce the issue by (hotplug)
> offlining memory used to back huge pages.  A simple testcase can
> reproduce the problem by creating a shared PMD mapping (note that
> this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
> x86)), and using migrate_pages() to migrate process pages between
> nodes while continually writing to the huge pages being migrated.
> 
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops the reference on the PMD page.  After this, flush
> caches and TLB.
> 
> mmu notifiers are called before locking page tables, but we can not
> be sure of PMD sharing until page tables are locked.  Therefore,
> check for the possibility of PMD sharing before locking so that
> notifiers can prepare for the worst possible case.
> 
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

With the locking expectations for mmu_notifier_invalidate_range
exlained I do not see any other issues.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/hugetlb.h | 14 ++++++++++++++
>  include/linux/mm.h      |  6 ++++++
>  mm/hugetlb.c            | 37 ++++++++++++++++++++++++++++++++++--
>  mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 94 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 36fa6a2a82e3..4ee95d8c8413 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  			      int write);
>  struct page *follow_huge_pd(struct vm_area_struct *vma,
> @@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
>  	return 0;
>  }
>  
> +static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
> +					pte_t *ptep)
> +{
> +	return 0;
> +}
> +
> +static inline void adjust_range_if_pmd_sharing_possible(
> +				struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +}
> +
>  #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
>  #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 68a5121694ef..40ad93bc9548 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2463,6 +2463,12 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
>  	return vma;
>  }
>  
> +static inline bool range_in_vma(struct vm_area_struct *vma,
> +				unsigned long start, unsigned long end)
> +{
> +	return (vma && vma->vm_start <= start && end <= vma->vm_end);
> +}
> +
>  #ifdef CONFIG_MMU
>  pgprot_t vm_get_page_prot(unsigned long vm_flags);
>  void vma_set_page_prot(struct vm_area_struct *vma);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3103099f64fd..f469315a6a0f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4556,12 +4556,40 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  	/*
>  	 * check on proper vm_flags and page table alignment
>  	 */
> -	if (vma->vm_flags & VM_MAYSHARE &&
> -	    vma->vm_start <= base && end <= vma->vm_end)
> +	if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
>  		return true;
>  	return false;
>  }
>  
> +/*
> + * Determine if start,end range within vma could be mapped by shared pmd.
> + * If yes, adjust start and end to cover range associated with possible
> + * shared pmd mappings.
> + */
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long check_addr = *start;
> +
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return;
> +
> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> +		unsigned long a_start = check_addr & PUD_MASK;
> +		unsigned long a_end = a_start + PUD_SIZE;
> +
> +		/*
> +		 * If sharing is possible, adjust start/end if necessary.
> +		 */
> +		if (range_in_vma(vma, a_start, a_end)) {
> +			if (a_start < *start)
> +				*start = a_start;
> +			if (a_end > *end)
> +				*end = a_end;
> +		}
> +	}
> +}
> +
>  /*
>   * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
>   * and returns the corresponding pte. While this is not necessary for the
> @@ -4659,6 +4687,11 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  {
>  	return 0;
>  }
> +
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +}
>  #define want_pmd_share()	(0)
>  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index eb477809a5c0..1e79fac3186b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>  
>  	/*
> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> -	 * the page can not be free in this function as call of try_to_unmap()
> -	 * must hold a reference on the page.
> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> +	 * For hugetlb, it could be much worse if we need to do pud
> +	 * invalidation in the case of pmd sharing.
> +	 *
> +	 * Note that the page can not be free in this function as call of
> +	 * try_to_unmap() must hold a reference on the page.
>  	 */
>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> +	if (PageHuge(page)) {
> +		/*
> +		 * If sharing is possible, start and end will be adjusted
> +		 * accordingly.
> +		 */
> +		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
> +	}
>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> +		if (PageHuge(page)) {
> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> +				/*
> +				 * huge_pmd_unshare unmapped an entire PMD
> +				 * page.  There is no way of knowing exactly
> +				 * which PMDs may be cached for this mm, so
> +				 * we must flush them all.  start/end were
> +				 * already adjusted above to cover this range.
> +				 */
> +				flush_cache_range(vma, start, end);
> +				flush_tlb_range(vma, start, end);
> +				mmu_notifier_invalidate_range(mm, start, end);
> +
> +				/*
> +				 * The ref count of the PMD page was dropped
> +				 * which is part of the way map counting
> +				 * is done for shared PMDs.  Return 'true'
> +				 * here.  When there is no other sharing,
> +				 * huge_pmd_unshare returns false and we will
> +				 * unmap the actual page and drop map count
> +				 * to zero.
> +				 */
> +				page_vma_mapped_walk_done(&pvmw);
> +				break;
> +			}
> +		}
>  
>  		if (IS_ENABLED(CONFIG_MIGRATION) &&
>  		    (flags & TTU_MIGRATION) &&
> -- 
> 2.17.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-27 13:46         ` Jerome Glisse
  2018-08-27 19:09           ` Michal Hocko
@ 2018-08-29 17:24           ` Mike Kravetz
  2018-08-29 18:14             ` Jerome Glisse
  1 sibling, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2018-08-29 17:24 UTC (permalink / raw)
  To: Jerome Glisse, Michal Hocko
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable

On 08/27/2018 06:46 AM, Jerome Glisse wrote:
> On Mon, Aug 27, 2018 at 09:46:45AM +0200, Michal Hocko wrote:
>> On Fri 24-08-18 11:08:24, Mike Kravetz wrote:
>>> Here is an updated patch which does as you suggest above.
>> [...]
>>> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>>>  		address = pvmw.address;
>>>  
>>> +		if (PageHuge(page)) {
>>> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
>>> +				/*
>>> +				 * huge_pmd_unshare unmapped an entire PMD
>>> +				 * page.  There is no way of knowing exactly
>>> +				 * which PMDs may be cached for this mm, so
>>> +				 * we must flush them all.  start/end were
>>> +				 * already adjusted above to cover this range.
>>> +				 */
>>> +				flush_cache_range(vma, start, end);
>>> +				flush_tlb_range(vma, start, end);
>>> +				mmu_notifier_invalidate_range(mm, start, end);
>>> +
>>> +				/*
>>> +				 * The ref count of the PMD page was dropped
>>> +				 * which is part of the way map counting
>>> +				 * is done for shared PMDs.  Return 'true'
>>> +				 * here.  When there is no other sharing,
>>> +				 * huge_pmd_unshare returns false and we will
>>> +				 * unmap the actual page and drop map count
>>> +				 * to zero.
>>> +				 */
>>> +				page_vma_mapped_walk_done(&pvmw);
>>> +				break;
>>> +			}
>>
>> This still calls into notifier while holding the ptl lock. Either I am
>> missing something or the invalidation is broken in this loop (not also
>> for other invalidations).
> 
> mmu_notifier_invalidate_range() is done with pt lock held only the start
> and end versions need to happen outside pt lock.

Hi Jérôme (and anyone else having good understanding of mmu notifier API),

Michal and I have been looking at backports to stable releases.  If you look
at the v4.4 version of try_to_unmap_one(), it does not use the
mmu_notifier_invalidate_range_start/end interfaces. Rather, it uses the
mmu_notifier_invalidate_page(), passing in the address of the page it
unmapped.  This is done after releasing the ptl lock.  I'm not even sure if
this works for huge pages, as it appears some THP supporting code was added
to try_to_unmap_one() after v4.4.

But, we were wondering what mmu notifier interface to use in the case where
try_to_unmap_one() unmaps a shared pmd huge page as addressed in the patch
above.  In this case, a PUD sized area is effectively unmapped.  In the
code/patch above we have the invalidate range (start and end as well) take
the PUD sized area into account.

What would be the best mmu notifier interface to use where there are no
start/end calls?
Or, is the best solution to add the start/end calls as is done in later
versions of the code?  If that is the suggestion, has there been any change
in invalidate start/end semantics that we should take into account?

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-29 17:24           ` Mike Kravetz
@ 2018-08-29 18:14             ` Jerome Glisse
  2018-08-29 18:39               ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2018-08-29 18:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable

On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> On 08/27/2018 06:46 AM, Jerome Glisse wrote:
> > On Mon, Aug 27, 2018 at 09:46:45AM +0200, Michal Hocko wrote:
> >> On Fri 24-08-18 11:08:24, Mike Kravetz wrote:
> >>> Here is an updated patch which does as you suggest above.
> >> [...]
> >>> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> >>>  		address = pvmw.address;
> >>>  
> >>> +		if (PageHuge(page)) {
> >>> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> >>> +				/*
> >>> +				 * huge_pmd_unshare unmapped an entire PMD
> >>> +				 * page.  There is no way of knowing exactly
> >>> +				 * which PMDs may be cached for this mm, so
> >>> +				 * we must flush them all.  start/end were
> >>> +				 * already adjusted above to cover this range.
> >>> +				 */
> >>> +				flush_cache_range(vma, start, end);
> >>> +				flush_tlb_range(vma, start, end);
> >>> +				mmu_notifier_invalidate_range(mm, start, end);
> >>> +
> >>> +				/*
> >>> +				 * The ref count of the PMD page was dropped
> >>> +				 * which is part of the way map counting
> >>> +				 * is done for shared PMDs.  Return 'true'
> >>> +				 * here.  When there is no other sharing,
> >>> +				 * huge_pmd_unshare returns false and we will
> >>> +				 * unmap the actual page and drop map count
> >>> +				 * to zero.
> >>> +				 */
> >>> +				page_vma_mapped_walk_done(&pvmw);
> >>> +				break;
> >>> +			}
> >>
> >> This still calls into notifier while holding the ptl lock. Either I am
> >> missing something or the invalidation is broken in this loop (not also
> >> for other invalidations).
> > 
> > mmu_notifier_invalidate_range() is done with pt lock held only the start
> > and end versions need to happen outside pt lock.
> 
> Hi Jérôme (and anyone else having good understanding of mmu notifier API),
> 
> Michal and I have been looking at backports to stable releases.  If you look
> at the v4.4 version of try_to_unmap_one(), it does not use the
> mmu_notifier_invalidate_range_start/end interfaces. Rather, it uses the
> mmu_notifier_invalidate_page(), passing in the address of the page it
> unmapped.  This is done after releasing the ptl lock.  I'm not even sure if
> this works for huge pages, as it appears some THP supporting code was added
> to try_to_unmap_one() after v4.4.
> 
> But, we were wondering what mmu notifier interface to use in the case where
> try_to_unmap_one() unmaps a shared pmd huge page as addressed in the patch
> above.  In this case, a PUD sized area is effectively unmapped.  In the
> code/patch above we have the invalidate range (start and end as well) take
> the PUD sized area into account.
> 
> What would be the best mmu notifier interface to use where there are no
> start/end calls?
> Or, is the best solution to add the start/end calls as is done in later
> versions of the code?  If that is the suggestion, has there been any change
> in invalidate start/end semantics that we should take into account?

start/end would be the one to add, 4.4 seems broken in respect to THP
and mmu notification. Another solution is to fix user of mmu notifier,
they were only a handful back then. For instance properly adjust the
address to match first address covered by pmd or pud and passing down
correct page size to mmu_notifier_invalidate_page() would allow to fix
this easily.

This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
with an invalid one (either poison, migration or swap) inside the
function. So anyone racing would synchronize on those special entry
hence why it is fine to delay mmu_notifier_invalidate_page() to after
dropping the page table lock.

Adding start/end might the solution with less code churn as you would
only need to change try_to_unmap_one().

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-29 18:14             ` Jerome Glisse
@ 2018-08-29 18:39               ` Michal Hocko
  2018-08-29 21:11                 ` Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2018-08-29 18:39 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable

On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
[...]
> > What would be the best mmu notifier interface to use where there are no
> > start/end calls?
> > Or, is the best solution to add the start/end calls as is done in later
> > versions of the code?  If that is the suggestion, has there been any change
> > in invalidate start/end semantics that we should take into account?
> 
> start/end would be the one to add, 4.4 seems broken in respect to THP
> and mmu notification. Another solution is to fix user of mmu notifier,
> they were only a handful back then. For instance properly adjust the
> address to match first address covered by pmd or pud and passing down
> correct page size to mmu_notifier_invalidate_page() would allow to fix
> this easily.
> 
> This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> with an invalid one (either poison, migration or swap) inside the
> function. So anyone racing would synchronize on those special entry
> hence why it is fine to delay mmu_notifier_invalidate_page() to after
> dropping the page table lock.
> 
> Adding start/end might the solution with less code churn as you would
> only need to change try_to_unmap_one().

What about dependencies? 369ea8242c0fb sounds like it needs work for all
notifiers need to be updated as well.

Anyway, I am wondering why we haven't see any bugs coming from
incomplete range invalidation. How would those exhibit?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-29 18:39               ` Michal Hocko
@ 2018-08-29 21:11                 ` Jerome Glisse
  2018-08-30  0:40                   ` Mike Kravetz
  2018-08-30 10:56                   ` Michal Hocko
  0 siblings, 2 replies; 29+ messages in thread
From: Jerome Glisse @ 2018-08-29 21:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
> On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> > On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> [...]
> > > What would be the best mmu notifier interface to use where there are no
> > > start/end calls?
> > > Or, is the best solution to add the start/end calls as is done in later
> > > versions of the code?  If that is the suggestion, has there been any change
> > > in invalidate start/end semantics that we should take into account?
> > 
> > start/end would be the one to add, 4.4 seems broken in respect to THP
> > and mmu notification. Another solution is to fix user of mmu notifier,
> > they were only a handful back then. For instance properly adjust the
> > address to match first address covered by pmd or pud and passing down
> > correct page size to mmu_notifier_invalidate_page() would allow to fix
> > this easily.
> > 
> > This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> > with an invalid one (either poison, migration or swap) inside the
> > function. So anyone racing would synchronize on those special entry
> > hence why it is fine to delay mmu_notifier_invalidate_page() to after
> > dropping the page table lock.
> > 
> > Adding start/end might the solution with less code churn as you would
> > only need to change try_to_unmap_one().
> 
> What about dependencies? 369ea8242c0fb sounds like it needs work for all
> notifiers need to be updated as well.

This commit remove mmu_notifier_invalidate_page() hence why everything
need to be updated. But in 4.4 you can get away with just adding start/
end and keep around mmu_notifier_invalidate_page() to minimize disruption.

So the new semantic in 369ea8242c0fb is that all page table changes are
bracketed with mmu notifier start/end calls and invalidate_range right
after tlb flush. This simplify thing and make it more reliable for mmu
notifier users like IOMMU or ODP or GPUs drivers.


> Anyway, I am wondering why we haven't see any bugs coming from
> incomplete range invalidation. How would those exhibit?

Reading back the 4.4 code try_to_unmap() can only be call against a
huge tlb page and only when migrating one ie through migrate_pages()
So this highly limit the cases where issues would happen. I believe
no one use hugetlb fs has backing for guest memory so xen and kvm
would never face such case.

So what is left is ODP, i915, radeon, amd gpu, SGI and IOMMU drivers.

The IOMMU drivers are never use that way AFAICT in 4.4 their was no
drivers upstream for a PCIE device that would support ATS/PASID and
thus the notifier path would never be use. Back then the only device
was AMD APU AFAIK which were never really use with that features
due to lack of mature userspace to use this.

For i915,radeon and amd GPU we would never see this either as the
mmu notifier is only use for uptr GEM object which are only use
to upload texture with either anonymous vma or file back vma. I
never heard of an xorg server or ddx or mesa drivers which would
use hugetlb fs.

So the only ones that might have issues AFAICT are ODP and SGI.
I am unsure on how likely either can be use in conjunction with a
hugetlb fs. CCing maintainers for those so they could comment.


The symptoms would either be memory corruption ie RDMA or SGI would
write to the old huge page and not the new one. Or even harder to
spot use of stall/invalid data ie RDMA or SGI are reading from
the old huge page and instead of the new one.

Corruption to hugetlbfs page can likely go unoticed.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-29 21:11                 ` Jerome Glisse
@ 2018-08-30  0:40                   ` Mike Kravetz
  2018-08-30 10:56                   ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Kravetz @ 2018-08-30  0:40 UTC (permalink / raw)
  To: Jerome Glisse, Michal Hocko
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable,
	linux-rdma, Matan Barak, Leon Romanovsky, Dimitri Sivanich

On 08/29/2018 02:11 PM, Jerome Glisse wrote:
> On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
>> On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
>>> On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
>> [...]
>>>> What would be the best mmu notifier interface to use where there are no
>>>> start/end calls?
>>>> Or, is the best solution to add the start/end calls as is done in later
>>>> versions of the code?  If that is the suggestion, has there been any change
>>>> in invalidate start/end semantics that we should take into account?
>>>
>>> start/end would be the one to add, 4.4 seems broken in respect to THP
>>> and mmu notification. Another solution is to fix user of mmu notifier,
>>> they were only a handful back then. For instance properly adjust the
>>> address to match first address covered by pmd or pud and passing down
>>> correct page size to mmu_notifier_invalidate_page() would allow to fix
>>> this easily.
>>>
>>> This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
>>> with an invalid one (either poison, migration or swap) inside the
>>> function. So anyone racing would synchronize on those special entry
>>> hence why it is fine to delay mmu_notifier_invalidate_page() to after
>>> dropping the page table lock.
>>>
>>> Adding start/end might the solution with less code churn as you would
>>> only need to change try_to_unmap_one().
>>
>> What about dependencies? 369ea8242c0fb sounds like it needs work for all
>> notifiers need to be updated as well.
> 
> This commit remove mmu_notifier_invalidate_page() hence why everything
> need to be updated. But in 4.4 you can get away with just adding start/
> end and keep around mmu_notifier_invalidate_page() to minimize disruption.
> 
> So the new semantic in 369ea8242c0fb is that all page table changes are
> bracketed with mmu notifier start/end calls and invalidate_range right
> after tlb flush. This simplify thing and make it more reliable for mmu
> notifier users like IOMMU or ODP or GPUs drivers.

Here is what I came up with by adding the start/end calls to the 4.4 version
of try_to_unmap_one.  Note that this assumes/uses the new routine
adjust_range_if_pmd_sharing_possible to adjust the notifier/flush range if
huge pmd sharing is possible.  I changed the mmu_notifier_invalidate_page
to a mmu_notifier_invalidate_range, but am not sure if that needs to happen
earlier in the routine (like right after tlb flush as you said above).
Does this look reasonable?

diff --git a/mm/rmap.c b/mm/rmap.c
index b577fbb98d4b..7ba8bfeddb4b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1302,11 +1302,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	unsigned long start = address, end;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
-		goto out;
+		return ret;
+
+	/*
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
+	 */
+	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
+	}
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1334,6 +1353,29 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
   	}
 
+	if (PageHuge(page) && huge_pmd_unshare(mm, &address, pte)) {
+		/*
+		 * huge_pmd_unshare unmapped an entire PMD page.  There is
+		 * no way of knowing exactly which PMDs may be cached for
+		 * this mm, so flush them all.  start/end were already
+		 * adjusted to cover this range.
+		 */
+		flush_cache_range(vma, start, end);
+		flush_tlb_range(vma, start, end);
+
+		/*
+		 * The ref count of the PMD page was dropped which is part
+		 * of the way map counting is done for shared PMDs.  When
+		 * there is no other sharing, huge_pmd_unshare returns false
+		 * and we will unmap the actual page and drop map count
+		 * to zero.
+		 *
+		 * Note that huge_pmd_unshare modified address and is likely
+		 * not what you would expect.
+		 */
+		goto out_unmap;
+	}
+
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	if (should_defer_flush(mm, flags)) {
@@ -1424,10 +1466,11 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	page_cache_release(page);
 
 out_unmap:
-	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_range(mm, start, end);
+	pte_unmap_unlock(pte, ptl);
 out:
+	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
 	return ret;
 }
 
-- 
Mike Kravetz

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-29 21:11                 ` Jerome Glisse
  2018-08-30  0:40                   ` Mike Kravetz
@ 2018-08-30 10:56                   ` Michal Hocko
  2018-08-30 14:08                     ` Jerome Glisse
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2018-08-30 10:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Wed 29-08-18 17:11:07, Jerome Glisse wrote:
> On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
> > On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> > > On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> > [...]
> > > > What would be the best mmu notifier interface to use where there are no
> > > > start/end calls?
> > > > Or, is the best solution to add the start/end calls as is done in later
> > > > versions of the code?  If that is the suggestion, has there been any change
> > > > in invalidate start/end semantics that we should take into account?
> > > 
> > > start/end would be the one to add, 4.4 seems broken in respect to THP
> > > and mmu notification. Another solution is to fix user of mmu notifier,
> > > they were only a handful back then. For instance properly adjust the
> > > address to match first address covered by pmd or pud and passing down
> > > correct page size to mmu_notifier_invalidate_page() would allow to fix
> > > this easily.
> > > 
> > > This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> > > with an invalid one (either poison, migration or swap) inside the
> > > function. So anyone racing would synchronize on those special entry
> > > hence why it is fine to delay mmu_notifier_invalidate_page() to after
> > > dropping the page table lock.
> > > 
> > > Adding start/end might the solution with less code churn as you would
> > > only need to change try_to_unmap_one().
> > 
> > What about dependencies? 369ea8242c0fb sounds like it needs work for all
> > notifiers need to be updated as well.
> 
> This commit remove mmu_notifier_invalidate_page() hence why everything
> need to be updated. But in 4.4 you can get away with just adding start/
> end and keep around mmu_notifier_invalidate_page() to minimize disruption.

OK, this is really interesting. I was really worried to change the
semantic of the mmu notifiers in stable kernels because this is really
a hard to review change and high risk for anybody running those old
kernels. If we can keep the mmu_notifier_invalidate_page and wrap them
into the range scope API then this sounds like the best way forward.

So just to make sure we are at the same page. Does this sounds goo for
stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on
top. What do you think?


From 70a2285b058073eeb2971b94b7e6c8067d2d161a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Thu, 31 Aug 2017 17:17:27 -0400
Subject: [PATCH] mm/rmap: update to new mmu_notifier semantic v2
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

commit 369ea8242c0fb5239b4ddf0dc568f694bd244de4 upstrea.

Please note that this patch differs from the mainline because we do not
really replace mmu_notifier_invalidate_page by mmu_notifier_invalidate_range
because that requires changes to most of existing mmu notifiers. We also
do not want to change the semantic of this API in old kernels. Anyway
Jerome has suggested that it should be sufficient to simply wrap
mmu_notifier_invalidate_page by *_invalidate_range_start()/end() to fix
invalidation of larger than pte mappings (e.g. THP/hugetlb pages during
migration). We need this change to handle large (hugetlb/THP) pages
migration properly.

Note that because we can not presume the pmd value or pte value we have
to assume the worst and unconditionaly report an invalidation as
happening.

Changed since v2:
  - try_to_unmap_one() only one call to mmu_notifier_invalidate_range()
  - compute end with PAGE_SIZE << compound_order(page)
  - fix PageHuge() case in try_to_unmap_one()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com> # backport to 4.4
---
 mm/rmap.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1bceb49aa214..364d245e6411 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1324,6 +1324,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	unsigned long start = address, end;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
@@ -1356,6 +1357,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
   	}
 
+	/*
+	 * We have to assume the worse case ie pmd for invalidation. Note that
+	 * the page can not be free in this function as call of try_to_unmap()
+	 * must hold a reference on the page.
+	 */
+	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
 	if (should_defer_flush(mm, flags)) {
@@ -1449,6 +1458,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
 		mmu_notifier_invalidate_page(mm, address);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
 out:
 	return ret;
 }
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-30 10:56                   ` Michal Hocko
@ 2018-08-30 14:08                     ` Jerome Glisse
  2018-08-30 16:19                       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2018-08-30 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Thu, Aug 30, 2018 at 12:56:16PM +0200, Michal Hocko wrote:
> On Wed 29-08-18 17:11:07, Jerome Glisse wrote:
> > On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
> > > On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> > > > On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> > > [...]
> > > > > What would be the best mmu notifier interface to use where there are no
> > > > > start/end calls?
> > > > > Or, is the best solution to add the start/end calls as is done in later
> > > > > versions of the code?  If that is the suggestion, has there been any change
> > > > > in invalidate start/end semantics that we should take into account?
> > > > 
> > > > start/end would be the one to add, 4.4 seems broken in respect to THP
> > > > and mmu notification. Another solution is to fix user of mmu notifier,
> > > > they were only a handful back then. For instance properly adjust the
> > > > address to match first address covered by pmd or pud and passing down
> > > > correct page size to mmu_notifier_invalidate_page() would allow to fix
> > > > this easily.
> > > > 
> > > > This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> > > > with an invalid one (either poison, migration or swap) inside the
> > > > function. So anyone racing would synchronize on those special entry
> > > > hence why it is fine to delay mmu_notifier_invalidate_page() to after
> > > > dropping the page table lock.
> > > > 
> > > > Adding start/end might the solution with less code churn as you would
> > > > only need to change try_to_unmap_one().
> > > 
> > > What about dependencies? 369ea8242c0fb sounds like it needs work for all
> > > notifiers need to be updated as well.
> > 
> > This commit remove mmu_notifier_invalidate_page() hence why everything
> > need to be updated. But in 4.4 you can get away with just adding start/
> > end and keep around mmu_notifier_invalidate_page() to minimize disruption.
> 
> OK, this is really interesting. I was really worried to change the
> semantic of the mmu notifiers in stable kernels because this is really
> a hard to review change and high risk for anybody running those old
> kernels. If we can keep the mmu_notifier_invalidate_page and wrap them
> into the range scope API then this sounds like the best way forward.
> 
> So just to make sure we are at the same page. Does this sounds goo for
> stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on
> top. What do you think?

You need to invalidate outside page table lock so before the call to
page_check_address(). For instance like below patch, which also only
do the range invalidation for huge page which would avoid too much of
a behavior change for user of mmu notifier.

From 1be4109cfbf1c475ad67a5a57c87c74fd183ab1d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Thu, 31 Aug 2017 17:17:27 -0400
Subject: [PATCH] mm/rmap: update to new mmu_notifier semantic v2
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

commit 369ea8242c0fb5239b4ddf0dc568f694bd244de4 upstrea.

Please note that this patch differs from the mainline because we do not
really replace mmu_notifier_invalidate_page by mmu_notifier_invalidate_range
because that requires changes to most of existing mmu notifiers. We also
do not want to change the semantic of this API in old kernels. Anyway
Jérôme has suggested that it should be sufficient to simply wrap
mmu_notifier_invalidate_page by *_invalidate_range_start()/end() to fix
invalidation of larger than pte mappings (e.g. THP/hugetlb pages during
migration). We need this change to handle large (hugetlb/THP) pages
migration properly.

Note that because we can not presume the pmd value or pte value we have
to assume the worst and unconditionaly report an invalidation as
happening.

Changed since v2:
  - try_to_unmap_one() only one call to mmu_notifier_invalidate_range()
  - compute end with PAGE_SIZE << compound_order(page)
  - fix PageHuge() case in try_to_unmap_one()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com> # backport to 4.4
---
 mm/rmap.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index b577fbb98d4b..a77f15dc0cf1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1302,15 +1302,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	unsigned long start = address, end;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		goto out;
 
+	if (unlikely(PageHuge(page))) {
+		/*
+		 * We have to assume the worse case ie pmd for invalidation.
+		 * Note that the page can not be free in this function as call
+		 * of try_to_unmap() must hold a reference on the page.
+		 *
+		 * This is ok to invalidate even if are not unmapping anything
+		 * ie below page_check_address() returning NULL.
+		 */
+		end = min(vma->vm_end, start + (PAGE_SIZE <<
+						compound_order(page)));
+		mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+	}
+
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
-		goto out;
+		goto out_notify;
 
 	/*
 	 * If the page is mlock()d, we cannot swap it out.
@@ -1427,6 +1442,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
 		mmu_notifier_invalidate_page(mm, address);
+out_notify:
+	if (unlikely(PageHuge(page)))
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
 out:
 	return ret;
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-30 14:08                     ` Jerome Glisse
@ 2018-08-30 16:19                       ` Michal Hocko
  2018-08-30 16:57                         ` Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2018-08-30 16:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Thu 30-08-18 10:08:25, Jerome Glisse wrote:
> On Thu, Aug 30, 2018 at 12:56:16PM +0200, Michal Hocko wrote:
> > On Wed 29-08-18 17:11:07, Jerome Glisse wrote:
> > > On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
> > > > On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> > > > > On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> > > > [...]
> > > > > > What would be the best mmu notifier interface to use where there are no
> > > > > > start/end calls?
> > > > > > Or, is the best solution to add the start/end calls as is done in later
> > > > > > versions of the code?  If that is the suggestion, has there been any change
> > > > > > in invalidate start/end semantics that we should take into account?
> > > > > 
> > > > > start/end would be the one to add, 4.4 seems broken in respect to THP
> > > > > and mmu notification. Another solution is to fix user of mmu notifier,
> > > > > they were only a handful back then. For instance properly adjust the
> > > > > address to match first address covered by pmd or pud and passing down
> > > > > correct page size to mmu_notifier_invalidate_page() would allow to fix
> > > > > this easily.
> > > > > 
> > > > > This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> > > > > with an invalid one (either poison, migration or swap) inside the
> > > > > function. So anyone racing would synchronize on those special entry
> > > > > hence why it is fine to delay mmu_notifier_invalidate_page() to after
> > > > > dropping the page table lock.
> > > > > 
> > > > > Adding start/end might the solution with less code churn as you would
> > > > > only need to change try_to_unmap_one().
> > > > 
> > > > What about dependencies? 369ea8242c0fb sounds like it needs work for all
> > > > notifiers need to be updated as well.
> > > 
> > > This commit remove mmu_notifier_invalidate_page() hence why everything
> > > need to be updated. But in 4.4 you can get away with just adding start/
> > > end and keep around mmu_notifier_invalidate_page() to minimize disruption.
> > 
> > OK, this is really interesting. I was really worried to change the
> > semantic of the mmu notifiers in stable kernels because this is really
> > a hard to review change and high risk for anybody running those old
> > kernels. If we can keep the mmu_notifier_invalidate_page and wrap them
> > into the range scope API then this sounds like the best way forward.
> > 
> > So just to make sure we are at the same page. Does this sounds goo for
> > stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on
> > top. What do you think?
> 
> You need to invalidate outside page table lock so before the call to
> page_check_address(). For instance like below patch, which also only
> do the range invalidation for huge page which would avoid too much of
> a behavior change for user of mmu notifier.

Right. I would rather not make this PageHuge special though. So the
fixed version should be.

From c05849f6789ec36e2ff11adcd8fa6cfb05e870a9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Thu, 31 Aug 2017 17:17:27 -0400
Subject: [PATCH] mm/rmap: update to new mmu_notifier semantic v2
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

commit 369ea8242c0fb5239b4ddf0dc568f694bd244de4 upstrea.

Please note that this patch differs from the mainline because we do not
really replace mmu_notifier_invalidate_page by mmu_notifier_invalidate_range
because that requires changes to most of existing mmu notifiers. We also
do not want to change the semantic of this API in old kernels. Anyway
Jerome has suggested that it should be sufficient to simply wrap
mmu_notifier_invalidate_page by *_invalidate_range_start()/end() to fix
invalidation of larger than pte mappings (e.g. THP/hugetlb pages during
migration). We need this change to handle large (hugetlb/THP) pages
migration properly.

Note that because we can not presume the pmd value or pte value we have
to assume the worst and unconditionaly report an invalidation as
happening.

Changed since v2:
  - try_to_unmap_one() only one call to mmu_notifier_invalidate_range()
  - compute end with PAGE_SIZE << compound_order(page)
  - fix PageHuge() case in try_to_unmap_one()

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com> # backport to 4.4
---
 mm/rmap.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/rmap.c b/mm/rmap.c
index 1bceb49aa214..aba994f55d6c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1324,12 +1324,21 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	unsigned long start = address, end;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		goto out;
 
+	/*
+	 * We have to assume the worse case ie pmd for invalidation. Note that
+	 * the page can not be free in this function as call of try_to_unmap()
+	 * must hold a reference on the page.
+	 */
+	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
 		goto out;
@@ -1450,6 +1459,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
 		mmu_notifier_invalidate_page(mm, address);
 out:
+	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
 	return ret;
 }
 
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-30 16:19                       ` Michal Hocko
@ 2018-08-30 16:57                         ` Jerome Glisse
  2018-08-30 18:05                           ` Mike Kravetz
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2018-08-30 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Thu, Aug 30, 2018 at 06:19:52PM +0200, Michal Hocko wrote:
> On Thu 30-08-18 10:08:25, Jerome Glisse wrote:
> > On Thu, Aug 30, 2018 at 12:56:16PM +0200, Michal Hocko wrote:
> > > On Wed 29-08-18 17:11:07, Jerome Glisse wrote:
> > > > On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
> > > > > On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> > > > > > On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> > > > > [...]
> > > > > > > What would be the best mmu notifier interface to use where there are no
> > > > > > > start/end calls?
> > > > > > > Or, is the best solution to add the start/end calls as is done in later
> > > > > > > versions of the code?  If that is the suggestion, has there been any change
> > > > > > > in invalidate start/end semantics that we should take into account?
> > > > > > 
> > > > > > start/end would be the one to add, 4.4 seems broken in respect to THP
> > > > > > and mmu notification. Another solution is to fix user of mmu notifier,
> > > > > > they were only a handful back then. For instance properly adjust the
> > > > > > address to match first address covered by pmd or pud and passing down
> > > > > > correct page size to mmu_notifier_invalidate_page() would allow to fix
> > > > > > this easily.
> > > > > > 
> > > > > > This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> > > > > > with an invalid one (either poison, migration or swap) inside the
> > > > > > function. So anyone racing would synchronize on those special entry
> > > > > > hence why it is fine to delay mmu_notifier_invalidate_page() to after
> > > > > > dropping the page table lock.
> > > > > > 
> > > > > > Adding start/end might the solution with less code churn as you would
> > > > > > only need to change try_to_unmap_one().
> > > > > 
> > > > > What about dependencies? 369ea8242c0fb sounds like it needs work for all
> > > > > notifiers need to be updated as well.
> > > > 
> > > > This commit remove mmu_notifier_invalidate_page() hence why everything
> > > > need to be updated. But in 4.4 you can get away with just adding start/
> > > > end and keep around mmu_notifier_invalidate_page() to minimize disruption.
> > > 
> > > OK, this is really interesting. I was really worried to change the
> > > semantic of the mmu notifiers in stable kernels because this is really
> > > a hard to review change and high risk for anybody running those old
> > > kernels. If we can keep the mmu_notifier_invalidate_page and wrap them
> > > into the range scope API then this sounds like the best way forward.
> > > 
> > > So just to make sure we are at the same page. Does this sounds goo for
> > > stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on
> > > top. What do you think?
> > 
> > You need to invalidate outside page table lock so before the call to
> > page_check_address(). For instance like below patch, which also only
> > do the range invalidation for huge page which would avoid too much of
> > a behavior change for user of mmu notifier.
> 
> Right. I would rather not make this PageHuge special though. So the
> fixed version should be.

Why not testing for huge ? Only huge is broken and thus only that
need the extra range invalidation. Doing the double invalidation
for single page is bit overkill.

Also below is bogus you need to add a out_notify: label to avoid
an inbalance in start/end callback.

> 
> From c05849f6789ec36e2ff11adcd8fa6cfb05e870a9 Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
> Date: Thu, 31 Aug 2017 17:17:27 -0400
> Subject: [PATCH] mm/rmap: update to new mmu_notifier semantic v2
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> commit 369ea8242c0fb5239b4ddf0dc568f694bd244de4 upstrea.
> 
> Please note that this patch differs from the mainline because we do not
> really replace mmu_notifier_invalidate_page by mmu_notifier_invalidate_range
> because that requires changes to most of existing mmu notifiers. We also
> do not want to change the semantic of this API in old kernels. Anyway
> Jerome has suggested that it should be sufficient to simply wrap
> mmu_notifier_invalidate_page by *_invalidate_range_start()/end() to fix
> invalidation of larger than pte mappings (e.g. THP/hugetlb pages during
> migration). We need this change to handle large (hugetlb/THP) pages
> migration properly.
> 
> Note that because we can not presume the pmd value or pte value we have
> to assume the worst and unconditionaly report an invalidation as
> happening.
> 
> Changed since v2:
>   - try_to_unmap_one() only one call to mmu_notifier_invalidate_range()
>   - compute end with PAGE_SIZE << compound_order(page)
>   - fix PageHuge() case in try_to_unmap_one()
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Bernhard Held <berny156@gmx.de>
> Cc: Adam Borowski <kilobyte@angband.pl>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: axie <axie@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com> # backport to 4.4
> ---
>  mm/rmap.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1bceb49aa214..aba994f55d6c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1324,12 +1324,21 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pte_t pteval;
>  	spinlock_t *ptl;
>  	int ret = SWAP_AGAIN;
> +	unsigned long start = address, end;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
>  
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		goto out;
>  
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free in this function as call of try_to_unmap()
> +	 * must hold a reference on the page.
> +	 */
> +	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
>  	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;

Instead

>  		goto out_notify;

> @@ -1450,6 +1459,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
>  		mmu_notifier_invalidate_page(mm, address);

+out_notify:

> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>  out:
>  	return ret;
>  }
>  
> -- 
> 2.18.0
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-30 16:57                         ` Jerome Glisse
@ 2018-08-30 18:05                           ` Mike Kravetz
  2018-08-30 18:39                             ` Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Mike Kravetz @ 2018-08-30 18:05 UTC (permalink / raw)
  To: Jerome Glisse, Michal Hocko
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable,
	linux-rdma, Matan Barak, Leon Romanovsky, Dimitri Sivanich

On 08/30/2018 09:57 AM, Jerome Glisse wrote:
> On Thu, Aug 30, 2018 at 06:19:52PM +0200, Michal Hocko wrote:
>> On Thu 30-08-18 10:08:25, Jerome Glisse wrote:
>>> On Thu, Aug 30, 2018 at 12:56:16PM +0200, Michal Hocko wrote:
>>>> On Wed 29-08-18 17:11:07, Jerome Glisse wrote:
>>>>> On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
>>>>>> On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
>>>>>>> On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
>>>>>> [...]
>>>>>>>> What would be the best mmu notifier interface to use where there are no
>>>>>>>> start/end calls?
>>>>>>>> Or, is the best solution to add the start/end calls as is done in later
>>>>>>>> versions of the code?  If that is the suggestion, has there been any change
>>>>>>>> in invalidate start/end semantics that we should take into account?
>>>>>>>
>>>>>>> start/end would be the one to add, 4.4 seems broken in respect to THP
>>>>>>> and mmu notification. Another solution is to fix user of mmu notifier,
>>>>>>> they were only a handful back then. For instance properly adjust the
>>>>>>> address to match first address covered by pmd or pud and passing down
>>>>>>> correct page size to mmu_notifier_invalidate_page() would allow to fix
>>>>>>> this easily.
>>>>>>>
>>>>>>> This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
>>>>>>> with an invalid one (either poison, migration or swap) inside the
>>>>>>> function. So anyone racing would synchronize on those special entry
>>>>>>> hence why it is fine to delay mmu_notifier_invalidate_page() to after
>>>>>>> dropping the page table lock.
>>>>>>>
>>>>>>> Adding start/end might the solution with less code churn as you would
>>>>>>> only need to change try_to_unmap_one().
>>>>>>
>>>>>> What about dependencies? 369ea8242c0fb sounds like it needs work for all
>>>>>> notifiers need to be updated as well.
>>>>>
>>>>> This commit remove mmu_notifier_invalidate_page() hence why everything
>>>>> need to be updated. But in 4.4 you can get away with just adding start/
>>>>> end and keep around mmu_notifier_invalidate_page() to minimize disruption.
>>>>
>>>> OK, this is really interesting. I was really worried to change the
>>>> semantic of the mmu notifiers in stable kernels because this is really
>>>> a hard to review change and high risk for anybody running those old
>>>> kernels. If we can keep the mmu_notifier_invalidate_page and wrap them
>>>> into the range scope API then this sounds like the best way forward.
>>>>
>>>> So just to make sure we are at the same page. Does this sounds goo for
>>>> stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on
>>>> top. What do you think?
>>>
>>> You need to invalidate outside page table lock so before the call to
>>> page_check_address(). For instance like below patch, which also only
>>> do the range invalidation for huge page which would avoid too much of
>>> a behavior change for user of mmu notifier.
>>
>> Right. I would rather not make this PageHuge special though. So the
>> fixed version should be.
> 
> Why not testing for huge ? Only huge is broken and thus only that
> need the extra range invalidation. Doing the double invalidation
> for single page is bit overkill.

I am a bit confused, and hope this does not add to any confusion by others.

IIUC, the patch below does not attempt to 'fix' anything.  It is simply
there to add the start/end notifiers to the v4.4 version of this routine
so that a subsequent patch can use them (with modified ranges) to handle
unmapping a shared pmd huge page.  That is the mainline fix which started
this thread.

Since we are only/mostly interested in fixing the shared pmd issue in
4.4, how about just adding the start/end notifiers to the very specific
case where pmd sharing is possible?

I can see the value in trying to back port dependent patches such as this
so that stable releases look more like mainline.  However, I am not sure of
the value in this case as this patch was part of a larger set changing
notifier semantics.

-- 
Mike Kravetz

> Also below is bogus you need to add a out_notify: label to avoid
> an inbalance in start/end callback.
> 
>>
>> From c05849f6789ec36e2ff11adcd8fa6cfb05e870a9 Mon Sep 17 00:00:00 2001
>> From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
>> Date: Thu, 31 Aug 2017 17:17:27 -0400
>> Subject: [PATCH] mm/rmap: update to new mmu_notifier semantic v2
>> MIME-Version: 1.0
>> Content-Type: text/plain; charset=UTF-8
>> Content-Transfer-Encoding: 8bit
>>
>> commit 369ea8242c0fb5239b4ddf0dc568f694bd244de4 upstrea.
>>
>> Please note that this patch differs from the mainline because we do not
>> really replace mmu_notifier_invalidate_page by mmu_notifier_invalidate_range
>> because that requires changes to most of existing mmu notifiers. We also
>> do not want to change the semantic of this API in old kernels. Anyway
>> Jerome has suggested that it should be sufficient to simply wrap
>> mmu_notifier_invalidate_page by *_invalidate_range_start()/end() to fix
>> invalidation of larger than pte mappings (e.g. THP/hugetlb pages during
>> migration). We need this change to handle large (hugetlb/THP) pages
>> migration properly.
>>
>> Note that because we can not presume the pmd value or pte value we have
>> to assume the worst and unconditionaly report an invalidation as
>> happening.
>>
>> Changed since v2:
>>   - try_to_unmap_one() only one call to mmu_notifier_invalidate_range()
>>   - compute end with PAGE_SIZE << compound_order(page)
>>   - fix PageHuge() case in try_to_unmap_one()
>>
>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Bernhard Held <berny156@gmx.de>
>> Cc: Adam Borowski <kilobyte@angband.pl>
>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> Cc: Wanpeng Li <kernellwp@gmail.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Takashi Iwai <tiwai@suse.de>
>> Cc: Nadav Amit <nadav.amit@gmail.com>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: axie <axie@amd.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Michal Hocko <mhocko@suse.com> # backport to 4.4
>> ---
>>  mm/rmap.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1bceb49aa214..aba994f55d6c 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1324,12 +1324,21 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  	pte_t pteval;
>>  	spinlock_t *ptl;
>>  	int ret = SWAP_AGAIN;
>> +	unsigned long start = address, end;
>>  	enum ttu_flags flags = (enum ttu_flags)arg;
>>  
>>  	/* munlock has nothing to gain from examining un-locked vmas */
>>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>>  		goto out;
>>  
>> +	/*
>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>> +	 * the page can not be free in this function as call of try_to_unmap()
>> +	 * must hold a reference on the page.
>> +	 */
>> +	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>> +
>>  	pte = page_check_address(page, mm, address, &ptl, 0);
>>  	if (!pte)
>>  		goto out;
> 
> Instead
> 
>>  		goto out_notify;
> 
>> @@ -1450,6 +1459,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
>>  		mmu_notifier_invalidate_page(mm, address);
> 
> +out_notify:
> 
>> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>  out:
>>  	return ret;
>>  }
>>  
>> -- 
>> 2.18.0
>>
>> -- 
>> Michal Hocko
>> SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-30 18:05                           ` Mike Kravetz
@ 2018-08-30 18:39                             ` Jerome Glisse
  2018-09-03  5:56                               ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2018-08-30 18:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Michal Hocko, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Thu, Aug 30, 2018 at 11:05:16AM -0700, Mike Kravetz wrote:
> On 08/30/2018 09:57 AM, Jerome Glisse wrote:
> > On Thu, Aug 30, 2018 at 06:19:52PM +0200, Michal Hocko wrote:
> >> On Thu 30-08-18 10:08:25, Jerome Glisse wrote:
> >>> On Thu, Aug 30, 2018 at 12:56:16PM +0200, Michal Hocko wrote:
> >>>> On Wed 29-08-18 17:11:07, Jerome Glisse wrote:
> >>>>> On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
> >>>>>> On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> >>>>>>> On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> >>>>>> [...]
> >>>>>>>> What would be the best mmu notifier interface to use where there are no
> >>>>>>>> start/end calls?
> >>>>>>>> Or, is the best solution to add the start/end calls as is done in later
> >>>>>>>> versions of the code?  If that is the suggestion, has there been any change
> >>>>>>>> in invalidate start/end semantics that we should take into account?
> >>>>>>>
> >>>>>>> start/end would be the one to add, 4.4 seems broken in respect to THP
> >>>>>>> and mmu notification. Another solution is to fix user of mmu notifier,
> >>>>>>> they were only a handful back then. For instance properly adjust the
> >>>>>>> address to match first address covered by pmd or pud and passing down
> >>>>>>> correct page size to mmu_notifier_invalidate_page() would allow to fix
> >>>>>>> this easily.
> >>>>>>>
> >>>>>>> This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> >>>>>>> with an invalid one (either poison, migration or swap) inside the
> >>>>>>> function. So anyone racing would synchronize on those special entry
> >>>>>>> hence why it is fine to delay mmu_notifier_invalidate_page() to after
> >>>>>>> dropping the page table lock.
> >>>>>>>
> >>>>>>> Adding start/end might the solution with less code churn as you would
> >>>>>>> only need to change try_to_unmap_one().
> >>>>>>
> >>>>>> What about dependencies? 369ea8242c0fb sounds like it needs work for all
> >>>>>> notifiers need to be updated as well.
> >>>>>
> >>>>> This commit remove mmu_notifier_invalidate_page() hence why everything
> >>>>> need to be updated. But in 4.4 you can get away with just adding start/
> >>>>> end and keep around mmu_notifier_invalidate_page() to minimize disruption.
> >>>>
> >>>> OK, this is really interesting. I was really worried to change the
> >>>> semantic of the mmu notifiers in stable kernels because this is really
> >>>> a hard to review change and high risk for anybody running those old
> >>>> kernels. If we can keep the mmu_notifier_invalidate_page and wrap them
> >>>> into the range scope API then this sounds like the best way forward.
> >>>>
> >>>> So just to make sure we are at the same page. Does this sounds goo for
> >>>> stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on
> >>>> top. What do you think?
> >>>
> >>> You need to invalidate outside page table lock so before the call to
> >>> page_check_address(). For instance like below patch, which also only
> >>> do the range invalidation for huge page which would avoid too much of
> >>> a behavior change for user of mmu notifier.
> >>
> >> Right. I would rather not make this PageHuge special though. So the
> >> fixed version should be.
> > 
> > Why not testing for huge ? Only huge is broken and thus only that
> > need the extra range invalidation. Doing the double invalidation
> > for single page is bit overkill.
> 
> I am a bit confused, and hope this does not add to any confusion by others.
> 
> IIUC, the patch below does not attempt to 'fix' anything.  It is simply
> there to add the start/end notifiers to the v4.4 version of this routine
> so that a subsequent patch can use them (with modified ranges) to handle
> unmapping a shared pmd huge page.  That is the mainline fix which started
> this thread.
> 
> Since we are only/mostly interested in fixing the shared pmd issue in
> 4.4, how about just adding the start/end notifiers to the very specific
> case where pmd sharing is possible?
> 
> I can see the value in trying to back port dependent patches such as this
> so that stable releases look more like mainline.  However, I am not sure of
> the value in this case as this patch was part of a larger set changing
> notifier semantics.

For all intents and purposes this is not a backport of the original
patch so maybe we should just drop the commit reference and just
explains that it is there to fix mmu notifier in respect to huge page
migration.

The original patches fix more than this case because newer featurers
like THP migration, THP swapping, ... added more cases where things
would have been wrong. But in 4.4 frame there is only huge tlb fs
migration.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-08-30 18:39                             ` Jerome Glisse
@ 2018-09-03  5:56                               ` Michal Hocko
  2018-09-04 14:00                                 ` Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2018-09-03  5:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Thu 30-08-18 14:39:44, Jerome Glisse wrote:
> On Thu, Aug 30, 2018 at 11:05:16AM -0700, Mike Kravetz wrote:
> > On 08/30/2018 09:57 AM, Jerome Glisse wrote:
> > > On Thu, Aug 30, 2018 at 06:19:52PM +0200, Michal Hocko wrote:
> > >> On Thu 30-08-18 10:08:25, Jerome Glisse wrote:
> > >>> On Thu, Aug 30, 2018 at 12:56:16PM +0200, Michal Hocko wrote:
> > >>>> On Wed 29-08-18 17:11:07, Jerome Glisse wrote:
> > >>>>> On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
> > >>>>>> On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> > >>>>>>> On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> > >>>>>> [...]
> > >>>>>>>> What would be the best mmu notifier interface to use where there are no
> > >>>>>>>> start/end calls?
> > >>>>>>>> Or, is the best solution to add the start/end calls as is done in later
> > >>>>>>>> versions of the code?  If that is the suggestion, has there been any change
> > >>>>>>>> in invalidate start/end semantics that we should take into account?
> > >>>>>>>
> > >>>>>>> start/end would be the one to add, 4.4 seems broken in respect to THP
> > >>>>>>> and mmu notification. Another solution is to fix user of mmu notifier,
> > >>>>>>> they were only a handful back then. For instance properly adjust the
> > >>>>>>> address to match first address covered by pmd or pud and passing down
> > >>>>>>> correct page size to mmu_notifier_invalidate_page() would allow to fix
> > >>>>>>> this easily.
> > >>>>>>>
> > >>>>>>> This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> > >>>>>>> with an invalid one (either poison, migration or swap) inside the
> > >>>>>>> function. So anyone racing would synchronize on those special entry
> > >>>>>>> hence why it is fine to delay mmu_notifier_invalidate_page() to after
> > >>>>>>> dropping the page table lock.
> > >>>>>>>
> > >>>>>>> Adding start/end might the solution with less code churn as you would
> > >>>>>>> only need to change try_to_unmap_one().
> > >>>>>>
> > >>>>>> What about dependencies? 369ea8242c0fb sounds like it needs work for all
> > >>>>>> notifiers need to be updated as well.
> > >>>>>
> > >>>>> This commit remove mmu_notifier_invalidate_page() hence why everything
> > >>>>> need to be updated. But in 4.4 you can get away with just adding start/
> > >>>>> end and keep around mmu_notifier_invalidate_page() to minimize disruption.
> > >>>>
> > >>>> OK, this is really interesting. I was really worried to change the
> > >>>> semantic of the mmu notifiers in stable kernels because this is really
> > >>>> a hard to review change and high risk for anybody running those old
> > >>>> kernels. If we can keep the mmu_notifier_invalidate_page and wrap them
> > >>>> into the range scope API then this sounds like the best way forward.
> > >>>>
> > >>>> So just to make sure we are at the same page. Does this sounds goo for
> > >>>> stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on
> > >>>> top. What do you think?
> > >>>
> > >>> You need to invalidate outside page table lock so before the call to
> > >>> page_check_address(). For instance like below patch, which also only
> > >>> do the range invalidation for huge page which would avoid too much of
> > >>> a behavior change for user of mmu notifier.
> > >>
> > >> Right. I would rather not make this PageHuge special though. So the
> > >> fixed version should be.
> > > 
> > > Why not testing for huge ? Only huge is broken and thus only that
> > > need the extra range invalidation. Doing the double invalidation
> > > for single page is bit overkill.
> > 
> > I am a bit confused, and hope this does not add to any confusion by others.
> > 
> > IIUC, the patch below does not attempt to 'fix' anything.  It is simply
> > there to add the start/end notifiers to the v4.4 version of this routine
> > so that a subsequent patch can use them (with modified ranges) to handle
> > unmapping a shared pmd huge page.  That is the mainline fix which started
> > this thread.
> > 
> > Since we are only/mostly interested in fixing the shared pmd issue in
> > 4.4, how about just adding the start/end notifiers to the very specific
> > case where pmd sharing is possible?
> > 
> > I can see the value in trying to back port dependent patches such as this
> > so that stable releases look more like mainline.  However, I am not sure of
> > the value in this case as this patch was part of a larger set changing
> > notifier semantics.
> 
> For all intents and purposes this is not a backport of the original
> patch so maybe we should just drop the commit reference and just
> explains that it is there to fix mmu notifier in respect to huge page
> migration.
> 
> The original patches fix more than this case because newer featurers
> like THP migration, THP swapping, ... added more cases where things
> would have been wrong. But in 4.4 frame there is only huge tlb fs
> migration.

And THP migration is still a problem with 4.4 AFAICS. All other cases
simply split the huge page but THP migration keeps it in one piece and
as such it is theoretically broken as you have explained. So I would
stick with what I posted with some more clarifications in the changelog
if you think it is appropriate (suggestions welcome).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-09-03  5:56                               ` Michal Hocko
@ 2018-09-04 14:00                                 ` Jerome Glisse
  2018-09-04 17:55                                   ` Mike Kravetz
  2018-09-05  6:57                                   ` Michal Hocko
  0 siblings, 2 replies; 29+ messages in thread
From: Jerome Glisse @ 2018-09-04 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Mon, Sep 03, 2018 at 07:56:54AM +0200, Michal Hocko wrote:
> On Thu 30-08-18 14:39:44, Jerome Glisse wrote:
> > On Thu, Aug 30, 2018 at 11:05:16AM -0700, Mike Kravetz wrote:
> > > On 08/30/2018 09:57 AM, Jerome Glisse wrote:
> > > > On Thu, Aug 30, 2018 at 06:19:52PM +0200, Michal Hocko wrote:
> > > >> On Thu 30-08-18 10:08:25, Jerome Glisse wrote:
> > > >>> On Thu, Aug 30, 2018 at 12:56:16PM +0200, Michal Hocko wrote:
> > > >>>> On Wed 29-08-18 17:11:07, Jerome Glisse wrote:
> > > >>>>> On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
> > > >>>>>> On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
> > > >>>>>>> On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> > > >>>>>> [...]
> > > >>>>>>>> What would be the best mmu notifier interface to use where there are no
> > > >>>>>>>> start/end calls?
> > > >>>>>>>> Or, is the best solution to add the start/end calls as is done in later
> > > >>>>>>>> versions of the code?  If that is the suggestion, has there been any change
> > > >>>>>>>> in invalidate start/end semantics that we should take into account?
> > > >>>>>>>
> > > >>>>>>> start/end would be the one to add, 4.4 seems broken in respect to THP
> > > >>>>>>> and mmu notification. Another solution is to fix user of mmu notifier,
> > > >>>>>>> they were only a handful back then. For instance properly adjust the
> > > >>>>>>> address to match first address covered by pmd or pud and passing down
> > > >>>>>>> correct page size to mmu_notifier_invalidate_page() would allow to fix
> > > >>>>>>> this easily.
> > > >>>>>>>
> > > >>>>>>> This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
> > > >>>>>>> with an invalid one (either poison, migration or swap) inside the
> > > >>>>>>> function. So anyone racing would synchronize on those special entry
> > > >>>>>>> hence why it is fine to delay mmu_notifier_invalidate_page() to after
> > > >>>>>>> dropping the page table lock.
> > > >>>>>>>
> > > >>>>>>> Adding start/end might the solution with less code churn as you would
> > > >>>>>>> only need to change try_to_unmap_one().
> > > >>>>>>
> > > >>>>>> What about dependencies? 369ea8242c0fb sounds like it needs work for all
> > > >>>>>> notifiers need to be updated as well.
> > > >>>>>
> > > >>>>> This commit remove mmu_notifier_invalidate_page() hence why everything
> > > >>>>> need to be updated. But in 4.4 you can get away with just adding start/
> > > >>>>> end and keep around mmu_notifier_invalidate_page() to minimize disruption.
> > > >>>>
> > > >>>> OK, this is really interesting. I was really worried to change the
> > > >>>> semantic of the mmu notifiers in stable kernels because this is really
> > > >>>> a hard to review change and high risk for anybody running those old
> > > >>>> kernels. If we can keep the mmu_notifier_invalidate_page and wrap them
> > > >>>> into the range scope API then this sounds like the best way forward.
> > > >>>>
> > > >>>> So just to make sure we are at the same page. Does this sounds goo for
> > > >>>> stable 4.4. backport? Mike's hugetlb pmd shared fixup can be applied on
> > > >>>> top. What do you think?
> > > >>>
> > > >>> You need to invalidate outside page table lock so before the call to
> > > >>> page_check_address(). For instance like below patch, which also only
> > > >>> do the range invalidation for huge page which would avoid too much of
> > > >>> a behavior change for user of mmu notifier.
> > > >>
> > > >> Right. I would rather not make this PageHuge special though. So the
> > > >> fixed version should be.
> > > > 
> > > > Why not testing for huge ? Only huge is broken and thus only that
> > > > need the extra range invalidation. Doing the double invalidation
> > > > for single page is bit overkill.
> > > 
> > > I am a bit confused, and hope this does not add to any confusion by others.
> > > 
> > > IIUC, the patch below does not attempt to 'fix' anything.  It is simply
> > > there to add the start/end notifiers to the v4.4 version of this routine
> > > so that a subsequent patch can use them (with modified ranges) to handle
> > > unmapping a shared pmd huge page.  That is the mainline fix which started
> > > this thread.
> > > 
> > > Since we are only/mostly interested in fixing the shared pmd issue in
> > > 4.4, how about just adding the start/end notifiers to the very specific
> > > case where pmd sharing is possible?
> > > 
> > > I can see the value in trying to back port dependent patches such as this
> > > so that stable releases look more like mainline.  However, I am not sure of
> > > the value in this case as this patch was part of a larger set changing
> > > notifier semantics.
> > 
> > For all intents and purposes this is not a backport of the original
> > patch so maybe we should just drop the commit reference and just
> > explains that it is there to fix mmu notifier in respect to huge page
> > migration.
> > 
> > The original patches fix more than this case because newer featurers
> > like THP migration, THP swapping, ... added more cases where things
> > would have been wrong. But in 4.4 frame there is only huge tlb fs
> > migration.
> 
> And THP migration is still a problem with 4.4 AFAICS. All other cases
> simply split the huge page but THP migration keeps it in one piece and
> as such it is theoretically broken as you have explained. So I would
> stick with what I posted with some more clarifications in the changelog
> if you think it is appropriate (suggestions welcome).

Reading code there is no THP migration in 4.4 only huge tlb migration.
Look at handle_mm_fault which do not know how to handle swap pmd, only
the huge tlb fs fault handler knows how to handle those. Hence why i
was checking for huge tlb exactly as page_check_address() to only range
invalidate for huge tlb fs migration.

But i am fine with doing the range invalidation with all.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-09-04 14:00                                 ` Jerome Glisse
@ 2018-09-04 17:55                                   ` Mike Kravetz
  2018-09-05  6:57                                   ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Mike Kravetz @ 2018-09-04 17:55 UTC (permalink / raw)
  To: Jerome Glisse, Michal Hocko
  Cc: linux-mm, linux-kernel, Kirill A . Shutemov, Vlastimil Babka,
	Naoya Horiguchi, Davidlohr Bueso, Andrew Morton, stable,
	linux-rdma, Matan Barak, Leon Romanovsky, Dimitri Sivanich

On 09/04/2018 07:00 AM, Jerome Glisse wrote:
> On Mon, Sep 03, 2018 at 07:56:54AM +0200, Michal Hocko wrote:
>> On Thu 30-08-18 14:39:44, Jerome Glisse wrote:
>>> For all intents and purposes this is not a backport of the original
>>> patch so maybe we should just drop the commit reference and just
>>> explains that it is there to fix mmu notifier in respect to huge page
>>> migration.
>>>
>>> The original patches fix more than this case because newer featurers
>>> like THP migration, THP swapping, ... added more cases where things
>>> would have been wrong. But in 4.4 frame there is only huge tlb fs
>>> migration.
>>
>> And THP migration is still a problem with 4.4 AFAICS. All other cases
>> simply split the huge page but THP migration keeps it in one piece and
>> as such it is theoretically broken as you have explained. So I would
>> stick with what I posted with some more clarifications in the changelog
>> if you think it is appropriate (suggestions welcome).
> 
> Reading code there is no THP migration in 4.4 only huge tlb migration.
> Look at handle_mm_fault which do not know how to handle swap pmd, only
> the huge tlb fs fault handler knows how to handle those. Hence why i
> was checking for huge tlb exactly as page_check_address() to only range
> invalidate for huge tlb fs migration.

I agree with Jérôme that THP migration was added after 4.4.  But, I could
be missing something.

> But i am fine with doing the range invalidation with all.

Since the shared pmd patch which will ultimately go on top of this needs
the PageHuge checks, my preference would be Jérôme's patch.

However, IMO I am not certain we really need/want a separate patch.  We
could just add the notifiers to the shared pmd patch.  Back porting the
shared pmd patch will also require some fixup.

Either would work.  I'll admit I do not know what stable maintainers would
prefer.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages
  2018-09-04 14:00                                 ` Jerome Glisse
  2018-09-04 17:55                                   ` Mike Kravetz
@ 2018-09-05  6:57                                   ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2018-09-05  6:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mike Kravetz, linux-mm, linux-kernel, Kirill A . Shutemov,
	Vlastimil Babka, Naoya Horiguchi, Davidlohr Bueso, Andrew Morton,
	stable, linux-rdma, Matan Barak, Leon Romanovsky,
	Dimitri Sivanich

On Tue 04-09-18 10:00:36, Jerome Glisse wrote:
> On Mon, Sep 03, 2018 at 07:56:54AM +0200, Michal Hocko wrote:
[...]
> > And THP migration is still a problem with 4.4 AFAICS. All other cases
> > simply split the huge page but THP migration keeps it in one piece and
> > as such it is theoretically broken as you have explained. So I would
> > stick with what I posted with some more clarifications in the changelog
> > if you think it is appropriate (suggestions welcome).
> 
> Reading code there is no THP migration in 4.4 only huge tlb migration.

Meh, you are right. For some reason I misread unmap_and_move_huge_page
to be also for THP. Sorry for the conusion. My fault!

Then it would be indeed safer to use your backport.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-09-05  6:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-23 20:59 [PATCH v6 0/2] huge_pmd_unshare migration and flushing Mike Kravetz
2018-08-23 20:59 ` [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
2018-08-24  2:59   ` Naoya Horiguchi
2018-08-24  8:41   ` Michal Hocko
2018-08-24 18:08     ` Mike Kravetz
2018-08-27  7:46       ` Michal Hocko
2018-08-27 13:46         ` Jerome Glisse
2018-08-27 19:09           ` Michal Hocko
2018-08-29 17:24           ` Mike Kravetz
2018-08-29 18:14             ` Jerome Glisse
2018-08-29 18:39               ` Michal Hocko
2018-08-29 21:11                 ` Jerome Glisse
2018-08-30  0:40                   ` Mike Kravetz
2018-08-30 10:56                   ` Michal Hocko
2018-08-30 14:08                     ` Jerome Glisse
2018-08-30 16:19                       ` Michal Hocko
2018-08-30 16:57                         ` Jerome Glisse
2018-08-30 18:05                           ` Mike Kravetz
2018-08-30 18:39                             ` Jerome Glisse
2018-09-03  5:56                               ` Michal Hocko
2018-09-04 14:00                                 ` Jerome Glisse
2018-09-04 17:55                                   ` Mike Kravetz
2018-09-05  6:57                                   ` Michal Hocko
2018-08-27 16:42         ` Mike Kravetz
2018-08-27 19:11       ` Michal Hocko
2018-08-24  9:25   ` Michal Hocko
2018-08-23 20:59 ` [PATCH v6 2/2] hugetlb: take PMD sharing into account when flushing tlb/caches Mike Kravetz
2018-08-24  3:07   ` Naoya Horiguchi
2018-08-24 11:35 ` [PATCH v6 0/2] huge_pmd_unshare migration and flushing Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).