All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 02/10] mm/thp: make is_huge_zero_pmd() safe and quicker
       [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
@ 2021-06-09  4:08   ` Hugh Dickins
  2021-06-09  4:14   ` Hugh Dickins
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

Most callers of is_huge_zero_pmd() supply a pmd already verified present;
but a few (notably zap_huge_pmd()) do not - it might be a pmd migration
entry, in which the pfn is encoded differently from a present pmd: which
might pass the is_huge_zero_pmd() test (though not on x86, since L1TF
forced us to protect against that); or perhaps even crash in pmd_page()
applied to a swap-like entry.

Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself;
and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd()
will not need to do that pmd_page() lookup each time.

__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
but is unnecessary now that is_huge_zero_pmd() checks present.

Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
Patch added (replacing part of first) since the v1 series was posted.

 include/linux/huge_mm.h | 8 +++++++-
 mm/huge_memory.c        | 5 ++++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9626fda5efce..2a8ebe6c222e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -286,6 +286,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 extern struct page *huge_zero_page;
+extern unsigned long huge_zero_pfn;
 
 static inline bool is_huge_zero_page(struct page *page)
 {
@@ -294,7 +295,7 @@ static inline bool is_huge_zero_page(struct page *page)
 
 static inline bool is_huge_zero_pmd(pmd_t pmd)
 {
-	return is_huge_zero_page(pmd_page(pmd));
+	return READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd) && pmd_present(pmd);
 }
 
 static inline bool is_huge_zero_pud(pud_t pud)
@@ -440,6 +441,11 @@ static inline bool is_huge_zero_page(struct page *page)
 	return false;
 }
 
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	return false;
+}
+
 static inline bool is_huge_zero_pud(pud_t pud)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 42cfefc6e66e..5885c5f5836f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -62,6 +62,7 @@ static struct shrinker deferred_split_shrinker;
 
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
+unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
 bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
@@ -98,6 +99,7 @@ static bool get_huge_zero_page(void)
 		__free_pages(zero_page, compound_order(zero_page));
 		goto retry;
 	}
+	WRITE_ONCE(huge_zero_pfn, page_to_pfn(zero_page));
 
 	/* We take additional reference here. It will be put back by shrinker */
 	atomic_set(&huge_zero_refcount, 2);
@@ -147,6 +149,7 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
 	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
 		struct page *zero_page = xchg(&huge_zero_page, NULL);
 		BUG_ON(zero_page == NULL);
+		WRITE_ONCE(huge_zero_pfn, ~0UL);
 		__free_pages(zero_page, compound_order(zero_page));
 		return HPAGE_PMD_NR;
 	}
@@ -2071,7 +2074,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return;
 	}
 
-	if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
+	if (is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
 		 * mmu_notifier_invalidate_range() see comments below inside
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 02/10] mm/thp: make is_huge_zero_pmd() safe and quicker
@ 2021-06-09  4:08   ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

Most callers of is_huge_zero_pmd() supply a pmd already verified present;
but a few (notably zap_huge_pmd()) do not - it might be a pmd migration
entry, in which the pfn is encoded differently from a present pmd: which
might pass the is_huge_zero_pmd() test (though not on x86, since L1TF
forced us to protect against that); or perhaps even crash in pmd_page()
applied to a swap-like entry.

Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself;
and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd()
will not need to do that pmd_page() lookup each time.

__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
but is unnecessary now that is_huge_zero_pmd() checks present.

Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
Patch added (replacing part of first) since the v1 series was posted.

 include/linux/huge_mm.h | 8 +++++++-
 mm/huge_memory.c        | 5 ++++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9626fda5efce..2a8ebe6c222e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -286,6 +286,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 extern struct page *huge_zero_page;
+extern unsigned long huge_zero_pfn;
 
 static inline bool is_huge_zero_page(struct page *page)
 {
@@ -294,7 +295,7 @@ static inline bool is_huge_zero_page(struct page *page)
 
 static inline bool is_huge_zero_pmd(pmd_t pmd)
 {
-	return is_huge_zero_page(pmd_page(pmd));
+	return READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd) && pmd_present(pmd);
 }
 
 static inline bool is_huge_zero_pud(pud_t pud)
@@ -440,6 +441,11 @@ static inline bool is_huge_zero_page(struct page *page)
 	return false;
 }
 
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	return false;
+}
+
 static inline bool is_huge_zero_pud(pud_t pud)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 42cfefc6e66e..5885c5f5836f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -62,6 +62,7 @@ static struct shrinker deferred_split_shrinker;
 
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
+unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
 bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
@@ -98,6 +99,7 @@ static bool get_huge_zero_page(void)
 		__free_pages(zero_page, compound_order(zero_page));
 		goto retry;
 	}
+	WRITE_ONCE(huge_zero_pfn, page_to_pfn(zero_page));
 
 	/* We take additional reference here. It will be put back by shrinker */
 	atomic_set(&huge_zero_refcount, 2);
@@ -147,6 +149,7 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
 	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
 		struct page *zero_page = xchg(&huge_zero_page, NULL);
 		BUG_ON(zero_page == NULL);
+		WRITE_ONCE(huge_zero_pfn, ~0UL);
 		__free_pages(zero_page, compound_order(zero_page));
 		return HPAGE_PMD_NR;
 	}
@@ -2071,7 +2074,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return;
 	}
 
-	if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
+	if (is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
 		 * mmu_notifier_invalidate_range() see comments below inside
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 04/10] mm/thp: fix vma_address() if virtual address below file offset
       [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
@ 2021-06-09  4:14   ` Hugh Dickins
  2021-06-09  4:14   ` Hugh Dickins
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low: vma_address()
chooses max(start, vma->vm_start), but that decides on the wrong address,
because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM pages,
which cannot be located by the page->index that page_to_pgoff() uses: as
4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages")
once discovered.  I dithered over the best thing to do about that, and
have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and
vma_address_end(); though the only place in danger of using it on them
was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_nr() on
a head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never mattered.

Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
---
v2 series: adjust vma_address() to avoid 32-bit wrap, per Matthew
v2: use compound_nr() as Matthew suggested

 mm/internal.h        | 53 ++++++++++++++++++++++++++++++++++------------
 mm/page_vma_mapped.c | 16 ++++++--------
 mm/rmap.c            | 16 +++++++-------
 3 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2f1182948aa6..e8fdb531f887 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -384,27 +384,52 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
- * At what user virtual address is page expected in @vma?
+ * At what user virtual address is page expected in vma?
+ * Returns -EFAULT if all of the page is outside the range of vma.
+ * If page is a compound head, the entire compound page is considered.
  */
 static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page);
+	if (pgoff >= vma->vm_pgoff) {
+		address = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		/* Check for address beyond vma (or wrapped through 0?) */
+		if (address < vma->vm_start || address >= vma->vm_end)
+			address = -EFAULT;
+	} else if (PageHead(page) &&
+		   pgoff + compound_nr(page) - 1 >= vma->vm_pgoff) {
+		/* Test above avoids possibility of wrap to 0 on 32-bit */
+		address = vma->vm_start;
+	} else {
+		address = -EFAULT;
+	}
+	return address;
 }
 
+/*
+ * Then at what user virtual address will none of the page be found in vma?
+ * Assumes that vma_address() already returned a good starting address.
+ * If page is a compound head, the entire compound page is considered.
+ */
 static inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address_end(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
-
-	return max(start, vma->vm_start);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page) + compound_nr(page);
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	/* Check for address beyond vma (or wrapped through 0?) */
+	if (address < vma->vm_start || address > vma->vm_end)
+		address = vma->vm_end;
+	return address;
 }
 
 static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 5b559967410e..e37bd43904af 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -228,18 +228,18 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
+		unsigned long end;
+
 		if (check_pte(pvmw))
 			return true;
 next_pte:
 		/* Seek to next pte only makes sense for THP */
 		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
 			return not_found(pvmw);
+		end = vma_address_end(pvmw->page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
-			if (pvmw->address >= pvmw->vma->vm_end ||
-			    pvmw->address >=
-					__vma_address(pvmw->page, pvmw->vma) +
-					thp_size(pvmw->page))
+			if (pvmw->address >= end)
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -277,14 +277,10 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 		.vma = vma,
 		.flags = PVMW_SYNC,
 	};
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
 
-	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
+	pvmw.address = vma_address(page, vma);
+	if (pvmw.address == -EFAULT)
 		return 0;
-	pvmw.address = max(start, vma->vm_start);
 	if (!page_vma_mapped_walk(&pvmw))
 		return 0;
 	page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/rmap.c b/mm/rmap.c
index 07811b4ae793..144de54efc1c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -707,7 +707,6 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  */
 unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long address;
 	if (PageAnon(page)) {
 		struct anon_vma *page__anon_vma = page_anon_vma(page);
 		/*
@@ -722,10 +721,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 			return -EFAULT;
 	} else
 		return -EFAULT;
-	address = __vma_address(page, vma);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		return -EFAULT;
-	return address;
+
+	return vma_address(page, vma);
 }
 
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
@@ -919,7 +916,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	 */
 	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
 				0, vma, vma->vm_mm, address,
-				min(vma->vm_end, address + page_size(page)));
+				vma_address_end(page, vma));
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1435,9 +1432,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	 * Note that the page can not be free in this function as call of
 	 * try_to_unmap() must hold a reference on the page.
 	 */
+	range.end = PageKsm(page) ?
+			address + PAGE_SIZE : vma_address_end(page, vma);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				address,
-				min(vma->vm_end, address + page_size(page)));
+				address, range.end);
 	if (PageHuge(page)) {
 		/*
 		 * If sharing is possible, start and end will be adjusted
@@ -1889,6 +1887,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
@@ -1943,6 +1942,7 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 04/10] mm/thp: fix vma_address() if virtual address below file offset
@ 2021-06-09  4:14   ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low: vma_address()
chooses max(start, vma->vm_start), but that decides on the wrong address,
because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM pages,
which cannot be located by the page->index that page_to_pgoff() uses: as
4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages")
once discovered.  I dithered over the best thing to do about that, and
have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and
vma_address_end(); though the only place in danger of using it on them
was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_nr() on
a head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never mattered.

Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
---
v2 series: adjust vma_address() to avoid 32-bit wrap, per Matthew
v2: use compound_nr() as Matthew suggested

 mm/internal.h        | 53 ++++++++++++++++++++++++++++++++++------------
 mm/page_vma_mapped.c | 16 ++++++--------
 mm/rmap.c            | 16 +++++++-------
 3 files changed, 53 insertions(+), 32 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2f1182948aa6..e8fdb531f887 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -384,27 +384,52 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
- * At what user virtual address is page expected in @vma?
+ * At what user virtual address is page expected in vma?
+ * Returns -EFAULT if all of the page is outside the range of vma.
+ * If page is a compound head, the entire compound page is considered.
  */
 static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page);
+	if (pgoff >= vma->vm_pgoff) {
+		address = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		/* Check for address beyond vma (or wrapped through 0?) */
+		if (address < vma->vm_start || address >= vma->vm_end)
+			address = -EFAULT;
+	} else if (PageHead(page) &&
+		   pgoff + compound_nr(page) - 1 >= vma->vm_pgoff) {
+		/* Test above avoids possibility of wrap to 0 on 32-bit */
+		address = vma->vm_start;
+	} else {
+		address = -EFAULT;
+	}
+	return address;
 }
 
+/*
+ * Then at what user virtual address will none of the page be found in vma?
+ * Assumes that vma_address() already returned a good starting address.
+ * If page is a compound head, the entire compound page is considered.
+ */
 static inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address_end(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
-
-	return max(start, vma->vm_start);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page) + compound_nr(page);
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	/* Check for address beyond vma (or wrapped through 0?) */
+	if (address < vma->vm_start || address > vma->vm_end)
+		address = vma->vm_end;
+	return address;
 }
 
 static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 5b559967410e..e37bd43904af 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -228,18 +228,18 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
+		unsigned long end;
+
 		if (check_pte(pvmw))
 			return true;
 next_pte:
 		/* Seek to next pte only makes sense for THP */
 		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
 			return not_found(pvmw);
+		end = vma_address_end(pvmw->page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
-			if (pvmw->address >= pvmw->vma->vm_end ||
-			    pvmw->address >=
-					__vma_address(pvmw->page, pvmw->vma) +
-					thp_size(pvmw->page))
+			if (pvmw->address >= end)
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -277,14 +277,10 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 		.vma = vma,
 		.flags = PVMW_SYNC,
 	};
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
 
-	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
+	pvmw.address = vma_address(page, vma);
+	if (pvmw.address == -EFAULT)
 		return 0;
-	pvmw.address = max(start, vma->vm_start);
 	if (!page_vma_mapped_walk(&pvmw))
 		return 0;
 	page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/rmap.c b/mm/rmap.c
index 07811b4ae793..144de54efc1c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -707,7 +707,6 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  */
 unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long address;
 	if (PageAnon(page)) {
 		struct anon_vma *page__anon_vma = page_anon_vma(page);
 		/*
@@ -722,10 +721,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 			return -EFAULT;
 	} else
 		return -EFAULT;
-	address = __vma_address(page, vma);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		return -EFAULT;
-	return address;
+
+	return vma_address(page, vma);
 }
 
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
@@ -919,7 +916,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	 */
 	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
 				0, vma, vma->vm_mm, address,
-				min(vma->vm_end, address + page_size(page)));
+				vma_address_end(page, vma));
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1435,9 +1432,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	 * Note that the page can not be free in this function as call of
 	 * try_to_unmap() must hold a reference on the page.
 	 */
+	range.end = PageKsm(page) ?
+			address + PAGE_SIZE : vma_address_end(page, vma);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				address,
-				min(vma->vm_end, address + page_size(page)));
+				address, range.end);
 	if (PageHuge(page)) {
 		/*
 		 * If sharing is possible, start and end will be adjusted
@@ -1889,6 +1887,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
@@ -1943,6 +1942,7 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 05/10] mm/thp: fix page_address_in_vma() on file THP tails
       [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
@ 2021-06-09  4:16   ` Hugh Dickins
  2021-06-09  4:14   ` Hugh Dickins
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

From: Jue Wang <juew@google.com>

Anon THP tails were already supported, but memory-failure may need to use
page_address_in_vma() on file THP tails, which its page->mapping check did
not permit: fix it.

hughd adds: no current usage is known to hit the issue, but this does fix
a subtle trap in a general helper: best fixed in stable sooner than later.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
---
 mm/rmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 144de54efc1c..e05c300048e6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -716,11 +716,11 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 		if (!vma->anon_vma || !page__anon_vma ||
 		    vma->anon_vma->root != page__anon_vma->root)
 			return -EFAULT;
-	} else if (page->mapping) {
-		if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
-			return -EFAULT;
-	} else
+	} else if (!vma->vm_file) {
+		return -EFAULT;
+	} else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
 		return -EFAULT;
+	}
 
 	return vma_address(page, vma);
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 05/10] mm/thp: fix page_address_in_vma() on file THP tails
@ 2021-06-09  4:16   ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

From: Jue Wang <juew@google.com>

Anon THP tails were already supported, but memory-failure may need to use
page_address_in_vma() on file THP tails, which its page->mapping check did
not permit: fix it.

hughd adds: no current usage is known to hit the issue, but this does fix
a subtle trap in a general helper: best fixed in stable sooner than later.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
---
 mm/rmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 144de54efc1c..e05c300048e6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -716,11 +716,11 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 		if (!vma->anon_vma || !page__anon_vma ||
 		    vma->anon_vma->root != page__anon_vma->root)
 			return -EFAULT;
-	} else if (page->mapping) {
-		if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
-			return -EFAULT;
-	} else
+	} else if (!vma->vm_file) {
+		return -EFAULT;
+	} else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
 		return -EFAULT;
+	}
 
 	return vma_address(page, vma);
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
       [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
@ 2021-06-09  4:19   ` Hugh Dickins
  2021-06-09  4:14   ` Hugh Dickins
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
but before its page_remove_rmap() gets to decrement compound_mapcount:
generating false "BUG: Bad page cache" reports that the page is still
mapped when deleted.  This commit fixes that, but not in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
been an annoyance that we usually call unmap_mapping_range() with no pages
locked, but there apply it to a single locked page.  try_to_unmap() looks
more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap.  Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end.  Nice.  But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage.  It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too).  Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details.  Then zap_pmd_range() watches for this case,
and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
now does in the PVMW_SYNC case.  Not pretty, but safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not.  Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c      | 43 +++++++++++++++++++------------------------
 3 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75efcf9..8ae31622deef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	struct page *single_page;		/* Locked page to be unmapped */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 extern int fixup_user_fault(struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
+void unmap_mapping_page(struct page *page);
 void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
@@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
 	BUG();
 	return -EFAULT;
 }
+static inline void unmap_mapping_page(struct page *page) { }
 static inline void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows) { }
 static inline void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
index f3ffab9b9e39..ee1163df3a53 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1361,7 +1361,17 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
+		} else if (details && details->single_page &&
+			   PageTransCompound(details->single_page) &&
+			   next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
+			/*
+			 * Take and drop THP pmd lock so that we cannot return
+			 * prematurely, while zap_huge_pmd() has cleared *pmd,
+			 * but not yet decremented compound_mapcount().
+			 */
+			spin_unlock(pmd_lock(tlb->mm, pmd));
 		}
+
 		/*
 		 * Here there can be other concurrent MADV_DONTNEED or
 		 * trans huge page faults running, and if the pmd is
@@ -3236,6 +3246,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
 	}
 }
 
+/**
+ * unmap_mapping_page() - Unmap single page from processes.
+ * @page: The locked page to be unmapped.
+ *
+ * Unmap this page from any userspace process which still has it mmaped.
+ * Typically, for efficiency, the range of nearby pages has already been
+ * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
+ * truncation or invalidation holds the lock on a page, it may find that
+ * the page has been remapped again: and then uses unmap_mapping_page()
+ * to unmap it finally.
+ */
+void unmap_mapping_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct zap_details details = { };
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageTail(page));
+
+	details.check_mapping = mapping;
+	details.first_index = page->index;
+	details.last_index = page->index + thp_nr_pages(page) - 1;
+	details.single_page = page;
+
+	i_mmap_lock_write(mapping);
+	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+	i_mmap_unlock_write(mapping);
+}
+
 /**
  * unmap_mapping_pages() - Unmap pages from processes.
  * @mapping: The address space containing pages to be unmapped.
diff --git a/mm/truncate.c b/mm/truncate.c
index 95af244b112a..234ddd879caa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
  */
-static void
-truncate_cleanup_page(struct address_space *mapping, struct page *page)
+static void truncate_cleanup_page(struct page *page)
 {
-	if (page_mapped(page)) {
-		unsigned int nr = thp_nr_pages(page);
-		unmap_mapping_pages(mapping, page->index, nr, false);
-	}
+	if (page_mapped(page))
+		unmap_mapping_page(page);
 
 	if (page_has_private(page))
 		do_invalidatepage(page, 0, thp_size(page));
@@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return -EIO;
 
-	truncate_cleanup_page(mapping, page);
+	truncate_cleanup_page(page);
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index = indices[pagevec_count(&pvec) - 1] + 1;
 		truncate_exceptional_pvec_entries(mapping, &pvec, indices);
 		for (i = 0; i < pagevec_count(&pvec); i++)
-			truncate_cleanup_page(mapping, pvec.pages[i]);
+			truncate_cleanup_page(pvec.pages[i]);
 		delete_from_page_cache_batch(mapping, &pvec);
 		for (i = 0; i < pagevec_count(&pvec); i++)
 			unlock_page(pvec.pages[i]);
@@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 
+			if (!did_range_unmap && page_mapped(page)) {
+				/*
+				 * If page is mapped, before taking its lock,
+				 * zap the rest of the file in one hit.
+				 */
+				unmap_mapping_pages(mapping, index,
+						(1 + end - index), false);
+				did_range_unmap = 1;
+			}
+
 			lock_page(page);
 			WARN_ON(page_to_index(page) != index);
 			if (page->mapping != mapping) {
@@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				if (!did_range_unmap) {
-					/*
-					 * Zap the rest of the file in one hit.
-					 */
-					unmap_mapping_pages(mapping, index,
-						(1 + end - index), false);
-					did_range_unmap = 1;
-				} else {
-					/*
-					 * Just zap this page
-					 */
-					unmap_mapping_pages(mapping, index,
-								1, false);
-				}
-			}
+
+			if (page_mapped(page))
+				unmap_mapping_page(page);
 			BUG_ON(page_mapped(page));
+
 			ret2 = do_launder_page(mapping, page);
 			if (ret2 == 0) {
 				if (!invalidate_complete_page2(mapping, page))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
@ 2021-06-09  4:19   ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
but before its page_remove_rmap() gets to decrement compound_mapcount:
generating false "BUG: Bad page cache" reports that the page is still
mapped when deleted.  This commit fixes that, but not in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
been an annoyance that we usually call unmap_mapping_range() with no pages
locked, but there apply it to a single locked page.  try_to_unmap() looks
more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap.  Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end.  Nice.  But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage.  It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too).  Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details.  Then zap_pmd_range() watches for this case,
and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
now does in the PVMW_SYNC case.  Not pretty, but safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not.  Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c      | 43 +++++++++++++++++++------------------------
 3 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75efcf9..8ae31622deef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	struct page *single_page;		/* Locked page to be unmapped */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 extern int fixup_user_fault(struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
+void unmap_mapping_page(struct page *page);
 void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
@@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
 	BUG();
 	return -EFAULT;
 }
+static inline void unmap_mapping_page(struct page *page) { }
 static inline void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows) { }
 static inline void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
index f3ffab9b9e39..ee1163df3a53 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1361,7 +1361,17 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
+		} else if (details && details->single_page &&
+			   PageTransCompound(details->single_page) &&
+			   next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
+			/*
+			 * Take and drop THP pmd lock so that we cannot return
+			 * prematurely, while zap_huge_pmd() has cleared *pmd,
+			 * but not yet decremented compound_mapcount().
+			 */
+			spin_unlock(pmd_lock(tlb->mm, pmd));
 		}
+
 		/*
 		 * Here there can be other concurrent MADV_DONTNEED or
 		 * trans huge page faults running, and if the pmd is
@@ -3236,6 +3246,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
 	}
 }
 
+/**
+ * unmap_mapping_page() - Unmap single page from processes.
+ * @page: The locked page to be unmapped.
+ *
+ * Unmap this page from any userspace process which still has it mmaped.
+ * Typically, for efficiency, the range of nearby pages has already been
+ * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
+ * truncation or invalidation holds the lock on a page, it may find that
+ * the page has been remapped again: and then uses unmap_mapping_page()
+ * to unmap it finally.
+ */
+void unmap_mapping_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct zap_details details = { };
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageTail(page));
+
+	details.check_mapping = mapping;
+	details.first_index = page->index;
+	details.last_index = page->index + thp_nr_pages(page) - 1;
+	details.single_page = page;
+
+	i_mmap_lock_write(mapping);
+	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+	i_mmap_unlock_write(mapping);
+}
+
 /**
  * unmap_mapping_pages() - Unmap pages from processes.
  * @mapping: The address space containing pages to be unmapped.
diff --git a/mm/truncate.c b/mm/truncate.c
index 95af244b112a..234ddd879caa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
  */
-static void
-truncate_cleanup_page(struct address_space *mapping, struct page *page)
+static void truncate_cleanup_page(struct page *page)
 {
-	if (page_mapped(page)) {
-		unsigned int nr = thp_nr_pages(page);
-		unmap_mapping_pages(mapping, page->index, nr, false);
-	}
+	if (page_mapped(page))
+		unmap_mapping_page(page);
 
 	if (page_has_private(page))
 		do_invalidatepage(page, 0, thp_size(page));
@@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return -EIO;
 
-	truncate_cleanup_page(mapping, page);
+	truncate_cleanup_page(page);
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index = indices[pagevec_count(&pvec) - 1] + 1;
 		truncate_exceptional_pvec_entries(mapping, &pvec, indices);
 		for (i = 0; i < pagevec_count(&pvec); i++)
-			truncate_cleanup_page(mapping, pvec.pages[i]);
+			truncate_cleanup_page(pvec.pages[i]);
 		delete_from_page_cache_batch(mapping, &pvec);
 		for (i = 0; i < pagevec_count(&pvec); i++)
 			unlock_page(pvec.pages[i]);
@@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 
+			if (!did_range_unmap && page_mapped(page)) {
+				/*
+				 * If page is mapped, before taking its lock,
+				 * zap the rest of the file in one hit.
+				 */
+				unmap_mapping_pages(mapping, index,
+						(1 + end - index), false);
+				did_range_unmap = 1;
+			}
+
 			lock_page(page);
 			WARN_ON(page_to_index(page) != index);
 			if (page->mapping != mapping) {
@@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				if (!did_range_unmap) {
-					/*
-					 * Zap the rest of the file in one hit.
-					 */
-					unmap_mapping_pages(mapping, index,
-						(1 + end - index), false);
-					did_range_unmap = 1;
-				} else {
-					/*
-					 * Just zap this page
-					 */
-					unmap_mapping_pages(mapping, index,
-								1, false);
-				}
-			}
+
+			if (page_mapped(page))
+				unmap_mapping_page(page);
 			BUG_ON(page_mapped(page));
+
 			ret2 = do_launder_page(mapping, page);
 			if (ret2 == 0) {
 				if (!invalidate_complete_page2(mapping, page))
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 07/10] mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
       [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
@ 2021-06-09  4:22   ` Hugh Dickins
  2021-06-09  4:14   ` Hugh Dickins
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

From: Yang Shi <shy828301@gmail.com>

When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped.  Then the second DEBUG_VM BUG() that check total mapcount would
catch it.  This may incur some confusion.  And this is not a fatal issue,
so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE().

[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
Patch inserted since the v1 series was posted.
v5: Rediffed by Hugh to fit after 6/7 in his mm/thp series; Cc stable.
v4: Updated the subject and commit log per Hugh.
    Reordered the patches per Hugh.
v3: Incorporated the comments from Hugh. Keep Zi Yan's reviewed-by tag
    since there is no fundamental change against v2.
v2: Removed dead code and updated the comment of try_to_unmap() per Zi
    Yan.

 mm/huge_memory.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 84ab735139dc..6d2a0119fc58 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2352,15 +2352,15 @@ static void unmap_page(struct page *page)
 {
 	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC |
 		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
-	bool unmap_success;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
-	unmap_success = try_to_unmap(page, ttu_flags);
-	VM_BUG_ON_PAGE(!unmap_success, page);
+	try_to_unmap(page, ttu_flags);
+
+	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
 }
 
 static void remap_page(struct page *page, unsigned int nr)
@@ -2671,7 +2671,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
-	int count, mapcount, extra_pins, ret;
+	int extra_pins, ret;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2730,7 +2730,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	}
 
 	unmap_page(head);
-	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -2748,9 +2747,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
-	count = page_count(head);
-	mapcount = total_mapcount(head);
-	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
+	if (page_ref_freeze(head, 1 + extra_pins)) {
 		if (!list_empty(page_deferred_list(head))) {
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
@@ -2770,16 +2767,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		__split_huge_page(page, list, end);
 		ret = 0;
 	} else {
-		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
-			pr_alert("total_mapcount: %u, page_count(): %u\n",
-					mapcount, count);
-			if (PageTail(page))
-				dump_page(head, NULL);
-			dump_page(page, "total_mapcount(head) > 0");
-			BUG();
-		}
 		spin_unlock(&ds_queue->split_queue_lock);
-fail:		if (mapping)
+fail:
+		if (mapping)
 			xa_unlock(&mapping->i_pages);
 		local_irq_enable();
 		remap_page(head, thp_nr_pages(head));
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 07/10] mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
@ 2021-06-09  4:22   ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

From: Yang Shi <shy828301@gmail.com>

When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped.  Then the second DEBUG_VM BUG() that check total mapcount would
catch it.  This may incur some confusion.  And this is not a fatal issue,
so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE().

[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
Patch inserted since the v1 series was posted.
v5: Rediffed by Hugh to fit after 6/7 in his mm/thp series; Cc stable.
v4: Updated the subject and commit log per Hugh.
    Reordered the patches per Hugh.
v3: Incorporated the comments from Hugh. Keep Zi Yan's reviewed-by tag
    since there is no fundamental change against v2.
v2: Removed dead code and updated the comment of try_to_unmap() per Zi
    Yan.

 mm/huge_memory.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 84ab735139dc..6d2a0119fc58 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2352,15 +2352,15 @@ static void unmap_page(struct page *page)
 {
 	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC |
 		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
-	bool unmap_success;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
-	unmap_success = try_to_unmap(page, ttu_flags);
-	VM_BUG_ON_PAGE(!unmap_success, page);
+	try_to_unmap(page, ttu_flags);
+
+	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
 }
 
 static void remap_page(struct page *page, unsigned int nr)
@@ -2671,7 +2671,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
-	int count, mapcount, extra_pins, ret;
+	int extra_pins, ret;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2730,7 +2730,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	}
 
 	unmap_page(head);
-	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -2748,9 +2747,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
-	count = page_count(head);
-	mapcount = total_mapcount(head);
-	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
+	if (page_ref_freeze(head, 1 + extra_pins)) {
 		if (!list_empty(page_deferred_list(head))) {
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
@@ -2770,16 +2767,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		__split_huge_page(page, list, end);
 		ret = 0;
 	} else {
-		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
-			pr_alert("total_mapcount: %u, page_count(): %u\n",
-					mapcount, count);
-			if (PageTail(page))
-				dump_page(head, NULL);
-			dump_page(page, "total_mapcount(head) > 0");
-			BUG();
-		}
 		spin_unlock(&ds_queue->split_queue_lock);
-fail:		if (mapping)
+fail:
+		if (mapping)
 			xa_unlock(&mapping->i_pages);
 		local_irq_enable();
 		remap_page(head, thp_nr_pages(head));
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 08/10] mm: rmap: make try_to_unmap() void function
       [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
@ 2021-06-09  4:25   ` Hugh Dickins
  2021-06-09  4:14   ` Hugh Dickins
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

From: Yang Shi <shy828301@gmail.com>

Currently try_to_unmap() return bool value by checking page_mapcount(),
however this may return false positive since page_mapcount() doesn't
check all subpages of compound page.  The total_mapcount() could be used
instead, but its cost is higher since it traverses all subpages.

Actually the most callers of try_to_unmap() don't care about the
return value at all.  So just need check if page is still mapped by
page_mapped() when necessary.  And page_mapped() does bail out early
when it finds mapped subpage.

Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
Patch inserted since the v1 series was posted.
v5: Rediffed by Hugh to fit before 7/7 in mm/thp series; akpm fixed grammar.
v4: Updated the comment of try_to_unmap() per Minchan.
    Minor fix and patch reorder per Hugh.
    Collected ack tag from Hugh.

 include/linux/rmap.h |  2 +-
 mm/memory-failure.c  | 15 +++++++--------
 mm/rmap.c            | 15 ++++-----------
 mm/vmscan.c          |  3 ++-
 4 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8d04e7deedc6..ed31a559e857 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -195,7 +195,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
-bool try_to_unmap(struct page *, enum ttu_flags flags);
+void try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 85ad98c00fd9..b6806e446567 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1063,7 +1063,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	enum ttu_flags ttu = TTU_IGNORE_MLOCK;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
-	bool unmap_success = true;
+	bool unmap_success;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
@@ -1126,7 +1126,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
 
 	if (!PageHuge(hpage)) {
-		unmap_success = try_to_unmap(hpage, ttu);
+		try_to_unmap(hpage, ttu);
 	} else {
 		if (!PageAnon(hpage)) {
 			/*
@@ -1138,17 +1138,16 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 			 */
 			mapping = hugetlb_page_mapping_lock_write(hpage);
 			if (mapping) {
-				unmap_success = try_to_unmap(hpage,
-						     ttu|TTU_RMAP_LOCKED);
+				try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
 				i_mmap_unlock_write(mapping);
-			} else {
+			} else
 				pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn);
-				unmap_success = false;
-			}
 		} else {
-			unmap_success = try_to_unmap(hpage, ttu);
+			try_to_unmap(hpage, ttu);
 		}
 	}
+
+	unmap_success = !page_mapped(hpage);
 	if (!unmap_success)
 		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
 		       pfn, page_mapcount(hpage));
diff --git a/mm/rmap.c b/mm/rmap.c
index e05c300048e6..f9fd5bc54f0a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,7 +1405,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
 	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
-	 * try_to_unmap() may return false when it is about to become true,
+	 * try_to_unmap() may return before page_mapped() has become false,
 	 * if page table locking is skipped: use TTU_SYNC to wait for that.
 	 */
 	if (flags & TTU_SYNC)
@@ -1756,9 +1756,10 @@ static int page_not_mapped(struct page *page)
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
  *
- * If unmap is successful, return true. Otherwise, false.
+ * It is the caller's responsibility to check if the page is still
+ * mapped when needed (use TTU_SYNC to prevent accounting races).
  */
-bool try_to_unmap(struct page *page, enum ttu_flags flags)
+void try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
@@ -1783,14 +1784,6 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
 		rmap_walk_locked(page, &rwc);
 	else
 		rmap_walk(page, &rwc);
-
-	/*
-	 * When racing against e.g. zap_pte_range() on another cpu,
-	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
-	 * try_to_unmap() may return false when it is about to become true,
-	 * if page table locking is skipped: use TTU_SYNC to wait for that.
-	 */
-	return !page_mapcount(page);
 }
 
 /**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5199b9696bab..db49cb1dc052 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1499,7 +1499,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			if (unlikely(PageTransHuge(page)))
 				flags |= TTU_SPLIT_HUGE_PMD;
 
-			if (!try_to_unmap(page, flags)) {
+			try_to_unmap(page, flags);
+			if (page_mapped(page)) {
 				stat->nr_unmap_fail += nr_pages;
 				if (!was_swapbacked && PageSwapBacked(page))
 					stat->nr_lazyfree_fail += nr_pages;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 08/10] mm: rmap: make try_to_unmap() void function
@ 2021-06-09  4:25   ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

From: Yang Shi <shy828301@gmail.com>

Currently try_to_unmap() return bool value by checking page_mapcount(),
however this may return false positive since page_mapcount() doesn't
check all subpages of compound page.  The total_mapcount() could be used
instead, but its cost is higher since it traverses all subpages.

Actually the most callers of try_to_unmap() don't care about the
return value at all.  So just need check if page is still mapped by
page_mapped() when necessary.  And page_mapped() does bail out early
when it finds mapped subpage.

Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
Patch inserted since the v1 series was posted.
v5: Rediffed by Hugh to fit before 7/7 in mm/thp series; akpm fixed grammar.
v4: Updated the comment of try_to_unmap() per Minchan.
    Minor fix and patch reorder per Hugh.
    Collected ack tag from Hugh.

 include/linux/rmap.h |  2 +-
 mm/memory-failure.c  | 15 +++++++--------
 mm/rmap.c            | 15 ++++-----------
 mm/vmscan.c          |  3 ++-
 4 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8d04e7deedc6..ed31a559e857 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -195,7 +195,7 @@ static inline void page_dup_rmap(struct page *page, bool compound)
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
-bool try_to_unmap(struct page *, enum ttu_flags flags);
+void try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 85ad98c00fd9..b6806e446567 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1063,7 +1063,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	enum ttu_flags ttu = TTU_IGNORE_MLOCK;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
-	bool unmap_success = true;
+	bool unmap_success;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
@@ -1126,7 +1126,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
 
 	if (!PageHuge(hpage)) {
-		unmap_success = try_to_unmap(hpage, ttu);
+		try_to_unmap(hpage, ttu);
 	} else {
 		if (!PageAnon(hpage)) {
 			/*
@@ -1138,17 +1138,16 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 			 */
 			mapping = hugetlb_page_mapping_lock_write(hpage);
 			if (mapping) {
-				unmap_success = try_to_unmap(hpage,
-						     ttu|TTU_RMAP_LOCKED);
+				try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
 				i_mmap_unlock_write(mapping);
-			} else {
+			} else
 				pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn);
-				unmap_success = false;
-			}
 		} else {
-			unmap_success = try_to_unmap(hpage, ttu);
+			try_to_unmap(hpage, ttu);
 		}
 	}
+
+	unmap_success = !page_mapped(hpage);
 	if (!unmap_success)
 		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
 		       pfn, page_mapcount(hpage));
diff --git a/mm/rmap.c b/mm/rmap.c
index e05c300048e6..f9fd5bc54f0a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,7 +1405,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
 	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
-	 * try_to_unmap() may return false when it is about to become true,
+	 * try_to_unmap() may return before page_mapped() has become false,
 	 * if page table locking is skipped: use TTU_SYNC to wait for that.
 	 */
 	if (flags & TTU_SYNC)
@@ -1756,9 +1756,10 @@ static int page_not_mapped(struct page *page)
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
  *
- * If unmap is successful, return true. Otherwise, false.
+ * It is the caller's responsibility to check if the page is still
+ * mapped when needed (use TTU_SYNC to prevent accounting races).
  */
-bool try_to_unmap(struct page *page, enum ttu_flags flags)
+void try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
@@ -1783,14 +1784,6 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
 		rmap_walk_locked(page, &rwc);
 	else
 		rmap_walk(page, &rwc);
-
-	/*
-	 * When racing against e.g. zap_pte_range() on another cpu,
-	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
-	 * try_to_unmap() may return false when it is about to become true,
-	 * if page table locking is skipped: use TTU_SYNC to wait for that.
-	 */
-	return !page_mapcount(page);
 }
 
 /**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5199b9696bab..db49cb1dc052 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1499,7 +1499,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 			if (unlikely(PageTransHuge(page)))
 				flags |= TTU_SPLIT_HUGE_PMD;
 
-			if (!try_to_unmap(page, flags)) {
+			try_to_unmap(page, flags);
+			if (page_mapped(page)) {
 				stat->nr_unmap_fail += nr_pages;
 				if (!was_swapbacked && PageSwapBacked(page))
 					stat->nr_lazyfree_fail += nr_pages;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 09/10] mm/thp: remap_page() is only needed on anonymous THP
       [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
@ 2021-06-09  4:27   ` Hugh Dickins
  2021-06-09  4:14   ` Hugh Dickins
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon,
and migration entries are only inserted when TTU_MIGRATION (unused here)
or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page()
to search for migration entries to remove when !PageAnon.

Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6d2a0119fc58..319a1a078451 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2355,6 +2355,7 @@ static void unmap_page(struct page *page)
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2366,6 +2367,10 @@ static void unmap_page(struct page *page)
 static void remap_page(struct page *page, unsigned int nr)
 {
 	int i;
+
+	/* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
+	if (!PageAnon(page))
+		return;
 	if (PageTransHuge(page)) {
 		remove_migration_ptes(page, page, true);
 	} else {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 09/10] mm/thp: remap_page() is only needed on anonymous THP
@ 2021-06-09  4:27   ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon,
and migration entries are only inserted when TTU_MIGRATION (unused here)
or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page()
to search for migration entries to remove when !PageAnon.

Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6d2a0119fc58..319a1a078451 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2355,6 +2355,7 @@ static void unmap_page(struct page *page)
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2366,6 +2367,10 @@ static void unmap_page(struct page *page)
 static void remap_page(struct page *page, unsigned int nr)
 {
 	int i;
+
+	/* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
+	if (!PageAnon(page))
+		return;
 	if (PageTransHuge(page)) {
 		remove_migration_ptes(page, page, true);
 	} else {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 10/10] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC
       [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
@ 2021-06-09  4:30   ` Hugh Dickins
  2021-06-09  4:14   ` Hugh Dickins
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly
before the page is accounted as unmapped.  It is unlikely to coincide
with hwpoisoning, but now that we have the flag, hwpoison_user_mappings()
would do well to use it.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
Patch added since the v1 series was posted.

 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b6806e446567..e16edefca523 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1060,7 +1060,7 @@ static int get_hwpoison_page(struct page *p, unsigned long flags,
 static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_IGNORE_MLOCK;
+	enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 10/10] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC
@ 2021-06-09  4:30   ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly
before the page is accounted as unmapped.  It is unlikely to coincide
with hwpoisoning, but now that we have the flag, hwpoison_user_mappings()
would do well to use it.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
Patch added since the v1 series was posted.

 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b6806e446567..e16edefca523 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1060,7 +1060,7 @@ static int get_hwpoison_page(struct page *p, unsigned long flags,
 static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_IGNORE_MLOCK;
+	enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 02/10] mm/thp: make is_huge_zero_pmd() safe and quicker
  2021-06-09  4:08   ` Hugh Dickins
  (?)
@ 2021-06-09 10:22   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2021-06-09 10:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Tue, Jun 08, 2021 at 09:08:09PM -0700, Hugh Dickins wrote:
> Most callers of is_huge_zero_pmd() supply a pmd already verified present;
> but a few (notably zap_huge_pmd()) do not - it might be a pmd migration
> entry, in which the pfn is encoded differently from a present pmd: which
> might pass the is_huge_zero_pmd() test (though not on x86, since L1TF
> forced us to protect against that); or perhaps even crash in pmd_page()
> applied to a swap-like entry.
> 
> Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself;
> and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd()
> will not need to do that pmd_page() lookup each time.
> 
> __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
> but is unnecessary now that is_huge_zero_pmd() checks present.
> 
> Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 10/10] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC
  2021-06-09  4:30   ` Hugh Dickins
  (?)
@ 2021-06-09 10:27   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2021-06-09 10:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Tue, Jun 08, 2021 at 09:30:00PM -0700, Hugh Dickins wrote:
> TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly
> before the page is accounted as unmapped.  It is unlikely to coincide
> with hwpoisoning, but now that we have the flag, hwpoison_user_mappings()
> would do well to use it.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 02/10] mm/thp: make is_huge_zero_pmd() safe and quicker
  2021-06-09  4:08   ` Hugh Dickins
@ 2021-06-09 16:56     ` Yang Shi
  -1 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2021-06-09 16:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 9:08 PM Hugh Dickins <hughd@google.com> wrote:
>
> Most callers of is_huge_zero_pmd() supply a pmd already verified present;
> but a few (notably zap_huge_pmd()) do not - it might be a pmd migration
> entry, in which the pfn is encoded differently from a present pmd: which
> might pass the is_huge_zero_pmd() test (though not on x86, since L1TF
> forced us to protect against that); or perhaps even crash in pmd_page()
> applied to a swap-like entry.
>
> Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself;
> and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd()
> will not need to do that pmd_page() lookup each time.
>
> __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
> but is unnecessary now that is_huge_zero_pmd() checks present.
>
> Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
> Patch added (replacing part of first) since the v1 series was posted.
>
>  include/linux/huge_mm.h | 8 +++++++-
>  mm/huge_memory.c        | 5 ++++-
>  2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9626fda5efce..2a8ebe6c222e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -286,6 +286,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>  vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
>
>  extern struct page *huge_zero_page;
> +extern unsigned long huge_zero_pfn;
>
>  static inline bool is_huge_zero_page(struct page *page)
>  {
> @@ -294,7 +295,7 @@ static inline bool is_huge_zero_page(struct page *page)
>
>  static inline bool is_huge_zero_pmd(pmd_t pmd)
>  {
> -       return is_huge_zero_page(pmd_page(pmd));
> +       return READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd) && pmd_present(pmd);
>  }
>
>  static inline bool is_huge_zero_pud(pud_t pud)
> @@ -440,6 +441,11 @@ static inline bool is_huge_zero_page(struct page *page)
>         return false;
>  }
>
> +static inline bool is_huge_zero_pmd(pmd_t pmd)
> +{
> +       return false;
> +}
> +
>  static inline bool is_huge_zero_pud(pud_t pud)
>  {
>         return false;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 42cfefc6e66e..5885c5f5836f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -62,6 +62,7 @@ static struct shrinker deferred_split_shrinker;
>
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
> +unsigned long huge_zero_pfn __read_mostly = ~0UL;
>
>  bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
> @@ -98,6 +99,7 @@ static bool get_huge_zero_page(void)
>                 __free_pages(zero_page, compound_order(zero_page));
>                 goto retry;
>         }
> +       WRITE_ONCE(huge_zero_pfn, page_to_pfn(zero_page));
>
>         /* We take additional reference here. It will be put back by shrinker */
>         atomic_set(&huge_zero_refcount, 2);
> @@ -147,6 +149,7 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
>         if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
>                 struct page *zero_page = xchg(&huge_zero_page, NULL);
>                 BUG_ON(zero_page == NULL);
> +               WRITE_ONCE(huge_zero_pfn, ~0UL);
>                 __free_pages(zero_page, compound_order(zero_page));
>                 return HPAGE_PMD_NR;
>         }
> @@ -2071,7 +2074,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>                 return;
>         }
>
> -       if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
> +       if (is_huge_zero_pmd(*pmd)) {
>                 /*
>                  * FIXME: Do we want to invalidate secondary mmu by calling
>                  * mmu_notifier_invalidate_range() see comments below inside
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 02/10] mm/thp: make is_huge_zero_pmd() safe and quicker
@ 2021-06-09 16:56     ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2021-06-09 16:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 9:08 PM Hugh Dickins <hughd@google.com> wrote:
>
> Most callers of is_huge_zero_pmd() supply a pmd already verified present;
> but a few (notably zap_huge_pmd()) do not - it might be a pmd migration
> entry, in which the pfn is encoded differently from a present pmd: which
> might pass the is_huge_zero_pmd() test (though not on x86, since L1TF
> forced us to protect against that); or perhaps even crash in pmd_page()
> applied to a swap-like entry.
>
> Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself;
> and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd()
> will not need to do that pmd_page() lookup each time.
>
> __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
> but is unnecessary now that is_huge_zero_pmd() checks present.
>
> Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
> Patch added (replacing part of first) since the v1 series was posted.
>
>  include/linux/huge_mm.h | 8 +++++++-
>  mm/huge_memory.c        | 5 ++++-
>  2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9626fda5efce..2a8ebe6c222e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -286,6 +286,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>  vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
>
>  extern struct page *huge_zero_page;
> +extern unsigned long huge_zero_pfn;
>
>  static inline bool is_huge_zero_page(struct page *page)
>  {
> @@ -294,7 +295,7 @@ static inline bool is_huge_zero_page(struct page *page)
>
>  static inline bool is_huge_zero_pmd(pmd_t pmd)
>  {
> -       return is_huge_zero_page(pmd_page(pmd));
> +       return READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd) && pmd_present(pmd);
>  }
>
>  static inline bool is_huge_zero_pud(pud_t pud)
> @@ -440,6 +441,11 @@ static inline bool is_huge_zero_page(struct page *page)
>         return false;
>  }
>
> +static inline bool is_huge_zero_pmd(pmd_t pmd)
> +{
> +       return false;
> +}
> +
>  static inline bool is_huge_zero_pud(pud_t pud)
>  {
>         return false;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 42cfefc6e66e..5885c5f5836f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -62,6 +62,7 @@ static struct shrinker deferred_split_shrinker;
>
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
> +unsigned long huge_zero_pfn __read_mostly = ~0UL;
>
>  bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
> @@ -98,6 +99,7 @@ static bool get_huge_zero_page(void)
>                 __free_pages(zero_page, compound_order(zero_page));
>                 goto retry;
>         }
> +       WRITE_ONCE(huge_zero_pfn, page_to_pfn(zero_page));
>
>         /* We take additional reference here. It will be put back by shrinker */
>         atomic_set(&huge_zero_refcount, 2);
> @@ -147,6 +149,7 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
>         if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
>                 struct page *zero_page = xchg(&huge_zero_page, NULL);
>                 BUG_ON(zero_page == NULL);
> +               WRITE_ONCE(huge_zero_pfn, ~0UL);
>                 __free_pages(zero_page, compound_order(zero_page));
>                 return HPAGE_PMD_NR;
>         }
> @@ -2071,7 +2074,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>                 return;
>         }
>
> -       if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
> +       if (is_huge_zero_pmd(*pmd)) {
>                 /*
>                  * FIXME: Do we want to invalidate secondary mmu by calling
>                  * mmu_notifier_invalidate_range() see comments below inside
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
  2021-06-09  4:19   ` Hugh Dickins
@ 2021-06-09 17:02     ` Yang Shi
  -1 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2021-06-09 17:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 9:19 PM Hugh Dickins <hughd@google.com> wrote:
>
> There is a race between THP unmapping and truncation, when truncate sees
> pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
> but before its page_remove_rmap() gets to decrement compound_mapcount:
> generating false "BUG: Bad page cache" reports that the page is still
> mapped when deleted.  This commit fixes that, but not in the way I hoped.
>
> The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
> instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
> been an annoyance that we usually call unmap_mapping_range() with no pages
> locked, but there apply it to a single locked page.  try_to_unmap() looks
> more suitable for a single locked page.
>
> However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
> it is used to insert THP migration entries, but not used to unmap THPs.
> Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
> needs are different, I'm too ignorant of the DAX cases, and couldn't
> decide how far to go for anon+swap.  Set that aside.
>
> The second attempt took a different tack: make no change in truncate.c,
> but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
> clearing it initially, then pmd_clear() between page_remove_rmap() and
> unlocking at the end.  Nice.  But powerpc blows that approach out of the
> water, with its serialize_against_pte_lookup(), and interesting pgtable
> usage.  It would need serious help to get working on powerpc (with a
> minor optimization issue on s390 too).  Set that aside.
>
> Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
> delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
> that's likely to reduce or eliminate the number of incidents, it would
> give less assurance of whether we had identified the problem correctly.
>
> This successful iteration introduces "unmap_mapping_page(page)" instead
> of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
> with an addition to details.  Then zap_pmd_range() watches for this case,
> and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
> now does in the PVMW_SYNC case.  Not pretty, but safe.
>
> Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
> assert its interface; but currently that's only used to make sure that
> page->mapping is stable, and zap_pmd_range() doesn't care if the page is
> locked or not.  Along these lines, in invalidate_inode_pages2_range()
> move the initial unmap_mapping_range() out from under page lock, before
> then calling unmap_mapping_page() under page lock if still mapped.
>
> Fixes: fc127da085c2 ("truncate: handle file thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
>  mm/truncate.c      | 43 +++++++++++++++++++------------------------
>  3 files changed, 62 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c274f75efcf9..8ae31622deef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1719,6 +1719,7 @@ struct zap_details {
>         struct address_space *check_mapping;    /* Check page->mapping if set */
>         pgoff_t first_index;                    /* Lowest page->index to unmap */
>         pgoff_t last_index;                     /* Highest page->index to unmap */
> +       struct page *single_page;               /* Locked page to be unmapped */
>  };
>
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
>  extern int fixup_user_fault(struct mm_struct *mm,
>                             unsigned long address, unsigned int fault_flags,
>                             bool *unlocked);
> +void unmap_mapping_page(struct page *page);
>  void unmap_mapping_pages(struct address_space *mapping,
>                 pgoff_t start, pgoff_t nr, bool even_cows);
>  void unmap_mapping_range(struct address_space *mapping,
> @@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
>         BUG();
>         return -EFAULT;
>  }
> +static inline void unmap_mapping_page(struct page *page) { }
>  static inline void unmap_mapping_pages(struct address_space *mapping,
>                 pgoff_t start, pgoff_t nr, bool even_cows) { }
>  static inline void unmap_mapping_range(struct address_space *mapping,
> diff --git a/mm/memory.c b/mm/memory.c
> index f3ffab9b9e39..ee1163df3a53 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1361,7 +1361,17 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                         else if (zap_huge_pmd(tlb, vma, pmd, addr))
>                                 goto next;
>                         /* fall through */
> +               } else if (details && details->single_page &&
> +                          PageTransCompound(details->single_page) &&
> +                          next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
> +                       /*
> +                        * Take and drop THP pmd lock so that we cannot return
> +                        * prematurely, while zap_huge_pmd() has cleared *pmd,
> +                        * but not yet decremented compound_mapcount().
> +                        */
> +                       spin_unlock(pmd_lock(tlb->mm, pmd));

Just a nit, why not follow the style of patch #3 to have lock and
unlock with separate lines?


>                 }
> +
>                 /*
>                  * Here there can be other concurrent MADV_DONTNEED or
>                  * trans huge page faults running, and if the pmd is
> @@ -3236,6 +3246,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
>         }
>  }
>
> +/**
> + * unmap_mapping_page() - Unmap single page from processes.
> + * @page: The locked page to be unmapped.
> + *
> + * Unmap this page from any userspace process which still has it mmaped.
> + * Typically, for efficiency, the range of nearby pages has already been
> + * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
> + * truncation or invalidation holds the lock on a page, it may find that
> + * the page has been remapped again: and then uses unmap_mapping_page()
> + * to unmap it finally.
> + */
> +void unmap_mapping_page(struct page *page)
> +{
> +       struct address_space *mapping = page->mapping;
> +       struct zap_details details = { };
> +
> +       VM_BUG_ON(!PageLocked(page));
> +       VM_BUG_ON(PageTail(page));
> +
> +       details.check_mapping = mapping;
> +       details.first_index = page->index;
> +       details.last_index = page->index + thp_nr_pages(page) - 1;
> +       details.single_page = page;
> +
> +       i_mmap_lock_write(mapping);
> +       if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> +               unmap_mapping_range_tree(&mapping->i_mmap, &details);
> +       i_mmap_unlock_write(mapping);
> +}
> +
>  /**
>   * unmap_mapping_pages() - Unmap pages from processes.
>   * @mapping: The address space containing pages to be unmapped.
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 95af244b112a..234ddd879caa 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
>   * its lock, b) when a concurrent invalidate_mapping_pages got there first and
>   * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
>   */
> -static void
> -truncate_cleanup_page(struct address_space *mapping, struct page *page)
> +static void truncate_cleanup_page(struct page *page)
>  {
> -       if (page_mapped(page)) {
> -               unsigned int nr = thp_nr_pages(page);
> -               unmap_mapping_pages(mapping, page->index, nr, false);
> -       }
> +       if (page_mapped(page))
> +               unmap_mapping_page(page);
>
>         if (page_has_private(page))
>                 do_invalidatepage(page, 0, thp_size(page));
> @@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
>         if (page->mapping != mapping)
>                 return -EIO;
>
> -       truncate_cleanup_page(mapping, page);
> +       truncate_cleanup_page(page);
>         delete_from_page_cache(page);
>         return 0;
>  }
> @@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>                 index = indices[pagevec_count(&pvec) - 1] + 1;
>                 truncate_exceptional_pvec_entries(mapping, &pvec, indices);
>                 for (i = 0; i < pagevec_count(&pvec); i++)
> -                       truncate_cleanup_page(mapping, pvec.pages[i]);
> +                       truncate_cleanup_page(pvec.pages[i]);
>                 delete_from_page_cache_batch(mapping, &pvec);
>                 for (i = 0; i < pagevec_count(&pvec); i++)
>                         unlock_page(pvec.pages[i]);
> @@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>                                 continue;
>                         }
>
> +                       if (!did_range_unmap && page_mapped(page)) {
> +                               /*
> +                                * If page is mapped, before taking its lock,
> +                                * zap the rest of the file in one hit.
> +                                */
> +                               unmap_mapping_pages(mapping, index,
> +                                               (1 + end - index), false);
> +                               did_range_unmap = 1;
> +                       }
> +
>                         lock_page(page);
>                         WARN_ON(page_to_index(page) != index);
>                         if (page->mapping != mapping) {
> @@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>                                 continue;
>                         }
>                         wait_on_page_writeback(page);
> -                       if (page_mapped(page)) {
> -                               if (!did_range_unmap) {
> -                                       /*
> -                                        * Zap the rest of the file in one hit.
> -                                        */
> -                                       unmap_mapping_pages(mapping, index,
> -                                               (1 + end - index), false);
> -                                       did_range_unmap = 1;
> -                               } else {
> -                                       /*
> -                                        * Just zap this page
> -                                        */
> -                                       unmap_mapping_pages(mapping, index,
> -                                                               1, false);
> -                               }
> -                       }
> +
> +                       if (page_mapped(page))
> +                               unmap_mapping_page(page);
>                         BUG_ON(page_mapped(page));
> +
>                         ret2 = do_launder_page(mapping, page);
>                         if (ret2 == 0) {
>                                 if (!invalidate_complete_page2(mapping, page))
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
@ 2021-06-09 17:02     ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2021-06-09 17:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 8, 2021 at 9:19 PM Hugh Dickins <hughd@google.com> wrote:
>
> There is a race between THP unmapping and truncation, when truncate sees
> pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
> but before its page_remove_rmap() gets to decrement compound_mapcount:
> generating false "BUG: Bad page cache" reports that the page is still
> mapped when deleted.  This commit fixes that, but not in the way I hoped.
>
> The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
> instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
> been an annoyance that we usually call unmap_mapping_range() with no pages
> locked, but there apply it to a single locked page.  try_to_unmap() looks
> more suitable for a single locked page.
>
> However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
> it is used to insert THP migration entries, but not used to unmap THPs.
> Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
> needs are different, I'm too ignorant of the DAX cases, and couldn't
> decide how far to go for anon+swap.  Set that aside.
>
> The second attempt took a different tack: make no change in truncate.c,
> but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
> clearing it initially, then pmd_clear() between page_remove_rmap() and
> unlocking at the end.  Nice.  But powerpc blows that approach out of the
> water, with its serialize_against_pte_lookup(), and interesting pgtable
> usage.  It would need serious help to get working on powerpc (with a
> minor optimization issue on s390 too).  Set that aside.
>
> Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
> delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
> that's likely to reduce or eliminate the number of incidents, it would
> give less assurance of whether we had identified the problem correctly.
>
> This successful iteration introduces "unmap_mapping_page(page)" instead
> of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
> with an addition to details.  Then zap_pmd_range() watches for this case,
> and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
> now does in the PVMW_SYNC case.  Not pretty, but safe.
>
> Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
> assert its interface; but currently that's only used to make sure that
> page->mapping is stable, and zap_pmd_range() doesn't care if the page is
> locked or not.  Along these lines, in invalidate_inode_pages2_range()
> move the initial unmap_mapping_range() out from under page lock, before
> then calling unmap_mapping_page() under page lock if still mapped.
>
> Fixes: fc127da085c2 ("truncate: handle file thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
>  mm/truncate.c      | 43 +++++++++++++++++++------------------------
>  3 files changed, 62 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c274f75efcf9..8ae31622deef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1719,6 +1719,7 @@ struct zap_details {
>         struct address_space *check_mapping;    /* Check page->mapping if set */
>         pgoff_t first_index;                    /* Lowest page->index to unmap */
>         pgoff_t last_index;                     /* Highest page->index to unmap */
> +       struct page *single_page;               /* Locked page to be unmapped */
>  };
>
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
>  extern int fixup_user_fault(struct mm_struct *mm,
>                             unsigned long address, unsigned int fault_flags,
>                             bool *unlocked);
> +void unmap_mapping_page(struct page *page);
>  void unmap_mapping_pages(struct address_space *mapping,
>                 pgoff_t start, pgoff_t nr, bool even_cows);
>  void unmap_mapping_range(struct address_space *mapping,
> @@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
>         BUG();
>         return -EFAULT;
>  }
> +static inline void unmap_mapping_page(struct page *page) { }
>  static inline void unmap_mapping_pages(struct address_space *mapping,
>                 pgoff_t start, pgoff_t nr, bool even_cows) { }
>  static inline void unmap_mapping_range(struct address_space *mapping,
> diff --git a/mm/memory.c b/mm/memory.c
> index f3ffab9b9e39..ee1163df3a53 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1361,7 +1361,17 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                         else if (zap_huge_pmd(tlb, vma, pmd, addr))
>                                 goto next;
>                         /* fall through */
> +               } else if (details && details->single_page &&
> +                          PageTransCompound(details->single_page) &&
> +                          next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
> +                       /*
> +                        * Take and drop THP pmd lock so that we cannot return
> +                        * prematurely, while zap_huge_pmd() has cleared *pmd,
> +                        * but not yet decremented compound_mapcount().
> +                        */
> +                       spin_unlock(pmd_lock(tlb->mm, pmd));

Just a nit, why not follow the style of patch #3 to have lock and
unlock with separate lines?


>                 }
> +
>                 /*
>                  * Here there can be other concurrent MADV_DONTNEED or
>                  * trans huge page faults running, and if the pmd is
> @@ -3236,6 +3246,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
>         }
>  }
>
> +/**
> + * unmap_mapping_page() - Unmap single page from processes.
> + * @page: The locked page to be unmapped.
> + *
> + * Unmap this page from any userspace process which still has it mmaped.
> + * Typically, for efficiency, the range of nearby pages has already been
> + * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
> + * truncation or invalidation holds the lock on a page, it may find that
> + * the page has been remapped again: and then uses unmap_mapping_page()
> + * to unmap it finally.
> + */
> +void unmap_mapping_page(struct page *page)
> +{
> +       struct address_space *mapping = page->mapping;
> +       struct zap_details details = { };
> +
> +       VM_BUG_ON(!PageLocked(page));
> +       VM_BUG_ON(PageTail(page));
> +
> +       details.check_mapping = mapping;
> +       details.first_index = page->index;
> +       details.last_index = page->index + thp_nr_pages(page) - 1;
> +       details.single_page = page;
> +
> +       i_mmap_lock_write(mapping);
> +       if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> +               unmap_mapping_range_tree(&mapping->i_mmap, &details);
> +       i_mmap_unlock_write(mapping);
> +}
> +
>  /**
>   * unmap_mapping_pages() - Unmap pages from processes.
>   * @mapping: The address space containing pages to be unmapped.
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 95af244b112a..234ddd879caa 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
>   * its lock, b) when a concurrent invalidate_mapping_pages got there first and
>   * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
>   */
> -static void
> -truncate_cleanup_page(struct address_space *mapping, struct page *page)
> +static void truncate_cleanup_page(struct page *page)
>  {
> -       if (page_mapped(page)) {
> -               unsigned int nr = thp_nr_pages(page);
> -               unmap_mapping_pages(mapping, page->index, nr, false);
> -       }
> +       if (page_mapped(page))
> +               unmap_mapping_page(page);
>
>         if (page_has_private(page))
>                 do_invalidatepage(page, 0, thp_size(page));
> @@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
>         if (page->mapping != mapping)
>                 return -EIO;
>
> -       truncate_cleanup_page(mapping, page);
> +       truncate_cleanup_page(page);
>         delete_from_page_cache(page);
>         return 0;
>  }
> @@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>                 index = indices[pagevec_count(&pvec) - 1] + 1;
>                 truncate_exceptional_pvec_entries(mapping, &pvec, indices);
>                 for (i = 0; i < pagevec_count(&pvec); i++)
> -                       truncate_cleanup_page(mapping, pvec.pages[i]);
> +                       truncate_cleanup_page(pvec.pages[i]);
>                 delete_from_page_cache_batch(mapping, &pvec);
>                 for (i = 0; i < pagevec_count(&pvec); i++)
>                         unlock_page(pvec.pages[i]);
> @@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>                                 continue;
>                         }
>
> +                       if (!did_range_unmap && page_mapped(page)) {
> +                               /*
> +                                * If page is mapped, before taking its lock,
> +                                * zap the rest of the file in one hit.
> +                                */
> +                               unmap_mapping_pages(mapping, index,
> +                                               (1 + end - index), false);
> +                               did_range_unmap = 1;
> +                       }
> +
>                         lock_page(page);
>                         WARN_ON(page_to_index(page) != index);
>                         if (page->mapping != mapping) {
> @@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>                                 continue;
>                         }
>                         wait_on_page_writeback(page);
> -                       if (page_mapped(page)) {
> -                               if (!did_range_unmap) {
> -                                       /*
> -                                        * Zap the rest of the file in one hit.
> -                                        */
> -                                       unmap_mapping_pages(mapping, index,
> -                                               (1 + end - index), false);
> -                                       did_range_unmap = 1;
> -                               } else {
> -                                       /*
> -                                        * Just zap this page
> -                                        */
> -                                       unmap_mapping_pages(mapping, index,
> -                                                               1, false);
> -                               }
> -                       }
> +
> +                       if (page_mapped(page))
> +                               unmap_mapping_page(page);
>                         BUG_ON(page_mapped(page));
> +
>                         ret2 = do_launder_page(mapping, page);
>                         if (ret2 == 0) {
>                                 if (!invalidate_complete_page2(mapping, page))
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
  2021-06-09 17:02     ` Yang Shi
@ 2021-06-09 21:11       ` Hugh Dickins
  -1 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09 21:11 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

On Wed, 9 Jun 2021, Yang Shi wrote:
> On Tue, Jun 8, 2021 at 9:19 PM Hugh Dickins <hughd@google.com> wrote:
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1361,7 +1361,17 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> >                         else if (zap_huge_pmd(tlb, vma, pmd, addr))
> >                                 goto next;
> >                         /* fall through */
> > +               } else if (details && details->single_page &&
> > +                          PageTransCompound(details->single_page) &&
> > +                          next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
> > +                       /*
> > +                        * Take and drop THP pmd lock so that we cannot return
> > +                        * prematurely, while zap_huge_pmd() has cleared *pmd,
> > +                        * but not yet decremented compound_mapcount().
> > +                        */
> > +                       spin_unlock(pmd_lock(tlb->mm, pmd));
> 
> Just a nit, why not follow the style of patch #3 to have lock and
> unlock with separate lines?

Good point.  Doing it on one line is my own preferred style (particularly
with the "take and drop lock" comment just above), carried forward from
before.  I simply didn't think to change this one when I agreed to change
the other.  You're right to question it: v3 of just this 06/10 follows.

And thank you to Kirill and to you for these rapid further reviews (and
for silently forgiving my screwup in omitting to Cc linux-mm and lkml
on those early ones).

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
@ 2021-06-09 21:11       ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09 21:11 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

On Wed, 9 Jun 2021, Yang Shi wrote:
> On Tue, Jun 8, 2021 at 9:19 PM Hugh Dickins <hughd@google.com> wrote:
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1361,7 +1361,17 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> >                         else if (zap_huge_pmd(tlb, vma, pmd, addr))
> >                                 goto next;
> >                         /* fall through */
> > +               } else if (details && details->single_page &&
> > +                          PageTransCompound(details->single_page) &&
> > +                          next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
> > +                       /*
> > +                        * Take and drop THP pmd lock so that we cannot return
> > +                        * prematurely, while zap_huge_pmd() has cleared *pmd,
> > +                        * but not yet decremented compound_mapcount().
> > +                        */
> > +                       spin_unlock(pmd_lock(tlb->mm, pmd));
> 
> Just a nit, why not follow the style of patch #3 to have lock and
> unlock with separate lines?

Good point.  Doing it on one line is my own preferred style (particularly
with the "take and drop lock" comment just above), carried forward from
before.  I simply didn't think to change this one when I agreed to change
the other.  You're right to question it: v3 of just this 06/10 follows.

And thank you to Kirill and to you for these rapid further reviews (and
for silently forgiving my screwup in omitting to Cc linux-mm and lkml
on those early ones).

Hugh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
  2021-06-09 21:11       ` Hugh Dickins
@ 2021-06-09 21:16         ` Hugh Dickins
  -1 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09 21:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Yang Shi, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
but before its page_remove_rmap() gets to decrement compound_mapcount:
generating false "BUG: Bad page cache" reports that the page is still
mapped when deleted.  This commit fixes that, but not in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
been an annoyance that we usually call unmap_mapping_range() with no pages
locked, but there apply it to a single locked page.  try_to_unmap() looks
more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap.  Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end.  Nice.  But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage.  It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too).  Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details.  Then zap_pmd_range() watches for this case,
and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
now does in the PVMW_SYNC case.  Not pretty, but safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not.  Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
---
v3: expanded spin_unlock(pmd_lock()) like in 03/10, per Yang Shi

 include/linux/mm.h |  3 +++
 mm/memory.c        | 41 +++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c      | 43 +++++++++++++++++++------------------------
 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75efcf9..8ae31622deef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	struct page *single_page;		/* Locked page to be unmapped */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 extern int fixup_user_fault(struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
+void unmap_mapping_page(struct page *page);
 void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
@@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
 	BUG();
 	return -EFAULT;
 }
+static inline void unmap_mapping_page(struct page *page) { }
 static inline void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows) { }
 static inline void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
index f3ffab9b9e39..486f4a2874e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1361,7 +1361,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
+		} else if (details && details->single_page &&
+			   PageTransCompound(details->single_page) &&
+			   next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
+			spinlock_t *ptl = pmd_lock(tlb->mm, pmd);
+			/*
+			 * Take and drop THP pmd lock so that we cannot return
+			 * prematurely, while zap_huge_pmd() has cleared *pmd,
+			 * but not yet decremented compound_mapcount().
+			 */
+			spin_unlock(ptl);
 		}
+
 		/*
 		 * Here there can be other concurrent MADV_DONTNEED or
 		 * trans huge page faults running, and if the pmd is
@@ -3236,6 +3247,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
 	}
 }
 
+/**
+ * unmap_mapping_page() - Unmap single page from processes.
+ * @page: The locked page to be unmapped.
+ *
+ * Unmap this page from any userspace process which still has it mmaped.
+ * Typically, for efficiency, the range of nearby pages has already been
+ * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
+ * truncation or invalidation holds the lock on a page, it may find that
+ * the page has been remapped again: and then uses unmap_mapping_page()
+ * to unmap it finally.
+ */
+void unmap_mapping_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct zap_details details = { };
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageTail(page));
+
+	details.check_mapping = mapping;
+	details.first_index = page->index;
+	details.last_index = page->index + thp_nr_pages(page) - 1;
+	details.single_page = page;
+
+	i_mmap_lock_write(mapping);
+	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+	i_mmap_unlock_write(mapping);
+}
+
 /**
  * unmap_mapping_pages() - Unmap pages from processes.
  * @mapping: The address space containing pages to be unmapped.
diff --git a/mm/truncate.c b/mm/truncate.c
index 95af244b112a..234ddd879caa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
  */
-static void
-truncate_cleanup_page(struct address_space *mapping, struct page *page)
+static void truncate_cleanup_page(struct page *page)
 {
-	if (page_mapped(page)) {
-		unsigned int nr = thp_nr_pages(page);
-		unmap_mapping_pages(mapping, page->index, nr, false);
-	}
+	if (page_mapped(page))
+		unmap_mapping_page(page);
 
 	if (page_has_private(page))
 		do_invalidatepage(page, 0, thp_size(page));
@@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return -EIO;
 
-	truncate_cleanup_page(mapping, page);
+	truncate_cleanup_page(page);
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index = indices[pagevec_count(&pvec) - 1] + 1;
 		truncate_exceptional_pvec_entries(mapping, &pvec, indices);
 		for (i = 0; i < pagevec_count(&pvec); i++)
-			truncate_cleanup_page(mapping, pvec.pages[i]);
+			truncate_cleanup_page(pvec.pages[i]);
 		delete_from_page_cache_batch(mapping, &pvec);
 		for (i = 0; i < pagevec_count(&pvec); i++)
 			unlock_page(pvec.pages[i]);
@@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 
+			if (!did_range_unmap && page_mapped(page)) {
+				/*
+				 * If page is mapped, before taking its lock,
+				 * zap the rest of the file in one hit.
+				 */
+				unmap_mapping_pages(mapping, index,
+						(1 + end - index), false);
+				did_range_unmap = 1;
+			}
+
 			lock_page(page);
 			WARN_ON(page_to_index(page) != index);
 			if (page->mapping != mapping) {
@@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				if (!did_range_unmap) {
-					/*
-					 * Zap the rest of the file in one hit.
-					 */
-					unmap_mapping_pages(mapping, index,
-						(1 + end - index), false);
-					did_range_unmap = 1;
-				} else {
-					/*
-					 * Just zap this page
-					 */
-					unmap_mapping_pages(mapping, index,
-								1, false);
-				}
-			}
+
+			if (page_mapped(page))
+				unmap_mapping_page(page);
 			BUG_ON(page_mapped(page));
+
 			ret2 = do_launder_page(mapping, page);
 			if (ret2 == 0) {
 				if (!invalidate_complete_page2(mapping, page))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
@ 2021-06-09 21:16         ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-09 21:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Yang Shi, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
but before its page_remove_rmap() gets to decrement compound_mapcount:
generating false "BUG: Bad page cache" reports that the page is still
mapped when deleted.  This commit fixes that, but not in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
been an annoyance that we usually call unmap_mapping_range() with no pages
locked, but there apply it to a single locked page.  try_to_unmap() looks
more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap.  Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end.  Nice.  But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage.  It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too).  Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details.  Then zap_pmd_range() watches for this case,
and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
now does in the PVMW_SYNC case.  Not pretty, but safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not.  Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
---
v3: expanded spin_unlock(pmd_lock()) like in 03/10, per Yang Shi

 include/linux/mm.h |  3 +++
 mm/memory.c        | 41 +++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c      | 43 +++++++++++++++++++------------------------
 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75efcf9..8ae31622deef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	struct page *single_page;		/* Locked page to be unmapped */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 extern int fixup_user_fault(struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
+void unmap_mapping_page(struct page *page);
 void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
@@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
 	BUG();
 	return -EFAULT;
 }
+static inline void unmap_mapping_page(struct page *page) { }
 static inline void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows) { }
 static inline void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
index f3ffab9b9e39..486f4a2874e7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1361,7 +1361,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
+		} else if (details && details->single_page &&
+			   PageTransCompound(details->single_page) &&
+			   next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
+			spinlock_t *ptl = pmd_lock(tlb->mm, pmd);
+			/*
+			 * Take and drop THP pmd lock so that we cannot return
+			 * prematurely, while zap_huge_pmd() has cleared *pmd,
+			 * but not yet decremented compound_mapcount().
+			 */
+			spin_unlock(ptl);
 		}
+
 		/*
 		 * Here there can be other concurrent MADV_DONTNEED or
 		 * trans huge page faults running, and if the pmd is
@@ -3236,6 +3247,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
 	}
 }
 
+/**
+ * unmap_mapping_page() - Unmap single page from processes.
+ * @page: The locked page to be unmapped.
+ *
+ * Unmap this page from any userspace process which still has it mmaped.
+ * Typically, for efficiency, the range of nearby pages has already been
+ * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
+ * truncation or invalidation holds the lock on a page, it may find that
+ * the page has been remapped again: and then uses unmap_mapping_page()
+ * to unmap it finally.
+ */
+void unmap_mapping_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct zap_details details = { };
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageTail(page));
+
+	details.check_mapping = mapping;
+	details.first_index = page->index;
+	details.last_index = page->index + thp_nr_pages(page) - 1;
+	details.single_page = page;
+
+	i_mmap_lock_write(mapping);
+	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+	i_mmap_unlock_write(mapping);
+}
+
 /**
  * unmap_mapping_pages() - Unmap pages from processes.
  * @mapping: The address space containing pages to be unmapped.
diff --git a/mm/truncate.c b/mm/truncate.c
index 95af244b112a..234ddd879caa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
  */
-static void
-truncate_cleanup_page(struct address_space *mapping, struct page *page)
+static void truncate_cleanup_page(struct page *page)
 {
-	if (page_mapped(page)) {
-		unsigned int nr = thp_nr_pages(page);
-		unmap_mapping_pages(mapping, page->index, nr, false);
-	}
+	if (page_mapped(page))
+		unmap_mapping_page(page);
 
 	if (page_has_private(page))
 		do_invalidatepage(page, 0, thp_size(page));
@@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return -EIO;
 
-	truncate_cleanup_page(mapping, page);
+	truncate_cleanup_page(page);
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index = indices[pagevec_count(&pvec) - 1] + 1;
 		truncate_exceptional_pvec_entries(mapping, &pvec, indices);
 		for (i = 0; i < pagevec_count(&pvec); i++)
-			truncate_cleanup_page(mapping, pvec.pages[i]);
+			truncate_cleanup_page(pvec.pages[i]);
 		delete_from_page_cache_batch(mapping, &pvec);
 		for (i = 0; i < pagevec_count(&pvec); i++)
 			unlock_page(pvec.pages[i]);
@@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 
+			if (!did_range_unmap && page_mapped(page)) {
+				/*
+				 * If page is mapped, before taking its lock,
+				 * zap the rest of the file in one hit.
+				 */
+				unmap_mapping_pages(mapping, index,
+						(1 + end - index), false);
+				did_range_unmap = 1;
+			}
+
 			lock_page(page);
 			WARN_ON(page_to_index(page) != index);
 			if (page->mapping != mapping) {
@@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				if (!did_range_unmap) {
-					/*
-					 * Zap the rest of the file in one hit.
-					 */
-					unmap_mapping_pages(mapping, index,
-						(1 + end - index), false);
-					did_range_unmap = 1;
-				} else {
-					/*
-					 * Just zap this page
-					 */
-					unmap_mapping_pages(mapping, index,
-								1, false);
-				}
-			}
+
+			if (page_mapped(page))
+				unmap_mapping_page(page);
 			BUG_ON(page_mapped(page));
+
 			ret2 = do_launder_page(mapping, page);
 			if (ret2 == 0) {
 				if (!invalidate_complete_page2(mapping, page))
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
  2021-06-09 21:16         ` Hugh Dickins
@ 2021-06-09 21:51           ` Yang Shi
  -1 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2021-06-09 21:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

On Wed, Jun 9, 2021 at 2:16 PM Hugh Dickins <hughd@google.com> wrote:
>
> There is a race between THP unmapping and truncation, when truncate sees
> pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
> but before its page_remove_rmap() gets to decrement compound_mapcount:
> generating false "BUG: Bad page cache" reports that the page is still
> mapped when deleted.  This commit fixes that, but not in the way I hoped.
>
> The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
> instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
> been an annoyance that we usually call unmap_mapping_range() with no pages
> locked, but there apply it to a single locked page.  try_to_unmap() looks
> more suitable for a single locked page.
>
> However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
> it is used to insert THP migration entries, but not used to unmap THPs.
> Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
> needs are different, I'm too ignorant of the DAX cases, and couldn't
> decide how far to go for anon+swap.  Set that aside.
>
> The second attempt took a different tack: make no change in truncate.c,
> but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
> clearing it initially, then pmd_clear() between page_remove_rmap() and
> unlocking at the end.  Nice.  But powerpc blows that approach out of the
> water, with its serialize_against_pte_lookup(), and interesting pgtable
> usage.  It would need serious help to get working on powerpc (with a
> minor optimization issue on s390 too).  Set that aside.
>
> Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
> delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
> that's likely to reduce or eliminate the number of incidents, it would
> give less assurance of whether we had identified the problem correctly.
>
> This successful iteration introduces "unmap_mapping_page(page)" instead
> of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
> with an addition to details.  Then zap_pmd_range() watches for this case,
> and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
> now does in the PVMW_SYNC case.  Not pretty, but safe.
>
> Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
> assert its interface; but currently that's only used to make sure that
> page->mapping is stable, and zap_pmd_range() doesn't care if the page is
> locked or not.  Along these lines, in invalidate_inode_pages2_range()
> move the initial unmap_mapping_range() out from under page lock, before
> then calling unmap_mapping_page() under page lock if still mapped.
>
> Fixes: fc127da085c2 ("truncate: handle file thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: <stable@vger.kernel.org>
> ---
> v3: expanded spin_unlock(pmd_lock()) like in 03/10, per Yang Shi

Thanks. Reviewed-by: Yang Shi <shy828301@gmail.com>

>
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 41 +++++++++++++++++++++++++++++++++++++++++
>  mm/truncate.c      | 43 +++++++++++++++++++------------------------
>  3 files changed, 63 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c274f75efcf9..8ae31622deef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1719,6 +1719,7 @@ struct zap_details {
>         struct address_space *check_mapping;    /* Check page->mapping if set */
>         pgoff_t first_index;                    /* Lowest page->index to unmap */
>         pgoff_t last_index;                     /* Highest page->index to unmap */
> +       struct page *single_page;               /* Locked page to be unmapped */
>  };
>
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
>  extern int fixup_user_fault(struct mm_struct *mm,
>                             unsigned long address, unsigned int fault_flags,
>                             bool *unlocked);
> +void unmap_mapping_page(struct page *page);
>  void unmap_mapping_pages(struct address_space *mapping,
>                 pgoff_t start, pgoff_t nr, bool even_cows);
>  void unmap_mapping_range(struct address_space *mapping,
> @@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
>         BUG();
>         return -EFAULT;
>  }
> +static inline void unmap_mapping_page(struct page *page) { }
>  static inline void unmap_mapping_pages(struct address_space *mapping,
>                 pgoff_t start, pgoff_t nr, bool even_cows) { }
>  static inline void unmap_mapping_range(struct address_space *mapping,
> diff --git a/mm/memory.c b/mm/memory.c
> index f3ffab9b9e39..486f4a2874e7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1361,7 +1361,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                         else if (zap_huge_pmd(tlb, vma, pmd, addr))
>                                 goto next;
>                         /* fall through */
> +               } else if (details && details->single_page &&
> +                          PageTransCompound(details->single_page) &&
> +                          next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
> +                       spinlock_t *ptl = pmd_lock(tlb->mm, pmd);
> +                       /*
> +                        * Take and drop THP pmd lock so that we cannot return
> +                        * prematurely, while zap_huge_pmd() has cleared *pmd,
> +                        * but not yet decremented compound_mapcount().
> +                        */
> +                       spin_unlock(ptl);
>                 }
> +
>                 /*
>                  * Here there can be other concurrent MADV_DONTNEED or
>                  * trans huge page faults running, and if the pmd is
> @@ -3236,6 +3247,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
>         }
>  }
>
> +/**
> + * unmap_mapping_page() - Unmap single page from processes.
> + * @page: The locked page to be unmapped.
> + *
> + * Unmap this page from any userspace process which still has it mmaped.
> + * Typically, for efficiency, the range of nearby pages has already been
> + * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
> + * truncation or invalidation holds the lock on a page, it may find that
> + * the page has been remapped again: and then uses unmap_mapping_page()
> + * to unmap it finally.
> + */
> +void unmap_mapping_page(struct page *page)
> +{
> +       struct address_space *mapping = page->mapping;
> +       struct zap_details details = { };
> +
> +       VM_BUG_ON(!PageLocked(page));
> +       VM_BUG_ON(PageTail(page));
> +
> +       details.check_mapping = mapping;
> +       details.first_index = page->index;
> +       details.last_index = page->index + thp_nr_pages(page) - 1;
> +       details.single_page = page;
> +
> +       i_mmap_lock_write(mapping);
> +       if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> +               unmap_mapping_range_tree(&mapping->i_mmap, &details);
> +       i_mmap_unlock_write(mapping);
> +}
> +
>  /**
>   * unmap_mapping_pages() - Unmap pages from processes.
>   * @mapping: The address space containing pages to be unmapped.
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 95af244b112a..234ddd879caa 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
>   * its lock, b) when a concurrent invalidate_mapping_pages got there first and
>   * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
>   */
> -static void
> -truncate_cleanup_page(struct address_space *mapping, struct page *page)
> +static void truncate_cleanup_page(struct page *page)
>  {
> -       if (page_mapped(page)) {
> -               unsigned int nr = thp_nr_pages(page);
> -               unmap_mapping_pages(mapping, page->index, nr, false);
> -       }
> +       if (page_mapped(page))
> +               unmap_mapping_page(page);
>
>         if (page_has_private(page))
>                 do_invalidatepage(page, 0, thp_size(page));
> @@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
>         if (page->mapping != mapping)
>                 return -EIO;
>
> -       truncate_cleanup_page(mapping, page);
> +       truncate_cleanup_page(page);
>         delete_from_page_cache(page);
>         return 0;
>  }
> @@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>                 index = indices[pagevec_count(&pvec) - 1] + 1;
>                 truncate_exceptional_pvec_entries(mapping, &pvec, indices);
>                 for (i = 0; i < pagevec_count(&pvec); i++)
> -                       truncate_cleanup_page(mapping, pvec.pages[i]);
> +                       truncate_cleanup_page(pvec.pages[i]);
>                 delete_from_page_cache_batch(mapping, &pvec);
>                 for (i = 0; i < pagevec_count(&pvec); i++)
>                         unlock_page(pvec.pages[i]);
> @@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>                                 continue;
>                         }
>
> +                       if (!did_range_unmap && page_mapped(page)) {
> +                               /*
> +                                * If page is mapped, before taking its lock,
> +                                * zap the rest of the file in one hit.
> +                                */
> +                               unmap_mapping_pages(mapping, index,
> +                                               (1 + end - index), false);
> +                               did_range_unmap = 1;
> +                       }
> +
>                         lock_page(page);
>                         WARN_ON(page_to_index(page) != index);
>                         if (page->mapping != mapping) {
> @@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>                                 continue;
>                         }
>                         wait_on_page_writeback(page);
> -                       if (page_mapped(page)) {
> -                               if (!did_range_unmap) {
> -                                       /*
> -                                        * Zap the rest of the file in one hit.
> -                                        */
> -                                       unmap_mapping_pages(mapping, index,
> -                                               (1 + end - index), false);
> -                                       did_range_unmap = 1;
> -                               } else {
> -                                       /*
> -                                        * Just zap this page
> -                                        */
> -                                       unmap_mapping_pages(mapping, index,
> -                                                               1, false);
> -                               }
> -                       }
> +
> +                       if (page_mapped(page))
> +                               unmap_mapping_page(page);
>                         BUG_ON(page_mapped(page));
> +
>                         ret2 = do_launder_page(mapping, page);
>                         if (ret2 == 0) {
>                                 if (!invalidate_complete_page2(mapping, page))
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
@ 2021-06-09 21:51           ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2021-06-09 21:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, Linux MM,
	Linux Kernel Mailing List

On Wed, Jun 9, 2021 at 2:16 PM Hugh Dickins <hughd@google.com> wrote:
>
> There is a race between THP unmapping and truncation, when truncate sees
> pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
> but before its page_remove_rmap() gets to decrement compound_mapcount:
> generating false "BUG: Bad page cache" reports that the page is still
> mapped when deleted.  This commit fixes that, but not in the way I hoped.
>
> The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
> instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
> been an annoyance that we usually call unmap_mapping_range() with no pages
> locked, but there apply it to a single locked page.  try_to_unmap() looks
> more suitable for a single locked page.
>
> However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
> it is used to insert THP migration entries, but not used to unmap THPs.
> Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
> needs are different, I'm too ignorant of the DAX cases, and couldn't
> decide how far to go for anon+swap.  Set that aside.
>
> The second attempt took a different tack: make no change in truncate.c,
> but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
> clearing it initially, then pmd_clear() between page_remove_rmap() and
> unlocking at the end.  Nice.  But powerpc blows that approach out of the
> water, with its serialize_against_pte_lookup(), and interesting pgtable
> usage.  It would need serious help to get working on powerpc (with a
> minor optimization issue on s390 too).  Set that aside.
>
> Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
> delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
> that's likely to reduce or eliminate the number of incidents, it would
> give less assurance of whether we had identified the problem correctly.
>
> This successful iteration introduces "unmap_mapping_page(page)" instead
> of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
> with an addition to details.  Then zap_pmd_range() watches for this case,
> and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
> now does in the PVMW_SYNC case.  Not pretty, but safe.
>
> Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
> assert its interface; but currently that's only used to make sure that
> page->mapping is stable, and zap_pmd_range() doesn't care if the page is
> locked or not.  Along these lines, in invalidate_inode_pages2_range()
> move the initial unmap_mapping_range() out from under page lock, before
> then calling unmap_mapping_page() under page lock if still mapped.
>
> Fixes: fc127da085c2 ("truncate: handle file thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: <stable@vger.kernel.org>
> ---
> v3: expanded spin_unlock(pmd_lock()) like in 03/10, per Yang Shi

Thanks. Reviewed-by: Yang Shi <shy828301@gmail.com>

>
>  include/linux/mm.h |  3 +++
>  mm/memory.c        | 41 +++++++++++++++++++++++++++++++++++++++++
>  mm/truncate.c      | 43 +++++++++++++++++++------------------------
>  3 files changed, 63 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c274f75efcf9..8ae31622deef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1719,6 +1719,7 @@ struct zap_details {
>         struct address_space *check_mapping;    /* Check page->mapping if set */
>         pgoff_t first_index;                    /* Lowest page->index to unmap */
>         pgoff_t last_index;                     /* Highest page->index to unmap */
> +       struct page *single_page;               /* Locked page to be unmapped */
>  };
>
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
>  extern int fixup_user_fault(struct mm_struct *mm,
>                             unsigned long address, unsigned int fault_flags,
>                             bool *unlocked);
> +void unmap_mapping_page(struct page *page);
>  void unmap_mapping_pages(struct address_space *mapping,
>                 pgoff_t start, pgoff_t nr, bool even_cows);
>  void unmap_mapping_range(struct address_space *mapping,
> @@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
>         BUG();
>         return -EFAULT;
>  }
> +static inline void unmap_mapping_page(struct page *page) { }
>  static inline void unmap_mapping_pages(struct address_space *mapping,
>                 pgoff_t start, pgoff_t nr, bool even_cows) { }
>  static inline void unmap_mapping_range(struct address_space *mapping,
> diff --git a/mm/memory.c b/mm/memory.c
> index f3ffab9b9e39..486f4a2874e7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1361,7 +1361,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                         else if (zap_huge_pmd(tlb, vma, pmd, addr))
>                                 goto next;
>                         /* fall through */
> +               } else if (details && details->single_page &&
> +                          PageTransCompound(details->single_page) &&
> +                          next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
> +                       spinlock_t *ptl = pmd_lock(tlb->mm, pmd);
> +                       /*
> +                        * Take and drop THP pmd lock so that we cannot return
> +                        * prematurely, while zap_huge_pmd() has cleared *pmd,
> +                        * but not yet decremented compound_mapcount().
> +                        */
> +                       spin_unlock(ptl);
>                 }
> +
>                 /*
>                  * Here there can be other concurrent MADV_DONTNEED or
>                  * trans huge page faults running, and if the pmd is
> @@ -3236,6 +3247,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
>         }
>  }
>
> +/**
> + * unmap_mapping_page() - Unmap single page from processes.
> + * @page: The locked page to be unmapped.
> + *
> + * Unmap this page from any userspace process which still has it mmaped.
> + * Typically, for efficiency, the range of nearby pages has already been
> + * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
> + * truncation or invalidation holds the lock on a page, it may find that
> + * the page has been remapped again: and then uses unmap_mapping_page()
> + * to unmap it finally.
> + */
> +void unmap_mapping_page(struct page *page)
> +{
> +       struct address_space *mapping = page->mapping;
> +       struct zap_details details = { };
> +
> +       VM_BUG_ON(!PageLocked(page));
> +       VM_BUG_ON(PageTail(page));
> +
> +       details.check_mapping = mapping;
> +       details.first_index = page->index;
> +       details.last_index = page->index + thp_nr_pages(page) - 1;
> +       details.single_page = page;
> +
> +       i_mmap_lock_write(mapping);
> +       if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> +               unmap_mapping_range_tree(&mapping->i_mmap, &details);
> +       i_mmap_unlock_write(mapping);
> +}
> +
>  /**
>   * unmap_mapping_pages() - Unmap pages from processes.
>   * @mapping: The address space containing pages to be unmapped.
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 95af244b112a..234ddd879caa 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
>   * its lock, b) when a concurrent invalidate_mapping_pages got there first and
>   * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
>   */
> -static void
> -truncate_cleanup_page(struct address_space *mapping, struct page *page)
> +static void truncate_cleanup_page(struct page *page)
>  {
> -       if (page_mapped(page)) {
> -               unsigned int nr = thp_nr_pages(page);
> -               unmap_mapping_pages(mapping, page->index, nr, false);
> -       }
> +       if (page_mapped(page))
> +               unmap_mapping_page(page);
>
>         if (page_has_private(page))
>                 do_invalidatepage(page, 0, thp_size(page));
> @@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
>         if (page->mapping != mapping)
>                 return -EIO;
>
> -       truncate_cleanup_page(mapping, page);
> +       truncate_cleanup_page(page);
>         delete_from_page_cache(page);
>         return 0;
>  }
> @@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>                 index = indices[pagevec_count(&pvec) - 1] + 1;
>                 truncate_exceptional_pvec_entries(mapping, &pvec, indices);
>                 for (i = 0; i < pagevec_count(&pvec); i++)
> -                       truncate_cleanup_page(mapping, pvec.pages[i]);
> +                       truncate_cleanup_page(pvec.pages[i]);
>                 delete_from_page_cache_batch(mapping, &pvec);
>                 for (i = 0; i < pagevec_count(&pvec); i++)
>                         unlock_page(pvec.pages[i]);
> @@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>                                 continue;
>                         }
>
> +                       if (!did_range_unmap && page_mapped(page)) {
> +                               /*
> +                                * If page is mapped, before taking its lock,
> +                                * zap the rest of the file in one hit.
> +                                */
> +                               unmap_mapping_pages(mapping, index,
> +                                               (1 + end - index), false);
> +                               did_range_unmap = 1;
> +                       }
> +
>                         lock_page(page);
>                         WARN_ON(page_to_index(page) != index);
>                         if (page->mapping != mapping) {
> @@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>                                 continue;
>                         }
>                         wait_on_page_writeback(page);
> -                       if (page_mapped(page)) {
> -                               if (!did_range_unmap) {
> -                                       /*
> -                                        * Zap the rest of the file in one hit.
> -                                        */
> -                                       unmap_mapping_pages(mapping, index,
> -                                               (1 + end - index), false);
> -                                       did_range_unmap = 1;
> -                               } else {
> -                                       /*
> -                                        * Just zap this page
> -                                        */
> -                                       unmap_mapping_pages(mapping, index,
> -                                                               1, false);
> -                               }
> -                       }
> +
> +                       if (page_mapped(page))
> +                               unmap_mapping_page(page);
>                         BUG_ON(page_mapped(page));
> +
>                         ret2 = do_launder_page(mapping, page);
>                         if (ret2 == 0) {
>                                 if (!invalidate_complete_page2(mapping, page))
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 10/10] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC
  2021-06-09  4:30   ` Hugh Dickins
  (?)
  (?)
@ 2021-06-10  7:38   ` HORIGUCHI NAOYA(堀口 直也)
  -1 siblings, 0 replies; 37+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-06-10  7:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Tue, Jun 08, 2021 at 09:30:00PM -0700, Hugh Dickins wrote:
> TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly
> before the page is accounted as unmapped.  It is unlikely to coincide
> with hwpoisoning, but now that we have the flag, hwpoison_user_mappings()
> would do well to use it.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> Patch added since the v1 series was posted.
>
>  mm/memory-failure.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index b6806e446567..e16edefca523 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1060,7 +1060,7 @@ static int get_hwpoison_page(struct page *p, unsigned long flags,
>  static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  				  int flags, struct page **hpagep)
>  {
> -	enum ttu_flags ttu = TTU_IGNORE_MLOCK;
> +	enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC;

Thanks for improving thp code.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 08/10] mm: rmap: make try_to_unmap() void function
  2021-06-09  4:25   ` Hugh Dickins
  (?)
@ 2021-06-10  7:57   ` HORIGUCHI NAOYA(堀口 直也)
  -1 siblings, 0 replies; 37+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-06-10  7:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Tue, Jun 08, 2021 at 09:25:22PM -0700, Hugh Dickins wrote:
> From: Yang Shi <shy828301@gmail.com>
> 
> Currently try_to_unmap() return bool value by checking page_mapcount(),
> however this may return false positive since page_mapcount() doesn't
> check all subpages of compound page.  The total_mapcount() could be used
> instead, but its cost is higher since it traverses all subpages.
> 
> Actually the most callers of try_to_unmap() don't care about the
> return value at all.  So just need check if page is still mapped by
> page_mapped() when necessary.  And page_mapped() does bail out early
> when it finds mapped subpage.
> 
> Suggested-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Looks good to me, thank you.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related
       [not found]   ` <2014832.e7zRqyNrDn@nvdebian>
@ 2021-06-11  0:15     ` Hugh Dickins
  2021-06-11  7:28       ` Alistair Popple
  0 siblings, 1 reply; 37+ messages in thread
From: Hugh Dickins @ 2021-06-11  0:15 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, Hugh Dickins, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Fri, 11 Jun 2021, Alistair Popple wrote:
> On Friday, 11 June 2021 8:15:05 AM AEST Andrew Morton wrote:
> > On Tue, 8 Jun 2021 20:57:34 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> > 
> > > These are against 5.13-rc5: expect mmotm conflicts with a couple of
> > > Alistair Popple's "Add support for SVM atomics in Nouveau" series:
> > > mm-remove-special-swap-entry-functions.patch
> > > mm-rmap-split-try_to_munlock-from-try_to_unmap.patch
> > 
> > I came unstuck at "mm/rmap: split migration into its own function".

Sorry about that, I hadn't yet gotten to trying my latest with mmotm.
And I think my previous mmotm-adjust.tar must have been incomplete;
and even if it were complete, would no longer apply properly anyway.

> > 
> > --- mm/huge_memory.c~mm-rmap-split-migration-into-its-own-function
> > +++ mm/huge_memory.c
> > @@ -2345,16 +2345,21 @@ void vma_adjust_trans_huge(struct vm_are
> > 
> >  static void unmap_page(struct page *page)
> >  {
> > -       enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
> > -               TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> > +       enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> >         bool unmap_success;
> > 
> >         VM_BUG_ON_PAGE(!PageHead(page), page);
> > 
> >         if (PageAnon(page))
> > -               ttu_flags |= TTU_SPLIT_FREEZE;
> > -
> > -       unmap_success = try_to_unmap(page, ttu_flags);
> > +               unmap_success = try_to_migrate(page, ttu_flags);
> > +       else
> > +               /*
> > +                * Don't install migration entries for file backed pages. This
> > +                * helps handle cases when i_size is in the middle of the page
> > +                * as there is no need to unmap pages beyond i_size manually.
> > +                */
> > +               unmap_success = try_to_unmap(page, ttu_flags |
> > +                                               TTU_IGNORE_MLOCK);
> >         VM_BUG_ON_PAGE(!unmap_success, page);
> >  }
> > 
> > 
> > Sigh.  I have a few todo's against Alastair's "Add support for SVM
> > atomics in Nouveau v9".  Including

Sigh shared!

> > 
> > https://lkml.kernel.org/r/20210525183710.fa2m2sbfixnhz7g5@revolver
> > https://lkml.kernel.org/r/20210604204934.sbspsmwdqdtmz73d@revolver
> > https://lkml.kernel.org/r/YK6mbf967dV0ljHn@t490s
> > https://lkml.kernel.org/r/2005328.bFqPmhE5MS@nvdebian
> > https://lkml.kernel.org/r/202105262107.LkxpsZsV-lkp@intel.com
> > https://lkml.kernel.org/r/YK6hYGEx+XzeZELV@t490s
> > 
> > So I think I'll drop that series and shall ask for it to be redone
> > against this lot, please.

Thank you, Andrew: that's certainly easiest for you and for me:
and I think the right thing to do for now.

> > 
> 
> I believe v10 of the series posted earlier this week should address those
> todo's. I will double check though and resend based on top of mmotm. Thanks.

Sorry to give you the bother, Alistair: it's worked out as a bad moment
to rewrite swapops.h and rmap.c, I'm afraid.

And the only help I've had time to give you was pointing Peter at your
series - many thanks to Peter, and to Shakeel.

Several times I've been on the point of asking you to keep the familiar
migration_entry_to_page(), along with your new pfn_swap_entry_to_page();
but each time I've looked, seen that it's hard to retain it sensibly at
the same time as overdue cleanup of the device_private_entry_to_page()s.

So I guess I'm resigned to losing it; but there are at least three
bugs currently under discussion or fixes in flight, which border on
migration_entry_to_page() - Jann Horn's smaps syzbot bug, Xu Yu's
__migration_entry_wait() fix, my __split_huge_pmd_locked() fix
(and page_vma_mapped_walk() cleanup).

And regarding huge_memory.c's unmap_page(): I did not recognize the
"helps handle cases when i_size" comment you added there.  What I
ended up with (and thought was in mmotm-adjust.tar but seems not):

	/*
	 * Anon pages need migration entries to preserve them, but file
	 * pages can simply be left unmapped, then faulted back on demand.
	 * If that is ever changed (perhaps for mlock), update remap_page().
	 */
	if (PageAnon(page))
		try_to_migrate(page, ttu_flags);
	else
		try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);

with
	/* If try_to_migrate() is used on file, remove this check */
in remap_page() to replace the
	/* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
comment my series puts there (since you delete TTU_SPLIT_FREEZE altogether).

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related
  2021-06-11  0:15     ` [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related Hugh Dickins
@ 2021-06-11  7:28       ` Alistair Popple
  2021-06-11 20:56           ` Hugh Dickins
  0 siblings, 1 reply; 37+ messages in thread
From: Alistair Popple @ 2021-06-11  7:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Friday, 11 June 2021 10:15:51 AM AEST Hugh Dickins wrote:
> On Fri, 11 Jun 2021, Alistair Popple wrote:
> > On Friday, 11 June 2021 8:15:05 AM AEST Andrew Morton wrote:
> > > On Tue, 8 Jun 2021 20:57:34 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:


> > > --- mm/huge_memory.c~mm-rmap-split-migration-into-its-own-function
> > > +++ mm/huge_memory.c
> > > @@ -2345,16 +2345,21 @@ void vma_adjust_trans_huge(struct vm_are
> > >
> > >  static void unmap_page(struct page *page)
> > >  {
> > > -       enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
> > > -               TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> > > +       enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> > >         bool unmap_success;
> > >
> > >         VM_BUG_ON_PAGE(!PageHead(page), page);
> > >
> > >         if (PageAnon(page))
> > > -               ttu_flags |= TTU_SPLIT_FREEZE;
> > > -
> > > -       unmap_success = try_to_unmap(page, ttu_flags);
> > > +               unmap_success = try_to_migrate(page, ttu_flags);
> > > +       else
> > > +               /*
> > > +                * Don't install migration entries for file backed pages. This
> > > +                * helps handle cases when i_size is in the middle of the page
> > > +                * as there is no need to unmap pages beyond i_size manually.
> > > +                */
> > > +               unmap_success = try_to_unmap(page, ttu_flags |
> > > +                                               TTU_IGNORE_MLOCK);
> > >         VM_BUG_ON_PAGE(!unmap_success, page);
> > >  }
> > >
> > >
> > > Sigh.  I have a few todo's against Alastair's "Add support for SVM
> > > atomics in Nouveau v9".  Including
> 
> Sigh shared!
> 
> > >
> > > https://lkml.kernel.org/r/20210525183710.fa2m2sbfixnhz7g5@revolver
> > > https://lkml.kernel.org/r/20210604204934.sbspsmwdqdtmz73d@revolver
> > > https://lkml.kernel.org/r/YK6mbf967dV0ljHn@t490s
> > > https://lkml.kernel.org/r/2005328.bFqPmhE5MS@nvdebian
> > > https://lkml.kernel.org/r/202105262107.LkxpsZsV-lkp@intel.com
> > > https://lkml.kernel.org/r/YK6hYGEx+XzeZELV@t490s
> > >
> > > So I think I'll drop that series and shall ask for it to be redone
> > > against this lot, please.
> 
> Thank you, Andrew: that's certainly easiest for you and for me:
> and I think the right thing to do for now.

I guess this is where I sigh :-)

> > >
> >
> > I believe v10 of the series posted earlier this week should address those
> > todo's. I will double check though and resend based on top of mmotm. Thanks.
> 
> Sorry to give you the bother, Alistair: it's worked out as a bad moment
> to rewrite swapops.h and rmap.c, I'm afraid.

Indeed, but I don't think it's too bad. I've just tried rebasing it on this
series and it didn't run into too many problems. Obviously I ran into the same
issue Andrew did but I was able to fix that up. It also means try_to_migrate()
now returns 'void' instead of 'bool'.

Which brings me to the only real question I had during the rebase - does
migration also need to accept the TTU_SYNC flag? I think it does because if I
understand correctly we can still hit the same race with zap_pte_range() when
trying to establish migration entries which previously also returned the status
of page_mapped().

> And the only help I've had time to give you was pointing Peter at your
> series - many thanks to Peter, and to Shakeel.

Yes, thanks for the help there. I think the main questions I had for you were
around checking vma flags under the ptl in try_to_munlock_one but Shakeel was
able to clear that up for me. Thanks!

> Several times I've been on the point of asking you to keep the familiar
> migration_entry_to_page(), along with your new pfn_swap_entry_to_page();
> but each time I've looked, seen that it's hard to retain it sensibly at
> the same time as overdue cleanup of the device_private_entry_to_page()s.

Yeah, it would make things a bit funny to retain it IMHO. At least any fixups
should just be simple substitutions.

> So I guess I'm resigned to losing it; but there are at least three
> bugs currently under discussion or fixes in flight, which border on
> migration_entry_to_page() - Jann Horn's smaps syzbot bug, Xu Yu's
> __migration_entry_wait() fix, my __split_huge_pmd_locked() fix
> (and page_vma_mapped_walk() cleanup).
> 
> And regarding huge_memory.c's unmap_page(): I did not recognize the
> "helps handle cases when i_size" comment you added there.  What I
> ended up with (and thought was in mmotm-adjust.tar but seems not):
> 
>         /*
>          * Anon pages need migration entries to preserve them, but file
>          * pages can simply be left unmapped, then faulted back on demand.
>          * If that is ever changed (perhaps for mlock), update remap_page().
>          */

My comment was based somewhat on the commit message for the original change but
yours is much clearer so will incorporate it into my rebase, thanks.

As to sending my rebased series I suppose it would be best to wait until
linux-mm has been updated with whatever other fixes are needed before resending
it based on top of that. So far rebasing on this series didn't require too many
drastic changes to my v10 series. The most significant was to incorporate your
changes to unmap_page(). The remaining were just adding the TTU_SYNC case to
try_to_migrate{_one} and a single s/migration_entry_to_page/pfn_swap_entry_to_page/
in huge_memory.c

>         if (PageAnon(page))
>                 try_to_migrate(page, ttu_flags);
>         else
>                 try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);
> 
> with
>         /* If try_to_migrate() is used on file, remove this check */
> in remap_page() to replace the
>         /* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
> comment my series puts there (since you delete TTU_SPLIT_FREEZE altogether).

> Hugh





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related
  2021-06-11  7:28       ` Alistair Popple
@ 2021-06-11 20:56           ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-11 20:56 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Fri, 11 Jun 2021, Alistair Popple wrote:
> On Friday, 11 June 2021 10:15:51 AM AEST Hugh Dickins wrote:
> > 
> > Sorry to give you the bother, Alistair: it's worked out as a bad moment
> > to rewrite swapops.h and rmap.c, I'm afraid.
> 
> Indeed, but I don't think it's too bad. I've just tried rebasing it on this
> series and it didn't run into too many problems. Obviously I ran into the same
> issue Andrew did but I was able to fix that up. It also means try_to_migrate()
> now returns 'void' instead of 'bool'.

Yes, void try_to_migrate().

> 
> Which brings me to the only real question I had during the rebase - does
> migration also need to accept the TTU_SYNC flag? I think it does because if I
> understand correctly we can still hit the same race with zap_pte_range() when
> trying to establish migration entries which previously also returned the status
> of page_mapped().

Yes, try_to_migrate() needs to accept TTU_SYNC too.

> 
> > And the only help I've had time to give you was pointing Peter at your
> > series - many thanks to Peter, and to Shakeel.
> 
> Yes, thanks for the help there. I think the main questions I had for you were
> around checking vma flags under the ptl in try_to_munlock_one but Shakeel was
> able to clear that up for me. Thanks!
> 
> > Several times I've been on the point of asking you to keep the familiar
> > migration_entry_to_page(), along with your new pfn_swap_entry_to_page();
> > but each time I've looked, seen that it's hard to retain it sensibly at
> > the same time as overdue cleanup of the device_private_entry_to_page()s.
> 
> Yeah, it would make things a bit funny to retain it IMHO. At least any fixups
> should just be simple substitutions.
> 
> > So I guess I'm resigned to losing it; but there are at least three
> > bugs currently under discussion or fixes in flight, which border on
> > migration_entry_to_page() - Jann Horn's smaps syzbot bug, Xu Yu's
> > __migration_entry_wait() fix, my __split_huge_pmd_locked() fix
> > (and page_vma_mapped_walk() cleanup).
> > 
> > And regarding huge_memory.c's unmap_page(): I did not recognize the
> > "helps handle cases when i_size" comment you added there.  What I
> > ended up with (and thought was in mmotm-adjust.tar but seems not):
> > 
> >         /*
> >          * Anon pages need migration entries to preserve them, but file
> >          * pages can simply be left unmapped, then faulted back on demand.
> >          * If that is ever changed (perhaps for mlock), update remap_page().
> >          */
> 
> My comment was based somewhat on the commit message for the original change but
> yours is much clearer so will incorporate it into my rebase, thanks.

Oh, you did better than I, I didn't think to look there on this occasion.
But even so, the i_size business is just one detail, and the new comment
better I think (I also disliked comment on an else without { } around it).

> 
> As to sending my rebased series I suppose it would be best to wait until
> linux-mm has been updated with whatever other fixes are needed before resending
> it based on top of that. So far rebasing on this series didn't require too many
> drastic changes to my v10 series. The most significant was to incorporate your
> changes to unmap_page(). The remaining were just adding the TTU_SYNC case to
> try_to_migrate{_one} and a single s/migration_entry_to_page/pfn_swap_entry_to_page/
> in huge_memory.c

Yes, I think that's it.  But check your try_to_migrate_one(), it may
want the same range.end vma_address_end() mod I made in try_to_unmap_one().

And does try_to_migrate_one() still have a comment referring to
try_to_unmap() when it should say try_to_migrate() there?

I've now located the diffs I missed from sending akpm before,
and diffed the diffs, and those are the points I see there;
but sending them now will just be a waste of everyones time.
No substitute for me checking your end result when it comes,
though I fear to do so since there's much more in your series
than I can wrap my head around without a lot more education.

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related
@ 2021-06-11 20:56           ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-11 20:56 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Fri, 11 Jun 2021, Alistair Popple wrote:
> On Friday, 11 June 2021 10:15:51 AM AEST Hugh Dickins wrote:
> > 
> > Sorry to give you the bother, Alistair: it's worked out as a bad moment
> > to rewrite swapops.h and rmap.c, I'm afraid.
> 
> Indeed, but I don't think it's too bad. I've just tried rebasing it on this
> series and it didn't run into too many problems. Obviously I ran into the same
> issue Andrew did but I was able to fix that up. It also means try_to_migrate()
> now returns 'void' instead of 'bool'.

Yes, void try_to_migrate().

> 
> Which brings me to the only real question I had during the rebase - does
> migration also need to accept the TTU_SYNC flag? I think it does because if I
> understand correctly we can still hit the same race with zap_pte_range() when
> trying to establish migration entries which previously also returned the status
> of page_mapped().

Yes, try_to_migrate() needs to accept TTU_SYNC too.

> 
> > And the only help I've had time to give you was pointing Peter at your
> > series - many thanks to Peter, and to Shakeel.
> 
> Yes, thanks for the help there. I think the main questions I had for you were
> around checking vma flags under the ptl in try_to_munlock_one but Shakeel was
> able to clear that up for me. Thanks!
> 
> > Several times I've been on the point of asking you to keep the familiar
> > migration_entry_to_page(), along with your new pfn_swap_entry_to_page();
> > but each time I've looked, seen that it's hard to retain it sensibly at
> > the same time as overdue cleanup of the device_private_entry_to_page()s.
> 
> Yeah, it would make things a bit funny to retain it IMHO. At least any fixups
> should just be simple substitutions.
> 
> > So I guess I'm resigned to losing it; but there are at least three
> > bugs currently under discussion or fixes in flight, which border on
> > migration_entry_to_page() - Jann Horn's smaps syzbot bug, Xu Yu's
> > __migration_entry_wait() fix, my __split_huge_pmd_locked() fix
> > (and page_vma_mapped_walk() cleanup).
> > 
> > And regarding huge_memory.c's unmap_page(): I did not recognize the
> > "helps handle cases when i_size" comment you added there.  What I
> > ended up with (and thought was in mmotm-adjust.tar but seems not):
> > 
> >         /*
> >          * Anon pages need migration entries to preserve them, but file
> >          * pages can simply be left unmapped, then faulted back on demand.
> >          * If that is ever changed (perhaps for mlock), update remap_page().
> >          */
> 
> My comment was based somewhat on the commit message for the original change but
> yours is much clearer so will incorporate it into my rebase, thanks.

Oh, you did better than I, I didn't think to look there on this occasion.
But even so, the i_size business is just one detail, and the new comment
better I think (I also disliked comment on an else without { } around it).

> 
> As to sending my rebased series I suppose it would be best to wait until
> linux-mm has been updated with whatever other fixes are needed before resending
> it based on top of that. So far rebasing on this series didn't require too many
> drastic changes to my v10 series. The most significant was to incorporate your
> changes to unmap_page(). The remaining were just adding the TTU_SYNC case to
> try_to_migrate{_one} and a single s/migration_entry_to_page/pfn_swap_entry_to_page/
> in huge_memory.c

Yes, I think that's it.  But check your try_to_migrate_one(), it may
want the same range.end vma_address_end() mod I made in try_to_unmap_one().

And does try_to_migrate_one() still have a comment referring to
try_to_unmap() when it should say try_to_migrate() there?

I've now located the diffs I missed from sending akpm before,
and diffed the diffs, and those are the points I see there;
but sending them now will just be a waste of everyones time.
No substitute for me checking your end result when it comes,
though I fear to do so since there's much more in your series
than I can wrap my head around without a lot more education.

Hugh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related
  2021-06-11 20:56           ` Hugh Dickins
  (?)
@ 2021-06-12  7:34           ` Alistair Popple
  2021-06-12  8:20               ` Hugh Dickins
  -1 siblings, 1 reply; 37+ messages in thread
From: Alistair Popple @ 2021-06-12  7:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Saturday, 12 June 2021 6:56:36 AM AEST Hugh Dickins wrote:
> >
> > As to sending my rebased series I suppose it would be best to wait until
> > linux-mm has been updated with whatever other fixes are needed before resending
> > it based on top of that. So far rebasing on this series didn't require too many
> > drastic changes to my v10 series. The most significant was to incorporate your
> > changes to unmap_page(). The remaining were just adding the TTU_SYNC case to
> > try_to_migrate{_one} and a single s/migration_entry_to_page/pfn_swap_entry_to_page/
> > in huge_memory.c
> 
> Yes, I think that's it.  But check your try_to_migrate_one(), it may
> want the same range.end vma_address_end() mod I made in try_to_unmap_one().
> 
> And does try_to_migrate_one() still have a comment referring to
> try_to_unmap() when it should say try_to_migrate() there?

Thanks for the pointers, I had caught both those as well.

> I've now located the diffs I missed from sending akpm before,
> and diffed the diffs, and those are the points I see there;
> but sending them now will just be a waste of everyones time.
> No substitute for me checking your end result when it comes,
> though I fear to do so since there's much more in your series
> than I can wrap my head around without a lot more education.

The first few patches in the series (and the ones with conflicts) are
clean-ups, so shouldn't change any behaviour. I'm reasonably confident I caught
everything  but would certainly appreciate you checking the end result in the
early patches when I post just to make sure I didn't miss anything. Thanks.

Also I have been getting bounce responses trying to deliver mail to linux-mm
in case anyone is wondering why these might not be showing up on the mailing
list. Looks to be some kind of mail loop, but not sure if it's at my end or
somewhere else.

> Hugh





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related
  2021-06-12  7:34           ` Alistair Popple
@ 2021-06-12  8:20               ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-12  8:20 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Sat, 12 Jun 2021, Alistair Popple wrote:
> On Saturday, 12 June 2021 6:56:36 AM AEST Hugh Dickins wrote:

I wonder how I arrived in Queensland - ^^^^ must I quarantine?

> > >
> > > As to sending my rebased series I suppose it would be best to wait until
> > > linux-mm has been updated with whatever other fixes are needed before resending
> > > it based on top of that. So far rebasing on this series didn't require too many
> > > drastic changes to my v10 series. The most significant was to incorporate your
> > > changes to unmap_page(). The remaining were just adding the TTU_SYNC case to
> > > try_to_migrate{_one} and a single s/migration_entry_to_page/pfn_swap_entry_to_page/
> > > in huge_memory.c
> > 
> > Yes, I think that's it.  But check your try_to_migrate_one(), it may
> > want the same range.end vma_address_end() mod I made in try_to_unmap_one().
> > 
> > And does try_to_migrate_one() still have a comment referring to
> > try_to_unmap() when it should say try_to_migrate() there?
> 
> Thanks for the pointers, I had caught both those as well.
> 
> > I've now located the diffs I missed from sending akpm before,
> > and diffed the diffs, and those are the points I see there;
> > but sending them now will just be a waste of everyones time.
> > No substitute for me checking your end result when it comes,
> > though I fear to do so since there's much more in your series
> > than I can wrap my head around without a lot more education.
> 
> The first few patches in the series (and the ones with conflicts) are
> clean-ups, so shouldn't change any behaviour. I'm reasonably confident I caught
> everything  but would certainly appreciate you checking the end result in the
> early patches when I post just to make sure I didn't miss anything. Thanks.
> 
> Also I have been getting bounce responses trying to deliver mail to linux-mm
> in case anyone is wondering why these might not be showing up on the mailing
> list. Looks to be some kind of mail loop, but not sure if it's at my end or
> somewhere else.

linux-mm@kvack.org has been having trouble recently.

See https://lore.kernel.org/linux-mm/YMJrk81oZa5ArWPw@cmpxchg.org/

There are more messages in that thread, but only one has got out.

Hugh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related
@ 2021-06-12  8:20               ` Hugh Dickins
  0 siblings, 0 replies; 37+ messages in thread
From: Hugh Dickins @ 2021-06-12  8:20 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Shakeel Butt, Oscar Salvador, linux-mm, linux-kernel

On Sat, 12 Jun 2021, Alistair Popple wrote:
> On Saturday, 12 June 2021 6:56:36 AM AEST Hugh Dickins wrote:

I wonder how I arrived in Queensland - ^^^^ must I quarantine?

> > >
> > > As to sending my rebased series I suppose it would be best to wait until
> > > linux-mm has been updated with whatever other fixes are needed before resending
> > > it based on top of that. So far rebasing on this series didn't require too many
> > > drastic changes to my v10 series. The most significant was to incorporate your
> > > changes to unmap_page(). The remaining were just adding the TTU_SYNC case to
> > > try_to_migrate{_one} and a single s/migration_entry_to_page/pfn_swap_entry_to_page/
> > > in huge_memory.c
> > 
> > Yes, I think that's it.  But check your try_to_migrate_one(), it may
> > want the same range.end vma_address_end() mod I made in try_to_unmap_one().
> > 
> > And does try_to_migrate_one() still have a comment referring to
> > try_to_unmap() when it should say try_to_migrate() there?
> 
> Thanks for the pointers, I had caught both those as well.
> 
> > I've now located the diffs I missed from sending akpm before,
> > and diffed the diffs, and those are the points I see there;
> > but sending them now will just be a waste of everyones time.
> > No substitute for me checking your end result when it comes,
> > though I fear to do so since there's much more in your series
> > than I can wrap my head around without a lot more education.
> 
> The first few patches in the series (and the ones with conflicts) are
> clean-ups, so shouldn't change any behaviour. I'm reasonably confident I caught
> everything  but would certainly appreciate you checking the end result in the
> early patches when I post just to make sure I didn't miss anything. Thanks.
> 
> Also I have been getting bounce responses trying to deliver mail to linux-mm
> in case anyone is wondering why these might not be showing up on the mailing
> list. Looks to be some kind of mail loop, but not sure if it's at my end or
> somewhere else.

linux-mm@kvack.org has been having trouble recently.

See https://lore.kernel.org/linux-mm/YMJrk81oZa5ArWPw@cmpxchg.org/

There are more messages in that thread, but only one has got out.

Hugh


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2021-06-12  8:20 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <af88612-1473-2eaa-903-8d1a448b26@google.com>
2021-06-09  4:08 ` [PATCH v2 02/10] mm/thp: make is_huge_zero_pmd() safe and quicker Hugh Dickins
2021-06-09  4:08   ` Hugh Dickins
2021-06-09 10:22   ` Kirill A. Shutemov
2021-06-09 16:56   ` Yang Shi
2021-06-09 16:56     ` Yang Shi
2021-06-09  4:14 ` [PATCH v2 04/10] mm/thp: fix vma_address() if virtual address below file offset Hugh Dickins
2021-06-09  4:14   ` Hugh Dickins
2021-06-09  4:16 ` [PATCH v2 05/10] mm/thp: fix page_address_in_vma() on file THP tails Hugh Dickins
2021-06-09  4:16   ` Hugh Dickins
2021-06-09  4:19 ` [PATCH v2 06/10] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() Hugh Dickins
2021-06-09  4:19   ` Hugh Dickins
2021-06-09 17:02   ` Yang Shi
2021-06-09 17:02     ` Yang Shi
2021-06-09 21:11     ` Hugh Dickins
2021-06-09 21:11       ` Hugh Dickins
2021-06-09 21:16       ` [PATCH v3 " Hugh Dickins
2021-06-09 21:16         ` Hugh Dickins
2021-06-09 21:51         ` Yang Shi
2021-06-09 21:51           ` Yang Shi
2021-06-09  4:22 ` [PATCH v2 07/10] mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split Hugh Dickins
2021-06-09  4:22   ` Hugh Dickins
2021-06-09  4:25 ` [PATCH v2 08/10] mm: rmap: make try_to_unmap() void function Hugh Dickins
2021-06-09  4:25   ` Hugh Dickins
2021-06-10  7:57   ` HORIGUCHI NAOYA(堀口 直也)
2021-06-09  4:27 ` [PATCH v2 09/10] mm/thp: remap_page() is only needed on anonymous THP Hugh Dickins
2021-06-09  4:27   ` Hugh Dickins
2021-06-09  4:30 ` [PATCH v2 10/10] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC Hugh Dickins
2021-06-09  4:30   ` Hugh Dickins
2021-06-09 10:27   ` Kirill A. Shutemov
2021-06-10  7:38   ` HORIGUCHI NAOYA(堀口 直也)
     [not found] ` <20210610151505.d0124033e55bda07fa3d4408@linux-foundation.org>
     [not found]   ` <2014832.e7zRqyNrDn@nvdebian>
2021-06-11  0:15     ` [PATCH v2 00/10] mm/thp: fix THP splitting unmap BUGs and related Hugh Dickins
2021-06-11  7:28       ` Alistair Popple
2021-06-11 20:56         ` Hugh Dickins
2021-06-11 20:56           ` Hugh Dickins
2021-06-12  7:34           ` Alistair Popple
2021-06-12  8:20             ` Hugh Dickins
2021-06-12  8:20               ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.