[PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related
@ 2021-06-01 21:03 ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Yang Shi, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Hugh Dickins, linux-mm, linux-kernel

Here is a batch of long-standing THP bug fixes that I had not got
around to sending before, but prompted now by Wang Yugui's report
https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Wang Yugui has tested a rollup of these fixes applied to 5.10.39,
and they have done no harm, but have *not* fixed that issue:
something more is needed and I have no idea of what.

But at least these clear up related issues, and should go to stable
(except for the last, which is just an optimization: it would be
fine in stable, but it's not required there).

These are against 5.13-rc4: easy for others to try out, but my next
task is to work out how to shoehorn them into mmotm and linux-next.

I would have said just before Yang Shi's related
mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
except that (which should also go to stable) is currently placed after
Alistair Popple's "Add support for SVM atomics in Nouveau" series,
mm-rmap-split-try_to_munlock-from-try_to_unmap.patch etc.
I expect I shall offer you some rediffs of Alistair's, we'll see.

1/7 mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
2/7 mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
3/7 mm/thp: fix vma_address() if virtual address below file offset
4/7 mm/thp: fix page_address_in_vma() on file THP tails
5/7 mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
6/7 mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
7/7 mm/thp: remap_page() is only needed on anonymous THP

 include/linux/mm.h   |    3 
 include/linux/rmap.h |    3 
 mm/huge_memory.c     |   47 ++++++++----
 mm/internal.h        |   54 ++++++++++----
 mm/memory.c          |   40 ++++++++++
 mm/page_vma_mapped.c |  163 ++++++++++++++++++++++++++-----------------
 mm/pgtable-generic.c |    5 -
 mm/rmap.c            |   39 +++++++---
 mm/truncate.c        |   43 +++++------
 9 files changed, 266 insertions(+), 131 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related
@ 2021-06-01 21:03 ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Yang Shi, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Hugh Dickins, linux-mm, linux-kernel

Here is a batch of long-standing THP bug fixes that I had not got
around to sending before, but prompted now by Wang Yugui's report
https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Wang Yugui has tested a rollup of these fixes applied to 5.10.39,
and they have done no harm, but have *not* fixed that issue:
something more is needed and I have no idea of what.

But at least these clear up related issues, and should go to stable
(except for the last, which is just an optimization: it would be
fine in stable, but it's not required there).

These are against 5.13-rc4: easy for others to try out, but my next
task is to work out how to shoehorn them into mmotm and linux-next.

I would have said just before Yang Shi's related
mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
except that (which should also go to stable) is currently placed after
Alistair Popple's "Add support for SVM atomics in Nouveau" series,
mm-rmap-split-try_to_munlock-from-try_to_unmap.patch etc.
I expect I shall offer you some rediffs of Alistair's, we'll see.

1/7 mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
2/7 mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
3/7 mm/thp: fix vma_address() if virtual address below file offset
4/7 mm/thp: fix page_address_in_vma() on file THP tails
5/7 mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
6/7 mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
7/7 mm/thp: remap_page() is only needed on anonymous THP

 include/linux/mm.h   |    3 
 include/linux/rmap.h |    3 
 mm/huge_memory.c     |   47 ++++++++----
 mm/internal.h        |   54 ++++++++++----
 mm/memory.c          |   40 ++++++++++
 mm/page_vma_mapped.c |  163 ++++++++++++++++++++++++++-----------------
 mm/pgtable-generic.c |    5 -
 mm/rmap.c            |   39 +++++++---
 mm/truncate.c        |   43 +++++------
 9 files changed, 266 insertions(+), 131 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-01 21:05   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Stressing huge tmpfs page migration racing hole punch often crashed on the
VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel;
or shortly afterwards, on a bad dereference in __split_huge_pmd_locked()
when DEBUG_VM=n.  They forgot to allow for pmd migration entries in the
non-anonymous case.

Full disclosure: those particular experiments were on a kernel with more
relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
vanilla kernel: it is conceivable that stricter locking happens to avoid
those cases, or makes them less likely; but __split_huge_pmd_locked()
already allowed for pmd migration entries when handling anonymous THPs,
so this commit brings the shmem and file THP handling into line.

Are there more places that need to be careful about pmd migration entries?
None hit in practice, but several of those is_huge_zero_pmd() tests were
done without checking pmd_present() first: I believe a pmd migration entry
could end up satisfying that test.  Ah, the inversion of swap offset, to
protect against L1TF, makes that impossible on x86; but other arches need
the pmd_present() check, and even x86 ought not to apply pmd_page() to a
swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
wrong to be checking with pmd_trans_huge() instead, but I think it's
clearer to use pmd_present() in each instance.

And while there: make it clearer to the eye that the !vma_is_anonymous()
and is_huge_zero_pmd() blocks make early returns (and don't return void).

Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/huge_memory.c     | 38 ++++++++++++++++++++++++--------------
 mm/pgtable-generic.c |  5 ++---
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 63ed6b25deaa..9fb7b47da87e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1587,9 +1587,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		goto out_unlocked;
 
 	orig_pmd = *pmd;
-	if (is_huge_zero_pmd(orig_pmd))
-		goto out;
-
 	if (unlikely(!pmd_present(orig_pmd))) {
 		VM_BUG_ON(thp_migration_supported() &&
 				  !is_pmd_migration_entry(orig_pmd));
@@ -1597,6 +1594,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 
 	page = pmd_page(orig_pmd);
+	if (is_huge_zero_page(page))
+		goto out;
+
 	/*
 	 * If other processes are mapping this page, we couldn't discard
 	 * the page unless they all do MADV_FREE so let's skip the page.
@@ -1676,7 +1676,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		if (is_huge_zero_pmd(orig_pmd))
 			tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
-	} else if (is_huge_zero_pmd(orig_pmd)) {
+	} else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
 		zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
@@ -2044,7 +2044,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	count_vm_event(THP_SPLIT_PMD);
 
 	if (!vma_is_anonymous(vma)) {
-		_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+		old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
 		/*
 		 * We are going to unmap this huge page. So
 		 * just go ahead and zap it
@@ -2053,16 +2053,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			zap_deposited_table(mm, pmd);
 		if (vma_is_special_huge(vma))
 			return;
-		page = pmd_page(_pmd);
-		if (!PageDirty(page) && pmd_dirty(_pmd))
-			set_page_dirty(page);
-		if (!PageReferenced(page) && pmd_young(_pmd))
-			SetPageReferenced(page);
-		page_remove_rmap(page, true);
-		put_page(page);
+		if (unlikely(is_pmd_migration_entry(old_pmd))) {
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(old_pmd);
+			page = migration_entry_to_page(entry);
+		} else {
+			page = pmd_page(old_pmd);
+			if (!PageDirty(page) && pmd_dirty(old_pmd))
+				set_page_dirty(page);
+			if (!PageReferenced(page) && pmd_young(old_pmd))
+				SetPageReferenced(page);
+			page_remove_rmap(page, true);
+			put_page(page);
+		}
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
-	} else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
+	}
+
+	if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
 		 * mmu_notifier_invalidate_range() see comments below inside
@@ -2072,7 +2081,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * small page also write protected so it does not seems useful
 		 * to invalidate secondary mmu at this time.
 		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		__split_huge_zero_page_pmd(vma, haddr, pmd);
+		return;
 	}
 
 	/*
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c2210e1cdb51..4e640baf9794 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -135,9 +135,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(!pmd_present(*pmdp));
-	/* Below assumes pmd_present() is true */
-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+			   !pmd_devmap(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
@ 2021-06-01 21:05   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Stressing huge tmpfs page migration racing hole punch often crashed on the
VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel;
or shortly afterwards, on a bad dereference in __split_huge_pmd_locked()
when DEBUG_VM=n.  They forgot to allow for pmd migration entries in the
non-anonymous case.

Full disclosure: those particular experiments were on a kernel with more
relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
vanilla kernel: it is conceivable that stricter locking happens to avoid
those cases, or makes them less likely; but __split_huge_pmd_locked()
already allowed for pmd migration entries when handling anonymous THPs,
so this commit brings the shmem and file THP handling into line.

Are there more places that need to be careful about pmd migration entries?
None hit in practice, but several of those is_huge_zero_pmd() tests were
done without checking pmd_present() first: I believe a pmd migration entry
could end up satisfying that test.  Ah, the inversion of swap offset, to
protect against L1TF, makes that impossible on x86; but other arches need
the pmd_present() check, and even x86 ought not to apply pmd_page() to a
swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
wrong to be checking with pmd_trans_huge() instead, but I think it's
clearer to use pmd_present() in each instance.

And while there: make it clearer to the eye that the !vma_is_anonymous()
and is_huge_zero_pmd() blocks make early returns (and don't return void).

Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/huge_memory.c     | 38 ++++++++++++++++++++++++--------------
 mm/pgtable-generic.c |  5 ++---
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 63ed6b25deaa..9fb7b47da87e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1587,9 +1587,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		goto out_unlocked;
 
 	orig_pmd = *pmd;
-	if (is_huge_zero_pmd(orig_pmd))
-		goto out;
-
 	if (unlikely(!pmd_present(orig_pmd))) {
 		VM_BUG_ON(thp_migration_supported() &&
 				  !is_pmd_migration_entry(orig_pmd));
@@ -1597,6 +1594,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 
 	page = pmd_page(orig_pmd);
+	if (is_huge_zero_page(page))
+		goto out;
+
 	/*
 	 * If other processes are mapping this page, we couldn't discard
 	 * the page unless they all do MADV_FREE so let's skip the page.
@@ -1676,7 +1676,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		if (is_huge_zero_pmd(orig_pmd))
 			tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
-	} else if (is_huge_zero_pmd(orig_pmd)) {
+	} else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
 		zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
@@ -2044,7 +2044,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	count_vm_event(THP_SPLIT_PMD);
 
 	if (!vma_is_anonymous(vma)) {
-		_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+		old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
 		/*
 		 * We are going to unmap this huge page. So
 		 * just go ahead and zap it
@@ -2053,16 +2053,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			zap_deposited_table(mm, pmd);
 		if (vma_is_special_huge(vma))
 			return;
-		page = pmd_page(_pmd);
-		if (!PageDirty(page) && pmd_dirty(_pmd))
-			set_page_dirty(page);
-		if (!PageReferenced(page) && pmd_young(_pmd))
-			SetPageReferenced(page);
-		page_remove_rmap(page, true);
-		put_page(page);
+		if (unlikely(is_pmd_migration_entry(old_pmd))) {
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(old_pmd);
+			page = migration_entry_to_page(entry);
+		} else {
+			page = pmd_page(old_pmd);
+			if (!PageDirty(page) && pmd_dirty(old_pmd))
+				set_page_dirty(page);
+			if (!PageReferenced(page) && pmd_young(old_pmd))
+				SetPageReferenced(page);
+			page_remove_rmap(page, true);
+			put_page(page);
+		}
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
-	} else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
+	}
+
+	if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
 		 * mmu_notifier_invalidate_range() see comments below inside
@@ -2072,7 +2081,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * small page also write protected so it does not seems useful
 		 * to invalidate secondary mmu at this time.
 		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+		__split_huge_zero_page_pmd(vma, haddr, pmd);
+		return;
 	}
 
 	/*
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c2210e1cdb51..4e640baf9794 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -135,9 +135,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(!pmd_present(*pmdp));
-	/* Below assumes pmd_present() is true */
-	VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+			   !pmd_devmap(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog



^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-01 21:07   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its
raw struct page output showing _mapcount ffffffff i.e. mapcount 0.

And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and silently.

I believe the problem is that once a racing unmap has cleared pte or pmd,
try_to_unmap_one() may skip taking the page table lock, and emerge from
try_to_unmap() before the racing task has reached decrementing mapcount.

Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
It could be passed in the non-debug case too, but that would sometimes add
a little overhead, whereas it's rare for this race to result in failure.

mm/memory-failure.c:hwpoison_user_mappings() should probably use the new
TTU_SYNC option too, just in case this race coincides with its attempts to
unmap a failing page (THP or not); but this commit does not add that.

Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/rmap.h |  3 ++-
 mm/huge_memory.c     |  4 ++++
 mm/page_vma_mapped.c |  8 ++++++++
 mm/rmap.c            | 17 ++++++++++++++++-
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index def5c62c93b3..891599a4cb8d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -97,7 +97,8 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
-	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
+	TTU_SPLIT_FREEZE	= 0x100, /* freeze pte under splitting thp */
+	TTU_SYNC		= 0x200, /* avoid racy checks with PVMW_SYNC */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9fb7b47da87e..305f709a7aca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2357,6 +2357,10 @@ static void unmap_page(struct page *page)
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
+	/* Make sure that the BUGs will not bite */
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		ttu_flags |= TTU_SYNC;
+
 	unmap_success = try_to_unmap(page, ttu_flags);
 	VM_BUG_ON_PAGE(!unmap_success, page);
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 2cf01d933f13..b45d22738b45 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			pvmw->ptl = NULL;
 		}
 	} else if (!pmd_present(pmde)) {
+		/*
+		 * If PVMW_SYNC, take and drop THP pmd lock so that we
+		 * cannot return prematurely, while zap_huge_pmd() has
+		 * cleared *pmd but not decremented compound_mapcount().
+		 */
+		if ((pvmw->flags & PVMW_SYNC) &&
+		    PageTransCompound(pvmw->page))
+			spin_unlock(pmd_lock(mm, pvmw->pmd));
 		return false;
 	}
 	if (!map_pte(pvmw))
diff --git a/mm/rmap.c b/mm/rmap.c
index 693a610e181d..07811b4ae793 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,6 +1405,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
+	/*
+	 * When racing against e.g. zap_pte_range() on another cpu,
+	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
+	 * try_to_unmap() may return false when it is about to become true,
+	 * if page table locking is skipped: use TTU_SYNC to wait for that.
+	 */
+	if (flags & TTU_SYNC)
+		pvmw.flags = PVMW_SYNC;
+
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return true;
@@ -1777,7 +1786,13 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
 	else
 		rmap_walk(page, &rwc);
 
-	return !page_mapcount(page) ? true : false;
+	/*
+	 * When racing against e.g. zap_pte_range() on another cpu,
+	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
+	 * try_to_unmap() may return false when it is about to become true,
+	 * if page table locking is skipped: use TTU_SYNC to wait for that.
+	 */
+	return !page_mapcount(page);
 }
 
 /**
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
@ 2021-06-01 21:07   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its
raw struct page output showing _mapcount ffffffff i.e. mapcount 0.

And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and silently.

I believe the problem is that once a racing unmap has cleared pte or pmd,
try_to_unmap_one() may skip taking the page table lock, and emerge from
try_to_unmap() before the racing task has reached decrementing mapcount.

Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
It could be passed in the non-debug case too, but that would sometimes add
a little overhead, whereas it's rare for this race to result in failure.

mm/memory-failure.c:hwpoison_user_mappings() should probably use the new
TTU_SYNC option too, just in case this race coincides with its attempts to
unmap a failing page (THP or not); but this commit does not add that.

Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/rmap.h |  3 ++-
 mm/huge_memory.c     |  4 ++++
 mm/page_vma_mapped.c |  8 ++++++++
 mm/rmap.c            | 17 ++++++++++++++++-
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index def5c62c93b3..891599a4cb8d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -97,7 +97,8 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
-	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
+	TTU_SPLIT_FREEZE	= 0x100, /* freeze pte under splitting thp */
+	TTU_SYNC		= 0x200, /* avoid racy checks with PVMW_SYNC */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9fb7b47da87e..305f709a7aca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2357,6 +2357,10 @@ static void unmap_page(struct page *page)
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
+	/* Make sure that the BUGs will not bite */
+	if (IS_ENABLED(CONFIG_DEBUG_VM))
+		ttu_flags |= TTU_SYNC;
+
 	unmap_success = try_to_unmap(page, ttu_flags);
 	VM_BUG_ON_PAGE(!unmap_success, page);
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 2cf01d933f13..b45d22738b45 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			pvmw->ptl = NULL;
 		}
 	} else if (!pmd_present(pmde)) {
+		/*
+		 * If PVMW_SYNC, take and drop THP pmd lock so that we
+		 * cannot return prematurely, while zap_huge_pmd() has
+		 * cleared *pmd but not decremented compound_mapcount().
+		 */
+		if ((pvmw->flags & PVMW_SYNC) &&
+		    PageTransCompound(pvmw->page))
+			spin_unlock(pmd_lock(mm, pvmw->pmd));
 		return false;
 	}
 	if (!map_pte(pvmw))
diff --git a/mm/rmap.c b/mm/rmap.c
index 693a610e181d..07811b4ae793 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1405,6 +1405,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
+	/*
+	 * When racing against e.g. zap_pte_range() on another cpu,
+	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
+	 * try_to_unmap() may return false when it is about to become true,
+	 * if page table locking is skipped: use TTU_SYNC to wait for that.
+	 */
+	if (flags & TTU_SYNC)
+		pvmw.flags = PVMW_SYNC;
+
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return true;
@@ -1777,7 +1786,13 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
 	else
 		rmap_walk(page, &rwc);
 
-	return !page_mapcount(page) ? true : false;
+	/*
+	 * When racing against e.g. zap_pte_range() on another cpu,
+	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
+	 * try_to_unmap() may return false when it is about to become true,
+	 * if page table locking is skipped: use TTU_SYNC to wait for that.
+	 */
+	return !page_mapcount(page);
 }
 
 /**
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog



^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 3/7] mm/thp: fix vma_address() if virtual address below file offset
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-01 21:09   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low: vma_address()
chooses max(start, vma->vm_start), but that decides on the wrong address,
because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM pages,
which cannot be located by the page->index that page_to_pgoff() uses: as
4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages")
once discovered.  I dithered over the best thing to do about that, and
have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and
vma_address_end(); though the only place in danger of using it on them
was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_order() on
a head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never mattered.

Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/internal.h        | 56 +++++++++++++++++++++++++++++++++-----------
 mm/page_vma_mapped.c | 16 +++++--------
 mm/rmap.c            | 16 ++++++-------
 3 files changed, 56 insertions(+), 32 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2f1182948aa6..00d2dfa534cf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -384,27 +384,55 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
- * At what user virtual address is page expected in @vma?
+ * At what user virtual address is page expected in vma?
+ * Returns -EFAULT if all of the page is outside the range of vma.
+ * If page is a compound head, the entire compound page is considered.
  */
 static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page);
+	if (pgoff >= vma->vm_pgoff) {
+		address = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		/* Check for address beyond vma (or wrapped through 0?) */
+		if (address < vma->vm_start || address >= vma->vm_end)
+			address = -EFAULT;
+	} else if (PageHead(page) &&
+		   pgoff + (1UL << compound_order(page)) > vma->vm_pgoff) {
+		address = vma->vm_start;
+	} else {
+		address = -EFAULT;
+	}
+	return address;
 }
 
+/*
+ * Then at what user virtual address will none of the page be found in vma?
+ * Assumes that vma_address() already returned a good starting address.
+ * If page is a compound head, the entire compound page is considered.
+ */
 static inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address_end(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
-
-	return max(start, vma->vm_start);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page);
+	if (PageHead(page))
+		pgoff += 1UL << compound_order(page);
+	else
+		pgoff++;
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	/* Check for address beyond vma (or wrapped through 0?) */
+	if (address < vma->vm_start || address > vma->vm_end)
+		address = vma->vm_end;
+	return address;
 }
 
 static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index b45d22738b45..92f0ecd50524 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -225,18 +225,18 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
+		unsigned long end;
+
 		if (check_pte(pvmw))
 			return true;
 next_pte:
 		/* Seek to next pte only makes sense for THP */
 		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
 			return not_found(pvmw);
+		end = vma_address_end(pvmw->page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
-			if (pvmw->address >= pvmw->vma->vm_end ||
-			    pvmw->address >=
-					__vma_address(pvmw->page, pvmw->vma) +
-					thp_size(pvmw->page))
+			if (pvmw->address >= end)
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -274,14 +274,10 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 		.vma = vma,
 		.flags = PVMW_SYNC,
 	};
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
 
-	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
+	pvmw.address = vma_address(page, vma);
+	if (pvmw.address == -EFAULT)
 		return 0;
-	pvmw.address = max(start, vma->vm_start);
 	if (!page_vma_mapped_walk(&pvmw))
 		return 0;
 	page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/rmap.c b/mm/rmap.c
index 07811b4ae793..144de54efc1c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -707,7 +707,6 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  */
 unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long address;
 	if (PageAnon(page)) {
 		struct anon_vma *page__anon_vma = page_anon_vma(page);
 		/*
@@ -722,10 +721,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 			return -EFAULT;
 	} else
 		return -EFAULT;
-	address = __vma_address(page, vma);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		return -EFAULT;
-	return address;
+
+	return vma_address(page, vma);
 }
 
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
@@ -919,7 +916,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	 */
 	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
 				0, vma, vma->vm_mm, address,
-				min(vma->vm_end, address + page_size(page)));
+				vma_address_end(page, vma));
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1435,9 +1432,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	 * Note that the page can not be free in this function as call of
 	 * try_to_unmap() must hold a reference on the page.
 	 */
+	range.end = PageKsm(page) ?
+			address + PAGE_SIZE : vma_address_end(page, vma);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				address,
-				min(vma->vm_end, address + page_size(page)));
+				address, range.end);
 	if (PageHuge(page)) {
 		/*
 		 * If sharing is possible, start and end will be adjusted
@@ -1889,6 +1887,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
@@ -1943,6 +1942,7 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 3/7] mm/thp: fix vma_address() if virtual address below file offset
@ 2021-06-01 21:09   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low: vma_address()
chooses max(start, vma->vm_start), but that decides on the wrong address,
because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM pages,
which cannot be located by the page->index that page_to_pgoff() uses: as
4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages")
once discovered.  I dithered over the best thing to do about that, and
have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and
vma_address_end(); though the only place in danger of using it on them
was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_order() on
a head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never mattered.

Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/internal.h        | 56 +++++++++++++++++++++++++++++++++-----------
 mm/page_vma_mapped.c | 16 +++++--------
 mm/rmap.c            | 16 ++++++-------
 3 files changed, 56 insertions(+), 32 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2f1182948aa6..00d2dfa534cf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -384,27 +384,55 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
- * At what user virtual address is page expected in @vma?
+ * At what user virtual address is page expected in vma?
+ * Returns -EFAULT if all of the page is outside the range of vma.
+ * If page is a compound head, the entire compound page is considered.
  */
 static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page);
+	if (pgoff >= vma->vm_pgoff) {
+		address = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		/* Check for address beyond vma (or wrapped through 0?) */
+		if (address < vma->vm_start || address >= vma->vm_end)
+			address = -EFAULT;
+	} else if (PageHead(page) &&
+		   pgoff + (1UL << compound_order(page)) > vma->vm_pgoff) {
+		address = vma->vm_start;
+	} else {
+		address = -EFAULT;
+	}
+	return address;
 }
 
+/*
+ * Then at what user virtual address will none of the page be found in vma?
+ * Assumes that vma_address() already returned a good starting address.
+ * If page is a compound head, the entire compound page is considered.
+ */
 static inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address_end(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
-
-	return max(start, vma->vm_start);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page);
+	if (PageHead(page))
+		pgoff += 1UL << compound_order(page);
+	else
+		pgoff++;
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	/* Check for address beyond vma (or wrapped through 0?) */
+	if (address < vma->vm_start || address > vma->vm_end)
+		address = vma->vm_end;
+	return address;
 }
 
 static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index b45d22738b45..92f0ecd50524 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -225,18 +225,18 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
+		unsigned long end;
+
 		if (check_pte(pvmw))
 			return true;
 next_pte:
 		/* Seek to next pte only makes sense for THP */
 		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
 			return not_found(pvmw);
+		end = vma_address_end(pvmw->page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
-			if (pvmw->address >= pvmw->vma->vm_end ||
-			    pvmw->address >=
-					__vma_address(pvmw->page, pvmw->vma) +
-					thp_size(pvmw->page))
+			if (pvmw->address >= end)
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -274,14 +274,10 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 		.vma = vma,
 		.flags = PVMW_SYNC,
 	};
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
 
-	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
+	pvmw.address = vma_address(page, vma);
+	if (pvmw.address == -EFAULT)
 		return 0;
-	pvmw.address = max(start, vma->vm_start);
 	if (!page_vma_mapped_walk(&pvmw))
 		return 0;
 	page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/rmap.c b/mm/rmap.c
index 07811b4ae793..144de54efc1c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -707,7 +707,6 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
  */
 unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long address;
 	if (PageAnon(page)) {
 		struct anon_vma *page__anon_vma = page_anon_vma(page);
 		/*
@@ -722,10 +721,8 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 			return -EFAULT;
 	} else
 		return -EFAULT;
-	address = __vma_address(page, vma);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		return -EFAULT;
-	return address;
+
+	return vma_address(page, vma);
 }
 
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
@@ -919,7 +916,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	 */
 	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
 				0, vma, vma->vm_mm, address,
-				min(vma->vm_end, address + page_size(page)));
+				vma_address_end(page, vma));
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1435,9 +1432,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	 * Note that the page can not be free in this function as call of
 	 * try_to_unmap() must hold a reference on the page.
 	 */
+	range.end = PageKsm(page) ?
+			address + PAGE_SIZE : vma_address_end(page, vma);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				address,
-				min(vma->vm_end, address + page_size(page)));
+				address, range.end);
 	if (PageHuge(page)) {
 		/*
 		 * If sharing is possible, start and end will be adjusted
@@ -1889,6 +1887,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
@@ -1943,6 +1942,7 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog



^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 4/7] mm/thp: fix page_address_in_vma() on file THP tails
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-01 21:11   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

From: Jue Wang <juew@google.com>

Anon THP tails were already supported, but memory-failure may need to use
page_address_in_vma() on file THP tails, which its page->mapping check did
not permit: fix it.

hughd adds: no current usage is known to hit the issue, but this does fix
a subtle trap in a general helper: best fixed in stable sooner than later.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/rmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 144de54efc1c..e05c300048e6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -716,11 +716,11 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 		if (!vma->anon_vma || !page__anon_vma ||
 		    vma->anon_vma->root != page__anon_vma->root)
 			return -EFAULT;
-	} else if (page->mapping) {
-		if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
-			return -EFAULT;
-	} else
+	} else if (!vma->vm_file) {
+		return -EFAULT;
+	} else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
 		return -EFAULT;
+	}
 
 	return vma_address(page, vma);
 }
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 4/7] mm/thp: fix page_address_in_vma() on file THP tails
@ 2021-06-01 21:11   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

From: Jue Wang <juew@google.com>

Anon THP tails were already supported, but memory-failure may need to use
page_address_in_vma() on file THP tails, which its page->mapping check did
not permit: fix it.

hughd adds: no current usage is known to hit the issue, but this does fix
a subtle trap in a general helper: best fixed in stable sooner than later.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/rmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 144de54efc1c..e05c300048e6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -716,11 +716,11 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 		if (!vma->anon_vma || !page__anon_vma ||
 		    vma->anon_vma->root != page__anon_vma->root)
 			return -EFAULT;
-	} else if (page->mapping) {
-		if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
-			return -EFAULT;
-	} else
+	} else if (!vma->vm_file) {
+		return -EFAULT;
+	} else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
 		return -EFAULT;
+	}
 
 	return vma_address(page, vma);
 }
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog



^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 5/7] mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-01 21:13   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

Crash dumps showed two tail pages of a shmem huge page remained mapped
by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
of a long unmapped range; and no page table had yet been allocated for
the head of the huge page to be mapped into.

Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in
the next.

Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before. Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times
less often: because of the way redundant levels are folded together,
but folded differently in different configurations, it was just too
difficult to use them correctly; and step_forward() is simpler anyway.

Merged various other minor fixes and cleanups into page_vma_mapped_walk()
as I worked on it: which I find much easier to enumerate here than to
prise apart into separate commits.

Handle all of the hugetlbfs PageHuge() case once at the start,
so we don't need to worry about it again further down.

Sometimes local copy of pvmw->page was used, sometimes pvmw->page:
just use pvmw->page throughout (and continue to use pvmw->address
throughout, though we could take a local copy instead).

Use pmd_read_atomic() with barrier() instead of READ_ONCE() for pmde:
some architectures (e.g. i386 with PAE) have a multi-word pmd entry,
for which READ_ONCE() is not good enough.

Re-evaluate pmde after taking lock, then use it in subsequent tests,
instead of repeatedly dereferencing pvmw->pmd pointer.

Rearrange the !pmd_present block to follow the same "return not_found,
return not_found, return true" pattern as the block above it (note:
returning not_found there is never premature, since the existence or
prior existence of a huge pmd guarantees good alignment).

Adjust page table boundary test in case address was not page-aligned.

Reset pvmw->pte to NULL after unmapping that page table.

Respect the 80-column line limit.

Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/page_vma_mapped.c | 159 ++++++++++++++++++++++++++-----------------
 1 file changed, 95 insertions(+), 64 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 92f0ecd50524..f38ea5aa4acf 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -116,6 +116,13 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 	return pfn_is_match(pvmw->page, pfn);
 }
 
+static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
+{
+	pvmw->address = (pvmw->address + size) & ~(size - 1);
+	if (!pvmw->address)
+		pvmw->address = ULONG_MAX;
+}
+
 /**
  * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
  * @pvmw->address
@@ -143,7 +150,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 {
 	struct mm_struct *mm = pvmw->vma->vm_mm;
-	struct page *page = pvmw->page;
+	unsigned long end;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -153,109 +160,133 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (pvmw->pmd && !pvmw->pte)
 		return not_found(pvmw);
 
-	if (pvmw->pte)
-		goto next_pte;
-
 	if (unlikely(PageHuge(pvmw->page))) {
+		/* The only possible mapping was handled on last iteration */
+		if (pvmw->pte)
+			return not_found(pvmw);
+
 		/* when pud is not present, pte will be NULL */
-		pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
+		pvmw->pte = huge_pte_offset(mm, pvmw->address,
+					    page_size(pvmw->page));
 		if (!pvmw->pte)
 			return false;
 
-		pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
+		pvmw->ptl = huge_pte_lockptr(page_hstate(pvmw->page),
+					     mm, pvmw->pte);
 		spin_lock(pvmw->ptl);
 		if (!check_pte(pvmw))
 			return not_found(pvmw);
 		return true;
 	}
-restart:
-	pgd = pgd_offset(mm, pvmw->address);
-	if (!pgd_present(*pgd))
-		return false;
-	p4d = p4d_offset(pgd, pvmw->address);
-	if (!p4d_present(*p4d))
-		return false;
-	pud = pud_offset(p4d, pvmw->address);
-	if (!pud_present(*pud))
-		return false;
-	pvmw->pmd = pmd_offset(pud, pvmw->address);
+
 	/*
-	 * Make sure the pmd value isn't cached in a register by the
-	 * compiler and used as a stale value after we've observed a
-	 * subsequent update.
+	 * Seek to next pte only makes sense for THP.
+	 * But more important than that optimization, is to filter out
+	 * any PageKsm page: whose page->index misleads vma_address()
+	 * and vma_address_end() to disaster.
 	 */
-	pmde = READ_ONCE(*pvmw->pmd);
-	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
-		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (likely(pmd_trans_huge(*pvmw->pmd))) {
-			if (pvmw->flags & PVMW_MIGRATION)
-				return not_found(pvmw);
-			if (pmd_page(*pvmw->pmd) != page)
-				return not_found(pvmw);
-			return true;
-		} else if (!pmd_present(*pvmw->pmd)) {
-			if (thp_migration_supported()) {
-				if (!(pvmw->flags & PVMW_MIGRATION))
+	end = PageTransCompound(pvmw->page) ?
+		vma_address_end(pvmw->page, pvmw->vma) :
+		pvmw->address + PAGE_SIZE;
+	if (pvmw->pte)
+		goto next_pte;
+restart:
+	do {
+		pgd = pgd_offset(mm, pvmw->address);
+		if (!pgd_present(*pgd)) {
+			step_forward(pvmw, PGDIR_SIZE);
+			continue;
+		}
+		p4d = p4d_offset(pgd, pvmw->address);
+		if (!p4d_present(*p4d)) {
+			step_forward(pvmw, P4D_SIZE);
+			continue;
+		}
+		pud = pud_offset(p4d, pvmw->address);
+		if (!pud_present(*pud)) {
+			step_forward(pvmw, PUD_SIZE);
+			continue;
+		}
+
+		pvmw->pmd = pmd_offset(pud, pvmw->address);
+		/*
+		 * Make sure the pmd value isn't cached in a register by the
+		 * compiler and used as a stale value after we've observed a
+		 * subsequent update.
+		 */
+		pmde = pmd_read_atomic(pvmw->pmd);
+		barrier();
+
+		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
+			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+			pmde = *pvmw->pmd;
+			if (likely(pmd_trans_huge(pmde))) {
+				if (pvmw->flags & PVMW_MIGRATION)
+					return not_found(pvmw);
+				if (pmd_page(pmde) != pvmw->page)
 					return not_found(pvmw);
-				if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
-					swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+				return true;
+			}
+			if (!pmd_present(pmde)) {
+				swp_entry_t swap;
 
-					if (migration_entry_to_page(entry) != page)
-						return not_found(pvmw);
-					return true;
-				}
+				if (!thp_migration_supported() ||
+				    !(pvmw->flags & PVMW_MIGRATION))
+					return not_found(pvmw);
+				swap = pmd_to_swp_entry(pmde);
+				if (!is_migration_entry(swap) ||
+				    migration_entry_to_page(swap) != pvmw->page)
+					return not_found(pvmw);
+				return true;
 			}
-			return not_found(pvmw);
-		} else {
 			/* THP pmd was split under us: handle on pte level */
 			spin_unlock(pvmw->ptl);
 			pvmw->ptl = NULL;
+		} else if (!pmd_present(pmde)) {
+			/*
+			 * If PVMW_SYNC, take and drop THP pmd lock so that we
+			 * cannot return prematurely, while zap_huge_pmd() has
+			 * cleared *pmd but not decremented compound_mapcount().
+			 */
+			if ((pvmw->flags & PVMW_SYNC) &&
+			    PageTransCompound(pvmw->page))
+				spin_unlock(pmd_lock(mm, pvmw->pmd));
+			step_forward(pvmw, PMD_SIZE);
+			continue;
 		}
-	} else if (!pmd_present(pmde)) {
-		/*
-		 * If PVMW_SYNC, take and drop THP pmd lock so that we
-		 * cannot return prematurely, while zap_huge_pmd() has
-		 * cleared *pmd but not decremented compound_mapcount().
-		 */
-		if ((pvmw->flags & PVMW_SYNC) &&
-		    PageTransCompound(pvmw->page))
-			spin_unlock(pmd_lock(mm, pvmw->pmd));
-		return false;
-	}
-	if (!map_pte(pvmw))
-		goto next_pte;
-	while (1) {
-		unsigned long end;
 
+		if (!map_pte(pvmw))
+			goto next_pte;
+this_pte:
 		if (check_pte(pvmw))
 			return true;
 next_pte:
-		/* Seek to next pte only makes sense for THP */
-		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
-			return not_found(pvmw);
-		end = vma_address_end(pvmw->page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
 			if (pvmw->address >= end)
 				return not_found(pvmw);
+
 			/* Did we cross page table boundary? */
-			if (pvmw->address % PMD_SIZE == 0) {
-				pte_unmap(pvmw->pte);
+			if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
 				if (pvmw->ptl) {
 					spin_unlock(pvmw->ptl);
 					pvmw->ptl = NULL;
 				}
+				pte_unmap(pvmw->pte);
+				pvmw->pte = NULL;
 				goto restart;
-			} else {
-				pvmw->pte++;
 			}
+			pvmw->pte++;
 		} while (pte_none(*pvmw->pte));
 
 		if (!pvmw->ptl) {
 			pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
 			spin_lock(pvmw->ptl);
 		}
-	}
+		goto this_pte;
+	} while (pvmw->address < end);
+
+	return false;
 }
 
 /**
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 5/7] mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
@ 2021-06-01 21:13   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

Crash dumps showed two tail pages of a shmem huge page remained mapped
by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
of a long unmapped range; and no page table had yet been allocated for
the head of the huge page to be mapped into.

Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in
the next.

Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before. Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times
less often: because of the way redundant levels are folded together,
but folded differently in different configurations, it was just too
difficult to use them correctly; and step_forward() is simpler anyway.

Merged various other minor fixes and cleanups into page_vma_mapped_walk()
as I worked on it: which I find much easier to enumerate here than to
prise apart into separate commits.

Handle all of the hugetlbfs PageHuge() case once at the start,
so we don't need to worry about it again further down.

Sometimes local copy of pvmw->page was used, sometimes pvmw->page:
just use pvmw->page throughout (and continue to use pvmw->address
throughout, though we could take a local copy instead).

Use pmd_read_atomic() with barrier() instead of READ_ONCE() for pmde:
some architectures (e.g. i386 with PAE) have a multi-word pmd entry,
for which READ_ONCE() is not good enough.

Re-evaluate pmde after taking lock, then use it in subsequent tests,
instead of repeatedly dereferencing pvmw->pmd pointer.

Rearrange the !pmd_present block to follow the same "return not_found,
return not_found, return true" pattern as the block above it (note:
returning not_found there is never premature, since the existence or
prior existence of a huge pmd guarantees good alignment).

Adjust page table boundary test in case address was not page-aligned.

Reset pvmw->pte to NULL after unmapping that page table.

Respect the 80-column line limit.

Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/page_vma_mapped.c | 159 ++++++++++++++++++++++++++-----------------
 1 file changed, 95 insertions(+), 64 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 92f0ecd50524..f38ea5aa4acf 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -116,6 +116,13 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 	return pfn_is_match(pvmw->page, pfn);
 }
 
+static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
+{
+	pvmw->address = (pvmw->address + size) & ~(size - 1);
+	if (!pvmw->address)
+		pvmw->address = ULONG_MAX;
+}
+
 /**
  * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
  * @pvmw->address
@@ -143,7 +150,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 {
 	struct mm_struct *mm = pvmw->vma->vm_mm;
-	struct page *page = pvmw->page;
+	unsigned long end;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -153,109 +160,133 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (pvmw->pmd && !pvmw->pte)
 		return not_found(pvmw);
 
-	if (pvmw->pte)
-		goto next_pte;
-
 	if (unlikely(PageHuge(pvmw->page))) {
+		/* The only possible mapping was handled on last iteration */
+		if (pvmw->pte)
+			return not_found(pvmw);
+
 		/* when pud is not present, pte will be NULL */
-		pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
+		pvmw->pte = huge_pte_offset(mm, pvmw->address,
+					    page_size(pvmw->page));
 		if (!pvmw->pte)
 			return false;
 
-		pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
+		pvmw->ptl = huge_pte_lockptr(page_hstate(pvmw->page),
+					     mm, pvmw->pte);
 		spin_lock(pvmw->ptl);
 		if (!check_pte(pvmw))
 			return not_found(pvmw);
 		return true;
 	}
-restart:
-	pgd = pgd_offset(mm, pvmw->address);
-	if (!pgd_present(*pgd))
-		return false;
-	p4d = p4d_offset(pgd, pvmw->address);
-	if (!p4d_present(*p4d))
-		return false;
-	pud = pud_offset(p4d, pvmw->address);
-	if (!pud_present(*pud))
-		return false;
-	pvmw->pmd = pmd_offset(pud, pvmw->address);
+
 	/*
-	 * Make sure the pmd value isn't cached in a register by the
-	 * compiler and used as a stale value after we've observed a
-	 * subsequent update.
+	 * Seek to next pte only makes sense for THP.
+	 * But more important than that optimization, is to filter out
+	 * any PageKsm page: whose page->index misleads vma_address()
+	 * and vma_address_end() to disaster.
 	 */
-	pmde = READ_ONCE(*pvmw->pmd);
-	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
-		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (likely(pmd_trans_huge(*pvmw->pmd))) {
-			if (pvmw->flags & PVMW_MIGRATION)
-				return not_found(pvmw);
-			if (pmd_page(*pvmw->pmd) != page)
-				return not_found(pvmw);
-			return true;
-		} else if (!pmd_present(*pvmw->pmd)) {
-			if (thp_migration_supported()) {
-				if (!(pvmw->flags & PVMW_MIGRATION))
+	end = PageTransCompound(pvmw->page) ?
+		vma_address_end(pvmw->page, pvmw->vma) :
+		pvmw->address + PAGE_SIZE;
+	if (pvmw->pte)
+		goto next_pte;
+restart:
+	do {
+		pgd = pgd_offset(mm, pvmw->address);
+		if (!pgd_present(*pgd)) {
+			step_forward(pvmw, PGDIR_SIZE);
+			continue;
+		}
+		p4d = p4d_offset(pgd, pvmw->address);
+		if (!p4d_present(*p4d)) {
+			step_forward(pvmw, P4D_SIZE);
+			continue;
+		}
+		pud = pud_offset(p4d, pvmw->address);
+		if (!pud_present(*pud)) {
+			step_forward(pvmw, PUD_SIZE);
+			continue;
+		}
+
+		pvmw->pmd = pmd_offset(pud, pvmw->address);
+		/*
+		 * Make sure the pmd value isn't cached in a register by the
+		 * compiler and used as a stale value after we've observed a
+		 * subsequent update.
+		 */
+		pmde = pmd_read_atomic(pvmw->pmd);
+		barrier();
+
+		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
+			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+			pmde = *pvmw->pmd;
+			if (likely(pmd_trans_huge(pmde))) {
+				if (pvmw->flags & PVMW_MIGRATION)
+					return not_found(pvmw);
+				if (pmd_page(pmde) != pvmw->page)
 					return not_found(pvmw);
-				if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
-					swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+				return true;
+			}
+			if (!pmd_present(pmde)) {
+				swp_entry_t swap;
 
-					if (migration_entry_to_page(entry) != page)
-						return not_found(pvmw);
-					return true;
-				}
+				if (!thp_migration_supported() ||
+				    !(pvmw->flags & PVMW_MIGRATION))
+					return not_found(pvmw);
+				swap = pmd_to_swp_entry(pmde);
+				if (!is_migration_entry(swap) ||
+				    migration_entry_to_page(swap) != pvmw->page)
+					return not_found(pvmw);
+				return true;
 			}
-			return not_found(pvmw);
-		} else {
 			/* THP pmd was split under us: handle on pte level */
 			spin_unlock(pvmw->ptl);
 			pvmw->ptl = NULL;
+		} else if (!pmd_present(pmde)) {
+			/*
+			 * If PVMW_SYNC, take and drop THP pmd lock so that we
+			 * cannot return prematurely, while zap_huge_pmd() has
+			 * cleared *pmd but not decremented compound_mapcount().
+			 */
+			if ((pvmw->flags & PVMW_SYNC) &&
+			    PageTransCompound(pvmw->page))
+				spin_unlock(pmd_lock(mm, pvmw->pmd));
+			step_forward(pvmw, PMD_SIZE);
+			continue;
 		}
-	} else if (!pmd_present(pmde)) {
-		/*
-		 * If PVMW_SYNC, take and drop THP pmd lock so that we
-		 * cannot return prematurely, while zap_huge_pmd() has
-		 * cleared *pmd but not decremented compound_mapcount().
-		 */
-		if ((pvmw->flags & PVMW_SYNC) &&
-		    PageTransCompound(pvmw->page))
-			spin_unlock(pmd_lock(mm, pvmw->pmd));
-		return false;
-	}
-	if (!map_pte(pvmw))
-		goto next_pte;
-	while (1) {
-		unsigned long end;
 
+		if (!map_pte(pvmw))
+			goto next_pte;
+this_pte:
 		if (check_pte(pvmw))
 			return true;
 next_pte:
-		/* Seek to next pte only makes sense for THP */
-		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
-			return not_found(pvmw);
-		end = vma_address_end(pvmw->page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
 			if (pvmw->address >= end)
 				return not_found(pvmw);
+
 			/* Did we cross page table boundary? */
-			if (pvmw->address % PMD_SIZE == 0) {
-				pte_unmap(pvmw->pte);
+			if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
 				if (pvmw->ptl) {
 					spin_unlock(pvmw->ptl);
 					pvmw->ptl = NULL;
 				}
+				pte_unmap(pvmw->pte);
+				pvmw->pte = NULL;
 				goto restart;
-			} else {
-				pvmw->pte++;
 			}
+			pvmw->pte++;
 		} while (pte_none(*pvmw->pte));
 
 		if (!pvmw->ptl) {
 			pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
 			spin_lock(pvmw->ptl);
 		}
-	}
+		goto this_pte;
+	} while (pvmw->address < end);
+
+	return false;
 }
 
 /**
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog



^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 6/7] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-01 21:15   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
but before its page_remove_rmap() gets to decrement compound_mapcount:
generating false "BUG: Bad page cache" reports that the page is still
mapped when deleted.  This commit fixes that, but not in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
been an annoyance that we usually call unmap_mapping_range() with no pages
locked, but there apply it to a single locked page.  try_to_unmap() looks
more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap.  Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end.  Nice.  But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage.  It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too).  Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details.  Then zap_pmd_range() watches for this case,
and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
now does in the PVMW_SYNC case.  Not pretty, but safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not.  Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c      | 43 +++++++++++++++++++------------------------
 3 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75efcf9..8ae31622deef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	struct page *single_page;		/* Locked page to be unmapped */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 extern int fixup_user_fault(struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
+void unmap_mapping_page(struct page *page);
 void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
@@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
 	BUG();
 	return -EFAULT;
 }
+static inline void unmap_mapping_page(struct page *page) { }
 static inline void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows) { }
 static inline void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
index 730daa00952b..26d7f8e265a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1361,7 +1361,17 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
+		} else if (details && details->single_page &&
+			   PageTransCompound(details->single_page) &&
+			   next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
+			/*
+			 * Take and drop THP pmd lock so that we cannot return
+			 * prematurely, while zap_huge_pmd() has cleared *pmd,
+			 * but not yet decremented compound_mapcount().
+			 */
+			spin_unlock(pmd_lock(tlb->mm, pmd));
 		}
+
 		/*
 		 * Here there can be other concurrent MADV_DONTNEED or
 		 * trans huge page faults running, and if the pmd is
@@ -3235,6 +3245,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
 	}
 }
 
+/**
+ * unmap_mapping_page() - Unmap single page from processes.
+ * @page: The locked page to be unmapped.
+ *
+ * Unmap this page from any userspace process which still has it mmaped.
+ * Typically, for efficiency, the range of nearby pages has already been
+ * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
+ * truncation or invalidation holds the lock on a page, it may find that
+ * the page has been remapped again: and then uses unmap_mapping_page()
+ * to unmap it finally.
+ */
+void unmap_mapping_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct zap_details details = { };
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageTail(page));
+
+	details.check_mapping = mapping;
+	details.first_index = page->index;
+	details.last_index = page->index + thp_nr_pages(page) - 1;
+	details.single_page = page;
+
+	i_mmap_lock_write(mapping);
+	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+	i_mmap_unlock_write(mapping);
+}
+
 /**
  * unmap_mapping_pages() - Unmap pages from processes.
  * @mapping: The address space containing pages to be unmapped.
diff --git a/mm/truncate.c b/mm/truncate.c
index 95af244b112a..234ddd879caa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
  */
-static void
-truncate_cleanup_page(struct address_space *mapping, struct page *page)
+static void truncate_cleanup_page(struct page *page)
 {
-	if (page_mapped(page)) {
-		unsigned int nr = thp_nr_pages(page);
-		unmap_mapping_pages(mapping, page->index, nr, false);
-	}
+	if (page_mapped(page))
+		unmap_mapping_page(page);
 
 	if (page_has_private(page))
 		do_invalidatepage(page, 0, thp_size(page));
@@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return -EIO;
 
-	truncate_cleanup_page(mapping, page);
+	truncate_cleanup_page(page);
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index = indices[pagevec_count(&pvec) - 1] + 1;
 		truncate_exceptional_pvec_entries(mapping, &pvec, indices);
 		for (i = 0; i < pagevec_count(&pvec); i++)
-			truncate_cleanup_page(mapping, pvec.pages[i]);
+			truncate_cleanup_page(pvec.pages[i]);
 		delete_from_page_cache_batch(mapping, &pvec);
 		for (i = 0; i < pagevec_count(&pvec); i++)
 			unlock_page(pvec.pages[i]);
@@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 
+			if (!did_range_unmap && page_mapped(page)) {
+				/*
+				 * If page is mapped, before taking its lock,
+				 * zap the rest of the file in one hit.
+				 */
+				unmap_mapping_pages(mapping, index,
+						(1 + end - index), false);
+				did_range_unmap = 1;
+			}
+
 			lock_page(page);
 			WARN_ON(page_to_index(page) != index);
 			if (page->mapping != mapping) {
@@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				if (!did_range_unmap) {
-					/*
-					 * Zap the rest of the file in one hit.
-					 */
-					unmap_mapping_pages(mapping, index,
-						(1 + end - index), false);
-					did_range_unmap = 1;
-				} else {
-					/*
-					 * Just zap this page
-					 */
-					unmap_mapping_pages(mapping, index,
-								1, false);
-				}
-			}
+
+			if (page_mapped(page))
+				unmap_mapping_page(page);
 			BUG_ON(page_mapped(page));
+
 			ret2 = do_launder_page(mapping, page);
 			if (ret2 == 0) {
 				if (!invalidate_complete_page2(mapping, page))
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 6/7] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
@ 2021-06-01 21:15   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
but before its page_remove_rmap() gets to decrement compound_mapcount:
generating false "BUG: Bad page cache" reports that the page is still
mapped when deleted.  This commit fixes that, but not in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
been an annoyance that we usually call unmap_mapping_range() with no pages
locked, but there apply it to a single locked page.  try_to_unmap() looks
more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap.  Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end.  Nice.  But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage.  It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too).  Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details.  Then zap_pmd_range() watches for this case,
and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
now does in the PVMW_SYNC case.  Not pretty, but safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not.  Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 40 ++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c      | 43 +++++++++++++++++++------------------------
 3 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75efcf9..8ae31622deef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	struct page *single_page;		/* Locked page to be unmapped */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 extern int fixup_user_fault(struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
+void unmap_mapping_page(struct page *page);
 void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows);
 void unmap_mapping_range(struct address_space *mapping,
@@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
 	BUG();
 	return -EFAULT;
 }
+static inline void unmap_mapping_page(struct page *page) { }
 static inline void unmap_mapping_pages(struct address_space *mapping,
 		pgoff_t start, pgoff_t nr, bool even_cows) { }
 static inline void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
index 730daa00952b..26d7f8e265a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1361,7 +1361,17 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
+		} else if (details && details->single_page &&
+			   PageTransCompound(details->single_page) &&
+			   next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
+			/*
+			 * Take and drop THP pmd lock so that we cannot return
+			 * prematurely, while zap_huge_pmd() has cleared *pmd,
+			 * but not yet decremented compound_mapcount().
+			 */
+			spin_unlock(pmd_lock(tlb->mm, pmd));
 		}
+
 		/*
 		 * Here there can be other concurrent MADV_DONTNEED or
 		 * trans huge page faults running, and if the pmd is
@@ -3235,6 +3245,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
 	}
 }
 
+/**
+ * unmap_mapping_page() - Unmap single page from processes.
+ * @page: The locked page to be unmapped.
+ *
+ * Unmap this page from any userspace process which still has it mmaped.
+ * Typically, for efficiency, the range of nearby pages has already been
+ * unmapped by unmap_mapping_pages() or unmap_mapping_range().  But once
+ * truncation or invalidation holds the lock on a page, it may find that
+ * the page has been remapped again: and then uses unmap_mapping_page()
+ * to unmap it finally.
+ */
+void unmap_mapping_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct zap_details details = { };
+
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(PageTail(page));
+
+	details.check_mapping = mapping;
+	details.first_index = page->index;
+	details.last_index = page->index + thp_nr_pages(page) - 1;
+	details.single_page = page;
+
+	i_mmap_lock_write(mapping);
+	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+	i_mmap_unlock_write(mapping);
+}
+
 /**
  * unmap_mapping_pages() - Unmap pages from processes.
  * @mapping: The address space containing pages to be unmapped.
diff --git a/mm/truncate.c b/mm/truncate.c
index 95af244b112a..234ddd879caa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page, unsigned int offset,
  * its lock, b) when a concurrent invalidate_mapping_pages got there first and
  * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
  */
-static void
-truncate_cleanup_page(struct address_space *mapping, struct page *page)
+static void truncate_cleanup_page(struct page *page)
 {
-	if (page_mapped(page)) {
-		unsigned int nr = thp_nr_pages(page);
-		unmap_mapping_pages(mapping, page->index, nr, false);
-	}
+	if (page_mapped(page))
+		unmap_mapping_page(page);
 
 	if (page_has_private(page))
 		do_invalidatepage(page, 0, thp_size(page));
@@ -218,7 +215,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return -EIO;
 
-	truncate_cleanup_page(mapping, page);
+	truncate_cleanup_page(page);
 	delete_from_page_cache(page);
 	return 0;
 }
@@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		index = indices[pagevec_count(&pvec) - 1] + 1;
 		truncate_exceptional_pvec_entries(mapping, &pvec, indices);
 		for (i = 0; i < pagevec_count(&pvec); i++)
-			truncate_cleanup_page(mapping, pvec.pages[i]);
+			truncate_cleanup_page(pvec.pages[i]);
 		delete_from_page_cache_batch(mapping, &pvec);
 		for (i = 0; i < pagevec_count(&pvec); i++)
 			unlock_page(pvec.pages[i]);
@@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 
+			if (!did_range_unmap && page_mapped(page)) {
+				/*
+				 * If page is mapped, before taking its lock,
+				 * zap the rest of the file in one hit.
+				 */
+				unmap_mapping_pages(mapping, index,
+						(1 + end - index), false);
+				did_range_unmap = 1;
+			}
+
 			lock_page(page);
 			WARN_ON(page_to_index(page) != index);
 			if (page->mapping != mapping) {
@@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 				continue;
 			}
 			wait_on_page_writeback(page);
-			if (page_mapped(page)) {
-				if (!did_range_unmap) {
-					/*
-					 * Zap the rest of the file in one hit.
-					 */
-					unmap_mapping_pages(mapping, index,
-						(1 + end - index), false);
-					did_range_unmap = 1;
-				} else {
-					/*
-					 * Just zap this page
-					 */
-					unmap_mapping_pages(mapping, index,
-								1, false);
-				}
-			}
+
+			if (page_mapped(page))
+				unmap_mapping_page(page);
 			BUG_ON(page_mapped(page));
+
 			ret2 = do_launder_page(mapping, page);
 			if (ret2 == 0) {
 				if (!invalidate_complete_page2(mapping, page))
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog



^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 7/7] mm/thp: remap_page() is only needed on anonymous THP
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-01 21:17   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon,
and migration entries are only inserted when TTU_MIGRATION (unused here)
or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page()
to search for migration entries to remove when !PageAnon.

Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/huge_memory.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 305f709a7aca..e4a83e310452 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,6 +2354,7 @@ static void unmap_page(struct page *page)
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2368,6 +2369,10 @@ static void unmap_page(struct page *page)
 static void remap_page(struct page *page, unsigned int nr)
 {
 	int i;
+
+	/* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
+	if (!PageAnon(page))
+		return;
 	if (PageTransHuge(page)) {
 		remove_migration_ptes(page, page, true);
 	} else {
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH 7/7] mm/thp: remap_page() is only needed on anonymous THP
@ 2021-06-01 21:17   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-01 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon,
and migration entries are only inserted when TTU_MIGRATION (unused here)
or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page()
to search for migration entries to remove when !PageAnon.

Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/huge_memory.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 305f709a7aca..e4a83e310452 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2354,6 +2354,7 @@ static void unmap_page(struct page *page)
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2368,6 +2369,10 @@ static void unmap_page(struct page *page)
 static void remap_page(struct page *page, unsigned int nr)
 {
 	int i;
+
+	/* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
+	if (!PageAnon(page))
+		return;
 	if (PageTransHuge(page)) {
 		remove_migration_ptes(page, page, true);
 	} else {
-- 
2.32.0.rc0.204.g9fa02ecfa5-goog



^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] mm/thp: fix vma_address() if virtual address below file offset
  2021-06-01 21:09   ` Hugh Dickins
  (?)
@ 2021-06-01 21:30   ` Matthew Wilcox
  2021-06-03 21:36       ` Hugh Dickins
  -1 siblings, 1 reply; 70+ messages in thread
From: Matthew Wilcox @ 2021-06-01 21:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, linux-mm,
	linux-kernel

On Tue, Jun 01, 2021 at 02:09:31PM -0700, Hugh Dickins wrote:
>  static inline unsigned long
> -__vma_address(struct page *page, struct vm_area_struct *vma)
> +vma_address(struct page *page, struct vm_area_struct *vma)
>  {
> -	pgoff_t pgoff = page_to_pgoff(page);
> -	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	pgoff_t pgoff;
> +	unsigned long address;
> +
> +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> +	pgoff = page_to_pgoff(page);
> +	if (pgoff >= vma->vm_pgoff) {
> +		address = vma->vm_start +
> +			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +		/* Check for address beyond vma (or wrapped through 0?) */
> +		if (address < vma->vm_start || address >= vma->vm_end)
> +			address = -EFAULT;
> +	} else if (PageHead(page) &&
> +		   pgoff + (1UL << compound_order(page)) > vma->vm_pgoff) {

	} else if (PageHead(page) &&
		   pgoff + compound_nr(page) > vma->vm_pgoff) {

> +vma_address_end(struct page *page, struct vm_area_struct *vma)
>  {
> +	pgoff_t pgoff;
> +	unsigned long address;
> +
> +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> +	pgoff = page_to_pgoff(page);
> +	if (PageHead(page))
> +		pgoff += 1UL << compound_order(page);
> +	else
> +		pgoff++;

Again, can use compound_nr here.  In fact, the whole thing can be:

	pgoff += compound_nr(page);


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/7] mm/thp: fix page_address_in_vma() on file THP tails
  2021-06-01 21:11   ` Hugh Dickins
  (?)
@ 2021-06-01 21:32   ` Matthew Wilcox
  -1 siblings, 0 replies; 70+ messages in thread
From: Matthew Wilcox @ 2021-06-01 21:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, linux-mm,
	linux-kernel

On Tue, Jun 01, 2021 at 02:11:22PM -0700, Hugh Dickins wrote:
> From: Jue Wang <juew@google.com>
> 
> Anon THP tails were already supported, but memory-failure may need to use
> page_address_in_vma() on file THP tails, which its page->mapping check did
> not permit: fix it.
> 
> hughd adds: no current usage is known to hit the issue, but this does fix
> a subtle trap in a general helper: best fixed in stable sooner than later.
> 
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Signed-off-by: Jue Wang <juew@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-01 21:07   ` Hugh Dickins
  (?)
@ 2021-06-02  1:59   ` Alistair Popple
  -1 siblings, 0 replies; 70+ messages in thread
From: Alistair Popple @ 2021-06-02  1:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, linux-mm,
	linux-kernel

On Wednesday, 2 June 2021 7:07:53 AM AEST Hugh Dickins wrote:
> External email: Use caution opening links or attachments
> 
> 
> Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
> (!unmap_success): with dump_page() showing mapcount:1, but then its
> raw struct page output showing _mapcount ffffffff i.e. mapcount 0.
> 
> And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
> it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
> and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
> all indicative of some mapcount difficulty in development here perhaps.
> But the !CONFIG_DEBUG_VM path handles the failures correctly and silently.
> 
> I believe the problem is that once a racing unmap has cleared pte or pmd,
> try_to_unmap_one() may skip taking the page table lock, and emerge from
> try_to_unmap() before the racing task has reached decrementing mapcount.
> 
> Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
> follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
> to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
> It could be passed in the non-debug case too, but that would sometimes add
> a little overhead, whereas it's rare for this race to result in failure.
> 
> mm/memory-failure.c:hwpoison_user_mappings() should probably use the new
> TTU_SYNC option too, just in case this race coincides with its attempts to
> unmap a failing page (THP or not); but this commit does not add that.
> 
> Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with
> generic rmap walkers") Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/rmap.h |  3 ++-
>  mm/huge_memory.c     |  4 ++++
>  mm/page_vma_mapped.c |  8 ++++++++
>  mm/rmap.c            | 17 ++++++++++++++++-
>  4 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index def5c62c93b3..891599a4cb8d 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -97,7 +97,8 @@ enum ttu_flags {
>                                          * do a final flush if necessary */
>         TTU_RMAP_LOCKED         = 0x80, /* do not grab rmap lock:
>                                          * caller holds it */
> -       TTU_SPLIT_FREEZE        = 0x100,                /* freeze pte under
> splitting thp */ +       TTU_SPLIT_FREEZE        = 0x100, /* freeze pte
> under splitting thp */ +       TTU_SYNC                = 0x200, /* avoid
> racy checks with PVMW_SYNC */ };
> 
>  #ifdef CONFIG_MMU
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9fb7b47da87e..305f709a7aca 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2357,6 +2357,10 @@ static void unmap_page(struct page *page)
>         if (PageAnon(page))
>                 ttu_flags |= TTU_SPLIT_FREEZE;
> 
> +       /* Make sure that the BUGs will not bite */
> +       if (IS_ENABLED(CONFIG_DEBUG_VM))
> +               ttu_flags |= TTU_SYNC;
> +
>         unmap_success = try_to_unmap(page, ttu_flags);
>         VM_BUG_ON_PAGE(!unmap_success, page);
>  }
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 2cf01d933f13..b45d22738b45 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk
> *pvmw) pvmw->ptl = NULL;
>                 }
>         } else if (!pmd_present(pmde)) {
> +               /*
> +                * If PVMW_SYNC, take and drop THP pmd lock so that we
> +                * cannot return prematurely, while zap_huge_pmd() has
> +                * cleared *pmd but not decremented compound_mapcount().
> +                */
> +               if ((pvmw->flags & PVMW_SYNC) &&
> +                   PageTransCompound(pvmw->page))
> +                       spin_unlock(pmd_lock(mm, pvmw->pmd));
>                 return false;
>         }
>         if (!map_pte(pvmw))
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 693a610e181d..07811b4ae793 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1405,6 +1405,15 @@ static bool try_to_unmap_one(struct page *page,
> struct vm_area_struct *vma, struct mmu_notifier_range range;
>         enum ttu_flags flags = (enum ttu_flags)(long)arg;
> 
> +       /*
> +        * When racing against e.g. zap_pte_range() on another cpu,
> +        * in between its ptep_get_and_clear_full() and page_remove_rmap(),
> +        * try_to_unmap() may return false when it is about to become true,
> +        * if page table locking is skipped: use TTU_SYNC to wait for that.
> +        */
> +       if (flags & TTU_SYNC)
> +               pvmw.flags = PVMW_SYNC;
> +

If this gets applied on top of my series then I think we would also need to 
add this to the start of try_to_migrate_one() as I assume you can hit this bug 
regardless of whether unmapping vs. installing swap migration entries.

We would also need to update the flag check at the start of try_to_migrate() 
to allow passing TTU_SYNC.

>         /* munlock has nothing to gain from examining un-locked vmas */
>         if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>                 return true;
> @@ -1777,7 +1786,13 @@ bool try_to_unmap(struct page *page, enum ttu_flags
> flags) else
>                 rmap_walk(page, &rwc);
> 
> -       return !page_mapcount(page) ? true : false;
> +       /*
> +        * When racing against e.g. zap_pte_range() on another cpu,
> +        * in between its ptep_get_and_clear_full() and page_remove_rmap(),
> +        * try_to_unmap() may return false when it is about to become true,
> +        * if page table locking is skipped: use TTU_SYNC to wait for that.
> +        */
> +       return !page_mapcount(page);
>  }
> 
>  /**
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog





^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related
  2021-06-01 21:03 ` Hugh Dickins
                   ` (7 preceding siblings ...)
  (?)
@ 2021-06-02  2:07 ` Alistair Popple
  -1 siblings, 0 replies; 70+ messages in thread
From: Alistair Popple @ 2021-06-02  2:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, linux-mm,
	linux-kernel

On Wednesday, 2 June 2021 7:03:30 AM AEST Hugh Dickins wrote:
> Here is a batch of long-standing THP bug fixes that I had not got
> around to sending before, but prompted now by Wang Yugui's report
> https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
> 
> Wang Yugui has tested a rollup of these fixes applied to 5.10.39,
> and they have done no harm, but have *not* fixed that issue:
> something more is needed and I have no idea of what.
> 
> But at least these clear up related issues, and should go to stable
> (except for the last, which is just an optimization: it would be
> fine in stable, but it's not required there).
> 
> These are against 5.13-rc4: easy for others to try out, but my next
> task is to work out how to shoehorn them into mmotm and linux-next.
> 
> I would have said just before Yang Shi's related
> mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
> except that (which should also go to stable) is currently placed after
> Alistair Popple's "Add support for SVM atomics in Nouveau" series,
> mm-rmap-split-try_to_munlock-from-try_to_unmap.patch etc.
> I expect I shall offer you some rediffs of Alistair's, we'll see.

I haven't looked at Yang Shi's patch yet but aside from patch 2 this series 
applies on top of mine fairly easily. The only other issue I noticed was 
needing to rename migration_entry_to_page() -> pfn_swap_entry_to_page() in 
patch 1 & 7 before applying.

> 1/7 mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
> 2/7 mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
> 3/7 mm/thp: fix vma_address() if virtual address below file offset
> 4/7 mm/thp: fix page_address_in_vma() on file THP tails
> 5/7 mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
> 6/7 mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
> 7/7 mm/thp: remap_page() is only needed on anonymous THP
> 
>  include/linux/mm.h   |    3
>  include/linux/rmap.h |    3
>  mm/huge_memory.c     |   47 ++++++++----
>  mm/internal.h        |   54 ++++++++++----
>  mm/memory.c          |   40 ++++++++++
>  mm/page_vma_mapped.c |  163 ++++++++++++++++++++++++++-----------------
>  mm/pgtable-generic.c |    5 -
>  mm/rmap.c            |   39 +++++++---
>  mm/truncate.c        |   43 +++++------
>  9 files changed, 266 insertions(+), 131 deletions(-)
> 
> Hugh





^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
  2021-06-01 21:05   ` Hugh Dickins
@ 2021-06-03 21:26     ` Yang Shi
  -1 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-03 21:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 1, 2021 at 2:05 PM Hugh Dickins <hughd@google.com> wrote:
>
> Stressing huge tmpfs page migration racing hole punch often crashed on the
> VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel;
> or shortly afterwards, on a bad dereference in __split_huge_pmd_locked()
> when DEBUG_VM=n.  They forgot to allow for pmd migration entries in the
> non-anonymous case.
>
> Full disclosure: those particular experiments were on a kernel with more
> relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
> vanilla kernel: it is conceivable that stricter locking happens to avoid
> those cases, or makes them less likely; but __split_huge_pmd_locked()
> already allowed for pmd migration entries when handling anonymous THPs,
> so this commit brings the shmem and file THP handling into line.
>
> Are there more places that need to be careful about pmd migration entries?
> None hit in practice, but several of those is_huge_zero_pmd() tests were
> done without checking pmd_present() first: I believe a pmd migration entry
> could end up satisfying that test.  Ah, the inversion of swap offset, to
> protect against L1TF, makes that impossible on x86; but other arches need
> the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> wrong to be checking with pmd_trans_huge() instead, but I think it's
> clearer to use pmd_present() in each instance.
>
> And while there: make it clearer to the eye that the !vma_is_anonymous()
> and is_huge_zero_pmd() blocks make early returns (and don't return void).
>
> Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/huge_memory.c     | 38 ++++++++++++++++++++++++--------------
>  mm/pgtable-generic.c |  5 ++---
>  2 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 63ed6b25deaa..9fb7b47da87e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1587,9 +1587,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                 goto out_unlocked;
>
>         orig_pmd = *pmd;
> -       if (is_huge_zero_pmd(orig_pmd))
> -               goto out;
> -
>         if (unlikely(!pmd_present(orig_pmd))) {
>                 VM_BUG_ON(thp_migration_supported() &&
>                                   !is_pmd_migration_entry(orig_pmd));
> @@ -1597,6 +1594,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>         }
>
>         page = pmd_page(orig_pmd);
> +       if (is_huge_zero_page(page))
> +               goto out;
> +
>         /*
>          * If other processes are mapping this page, we couldn't discard
>          * the page unless they all do MADV_FREE so let's skip the page.
> @@ -1676,7 +1676,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                 spin_unlock(ptl);
>                 if (is_huge_zero_pmd(orig_pmd))
>                         tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> -       } else if (is_huge_zero_pmd(orig_pmd)) {
> +       } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {

If it is a huge zero migration entry, the code would fallback to the
"else". But IIUC the "else" case doesn't handle the huge zero page
correctly. It may mess up the rss counter.

>                 zap_deposited_table(tlb->mm, pmd);
>                 spin_unlock(ptl);
>                 tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> @@ -2044,7 +2044,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>         count_vm_event(THP_SPLIT_PMD);
>
>         if (!vma_is_anonymous(vma)) {
> -               _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> +               old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>                 /*
>                  * We are going to unmap this huge page. So
>                  * just go ahead and zap it
> @@ -2053,16 +2053,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>                         zap_deposited_table(mm, pmd);
>                 if (vma_is_special_huge(vma))
>                         return;
> -               page = pmd_page(_pmd);
> -               if (!PageDirty(page) && pmd_dirty(_pmd))
> -                       set_page_dirty(page);
> -               if (!PageReferenced(page) && pmd_young(_pmd))
> -                       SetPageReferenced(page);
> -               page_remove_rmap(page, true);
> -               put_page(page);
> +               if (unlikely(is_pmd_migration_entry(old_pmd))) {
> +                       swp_entry_t entry;
> +
> +                       entry = pmd_to_swp_entry(old_pmd);
> +                       page = migration_entry_to_page(entry);
> +               } else {
> +                       page = pmd_page(old_pmd);
> +                       if (!PageDirty(page) && pmd_dirty(old_pmd))
> +                               set_page_dirty(page);
> +                       if (!PageReferenced(page) && pmd_young(old_pmd))
> +                               SetPageReferenced(page);
> +                       page_remove_rmap(page, true);
> +                       put_page(page);
> +               }
>                 add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
>                 return;
> -       } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
> +       }
> +
> +       if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) {
>                 /*
>                  * FIXME: Do we want to invalidate secondary mmu by calling
>                  * mmu_notifier_invalidate_range() see comments below inside
> @@ -2072,7 +2081,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>                  * small page also write protected so it does not seems useful
>                  * to invalidate secondary mmu at this time.
>                  */
> -               return __split_huge_zero_page_pmd(vma, haddr, pmd);
> +               __split_huge_zero_page_pmd(vma, haddr, pmd);
> +               return;
>         }
>
>         /*
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c2210e1cdb51..4e640baf9794 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -135,9 +135,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>         pmd_t pmd;
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -       VM_BUG_ON(!pmd_present(*pmdp));
> -       /* Below assumes pmd_present() is true */
> -       VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +       VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +                          !pmd_devmap(*pmdp));
>         pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>         flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>         return pmd;
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
@ 2021-06-03 21:26     ` Yang Shi
  0 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-03 21:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 1, 2021 at 2:05 PM Hugh Dickins <hughd@google.com> wrote:
>
> Stressing huge tmpfs page migration racing hole punch often crashed on the
> VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel;
> or shortly afterwards, on a bad dereference in __split_huge_pmd_locked()
> when DEBUG_VM=n.  They forgot to allow for pmd migration entries in the
> non-anonymous case.
>
> Full disclosure: those particular experiments were on a kernel with more
> relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
> vanilla kernel: it is conceivable that stricter locking happens to avoid
> those cases, or makes them less likely; but __split_huge_pmd_locked()
> already allowed for pmd migration entries when handling anonymous THPs,
> so this commit brings the shmem and file THP handling into line.
>
> Are there more places that need to be careful about pmd migration entries?
> None hit in practice, but several of those is_huge_zero_pmd() tests were
> done without checking pmd_present() first: I believe a pmd migration entry
> could end up satisfying that test.  Ah, the inversion of swap offset, to
> protect against L1TF, makes that impossible on x86; but other arches need
> the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> wrong to be checking with pmd_trans_huge() instead, but I think it's
> clearer to use pmd_present() in each instance.
>
> And while there: make it clearer to the eye that the !vma_is_anonymous()
> and is_huge_zero_pmd() blocks make early returns (and don't return void).
>
> Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/huge_memory.c     | 38 ++++++++++++++++++++++++--------------
>  mm/pgtable-generic.c |  5 ++---
>  2 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 63ed6b25deaa..9fb7b47da87e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1587,9 +1587,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                 goto out_unlocked;
>
>         orig_pmd = *pmd;
> -       if (is_huge_zero_pmd(orig_pmd))
> -               goto out;
> -
>         if (unlikely(!pmd_present(orig_pmd))) {
>                 VM_BUG_ON(thp_migration_supported() &&
>                                   !is_pmd_migration_entry(orig_pmd));
> @@ -1597,6 +1594,9 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>         }
>
>         page = pmd_page(orig_pmd);
> +       if (is_huge_zero_page(page))
> +               goto out;
> +
>         /*
>          * If other processes are mapping this page, we couldn't discard
>          * the page unless they all do MADV_FREE so let's skip the page.
> @@ -1676,7 +1676,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                 spin_unlock(ptl);
>                 if (is_huge_zero_pmd(orig_pmd))
>                         tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> -       } else if (is_huge_zero_pmd(orig_pmd)) {
> +       } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {

If it is a huge zero migration entry, the code would fallback to the
"else". But IIUC the "else" case doesn't handle the huge zero page
correctly. It may mess up the rss counter.

>                 zap_deposited_table(tlb->mm, pmd);
>                 spin_unlock(ptl);
>                 tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> @@ -2044,7 +2044,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>         count_vm_event(THP_SPLIT_PMD);
>
>         if (!vma_is_anonymous(vma)) {
> -               _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> +               old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
>                 /*
>                  * We are going to unmap this huge page. So
>                  * just go ahead and zap it
> @@ -2053,16 +2053,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>                         zap_deposited_table(mm, pmd);
>                 if (vma_is_special_huge(vma))
>                         return;
> -               page = pmd_page(_pmd);
> -               if (!PageDirty(page) && pmd_dirty(_pmd))
> -                       set_page_dirty(page);
> -               if (!PageReferenced(page) && pmd_young(_pmd))
> -                       SetPageReferenced(page);
> -               page_remove_rmap(page, true);
> -               put_page(page);
> +               if (unlikely(is_pmd_migration_entry(old_pmd))) {
> +                       swp_entry_t entry;
> +
> +                       entry = pmd_to_swp_entry(old_pmd);
> +                       page = migration_entry_to_page(entry);
> +               } else {
> +                       page = pmd_page(old_pmd);
> +                       if (!PageDirty(page) && pmd_dirty(old_pmd))
> +                               set_page_dirty(page);
> +                       if (!PageReferenced(page) && pmd_young(old_pmd))
> +                               SetPageReferenced(page);
> +                       page_remove_rmap(page, true);
> +                       put_page(page);
> +               }
>                 add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
>                 return;
> -       } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
> +       }
> +
> +       if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) {
>                 /*
>                  * FIXME: Do we want to invalidate secondary mmu by calling
>                  * mmu_notifier_invalidate_range() see comments below inside
> @@ -2072,7 +2081,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>                  * small page also write protected so it does not seems useful
>                  * to invalidate secondary mmu at this time.
>                  */
> -               return __split_huge_zero_page_pmd(vma, haddr, pmd);
> +               __split_huge_zero_page_pmd(vma, haddr, pmd);
> +               return;
>         }
>
>         /*
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c2210e1cdb51..4e640baf9794 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -135,9 +135,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  {
>         pmd_t pmd;
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -       VM_BUG_ON(!pmd_present(*pmdp));
> -       /* Below assumes pmd_present() is true */
> -       VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
> +       VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
> +                          !pmd_devmap(*pmdp));
>         pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
>         flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>         return pmd;
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] mm/thp: fix vma_address() if virtual address below file offset
  2021-06-01 21:30   ` Matthew Wilcox
@ 2021-06-03 21:36       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-03 21:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Tue, 1 Jun 2021, Matthew Wilcox wrote:
> On Tue, Jun 01, 2021 at 02:09:31PM -0700, Hugh Dickins wrote:
> >  static inline unsigned long
> > -__vma_address(struct page *page, struct vm_area_struct *vma)
> > +vma_address(struct page *page, struct vm_area_struct *vma)
> >  {
> > -	pgoff_t pgoff = page_to_pgoff(page);
> > -	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +	pgoff_t pgoff;
> > +	unsigned long address;
> > +
> > +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> > +	pgoff = page_to_pgoff(page);
> > +	if (pgoff >= vma->vm_pgoff) {
> > +		address = vma->vm_start +
> > +			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +		/* Check for address beyond vma (or wrapped through 0?) */
> > +		if (address < vma->vm_start || address >= vma->vm_end)
> > +			address = -EFAULT;
> > +	} else if (PageHead(page) &&
> > +		   pgoff + (1UL << compound_order(page)) > vma->vm_pgoff) {
> 
> 	} else if (PageHead(page) &&
> 		   pgoff + compound_nr(page) > vma->vm_pgoff) {

Yes, that's better, thanks.

I was tempted to leave out the preliminary PageHead() test altogether;
but you didn't suggest that, and I've concluded that perhaps it makes
for good documentation, even though not strictly needed.

> 
> > +vma_address_end(struct page *page, struct vm_area_struct *vma)
> >  {
> > +	pgoff_t pgoff;
> > +	unsigned long address;
> > +
> > +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> > +	pgoff = page_to_pgoff(page);
> > +	if (PageHead(page))
> > +		pgoff += 1UL << compound_order(page);
> > +	else
> > +		pgoff++;
> 
> Again, can use compound_nr here.  In fact, the whole thing can be:
> 
> 	pgoff += compound_nr(page);

Even nicer; and I've taken the liberty of just saying

	pgoff = page_to_pgoff(page) + compound_nr(page);

v2 of this patch coming up now, thanks.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 3/7] mm/thp: fix vma_address() if virtual address below file offset
@ 2021-06-03 21:36       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-03 21:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Tue, 1 Jun 2021, Matthew Wilcox wrote:
> On Tue, Jun 01, 2021 at 02:09:31PM -0700, Hugh Dickins wrote:
> >  static inline unsigned long
> > -__vma_address(struct page *page, struct vm_area_struct *vma)
> > +vma_address(struct page *page, struct vm_area_struct *vma)
> >  {
> > -	pgoff_t pgoff = page_to_pgoff(page);
> > -	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +	pgoff_t pgoff;
> > +	unsigned long address;
> > +
> > +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> > +	pgoff = page_to_pgoff(page);
> > +	if (pgoff >= vma->vm_pgoff) {
> > +		address = vma->vm_start +
> > +			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +		/* Check for address beyond vma (or wrapped through 0?) */
> > +		if (address < vma->vm_start || address >= vma->vm_end)
> > +			address = -EFAULT;
> > +	} else if (PageHead(page) &&
> > +		   pgoff + (1UL << compound_order(page)) > vma->vm_pgoff) {
> 
> 	} else if (PageHead(page) &&
> 		   pgoff + compound_nr(page) > vma->vm_pgoff) {

Yes, that's better, thanks.

I was tempted to leave out the preliminary PageHead() test altogether;
but you didn't suggest that, and I've concluded that perhaps it makes
for good documentation, even though not strictly needed.

> 
> > +vma_address_end(struct page *page, struct vm_area_struct *vma)
> >  {
> > +	pgoff_t pgoff;
> > +	unsigned long address;
> > +
> > +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> > +	pgoff = page_to_pgoff(page);
> > +	if (PageHead(page))
> > +		pgoff += 1UL << compound_order(page);
> > +	else
> > +		pgoff++;
> 
> Again, can use compound_nr here.  In fact, the whole thing can be:
> 
> 	pgoff += compound_nr(page);

Even nicer; and I've taken the liberty of just saying

	pgoff = page_to_pgoff(page) + compound_nr(page);

v2 of this patch coming up now, thanks.

Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v2 3/7] mm/thp: fix vma_address() if virtual address below file offset
  2021-06-03 21:36       ` Hugh Dickins
@ 2021-06-03 21:40         ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-03 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Hugh Dickins, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low: vma_address()
chooses max(start, vma->vm_start), but that decides on the wrong address,
because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM pages,
which cannot be located by the page->index that page_to_pgoff() uses: as
4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages")
once discovered.  I dithered over the best thing to do about that, and
have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and
vma_address_end(); though the only place in danger of using it on them
was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_nr() on
a head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never mattered.

Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
v2: use compound_nr() as Matthew suggested.

 mm/internal.h        |   50 +++++++++++++++++++++++++++++++------------
 mm/page_vma_mapped.c |   16 +++++--------
 mm/rmap.c            |   16 ++++++-------
 3 files changed, 51 insertions(+), 31 deletions(-)

--- a/mm/internal.h
+++ b/mm/internal.h
@@ -384,27 +384,51 @@ static inline void mlock_migrate_page(st
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
- * At what user virtual address is page expected in @vma?
+ * At what user virtual address is page expected in vma?
+ * Returns -EFAULT if all of the page is outside the range of vma.
+ * If page is a compound head, the entire compound page is considered.
  */
 static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page);
+	if (pgoff >= vma->vm_pgoff) {
+		address = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		/* Check for address beyond vma (or wrapped through 0?) */
+		if (address < vma->vm_start || address >= vma->vm_end)
+			address = -EFAULT;
+	} else if (PageHead(page) &&
+		   pgoff + compound_nr(page) > vma->vm_pgoff) {
+		address = vma->vm_start;
+	} else {
+		address = -EFAULT;
+	}
+	return address;
 }
 
+/*
+ * Then at what user virtual address will none of the page be found in vma?
+ * Assumes that vma_address() already returned a good starting address.
+ * If page is a compound head, the entire compound page is considered.
+ */
 static inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address_end(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
+	pgoff_t pgoff;
+	unsigned long address;
 
-	return max(start, vma->vm_start);
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page) + compound_nr(page);
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	/* Check for address beyond vma (or wrapped through 0?) */
+	if (address < vma->vm_start || address > vma->vm_end)
+		address = vma->vm_end;
+	return address;
 }
 
 static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -225,18 +225,18 @@ restart:
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
+		unsigned long end;
+
 		if (check_pte(pvmw))
 			return true;
 next_pte:
 		/* Seek to next pte only makes sense for THP */
 		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
 			return not_found(pvmw);
+		end = vma_address_end(pvmw->page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
-			if (pvmw->address >= pvmw->vma->vm_end ||
-			    pvmw->address >=
-					__vma_address(pvmw->page, pvmw->vma) +
-					thp_size(pvmw->page))
+			if (pvmw->address >= end)
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -274,14 +274,10 @@ int page_mapped_in_vma(struct page *page
 		.vma = vma,
 		.flags = PVMW_SYNC,
 	};
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
 
-	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
+	pvmw.address = vma_address(page, vma);
+	if (pvmw.address == -EFAULT)
 		return 0;
-	pvmw.address = max(start, vma->vm_start);
 	if (!page_vma_mapped_walk(&pvmw))
 		return 0;
 	page_vma_mapped_walk_done(&pvmw);
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -707,7 +707,6 @@ static bool should_defer_flush(struct mm
  */
 unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long address;
 	if (PageAnon(page)) {
 		struct anon_vma *page__anon_vma = page_anon_vma(page);
 		/*
@@ -722,10 +721,8 @@ unsigned long page_address_in_vma(struct
 			return -EFAULT;
 	} else
 		return -EFAULT;
-	address = __vma_address(page, vma);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		return -EFAULT;
-	return address;
+
+	return vma_address(page, vma);
 }
 
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
@@ -919,7 +916,7 @@ static bool page_mkclean_one(struct page
 	 */
 	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
 				0, vma, vma->vm_mm, address,
-				min(vma->vm_end, address + page_size(page)));
+				vma_address_end(page, vma));
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1435,9 +1432,10 @@ static bool try_to_unmap_one(struct page
 	 * Note that the page can not be free in this function as call of
 	 * try_to_unmap() must hold a reference on the page.
 	 */
+	range.end = PageKsm(page) ?
+			address + PAGE_SIZE : vma_address_end(page, vma);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				address,
-				min(vma->vm_end, address + page_size(page)));
+				address, range.end);
 	if (PageHuge(page)) {
 		/*
 		 * If sharing is possible, start and end will be adjusted
@@ -1889,6 +1887,7 @@ static void rmap_walk_anon(struct page *
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
@@ -1943,6 +1942,7 @@ static void rmap_walk_file(struct page *
 			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v2 3/7] mm/thp: fix vma_address() if virtual address below file offset
@ 2021-06-03 21:40         ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-03 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Hugh Dickins, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low: vma_address()
chooses max(start, vma->vm_start), but that decides on the wrong address,
because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM pages,
which cannot be located by the page->index that page_to_pgoff() uses: as
4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages")
once discovered.  I dithered over the best thing to do about that, and
have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and
vma_address_end(); though the only place in danger of using it on them
was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_nr() on
a head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never mattered.

Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
v2: use compound_nr() as Matthew suggested.

 mm/internal.h        |   50 +++++++++++++++++++++++++++++++------------
 mm/page_vma_mapped.c |   16 +++++--------
 mm/rmap.c            |   16 ++++++-------
 3 files changed, 51 insertions(+), 31 deletions(-)

--- a/mm/internal.h
+++ b/mm/internal.h
@@ -384,27 +384,51 @@ static inline void mlock_migrate_page(st
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
- * At what user virtual address is page expected in @vma?
+ * At what user virtual address is page expected in vma?
+ * Returns -EFAULT if all of the page is outside the range of vma.
+ * If page is a compound head, the entire compound page is considered.
  */
 static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	pgoff_t pgoff;
+	unsigned long address;
+
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page);
+	if (pgoff >= vma->vm_pgoff) {
+		address = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		/* Check for address beyond vma (or wrapped through 0?) */
+		if (address < vma->vm_start || address >= vma->vm_end)
+			address = -EFAULT;
+	} else if (PageHead(page) &&
+		   pgoff + compound_nr(page) > vma->vm_pgoff) {
+		address = vma->vm_start;
+	} else {
+		address = -EFAULT;
+	}
+	return address;
 }
 
+/*
+ * Then at what user virtual address will none of the page be found in vma?
+ * Assumes that vma_address() already returned a good starting address.
+ * If page is a compound head, the entire compound page is considered.
+ */
 static inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address_end(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
+	pgoff_t pgoff;
+	unsigned long address;
 
-	return max(start, vma->vm_start);
+	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
+	pgoff = page_to_pgoff(page) + compound_nr(page);
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	/* Check for address beyond vma (or wrapped through 0?) */
+	if (address < vma->vm_start || address > vma->vm_end)
+		address = vma->vm_end;
+	return address;
 }
 
 static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -225,18 +225,18 @@ restart:
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
+		unsigned long end;
+
 		if (check_pte(pvmw))
 			return true;
 next_pte:
 		/* Seek to next pte only makes sense for THP */
 		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
 			return not_found(pvmw);
+		end = vma_address_end(pvmw->page, pvmw->vma);
 		do {
 			pvmw->address += PAGE_SIZE;
-			if (pvmw->address >= pvmw->vma->vm_end ||
-			    pvmw->address >=
-					__vma_address(pvmw->page, pvmw->vma) +
-					thp_size(pvmw->page))
+			if (pvmw->address >= end)
 				return not_found(pvmw);
 			/* Did we cross page table boundary? */
 			if (pvmw->address % PMD_SIZE == 0) {
@@ -274,14 +274,10 @@ int page_mapped_in_vma(struct page *page
 		.vma = vma,
 		.flags = PVMW_SYNC,
 	};
-	unsigned long start, end;
-
-	start = __vma_address(page, vma);
-	end = start + thp_size(page) - PAGE_SIZE;
 
-	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
+	pvmw.address = vma_address(page, vma);
+	if (pvmw.address == -EFAULT)
 		return 0;
-	pvmw.address = max(start, vma->vm_start);
 	if (!page_vma_mapped_walk(&pvmw))
 		return 0;
 	page_vma_mapped_walk_done(&pvmw);
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -707,7 +707,6 @@ static bool should_defer_flush(struct mm
  */
 unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long address;
 	if (PageAnon(page)) {
 		struct anon_vma *page__anon_vma = page_anon_vma(page);
 		/*
@@ -722,10 +721,8 @@ unsigned long page_address_in_vma(struct
 			return -EFAULT;
 	} else
 		return -EFAULT;
-	address = __vma_address(page, vma);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		return -EFAULT;
-	return address;
+
+	return vma_address(page, vma);
 }
 
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
@@ -919,7 +916,7 @@ static bool page_mkclean_one(struct page
 	 */
 	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
 				0, vma, vma->vm_mm, address,
-				min(vma->vm_end, address + page_size(page)));
+				vma_address_end(page, vma));
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1435,9 +1432,10 @@ static bool try_to_unmap_one(struct page
 	 * Note that the page can not be free in this function as call of
 	 * try_to_unmap() must hold a reference on the page.
 	 */
+	range.end = PageKsm(page) ?
+			address + PAGE_SIZE : vma_address_end(page, vma);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				address,
-				min(vma->vm_end, address + page_size(page)));
+				address, range.end);
 	if (PageHuge(page)) {
 		/*
 		 * If sharing is possible, start and end will be adjusted
@@ -1889,6 +1887,7 @@ static void rmap_walk_anon(struct page *
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
@@ -1943,6 +1942,7 @@ static void rmap_walk_file(struct page *
 			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(page, vma);
 
+		VM_BUG_ON_VMA(address == -EFAULT, vma);
 		cond_resched();
 
 		if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-01 21:07   ` Hugh Dickins
@ 2021-06-03 21:45     ` Yang Shi
  -1 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-03 21:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 1, 2021 at 2:07 PM Hugh Dickins <hughd@google.com> wrote:
>
> Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
> (!unmap_success): with dump_page() showing mapcount:1, but then its
> raw struct page output showing _mapcount ffffffff i.e. mapcount 0.
>
> And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
> it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
> and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
> all indicative of some mapcount difficulty in development here perhaps.
> But the !CONFIG_DEBUG_VM path handles the failures correctly and silently.
>
> I believe the problem is that once a racing unmap has cleared pte or pmd,
> try_to_unmap_one() may skip taking the page table lock, and emerge from
> try_to_unmap() before the racing task has reached decrementing mapcount.
>
> Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
> follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
> to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
> It could be passed in the non-debug case too, but that would sometimes add
> a little overhead, whereas it's rare for this race to result in failure.

The above statement makes me feel this patch is just to relieve the
VM_BUG_ON, but my patch already changed it to VM_WARN, the race sounds
acceptable (at least not fatal) and the splitting code can handle the
failure case as well. So I'm wondering if we still need this patch or
not if it is just used to close the race when CONFIG_DEBUG_VM=y.

>
> mm/memory-failure.c:hwpoison_user_mappings() should probably use the new
> TTU_SYNC option too, just in case this race coincides with its attempts to
> unmap a failing page (THP or not); but this commit does not add that.
>
> Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/rmap.h |  3 ++-
>  mm/huge_memory.c     |  4 ++++
>  mm/page_vma_mapped.c |  8 ++++++++
>  mm/rmap.c            | 17 ++++++++++++++++-
>  4 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index def5c62c93b3..891599a4cb8d 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -97,7 +97,8 @@ enum ttu_flags {
>                                          * do a final flush if necessary */
>         TTU_RMAP_LOCKED         = 0x80, /* do not grab rmap lock:
>                                          * caller holds it */
> -       TTU_SPLIT_FREEZE        = 0x100,                /* freeze pte under splitting thp */
> +       TTU_SPLIT_FREEZE        = 0x100, /* freeze pte under splitting thp */
> +       TTU_SYNC                = 0x200, /* avoid racy checks with PVMW_SYNC */
>  };
>
>  #ifdef CONFIG_MMU
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9fb7b47da87e..305f709a7aca 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2357,6 +2357,10 @@ static void unmap_page(struct page *page)
>         if (PageAnon(page))
>                 ttu_flags |= TTU_SPLIT_FREEZE;
>
> +       /* Make sure that the BUGs will not bite */
> +       if (IS_ENABLED(CONFIG_DEBUG_VM))
> +               ttu_flags |= TTU_SYNC;
> +
>         unmap_success = try_to_unmap(page, ttu_flags);
>         VM_BUG_ON_PAGE(!unmap_success, page);
>  }
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 2cf01d933f13..b45d22738b45 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>                         pvmw->ptl = NULL;
>                 }
>         } else if (!pmd_present(pmde)) {
> +               /*
> +                * If PVMW_SYNC, take and drop THP pmd lock so that we
> +                * cannot return prematurely, while zap_huge_pmd() has
> +                * cleared *pmd but not decremented compound_mapcount().
> +                */
> +               if ((pvmw->flags & PVMW_SYNC) &&
> +                   PageTransCompound(pvmw->page))
> +                       spin_unlock(pmd_lock(mm, pvmw->pmd));
>                 return false;
>         }
>         if (!map_pte(pvmw))
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 693a610e181d..07811b4ae793 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1405,6 +1405,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>         struct mmu_notifier_range range;
>         enum ttu_flags flags = (enum ttu_flags)(long)arg;
>
> +       /*
> +        * When racing against e.g. zap_pte_range() on another cpu,
> +        * in between its ptep_get_and_clear_full() and page_remove_rmap(),
> +        * try_to_unmap() may return false when it is about to become true,
> +        * if page table locking is skipped: use TTU_SYNC to wait for that.
> +        */
> +       if (flags & TTU_SYNC)
> +               pvmw.flags = PVMW_SYNC;
> +
>         /* munlock has nothing to gain from examining un-locked vmas */
>         if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>                 return true;
> @@ -1777,7 +1786,13 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
>         else
>                 rmap_walk(page, &rwc);
>
> -       return !page_mapcount(page) ? true : false;
> +       /*
> +        * When racing against e.g. zap_pte_range() on another cpu,
> +        * in between its ptep_get_and_clear_full() and page_remove_rmap(),
> +        * try_to_unmap() may return false when it is about to become true,
> +        * if page table locking is skipped: use TTU_SYNC to wait for that.
> +        */
> +       return !page_mapcount(page);
>  }
>
>  /**
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
@ 2021-06-03 21:45     ` Yang Shi
  0 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-03 21:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 1, 2021 at 2:07 PM Hugh Dickins <hughd@google.com> wrote:
>
> Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
> (!unmap_success): with dump_page() showing mapcount:1, but then its
> raw struct page output showing _mapcount ffffffff i.e. mapcount 0.
>
> And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
> it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
> and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
> all indicative of some mapcount difficulty in development here perhaps.
> But the !CONFIG_DEBUG_VM path handles the failures correctly and silently.
>
> I believe the problem is that once a racing unmap has cleared pte or pmd,
> try_to_unmap_one() may skip taking the page table lock, and emerge from
> try_to_unmap() before the racing task has reached decrementing mapcount.
>
> Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
> follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
> to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
> It could be passed in the non-debug case too, but that would sometimes add
> a little overhead, whereas it's rare for this race to result in failure.

The above statement makes me feel this patch is just to relieve the
VM_BUG_ON, but my patch already changed it to VM_WARN, the race sounds
acceptable (at least not fatal) and the splitting code can handle the
failure case as well. So I'm wondering if we still need this patch or
not if it is just used to close the race when CONFIG_DEBUG_VM=y.

>
> mm/memory-failure.c:hwpoison_user_mappings() should probably use the new
> TTU_SYNC option too, just in case this race coincides with its attempts to
> unmap a failing page (THP or not); but this commit does not add that.
>
> Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/rmap.h |  3 ++-
>  mm/huge_memory.c     |  4 ++++
>  mm/page_vma_mapped.c |  8 ++++++++
>  mm/rmap.c            | 17 ++++++++++++++++-
>  4 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index def5c62c93b3..891599a4cb8d 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -97,7 +97,8 @@ enum ttu_flags {
>                                          * do a final flush if necessary */
>         TTU_RMAP_LOCKED         = 0x80, /* do not grab rmap lock:
>                                          * caller holds it */
> -       TTU_SPLIT_FREEZE        = 0x100,                /* freeze pte under splitting thp */
> +       TTU_SPLIT_FREEZE        = 0x100, /* freeze pte under splitting thp */
> +       TTU_SYNC                = 0x200, /* avoid racy checks with PVMW_SYNC */
>  };
>
>  #ifdef CONFIG_MMU
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9fb7b47da87e..305f709a7aca 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2357,6 +2357,10 @@ static void unmap_page(struct page *page)
>         if (PageAnon(page))
>                 ttu_flags |= TTU_SPLIT_FREEZE;
>
> +       /* Make sure that the BUGs will not bite */
> +       if (IS_ENABLED(CONFIG_DEBUG_VM))
> +               ttu_flags |= TTU_SYNC;
> +
>         unmap_success = try_to_unmap(page, ttu_flags);
>         VM_BUG_ON_PAGE(!unmap_success, page);
>  }
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 2cf01d933f13..b45d22738b45 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>                         pvmw->ptl = NULL;
>                 }
>         } else if (!pmd_present(pmde)) {
> +               /*
> +                * If PVMW_SYNC, take and drop THP pmd lock so that we
> +                * cannot return prematurely, while zap_huge_pmd() has
> +                * cleared *pmd but not decremented compound_mapcount().
> +                */
> +               if ((pvmw->flags & PVMW_SYNC) &&
> +                   PageTransCompound(pvmw->page))
> +                       spin_unlock(pmd_lock(mm, pvmw->pmd));
>                 return false;
>         }
>         if (!map_pte(pvmw))
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 693a610e181d..07811b4ae793 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1405,6 +1405,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>         struct mmu_notifier_range range;
>         enum ttu_flags flags = (enum ttu_flags)(long)arg;
>
> +       /*
> +        * When racing against e.g. zap_pte_range() on another cpu,
> +        * in between its ptep_get_and_clear_full() and page_remove_rmap(),
> +        * try_to_unmap() may return false when it is about to become true,
> +        * if page table locking is skipped: use TTU_SYNC to wait for that.
> +        */
> +       if (flags & TTU_SYNC)
> +               pvmw.flags = PVMW_SYNC;
> +
>         /* munlock has nothing to gain from examining un-locked vmas */
>         if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>                 return true;
> @@ -1777,7 +1786,13 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags)
>         else
>                 rmap_walk(page, &rwc);
>
> -       return !page_mapcount(page) ? true : false;
> +       /*
> +        * When racing against e.g. zap_pte_range() on another cpu,
> +        * in between its ptep_get_and_clear_full() and page_remove_rmap(),
> +        * try_to_unmap() may return false when it is about to become true,
> +        * if page table locking is skipped: use TTU_SYNC to wait for that.
> +        */
> +       return !page_mapcount(page);
>  }
>
>  /**
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-01 21:07   ` Hugh Dickins
                     ` (2 preceding siblings ...)
  (?)
@ 2021-06-03 21:48   ` Peter Xu
  2021-06-04  2:54       ` Hugh Dickins
  -1 siblings, 1 reply; 70+ messages in thread
From: Peter Xu @ 2021-06-03 21:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Jan Kara, linux-mm,
	linux-kernel

On Tue, Jun 01, 2021 at 02:07:53PM -0700, Hugh Dickins wrote:
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 2cf01d933f13..b45d22738b45 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			pvmw->ptl = NULL;
>  		}
>  	} else if (!pmd_present(pmde)) {
> +		/*
> +		 * If PVMW_SYNC, take and drop THP pmd lock so that we
> +		 * cannot return prematurely, while zap_huge_pmd() has
> +		 * cleared *pmd but not decremented compound_mapcount().
> +		 */
> +		if ((pvmw->flags & PVMW_SYNC) &&
> +		    PageTransCompound(pvmw->page))
> +			spin_unlock(pmd_lock(mm, pvmw->pmd));
>  		return false;
>  	}
>  	if (!map_pte(pvmw))

Sorry if I missed something important, but I'm totally confused on how this
unlock is pairing with another lock()..

And.. isn't PVMW_SYNC only meaningful for pte-level only (as I didn't see a
reference of it outside map_pte)?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/7] mm/thp: fix page_address_in_vma() on file THP tails
  2021-06-01 21:11   ` Hugh Dickins
@ 2021-06-03 22:06     ` Yang Shi
  -1 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-03 22:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 1, 2021 at 2:11 PM Hugh Dickins <hughd@google.com> wrote:
>
> From: Jue Wang <juew@google.com>
>
> Anon THP tails were already supported, but memory-failure may need to use
> page_address_in_vma() on file THP tails, which its page->mapping check did
> not permit: fix it.
>
> hughd adds: no current usage is known to hit the issue, but this does fix
> a subtle trap in a general helper: best fixed in stable sooner than later.
>
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Signed-off-by: Jue Wang <juew@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/rmap.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 144de54efc1c..e05c300048e6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -716,11 +716,11 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>                 if (!vma->anon_vma || !page__anon_vma ||
>                     vma->anon_vma->root != page__anon_vma->root)
>                         return -EFAULT;
> -       } else if (page->mapping) {
> -               if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
> -                       return -EFAULT;
> -       } else
> +       } else if (!vma->vm_file) {
> +               return -EFAULT;
> +       } else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
>                 return -EFAULT;
> +       }
>
>         return vma_address(page, vma);
>  }
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/7] mm/thp: fix page_address_in_vma() on file THP tails
@ 2021-06-03 22:06     ` Yang Shi
  0 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-03 22:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 1, 2021 at 2:11 PM Hugh Dickins <hughd@google.com> wrote:
>
> From: Jue Wang <juew@google.com>
>
> Anon THP tails were already supported, but memory-failure may need to use
> page_address_in_vma() on file THP tails, which its page->mapping check did
> not permit: fix it.
>
> hughd adds: no current usage is known to hit the issue, but this does fix
> a subtle trap in a general helper: best fixed in stable sooner than later.
>
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Signed-off-by: Jue Wang <juew@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/rmap.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 144de54efc1c..e05c300048e6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -716,11 +716,11 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>                 if (!vma->anon_vma || !page__anon_vma ||
>                     vma->anon_vma->root != page__anon_vma->root)
>                         return -EFAULT;
> -       } else if (page->mapping) {
> -               if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
> -                       return -EFAULT;
> -       } else
> +       } else if (!vma->vm_file) {
> +               return -EFAULT;
> +       } else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
>                 return -EFAULT;
> +       }
>
>         return vma_address(page, vma);
>  }
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>
>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 7/7] mm/thp: remap_page() is only needed on anonymous THP
  2021-06-01 21:17   ` Hugh Dickins
@ 2021-06-03 22:09     ` Yang Shi
  -1 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-03 22:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 1, 2021 at 2:17 PM Hugh Dickins <hughd@google.com> wrote:
>
> THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon,
> and migration entries are only inserted when TTU_MIGRATION (unused here)
> or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page()
> to search for migration entries to remove when !PageAnon.
>
> Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/huge_memory.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 305f709a7aca..e4a83e310452 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2354,6 +2354,7 @@ static void unmap_page(struct page *page)
>
>         VM_BUG_ON_PAGE(!PageHead(page), page);
>
> +       /* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
>         if (PageAnon(page))
>                 ttu_flags |= TTU_SPLIT_FREEZE;
>
> @@ -2368,6 +2369,10 @@ static void unmap_page(struct page *page)
>  static void remap_page(struct page *page, unsigned int nr)
>  {
>         int i;
> +
> +       /* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
> +       if (!PageAnon(page))
> +               return;
>         if (PageTransHuge(page)) {
>                 remove_migration_ptes(page, page, true);
>         } else {
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 7/7] mm/thp: remap_page() is only needed on anonymous THP
@ 2021-06-03 22:09     ` Yang Shi
  0 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-03 22:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Tue, Jun 1, 2021 at 2:17 PM Hugh Dickins <hughd@google.com> wrote:
>
> THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon,
> and migration entries are only inserted when TTU_MIGRATION (unused here)
> or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page()
> to search for migration entries to remove when !PageAnon.
>
> Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/huge_memory.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 305f709a7aca..e4a83e310452 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2354,6 +2354,7 @@ static void unmap_page(struct page *page)
>
>         VM_BUG_ON_PAGE(!PageHead(page), page);
>
> +       /* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
>         if (PageAnon(page))
>                 ttu_flags |= TTU_SPLIT_FREEZE;
>
> @@ -2368,6 +2369,10 @@ static void unmap_page(struct page *page)
>  static void remap_page(struct page *page, unsigned int nr)
>  {
>         int i;
> +
> +       /* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
> +       if (!PageAnon(page))
> +               return;
>         if (PageTransHuge(page)) {
>                 remove_migration_ptes(page, page, true);
>         } else {
> --
> 2.32.0.rc0.204.g9fa02ecfa5-goog
>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-03 22:21   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-03 22:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4532 bytes --]

On Tue, 1 Jun 2021, Hugh Dickins wrote:

> Here is a batch of long-standing THP bug fixes that I had not got
> around to sending before, but prompted now by Wang Yugui's report
> https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
> 
> Wang Yugui has tested a rollup of these fixes applied to 5.10.39,
> and they have done no harm, but have *not* fixed that issue:
> something more is needed and I have no idea of what.
> 
> But at least these clear up related issues, and should go to stable
> (except for the last, which is just an optimization: it would be
> fine in stable, but it's not required there).
> 
> These are against 5.13-rc4: easy for others to try out, but my next
> task is to work out how to shoehorn them into mmotm and linux-next.
> 
> I would have said just before Yang Shi's related
> mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
> except that (which should also go to stable) is currently placed after
> Alistair Popple's "Add support for SVM atomics in Nouveau" series,
> mm-rmap-split-try_to_munlock-from-try_to_unmap.patch etc.
> I expect I shall offer you some rediffs of Alistair's, we'll see.

And here's my attempt to rewrite history: mmotm-adjust.tar contains

before/series
before/mm-remove-special-swap-entry-functions.patch
before/mm-rmap-split-try_to_munlock-from-try_to_unmap.patch
before/mm-huge_memoryc-remove-unnecessary-tlb_remove_page_size-for-huge-zero-pmd.patch
after/series
after/mm-remove-special-swap-entry-functions.patch
after/mm-rmap-split-try_to_munlock-from-try_to_unmap.patch
after/mm-huge_memoryc-remove-unnecessary-tlb_remove_page_size-for-huge-zero-pmd.patch

"before" contains the originals from mmotm .DATE=2021-06-01-19-57:
your series file, two patches from Alistair and one from Miaohe, which
gave rejects when I inserted this 1-7/7 patchset and Yang Shi's 1-2/2
earlier.  All other patches applied cleanly, but could be refreshed.

"after" contains my replacements to those: adjusting them to how they
would have been written if this patchset and Yang Shi's went first
(and in adjusting some of the comments in Alistair's, I did not
recognize the reason given for unmap_page() using try_to_unmap()
on file THPs, so substituted my own explanation).

As diff'ing the series will show, first thing is to delete the old

mm-rmap-make-try_to_unmap-void-function-fix-fix.patch
mm-rmap-make-try_to_unmap-void-function-fix.patch
mm-rmap-make-try_to_unmap-void-function.patch
mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
mm-thp-check-total_mapcount-instead-of-page_mapcount-fix-fix-fix.patch

Then insert, from the mailing list

hughd1.patch
hughd2.patch
hughd3.patch
hughd4.patch
hughd5.patch
hughd6.patch
yangshi1.patch
yangshi2.patch
hughd7.patch

You'll name those differently of course.  I have just posted a v2
of hughd3.patch, with small changes suggested by Matthew; and after
sending this, I shall reply to it with a 6.1/7 and 6.2/7 which are
rediffs of Yang Shi's related patches to fit in here: 1/7 through
6.1/7 are Cc'ed to stable, 6.2/7 and 7/7 are cleanup and optimization.

I've placed them all together early, in the "mainline-later" section;
I expect you to be more cautious and place them later - anywhere before
mm-remove-special-swap-entry-functions.patch
works (but even the two not-for-stable ones need to come before that).

Then please replace those three "before" patches by their "after"s,
unless Alistair or someone else objects.

I've probably forgotten something,
but can't remember what it is :)
Let me know if I've screwed up.

Thanks,
Hugh

> 
> 1/7 mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
> 2/7 mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
> 3/7 mm/thp: fix vma_address() if virtual address below file offset
> 4/7 mm/thp: fix page_address_in_vma() on file THP tails
> 5/7 mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
> 6/7 mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
> 7/7 mm/thp: remap_page() is only needed on anonymous THP
> 
>  include/linux/mm.h   |    3 
>  include/linux/rmap.h |    3 
>  mm/huge_memory.c     |   47 ++++++++----
>  mm/internal.h        |   54 ++++++++++----
>  mm/memory.c          |   40 ++++++++++
>  mm/page_vma_mapped.c |  163 ++++++++++++++++++++++++++-----------------
>  mm/pgtable-generic.c |    5 -
>  mm/rmap.c            |   39 +++++++---
>  mm/truncate.c        |   43 +++++------
>  9 files changed, 266 insertions(+), 131 deletions(-)
> 
> Hugh

[-- Attachment #2: Type: APPLICATION/x-tar, Size: 184320 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related
@ 2021-06-03 22:21   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-03 22:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4532 bytes --]

On Tue, 1 Jun 2021, Hugh Dickins wrote:

> Here is a batch of long-standing THP bug fixes that I had not got
> around to sending before, but prompted now by Wang Yugui's report
> https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
> 
> Wang Yugui has tested a rollup of these fixes applied to 5.10.39,
> and they have done no harm, but have *not* fixed that issue:
> something more is needed and I have no idea of what.
> 
> But at least these clear up related issues, and should go to stable
> (except for the last, which is just an optimization: it would be
> fine in stable, but it's not required there).
> 
> These are against 5.13-rc4: easy for others to try out, but my next
> task is to work out how to shoehorn them into mmotm and linux-next.
> 
> I would have said just before Yang Shi's related
> mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
> except that (which should also go to stable) is currently placed after
> Alistair Popple's "Add support for SVM atomics in Nouveau" series,
> mm-rmap-split-try_to_munlock-from-try_to_unmap.patch etc.
> I expect I shall offer you some rediffs of Alistair's, we'll see.

And here's my attempt to rewrite history: mmotm-adjust.tar contains

before/series
before/mm-remove-special-swap-entry-functions.patch
before/mm-rmap-split-try_to_munlock-from-try_to_unmap.patch
before/mm-huge_memoryc-remove-unnecessary-tlb_remove_page_size-for-huge-zero-pmd.patch
after/series
after/mm-remove-special-swap-entry-functions.patch
after/mm-rmap-split-try_to_munlock-from-try_to_unmap.patch
after/mm-huge_memoryc-remove-unnecessary-tlb_remove_page_size-for-huge-zero-pmd.patch

"before" contains the originals from mmotm .DATE=2021-06-01-19-57:
your series file, two patches from Alistair and one from Miaohe, which
gave rejects when I inserted this 1-7/7 patchset and Yang Shi's 1-2/2
earlier.  All other patches applied cleanly, but could be refreshed.

"after" contains my replacements to those: adjusting them to how they
would have been written if this patchset and Yang Shi's went first
(and in adjusting some of the comments in Alistair's, I did not
recognize the reason given for unmap_page() using try_to_unmap()
on file THPs, so substituted my own explanation).

As diff'ing the series will show, first thing is to delete the old

mm-rmap-make-try_to_unmap-void-function-fix-fix.patch
mm-rmap-make-try_to_unmap-void-function-fix.patch
mm-rmap-make-try_to_unmap-void-function.patch
mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
mm-thp-check-total_mapcount-instead-of-page_mapcount-fix-fix-fix.patch

Then insert, from the mailing list

hughd1.patch
hughd2.patch
hughd3.patch
hughd4.patch
hughd5.patch
hughd6.patch
yangshi1.patch
yangshi2.patch
hughd7.patch

You'll name those differently of course.  I have just posted a v2
of hughd3.patch, with small changes suggested by Matthew; and after
sending this, I shall reply to it with a 6.1/7 and 6.2/7 which are
rediffs of Yang Shi's related patches to fit in here: 1/7 through
6.1/7 are Cc'ed to stable, 6.2/7 and 7/7 are cleanup and optimization.

I've placed them all together early, in the "mainline-later" section;
I expect you to be more cautious and place them later - anywhere before
mm-remove-special-swap-entry-functions.patch
works (but even the two not-for-stable ones need to come before that).

Then please replace those three "before" patches by their "after"s,
unless Alistair or someone else objects.

I've probably forgotten something,
but can't remember what it is :)
Let me know if I've screwed up.

Thanks,
Hugh

> 
> 1/7 mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
> 2/7 mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
> 3/7 mm/thp: fix vma_address() if virtual address below file offset
> 4/7 mm/thp: fix page_address_in_vma() on file THP tails
> 5/7 mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
> 6/7 mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
> 7/7 mm/thp: remap_page() is only needed on anonymous THP
> 
>  include/linux/mm.h   |    3 
>  include/linux/rmap.h |    3 
>  mm/huge_memory.c     |   47 ++++++++----
>  mm/internal.h        |   54 ++++++++++----
>  mm/memory.c          |   40 ++++++++++
>  mm/page_vma_mapped.c |  163 ++++++++++++++++++++++++++-----------------
>  mm/pgtable-generic.c |    5 -
>  mm/rmap.c            |   39 +++++++---
>  mm/truncate.c        |   43 +++++------
>  9 files changed, 266 insertions(+), 131 deletions(-)
> 
> Hugh

[-- Attachment #2: Type: APPLICATION/x-tar, Size: 184320 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 6.1/7] mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
  2021-06-01 21:03 ` Hugh Dickins
@ 2021-06-03 22:26   ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-03 22:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

From: Yang Shi <shy828301@gmail.com>

When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped.  Then the second DEBUG_VM BUG() that check total mapcount would
catch it.  This may incur some confusion.  And this is not a fatal issue,
so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE().

[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
v5: Rediffed by Hugh to fit after 6/7 in his mm/thp series; Cc stable.
v4: Updated the subject and commit log per Hugh.
    Reordered the patches per Hugh.
v3: Incorporated the comments from Hugh. Keep Zi Yan's reviewed-by tag
    since there is no fundamental change against v2.
v2: Removed dead code and updated the comment of try_to_unmap() per Zi
    Yan.

 mm/huge_memory.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2350,19 +2350,19 @@ static void unmap_page(struct page *page
 {
 	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
 		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
-	bool unmap_success;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
-	/* Make sure that the BUGs will not bite */
+	/* If warning below, prevent a race in the mapped accounting */
 	if (IS_ENABLED(CONFIG_DEBUG_VM))
 		ttu_flags |= TTU_SYNC;
 
-	unmap_success = try_to_unmap(page, ttu_flags);
-	VM_BUG_ON_PAGE(!unmap_success, page);
+	try_to_unmap(page, ttu_flags);
+
+	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
 }
 
 static void remap_page(struct page *page, unsigned int nr)
@@ -2673,7 +2673,7 @@ int split_huge_page_to_list(struct page
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
-	int count, mapcount, extra_pins, ret;
+	int extra_pins, ret;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2732,7 +2732,6 @@ int split_huge_page_to_list(struct page
 	}
 
 	unmap_page(head);
-	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -2750,9 +2749,7 @@ int split_huge_page_to_list(struct page
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
-	count = page_count(head);
-	mapcount = total_mapcount(head);
-	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
+	if (page_ref_freeze(head, 1 + extra_pins)) {
 		if (!list_empty(page_deferred_list(head))) {
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
@@ -2772,16 +2769,9 @@ int split_huge_page_to_list(struct page
 		__split_huge_page(page, list, end);
 		ret = 0;
 	} else {
-		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
-			pr_alert("total_mapcount: %u, page_count(): %u\n",
-					mapcount, count);
-			if (PageTail(page))
-				dump_page(head, NULL);
-			dump_page(page, "total_mapcount(head) > 0");
-			BUG();
-		}
 		spin_unlock(&ds_queue->split_queue_lock);
-fail:		if (mapping)
+fail:
+		if (mapping)
 			xa_unlock(&mapping->i_pages);
 		local_irq_enable();
 		remap_page(head, thp_nr_pages(head));

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH 6.1/7] mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
@ 2021-06-03 22:26   ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-03 22:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

From: Yang Shi <shy828301@gmail.com>

When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped.  Then the second DEBUG_VM BUG() that check total mapcount would
catch it.  This may incur some confusion.  And this is not a fatal issue,
so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE().

[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
v5: Rediffed by Hugh to fit after 6/7 in his mm/thp series; Cc stable.
v4: Updated the subject and commit log per Hugh.
    Reordered the patches per Hugh.
v3: Incorporated the comments from Hugh. Keep Zi Yan's reviewed-by tag
    since there is no fundamental change against v2.
v2: Removed dead code and updated the comment of try_to_unmap() per Zi
    Yan.

 mm/huge_memory.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2350,19 +2350,19 @@ static void unmap_page(struct page *page
 {
 	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
 		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
-	bool unmap_success;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
-	/* Make sure that the BUGs will not bite */
+	/* If warning below, prevent a race in the mapped accounting */
 	if (IS_ENABLED(CONFIG_DEBUG_VM))
 		ttu_flags |= TTU_SYNC;
 
-	unmap_success = try_to_unmap(page, ttu_flags);
-	VM_BUG_ON_PAGE(!unmap_success, page);
+	try_to_unmap(page, ttu_flags);
+
+	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
 }
 
 static void remap_page(struct page *page, unsigned int nr)
@@ -2673,7 +2673,7 @@ int split_huge_page_to_list(struct page
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
-	int count, mapcount, extra_pins, ret;
+	int extra_pins, ret;
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2732,7 +2732,6 @@ int split_huge_page_to_list(struct page
 	}
 
 	unmap_page(head);
-	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -2750,9 +2749,7 @@ int split_huge_page_to_list(struct page
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
-	count = page_count(head);
-	mapcount = total_mapcount(head);
-	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
+	if (page_ref_freeze(head, 1 + extra_pins)) {
 		if (!list_empty(page_deferred_list(head))) {
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
@@ -2772,16 +2769,9 @@ int split_huge_page_to_list(struct page
 		__split_huge_page(page, list, end);
 		ret = 0;
 	} else {
-		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
-			pr_alert("total_mapcount: %u, page_count(): %u\n",
-					mapcount, count);
-			if (PageTail(page))
-				dump_page(head, NULL);
-			dump_page(page, "total_mapcount(head) > 0");
-			BUG();
-		}
 		spin_unlock(&ds_queue->split_queue_lock);
-fail:		if (mapping)
+fail:
+		if (mapping)
 			xa_unlock(&mapping->i_pages);
 		local_irq_enable();
 		remap_page(head, thp_nr_pages(head));


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related
  2021-06-03 22:21   ` Hugh Dickins
  (?)
@ 2021-06-03 23:03   ` Andrew Morton
  -1 siblings, 0 replies; 70+ messages in thread
From: Andrew Morton @ 2021-06-03 23:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Yang Shi, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, linux-mm,
	linux-kernel

On Thu, 3 Jun 2021 15:21:35 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> And here's my attempt to rewrite history: mmotm-adjust.tar contains
> 
> before/series
> before/mm-remove-special-swap-entry-functions.patch
> before/mm-rmap-split-try_to_munlock-from-try_to_unmap.patch
> before/mm-huge_memoryc-remove-unnecessary-tlb_remove_page_size-for-huge-zero-pmd.patch
> after/series
> after/mm-remove-special-swap-entry-functions.patch
> after/mm-rmap-split-try_to_munlock-from-try_to_unmap.patch
> after/mm-huge_memoryc-remove-unnecessary-tlb_remove_page_size-for-huge-zero-pmd.patch
> 
> "before" contains the originals from mmotm .DATE=2021-06-01-19-57:
> your series file, two patches from Alistair and one from Miaohe, which
> gave rejects when I inserted this 1-7/7 patchset and Yang Shi's 1-2/2
> earlier.  All other patches applied cleanly, but could be refreshed.
> 
> "after" contains my replacements to those: adjusting them to how they
> would have been written if this patchset and Yang Shi's went first
> (and in adjusting some of the comments in Alistair's, I did not
> recognize the reason given for unmap_page() using try_to_unmap()
> on file THPs, so substituted my own explanation).
> 
> As diff'ing the series will show, first thing is to delete the old
> 
> mm-rmap-make-try_to_unmap-void-function-fix-fix.patch
> mm-rmap-make-try_to_unmap-void-function-fix.patch
> mm-rmap-make-try_to_unmap-void-function.patch
> mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
> mm-thp-check-total_mapcount-instead-of-page_mapcount-fix-fix-fix.patch
> 
> Then insert, from the mailing list
> 
> hughd1.patch
> hughd2.patch
> hughd3.patch
> hughd4.patch
> hughd5.patch
> hughd6.patch
> yangshi1.patch
> yangshi2.patch
> hughd7.patch

There have been some significant-appearing review comments coming in on
hughdN.patch just today.

So...  I think I'll pass for now.  How about when the dust has settled
you send out a complete series of the hughd and yahgshi patches against
Linus mainline?


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
  2021-06-03 21:26     ` Yang Shi
@ 2021-06-04  2:22       ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04  2:22 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Linux MM, Linux Kernel Mailing List

On Thu, 3 Jun 2021, Yang Shi wrote:
> On Tue, Jun 1, 2021 at 2:05 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > Are there more places that need to be careful about pmd migration entries?
> > None hit in practice, but several of those is_huge_zero_pmd() tests were
> > done without checking pmd_present() first: I believe a pmd migration entry
> > could end up satisfying that test.  Ah, the inversion of swap offset, to
> > protect against L1TF, makes that impossible on x86; but other arches need
> > the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> > swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> > wrong to be checking with pmd_trans_huge() instead, but I think it's
> > clearer to use pmd_present() in each instance.
...
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 63ed6b25deaa..9fb7b47da87e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1676,7 +1676,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                 spin_unlock(ptl);
> >                 if (is_huge_zero_pmd(orig_pmd))
> >                         tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> > -       } else if (is_huge_zero_pmd(orig_pmd)) {
> > +       } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
> 
> If it is a huge zero migration entry, the code would fallback to the
> "else". But IIUC the "else" case doesn't handle the huge zero page
> correctly. It may mess up the rss counter.

A huge zero migration entry?  I hope that's not something special
that I've missed.

Do we ever migrate a huge zero page - and how do we find where it's
mapped, to insert the migration entries?  But if we do, I thought it
would use the usual kind of pmd migration entry; and the first check
in is_pmd_migration_entry() is !pmd_present(pmd).

(I have to be rather careful to check such details, after getting
burnt once by pmd_present(): which includes the "huge" bit even when
not otherwise present, to permit races with pmdp_invalidate().
I mentioned in private mail that I'd dropped one of my "fixes" because
it was harmless but mistaken: I had misunderstood pmd_present().)

The point here (see commit message above) is that some unrelated pmd
migration entry could pass the is_huge_zero_pmd() test, which rushes
off to use pmd_page() without even checking pmd_present() first.  And
most of its users have, one way or another, checked pmd_present() first;
but this place and a couple of others had not.

I'm just verifying that it's really a a huge zero pmd before handling
its case; the "else" still does not need to handle the huge zero page.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
@ 2021-06-04  2:22       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04  2:22 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Linux MM, Linux Kernel Mailing List

On Thu, 3 Jun 2021, Yang Shi wrote:
> On Tue, Jun 1, 2021 at 2:05 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > Are there more places that need to be careful about pmd migration entries?
> > None hit in practice, but several of those is_huge_zero_pmd() tests were
> > done without checking pmd_present() first: I believe a pmd migration entry
> > could end up satisfying that test.  Ah, the inversion of swap offset, to
> > protect against L1TF, makes that impossible on x86; but other arches need
> > the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> > swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> > wrong to be checking with pmd_trans_huge() instead, but I think it's
> > clearer to use pmd_present() in each instance.
...
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 63ed6b25deaa..9fb7b47da87e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1676,7 +1676,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >                 spin_unlock(ptl);
> >                 if (is_huge_zero_pmd(orig_pmd))
> >                         tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> > -       } else if (is_huge_zero_pmd(orig_pmd)) {
> > +       } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
> 
> If it is a huge zero migration entry, the code would fallback to the
> "else". But IIUC the "else" case doesn't handle the huge zero page
> correctly. It may mess up the rss counter.

A huge zero migration entry?  I hope that's not something special
that I've missed.

Do we ever migrate a huge zero page - and how do we find where it's
mapped, to insert the migration entries?  But if we do, I thought it
would use the usual kind of pmd migration entry; and the first check
in is_pmd_migration_entry() is !pmd_present(pmd).

(I have to be rather careful to check such details, after getting
burnt once by pmd_present(): which includes the "huge" bit even when
not otherwise present, to permit races with pmdp_invalidate().
I mentioned in private mail that I'd dropped one of my "fixes" because
it was harmless but mistaken: I had misunderstood pmd_present().)

The point here (see commit message above) is that some unrelated pmd
migration entry could pass the is_huge_zero_pmd() test, which rushes
off to use pmd_page() without even checking pmd_present() first.  And
most of its users have, one way or another, checked pmd_present() first;
but this place and a couple of others had not.

I'm just verifying that it's really a a huge zero pmd before handling
its case; the "else" still does not need to handle the huge zero page.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-03 21:45     ` Yang Shi
@ 2021-06-04  2:45       ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04  2:45 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Linux MM, Linux Kernel Mailing List

On Thu, 3 Jun 2021, Yang Shi wrote:
> On Tue, Jun 1, 2021 at 2:07 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
> > follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
> > to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
> > It could be passed in the non-debug case too, but that would sometimes add
> > a little overhead, whereas it's rare for this race to result in failure.
> 
> The above statement makes me feel this patch is just to relieve the
> VM_BUG_ON, but my patch already changed it to VM_WARN, the race sounds
> acceptable (at least not fatal) and the splitting code can handle the
> failure case as well. So I'm wondering if we still need this patch or
> not if it is just used to close the race when CONFIG_DEBUG_VM=y.

I do agree that your 1/2 (what I'm calling 6.1/7) BUG->WARN patch
is the most important of them; but it didn't get marked for stable,
and has got placed behind conflicting mods never intended for stable.

And a lot of the descriptions had been written in terms of the prior
situation, with VM BUG there: it was easier to keep describing that way.

Whether your fix makes mine redundant is arguable (Wang Yugui thinks
not).  It's easier to argue that it makes the racy ones (like this)
redundant, than the persistent ones (like vma_address or pvm_walk).

Since I know of at least one customer who wants all these fixes in 5.10
longterm, I'm fighting to get them that far at least.  But the further
back they go, the less effort I'll make to backport them - will fall
back to porting your BUG->WARN only.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
@ 2021-06-04  2:45       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04  2:45 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Linux MM, Linux Kernel Mailing List

On Thu, 3 Jun 2021, Yang Shi wrote:
> On Tue, Jun 1, 2021 at 2:07 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
> > follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
> > to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
> > It could be passed in the non-debug case too, but that would sometimes add
> > a little overhead, whereas it's rare for this race to result in failure.
> 
> The above statement makes me feel this patch is just to relieve the
> VM_BUG_ON, but my patch already changed it to VM_WARN, the race sounds
> acceptable (at least not fatal) and the splitting code can handle the
> failure case as well. So I'm wondering if we still need this patch or
> not if it is just used to close the race when CONFIG_DEBUG_VM=y.

I do agree that your 1/2 (what I'm calling 6.1/7) BUG->WARN patch
is the most important of them; but it didn't get marked for stable,
and has got placed behind conflicting mods never intended for stable.

And a lot of the descriptions had been written in terms of the prior
situation, with VM BUG there: it was easier to keep describing that way.

Whether your fix makes mine redundant is arguable (Wang Yugui thinks
not).  It's easier to argue that it makes the racy ones (like this)
redundant, than the persistent ones (like vma_address or pvm_walk).

Since I know of at least one customer who wants all these fixes in 5.10
longterm, I'm fighting to get them that far at least.  But the further
back they go, the less effort I'll make to backport them - will fall
back to porting your BUG->WARN only.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-03 21:48   ` Peter Xu
@ 2021-06-04  2:54       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04  2:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Jan Kara, linux-mm, linux-kernel

On Thu, 3 Jun 2021, Peter Xu wrote:
> On Tue, Jun 01, 2021 at 02:07:53PM -0700, Hugh Dickins wrote:
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 2cf01d933f13..b45d22738b45 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >  			pvmw->ptl = NULL;
> >  		}
> >  	} else if (!pmd_present(pmde)) {
> > +		/*
> > +		 * If PVMW_SYNC, take and drop THP pmd lock so that we
> > +		 * cannot return prematurely, while zap_huge_pmd() has
> > +		 * cleared *pmd but not decremented compound_mapcount().
> > +		 */
> > +		if ((pvmw->flags & PVMW_SYNC) &&
> > +		    PageTransCompound(pvmw->page))
> > +			spin_unlock(pmd_lock(mm, pvmw->pmd));
> >  		return false;
> >  	}
> >  	if (!map_pte(pvmw))
> 
> Sorry if I missed something important, but I'm totally confused on how this
> unlock is pairing with another lock()..

I imagine you're reading that as spin_unlock(pmd_lockptr(blah));
no, the lock is right there, inside spin_unlock(pmd_lock(blah)).

> 
> And.. isn't PVMW_SYNC only meaningful for pte-level only (as I didn't see a
> reference of it outside map_pte)?

But you are pointing directly to its reference outside map_pte()!

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
@ 2021-06-04  2:54       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04  2:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Jan Kara, linux-mm, linux-kernel

On Thu, 3 Jun 2021, Peter Xu wrote:
> On Tue, Jun 01, 2021 at 02:07:53PM -0700, Hugh Dickins wrote:
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 2cf01d933f13..b45d22738b45 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >  			pvmw->ptl = NULL;
> >  		}
> >  	} else if (!pmd_present(pmde)) {
> > +		/*
> > +		 * If PVMW_SYNC, take and drop THP pmd lock so that we
> > +		 * cannot return prematurely, while zap_huge_pmd() has
> > +		 * cleared *pmd but not decremented compound_mapcount().
> > +		 */
> > +		if ((pvmw->flags & PVMW_SYNC) &&
> > +		    PageTransCompound(pvmw->page))
> > +			spin_unlock(pmd_lock(mm, pvmw->pmd));
> >  		return false;
> >  	}
> >  	if (!map_pte(pvmw))
> 
> Sorry if I missed something important, but I'm totally confused on how this
> unlock is pairing with another lock()..

I imagine you're reading that as spin_unlock(pmd_lockptr(blah));
no, the lock is right there, inside spin_unlock(pmd_lock(blah)).

> 
> And.. isn't PVMW_SYNC only meaningful for pte-level only (as I didn't see a
> reference of it outside map_pte)?

But you are pointing directly to its reference outside map_pte()!

Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-04  2:54       ` Hugh Dickins
  (?)
@ 2021-06-04 14:48       ` Peter Xu
  2021-06-04 22:26           ` Hugh Dickins
  -1 siblings, 1 reply; 70+ messages in thread
From: Peter Xu @ 2021-06-04 14:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Jan Kara, linux-mm,
	linux-kernel

On Thu, Jun 03, 2021 at 07:54:11PM -0700, Hugh Dickins wrote:
> On Thu, 3 Jun 2021, Peter Xu wrote:
> > On Tue, Jun 01, 2021 at 02:07:53PM -0700, Hugh Dickins wrote:
> > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > index 2cf01d933f13..b45d22738b45 100644
> > > --- a/mm/page_vma_mapped.c
> > > +++ b/mm/page_vma_mapped.c
> > > @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > >  			pvmw->ptl = NULL;
> > >  		}
> > >  	} else if (!pmd_present(pmde)) {
> > > +		/*
> > > +		 * If PVMW_SYNC, take and drop THP pmd lock so that we
> > > +		 * cannot return prematurely, while zap_huge_pmd() has
> > > +		 * cleared *pmd but not decremented compound_mapcount().
> > > +		 */
> > > +		if ((pvmw->flags & PVMW_SYNC) &&
> > > +		    PageTransCompound(pvmw->page))
> > > +			spin_unlock(pmd_lock(mm, pvmw->pmd));
> > >  		return false;
> > >  	}
> > >  	if (!map_pte(pvmw))
> > 
> > Sorry if I missed something important, but I'm totally confused on how this
> > unlock is pairing with another lock()..
> 
> I imagine you're reading that as spin_unlock(pmd_lockptr(blah));
> no, the lock is right there, inside spin_unlock(pmd_lock(blah)).

Heh, yeah... Sorry about that.

> 
> > 
> > And.. isn't PVMW_SYNC only meaningful for pte-level only (as I didn't see a
> > reference of it outside map_pte)?
> 
> But you are pointing directly to its reference outside map_pte()!

Right, I was trying to look for the lock() so I needed to look at all the rest
besides this one. :)

I didn't follow Yang's patch, but if Yang's patch can make kernel not crashing
and fault handling done all well, then I'm kind of agree with him: having
workaround code (like taking lock and quickly releasing..) only for debug code
seems an overkill to me, not to mention that the debug code will be even more
strict after this patch, as it means it's even less likely that one can
reproduce one production host race with DEBUG_VM..  Normally debugging code
would affect code execution already, and for this one we're enlarging that gap
"explicitly" - not sure whether it's good.

This also makes me curious what if we make !DEBUG_VM strict too - how much perf
we'll lose?  Haven't even tried to think about it with details, but just raise
it up. Say, is there any chance we make the debug/non-debug paths run the same
logic (e.g. of SYNC version)?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
  2021-06-01 21:05   ` Hugh Dickins
  (?)
  (?)
@ 2021-06-04 15:34   ` Kirill A. Shutemov
  2021-06-04 21:29       ` Hugh Dickins
  -1 siblings, 1 reply; 70+ messages in thread
From: Kirill A. Shutemov @ 2021-06-04 15:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Tue, Jun 01, 2021 at 02:05:45PM -0700, Hugh Dickins wrote:
> Stressing huge tmpfs page migration racing hole punch often crashed on the
> VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel;
> or shortly afterwards, on a bad dereference in __split_huge_pmd_locked()
> when DEBUG_VM=n.  They forgot to allow for pmd migration entries in the
> non-anonymous case.
> 
> Full disclosure: those particular experiments were on a kernel with more
> relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
> vanilla kernel: it is conceivable that stricter locking happens to avoid
> those cases, or makes them less likely; but __split_huge_pmd_locked()
> already allowed for pmd migration entries when handling anonymous THPs,
> so this commit brings the shmem and file THP handling into line.
> 
> Are there more places that need to be careful about pmd migration entries?
> None hit in practice, but several of those is_huge_zero_pmd() tests were
> done without checking pmd_present() first: I believe a pmd migration entry
> could end up satisfying that test.  Ah, the inversion of swap offset, to
> protect against L1TF, makes that impossible on x86; but other arches need
> the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> wrong to be checking with pmd_trans_huge() instead, but I think it's
> clearer to use pmd_present() in each instance.
> 
> And while there: make it clearer to the eye that the !vma_is_anonymous()
> and is_huge_zero_pmd() blocks make early returns (and don't return void).
> 
> Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Looks like a two fixes got squashed into one patch. Zero-page fix and
migration entries in __split_huge_pmd_locked() deserve separate patches,
no?

Maybe add VM_BUG_ON(!pmd_present()) in is_huge_zero_pmd()?

Also I wounder how much code we can remove if we would not establish
migration ptes for file pages. We can make these page table entries 'none'
on migration.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-04  2:54       ` Hugh Dickins
  (?)
  (?)
@ 2021-06-04 15:47       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 70+ messages in thread
From: Kirill A. Shutemov @ 2021-06-04 15:47 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Xu, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Jan Kara, linux-mm, linux-kernel

On Thu, Jun 03, 2021 at 07:54:11PM -0700, Hugh Dickins wrote:
> On Thu, 3 Jun 2021, Peter Xu wrote:
> > On Tue, Jun 01, 2021 at 02:07:53PM -0700, Hugh Dickins wrote:
> > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > index 2cf01d933f13..b45d22738b45 100644
> > > --- a/mm/page_vma_mapped.c
> > > +++ b/mm/page_vma_mapped.c
> > > @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > >  			pvmw->ptl = NULL;
> > >  		}
> > >  	} else if (!pmd_present(pmde)) {
> > > +		/*
> > > +		 * If PVMW_SYNC, take and drop THP pmd lock so that we
> > > +		 * cannot return prematurely, while zap_huge_pmd() has
> > > +		 * cleared *pmd but not decremented compound_mapcount().
> > > +		 */
> > > +		if ((pvmw->flags & PVMW_SYNC) &&
> > > +		    PageTransCompound(pvmw->page))
> > > +			spin_unlock(pmd_lock(mm, pvmw->pmd));
> > >  		return false;
> > >  	}
> > >  	if (!map_pte(pvmw))
> > 
> > Sorry if I missed something important, but I'm totally confused on how this
> > unlock is pairing with another lock()..
> 
> I imagine you're reading that as spin_unlock(pmd_lockptr(blah));
> no, the lock is right there, inside spin_unlock(pmd_lock(blah)).

Oh.. It made me scratch head too. Could you rewrite it in a more readable
way?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/7] mm/thp: fix vma_address() if virtual address below file offset
  2021-06-03 21:40         ` Hugh Dickins
  (?)
@ 2021-06-04 15:53         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 70+ messages in thread
From: Kirill A. Shutemov @ 2021-06-04 15:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Thu, Jun 03, 2021 at 02:40:30PM -0700, Hugh Dickins wrote:
> Running certain tests with a DEBUG_VM kernel would crash within hours,
> on the total_mapcount BUG() in split_huge_page_to_list(), while trying
> to free up some memory by punching a hole in a shmem huge page: split's
> try_to_unmap() was unable to find all the mappings of the page (which,
> on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
> 
> When that BUG() was changed to a WARN(), it would later crash on the
> VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
> mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().
> 
> vma_address() is usually correct, but there's a wraparound case when the
> vm_start address is unusually low, but vm_pgoff not so low: vma_address()
> chooses max(start, vma->vm_start), but that decides on the wrong address,
> because start has become almost ULONG_MAX.
> 
> Rewrite vma_address() to be more careful about vm_pgoff; move the
> VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
> be safely used from page_mapped_in_vma() and page_address_in_vma() too.
> 
> Add vma_address_end() to apply similar care to end address calculation,
> in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
> though it raises a question of whether callers would do better to supply
> pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.
> 
> An irritation is that their apparent generality breaks down on KSM pages,
> which cannot be located by the page->index that page_to_pgoff() uses: as
> 4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages")
> once discovered.  I dithered over the best thing to do about that, and
> have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and
> vma_address_end(); though the only place in danger of using it on them
> was try_to_unmap_one().
> 
> Sidenote: vma_address() and vma_address_end() now use compound_nr() on
> a head page, instead of thp_size(): to make the right calculation on a
> hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
> used on hugetlbfs pages, but perhaps the wrong calculation never mattered.
> 
> Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 4/7] mm/thp: fix page_address_in_vma() on file THP tails
  2021-06-01 21:11   ` Hugh Dickins
                     ` (2 preceding siblings ...)
  (?)
@ 2021-06-04 15:54   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 70+ messages in thread
From: Kirill A. Shutemov @ 2021-06-04 15:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Tue, Jun 01, 2021 at 02:11:22PM -0700, Hugh Dickins wrote:
> From: Jue Wang <juew@google.com>
> 
> Anon THP tails were already supported, but memory-failure may need to use
> page_address_in_vma() on file THP tails, which its page->mapping check did
> not permit: fix it.
> 
> hughd adds: no current usage is known to hit the issue, but this does fix
> a subtle trap in a general helper: best fixed in stable sooner than later.
> 
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Signed-off-by: Jue Wang <juew@google.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
  2021-06-01 21:13   ` Hugh Dickins
  (?)
@ 2021-06-04 16:24   ` Kirill A. Shutemov
  2021-06-04 17:42     ` Matthew Wilcox
  2021-06-04 22:56       ` Hugh Dickins
  -1 siblings, 2 replies; 70+ messages in thread
From: Kirill A. Shutemov @ 2021-06-04 16:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Tue, Jun 01, 2021 at 02:13:21PM -0700, Hugh Dickins wrote:
> Running certain tests with a DEBUG_VM kernel would crash within hours,
> on the total_mapcount BUG() in split_huge_page_to_list(), while trying
> to free up some memory by punching a hole in a shmem huge page: split's
> try_to_unmap() was unable to find all the mappings of the page (which,
> on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
> 
> Crash dumps showed two tail pages of a shmem huge page remained mapped
> by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
> of a long unmapped range; and no page table had yet been allocated for
> the head of the huge page to be mapped into.
> 
> Although designed to handle these odd misaligned huge-page-mapped-by-pte
> cases, page_vma_mapped_walk() falls short by returning false prematurely
> when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
> are cases when a huge page may span the boundary, with ptes present in
> the next.

Oh. My bad. I guess it was pain to debug.

> Restructure page_vma_mapped_walk() as a loop to continue in these cases,
> while keeping its layout much as before. Add a step_forward() helper to
> advance pvmw->address across those boundaries: originally I tried to use
> mm's standard p?d_addr_end() macros, but hit the same crash 512 times
> less often: because of the way redundant levels are folded together,
> but folded differently in different configurations, it was just too
> difficult to use them correctly; and step_forward() is simpler anyway.
> 
> Merged various other minor fixes and cleanups into page_vma_mapped_walk()
> as I worked on it: which I find much easier to enumerate here than to
> prise apart into separate commits.

But it makes it harder to review...

> Handle all of the hugetlbfs PageHuge() case once at the start,
> so we don't need to worry about it again further down.
> 
> Sometimes local copy of pvmw->page was used, sometimes pvmw->page:
> just use pvmw->page throughout (and continue to use pvmw->address
> throughout, though we could take a local copy instead).
> 
> Use pmd_read_atomic() with barrier() instead of READ_ONCE() for pmde:
> some architectures (e.g. i386 with PAE) have a multi-word pmd entry,
> for which READ_ONCE() is not good enough.
> 
> Re-evaluate pmde after taking lock, then use it in subsequent tests,
> instead of repeatedly dereferencing pvmw->pmd pointer.
> 
> Rearrange the !pmd_present block to follow the same "return not_found,
> return not_found, return true" pattern as the block above it (note:
> returning not_found there is never premature, since the existence or
> prior existence of a huge pmd guarantees good alignment).
> 
> Adjust page table boundary test in case address was not page-aligned.
> 
> Reset pvmw->pte to NULL after unmapping that page table.
> 
> Respect the 80-column line limit.
> 
> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

I tried to review it and superficially it looks good, but it has to be
split into bunch of patches.


>  		/* when pud is not present, pte will be NULL */
> -		pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
> +		pvmw->pte = huge_pte_offset(mm, pvmw->address,
> +					    page_size(pvmw->page));

AFAICS, it exactly fits into 80-column.

>  		if (!pvmw->pte)
>  			return false;
>  
> -		pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
> +		pvmw->ptl = huge_pte_lockptr(page_hstate(pvmw->page),
> +					     mm, pvmw->pte);

And this one end on 79.

Hm?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/7] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
  2021-06-01 21:15   ` Hugh Dickins
  (?)
@ 2021-06-04 16:39   ` Kirill A. Shutemov
  2021-06-04 23:07       ` Hugh Dickins
  -1 siblings, 1 reply; 70+ messages in thread
From: Kirill A. Shutemov @ 2021-06-04 16:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Tue, Jun 01, 2021 at 02:15:35PM -0700, Hugh Dickins wrote:
> There is a race between THP unmapping and truncation, when truncate sees
> pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
> but before its page_remove_rmap() gets to decrement compound_mapcount:
> generating false "BUG: Bad page cache" reports that the page is still
> mapped when deleted.  This commit fixes that, but not in the way I hoped.
> 
> The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
> instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
> been an annoyance that we usually call unmap_mapping_range() with no pages
> locked, but there apply it to a single locked page.  try_to_unmap() looks
> more suitable for a single locked page.
> 
> However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
> it is used to insert THP migration entries, but not used to unmap THPs.
> Copy zap_huge_pmd() and add THP handling now?  Perhaps, but their TLB
> needs are different, I'm too ignorant of the DAX cases, and couldn't
> decide how far to go for anon+swap.  Set that aside.
> 
> The second attempt took a different tack: make no change in truncate.c,
> but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
> clearing it initially, then pmd_clear() between page_remove_rmap() and
> unlocking at the end.  Nice.  But powerpc blows that approach out of the
> water, with its serialize_against_pte_lookup(), and interesting pgtable
> usage.  It would need serious help to get working on powerpc (with a
> minor optimization issue on s390 too).  Set that aside.
> 
> Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
> delay, after unmapping in truncate_cleanup_page()?  Perhaps, but though
> that's likely to reduce or eliminate the number of incidents, it would
> give less assurance of whether we had identified the problem correctly.
> 
> This successful iteration introduces "unmap_mapping_page(page)" instead
> of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
> with an addition to details.  Then zap_pmd_range() watches for this case,
> and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk()
> now does in the PVMW_SYNC case.  Not pretty, but safe.
> 
> Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
> assert its interface; but currently that's only used to make sure that
> page->mapping is stable, and zap_pmd_range() doesn't care if the page is
> locked or not.  Along these lines, in invalidate_inode_pages2_range()
> move the initial unmap_mapping_range() out from under page lock, before
> then calling unmap_mapping_page() under page lock if still mapped.
> 
> Fixes: fc127da085c2 ("truncate: handle file thp")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>

I think adding support for THP in try_to_unmap_one() is the most
future-proof way. We would need it there eventually.

But this works too:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 7/7] mm/thp: remap_page() is only needed on anonymous THP
  2021-06-01 21:17   ` Hugh Dickins
  (?)
  (?)
@ 2021-06-04 16:41   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 70+ messages in thread
From: Kirill A. Shutemov @ 2021-06-04 16:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Tue, Jun 01, 2021 at 02:17:19PM -0700, Hugh Dickins wrote:
> THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon,
> and migration entries are only inserted when TTU_MIGRATION (unused here)
> or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page()
> to search for migration entries to remove when !PageAnon.
> 
> Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/7] mm/thp: fix vma_address() if virtual address below file offset
  2021-06-03 21:40         ` Hugh Dickins
  (?)
  (?)
@ 2021-06-04 17:36         ` Matthew Wilcox
  2021-06-04 22:35             ` Hugh Dickins
  -1 siblings, 1 reply; 70+ messages in thread
From: Matthew Wilcox @ 2021-06-04 17:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Yang Shi, Wang Yugui,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, linux-mm,
	linux-kernel

On Thu, Jun 03, 2021 at 02:40:30PM -0700, Hugh Dickins wrote:
>  static inline unsigned long
> -__vma_address(struct page *page, struct vm_area_struct *vma)
> +vma_address(struct page *page, struct vm_area_struct *vma)
>  {
> -	pgoff_t pgoff = page_to_pgoff(page);
> -	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	pgoff_t pgoff;
> +	unsigned long address;
> +
> +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> +	pgoff = page_to_pgoff(page);
> +	if (pgoff >= vma->vm_pgoff) {
> +		address = vma->vm_start +
> +			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +		/* Check for address beyond vma (or wrapped through 0?) */
> +		if (address < vma->vm_start || address >= vma->vm_end)
> +			address = -EFAULT;
> +	} else if (PageHead(page) &&
> +		   pgoff + compound_nr(page) > vma->vm_pgoff) {

I think on 32-bit, you need ...

		   pgoff + compound_nr(page) - 1 >= vma->vm_pgoff

... right?

> +		address = vma->vm_start;
> +	} else {
> +		address = -EFAULT;
> +	}
> +	return address;
>  }

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
  2021-06-04 16:24   ` Kirill A. Shutemov
@ 2021-06-04 17:42     ` Matthew Wilcox
  2021-06-04 22:56       ` Hugh Dickins
  1 sibling, 0 replies; 70+ messages in thread
From: Matthew Wilcox @ 2021-06-04 17:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Fri, Jun 04, 2021 at 07:24:02PM +0300, Kirill A. Shutemov wrote:
> >  		/* when pud is not present, pte will be NULL */
> > -		pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
> > +		pvmw->pte = huge_pte_offset(mm, pvmw->address,
> > +					    page_size(pvmw->page));
> 
> AFAICS, it exactly fits into 80-column.

.. not after putting the 'pvmw->' before 'page'.

> >  		if (!pvmw->pte)
> >  			return false;
> >  
> > -		pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
> > +		pvmw->ptl = huge_pte_lockptr(page_hstate(pvmw->page),
> > +					     mm, pvmw->pte);
> 
> And this one end on 79.

likewise.  I mean, the 'mm' could go on the previous line, but that's
beyond my level of code format caring.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
  2021-06-04  2:22       ` Hugh Dickins
@ 2021-06-04 18:03         ` Yang Shi
  -1 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-04 18:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Thu, Jun 3, 2021 at 7:23 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 3 Jun 2021, Yang Shi wrote:
> > On Tue, Jun 1, 2021 at 2:05 PM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Are there more places that need to be careful about pmd migration entries?
> > > None hit in practice, but several of those is_huge_zero_pmd() tests were
> > > done without checking pmd_present() first: I believe a pmd migration entry
> > > could end up satisfying that test.  Ah, the inversion of swap offset, to
> > > protect against L1TF, makes that impossible on x86; but other arches need
> > > the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> > > swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> > > wrong to be checking with pmd_trans_huge() instead, but I think it's
> > > clearer to use pmd_present() in each instance.
> ...
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 63ed6b25deaa..9fb7b47da87e 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1676,7 +1676,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > >                 spin_unlock(ptl);
> > >                 if (is_huge_zero_pmd(orig_pmd))
> > >                         tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> > > -       } else if (is_huge_zero_pmd(orig_pmd)) {
> > > +       } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
> >
> > If it is a huge zero migration entry, the code would fallback to the
> > "else". But IIUC the "else" case doesn't handle the huge zero page
> > correctly. It may mess up the rss counter.
>
> A huge zero migration entry?  I hope that's not something special
> that I've missed.
>
> Do we ever migrate a huge zero page - and how do we find where it's
> mapped, to insert the migration entries?  But if we do, I thought it
> would use the usual kind of pmd migration entry; and the first check
> in is_pmd_migration_entry() is !pmd_present(pmd).

I overlooked if the huge zero page is migratable or not when I was
writing the comment, just focused on the if/else if/else conditions.

I don't think huge zero page is migratable by a quick look since:
* mempolicy and numa hinting skip huge zero pmd
* other migration callsites just try to migrate LRU pages

>
> (I have to be rather careful to check such details, after getting
> burnt once by pmd_present(): which includes the "huge" bit even when
> not otherwise present, to permit races with pmdp_invalidate().
> I mentioned in private mail that I'd dropped one of my "fixes" because
> it was harmless but mistaken: I had misunderstood pmd_present().)
>
> The point here (see commit message above) is that some unrelated pmd
> migration entry could pass the is_huge_zero_pmd() test, which rushes
> off to use pmd_page() without even checking pmd_present() first.  And
> most of its users have, one way or another, checked pmd_present() first;
> but this place and a couple of others had not.

Thanks for the elaboration. Wondering whether we'd better add some
comments in the code? Someone may submit a fix patch by visual
inspection in the future due to missing these points.

>
> I'm just verifying that it's really a a huge zero pmd before handling
> its case; the "else" still does not need to handle the huge zero page.
>
> Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
@ 2021-06-04 18:03         ` Yang Shi
  0 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-04 18:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Thu, Jun 3, 2021 at 7:23 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 3 Jun 2021, Yang Shi wrote:
> > On Tue, Jun 1, 2021 at 2:05 PM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Are there more places that need to be careful about pmd migration entries?
> > > None hit in practice, but several of those is_huge_zero_pmd() tests were
> > > done without checking pmd_present() first: I believe a pmd migration entry
> > > could end up satisfying that test.  Ah, the inversion of swap offset, to
> > > protect against L1TF, makes that impossible on x86; but other arches need
> > > the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> > > swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> > > wrong to be checking with pmd_trans_huge() instead, but I think it's
> > > clearer to use pmd_present() in each instance.
> ...
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 63ed6b25deaa..9fb7b47da87e 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1676,7 +1676,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > >                 spin_unlock(ptl);
> > >                 if (is_huge_zero_pmd(orig_pmd))
> > >                         tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
> > > -       } else if (is_huge_zero_pmd(orig_pmd)) {
> > > +       } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
> >
> > If it is a huge zero migration entry, the code would fallback to the
> > "else". But IIUC the "else" case doesn't handle the huge zero page
> > correctly. It may mess up the rss counter.
>
> A huge zero migration entry?  I hope that's not something special
> that I've missed.
>
> Do we ever migrate a huge zero page - and how do we find where it's
> mapped, to insert the migration entries?  But if we do, I thought it
> would use the usual kind of pmd migration entry; and the first check
> in is_pmd_migration_entry() is !pmd_present(pmd).

I overlooked if the huge zero page is migratable or not when I was
writing the comment, just focused on the if/else if/else conditions.

I don't think huge zero page is migratable by a quick look since:
* mempolicy and numa hinting skip huge zero pmd
* other migration callsites just try to migrate LRU pages

>
> (I have to be rather careful to check such details, after getting
> burnt once by pmd_present(): which includes the "huge" bit even when
> not otherwise present, to permit races with pmdp_invalidate().
> I mentioned in private mail that I'd dropped one of my "fixes" because
> it was harmless but mistaken: I had misunderstood pmd_present().)
>
> The point here (see commit message above) is that some unrelated pmd
> migration entry could pass the is_huge_zero_pmd() test, which rushes
> off to use pmd_page() without even checking pmd_present() first.  And
> most of its users have, one way or another, checked pmd_present() first;
> but this place and a couple of others had not.

Thanks for the elaboration. Wondering whether we'd better add some
comments in the code? Someone may submit a fix patch by visual
inspection in the future due to missing these points.

>
> I'm just verifying that it's really a a huge zero pmd before handling
> its case; the "else" still does not need to handle the huge zero page.
>
> Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-04  2:45       ` Hugh Dickins
@ 2021-06-04 18:24         ` Yang Shi
  -1 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-04 18:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Thu, Jun 3, 2021 at 7:45 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 3 Jun 2021, Yang Shi wrote:
> > On Tue, Jun 1, 2021 at 2:07 PM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
> > > follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
> > > to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
> > > It could be passed in the non-debug case too, but that would sometimes add
> > > a little overhead, whereas it's rare for this race to result in failure.
> >
> > The above statement makes me feel this patch is just to relieve the
> > VM_BUG_ON, but my patch already changed it to VM_WARN, the race sounds
> > acceptable (at least not fatal) and the splitting code can handle the
> > failure case as well. So I'm wondering if we still need this patch or
> > not if it is just used to close the race when CONFIG_DEBUG_VM=y.
>
> I do agree that your 1/2 (what I'm calling 6.1/7) BUG->WARN patch
> is the most important of them; but it didn't get marked for stable,
> and has got placed behind conflicting mods never intended for stable.

Sorry for not marking it for stable. I do have no any objection to
having it in stable, just forgot to cc stable. :-(

>
> And a lot of the descriptions had been written in terms of the prior
> situation, with VM BUG there: it was easier to keep describing that way.

Understood.

>
> Whether your fix makes mine redundant is arguable (Wang Yugui thinks
> not).  It's easier to argue that it makes the racy ones (like this)
> redundant, than the persistent ones (like vma_address or pvm_walk).

The point is if we just close the race for DEBUG_VM case, it doesn't
sound worth it to me IMHO. How's about making it for !DEBUG_VM anyway?
How big the overhead could be? It looks like the heavy lifting (tlb
shootdown and page free) is done after pmd lock is released so it
should not block pvmw for a long time.

>
> Since I know of at least one customer who wants all these fixes in 5.10
> longterm, I'm fighting to get them that far at least.  But the further
> back they go, the less effort I'll make to backport them - will fall
> back to porting your BUG->WARN only.
>
> Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
@ 2021-06-04 18:24         ` Yang Shi
  0 siblings, 0 replies; 70+ messages in thread
From: Yang Shi @ 2021-06-04 18:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Kirill A. Shutemov, Wang Yugui, Matthew Wilcox,
	Naoya Horiguchi, Alistair Popple, Ralph Campbell, Zi Yan,
	Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara, Linux MM,
	Linux Kernel Mailing List

On Thu, Jun 3, 2021 at 7:45 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 3 Jun 2021, Yang Shi wrote:
> > On Tue, Jun 1, 2021 at 2:07 PM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
> > > follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
> > > to the options, and passing that from unmap_page() when CONFIG_DEBUG_VM=y.
> > > It could be passed in the non-debug case too, but that would sometimes add
> > > a little overhead, whereas it's rare for this race to result in failure.
> >
> > The above statement makes me feel this patch is just to relieve the
> > VM_BUG_ON, but my patch already changed it to VM_WARN, the race sounds
> > acceptable (at least not fatal) and the splitting code can handle the
> > failure case as well. So I'm wondering if we still need this patch or
> > not if it is just used to close the race when CONFIG_DEBUG_VM=y.
>
> I do agree that your 1/2 (what I'm calling 6.1/7) BUG->WARN patch
> is the most important of them; but it didn't get marked for stable,
> and has got placed behind conflicting mods never intended for stable.

Sorry for not marking it for stable. I do have no any objection to
having it in stable, just forgot to cc stable. :-(

>
> And a lot of the descriptions had been written in terms of the prior
> situation, with VM BUG there: it was easier to keep describing that way.

Understood.

>
> Whether your fix makes mine redundant is arguable (Wang Yugui thinks
> not).  It's easier to argue that it makes the racy ones (like this)
> redundant, than the persistent ones (like vma_address or pvm_walk).

The point is if we just close the race for DEBUG_VM case, it doesn't
sound worth it to me IMHO. How's about making it for !DEBUG_VM anyway?
How big the overhead could be? It looks like the heavy lifting (tlb
shootdown and page free) is done after pmd lock is released so it
should not block pvmw for a long time.

>
> Since I know of at least one customer who wants all these fixes in 5.10
> longterm, I'm fighting to get them that far at least.  But the further
> back they go, the less effort I'll make to backport them - will fall
> back to porting your BUG->WARN only.
>
> Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
  2021-06-04 15:34   ` Kirill A. Shutemov
@ 2021-06-04 21:29       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 21:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Peter Xu, Jan Kara, linux-mm, linux-kernel

On Fri, 4 Jun 2021, Kirill A. Shutemov wrote:
> On Tue, Jun 01, 2021 at 02:05:45PM -0700, Hugh Dickins wrote:
> > Stressing huge tmpfs page migration racing hole punch often crashed on the
> > VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel;
> > or shortly afterwards, on a bad dereference in __split_huge_pmd_locked()
> > when DEBUG_VM=n.  They forgot to allow for pmd migration entries in the
> > non-anonymous case.
> > 
> > Full disclosure: those particular experiments were on a kernel with more
> > relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
> > vanilla kernel: it is conceivable that stricter locking happens to avoid
> > those cases, or makes them less likely; but __split_huge_pmd_locked()
> > already allowed for pmd migration entries when handling anonymous THPs,
> > so this commit brings the shmem and file THP handling into line.
> > 
> > Are there more places that need to be careful about pmd migration entries?
> > None hit in practice, but several of those is_huge_zero_pmd() tests were
> > done without checking pmd_present() first: I believe a pmd migration entry
> > could end up satisfying that test.  Ah, the inversion of swap offset, to
> > protect against L1TF, makes that impossible on x86; but other arches need
> > the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> > swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> > wrong to be checking with pmd_trans_huge() instead, but I think it's
> > clearer to use pmd_present() in each instance.
> > 
> > And while there: make it clearer to the eye that the !vma_is_anonymous()
> > and is_huge_zero_pmd() blocks make early returns (and don't return void).
> > 
> > Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> 
> Looks like a two fixes got squashed into one patch. Zero-page fix and
> migration entries in __split_huge_pmd_locked() deserve separate patches,
> no?

Okay, I'll divide in two (and probably lose the "don't return void"
cleanup; but still keep the clearer separation of those two blocks).

> 
> Maybe add VM_BUG_ON(!pmd_present()) in is_huge_zero_pmd()?

Certainly not as part of any patch I'm aiming at stable!  But
I've remembered another approach, I'll say in response to Yang.

> 
> Also I wounder how much code we can remove if we would not establish
> migration ptes for file pages. We can make these page table entries 'none'
> on migration.

I'm not sure how far you're wondering to go with that (just in THP
case, or file ptes generally?). But you may recall that I disagree,
especially on mlocked vmas, where we break the contract by not using
migration entries.  Anyway, not something to get into here.

Thanks a lot for all your reviews, I'll get on with it.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
@ 2021-06-04 21:29       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 21:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Peter Xu, Jan Kara, linux-mm, linux-kernel

On Fri, 4 Jun 2021, Kirill A. Shutemov wrote:
> On Tue, Jun 01, 2021 at 02:05:45PM -0700, Hugh Dickins wrote:
> > Stressing huge tmpfs page migration racing hole punch often crashed on the
> > VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y kernel;
> > or shortly afterwards, on a bad dereference in __split_huge_pmd_locked()
> > when DEBUG_VM=n.  They forgot to allow for pmd migration entries in the
> > non-anonymous case.
> > 
> > Full disclosure: those particular experiments were on a kernel with more
> > relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
> > vanilla kernel: it is conceivable that stricter locking happens to avoid
> > those cases, or makes them less likely; but __split_huge_pmd_locked()
> > already allowed for pmd migration entries when handling anonymous THPs,
> > so this commit brings the shmem and file THP handling into line.
> > 
> > Are there more places that need to be careful about pmd migration entries?
> > None hit in practice, but several of those is_huge_zero_pmd() tests were
> > done without checking pmd_present() first: I believe a pmd migration entry
> > could end up satisfying that test.  Ah, the inversion of swap offset, to
> > protect against L1TF, makes that impossible on x86; but other arches need
> > the pmd_present() check, and even x86 ought not to apply pmd_page() to a
> > swap-like pmd.  Fix those instances; __split_huge_pmd_locked() was not
> > wrong to be checking with pmd_trans_huge() instead, but I think it's
> > clearer to use pmd_present() in each instance.
> > 
> > And while there: make it clearer to the eye that the !vma_is_anonymous()
> > and is_huge_zero_pmd() blocks make early returns (and don't return void).
> > 
> > Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: <stable@vger.kernel.org>
> 
> Looks like a two fixes got squashed into one patch. Zero-page fix and
> migration entries in __split_huge_pmd_locked() deserve separate patches,
> no?

Okay, I'll divide in two (and probably lose the "don't return void"
cleanup; but still keep the clearer separation of those two blocks).

> 
> Maybe add VM_BUG_ON(!pmd_present()) in is_huge_zero_pmd()?

Certainly not as part of any patch I'm aiming at stable!  But
I've remembered another approach, I'll say in response to Yang.

> 
> Also I wounder how much code we can remove if we would not establish
> migration ptes for file pages. We can make these page table entries 'none'
> on migration.

I'm not sure how far you're wondering to go with that (just in THP
case, or file ptes generally?). But you may recall that I disagree,
especially on mlocked vmas, where we break the contract by not using
migration entries.  Anyway, not something to get into here.

Thanks a lot for all your reviews, I'll get on with it.

Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
  2021-06-04 18:03         ` Yang Shi
@ 2021-06-04 21:52           ` Hugh Dickins
  -1 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 21:52 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Linux MM, Linux Kernel Mailing List

On Fri, 4 Jun 2021, Yang Shi wrote:
> On Thu, Jun 3, 2021 at 7:23 PM Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 3 Jun 2021, Yang Shi wrote:
> > > On Tue, Jun 1, 2021 at 2:05 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > The point here (see commit message above) is that some unrelated pmd
> > migration entry could pass the is_huge_zero_pmd() test, which rushes
> > off to use pmd_page() without even checking pmd_present() first.  And
> > most of its users have, one way or another, checked pmd_present() first;
> > but this place and a couple of others had not.
> 
> Thanks for the elaboration. Wondering whether we'd better add some
> comments in the code? Someone may submit a fix patch by visual
> inspection in the future due to missing these points.

I don't really want to add a comment on this, there in zap_huge_pmd():
I think it would be too much of a distraction from that dense code
sequence.  And the comment will be more obvious in the commit message,
once I split these is_huge_zero_pmd() fixes off from
__split_huge_pmd_locked() as Kirill asked.

But... now I think I'll scrap these parts of the patch, and instead
just add a pmd_present() check into is_huge_zero_pmd() itself.
pmd_present() is quick, but pmd_page() may not be: I may convert it
to use a __read_only huge_pmd_pfn, or may not: I'll see how that goes.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
@ 2021-06-04 21:52           ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 21:52 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Wang Yugui,
	Matthew Wilcox, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	Linux MM, Linux Kernel Mailing List

On Fri, 4 Jun 2021, Yang Shi wrote:
> On Thu, Jun 3, 2021 at 7:23 PM Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 3 Jun 2021, Yang Shi wrote:
> > > On Tue, Jun 1, 2021 at 2:05 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > The point here (see commit message above) is that some unrelated pmd
> > migration entry could pass the is_huge_zero_pmd() test, which rushes
> > off to use pmd_page() without even checking pmd_present() first.  And
> > most of its users have, one way or another, checked pmd_present() first;
> > but this place and a couple of others had not.
> 
> Thanks for the elaboration. Wondering whether we'd better add some
> comments in the code? Someone may submit a fix patch by visual
> inspection in the future due to missing these points.

I don't really want to add a comment on this, there in zap_huge_pmd():
I think it would be too much of a distraction from that dense code
sequence.  And the comment will be more obvious in the commit message,
once I split these is_huge_zero_pmd() fixes off from
__split_huge_pmd_locked() as Kirill asked.

But... now I think I'll scrap these parts of the patch, and instead
just add a pmd_present() check into is_huge_zero_pmd() itself.
pmd_present() is quick, but pmd_page() may not be: I may convert it
to use a __read_only huge_pmd_pfn, or may not: I'll see how that goes.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
  2021-06-04 14:48       ` Peter Xu
@ 2021-06-04 22:26           ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 22:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Jan Kara, linux-mm, linux-kernel

On Fri, 4 Jun 2021, Peter Xu wrote:
> On Thu, Jun 03, 2021 at 07:54:11PM -0700, Hugh Dickins wrote:
> > On Thu, 3 Jun 2021, Peter Xu wrote:
> > > On Tue, Jun 01, 2021 at 02:07:53PM -0700, Hugh Dickins wrote:
> > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > index 2cf01d933f13..b45d22738b45 100644
> > > > --- a/mm/page_vma_mapped.c
> > > > +++ b/mm/page_vma_mapped.c
> > > > @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > >  			pvmw->ptl = NULL;
> > > >  		}
> > > >  	} else if (!pmd_present(pmde)) {
> > > > +		/*
> > > > +		 * If PVMW_SYNC, take and drop THP pmd lock so that we
> > > > +		 * cannot return prematurely, while zap_huge_pmd() has
> > > > +		 * cleared *pmd but not decremented compound_mapcount().
> > > > +		 */
> > > > +		if ((pvmw->flags & PVMW_SYNC) &&
> > > > +		    PageTransCompound(pvmw->page))
> > > > +			spin_unlock(pmd_lock(mm, pvmw->pmd));
> > > >  		return false;
> > > >  	}
> > > >  	if (!map_pte(pvmw))
> > > 
> > > Sorry if I missed something important, but I'm totally confused on how this
> > > unlock is pairing with another lock()..
> > 
> > I imagine you're reading that as spin_unlock(pmd_lockptr(blah));
> > no, the lock is right there, inside spin_unlock(pmd_lock(blah)).
> 
> Heh, yeah... Sorry about that.

I'll expand that line, as Kirill asks too.

> 
> > 
> > > 
> > > And.. isn't PVMW_SYNC only meaningful for pte-level only (as I didn't see a
> > > reference of it outside map_pte)?
> > 
> > But you are pointing directly to its reference outside map_pte()!
> 
> Right, I was trying to look for the lock() so I needed to look at all the rest
> besides this one. :)
> 
> I didn't follow Yang's patch, but if Yang's patch can make kernel not crashing
> and fault handling done all well, then I'm kind of agree with him: having
> workaround code (like taking lock and quickly releasing..) only for debug code
> seems an overkill to me, not to mention that the debug code will be even more
> strict after this patch, as it means it's even less likely that one can
> reproduce one production host race with DEBUG_VM..  Normally debugging code
> would affect code execution already, and for this one we're enlarging that gap
> "explicitly" - not sure whether it's good.
> 
> This also makes me curious what if we make !DEBUG_VM strict too - how much perf
> we'll lose?  Haven't even tried to think about it with details, but just raise
> it up. Say, is there any chance we make the debug/non-debug paths run the same
> logic (e.g. of SYNC version)?

And Yang Shi suggests the same.

Yes, I'm not fond of doing that differently for DEBUG_VM or not;
but could never quite decide which way to jump.

For so long as we worry about whether split_huge_page() is working
correctly (and Wang Yugui still has a case that we have not solved),
we do want the warning; and for so long as we have the warning, we
do need the TTU_SYNC to prevent showing the warning unnecessarily.

How much overhead added by doing TTU_SYNC now on !DEBUG_VM?  On any
sensible anon THP case, I don't think it could add overhead at all.
But in some shmem cases (multiply mapped but sparsely populated,
populated differently by different tasks) it could add overhead:
dirtying lock cachelines in tasks which don't have the page mapped.

But we're only talking about huge page splitting, that should not
be #1 on anyone's performance list; and has all sorts of other
overheads of its own.  I think I'll go with your preference, and
make this TTU_SYNC for all.  We can easily revert to DEBUG_VM only
if some regression is noticed.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting
@ 2021-06-04 22:26           ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 22:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Jan Kara, linux-mm, linux-kernel

On Fri, 4 Jun 2021, Peter Xu wrote:
> On Thu, Jun 03, 2021 at 07:54:11PM -0700, Hugh Dickins wrote:
> > On Thu, 3 Jun 2021, Peter Xu wrote:
> > > On Tue, Jun 01, 2021 at 02:07:53PM -0700, Hugh Dickins wrote:
> > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > index 2cf01d933f13..b45d22738b45 100644
> > > > --- a/mm/page_vma_mapped.c
> > > > +++ b/mm/page_vma_mapped.c
> > > > @@ -212,6 +212,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > >  			pvmw->ptl = NULL;
> > > >  		}
> > > >  	} else if (!pmd_present(pmde)) {
> > > > +		/*
> > > > +		 * If PVMW_SYNC, take and drop THP pmd lock so that we
> > > > +		 * cannot return prematurely, while zap_huge_pmd() has
> > > > +		 * cleared *pmd but not decremented compound_mapcount().
> > > > +		 */
> > > > +		if ((pvmw->flags & PVMW_SYNC) &&
> > > > +		    PageTransCompound(pvmw->page))
> > > > +			spin_unlock(pmd_lock(mm, pvmw->pmd));
> > > >  		return false;
> > > >  	}
> > > >  	if (!map_pte(pvmw))
> > > 
> > > Sorry if I missed something important, but I'm totally confused on how this
> > > unlock is pairing with another lock()..
> > 
> > I imagine you're reading that as spin_unlock(pmd_lockptr(blah));
> > no, the lock is right there, inside spin_unlock(pmd_lock(blah)).
> 
> Heh, yeah... Sorry about that.

I'll expand that line, as Kirill asks too.

> 
> > 
> > > 
> > > And.. isn't PVMW_SYNC only meaningful for pte-level only (as I didn't see a
> > > reference of it outside map_pte)?
> > 
> > But you are pointing directly to its reference outside map_pte()!
> 
> Right, I was trying to look for the lock() so I needed to look at all the rest
> besides this one. :)
> 
> I didn't follow Yang's patch, but if Yang's patch can make kernel not crashing
> and fault handling done all well, then I'm kind of agree with him: having
> workaround code (like taking lock and quickly releasing..) only for debug code
> seems an overkill to me, not to mention that the debug code will be even more
> strict after this patch, as it means it's even less likely that one can
> reproduce one production host race with DEBUG_VM..  Normally debugging code
> would affect code execution already, and for this one we're enlarging that gap
> "explicitly" - not sure whether it's good.
> 
> This also makes me curious what if we make !DEBUG_VM strict too - how much perf
> we'll lose?  Haven't even tried to think about it with details, but just raise
> it up. Say, is there any chance we make the debug/non-debug paths run the same
> logic (e.g. of SYNC version)?

And Yang Shi suggests the same.

Yes, I'm not fond of doing that differently for DEBUG_VM or not;
but could never quite decide which way to jump.

For so long as we worry about whether split_huge_page() is working
correctly (and Wang Yugui still has a case that we have not solved),
we do want the warning; and for so long as we have the warning, we
do need the TTU_SYNC to prevent showing the warning unnecessarily.

How much overhead added by doing TTU_SYNC now on !DEBUG_VM?  On any
sensible anon THP case, I don't think it could add overhead at all.
But in some shmem cases (multiply mapped but sparsely populated,
populated differently by different tasks) it could add overhead:
dirtying lock cachelines in tasks which don't have the page mapped.

But we're only talking about huge page splitting, that should not
be #1 on anyone's performance list; and has all sorts of other
overheads of its own.  I think I'll go with your preference, and
make this TTU_SYNC for all.  We can easily revert to DEBUG_VM only
if some regression is noticed.

Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/7] mm/thp: fix vma_address() if virtual address below file offset
  2021-06-04 17:36         ` Matthew Wilcox
@ 2021-06-04 22:35             ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 22:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Fri, 4 Jun 2021, Matthew Wilcox wrote:
> On Thu, Jun 03, 2021 at 02:40:30PM -0700, Hugh Dickins wrote:
> >  static inline unsigned long
> > -__vma_address(struct page *page, struct vm_area_struct *vma)
> > +vma_address(struct page *page, struct vm_area_struct *vma)
> >  {
> > -	pgoff_t pgoff = page_to_pgoff(page);
> > -	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +	pgoff_t pgoff;
> > +	unsigned long address;
> > +
> > +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> > +	pgoff = page_to_pgoff(page);
> > +	if (pgoff >= vma->vm_pgoff) {
> > +		address = vma->vm_start +
> > +			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +		/* Check for address beyond vma (or wrapped through 0?) */
> > +		if (address < vma->vm_start || address >= vma->vm_end)
> > +			address = -EFAULT;
> > +	} else if (PageHead(page) &&
> > +		   pgoff + compound_nr(page) > vma->vm_pgoff) {
> 
> I think on 32-bit, you need ...
> 
> 		   pgoff + compound_nr(page) - 1 >= vma->vm_pgoff
> 
> ... right?

Hey, beating me at my own game ;-) I'm pretty sure you're right (and
it's true that I first wrote this patch before becoming conscious of
the 32-bit MAX_LFS_FILESIZE issue); but caution tells me to think
some more and check some places before committing to that.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v2 3/7] mm/thp: fix vma_address() if virtual address below file offset
@ 2021-06-04 22:35             ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 22:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Naoya Horiguchi, Alistair Popple, Ralph Campbell,
	Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang, Peter Xu, Jan Kara,
	linux-mm, linux-kernel

On Fri, 4 Jun 2021, Matthew Wilcox wrote:
> On Thu, Jun 03, 2021 at 02:40:30PM -0700, Hugh Dickins wrote:
> >  static inline unsigned long
> > -__vma_address(struct page *page, struct vm_area_struct *vma)
> > +vma_address(struct page *page, struct vm_area_struct *vma)
> >  {
> > -	pgoff_t pgoff = page_to_pgoff(page);
> > -	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +	pgoff_t pgoff;
> > +	unsigned long address;
> > +
> > +	VM_BUG_ON_PAGE(PageKsm(page), page);	/* KSM page->index unusable */
> > +	pgoff = page_to_pgoff(page);
> > +	if (pgoff >= vma->vm_pgoff) {
> > +		address = vma->vm_start +
> > +			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +		/* Check for address beyond vma (or wrapped through 0?) */
> > +		if (address < vma->vm_start || address >= vma->vm_end)
> > +			address = -EFAULT;
> > +	} else if (PageHead(page) &&
> > +		   pgoff + compound_nr(page) > vma->vm_pgoff) {
> 
> I think on 32-bit, you need ...
> 
> 		   pgoff + compound_nr(page) - 1 >= vma->vm_pgoff
> 
> ... right?

Hey, beating me at my own game ;-) I'm pretty sure you're right (and
it's true that I first wrote this patch before becoming conscious of
the 32-bit MAX_LFS_FILESIZE issue); but caution tells me to think
some more and check some places before committing to that.

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
  2021-06-04 16:24   ` Kirill A. Shutemov
@ 2021-06-04 22:56       ` Hugh Dickins
  2021-06-04 22:56       ` Hugh Dickins
  1 sibling, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 22:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Peter Xu, Jan Kara, linux-mm, linux-kernel

On Fri, 4 Jun 2021, Kirill A. Shutemov wrote:
> On Tue, Jun 01, 2021 at 02:13:21PM -0700, Hugh Dickins wrote:
> > 
> > Merged various other minor fixes and cleanups into page_vma_mapped_walk()
> > as I worked on it: which I find much easier to enumerate here than to
> > prise apart into separate commits.
> 
> But it makes it harder to review...
> 
> I tried to review it and superficially it looks good, but it has to be
> split into bunch of patches.

I had hoped to avoid that.  Okay, I'll try bisecting it into a
preliminary cleanup patch then the bugfix patch, and see if that
looks reviewable; divide it up further if not.  But my idea of
what's reviewable may differ.

Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 5/7] mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes
@ 2021-06-04 22:56       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 22:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Peter Xu, Jan Kara, linux-mm, linux-kernel

On Fri, 4 Jun 2021, Kirill A. Shutemov wrote:
> On Tue, Jun 01, 2021 at 02:13:21PM -0700, Hugh Dickins wrote:
> > 
> > Merged various other minor fixes and cleanups into page_vma_mapped_walk()
> > as I worked on it: which I find much easier to enumerate here than to
> > prise apart into separate commits.
> 
> But it makes it harder to review...
> 
> I tried to review it and superficially it looks good, but it has to be
> split into bunch of patches.

I had hoped to avoid that.  Okay, I'll try bisecting it into a
preliminary cleanup patch then the bugfix patch, and see if that
looks reviewable; divide it up further if not.  But my idea of
what's reviewable may differ.

Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/7] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
  2021-06-04 16:39   ` Kirill A. Shutemov
@ 2021-06-04 23:07       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 23:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Peter Xu, Jan Kara, linux-mm, linux-kernel

On Fri, 4 Jun 2021, Kirill A. Shutemov wrote:
> 
> I think adding support for THP in try_to_unmap_one() is the most
> future-proof way. We would need it there eventually.

Yes, I'd love to delete unmap_mapping_page() and use try_to_unmap(),
but couldn't easily do it.  I did cut out one paragraph which said:

"Next iteration, I would probably refactor zap_huge_pmd() so that the
same code could be used from zap_pmd_range() or try_to_unmap_one();
but that seems riskier and more trouble than this current iteration."

And I've certainly not found time to go back and try that since.

> 
> But this works too:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH 6/7] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
@ 2021-06-04 23:07       ` Hugh Dickins
  0 siblings, 0 replies; 70+ messages in thread
From: Hugh Dickins @ 2021-06-04 23:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov, Yang Shi,
	Wang Yugui, Matthew Wilcox, Naoya Horiguchi, Alistair Popple,
	Ralph Campbell, Zi Yan, Miaohe Lin, Minchan Kim, Jue Wang,
	Peter Xu, Jan Kara, linux-mm, linux-kernel

On Fri, 4 Jun 2021, Kirill A. Shutemov wrote:
> 
> I think adding support for THP in try_to_unmap_one() is the most
> future-proof way. We would need it there eventually.

Yes, I'd love to delete unmap_mapping_page() and use try_to_unmap(),
but couldn't easily do it.  I did cut out one paragraph which said:

"Next iteration, I would probably refactor zap_huge_pmd() so that the
same code could be used from zap_pmd_range() or try_to_unmap_one();
but that seems riskier and more trouble than this current iteration."

And I've certainly not found time to go back and try that since.

> 
> But this works too:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2021-06-05  7:25 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-01 21:03 [PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related Hugh Dickins
2021-06-01 21:03 ` Hugh Dickins
2021-06-01 21:05 ` [PATCH 1/7] mm/thp: fix __split_huge_pmd_locked() on shmem migration entry Hugh Dickins
2021-06-01 21:05   ` Hugh Dickins
2021-06-03 21:26   ` Yang Shi
2021-06-03 21:26     ` Yang Shi
2021-06-04  2:22     ` Hugh Dickins
2021-06-04  2:22       ` Hugh Dickins
2021-06-04 18:03       ` Yang Shi
2021-06-04 18:03         ` Yang Shi
2021-06-04 21:52         ` Hugh Dickins
2021-06-04 21:52           ` Hugh Dickins
2021-06-04 15:34   ` Kirill A. Shutemov
2021-06-04 21:29     ` Hugh Dickins
2021-06-04 21:29       ` Hugh Dickins
2021-06-01 21:07 ` [PATCH 2/7] mm/thp: try_to_unmap() use TTU_SYNC for safe DEBUG_VM splitting Hugh Dickins
2021-06-01 21:07   ` Hugh Dickins
2021-06-02  1:59   ` Alistair Popple
2021-06-03 21:45   ` Yang Shi
2021-06-03 21:45     ` Yang Shi
2021-06-04  2:45     ` Hugh Dickins
2021-06-04  2:45       ` Hugh Dickins
2021-06-04 18:24       ` Yang Shi
2021-06-04 18:24         ` Yang Shi
2021-06-03 21:48   ` Peter Xu
2021-06-04  2:54     ` Hugh Dickins
2021-06-04  2:54       ` Hugh Dickins
2021-06-04 14:48       ` Peter Xu
2021-06-04 22:26         ` Hugh Dickins
2021-06-04 22:26           ` Hugh Dickins
2021-06-04 15:47       ` Kirill A. Shutemov
2021-06-01 21:09 ` [PATCH 3/7] mm/thp: fix vma_address() if virtual address below file offset Hugh Dickins
2021-06-01 21:09   ` Hugh Dickins
2021-06-01 21:30   ` Matthew Wilcox
2021-06-03 21:36     ` Hugh Dickins
2021-06-03 21:36       ` Hugh Dickins
2021-06-03 21:40       ` [PATCH v2 " Hugh Dickins
2021-06-03 21:40         ` Hugh Dickins
2021-06-04 15:53         ` Kirill A. Shutemov
2021-06-04 17:36         ` Matthew Wilcox
2021-06-04 22:35           ` Hugh Dickins
2021-06-04 22:35             ` Hugh Dickins
2021-06-01 21:11 ` [PATCH 4/7] mm/thp: fix page_address_in_vma() on file THP tails Hugh Dickins
2021-06-01 21:11   ` Hugh Dickins
2021-06-01 21:32   ` Matthew Wilcox
2021-06-03 22:06   ` Yang Shi
2021-06-03 22:06     ` Yang Shi
2021-06-04 15:54   ` Kirill A. Shutemov
2021-06-01 21:13 ` [PATCH 5/7] mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes Hugh Dickins
2021-06-01 21:13   ` Hugh Dickins
2021-06-04 16:24   ` Kirill A. Shutemov
2021-06-04 17:42     ` Matthew Wilcox
2021-06-04 22:56     ` Hugh Dickins
2021-06-04 22:56       ` Hugh Dickins
2021-06-01 21:15 ` [PATCH 6/7] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() Hugh Dickins
2021-06-01 21:15   ` Hugh Dickins
2021-06-04 16:39   ` Kirill A. Shutemov
2021-06-04 23:07     ` Hugh Dickins
2021-06-04 23:07       ` Hugh Dickins
2021-06-01 21:17 ` [PATCH 7/7] mm/thp: remap_page() is only needed on anonymous THP Hugh Dickins
2021-06-01 21:17   ` Hugh Dickins
2021-06-03 22:09   ` Yang Shi
2021-06-03 22:09     ` Yang Shi
2021-06-04 16:41   ` Kirill A. Shutemov
2021-06-02  2:07 ` [PATCH 0/7] mm/thp: fix THP splitting unmap BUGs and related Alistair Popple
2021-06-03 22:21 ` Hugh Dickins
2021-06-03 22:21   ` Hugh Dickins
2021-06-03 23:03   ` Andrew Morton
2021-06-03 22:26 ` [PATCH 6.1/7] mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split Hugh Dickins
2021-06-03 22:26   ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.