linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling
@ 2022-11-14 23:55 Mike Kravetz
  2022-11-14 23:55 ` [PATCH v10 1/3] madvise: use zap_page_range_single for madvise dontneed Mike Kravetz
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-14 23:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
	Matthew Wilcox, Andrew Morton, Mike Kravetz

This series addresses the issue first reported in [1], and fully
described in patch 2.  Patches 1 and 2 address the user visible issue
and are tagged for stable backports.

While exploring solutions to this issue, related problems with mmu
notification calls were discovered.  This is addressed in patch 3.
Since there are no user visible effects, patch 3 is not tagged for
stable backports.

Previous discussions suggested further cleanup by removing the
routine zap_page_range.  This is possible because zap_page_range_single
is now exported, and all callers of zap_page_range pass ranges entirely
within a single vma.  This work will be done in a later patch so as not
to distract from this bug fix.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

v9-v10 Rearrange series and do not tag "remove duplicate mmu notifications"
       patch for stable as suggested by David.

Mike Kravetz (3):
  madvise: use zap_page_range_single for madvise dontneed
  hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  hugetlb: remove duplicate mmu notifications

 include/linux/mm.h | 29 +++++++++++++++++++++--------
 mm/hugetlb.c       | 45 +++++++++++++++++++++++++--------------------
 mm/madvise.c       |  6 +++---
 mm/memory.c        | 25 ++++++++++++-------------
 4 files changed, 61 insertions(+), 44 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v10 1/3] madvise: use zap_page_range_single for madvise dontneed
  2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
@ 2022-11-14 23:55 ` Mike Kravetz
  2022-11-14 23:55 ` [PATCH v10 2/3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-14 23:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
	Matthew Wilcox, Andrew Morton, Mike Kravetz, Wei Chen, stable

Expose the routine zap_page_range_single to zap a range within a single
vma.  The madvise routine madvise_dontneed_single_vma can use this
routine as it explicitly operates on a single vma.  Also, update the mmu
notification range in zap_page_range_single to take hugetlb pmd sharing
into account.  This is required as MADV_DONTNEED supports hugetlb vmas.

Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h | 27 +++++++++++++++++++--------
 mm/madvise.c       |  6 +++---
 mm/memory.c        | 23 +++++++++++------------
 3 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9838b535fa21..dd5a38682537 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1870,6 +1870,23 @@ static void __maybe_unused show_free_areas(unsigned int flags, nodemask_t *nodem
 	__show_free_areas(flags, nodemask, MAX_NR_ZONES - 1);
 }
 
+/*
+ * Parameter block passed down to zap_pte_range in exceptional cases.
+ */
+struct zap_details {
+	struct folio *single_folio;	/* Locked folio to be unmapped */
+	bool even_cows;			/* Zap COWed private pages too? */
+	zap_flags_t zap_flags;		/* Extra flags for zapping */
+};
+
+/*
+ * Whether to drop the pte markers, for example, the uffd-wp information for
+ * file-backed memory.  This should only be specified when we will completely
+ * drop the page in the mm, either by truncation or unmapping of the vma.  By
+ * default, the flag is not set.
+ */
+#define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
+
 #ifdef CONFIG_MMU
 extern bool can_do_mlock(void);
 #else
@@ -1887,6 +1904,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
 void zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		    unsigned long size);
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+			   unsigned long size, struct zap_details *details);
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 		struct vm_area_struct *start_vma, unsigned long start,
 		unsigned long end);
@@ -3518,12 +3537,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif
 
-/*
- * Whether to drop the pte markers, for example, the uffd-wp information for
- * file-backed memory.  This should only be specified when we will completely
- * drop the page in the mm, either by truncation or unmapping of the vma.  By
- * default, the flag is not set.
- */
-#define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
-
 #endif /* _LINUX_MM_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index df62d9e1035a..a21b186eb7a0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -785,8 +785,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
  * data it wants to keep.  Be sure to free swap resources too.  The
- * zap_page_range call sets things up for shrink_active_list to actually free
- * these pages later if no one else has touched them in the meantime,
+ * zap_page_range_single call sets things up for shrink_active_list to actually
+ * free these pages later if no one else has touched them in the meantime,
  * although we could add these pages to a global reuse list for
  * shrink_active_list to pick up before reclaiming other pages.
  *
@@ -803,7 +803,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range(vma, start, end - start);
+	zap_page_range_single(vma, start, end - start, NULL);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 98ddb91df9a7..a177f6bbfafc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1294,15 +1294,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	return ret;
 }
 
-/*
- * Parameter block passed down to zap_pte_range in exceptional cases.
- */
-struct zap_details {
-	struct folio *single_folio;	/* Locked folio to be unmapped */
-	bool even_cows;			/* Zap COWed private pages too? */
-	zap_flags_t zap_flags;		/* Extra flags for zapping */
-};
-
 /* Whether we should zap all COWed (private) pages too */
 static inline bool should_zap_cows(struct zap_details *details)
 {
@@ -1736,19 +1727,27 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
  *
  * The range must fit into one VMA.
  */
-static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
+	const unsigned long end = address + size;
 	struct mmu_notifier_range range;
 	struct mmu_gather tlb;
 
 	lru_add_drain();
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				address, address + size);
+				address, end);
+	if (is_vm_hugetlb_page(vma))
+		adjust_range_if_pmd_sharing_possible(vma, &range.start,
+						     &range.end);
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
-	unmap_single_vma(&tlb, vma, address, range.end, details);
+	/*
+	 * unmap 'address-end' not 'range.start-range.end' as range
+	 * could have been expanded for hugetlb pmd sharing.
+	 */
+	unmap_single_vma(&tlb, vma, address, end, details);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v10 2/3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
  2022-11-14 23:55 ` [PATCH v10 1/3] madvise: use zap_page_range_single for madvise dontneed Mike Kravetz
@ 2022-11-14 23:55 ` Mike Kravetz
  2022-11-14 23:55 ` [PATCH v10 3/3] hugetlb: remove duplicate mmu notifications Mike Kravetz
  2022-11-23  2:07 ` [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Andrew Morton
  3 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-14 23:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
	Matthew Wilcox, Andrew Morton, Mike Kravetz, Wei Chen, stable

madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range.  For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final.  However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing.  Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table.  This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.

Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
call from unmap_vmas().  This is used to indicate the 'final' unmapping of
a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and the
vm_lock is not deleted.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h |  2 ++
 mm/hugetlb.c       | 27 ++++++++++++++++-----------
 mm/memory.c        |  2 +-
 3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd5a38682537..a4e24dd2d96e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1886,6 +1886,8 @@ struct zap_details {
  * default, the flag is not set.
  */
 #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
+/* Set in unmap_vmas() to indicate a final unmap call.  Only used by hugetlb */
+#define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
 
 #ifdef CONFIG_MMU
 extern bool can_do_mlock(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d765364231e..7559b9dfe782 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5210,17 +5210,22 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 
 	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
 
-	/*
-	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
-	 * the vma_lock is freed, this makes the vma ineligible for pmd
-	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
-	 * This is important as page tables for this unmapped range will
-	 * be asynchrously deleted.  If the page tables are shared, there
-	 * will be issues when accessed by someone else.
-	 */
-	__hugetlb_vma_unlock_write_free(vma);
-
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	if (zap_flags & ZAP_FLAG_UNMAP) {	/* final unmap */
+		/*
+		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
+		 * When the vma_lock is freed, this makes the vma ineligible
+		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
+		 * pmd sharing.  This is important as page tables for this
+		 * unmapped range will be asynchrously deleted.  If the page
+		 * tables are shared, there will be issues when accessed by
+		 * someone else.
+		 */
+		__hugetlb_vma_unlock_write_free(vma);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	} else {
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+		hugetlb_vma_unlock_write(vma);
+	}
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
diff --git a/mm/memory.c b/mm/memory.c
index a177f6bbfafc..6d77bc00bca1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1673,7 +1673,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 {
 	struct mmu_notifier_range range;
 	struct zap_details details = {
-		.zap_flags = ZAP_FLAG_DROP_MARKER,
+		.zap_flags = ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP,
 		/* Careful - we need to zap private pages too! */
 		.even_cows = true,
 	};
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v10 3/3] hugetlb: remove duplicate mmu notifications
  2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
  2022-11-14 23:55 ` [PATCH v10 1/3] madvise: use zap_page_range_single for madvise dontneed Mike Kravetz
  2022-11-14 23:55 ` [PATCH v10 2/3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
@ 2022-11-14 23:55 ` Mike Kravetz
  2022-11-23  2:07 ` [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Andrew Morton
  3 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-14 23:55 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
	Matthew Wilcox, Andrew Morton, Mike Kravetz

The common hugetlb unmap routine __unmap_hugepage_range performs mmu
notification calls.  However, in the case where __unmap_hugepage_range
is called via __unmap_hugepage_range_final, mmu notification calls are
performed earlier in other calling routines.

Remove mmu notification calls from __unmap_hugepage_range.  Add
notification calls to the only other caller: unmap_hugepage_range.
unmap_hugepage_range is called for truncation and hole punch, so
change notification type from UNMAP to CLEAR as this is more appropriate.

Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7559b9dfe782..0cdefa63f474 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5074,7 +5074,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	struct mmu_notifier_range range;
 	unsigned long last_addr_mask;
 	bool force_flush = false;
 
@@ -5089,13 +5088,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	tlb_change_page_size(tlb, sz);
 	tlb_start_vma(tlb, vma);
 
-	/*
-	 * If sharing possible, alert mmu notifiers of worst case.
-	 */
-	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm, start,
-				end);
-	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
-	mmu_notifier_invalidate_range_start(&range);
 	last_addr_mask = hugetlb_mask_last_page(h);
 	address = start;
 	for (; address < end; address += sz) {
@@ -5180,7 +5172,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		if (ref_page)
 			break;
 	}
-	mmu_notifier_invalidate_range_end(&range);
 	tlb_end_vma(tlb, vma);
 
 	/*
@@ -5208,6 +5199,7 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 
+	/* mmu notification performed in caller */
 	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
 
 	if (zap_flags & ZAP_FLAG_UNMAP) {	/* final unmap */
@@ -5232,10 +5224,18 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
 			  zap_flags_t zap_flags)
 {
+	struct mmu_notifier_range range;
 	struct mmu_gather tlb;
 
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				start, end);
+	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
+	mmu_notifier_invalidate_range_start(&range);
 	tlb_gather_mmu(&tlb, vma->vm_mm);
+
 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
+
+	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling
  2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
                   ` (2 preceding siblings ...)
  2022-11-14 23:55 ` [PATCH v10 3/3] hugetlb: remove duplicate mmu notifications Mike Kravetz
@ 2022-11-23  2:07 ` Andrew Morton
  2022-11-23  2:21   ` Mike Kravetz
  3 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2022-11-23  2:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Peter Xu, Nadav Amit, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox

Could this series be implicated in
https://lkml.kernel.org/r/00000000000041a69905edf8c1e3@google.com?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling
  2022-11-23  2:07 ` [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Andrew Morton
@ 2022-11-23  2:21   ` Mike Kravetz
  0 siblings, 0 replies; 6+ messages in thread
From: Mike Kravetz @ 2022-11-23  2:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Peter Xu, Nadav Amit, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox

On 11/22/22 18:07, Andrew Morton wrote:
> Could this series be implicated in
> https://lkml.kernel.org/r/00000000000041a69905edf8c1e3@google.com?

If I am reading the report correctly, I would say that this series (at least
the first two patches) would address that issue.  The bot is running against
6.1-rcX and those patches have not yet been sent to the 6.1 stream.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-11-23  2:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-14 23:55 [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Mike Kravetz
2022-11-14 23:55 ` [PATCH v10 1/3] madvise: use zap_page_range_single for madvise dontneed Mike Kravetz
2022-11-14 23:55 ` [PATCH v10 2/3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
2022-11-14 23:55 ` [PATCH v10 3/3] hugetlb: remove duplicate mmu notifications Mike Kravetz
2022-11-23  2:07 ` [PATCH v10 0/3] fix hugetlb MADV_DONTNEED vma_lock handling Andrew Morton
2022-11-23  2:21   ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).