linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup
@ 2022-11-08  1:19 Mike Kravetz
  2022-11-08  1:19 ` [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Mike Kravetz @ 2022-11-08  1:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
	Matthew Wilcox, Andrew Morton, Mike Kravetz

Sending v8 as a two patch series.
Patch 1 addressed the huegtlb MADV_DONTNEED bug and should eventually go
to stable releases.
Patch 2 is cleanup which removes zap_page_range and changes as all callers
to use the new zap_vma_range routine as they only pass ranges within a
single vma.

These changes inspired by discussions with Nadav and Peter.

Andrew, I would suggest not replacing the patch currently in
mm-hotfixes-unstable branch until there is some feedback here.

Mike Kravetz (2):
  hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  mm: remove zap_page_range and change callers to use zap_vma_range

 arch/arm64/kernel/vdso.c                |  4 +--
 arch/powerpc/kernel/vdso.c              |  2 +-
 arch/powerpc/platforms/book3s/vas-api.c |  2 +-
 arch/powerpc/platforms/pseries/vas.c    |  2 +-
 arch/riscv/kernel/vdso.c                |  4 +--
 arch/s390/kernel/vdso.c                 |  2 +-
 arch/s390/mm/gmap.c                     |  2 +-
 arch/x86/entry/vdso/vma.c               |  2 +-
 drivers/android/binder_alloc.c          |  2 +-
 include/linux/mm.h                      |  5 ++-
 mm/hugetlb.c                            | 45 +++++++++++++----------
 mm/madvise.c                            |  4 +--
 mm/memory.c                             | 47 +++++++------------------
 mm/page-writeback.c                     |  2 +-
 net/ipv4/tcp.c                          |  6 ++--
 15 files changed, 61 insertions(+), 70 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-08  1:19 [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup Mike Kravetz
@ 2022-11-08  1:19 ` Mike Kravetz
  2022-11-10 20:56   ` Peter Xu
                     ` (2 more replies)
  2022-11-08  1:19 ` [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range Mike Kravetz
  2022-11-10 19:46 ` [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup Mike Kravetz
  2 siblings, 3 replies; 16+ messages in thread
From: Mike Kravetz @ 2022-11-08  1:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
	Matthew Wilcox, Andrew Morton, Mike Kravetz, Wei Chen, stable

madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range.  For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final.  However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing.  Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table.  This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.

Address issue by:
- Add a new zap flag ZAP_FLAG_UNMAP to indicate an unmap call from
  unmap_vmas().  This is used to indicate the 'final' unmapping of a vma.
  When called via MADV_DONTNEED, this flag is not set and the vm_lock is
  not deleted.
- mmu notification is removed from __unmap_hugepage_range to avoid
  duplication, and notification is added to the other calling routine
  (unmap_hugepage_range).
- notification range in zap_page_range_single is updated to take into
  account the possibility of hugetlb pmd sharing.
- zap_page_range_single renamed to __zap_page_range_single to be used
  internally within mm/memory.c
- zap_vma_range() interface created to zap a range within a single vma.
- madvise_dontneed_single_vma is updated to call zap_vma_range instead of
  zap_page_range as it only operates on a range within a single vma

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h |  5 +++++
 mm/hugetlb.c       | 45 +++++++++++++++++++++++++++------------------
 mm/madvise.c       |  4 ++--
 mm/memory.c        | 17 +++++++++++++----
 4 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 978c17df053e..d205bcd9cd2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1840,6 +1840,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
 void zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		    unsigned long size);
+void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
+		    unsigned long size);
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 		struct vm_area_struct *start_vma, unsigned long start,
 		unsigned long end);
@@ -3464,4 +3466,7 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  */
 #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
 
+/* Set in unmap_vmas() to indicate an unmap call.  Only used by hugetlb */
+#define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ceb47c4e183a..7c8fbce4441e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5072,7 +5072,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
-	struct mmu_notifier_range range;
 	unsigned long last_addr_mask;
 	bool force_flush = false;
 
@@ -5087,13 +5086,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	tlb_change_page_size(tlb, sz);
 	tlb_start_vma(tlb, vma);
 
-	/*
-	 * If sharing possible, alert mmu notifiers of worst case.
-	 */
-	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm, start,
-				end);
-	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
-	mmu_notifier_invalidate_range_start(&range);
 	last_addr_mask = hugetlb_mask_last_page(h);
 	address = start;
 	for (; address < end; address += sz) {
@@ -5178,7 +5170,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 		if (ref_page)
 			break;
 	}
-	mmu_notifier_invalidate_range_end(&range);
 	tlb_end_vma(tlb, vma);
 
 	/*
@@ -5203,32 +5194,50 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  unsigned long end, struct page *ref_page,
 			  zap_flags_t zap_flags)
 {
+	bool final = zap_flags & ZAP_FLAG_UNMAP;
+
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 
 	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
 
 	/*
-	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
-	 * the vma_lock is freed, this makes the vma ineligible for pmd
-	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
-	 * This is important as page tables for this unmapped range will
-	 * be asynchrously deleted.  If the page tables are shared, there
-	 * will be issues when accessed by someone else.
+	 * When called via zap_vma_range (MADV_DONTNEED), this is not the
+	 * final unmap of the vma, and we do not want to delete the vma_lock.
 	 */
-	__hugetlb_vma_unlock_write_free(vma);
-
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	if (final) {
+		/*
+		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
+		 * When the vma_lock is freed, this makes the vma ineligible
+		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
+		 * pmd sharing.  This is important as page tables for this
+		 * unmapped range will be asynchrously deleted.  If the page
+		 * tables are shared, there will be issues when accessed by
+		 * someone else.
+		 */
+		__hugetlb_vma_unlock_write_free(vma);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	} else {
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+		hugetlb_vma_unlock_write(vma);
+	}
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
 			  zap_flags_t zap_flags)
 {
+	struct mmu_notifier_range range;
 	struct mmu_gather tlb;
 
+	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
+				start, end);
+	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
 	tlb_gather_mmu(&tlb, vma->vm_mm);
+
 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
+
+	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index c7105ec6d08c..9d2625b8029a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -772,7 +772,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
  * data it wants to keep.  Be sure to free swap resources too.  The
- * zap_page_range call sets things up for shrink_active_list to actually free
+ * zap_vma_range call sets things up for shrink_active_list to actually free
  * these pages later if no one else has touched them in the meantime,
  * although we could add these pages to a global reuse list for
  * shrink_active_list to pick up before reclaiming other pages.
@@ -790,7 +790,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range(vma, start, end - start);
+	zap_vma_range(vma, start, end - start);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 6090124b64f1..af3a4724b464 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1717,7 +1717,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 }
 
 /**
- * zap_page_range_single - remove user pages in a given range
+ * __zap_page_range_single - remove user pages in a given range
  * @vma: vm_area_struct holding the applicable pages
  * @address: starting address of pages to zap
  * @size: number of bytes to zap
@@ -1725,7 +1725,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
  *
  * The range must fit into one VMA.
  */
-static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+static void __zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
 	struct mmu_notifier_range range;
@@ -1734,6 +1734,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				address, address + size);
+	if (is_vm_hugetlb_page(vma))
+		adjust_range_if_pmd_sharing_possible(vma, &range.start,
+							&range.end);
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
@@ -1742,6 +1745,12 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	tlb_finish_mmu(&tlb);
 }
 
+void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
+		unsigned long size)
+{
+	__zap_page_range_single(vma, address, size, NULL);
+}
+
 /**
  * zap_vma_ptes - remove ptes mapping the vma
  * @vma: vm_area_struct holding ptes to be zapped
@@ -1760,7 +1769,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 	    		!(vma->vm_flags & VM_PFNMAP))
 		return;
 
-	zap_page_range_single(vma, address, size, NULL);
+	__zap_page_range_single(vma, address, size, NULL);
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
@@ -3453,7 +3462,7 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
 		unsigned long start_addr, unsigned long end_addr,
 		struct zap_details *details)
 {
-	zap_page_range_single(vma, start_addr, end_addr - start_addr, details);
+	__zap_page_range_single(vma, start_addr, end_addr - start_addr, details);
 }
 
 static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range
  2022-11-08  1:19 [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup Mike Kravetz
  2022-11-08  1:19 ` [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
@ 2022-11-08  1:19 ` Mike Kravetz
  2022-11-10 21:09   ` Nadav Amit
  2022-11-10 19:46 ` [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup Mike Kravetz
  2 siblings, 1 reply; 16+ messages in thread
From: Mike Kravetz @ 2022-11-08  1:19 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
	Matthew Wilcox, Andrew Morton, Mike Kravetz

zap_page_range was originally designed to unmap pages within an address
range that could span multiple vmas.  However, today all callers of
zap_page_range pass a range entirely within a single vma.  In addition,
the mmu notification call within zap_page range is not correct as it
should be vma specific.

Instead of fixing zap_page_range, change all callers to use zap_vma_range
as it is designed for ranges within a single vma.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 arch/arm64/kernel/vdso.c                |  4 ++--
 arch/powerpc/kernel/vdso.c              |  2 +-
 arch/powerpc/platforms/book3s/vas-api.c |  2 +-
 arch/powerpc/platforms/pseries/vas.c    |  2 +-
 arch/riscv/kernel/vdso.c                |  4 ++--
 arch/s390/kernel/vdso.c                 |  2 +-
 arch/s390/mm/gmap.c                     |  2 +-
 arch/x86/entry/vdso/vma.c               |  2 +-
 drivers/android/binder_alloc.c          |  2 +-
 include/linux/mm.h                      |  2 --
 mm/memory.c                             | 30 -------------------------
 mm/page-writeback.c                     |  2 +-
 net/ipv4/tcp.c                          |  6 ++---
 13 files changed, 15 insertions(+), 47 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 99ae81ab91a7..05aa0c68b609 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -141,10 +141,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 #ifdef CONFIG_COMPAT_VDSO
 		if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 #endif
 	}
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 4abc01949702..69210ca35dc8 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -123,7 +123,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, &vvar_spec))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 	}
 	mmap_read_unlock(mm);
 
diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
index 40f5ae5e1238..475925723981 100644
--- a/arch/powerpc/platforms/book3s/vas-api.c
+++ b/arch/powerpc/platforms/book3s/vas-api.c
@@ -414,7 +414,7 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
 	/*
 	 * When the LPAR lost credits due to core removal or during
 	 * migration, invalidate the existing mapping for the current
-	 * paste addresses and set windows in-active (zap_page_range in
+	 * paste addresses and set windows in-active (zap_vma_range in
 	 * reconfig_close_windows()).
 	 * New mapping will be done later after migration or new credits
 	 * available. So continue to receive faults if the user space
diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
index 4ad6e510d405..b70afaa5e399 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -760,7 +760,7 @@ static int reconfig_close_windows(struct vas_caps *vcap, int excess_creds,
 		 * is done before the original mmap() and after the ioctl.
 		 */
 		if (vma)
-			zap_page_range(vma, vma->vm_start,
+			zap_vma_range(vma, vma->vm_start,
 					vma->vm_end - vma->vm_start);
 
 		mmap_write_unlock(task_ref->mm);
diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
index 123d05255fcf..47b767215d15 100644
--- a/arch/riscv/kernel/vdso.c
+++ b/arch/riscv/kernel/vdso.c
@@ -127,10 +127,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, vdso_info.dm))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 #ifdef CONFIG_COMPAT
 		if (vma_is_special_mapping(vma, compat_vdso_info.dm))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 #endif
 	}
 
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 119328e1e2b3..af50c3cefe45 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -78,7 +78,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 
 		if (!vma_is_special_mapping(vma, &vvar_mapping))
 			continue;
-		zap_page_range(vma, vma->vm_start, size);
+		zap_vma_range(vma, vma->vm_start, size);
 		break;
 	}
 	mmap_read_unlock(mm);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 02d15c8dc92e..32f1d4a3d241 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -723,7 +723,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
 		if (is_vm_hugetlb_page(vma))
 			continue;
 		size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
-		zap_page_range(vma, vmaddr, size);
+		zap_vma_range(vma, vmaddr, size);
 	}
 	mmap_read_unlock(gmap->mm);
 }
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index d45c5fcfeac2..b3c269cf28d0 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -134,7 +134,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, &vvar_mapping))
-			zap_page_range(vma, vma->vm_start, size);
+			zap_vma_range(vma, vma->vm_start, size);
 	}
 	mmap_read_unlock(mm);
 
diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index 1c39cfce32fa..063a9b4a6c02 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -1012,7 +1012,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
 	if (vma) {
 		trace_binder_unmap_user_start(alloc, index);
 
-		zap_page_range(vma, page_addr, PAGE_SIZE);
+		zap_vma_range(vma, page_addr, PAGE_SIZE);
 
 		trace_binder_unmap_user_end(alloc, index);
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d205bcd9cd2e..16052a628ab2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1838,8 +1838,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
-void zap_page_range(struct vm_area_struct *vma, unsigned long address,
-		    unsigned long size);
 void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
 		    unsigned long size);
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
diff --git a/mm/memory.c b/mm/memory.c
index af3a4724b464..a9b2aa1149b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1686,36 +1686,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/**
- * zap_page_range - remove user pages in a given range
- * @vma: vm_area_struct holding the applicable pages
- * @start: starting address of pages to zap
- * @size: number of bytes to zap
- *
- * Caller must protect the VMA list
- */
-void zap_page_range(struct vm_area_struct *vma, unsigned long start,
-		unsigned long size)
-{
-	struct maple_tree *mt = &vma->vm_mm->mm_mt;
-	unsigned long end = start + size;
-	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
-	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
-
-	lru_add_drain();
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
-				start, start + size);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
-	update_hiwater_rss(vma->vm_mm);
-	mmu_notifier_invalidate_range_start(&range);
-	do {
-		unmap_single_vma(&tlb, vma, start, range.end, NULL);
-	} while ((vma = mas_find(&mas, end - 1)) != NULL);
-	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
-}
-
 /**
  * __zap_page_range_single - remove user pages in a given range
  * @vma: vm_area_struct holding the applicable pages
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7e9d8d857ecc..dbfa8b2062fc 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2601,7 +2601,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
  *
  * The caller must hold lock_page_memcg().  Most callers have the folio
  * locked.  A few have the folio blocked from truncation through other
- * means (eg zap_page_range() has it mapped and is holding the page table
+ * means (eg zap_vma_range() has it mapped and is holding the page table
  * lock).  This can also be called from mark_buffer_dirty(), which I
  * cannot prove is always protected against truncate.
  */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de8f0cd7cb32..dea1d72ae4e2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2092,7 +2092,7 @@ static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
 		maybe_zap_len = total_bytes_to_map -  /* All bytes to map */
 				*length + /* Mapped or pending */
 				(pages_remaining * PAGE_SIZE); /* Failed map. */
-		zap_page_range(vma, *address, maybe_zap_len);
+		zap_vma_range(vma, *address, maybe_zap_len);
 		err = 0;
 	}
 
@@ -2100,7 +2100,7 @@ static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
 		unsigned long leftover_pages = pages_remaining;
 		int bytes_mapped;
 
-		/* We called zap_page_range, try to reinsert. */
+		/* We called zap_vma_range, try to reinsert. */
 		err = vm_insert_pages(vma, *address,
 				      pending_pages,
 				      &pages_remaining);
@@ -2234,7 +2234,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
 	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
 	if (total_bytes_to_map) {
 		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
-			zap_page_range(vma, address, total_bytes_to_map);
+			zap_vma_range(vma, address, total_bytes_to_map);
 		zc->length = total_bytes_to_map;
 		zc->recv_skip_hint = 0;
 	} else {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup
  2022-11-08  1:19 [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup Mike Kravetz
  2022-11-08  1:19 ` [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
  2022-11-08  1:19 ` [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range Mike Kravetz
@ 2022-11-10 19:46 ` Mike Kravetz
  2 siblings, 0 replies; 16+ messages in thread
From: Mike Kravetz @ 2022-11-10 19:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Peter Xu, Nadav Amit, Rik van Riel, Vlastimil Babka,
	Matthew Wilcox, Andrew Morton

On 11/07/22 17:19, Mike Kravetz wrote:
> Sending v8 as a two patch series.
> Patch 1 addressed the huegtlb MADV_DONTNEED bug and should eventually go
> to stable releases.
> Patch 2 is cleanup which removes zap_page_range and changes as all callers
> to use the new zap_vma_range routine as they only pass ranges within a
> single vma.
> 
> These changes inspired by discussions with Nadav and Peter.
> 
> Andrew, I would suggest not replacing the patch currently in
> mm-hotfixes-unstable branch until there is some feedback here.

Any comments?

Because of previous churn I asked Andrew not to add to his tree.  v7 with
zap_page_range fixup is sitting in mm-hotfixes-unstable.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-08  1:19 ` [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
@ 2022-11-10 20:56   ` Peter Xu
  2022-11-10 21:55     ` Mike Kravetz
  2022-11-10 20:59   ` Nadav Amit
  2022-11-10 21:07   ` Peter Xu
  2 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2022-11-10 20:56 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Nadav Amit, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton, Wei Chen, stable

Hi, Mike,

Sorry to be late, took me quite some time working on another bug..

On Mon, Nov 07, 2022 at 05:19:09PM -0800, Mike Kravetz wrote:
> madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
> tables associated with the address range.  For hugetlb vmas,
> zap_page_range will call __unmap_hugepage_range_final.  However,
> __unmap_hugepage_range_final assumes the passed vma is about to be removed
> and deletes the vma_lock to prevent pmd sharing as the vma is on the way
> out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
> missing vma_lock prevents pmd sharing and could potentially lead to issues
> with truncation/fault races.
> 
> This issue was originally reported here [1] as a BUG triggered in
> page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
> vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
> prevent pmd sharing.  Subsequent faults on this vma were confused as
> VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
> not set in new pages added to the page table.  This resulted in pages that
> appeared anonymous in a VM_SHARED vma and triggered the BUG.
> 
> Address issue by:
> - Add a new zap flag ZAP_FLAG_UNMAP to indicate an unmap call from
>   unmap_vmas().  This is used to indicate the 'final' unmapping of a vma.
>   When called via MADV_DONTNEED, this flag is not set and the vm_lock is
>   not deleted.
> - mmu notification is removed from __unmap_hugepage_range to avoid
>   duplication, and notification is added to the other calling routine
>   (unmap_hugepage_range).
> - notification range in zap_page_range_single is updated to take into
>   account the possibility of hugetlb pmd sharing.
> - zap_page_range_single renamed to __zap_page_range_single to be used
>   internally within mm/memory.c
> - zap_vma_range() interface created to zap a range within a single vma.
> - madvise_dontneed_single_vma is updated to call zap_vma_range instead of
>   zap_page_range as it only operates on a range within a single vma
> 
> [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
> Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reported-by: Wei Chen <harperchen1110@gmail.com>
> Cc: <stable@vger.kernel.org>
> ---
>  include/linux/mm.h |  5 +++++
>  mm/hugetlb.c       | 45 +++++++++++++++++++++++++++------------------
>  mm/madvise.c       |  4 ++--
>  mm/memory.c        | 17 +++++++++++++----
>  4 files changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 978c17df053e..d205bcd9cd2e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1840,6 +1840,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
>  		  unsigned long size);
>  void zap_page_range(struct vm_area_struct *vma, unsigned long address,
>  		    unsigned long size);
> +void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
> +		    unsigned long size);
>  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
>  		struct vm_area_struct *start_vma, unsigned long start,
>  		unsigned long end);
> @@ -3464,4 +3466,7 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   */
>  #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
>  
> +/* Set in unmap_vmas() to indicate an unmap call.  Only used by hugetlb */
> +#define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))

It seems this is not set anywhere in the patch?

> +
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ceb47c4e183a..7c8fbce4441e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5072,7 +5072,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>  	struct page *page;
>  	struct hstate *h = hstate_vma(vma);
>  	unsigned long sz = huge_page_size(h);
> -	struct mmu_notifier_range range;
>  	unsigned long last_addr_mask;
>  	bool force_flush = false;
>  
> @@ -5087,13 +5086,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>  	tlb_change_page_size(tlb, sz);
>  	tlb_start_vma(tlb, vma);
>  
> -	/*
> -	 * If sharing possible, alert mmu notifiers of worst case.
> -	 */
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm, start,
> -				end);
> -	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> -	mmu_notifier_invalidate_range_start(&range);
>  	last_addr_mask = hugetlb_mask_last_page(h);
>  	address = start;
>  	for (; address < end; address += sz) {
> @@ -5178,7 +5170,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
>  		if (ref_page)
>  			break;
>  	}
> -	mmu_notifier_invalidate_range_end(&range);
>  	tlb_end_vma(tlb, vma);
>  
>  	/*
> @@ -5203,32 +5194,50 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
>  			  unsigned long end, struct page *ref_page,
>  			  zap_flags_t zap_flags)
>  {
> +	bool final = zap_flags & ZAP_FLAG_UNMAP;
> +
>  	hugetlb_vma_lock_write(vma);
>  	i_mmap_lock_write(vma->vm_file->f_mapping);
>  
>  	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
>  
>  	/*
> -	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
> -	 * the vma_lock is freed, this makes the vma ineligible for pmd
> -	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
> -	 * This is important as page tables for this unmapped range will
> -	 * be asynchrously deleted.  If the page tables are shared, there
> -	 * will be issues when accessed by someone else.
> +	 * When called via zap_vma_range (MADV_DONTNEED), this is not the
> +	 * final unmap of the vma, and we do not want to delete the vma_lock.
>  	 */
> -	__hugetlb_vma_unlock_write_free(vma);
> -
> -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	if (final) {
> +		/*
> +		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
> +		 * When the vma_lock is freed, this makes the vma ineligible
> +		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
> +		 * pmd sharing.  This is important as page tables for this
> +		 * unmapped range will be asynchrously deleted.  If the page
> +		 * tables are shared, there will be issues when accessed by
> +		 * someone else.
> +		 */
> +		__hugetlb_vma_unlock_write_free(vma);
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	} else {
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> +		hugetlb_vma_unlock_write(vma);
> +	}
>  }
>  
>  void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>  			  unsigned long end, struct page *ref_page,
>  			  zap_flags_t zap_flags)
>  {
> +	struct mmu_notifier_range range;
>  	struct mmu_gather tlb;
>  
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,

Should this be s/UNMAP/CLEAR/?  As IIUC the unmap path was only happening
in __unmap_hugepage_range_final().

> +				start, end);
> +	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
>  	tlb_gather_mmu(&tlb, vma->vm_mm);
> +
>  	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
> +
> +	mmu_notifier_invalidate_range_end(&range);
>  	tlb_finish_mmu(&tlb);
>  }
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c7105ec6d08c..9d2625b8029a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -772,7 +772,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>   * Application no longer needs these pages.  If the pages are dirty,
>   * it's OK to just throw them away.  The app will be more careful about
>   * data it wants to keep.  Be sure to free swap resources too.  The
> - * zap_page_range call sets things up for shrink_active_list to actually free
> + * zap_vma_range call sets things up for shrink_active_list to actually free
>   * these pages later if no one else has touched them in the meantime,
>   * although we could add these pages to a global reuse list for
>   * shrink_active_list to pick up before reclaiming other pages.
> @@ -790,7 +790,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  					unsigned long start, unsigned long end)
>  {
> -	zap_page_range(vma, start, end - start);
> +	zap_vma_range(vma, start, end - start);

I'd rather just call zap_page_range_single() directly with NULL passed
over, considering that this is for stable, but no strong opinions.

>  	return 0;
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 6090124b64f1..af3a4724b464 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1717,7 +1717,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>  }
>  
>  /**
> - * zap_page_range_single - remove user pages in a given range
> + * __zap_page_range_single - remove user pages in a given range

Same nitpick here, I'd rather keep the name at least for this patch.  But
again, no strong opinion.

>   * @vma: vm_area_struct holding the applicable pages
>   * @address: starting address of pages to zap
>   * @size: number of bytes to zap
> @@ -1725,7 +1725,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>   *
>   * The range must fit into one VMA.
>   */
> -static void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +static void __zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {
>  	struct mmu_notifier_range range;
> @@ -1734,6 +1734,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
>  	lru_add_drain();
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
>  				address, address + size);
> +	if (is_vm_hugetlb_page(vma))
> +		adjust_range_if_pmd_sharing_possible(vma, &range.start,
> +							&range.end);
>  	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
> @@ -1742,6 +1745,12 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
>  	tlb_finish_mmu(&tlb);
>  }
>  
> +void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
> +		unsigned long size)
> +{
> +	__zap_page_range_single(vma, address, size, NULL);
> +}
> +
>  /**
>   * zap_vma_ptes - remove ptes mapping the vma
>   * @vma: vm_area_struct holding ptes to be zapped
> @@ -1760,7 +1769,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
>  	    		!(vma->vm_flags & VM_PFNMAP))
>  		return;
>  
> -	zap_page_range_single(vma, address, size, NULL);
> +	__zap_page_range_single(vma, address, size, NULL);
>  }
>  EXPORT_SYMBOL_GPL(zap_vma_ptes);
>  
> @@ -3453,7 +3462,7 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
>  		unsigned long start_addr, unsigned long end_addr,
>  		struct zap_details *details)
>  {
> -	zap_page_range_single(vma, start_addr, end_addr - start_addr, details);
> +	__zap_page_range_single(vma, start_addr, end_addr - start_addr, details);
>  }
>  
>  static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
> -- 
> 2.37.3
> 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-08  1:19 ` [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
  2022-11-10 20:56   ` Peter Xu
@ 2022-11-10 20:59   ` Nadav Amit
  2022-11-10 21:48     ` Mike Kravetz
  2022-11-10 21:07   ` Peter Xu
  2 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2022-11-10 20:59 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, kernel list, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Peter Xu, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton, Wei Chen, stable

On Nov 7, 2022, at 5:19 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:

> madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
> tables associated with the address range.  For hugetlb vmas,
> zap_page_range will call __unmap_hugepage_range_final.  However,
> __unmap_hugepage_range_final assumes the passed vma is about to be removed
> and deletes the vma_lock to prevent pmd sharing as the vma is on the way
> out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
> missing vma_lock prevents pmd sharing and could potentially lead to issues
> with truncation/fault races.

I understand the problem in general. Please consider my feedback as partial
though.


> @@ -5203,32 +5194,50 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> 			  unsigned long end, struct page *ref_page,
> 			  zap_flags_t zap_flags)
> {
> +	bool final = zap_flags & ZAP_FLAG_UNMAP;
> +

Not sure why caching final in local variable helps.

> 
> void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> 			  unsigned long end, struct page *ref_page,
> 			  zap_flags_t zap_flags)
> {
> +	struct mmu_notifier_range range;
> 	struct mmu_gather tlb;
> 
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> +				start, end);
> +	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> 	tlb_gather_mmu(&tlb, vma->vm_mm);
> +
> 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);

Is there a reason for not using range.start and range.end?

It is just that every inconsistency is worrying…

> 
> @@ -1734,6 +1734,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> 	lru_add_drain();
> 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> 				address, address + size);
> +	if (is_vm_hugetlb_page(vma))
> +		adjust_range_if_pmd_sharing_possible(vma, &range.start,
> +							&range.end);
> 	tlb_gather_mmu(&tlb, vma->vm_mm);
> 	update_hiwater_rss(vma->vm_mm);
> 	mmu_notifier_invalidate_range_start(&range);
> @@ -1742,6 +1745,12 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> 	tlb_finish_mmu(&tlb);
> }
> 
> +void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
> +		unsigned long size)
> +{
> +	__zap_page_range_single(vma, address, size, NULL);

Ugh. So zap_vma_range() would actually be emitted as a wrapper function that
only calls __zap_page_range_single() (or worse __zap_page_range_single()
which is large would be inlined), unless you use LTO.

Another option is to declare __zap_page_range_size() in the header and move
this one to the header as inline wrapper.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-08  1:19 ` [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
  2022-11-10 20:56   ` Peter Xu
  2022-11-10 20:59   ` Nadav Amit
@ 2022-11-10 21:07   ` Peter Xu
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2022-11-10 21:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Nadav Amit, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton, Wei Chen, stable

Maybe it'll also be good if we split this into a few smaller ones?

For example:

On Mon, Nov 07, 2022 at 05:19:09PM -0800, Mike Kravetz wrote:
> Address issue by:
> - Add a new zap flag ZAP_FLAG_UNMAP to indicate an unmap call from
>   unmap_vmas().  This is used to indicate the 'final' unmapping of a vma.
>   When called via MADV_DONTNEED, this flag is not set and the vm_lock is
>   not deleted.

One patch to fix the real thing (patch 2).

> - mmu notification is removed from __unmap_hugepage_range to avoid
>   duplication, and notification is added to the other calling routine
>   (unmap_hugepage_range).

One patch to fix the mmu notifier (patch 1).

> - notification range in zap_page_range_single is updated to take into
>   account the possibility of hugetlb pmd sharing.

Part of patch 2.

> - zap_page_range_single renamed to __zap_page_range_single to be used
>   internally within mm/memory.c
> - zap_vma_range() interface created to zap a range within a single vma.

Another patch 3 for the two.

> - madvise_dontneed_single_vma is updated to call zap_vma_range instead of
>   zap_page_range as it only operates on a range within a single vma

Part of patch 2.

Then patch 1 & 2 will need to copy stable, 3 won't need to.  Patch 2 in
this series will be patch 4.  Not sure whether that looks cleaner.

Mike, sorry again if this is too late as comment, feel free to go with
either way you think proper.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range
  2022-11-08  1:19 ` [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range Mike Kravetz
@ 2022-11-10 21:09   ` Nadav Amit
  2022-11-10 21:27     ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2022-11-10 21:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, kernel list, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Peter Xu, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton

On Nov 7, 2022, at 5:19 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:

> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas.  However, today all callers of
> zap_page_range pass a range entirely within a single vma.  In addition,
> the mmu notification call within zap_page range is not correct as it
> should be vma specific.
> 
> Instead of fixing zap_page_range, change all callers to use zap_vma_range
> as it is designed for ranges within a single vma.

I understand the argument about mmu notifiers being broken (which is of
course fixable).

But, are the callers really able to guarantee that the ranges are all in a
single VMA? I am not familiar with the users, but how for instance
tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some
sorts that caused the original VMA to be split?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range
  2022-11-10 21:09   ` Nadav Amit
@ 2022-11-10 21:27     ` Peter Xu
  2022-11-10 22:02       ` Nadav Amit
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2022-11-10 21:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Mike Kravetz, Linux-MM, kernel list, Naoya Horiguchi,
	David Hildenbrand, Axel Rasmussen, Mina Almasry, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton

Hi, Nadav,

On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote:
> But, are the callers really able to guarantee that the ranges are all in a
> single VMA? I am not familiar with the users, but how for instance
> tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some
> sorts that caused the original VMA to be split?

Let me try to answer this one for Mike..  We have two callers in tcp
zerocopy code for this function:

tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len);
tcp_zerocopy_receive[2237]     zap_page_range(vma, address, total_bytes_to_map);

Both of them take the mmap lock for read, so firstly mprotect is not
possible.

The 1st call has:

	mmap_read_lock(current->mm);

	vma = vma_lookup(current->mm, address);
	if (!vma || vma->vm_ops != &tcp_vm_ops) {
		mmap_read_unlock(current->mm);
		return -EINVAL;
	}
	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
	avail_len = min_t(u32, vma_len, inq);
	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
	if (total_bytes_to_map) {
		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
			zap_page_range(vma, address, total_bytes_to_map);

Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min()
of the rest vma range.  So total_bytes_to_map will never go beyond the vma.

The 2nd call uses maybe_zap_len as len, we need to look two layers of the
callers, but ultimately it's something smaller than total_bytes_to_map we
discussed.  Hopefully it proves 100% safety on tcp zerocopy.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-10 20:59   ` Nadav Amit
@ 2022-11-10 21:48     ` Mike Kravetz
  2022-11-10 22:07       ` Peter Xu
  2022-11-10 22:22       ` Nadav Amit
  0 siblings, 2 replies; 16+ messages in thread
From: Mike Kravetz @ 2022-11-10 21:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, kernel list, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Peter Xu, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton, Wei Chen, stable

On 11/10/22 12:59, Nadav Amit wrote:
> On Nov 7, 2022, at 5:19 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> > madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
> > tables associated with the address range.  For hugetlb vmas,
> > zap_page_range will call __unmap_hugepage_range_final.  However,
> > __unmap_hugepage_range_final assumes the passed vma is about to be removed
> > and deletes the vma_lock to prevent pmd sharing as the vma is on the way
> > out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
> > missing vma_lock prevents pmd sharing and could potentially lead to issues
> > with truncation/fault races.
> 
> I understand the problem in general. Please consider my feedback as partial
> though.

Thanks for taking a look!

> 
> > @@ -5203,32 +5194,50 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> > 			  unsigned long end, struct page *ref_page,
> > 			  zap_flags_t zap_flags)
> > {
> > +	bool final = zap_flags & ZAP_FLAG_UNMAP;
> > +
> 
> Not sure why caching final in local variable helps.

No particular reason.  It can be eliminated.

> 
> > 
> > void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> > 			  unsigned long end, struct page *ref_page,
> > 			  zap_flags_t zap_flags)
> > {
> > +	struct mmu_notifier_range range;
> > 	struct mmu_gather tlb;
> > 
> > +	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> > +				start, end);
> > +	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> > 	tlb_gather_mmu(&tlb, vma->vm_mm);
> > +
> > 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
> 
> Is there a reason for not using range.start and range.end?

After calling adjust_range_if_pmd_sharing_possible, range.start - range.end
could be much greater than the range we actually want to unmap.  The range
gets adjusted to account for pmd sharing if that is POSSIBLE.  It does not
know for sure if we will actually 'unshare a pmd'.

I suppose adjust_range_if_pmd_sharing_possible could be modified to actually
check if unmapping will result in unsharing, but it does not do that today.

> 
> It is just that every inconsistency is worrying…
> 
> > 
> > @@ -1734,6 +1734,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> > 	lru_add_drain();
> > 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > 				address, address + size);
> > +	if (is_vm_hugetlb_page(vma))
> > +		adjust_range_if_pmd_sharing_possible(vma, &range.start,
> > +							&range.end);
> > 	tlb_gather_mmu(&tlb, vma->vm_mm);
> > 	update_hiwater_rss(vma->vm_mm);
> > 	mmu_notifier_invalidate_range_start(&range);
> > @@ -1742,6 +1745,12 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
> > 	tlb_finish_mmu(&tlb);
> > }
> > 
> > +void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
> > +		unsigned long size)
> > +{
> > +	__zap_page_range_single(vma, address, size, NULL);
> 
> Ugh. So zap_vma_range() would actually be emitted as a wrapper function that
> only calls __zap_page_range_single() (or worse __zap_page_range_single()
> which is large would be inlined), unless you use LTO.
> 
> Another option is to declare __zap_page_range_size() in the header and move
> this one to the header as inline wrapper.

Ok, I will keep that in mind.

However, Peter seems to prefer just exposing the current zap_page_range_single.  
The issue with exposing zap_page_range_single as it is today, is that we also
need to expose 'struct zap_details' which currently is not visible outside
mm/memory.c.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-10 20:56   ` Peter Xu
@ 2022-11-10 21:55     ` Mike Kravetz
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Kravetz @ 2022-11-10 21:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Nadav Amit, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton, Wei Chen, stable

On 11/10/22 15:56, Peter Xu wrote:
> Hi, Mike,
> 
> Sorry to be late, took me quite some time working on another bug..
> 
> On Mon, Nov 07, 2022 at 05:19:09PM -0800, Mike Kravetz wrote:
> > madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
> > tables associated with the address range.  For hugetlb vmas,
> > zap_page_range will call __unmap_hugepage_range_final.  However,
> > __unmap_hugepage_range_final assumes the passed vma is about to be removed
> > and deletes the vma_lock to prevent pmd sharing as the vma is on the way
> > out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
> > missing vma_lock prevents pmd sharing and could potentially lead to issues
> > with truncation/fault races.
> > 
> > This issue was originally reported here [1] as a BUG triggered in
> > page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
> > vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
> > prevent pmd sharing.  Subsequent faults on this vma were confused as
> > VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
> > not set in new pages added to the page table.  This resulted in pages that
> > appeared anonymous in a VM_SHARED vma and triggered the BUG.
> > 
> > Address issue by:
> > - Add a new zap flag ZAP_FLAG_UNMAP to indicate an unmap call from
> >   unmap_vmas().  This is used to indicate the 'final' unmapping of a vma.
> >   When called via MADV_DONTNEED, this flag is not set and the vm_lock is
> >   not deleted.
> > - mmu notification is removed from __unmap_hugepage_range to avoid
> >   duplication, and notification is added to the other calling routine
> >   (unmap_hugepage_range).
> > - notification range in zap_page_range_single is updated to take into
> >   account the possibility of hugetlb pmd sharing.
> > - zap_page_range_single renamed to __zap_page_range_single to be used
> >   internally within mm/memory.c
> > - zap_vma_range() interface created to zap a range within a single vma.
> > - madvise_dontneed_single_vma is updated to call zap_vma_range instead of
> >   zap_page_range as it only operates on a range within a single vma
> > 
> > [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
> > Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Reported-by: Wei Chen <harperchen1110@gmail.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  include/linux/mm.h |  5 +++++
> >  mm/hugetlb.c       | 45 +++++++++++++++++++++++++++------------------
> >  mm/madvise.c       |  4 ++--
> >  mm/memory.c        | 17 +++++++++++++----
> >  4 files changed, 47 insertions(+), 24 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 978c17df053e..d205bcd9cd2e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1840,6 +1840,8 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> >  		  unsigned long size);
> >  void zap_page_range(struct vm_area_struct *vma, unsigned long address,
> >  		    unsigned long size);
> > +void zap_vma_range(struct vm_area_struct *vma, unsigned long address,
> > +		    unsigned long size);
> >  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
> >  		struct vm_area_struct *start_vma, unsigned long start,
> >  		unsigned long end);
> > @@ -3464,4 +3466,7 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   */
> >  #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
> >  
> > +/* Set in unmap_vmas() to indicate an unmap call.  Only used by hugetlb */
> > +#define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
> 
> It seems this is not set anywhere in the patch?

Correct.  Should be set in unmap_vmas.

> > +
> >  #endif /* _LINUX_MM_H */
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ceb47c4e183a..7c8fbce4441e 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5072,7 +5072,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
> >  	struct page *page;
> >  	struct hstate *h = hstate_vma(vma);
> >  	unsigned long sz = huge_page_size(h);
> > -	struct mmu_notifier_range range;
> >  	unsigned long last_addr_mask;
> >  	bool force_flush = false;
> >  
> > @@ -5087,13 +5086,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
> >  	tlb_change_page_size(tlb, sz);
> >  	tlb_start_vma(tlb, vma);
> >  
> > -	/*
> > -	 * If sharing possible, alert mmu notifiers of worst case.
> > -	 */
> > -	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, mm, start,
> > -				end);
> > -	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> > -	mmu_notifier_invalidate_range_start(&range);
> >  	last_addr_mask = hugetlb_mask_last_page(h);
> >  	address = start;
> >  	for (; address < end; address += sz) {
> > @@ -5178,7 +5170,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
> >  		if (ref_page)
> >  			break;
> >  	}
> > -	mmu_notifier_invalidate_range_end(&range);
> >  	tlb_end_vma(tlb, vma);
> >  
> >  	/*
> > @@ -5203,32 +5194,50 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> >  			  unsigned long end, struct page *ref_page,
> >  			  zap_flags_t zap_flags)
> >  {
> > +	bool final = zap_flags & ZAP_FLAG_UNMAP;
> > +
> >  	hugetlb_vma_lock_write(vma);
> >  	i_mmap_lock_write(vma->vm_file->f_mapping);
> >  
> >  	__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
> >  
> >  	/*
> > -	 * Unlock and free the vma lock before releasing i_mmap_rwsem.  When
> > -	 * the vma_lock is freed, this makes the vma ineligible for pmd
> > -	 * sharing.  And, i_mmap_rwsem is required to set up pmd sharing.
> > -	 * This is important as page tables for this unmapped range will
> > -	 * be asynchrously deleted.  If the page tables are shared, there
> > -	 * will be issues when accessed by someone else.
> > +	 * When called via zap_vma_range (MADV_DONTNEED), this is not the
> > +	 * final unmap of the vma, and we do not want to delete the vma_lock.
> >  	 */
> > -	__hugetlb_vma_unlock_write_free(vma);
> > -
> > -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> > +	if (final) {
> > +		/*
> > +		 * Unlock and free the vma lock before releasing i_mmap_rwsem.
> > +		 * When the vma_lock is freed, this makes the vma ineligible
> > +		 * for pmd sharing.  And, i_mmap_rwsem is required to set up
> > +		 * pmd sharing.  This is important as page tables for this
> > +		 * unmapped range will be asynchrously deleted.  If the page
> > +		 * tables are shared, there will be issues when accessed by
> > +		 * someone else.
> > +		 */
> > +		__hugetlb_vma_unlock_write_free(vma);
> > +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> > +	} else {
> > +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> > +		hugetlb_vma_unlock_write(vma);
> > +	}
> >  }
> >  
> >  void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> >  			  unsigned long end, struct page *ref_page,
> >  			  zap_flags_t zap_flags)
> >  {
> > +	struct mmu_notifier_range range;
> >  	struct mmu_gather tlb;
> >  
> > +	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> 
> Should this be s/UNMAP/CLEAR/?  As IIUC the unmap path was only happening
> in __unmap_hugepage_range_final().

Right, unmap_hugepage_range is employed in the truncate and hole punch
path where we should be using the CLEAR notifier.

> > +				start, end);
> > +	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> >  	tlb_gather_mmu(&tlb, vma->vm_mm);
> > +
> >  	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
> > +
> > +	mmu_notifier_invalidate_range_end(&range);
> >  	tlb_finish_mmu(&tlb);
> >  }
> >  
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c7105ec6d08c..9d2625b8029a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -772,7 +772,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> >   * Application no longer needs these pages.  If the pages are dirty,
> >   * it's OK to just throw them away.  The app will be more careful about
> >   * data it wants to keep.  Be sure to free swap resources too.  The
> > - * zap_page_range call sets things up for shrink_active_list to actually free
> > + * zap_vma_range call sets things up for shrink_active_list to actually free
> >   * these pages later if no one else has touched them in the meantime,
> >   * although we could add these pages to a global reuse list for
> >   * shrink_active_list to pick up before reclaiming other pages.
> > @@ -790,7 +790,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> >  					unsigned long start, unsigned long end)
> >  {
> > -	zap_page_range(vma, start, end - start);
> > +	zap_vma_range(vma, start, end - start);
> 
> I'd rather just call zap_page_range_single() directly with NULL passed
> over, considering that this is for stable, but no strong opinions.

As mentioned in reply to Nadav, if we expose zap_page_range_single we
need to expose 'struct zap_details'.  Any concerns there?  That was the
primary reason I did not just call zap_page_range_single directly.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range
  2022-11-10 21:27     ` Peter Xu
@ 2022-11-10 22:02       ` Nadav Amit
  2022-11-10 22:17         ` Mike Kravetz
  0 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2022-11-10 22:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Linux-MM, kernel list, Naoya Horiguchi,
	David Hildenbrand, Axel Rasmussen, Mina Almasry, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton

On Nov 10, 2022, at 1:27 PM, Peter Xu <peterx@redhat.com> wrote:

> Hi, Nadav,
> 
> On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote:
>> But, are the callers really able to guarantee that the ranges are all in a
>> single VMA? I am not familiar with the users, but how for instance
>> tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some
>> sorts that caused the original VMA to be split?
> 
> Let me try to answer this one for Mike..  We have two callers in tcp
> zerocopy code for this function:
> 
> tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len);
> tcp_zerocopy_receive[2237]     zap_page_range(vma, address, total_bytes_to_map);
> 
> Both of them take the mmap lock for read, so firstly mprotect is not
> possible.
> 
> The 1st call has:
> 
> 	mmap_read_lock(current->mm);
> 
> 	vma = vma_lookup(current->mm, address);
> 	if (!vma || vma->vm_ops != &tcp_vm_ops) {
> 		mmap_read_unlock(current->mm);
> 		return -EINVAL;
> 	}
> 	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
> 	avail_len = min_t(u32, vma_len, inq);
> 	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
> 	if (total_bytes_to_map) {
> 		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
> 			zap_page_range(vma, address, total_bytes_to_map);
> 
> Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min()
> of the rest vma range.  So total_bytes_to_map will never go beyond the vma.
> 
> The 2nd call uses maybe_zap_len as len, we need to look two layers of the
> callers, but ultimately it's something smaller than total_bytes_to_map we
> discussed.  Hopefully it proves 100% safety on tcp zerocopy.

Thanks Peter for the detailed explanation.

I had another look at the code and indeed it should not break. I am not sure
whether users who zero-copy receive and mprotect() part of the memory would
not be surprised, but I guess that’s a different story, which I should
further study at some point.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-10 21:48     ` Mike Kravetz
@ 2022-11-10 22:07       ` Peter Xu
  2022-11-10 22:22       ` Nadav Amit
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Xu @ 2022-11-10 22:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Nadav Amit, Linux-MM, kernel list, Naoya Horiguchi,
	David Hildenbrand, Axel Rasmussen, Mina Almasry, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton, Wei Chen, stable

On Thu, Nov 10, 2022 at 01:48:06PM -0800, Mike Kravetz wrote:
> However, Peter seems to prefer just exposing the current zap_page_range_single.  
> The issue with exposing zap_page_range_single as it is today, is that we also
> need to expose 'struct zap_details' which currently is not visible outside
> mm/memory.c.

Ah I see, that's okay to me.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range
  2022-11-10 22:02       ` Nadav Amit
@ 2022-11-10 22:17         ` Mike Kravetz
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Kravetz @ 2022-11-10 22:17 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, Linux-MM, kernel list, Naoya Horiguchi,
	David Hildenbrand, Axel Rasmussen, Mina Almasry, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton

On 11/10/22 14:02, Nadav Amit wrote:
> On Nov 10, 2022, at 1:27 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Nadav,
> > 
> > On Thu, Nov 10, 2022 at 01:09:43PM -0800, Nadav Amit wrote:
> >> But, are the callers really able to guarantee that the ranges are all in a
> >> single VMA? I am not familiar with the users, but how for instance
> >> tcp_zerocopy_receive() can guarantee that no one did some mprotect() of some
> >> sorts that caused the original VMA to be split?
> > 
> > Let me try to answer this one for Mike..  We have two callers in tcp
> > zerocopy code for this function:
> > 
> > tcp_zerocopy_vm_insert_batch_error[2095] zap_page_range(vma, *address, maybe_zap_len);
> > tcp_zerocopy_receive[2237]     zap_page_range(vma, address, total_bytes_to_map);
> > 
> > Both of them take the mmap lock for read, so firstly mprotect is not
> > possible.
> > 
> > The 1st call has:
> > 
> > 	mmap_read_lock(current->mm);
> > 
> > 	vma = vma_lookup(current->mm, address);
> > 	if (!vma || vma->vm_ops != &tcp_vm_ops) {
> > 		mmap_read_unlock(current->mm);
> > 		return -EINVAL;
> > 	}
> > 	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
> > 	avail_len = min_t(u32, vma_len, inq);
> > 	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
> > 	if (total_bytes_to_map) {
> > 		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
> > 			zap_page_range(vma, address, total_bytes_to_map);
> > 
> > Here total_bytes_to_map comes from avail_len <--- vma_len, which is a min()
> > of the rest vma range.  So total_bytes_to_map will never go beyond the vma.
> > 
> > The 2nd call uses maybe_zap_len as len, we need to look two layers of the
> > callers, but ultimately it's something smaller than total_bytes_to_map we
> > discussed.  Hopefully it proves 100% safety on tcp zerocopy.
> 
> Thanks Peter for the detailed explanation.
> 
> I had another look at the code and indeed it should not break. I am not sure
> whether users who zero-copy receive and mprotect() part of the memory would
> not be surprised, but I guess that’s a different story, which I should
> further study at some point.

I did audit all calling sites and am fairly certain passed ranges are within
a single vma.  Because of this, Peter suggested removing zap_page_range.  If
there is concern, we can just fix up the mmu notifiers in zap_page_range and
leave it.  This is what is done in the patch which is currently in
mm-hotfixes-unstable.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-10 21:48     ` Mike Kravetz
  2022-11-10 22:07       ` Peter Xu
@ 2022-11-10 22:22       ` Nadav Amit
  2022-11-10 22:31         ` Mike Kravetz
  1 sibling, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2022-11-10 22:22 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux-MM, kernel list, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Peter Xu, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton, Wei Chen, stable

On Nov 10, 2022, at 1:48 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:

>>> void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
>>> 			  unsigned long end, struct page *ref_page,
>>> 			  zap_flags_t zap_flags)
>>> {
>>> +	struct mmu_notifier_range range;
>>> 	struct mmu_gather tlb;
>>> 
>>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
>>> +				start, end);
>>> +	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
>>> 	tlb_gather_mmu(&tlb, vma->vm_mm);
>>> +
>>> 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
>> 
>> Is there a reason for not using range.start and range.end?
> 
> After calling adjust_range_if_pmd_sharing_possible, range.start - range.end
> could be much greater than the range we actually want to unmap.  The range
> gets adjusted to account for pmd sharing if that is POSSIBLE.  It does not
> know for sure if we will actually 'unshare a pmd'.
> 
> I suppose adjust_range_if_pmd_sharing_possible could be modified to actually
> check if unmapping will result in unsharing, but it does not do that today.

Thanks for the explanation. It’s probably me, but I am still not sure that I
understand the the different between __unmap_hugepage_range() using (start,
end) and __zap_page_range_single() using (address, range.end). Perhaps it
worth a comment in the code?

But anyhow… shouldn’t unmap_hugepage_range() call
mmu_notifier_invalidate_range_start()?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  2022-11-10 22:22       ` Nadav Amit
@ 2022-11-10 22:31         ` Mike Kravetz
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Kravetz @ 2022-11-10 22:31 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, kernel list, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Peter Xu, Rik van Riel,
	Vlastimil Babka, Matthew Wilcox, Andrew Morton, Wei Chen, stable

On 11/10/22 14:22, Nadav Amit wrote:
> On Nov 10, 2022, at 1:48 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> >>> void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> >>> 			  unsigned long end, struct page *ref_page,
> >>> 			  zap_flags_t zap_flags)
> >>> {
> >>> +	struct mmu_notifier_range range;
> >>> 	struct mmu_gather tlb;
> >>> 
> >>> +	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> >>> +				start, end);
> >>> +	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> >>> 	tlb_gather_mmu(&tlb, vma->vm_mm);
> >>> +
> >>> 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
> >> 
> >> Is there a reason for not using range.start and range.end?
> > 
> > After calling adjust_range_if_pmd_sharing_possible, range.start - range.end
> > could be much greater than the range we actually want to unmap.  The range
> > gets adjusted to account for pmd sharing if that is POSSIBLE.  It does not
> > know for sure if we will actually 'unshare a pmd'.
> > 
> > I suppose adjust_range_if_pmd_sharing_possible could be modified to actually
> > check if unmapping will result in unsharing, but it does not do that today.
> 
> Thanks for the explanation. It’s probably me, but I am still not sure that I
> understand the the different between __unmap_hugepage_range() using (start,
> end) and __zap_page_range_single() using (address, range.end). Perhaps it
> worth a comment in the code?

__zap_page_range_single is wrong.  It should have been updated to use
the range address - (address + size).

At Peter's suggestion the mmu notifier updates will be broken out in a
separate patch.  I will also add comments, to make this easier to follow.

> But anyhow… shouldn’t unmap_hugepage_range() call
> mmu_notifier_invalidate_range_start()?

Yes, thanks!

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-11-10 22:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-08  1:19 [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup Mike Kravetz
2022-11-08  1:19 ` [PATCH v8 1/2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing Mike Kravetz
2022-11-10 20:56   ` Peter Xu
2022-11-10 21:55     ` Mike Kravetz
2022-11-10 20:59   ` Nadav Amit
2022-11-10 21:48     ` Mike Kravetz
2022-11-10 22:07       ` Peter Xu
2022-11-10 22:22       ` Nadav Amit
2022-11-10 22:31         ` Mike Kravetz
2022-11-10 21:07   ` Peter Xu
2022-11-08  1:19 ` [PATCH v8 2/2] mm: remove zap_page_range and change callers to use zap_vma_range Mike Kravetz
2022-11-10 21:09   ` Nadav Amit
2022-11-10 21:27     ` Peter Xu
2022-11-10 22:02       ` Nadav Amit
2022-11-10 22:17         ` Mike Kravetz
2022-11-10 19:46 ` [PATCH v8 0/2] hugetlb MADV_DONTNEED fix and zap_page_range cleanup Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).