All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
@ 2022-11-18  1:10 Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 01/12] mm/hugetlb: Let vma_offset_start() to return start Peter Xu
                   ` (12 more replies)
  0 siblings, 13 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

Based on latest mm-unstable (96aa38b69507).

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, so this series also depends on that one.
Hopefully this series can make it a more complete resolution for pmd
unsharing.

PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
think I'm already more confident than many of the RFCs I posted.

PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
the changelog.  The whole idea is still the same, though.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable.  It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.

For normal memory types that's far enough, since any pgtable removal
requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
has the pmd unshare feature, it means not only the pgtable page can be gone
from under us when we're doing a walking, but also the pgtable page we're
walking (even after unshared, in this case it can only be the huge PUD page
which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
possible because even though freeing the pgtable page requires mmap write
lock, it doesn't help us when we're walking on another mm's pgtable, so
it's still on risk even if we're with the current->mm's mmap lock.

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed.  It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons.  E.g. it
may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
because even if we took the vma lock, we're walking on another mm's vma!
Taking vma lock for all the vmas are probably not gonna work.

I have totally no report showing that I can trigger such a race, but from
code wise I never see anything that stops the race from happening.  This
series is trying to resolve that problem.

Resolution
==========

What this patch proposed, besides using the vma lock, is that we can also
use other ways to protect the pgtable page from being freed from under us
in huge_pte_offset() context.  The idea is kind of similar to RCU fast-gup.
Note that fast-gup is very safe regarding pmd unsharing even before vma
lock, because fast-gup relies on RCU to protect walking any pgtable page,
including another mm's.  So fast-gup will never hit a freed page even if
pmd sharing is possible.

To apply similar same idea to huge_pte_offset(), it means with proper RCU
protection the pte_t* pointer returned from huge_pte_offset() can also be
always safe to access and de-reference, along with the pgtable lock that
was bound to the pgtable page.  Note that RCU will only work to protect
pgtables if MMU_GATHER_RCU_TABLE_FREE=y.  For the rest we need to disable
irq.  Of course, the whole locking idea is not needed if pmd sharing is not
possible at all, or, on private hugetlb mappings.

Patch Layout
============

Patch 1-3:         cleanup, or dependency of the follow up patches
Patch 4:           the core patch to introduce hugetlb walker lock
Patch 5-11:        each patch resolves one possible race condition
Patch 12:          introduce hugetlb_walk() to replace huge_pte_offset()

Tests
=====

Only lightly tested on hugetlb kselftests including uffd.

Comments welcomed, thanks.

Peter Xu (12):
  mm/hugetlb: Let vma_offset_start() to return start
  mm/hugetlb: Move swap entry handling into vma lock for fault
  mm/hugetlb: Don't wait for migration entry during follow page
  mm/hugetlb: Add pgtable walker lock
  mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare
  mm/hugetlb: Protect huge_pmd_share() with walker lock
  mm/hugetlb: Use hugetlb walker lock in hugetlb_follow_page_mask()
  mm/hugetlb: Use hugetlb walker lock in follow_hugetlb_page()
  mm/hugetlb: Use hugetlb walker lock in hugetlb_vma_maps_page()
  mm/hugetlb: Use hugetlb walker lock in walk_hugetlb_range()
  mm/hugetlb: Use hugetlb walker lock in page_vma_mapped_walk()
  mm/hugetlb: Introduce hugetlb_walk()

 arch/s390/mm/gmap.c      |   2 +
 fs/hugetlbfs/inode.c     |  41 +++++++-------
 fs/proc/task_mmu.c       |   2 +
 fs/userfaultfd.c         |  24 ++++++---
 include/linux/hugetlb.h  | 112 +++++++++++++++++++++++++++++++++++++++
 include/linux/pagewalk.h |   9 +++-
 include/linux/rmap.h     |   4 ++
 mm/hugetlb.c             |  97 +++++++++++++++++----------------
 mm/page_vma_mapped.c     |   7 ++-
 mm/pagewalk.c            |   6 +--
 10 files changed, 224 insertions(+), 80 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 01/12] mm/hugetlb: Let vma_offset_start() to return start
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 02/12] mm/hugetlb: Move swap entry handling into vma lock for fault Peter Xu
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

Even though vma_offset_start() is named like that, it's not returning "the
start address of the range" but rather the offset we should use to offset
the vma->vm_start address.

Make it return the real value of the start vaddr, and it also helps for all
the callers because whenever the retval is used, it'll be ultimately added
into the vma->vm_start anyway, so it's better.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 790d2727141a..fdb16246f46e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -412,10 +412,12 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
  */
 static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
 {
+	unsigned long offset = 0;
+
 	if (vma->vm_pgoff < start)
-		return (start - vma->vm_pgoff) << PAGE_SHIFT;
-	else
-		return 0;
+		offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
+
+	return vma->vm_start + offset;
 }
 
 static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
@@ -457,7 +459,7 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
 
-		if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+		if (!hugetlb_vma_maps_page(vma, v_start, page))
 			continue;
 
 		if (!hugetlb_vma_trylock_write(vma)) {
@@ -473,8 +475,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 			break;
 		}
 
-		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
-				NULL, ZAP_FLAG_DROP_MARKER);
+		unmap_hugepage_range(vma, v_start, v_end, NULL,
+				     ZAP_FLAG_DROP_MARKER);
 		hugetlb_vma_unlock_write(vma);
 	}
 
@@ -507,10 +509,9 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
 		 */
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
-		if (hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
-			unmap_hugepage_range(vma, vma->vm_start + v_start,
-						v_end, NULL,
-						ZAP_FLAG_DROP_MARKER);
+		if (hugetlb_vma_maps_page(vma, v_start, page))
+			unmap_hugepage_range(vma, v_start, v_end, NULL,
+					     ZAP_FLAG_DROP_MARKER);
 
 		kref_put(&vma_lock->refs, hugetlb_vma_lock_release);
 		hugetlb_vma_unlock_write(vma);
@@ -540,8 +541,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
 		v_start = vma_offset_start(vma, start);
 		v_end = vma_offset_end(vma, end);
 
-		unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
-				     NULL, zap_flags);
+		unmap_hugepage_range(vma, v_start, v_end, NULL, zap_flags);
 
 		/*
 		 * Note that vma lock only exists for shared/non-private
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 02/12] mm/hugetlb: Move swap entry handling into vma lock for fault
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 01/12] mm/hugetlb: Let vma_offset_start() to return start Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:35   ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 03/12] mm/hugetlb: Don't wait for migration entry during follow page Peter Xu
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

In hugetlb_fault(), there used to have a special path to handle swap entry
at the entrance using huge_pte_offset().  That's unsafe because
huge_pte_offset() for a pmd sharable range can access freed pgtables if
without either the walker lock or vma lock.

Here the simplest solution for making it safe is just to move the swap
handling to be after the vma lock being held.  We may need to take the
fault mutex on either migration or hwpoison entries now (also the vma lock,
but that's really needed), however neither of them is hot path so it should
be fine.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c3aab6d5b7aa..62ff3fc51d4e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5824,22 +5824,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
 
-	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
-	if (ptep) {
-		/*
-		 * Since we hold no locks, ptep could be stale.  That is
-		 * OK as we are only making decisions based on content and
-		 * not actually modifying content here.
-		 */
-		entry = huge_ptep_get(ptep);
-		if (unlikely(is_hugetlb_entry_migration(entry))) {
-			migration_entry_wait_huge(vma, ptep);
-			return 0;
-		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
-			return VM_FAULT_HWPOISON_LARGE |
-				VM_FAULT_SET_HINDEX(hstate_index(h));
-	}
-
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
@@ -5886,8 +5870,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
 	 * properly handle it.
 	 */
-	if (!pte_present(entry))
+	if (!pte_present(entry)) {
+		if (unlikely(is_hugetlb_entry_migration(entry)))
+			migration_entry_wait_huge(vma, ptep);
+		else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+			ret = VM_FAULT_HWPOISON_LARGE |
+			    VM_FAULT_SET_HINDEX(hstate_index(h));
 		goto out_mutex;
+	}
 
 	/*
 	 * If we are going to COW/unshare the mapping later, we examine the
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 03/12] mm/hugetlb: Don't wait for migration entry during follow page
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 01/12] mm/hugetlb: Let vma_offset_start() to return start Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 02/12] mm/hugetlb: Move swap entry handling into vma lock for fault Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 04/12] mm/hugetlb: Add pgtable walker lock Peter Xu
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

That's what the code does with !hugetlb pages, so we should logically do
the same for hugetlb, so migration entry will also be treated as no page.

This is probably also the last piece in follow_page code that may sleep,
the last one should be removed in cf994dd8af27 ("mm/gup: remove
FOLL_MIGRATION", 2022-11-16).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 62ff3fc51d4e..61a1fa678172 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6222,7 +6222,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	if (WARN_ON_ONCE(flags & FOLL_PIN))
 		return NULL;
 
-retry:
 	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (!pte)
 		return NULL;
@@ -6245,16 +6244,6 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 			page = NULL;
 			goto out;
 		}
-	} else {
-		if (is_hugetlb_entry_migration(entry)) {
-			spin_unlock(ptl);
-			__migration_entry_wait_huge(pte, ptl);
-			goto retry;
-		}
-		/*
-		 * hwpoisoned entry is treated as no_page_table in
-		 * follow_page_mask().
-		 */
 	}
 out:
 	spin_unlock(ptl);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 04/12] mm/hugetlb: Add pgtable walker lock
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (2 preceding siblings ...)
  2022-11-18  1:10 ` [PATCH RFC v2 03/12] mm/hugetlb: Don't wait for migration entry during follow page Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 05/12] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare Peter Xu
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.

Normally, it's always safe to walk a generic pgtable as long as we're with
the mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.

But it's not true for hugetlbfs, especially shared: hugetlbfs can have its
pgtable freed by pmd unsharing, it means that even with mmap lock held for
current mm, the PMD pgtable page can still go away from under us if pmd
unsharing is possible during the walk.

So we have three ways to make it safe:

  (1) If the mapping is private we're not prone to pmd sharing or
      unsharing, so it's okay.

  (2) If we're with the hugetlb vma lock held for either read/write, it's
      okay too because pmd unshare cannot happen at all.

  (3) Otherwise we may need the pgtable walker lock

The pgtable walker is implemented in different ways depending on the
config:

  - if !ARCH_WANT_HUGE_PMD_SHARE:      it's no-op
  - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
  - else:                              it should be local_irq_disable()

A more efficient way to take the lock is only taking them with valid vmas
that are possible to have pmd sharing (aka, want_pmd_share() returns true),
but since the lock is mostly lightweighted (only riscv will use irq
disable, x86 & arm will use rcu) just take them unconditionally for now.

Document all these explicitly for huge_pte_offset() too, because it's not
that obvious.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/hugetlb.h | 99 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 551834cd5299..8f85ad0d5bdb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -12,6 +12,8 @@
 #include <linux/pgtable.h>
 #include <linux/gfp.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/irqflags.h>
+#include <linux/rcupdate.h>
 
 struct ctl_table;
 struct user_struct;
@@ -24,6 +26,71 @@ typedef struct { unsigned long pd; } hugepd_t;
 #define __hugepd(x) ((hugepd_t) { (x) })
 #endif
 
+/**
+ * @hugetlb_walker_[un]lock(): protects software hugetlb pgtable walkers
+ *
+ * The hugetlb walker lock needs to be taken before huge_pte_offset() and
+ * needs to be released after finishing using the pte_t* returned.  It's
+ * used to guarantee we won't access a freed pgtable page.  Normally in
+ * this case we already have the mmap lock held for read.
+ *
+ * When holding the hugetlb walker lock, the thread cannot sleep, nor can
+ * it already in irq context (just to simplify unlock with no-save for
+ * !RCU_TABLE_TREE).  Before the thread yields itself, it should release
+ * the pgtable lock.  After the lock released, pte_t* can become invalid
+ * anytime so it cannot be accessed anymore.
+ *
+ * The only reason for the lock is to protect concurrent pmd unsharing
+ * followed by e.g. munmap() to drop the pgtable page.  For no-pmd-sharing
+ * archs, it's no-op because it's always safe to access hugetlb pgtables
+ * just like generic pgtables.
+ */
+#if !defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE)
+static inline void hugetlb_walker_lock(void)
+{
+}
+static inline void hugetlb_walker_unlock(void)
+{
+}
+static inline bool __hugetlb_walker_locked(void)
+{
+	return true;
+}
+#elif defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
+static inline void hugetlb_walker_lock(void)
+{
+	rcu_read_lock();
+}
+static inline void hugetlb_walker_unlock(void)
+{
+	rcu_read_unlock();
+}
+static inline bool __hugetlb_walker_locked(void)
+{
+	return rcu_read_lock_held();
+}
+#else
+static inline void hugetlb_walker_lock(void)
+{
+	WARN_ON_ONCE(irqs_disabled());
+	local_irq_disable();
+}
+static inline void hugetlb_walker_unlock(void)
+{
+	local_irq_enable();
+}
+static inline bool __hugetlb_walker_locked(void)
+{
+	return irqs_disabled();
+}
+#endif
+
+#ifdef CONFIG_LOCKDEP
+#define  hugetlb_walker_locked()  __hugetlb_walker_locked()
+#else
+#define  hugetlb_walker_locked()  true
+#endif
+
 #ifdef CONFIG_HUGETLB_PAGE
 
 #include <linux/mempolicy.h>
@@ -192,6 +259,38 @@ extern struct list_head huge_boot_pages;
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz);
+/*
+ * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
+ * Returns the pte_t* if found, or NULL if the address is not mapped.
+ *
+ * NOTE: since this function will walk all the pgtable pages (including not
+ * only high-level pgtable page, but also PUD entry that can be unshared
+ * concurrently for VM_SHARED), the caller of this function should be
+ * responsible of its thread safety.  One can follow this rule:
+ *
+ *  (1) For private mappings: pmd unsharing is not possible, so it'll
+ *      always be safe if we're with the mmap sem for either read or write.
+ *      This is normally always the case, IOW we don't need to do anything
+ *      special.
+ *
+ *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
+ *      pgtable page can go away from under us!  It can be done by a pmd
+ *      unshare with a follow up munmap() on the other process), then we
+ *      need either:
+ *
+ *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
+ *           won't happen upon the range (it also makes sure the pte_t we
+ *           read is the right and stable one), or,
+ *
+ *     (2.2) pgtable walker lock, to make sure even pmd unsharing happened,
+ *           the old shared PUD page won't get freed from under us, so even
+ *           the pteval can be obsolete, at least it's still always safe to
+ *           access the pgtable page (e.g., de-referencing pte_t* would not
+ *           cause use-after-free).
+ *
+ * PS: from the regard of (2.2), it's the same logic of fast-gup being safe
+ * for generic mm when walking the pgtables even without mmap lock.
+ */
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 unsigned long hugetlb_mask_last_page(struct hstate *h);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 05/12] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (3 preceding siblings ...)
  2022-11-18  1:10 ` [PATCH RFC v2 04/12] mm/hugetlb: Add pgtable walker lock Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 06/12] mm/hugetlb: Protect huge_pmd_share() with walker lock Peter Xu
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

We can take the hugetlb walker lock, here taking vma lock directly.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..a602f008dde5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -376,7 +376,8 @@ static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
  */
 vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 {
-	struct mm_struct *mm = vmf->vma->vm_mm;
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -403,7 +404,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 */
 	mmap_assert_locked(mm);
 
-	ctx = vmf->vma->vm_userfaultfd_ctx.ctx;
+	ctx = vma->vm_userfaultfd_ctx.ctx;
 	if (!ctx)
 		goto out;
 
@@ -493,6 +494,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
 
+	/*
+	 * This stablizes pgtable for hugetlb on e.g. pmd unsharing.  Need
+	 * to be before setting current state.
+	 */
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_lock_read(vma);
+
 	spin_lock_irq(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -507,13 +515,15 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	set_current_state(blocking_state);
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
-	if (!is_vm_hugetlb_page(vmf->vma))
+	if (!is_vm_hugetlb_page(vma))
 		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
 						  reason);
 	else
-		must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
+		must_wait = userfaultfd_huge_must_wait(ctx, vma,
 						       vmf->address,
 						       vmf->flags, reason);
+	if (is_vm_hugetlb_page(vma))
+		hugetlb_vma_unlock_read(vma);
 	mmap_read_unlock(mm);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 06/12] mm/hugetlb: Protect huge_pmd_share() with walker lock
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (4 preceding siblings ...)
  2022-11-18  1:10 ` [PATCH RFC v2 05/12] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:17   ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 07/12] mm/hugetlb: Use hugetlb walker lock in hugetlb_follow_page_mask() Peter Xu
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

huge_pmd_share() is normally called with vma lock held already, it lets us
feel like we don't really need the walker lock.  But that's not true,
because we only took the vma lock for "current" vma, but not all the vma
pgtables we're going to walk upon.

Taking each vma lock may lead to deadlock and hard to order.  The safe
approach is just to use the walker lock which guarantees the pgtable page
being alive, then we should use get_page_unless_zero() rather than
get_page() only, to make sure the pgtable page is not being freed at the
meantime.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61a1fa678172..5ef883184885 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7008,6 +7008,13 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 
 	i_mmap_lock_read(mapping);
+
+	/*
+	 * NOTE: even if we've got the vma read lock, here we still need to
+	 * take the walker lock, because we're not walking the current vma,
+	 * but some other mm's!
+	 */
+	hugetlb_walker_lock();
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -7016,12 +7023,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (saddr) {
 			spte = huge_pte_offset(svma->vm_mm, saddr,
 					       vma_mmu_pagesize(svma));
-			if (spte) {
-				get_page(virt_to_page(spte));
+			/*
+			 * When page ref==0, it means it's probably being
+			 * freed; continue with the next one.
+			 */
+			if (spte && get_page_unless_zero(virt_to_page(spte)))
 				break;
-			}
 		}
 	}
+	hugetlb_walker_unlock();
 
 	if (!spte)
 		goto out;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 07/12] mm/hugetlb: Use hugetlb walker lock in hugetlb_follow_page_mask()
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (5 preceding siblings ...)
  2022-11-18  1:10 ` [PATCH RFC v2 06/12] mm/hugetlb: Protect huge_pmd_share() with walker lock Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 08/12] mm/hugetlb: Use hugetlb walker lock in follow_hugetlb_page() Peter Xu
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

Hugetlb walker lock makes sure the pte_t* won't go away from under us.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5ef883184885..fc49e3ca8acd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6222,9 +6222,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	if (WARN_ON_ONCE(flags & FOLL_PIN))
 		return NULL;
 
+	hugetlb_walker_lock();
 	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (!pte)
-		return NULL;
+		goto out_unlock;
 
 	ptl = huge_pte_lock(h, mm, pte);
 	entry = huge_ptep_get(pte);
@@ -6247,6 +6248,8 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 	}
 out:
 	spin_unlock(ptl);
+out_unlock:
+	hugetlb_walker_unlock();
 	return page;
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 08/12] mm/hugetlb: Use hugetlb walker lock in follow_hugetlb_page()
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (6 preceding siblings ...)
  2022-11-18  1:10 ` [PATCH RFC v2 07/12] mm/hugetlb: Use hugetlb walker lock in hugetlb_follow_page_mask() Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 09/12] mm/hugetlb: Use hugetlb walker lock in hugetlb_vma_maps_page() Peter Xu
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

Hugetlb walker lock makes sure the pte_t* won't go away from under us.
Some trick is used to release the walker lock slightly earlier when we
found present pte.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fc49e3ca8acd..e81af6a46c59 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6280,6 +6280,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			break;
 		}
 
+		hugetlb_walker_lock();
 		/*
 		 * Some archs (sparc64, sh*) have multiple pte_ts to
 		 * each hugepage.  We have to make sure we get the
@@ -6304,6 +6305,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		    !hugetlbfs_pagecache_present(h, vma, vaddr)) {
 			if (pte)
 				spin_unlock(ptl);
+			hugetlb_walker_unlock();
 			remainder = 0;
 			break;
 		}
@@ -6325,6 +6327,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 			if (pte)
 				spin_unlock(ptl);
+			hugetlb_walker_unlock();
+
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
 			else if (unshare)
@@ -6367,6 +6371,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 		}
 
+		/*
+		 * When reach here, it means the pteval is not absent, so
+		 * anyone who wants to free and invalidate the pgtable page
+		 * (aka, pte*) should need to first unmap the entries which
+		 * relies on the pgtable lock.  Since we're holding it,
+		 * we're safe even without the walker lock anymore.
+		 */
+		hugetlb_walker_unlock();
+
 		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 09/12] mm/hugetlb: Use hugetlb walker lock in hugetlb_vma_maps_page()
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (7 preceding siblings ...)
  2022-11-18  1:10 ` [PATCH RFC v2 08/12] mm/hugetlb: Use hugetlb walker lock in follow_hugetlb_page() Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:10 ` [PATCH RFC v2 10/12] mm/hugetlb: Use hugetlb walker lock in walk_hugetlb_range() Peter Xu
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

Hugetlb walker lock makes sure the pte_t* won't go away from under us.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fdb16246f46e..265508981ba1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -387,21 +387,24 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
 				unsigned long addr, struct page *page)
 {
 	pte_t *ptep, pte;
+	bool result = false;
 
+	hugetlb_walker_lock();
 	ptep = huge_pte_offset(vma->vm_mm, addr,
 			huge_page_size(hstate_vma(vma)));
 
 	if (!ptep)
-		return false;
+		goto out;
 
 	pte = huge_ptep_get(ptep);
 	if (huge_pte_none(pte) || !pte_present(pte))
-		return false;
+		goto out;
 
 	if (pte_page(pte) == page)
-		return true;
-
-	return false;
+		result = true;
+out:
+	hugetlb_walker_unlock();
+	return result;
 }
 
 /*
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 10/12] mm/hugetlb: Use hugetlb walker lock in walk_hugetlb_range()
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (8 preceding siblings ...)
  2022-11-18  1:10 ` [PATCH RFC v2 09/12] mm/hugetlb: Use hugetlb walker lock in hugetlb_vma_maps_page() Peter Xu
@ 2022-11-18  1:10 ` Peter Xu
  2022-11-18  1:11 ` [PATCH RFC v2 11/12] mm/hugetlb: Use hugetlb walker lock in page_vma_mapped_walk() Peter Xu
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, peterx, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

Hugetlb walker lock makes sure the pte_t* won't go away from under us.  One
thing to mention is there're two hugetlb_entry() users that can yield the
thread within hugetlb_entry(), that'll need to add unlock/lock pair around
the yield, meanwhile document hugetlb_entry() explaining the lock for
sleepable hugetlb_entry()s.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/mm/gmap.c      | 2 ++
 fs/proc/task_mmu.c       | 2 ++
 include/linux/pagewalk.h | 9 ++++++++-
 mm/pagewalk.c            | 2 ++
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 02d15c8dc92e..fb2938e8d1c7 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2644,7 +2644,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
 	end = start + HPAGE_SIZE - 1;
 	__storage_key_init_range(start, end);
 	set_bit(PG_arch_1, &page->flags);
+	hugetlb_walker_unlock();
 	cond_resched();
+	hugetlb_walker_lock();
 	return 0;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 89338950afd3..ed750a52e60b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1612,7 +1612,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 			frame++;
 	}
 
+	hugetlb_walker_unlock();
 	cond_resched();
+	hugetlb_walker_lock();
 
 	return err;
 }
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 959f52e5867d..7fa3724c6eb5 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -21,7 +21,14 @@ struct mm_walk;
  *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
  *			Any folded depths (where PTRS_PER_P?D is equal to 1)
  *			are skipped.
- * @hugetlb_entry:	if set, called for each hugetlb entry
+ * @hugetlb_entry:	if set, called for each hugetlb entry.  Note that
+ *			each pte_t* is protected by hugetlb_walker_lock(),
+ *			and the lock does not allow sleep.  If explicit
+ *			sleep in the entry fn needed, the caller needs to
+ *			release the lock (hugetlb_walker_unlock()), then
+ *			relock it (hugetlb_walker_lock()) before return.
+ *			After the unlock, the pte_t* may become invalid
+ *			anytime so cannot be accessed anymore.
  * @test_walk:		caller specific callback function to determine whether
  *			we walk over the current vma or not. Returning 0 means
  *			"do page table walk over the current vma", returning
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 7f1c9b274906..abf310011ab1 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 
+	hugetlb_walker_lock();
 	do {
 		next = hugetlb_entry_end(h, addr, end);
 		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
@@ -314,6 +315,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 		if (err)
 			break;
 	} while (addr = next, addr != end);
+	hugetlb_walker_unlock();
 
 	return err;
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 11/12] mm/hugetlb: Use hugetlb walker lock in page_vma_mapped_walk()
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (9 preceding siblings ...)
  2022-11-18  1:10 ` [PATCH RFC v2 10/12] mm/hugetlb: Use hugetlb walker lock in walk_hugetlb_range() Peter Xu
@ 2022-11-18  1:11 ` Peter Xu
  2022-11-18  1:11 ` [PATCH RFC v2 12/12] mm/hugetlb: Introduce hugetlb_walk() Peter Xu
  2022-11-23  9:40 ` [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare David Hildenbrand
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Muchun Song, peterx, Nadav Amit, Miaohe Lin,
	James Houghton, David Hildenbrand, Andrew Morton, Rik van Riel,
	Andrea Arcangeli

Hugetlb walker lock makes sure the pte_t* won't go away from under us.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/rmap.h | 4 ++++
 mm/page_vma_mapped.c | 5 ++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 011a7530dc76..94d25a67db2b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -13,6 +13,7 @@
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/memremap.h>
+#include <linux/hugetlb.h>
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -408,6 +409,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 		pte_unmap(pvmw->pte);
 	if (pvmw->ptl)
 		spin_unlock(pvmw->ptl);
+	/* This needs to be after unlock of the spinlock */
+	if (is_vm_hugetlb_page(pvmw->vma))
+		hugetlb_walker_unlock();
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..5ac8a89130f6 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -169,10 +169,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (pvmw->pte)
 			return not_found(pvmw);
 
+		hugetlb_walker_lock();
 		/* when pud is not present, pte will be NULL */
 		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
-		if (!pvmw->pte)
+		if (!pvmw->pte) {
+			hugetlb_walker_unlock();
 			return false;
+		}
 
 		pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
 		if (!check_pte(pvmw))
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC v2 12/12] mm/hugetlb: Introduce hugetlb_walk()
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (10 preceding siblings ...)
  2022-11-18  1:11 ` [PATCH RFC v2 11/12] mm/hugetlb: Use hugetlb walker lock in page_vma_mapped_walk() Peter Xu
@ 2022-11-18  1:11 ` Peter Xu
  2022-11-23  9:40 ` [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare David Hildenbrand
  12 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Kravetz, Muchun Song, peterx, Nadav Amit, Miaohe Lin,
	James Houghton, David Hildenbrand, Andrew Morton, Rik van Riel,
	Andrea Arcangeli

huge_pte_offset() is the main walker function for hugetlb pgtables.  The
name is not really representing what it does, though.

Instead of renaming it, introduce a wrapper function called hugetlb_walk()
which will use huge_pte_offset() inside.  Assert on the locks so when
walking the pgtable.

Note, the vma lock assertion will be a no-op for private mappings.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c    |  4 +---
 fs/userfaultfd.c        |  6 ++----
 include/linux/hugetlb.h | 13 +++++++++++++
 mm/hugetlb.c            | 28 ++++++++++++++--------------
 mm/page_vma_mapped.c    |  2 +-
 mm/pagewalk.c           |  4 +---
 6 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 265508981ba1..ed7934015290 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -390,9 +390,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
 	bool result = false;
 
 	hugetlb_walker_lock();
-	ptep = huge_pte_offset(vma->vm_mm, addr,
-			huge_page_size(hstate_vma(vma)));
-
+	ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
 	if (!ptep)
 		goto out;
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a602f008dde5..f31fe1a9f4c5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 					 unsigned long flags,
 					 unsigned long reason)
 {
-	struct mm_struct *mm = ctx->mm;
 	pte_t *ptep, pte;
 	bool ret = true;
 
-	mmap_assert_locked(mm);
-
-	ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
+	mmap_assert_locked(ctx->mm);
 
+	ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
 	if (!ptep)
 		goto out;
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8f85ad0d5bdb..e75b1ffb93a5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1296,4 +1296,17 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
 #define flush_hugetlb_tlb_range(vma, addr, end)	flush_tlb_range(vma, addr, end)
 #endif
 
+/*
+ * Safe version of huge_pte_offset() to check the locks.  See comments
+ * above huge_pte_offset().
+ */
+static inline pte_t *
+hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
+{
+	if (!hugetlb_walker_locked())
+		hugetlb_vma_assert_locked(vma);
+
+	return huge_pte_offset(vma->vm_mm, addr, sz);
+}
+
 #endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e81af6a46c59..6c77ae7a3d94 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4814,7 +4814,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	} else {
 		/*
 		 * For shared mappings the vma lock must be held before
-		 * calling huge_pte_offset in the src vma. Otherwise, the
+		 * calling hugetlb_walk() in the src vma. Otherwise, the
 		 * returned ptep could go away if part of a shared pmd and
 		 * another thread calls huge_pmd_unshare.
 		 */
@@ -4824,7 +4824,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 	last_addr_mask = hugetlb_mask_last_page(h);
 	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
 		spinlock_t *src_ptl, *dst_ptl;
-		src_pte = huge_pte_offset(src, addr, sz);
+		src_pte = hugetlb_walk(src_vma, addr, sz);
 		if (!src_pte) {
 			addr |= last_addr_mask;
 			continue;
@@ -5028,7 +5028,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(mapping);
 	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
-		src_pte = huge_pte_offset(mm, old_addr, sz);
+		src_pte = hugetlb_walk(vma, old_addr, sz);
 		if (!src_pte) {
 			old_addr |= last_addr_mask;
 			new_addr |= last_addr_mask;
@@ -5091,7 +5091,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 	last_addr_mask = hugetlb_mask_last_page(h);
 	address = start;
 	for (; address < end; address += sz) {
-		ptep = huge_pte_offset(mm, address, sz);
+		ptep = hugetlb_walk(vma, address, sz);
 		if (!ptep) {
 			address |= last_addr_mask;
 			continue;
@@ -5404,7 +5404,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			hugetlb_vma_lock_read(vma);
 			spin_lock(ptl);
-			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+			ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
 			if (likely(ptep &&
 				   pte_same(huge_ptep_get(ptep), pte)))
 				goto retry_avoidcopy;
@@ -5442,7 +5442,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * before the page tables are altered
 	 */
 	spin_lock(ptl);
-	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
+	ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW or unshare */
 		huge_ptep_clear_flush(vma, haddr, ptep);
@@ -5839,7 +5839,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * until finished with ptep.  This prevents huge_pmd_unshare from
 	 * being called elsewhere and making the ptep no longer valid.
 	 *
-	 * ptep could have already be assigned via huge_pte_offset.  That
+	 * ptep could have already be assigned via hugetlb_walk().  That
 	 * is OK, as huge_pte_alloc will return the same value unless
 	 * something has changed.
 	 */
@@ -6223,7 +6223,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 		return NULL;
 
 	hugetlb_walker_lock();
-	pte = huge_pte_offset(mm, haddr, huge_page_size(h));
+	pte = hugetlb_walk(vma, haddr, huge_page_size(h));
 	if (!pte)
 		goto out_unlock;
 
@@ -6288,8 +6288,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 *
 		 * Note that page table lock is not held when pte is null.
 		 */
-		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h),
-				      huge_page_size(h));
+		pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
+				   huge_page_size(h));
 		if (pte)
 			ptl = huge_pte_lock(h, mm, pte);
 		absent = !pte || huge_pte_none(huge_ptep_get(pte));
@@ -6481,7 +6481,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 	last_addr_mask = hugetlb_mask_last_page(h);
 	for (; address < end; address += psize) {
 		spinlock_t *ptl;
-		ptep = huge_pte_offset(mm, address, psize);
+		ptep = hugetlb_walk(vma, address, psize);
 		if (!ptep) {
 			address |= last_addr_mask;
 			continue;
@@ -7037,8 +7037,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		saddr = page_table_shareable(svma, vma, addr, idx);
 		if (saddr) {
-			spte = huge_pte_offset(svma->vm_mm, saddr,
-					       vma_mmu_pagesize(svma));
+			spte = hugetlb_walk(svma, saddr,
+					    vma_mmu_pagesize(svma));
 			/*
 			 * When page ref==0, it means it's probably being
 			 * freed; continue with the next one.
@@ -7400,7 +7400,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
 	hugetlb_vma_lock_write(vma);
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	for (address = start; address < end; address += PUD_SIZE) {
-		ptep = huge_pte_offset(mm, address, sz);
+		ptep = hugetlb_walk(vma, address, sz);
 		if (!ptep)
 			continue;
 		ptl = huge_pte_lock(h, mm, ptep);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 5ac8a89130f6..72d72cd73c8f 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -171,7 +171,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 
 		hugetlb_walker_lock();
 		/* when pud is not present, pte will be NULL */
-		pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
+		pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
 		if (!pvmw->pte) {
 			hugetlb_walker_unlock();
 			return false;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index abf310011ab1..469e60b1b096 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -305,13 +305,11 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	hugetlb_walker_lock();
 	do {
 		next = hugetlb_entry_end(h, addr, end);
-		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
-
+		pte = hugetlb_walk(vma, addr & hmask, sz);
 		if (pte)
 			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
 		else if (ops->pte_hole)
 			err = ops->pte_hole(addr, next, -1, walk);
-
 		if (err)
 			break;
 	} while (addr = next, addr != end);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 06/12] mm/hugetlb: Protect huge_pmd_share() with walker lock
  2022-11-18  1:10 ` [PATCH RFC v2 06/12] mm/hugetlb: Protect huge_pmd_share() with walker lock Peter Xu
@ 2022-11-18  1:17   ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

On Thu, Nov 17, 2022 at 08:10:19PM -0500, Peter Xu wrote:
> huge_pmd_share() is normally called with vma lock held already, it lets us
> feel like we don't really need the walker lock.  But that's not true,
> because we only took the vma lock for "current" vma, but not all the vma
> pgtables we're going to walk upon.
> 
> Taking each vma lock may lead to deadlock and hard to order.  The safe
> approach is just to use the walker lock which guarantees the pgtable page
> being alive, then we should use get_page_unless_zero() rather than
> get_page() only, to make sure the pgtable page is not being freed at the
> meantime.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 61a1fa678172..5ef883184885 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7008,6 +7008,13 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  
>  	i_mmap_lock_read(mapping);
> +
> +	/*
> +	 * NOTE: even if we've got the vma read lock, here we still need to
> +	 * take the walker lock, because we're not walking the current vma,
> +	 * but some other mm's!
> +	 */
> +	hugetlb_walker_lock();
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> @@ -7016,12 +7023,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (saddr) {
>  			spte = huge_pte_offset(svma->vm_mm, saddr,
>  					       vma_mmu_pagesize(svma));
> -			if (spte) {
> -				get_page(virt_to_page(spte));
> +			/*
> +			 * When page ref==0, it means it's probably being
> +			 * freed; continue with the next one.
> +			 */
> +			if (spte && get_page_unless_zero(virt_to_page(spte)))
>  				break;
> -			}
>  		}
>  	}
> +	hugetlb_walker_unlock();
>  
>  	if (!spte)
>  		goto out;
> -- 
> 2.37.3
> 

Will squash this in..

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c77ae7a3d94..f9abede9de47 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7045,6 +7045,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
                         */
                        if (spte && get_page_unless_zero(virt_to_page(spte)))
                                break;
+                       else
+                               spte = NULL;
                }
        }
        hugetlb_walker_unlock();

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 02/12] mm/hugetlb: Move swap entry handling into vma lock for fault
  2022-11-18  1:10 ` [PATCH RFC v2 02/12] mm/hugetlb: Move swap entry handling into vma lock for fault Peter Xu
@ 2022-11-18  1:35   ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-18  1:35 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, James Houghton,
	Nadav Amit, Andrea Arcangeli, David Hildenbrand, Miaohe Lin,
	Mike Kravetz

On Thu, Nov 17, 2022 at 08:10:15PM -0500, Peter Xu wrote:
> In hugetlb_fault(), there used to have a special path to handle swap entry
> at the entrance using huge_pte_offset().  That's unsafe because
> huge_pte_offset() for a pmd sharable range can access freed pgtables if
> without either the walker lock or vma lock.
> 
> Here the simplest solution for making it safe is just to move the swap
> handling to be after the vma lock being held.  We may need to take the
> fault mutex on either migration or hwpoison entries now (also the vma lock,
> but that's really needed), however neither of them is hot path so it should
> be fine.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 24 +++++++-----------------
>  1 file changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c3aab6d5b7aa..62ff3fc51d4e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5824,22 +5824,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	int need_wait_lock = 0;
>  	unsigned long haddr = address & huge_page_mask(h);
>  
> -	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> -	if (ptep) {
> -		/*
> -		 * Since we hold no locks, ptep could be stale.  That is
> -		 * OK as we are only making decisions based on content and
> -		 * not actually modifying content here.
> -		 */
> -		entry = huge_ptep_get(ptep);
> -		if (unlikely(is_hugetlb_entry_migration(entry))) {
> -			migration_entry_wait_huge(vma, ptep);
> -			return 0;
> -		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> -			return VM_FAULT_HWPOISON_LARGE |
> -				VM_FAULT_SET_HINDEX(hstate_index(h));
> -	}
> -
>  	/*
>  	 * Serialize hugepage allocation and instantiation, so that we don't
>  	 * get spurious allocation failures if two CPUs race to instantiate
> @@ -5886,8 +5870,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
>  	 * properly handle it.
>  	 */
> -	if (!pte_present(entry))
> +	if (!pte_present(entry)) {
> +		if (unlikely(is_hugetlb_entry_migration(entry)))
> +			migration_entry_wait_huge(vma, ptep);

Hmm no, need to release the vma lock and fault mutex.. So I remembered why
I had a note that I need to rework migration wait code..

I'll try that on next version, it would be a callback just to release the
proper locks in migration_entry_wait_huge() right after releasing the
pgtable lock, in e.g. migration_entry_wait_on_locked().

> +		else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> +			ret = VM_FAULT_HWPOISON_LARGE |
> +			    VM_FAULT_SET_HINDEX(hstate_index(h));
>  		goto out_mutex;
> +	}
>  
>  	/*
>  	 * If we are going to COW/unshare the mapping later, we examine the
> -- 
> 2.37.3
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
  2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
                   ` (11 preceding siblings ...)
  2022-11-18  1:11 ` [PATCH RFC v2 12/12] mm/hugetlb: Introduce hugetlb_walk() Peter Xu
@ 2022-11-23  9:40 ` David Hildenbrand
  2022-11-23 15:09   ` Peter Xu
  12 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2022-11-23  9:40 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Rik van Riel, Muchun Song, Andrew Morton, James Houghton,
	Nadav Amit, Andrea Arcangeli, Miaohe Lin, Mike Kravetz

On 18.11.22 02:10, Peter Xu wrote:
> Based on latest mm-unstable (96aa38b69507).
> 
> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> series for pmd unsharing, so this series also depends on that one.
> Hopefully this series can make it a more complete resolution for pmd
> unsharing.
> 
> PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
> think I'm already more confident than many of the RFCs I posted.
> 
> PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> the changelog.  The whole idea is still the same, though.
> 
> Problem
> =======
> 
> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> hugetlb pgtable.  It's used mostly everywhere since that's needed even
> before taking the pgtable lock.
> 
> huge_pte_offset() is always called with mmap lock held with either read or
> write.
> 
> For normal memory types that's far enough, since any pgtable removal
> requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
> has the pmd unshare feature, it means not only the pgtable page can be gone
> from under us when we're doing a walking, but also the pgtable page we're
> walking (even after unshared, in this case it can only be the huge PUD page
> which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
> possible because even though freeing the pgtable page requires mmap write
> lock, it doesn't help us when we're walking on another mm's pgtable, so
> it's still on risk even if we're with the current->mm's mmap lock.
> 
> The recent work from Mike on vma lock can resolve most of this already.
> It's achieved by forbidden pmd unsharing during the lock being taken, so no
> further risk of the pgtable page being freed.  It means if we can take the
> vma lock around all huge_pte_offset() callers it'll be safe.
> 
> There're already a bunch of them that we did as per the latest mm-unstable,
> but also quite a few others that we didn't for various reasons.  E.g. it
> may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> because even if we took the vma lock, we're walking on another mm's vma!
> Taking vma lock for all the vmas are probably not gonna work.
> 
> I have totally no report showing that I can trigger such a race, but from
> code wise I never see anything that stops the race from happening.  This
> series is trying to resolve that problem.

Let me try understand the basic problem first:

hugetlb walks page tables semi-lockless: while we hold the mmap lock, we 
don't grab the page table locks. That's very hugetlb specific handling 
and I assume hugetlb uses different mechanisms to sync against 
MADV_DONTNEED, concurrent page fault s... but that's no news. hugetlb is 
weird in many ways :)

So, IIUC, you want a mechanism to synchronize against PMD unsharing. 
Can't we use some very basic locking for that?

Using RCU / disabling local irqs seems a bit excessive because we *are* 
holding the mmap lock and only care about concurrent unsharing

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
  2022-11-23  9:40 ` [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare David Hildenbrand
@ 2022-11-23 15:09   ` Peter Xu
  2022-11-23 18:21     ` Mike Kravetz
  2022-11-25  9:43     ` David Hildenbrand
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-23 15:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Rik van Riel, Muchun Song, Andrew Morton,
	James Houghton, Nadav Amit, Andrea Arcangeli, Miaohe Lin,
	Mike Kravetz

Hi, David,

Thanks for taking a look.

On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> On 18.11.22 02:10, Peter Xu wrote:
> > Based on latest mm-unstable (96aa38b69507).
> > 
> > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > series for pmd unsharing, so this series also depends on that one.
> > Hopefully this series can make it a more complete resolution for pmd
> > unsharing.
> > 
> > PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
> > think I'm already more confident than many of the RFCs I posted.
> > 
> > PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> > the changelog.  The whole idea is still the same, though.
> > 
> > Problem
> > =======
> > 
> > huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> > hugetlb pgtable.  It's used mostly everywhere since that's needed even
> > before taking the pgtable lock.
> > 
> > huge_pte_offset() is always called with mmap lock held with either read or
> > write.
> > 
> > For normal memory types that's far enough, since any pgtable removal
> > requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
> > has the pmd unshare feature, it means not only the pgtable page can be gone
> > from under us when we're doing a walking, but also the pgtable page we're
> > walking (even after unshared, in this case it can only be the huge PUD page
> > which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
> > possible because even though freeing the pgtable page requires mmap write
> > lock, it doesn't help us when we're walking on another mm's pgtable, so
> > it's still on risk even if we're with the current->mm's mmap lock.
> > 
> > The recent work from Mike on vma lock can resolve most of this already.
> > It's achieved by forbidden pmd unsharing during the lock being taken, so no
> > further risk of the pgtable page being freed.  It means if we can take the
> > vma lock around all huge_pte_offset() callers it'll be safe.

[1]

> > 
> > There're already a bunch of them that we did as per the latest mm-unstable,
> > but also quite a few others that we didn't for various reasons.  E.g. it
> > may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> > Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),

[2]

> > because even if we took the vma lock, we're walking on another mm's vma!
> > Taking vma lock for all the vmas are probably not gonna work.
> > 
> > I have totally no report showing that I can trigger such a race, but from
> > code wise I never see anything that stops the race from happening.  This
> > series is trying to resolve that problem.
> 
> Let me try understand the basic problem first:
> 
> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> don't grab the page table locks. That's very hugetlb specific handling and I
> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> concurrent page fault s... but that's no news. hugetlb is weird in many ways
> :)
> 
> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> we use some very basic locking for that?

Yes we can in most cases.  Please refer to above paragraph [1] where I
referred Mike's recent work on vma lock.  That's the basic locking we need
so far to protect pmd unsharing.  I'll attach the link too in the next
post, which is here:

https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com

> 
> Using RCU / disabling local irqs seems a bit excessive because we *are*
> holding the mmap lock and only care about concurrent unsharing

The series wanted to address where the vma lock is not easy to take.  It
originates from when I was reading Mike's other patch, I forgot why I did
that but I just noticed there's some code path that we may not want to take
a sleepable lock, e.g. in follow page code.

The other one is huge_pmd_share() where we may have the mmap lock for
current mm but we're fundamentally walking another mm.  It'll be tricky to
take a sleepable lock in such condition too.

I mentioned these cases in the other paragraph above [2].  Let me try to
expand that in my next post too.

It's debatable whether all the rest places can only work with either RCU or
irq disabled, but the idea is at least it should speed up those paths when
we still can.  Here, irqoff might be a bit heavy, but RCU lock should be
always superior to vma lock when possible, the payoff is we may still see
stale pgtable data (since unsharing can still happen in parallel), while
that can be completely avoided when we take the vma lock.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
  2022-11-23 15:09   ` Peter Xu
@ 2022-11-23 18:21     ` Mike Kravetz
  2022-11-23 18:56       ` Peter Xu
  2022-11-25  9:43     ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: Mike Kravetz @ 2022-11-23 18:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, linux-mm, linux-kernel, Rik van Riel,
	Muchun Song, Andrew Morton, James Houghton, Nadav Amit,
	Andrea Arcangeli, Miaohe Lin

On 11/23/22 10:09, Peter Xu wrote:
> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > Let me try understand the basic problem first:
> > 
> > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > don't grab the page table locks. That's very hugetlb specific handling and I
> > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > :)
> > 
> > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > we use some very basic locking for that?
> 
> Yes we can in most cases.  Please refer to above paragraph [1] where I
> referred Mike's recent work on vma lock.  That's the basic locking we need
> so far to protect pmd unsharing.  I'll attach the link too in the next
> post, which is here:
> 
> https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
> 
> > 
> > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > holding the mmap lock and only care about concurrent unsharing
> 
> The series wanted to address where the vma lock is not easy to take.  It
> originates from when I was reading Mike's other patch, I forgot why I did
> that but I just noticed there's some code path that we may not want to take
> a sleepable lock, e.g. in follow page code.

Yes, it was the patch suggested by David,

https://lore.kernel.org/linux-mm/20221030225825.40872-1-mike.kravetz@oracle.com/

The issue was that FOLL_NOWAIT could be passed into follow_page_mask.  If so,
then we do not want potentially sleep on the mutex.

Since you both are on this thread, I thought of/noticed a related issue.  In
follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
is set.  hugetlb_fault certainly has the potential for sleeping.  Is this also
a similar issue?

-- 
Mike Kravetz

> The other one is huge_pmd_share() where we may have the mmap lock for
> current mm but we're fundamentally walking another mm.  It'll be tricky to
> take a sleepable lock in such condition too.
> 
> I mentioned these cases in the other paragraph above [2].  Let me try to
> expand that in my next post too.
> 
> It's debatable whether all the rest places can only work with either RCU or
> irq disabled, but the idea is at least it should speed up those paths when
> we still can.  Here, irqoff might be a bit heavy, but RCU lock should be
> always superior to vma lock when possible, the payoff is we may still see
> stale pgtable data (since unsharing can still happen in parallel), while
> that can be completely avoided when we take the vma lock.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
  2022-11-23 18:21     ` Mike Kravetz
@ 2022-11-23 18:56       ` Peter Xu
  2022-11-23 19:31         ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2022-11-23 18:56 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, linux-mm, linux-kernel, Rik van Riel,
	Muchun Song, Andrew Morton, James Houghton, Nadav Amit,
	Andrea Arcangeli, Miaohe Lin

On Wed, Nov 23, 2022 at 10:21:30AM -0800, Mike Kravetz wrote:
> On 11/23/22 10:09, Peter Xu wrote:
> > On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > > Let me try understand the basic problem first:
> > > 
> > > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > > don't grab the page table locks. That's very hugetlb specific handling and I
> > > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > > :)
> > > 
> > > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > > we use some very basic locking for that?
> > 
> > Yes we can in most cases.  Please refer to above paragraph [1] where I
> > referred Mike's recent work on vma lock.  That's the basic locking we need
> > so far to protect pmd unsharing.  I'll attach the link too in the next
> > post, which is here:
> > 
> > https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
> > 
> > > 
> > > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > > holding the mmap lock and only care about concurrent unsharing
> > 
> > The series wanted to address where the vma lock is not easy to take.  It
> > originates from when I was reading Mike's other patch, I forgot why I did
> > that but I just noticed there's some code path that we may not want to take
> > a sleepable lock, e.g. in follow page code.
> 
> Yes, it was the patch suggested by David,
> 
> https://lore.kernel.org/linux-mm/20221030225825.40872-1-mike.kravetz@oracle.com/
> 
> The issue was that FOLL_NOWAIT could be passed into follow_page_mask.  If so,
> then we do not want potentially sleep on the mutex.
> 
> Since you both are on this thread, I thought of/noticed a related issue.  In
> follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
> is set.  hugetlb_fault certainly has the potential for sleeping.  Is this also
> a similar issue?

Yeah maybe the clean way to do this is when FAULT_FLAG_RETRY_NOWAIT is set
we should always try to not sleep at all.

But maybe that's also not urgently needed. So far I don't see any real
non-sleepable caller of it exists - the only one (kvm) can actually sleep..

It's definitely not wanted, as kvm only attach NOWAIT for an async fault,
so ideally any wait should be offloaded into async threads.  Now with the
hugetlb code being able to sleep with NOWAIT, the waiting time will be
accounted to real fault time of vcpu and partly invalidate async page fault
handling.  Said that, it also means no immediate fault would trigger either.
It's just that for the pmd unshare we can start to at least use non-sleep
version of the locks.

Now I'm more concerned with huge_pmd_share(), which seems to have no good
option but only the RCU approach.

One other thing I noticed is I cannot quickly figure out whether
follow_hugetlb_page() is needed anymore, since follow_page_mask() seems to
be also fine with walking hugetlb pgtables.

follow_hugetlb_page() can be traced back to the git initial commit, I had a
feeling that the old version of follow_page_mask() doesn't support hugetlb,
but now after it's supported maybe we can drop follow_hugetlb_page() as a
whole?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
  2022-11-23 18:56       ` Peter Xu
@ 2022-11-23 19:31         ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2022-11-23 19:31 UTC (permalink / raw)
  To: Peter Xu, Mike Kravetz
  Cc: linux-mm, linux-kernel, Rik van Riel, Muchun Song, Andrew Morton,
	James Houghton, Nadav Amit, Andrea Arcangeli, Miaohe Lin

On 23.11.22 19:56, Peter Xu wrote:
> On Wed, Nov 23, 2022 at 10:21:30AM -0800, Mike Kravetz wrote:
>> On 11/23/22 10:09, Peter Xu wrote:
>>> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
>>>> Let me try understand the basic problem first:
>>>>
>>>> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
>>>> don't grab the page table locks. That's very hugetlb specific handling and I
>>>> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
>>>> concurrent page fault s... but that's no news. hugetlb is weird in many ways
>>>> :)
>>>>
>>>> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
>>>> we use some very basic locking for that?
>>>
>>> Yes we can in most cases.  Please refer to above paragraph [1] where I
>>> referred Mike's recent work on vma lock.  That's the basic locking we need
>>> so far to protect pmd unsharing.  I'll attach the link too in the next
>>> post, which is here:
>>>
>>> https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
>>>
>>>>
>>>> Using RCU / disabling local irqs seems a bit excessive because we *are*
>>>> holding the mmap lock and only care about concurrent unsharing
>>>
>>> The series wanted to address where the vma lock is not easy to take.  It
>>> originates from when I was reading Mike's other patch, I forgot why I did
>>> that but I just noticed there's some code path that we may not want to take
>>> a sleepable lock, e.g. in follow page code.
>>
>> Yes, it was the patch suggested by David,
>>
>> https://lore.kernel.org/linux-mm/20221030225825.40872-1-mike.kravetz@oracle.com/
>>
>> The issue was that FOLL_NOWAIT could be passed into follow_page_mask.  If so,
>> then we do not want potentially sleep on the mutex.
>>
>> Since you both are on this thread, I thought of/noticed a related issue.  In
>> follow_hugetlb_page, it looks like we can call hugetlb_fault if FOLL_NOWAIT
>> is set.  hugetlb_fault certainly has the potential for sleeping.  Is this also
>> a similar issue?
> 
> Yeah maybe the clean way to do this is when FAULT_FLAG_RETRY_NOWAIT is set
> we should always try to not sleep at all.

hva_to_pfn_slow() that sets FOLL_NOWAIT calls get_user_pages_unlocked(), 
which will just do a straight mmap_read_lock().

The interpretation of FOLL_NOWAIT should not be "don't take any 
sleepable locks" but instead more like "don't wait for a page to get 
swapped in".

#define FOLL_NOWAIT  0x20  /* if a disk transfer is needed, start the IO


I did not read the full replies yet (sorry, busy hacking :) ) but *any* 
code path that already takes the mmap_read_lock() can just take whatever 
other lock we want -- IMHO. No need to over-complicate our code trying 
to avoid locks in that case.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
  2022-11-23 15:09   ` Peter Xu
  2022-11-23 18:21     ` Mike Kravetz
@ 2022-11-25  9:43     ` David Hildenbrand
  2022-11-25 13:55       ` Peter Xu
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2022-11-25  9:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Rik van Riel, Muchun Song, Andrew Morton,
	James Houghton, Nadav Amit, Andrea Arcangeli, Miaohe Lin,
	Mike Kravetz

On 23.11.22 16:09, Peter Xu wrote:
> Hi, David,
> 
> Thanks for taking a look.
> 
> On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
>> On 18.11.22 02:10, Peter Xu wrote:
>>> Based on latest mm-unstable (96aa38b69507).
>>>
>>> This can be seen as a follow-up series to Mike's recent hugetlb vma lock
>>> series for pmd unsharing, so this series also depends on that one.
>>> Hopefully this series can make it a more complete resolution for pmd
>>> unsharing.
>>>
>>> PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
>>> think I'm already more confident than many of the RFCs I posted.
>>>
>>> PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
>>> the changelog.  The whole idea is still the same, though.
>>>
>>> Problem
>>> =======
>>>
>>> huge_pte_offset() is a major helper used by hugetlb code paths to walk a
>>> hugetlb pgtable.  It's used mostly everywhere since that's needed even
>>> before taking the pgtable lock.
>>>
>>> huge_pte_offset() is always called with mmap lock held with either read or
>>> write.
>>>
>>> For normal memory types that's far enough, since any pgtable removal
>>> requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
>>> has the pmd unshare feature, it means not only the pgtable page can be gone
>>> from under us when we're doing a walking, but also the pgtable page we're
>>> walking (even after unshared, in this case it can only be the huge PUD page
>>> which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
>>> possible because even though freeing the pgtable page requires mmap write
>>> lock, it doesn't help us when we're walking on another mm's pgtable, so
>>> it's still on risk even if we're with the current->mm's mmap lock.
>>>
>>> The recent work from Mike on vma lock can resolve most of this already.
>>> It's achieved by forbidden pmd unsharing during the lock being taken, so no
>>> further risk of the pgtable page being freed.  It means if we can take the
>>> vma lock around all huge_pte_offset() callers it'll be safe.
> 
> [1]
> 
>>>
>>> There're already a bunch of them that we did as per the latest mm-unstable,
>>> but also quite a few others that we didn't for various reasons.  E.g. it
>>> may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
>>> Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> 
> [2]
> 
>>> because even if we took the vma lock, we're walking on another mm's vma!
>>> Taking vma lock for all the vmas are probably not gonna work.
>>>
>>> I have totally no report showing that I can trigger such a race, but from
>>> code wise I never see anything that stops the race from happening.  This
>>> series is trying to resolve that problem.
>>
>> Let me try understand the basic problem first:
>>
>> hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
>> don't grab the page table locks. That's very hugetlb specific handling and I
>> assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
>> concurrent page fault s... but that's no news. hugetlb is weird in many ways
>> :)
>>
>> So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
>> we use some very basic locking for that?
> 

Sorry for the delay, finally found time to look into this again. :)

> Yes we can in most cases.  Please refer to above paragraph [1] where I
> referred Mike's recent work on vma lock.  That's the basic locking we need
> so far to protect pmd unsharing.  I'll attach the link too in the next
> post, which is here:
> 
> https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
> 
>>
>> Using RCU / disabling local irqs seems a bit excessive because we *are*
>> holding the mmap lock and only care about concurrent unsharing
> 
> The series wanted to address where the vma lock is not easy to take.  It
> originates from when I was reading Mike's other patch, I forgot why I did
> that but I just noticed there's some code path that we may not want to take
> a sleepable lock, e.g. in follow page code.

As I stated, whenever we already take the (expensive) mmap lock, the 
least thing we should have to worry about is taking another sleepable 
lock IMHO. Please correct me if I'm wrong.

> 
> The other one is huge_pmd_share() where we may have the mmap lock for
> current mm but we're fundamentally walking another mm.  It'll be tricky to
> take a sleepable lock in such condition too.

We're already grabbing the i_mmap_lock_read(mapping), and the VMAs are 
should be stable in that interval tree IIUC. So I wonder if taking VMA 
locks would really be problematic here. Anything obvious I am missing?

> 
> I mentioned these cases in the other paragraph above [2].  Let me try to
> expand that in my next post too.

That would be great. I yet have to dedicate more time to understand all 
that complexity.

> 
> It's debatable whether all the rest places can only work with either RCU or
> irq disabled, but the idea is at least it should speed up those paths when
> we still can.  Here, irqoff might be a bit heavy, but RCU lock should be
> always superior to vma lock when possible, the payoff is we may still see
> stale pgtable data (since unsharing can still happen in parallel), while
> that can be completely avoided when we take the vma lock.

IRQ disabled is frowned upon by RT folks, that's why I'd like to 
understand if this is really required. Also, adding RCU to an already 
complex mechanism doesn't necessarily make it easier :)

Let me dedicate some time today to dig into some details.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
  2022-11-25  9:43     ` David Hildenbrand
@ 2022-11-25 13:55       ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-11-25 13:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Rik van Riel, Muchun Song, Andrew Morton,
	James Houghton, Nadav Amit, Andrea Arcangeli, Miaohe Lin,
	Mike Kravetz

On Fri, Nov 25, 2022 at 10:43:43AM +0100, David Hildenbrand wrote:
> On 23.11.22 16:09, Peter Xu wrote:
> > Hi, David,
> > 
> > Thanks for taking a look.
> > 
> > On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
> > > On 18.11.22 02:10, Peter Xu wrote:
> > > > Based on latest mm-unstable (96aa38b69507).
> > > > 
> > > > This can be seen as a follow-up series to Mike's recent hugetlb vma lock
> > > > series for pmd unsharing, so this series also depends on that one.
> > > > Hopefully this series can make it a more complete resolution for pmd
> > > > unsharing.
> > > > 
> > > > PS: so far no one strongly ACKed this, let me keep the RFC tag.  But I
> > > > think I'm already more confident than many of the RFCs I posted.
> > > > 
> > > > PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
> > > > the changelog.  The whole idea is still the same, though.
> > > > 
> > > > Problem
> > > > =======
> > > > 
> > > > huge_pte_offset() is a major helper used by hugetlb code paths to walk a
> > > > hugetlb pgtable.  It's used mostly everywhere since that's needed even
> > > > before taking the pgtable lock.
> > > > 
> > > > huge_pte_offset() is always called with mmap lock held with either read or
> > > > write.
> > > > 
> > > > For normal memory types that's far enough, since any pgtable removal
> > > > requires mmap write lock (e.g. munmap or mm destructions).  However hugetlb
> > > > has the pmd unshare feature, it means not only the pgtable page can be gone
> > > > from under us when we're doing a walking, but also the pgtable page we're
> > > > walking (even after unshared, in this case it can only be the huge PUD page
> > > > which contains 512 huge pmd entries, with the vma VM_SHARED mapped).  It's
> > > > possible because even though freeing the pgtable page requires mmap write
> > > > lock, it doesn't help us when we're walking on another mm's pgtable, so
> > > > it's still on risk even if we're with the current->mm's mmap lock.
> > > > 
> > > > The recent work from Mike on vma lock can resolve most of this already.
> > > > It's achieved by forbidden pmd unsharing during the lock being taken, so no
> > > > further risk of the pgtable page being freed.  It means if we can take the
> > > > vma lock around all huge_pte_offset() callers it'll be safe.
> > 
> > [1]
> > 
> > > > 
> > > > There're already a bunch of them that we did as per the latest mm-unstable,
> > > > but also quite a few others that we didn't for various reasons.  E.g. it
> > > > may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
> > > > Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),
> > 
> > [2]
> > 
> > > > because even if we took the vma lock, we're walking on another mm's vma!
> > > > Taking vma lock for all the vmas are probably not gonna work.
> > > > 
> > > > I have totally no report showing that I can trigger such a race, but from
> > > > code wise I never see anything that stops the race from happening.  This
> > > > series is trying to resolve that problem.
> > > 
> > > Let me try understand the basic problem first:
> > > 
> > > hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
> > > don't grab the page table locks. That's very hugetlb specific handling and I
> > > assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
> > > concurrent page fault s... but that's no news. hugetlb is weird in many ways
> > > :)
> > > 
> > > So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
> > > we use some very basic locking for that?
> > 
> 
> Sorry for the delay, finally found time to look into this again. :)
> 
> > Yes we can in most cases.  Please refer to above paragraph [1] where I
> > referred Mike's recent work on vma lock.  That's the basic locking we need
> > so far to protect pmd unsharing.  I'll attach the link too in the next
> > post, which is here:
> > 
> > https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
> > 
> > > 
> > > Using RCU / disabling local irqs seems a bit excessive because we *are*
> > > holding the mmap lock and only care about concurrent unsharing
> > 
> > The series wanted to address where the vma lock is not easy to take.  It
> > originates from when I was reading Mike's other patch, I forgot why I did
> > that but I just noticed there's some code path that we may not want to take
> > a sleepable lock, e.g. in follow page code.
> 
> As I stated, whenever we already take the (expensive) mmap lock, the least
> thing we should have to worry about is taking another sleepable lock IMHO.
> Please correct me if I'm wrong.

Yes that's not a major concern.  But I still think the follow page path
should sleep as less as possible.  For example, non-hugetlb doesn't sleep
now.  If with RCU lock we may do it lockless, then why not?

The same thing to patch 3 of this patchset - I would think it beneficial to
have even without a new lock type introduced, because it still makes the
follow page path cleaner, and have the hugetlb and non-hugetlb match.

> 
> > 
> > The other one is huge_pmd_share() where we may have the mmap lock for
> > current mm but we're fundamentally walking another mm.  It'll be tricky to
> > take a sleepable lock in such condition too.
> 
> We're already grabbing the i_mmap_lock_read(mapping), and the VMAs are
> should be stable in that interval tree IIUC. So I wonder if taking VMA locks
> would really be problematic here. Anything obvious I am missing?

No, I think you're right, and I found that myself just yesterday when I was
writting a reproducer.  huge_pmd_share() is safe here, so at least that
patch in this patchset can be dropped.

> 
> > 
> > I mentioned these cases in the other paragraph above [2].  Let me try to
> > expand that in my next post too.
> 
> That would be great. I yet have to dedicate more time to understand all that
> complexity.
> 
> > 
> > It's debatable whether all the rest places can only work with either RCU or
> > irq disabled, but the idea is at least it should speed up those paths when
> > we still can.  Here, irqoff might be a bit heavy, but RCU lock should be
> > always superior to vma lock when possible, the payoff is we may still see
> > stale pgtable data (since unsharing can still happen in parallel), while
> > that can be completely avoided when we take the vma lock.
> 
> IRQ disabled is frowned upon by RT folks, that's why I'd like to understand
> if this is really required. Also, adding RCU to an already complex mechanism
> doesn't necessarily make it easier :)

I've posted it before, let me copy that over:

arch/arm64/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE     
arch/x86/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE        if PARAVIRT

arch/arm64/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
arch/riscv/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
arch/x86/Kconfig:       select ARCH_WANT_HUGE_PMD_SHARE

The irqoff thing is definitely unfortunate, but that's only happening with
riscv or x86 when !PARAVIRT.  I would suppose PARAVIRT a common
config.. then it's majorly for riscv only.  If someday riscv can support
rcu table free we can remove it and enforce rcu table free for pmd sharing
if wanted.

Said that, the "new lock" part is definitely not the core of the patchset
(even if it might read like that..).  The core part is to first identify
the issue on overlooked usage of huge_pte_offset() and how to make it
always safe.

Let me also update a few more things when at it.

Since it'll be very hard to reproduce the race discussed in this series, I
didn't try to write a reproducer until yesterday. I'll need some kernel
delays to trigger that, only if so I can trigger some use-after-free.  So I
think problem confirmed.  The rest is how to resolve it, and whether the
vma lock is good enough.

One other thing to mention is I overlooked one important thing on the huge
pgtable lock, which is actually not protected by RCU (as it's part of
pmd_free()).  IOW if a hugetlb walker that wants to do huge_pte_lock()
after huge_pte_offset() it won't be guarded by RCU, hence the new lock
won't easily work for them.  That's another thing very unfortunate.  I'm
not sure whether it's okay to move that part into the page free rcu
callback, but definitely needs more thoughts as pmd_free() is an arch API.

Debatably irqoff might work if the arch needs IPI for tlb flush so irqoff
may protect both the pgtable page but also the huge_pte_lock(), but I don't
think it's wise either to enlarge the irqoff to generic archs, and also I
think "whether tlb flush requires IPI" is arch-specific, that makes this
over complicated too and not necessary.

So firstly, I think I need to rework the patchset so more places will need
to take the vma lock (where pgtable lock needed).  I tend to keep the RCU
lock because it's lighter at least to !riscv, if so it'll look like this
for the rule to use huge_pte_offset():

/*
 * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
 * Returns the pte_t* if found, or NULL if the address is not mapped.
 *
 * IMPORTANT: we should normally not directly call this function, instead
 * this is only a common interface to implement arch-specific walker.
 * Please consider using the hugetlb_walk() helper to make sure of the
 * correct locking is satisfied.
 *
 * Since this function will walk all the pgtable pages (including not only
 * high-level pgtable page, but also PUD entry that can be unshared
 * concurrently for VM_SHARED), the caller of this function should be
 * responsible of its thread safety.  One can follow this rule:
 *
 *  (1) For private mappings: pmd unsharing is not possible, so it'll
 *      always be safe if we're with the mmap sem for either read or write.
 *      This is normally always the case, IOW we don't need to do anything
 *      special.
 *
 *  (2) For shared mappings: pmd unsharing is possible (so the PUD-ranged
 *      pgtable page can go away from under us!  It can be done by a pmd
 *      unshare with a follow up munmap() on the other process), then we
 *      need either:
 *
 *     (2.1) hugetlb vma lock read or write held, to make sure pmd unshare
 *           won't happen upon the range (it also makes sure the pte_t we
 *           read is the right and stable one), or,
 *
 *     (2.2) hugetlb mapping i_mmap_rwsem lock held read or write, to make
 *           sure even if unshare happened the racy unmap() will wait until
 *           i_mmap_rwsem is released, or,
 *
 *     (2.3) pgtable walker lock, to make sure even pmd unsharing happened,
 *           the old shared PUD page won't get freed from under us.  In
 *           this case, the pteval can be obsolete, but at least it's still
 *           always safe to access the page (e.g., de-referencing pte_t*
 *           would not cause use-after-free).  Note, it's not safe to
 *           access pgtable lock with this lock.  If huge_pte_lock()
 *           needed, look for (2.1) or (2.2).
 *
 * Option (2.3) is the lightest, but it means pmd unshare can still happen
 * so the pte got from pgtable walk can be stalled; also only page data is
 * safe to access not others (e.g. pgtable lock).  Option (2.1) is the
 * safest, which guarantees pte stability until the vma lock released,
 * however heavier than (2.3).
 */

Where hugetlb_walk() will be a new wrapper I'll introduce just to make sure
lock protections:

static inline pte_t *
hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
{
#ifdef CONFIG_LOCKDEP
	/* lockdep_is_held() only defined with CONFIG_LOCKDEP */
	if (vma->vm_flags & VM_MAYSHARE) {
		struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

		/*
		 * Here, taking any of the three locks will guarantee
		 * safety of hugetlb pgtable walk.
		 *
		 * For more information on the locking rules of using
		 * huge_pte_offset(), please see the comment above
		 * huge_pte_offset() in the header file.
		 */
		WARN_ON_ONCE(!hugetlb_walker_locked() &&
			     !lockdep_is_held(&vma_lock->rw_sema) &&
			     !lockdep_is_held(
				 &vma->vm_file->f_mapping->i_mmap_rwsem));
	}
#endif

	return huge_pte_offset(vma->vm_mm, addr, sz);
}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-11-25 13:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-18  1:10 [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 01/12] mm/hugetlb: Let vma_offset_start() to return start Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 02/12] mm/hugetlb: Move swap entry handling into vma lock for fault Peter Xu
2022-11-18  1:35   ` Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 03/12] mm/hugetlb: Don't wait for migration entry during follow page Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 04/12] mm/hugetlb: Add pgtable walker lock Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 05/12] mm/hugetlb: Make userfaultfd_huge_must_wait() safe to pmd unshare Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 06/12] mm/hugetlb: Protect huge_pmd_share() with walker lock Peter Xu
2022-11-18  1:17   ` Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 07/12] mm/hugetlb: Use hugetlb walker lock in hugetlb_follow_page_mask() Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 08/12] mm/hugetlb: Use hugetlb walker lock in follow_hugetlb_page() Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 09/12] mm/hugetlb: Use hugetlb walker lock in hugetlb_vma_maps_page() Peter Xu
2022-11-18  1:10 ` [PATCH RFC v2 10/12] mm/hugetlb: Use hugetlb walker lock in walk_hugetlb_range() Peter Xu
2022-11-18  1:11 ` [PATCH RFC v2 11/12] mm/hugetlb: Use hugetlb walker lock in page_vma_mapped_walk() Peter Xu
2022-11-18  1:11 ` [PATCH RFC v2 12/12] mm/hugetlb: Introduce hugetlb_walk() Peter Xu
2022-11-23  9:40 ` [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare David Hildenbrand
2022-11-23 15:09   ` Peter Xu
2022-11-23 18:21     ` Mike Kravetz
2022-11-23 18:56       ` Peter Xu
2022-11-23 19:31         ` David Hildenbrand
2022-11-25  9:43     ` David Hildenbrand
2022-11-25 13:55       ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.