Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

From: Peter Xu <peterx@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	James Houghton <jthoughton@google.com>,
	Jann Horn <jannh@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@surriel.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Muchun Song <songmuchun@bytedance.com>,
	David Hildenbrand <david@redhat.com>
Subject: Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare
Date: Tue, 6 Dec 2022 11:45:09 -0500	[thread overview]
Message-ID: <Y49xlV8I2/92Flha@x1n> (raw)
In-Reply-To: <0813b9ed-3c92-088c-4fb9-45fb648c6e73@nvidia.com>

[-- Attachment #1: Type: text/plain, Size: 2030 bytes --]

On Mon, Dec 05, 2022 at 03:52:51PM -0800, John Hubbard wrote:
> On 12/5/22 15:33, Mike Kravetz wrote:
> > On 11/29/22 14:35, Peter Xu wrote:
> > > Since walk_hugetlb_range() walks the pgtable, it needs the vma lock
> > > to make sure the pgtable page will not be freed concurrently.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >   mm/pagewalk.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > > index 7f1c9b274906..d98564a7be57 100644
> > > --- a/mm/pagewalk.c
> > > +++ b/mm/pagewalk.c
> > > @@ -302,6 +302,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
> > >   	const struct mm_walk_ops *ops = walk->ops;
> > >   	int err = 0;
> > > +	hugetlb_vma_lock_read(vma);
> > >   	do {
> > >   		next = hugetlb_entry_end(h, addr, end);
> > >   		pte = huge_pte_offset(walk->mm, addr & hmask, sz);
> > 
> > For each found pte, we will be calling mm_walk_ops->hugetlb_entry() with
> > the vma_lock held.  I looked into the various hugetlb_entry routines, and
> > I am not sure about hmm_vma_walk_hugetlb_entry.  It seems like it could
> > possibly call hmm_vma_fault -> handle_mm_fault -> hugetlb_fault.  If this
> > can happen, then we may have an issue as hugetlb_fault will also need to
> > acquire the vma_lock in read mode.

Thanks for spotting that, Mike.

I used to notice that path special but that's when I was still using RCU
locks who doesn't have the issue.  Then I overlooked this one when
switchover.

> > 
> > I do not know the hmm code well enough to know if this may be an actual
> > issue?
> 
> Oh, this sounds like a serious concern. If we add a new lock, and hold it
> during callbacks that also need to take it, that's not going to work out,
> right?
> 
> And yes, hmm_range_fault() and related things do a good job of revealing
> this kind of deadlock. :)

I've got a fixup attached.  John, since this got your attention please also
have a look too in case there's further issues.

Thanks,

-- 
Peter Xu

[-- Attachment #2: 0001-fixup-mm-hugetlb-Make-walk_hugetlb_range-safe-to-pmd.patch --]
[-- Type: text/plain, Size: 2966 bytes --]

From 9ad1e65a31f51a0dc687cd9d6083b9e920d2da61 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 6 Dec 2022 11:38:47 -0500
Subject: [PATCH] fixup! mm/hugetlb: Make walk_hugetlb_range() safe to pmd
 unshare
Content-type: text/plain

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/mm/gmap.c      | 2 ++
 fs/proc/task_mmu.c       | 2 ++
 include/linux/pagewalk.h | 8 +++++++-
 mm/hmm.c                 | 8 +++++++-
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 8947451ae021..292a54c490d4 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2643,7 +2643,9 @@ static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
 	end = start + HPAGE_SIZE - 1;
 	__storage_key_init_range(start, end);
 	set_bit(PG_arch_1, &page->flags);
+	hugetlb_vma_unlock_read(walk->vma);
 	cond_resched();
+	hugetlb_vma_lock_read(walk->vma);
 	return 0;
 }
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 89338950afd3..d7155f3bb678 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1612,7 +1612,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 			frame++;
 	}
 
+	hugetlb_vma_unlock_read(walk->vma);
 	cond_resched();
+	hugetlb_vma_lock_read(walk->vma);
 
 	return err;
 }
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 959f52e5867d..1f7c2011f6cb 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -21,7 +21,13 @@ struct mm_walk;
  *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
  *			Any folded depths (where PTRS_PER_P?D is equal to 1)
  *			are skipped.
- * @hugetlb_entry:	if set, called for each hugetlb entry
+ * @hugetlb_entry:	if set, called for each hugetlb entry.	Note that
+ *			currently the hook function is protected by hugetlb
+ *			vma lock to make sure pte_t* and the spinlock is valid
+ *			to access.  If the hook function needs to yield the
+ *			thread or retake the vma lock for some reason, it
+ *			needs to properly release the vma lock manually,
+ *			and retake it before the function returns.
  * @test_walk:		caller specific callback function to determine whether
  *			we walk over the current vma or not. Returning 0 means
  *			"do page table walk over the current vma", returning
diff --git a/mm/hmm.c b/mm/hmm.c
index 3850fb625dda..dcd624f28bcf 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -493,8 +493,14 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	required_fault =
 		hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
 	if (required_fault) {
+		int ret;
+
 		spin_unlock(ptl);
-		return hmm_vma_fault(addr, end, required_fault, walk);
+		hugetlb_vma_unlock_read(vma);
+		/* hmm_vma_fault() can retake the vma lock */
+		ret = hmm_vma_fault(addr, end, required_fault, walk);
+		hugetlb_vma_lock_read(vma);
+		return ret;
 	}
 
 	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
-- 
2.37.3